-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reimplement Kolmogorov Smirnov query logic with sqlalchemy's Language Expression API #44
Conversation
Codecov Report
@@ Coverage Diff @@
## main #44 +/- ##
==========================================
+ Coverage 93.84% 93.90% +0.05%
==========================================
Files 15 15
Lines 1577 1607 +30
==========================================
+ Hits 1480 1509 +29
- Misses 97 98 +1
Help us with your feedback. Take ten seconds to tell us how you rate us. |
@@ -63,9 +63,10 @@ def check_acceptance( | |||
def c(alpha: float): | |||
return math.sqrt(-math.log(alpha / 2.0 + 1e-10) * 0.5) | |||
|
|||
return d_statistic <= c(accepted_level) * math.sqrt( | |||
threshold = c(accepted_level) * math.sqrt( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly more convenient to debug.
("value_0_1", "value_1_1", 0.3924, 0.0), | ||
], | ||
) | ||
def test_ks_2sample_implementation(engine, random_normal_table, configuration): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved this test to a different test suite. Not a hard constraint but the general idea so far was:
Everything that needs a database goes into tests/integration
. Tests operating on a TestResult
- obtained via the test
method of a Requirement
- go into tests/integration/test_integration.py
.
@@ -1925,31 +1913,41 @@ def test_ks_2sample_constraint_perfect_between(engine, int_table1, data): | |||
assert operation(test_result.outcome), test_result.failure_message | |||
|
|||
|
|||
# TODO: Enable this test once the bug is fixed. | |||
@pytest.mark.skip(reason="This is a known bug and unintended behaviour.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer skipping this test (as well as adding further examples) should indicate that this PR solves #42
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the SQLAlchemy code incredibly difficult to follow. Not saying that it is because you are writing complicated SQLAlchemy. Maybe it's just more difficult for me (with little SQLAlchemy experience) to follow than to follow SQL. Maybe it would help people like me if we had a simplified version of the query in SQL somewhere in a docstring to help get an overview of the code.
col = ref.get_column(engine) | ||
selection = ref.get_selection(engine).subquery() | ||
|
||
# Step 1: Calculate the CDF over the value column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious: Is possible to merge the two steps? Like so
sa.select([
selection.c[col],
sa.func.max(sa.func.cume_dist().over(order_by=col)),
])
.group_by(selection.c[col])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wondered the same and doing it that way leads to an error:
sqlalchemy.exc.ProgrammingError: (psycopg2.errors.GroupingError) aggregate function calls cannot contain window function calls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add in the docstring a bit more information about the objective/idea behind this method?
It's great to have the comments on the step-by-step like in the SQL version before, but a summary would be a great addition to it, particularly clarifying and being explicit about the meaning of the arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea.
f396053
|
||
@staticmethod | ||
def calculate_statistic( | ||
engine, | ||
ref1: DataReference, | ||
ref2: DataReference, | ||
) -> Tuple[float, Optional[float], int, int]: | ||
) -> Tuple[float, Optional[float], int, int, List]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
List of what?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite frankly we/I haven't figured out yet what the latest common SQLAlchemy ancestor type yet. Throughout almost all of db_access
we don't annotate the type of the selections because the types, iirc, differ. Quite a few are simply sqlalchemy.sql.selectable.Select
or sqlalchemy.sql.selectable.Subquery
. Yet, some aren't and still support the necessary method interfaces.
Now there certainly are remedies to this situation but we haven't considered this to be 'sufficiently important' up until now.
def get_ks_2sample( | ||
engine: sa.engine.Engine, | ||
ref1: DataReference, | ||
ref2: DataReference, | ||
) -> float: | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you not annotate return types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I missing a comment somewhere? What's the idea? Shouldn't we annotate as much as possible?
Even using Any makes sense because you are proactively declaring that you don't care while not annotating leaves the user to guess where it's (1) unknown (2) not important (3) missing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I missing a comment somewhere?
Have you read this [0]?
because you are proactively declaring that you don't care
It's not clear to me why we don't care.
[0] #44 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take the liberty to merge for now. Yet, if you consider this an open topic still, happy to further discuss this and address it as a follow-up @YYYasin19 .
I definitely understand where you're coming from. A couple of hopefully related comments from my side:
I'm definitely open to the idea. I made a point out of mostly preserving @YYYasin19's structure of iterative steps - in contrast to recursion - to allow the curious developer to create, read and run these intermediate steps. What I often do is to set breakpoints and run Nevertheless, if you have a concrete suggestion I'm happy to follow your lead. :) [0] https://github.com/Quantco/datajudge/pull/44/files#diff-74c4ccf9b9cba732c4562df6e1e05a5ecdee4e9c04f38276b54e85f850155660R943-R945 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job!
col = ref.get_column(engine) | ||
selection = ref.get_selection(engine).subquery() | ||
|
||
# Step 1: Calculate the CDF over the value column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add in the docstring a bit more information about the objective/idea behind this method?
It's great to have the comments on the step-by-step like in the SQL version before, but a summary would be a great addition to it, particularly clarifying and being explicit about the meaning of the arguments.
): | ||
"""Create a cross cumulative distribution function selection given two samples. | ||
|
||
Concretely, both ``DataReference``s are expected to have specified a single relevant column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you want to explicitly enforce that expectation at the beginning of the method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, great point! Did it for all at once:
3c9408c
Why not put a link to the full version in Yasin's repository containing the SQL version? |
Very nice (re-)implementation in the expression language API! |
@@ -288,7 +288,13 @@ def get_column(self, engine): | |||
f"Trying to access column of DataReference " | |||
f"{self.get_string()} yet none is given." | |||
) | |||
return self.get_columns(engine)[0] | |||
columns = self.get_columns(engine) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this one
(col,) = self.get_columns(engine)
return col
See linked issues for context.
Importantly, I have high hopes that the isolation and explicit testing of
_cross_cdf_selection
will now be useful for other kinds ofConstraint
s, e.g. #45.