Reimplement Kolmogorov Smirnov query logic with sqlalchemy's Language Expression API #44

kklein · 2022-07-28T16:03:30Z

See linked issues for context.

Importantly, I have high hopes that the isolation and explicit testing of _cross_cdf_selection will now be useful for other kinds of Constraints, e.g. #45.

codecov · 2022-07-28T16:07:34Z

Codecov Report

Merging #44 (397443f) into main (da059c5) will increase coverage by 0.05%.
The diff coverage is 97.87%.

@@            Coverage Diff             @@
##             main      #44      +/-   ##
==========================================
+ Coverage   93.84%   93.90%   +0.05%     
==========================================
  Files          15       15              
  Lines        1577     1607      +30     
==========================================
+ Hits         1480     1509      +29     
- Misses         97       98       +1

Impacted Files	Coverage Δ
src/datajudge/db_access.py	`94.10% <96.96%> (+0.11%)`	⬆️
src/datajudge/constraints/stats.py	`94.00% <100.00%> (+0.66%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us.

kklein · 2022-07-29T15:27:13Z

src/datajudge/constraints/stats.py

@@ -63,9 +63,10 @@ def check_acceptance(
        def c(alpha: float):
            return math.sqrt(-math.log(alpha / 2.0 + 1e-10) * 0.5)

-        return d_statistic <= c(accepted_level) * math.sqrt(
+        threshold = c(accepted_level) * math.sqrt(


Slightly more convenient to debug.

kklein · 2022-07-29T17:04:12Z

tests/integration/test_integration.py

-        ("value_0_1", "value_1_1", 0.3924, 0.0),
-    ],
-)
-def test_ks_2sample_implementation(engine, random_normal_table, configuration):


Moved this test to a different test suite. Not a hard constraint but the general idea so far was:
Everything that needs a database goes into tests/integration. Tests operating on a TestResult - obtained via the test method of a Requirement - go into tests/integration/test_integration.py.

kklein · 2022-07-29T17:06:48Z

tests/integration/test_integration.py

@@ -1925,31 +1913,41 @@ def test_ks_2sample_constraint_perfect_between(engine, int_table1, data):
    assert operation(test_result.outcome), test_result.failure_message


-# TODO: Enable this test once the bug is fixed.
-@pytest.mark.skip(reason="This is a known bug and unintended behaviour.")


No longer skipping this test (as well as adding further examples) should indicate that this PR solves #42

jonashaag

I find the SQLAlchemy code incredibly difficult to follow. Not saying that it is because you are writing complicated SQLAlchemy. Maybe it's just more difficult for me (with little SQLAlchemy experience) to follow than to follow SQL. Maybe it would help people like me if we had a simplified version of the query in SQL somewhere in a docstring to help get an overview of the code.

jonashaag · 2022-07-29T18:18:22Z

src/datajudge/db_access.py

+    col = ref.get_column(engine)
+    selection = ref.get_selection(engine).subquery()
+
+    # Step 1: Calculate the CDF over the value column.


Just curious: Is possible to merge the two steps? Like so

sa.select([ selection.c[col], sa.func.max(sa.func.cume_dist().over(order_by=col)), ]) .group_by(selection.c[col])

I wondered the same and doing it that way leads to an error:
sqlalchemy.exc.ProgrammingError: (psycopg2.errors.GroupingError) aggregate function calls cannot contain window function calls

Could you add in the docstring a bit more information about the objective/idea behind this method?

It's great to have the comments on the step-by-step like in the SQL version before, but a summary would be a great addition to it, particularly clarifying and being explicit about the meaning of the arguments.

Good idea.
f396053

jonashaag · 2022-07-29T18:19:14Z

src/datajudge/constraints/stats.py


    @staticmethod
    def calculate_statistic(
        engine,
        ref1: DataReference,
        ref2: DataReference,
-    ) -> Tuple[float, Optional[float], int, int]:
+    ) -> Tuple[float, Optional[float], int, int, List]:


List of what?

Quite frankly we/I haven't figured out yet what the latest common SQLAlchemy ancestor type yet. Throughout almost all of db_access we don't annotate the type of the selections because the types, iirc, differ. Quite a few are simply sqlalchemy.sql.selectable.Select or sqlalchemy.sql.selectable.Subquery. Yet, some aren't and still support the necessary method interfaces.

Now there certainly are remedies to this situation but we haven't considered this to be 'sufficiently important' up until now.

jonashaag · 2022-07-29T18:23:17Z

src/datajudge/db_access.py

 def get_ks_2sample(
    engine: sa.engine.Engine,
    ref1: DataReference,
    ref2: DataReference,
-) -> float:
+):


Do you not annotate return types?

Am I missing a comment somewhere? What's the idea? Shouldn't we annotate as much as possible?
Even using Any makes sense because you are proactively declaring that you don't care while not annotating leaves the user to guess where it's (1) unknown (2) not important (3) missing

Am I missing a comment somewhere?

Have you read this [0]?

because you are proactively declaring that you don't care

It's not clear to me why we don't care.

[0] #44 (comment)

I'll take the liberty to merge for now. Yet, if you consider this an open topic still, happy to further discuss this and address it as a follow-up @YYYasin19 .

kklein · 2022-07-30T09:58:02Z

I find the SQLAlchemy code incredibly difficult to follow. Not saying that it is because you are writing complicated SQLAlchemy. Maybe it's just more difficult for me (with little SQLAlchemy experience) to follow than to follow SQL.

I definitely understand where you're coming from. A couple of hopefully related comments from my side:

Personally, I see multiple upsides of using the language expression API.
- Firstly, it conveniently enables the use of abstractions such as ref.get_selection(engine). IIrc not being able to use this abstraction caused a lot of the development effort in the raw-sql version of this query logic. Moreover, it caused a bug in relation to Conditions.
- Secondly, some raw sql queries might actually run just fine against multiple dialects. Yet, this is definitely not the case for all of our queries. When it is not possible, the language expression API actually allows us to 'write once, run everywhere™'. I'm not sure whether having a patchwork of raw sql and language expression api would be desirable in the long run.
- Thirdly, I find it easier to share logic between different queries than with raw sql queries - see comment about reusing _cross_cdf_selection for another hypothesis test.
As you gently allude to, I think I have indeed seen a development on my end when it comes to reading language expression code. I believe to find it easier to read now than two years ago.
I still don't find it easy to read.
To me, this very example seems to be on one side of the query logic complexity spectrum, at least in how far datajudge is concerned.
Personally - even though I admire the rigid structure and invaluable comments introduced by @YYYasin19 - I didn't find the raw sql code easy to follow either. It took me some debugging and reimplementing to understand some of the intricacies.

Maybe it would help people like me if we had a simplified version of the query in SQL somewhere in a docstring to help get an overview of the code.

I'm definitely open to the idea.

I made a point out of mostly preserving @YYYasin19's structure of iterative steps - in contrast to recursion - to allow the curious developer to create, read and run these intermediate steps. What I often do is to set breakpoints and run str(intermediate_selection) or even engine.connect().execute(intermediate_selection) via pdb while executing the integration tests. Moreover, I hoped to have made the query building a little more clear with this[0] example as well as the dedicated test case[1].

Nevertheless, if you have a concrete suggestion I'm happy to follow your lead. :)

[0] https://github.com/Quantco/datajudge/pull/44/files#diff-74c4ccf9b9cba732c4562df6e1e05a5ecdee4e9c04f38276b54e85f850155660R943-R945
[1] https://github.com/Quantco/datajudge/pull/44/files#diff-294af7c1d98bcac87ae111bebc4c13a1b1d8155cce13ea48446397e950f9db32R7-R28

ivergara

Great job!

ivergara · 2022-07-30T21:13:25Z

src/datajudge/db_access.py

+    col = ref.get_column(engine)
+    selection = ref.get_selection(engine).subquery()
+
+    # Step 1: Calculate the CDF over the value column.


Could you add in the docstring a bit more information about the objective/idea behind this method?

It's great to have the comments on the step-by-step like in the SQL version before, but a summary would be a great addition to it, particularly clarifying and being explicit about the meaning of the arguments.

ivergara · 2022-07-30T21:16:42Z

src/datajudge/db_access.py

+):
+    """Create a cross cumulative distribution function selection given two samples.
+
+    Concretely, both ``DataReference``s are expected to have specified a single relevant column.


Don't you want to explicitly enforce that expectation at the beginning of the method?

Yeah, great point! Did it for all at once:
3c9408c

ivergara · 2022-07-30T21:36:23Z

Maybe it would help people like me if we had a simplified version of the query in SQL somewhere in a docstring to help get an overview of the code.

I'm definitely open to the idea.

Why not put a link to the full version in Yasin's repository containing the SQL version?

YYYasin19 · 2022-07-31T14:37:54Z

Very nice (re-)implementation in the expression language API!
I agree with the fact that it is hard to parse without having experience with the API. I think we can split the query up as much as possible here, since the final compilation and optimization will still be done by the database engine(s), so our goal should be readability and maintainability -- though I don't see any obvious wins left here.

jonashaag · 2022-08-02T07:26:55Z

src/datajudge/db_access.py

@@ -288,7 +288,13 @@ def get_column(self, engine):
                f"Trying to access column of DataReference "
                f"{self.get_string()} yet none is given."
            )
-        return self.get_columns(engine)[0]
+        columns = self.get_columns(engine)


I like this one

(col,) = self.get_columns(engine) return col

kklein added 8 commits July 28, 2022 10:50

Use DataRefereces in function interfaces.

94ae647

Seperation of concerns: Fetch counts outside of get_ks_2sample method.

6d6cd33

Adapt formatting for mssql.

47034c0

Fix usage of Conditions in tests.

5c8952b

Undo redundant change.

bf19817

Bluntly translate raw string to sqlalchemy language expresions.

ddb6441

Merge branch 'main' of github.com:Quantco/datajudge into ks_le2

92afc22

Bring back the comments.

c0a1efd

kklein linked an issue Jul 28, 2022 that may be closed by this pull request

Re-implement KS test in sqlalchemy expression language API #29

Closed

kklein added 2 commits July 29, 2022 15:43

Modularize selection generation to some degree.

ada70bc

Enable test of previous bug.

2387817

kklein linked an issue Jul 29, 2022 that may be closed by this pull request

KolmogorovSmirnov2Sample: Conditions are ignored #42

Closed

kklein commented Jul 29, 2022

View reviewed changes

kklein added 4 commits July 29, 2022 17:33

Extend test.

6df75e8

Store selection queries.

44c0cdd

Separate generation of cross-cdf selection from ks method.

73e7e99

Implement test for _cross_cdf_selection.

7e9f44f

kklein changed the title ~~[WIP] Reimplement KS test with sqlalchemy's Language Expression API~~ Reimplement KS test with sqlalchemy's Language Expression API Jul 29, 2022

kklein changed the title ~~Reimplement KS test with sqlalchemy's Language Expression API~~ Reimplement Kolmogorov Smirnov test with sqlalchemy's Language Expression API Jul 29, 2022

kklein added 2 commits July 29, 2022 18:56

Use type annotations as if it was Python 3.8 days.

61eb97e

Add separate tests for stats module.

d641748

kklein commented Jul 29, 2022

View reviewed changes

kklein marked this pull request as ready for review July 29, 2022 17:07

kklein changed the title ~~Reimplement Kolmogorov Smirnov test with sqlalchemy's Language Expression API~~ Reimplement Kolmogorov Smirnov query logic with sqlalchemy's Language Expression API Jul 29, 2022

Sort tuple before comparison.

1353488

kklein requested review from jonashaag and YYYasin19 July 29, 2022 17:25

jonashaag reviewed Jul 29, 2022

View reviewed changes

ivergara approved these changes Jul 30, 2022

View reviewed changes

YYYasin19 approved these changes Jul 31, 2022

View reviewed changes

kklein added 3 commits August 1, 2022 20:09

Add doc string to _cdf_selection.

f396053

Check for number of columns in get_column.

3c9408c

Add link to raw-sql PR.

397443f

kklein merged commit 8adeefd into main Aug 1, 2022

kklein deleted the ks_le2 branch August 1, 2022 18:33

jonashaag reviewed Aug 2, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reimplement Kolmogorov Smirnov query logic with sqlalchemy's Language Expression API #44

Reimplement Kolmogorov Smirnov query logic with sqlalchemy's Language Expression API #44

kklein commented Jul 28, 2022 •

edited

Loading

codecov bot commented Jul 28, 2022 •

edited

Loading

kklein Jul 29, 2022

kklein Jul 29, 2022 •

edited

Loading

kklein Jul 29, 2022

jonashaag left a comment

jonashaag Jul 29, 2022

kklein Jul 30, 2022

ivergara Jul 30, 2022

kklein Aug 1, 2022

jonashaag Jul 29, 2022

kklein Jul 30, 2022 •

edited

Loading

jonashaag Jul 29, 2022

kklein Jul 30, 2022

YYYasin19 Jul 31, 2022

kklein Aug 1, 2022 •

edited

Loading

kklein Aug 1, 2022

kklein commented Jul 30, 2022

ivergara left a comment

ivergara Jul 30, 2022

ivergara Jul 30, 2022

kklein Aug 1, 2022

ivergara commented Jul 30, 2022

YYYasin19 commented Jul 31, 2022

jonashaag Aug 2, 2022

Reimplement Kolmogorov Smirnov query logic with sqlalchemy's Language Expression API #44

Reimplement Kolmogorov Smirnov query logic with sqlalchemy's Language Expression API #44

Conversation

kklein commented Jul 28, 2022 • edited Loading

codecov bot commented Jul 28, 2022 • edited Loading

Codecov Report

Choose a reason for hiding this comment

kklein Jul 29, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonashaag left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kklein Jul 30, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kklein Aug 1, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kklein commented Jul 30, 2022

ivergara left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivergara commented Jul 30, 2022

YYYasin19 commented Jul 31, 2022

Choose a reason for hiding this comment

kklein commented Jul 28, 2022 •

edited

Loading

codecov bot commented Jul 28, 2022 •

edited

Loading

kklein Jul 29, 2022 •

edited

Loading

kklein Jul 30, 2022 •

edited

Loading

kklein Aug 1, 2022 •

edited

Loading