Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AggregateNumericRangeEquality: Mismatch between implementation and documentation #37

Open
kklein opened this issue Jul 25, 2022 · 2 comments
Labels
bug Something isn't working question Further information is requested

Comments

@kklein
Copy link
Collaborator

kklein commented Jul 25, 2022

The doc string of the corresponding add_*_constraint method claims the following:

Since we expect aggregate_column to be a numeric column, this leads to a multiset of aggregated values. These values should correspond to the integers ranging from start_value to the cardinality of the multiset.

Hence if we have, for a given key, n rows (in other words, the cardinality of the multiset) and a start_value of k, I would expect a range to be complete if exactly the following rows exist:

(key, k)
(key, k+1)
...
(key, k+n-1)

Yet, the implementation checks the following:

def missing_from_range(values, start=0):
return set(range(start, max(values) + start)) - set(values)

On the one hand, it revolves around the maximal encountered value instead of the cardinality of the set. On the other hand, the start value is added to said maximum.

It is easy to come up with an example where both outlined behaviours diverge. Assume start_value to equal k and the observed rows to correspond to this:

(key, k)
(key, k+1)
(key, k+2)

According to the former definition - as described in the doc string - this would be a legitimate key.
According to the latter definition, we would expect

(key, k)
(key, k+1)
(key, k+2)
...
(key, k+2+k)

which would flag the current key as a failure for some k.

We do not notice this diverging behaviour in our tests since our tests only use start_value=1.

What is intended behaviour?

@kklein kklein added bug Something isn't working question Further information is requested labels Jul 25, 2022
@ivergara
Copy link
Collaborator

From what I remember, and by reading the code, the idea was to test for continuous keys of the aggregation when such keys are integers and having an offset.

Unfortunately, I didn't get exactly the issues you present here.

What you get from the aggregation are pairs (k, n_k). for example (1, 12), (2, 1), (3,100). In this case, a test would be fine if start_value=1 but it'd failt for start_value=0.

@kklein
Copy link
Collaborator Author

kklein commented Nov 9, 2022

Unfortunately, I didn't get exactly the issues you present here.

In the example above we observe

(key, k)
(key, k+1)
(key, k+2)

The docstring says:

Since we expect aggregate_column to be a numeric column, this leads to a multiset of aggregated values. These values should correspond to the integers ranging from start_value to the cardinality of the multiset.

Hence, if start_value = k, this should be an accepted input since the cardinality is 3 and we observe values k, k + 1 and k + 2.

Yet, according to the current implementation we would not accept the input. The current implementation accepts the input only if range(start, max(values) + start) is fully covered.

Note that in the general case it doesn't hold that
range(start, max(values) + start) = range(start, start + cardinality)

In the example above we have max(values) = k + 2 and start = k, implying that

  • max(values) + start = 2*k + 2
  • start + cardinality = k + 3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants