You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The doc string of the corresponding add_*_constraint method claims the following:
Since we expect aggregate_column to be a numeric column, this leads to a multiset of aggregated values. These values should correspond to the integers ranging from start_value to the cardinality of the multiset.
Hence if we have, for a given key, n rows (in other words, the cardinality of the multiset) and a start_value of k, I would expect a range to be complete if exactly the following rows exist:
On the one hand, it revolves around the maximal encountered value instead of the cardinality of the set. On the other hand, the start value is added to said maximum.
It is easy to come up with an example where both outlined behaviours diverge. Assume start_value to equal k and the observed rows to correspond to this:
(key, k)
(key, k+1)
(key, k+2)
According to the former definition - as described in the doc string - this would be a legitimate key.
According to the latter definition, we would expect
(key, k)
(key, k+1)
(key, k+2)
...
(key, k+2+k)
which would flag the current key as a failure for some k.
We do not notice this diverging behaviour in our tests since our tests only use start_value=1.
What is intended behaviour?
The text was updated successfully, but these errors were encountered:
From what I remember, and by reading the code, the idea was to test for continuous keys of the aggregation when such keys are integers and having an offset.
Unfortunately, I didn't get exactly the issues you present here.
What you get from the aggregation are pairs (k, n_k). for example (1, 12), (2, 1), (3,100). In this case, a test would be fine if start_value=1 but it'd failt for start_value=0.
Unfortunately, I didn't get exactly the issues you present here.
In the example above we observe
(key, k)
(key, k+1)
(key, k+2)
The docstring says:
Since we expect aggregate_column to be a numeric column, this leads to a multiset of aggregated values. These values should correspond to the integers ranging from start_value to the cardinality of the multiset.
Hence, if start_value = k, this should be an accepted input since the cardinality is 3 and we observe values k, k + 1 and k + 2.
Yet, according to the current implementation we would not accept the input. The current implementation accepts the input only if range(start, max(values) + start) is fully covered.
Note that in the general case it doesn't hold that range(start, max(values) + start) = range(start, start + cardinality)
In the example above we have max(values) = k + 2 and start = k, implying that
The doc string of the corresponding
add_*_constraint
method claims the following:Hence if we have, for a given key,
n
rows (in other words, the cardinality of the multiset) and a start_value ofk
, I would expect a range to be complete if exactly the following rows exist:Yet, the implementation checks the following:
datajudge/src/datajudge/constraints/groupby.py
Lines 36 to 37 in 0350318
On the one hand, it revolves around the maximal encountered value instead of the cardinality of the set. On the other hand, the start value is added to said maximum.
It is easy to come up with an example where both outlined behaviours diverge. Assume
start_value
to equalk
and the observed rows to correspond to this:According to the former definition - as described in the doc string - this would be a legitimate key.
According to the latter definition, we would expect
which would flag the current key as a failure for some
k
.We do not notice this diverging behaviour in our tests since our tests only use
start_value=1
.What is intended behaviour?
The text was updated successfully, but these errors were encountered: