`AggregateNumericRangeEquality`: Mismatch between implementation and documentation #37

kklein · 2022-07-25T16:36:43Z

The doc string of the corresponding add_*_constraint method claims the following:

Since we expect aggregate_column to be a numeric column, this leads to a multiset of aggregated values. These values should correspond to the integers ranging from start_value to the cardinality of the multiset.

Hence if we have, for a given key, n rows (in other words, the cardinality of the multiset) and a start_value of k, I would expect a range to be complete if exactly the following rows exist:

(key, k)
(key, k+1)
...
(key, k+n-1)

Yet, the implementation checks the following:

datajudge/src/datajudge/constraints/groupby.py

Lines 36 to 37 in 0350318

    
           def missing_from_range(values, start=0): 
        
               return set(range(start, max(values) + start)) - set(values)

On the one hand, it revolves around the maximal encountered value instead of the cardinality of the set. On the other hand, the start value is added to said maximum.

It is easy to come up with an example where both outlined behaviours diverge. Assume start_value to equal k and the observed rows to correspond to this:

(key, k)
(key, k+1)
(key, k+2)

According to the former definition - as described in the doc string - this would be a legitimate key.
According to the latter definition, we would expect

(key, k)
(key, k+1)
(key, k+2)
...
(key, k+2+k)

which would flag the current key as a failure for some k.

We do not notice this diverging behaviour in our tests since our tests only use start_value=1.

What is intended behaviour?

The text was updated successfully, but these errors were encountered:

ivergara · 2022-07-31T22:40:03Z

From what I remember, and by reading the code, the idea was to test for continuous keys of the aggregation when such keys are integers and having an offset.

Unfortunately, I didn't get exactly the issues you present here.

What you get from the aggregation are pairs (k, n_k). for example (1, 12), (2, 1), (3,100). In this case, a test would be fine if start_value=1 but it'd failt for start_value=0.

kklein · 2022-11-09T11:03:32Z

Unfortunately, I didn't get exactly the issues you present here.

In the example above we observe

(key, k)
(key, k+1)
(key, k+2)

The docstring says:

Since we expect aggregate_column to be a numeric column, this leads to a multiset of aggregated values. These values should correspond to the integers ranging from start_value to the cardinality of the multiset.

Hence, if start_value = k, this should be an accepted input since the cardinality is 3 and we observe values k, k + 1 and k + 2.

Yet, according to the current implementation we would not accept the input. The current implementation accepts the input only if range(start, max(values) + start) is fully covered.

Note that in the general case it doesn't hold that
range(start, max(values) + start) = range(start, start + cardinality)

In the example above we have max(values) = k + 2 and start = k, implying that

max(values) + start = 2*k + 2
start + cardinality = k + 3

kklein added bug Something isn't working question Further information is requested labels Jul 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`AggregateNumericRangeEquality`: Mismatch between implementation and documentation #37

`AggregateNumericRangeEquality`: Mismatch between implementation and documentation #37

kklein commented Jul 25, 2022

ivergara commented Jul 31, 2022

kklein commented Nov 9, 2022 •

edited

Loading

AggregateNumericRangeEquality: Mismatch between implementation and documentation #37

AggregateNumericRangeEquality: Mismatch between implementation and documentation #37

Comments

kklein commented Jul 25, 2022

ivergara commented Jul 31, 2022

kklein commented Nov 9, 2022 • edited Loading

`AggregateNumericRangeEquality`: Mismatch between implementation and documentation #37

`AggregateNumericRangeEquality`: Mismatch between implementation and documentation #37

kklein commented Nov 9, 2022 •

edited

Loading