Skip to content
This repository has been archived by the owner on May 17, 2024. It is now read-only.

Counts from command line summary do not match up with row counts in materialized results #895

Closed
zigouras opened this issue May 15, 2024 · 1 comment
Labels
bug Something isn't working triage

Comments

@zigouras
Copy link

zigouras commented May 15, 2024

Describe the bug
The counts reported on the command line by the open source data-diff program are correct for rows exclusive to table a, exclusive to b and changed rows. However, the counts of the result rows in the materialized results table do not match what I would expect based on those counts. Only the count of rows exclusive to table B is correct. SQL outside of data-diff confirms the inaccuracy of the materialized results.

Make sure to include the following (minus sensitive information):

data-diff \
  $VACANT_LOTS_DB \
  public.li_complaints \
  backup_.li_complaints \
  -k service_request_id \
  -c '%' \
  -m 'backup_.li_complaints_diff' \
  --stats 
  • The run output + error you're getting. (including tracestack)
247415 rows in table A
246058 rows in table B
1560 rows exclusive to table A (not present in B)
203 rows exclusive to table B (not present in A)
929 rows updated
244926 rows unchanged
1.01% difference score

Extra-Info:
  diff_counts = {'service_request_id_a': 19, 'service_code_a': 15, 'service_name_a': 19, 'address_a': 19, 'subject_a': 19, 
'y_a': 19, 'x_a': 19, 'status_a': 19, 'geometry_a': 19}
  exclusive_count = 19
  table1_count = 247415
  table1_sum_x = -18593115.109934963
  table1_sum_y = 9894894.846454354
  table2_count = 246058
  table2_sum_x = -18491140.8768517
  table2_sum_y = 9840625.990917146

Count from SQL against the materialized table:

select is_exclusive_a, is_exclusive_b, count(*) from backup_.li_complaints_diff 
group by is_exclusive_a,is_exclusive_b;
 is_exclusive_a | is_exclusive_b | count 
----------------+----------------+-------
 f              | f              |   832
 f              | t              |   203
 t              | f              |   795

Raw counts from the source and target tables:

select count(*) from li_complaints p where not exists (select service_request_id from 
backup_.li_complaints where service_request_id = p.service_request_id) ;
 count 
-------
  1560
(1 row)

select count(*) from backup_.li_complaints b where not exists (select service_request_id from 
li_complaints where service_request_id = b.service_request_id) ;
 count 
-------
   203
(1 row)
  • Run data-diff with the -d switch for extra debug information.
    see attached file data-diff-debug.txt
    data-diff-debug.txt

Describe the environment

data-diff 0.11.1
Python 3.11.4
os: Ubuntu Linux on WSL on Windows 10
PostgreSQL 14 on Ubuntu
@zigouras zigouras added the bug Something isn't working label May 15, 2024
@zigouras zigouras changed the title Counts from summary do not match up with row counts in materialized results Counts from command line summary do not match up with row counts in materialized results May 15, 2024
@zigouras
Copy link
Author

zigouras commented May 15, 2024

I think I figured it out, the default --table-write-limit of 1000 was the problem. I increased it and the numbers look correct in the materialized table now.

This issue can be deleted.

@zigouras zigouras closed this as not planned Won't fix, can't repro, duplicate, stale May 15, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working triage
Projects
None yet
Development

No branches or pull requests

1 participant