Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dedup transactions #277

Merged
merged 6 commits into from
Oct 19, 2023
Merged

Dedup transactions #277

merged 6 commits into from
Oct 19, 2023

Conversation

dgitis
Copy link
Collaborator

@dgitis dgitis commented Oct 17, 2023

Description & motivation

Resolves #193.

This PR deduplicates transactions on transaction_id within the current processing window as determined by the static_incremental_days variable.

There are a number of possible issues could be fixed here but many of them are in conflict with one another so I chose to keep this as simple as possible.

Checklist

  • [y ] I have verified that these changes work locally
  • [ y] I have updated the README.md (if applicable)
  • [n/a ] I have added tests & descriptions to my models (and macros if applicable)
  • [ y] I have run dbt test and python -m pytest . to validate existing tests

@dgitis
Copy link
Collaborator Author

dgitis commented Oct 17, 2023

Just for reference, the performance additions in 52868e5 process 136.46 MB on a client test site that's been running for over a year compared with 8.8 GB from before the optimization.

The changes reduce the effectiveness when deduplicating, but I prefer to default to the performant option and let people undo that as needed.

@adamribaudo-velir
Copy link
Collaborator

The business logic seems sound, but we could use the QUALIFY statement to cut the amount of SQL, remove the 'distinct', and remove a CTE. See here for example: https://github.com/Velir/dbt-ga4/blob/f2f03a2475b3e567a452510e04e01a3464ca59c2/models/staging/base/base_ga4__events.sql#L36C1-L36C26

Something like this I think:

qualify row_number() over(
 partition by transaction_id 
        order by 
            event_timestamp asc rows between unbounded preceding
            and unbounded following
) = 1

@dgitis
Copy link
Collaborator Author

dgitis commented Oct 18, 2023

Review the new version when you get an chance. It seems to work fine on my test site.

The processing cost on both versions is equal.

@adamribaudo-velir adamribaudo-velir merged commit 3a7fb3c into main Oct 19, 2023
@adamribaudo-velir adamribaudo-velir deleted the dedup-transactions branch October 19, 2023 12:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Deduplicating transactions
2 participants