-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Import transactions by quarter, in chunks of a few hundred #212
Conversation
.SECONDEXPANSION: | ||
import/% : _data/sorted/$$(word 1, $$(subst _, , $$*))_$$(word 3, $$(subst _, , $$*)).csv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parse the transaction type and year out of a pattern like CON_1_2023
. One transaction file covers the entire year, so we don't need to download it again for each quarterly import.
define quarterly_target | ||
$(foreach YEAR,$(1),$(patsubst %,import/$(2)_%_$(YEAR),1 2 3 4)) | ||
endef |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Create a target for each year ($1
), transaction type ($2
), and quarter.
with: | ||
ref: "deploy" | ||
- name: Import data for 2024 | ||
ref: "hcg/batch-it-up" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Change back to deploy before merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking good. one bug fix, and some suggestions.
""" | ||
return filter( | ||
lambda x: get_quarter(x[0][2]) in filing_quarters, | ||
groupby(tqdm(records), key=filing_key), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this tqdm is a bit confusing, as it is over all the records, not just the filtered records.
would something like
return groupby(tqdm(filter(records, lambda x: ...)), key=filing_key)
work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup! Done.
|
||
if not len(batch) % batch_size: | ||
self._save_batch(batch) | ||
batch = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you also need to handle the case that you have iterated through all the records, but the batch isn't modulo the batch_size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch, thanks.
contribution = self.make_contribution(record, None, filing) | ||
batch.append(contribution) | ||
|
||
if not len(batch) % batch_size: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when doing modulo, i think it's better to have the form that
len(batch) % batch_size == 0
i think it's just a touch more explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love it. Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice work!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very curious to see what the performance improvement looks like!
): | ||
yield cls.objects.bulk_create(cls_records) | ||
|
||
def import_contributions(self, f, quarters, year, batch_size): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Such a clean function.
Overview
The transaction import makes something like 10 database calls per iteration. That can really slow us down when our database connection and/or cloud compute environment isn't zippy (#210).
This PR reduces the number of iterations from n to roughly n/4 by slicing the import by quarter and reduces the number of database calls (made from transaction saves, other queries are unaffected) from n to n/500 by batching transaction saves in chunks of 500. Both of these options are configurable, but are given reasonable defaults.
It further reduces run time by importing only one transaction type per job, rather than both contributions and expenditures. Jobs are now created per a matrix strategy, so that all transaction imports run concurrently.
Connects #210.
Notes
I initially did a month filter, but filing periods tend to correspond to quarters, so some months have no filings. I thought it better to do quarters for jobs of roughly equal size, but with no waste jobs (i.e., jobs that don't import anything).
The quarter and year parameters are a little confusing.
When filing period spans years, its transactions can occur across two data files. The year is the vintage of the data file, not the transaction date or filing period, and is used to remove only transactions from a given filing in the given year (i.e., ones that should be reimported in a given run).
The quarter is used to filter filings to only those with a filing period beginning in the given quarter of any year. In this way, it accounts for filings spanning years. For example, consider a filing period starting in December 2023 and ending in February 2024. Transactions would be split across the 2023 and 2024 files. To get them all, you would run the Q4 import for both 2023 and 2024.
Testing Instructions