-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ujson and chunking #183
Ujson and chunking #183
Conversation
Coverage reportClick to see where and how coverage changed
This report was generated by python-coverage-comment-action |
I put a comment in the dt_to_json to explain the reasoning behind doing a groupby before processing the error data. Essentially this was the only way to maintain data integrity without adding even more processing to collate related data between chunks. However, adding this groupby does impact smaller data sets, though not by a huge margin. So if we think an absurd number of errors, in the 3 million+ is just not something we are going to concern ourselves with for now, I can remove all that and just keep ujson and the df.concat change. Post MVP it might be looking at parallel processing across multiple services to really speed things up. Over the weekend I spent a bit of time trying many different approaches. I tried creating the original error dataframe in the format the to_csv needs but that extra processing there negated just about any improvement seen in the df_to_csv, and the changes needed to create the intended json actually increased overall processing time by a decent amount. I tried switching things between dealing with dicts vs. df.groupby or iterrows() and so far this approach seems to be faster and doesn't cause a memory crash. I will say I still have yet to get the millions of records spitting out over 28 million validation rows in a dataframe to fully process. It just takes way too long. |
# dataframes (millions of errors). We can't chunk because could cause splitting | ||
# related validation data across chunks, without having to add extra processing | ||
# for tying those objects back together. Grouping adds a little more processing | ||
# time for smaller datasets but keeps really larger ones from crashing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when you say crashing, does it kill the whole API with OOM or something like that? Or just the validation step fails? While "chunking" is worthwhile regardless, if just the validation fails, and doesn't kill the pod, then I'm less concerned with validation; maybe post-mvp we can add some sort of file has too many errors to process type of thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lchen-2101 Honestly I don't know how the filing-api responds because so far I've only ran this in the data-validator client. It will crash that process all together. I'll try to test with an actual full flow and see if I can get it to happen again and see how the filing-api container handles it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cfpb/sbl-filing-api#223 was written to handle if an executor crash happens during processing in the filing-api and gracefully move the submission to VALIDATION_ERROR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
oops, merge conflict needs resolving. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Closes #181
Closes #182