Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ujson and chunking #183

Merged
merged 4 commits into from
May 15, 2024
Merged

Ujson and chunking #183

merged 4 commits into from
May 15, 2024

Conversation

jcadam14
Copy link
Contributor

@jcadam14 jcadam14 commented May 9, 2024

Closes #181
Closes #182

Copy link

github-actions bot commented May 9, 2024

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  src/regtech_data_validator
  create_schemas.py
  data_formatters.py 153
Project Total  

This report was generated by python-coverage-comment-action

@jcadam14 jcadam14 changed the title Ujson and chunking [WIP] Ujson and chunking May 10, 2024
@jcadam14 jcadam14 marked this pull request as draft May 10, 2024 17:54
@jcadam14 jcadam14 changed the title [WIP] Ujson and chunking Ujson and chunking May 13, 2024
@jcadam14 jcadam14 marked this pull request as ready for review May 13, 2024 14:59
@jcadam14
Copy link
Contributor Author

jcadam14 commented May 13, 2024

I put a comment in the dt_to_json to explain the reasoning behind doing a groupby before processing the error data. Essentially this was the only way to maintain data integrity without adding even more processing to collate related data between chunks. However, adding this groupby does impact smaller data sets, though not by a huge margin. So if we think an absurd number of errors, in the 3 million+ is just not something we are going to concern ourselves with for now, I can remove all that and just keep ujson and the df.concat change.

Post MVP it might be looking at parallel processing across multiple services to really speed things up.

Over the weekend I spent a bit of time trying many different approaches. I tried creating the original error dataframe in the format the to_csv needs but that extra processing there negated just about any improvement seen in the df_to_csv, and the changes needed to create the intended json actually increased overall processing time by a decent amount. I tried switching things between dealing with dicts vs. df.groupby or iterrows() and so far this approach seems to be faster and doesn't cause a memory crash. I will say I still have yet to get the millions of records spitting out over 28 million validation rows in a dataframe to fully process. It just takes way too long.

# dataframes (millions of errors). We can't chunk because could cause splitting
# related validation data across chunks, without having to add extra processing
# for tying those objects back together. Grouping adds a little more processing
# time for smaller datasets but keeps really larger ones from crashing.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when you say crashing, does it kill the whole API with OOM or something like that? Or just the validation step fails? While "chunking" is worthwhile regardless, if just the validation fails, and doesn't kill the pod, then I'm less concerned with validation; maybe post-mvp we can add some sort of file has too many errors to process type of thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lchen-2101 Honestly I don't know how the filing-api responds because so far I've only ran this in the data-validator client. It will crash that process all together. I'll try to test with an actual full flow and see if I can get it to happen again and see how the filing-api container handles it.

Copy link
Contributor Author

@jcadam14 jcadam14 May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cfpb/sbl-filing-api#223 was written to handle if an executor crash happens during processing in the filing-api and gracefully move the submission to VALIDATION_ERROR

lchen-2101
lchen-2101 previously approved these changes May 15, 2024
Copy link
Collaborator

@lchen-2101 lchen-2101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lchen-2101
Copy link
Collaborator

oops, merge conflict needs resolving.

Copy link
Collaborator

@lchen-2101 lchen-2101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lchen-2101 lchen-2101 merged commit 2da1812 into main May 15, 2024
6 checks passed
@lchen-2101 lchen-2101 deleted the ujson_and_chunking branch May 15, 2024 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Chunk results dataframe for json processing Add ujson to data validator for data formatter df_to_json
2 participants