Ujson and chunking #183

jcadam14 · 2024-05-09T19:03:08Z

Closes #181
Closes #182

github-actions · 2024-05-09T19:06:37Z

Coverage report

Click to see where and how coverage changed

File	Statements	Missing	Coverage	Coverage (new stmts)	Lines missing
src/regtech_data_validator
create_schemas.py
data_formatters.py					153
Project Total

_{This report was generated by python-coverage-comment-action}

…idation id

jcadam14 · 2024-05-13T15:02:32Z

I put a comment in the dt_to_json to explain the reasoning behind doing a groupby before processing the error data. Essentially this was the only way to maintain data integrity without adding even more processing to collate related data between chunks. However, adding this groupby does impact smaller data sets, though not by a huge margin. So if we think an absurd number of errors, in the 3 million+ is just not something we are going to concern ourselves with for now, I can remove all that and just keep ujson and the df.concat change.

Post MVP it might be looking at parallel processing across multiple services to really speed things up.

Over the weekend I spent a bit of time trying many different approaches. I tried creating the original error dataframe in the format the to_csv needs but that extra processing there negated just about any improvement seen in the df_to_csv, and the changes needed to create the intended json actually increased overall processing time by a decent amount. I tried switching things between dealing with dicts vs. df.groupby or iterrows() and so far this approach seems to be faster and doesn't cause a memory crash. I will say I still have yet to get the millions of records spitting out over 28 million validation rows in a dataframe to fully process. It just takes way too long.

lchen-2101 · 2024-05-13T17:41:24Z

src/regtech_data_validator/data_formatters.py

+    # dataframes (millions of errors).  We can't chunk because could cause splitting
+    # related validation data across chunks, without having to add extra processing
+    # for tying those objects back together.  Grouping adds a little more processing
+    # time for smaller datasets but keeps really larger ones from crashing.


when you say crashing, does it kill the whole API with OOM or something like that? Or just the validation step fails? While "chunking" is worthwhile regardless, if just the validation fails, and doesn't kill the pod, then I'm less concerned with validation; maybe post-mvp we can add some sort of file has too many errors to process type of thing.

@lchen-2101 Honestly I don't know how the filing-api responds because so far I've only ran this in the data-validator client. It will crash that process all together. I'll try to test with an actual full flow and see if I can get it to happen again and see how the filing-api container handles it.

cfpb/sbl-filing-api#223 was written to handle if an executor crash happens during processing in the filing-api and gracefully move the submission to VALIDATION_ERROR

lchen-2101

LGTM

lchen-2101 · 2024-05-15T14:17:10Z

oops, merge conflict needs resolving.

lchen-2101

LGTM

jcadam14 added 2 commits May 9, 2024 15:00

Pulled in ujson and chunked up the dataframe processing for json

3070377

Merge branch 'main' into ujson_and_chunking

377a97f

jcadam14 requested review from hkeeler, guffee23, lchen-2101 and nargis-sultani May 9, 2024 19:03

jcadam14 self-assigned this May 9, 2024

jcadam14 changed the title ~~Ujson and chunking~~ [WIP] Ujson and chunking May 10, 2024

jcadam14 marked this pull request as draft May 10, 2024 17:54

Kept ujson, moved df.concat to outside the loop, group to_json by val…

7480324

…idation id

jcadam14 changed the title ~~[WIP] Ujson and chunking~~ Ujson and chunking May 13, 2024

jcadam14 marked this pull request as ready for review May 13, 2024 14:59

lchen-2101 reviewed May 13, 2024

View reviewed changes

lchen-2101 previously approved these changes May 15, 2024

View reviewed changes

Merge branch 'main' into ujson_and_chunking

4d21ed6

jcadam14 dismissed lchen-2101’s stale review via 4d21ed6 May 15, 2024 17:16

jcadam14 requested a review from lchen-2101 May 15, 2024 17:20

lchen-2101 approved these changes May 15, 2024

View reviewed changes

lchen-2101 merged commit 2da1812 into main May 15, 2024
6 checks passed

lchen-2101 deleted the ujson_and_chunking branch May 15, 2024 17:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ujson and chunking #183

Ujson and chunking #183

jcadam14 commented May 9, 2024

github-actions bot commented May 9, 2024 •

edited

Loading

jcadam14 commented May 13, 2024 •

edited

Loading

lchen-2101 May 13, 2024

jcadam14 May 13, 2024

jcadam14 May 14, 2024 •

edited

Loading

lchen-2101 left a comment

lchen-2101 commented May 15, 2024

lchen-2101 left a comment

Ujson and chunking #183

Ujson and chunking #183

Conversation

jcadam14 commented May 9, 2024

github-actions bot commented May 9, 2024 • edited Loading

Coverage report

jcadam14 commented May 13, 2024 • edited Loading

lchen-2101 May 13, 2024

Choose a reason for hiding this comment

jcadam14 May 13, 2024

Choose a reason for hiding this comment

jcadam14 May 14, 2024 • edited Loading

Choose a reason for hiding this comment

lchen-2101 left a comment

Choose a reason for hiding this comment

lchen-2101 commented May 15, 2024

lchen-2101 left a comment

Choose a reason for hiding this comment

github-actions bot commented May 9, 2024 •

edited

Loading

jcadam14 commented May 13, 2024 •

edited

Loading

jcadam14 May 14, 2024 •

edited

Loading