Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(pacer.email): improve bankruptcy short description parsing #1276

Merged
merged 10 commits into from
Jan 10, 2025

Conversation

grossir
Copy link
Contributor

@grossir grossir commented Dec 17, 2024

Solves #912, Solves #914

  • simplify parsing by getting rid of cases by court groups
  • support multi docket NEF parsing: add examples for deb, ctb, mdb, ndb, nhb, paeb, txnb
  • support flsb
  • correct wrong parsing for vaeb and mdb after double checking on PACER
  • updated paeb_1 example file where creation of example file had broken parsing

Solves #912, Solves #914

- simplify parsing by getting rid of cases by court groups
- support multi docket NEF parsing: add examples for deb, ctb, mdb, ndb, nhb, paeb, txnb
- support flsb
- correct wrong parsing for vaeb and mdb after double checking on PACER
- updated  paeb_1 example file where creation of example file had broken parsing
Copy link
Member

@mlissner mlissner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, you really simplified the code while adding a bunch more test cases. Love it.

How much have you checked vs. the pacer history reports to make sure that your output is correct?

Copy link
Contributor Author

@grossir grossir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a screenshot of the "History/Documents" report on most JSON files to explain the changes. I noticed I need to improve the parsing of a couple of them

@@ -16,7 +16,7 @@
"pacer_doc_id": null,
"pacer_magic_num": null,
"pacer_seq_no": null,
"short_description": "Hearing Held"
"short_description": "Hearing Held (BK)"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document number 56 has short description "Hearing Held (BK)"
image


# Deletes:
# - extra docket number 'components', such as `federal_dn_judge_initials_assigned`
# - Chapter component
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On deleting "Chapter", see nhb1 example
The email's subject is
Subject:21-10245-BAH Chapter 13 Mykle Lepene Affidavit of Compliance with Discharge Requirements
if we didn't delete the "Chapter" the short_description would be
"Chapter 13 Affidavit of Compliance with Discharge Requirements"
but it's different on pacer
image

@@ -16,7 +16,7 @@
"pacer_doc_id": null,
"pacer_magic_num": null,
"pacer_seq_no": null,
"short_description": "CHAP - Hearing Continued"
"short_description": "CHAP - Hearing Continued (Bk Other)"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On keeping "CHAP" and "(Bk Other)"
image

@@ -16,7 +16,7 @@
"pacer_doc_id": "188040985133",
"pacer_magic_num": "49963627",
"pacer_seq_no": "101",
"short_description": "Notice of Dismissal"
"short_description": "Notice of Dismissal - CerDocTyp"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On keeping "CerDocTyp"
image

@@ -16,7 +16,7 @@
"pacer_doc_id": null,
"pacer_magic_num": null,
"pacer_seq_no": null,
"short_description": "Docket Order - Continue Hearing (Auto) Ch 13"
"short_description": "Docket Order - Continue Hearing (Auto)"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On deleting "Ch 13"

image

"pacer_doc_id": null,
"pacer_magic_num": null,
"pacer_seq_no": null,
"short_description": "Close Adversary Case"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Subject in the email is
"Multiple Cases "Close Adversary Case" - AP -"

Should be reduced to "Close Adversary Case". It doesn't follow the usual clean up process, that's why I am using the early return
image

"pacer_doc_id": "050057572723",
"pacer_magic_num": "58443666",
"pacer_seq_no": "384",
"short_description": "UST Form 11"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one should be "UST Form 11-MOR"
image

@@ -16,7 +16,7 @@
"pacer_doc_id": "092051714783",
"pacer_magic_num": "11582823",
"pacer_seq_no": "409",
"short_description": "Order on Motion To Quash"
"short_description": "Order on Motion To Quash - CH"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one shouldn't have the lagging " - CH"
image

"pacer_doc_id": null,
"pacer_magic_num": null,
"pacer_seq_no": null,
"short_description": "Close Adversary Case"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is OK
image

@@ -4,7 +4,7 @@
"court_id": "paeb",
"dockets": [
{
"case_name": "Br Cun",
"case_name": "Brittany Cunningham",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Editing the example source had messed up the parsing of this case

@mlissner mlissner assigned albertisfu and unassigned mlissner Dec 18, 2024
@mlissner mlissner requested a review from albertisfu December 18, 2024 17:59
@mlissner
Copy link
Member

Great. Thanks Gianfranco. Assigning to Alberto for review.

@grossir
Copy link
Contributor Author

grossir commented Dec 18, 2024

I am missing one more commit; I have been checking the old example files too and doing some minor updates

grossir and others added 2 commits December 18, 2024 15:08
"pacer_doc_id": "154018316665",
"pacer_magic_num": "88545762",
"pacer_seq_no": "31",
"short_description": "BNC Certificate of Notice (341 Meeting Notice (Chapter 13))"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We keep the "Chapter ..." string

Email subject:
Subject:Ch-13 1:24-bk-00941-HWV -Charlene R. House BNC Certificate of Notice (341 Meeting Notice (Chapter 13))

Document history report:
image

"pacer_doc_id": "152032523670",
"pacer_magic_num": "29282863",
"pacer_seq_no": "78",
"short_description": "Chapter 13 Plan"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We keep the "Chapter..." string

Email subject:
Subject:Ch-13 23-13878-pmm Chapter 13 Plan - Kervince Markenzy

History report
image

@grossir
Copy link
Contributor Author

grossir commented Dec 18, 2024

@albertisfu this now ready for review

Copy link
Contributor

@albertisfu albertisfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @grossir I reviewed the refactor applied, and it looks good to me.
I just left a comment and a couple of concerns after finding some examples where the conditions for deciding when to keep or remove some suffixes and Chapter could be problematic.

juriscraper/pacer/email.py Outdated Show resolved Hide resolved
juriscraper/pacer/email.py Outdated Show resolved Hide resolved
component_regex = r"((?!-MOR)(\-[A-Z]{2,}))|(\-[a-z]{2,})|(NEF:? )"
if self.court_id in ["paeb", "pamb"]:
# keeps the "Chapter ..." description on the short description
chapter_regex = r"(C[Hh][- ]?(13|7|9|11))|(C[hH][\s-]*$)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without looking too much I was able to find a couple of examples where the removal of Chapter... is not compatible.

For instance this one:
cacb_2.txt
Screenshot 2024-12-23 at 7 10 00 p m

You can see that in the Docket history report, many entries retain "Chapter" in their short descriptions.

Another example comes from pawb:

Screenshot 2024-12-23 at 7 11 59 p m

It seems difficult to determine when Chapter... should be removed. I’m also not sure if it's a consistent behavior across courts. If it should sometimes be removed and sometimes not in the same court that's a problem unless we can found a different pattern to decide. However, if it is consistent, we should look for more courts where it should be retained and add them to the condition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an example with "Chapter..." or "Ch..." for each court; except for nyeb, for which I couldn't find any

@albertisfu albertisfu assigned grossir and unassigned albertisfu Dec 24, 2024
- Add `casb` examples
- Add examples with a "Chapter.." string for all courts except nyeb, for which I couldn't find any
- Solved bugs pointed out in code review
- Now using saved raw case numbers to clean up the subject. This simplifies the process
@grossir
Copy link
Contributor Author

grossir commented Jan 8, 2025

Hi @albertisfu thanks for the review; I think this is ready for review again; I have simplified the code by using the "raw" docket number, befor it is separated into components

@grossir grossir assigned albertisfu and unassigned grossir Jan 8, 2025
@grossir grossir requested a review from albertisfu January 8, 2025 15:50
"pacer_doc_id": "07404406548",
"pacer_magic_num": "85473386",
"pacer_seq_no": "4579",
"short_description": "Order Confirming Chapter 11 Plan"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

"pacer_doc_id": "14404609723",
"pacer_magic_num": "39816587",
"pacer_seq_no": "190",
"short_description": "Chapter 11 Monthly Operating Report UST Form 11-MOR"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

"pacer_doc_id": "11606636788",
"pacer_magic_num": "97314543",
"pacer_seq_no": "304",
"short_description": "Chapter 13 Trustee's Interim Report Upon Completion of Plan Payments (batch)"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

"pacer_doc_id": "036021419934",
"pacer_magic_num": "29072456",
"pacer_seq_no": "2",
"short_description": "Chapter 7 Voluntary Petition"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Copy link
Contributor

@albertisfu albertisfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @grossir. The changes look pretty good. I just left a couple of small comments, and I think we'll be ready to merge this.

juriscraper/pacer/email.py Outdated Show resolved Hide resolved
juriscraper/pacer/email.py Outdated Show resolved Hide resolved
# Courts like `nhb` do not use the "Ch \d{1,2}" abbreviation
# and we must delete the "Chapter..." string; but only once
# See nhb_2 for an example with 2 "Chapter..." strings
if self.court_id in ["nhb"]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be a good idea to log an error message in Sentry (except for nhb) when we find an email subject with more than one 'Chapter...' string? This way, we can detect if there are other courts that need to be added to this list of exceptions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a logger.error call for this case; and another for when we encounter an email from a court for which we have no examples

tests/examples/pacer/nef/s3/cacb_3.json Outdated Show resolved Hide resolved
tests/examples/pacer/nef/s3/cob.json Outdated Show resolved Hide resolved
tests/examples/pacer/nef/s3/paeb_3.json Outdated Show resolved Hide resolved
- add logger.error calls for courts without examples; and for unexpected usage of "Chapter.." strings
- restore overwritten test cases
- add a test case for nceb
- handle plain/text as seen in cob example
@grossir
Copy link
Contributor Author

grossir commented Jan 9, 2025

@albertisfu thanks for the review; I have addressed your comments, please check again when you have time

@albertisfu
Copy link
Contributor

Thanks, @grossir. With the new loggers, we can confirm the parsing once we get new examples.

@mlissner this seems ready to merge!

@mlissner mlissner merged commit 5981261 into main Jan 10, 2025
12 checks passed
@mlissner mlissner deleted the recap_email_bankruptcy_short_description branch January 10, 2025 04:29
@mlissner
Copy link
Member

Whew! Thanks guys!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants