-
-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(pacer.email): improve bankruptcy short description parsing #1276
Conversation
Solves #912, Solves #914 - simplify parsing by getting rid of cases by court groups - support multi docket NEF parsing: add examples for deb, ctb, mdb, ndb, nhb, paeb, txnb - support flsb - correct wrong parsing for vaeb and mdb after double checking on PACER - updated paeb_1 example file where creation of example file had broken parsing
1369764
to
ecfc625
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, you really simplified the code while adding a bunch more test cases. Love it.
How much have you checked vs. the pacer history reports to make sure that your output is correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a screenshot of the "History/Documents" report on most JSON files to explain the changes. I noticed I need to improve the parsing of a couple of them
@@ -16,7 +16,7 @@ | |||
"pacer_doc_id": null, | |||
"pacer_magic_num": null, | |||
"pacer_seq_no": null, | |||
"short_description": "Hearing Held" | |||
"short_description": "Hearing Held (BK)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
juriscraper/pacer/email.py
Outdated
|
||
# Deletes: | ||
# - extra docket number 'components', such as `federal_dn_judge_initials_assigned` | ||
# - Chapter component |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On deleting "Chapter", see nhb1 example
The email's subject is
Subject:21-10245-BAH Chapter 13 Mykle Lepene Affidavit of Compliance with Discharge Requirements
if we didn't delete the "Chapter" the short_description would be
"Chapter 13 Affidavit of Compliance with Discharge Requirements"
but it's different on pacer
@@ -16,7 +16,7 @@ | |||
"pacer_doc_id": null, | |||
"pacer_magic_num": null, | |||
"pacer_seq_no": null, | |||
"short_description": "CHAP - Hearing Continued" | |||
"short_description": "CHAP - Hearing Continued (Bk Other)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -16,7 +16,7 @@ | |||
"pacer_doc_id": "188040985133", | |||
"pacer_magic_num": "49963627", | |||
"pacer_seq_no": "101", | |||
"short_description": "Notice of Dismissal" | |||
"short_description": "Notice of Dismissal - CerDocTyp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -16,7 +16,7 @@ | |||
"pacer_doc_id": null, | |||
"pacer_magic_num": null, | |||
"pacer_seq_no": null, | |||
"short_description": "Docket Order - Continue Hearing (Auto) Ch 13" | |||
"short_description": "Docket Order - Continue Hearing (Auto)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"pacer_doc_id": null, | ||
"pacer_magic_num": null, | ||
"pacer_seq_no": null, | ||
"short_description": "Close Adversary Case" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"pacer_doc_id": "050057572723", | ||
"pacer_magic_num": "58443666", | ||
"pacer_seq_no": "384", | ||
"short_description": "UST Form 11" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -16,7 +16,7 @@ | |||
"pacer_doc_id": "092051714783", | |||
"pacer_magic_num": "11582823", | |||
"pacer_seq_no": "409", | |||
"short_description": "Order on Motion To Quash" | |||
"short_description": "Order on Motion To Quash - CH" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"pacer_doc_id": null, | ||
"pacer_magic_num": null, | ||
"pacer_seq_no": null, | ||
"short_description": "Close Adversary Case" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -4,7 +4,7 @@ | |||
"court_id": "paeb", | |||
"dockets": [ | |||
{ | |||
"case_name": "Br Cun", | |||
"case_name": "Brittany Cunningham", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Editing the example source had messed up the parsing of this case
Great. Thanks Gianfranco. Assigning to Alberto for review. |
I am missing one more commit; I have been checking the old example files too and doing some minor updates |
Keep "Chapter ..." string in paeb and pamb courts
for more information, see https://pre-commit.ci
"pacer_doc_id": "154018316665", | ||
"pacer_magic_num": "88545762", | ||
"pacer_seq_no": "31", | ||
"short_description": "BNC Certificate of Notice (341 Meeting Notice (Chapter 13))" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"pacer_doc_id": "152032523670", | ||
"pacer_magic_num": "29282863", | ||
"pacer_seq_no": "78", | ||
"short_description": "Chapter 13 Plan" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@albertisfu this now ready for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @grossir I reviewed the refactor applied, and it looks good to me.
I just left a comment and a couple of concerns after finding some examples where the conditions for deciding when to keep or remove some suffixes and Chapter
could be problematic.
juriscraper/pacer/email.py
Outdated
component_regex = r"((?!-MOR)(\-[A-Z]{2,}))|(\-[a-z]{2,})|(NEF:? )" | ||
if self.court_id in ["paeb", "pamb"]: | ||
# keeps the "Chapter ..." description on the short description | ||
chapter_regex = r"(C[Hh][- ]?(13|7|9|11))|(C[hH][\s-]*$)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without looking too much I was able to find a couple of examples where the removal of Chapter...
is not compatible.
For instance this one:
cacb_2.txt
You can see that in the Docket history report, many entries retain "Chapter" in their short descriptions.
Another example comes from pawb
:
It seems difficult to determine when Chapter...
should be removed. I’m also not sure if it's a consistent behavior across courts. If it should sometimes be removed and sometimes not in the same court that's a problem unless we can found a different pattern to decide. However, if it is consistent, we should look for more courts where it should be retained and add them to the condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added an example with "Chapter..." or "Ch..." for each court; except for nyeb, for which I couldn't find any
- Add `casb` examples - Add examples with a "Chapter.." string for all courts except nyeb, for which I couldn't find any - Solved bugs pointed out in code review - Now using saved raw case numbers to clean up the subject. This simplifies the process
Hi @albertisfu thanks for the review; I think this is ready for review again; I have simplified the code by using the "raw" docket number, befor it is separated into components |
"pacer_doc_id": "07404406548", | ||
"pacer_magic_num": "85473386", | ||
"pacer_seq_no": "4579", | ||
"short_description": "Order Confirming Chapter 11 Plan" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"pacer_doc_id": "14404609723", | ||
"pacer_magic_num": "39816587", | ||
"pacer_seq_no": "190", | ||
"short_description": "Chapter 11 Monthly Operating Report UST Form 11-MOR" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"pacer_doc_id": "11606636788", | ||
"pacer_magic_num": "97314543", | ||
"pacer_seq_no": "304", | ||
"short_description": "Chapter 13 Trustee's Interim Report Upon Completion of Plan Payments (batch)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"pacer_doc_id": "036021419934", | ||
"pacer_magic_num": "29072456", | ||
"pacer_seq_no": "2", | ||
"short_description": "Chapter 7 Voluntary Petition" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @grossir. The changes look pretty good. I just left a couple of small comments, and I think we'll be ready to merge this.
# Courts like `nhb` do not use the "Ch \d{1,2}" abbreviation | ||
# and we must delete the "Chapter..." string; but only once | ||
# See nhb_2 for an example with 2 "Chapter..." strings | ||
if self.court_id in ["nhb"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be a good idea to log an error message in Sentry (except for nhb
) when we find an email subject with more than one 'Chapter...' string? This way, we can detect if there are other courts that need to be added to this list of exceptions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a logger.error
call for this case; and another for when we encounter an email from a court for which we have no examples
- add logger.error calls for courts without examples; and for unexpected usage of "Chapter.." strings - restore overwritten test cases - add a test case for nceb - handle plain/text as seen in cob example
@albertisfu thanks for the review; I have addressed your comments, please check again when you have time |
Whew! Thanks guys! |
Solves #912, Solves #914