Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(sd): update extract_from_text #1293

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

grossir
Copy link
Contributor

@grossir grossir commented Jan 10, 2025

Solves #1292

Now parsing: disposition, docket_number and judges

Solves #1292

Now parsing: disposition, docket_number and judges
@grossir grossir requested a review from flooie January 10, 2025 01:19
@grossir grossir self-assigned this Jan 10, 2025
Comment on lines +33 to +34
"aff in pt, vacate, & rem in pt": "Affirm in part, vacate and remand in part",
"aff in pt & vacate": "Affirm and vacate", # https://www.courtlistener.com/opinion/9502826/state-v-scott/pdf/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to fine tune these dispositions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see this one
aff in pt & vacate": "Affirm and vacate"
should be
aff in pt & vacate": "Affirm in part and vacate"

Any other?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggested the past tense. Dismiss should match Affirmed. Dismissed. Reversed and Remanded, ... etc.

Also - Aff in pt & vacate should be Affirmed in Part and Vacated

Copy link
Contributor

@flooie flooie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current regex pattern for citations is too restrictive. It seems to require a citation for processing, causing cases without one to be skipped. This should be loosened to ensure that all relevant data is captured, even when a citation is missing.

The back scraper and extract_from_text method only work with PDFs from 2005 onward, as that’s when the court transitioned to a new format. Before 2005, the extraction fails due to a change in text patterns.

Should add HTML cleanup code as well.

The regular backscraper is likely to fail starting around 2009-2010 because of overly strict regex constraints. Adjusting the pattern to accommodate format variations would improve reliability.

overall I liked what you did.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: PRs to Review
Development

Successfully merging this pull request may close these issues.

2 participants