Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataImportError when scraping restricted bills #30

Open
antidipyramid opened this issue Oct 21, 2024 · 8 comments
Open

DataImportError when scraping restricted bills #30

antidipyramid opened this issue Oct 21, 2024 · 8 comments
Assignees

Comments

@antidipyramid
Copy link
Collaborator

antidipyramid commented Oct 21, 2024

Board reports 2024-0556 and 2024-0549 were both restricted bills that raised DataImportErrors during scraping (both when restricted and not).

2024-10-21, 15:35:25 CDT] {docker.py:391} INFO - duplicate key value violates unique constraint "councilmatic_core_bill_slug_ecb9ca6b_uniq"
DETAIL: Key (slug)=(2024-0556) already exists.
while importing {'identifier': '2024-0556', 'title': 'AUTHORIZE the Chief ...

@antidipyramid
Copy link
Collaborator Author

Deleting the offending bills and re-scraping fixed the issue.

This has come up again with another restricted bill with an identifier of 2024-1033:

raise DataImportError(
pupa.exceptions.DataImportError: duplicate key value violates unique constraint "councilmatic_core_bill_slug_ecb9ca6b_uniq"
DETAIL:  Key (slug)=(2024-1033) already exists.
 while importing {'identifier': '2024-1033', 'title': 'Restricted View', 'classification': ['bill'], 'subject': [], 'extras': {'restrict_view': True, 'plain_text': '', 'rtf_text': ''}, 'legislative_session_id': UUID('d5353c5e-efed-43b7-9c08-54751ed323a8'), 'from_organization_id': 'ocd-organization/f659e65f-0e12-46f2-9610-c3f1456540a2'} as <class 'opencivicdata.legislative.models.bill.Bill'>; 2002774)

There's already a bill with the same identifier in the database. One difference between the it and the scraped bill seems to be the legislative_session_id-- the one in the database has a legislative_session_id of UUID('997eda68-3c01-4378-adeb-2a009842a7b4').

The from_organization_ids are identical. Pupa uses these two attributes along with the bill's identifier to check if it needs to create a new object or update an existing one.

Since Pupa thinks it's scraping a new bill, it tries to create it but the identifier/slug clashes with the existing bill's, raising the import error.

To your knowledge, has this come up in the past, @hancush? It seems like the common thread is that all of these bills were at one time restricted.

@hancush
Copy link
Collaborator

hancush commented Nov 1, 2024

@antidipyramid The conflict in legislative session is definitely to blame here. Does how we determine legislative session vary between private and public bills?

@antidipyramid
Copy link
Collaborator Author

@hancush
Copy link
Collaborator

hancush commented Nov 1, 2024

It looks like we pass the matter's intro date to self.session – can that change?

date = matter['MatterIntroDate']

@antidipyramid
Copy link
Collaborator Author

antidipyramid commented Nov 6, 2024

@hancush One way of dealing with this is to simply remove the legislative session from object spec during import

https://github.com/opencivicdata/pupa/blob/2f7847cb87ed467f7afeec3f51cc704471b679c1/pupa/importers/bills.py#L53-L65

Bill slugs (i.e. identifiers) already must be unique so this would allow the importer to match to the existing object and update its session.

@hancush
Copy link
Collaborator

hancush commented Nov 6, 2024

Could work, @antidipyramid! I do want to understand why this is happening now, though.

@antidipyramid
Copy link
Collaborator Author

@xmedr, if you have a chance in the next two weeks, it might be a good idea to take a look at the most recent scraper updates to see if those have anything to do with this behavior re: restricted bills. I don't see an obvious link but it'd be nice to get a second opinion.

I searched for similar errors in past issues across multiple repos and didn't see anything that looked like this.

@antidipyramid
Copy link
Collaborator Author

This issue hasn't come up again so I'll put this in the icebox for now.

@antidipyramid antidipyramid moved this from 📝 In Progress to 🧊 Icebox in boardagendas.metro.net - Monthly priorities Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

2 participants