-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add custom ali copyright language #2050
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making a start on this! As Ben noted, the current algorithm of looking for any word in "match_words" won't work, because nodes can contain words like "contracts" without being from the Restatement of the Law of Contracts. What we want is something that will detect each section in this PDF: https://www.ali.org/media/filer_public/f8/e9/f8e919d5-790e-408c-af0a-65db1687dbc5/2023-checklist.pdf , which you can also see listed on this page: https://www.ali.org/publications/
So for example:
- If the title contains "restatement" AND "contracts", it should have title="Restatement of the Law, Contracts", "years": "1971-2023"
- If the title contains "restatement" AND "agency", it should have title="Restatement of the Law, Agency", "years": "2026-2023"
- if it just says "Contracts: An introduction", it should have ali_licensed=false.
A good way to do this would be to add a test function to the test suite for your new get_ali_license_text
function that gives it a bunch of challenging examples, run some google searches like "site:opencasebook.org restatement", "site:opencasebook.org contracts", "site:opencasebook.org agency", "site:opencasebook.org torts", and make sure the test suite includes expected results for a range of examples that look challenging to you from those search results.
Thanks for the detailed explanation, I will work on it! |
@jcushman Some of the titles in the pdf contain 2nd, 3rd, and some additional wording, for example Another note is that since there are several publications under each title, and I will be grouping similar titles under one through above, I made the copyright years a range of the earliest and latest dates. For example: 2006-2023 for Let me know if you are ok with this approach, or would prefer it to be more granular |
This is also what I had in mind -- sounds great! |
One more thing to note is that since we didn't have the Model Penal Code publications in the pdf, I grabbed the two related materials' information from the ALI website (Sentencing, Sexual Assault and Related Offenses). Let me know if anything needs to be corrected for those. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks promising! The data structure looks right and the logic looks reasonable.
In addition to the notes I left for efficiency, can you add tests to make sure this performs as you intend? You can see in models.py we have various doctests like
r"""
Return all text, including spaces, from the html, using the LXML library.
>>> html = ' \n <p> \r\n <em> ...'
>>> assert ContentAnnotation.text_from_html(html) == ' \n \r\n foo ...'
"""
This code will actually be run from pytest, and is able to import fixtures defined in web/conftest.py. So you would do something in the docstring like:
>>> node = getfixture('contentnode')
>>> node.title = 'Restatement of Contracts 2d, blah blah blah'
>>> assert node.get_ali_license_text() == '...'
And include a few different examples to cover different scenarios. Alternatively you can add a test function to a reasonable place in main/test/ if the docstring approach starts to feel clunky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a typo in the "Principles" titles, then this is ready to go!
ali_licensed
field to the ContentCode model to indicate whether the section/resource has a special ALI copyright textlicensed_materials
list will be updated when we receive it from ALI, right now I only used it as a placeholder. Thus this PR won't be merged until we finalize it.