This is the source code for case.law, a website written by the Harvard Law School Library Innovation Lab to manage and serve court opinions. Other than several cases used for our automated testing, this repository does not contain case data. Case data may be obtained through the website.
- Capstone
The Caselaw Access Project is a large-scale digitization project hosted by the Harvard Law School Library Innovation Lab. Visit case.law for more details.
The output of the project consists of page images, marked up case XML files, ALTO XML files, and METS XML files. This repository has a more detailed explanation of the format, and two volumes worth of sample data:
CAP Samples and Format Documentation
This data, with some temporary restrictions, is available to all. Please see our project site with more information about how to access the API, or get bulk access to the data:
This is a living, breathing corpus of data. While we've taken great pains to ensure its accuracy and integrity, two large components of this project, namely OCR and human review, are utterly fallible. When we were designing Capstone, we knew that one of its primary functions would be to facilitate safe, accountable updates. If you find any errors in the data, we would be extraordinarily grateful for your taking a moment to create an issue in this GitHub repository's issue tracker to report it. If you notice a large pattern of problems that would be better fixed programmatically, or have a very large number of modifications, describe it in an issue. If we need more information, we'll ask. We'll close the issue when the issue has been corrected.
These are known issues — there's no need to file an issue if you come across one of these.
- Missing Judges Tag: In many volumes, elements which should have the tag name
<judges>
instead have the tag name<p>
. We're working on this one. - Nominative Case Citations: In many cases that come from nominative volumes, the citation format is wrong. We hope to have this corrected soon.
- Jurisdictions: Though the jurisdiction values in our API metadata entries are normalized, we have not propagated those changes to the XML.
- Court Name: We've seen some inconsistencies in the court name. We're trying to get this normalized in the data, and we'll also publish a complete court name list when we're done.
- OCR errors: There will be OCR errors on nearly every page. We're still trying to figure out how best to address this. If you've got some killer OCR correction strategies, get at us.
Capstone is a Django application with a PostgreSQL database which stores and manages the non-image data output of the CAP project. This includes:
- Original XML data
- Normalized metadata extracted from the XML
- External metadata, such as the Reporter database
- Changelog data, tracking changes and corrections
Add the following to /etc/hosts
:
127.0.0.1 case.test
127.0.0.1 api.case.test
127.0.0.1 cite.case.test
We support local development via docker compose
. Docker setup looks like this:
Using pull
first will avoid rebuilding images locally:
$ docker-compose pull
Start docker:
$ docker-compose up -d
Set up databases:
$ docker-compose exec db psql --user=postgres -c "CREATE DATABASE capdb;"
$ docker-compose exec db psql --user=postgres -c "CREATE DATABASE capapi;"
$ docker-compose exec db psql --user=postgres -c "CREATE DATABASE cap_user_data;"
Log into web container:
$ docker-compose exec web bash
#
From now on all commands starting with #
are assumed to be run from within docker-compose exec web bash
.
Load dev data:
⚠️ Note: Make sure that Docker has sufficient resources allocated to run Elasticsearch. Lower allocations may causerebuild_search_index
to crash. Recommended minimum:
- CPUs: 6
- Memory: 16 GB
- Swap: 1 GB
- Disk image: ~256 GB
# fab init_dev_db
# fab ingest_fixtures
# fab import_web_volumes
# fab refresh_case_body_cache
# fab rebuild_search_index
To get ngrams working, run:
# mkdir test_data/ngrams
# fab ngram_jurisdictions
Run the dev server:
# fab run
Capstone should now be running at 127.0.0.1:8000.
If you are working on javascript files, frontend, use fab run_frontend
:
# fab run_frontend
- [Testing ](#testing-)
- [Requirements ](#requirements-)
- [Applying model changes ](#applying-model-changes-)
- [Stored Postgres functions ](#stored-postgres-functions-)
- [Running Command Line Scripts ](#running-command-line-scripts-)
- [Logging In ](#logging-in-)
- [Local debugging tools ](#local-debugging-tools-)
- [Model versioning ](#model-versioning-)
- [Download real data locally ](#download-real-data-locally-)
- [Working with javascript ](#working-with-javascript-)
- [Elasticsearch ](#elasticsearch-)
We use pytest for tests. Some notable flags:
Run all tests:
# pytest
Run one test:
# pytest -k test_name
Drop into pdb on test failure:
# pytest --pdb
Run tests in parallel for speed:
# pytest -n 2
Top-level requirements are stored in requirements.in
. After updating that file, you should run
# fab pip_compile
to freeze all subdependencies into requirements.txt
.
To upgrade a single requirement to the latest version:
# fab pip_compile:"-P package_name"
Use Django to apply migrations. After you change models.py
:
# ./manage.py makemigrations
This will write a migration script to cap/migrations
. Then apply:
# fab migrate
This will migrate the underlying model in PostgreSQL. In order to transfer changes to Elasticsearch, apply:
# fab rebuild_search_index
Ensure that the relevant handlers to transfer this data are written in capstone/capapi/documents.py.
Some Capstone features depend on stored functions.
See set_up_postgres.py
for documentation.
Command line scripts are defined in fabfile.py
. You can list all available commands using fab -l
, and run a
command with fab command_name
.
fab init_dev_db
will create a user with email [email protected]
and password Password2
.
You can create additional test users from ./manage.py shell_plus
using the same code that is used by the init_dev_db
command, or using the web frontend on the local development server.
Creating a new user through the frontend requires access to an email verification link. That link will be shown in the
output of fab run
or fab run_frontend
in the following format:
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Subject: Caselaw Access Project: Verify your email address
From: [email protected]
To: [email protected]
Date: Wed, 04 Aug 2021 17:53:46 -0000
Message-ID: <162809962609.2188.6020186441304370023@63fceca6d616>
Please click here to verify your email address:
https://case.test:8000/user/verify-user/4/ffffffffffffffffff/
If you received this message in error, please ignore it.
django-extensions is enabled by default, including the very
handy ./manage.py shell_plus
command.
django-debug-toolbar is not automatically enabled, but if you
run pip install django-debug-toolbar
it will be detected and enabled by settings_dev.py
.
For database versioning we use the Postgres temporal tables approach inspired by SQL:2011's temporal databases.
See this blog post for an explanation of temporal tables and how to use them in Postgres.
We use django-simple-history to manage creation, migration, and querying of the historical tables.
Data is kept in sync through the temporal_tables Postgres extension and the triggers created in our scripts/set_up_postgres.py file.
We store complete fixtures for about 1,000 cases in the case.law downloads section.
You can download and ingest all volume fixtures from that section with the command fab import_web_volumes
,
or ingest a single volume downloaded from that section with the command fab import_volume:some.zip
.
We use Vite to compile javascript files. New javascript entrypoints can be added to vite.config.js and
included in templates with {% vite_asset %}
.
To see javascript changes live, run the dev server with
# fab run_frontend
This will start yarn serve
behind the scenes before calling fab run
.
For local dev, Elasticsearch will automatically be started by docker-compose up -d
. You can then run
fab refresh_case_body_cache
to populate CaseBodyCache for all cases, and fab rebuild_search_index
to populate the
search index.
For debugging, see settings.py.example for an example of how to log all requests to and from Elasticsearch.
It may also be useful to run Kibana to directly query Elasticsearch from a browser GUI:
$ brew install kibana
$ kibana -e http://127.0.0.1:9200
You can then go to Kibana -> Dev Tools to run any of the logged queries, or GET /_mapping
to see the search indexes.
We maintain a separate CAP examples repo for some ideas about using code to interact with CAP data.