From 91841b7a41de1d16e2ff2eed4e3a76e215d1e55e Mon Sep 17 00:00:00 2001 From: JSv4 Date: Mon, 26 Aug 2024 08:39:37 -0700 Subject: [PATCH] Deployed 9aacd08 with MkDocs version: 1.6.0 --- 404.html | 2 +- acknowledgements/index.html | 4 ++-- architecture/PDF-data-layer/index.html | 2 +- architecture/asynchronous-processing/index.html | 2 +- architecture/components/Data-flow-diagram/index.html | 2 +- .../annotator/how-annotations-are-created/index.html | 2 +- architecture/components/annotator/overview/index.html | 2 +- architecture/opencontract-corpus-actions/index.html | 4 ++-- configuration/add-users/index.html | 2 +- configuration/choose-an-authentication-backend/index.html | 2 +- configuration/choose-and-configure-docker-stack/index.html | 2 +- configuration/choose-storage-backend/index.html | 2 +- configuration/configure-admin-users/index.html | 2 +- configuration/configure-gremlin/index.html | 2 +- configuration/frontend-configuration/index.html | 2 +- development/documentation/index.html | 2 +- development/environment/index.html | 2 +- development/frontend-notes/index.html | 2 +- development/test-suite/index.html | 2 +- extract_and_retrieval/document_data_extract/index.html | 2 +- .../intro_to_django_annotation_vector_store/index.html | 2 +- extract_and_retrieval/querying_corpus/index.html | 2 +- index.html | 2 +- philosophy/index.html | 2 +- quick-start/index.html | 2 +- requirements/index.html | 2 +- search/search_index.json | 2 +- walkthrough/advanced/configure-annotation-view/index.html | 2 +- walkthrough/advanced/data-extract-models/index.html | 2 +- walkthrough/advanced/export-import-corpuses/index.html | 2 +- walkthrough/advanced/fork-a-corpus/index.html | 2 +- walkthrough/advanced/generate-graphql-schema-files/index.html | 2 +- walkthrough/advanced/pawls-token-format/index.html | 4 ++-- walkthrough/advanced/register-doc-analyzer/index.html | 2 +- walkthrough/advanced/run-gremlin-analyzer/index.html | 2 +- walkthrough/advanced/testing-llama-index-calls/index.html | 2 +- walkthrough/advanced/write-your-own-extractors/index.html | 2 +- walkthrough/key-concepts/index.html | 2 +- walkthrough/step-1-add-documents/index.html | 2 +- walkthrough/step-2-create-labelset/index.html | 2 +- walkthrough/step-3-create-a-corpus/index.html | 2 +- walkthrough/step-4-create-text-annotations/index.html | 2 +- walkthrough/step-5-create-doc-type-annotations/index.html | 2 +- .../step-6-search-and-filter-by-annotations/index.html | 2 +- walkthrough/step-7-query-corpus/index.html | 2 +- walkthrough/step-8-data-extract/index.html | 2 +- walkthrough/step-9-corpus-actions/index.html | 2 +- 47 files changed, 50 insertions(+), 50 deletions(-) diff --git a/404.html b/404.html index 9c24ee96..577685b5 100755 --- a/404.html +++ b/404.html @@ -1 +1 @@ - OpenContracts
\ No newline at end of file + OpenContracts
\ No newline at end of file diff --git a/acknowledgements/index.html b/acknowledgements/index.html index 28a66986..7c0e30da 100755 --- a/acknowledgements/index.html +++ b/acknowledgements/index.html @@ -1,4 +1,4 @@ - Acknowledgements - OpenContracts

Acknowledgements

OpenContracts is built in part on top of the PAWLs project frontend. We have made extensive changes, however, and plan to remove even more of the original PAWLs codebase, particularly their state management, as it's currently duplucitive of the Apollo state store we use throughout the application. That said, PAWLs was the inspiration for how we handle text extraction, and we're planning to continue using their PDF rendering code. We are also using PAWLs' pre-processing script, which is based on Grobid.

We should also thank the Grobid project, which was clearly a source of inspiration for PAWLs and an extremely impressive tool. Grobid is designed more for medical and scientific papers, but, nevertheless, offers a tremendous amount of inspiration and examples for the legal world to borrow. Perhaps there is an opportunity to have a unified tool in that respect.

Finally, let's not forget Tesseract, the OCR engine that started its life as an HP research project in the 1980s before being taken over by Google in the early aughts and finally becoming an independent project in 2018. Were it not for the excellent, free OCR provided by Tesseract, we'd have to rely on commercial OCR tech, which would make this kind of opensource, free project prohibitively expensive. Thanks to the many, many people who've made free OCR possible over the nearly 40 years Tesseract has been under development.

Acknowledgements

OpenContracts is built in part on top of the PAWLs project frontend. We have made extensive changes, however, and plan to remove even more of the original PAWLs codebase, particularly their state management, as it's currently duplucitive of the Apollo state store we use throughout the application. That said, PAWLs was the inspiration for how we handle text extraction, and we're planning to continue using their PDF rendering code. We are also using PAWLs' pre-processing script, which is based on Grobid.

We should also thank the Grobid project, which was clearly a source of inspiration for PAWLs and an extremely impressive tool. Grobid is designed more for medical and scientific papers, but, nevertheless, offers a tremendous amount of inspiration and examples for the legal world to borrow. Perhaps there is an opportunity to have a unified tool in that respect.

Finally, let's not forget Tesseract, the OCR engine that started its life as an HP research project in the 1980s before being taken over by Google in the early aughts and finally becoming an independent project in 2018. Were it not for the excellent, free OCR provided by Tesseract, we'd have to rely on commercial OCR tech, which would make this kind of opensource, free project prohibitively expensive. Thanks to the many, many people who've made free OCR possible over the nearly 40 years Tesseract has been under development.

\ No newline at end of file diff --git a/architecture/PDF-data-layer/index.html b/architecture/PDF-data-layer/index.html index c33767e6..ff32a8b8 100755 --- a/architecture/PDF-data-layer/index.html +++ b/architecture/PDF-data-layer/index.html @@ -7,6 +7,6 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}
Skip to content

PDF data layer

Data Layers

OpenContracts builds on the work that AllenAI did with PAWLs to create a consistent shared source of truth for data labeling and NLP algorithms, regardless of whether they are layout-aware, like LayoutLM or not, like BERT, Spacy or LexNLP. One of the challenges with natural language documents, particularly contracts is there are so many ways to structure any given file (e.g. .docx or .pdf) to represent exactly the same text. Even an identical document with identical formatting in a format like .pdf can have a significantly different file structure depending on what software was used to create it, the user's choices, and the software's own choices in deciding how to structure its output.

PAWLs and OpenContracts attempt to solve this by sending every document through a processing pipeline that provides a uniform and consistent way of extracting and structuring text and layout information. Using the parsing engine of Grobid and the open source OCR engine Tesseract, every single document is re-OCRed (to produce a consistent output for the same inputs) and then the "tokens" (text surrounded on all sides by whitespace - typically a word) in the OCRed document are stored as JSONs with their page and positional information. In OpenContracts, we refer to this JSON layer that combines text and positional data as the "PAWLs" layer. We use the PAWLs layer to build the full text extract from the document as well and store this as the "text layer".

Thus, in OpenContracts, every document has three files associated with it - the original pdf, a json file (the "PAWLs layer"), and a text file (the "text layer"). Because the text layer is built from the PAWLs layer, we can easily translate back and forth from text to positional information - e.g. given the start and end of a span of text the text layer, we can accurately say which PAWLs tokens the span includes, and, based on that, the x,y position of the span in the document.

This lets us take the outputs of many NLP libraries which typically produce only start and stop ranges and layer them perfectly on top of the original pdf. With the PAWLs tokens as the source of truth, we can seamlessly transition from text only to layout-aware text.

Limitations

OCR is not perfect. By only accepting pdf inputs and OCRing every document, we do ignore any text embedded in the pdf. To the extent that text was exported accurately from whatever tool was used to write the document, this introduces some potential loss of fidelity - e.g. if you've ever seen an OCR engine mistake an 'O' or a 0 or 'I' for a '1' or something like that. Typically, however, the instance of such errors is fairly small, and it's a price we have to pay for the power of being able to effortlessly layer NLP outputs that have no layout awareness on top of complex, visual layouts.

Skip to content

PDF data layer

Data Layers

OpenContracts builds on the work that AllenAI did with PAWLs to create a consistent shared source of truth for data labeling and NLP algorithms, regardless of whether they are layout-aware, like LayoutLM or not, like BERT, Spacy or LexNLP. One of the challenges with natural language documents, particularly contracts is there are so many ways to structure any given file (e.g. .docx or .pdf) to represent exactly the same text. Even an identical document with identical formatting in a format like .pdf can have a significantly different file structure depending on what software was used to create it, the user's choices, and the software's own choices in deciding how to structure its output.

PAWLs and OpenContracts attempt to solve this by sending every document through a processing pipeline that provides a uniform and consistent way of extracting and structuring text and layout information. Using the parsing engine of Grobid and the open source OCR engine Tesseract, every single document is re-OCRed (to produce a consistent output for the same inputs) and then the "tokens" (text surrounded on all sides by whitespace - typically a word) in the OCRed document are stored as JSONs with their page and positional information. In OpenContracts, we refer to this JSON layer that combines text and positional data as the "PAWLs" layer. We use the PAWLs layer to build the full text extract from the document as well and store this as the "text layer".

Thus, in OpenContracts, every document has three files associated with it - the original pdf, a json file (the "PAWLs layer"), and a text file (the "text layer"). Because the text layer is built from the PAWLs layer, we can easily translate back and forth from text to positional information - e.g. given the start and end of a span of text the text layer, we can accurately say which PAWLs tokens the span includes, and, based on that, the x,y position of the span in the document.

This lets us take the outputs of many NLP libraries which typically produce only start and stop ranges and layer them perfectly on top of the original pdf. With the PAWLs tokens as the source of truth, we can seamlessly transition from text only to layout-aware text.

Limitations

OCR is not perfect. By only accepting pdf inputs and OCRing every document, we do ignore any text embedded in the pdf. To the extent that text was exported accurately from whatever tool was used to write the document, this introduces some potential loss of fidelity - e.g. if you've ever seen an OCR engine mistake an 'O' or a 0 or 'I' for a '1' or something like that. Typically, however, the instance of such errors is fairly small, and it's a price we have to pay for the power of being able to effortlessly layer NLP outputs that have no layout awareness on top of complex, visual layouts.

\ No newline at end of file diff --git a/architecture/asynchronous-processing/index.html b/architecture/asynchronous-processing/index.html index 683237cd..ea9a1e96 100755 --- a/architecture/asynchronous-processing/index.html +++ b/architecture/asynchronous-processing/index.html @@ -7,7 +7,7 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}
Skip to content

Asynchronous Processing

Asynchronous Tasks

OpenContracts makes extensive use of celery, a powerful, mature python framework for distributed and asynchronous processing. Out-of-the-box, dedicated celeryworkers are configured in the docker compose stack to handle computationally-intensive and long-running tasks like parsing documents, applying annotations to pdfs, creating exports, importing exports, and more.

What if my celery queue gets clogged?

We are always working to make OpenContracts more fault-tolerant and stable. That said, due to the nature of the types of documents we're working with - pdfs - there is tremendous variation in what the parsers have to parse. Some documents are extremely long - thousands of pages or more - whereas other documents may have poor formatting, no text layers, etc.. In most cases, OpenContracts should be able to process the pdfs and make them compatible with our annotation tools. Sometimes, however, either due to unexpected issues or unexpected volume of documents, you may want to purge the queue of tasks to be processed by your celery workers. To do this, type:

sudo docker-compose -f local.yml run django celery -A config.celery_app purge
+    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Asynchronous Processing

Asynchronous Tasks

OpenContracts makes extensive use of celery, a powerful, mature python framework for distributed and asynchronous processing. Out-of-the-box, dedicated celeryworkers are configured in the docker compose stack to handle computationally-intensive and long-running tasks like parsing documents, applying annotations to pdfs, creating exports, importing exports, and more.

What if my celery queue gets clogged?

We are always working to make OpenContracts more fault-tolerant and stable. That said, due to the nature of the types of documents we're working with - pdfs - there is tremendous variation in what the parsers have to parse. Some documents are extremely long - thousands of pages or more - whereas other documents may have poor formatting, no text layers, etc.. In most cases, OpenContracts should be able to process the pdfs and make them compatible with our annotation tools. Sometimes, however, either due to unexpected issues or unexpected volume of documents, you may want to purge the queue of tasks to be processed by your celery workers. To do this, type:

sudo docker-compose -f local.yml run django celery -A config.celery_app purge
 

Be aware that this can cause some undesired effects for your users. For example, everytime a new document is uploaded, a Django signal kicks off the pdf preprocessor to produce the PAWLs token layer that is later annotated. If these tasks are in-queue and the queue is purged, you'll have documents that are not annotatable as they'll lack the PAWLS token layers. In such cases, we recommend you delete and re-upload the documents. There are ways to manually reprocess the pdfs, but we don't have a user-friendly way to do this yet.

\ No newline at end of file diff --git a/architecture/components/Data-flow-diagram/index.html b/architecture/components/Data-flow-diagram/index.html index 54e7c03c..12ccda8a 100755 --- a/architecture/components/Data-flow-diagram/index.html +++ b/architecture/components/Data-flow-diagram/index.html @@ -7,7 +7,7 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}

Container Architecture & Data Flow

You'll notice that we have a number of containers in our docker compose file (Note the local.yml is up-to-date. The production file needs some work to be production grade, and we may switch to Tilt.).

Here, you can see how these containers relate to some of the core data elements powering the application - such as parsing structural and layout annotations from PDFs (which powers the vector store) and generating vector embeddings.

PNG Diagram

Diagram

Mermaid Version

graph TB
+    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Container Architecture & Data Flow

You'll notice that we have a number of containers in our docker compose file (Note the local.yml is up-to-date. The production file needs some work to be production grade, and we may switch to Tilt.).

Here, you can see how these containers relate to some of the core data elements powering the application - such as parsing structural and layout annotations from PDFs (which powers the vector store) and generating vector embeddings.

PNG Diagram

Diagram

Mermaid Version

graph TB
     subgraph "Docker Compose Environment"
         direction TB
         django[Django]
diff --git a/architecture/components/annotator/how-annotations-are-created/index.html b/architecture/components/annotator/how-annotations-are-created/index.html
index 1962b0bd..c4cbbb71 100755
--- a/architecture/components/annotator/how-annotations-are-created/index.html
+++ b/architecture/components/annotator/how-annotations-are-created/index.html
@@ -7,7 +7,7 @@
     .gdesc-inner { font-size: 0.75rem; }
     body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);}
     body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);}
-    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

How Annotations are Handled

Overview

Here's a step-by-step explanation of the flow:

  1. The user selects text on the PDF by clicking and dragging the mouse. This triggers a mouse event in the Page component.
  2. The Page component checks if the Shift key is pressed.
  3. If the Shift key is not pressed, it creates a new selection and sets the selection state in the AnnotationStore.
  4. If the Shift key is pressed, it adds the selection to the selection queue in the AnnotationStore.
  5. The AnnotationStore updates its internal state with the new selection or the updated selection queue.
  6. If the Shift key is released, the Page component triggers the creation of a multi-page annotation. If the Shift key is still pressed, it waits for the next user action.
  7. To create a multi-page annotation, the Page component combines the selections from the queue.
  8. The Page component retrieves the annotation data from the PDFPageInfo object for each selected page.
  9. The Page component creates a ServerAnnotation object with the combined annotation data.
  10. The Page component calls the createAnnotation function in the AnnotationStore, passing the ServerAnnotation object.
  11. The AnnotationStore invokes the requestCreateAnnotation function in the Annotator component.
  12. The Annotator component sends a mutation to the server to create the annotation.
  13. If the server responds with success, the Annotator component updates the local state with the new annotation. If there's an error, it displays an error message.
  14. The updated annotations trigger a re-render of the relevant components, reflecting the newly created annotation on the PDF.

Flowchart

graph TD
+    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

How Annotations are Handled

Overview

Here's a step-by-step explanation of the flow:

  1. The user selects text on the PDF by clicking and dragging the mouse. This triggers a mouse event in the Page component.
  2. The Page component checks if the Shift key is pressed.
  3. If the Shift key is not pressed, it creates a new selection and sets the selection state in the AnnotationStore.
  4. If the Shift key is pressed, it adds the selection to the selection queue in the AnnotationStore.
  5. The AnnotationStore updates its internal state with the new selection or the updated selection queue.
  6. If the Shift key is released, the Page component triggers the creation of a multi-page annotation. If the Shift key is still pressed, it waits for the next user action.
  7. To create a multi-page annotation, the Page component combines the selections from the queue.
  8. The Page component retrieves the annotation data from the PDFPageInfo object for each selected page.
  9. The Page component creates a ServerAnnotation object with the combined annotation data.
  10. The Page component calls the createAnnotation function in the AnnotationStore, passing the ServerAnnotation object.
  11. The AnnotationStore invokes the requestCreateAnnotation function in the Annotator component.
  12. The Annotator component sends a mutation to the server to create the annotation.
  13. If the server responds with success, the Annotator component updates the local state with the new annotation. If there's an error, it displays an error message.
  14. The updated annotations trigger a re-render of the relevant components, reflecting the newly created annotation on the PDF.

Flowchart

graph TD
     A[User selects text on the PDF] -->|Mouse event| B(Page component)
     B --> C{Is Shift key pressed?}
     C -->|No| D[Create new selection]
diff --git a/architecture/components/annotator/overview/index.html b/architecture/components/annotator/overview/index.html
index 3f4d212f..9531b147 100755
--- a/architecture/components/annotator/overview/index.html
+++ b/architecture/components/annotator/overview/index.html
@@ -7,6 +7,6 @@
     .gdesc-inner { font-size: 0.75rem; }
     body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);}
     body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);}
-    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Open Contracts Annotator Components

Key Questions

  1. How is the PDF loaded?
  2. The PDF is loaded in the Annotator.tsx component.
  3. Inside the useEffect hook that runs when the openedDocument prop changes, the PDF loading process is initiated.
  4. The pdfjsLib.getDocument function from the pdfjs-dist library is used to load the PDF file specified by openedDocument.pdfFile.
  5. The loading progress is tracked using the loadingTask.onProgress callback, which updates the progress state.
  6. Once the PDF is loaded, the loadingTask.promise is resolved, and the PDFDocumentProxy object is obtained.
  7. The PDFPageInfo objects are created for each page of the PDF using doc.getPage(i) and stored in the pages state.

  8. Where and how are annotations loaded?

  9. Annotations are loaded using the REQUEST_ANNOTATOR_DATA_FOR_DOCUMENT GraphQL query in the Annotator.tsx component.
  10. The useQuery hook from Apollo Client is used to fetch the annotator data based on the provided initial_query_vars.
  11. The annotator_data received from the query contains information about existing text annotations, document label annotations, and relationships.
  12. The annotations are transformed into ServerAnnotation, DocTypeAnnotation, and RelationGroup objects and stored in the pdfAnnotations state using setPdfAnnotations.

  13. Where is the PAWLs layer loaded?

  14. The PAWLs layer is loaded in the Annotator.tsx component.
  15. Inside the useEffect hook that runs when the openedDocument prop changes, the PAWLs layer is loaded using the getPawlsLayer function from api/rest.ts.
  16. The getPawlsLayer function makes an HTTP GET request to fetch the PAWLs data file specified by openedDocument.pawlsParseFile.
  17. The PAWLs data is expected to be an array of PageTokens objects, which contain token information for each page of the PDF.
  18. The loaded PAWLs data is then used to create PDFPageInfo objects for each page, which include the page tokens.

High-level Components Overview

  • The Annotator component is the top-level component that manages the state and data loading for the annotator.
  • It renders the PDFView component, which is responsible for displaying the PDF and annotations.
  • The PDFView component renders various sub-components, such as LabelSelector, DocTypeLabelDisplay, AnnotatorSidebar, AnnotatorTopbar, and PDF.
  • The PDF component renders individual Page components for each page of the PDF.
  • Each Page component renders Selection and SearchResult components for annotations and search results, respectively.
  • The AnnotatorSidebar component displays the list of annotations, relations, and a search widget.
  • The PDFStore and AnnotationStore are context providers that hold the PDF and annotation data, respectively.

Specific Component Deep Dives

PDFView.tsx

The PDFView component is a top-level component that renders the PDF document with annotations, relations, and text search capabilities. It manages the state and functionality related to annotations, relations, and user interactions. Here's a detailed explanation of how the component works:

  1. The PDFView component receives several props, including permissions, callbacks for CRUD operations on annotations and relations, refs for container and selection elements, and various configuration options.

  2. It initializes several state variables using the useState hook, including:

  3. selectionElementRefs and searchResultElementRefs: Refs for annotation selections and search results.
  4. pageElementRefs: Refs for individual PDF pages.
  5. scrollContainerRef: Ref for the scroll container.
  6. textSearchMatches and searchText: State for text search matches and search text.
  7. selectedAnnotations and selectedRelations: State for currently selected annotations and relations.
  8. pageSelection and pageSelectionQueue: State for current page selection and queued selections.
  9. pdfPageInfoObjs: State for PDF page information objects.
  10. Various other state variables for active labels, relation modal visibility, and annotation options.

  11. The component defines several functions for updating state and handling user interactions, such as:

  12. insertSelectionElementRef, insertSearchResultElementRefs, and insertPageRef: Functions to add refs for selections, search results, and pages.
  13. onError: Error handling callback.
  14. advanceTextSearchMatch and reverseTextSearchMatch: Functions to navigate through text search matches.
  15. onRelationModalOk and onRelationModalCancel: Callbacks for relation modal actions.
  16. createMultiPageAnnotation: Function to create a multi-page annotation from queued selections.

  17. The component uses the useEffect hook to handle side effects, such as:

  18. Setting the scroll container ref on load.
  19. Listening for changes in the shift key and triggering annotation creation.
  20. Updating text search matches when the search text changes.

  21. The component renders the PDF document and its related components using the PDFStore and AnnotationStore contexts:

  22. The PDFStore context provides the PDF document, pages, and error handling.
  23. The AnnotationStore context provides annotation-related state and functions.

  24. The component renders the following main sections:

  25. LabelSelector: Allows the user to select the active label for annotations.
  26. DocTypeLabelDisplay: Displays the document type labels.
  27. AnnotatorSidebar: Sidebar component for managing annotations and relations.
  28. AnnotatorTopbar: Top bar component for additional controls and options.
  29. PDF: The actual PDF component that renders the PDF pages and annotations.

  30. The PDF component, defined in PDF.tsx, is responsible for rendering the PDF pages and annotations. It receives props from the PDFView component, such as permissions, configuration options, and callbacks.

  31. The PDF component maps over each page of the PDF document and renders a Page component for each page, passing the necessary props.

  32. The Page component, also defined in PDF.tsx, is responsible for rendering a single page of the PDF document along with its annotations and search results. It handles mouse events for creating and modifying annotations.

  33. The PDFView component also renders the RelationModal component when the active relation label is set and the user has the necessary permissions. The modal allows the user to create or modify relations between annotations.

PDF.tsx

PDF renders the actual PDF document with annotations and text search capabilities. PDFView (see above) is what actually interacts with the backend / API.

  1. The PDF component receives several props:
  2. shiftDown: Indicates whether the Shift key is pressed (optional).
  3. doc_permissions and corpus_permissions: Specify the permissions for the document and corpus, respectively.
  4. read_only: Determines if the component is in read-only mode.
  5. show_selected_annotation_only: Specifies whether to show only the selected annotation.
  6. show_annotation_bounding_boxes: Specifies whether to show annotation bounding boxes.
  7. show_annotation_labels: Specifies the behavior for displaying annotation labels.
  8. setJumpedToAnnotationOnLoad: A callback function to set the jumped-to annotation on load.
  9. The PDF component retrieves the PDF document and pages from the PDFStore context.
  10. It maps over each page of the PDF document and renders a Page component for each page, passing the necessary props.
  11. The Page component is responsible for rendering a single page of the PDF document along with its annotations and search results.
  12. Inside the Page component:
  13. It creates a canvas element using the useRef hook to render the PDF page.
  14. It retrieves the annotations for the current page from the AnnotationStore context.
  15. It defines a ConvertBoundsToSelections function that converts the selected bounds to annotations and tokens.
  16. It uses the useEffect hook to set up the PDF page rendering and event listeners for resizing and scrolling.
  17. It renders the PDF page canvas, annotations, search results, and queued selections.
  18. The Page component renders the following sub-components:
  19. PageAnnotationsContainer: A styled container for the page annotations.
  20. PageCanvas: A styled canvas element for rendering the PDF page.
  21. Selection: Represents a single annotation selection on the page.
  22. SearchResult: Represents a search result on the page.
  23. The Page component handles mouse events for creating and modifying annotations:
  24. On mouseDown, it initializes the selection if the necessary permissions are granted and the component is not in read-only mode.
  25. On mouseMove, it updates the selection bounds if a selection is active.
  26. On mouseUp, it adds the completed selection to the pageSelectionQueue and triggers the creation of a multi-page annotation if the Shift key is not pressed.
  27. The Page component also handles fetching more annotations for previous and next pages using the FetchMoreOnVisible component.
  28. The SelectionBoundary and SelectionTokens components are used to render the annotation boundaries and tokens, respectively.
  29. The PDFPageRenderer class is responsible for rendering a single PDF page on the canvas. It manages the rendering tasks and provides methods for canceling and rescaling the rendering.
  30. The getPageBoundsFromCanvas function calculates the bounding box of the page based on the canvas dimensions and its parent container.

Open Contracts Annotator Components

Key Questions

  1. How is the PDF loaded?
  2. The PDF is loaded in the Annotator.tsx component.
  3. Inside the useEffect hook that runs when the openedDocument prop changes, the PDF loading process is initiated.
  4. The pdfjsLib.getDocument function from the pdfjs-dist library is used to load the PDF file specified by openedDocument.pdfFile.
  5. The loading progress is tracked using the loadingTask.onProgress callback, which updates the progress state.
  6. Once the PDF is loaded, the loadingTask.promise is resolved, and the PDFDocumentProxy object is obtained.
  7. The PDFPageInfo objects are created for each page of the PDF using doc.getPage(i) and stored in the pages state.

  8. Where and how are annotations loaded?

  9. Annotations are loaded using the REQUEST_ANNOTATOR_DATA_FOR_DOCUMENT GraphQL query in the Annotator.tsx component.
  10. The useQuery hook from Apollo Client is used to fetch the annotator data based on the provided initial_query_vars.
  11. The annotator_data received from the query contains information about existing text annotations, document label annotations, and relationships.
  12. The annotations are transformed into ServerAnnotation, DocTypeAnnotation, and RelationGroup objects and stored in the pdfAnnotations state using setPdfAnnotations.

  13. Where is the PAWLs layer loaded?

  14. The PAWLs layer is loaded in the Annotator.tsx component.
  15. Inside the useEffect hook that runs when the openedDocument prop changes, the PAWLs layer is loaded using the getPawlsLayer function from api/rest.ts.
  16. The getPawlsLayer function makes an HTTP GET request to fetch the PAWLs data file specified by openedDocument.pawlsParseFile.
  17. The PAWLs data is expected to be an array of PageTokens objects, which contain token information for each page of the PDF.
  18. The loaded PAWLs data is then used to create PDFPageInfo objects for each page, which include the page tokens.

High-level Components Overview

  • The Annotator component is the top-level component that manages the state and data loading for the annotator.
  • It renders the PDFView component, which is responsible for displaying the PDF and annotations.
  • The PDFView component renders various sub-components, such as LabelSelector, DocTypeLabelDisplay, AnnotatorSidebar, AnnotatorTopbar, and PDF.
  • The PDF component renders individual Page components for each page of the PDF.
  • Each Page component renders Selection and SearchResult components for annotations and search results, respectively.
  • The AnnotatorSidebar component displays the list of annotations, relations, and a search widget.
  • The PDFStore and AnnotationStore are context providers that hold the PDF and annotation data, respectively.

Specific Component Deep Dives

PDFView.tsx

The PDFView component is a top-level component that renders the PDF document with annotations, relations, and text search capabilities. It manages the state and functionality related to annotations, relations, and user interactions. Here's a detailed explanation of how the component works:

  1. The PDFView component receives several props, including permissions, callbacks for CRUD operations on annotations and relations, refs for container and selection elements, and various configuration options.

  2. It initializes several state variables using the useState hook, including:

  3. selectionElementRefs and searchResultElementRefs: Refs for annotation selections and search results.
  4. pageElementRefs: Refs for individual PDF pages.
  5. scrollContainerRef: Ref for the scroll container.
  6. textSearchMatches and searchText: State for text search matches and search text.
  7. selectedAnnotations and selectedRelations: State for currently selected annotations and relations.
  8. pageSelection and pageSelectionQueue: State for current page selection and queued selections.
  9. pdfPageInfoObjs: State for PDF page information objects.
  10. Various other state variables for active labels, relation modal visibility, and annotation options.

  11. The component defines several functions for updating state and handling user interactions, such as:

  12. insertSelectionElementRef, insertSearchResultElementRefs, and insertPageRef: Functions to add refs for selections, search results, and pages.
  13. onError: Error handling callback.
  14. advanceTextSearchMatch and reverseTextSearchMatch: Functions to navigate through text search matches.
  15. onRelationModalOk and onRelationModalCancel: Callbacks for relation modal actions.
  16. createMultiPageAnnotation: Function to create a multi-page annotation from queued selections.

  17. The component uses the useEffect hook to handle side effects, such as:

  18. Setting the scroll container ref on load.
  19. Listening for changes in the shift key and triggering annotation creation.
  20. Updating text search matches when the search text changes.

  21. The component renders the PDF document and its related components using the PDFStore and AnnotationStore contexts:

  22. The PDFStore context provides the PDF document, pages, and error handling.
  23. The AnnotationStore context provides annotation-related state and functions.

  24. The component renders the following main sections:

  25. LabelSelector: Allows the user to select the active label for annotations.
  26. DocTypeLabelDisplay: Displays the document type labels.
  27. AnnotatorSidebar: Sidebar component for managing annotations and relations.
  28. AnnotatorTopbar: Top bar component for additional controls and options.
  29. PDF: The actual PDF component that renders the PDF pages and annotations.

  30. The PDF component, defined in PDF.tsx, is responsible for rendering the PDF pages and annotations. It receives props from the PDFView component, such as permissions, configuration options, and callbacks.

  31. The PDF component maps over each page of the PDF document and renders a Page component for each page, passing the necessary props.

  32. The Page component, also defined in PDF.tsx, is responsible for rendering a single page of the PDF document along with its annotations and search results. It handles mouse events for creating and modifying annotations.

  33. The PDFView component also renders the RelationModal component when the active relation label is set and the user has the necessary permissions. The modal allows the user to create or modify relations between annotations.

PDF.tsx

PDF renders the actual PDF document with annotations and text search capabilities. PDFView (see above) is what actually interacts with the backend / API.

  1. The PDF component receives several props:
  2. shiftDown: Indicates whether the Shift key is pressed (optional).
  3. doc_permissions and corpus_permissions: Specify the permissions for the document and corpus, respectively.
  4. read_only: Determines if the component is in read-only mode.
  5. show_selected_annotation_only: Specifies whether to show only the selected annotation.
  6. show_annotation_bounding_boxes: Specifies whether to show annotation bounding boxes.
  7. show_annotation_labels: Specifies the behavior for displaying annotation labels.
  8. setJumpedToAnnotationOnLoad: A callback function to set the jumped-to annotation on load.
  9. The PDF component retrieves the PDF document and pages from the PDFStore context.
  10. It maps over each page of the PDF document and renders a Page component for each page, passing the necessary props.
  11. The Page component is responsible for rendering a single page of the PDF document along with its annotations and search results.
  12. Inside the Page component:
  13. It creates a canvas element using the useRef hook to render the PDF page.
  14. It retrieves the annotations for the current page from the AnnotationStore context.
  15. It defines a ConvertBoundsToSelections function that converts the selected bounds to annotations and tokens.
  16. It uses the useEffect hook to set up the PDF page rendering and event listeners for resizing and scrolling.
  17. It renders the PDF page canvas, annotations, search results, and queued selections.
  18. The Page component renders the following sub-components:
  19. PageAnnotationsContainer: A styled container for the page annotations.
  20. PageCanvas: A styled canvas element for rendering the PDF page.
  21. Selection: Represents a single annotation selection on the page.
  22. SearchResult: Represents a search result on the page.
  23. The Page component handles mouse events for creating and modifying annotations:
  24. On mouseDown, it initializes the selection if the necessary permissions are granted and the component is not in read-only mode.
  25. On mouseMove, it updates the selection bounds if a selection is active.
  26. On mouseUp, it adds the completed selection to the pageSelectionQueue and triggers the creation of a multi-page annotation if the Shift key is not pressed.
  27. The Page component also handles fetching more annotations for previous and next pages using the FetchMoreOnVisible component.
  28. The SelectionBoundary and SelectionTokens components are used to render the annotation boundaries and tokens, respectively.
  29. The PDFPageRenderer class is responsible for rendering a single PDF page on the canvas. It manages the rendering tasks and provides methods for canceling and rescaling the rendering.
  30. The getPageBoundsFromCanvas function calculates the bounding box of the page based on the canvas dimensions and its parent container.
\ No newline at end of file diff --git a/architecture/opencontract-corpus-actions/index.html b/architecture/opencontract-corpus-actions/index.html index bce1d27b..5c92bd83 100755 --- a/architecture/opencontract-corpus-actions/index.html +++ b/architecture/opencontract-corpus-actions/index.html @@ -1,4 +1,4 @@ - Automated Tests - OpenContracts

CorpusAction System in OpenContracts: Revised Explanation

The CorpusAction system in OpenContracts automates document processing when new documents are added to a corpus. This system is designed to be flexible, allowing for different types of actions to be triggered based on the configuration.

Within this system, users have three options for registering actions to run automatically on new documents:

  1. Custom data extractors
  2. Analyzer microservices
  3. Celery tasks decorated with @doc_analyzer_task (a "task-based Analyzer")

The @doc_analyzer_task decorator is specifically designed for the third option, providing a straightforward way to implement simple, span-based analytics directly within the OpenContracts ecosystem.

Action Execution Overview

The following flowchart illustrates the CorpusAction system in OpenContracts, demonstrating the process that occurs when a new document is added to a corpus. This automated workflow begins with the addition of a document, which triggers a Django signal. The signal is then handled, leading to the processing of the corpus action. At this point, the system checks the type of CorpusAction configured for the corpus. Depending on this configuration, one of three paths is taken: running an Extract with a Fieldset, executing an Analysis with a doc_analyzer_task, or submitting an Analysis to a Gremlin Engine. This diagram provides a clear visual representation of how the CorpusAction system automates document processing based on predefined rules, enabling efficient and flexible handling of new documents within the OpenContracts platform.

graph TD
+    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

CorpusAction System in OpenContracts: Revised Explanation

The CorpusAction system in OpenContracts automates document processing when new documents are added to a corpus. This system is designed to be flexible, allowing for different types of actions to be triggered based on the configuration.

Within this system, users have three options for registering actions to run automatically on new documents:

  1. Custom data extractors
  2. Analyzer microservices
  3. Celery tasks decorated with @doc_analyzer_task (a "task-based Analyzer")

The @doc_analyzer_task decorator is specifically designed for the third option, providing a straightforward way to implement simple, span-based analytics directly within the OpenContracts ecosystem.

Action Execution Overview

The following flowchart illustrates the CorpusAction system in OpenContracts, demonstrating the process that occurs when a new document is added to a corpus. This automated workflow begins with the addition of a document, which triggers a Django signal. The signal is then handled, leading to the processing of the corpus action. At this point, the system checks the type of CorpusAction configured for the corpus. Depending on this configuration, one of three paths is taken: running an Extract with a Fieldset, executing an Analysis with a doc_analyzer_task, or submitting an Analysis to a Gremlin Engine. This diagram provides a clear visual representation of how the CorpusAction system automates document processing based on predefined rules, enabling efficient and flexible handling of new documents within the OpenContracts platform.

graph TD
     A[Document Added to Corpus] -->|Triggers| B[Django Signal]
     B --> C[Handle Document Added Signal]
     C --> D[Process Corpus Action]
diff --git a/configuration/add-users/index.html b/configuration/add-users/index.html
index 43225604..21310bee 100755
--- a/configuration/add-users/index.html
+++ b/configuration/add-users/index.html
@@ -7,6 +7,6 @@
     .gdesc-inner { font-size: 0.75rem; }
     body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);}
     body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);}
-    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Add Users

Adding More Users

You can use the same User admin page described above to create new users. Alternatively, go back to the main admin page http://localhost:8000/admin and, under the User section, click the "+Add" button:

Then, follow the on-screen instructions:

When you're done, the username and password you provided can be used to login.

OpenContracts is currently not built to allow users to self-register unless you use the Auth0 authentication. When managing users yourself, you'll need to add, remove and modify users via the admin panels.

Add Users

Adding More Users

You can use the same User admin page described above to create new users. Alternatively, go back to the main admin page http://localhost:8000/admin and, under the User section, click the "+Add" button:

Then, follow the on-screen instructions:

When you're done, the username and password you provided can be used to login.

OpenContracts is currently not built to allow users to self-register unless you use the Auth0 authentication. When managing users yourself, you'll need to add, remove and modify users via the admin panels.

\ No newline at end of file diff --git a/configuration/choose-an-authentication-backend/index.html b/configuration/choose-an-authentication-backend/index.html index 67dcdc91..8acb8e8e 100755 --- a/configuration/choose-an-authentication-backend/index.html +++ b/configuration/choose-an-authentication-backend/index.html @@ -7,6 +7,6 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}

Configure Authentication Backend

Select Authentication System via Env Variables

For authentication and authorization, you have two choices. 1. You can configure an Auth0 account and use Auth0 to authenticate users, in which case anyone who is permitted to authenticate via your auth0 setup can login and automatically get an account, 2. or, you can require a username and password for each user and our OpenContracts backend can provide user authentication and authorization. Using the latter option, there is no currently-supported sign-up method, you'll need to use the admin dashboard (See "Adding Users" section).

Auth0 Auth Setup

You need to configure three, separate applications on Auth0's platform:

  1. Configure the SPA as an application. You'll need the App Client ID.
  2. Configure the API. You'll need API Audience.
  3. Configure a M2M application to access the Auth0 Management API. This is used to fetch user details. You'll need the API_ID for the M2M application and the Client Secret for the M2M app.

You'll also need your Auth0 tenant ID (assuming it's the same for all three applications, though you could, in theory, host them in different tenants). These directions are not comprehensive, so, if you're not familiar with Auth0, we recommend you disable Auth0 for the time being and use username and password.

To enable and configure Auth0 Authentication, you'll need to set the following env variables in your .env file (the .django file in .envs/.production or .envs/.local, depending on your target environment). Our sample .envs only show these fields in the .production sample, but you could use them in the .local env file too:

  1. USE_AUTH0 - set to true to enable Auth0
  2. AUTH0_CLIENT_ID - should be the client ID configured on Auth0
  3. AUTH0_API_AUDIENCE - Configured API audience
  4. AUTH0_DOMAIN - domain of your configured Auth0 application
  5. AUTH0_M2M_MANAGEMENT_API_SECRET - secret for the auth0 Machine to Machine (M2M) API
  6. AUTH0_M2M_MANAGEMENT_API_ID - ID for Auth0 Machine to Machine (M2M) API
  7. AUTH0_M2M_MANAGEMENT_GRANT_TYPE - set to client_credentials

Detailed Explanation of Auth0 Implementation

To get Auth0 to work nicely with Graphene, we modified the graphql_jwt backend to support syncing remote user metadata with a local user similar to the default, django RemoteUserMiddleware. We're keeping the graphql_jwt graphene middleware in its entirety as it fetches the token and then passes it along to django authentication *backend. That django backend is what we're modifying to decode the jwt token against Auth0 settings and then check to see if local user exists, and, if not, create it.

Here's the order of operations in the original Graphene backend provided by graphql_jwt:

  1. Backend's authenticate method is called from the graphene middleware via django (from django.contrib.auth import authenticate)
  2. token is retrieved via .utils get_credentials
  3. if token is not None, get_user_by_token in shortcuts module is called
    1. "Payload" is retrieved via utils.get_payload
    2. User is requested via utils.get_user_by_payload
    3. username is retrieved from payload via auth0_settings.JWT_PAYLOAD_GET_USERNAME_HANDLER
    4. user object is retrieved via auth0_settings.JWT_GET_USER_BY_NATURAL_KEY_HANDLER

We modified a couple of things:

  1. The decode method called in 3(a) needs to be modified to decode with Auth0 secrets and settings.
  2. get_user_by_payload needs to be modified in several ways:
    1. user object must use RemoteUserMiddleware logic and, if everything from auth0 decodes properly, check to see if user with e-mail exists and, if not, create it. Upon completion of this, try to sync user data with auth0. 2) return created or retrieved user object as original method did

Django-Based Authentication Setup

The only thing you need to do for this is toggle the two auth0-related environment variables: 1. For the backend environment, set USE_AUTH0=False in your environment (either via an environment variable file or directly in your environment via the console). 2. For the frontend environment, set REACT_APP_USE_AUTH0=false in your environment (either via an environment variable file or directly in your environment via the console).

Note

As noted elsewhere, users cannot sign up on their own. You need to log into the admin dashboard - e.g. http://localhost:8000/admin - and add users manually.

Configure Authentication Backend

Select Authentication System via Env Variables

For authentication and authorization, you have two choices. 1. You can configure an Auth0 account and use Auth0 to authenticate users, in which case anyone who is permitted to authenticate via your auth0 setup can login and automatically get an account, 2. or, you can require a username and password for each user and our OpenContracts backend can provide user authentication and authorization. Using the latter option, there is no currently-supported sign-up method, you'll need to use the admin dashboard (See "Adding Users" section).

Auth0 Auth Setup

You need to configure three, separate applications on Auth0's platform:

  1. Configure the SPA as an application. You'll need the App Client ID.
  2. Configure the API. You'll need API Audience.
  3. Configure a M2M application to access the Auth0 Management API. This is used to fetch user details. You'll need the API_ID for the M2M application and the Client Secret for the M2M app.

You'll also need your Auth0 tenant ID (assuming it's the same for all three applications, though you could, in theory, host them in different tenants). These directions are not comprehensive, so, if you're not familiar with Auth0, we recommend you disable Auth0 for the time being and use username and password.

To enable and configure Auth0 Authentication, you'll need to set the following env variables in your .env file (the .django file in .envs/.production or .envs/.local, depending on your target environment). Our sample .envs only show these fields in the .production sample, but you could use them in the .local env file too:

  1. USE_AUTH0 - set to true to enable Auth0
  2. AUTH0_CLIENT_ID - should be the client ID configured on Auth0
  3. AUTH0_API_AUDIENCE - Configured API audience
  4. AUTH0_DOMAIN - domain of your configured Auth0 application
  5. AUTH0_M2M_MANAGEMENT_API_SECRET - secret for the auth0 Machine to Machine (M2M) API
  6. AUTH0_M2M_MANAGEMENT_API_ID - ID for Auth0 Machine to Machine (M2M) API
  7. AUTH0_M2M_MANAGEMENT_GRANT_TYPE - set to client_credentials

Detailed Explanation of Auth0 Implementation

To get Auth0 to work nicely with Graphene, we modified the graphql_jwt backend to support syncing remote user metadata with a local user similar to the default, django RemoteUserMiddleware. We're keeping the graphql_jwt graphene middleware in its entirety as it fetches the token and then passes it along to django authentication *backend. That django backend is what we're modifying to decode the jwt token against Auth0 settings and then check to see if local user exists, and, if not, create it.

Here's the order of operations in the original Graphene backend provided by graphql_jwt:

  1. Backend's authenticate method is called from the graphene middleware via django (from django.contrib.auth import authenticate)
  2. token is retrieved via .utils get_credentials
  3. if token is not None, get_user_by_token in shortcuts module is called
    1. "Payload" is retrieved via utils.get_payload
    2. User is requested via utils.get_user_by_payload
    3. username is retrieved from payload via auth0_settings.JWT_PAYLOAD_GET_USERNAME_HANDLER
    4. user object is retrieved via auth0_settings.JWT_GET_USER_BY_NATURAL_KEY_HANDLER

We modified a couple of things:

  1. The decode method called in 3(a) needs to be modified to decode with Auth0 secrets and settings.
  2. get_user_by_payload needs to be modified in several ways:
    1. user object must use RemoteUserMiddleware logic and, if everything from auth0 decodes properly, check to see if user with e-mail exists and, if not, create it. Upon completion of this, try to sync user data with auth0. 2) return created or retrieved user object as original method did

Django-Based Authentication Setup

The only thing you need to do for this is toggle the two auth0-related environment variables: 1. For the backend environment, set USE_AUTH0=False in your environment (either via an environment variable file or directly in your environment via the console). 2. For the frontend environment, set REACT_APP_USE_AUTH0=false in your environment (either via an environment variable file or directly in your environment via the console).

Note

As noted elsewhere, users cannot sign up on their own. You need to log into the admin dashboard - e.g. http://localhost:8000/admin - and add users manually.

\ No newline at end of file diff --git a/configuration/choose-and-configure-docker-stack/index.html b/configuration/choose-and-configure-docker-stack/index.html index 1981f6ab..b08a93c9 100755 --- a/configuration/choose-and-configure-docker-stack/index.html +++ b/configuration/choose-and-configure-docker-stack/index.html @@ -7,7 +7,7 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}

Choose and Configure Docker Compose Stack

Deployment Options

OpenContracts is designed to be deployed using docker-compose. You can run it locally or in a production environment. Follow the instructions below for a local environment if you just want to test it or you want to use it for yourself and don't intend to make the application available to other users via the Internet.

Local Deployment

Quick Start with Default Settings

A "local" deployment is deployed on your personal computer and is not meant to be accessed over the Internet. If you don't need to configure anything, just follow the quick start guide above to get up and running with a local deployment without needing any further configuration.

Setup .env Files

Backend

After cloning this repo to a machine of your choice, create a folder for your environment files in the repo root. You'll need ./.envs/.local/.django and ./.envs/.local/.postgres Use the samples in ./documentation/sample_env_files/local as guidance. NOTE, you'll need to replace the placeholder passwords and users where noted, but, otherwise, minimal config should be required.

Frontend

In the ./frontend folder, you also need to create a single .env file which holds your configurations for your login method as well as certain feature switches (e.g. turn off imports). We've included a sample using auth0 and another sample using django's auth backend. Local vs production deployments are essentially the same, but the root url of the backend will change from localhost to whereever you're hosting the application in production.

Build the Stack

Once your .env files are setup, build the stack using docker-compose:

$ docker-compose -f local.yml build

Then, run migrations (to setup the database):

$ docker-compose -f local.yml run django python manage.py migrate

Then, create a superuser account that can log in to the admin dashboard (in a local deployment this is available at http://localhost:8000/admin) by typing this command and following the prompts:

$ docker-compose -f local.yml run django python manage.py createsuperuser
+    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Choose and Configure Docker Compose Stack

Deployment Options

OpenContracts is designed to be deployed using docker-compose. You can run it locally or in a production environment. Follow the instructions below for a local environment if you just want to test it or you want to use it for yourself and don't intend to make the application available to other users via the Internet.

Local Deployment

Quick Start with Default Settings

A "local" deployment is deployed on your personal computer and is not meant to be accessed over the Internet. If you don't need to configure anything, just follow the quick start guide above to get up and running with a local deployment without needing any further configuration.

Setup .env Files

Backend

After cloning this repo to a machine of your choice, create a folder for your environment files in the repo root. You'll need ./.envs/.local/.django and ./.envs/.local/.postgres Use the samples in ./documentation/sample_env_files/local as guidance. NOTE, you'll need to replace the placeholder passwords and users where noted, but, otherwise, minimal config should be required.

Frontend

In the ./frontend folder, you also need to create a single .env file which holds your configurations for your login method as well as certain feature switches (e.g. turn off imports). We've included a sample using auth0 and another sample using django's auth backend. Local vs production deployments are essentially the same, but the root url of the backend will change from localhost to whereever you're hosting the application in production.

Build the Stack

Once your .env files are setup, build the stack using docker-compose:

$ docker-compose -f local.yml build

Then, run migrations (to setup the database):

$ docker-compose -f local.yml run django python manage.py migrate

Then, create a superuser account that can log in to the admin dashboard (in a local deployment this is available at http://localhost:8000/admin) by typing this command and following the prompts:

$ docker-compose -f local.yml run django python manage.py createsuperuser
 

Finally, bring up the stack:

$ docker-compose -f local.yml up
 

You should now be able to access the OpenContracts frontend by visiting http://localhost:3000.

Production Environment

The production environment is designed to be public-facing and exposed to the Internet, so there are quite a number more configurations required than a local deployment, particularly if you use an AWS S3 storage backend or the Auth0 authentication system.

After cloning this repo to a machine of your choice, configure the production .env files as described above.

You'll also need to configure your website url. This needs to be done in a few places.

First, in opencontractserver/contrib/migrations, you'll fine a file called 0003_set_site_domain_and_name.py. BEFORE running any of your migrations, you should modify the domain and name defaults you'll fine in update_site_forward:

def update_site_forward(apps, schema_editor):
  """Set site domain and name.""" Site = apps.get_model("sites", "Site") Site.objects.update_or_create( id=settings.SITE_ID, defaults={ "domain": "opencontracts.opensource.legal", "name": "OpenContractServer", }, )
diff --git a/configuration/choose-storage-backend/index.html b/configuration/choose-storage-backend/index.html
index 641ac778..4673fc95 100755
--- a/configuration/choose-storage-backend/index.html
+++ b/configuration/choose-storage-backend/index.html
@@ -7,6 +7,6 @@
     .gdesc-inner { font-size: 0.75rem; }
     body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);}
     body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);}
-    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Configure Storage Backend

Select and Setup Storage Backend

You can use Amazon S3 as a file storage backend (if you set the env flag USE_AWS=True, more on that below), or you can use the local storage of the host machine via a Docker volume.

AWS Storage Backend

If you want to use AWS S3 to store files (primarily pdfs, but also exports, tokens and txt files), you will need an Amazon AWS account to setup S3. This README does not cover the AWS side of configuration, but there are a number of tutorials and guides to getting AWS configured to be used with a django project.

Once you have an S3 bucket configured, you'll need to set the following env variables in your .env file (the .django file in .envs/.production or .envs/.local, depending on your target environment). Our sample .envs only show these fields in the .production samples, but you could use them in the .local env file too.

Here the variables you need to set to enable AWS S3 storage:

  1. USE_AWS - set to true since you're using AWS, otherwise the backend will use a docker volume for storage.
  2. AWS_ACCESS_KEY_ID - the access key ID created by AWS when you set up your IAM user (see tutorials above).
  3. AWS_SECRET_ACCESS_KEY - the secret access key created by AWS when you set up your IAM user (see tutorials above)
  4. AWS_STORAGE_BUCKET_NAME - the name of the AWS bucket you created to hold the files.
  5. AWS_S3_REGION_NAME - the region of the AWS bucket you configured.

Django Storage Backend

Setting USE_AWS=false will use the disk space in the django container. When using the local docker compose stack, the celery workers and django containers share the same disk, so this works fine. Our production configuration would not work properly with USE_AWS=false, however, as each container has its own disk.

Configure Storage Backend

Select and Setup Storage Backend

You can use Amazon S3 as a file storage backend (if you set the env flag USE_AWS=True, more on that below), or you can use the local storage of the host machine via a Docker volume.

AWS Storage Backend

If you want to use AWS S3 to store files (primarily pdfs, but also exports, tokens and txt files), you will need an Amazon AWS account to setup S3. This README does not cover the AWS side of configuration, but there are a number of tutorials and guides to getting AWS configured to be used with a django project.

Once you have an S3 bucket configured, you'll need to set the following env variables in your .env file (the .django file in .envs/.production or .envs/.local, depending on your target environment). Our sample .envs only show these fields in the .production samples, but you could use them in the .local env file too.

Here the variables you need to set to enable AWS S3 storage:

  1. USE_AWS - set to true since you're using AWS, otherwise the backend will use a docker volume for storage.
  2. AWS_ACCESS_KEY_ID - the access key ID created by AWS when you set up your IAM user (see tutorials above).
  3. AWS_SECRET_ACCESS_KEY - the secret access key created by AWS when you set up your IAM user (see tutorials above)
  4. AWS_STORAGE_BUCKET_NAME - the name of the AWS bucket you created to hold the files.
  5. AWS_S3_REGION_NAME - the region of the AWS bucket you configured.

Django Storage Backend

Setting USE_AWS=false will use the disk space in the django container. When using the local docker compose stack, the celery workers and django containers share the same disk, so this works fine. Our production configuration would not work properly with USE_AWS=false, however, as each container has its own disk.

\ No newline at end of file diff --git a/configuration/configure-admin-users/index.html b/configuration/configure-admin-users/index.html index 40948d90..39b6d45d 100755 --- a/configuration/configure-admin-users/index.html +++ b/configuration/configure-admin-users/index.html @@ -7,6 +7,6 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}

Configure Admin Users

Gremlin Admin Dashboard

Gremlin's backend is built on Django, which has its own powerful admin dashboard. This dashboard is not meant for end-users and should only be used by admins. You can access the admin dashboard by going to the /admin page - e,g, opencontracts.opensource.legal/admin or http://localhost:8000/admin. For the most part, you shouldn't need to use the admin dashboard and should only go in here if you're experience errors or unexpected behavior and want to look at the detailed contents of the database to see if it sheds any light on what's happening with a give corpus, document, etc.

By default, Gremlin creates an admin user for you. If you don't specify the username and password in your environment on first boot, it'll use system defaults. You can customize the default username and password via environment variables or after the system boots using the admin dash.

Configure Username and Password Prior to First Deployment

If the variable DJANGO_SUPERUSER_USERNAME is set, that will be the default admin user created on startup (the first time your run docker-compose -f local.yml up). The repo ships with a default superuser username of admin. The default password is set using the DJANGO_SUPERUSER_PASSWORD variable. The environment files for local deployments (but not production) include a default password of Openc0ntracts_def@ult. You should change this in the environment file before the first start OR, follow the instructions below to change it after the first start.

If you modify these environment variables in the environment file BEFORE running the docker-compose up command for the first time, your initial superuser will have the username, email and/or password you specify. If you don't modify the defaults, you can change them after you have created them via the admin dashboard (see below).

After First Deployment via Admin Dashboard

Once the default superuser has been created, you'll need to use the admin dashboard to modify it.

To manage users, including changing the password, you'll need to access the backend admin dashboard. OpenContracts is built on Django, which ships with Django Admin, a tool to manage low-level object data and users. It doesn't provide the rich, document focused UI/UX our frontend does, but it does let you edit and delete objects created on the frontend if, for any reason, you are unable to fix something done by a frontend user (e.g. a corrupt file is uploaded and cannot be parsed or rendered properly on the frontend).

To update your users, first login to the admin panel:

Then, in the lefthand navbar, find the entry for "Users" and click on it

Then, you'll see a list of all users for this instance. You should see your admin user and an "Anonymous" user. The Anonymous user is required for public browsing of objcets with their is_public field set to True. The Anonymous user cannot see other objects.

Click on the admin user to bring up the detailed user view:

Now you can click the "WHAT AM I CALLED" button to bring up a dialog to change the user password.

Configure Admin Users

Gremlin Admin Dashboard

Gremlin's backend is built on Django, which has its own powerful admin dashboard. This dashboard is not meant for end-users and should only be used by admins. You can access the admin dashboard by going to the /admin page - e,g, opencontracts.opensource.legal/admin or http://localhost:8000/admin. For the most part, you shouldn't need to use the admin dashboard and should only go in here if you're experience errors or unexpected behavior and want to look at the detailed contents of the database to see if it sheds any light on what's happening with a give corpus, document, etc.

By default, Gremlin creates an admin user for you. If you don't specify the username and password in your environment on first boot, it'll use system defaults. You can customize the default username and password via environment variables or after the system boots using the admin dash.

Configure Username and Password Prior to First Deployment

If the variable DJANGO_SUPERUSER_USERNAME is set, that will be the default admin user created on startup (the first time your run docker-compose -f local.yml up). The repo ships with a default superuser username of admin. The default password is set using the DJANGO_SUPERUSER_PASSWORD variable. The environment files for local deployments (but not production) include a default password of Openc0ntracts_def@ult. You should change this in the environment file before the first start OR, follow the instructions below to change it after the first start.

If you modify these environment variables in the environment file BEFORE running the docker-compose up command for the first time, your initial superuser will have the username, email and/or password you specify. If you don't modify the defaults, you can change them after you have created them via the admin dashboard (see below).

After First Deployment via Admin Dashboard

Once the default superuser has been created, you'll need to use the admin dashboard to modify it.

To manage users, including changing the password, you'll need to access the backend admin dashboard. OpenContracts is built on Django, which ships with Django Admin, a tool to manage low-level object data and users. It doesn't provide the rich, document focused UI/UX our frontend does, but it does let you edit and delete objects created on the frontend if, for any reason, you are unable to fix something done by a frontend user (e.g. a corrupt file is uploaded and cannot be parsed or rendered properly on the frontend).

To update your users, first login to the admin panel:

Then, in the lefthand navbar, find the entry for "Users" and click on it

Then, you'll see a list of all users for this instance. You should see your admin user and an "Anonymous" user. The Anonymous user is required for public browsing of objcets with their is_public field set to True. The Anonymous user cannot see other objects.

Click on the admin user to bring up the detailed user view:

Now you can click the "WHAT AM I CALLED" button to bring up a dialog to change the user password.

\ No newline at end of file diff --git a/configuration/configure-gremlin/index.html b/configuration/configure-gremlin/index.html index 39ddef3c..cd3b9a4a 100755 --- a/configuration/configure-gremlin/index.html +++ b/configuration/configure-gremlin/index.html @@ -7,6 +7,6 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}

Configure Gremlin Analyzer

Gremlin is a separate project by OpenSource Legal to provide a standard API to access NLP capabilities. This lets us wrap multiple NLP engines / techniques in the same API which lets us build tools that can readily consume the outputs of very different NLP libraries (etc. a Transformers-based model like BERT, and tools like SPACY and LexNLP can be deployed on Gremlin and the outputs from all three can readily be rendered in OpenContracts).

OpenContracts is designed to work with Gremlin out-of-the-box. We have a sample compose yaml file showing how to do this on a local machine local_deploy_with_gremlin.yaml and as a web-facing application production_deploy_with_gremlin.yaml.

When you add a new Gremlin Engine to the database, OpenContracs will automatically query it for its installed analyzers and labels. These will then be available within OpenContracts, and you can use an analyzer to analyze any OpenContracts corpus.

While we have plans to automatically "install" the default Gremlin on first boot, currently you must manually go into the OpenContracts admin dash and add the Gremlin. Thankfully, this is an easy process:

  1. In your environment file, make sure you set CALLBACK_ROOT_URL_FOR_ANALYZER
    1. For local deploy, use CALLBACK_ROOT_URL_FOR_ANALYZER=http://localhost:8000
    2. For production deploy, use http://django:5000. Why the change? Well, in our local docker compose stack, the host the localhost and the django development server runs on port 8000. In production, we want Gremlin to communicate with the OpenContracts container ("django") via its hostname on the docker compose stack's network. The production OpenContracts container also uses gunicorn on port 5000 instead of the development server on port 8000, so the port changes too.
  2. Go to the admin page:
  3. Click "Add+" in the Gremlin row to bring up the Add Gremlin Engine form. You just need to set the creator Url fields (the url for our default config is http://gremlinengine:5000). If, for some reason, you don't want the analyzer to be visible to any unauthenticated user, unselect the is_public box :
  4. This will automatically kick off an install process that runs in the background. When it's complete, you'll see the "Install Completed" Field change. It should take a second or two. At the moment, we don't handle errors in this process, so, if it doesn't complete successfully in 30 seconds, there is probably a misconfiguration somewhere. We plan to improve our error handling for these backend installation processes.

Note, in our example implementations, Gremlin is NOT encrypted or API Key secured to outside traffic. It's not exposed to outside traffic either per our docker compose config, so this shouldn't be a major concern. If you do expose the container to the host via your Docker Compose file, you should ensure you run the traffic through Treafik and setup API Key authentication.

Configure Gremlin Analyzer

Gremlin is a separate project by OpenSource Legal to provide a standard API to access NLP capabilities. This lets us wrap multiple NLP engines / techniques in the same API which lets us build tools that can readily consume the outputs of very different NLP libraries (etc. a Transformers-based model like BERT, and tools like SPACY and LexNLP can be deployed on Gremlin and the outputs from all three can readily be rendered in OpenContracts).

OpenContracts is designed to work with Gremlin out-of-the-box. We have a sample compose yaml file showing how to do this on a local machine local_deploy_with_gremlin.yaml and as a web-facing application production_deploy_with_gremlin.yaml.

When you add a new Gremlin Engine to the database, OpenContracs will automatically query it for its installed analyzers and labels. These will then be available within OpenContracts, and you can use an analyzer to analyze any OpenContracts corpus.

While we have plans to automatically "install" the default Gremlin on first boot, currently you must manually go into the OpenContracts admin dash and add the Gremlin. Thankfully, this is an easy process:

  1. In your environment file, make sure you set CALLBACK_ROOT_URL_FOR_ANALYZER
    1. For local deploy, use CALLBACK_ROOT_URL_FOR_ANALYZER=http://localhost:8000
    2. For production deploy, use http://django:5000. Why the change? Well, in our local docker compose stack, the host the localhost and the django development server runs on port 8000. In production, we want Gremlin to communicate with the OpenContracts container ("django") via its hostname on the docker compose stack's network. The production OpenContracts container also uses gunicorn on port 5000 instead of the development server on port 8000, so the port changes too.
  2. Go to the admin page:
  3. Click "Add+" in the Gremlin row to bring up the Add Gremlin Engine form. You just need to set the creator Url fields (the url for our default config is http://gremlinengine:5000). If, for some reason, you don't want the analyzer to be visible to any unauthenticated user, unselect the is_public box :
  4. This will automatically kick off an install process that runs in the background. When it's complete, you'll see the "Install Completed" Field change. It should take a second or two. At the moment, we don't handle errors in this process, so, if it doesn't complete successfully in 30 seconds, there is probably a misconfiguration somewhere. We plan to improve our error handling for these backend installation processes.

Note, in our example implementations, Gremlin is NOT encrypted or API Key secured to outside traffic. It's not exposed to outside traffic either per our docker compose config, so this shouldn't be a major concern. If you do expose the container to the host via your Docker Compose file, you should ensure you run the traffic through Treafik and setup API Key authentication.

\ No newline at end of file diff --git a/configuration/frontend-configuration/index.html b/configuration/frontend-configuration/index.html index 24b7a673..296d1a4e 100755 --- a/configuration/frontend-configuration/index.html +++ b/configuration/frontend-configuration/index.html @@ -7,7 +7,7 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}

Frontend Configuration

Why?

The frontend configuration variables should not be secrets as there is no way to keep them secure on the frontend. That said, being able to specify certain configurations via environment variables makes configuration and deployment much easier.

What Can be Configured?

Our frontend config file should look like this (The OPEN_CONTRACTS_ prefixes are necessary to get the env variables injected into the frontend container. The env variable that shows up on window._env_ in the React frontend will omit the prefix, however - e.g. OPEN_CONTRACTS_REACT_APP_APPLICATION_DOMAIN will show up as REACT_APP_APPLICATION_DOMAIN):

OPEN_CONTRACTS_REACT_APP_APPLICATION_DOMAIN=
+    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Frontend Configuration

Why?

The frontend configuration variables should not be secrets as there is no way to keep them secure on the frontend. That said, being able to specify certain configurations via environment variables makes configuration and deployment much easier.

What Can be Configured?

Our frontend config file should look like this (The OPEN_CONTRACTS_ prefixes are necessary to get the env variables injected into the frontend container. The env variable that shows up on window._env_ in the React frontend will omit the prefix, however - e.g. OPEN_CONTRACTS_REACT_APP_APPLICATION_DOMAIN will show up as REACT_APP_APPLICATION_DOMAIN):

OPEN_CONTRACTS_REACT_APP_APPLICATION_DOMAIN=
 OPEN_CONTRACTS_REACT_APP_APPLICATION_CLIENT_ID=
 OPEN_CONTRACTS_REACT_APP_AUDIENCE=http://localhost:3000
 OPEN_CONTRACTS_REACT_APP_API_ROOT_URL=https://opencontracts.opensource.legal
diff --git a/development/documentation/index.html b/development/documentation/index.html
index 824f22b7..36a028ba 100755
--- a/development/documentation/index.html
+++ b/development/documentation/index.html
@@ -7,7 +7,7 @@
     .gdesc-inner { font-size: 0.75rem; }
     body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);}
     body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);}
-    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Documentation

Documentation Stack

We're using mkdocs to render our markdown into pretty, bite-sized pieces. The markdown lives in /docs in our repo. If you want to work on the docs you'll need to install the requirements in /requirements/docs.txt.

To have a live server while working on them, type:

mkdocs serve
+    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Documentation

Documentation Stack

We're using mkdocs to render our markdown into pretty, bite-sized pieces. The markdown lives in /docs in our repo. If you want to work on the docs you'll need to install the requirements in /requirements/docs.txt.

To have a live server while working on them, type:

mkdocs serve
 

Building Docs

To build a html website from your markdown that can be uploaded to a webhost (or a GitHub Page), just type:

mkdocs build
 

Deploying to GH Page

mkdocs makes it super easy to deploy your docs to a GitHub page.

Just run:

mkdocs gh-deploy
 

Dev Environment

We use Black and Flake8 for Python Code Styling. These are run via pre-commit before all commits. If you want to develop extensions or code based on OpenContracts, you'll need to setup pre-commit. First, make sure the requirements in ./requirements/local.txt are installed in your local environment.

Then, install pre-commit into your local git repo. From the root of the repo, run:

 $ pre-commit install
+    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}     

Dev Environment

We use Black and Flake8 for Python Code Styling. These are run via pre-commit before all commits. If you want to develop extensions or code based on OpenContracts, you'll need to setup pre-commit. First, make sure the requirements in ./requirements/local.txt are installed in your local environment.

Then, install pre-commit into your local git repo. From the root of the repo, run:

 $ pre-commit install
 
If you want to run pre-commit manually on all the code in the repo, use this command:

 $ pre-commit run --all-files
 

When you commit changes to your repo or our repo as a PR, pre-commit will run and ensure your code follows our style guide and passes linting.

Frontend Notes

Responsive Layout

The application was primarily designed to be viewed around 1080p. We've built in some quick and dirty (honestly, hacks) to display a usable layout at other resolutions. A more thorough redesign / refactor is in order, again if there's sufficient interest. What's available now should handle a lot of situations ok. If you find performance / layout is not looking great at your given resolution, try to use a desktop browser at a 1080p resolution.

No Test Suite

As of our initial release, the test suite only tests the backend (and coverage is admittedly not as robust as we'd like). We'd like to add tests for the frontend, though this is a fairly large undertaking. We welcome any contributions on this front!

Frontend Notes

Responsive Layout

The application was primarily designed to be viewed around 1080p. We've built in some quick and dirty (honestly, hacks) to display a usable layout at other resolutions. A more thorough redesign / refactor is in order, again if there's sufficient interest. What's available now should handle a lot of situations ok. If you find performance / layout is not looking great at your given resolution, try to use a desktop browser at a 1080p resolution.

No Test Suite

As of our initial release, the test suite only tests the backend (and coverage is admittedly not as robust as we'd like). We'd like to add tests for the frontend, though this is a fairly large undertaking. We welcome any contributions on this front!

\ No newline at end of file diff --git a/development/test-suite/index.html b/development/test-suite/index.html index e7d6f8ce..4c9f3c68 100755 --- a/development/test-suite/index.html +++ b/development/test-suite/index.html @@ -7,7 +7,7 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}

Test Suite

Our test suite is a bit sparse, but we're working to improve coverage on the backend. Frontend tests will likely take longer to implement. Our existing tests do test imports and a number of the utility functions for manipulating annotations. These tests are integrated in our GitHub actions.

NOTE, use Python 3.10 or above as pydantic and certain pre-3.10 type annotations do not play well. using from __future__ import annotations doesn't always solve the problem, and upgrading to Python 3.10 was a lot easier than trying to figure out why the from __future__ didn't behave as expected

To run the tests, check your test coverage, and generate an HTML coverage report:

 $ docker-compose -f local.yml run django coverage run -m pytest
+    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}     

Test Suite

Our test suite is a bit sparse, but we're working to improve coverage on the backend. Frontend tests will likely take longer to implement. Our existing tests do test imports and a number of the utility functions for manipulating annotations. These tests are integrated in our GitHub actions.

NOTE, use Python 3.10 or above as pydantic and certain pre-3.10 type annotations do not play well. using from __future__ import annotations doesn't always solve the problem, and upgrading to Python 3.10 was a lot easier than trying to figure out why the from __future__ didn't behave as expected

To run the tests, check your test coverage, and generate an HTML coverage report:

 $ docker-compose -f local.yml run django coverage run -m pytest
  $ docker-compose -f local.yml run django coverage html
  $ open htmlcov/index.html
 

To run a specific test (e.g. test_analyzers):

 $ sudo docker-compose -f local.yml run django python manage.py test opencontractserver.tests.test_analyzers --noinput
diff --git a/extract_and_retrieval/document_data_extract/index.html b/extract_and_retrieval/document_data_extract/index.html
index 4e964c76..2f506738 100755
--- a/extract_and_retrieval/document_data_extract/index.html
+++ b/extract_and_retrieval/document_data_extract/index.html
@@ -7,7 +7,7 @@
     .gdesc-inner { font-size: 0.75rem; }
     body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);}
     body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);}
-    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Extracting Structured Data from Documents using LlamaIndex, AI Agents, and Marvin

We've added a powerful feature called "extract" that enables the generation of structured data grids from a list of documents using a combination of vector search, AI agents, and the Marvin library.

This run_extract task orchestrates the extraction process, spinning up a number of llama_index_doc_query tasks. Each of these query tasks uses LlamaIndex Django & pgvector for vector search and retrieval, and Marvin for data parsing and extraction. It processes each document and column in parallel using celery's task system.

All credit for the inspiration of this feature goes to the fine folks at Nlmatics. They were some of the first pioneers working on datagrids from document using a set of questions and custom transformer models. This implementation of their concept ultimately leverages newer techniques and better models, but hats off to them for coming up with a design like this in 2017/2018!

The current implementation relies heavily on LlamaIndex, specifically their vector store tooling, their reranker and their agent framework.

Structured data extraction is powered by the amazing Marvin library.

Overview

The extract process involves the following key components:

  1. Document Corpus: A collection of documents from which structured data will be extracted.
  2. Fieldset: A set of columns defining the structure of the data to be extracted.
  3. LlamaIndex: A library used for efficient vector search and retrieval of relevant document sections.
  4. AI Agents: Intelligent agents that analyze the retrieved document sections and extract structured data.
  5. Marvin: A library that facilitates the parsing and extraction of structured data from text.

The extract process is initiated by creating an Extract object that specifies the document corpus and the fieldset defining the desired data structure. The process is then broken down into individual tasks for each document and column combination, allowing for parallel processing and scalability.

Detailed Walkthrough

Here's how the extract process works step by step.

1. Initiating the Extract Process

The run_extract function is the entry point for initiating the extract process. It takes the extract_id and user_id as parameters and performs the following steps:

  1. Retrieves the Extract object from the database based on the provided extract_id.
  2. Sets the started timestamp of the extract to the current time.
  3. Retrieves the fieldset associated with the extract, which defines the columns of the structured data grid.
  4. Retrieves the list of document IDs associated with the extract.
  5. Creates Datacell objects for each document and column combination, representing the individual cells in the structured data grid.
  6. Sets the appropriate permissions for each Datacell object based on the user's permissions.
  7. Kicks off the processing job for each Datacell by appending a task to the Celery task queue.

2. Processing Individual Datacells

The llama_index_doc_query function is responsible for processing each individual Datacell.

Execution Flow Visualized:

graph TD
+    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Extracting Structured Data from Documents using LlamaIndex, AI Agents, and Marvin

We've added a powerful feature called "extract" that enables the generation of structured data grids from a list of documents using a combination of vector search, AI agents, and the Marvin library.

This run_extract task orchestrates the extraction process, spinning up a number of llama_index_doc_query tasks. Each of these query tasks uses LlamaIndex Django & pgvector for vector search and retrieval, and Marvin for data parsing and extraction. It processes each document and column in parallel using celery's task system.

All credit for the inspiration of this feature goes to the fine folks at Nlmatics. They were some of the first pioneers working on datagrids from document using a set of questions and custom transformer models. This implementation of their concept ultimately leverages newer techniques and better models, but hats off to them for coming up with a design like this in 2017/2018!

The current implementation relies heavily on LlamaIndex, specifically their vector store tooling, their reranker and their agent framework.

Structured data extraction is powered by the amazing Marvin library.

Overview

The extract process involves the following key components:

  1. Document Corpus: A collection of documents from which structured data will be extracted.
  2. Fieldset: A set of columns defining the structure of the data to be extracted.
  3. LlamaIndex: A library used for efficient vector search and retrieval of relevant document sections.
  4. AI Agents: Intelligent agents that analyze the retrieved document sections and extract structured data.
  5. Marvin: A library that facilitates the parsing and extraction of structured data from text.

The extract process is initiated by creating an Extract object that specifies the document corpus and the fieldset defining the desired data structure. The process is then broken down into individual tasks for each document and column combination, allowing for parallel processing and scalability.

Detailed Walkthrough

Here's how the extract process works step by step.

1. Initiating the Extract Process

The run_extract function is the entry point for initiating the extract process. It takes the extract_id and user_id as parameters and performs the following steps:

  1. Retrieves the Extract object from the database based on the provided extract_id.
  2. Sets the started timestamp of the extract to the current time.
  3. Retrieves the fieldset associated with the extract, which defines the columns of the structured data grid.
  4. Retrieves the list of document IDs associated with the extract.
  5. Creates Datacell objects for each document and column combination, representing the individual cells in the structured data grid.
  6. Sets the appropriate permissions for each Datacell object based on the user's permissions.
  7. Kicks off the processing job for each Datacell by appending a task to the Celery task queue.

2. Processing Individual Datacells

The llama_index_doc_query function is responsible for processing each individual Datacell.

Execution Flow Visualized:

graph TD
     I[llama_index_doc_query] --> J[Retrieve Datacell]
     J --> K[Create HuggingFaceEmbedding]
     K --> L[Create OpenAI LLM]
diff --git a/extract_and_retrieval/intro_to_django_annotation_vector_store/index.html b/extract_and_retrieval/intro_to_django_annotation_vector_store/index.html
index 58a1a954..7607392a 100755
--- a/extract_and_retrieval/intro_to_django_annotation_vector_store/index.html
+++ b/extract_and_retrieval/intro_to_django_annotation_vector_store/index.html
@@ -7,7 +7,7 @@
     .gdesc-inner { font-size: 0.75rem; }
     body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);}
     body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);}
-    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Making a Django Application Compatible with LlamaIndex using a Custom Vector Store

Introduction

In this walkthrough, we'll explore how the custom DjangoAnnotationVectorStore makes a Django application compatible with LlamaIndex, enabling powerful vector search capabilities within the application's structured annotation store. By leveraging the BasePydanticVectorStore class provided by LlamaIndex and integrating it with Django's ORM and the pg-vector extension for PostgreSQL, we can achieve efficient and scalable vector search functionality.

Understanding the DjangoAnnotationVectorStore

The DjangoAnnotationVectorStore is a custom implementation of LlamaIndex's BasePydanticVectorStore class, tailored specifically for a Django application. It allows the application to store and retrieve granular, visually-locatable annotations (x-y blocks) from PDF pages using vector search.

Let's break down the key components and features of the DjangoAnnotationVectorStore:

1. Inheritance from BasePydanticVectorStore

class DjangoAnnotationVectorStore(BasePydanticVectorStore):
+    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Making a Django Application Compatible with LlamaIndex using a Custom Vector Store

Introduction

In this walkthrough, we'll explore how the custom DjangoAnnotationVectorStore makes a Django application compatible with LlamaIndex, enabling powerful vector search capabilities within the application's structured annotation store. By leveraging the BasePydanticVectorStore class provided by LlamaIndex and integrating it with Django's ORM and the pg-vector extension for PostgreSQL, we can achieve efficient and scalable vector search functionality.

Understanding the DjangoAnnotationVectorStore

The DjangoAnnotationVectorStore is a custom implementation of LlamaIndex's BasePydanticVectorStore class, tailored specifically for a Django application. It allows the application to store and retrieve granular, visually-locatable annotations (x-y blocks) from PDF pages using vector search.

Let's break down the key components and features of the DjangoAnnotationVectorStore:

1. Inheritance from BasePydanticVectorStore

class DjangoAnnotationVectorStore(BasePydanticVectorStore):
     ...
 

By inheriting from BasePydanticVectorStore, the DjangoAnnotationVectorStore gains access to the base functionality and interfaces provided by LlamaIndex for vector stores. This ensures compatibility with LlamaIndex's query engines and retrieval methods.

2. Integration with Django's ORM

The DjangoAnnotationVectorStore leverages Django's Object-Relational Mapping (ORM) to interact with the application's database. It defines methods like _get_annotation_queryset() and _build_filter_query() to retrieve annotations from the database using Django's queryset API.

def _get_annotation_queryset(self) -> QuerySet:
     queryset = Annotation.objects.all()
diff --git a/extract_and_retrieval/querying_corpus/index.html b/extract_and_retrieval/querying_corpus/index.html
index bd5151c1..47b675b1 100755
--- a/extract_and_retrieval/querying_corpus/index.html
+++ b/extract_and_retrieval/querying_corpus/index.html
@@ -7,6 +7,6 @@
     .gdesc-inner { font-size: 0.75rem; }
     body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);}
     body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);}
-    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Answering Queries using LlamaIndex in a Django Application

This markdown document explains how queries are answered in a Django application using LlamaIndex, the limitations of the approach, and how LlamaIndex is leveraged for this purpose.

Query Answering Process

  1. A user submits a query through the Django application, which is associated with a specific corpus (a collection of documents).
  2. The query is saved in the database as a CorpusQuery object, and a Celery task (run_query) is triggered to process the query asynchronously.
  3. Inside the run_query task:
  4. The CorpusQuery object is retrieved from the database using the provided query_id.
  5. The query's started timestamp is set to the current time.
  6. The necessary components for query processing are set up, including the embedding model (HuggingFaceEmbedding), language model (OpenAI), and vector store (DjangoAnnotationVectorStore).
  7. The DjangoAnnotationVectorStore is initialized with the corpus_id associated with the query, allowing it to retrieve the relevant annotations for the specified corpus.
  8. A VectorStoreIndex is created from the DjangoAnnotationVectorStore, which serves as the index for the query engine.
  9. A CitationQueryEngine is instantiated with the index, specifying the number of top similar results to retrieve (similarity_top_k) and the granularity of the citation sources (citation_chunk_size).
  10. The query is passed to the CitationQueryEngine, which processes the query and generates a response.
  11. The response includes the answer to the query along with the source annotations used to generate the answer.
  12. The source annotations are parsed and converted into a markdown format, with each citation linked to the corresponding annotation ID.
  13. The query's sources field is updated with the annotation IDs used in the response.
  14. The query's response field is set to the generated markdown text.
  15. The query's completed timestamp is set to the current time.
  16. If an exception occurs during the query processing, the query's failed timestamp is set, and the stack trace is stored in the stacktrace field.

Leveraging LlamaIndex

LlamaIndex is leveraged in the following ways to enable query answering in the Django application:

  1. Vector Store: LlamaIndex provides the BasePydanticVectorStore class, which serves as the foundation for the custom DjangoAnnotationVectorStore. The DjangoAnnotationVectorStore integrates with Django's ORM to store and retrieve annotations efficiently, allowing seamless integration with the existing Django application.
  2. Indexing: LlamaIndex's VectorStoreIndex is used to create an index from the DjangoAnnotationVectorStore. The index facilitates fast and efficient retrieval of relevant annotations based on the query.
  3. Query Engine: LlamaIndex's CitationQueryEngine is employed to process the queries and generate responses. The query engine leverages the index to find the most relevant annotations and uses the language model to generate a coherent answer.
  4. Embedding and Language Models: LlamaIndex provides abstractions for integrating various embedding and language models. In this implementation, the HuggingFaceEmbedding and OpenAI models are used, but LlamaIndex allows flexibility in choosing different models based on requirements.

By leveraging LlamaIndex, the Django application benefits from a structured and efficient approach to query answering. LlamaIndex provides the necessary components and abstractions to handle vector storage, indexing, and query processing, allowing the application to focus on integrating these capabilities into its existing architecture.

Answering Queries using LlamaIndex in a Django Application

This markdown document explains how queries are answered in a Django application using LlamaIndex, the limitations of the approach, and how LlamaIndex is leveraged for this purpose.

Query Answering Process

  1. A user submits a query through the Django application, which is associated with a specific corpus (a collection of documents).
  2. The query is saved in the database as a CorpusQuery object, and a Celery task (run_query) is triggered to process the query asynchronously.
  3. Inside the run_query task:
  4. The CorpusQuery object is retrieved from the database using the provided query_id.
  5. The query's started timestamp is set to the current time.
  6. The necessary components for query processing are set up, including the embedding model (HuggingFaceEmbedding), language model (OpenAI), and vector store (DjangoAnnotationVectorStore).
  7. The DjangoAnnotationVectorStore is initialized with the corpus_id associated with the query, allowing it to retrieve the relevant annotations for the specified corpus.
  8. A VectorStoreIndex is created from the DjangoAnnotationVectorStore, which serves as the index for the query engine.
  9. A CitationQueryEngine is instantiated with the index, specifying the number of top similar results to retrieve (similarity_top_k) and the granularity of the citation sources (citation_chunk_size).
  10. The query is passed to the CitationQueryEngine, which processes the query and generates a response.
  11. The response includes the answer to the query along with the source annotations used to generate the answer.
  12. The source annotations are parsed and converted into a markdown format, with each citation linked to the corresponding annotation ID.
  13. The query's sources field is updated with the annotation IDs used in the response.
  14. The query's response field is set to the generated markdown text.
  15. The query's completed timestamp is set to the current time.
  16. If an exception occurs during the query processing, the query's failed timestamp is set, and the stack trace is stored in the stacktrace field.

Leveraging LlamaIndex

LlamaIndex is leveraged in the following ways to enable query answering in the Django application:

  1. Vector Store: LlamaIndex provides the BasePydanticVectorStore class, which serves as the foundation for the custom DjangoAnnotationVectorStore. The DjangoAnnotationVectorStore integrates with Django's ORM to store and retrieve annotations efficiently, allowing seamless integration with the existing Django application.
  2. Indexing: LlamaIndex's VectorStoreIndex is used to create an index from the DjangoAnnotationVectorStore. The index facilitates fast and efficient retrieval of relevant annotations based on the query.
  3. Query Engine: LlamaIndex's CitationQueryEngine is employed to process the queries and generate responses. The query engine leverages the index to find the most relevant annotations and uses the language model to generate a coherent answer.
  4. Embedding and Language Models: LlamaIndex provides abstractions for integrating various embedding and language models. In this implementation, the HuggingFaceEmbedding and OpenAI models are used, but LlamaIndex allows flexibility in choosing different models based on requirements.

By leveraging LlamaIndex, the Django application benefits from a structured and efficient approach to query answering. LlamaIndex provides the necessary components and abstractions to handle vector storage, indexing, and query processing, allowing the application to focus on integrating these capabilities into its existing architecture.

\ No newline at end of file diff --git a/index.html b/index.html index 771f850a..2fdc7991 100755 --- a/index.html +++ b/index.html @@ -7,6 +7,6 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}

OpenContracts

Open Contracts

The Free and Open Source Document Analytics Platform


CI/CD codecov
Meta code style - black types - Mypy imports - isort License - Apache2

What Does it Do?

OpenContracts is an Apache-2 Licensed enterprise document analytics tool. It provides several key features:

  1. Manage Documents - Manage document collections (Corpuses)
  2. Layout Parser - Automatically extracts layout features from PDFs
  3. Automatic Vector Embeddings - generated for uploaded PDFs and extracted layout blocks
  4. Pluggable microservice analyzer architecture - to let you analyze documents and automatically annotate them
  5. Human Annotation Interface - to manually annotated documents, including multi-page annotations.
  6. LlamaIndex Integration - Use our vector stores (powered by pgvector) and any manual or automatically annotated features to let an LLM intelligently answer questions.
  7. Data Extract - ask multiple questions across hundreds of documents using complex LLM-powered querying behavior. Our sample implementation uses LlamaIndex + Marvin.
  8. Custom Data Extract - Custom data extract pipelines can be used on the frontend to query documents in bulk.

Grid Review And Sources.gif

Manual Annotations

Key Docs

  1. Quickstart Guide - You'll probably want to get started quickly. Setting up locally should be pretty painless if you're already running Docker.
  2. Basic Walkthrough - Check out the walkthrough to step through basic usage of the application for document and annotation management.
  3. PDF Annotation Data Format Overview - You may be interested how we map text to PDFs visually and the underlying data format we're using.
  4. Django + Pgvector Powered Hybrid Vector Database We've used the latest open source tooling for vector storage in postgres to make it almost trivially easy to combine structured metadata and vector embeddings with an API-powered application.
  5. LlamaIndex Integration Walkthrough - We wrote a wrapper for our backend database and vector store to make it simple to load our parsed annotations, embeddings and text into LlamaIndex. Even better, if you have additional annotations in the document, the LLM can access those too.
  6. Write Custom Data Extractors - Custom data extract tasks (which can use LlamaIndex or can be totally bespoke) are automatically loaded and displayed on the frontend to let user's select how to ask questions and extract data from documents.

Architecture and Data Flows at a Glance

Core Data Standard

The core idea here - besides providing a platform to analyze contracts - is an open and standardized architecture that makes data extremely portable. Powering this is a set of data standards to describe the text and layout blocks on a PDF page:

Data Format

Robust PDF Processing Pipeline

We have a robust PDF processing pipeline that is horizontally scalable and generates our standardized data consistently for PDF inputs (We're working on adding additional formats soon):

PDF Processor

Special thanks to Nlmatics and nlm-ingestor for powering the layout parsing and extraction.

Limitations

At the moment, it only works with PDFs. In the future, it will be able to convert other document types to PDF for storage and labeling. PDF is an excellent format for this as it introduces a consistent, repeatable format which we can use to generate a text and x-y coordinate layer from scratch.

Adding OCR and ingestion for other enterprise documents is a priority.

Acknowledgements

Special thanks to AllenAI's PAWLS project and Nlmatics nlm-ingestor. They've pioneered a number of features and flows, and we are using their code in some parts of the application.

OpenContracts

Open Contracts

The Free and Open Source Document Analytics Platform


CI/CD codecov
Meta code style - black types - Mypy imports - isort License - Apache2

What Does it Do?

OpenContracts is an Apache-2 Licensed enterprise document analytics tool. It provides several key features:

  1. Manage Documents - Manage document collections (Corpuses)
  2. Layout Parser - Automatically extracts layout features from PDFs
  3. Automatic Vector Embeddings - generated for uploaded PDFs and extracted layout blocks
  4. Pluggable microservice analyzer architecture - to let you analyze documents and automatically annotate them
  5. Human Annotation Interface - to manually annotated documents, including multi-page annotations.
  6. LlamaIndex Integration - Use our vector stores (powered by pgvector) and any manual or automatically annotated features to let an LLM intelligently answer questions.
  7. Data Extract - ask multiple questions across hundreds of documents using complex LLM-powered querying behavior. Our sample implementation uses LlamaIndex + Marvin.
  8. Custom Data Extract - Custom data extract pipelines can be used on the frontend to query documents in bulk.

Grid Review And Sources.gif

Manual Annotations

Key Docs

  1. Quickstart Guide - You'll probably want to get started quickly. Setting up locally should be pretty painless if you're already running Docker.
  2. Basic Walkthrough - Check out the walkthrough to step through basic usage of the application for document and annotation management.
  3. PDF Annotation Data Format Overview - You may be interested how we map text to PDFs visually and the underlying data format we're using.
  4. Django + Pgvector Powered Hybrid Vector Database We've used the latest open source tooling for vector storage in postgres to make it almost trivially easy to combine structured metadata and vector embeddings with an API-powered application.
  5. LlamaIndex Integration Walkthrough - We wrote a wrapper for our backend database and vector store to make it simple to load our parsed annotations, embeddings and text into LlamaIndex. Even better, if you have additional annotations in the document, the LLM can access those too.
  6. Write Custom Data Extractors - Custom data extract tasks (which can use LlamaIndex or can be totally bespoke) are automatically loaded and displayed on the frontend to let user's select how to ask questions and extract data from documents.

Architecture and Data Flows at a Glance

Core Data Standard

The core idea here - besides providing a platform to analyze contracts - is an open and standardized architecture that makes data extremely portable. Powering this is a set of data standards to describe the text and layout blocks on a PDF page:

Data Format

Robust PDF Processing Pipeline

We have a robust PDF processing pipeline that is horizontally scalable and generates our standardized data consistently for PDF inputs (We're working on adding additional formats soon):

PDF Processor

Special thanks to Nlmatics and nlm-ingestor for powering the layout parsing and extraction.

Limitations

At the moment, it only works with PDFs. In the future, it will be able to convert other document types to PDF for storage and labeling. PDF is an excellent format for this as it introduces a consistent, repeatable format which we can use to generate a text and x-y coordinate layer from scratch.

Adding OCR and ingestion for other enterprise documents is a priority.

Acknowledgements

Special thanks to AllenAI's PAWLS project and Nlmatics nlm-ingestor. They've pioneered a number of features and flows, and we are using their code in some parts of the application.

\ No newline at end of file diff --git a/philosophy/index.html b/philosophy/index.html index 23981045..2ecde6ca 100755 --- a/philosophy/index.html +++ b/philosophy/index.html @@ -7,6 +7,6 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}

Philosophy

Don't Repeat Yourself

OpenContracts is designed not only be a powerful document analysis and annotation platform, it's also envisioned as a way to embrace the DRY (Don't Repeat Yourself) principle for legal and legal engineering. You can make a corpus, along with all of its labels, documents and annotations "public" (currently, you must do this via a GraphQL mutation).

Once something is public, it's read-only for everyone other than its original creator. People with read-only access can "clone" the corpus to create a private copy of the corpus, its documents and its annotations. They can then edit the annotations, add to them, export them, etc. This lets us work from previous document annotations and re-use labels and training data.

Philosophy

Don't Repeat Yourself

OpenContracts is designed not only be a powerful document analysis and annotation platform, it's also envisioned as a way to embrace the DRY (Don't Repeat Yourself) principle for legal and legal engineering. You can make a corpus, along with all of its labels, documents and annotations "public" (currently, you must do this via a GraphQL mutation).

Once something is public, it's read-only for everyone other than its original creator. People with read-only access can "clone" the corpus to create a private copy of the corpus, its documents and its annotations. They can then edit the annotations, add to them, export them, etc. This lets us work from previous document annotations and re-use labels and training data.

\ No newline at end of file diff --git a/quick-start/index.html b/quick-start/index.html index 3a14c395..9b669841 100755 --- a/quick-start/index.html +++ b/quick-start/index.html @@ -7,7 +7,7 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}

Quick Start (For use on your local machine)

This guide is for people who want to quickly get started using the application and aren't interested in hosting it online for others to use. You'll get a default, local user with admin access. We recommend you change the user password after completing this tutorial. We assume you're using Linux or Max OS, but you could do this on Windows too, assuming you have docker compose and docker installed. The commands to create directories will be different on Windows, but the git, docker and docker-compose commands should all be the same.

Step 1: Clone this Repo

Clone the repository into a local directory of your choice. Here, we assume you are using a folder called source in your user's home directory:

    $ cd ~
+    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Quick Start (For use on your local machine)

This guide is for people who want to quickly get started using the application and aren't interested in hosting it online for others to use. You'll get a default, local user with admin access. We recommend you change the user password after completing this tutorial. We assume you're using Linux or Max OS, but you could do this on Windows too, assuming you have docker compose and docker installed. The commands to create directories will be different on Windows, but the git, docker and docker-compose commands should all be the same.

Step 1: Clone this Repo

Clone the repository into a local directory of your choice. Here, we assume you are using a folder called source in your user's home directory:

    $ cd ~
     $ mkdir source
     $ cd source
     $ git clone https://github.com/JSv4/OpenContracts.git
diff --git a/requirements/index.html b/requirements/index.html
index 6421286f..238938f0 100755
--- a/requirements/index.html
+++ b/requirements/index.html
@@ -7,6 +7,6 @@
     .gdesc-inner { font-size: 0.75rem; }
     body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);}
     body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);}
-    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

System Requirements

System Requirements

You will need Docker and Docker Compose installed to run Open Contracts. We've developed and run the application a Linux x86_64 environment. We haven't tested on Windows, and it's known that celery is not supported on Windows. For this reason, we do not recommend deployment on Windows. If you must run on a Windows machine, consider using a virtual machine or using the Windows Linux Subsystem.

If you need help setting up Docker, we recommend Digital Ocean's setup guide. Likewise, if you need assistance setting up Docker Compose, Digital Ocean's guide is excellent.

System Requirements

System Requirements

You will need Docker and Docker Compose installed to run Open Contracts. We've developed and run the application a Linux x86_64 environment. We haven't tested on Windows, and it's known that celery is not supported on Windows. For this reason, we do not recommend deployment on Windows. If you must run on a Windows machine, consider using a virtual machine or using the Windows Linux Subsystem.

If you need help setting up Docker, we recommend Digital Ocean's setup guide. Likewise, if you need assistance setting up Docker Compose, Digital Ocean's guide is excellent.

\ No newline at end of file diff --git a/search/search_index.json b/search/search_index.json index 56572afc..587d02dc 100755 --- a/search/search_index.json +++ b/search/search_index.json @@ -1 +1 @@ -{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"About","text":""},{"location":"#open-contracts","title":"Open Contracts","text":""},{"location":"#the-free-and-open-source-document-analytics-platform","title":"The Free and Open Source Document Analytics Platform","text":"CI/CD Meta"},{"location":"#what-does-it-do","title":"What Does it Do?","text":"

OpenContracts is an Apache-2 Licensed enterprise document analytics tool. It provides several key features:

  1. Manage Documents - Manage document collections (Corpuses)
  2. Layout Parser - Automatically extracts layout features from PDFs
  3. Automatic Vector Embeddings - generated for uploaded PDFs and extracted layout blocks
  4. Pluggable microservice analyzer architecture - to let you analyze documents and automatically annotate them
  5. Human Annotation Interface - to manually annotated documents, including multi-page annotations.
  6. LlamaIndex Integration - Use our vector stores (powered by pgvector) and any manual or automatically annotated features to let an LLM intelligently answer questions.
  7. Data Extract - ask multiple questions across hundreds of documents using complex LLM-powered querying behavior. Our sample implementation uses LlamaIndex + Marvin.
  8. Custom Data Extract - Custom data extract pipelines can be used on the frontend to query documents in bulk.

"},{"location":"#key-docs","title":"Key Docs","text":"
  1. Quickstart Guide - You'll probably want to get started quickly. Setting up locally should be pretty painless if you're already running Docker.
  2. Basic Walkthrough - Check out the walkthrough to step through basic usage of the application for document and annotation management.
  3. PDF Annotation Data Format Overview - You may be interested how we map text to PDFs visually and the underlying data format we're using.
  4. Django + Pgvector Powered Hybrid Vector Database We've used the latest open source tooling for vector storage in postgres to make it almost trivially easy to combine structured metadata and vector embeddings with an API-powered application.
  5. LlamaIndex Integration Walkthrough - We wrote a wrapper for our backend database and vector store to make it simple to load our parsed annotations, embeddings and text into LlamaIndex. Even better, if you have additional annotations in the document, the LLM can access those too.
  6. Write Custom Data Extractors - Custom data extract tasks (which can use LlamaIndex or can be totally bespoke) are automatically loaded and displayed on the frontend to let user's select how to ask questions and extract data from documents.
"},{"location":"#architecture-and-data-flows-at-a-glance","title":"Architecture and Data Flows at a Glance","text":""},{"location":"#core-data-standard","title":"Core Data Standard","text":"

The core idea here - besides providing a platform to analyze contracts - is an open and standardized architecture that makes data extremely portable. Powering this is a set of data standards to describe the text and layout blocks on a PDF page:

"},{"location":"#robust-pdf-processing-pipeline","title":"Robust PDF Processing Pipeline","text":"

We have a robust PDF processing pipeline that is horizontally scalable and generates our standardized data consistently for PDF inputs (We're working on adding additional formats soon):

Special thanks to Nlmatics and nlm-ingestor for powering the layout parsing and extraction.

"},{"location":"#limitations","title":"Limitations","text":"

At the moment, it only works with PDFs. In the future, it will be able to convert other document types to PDF for storage and labeling. PDF is an excellent format for this as it introduces a consistent, repeatable format which we can use to generate a text and x-y coordinate layer from scratch.

Adding OCR and ingestion for other enterprise documents is a priority.

"},{"location":"#acknowledgements","title":"Acknowledgements","text":"

Special thanks to AllenAI's PAWLS project and Nlmatics nlm-ingestor. They've pioneered a number of features and flows, and we are using their code in some parts of the application.

"},{"location":"acknowledgements/","title":"Acknowledgements","text":"

OpenContracts is built in part on top of the PAWLs project frontend. We have made extensive changes, however, and plan to remove even more of the original PAWLs codebase, particularly their state management, as it's currently duplucitive of the Apollo state store we use throughout the application. That said, PAWLs was the inspiration for how we handle text extraction, and we're planning to continue using their PDF rendering code. We are also using PAWLs' pre-processing script, which is based on Grobid.

We should also thank the Grobid project, which was clearly a source of inspiration for PAWLs and an extremely impressive tool. Grobid is designed more for medical and scientific papers, but, nevertheless, offers a tremendous amount of inspiration and examples for the legal world to borrow. Perhaps there is an opportunity to have a unified tool in that respect.

Finally, let's not forget Tesseract, the OCR engine that started its life as an HP research project in the 1980s before being taken over by Google in the early aughts and finally becoming an independent project in 2018. Were it not for the excellent, free OCR provided by Tesseract, we'd have to rely on commercial OCR tech, which would make this kind of opensource, free project prohibitively expensive. Thanks to the many, many people who've made free OCR possible over the nearly 40 years Tesseract has been under development.

"},{"location":"philosophy/","title":"Philosophy","text":""},{"location":"philosophy/#dont-repeat-yourself","title":"Don't Repeat Yourself","text":"

OpenContracts is designed not only be a powerful document analysis and annotation platform, it's also envisioned as a way to embrace the DRY (Don't Repeat Yourself) principle for legal and legal engineering. You can make a corpus, along with all of its labels, documents and annotations \"public\" (currently, you must do this via a GraphQL mutation).

Once something is public, it's read-only for everyone other than its original creator. People with read-only access can \"clone\" the corpus to create a private copy of the corpus, its documents and its annotations. They can then edit the annotations, add to them, export them, etc. This lets us work from previous document annotations and re-use labels and training data.

"},{"location":"quick-start/","title":"Quick Start (For use on your local machine)","text":"

This guide is for people who want to quickly get started using the application and aren't interested in hosting it online for others to use. You'll get a default, local user with admin access. We recommend you change the user password after completing this tutorial. We assume you're using Linux or Max OS, but you could do this on Windows too, assuming you have docker compose and docker installed. The commands to create directories will be different on Windows, but the git, docker and docker-compose commands should all be the same.

"},{"location":"quick-start/#step-1-clone-this-repo","title":"Step 1: Clone this Repo","text":"

Clone the repository into a local directory of your choice. Here, we assume you are using a folder called source in your user's home directory:

    $ cd ~\n    $ mkdir source\n    $ cd source\n    $ git clone https://github.com/JSv4/OpenContracts.git\n
"},{"location":"quick-start/#step-2-copy-sample-env-files-to-appropriate-folders","title":"Step 2: Copy sample .env files to appropriate folders","text":"

Again, we're assuming a local deployment here with basic options. To just get up and running, you'll want to copy our sample .env file from the ./docs/sample_env_files directory to the appropriate .local subfolder in the .envs directory in the repo root.

"},{"location":"quick-start/#backend-env-file","title":"Backend .Env File","text":"

For the most basic deployment, copy ./sample_env_files/backend/local/.django to ./.envs/.local/.django and copy ./sample_env_files/backend/local/.postgres to ./.envs/.local/.postgres. You can use the default configurations, but we recommend you set you own admin account password in .django and your own postgres credentials in .postgres.

"},{"location":"quick-start/#frontend-env-file","title":"Frontend .Env File","text":"

You also need to copy the appropriate .frontend env file as ./envs/.local/.frontend. We're assuming you're not using something like auth0 and are going to rely on Django auth to provision and authenticate users. Grab ./sample_env_files/frontend/local/django.auth.env and copy it to ./envs/.local/.frontend.

"},{"location":"quick-start/#step-3-build-the-stack","title":"Step 3: Build the Stack","text":"

Change into the directory of the repository you just cloned, e.g.:

    cd OpenContracts\n

Now, you need to build the docker compose stack. IF you are okay with the default username and password, and, most importantly, you are NOT PLANNING TO HOST THE APPLICATION online, the default, local settings are sufficient and no configuration is required. If you want to change the

    $ docker-compose -f local.yml build\n
"},{"location":"quick-start/#step-4-choose-frontend-deployment-method","title":"Step 4 Choose Frontend Deployment Method","text":"

Option 1 Use \"Fullstack\" Profile in Docker Compose

If you're not planning to do any frontend development, the easiest way to get started with OpenContracts is to just type:

    docker-compose -f local.yml --profile fullstack up\n

This will start docker compose and add a container for the frontend to the stack.

Option 2 Use Node to Deploy Frontend

If you plan to actively develop the frontend in the /frontend folder, you can just point your favorite typescript ID to that directory and then run:

yarn install\n

and

yarn start\n

to bring up the frontend. Then you can edit the frontend code as desired and have it hot reload as you'd expect for a React app.

Congrats! You have OpenContracts running.

"},{"location":"quick-start/#step-5-login-and-start-annotating","title":"Step 5: Login and Start Annotating","text":"

If you go to http://localhost:3000 in your browser, you'll see the login page. You can login with the default username and password. These are set in the environment variable file you can find in the ./.envs/.local/ directory. In that directory, you'll see a file called .django. Backend specific configuration variables go in there. See our guide for how to create new users.

NOTE: The frontend is at port 3000, not 8000, so don't forget to use http://localhost:3000 for frontend access. We have an open issue to add a redirect from the backend root page - http://localhost:8000/ - to http://localhost:3000.

Caveats

The quick start local config is designed for use on a local machine, not for access over the Internet or a network. It uses the local disk for storage (not AWS), and Django's built-i

"},{"location":"requirements/","title":"System Requirements","text":""},{"location":"requirements/#system-requirements","title":"System Requirements","text":"

You will need Docker and Docker Compose installed to run Open Contracts. We've developed and run the application a Linux x86_64 environment. We haven't tested on Windows, and it's known that celery is not supported on Windows. For this reason, we do not recommend deployment on Windows. If you must run on a Windows machine, consider using a virtual machine or using the Windows Linux Subsystem.

If you need help setting up Docker, we recommend Digital Ocean's setup guide. Likewise, if you need assistance setting up Docker Compose, Digital Ocean's guide is excellent.

"},{"location":"architecture/PDF-data-layer/","title":"PDF data layer","text":""},{"location":"architecture/PDF-data-layer/#data-layers","title":"Data Layers","text":"

OpenContracts builds on the work that AllenAI did with PAWLs to create a consistent shared source of truth for data labeling and NLP algorithms, regardless of whether they are layout-aware, like LayoutLM or not, like BERT, Spacy or LexNLP. One of the challenges with natural language documents, particularly contracts is there are so many ways to structure any given file (e.g. .docx or .pdf) to represent exactly the same text. Even an identical document with identical formatting in a format like .pdf can have a significantly different file structure depending on what software was used to create it, the user's choices, and the software's own choices in deciding how to structure its output.

PAWLs and OpenContracts attempt to solve this by sending every document through a processing pipeline that provides a uniform and consistent way of extracting and structuring text and layout information. Using the parsing engine of Grobid and the open source OCR engine Tesseract, every single document is re-OCRed (to produce a consistent output for the same inputs) and then the \"tokens\" (text surrounded on all sides by whitespace - typically a word) in the OCRed document are stored as JSONs with their page and positional information. In OpenContracts, we refer to this JSON layer that combines text and positional data as the \"PAWLs\" layer. We use the PAWLs layer to build the full text extract from the document as well and store this as the \"text layer\".

Thus, in OpenContracts, every document has three files associated with it - the original pdf, a json file (the \"PAWLs layer\"), and a text file (the \"text layer\"). Because the text layer is built from the PAWLs layer, we can easily translate back and forth from text to positional information - e.g. given the start and end of a span of text the text layer, we can accurately say which PAWLs tokens the span includes, and, based on that, the x,y position of the span in the document.

This lets us take the outputs of many NLP libraries which typically produce only start and stop ranges and layer them perfectly on top of the original pdf. With the PAWLs tokens as the source of truth, we can seamlessly transition from text only to layout-aware text.

"},{"location":"architecture/PDF-data-layer/#limitations","title":"Limitations","text":"

OCR is not perfect. By only accepting pdf inputs and OCRing every document, we do ignore any text embedded in the pdf. To the extent that text was exported accurately from whatever tool was used to write the document, this introduces some potential loss of fidelity - e.g. if you've ever seen an OCR engine mistake an 'O' or a 0 or 'I' for a '1' or something like that. Typically, however, the instance of such errors is fairly small, and it's a price we have to pay for the power of being able to effortlessly layer NLP outputs that have no layout awareness on top of complex, visual layouts.

"},{"location":"architecture/asynchronous-processing/","title":"Asynchronous Processing","text":""},{"location":"architecture/asynchronous-processing/#asynchronous-tasks","title":"Asynchronous Tasks","text":"

OpenContracts makes extensive use of celery, a powerful, mature python framework for distributed and asynchronous processing. Out-of-the-box, dedicated celeryworkers are configured in the docker compose stack to handle computationally-intensive and long-running tasks like parsing documents, applying annotations to pdfs, creating exports, importing exports, and more.

"},{"location":"architecture/asynchronous-processing/#what-if-my-celery-queue-gets-clogged","title":"What if my celery queue gets clogged?","text":"

We are always working to make OpenContracts more fault-tolerant and stable. That said, due to the nature of the types of documents we're working with - pdfs - there is tremendous variation in what the parsers have to parse. Some documents are extremely long - thousands of pages or more - whereas other documents may have poor formatting, no text layers, etc.. In most cases, OpenContracts should be able to process the pdfs and make them compatible with our annotation tools. Sometimes, however, either due to unexpected issues or unexpected volume of documents, you may want to purge the queue of tasks to be processed by your celery workers. To do this, type:

sudo docker-compose -f local.yml run django celery -A config.celery_app purge\n

Be aware that this can cause some undesired effects for your users. For example, everytime a new document is uploaded, a Django signal kicks off the pdf preprocessor to produce the PAWLs token layer that is later annotated. If these tasks are in-queue and the queue is purged, you'll have documents that are not annotatable as they'll lack the PAWLS token layers. In such cases, we recommend you delete and re-upload the documents. There are ways to manually reprocess the pdfs, but we don't have a user-friendly way to do this yet.

"},{"location":"architecture/opencontract-corpus-actions/","title":"CorpusAction System in OpenContracts: Revised Explanation","text":"

The CorpusAction system in OpenContracts automates document processing when new documents are added to a corpus. This system is designed to be flexible, allowing for different types of actions to be triggered based on the configuration.

Within this system, users have three options for registering actions to run automatically on new documents:

  1. Custom data extractors
  2. Analyzer microservices
  3. Celery tasks decorated with @doc_analyzer_task (a \"task-based Analyzer\")

The @doc_analyzer_task decorator is specifically designed for the third option, providing a straightforward way to implement simple, span-based analytics directly within the OpenContracts ecosystem.

"},{"location":"architecture/opencontract-corpus-actions/#action-execution-overview","title":"Action Execution Overview","text":"

The following flowchart illustrates the CorpusAction system in OpenContracts, demonstrating the process that occurs when a new document is added to a corpus. This automated workflow begins with the addition of a document, which triggers a Django signal. The signal is then handled, leading to the processing of the corpus action. At this point, the system checks the type of CorpusAction configured for the corpus. Depending on this configuration, one of three paths is taken: running an Extract with a Fieldset, executing an Analysis with a doc_analyzer_task, or submitting an Analysis to a Gremlin Engine. This diagram provides a clear visual representation of how the CorpusAction system automates document processing based on predefined rules, enabling efficient and flexible handling of new documents within the OpenContracts platform.

graph TD\n    A[Document Added to Corpus] -->|Triggers| B[Django Signal]\n    B --> C[Handle Document Added Signal]\n    C --> D[Process Corpus Action]\n    D --> E{Check CorpusAction Type}\n    E -->|Fieldset| F[Run Extract]\n    E -->|Analyzer with task_name| G[Run Analysis with doc_analyzer_task]\n    E -->|Analyzer with host_gremlin| H[Run Analysis with Gremlin Engine]\n
"},{"location":"architecture/opencontract-corpus-actions/#key-components","title":"Key Components","text":"
  1. CorpusAction Model: Defines the action to be taken, including:

    • Reference to the associated corpus
    • Trigger type (e.g., ADD_DOCUMENT)
    • Reference to either an Analyzer or a Fieldset
  2. CorpusActionTrigger Enum: Defines trigger events (ADD_DOCUMENT, EDIT_DOCUMENT)

  3. Signal Handlers: Detect when documents are added to a corpus

  4. Celery Tasks: Perform the actual processing asynchronously

"},{"location":"architecture/opencontract-corpus-actions/#process-flow","title":"Process Flow","text":"
  1. Document Addition: A document is added to a corpus, triggering a Django signal.

  2. Signal Handling:

    @receiver(m2m_changed, sender=Corpus.documents.through)\ndef handle_document_added_to_corpus(sender, instance, action, pk_set, **kwargs):\n    if action == \"post_add\":\n        process_corpus_action.si(\n            corpus_id=instance.id,\n            document_ids=list(pk_set),\n            user_id=instance.creator.id,\n        ).apply_async()\n

  3. Action Processing: The process_corpus_action task is called, which determines the appropriate action based on the CorpusAction configuration.

  4. Execution Path: One of three paths is taken based on the CorpusAction configuration:

a) Run Extract with Fieldset - If the CorpusAction is associated with a Fieldset - Creates a new Extract object - Runs the extract process on the new document(s)

b) Run Analysis with doc_analyzer_task - If the CorpusAction is associated with an Analyzer that has a task_name - The task_name must refer to a function decorated with @doc_analyzer_task - Creates a new Analysis object - Runs the specified doc_analyzer_task on the new document(s)

c) Run Analysis with Gremlin Engine - If the CorpusAction is associated with an Analyzer that has a host_gremlin - Creates a new Analysis object - Submits the analysis job to the specified Gremlin Engine

Here's the relevant code snippet showing these paths:

@shared_task\ndef process_corpus_action(corpus_id: int, document_ids: list[int], user_id: int):\n    corpus = Corpus.objects.get(id=corpus_id)\n    actions = CorpusAction.objects.filter(\n        corpus=corpus, trigger=CorpusActionTrigger.ADD_DOCUMENT\n    )\n\n    for action in actions:\n        if action.fieldset:\n            # Path a: Run Extract with Fieldset\n            extract = Extract.objects.create(\n                name=f\"Extract for {corpus.title}\",\n                corpus=corpus,\n                fieldset=action.fieldset,\n                creator_id=user_id,\n            )\n            extract.documents.add(*document_ids)\n            run_extract.si(extract_id=extract.id).apply_async()\n        elif action.analyzer:\n            analysis = Analysis.objects.create(\n                analyzer=action.analyzer,\n                analyzed_corpus=corpus,\n                creator_id=user_id,\n            )\n            if action.analyzer.task_name:\n                # Path b: Run Analysis with doc_analyzer_task\n                task = import_string(action.analyzer.task_name)\n                for doc_id in document_ids:\n                    task.si(doc_id=doc_id, analysis_id=analysis.id).apply_async()\n            elif action.analyzer.host_gremlin:\n                # Path c: Run Analysis with Gremlin Engine\n                start_analysis.si(analysis_id=analysis.id).apply_async()\n

This system provides a flexible framework for automating document processing in OpenContracts. By configuring CorpusAction objects appropriately, users can ensure that newly added documents are automatically processed according to their specific needs, whether that involves running extracts, local analysis tasks, or submitting to external Gremlin engines for processing.

"},{"location":"architecture/components/Data-flow-diagram/","title":"Container Architecture & Data Flow","text":"

You'll notice that we have a number of containers in our docker compose file (Note the local.yml is up-to-date. The production file needs some work to be production grade, and we may switch to Tilt.).

Here, you can see how these containers relate to some of the core data elements powering the application - such as parsing structural and layout annotations from PDFs (which powers the vector store) and generating vector embeddings.

"},{"location":"architecture/components/Data-flow-diagram/#png-diagram","title":"PNG Diagram","text":""},{"location":"architecture/components/Data-flow-diagram/#mermaid-version","title":"Mermaid Version","text":"
graph TB\n    subgraph \"Docker Compose Environment\"\n        direction TB\n        django[Django]\n        postgres[PostgreSQL]\n        redis[Redis]\n        celeryworker[Celery Worker]\n        celerybeat[Celery Beat]\n        flower[Flower]\n        frontend[Frontend React]\n        nlm_ingestor[NLM Ingestor]\n        vector_embedder[Vector Embedder]\n    end\n\n    subgraph \"Django Models\"\n        direction TB\n        document[Document]\n        annotation[Annotation]\n        relationship[Relationship]\n        labelset[LabelSet]\n        extract[Extract]\n        datacell[Datacell]\n    end\n\n    django -->|Manages| document\n    django -->|Manages| annotation\n    django -->|Manages| relationship\n    django -->|Manages| labelset\n    django -->|Manages| extract\n    django -->|Manages| datacell\n\n    nlm_ingestor -->|Parses PDFs| django\n    nlm_ingestor -->|Creates layout annotations| annotation\n\n    vector_embedder -->|Generates embeddings| django\n    vector_embedder -->|Stores embeddings| annotation\n    vector_embedder -->|Stores embeddings| document\n\n    django -->|Stores data| postgres\n    django -->|Caching| redis\n\n    celeryworker -->|Processes tasks| django\n    celerybeat -->|Schedules tasks| celeryworker\n    flower -->|Monitors| celeryworker\n\n    frontend -->|User interface| django\n\n    classDef container fill:#e1f5fe,stroke:#01579b,stroke-width:2px;\n    classDef model fill:#fff59d,stroke:#f57f17,stroke-width:2px;\n\n    class django,postgres,redis,celeryworker,celerybeat,flower,frontend,nlm_ingestor,vector_embedder container;\n    class document,annotation,relationship,labelset,extract,datacell model;\n
"},{"location":"architecture/components/annotator/how-annotations-are-created/","title":"How Annotations are Handled","text":""},{"location":"architecture/components/annotator/how-annotations-are-created/#overview","title":"Overview","text":"

Here's a step-by-step explanation of the flow:

  1. The user selects text on the PDF by clicking and dragging the mouse. This triggers a mouse event in the Page component.
  2. The Page component checks if the Shift key is pressed.
  3. If the Shift key is not pressed, it creates a new selection and sets the selection state in the AnnotationStore.
  4. If the Shift key is pressed, it adds the selection to the selection queue in the AnnotationStore.
  5. The AnnotationStore updates its internal state with the new selection or the updated selection queue.
  6. If the Shift key is released, the Page component triggers the creation of a multi-page annotation. If the Shift key is still pressed, it waits for the next user action.
  7. To create a multi-page annotation, the Page component combines the selections from the queue.
  8. The Page component retrieves the annotation data from the PDFPageInfo object for each selected page.
  9. The Page component creates a ServerAnnotation object with the combined annotation data.
  10. The Page component calls the createAnnotation function in the AnnotationStore, passing the ServerAnnotation object.
  11. The AnnotationStore invokes the requestCreateAnnotation function in the Annotator component.
  12. The Annotator component sends a mutation to the server to create the annotation.
  13. If the server responds with success, the Annotator component updates the local state with the new annotation. If there's an error, it displays an error message.
  14. The updated annotations trigger a re-render of the relevant components, reflecting the newly created annotation on the PDF.
"},{"location":"architecture/components/annotator/how-annotations-are-created/#flowchart","title":"Flowchart","text":"
graph TD\n    A[User selects text on the PDF] -->|Mouse event| B(Page component)\n    B --> C{Is Shift key pressed?}\n    C -->|No| D[Create new selection]\n    C -->|Yes| E[Add selection to queue]\n    D --> F[Set selection state in AnnotationStore]\n    E --> G[Update selection queue in AnnotationStore]\n    F --> H{Is Shift key released?}\n    G --> H\n    H -->|Yes| I[Create multi-page annotation]\n    H -->|No| J[Wait for next user action]\n    I --> K[Combine selections from queue]\n    K --> L[Get annotation data from PDFPageInfo]\n    L --> M[Create ServerAnnotation object]\n    M --> N[Call createAnnotation in AnnotationStore]\n    N --> O[Invoke requestCreateAnnotation in Annotator]\n    O --> P[Send mutation to server]\n    P --> Q{Server response}\n    Q -->|Success| R[Update local state with new annotation]\n    Q -->|Error| S[Display error message]\n    R --> T[Re-render components with updated annotations]\n
"},{"location":"architecture/components/annotator/overview/","title":"Open Contracts Annotator Components","text":""},{"location":"architecture/components/annotator/overview/#key-questions","title":"Key Questions","text":"
  1. How is the PDF loaded?
  2. The PDF is loaded in the Annotator.tsx component.
  3. Inside the useEffect hook that runs when the openedDocument prop changes, the PDF loading process is initiated.
  4. The pdfjsLib.getDocument function from the pdfjs-dist library is used to load the PDF file specified by openedDocument.pdfFile.
  5. The loading progress is tracked using the loadingTask.onProgress callback, which updates the progress state.
  6. Once the PDF is loaded, the loadingTask.promise is resolved, and the PDFDocumentProxy object is obtained.
  7. The PDFPageInfo objects are created for each page of the PDF using doc.getPage(i) and stored in the pages state.

  8. Where and how are annotations loaded?

  9. Annotations are loaded using the REQUEST_ANNOTATOR_DATA_FOR_DOCUMENT GraphQL query in the Annotator.tsx component.
  10. The useQuery hook from Apollo Client is used to fetch the annotator data based on the provided initial_query_vars.
  11. The annotator_data received from the query contains information about existing text annotations, document label annotations, and relationships.
  12. The annotations are transformed into ServerAnnotation, DocTypeAnnotation, and RelationGroup objects and stored in the pdfAnnotations state using setPdfAnnotations.

  13. Where is the PAWLs layer loaded?

  14. The PAWLs layer is loaded in the Annotator.tsx component.
  15. Inside the useEffect hook that runs when the openedDocument prop changes, the PAWLs layer is loaded using the getPawlsLayer function from api/rest.ts.
  16. The getPawlsLayer function makes an HTTP GET request to fetch the PAWLs data file specified by openedDocument.pawlsParseFile.
  17. The PAWLs data is expected to be an array of PageTokens objects, which contain token information for each page of the PDF.
  18. The loaded PAWLs data is then used to create PDFPageInfo objects for each page, which include the page tokens.
"},{"location":"architecture/components/annotator/overview/#high-level-components-overview","title":"High-level Components Overview","text":"
  • The Annotator component is the top-level component that manages the state and data loading for the annotator.
  • It renders the PDFView component, which is responsible for displaying the PDF and annotations.
  • The PDFView component renders various sub-components, such as LabelSelector, DocTypeLabelDisplay, AnnotatorSidebar, AnnotatorTopbar, and PDF.
  • The PDF component renders individual Page components for each page of the PDF.
  • Each Page component renders Selection and SearchResult components for annotations and search results, respectively.
  • The AnnotatorSidebar component displays the list of annotations, relations, and a search widget.
  • The PDFStore and AnnotationStore are context providers that hold the PDF and annotation data, respectively.
"},{"location":"architecture/components/annotator/overview/#specific-component-deep-dives","title":"Specific Component Deep Dives","text":""},{"location":"architecture/components/annotator/overview/#pdfviewtsx","title":"PDFView.tsx","text":"

The PDFView component is a top-level component that renders the PDF document with annotations, relations, and text search capabilities. It manages the state and functionality related to annotations, relations, and user interactions. Here's a detailed explanation of how the component works:

  1. The PDFView component receives several props, including permissions, callbacks for CRUD operations on annotations and relations, refs for container and selection elements, and various configuration options.

  2. It initializes several state variables using the useState hook, including:

  3. selectionElementRefs and searchResultElementRefs: Refs for annotation selections and search results.
  4. pageElementRefs: Refs for individual PDF pages.
  5. scrollContainerRef: Ref for the scroll container.
  6. textSearchMatches and searchText: State for text search matches and search text.
  7. selectedAnnotations and selectedRelations: State for currently selected annotations and relations.
  8. pageSelection and pageSelectionQueue: State for current page selection and queued selections.
  9. pdfPageInfoObjs: State for PDF page information objects.
  10. Various other state variables for active labels, relation modal visibility, and annotation options.

  11. The component defines several functions for updating state and handling user interactions, such as:

  12. insertSelectionElementRef, insertSearchResultElementRefs, and insertPageRef: Functions to add refs for selections, search results, and pages.
  13. onError: Error handling callback.
  14. advanceTextSearchMatch and reverseTextSearchMatch: Functions to navigate through text search matches.
  15. onRelationModalOk and onRelationModalCancel: Callbacks for relation modal actions.
  16. createMultiPageAnnotation: Function to create a multi-page annotation from queued selections.

  17. The component uses the useEffect hook to handle side effects, such as:

  18. Setting the scroll container ref on load.
  19. Listening for changes in the shift key and triggering annotation creation.
  20. Updating text search matches when the search text changes.

  21. The component renders the PDF document and its related components using the PDFStore and AnnotationStore contexts:

  22. The PDFStore context provides the PDF document, pages, and error handling.
  23. The AnnotationStore context provides annotation-related state and functions.

  24. The component renders the following main sections:

  25. LabelSelector: Allows the user to select the active label for annotations.
  26. DocTypeLabelDisplay: Displays the document type labels.
  27. AnnotatorSidebar: Sidebar component for managing annotations and relations.
  28. AnnotatorTopbar: Top bar component for additional controls and options.
  29. PDF: The actual PDF component that renders the PDF pages and annotations.

  30. The PDF component, defined in PDF.tsx, is responsible for rendering the PDF pages and annotations. It receives props from the PDFView component, such as permissions, configuration options, and callbacks.

  31. The PDF component maps over each page of the PDF document and renders a Page component for each page, passing the necessary props.

  32. The Page component, also defined in PDF.tsx, is responsible for rendering a single page of the PDF document along with its annotations and search results. It handles mouse events for creating and modifying annotations.

  33. The PDFView component also renders the RelationModal component when the active relation label is set and the user has the necessary permissions. The modal allows the user to create or modify relations between annotations.

"},{"location":"architecture/components/annotator/overview/#pdftsx","title":"PDF.tsx","text":"

PDF renders the actual PDF document with annotations and text search capabilities. PDFView (see above) is what actually interacts with the backend / API.

  1. The PDF component receives several props:
  2. shiftDown: Indicates whether the Shift key is pressed (optional).
  3. doc_permissions and corpus_permissions: Specify the permissions for the document and corpus, respectively.
  4. read_only: Determines if the component is in read-only mode.
  5. show_selected_annotation_only: Specifies whether to show only the selected annotation.
  6. show_annotation_bounding_boxes: Specifies whether to show annotation bounding boxes.
  7. show_annotation_labels: Specifies the behavior for displaying annotation labels.
  8. setJumpedToAnnotationOnLoad: A callback function to set the jumped-to annotation on load.
  9. The PDF component retrieves the PDF document and pages from the PDFStore context.
  10. It maps over each page of the PDF document and renders a Page component for each page, passing the necessary props.
  11. The Page component is responsible for rendering a single page of the PDF document along with its annotations and search results.
  12. Inside the Page component:
  13. It creates a canvas element using the useRef hook to render the PDF page.
  14. It retrieves the annotations for the current page from the AnnotationStore context.
  15. It defines a ConvertBoundsToSelections function that converts the selected bounds to annotations and tokens.
  16. It uses the useEffect hook to set up the PDF page rendering and event listeners for resizing and scrolling.
  17. It renders the PDF page canvas, annotations, search results, and queued selections.
  18. The Page component renders the following sub-components:
  19. PageAnnotationsContainer: A styled container for the page annotations.
  20. PageCanvas: A styled canvas element for rendering the PDF page.
  21. Selection: Represents a single annotation selection on the page.
  22. SearchResult: Represents a search result on the page.
  23. The Page component handles mouse events for creating and modifying annotations:
  24. On mouseDown, it initializes the selection if the necessary permissions are granted and the component is not in read-only mode.
  25. On mouseMove, it updates the selection bounds if a selection is active.
  26. On mouseUp, it adds the completed selection to the pageSelectionQueue and triggers the creation of a multi-page annotation if the Shift key is not pressed.
  27. The Page component also handles fetching more annotations for previous and next pages using the FetchMoreOnVisible component.
  28. The SelectionBoundary and SelectionTokens components are used to render the annotation boundaries and tokens, respectively.
  29. The PDFPageRenderer class is responsible for rendering a single PDF page on the canvas. It manages the rendering tasks and provides methods for canceling and rescaling the rendering.
  30. The getPageBoundsFromCanvas function calculates the bounding box of the page based on the canvas dimensions and its parent container.
"},{"location":"configuration/add-users/","title":"Add Users","text":""},{"location":"configuration/add-users/#adding-more-users","title":"Adding More Users","text":"

You can use the same User admin page described above to create new users. Alternatively, go back to the main admin page http://localhost:8000/admin and, under the User section, click the \"+Add\" button:

Then, follow the on-screen instructions:

When you're done, the username and password you provided can be used to login.

OpenContracts is currently not built to allow users to self-register unless you use the Auth0 authentication. When managing users yourself, you'll need to add, remove and modify users via the admin panels.

"},{"location":"configuration/choose-an-authentication-backend/","title":"Configure Authentication Backend","text":""},{"location":"configuration/choose-an-authentication-backend/#select-authentication-system-via-env-variables","title":"Select Authentication System via Env Variables","text":"

For authentication and authorization, you have two choices. 1. You can configure an Auth0 account and use Auth0 to authenticate users, in which case anyone who is permitted to authenticate via your auth0 setup can login and automatically get an account, 2. or, you can require a username and password for each user and our OpenContracts backend can provide user authentication and authorization. Using the latter option, there is no currently-supported sign-up method, you'll need to use the admin dashboard (See \"Adding Users\" section).

"},{"location":"configuration/choose-an-authentication-backend/#auth0-auth-setup","title":"Auth0 Auth Setup","text":"

You need to configure three, separate applications on Auth0's platform:

  1. Configure the SPA as an application. You'll need the App Client ID.
  2. Configure the API. You'll need API Audience.
  3. Configure a M2M application to access the Auth0 Management API. This is used to fetch user details. You'll need the API_ID for the M2M application and the Client Secret for the M2M app.

You'll also need your Auth0 tenant ID (assuming it's the same for all three applications, though you could, in theory, host them in different tenants). These directions are not comprehensive, so, if you're not familiar with Auth0, we recommend you disable Auth0 for the time being and use username and password.

To enable and configure Auth0 Authentication, you'll need to set the following env variables in your .env file (the .django file in .envs/.production or .envs/.local, depending on your target environment). Our sample .envs only show these fields in the .production sample, but you could use them in the .local env file too:

  1. USE_AUTH0 - set to true to enable Auth0
  2. AUTH0_CLIENT_ID - should be the client ID configured on Auth0
  3. AUTH0_API_AUDIENCE - Configured API audience
  4. AUTH0_DOMAIN - domain of your configured Auth0 application
  5. AUTH0_M2M_MANAGEMENT_API_SECRET - secret for the auth0 Machine to Machine (M2M) API
  6. AUTH0_M2M_MANAGEMENT_API_ID - ID for Auth0 Machine to Machine (M2M) API
  7. AUTH0_M2M_MANAGEMENT_GRANT_TYPE - set to client_credentials
"},{"location":"configuration/choose-an-authentication-backend/#detailed-explanation-of-auth0-implementation","title":"Detailed Explanation of Auth0 Implementation","text":"

To get Auth0 to work nicely with Graphene, we modified the graphql_jwt backend to support syncing remote user metadata with a local user similar to the default, django RemoteUserMiddleware. We're keeping the graphql_jwt graphene middleware in its entirety as it fetches the token and then passes it along to django authentication *backend. That django backend is what we're modifying to decode the jwt token against Auth0 settings and then check to see if local user exists, and, if not, create it.

Here's the order of operations in the original Graphene backend provided by graphql_jwt:

  1. Backend's authenticate method is called from the graphene middleware via django (from django.contrib.auth import authenticate)
  2. token is retrieved via .utils get_credentials
  3. if token is not None, get_user_by_token in shortcuts module is called
    1. \"Payload\" is retrieved via utils.get_payload
    2. User is requested via utils.get_user_by_payload
    3. username is retrieved from payload via auth0_settings.JWT_PAYLOAD_GET_USERNAME_HANDLER
    4. user object is retrieved via auth0_settings.JWT_GET_USER_BY_NATURAL_KEY_HANDLER

We modified a couple of things:

  1. The decode method called in 3(a) needs to be modified to decode with Auth0 secrets and settings.
  2. get_user_by_payload needs to be modified in several ways:
    1. user object must use RemoteUserMiddleware logic and, if everything from auth0 decodes properly, check to see if user with e-mail exists and, if not, create it. Upon completion of this, try to sync user data with auth0. 2) return created or retrieved user object as original method did
"},{"location":"configuration/choose-an-authentication-backend/#django-based-authentication-setup","title":"Django-Based Authentication Setup","text":"

The only thing you need to do for this is toggle the two auth0-related environment variables: 1. For the backend environment, set USE_AUTH0=False in your environment (either via an environment variable file or directly in your environment via the console). 2. For the frontend environment, set REACT_APP_USE_AUTH0=false in your environment (either via an environment variable file or directly in your environment via the console).

Note

As noted elsewhere, users cannot sign up on their own. You need to log into the admin dashboard - e.g. http://localhost:8000/admin - and add users manually.

"},{"location":"configuration/choose-and-configure-docker-stack/","title":"Choose and Configure Docker Compose Stack","text":""},{"location":"configuration/choose-and-configure-docker-stack/#deployment-options","title":"Deployment Options","text":"

OpenContracts is designed to be deployed using docker-compose. You can run it locally or in a production environment. Follow the instructions below for a local environment if you just want to test it or you want to use it for yourself and don't intend to make the application available to other users via the Internet.

"},{"location":"configuration/choose-and-configure-docker-stack/#local-deployment","title":"Local Deployment","text":""},{"location":"configuration/choose-and-configure-docker-stack/#quick-start-with-default-settings","title":"Quick Start with Default Settings","text":"

A \"local\" deployment is deployed on your personal computer and is not meant to be accessed over the Internet. If you don't need to configure anything, just follow the quick start guide above to get up and running with a local deployment without needing any further configuration.

"},{"location":"configuration/choose-and-configure-docker-stack/#setup-env-files","title":"Setup .env Files","text":""},{"location":"configuration/choose-and-configure-docker-stack/#backend","title":"Backend","text":"

After cloning this repo to a machine of your choice, create a folder for your environment files in the repo root. You'll need ./.envs/.local/.django and ./.envs/.local/.postgres Use the samples in ./documentation/sample_env_files/local as guidance. NOTE, you'll need to replace the placeholder passwords and users where noted, but, otherwise, minimal config should be required.

"},{"location":"configuration/choose-and-configure-docker-stack/#frontend","title":"Frontend","text":"

In the ./frontend folder, you also need to create a single .env file which holds your configurations for your login method as well as certain feature switches (e.g. turn off imports). We've included a sample using auth0 and another sample using django's auth backend. Local vs production deployments are essentially the same, but the root url of the backend will change from localhost to whereever you're hosting the application in production.

"},{"location":"configuration/choose-and-configure-docker-stack/#build-the-stack","title":"Build the Stack","text":"

Once your .env files are setup, build the stack using docker-compose:

$ docker-compose -f local.yml build

Then, run migrations (to setup the database):

$ docker-compose -f local.yml run django python manage.py migrate

Then, create a superuser account that can log in to the admin dashboard (in a local deployment this is available at http://localhost:8000/admin) by typing this command and following the prompts:

$ docker-compose -f local.yml run django python manage.py createsuperuser\n

Finally, bring up the stack:

$ docker-compose -f local.yml up\n

You should now be able to access the OpenContracts frontend by visiting http://localhost:3000.

"},{"location":"configuration/choose-and-configure-docker-stack/#production-environment","title":"Production Environment","text":"

The production environment is designed to be public-facing and exposed to the Internet, so there are quite a number more configurations required than a local deployment, particularly if you use an AWS S3 storage backend or the Auth0 authentication system.

After cloning this repo to a machine of your choice, configure the production .env files as described above.

You'll also need to configure your website url. This needs to be done in a few places.

First, in opencontractserver/contrib/migrations, you'll fine a file called 0003_set_site_domain_and_name.py. BEFORE running any of your migrations, you should modify the domain and name defaults you'll fine in update_site_forward:

def update_site_forward(apps, schema_editor):\n \"\"\"Set site domain and name.\"\"\" Site = apps.get_model(\"sites\", \"Site\") Site.objects.update_or_create( id=settings.SITE_ID, defaults={ \"domain\": \"opencontracts.opensource.legal\", \"name\": \"OpenContractServer\", }, )\n

and update_site_backward:

def update_site_backward(apps, schema_editor):\n \"\"\"Revert site domain and name to default.\"\"\" Site = apps.get_model(\"sites\", \"Site\") Site.objects.update_or_create( id=settings.SITE_ID, defaults={\"domain\": \"example.com\", \"name\": \"example.com\"} )\n

Finally, don't forget to configure Treafik, the router in the docker-compose stack that exposes different containers to end-users depending on the route (url) received you need to update the Treafik file here.

If you're using Auth0, see the Auth0 configuration section.

If you're using AWS S3 for file storage, see the AWS configuration section. NOTE, the underlying django library that provides cloud storage, django-storages, can also work with other cloud providers such as Azure and GCP. See the django storages library docs for more info.

$ docker-compose -f production.yml build\n

Then, run migrations (to setup the database):

$ docker-compose -f production.yml run django python manage.py migrate`\n

Then, create a superuser account that can log in to the admin dashboard (in a production deployment this is available at the url set in your env file as the DJANGO_ADMIN_URL) by typing this command and following the prompts:

$ docker-compose -f production.yml run django python manage.py createsuperuser\n

Finally, bring up the stack:

$ docker-compose -f production.yml up\n

You should now be able to access the OpenContracts frontend by visiting http://localhost:3000.

"},{"location":"configuration/choose-and-configure-docker-stack/#env-file-configurations","title":"ENV File Configurations","text":"

OpenContracts is configured via .env files. For a local deployment, these should go in .envs/.local. For production, use .envs/.production. Sample .envs for each deployment environment are provided in documentation/sample_env_files.

The local configuration should let you deploy the application on your PC without requiring any specific configuration. The production configuration is meant to provide a web application and requires quite a bit more configuration and knowledge of web apps.

"},{"location":"configuration/choose-and-configure-docker-stack/#include-gremlin","title":"Include Gremlin","text":"

If you want to include a Gremlin analyzer, use local_deploy_with_gremlin.yml or production_deploy_with_gremlin.yml instead of local.yml or production.yml, respectively. All other parts of the tutorial are the same.

"},{"location":"configuration/choose-storage-backend/","title":"Configure Storage Backend","text":""},{"location":"configuration/choose-storage-backend/#select-and-setup-storage-backend","title":"Select and Setup Storage Backend","text":"

You can use Amazon S3 as a file storage backend (if you set the env flag USE_AWS=True, more on that below), or you can use the local storage of the host machine via a Docker volume.

"},{"location":"configuration/choose-storage-backend/#aws-storage-backend","title":"AWS Storage Backend","text":"

If you want to use AWS S3 to store files (primarily pdfs, but also exports, tokens and txt files), you will need an Amazon AWS account to setup S3. This README does not cover the AWS side of configuration, but there are a number of tutorials and guides to getting AWS configured to be used with a django project.

Once you have an S3 bucket configured, you'll need to set the following env variables in your .env file (the .django file in .envs/.production or .envs/.local, depending on your target environment). Our sample .envs only show these fields in the .production samples, but you could use them in the .local env file too.

Here the variables you need to set to enable AWS S3 storage:

  1. USE_AWS - set to true since you're using AWS, otherwise the backend will use a docker volume for storage.
  2. AWS_ACCESS_KEY_ID - the access key ID created by AWS when you set up your IAM user (see tutorials above).
  3. AWS_SECRET_ACCESS_KEY - the secret access key created by AWS when you set up your IAM user (see tutorials above)
  4. AWS_STORAGE_BUCKET_NAME - the name of the AWS bucket you created to hold the files.
  5. AWS_S3_REGION_NAME - the region of the AWS bucket you configured.
"},{"location":"configuration/choose-storage-backend/#django-storage-backend","title":"Django Storage Backend","text":"

Setting USE_AWS=false will use the disk space in the django container. When using the local docker compose stack, the celery workers and django containers share the same disk, so this works fine. Our production configuration would not work properly with USE_AWS=false, however, as each container has its own disk.

"},{"location":"configuration/configure-admin-users/","title":"Configure Admin Users","text":""},{"location":"configuration/configure-admin-users/#gremlin-admin-dashboard","title":"Gremlin Admin Dashboard","text":"

Gremlin's backend is built on Django, which has its own powerful admin dashboard. This dashboard is not meant for end-users and should only be used by admins. You can access the admin dashboard by going to the /admin page - e,g, opencontracts.opensource.legal/admin or http://localhost:8000/admin. For the most part, you shouldn't need to use the admin dashboard and should only go in here if you're experience errors or unexpected behavior and want to look at the detailed contents of the database to see if it sheds any light on what's happening with a give corpus, document, etc.

By default, Gremlin creates an admin user for you. If you don't specify the username and password in your environment on first boot, it'll use system defaults. You can customize the default username and password via environment variables or after the system boots using the admin dash.

"},{"location":"configuration/configure-admin-users/#configure-username-and-password-prior-to-first-deployment","title":"Configure Username and Password Prior to First Deployment","text":"

If the variable DJANGO_SUPERUSER_USERNAME is set, that will be the default admin user created on startup (the first time your run docker-compose -f local.yml up). The repo ships with a default superuser username of admin. The default password is set using the DJANGO_SUPERUSER_PASSWORD variable. The environment files for local deployments (but not production) include a default password of Openc0ntracts_def@ult. You should change this in the environment file before the first start OR, follow the instructions below to change it after the first start.

If you modify these environment variables in the environment file BEFORE running the docker-compose up command for the first time, your initial superuser will have the username, email and/or password you specify. If you don't modify the defaults, you can change them after you have created them via the admin dashboard (see below).

"},{"location":"configuration/configure-admin-users/#after-first-deployment-via-admin-dashboard","title":"After First Deployment via Admin Dashboard","text":"

Once the default superuser has been created, you'll need to use the admin dashboard to modify it.

To manage users, including changing the password, you'll need to access the backend admin dashboard. OpenContracts is built on Django, which ships with Django Admin, a tool to manage low-level object data and users. It doesn't provide the rich, document focused UI/UX our frontend does, but it does let you edit and delete objects created on the frontend if, for any reason, you are unable to fix something done by a frontend user (e.g. a corrupt file is uploaded and cannot be parsed or rendered properly on the frontend).

To update your users, first login to the admin panel:

Then, in the lefthand navbar, find the entry for \"Users\" and click on it

Then, you'll see a list of all users for this instance. You should see your admin user and an \"Anonymous\" user. The Anonymous user is required for public browsing of objcets with their is_public field set to True. The Anonymous user cannot see other objects.

Click on the admin user to bring up the detailed user view:

Now you can click the \"WHAT AM I CALLED\" button to bring up a dialog to change the user password.

"},{"location":"configuration/configure-gremlin/","title":"Configure Gremlin Analyzer","text":"

Gremlin is a separate project by OpenSource Legal to provide a standard API to access NLP capabilities. This lets us wrap multiple NLP engines / techniques in the same API which lets us build tools that can readily consume the outputs of very different NLP libraries (etc. a Transformers-based model like BERT, and tools like SPACY and LexNLP can be deployed on Gremlin and the outputs from all three can readily be rendered in OpenContracts).

OpenContracts is designed to work with Gremlin out-of-the-box. We have a sample compose yaml file showing how to do this on a local machine local_deploy_with_gremlin.yaml and as a web-facing application production_deploy_with_gremlin.yaml.

When you add a new Gremlin Engine to the database, OpenContracs will automatically query it for its installed analyzers and labels. These will then be available within OpenContracts, and you can use an analyzer to analyze any OpenContracts corpus.

While we have plans to automatically \"install\" the default Gremlin on first boot, currently you must manually go into the OpenContracts admin dash and add the Gremlin. Thankfully, this is an easy process:

  1. In your environment file, make sure you set CALLBACK_ROOT_URL_FOR_ANALYZER
    1. For local deploy, use CALLBACK_ROOT_URL_FOR_ANALYZER=http://localhost:8000
    2. For production deploy, use http://django:5000. Why the change? Well, in our local docker compose stack, the host the localhost and the django development server runs on port 8000. In production, we want Gremlin to communicate with the OpenContracts container (\"django\") via its hostname on the docker compose stack's network. The production OpenContracts container also uses gunicorn on port 5000 instead of the development server on port 8000, so the port changes too.
  2. Go to the admin page:
  3. Click \"Add+\" in the Gremlin row to bring up the Add Gremlin Engine form. You just need to set the creator Url fields (the url for our default config is http://gremlinengine:5000). If, for some reason, you don't want the analyzer to be visible to any unauthenticated user, unselect the is_public box :
  4. This will automatically kick off an install process that runs in the background. When it's complete, you'll see the \"Install Completed\" Field change. It should take a second or two. At the moment, we don't handle errors in this process, so, if it doesn't complete successfully in 30 seconds, there is probably a misconfiguration somewhere. We plan to improve our error handling for these backend installation processes.

Note, in our example implementations, Gremlin is NOT encrypted or API Key secured to outside traffic. It's not exposed to outside traffic either per our docker compose config, so this shouldn't be a major concern. If you do expose the container to the host via your Docker Compose file, you should ensure you run the traffic through Treafik and setup API Key authentication.

"},{"location":"configuration/frontend-configuration/","title":"Frontend Configuration","text":""},{"location":"configuration/frontend-configuration/#why","title":"Why?","text":"

The frontend configuration variables should not be secrets as there is no way to keep them secure on the frontend. That said, being able to specify certain configurations via environment variables makes configuration and deployment much easier.

"},{"location":"configuration/frontend-configuration/#what-can-be-configured","title":"What Can be Configured?","text":"

Our frontend config file should look like this (The OPEN_CONTRACTS_ prefixes are necessary to get the env variables injected into the frontend container. The env variable that shows up on window._env_ in the React frontend will omit the prefix, however - e.g. OPEN_CONTRACTS_REACT_APP_APPLICATION_DOMAIN will show up as REACT_APP_APPLICATION_DOMAIN):

OPEN_CONTRACTS_REACT_APP_APPLICATION_DOMAIN=\nOPEN_CONTRACTS_REACT_APP_APPLICATION_CLIENT_ID=\nOPEN_CONTRACTS_REACT_APP_AUDIENCE=http://localhost:3000\nOPEN_CONTRACTS_REACT_APP_API_ROOT_URL=https://opencontracts.opensource.legal\n\n# Uncomment to use Auth0 (you must then set the DOMAIN and CLIENT_ID envs above\n# OPEN_CONTRACTS_REACT_APP_USE_AUTH0=true\n\n# Uncomment to enable access to analyzers via the frontend\n# OPEN_CONTRACTS_REACT_APP_USE_ANALYZERS=true\n\n# Uncomment to enable access to import functionality via the frontend\n# OPEN_CONTRACTS_REACT_APP_ALLOW_IMPORTS=true\n

ATM, there are three key configurations: 1. OPEN_CONTRACTS_REACT_APP_USE_AUTH0 - uncomment this / set it to true to switch the frontend login components and auth flow from django password auth to Auth0 oauth2. IF this is true, you also need to provide valid configurations for OPEN_CONTRACTS_REACT_APP_APPLICATION_DOMAIN, OPEN_CONTRACTS_REACT_APP_APPLICATION_CLIENT_ID, and OPEN_CONTRACTS_REACT_APP_AUDIENCE. These are configured on the Auth0 platform. We don't have a walkthrough for that ATM. 2. OPEN_CONTRACTS_REACT_APP_USE_ANALYZERS - allow users to see and use analyzers. False on the demo deployment. 3. OPEN_CONTRACTS_REACT_APP_ALLOW_IMPORTS - do not let people upload zip files and attempt to import them. Not recommended on truly public installations as security will be challenging. Internal to an org should be OK, but still use caution.

"},{"location":"configuration/frontend-configuration/#how-to-configure","title":"How to Configure","text":""},{"location":"configuration/frontend-configuration/#method-1-using-an-env-file","title":"Method 1: Using an .env File","text":"

This method involves using a .env file that Docker Compose automatically picks up.

"},{"location":"configuration/frontend-configuration/#steps","title":"Steps:","text":"
  1. Create a file named .env in the same directory as your docker-compose.yml file.
  2. Copy the contents of your environment variable file into this .env file.
  3. In your docker-compose.yml, you don't need to explicitly specify the env file.
"},{"location":"configuration/frontend-configuration/#example-docker-composeyml","title":"Example docker-compose.yml:","text":"
version: '3'\nservices:\n  frontend:\n    build: ./frontend\n    ports:\n      - \"3000:3000\"\n    # No need to specify env_file here\n
"},{"location":"configuration/frontend-configuration/#pros","title":"Pros:","text":"
  • Simple setup
  • Docker Compose automatically uses the .env file
  • Easy to version control (if desired)
"},{"location":"configuration/frontend-configuration/#cons","title":"Cons:","text":"
  • All services defined in the Docker Compose file will have access to these variables
  • May not be suitable if you need different env files for different services
"},{"location":"configuration/frontend-configuration/#method-2-using-env_file-in-docker-compose","title":"Method 2: Using env_file in Docker Compose","text":"

This method allows you to specify a custom named env file for each service.

"},{"location":"configuration/frontend-configuration/#steps_1","title":"Steps:","text":"
  1. Keep your existing .env file (or rename it if desired).
  2. In your docker-compose.yml, specify the env file using the env_file key.
"},{"location":"configuration/frontend-configuration/#example-docker-composeyml_1","title":"Example docker-compose.yml:","text":"
version: '3'\nservices:\n  frontend:\n    build: ./frontend\n    ports:\n      - \"3000:3000\"\n    env_file:\n      - ./.env  # or your custom named file\n
"},{"location":"configuration/frontend-configuration/#pros_1","title":"Pros:","text":"
  • Allows using different env files for different services
  • More explicit than relying on the default .env file
"},{"location":"configuration/frontend-configuration/#cons_1","title":"Cons:","text":"
  • Requires specifying the env file in the Docker Compose file
"},{"location":"configuration/frontend-configuration/#method-3-defining-environment-variables-directly-in-docker-compose","title":"Method 3: Defining Environment Variables Directly in Docker Compose","text":"

This method involves defining the environment variables directly in the docker-compose.yml file.

"},{"location":"configuration/frontend-configuration/#steps_2","title":"Steps:","text":"
  1. In your docker-compose.yml, use the environment key to define variables.
"},{"location":"configuration/frontend-configuration/#example-docker-composeyml_2","title":"Example docker-compose.yml:","text":"
version: '3'\nservices:\n  frontend:\n    build: ./frontend\n    ports:\n      - \"3000:3000\"\n    environment:\n      - OPEN_CONTRACTS_REACT_APP_APPLICATION_DOMAIN=yourdomain.com\n      - OPEN_CONTRACTS_REACT_APP_APPLICATION_CLIENT_ID=your_client_id\n      - OPEN_CONTRACTS_REACT_APP_AUDIENCE=http://localhost:3000\n      - OPEN_CONTRACTS_REACT_APP_API_ROOT_URL=https://opencontracts.opensource.legal\n      - OPEN_CONTRACTS_REACT_APP_USE_AUTH0=true\n      - OPEN_CONTRACTS_REACT_APP_USE_ANALYZERS=true\n      - OPEN_CONTRACTS_REACT_APP_ALLOW_IMPORTS=true\n
"},{"location":"configuration/frontend-configuration/#pros_2","title":"Pros:","text":"
  • All configuration is in one file
  • Easy to see all environment variables at a glance
"},{"location":"configuration/frontend-configuration/#cons_2","title":"Cons:","text":"
  • Can make the docker-compose.yml file long and harder to manage
  • Sensitive information in the Docker Compose file may be a security risk
"},{"location":"configuration/frontend-configuration/#method-4-combining-env_file-and-environment","title":"Method 4: Combining env_file and environment","text":"

This method allows you to use an env file for most variables and override or add specific ones in the Docker Compose file.

"},{"location":"configuration/frontend-configuration/#steps_3","title":"Steps:","text":"
  1. Keep your .env file with most variables.
  2. In docker-compose.yml, use both env_file and environment.
"},{"location":"configuration/frontend-configuration/#example-docker-composeyml_3","title":"Example docker-compose.yml:","text":"
version: '3'\nservices:\n  frontend:\n    build: ./frontend\n    ports:\n      - \"3000:3000\"\n    env_file:\n      - ./.env\n    environment:\n      - REACT_APP_USE_AUTH0=true\n      - REACT_APP_USE_ANALYZERS=true\n      - REACT_APP_ALLOW_IMPORTS=true\n
"},{"location":"configuration/frontend-configuration/#pros_3","title":"Pros:","text":"
  • Flexibility to use env files and override when needed
  • Can keep sensitive info in env file and non-sensitive in Docker Compose
"},{"location":"configuration/frontend-configuration/#cons_3","title":"Cons:","text":"
  • Need to be careful about precedence (Docker Compose values override env file)
"},{"location":"development/documentation/","title":"Documentation","text":""},{"location":"development/documentation/#documentation-stack","title":"Documentation Stack","text":"

We're using mkdocs to render our markdown into pretty, bite-sized pieces. The markdown lives in /docs in our repo. If you want to work on the docs you'll need to install the requirements in /requirements/docs.txt.

To have a live server while working on them, type:

mkdocs serve\n
"},{"location":"development/documentation/#building-docs","title":"Building Docs","text":"

To build a html website from your markdown that can be uploaded to a webhost (or a GitHub Page), just type:

mkdocs build\n
"},{"location":"development/documentation/#deploying-to-gh-page","title":"Deploying to GH Page","text":"

mkdocs makes it super easy to deploy your docs to a GitHub page.

Just run:

mkdocs gh-deploy\n
"},{"location":"development/environment/","title":"Dev Environment","text":"

We use Black and Flake8 for Python Code Styling. These are run via pre-commit before all commits. If you want to develop extensions or code based on OpenContracts, you'll need to setup pre-commit. First, make sure the requirements in ./requirements/local.txt are installed in your local environment.

Then, install pre-commit into your local git repo. From the root of the repo, run:

 $ pre-commit install\n
If you want to run pre-commit manually on all the code in the repo, use this command:

 $ pre-commit run --all-files\n

When you commit changes to your repo or our repo as a PR, pre-commit will run and ensure your code follows our style guide and passes linting.

"},{"location":"development/frontend-notes/","title":"Frontend Notes","text":""},{"location":"development/frontend-notes/#responsive-layout","title":"Responsive Layout","text":"

The application was primarily designed to be viewed around 1080p. We've built in some quick and dirty (honestly, hacks) to display a usable layout at other resolutions. A more thorough redesign / refactor is in order, again if there's sufficient interest. What's available now should handle a lot of situations ok. If you find performance / layout is not looking great at your given resolution, try to use a desktop browser at a 1080p resolution.

"},{"location":"development/frontend-notes/#no-test-suite","title":"No Test Suite","text":"

As of our initial release, the test suite only tests the backend (and coverage is admittedly not as robust as we'd like). We'd like to add tests for the frontend, though this is a fairly large undertaking. We welcome any contributions on this front!

"},{"location":"development/test-suite/","title":"Test Suite","text":"

Our test suite is a bit sparse, but we're working to improve coverage on the backend. Frontend tests will likely take longer to implement. Our existing tests do test imports and a number of the utility functions for manipulating annotations. These tests are integrated in our GitHub actions.

NOTE, use Python 3.10 or above as pydantic and certain pre-3.10 type annotations do not play well. using from __future__ import annotations doesn't always solve the problem, and upgrading to Python 3.10 was a lot easier than trying to figure out why the from __future__ didn't behave as expected

To run the tests, check your test coverage, and generate an HTML coverage report:

 $ docker-compose -f local.yml run django coverage run -m pytest\n $ docker-compose -f local.yml run django coverage html\n $ open htmlcov/index.html\n

To run a specific test (e.g. test_analyzers):

 $ sudo docker-compose -f local.yml run django python manage.py test opencontractserver.tests.test_analyzers --noinput\n
"},{"location":"extract_and_retrieval/document_data_extract/","title":"Extracting Structured Data from Documents using LlamaIndex, AI Agents, and Marvin","text":"

We've added a powerful feature called \"extract\" that enables the generation of structured data grids from a list of documents using a combination of vector search, AI agents, and the Marvin library.

This run_extract task orchestrates the extraction process, spinning up a number of llama_index_doc_query tasks. Each of these query tasks uses LlamaIndex Django & pgvector for vector search and retrieval, and Marvin for data parsing and extraction. It processes each document and column in parallel using celery's task system.

All credit for the inspiration of this feature goes to the fine folks at Nlmatics. They were some of the first pioneers working on datagrids from document using a set of questions and custom transformer models. This implementation of their concept ultimately leverages newer techniques and better models, but hats off to them for coming up with a design like this in 2017/2018!

The current implementation relies heavily on LlamaIndex, specifically their vector store tooling, their reranker and their agent framework.

Structured data extraction is powered by the amazing Marvin library.

"},{"location":"extract_and_retrieval/document_data_extract/#overview","title":"Overview","text":"

The extract process involves the following key components:

  1. Document Corpus: A collection of documents from which structured data will be extracted.
  2. Fieldset: A set of columns defining the structure of the data to be extracted.
  3. LlamaIndex: A library used for efficient vector search and retrieval of relevant document sections.
  4. AI Agents: Intelligent agents that analyze the retrieved document sections and extract structured data.
  5. Marvin: A library that facilitates the parsing and extraction of structured data from text.

The extract process is initiated by creating an Extract object that specifies the document corpus and the fieldset defining the desired data structure. The process is then broken down into individual tasks for each document and column combination, allowing for parallel processing and scalability.

"},{"location":"extract_and_retrieval/document_data_extract/#detailed-walkthrough","title":"Detailed Walkthrough","text":"

Here's how the extract process works step by step.

"},{"location":"extract_and_retrieval/document_data_extract/#1-initiating-the-extract-process","title":"1. Initiating the Extract Process","text":"

The run_extract function is the entry point for initiating the extract process. It takes the extract_id and user_id as parameters and performs the following steps:

  1. Retrieves the Extract object from the database based on the provided extract_id.
  2. Sets the started timestamp of the extract to the current time.
  3. Retrieves the fieldset associated with the extract, which defines the columns of the structured data grid.
  4. Retrieves the list of document IDs associated with the extract.
  5. Creates Datacell objects for each document and column combination, representing the individual cells in the structured data grid.
  6. Sets the appropriate permissions for each Datacell object based on the user's permissions.
  7. Kicks off the processing job for each Datacell by appending a task to the Celery task queue.
"},{"location":"extract_and_retrieval/document_data_extract/#2-processing-individual-datacells","title":"2. Processing Individual Datacells","text":"

The llama_index_doc_query function is responsible for processing each individual Datacell.

"},{"location":"extract_and_retrieval/document_data_extract/#execution-flow-visualized","title":"Execution Flow Visualized:","text":"
graph TD\n    I[llama_index_doc_query] --> J[Retrieve Datacell]\n    J --> K[Create HuggingFaceEmbedding]\n    K --> L[Create OpenAI LLM]\n    L --> M[Create DjangoAnnotationVectorStore]\n    M --> N[Create VectorStoreIndex]\n    N --> O{Special character '|||' in search_text?}\n    O -- Yes --> P[Split examples and average embeddings]\n    P --> Q[Query annotations using averaged embeddings]\n    Q --> R[Rerank nodes using SentenceTransformerRerank]\n    O -- No --> S[Retrieve results using index retriever]\n    S --> T[Rerank nodes using SentenceTransformerRerank]\n    R --> U{Column is agentic?}\n    T --> U\n    U -- Yes --> V[Create QueryEngineTool]\n    V --> W[Create FunctionCallingAgentWorker]\n    W --> X[Create StructuredPlannerAgent]\n    X --> Y[Query agent for definitions]\n    U -- No --> Z{Extract is list?}\n    Y --> Z\n    Z -- Yes --> AA[Extract with Marvin]\n    Z -- No --> AB[Cast with Marvin]\n    AA --> AC[Save result to Datacell]\n    AB --> AC\n    AC --> AD[Mark Datacell complete]\n
"},{"location":"extract_and_retrieval/document_data_extract/#step-by-step-walkthrough","title":"Step-by-step Walkthrough","text":"
  1. The run_extract task is called with an extract_id and user_id. It retrieves the corresponding Extract object and marks it as started.

  2. It then iterates over the document IDs associated with the extract. For each document and each column in the extract's fieldset, it:

  3. Creates a new Datacell object with the extract, column, output type, creator, and document.
  4. Sets CRUD permissions for the datacell to the user.
  5. Appends a llama_index_doc_query task to a list of tasks, passing the datacell ID.

  6. After all datacells are created and their tasks added to the list, a Celery chord is used to group the tasks. Once all tasks are complete, it calls the mark_extract_complete task to mark the extract as finished.

  7. The llama_index_doc_query task processes each individual datacell. It:

  8. Retrieves the datacell and marks it as started.
  9. Creates a HuggingFaceEmbedding model and sets it as the Settings.embed_model.
  10. Creates an OpenAI LLM and sets it as the Settings.llm.
  11. Creates a DjangoAnnotationVectorStore from the document ID and column settings.
  12. Creates a VectorStoreIndex from the vector store.

  13. If the search_text contains the special character '|||':

  14. It splits the examples and calculates the embeddings for each example.
  15. It calculates the average embedding from the individual embeddings.
  16. It queries the Annotation objects using the averaged embeddings and orders them by cosine distance.
  17. It reranks the nodes using SentenceTransformerRerank and retrieves the top-n nodes.
  18. It adds the annotation IDs of the reranked nodes to the datacell's sources.
  19. It retrieves the text from the reranked nodes.

  20. If the search_text does not contain the special character '|||':

  21. It retrieves the relevant annotations using the index retriever based on the search_text or query.
  22. It reranks the nodes using SentenceTransformerRerank and retrieves the top-n nodes.
  23. It adds the annotation IDs of the reranked nodes to the datacell's sources.
  24. It retrieves the text from the retrieved nodes.

  25. If the column is marked as agentic:

  26. It creates a QueryEngineTool, FunctionCallingAgentWorker, and StructuredPlannerAgent.
  27. It queries the agent to find defined terms and section references in the retrieved text.
  28. The definitions and section text are added to the retrieved text.

  29. Depending on whether the column's extract_is_list is true, it either:

  30. Extracts a list of the output_type from the retrieved text using Marvin, with optional instructions or query.
  31. Casts the retrieved text to the output_type using Marvin, with optional instructions or query.

  32. The result is saved to the datacell's data field based on the output_type. The datacell is marked as completed.

  33. If an exception occurs during processing, the error is logged, saved to the datacell's stacktrace, and the datacell is marked as failed.

"},{"location":"extract_and_retrieval/document_data_extract/#next-steps","title":"Next Steps","text":"

This is more of a proof-of-concept of the power of the existing universe of open source tooling. There are a number of more advanced techniques we can use to get better retrieval, more intelligent agentic behavior and more. Also, we haven't optomized for performance AT ALL, so any improvements in any of these areas would be welcome. Further, we expect the real power for an open source tool like OpenContracts to come from custom implementations of this functionality, so we'll also be working on more easily customizable and modular agents and retrieval pipelines so you can quickly select the right pipeline for the right task.

"},{"location":"extract_and_retrieval/intro_to_django_annotation_vector_store/","title":"Making a Django Application Compatible with LlamaIndex using a Custom Vector Store","text":""},{"location":"extract_and_retrieval/intro_to_django_annotation_vector_store/#introduction","title":"Introduction","text":"

In this walkthrough, we'll explore how the custom DjangoAnnotationVectorStore makes a Django application compatible with LlamaIndex, enabling powerful vector search capabilities within the application's structured annotation store. By leveraging the BasePydanticVectorStore class provided by LlamaIndex and integrating it with Django's ORM and the pg-vector extension for PostgreSQL, we can achieve efficient and scalable vector search functionality.

"},{"location":"extract_and_retrieval/intro_to_django_annotation_vector_store/#understanding-the-djangoannotationvectorstore","title":"Understanding the DjangoAnnotationVectorStore","text":"

The DjangoAnnotationVectorStore is a custom implementation of LlamaIndex's BasePydanticVectorStore class, tailored specifically for a Django application. It allows the application to store and retrieve granular, visually-locatable annotations (x-y blocks) from PDF pages using vector search.

Let's break down the key components and features of the DjangoAnnotationVectorStore:

"},{"location":"extract_and_retrieval/intro_to_django_annotation_vector_store/#1-inheritance-from-basepydanticvectorstore","title":"1. Inheritance from BasePydanticVectorStore","text":"
class DjangoAnnotationVectorStore(BasePydanticVectorStore):\n    ...\n

By inheriting from BasePydanticVectorStore, the DjangoAnnotationVectorStore gains access to the base functionality and interfaces provided by LlamaIndex for vector stores. This ensures compatibility with LlamaIndex's query engines and retrieval methods.

"},{"location":"extract_and_retrieval/intro_to_django_annotation_vector_store/#2-integration-with-djangos-orm","title":"2. Integration with Django's ORM","text":"

The DjangoAnnotationVectorStore leverages Django's Object-Relational Mapping (ORM) to interact with the application's database. It defines methods like _get_annotation_queryset() and _build_filter_query() to retrieve annotations from the database using Django's queryset API.

def _get_annotation_queryset(self) -> QuerySet:\n    queryset = Annotation.objects.all()\n    if self.corpus_id is not None:\n        queryset = queryset.filter(\n            Q(corpus_id=self.corpus_id) | Q(document__corpus=self.corpus_id)\n        )\n    if self.document_id is not None:\n        queryset = queryset.filter(document=self.document_id)\n    if self.must_have_text is not None:\n        queryset = queryset.filter(raw_text__icontains=self.must_have_text)\n    return queryset.distinct()\n

This integration allows seamless retrieval of annotations from the Django application's database, making it compatible with LlamaIndex's querying and retrieval mechanisms.

"},{"location":"extract_and_retrieval/intro_to_django_annotation_vector_store/#3-utilization-of-pg-vector-for-vector-search","title":"3. Utilization of pg-vector for Vector Search","text":"

The DjangoAnnotationVectorStore utilizes the pg-vector extension for PostgreSQL to perform efficient vector search operations. pg-vector adds support for vector data types and provides optimized indexing and similarity search capabilities.

queryset = (\n    queryset.order_by(\n        CosineDistance(\"embedding\", query.query_embedding)\n    ).annotate(\n        similarity=CosineDistance(\"embedding\", query.query_embedding)\n    )\n)[: query.similarity_top_k]\n

In the code above, the CosineDistance function from pg-vector is used to calculate the cosine similarity between the query embedding and the annotation embeddings stored in the database. This allows for fast and accurate retrieval of relevant annotations based on vector similarity.

"},{"location":"extract_and_retrieval/intro_to_django_annotation_vector_store/#4-customization-and-filtering-options","title":"4. Customization and Filtering Options","text":"

The DjangoAnnotationVectorStore provides various customization and filtering options to fine-tune the vector search process. It allows filtering annotations based on criteria such as corpus_id, document_id, and must_have_text.

def _build_filter_query(self, filters: Optional[MetadataFilters]) -> QuerySet:\n    queryset = self._get_annotation_queryset()\n\n    if filters is None:\n        return queryset\n\n    for filter_ in filters.filters:\n        if filter_.key == \"label\":\n            queryset = queryset.filter(annotation_label__text__iexact=filter_.value)\n        else:\n            raise ValueError(f\"Unsupported filter key: {filter_.key}\")\n\n    return queryset\n

This flexibility enables targeted retrieval of annotations based on specific metadata filters, enhancing the search capabilities of the application.

"},{"location":"extract_and_retrieval/intro_to_django_annotation_vector_store/#benefits-of-integrating-llamaindex-with-django","title":"Benefits of Integrating LlamaIndex with Django","text":"

Integrating LlamaIndex with a Django application using the DjangoAnnotationVectorStore offers several benefits:

  1. Structured Annotation Storage: The Django application's annotation store provides a structured and organized way to store and manage granular annotations extracted from PDF pages. Each annotation is associated with metadata such as page number, bounding box coordinates, and labels, allowing for precise retrieval and visualization.
  2. Efficient Vector Search: By leveraging the pg-vector extension for PostgreSQL, the DjangoAnnotationVectorStore enables efficient vector search operations within the Django application. This allows for fast and accurate retrieval of relevant annotations based on their vector embeddings, improving the overall performance of the application.
  3. Compatibility with LlamaIndex: The DjangoAnnotationVectorStore is designed to be compatible with LlamaIndex's query engines and retrieval methods. This compatibility allows the Django application to benefit from the powerful natural language processing capabilities provided by LlamaIndex, such as semantic search, question answering, and document summarization.
  4. Customization and Extensibility: The DjangoAnnotationVectorStore provides a flexible and extensible foundation for building custom vector search functionality within a Django application. It can be easily adapted and extended to meet specific application requirements, such as adding new filtering options or incorporating additional metadata fields.
"},{"location":"extract_and_retrieval/intro_to_django_annotation_vector_store/#conclusion","title":"Conclusion","text":"

By implementing the DjangoAnnotationVectorStore and integrating it with LlamaIndex, a Django application can achieve powerful vector search capabilities within its structured annotation store. The custom vector store leverages Django's ORM and the pg-vector extension for PostgreSQL to enable efficient retrieval of granular annotations based on vector similarity.

This integration opens up new possibilities for building intelligent and interactive applications that can process and analyze large volumes of annotated data. With the combination of Django's robust web framework and LlamaIndex's advanced natural language processing capabilities, developers can create sophisticated applications that deliver enhanced user experiences and insights.

The DjangoAnnotationVectorStore serves as a bridge between the Django ecosystem and the powerful tools provided by LlamaIndex, enabling developers to harness the best of both worlds in their applications.

"},{"location":"extract_and_retrieval/querying_corpus/","title":"Answering Queries using LlamaIndex in a Django Application","text":"

This markdown document explains how queries are answered in a Django application using LlamaIndex, the limitations of the approach, and how LlamaIndex is leveraged for this purpose.

"},{"location":"extract_and_retrieval/querying_corpus/#query-answering-process","title":"Query Answering Process","text":"
  1. A user submits a query through the Django application, which is associated with a specific corpus (a collection of documents).
  2. The query is saved in the database as a CorpusQuery object, and a Celery task (run_query) is triggered to process the query asynchronously.
  3. Inside the run_query task:
  4. The CorpusQuery object is retrieved from the database using the provided query_id.
  5. The query's started timestamp is set to the current time.
  6. The necessary components for query processing are set up, including the embedding model (HuggingFaceEmbedding), language model (OpenAI), and vector store (DjangoAnnotationVectorStore).
  7. The DjangoAnnotationVectorStore is initialized with the corpus_id associated with the query, allowing it to retrieve the relevant annotations for the specified corpus.
  8. A VectorStoreIndex is created from the DjangoAnnotationVectorStore, which serves as the index for the query engine.
  9. A CitationQueryEngine is instantiated with the index, specifying the number of top similar results to retrieve (similarity_top_k) and the granularity of the citation sources (citation_chunk_size).
  10. The query is passed to the CitationQueryEngine, which processes the query and generates a response.
  11. The response includes the answer to the query along with the source annotations used to generate the answer.
  12. The source annotations are parsed and converted into a markdown format, with each citation linked to the corresponding annotation ID.
  13. The query's sources field is updated with the annotation IDs used in the response.
  14. The query's response field is set to the generated markdown text.
  15. The query's completed timestamp is set to the current time.
  16. If an exception occurs during the query processing, the query's failed timestamp is set, and the stack trace is stored in the stacktrace field.
"},{"location":"extract_and_retrieval/querying_corpus/#leveraging-llamaindex","title":"Leveraging LlamaIndex","text":"

LlamaIndex is leveraged in the following ways to enable query answering in the Django application:

  1. Vector Store: LlamaIndex provides the BasePydanticVectorStore class, which serves as the foundation for the custom DjangoAnnotationVectorStore. The DjangoAnnotationVectorStore integrates with Django's ORM to store and retrieve annotations efficiently, allowing seamless integration with the existing Django application.
  2. Indexing: LlamaIndex's VectorStoreIndex is used to create an index from the DjangoAnnotationVectorStore. The index facilitates fast and efficient retrieval of relevant annotations based on the query.
  3. Query Engine: LlamaIndex's CitationQueryEngine is employed to process the queries and generate responses. The query engine leverages the index to find the most relevant annotations and uses the language model to generate a coherent answer.
  4. Embedding and Language Models: LlamaIndex provides abstractions for integrating various embedding and language models. In this implementation, the HuggingFaceEmbedding and OpenAI models are used, but LlamaIndex allows flexibility in choosing different models based on requirements.

By leveraging LlamaIndex, the Django application benefits from a structured and efficient approach to query answering. LlamaIndex provides the necessary components and abstractions to handle vector storage, indexing, and query processing, allowing the application to focus on integrating these capabilities into its existing architecture.

"},{"location":"walkthrough/key-concepts/","title":"Key-Concepts","text":""},{"location":"walkthrough/key-concepts/#data-types","title":"Data Types","text":"

Text annotation data is divided into several concepts:

  1. Corpuses (or collections of documents). One document can be in multiple corpuses.
  2. Documents. Currently, these are PDFs ONLY.
  3. Annotations. These are either document-level annotations (the document type), text-level annotations (highlighted text), or relationships (which apply a label between two annotations). Relationships are currently not well-supported and may be buggy.
  4. Analyses. These groups of read-only annotations added by a Gremlin analyzer (see more on that below).
"},{"location":"walkthrough/key-concepts/#permissioning","title":"Permissioning","text":"

OpenContracts is built on top of the powerful permissioning framework for Django called django-guardian. Each GraphQL request can add a field to annotate the object-level permissions the current user has for a given object, and the frontend relies on this to determine whether to make some objects and pages read-only and whether certain features should be exposed to a given user. The capability of sharing objects with specific users is built in, but is not enabled from the frontend at the moment. Allowing such widespread sharing and user lookups could be a security hole and could also unduly tax the system. We'd like to test these capabilities more fully before letting users used them.

"},{"location":"walkthrough/key-concepts/#graphql","title":"GraphQL","text":""},{"location":"walkthrough/key-concepts/#mutations-and-queries","title":"Mutations and Queries","text":"

OpenContracts uses Graphene and GraphQL to serve data to its frontend. You can access the Graphiql playground by going to your OpenContracts root url /graphql - e.g. https://opencontracts.opensource.legal/graphql. Anonymous users have access to any public data. To authenticate and access your own data, you either need to use the login mutation to create a JWT token or login to the admin dashboard to get a Django session and auth cookie that will automatically authenticate your requests to the GraphQL endpoint.

If you're not familiar with GraphQL, it's a very powerful way to expose your backend to the user and/or frontend clients to permit the construction of specific queries with specific data shapes. As an example, here's a request to get public corpuses and the annotated text and labels in them:

Graphiql comes with a built-in documentation browser. Just click \"Docs\" in the top-right of the screen to start browsing. Typically, mutations change things on the server. Queries merely request copies of data from the server. We've tried to make our schema fairly self-explanatory, but we do plan to add more descriptions and guidance to our API docs.

"},{"location":"walkthrough/key-concepts/#graphql-only-features","title":"GraphQL-only features","text":"

Some of our features are currently not accessible via the frontend. Sharing analyses and corpuses to the public, for example, can only be achieved via makeCorpusPublic and makeAnalysisPublic mutations, and only admins have this power at the moment. For our current release, we've done this to prevent large numbers of public corpuses being shared to cut down on server usage. We'd like to make a fully free and open, collaborative platform with more features to share anonymously, but this will require additional effort and compute power.

"},{"location":"walkthrough/step-1-add-documents/","title":"Step 1 - Add Documents","text":"

In order to do anything, you need to add some documents to Gremlin.

"},{"location":"walkthrough/step-1-add-documents/#go-to-the-documents-tab","title":"Go to the Documents tab","text":"

Click on the \"Documents\" entry in the menu to bring up a view of all documents you have read and/or write access to:

"},{"location":"walkthrough/step-1-add-documents/#open-the-action-menu","title":"Open the Action Menu","text":"

Now, click on the \"Action\" dropdown to open the Action menu for available actions and click \"Import\":

This will bring up a dialog to load documents:

"},{"location":"walkthrough/step-1-add-documents/#select-documents-to-upload","title":"Select Documents to Upload","text":"

Open Contracts works with PDFs only (as this helps us have a single file type with predictable data structures, formats, etc.). In the future, we'll add functionality to convert other files to PDF, but, for now, please use PDFs. It doesn't matter if they are OCRed or not as OpenContracts performs its own OCR on every PDF anyway to ensure consistent OCR quality and outputs. Once you've added documents for upload, you'll see a list of documents:

Click on a document to change the description or title:

"},{"location":"walkthrough/step-1-add-documents/#upload-your-documents","title":"Upload Your Documents","text":"

Click upload to upload the documents to OpenContracts. Note Once the documents are uploaded, they are automatically processed with Tesseract amd PAWLs to create a layer of tokens - each one representing a word / symbol in the PDF an its X,Y coordinates on the page. This is what powers OpenContracts annotator and allows us to create both layout-aware and text-only annotations. While the PAWLs processing script is running, the document you uploaded will not be available for viewing and cannot be added to a corpus. You'll see a loading bar on the document until the pre-processing is complete. This is only one once and can take a long time (a couple of minutes to a max of 10) depending on the document length, quality, etc.

"},{"location":"walkthrough/step-2-create-labelset/","title":"Step 2 - Create Labelset","text":""},{"location":"walkthrough/step-2-create-labelset/#why-labelsets","title":"Why Labelsets?","text":"

Before you can add labels, you need to decide what you want to label. A labelset should reflect the taxonomy or concepts you want to associate with text in your document. This can be solely for the purpose of human review and retrieval, but we imagine many of you want to use it to train machine learning models.

At the moment, there's no way to create a label in a corpus without creating a labelset and creating a label for the labelset (though we'd like to add that and welcome contributions).

"},{"location":"walkthrough/step-2-create-labelset/#create-text-labels","title":"Create Text Labels","text":"

Let's say we want to add some labels for \"Parties\", \"Termination Clause\", and \"Effective Date\". To do that, let's first create a LabelSet to hold the labels.

  1. Go to the labelset view and click the action button to bring up the action menu:
  2. Clicking on the \"Create Label Set\" item will bring up a modal to let you create labels:
  3. Now click on the new label set to edit the labels:
  4. A modal comes up that lets you edit three types of labels:

    1. Text Labels - are meant to label spans of text (\"highlights\")
    2. Relationship Labels - this feature is still under development, but it labels relationships bewteen text label (e.g. one labelled party is the \"Parent Company\" of another).
    3. Doc Type Labels - are meant to label what category the document belongs in - e.g. a \"Stock Purchase Agreement\" or an \"NDA\"
  5. Click the \"Text Labels\" tab to bring up a view of current labels for text annotations and an action button that lets you create new ones. There should be no labels when you first open this view\"

  6. Click the action button and then the \"Create Text Label\" dropdown item:
  7. You'll see a new, blank label in the list of text labels:
  8. Click the edit icon on the label to edit the label title, description, color and/or icon. To edit the icon or highlight color, hover over or click the giant tag icon on the left side of the label:
  9. Hit save to commit the changes to the database. Repeat for the other labels - \"Parties\", \"Termination Clause\", and \"Effective Date\":
"},{"location":"walkthrough/step-2-create-labelset/#create-document-type-labels","title":"Create Document-Type Labels","text":"

In addition to labelling specific parts of a document, you may want to tag a document itself as a certain type of document or addressing a certain subject. In this example, let's say we want to label some documents as \"contracts\" and others as \"not contracts\".

  1. Let's also create two example document type labels. Click the \"Doc Type Labels\" tab:
  2. As before, click the action button and the \"Create Document Type Label\" item to create a blank document type label:
  3. Repeat to create two doc type labels - \"Contract\" and \"Not Contract\":
  4. Hit \"Close\" to close the editor.
"},{"location":"walkthrough/step-3-create-a-corpus/","title":"Step 3 - Create Corpus","text":""},{"location":"walkthrough/step-3-create-a-corpus/#purpose-of-the-corpus","title":"Purpose of the Corpus","text":"

A \"Corpus\" is a collection of documents that can be annotated by hand or automatically by a \"Gremlin\" analyzer. In order to create a Corpus, you first need to create a Corpus and then add documents to it.

"},{"location":"walkthrough/step-3-create-a-corpus/#go-to-the-corpus-page","title":"Go to the Corpus Page","text":"
  1. First, login if you're not already logged in.
  2. Then, go the \"Corpus\" tab and click the \"Action\" dropdown to bring up the action menu:
  3. Click \"Create Corpus\" to bring up the Create Corpus dialog. If you've already created a labelset or have a pre-existing one, you can select it, otherwise you'll need to create and add one later:
  4. Assuming you created the labelset you want to use, when you click on the dropdown in the \"Label Set\" section, you should see your new labelset. Click on it to select it:
  1. You will now be able to open the corpus again, open documents in the corpus and start labelling.
"},{"location":"walkthrough/step-3-create-a-corpus/#add-documents-to-corpus","title":"Add Documents to Corpus","text":"
  1. Once you have a corpus, go back to the document page to select documents to add. You can do this in one of two ways.
    1. Right-click on a document to show a context menu:
    2. Or, SHIFT + click on the documents you want to select in order to select multiple documents at once. A green checkmark will appear on selected documents.
  2. When you're done, click the \"Action\"
  3. A dialog will pop up asking you to select a corpus to add the documents to. Select the desired corpus and hit ok.
  4. You'll get a confirmation dialog. Hit OK.
  5. When you click on the Corpus you just added the documents to, you'll get a tabbed view of all of the documents, annotations and analyses for that Corpus. At this stage, you should see your documents:

Congrats! You've created a corpus to hold annotations or perform an analysis! In order to start labelling it yourself, you need to create and then select a LabelSet, however. You do not need to do this to run an analyzer, however.

Note: If you have an OpenContracts export file and proper permissions, you can also import a corpus, documents, annotations, and labels. This is disabled on our demo instance, however, to but down on server load and reduce opportunities to upload potentially malicious files. See the \"Advanced\" section for more details.

"},{"location":"walkthrough/step-4-create-text-annotations/","title":"Step 4 - Create Some Annotations","text":"

To view or edit annotations, you need to open a corpus and then open a document in the Corpus.

  1. Go to your Corpuses page and click on the corpus you just created:
  2. This will open up the document view again. Click on one of the documents to bring up the annotator:
  3. To select the label to apply, Click the vertical ellipses in the \"Text Label to Apply Widget\". This will bring up an interface that lets you search your labelset and select a label:
  4. Select the \"Effective Date\" label, for example, to label the Effective Date:
  5. Now, in the document, click and drag a box around the language that corresponds to your select label:
  6. When you've selected the correct text, release the mouse. You'll see a confirmtion when your annotation is created (you'll also see the annotation in the sidebar to the left):
  7. If you want to delete the annotation, you can click on the trash icon in the corresponding annotation card in the sidebar, or, when you hover over the annotation on the page, you'll see a trash icon in the label bar of the annotation. You can click this to delete the annotation too.
  8. If your desired annotated text is non-contiguous, you can hold down the SHIFT key while selecting blocks of text to combine them into a single annotation. While holding SHIFT, releasing the mouse will not create the annotation in the database, it will just allow you to move to a new area.
    1. One situation you might want to do this is where what you want to highlight is on different lines but is just a small part of the surrounding paragraph (such as this example, where Effective Date spans two lines):
    2. Or you might want to select multiple snippets of text in a larger block of text, such as where you have multiple parties you want to combine into a single annotation:
"},{"location":"walkthrough/step-5-create-doc-type-annotations/","title":"Step 5 - Create Some Document Annotations","text":"
  1. If you want to label the type of document instead of the text inside it, use the controls in the \"Doc Type\" widget on the bottom right of the Annotator. Hover over it and a green plus button should appear:
  2. Click the \"+\" button to bring up a dialog that lets you search and select document type labels (remember, we created these earlier in the tutorial):
  3. Click \"Add Label\" to actually apply the label, and you'll now see that label displayed in the \"Doc Type\" widget in the annotator:
  4. As before, you can click the trash can to delete the label.
"},{"location":"walkthrough/step-6-search-and-filter-by-annotations/","title":"Step 6 - Search and Filter By Annotations","text":"
  1. Back in the Corpus view, you can see in the document view the document type label you just added:
  2. You can click on the filter dropdown above to filter the documents to only those with a certain doc type label:
  3. With the corpus opened, click on the \"Annotations\" tab instead of the \"Documents\" tab to get a summary of all the current annotations in the Corpus:
  4. Click on an annotation card to automatically load the document it's in and jump right to the page containing the annotation:
"},{"location":"walkthrough/step-7-query-corpus/","title":"Querying a Corpus","text":"

Once you've created a corpus of documents, you can ask a natural language question and get a natural language answer, complete with citation and links back to the relevant text in the document(s)

Note: We're still working to improve nav and GUI performance, but this is pretty good for a first cut.

"},{"location":"walkthrough/step-8-data-extract/","title":"Build a Datagrid","text":"

You can easily use OpenContracts to create an \"Extract\" - a collection of queries and natural language-specified data points, represented as columns in a grid, that will be asked of every document in the extract (represented as rows). You can define complex extract schemas, including python primitives, Pydantic models (no nesting - yet) and lists.

"},{"location":"walkthrough/step-8-data-extract/#building-a-datagrid","title":"Building a Datagrid","text":"

To create a data grid, you can start by adding documents or adding data fields. Your choice. If you selected a corpus when defining the extract, the documents from that Corpus will be pre-loaded.

"},{"location":"walkthrough/step-8-data-extract/#to-add-documents","title":"To add documents:","text":""},{"location":"walkthrough/step-8-data-extract/#and-to-add-data-fields","title":"And to add data fields:","text":""},{"location":"walkthrough/step-8-data-extract/#running-an-extract","title":"Running an Extract","text":"

Once you've added all of the documents you want and defined all of the data fields to apply, you can click run to start processing the grid:

Extract speed will depend on your underlying LLM and the number of available celery workers provisioned for OpenContracts. We hope to do more performance optimization in a v2 minor release. We haven't optimized for performance at all.

"},{"location":"walkthrough/step-8-data-extract/#reviewing-results","title":"Reviewing Results","text":"

Once an extract is complete, you can click on the hamburger menu in a cell to see a dropdown menu. Click the eye to view the sources for that datacell. If you click thumbs up or thumbs down, you can log that you approved or rejected the value in question. Extract value edits are coming soon.

See a quick walkthrough here:

"},{"location":"walkthrough/step-9-corpus-actions/","title":"Corpus Actions","text":""},{"location":"walkthrough/step-9-corpus-actions/#introduction","title":"Introduction","text":"

If you're familiar with GitHub actions - user-scripted functions that run automatically over a software vcs repository when certain actions take place (like a merge, PR, etc.) - then a CorpusAction should be a familair concept. You can configure a celery task using our @doc_analyzer_task decorator (see more here on how to write these) and then configure a CorpusAction to run your custom task on all documents added to the target corpus.

"},{"location":"walkthrough/step-9-corpus-actions/#setting-up-a-corpus-action","title":"Setting up a Corpus Action","text":""},{"location":"walkthrough/step-9-corpus-actions/#supported-actions","title":"Supported Actions","text":"

NOTE: Currently, you have to configure all of this via the Django admin dashboard (http://localhost:8000/admin if you're using our local deployment), We'd like to expose this functionality using our React frontend, but the required GUI elements and GraphQL mutations need to be built out. A good starter PR for someone ;-).

Currently, a CorpusAction can be configured to run one of three types of analyzers automatically:

  1. A data extract fieldset - in which case, a data extract will be created and run on new documents added to the configured corpus (see our guide on setting up a data extract job)
  2. An Analyzer
    1. Configured as a \"Gremlin Microservice\". See more information on configuring a microservice-based analyzer here
    2. Configured to run a task decorated using the @doc_analyzer_task decorator. See more about configuring these kinds of tasks here.
"},{"location":"walkthrough/step-9-corpus-actions/#creating-corpus-action","title":"Creating Corpus Action","text":"

From within the Django admin dashboard, click on CorpusActions or the +Add button next to the header:

Once you've opened the create action form, you'll see a number of different options you can configure:

See next section for more details on these configuration options.

"},{"location":"walkthrough/step-9-corpus-actions/#configuration-options-for-corpus-action","title":"Configuration Options for Corpus Action","text":"

Corpus specifies that an action should run only on a single corpus, specified via dropdown.

Analyzer or Fieldset properties control whether an analysis or data extract runs when the applicable trigger is run (more on this below). If you want to run a data extract when document is added to the corpus, select the fieldset defining the data you want to extract. If you want to run an analyzer, select the pre-configured analyzer. Remember, an analyzer can point to a microservice or a task decorated with @doc_analyzer_task.

Trigger refers to the specific action type that should kick off the desired analysis. Currently, we \"provide\" add and edit actions - i.e., run specified analytics when a document is added or edited, respectively - but we have not configured the edit action to run.

Disabled is a toggle that will turn off the specified CorpusAction for ALL corpuses.

Run on all corpuses is a toggle that, if True, will run the specified action on EVERY corpus. Be careful with this as it runs for all corpuses for ALL users. Depending on your environment, this could incur a substantial compute cost and other users may not appreciate this. A nice feature we'd love to add is a more fine-grained set of rules based access controls to limit actions to certain groups. This would require a substantial investment into the frontend of the application and remains an unlikely addition, though we'd absolutely welcome contributions!

"},{"location":"walkthrough/step-9-corpus-actions/#quick-reference-configuring-doc_analyzer_task-analyzer","title":"Quick Reference - Configuring @doc_analyzer_task + Analyzer","text":"

If you write your own @doc_analyzer_task and want to run it automatically, let's step through this step-by-step.

  1. First, we assume you put a properly written and decorated task in opencontractserver.tasks.doc_analysis_tasks.py.
  2. Second, you need to create and configure an Analyzer via the Django admin panel. Click on the +Add button next to the Analyzer entry in the admin sidebar and then configure necessary properties:

Place the name of your task in the task_name property - e.g. opencontractserver.tasks.doc_analysis_tasks.contract_not_contract, add a brief description, assign the creator to the desired user, and click save. 3. Now, this Analyzer instance can be assigned to a CorpusAction!

"},{"location":"walkthrough/advanced/configure-annotation-view/","title":"Configure How Annotations Are Displayed","text":"

Annotations are composed of tokens (basically text in a line surrounded by whitespace). The tokens have a highlight. OpenContracts also has a \"BoundingBox\" around the tokens which is the smallest rectangle that can cover all of the tokens in an Annotation.

In the Annotator view, you'll see a purple-colored \"eye\" icon in the top left of the annotation list in the sidebar. Click the icon to bring up a series of configurations for how annotations are displayed:

There are three different settings that can be combined to significantly change how you see the annotations: 1. Show only selected - You will only see the annotation selected, either by clicking on it in the sidebar or when you clicked into an annotation from the Corpus view. All other annotations will be completely hidden. 2. Show bounding boxes - If you unselect this, only the tokens will be visible. This is recommended where you large numbers of overlapping annotations or annotations that are sparse - e.g. a few words scattered throughout a paragraph. In either of these cases, the bounding boxes can cover other bounding boxes and this can be confusing. Where you have too many overlapping bounding boxes, it's easier to hide them and just look at the tokens. 3. Label Display Behavior - has three options:

  1. Always Show - Always show the label for an annotation when it's displayed (remember, you can choose to only display selected annotations).
  2. Always Hide - Never show the label for an annotation, regardless of its visiblity.
  3. Show on Hover - If an annotation is visible, when you hover over it, you'll see the label.
"},{"location":"walkthrough/advanced/data-extract-models/","title":"Why Data Extract?","text":"

An extraction process is pivotal for transforming raw, unstructured data into actionable insights, especially in fields like legal, financial, healthcare, and research. Imagine having thousands of documents, such as contracts, invoices, medical records, or research papers, and needing to quickly locate and analyze specific information like key terms, dates, patient details, or research findings. Automated extraction saves countless hours of manual labor, reduces human error, and enables real-time data analysis. By leveraging an efficient extraction pipeline, businesses and researchers can make informed decisions faster, ensure compliance, enhance operational efficiency, and uncover valuable patterns and trends that might otherwise remain hidden in the data deluge. Simply put, data extraction transforms overwhelming amounts of information into strategic assets, driving innovation and competitive advantage.

"},{"location":"walkthrough/advanced/data-extract-models/#how-we-store-our-data-extracts","title":"How we Store Our Data Extracts","text":"

Ultimately, our application design follows Django best-practiecs for a data-driven application with asynchronous data processing. We use the Django ORM (with capabilities like vector search) to store our data and tasks to orchestrate. The extracts/models.py file defines several key models that are used to manage and track the process of extracting data from documents.

These models include:

  1. Fieldset
  2. Column
  3. Extract
  4. Datacell

Each model plays a specific role in the extraction workflow, and together they enable the storage, configuration, and execution of document-based data extraction tasks.

"},{"location":"walkthrough/advanced/data-extract-models/#detailed-explanation-of-each-model","title":"Detailed Explanation of Each Model","text":""},{"location":"walkthrough/advanced/data-extract-models/#1-fieldset","title":"1. Fieldset","text":"

Purpose: The Fieldset model groups related columns together. Each Fieldset represents a specific configuration of data fields that need to be extracted from documents.

class Fieldset(BaseOCModel):\n    name = models.CharField(max_length=256, null=False, blank=False)\n    description = models.TextField(null=False, blank=False)\n
  • name: The name of the fieldset.
  • description: A description of what this fieldset is intended to extract.

Usage: Fieldsets are associated with extracts in the Extract model, defining what data needs to be extracted.

"},{"location":"walkthrough/advanced/data-extract-models/#2-column","title":"2. Column","text":"

Purpose: The Column model defines individual data fields that need to be extracted. Each column specifies what to extract, the criteria for extraction, and the model to use for extraction.

class Column(BaseOCModel):\n    name = models.CharField(max_length=256, null=False, blank=False, default=\"\")\n    fieldset = models.ForeignKey('Fieldset', related_name='columns', on_delete=models.CASCADE)\n    query = models.TextField(null=True, blank=True)\n    match_text = models.TextField(null=True, blank=True)\n    must_contain_text = models.TextField(null=True, blank=True)\n    output_type = models.TextField(null=False, blank=False)\n    limit_to_label = models.CharField(max_length=512, null=True, blank=True)\n    instructions = models.TextField(null=True, blank=True)\n    task_name = models.CharField(max_length=1024, null=False, blank=False)\n    agentic = models.BooleanField(default=False)\n    extract_is_list = models.BooleanField(default=False)\n
  • name: The name of the column.
  • fieldset: ForeignKey linking to the Fieldset model.
  • query: The query used for extraction.
  • match_text: Text that must be matched during extraction.
  • must_contain_text: Text that must be contained in the document for extraction.
  • output_type: The type of data to be extracted.
  • limit_to_label: A label to limit the extraction scope.
  • instructions: Instructions for the extraction process.
  • task_name: The name of the registered celery extract task to use to process (lets you define and deploy custom ones).
  • agentic: Boolean indicating if the extraction is agentic.
  • extract_is_list: Boolean indicating if the extraction result is a list.

Usage: Columns are linked to fieldsets and specify detailed criteria for each piece of data to be extracted.

"},{"location":"walkthrough/advanced/data-extract-models/#4-extract","title":"4. Extract","text":"

Purpose: The Extract model represents an extraction job. It contains metadata about the extraction process, such as the documents to be processed, the fieldset to use, and the task type.

class Extract(BaseOCModel):\n    corpus = models.ForeignKey('Corpus', related_name='extracts', on_delete=models.SET_NULL, null=True, blank=True)\n    documents = models.ManyToManyField('Document', related_name='extracts', related_query_name='extract', blank=True)\n    name = models.CharField(max_length=512, null=False, blank=False)\n    fieldset = models.ForeignKey('Fieldset', related_name='extracts', on_delete=models.PROTECT, null=False)\n    created = models.DateTimeField(auto_now_add=True)\n    started = models.DateTimeField(null=True, blank=True)\n    finished = models.DateTimeField(null=True, blank=True)\n    error = models.TextField(null=True, blank=True)\n    doc_query_task = models.CharField(\n        max_length=10,\n        choices=[(tag.name, tag.value) for tag in DocQueryTask],\n        default=DocQueryTask.DEFAULT.name\n    )\n
  • corpus: ForeignKey linking to the Corpus model.
  • documents: ManyToManyField linking to the Document model.
  • name: The name of the extraction job.
  • fieldset: ForeignKey linking to the Fieldset model.
  • created: Timestamp when the extract was created.
  • started: Timestamp when the extract started.
  • finished: Timestamp when the extract finished.
  • error: Text field for storing error messages.
  • doc_query_task: CharField for storing the task type using DocQueryTask enum.

Usage: Extracts group the documents to be processed and the fieldset that defines what data to extract. The doc_query_task field determines which extraction pipeline to use.

"},{"location":"walkthrough/advanced/data-extract-models/#5-datacell","title":"5. Datacell","text":"

Purpose: The Datacell model stores the result of extracting a specific column from a specific document. Each datacell links to an extract, a column, and a document.

class Datacell(BaseOCModel):\n    extract = models.ForeignKey('Extract', related_name='extracted_datacells', on_delete=models.CASCADE)\n    column = models.ForeignKey('Column', related_name='extracted_datacells', on_delete=models.CASCADE)\n    document = models.ForeignKey('Document', related_name='extracted_datacells', on_delete=models.CASCADE)\n    sources = models.ManyToManyField('Annotation', blank=True, related_name='referencing_cells', related_query_name='referencing_cell')\n    data = NullableJSONField(default=jsonfield_default_value, null=True, blank=True)\n    data_definition = models.TextField(null=False, blank=False)\n    started = models.DateTimeField(null=True, blank=True)\n    completed = models.DateTimeField(null=True, blank=True)\n    failed = models.DateTimeField(null=True, blank=True)\n    stacktrace = models.TextField(null=True, blank=True)\n
  • extract: ForeignKey linking to the Extract model.
  • column: ForeignKey linking to the Column model.
  • document: ForeignKey linking to the Document model.
  • sources: ManyToManyField linking to the Annotation model.
  • data: JSON field for storing extracted data.
  • data_definition: Text field describing the data definition.
  • started: Timestamp when the datacell processing started.
  • completed: Timestamp when the datacell processing completed.
  • failed: Timestamp when the datacell processing failed.
  • stacktrace: Text field for storing error stack traces.

Usage: Datacells store the results of extracting specific fields from documents, linking back to the extract and column definitions. They also track the status and any errors during extraction.

"},{"location":"walkthrough/advanced/data-extract-models/#how-these-models-relate-to-data-extraction-tasks","title":"How These Models Relate to Data Extraction Tasks","text":"

1**Fieldset and Column**: Specify what data needs to be extracted and the criteria for extraction. Fieldsets group columns, which detail each piece of data to be extracted. You can register your own LlamaIndex extractors which you can then select as the extract engine for a given column, allowing you to create very bespoke extraction capabilities. 2**Extract**: Represents an extraction job, grouping documents to be processed with the fieldset defining what data to extract. The doc_query_task field allows dynamic selection of the extraction pipeline. 3**Datacell**: Stores the results of the extraction process for each document and column, tracking the status and any errors encountered.

"},{"location":"walkthrough/advanced/data-extract-models/#extraction-workflow","title":"Extraction Workflow","text":"
  1. Create Extract: An Extract instance is created, specifying the documents to process, the fieldset to use, and the desired extraction task.
  2. Run Extract: The run_extract task uses the doc_query_task field to determine which extraction pipeline to use. It iterates over the documents and columns, creating Datacell instances for each.
  3. Process Datacell: Each Datacell is processed by the selected extraction task (e.g., llama_index_doc_query or custom_llama_index_doc_query). The results are stored in the data field of the Datacell.
  4. Store Results: The extracted data is saved, and the status of each Datacell is updated to reflect completion or failure.

By structuring the models this way, the system is flexible and scalable, allowing for complex data extraction tasks to be defined, executed, and tracked efficiently.

"},{"location":"walkthrough/advanced/export-import-corpuses/","title":"Export / Import Functionality","text":""},{"location":"walkthrough/advanced/export-import-corpuses/#exports","title":"Exports","text":"

OpenContracts support both exporting and importing corpuses. This functionality is disabled on the public demo as it can be bandwidth intensive. If you want to experiment with these features on your own, you'll see the export action when you right-click on a corpus:

You can access your exports from the user dropdown menu in the top right corner of the screen. Once your export is complete, you should be able to download a zip containing all the documents, their PAWLs layers, and the corpus data you created - including all annotations.

"},{"location":"walkthrough/advanced/export-import-corpuses/#imports","title":"Imports","text":"

If you've enabled corpus imports (see the frontend env file for the boolean toggle to do this - it's REACT_APP_ALLOW_IMPORTS), you'll see an import action when you click the action button on the corpus page.

"},{"location":"walkthrough/advanced/export-import-corpuses/#export-format","title":"Export Format","text":""},{"location":"walkthrough/advanced/export-import-corpuses/#opencontracts-export-format-specification","title":"OpenContracts Export Format Specification","text":"

The OpenContracts export is a zip archive containing: 1. A data.json file with metadata about the export 2. The original PDF documents 3. Exported annotations \"burned in\" to the PDF documents

"},{"location":"walkthrough/advanced/export-import-corpuses/#datajson-format","title":"data.json Format","text":"

The data.json file contains a JSON object with the following fields:

  • annotated_docs (dict): Maps PDF filenames to OpenContractDocExport objects with annotations for that document.

  • doc_labels (dict): Maps document label names (strings) to AnnotationLabelPythonType objects defining those labels.

  • text_labels (dict): Maps text annotation label names (strings) to AnnotationLabelPythonType objects defining those labels.

  • corpus (OpenContractCorpusType): Metadata about the exported corpus, with fields:

    • id (int): ID of the corpus
    • title (string)
    • description (string)
    • icon_name (string): Filename of the corpus icon image
    • icon_data (string): Base64 encoded icon image data
    • creator (string): Email of the corpus creator
    • label_set (string): ID of the labelset used by this corpus
  • label_set (OpenContractsLabelSetType): Metadata about the label set, with fields:

    • id (int)
    • title (string)
    • description (string)
    • icon_name (string): Filename of the labelset icon
    • icon_data (string): Base64 encoded labelset icon data
    • creator (string): Email of the labelset creator
"},{"location":"walkthrough/advanced/export-import-corpuses/#opencontractdocexport-format","title":"OpenContractDocExport Format","text":"

Each document in annotated_docs is represented by an OpenContractDocExport object with fields:

  • doc_labels (list[string]): List of document label names applied to this doc
  • labelled_text (list[OpenContractsAnnotationPythonType]): List of text annotations
  • title (string): Document title
  • content (string): Full text content of the document
  • description (string): Description of the document
  • pawls_file_content (list[PawlsPagePythonType]): PAWLS parse data for each page
  • page_count (int): Number of pages in the document
"},{"location":"walkthrough/advanced/export-import-corpuses/#opencontractsannotationpythontype-format","title":"OpenContractsAnnotationPythonType Format","text":"

Represents an individual text annotation, with fields:

  • id (string): Optional ID
  • annotationLabel (string): Name of the label for this annotation
  • rawText (string): Raw text content of the annotation
  • page (int): 0-based page number the annotation is on
  • annotation_json (dict): Maps page numbers to OpenContractsSinglePageAnnotationType
"},{"location":"walkthrough/advanced/export-import-corpuses/#opencontractssinglepageannotationtype-format","title":"OpenContractsSinglePageAnnotationType Format","text":"

Represents the annotation data for a single page:

  • bounds (BoundingBoxPythonType): Bounding box of the annotation on the page
  • tokensJsons (list[TokenIdPythonType]): List of PAWLS tokens covered by the annotation
  • rawText (string): Raw text of the annotation on this page
"},{"location":"walkthrough/advanced/export-import-corpuses/#boundingboxpythontype-format","title":"BoundingBoxPythonType Format","text":"

Represents a bounding box with fields:

  • top (int)
  • bottom (int)
  • left (int)
  • right (int)
"},{"location":"walkthrough/advanced/export-import-corpuses/#tokenidpythontype-format","title":"TokenIdPythonType Format","text":"

References a PAWLS token by page and token index:

  • pageIndex (int)
  • tokenIndex (int)
"},{"location":"walkthrough/advanced/export-import-corpuses/#pawlspagepythontype-format","title":"PawlsPagePythonType Format","text":"

Represents PAWLS parse data for a single page:

  • page (PawlsPageBoundaryPythonType): Page boundary info
  • tokens (list[PawlsTokenPythonType]): List of PAWLS tokens on the page
"},{"location":"walkthrough/advanced/export-import-corpuses/#pawlspageboundarypythontype-format","title":"PawlsPageBoundaryPythonType Format","text":"

Represents the page boundary with fields:

  • width (float)
  • height (float)
  • index (int): Page index
"},{"location":"walkthrough/advanced/export-import-corpuses/#pawlstokenpythontype-format","title":"PawlsTokenPythonType Format","text":"

Represents a single PAWLS token with fields:

  • x (float): X-coordinate of token box
  • y (float): Y-coordinate of token box
  • width (float): Width of token box
  • height (float): Height of token box
  • text (string): Text content of the token
"},{"location":"walkthrough/advanced/export-import-corpuses/#annotationlabelpythontype-format","title":"AnnotationLabelPythonType Format","text":"

Defines an annotation label with fields:

  • id (string)
  • color (string): Hex color for the label
  • description (string)
  • icon (string): Icon name
  • text (string): Label text
  • label_type (LabelType): One of DOC_TYPE_LABEL, TOKEN_LABEL, RELATIONSHIP_LABEL, METADATA_LABEL
"},{"location":"walkthrough/advanced/export-import-corpuses/#example-datajson","title":"Example data.json","text":"
{\n  \"annotated_docs\": {\n    \"document1.pdf\": {\n      \"doc_labels\": [\"Contract\", \"NDA\"],\n      \"labelled_text\": [\n        {\n          \"id\": \"1\",\n          \"annotationLabel\": \"Effective Date\",\n          \"rawText\": \"This agreement is effective as of January 1, 2023\",\n          \"page\": 0,\n          \"annotation_json\": {\n            \"0\": {\n              \"bounds\": {\n                \"top\": 100,\n                \"bottom\": 120,\n                \"left\": 50,\n                \"right\": 500\n              },\n              \"tokensJsons\": [\n                {\n                  \"pageIndex\": 0,\n                  \"tokenIndex\": 5\n                },\n                {\n                  \"pageIndex\": 0,\n                  \"tokenIndex\": 6\n                }\n              ],\n              \"rawText\": \"January 1, 2023\"\n            }\n          }\n        }\n      ],\n      \"title\": \"Nondisclosure Agreement\",\n      \"content\": \"This Nondisclosure Agreement is made...\",\n      \"description\": \"Standard mutual NDA\",\n      \"pawls_file_content\": [\n        {\n          \"page\": {\n            \"width\": 612,\n            \"height\": 792,\n            \"index\": 0\n          },\n          \"tokens\": [\n            {\n              \"x\": 50,\n              \"y\": 100,\n              \"width\": 60,\n              \"height\": 10,\n              \"text\": \"This\"\n            },\n            {\n              \"x\": 120,\n              \"y\": 100,\n              \"width\": 100,\n              \"height\": 10,\n              \"text\": \"agreement\"\n            }\n          ]\n        }\n      ],\n      \"page_count\": 5\n    }\n  },\n  \"doc_labels\": {\n    \"Contract\": {\n      \"id\": \"1\",\n      \"color\": \"#FF0000\",\n      \"description\": \"Indicates a legal contract\",\n      \"icon\": \"contract\",\n      \"text\": \"Contract\",\n      \"label_type\": \"DOC_TYPE_LABEL\"\n    },\n    \"NDA\": {\n      \"id\": \"2\",\n      \"color\": \"#00FF00\",\n      \"description\": \"Indicates a non-disclosure agreement\",\n      \"icon\": \"nda\",\n      \"text\": \"NDA\",\n      \"label_type\": \"DOC_TYPE_LABEL\"\n    }\n  },\n  \"text_labels\": {\n    \"Effective Date\": {\n      \"id\": \"3\",\n      \"color\": \"#0000FF\",\n      \"description\": \"The effective date of the agreement\",\n      \"icon\": \"calendar\",\n      \"text\": \"Effective Date\",\n      \"label_type\": \"TOKEN_LABEL\"\n    }\n  },\n  \"corpus\": {\n    \"id\": 1,\n    \"title\": \"Example Corpus\",\n    \"description\": \"A sample corpus for demonstration\",\n    \"icon_name\": \"corpus_icon.png\",\n    \"icon_data\": \"iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAACklEQVR4nGMAAQAABQABDQottAAAAABJRU5ErkJggg==\",\n    \"creator\": \"user@example.com\",\n    \"label_set\": \"4\"\n  },\n  \"label_set\": {\n    \"id\": \"4\",\n    \"title\": \"Example Label Set\",\n    \"description\": \"A sample label set\",\n    \"icon_name\": \"label_icon.png\",\n    \"icon_data\": \"iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAACklEQVR4nGMAAQAABQABDQottAAAAABJRU5ErkJggg==\",\n    \"creator\":  \"user@example.com\"\n  }\n}\n

This data.json file includes:

  • One annotated document (document1.pdf) with two document labels (\"Contract\" and \"NDA\") and one text annotation for the \"Effective Date\"
  • Definitions for the two document labels (\"Contract\" and \"NDA\") and one text label (\"Effective Date\")
  • Metadata about the exported corpus and labelset, including Base64 encoded icon data

The PAWLS token data and text content are truncated for brevity. In a real export, the pawls_file_content would include the complete token data for each page, and content would contain the full extracted text of the document.

Let me know if you have any other questions!

"},{"location":"walkthrough/advanced/fork-a-corpus/","title":"Fork a Corpus","text":""},{"location":"walkthrough/advanced/fork-a-corpus/#to-fork-or-not-to-fork","title":"To Fork or Not to Fork?","text":"

One of the amazing things about Open Source collaboration is you can stand on the shoulder of giants - we can share techniques and data and collectively achieve what we could never do alone. OpenContracts is designed to make it super easy to share and re-use annotation data.

In OpenContracts, we introduce the concept of \"forking\" a corpus - basically creating a copy of public or private corpus, complete with its documents and annotations, which you can edit and tweak as needed. This opens up some interesting possibilities. For example, you might have a base corpus with annotations common to many types of AI models or annotation projects which you can fork as needed and layer task or domain-specific annotations on top of.

"},{"location":"walkthrough/advanced/fork-a-corpus/#fork-a-corpus","title":"Fork a Corpus","text":"

Forking a corpus is easy.

  1. Again, right-click on a corpus to bring up the context menu. You'll see an entry to \"Fork Corpus\":
  2. Click on it to start a fork. You should see a confirmation in the top right of the screen:
  3. Once the fork is complete, the next time you go to your Corpus page, you'll see a new Corpus with a Fork icon in the icon bar at the bottom. If you hover over it, you'll be able to see a summary of the corpus it was forked from. This is tracked in the database, so, long-term, we'd like to have corpus version control similar to how git works:
"},{"location":"walkthrough/advanced/generate-graphql-schema-files/","title":"Generate GraphQL Schema Files","text":""},{"location":"walkthrough/advanced/generate-graphql-schema-files/#generating-graphql-schema-files","title":"Generating GraphQL Schema Files","text":"

Open Contracts uses Graphene to provide a rich GraphQL endpoint, complete with the GraphiQL query application. For some applications, you may want to generate a GraphQL schema file in SDL or json. On example use case is if you're developing a frontend you want to connect to OpenContracts, and you'd like to autogenerate Typescript types from a GraphQL Schena.

To generate a GraphQL schema file, run your choice of the following commands.

For an SDL file:

$ docker-compose -f local.yml run django python manage.py graphql_schema --schema config.graphql.schema.schema --out schema.graphql\n

For a JSON file:

$ docker-compose -f local.yml run django python manage.py graphql_schema --schema config.graphql.schema.schema --out schema.json\n

You can convert these to TypeScript for use in a frontend (though you'll find this has already been done for the React- based OpenContracts frontend) using a tool like this.

"},{"location":"walkthrough/advanced/pawls-token-format/","title":"Understanding Document Ground Truth in OpenContracts","text":"

OpenContracts utilizes the PAWLs format for representing documents and their annotations. PAWLs was designed by AllenAI to provide a consistent and structured way to store text and layout information for complex documents like contracts, scientific papers, and newspapers.

AllenAI has largely stopped maintaining this project and this project evolved into something very different than its PAWLs namesake, but we've kept the name (and contributed a few PRs back to the PAWLs project).

"},{"location":"walkthrough/advanced/pawls-token-format/#standardized-pdf-data-layers","title":"Standardized PDF Data Layers","text":"

In OpenContracts, every document is processed through a pipeline that extracts and structures text and layout information into three files:

  1. Original PDF: The original PDF document.
  2. PAWLs Layer (JSON): A JSON file containing the text and positional data for each token (word) in the document.
  3. Text Layer: A text file containing the full text extracted from the document.
  4. Structural Annotations: Thanks to nlm-ingestor, we now use Nlmatics' parser to generate the PAWLs layer and turn the layout blocks - like header, paragraph, table, etc. - into Open Contracts Annotation objs that represent the visual blocks for each PDF. Upon creation, we create embeddings for each Annotation which are stored in Postgres via pgvector.

The PAWLs layer serves as the source of truth for the document, allowing seamless translation between text and positional information.

"},{"location":"walkthrough/advanced/pawls-token-format/#visualizing-how-pdfs-are-converted-to-data-annotations","title":"Visualizing How PDFs are Converted to Data & Annotations","text":"

Here's a rough diagram showing how a series of tokens - Lorem, ipsum, dolor, sit and amet - are mapped from a PDF to our various data types.

"},{"location":"walkthrough/advanced/pawls-token-format/#pawls-processing-pipeline","title":"PAWLs Processing Pipeline","text":"

The PAWLs processing pipeline involves the following steps:

  1. Token Extraction: The OCRed document is processed using the parsing engine of Grobid to extract \"tokens\" (text surrounded by whitespace, typically a word) along with their page and positional information.
  2. PAWLs Layer Generation: The extracted tokens and their positional data are stored as a JSON file, referred to as the \"PAWLs layer.\"
  3. Text Layer Generation: The full text is extracted from the PAWLs layer and stored as a separate text file, called the \"text layer.\"
"},{"location":"walkthrough/advanced/pawls-token-format/#pawls-layer-structure","title":"PAWLs Layer Structure","text":"

The PAWLs layer JSON file consists of a list of page objects, each containing the necessary tokens and page information for a given page. Here's the data shape for each page object:

class PawlsPagePythonType(TypedDict):\n    page: PawlsPageBoundaryPythonType\n    tokens: list[PawlsTokenPythonType]\n

The PawlsPageBoundaryPythonType represents the page boundary information:

class PawlsPageBoundaryPythonType(TypedDict):\n    width: float\n    height: float\n    index: int\n

Each token in the tokens list is represented by the PawlsTokenPythonType:

class PawlsTokenPythonType(TypedDict):\n    x: float\n    y: float\n    width: float\n    height: float\n    text: str\n

The x, y, width, and height fields provide the positional information for each token on the page.

"},{"location":"walkthrough/advanced/pawls-token-format/#annotation-process","title":"Annotation Process","text":"

OpenContracts allows users to annotate documents using the PAWLs layer. Annotations are stored as a dictionary mapping page numbers to annotation data:

Dict[int, OpenContractsSinglePageAnnotationType]\n

The OpenContractsSinglePageAnnotationType represents the annotation data for a single page:

class OpenContractsSinglePageAnnotationType(TypedDict):\n    bounds: BoundingBoxPythonType\n    tokensJsons: list[TokenIdPythonType]\n    rawText: str\n

The bounds field represents the bounding box of the annotation, while tokensJsons contains a list of token IDs that make up the annotation. The rawText field stores the raw text of the annotation.

"},{"location":"walkthrough/advanced/pawls-token-format/#advantages-of-pawls","title":"Advantages of PAWLs","text":"

The PAWLs format offers several advantages for document annotation and NLP tasks:

  1. Consistent Structure: PAWLs provides a consistent and structured representation of documents, regardless of the original file format or structure.
  2. Layout Awareness: By storing positional information for each token, PAWLs enables layout-aware text analysis and annotation.
  3. Seamless Integration: The PAWLs layer allows easy integration with various NLP libraries and tools, whether they are layout-aware or not.
  4. Reproducibility: The re-OCR process ensures consistent output across different documents and software versions.
"},{"location":"walkthrough/advanced/pawls-token-format/#conclusion","title":"Conclusion","text":"

The PAWLs format in OpenContracts provides a powerful and flexible way to represent and annotate complex documents. By extracting and structuring text and layout information, PAWLs enables efficient and accurate document analysis and annotation tasks. The consistent structure and layout awareness of PAWLs make it an essential component of the OpenContracts project.

"},{"location":"walkthrough/advanced/pawls-token-format/#example-pawls-file","title":"Example PAWLs File","text":"

Here's an example of what a PAWLs layer JSON file might look like:

[\n  {\n    \"page\": {\n      \"width\": 612.0,\n      \"height\": 792.0,\n      \"index\": 0\n    },\n    \"tokens\": [\n      {\n        \"x\": 72.0,\n        \"y\": 720.0,\n        \"width\": 41.0,\n        \"height\": 12.0,\n        \"text\": \"Lorem\"\n      },\n      {\n        \"x\": 113.0,\n        \"y\": 720.0,\n        \"width\": 35.0,\n        \"height\": 12.0,\n        \"text\": \"ipsum\"\n      },\n      {\n        \"x\": 148.0,\n        \"y\": 720.0,\n        \"width\": 31.0,\n        \"height\": 12.0,\n        \"text\": \"dolor\"\n      },\n      {\n        \"x\": 179.0,\n        \"y\": 720.0,\n        \"width\": 18.0,\n        \"height\": 12.0,\n        \"text\": \"sit\"\n      },\n      {\n        \"x\": 197.0,\n        \"y\": 720.0,\n        \"width\": 32.0,\n        \"height\": 12.0,\n        \"text\": \"amet,\"\n      },\n      {\n        \"x\": 72.0,\n        \"y\": 708.0,\n        \"width\": 66.0,\n        \"height\": 12.0,\n        \"text\": \"consectetur\"\n      },\n      {\n        \"x\": 138.0,\n        \"y\": 708.0,\n        \"width\": 60.0,\n        \"height\": 12.0,\n        \"text\": \"adipiscing\"\n      },\n      {\n        \"x\": 198.0,\n        \"y\": 708.0,\n        \"width\": 24.0,\n        \"height\": 12.0,\n        \"text\": \"elit.\"\n      }\n    ]\n  },\n  {\n    \"page\": {\n      \"width\": 612.0,\n      \"height\": 792.0,\n      \"index\": 1\n    },\n    \"tokens\": [\n      {\n        \"x\": 72.0,\n        \"y\": 756.0,\n        \"width\": 46.0,\n        \"height\": 12.0,\n        \"text\": \"Integer\"\n      },\n      {\n        \"x\": 118.0,\n        \"y\": 756.0,\n        \"width\": 35.0,\n        \"height\": 12.0,\n        \"text\": \"vitae\"\n      },\n      {\n        \"x\": 153.0,\n        \"y\": 756.0,\n        \"width\": 39.0,\n        \"height\": 12.0,\n        \"text\": \"augue\"\n      },\n      {\n        \"x\": 192.0,\n        \"y\": 756.0,\n        \"width\": 45.0,\n        \"height\": 12.0,\n        \"text\": \"rhoncus\"\n      },\n      {\n        \"x\": 237.0,\n        \"y\": 756.0,\n        \"width\": 57.0,\n        \"height\": 12.0,\n        \"text\": \"fermentum\"\n      },\n      {\n        \"x\": 294.0,\n        \"y\": 756.0,\n        \"width\": 13.0,\n        \"height\": 12.0,\n        \"text\": \"at\"\n      },\n      {\n        \"x\": 307.0,\n        \"y\": 756.0,\n        \"width\": 29.0,\n        \"height\": 12.0,\n        \"text\": \"quis.\"\n      }\n    ]\n  }\n]\n

In this example, the PAWLs layer JSON file contains an array of two page objects. Each page object has a page field with the page dimensions and index, and a tokens field with an array of token objects.

Each token object represents a word or a piece of text on the page, along with its positional information. The x and y fields indicate the coordinates of the token's bounding box, while width and height specify the dimensions of the bounding box. The text field contains the actual text content of the token.

The tokens are ordered based on their appearance on the page, allowing for the reconstruction of the document's text content while preserving the layout information.

This sample demonstrates the structure and content of a PAWLs layer JSON file, which serves as the foundation for annotation and analysis tasks in the OpenContracts project.

"},{"location":"walkthrough/advanced/register-doc-analyzer/","title":"Detailed Overview of @doc_analyzer_task Decorator","text":"

The @doc_analyzer_task decorator is an integral part of the OpenContracts CorpusAction system, which automates document processing when new documents are added to a corpus. As a refresher, within the CorpusAction system, users have three options for registering actions to run automatically on new documents:

  1. Custom data extractors
  2. Analyzer microservices
  3. Celery tasks decorated with @doc_analyzer_task

The @doc_analyzer_task decorator is specifically designed for the third option, providing a straightforward way to write and deploy simple, span-based analytics directly within the OpenContracts ecosystem.

"},{"location":"walkthrough/advanced/register-doc-analyzer/#when-to-use-doc_analyzer_task","title":"When to Use @doc_analyzer_task","text":"

The @doc_analyzer_task decorator is ideal for scenarios where:

  1. You're performing tests or analyses solely based on document text or PAWLs tokens.
  2. Your analyzer doesn't require conflicting dependencies or non-Python code bases.
  3. You want a quick and easy way to integrate custom analysis into the OpenContracts workflow.

For more complex scenarios, such as those requiring specific environments, non-Python components, or heavy computational resources, creating an analyzer microservice would be recommended.

"},{"location":"walkthrough/advanced/register-doc-analyzer/#advantages-of-doc_analyzer_task","title":"Advantages of @doc_analyzer_task","text":"

Using the @doc_analyzer_task decorator offers several benefits:

  1. Simplicity: It abstracts away much of the complexity of interacting with the OpenContracts system.
  2. Integration: Tasks are automatically integrated into the CorpusAction workflow.
  3. Consistency: It ensures that your analysis task produces outputs in a format that OpenContracts can readily use.
  4. Error Handling: It provides built-in error handling and retry mechanisms.

By using this decorator, you can focus on writing the core analysis logic while the OpenContracts system handles the intricacies of document processing, annotation creation, and result storage.

In the following sections, we'll dive deep into how to structure functions decorated with @doc_analyzer_task, what data they receive, and how their outputs are processed by the OpenContracts system.

"},{"location":"walkthrough/advanced/register-doc-analyzer/#function-signature","title":"Function Signature","text":"

Functions decorated with @doc_analyzer_task should have the following signature:

@doc_analyzer_task()\ndef your_analyzer_function(*args, pdf_text_extract=None, pdf_pawls_extract=None, **kwargs):\n    # Function body\n    pass\n
"},{"location":"walkthrough/advanced/register-doc-analyzer/#parameters","title":"Parameters:","text":"
  1. *args: Allows the function to accept any positional arguments.
  2. pdf_text_extract: Optional parameter that will contain the extracted text from the PDF.
  3. pdf_pawls_extract: Optional parameter that will contain the PAWLS (Page-Aware Word-Level Splitting) data from the PDF.
  4. **kwargs: Allows the function to accept any keyword arguments.

The resulting task then expects some kwargs, which, while not passed to the decorated function, are used to load the data passed to the decorated function:

  • doc_id: The ID of the document being analyzed.
  • corpus_id: The ID of the corpus containing the document (if applicable).
  • analysis_id: The ID of the analysis being performed.
"},{"location":"walkthrough/advanced/register-doc-analyzer/#injected-data","title":"Injected Data","text":"

The decorator provides the following data to your decorated function as kwargs:

  1. PDF Text Extract: The full text content of the PDF document, accessible via the pdf_text_extract parameter.
  2. PAWLS Extract: A structured representation of the document's layout and content, accessible via the pdf_pawls_extract parameter. This typically includes information about pages, tokens, and their positions.
"},{"location":"walkthrough/advanced/register-doc-analyzer/#required-outputs","title":"Required Outputs","text":"

The @doc_analyzer_task decorator in OpenContracts expects the decorated function's return value to match a specific output structure. It's a four element tuple, with each of the four elements (below) having a specific schema.

return doc_labels, span_labels, metadata, task_pass\n

Failure to adhere to this in your function will throw an error. This structure is designed to map directly to the data models used in the OpenContracts system.

Let's break down each component of the required output and explain how it's used.

"},{"location":"walkthrough/advanced/register-doc-analyzer/#1-document-labels-doc_labels","title":"1. Document Labels (doc_labels)","text":"

Document labels should be a list of strings representing the labels you want to apply to the entire document.

doc_labels = [\"IMPORTANT_DOCUMENT\", \"FINANCIAL_REPORT\"]\n

Purpose: These labels are applied to the entire document.

Relationship to OpenContracts Models:

  • Each string in this list corresponds to an AnnotationLabel object with label_type = DOC_TYPE_LABEL.
  • For each label, an Annotation object is created with:
    • document: Set to the current document
    • annotation_label: The corresponding AnnotationLabel object
    • analysis: The current Analysis object
    • corpus: The corpus of the document (if applicable)

Example in OpenContracts:

for label_text in doc_labels:\n    label = AnnotationLabel.objects.get(text=label_text, label_type=\"DOC_TYPE_LABEL\")\n    Annotation.objects.create(\n        document=document,\n        annotation_label=label,\n        analysis=analysis,\n        corpus=document.corpus\n    )\n
"},{"location":"walkthrough/advanced/register-doc-analyzer/#2-span-labels-span_labels","title":"2. Span Labels (span_labels)","text":"

These describe token / span level features you want to apply an annotation to.

span_labels = [\n    (TextSpan(id=\"1\", start=0, end=10, text=\"First ten\"), \"HEADER\"),\n    (TextSpan(id=\"2\", start=50, end=60, text=\"Next span\"), \"IMPORTANT_CLAUSE\")\n]\n

Purpose: These labels are applied to specific spans of text within the document.

Relationship to OpenContracts Models:

  • Each tuple in this list creates an Annotation object.
  • The TextSpan contains the position and content of the annotated text.
  • The label string corresponds to an AnnotationLabel object with label_type = TOKEN_LABEL.

Example in OpenContracts:

for span, label_text in span_labels:\n    label = AnnotationLabel.objects.get(text=label_text, label_type=\"TOKEN_LABEL\")\n    Annotation.objects.create(\n        document=document,\n        annotation_label=label,\n        analysis=analysis,\n        corpus=document.corpus,\n        page=calculate_page_from_span(span),\n        raw_text=span.text,\n        json={\n            \"1\": {\n                \"bounds\": calculate_bounds(span),\n                \"tokensJsons\": calculate_tokens(span),\n                \"rawText\": span.text\n            }\n        }\n    )\n
"},{"location":"walkthrough/advanced/register-doc-analyzer/#3-metadata","title":"3. Metadata","text":"

This element contains DataCell values we want to associate with resulting Analysis.

metadata = [{\"data\": {\"processed_date\": \"2023-06-15\", \"confidence_score\": 0.95}}]\n

Purpose: This provides additional context or information about the analysis.

Relationship to OpenContracts Models:

  • This element contains DataCell values we want to associate with resulting Analysis.

Example in OpenContracts:

analysis.metadata = metadata\nanalysis.save()\n
"},{"location":"walkthrough/advanced/register-doc-analyzer/#4-task-pass-task_pass","title":"4. Task Pass (task_pass)","text":"

This can be used to signal the failure of some kind of test or logic for automated testing.

task_pass = True\n

Purpose: Indicates whether the analysis task completed successfully.

Relationship to OpenContracts Models:

  • This boolean value is used to update the status of the Analysis object.
  • It can trigger further actions or notifications in the OpenContracts system.

Example in OpenContracts:

if task_pass:\n    analysis.status = \"COMPLETED\"\nelse:\n    analysis.status = \"FAILED\"\nanalysis.save()\n
"},{"location":"walkthrough/advanced/register-doc-analyzer/#how-the-decorator-processes-the-output","title":"How the Decorator Processes the Output","text":"
  1. Validation: The decorator first checks that the return value is a tuple of length 4 and that each element has the correct type.

  2. Document Label Processing: For each document label, it creates an Annotation object linked to the document, analysis, and corpus.

  3. Span Label Processing: For each span label, it creates an Annotation object with detailed information about the text span, including its position and content.

  4. Metadata Handling: The metadata is stored, typically with the Analysis object, for future reference.

  5. Task Status Update: Based on the task_pass value, the status of the analysis is updated.

  6. Error Handling: If any part of this process fails, the decorator handles the error, potentially marking the task as failed and logging the error.

"},{"location":"walkthrough/advanced/register-doc-analyzer/#benefits-of-this-structure","title":"Benefits of This Structure","text":"
  1. Consistency: By enforcing a specific output structure, the system ensures that all document analysis tasks provide consistent data.

  2. Separation of Concerns: The analysis logic (in the decorated function) is separated from the database operations (handled by the decorator).

  3. Flexibility: The structure allows for both document-level and span-level annotations, accommodating various types of analysis.

  4. Traceability: By linking annotations to specific analyses and including metadata, the system maintains a clear record of how and when annotations were created.

  5. Error Management: The task_pass boolean allows for clear indication of task success or failure, which can trigger appropriate follow-up actions in the system.

By structuring the output this way, the @doc_analyzer_task decorator seamlessly integrates custom analysis logic into the broader OpenContracts data model, ensuring that the results of document analysis are properly stored, linked, and traceable within the system.

"},{"location":"walkthrough/advanced/register-doc-analyzer/#example-implementation","title":"Example Implementation","text":"

Here's an example of how a function decorated with @doc_analyzer_task might look:

from opencontractserver.shared.decorators import doc_analyzer_task\nfrom opencontractserver.types.dicts import TextSpan\n\n\n@doc_analyzer_task()\ndef example_analyzer(*args, pdf_text_extract=None, pdf_pawls_extract=None, **kwargs):\n    doc_id = kwargs.get('doc_id')\n\n    # Your analysis logic here\n    # For example, let's say we're identifying a document type and important clauses\n\n    doc_type = identify_document_type(pdf_text_extract)\n    important_clauses = find_important_clauses(pdf_text_extract)\n\n    doc_labels = [doc_type]\n    span_labels = [\n        (TextSpan(id=str(i), start=clause.start, end=clause.end, text=clause.text), \"IMPORTANT_CLAUSE\")\n        for i, clause in enumerate(important_clauses)\n    ]\n    metadata = [{\"data\": {\"analysis_version\": \"1.0\", \"clauses_found\": len(important_clauses)}}]\n    task_pass = True\n\n    return doc_labels, span_labels, metadata, task_pass\n

In this example, the function uses the injected pdf_text_extract to perform its analysis. It identifies the document type and finds important clauses, then structures this information into the required output format.

By using the @doc_analyzer_task decorator, this function is automatically integrated into the OpenContracts system, handling document locking, error management, and annotation creation without requiring explicit code for these operations in the function body.

"},{"location":"walkthrough/advanced/run-gremlin-analyzer/","title":"Run a Gremlin Analyzer","text":""},{"location":"walkthrough/advanced/run-gremlin-analyzer/#introduction-to-gremlin-integration","title":"Introduction to Gremlin Integration","text":"

OpenContracts integrates with a powerful NLP engine called Gremlin Engine (\"Gremlin\"). If you run a Gremlin analyzer on a Corpus, it will create annotations of its own that you can view and export (e.g. automatically applying document labels or labeling parties, dates, and places, etc.). It's meant to provide a consistent API to deliver and render NLP and machine learning capabilities to end-users. As discussed in the configuration section, you need to install Gremlin Analyzers through the admin dashboard.

Once you've installed Gremlin Analyzers, however, it's easy to apply them.

"},{"location":"walkthrough/advanced/run-gremlin-analyzer/#using-an-installed-gremlin-analyzer","title":"Using an Installed Gremlin Analyzer","text":"
  1. If analysis capabilities are enabled for instance, when you right-click on a Corpus, you'll see an option to \"Analyze Corpus\":

  2. Clicking on this item will bring up a dialog where you can browse available analyzers:

  3. Select one and hit \"Analyze\" to submit a corpus for processing. When you go to the Analysis tab of your Corpus now, you'll see the analysis. Most likely, if you just clicked there, it will say processing:

  4. When the Analysis is complete, you'll see a summary of the number of labels and annotations applied by the analyzer:

"},{"location":"walkthrough/advanced/run-gremlin-analyzer/#note-on-processing-time","title":"Note on Processing Time","text":"

Large Corpuses of hundreds of documents can take a long time to process (10 minutes or more). It's hard to predict processing time up front, because it's dependent on the number of total pages and the specific analysis being performed. At the moment, there is not a great mechanism in place to detect and handle failures in a Gremlin analyzer and reflect this in OpenContracts. It's on our roadmap to improve this integration. In the meantime, the example analyzers we've released with Gremlin should be very stable, so they should run predictably.

"},{"location":"walkthrough/advanced/run-gremlin-analyzer/#viewing-the-outputs","title":"Viewing the Outputs","text":"

Once an Analysis completes, you'll be able to browse the annotations from the analysis in several ways.

  1. First, they'll be available in the \"Annotation\" tab, and you can easily filter to annotations from a specific analyzer.
  2. Second, when you load a Document, in the Annotator view, there's a small widget in the top of the annotator that has three downwards-facing arrows and says \"Human Annotation Mode\".
  3. Click on the arrows open a tray showing the analyses applied to this document.
  4. Click on an analysis to load the annotations and view them in the document.

Note: You can delete an analysis, but you cannot edit it. The annotations are machine-created and cannot be edited by human users.

"},{"location":"walkthrough/advanced/testing-llama-index-calls/","title":"Testing Complex LLM Applications","text":"

I've built a number of full-stack, LLM-powered applications at this point. A persistent challenge is testing the underlying LLM query pipelines in a deterministic and isolated way.

A colleague and I eventually hit on a way to make testing complex LLM behavior deterministic and decoupled from upstream LLM API providers. This tutorial walks you through the problem and solution to this testing issue.

In this guide, you'll learn:

  1. Why testing LLM applications is particularly challenging
  2. How to overcome common testing obstacles like API dependencies and resource limitations
  3. An innovative approach using VCR.py to record and replay LLM interactions
  4. How to implement this solution with popular frameworks like LlamaIndex and Django
  5. Potential pitfalls to watch out for when using this method

Whether you're working with RAG models, multi-hop reasoning loops, or other complex LLM architectures, this tutorial will show you how to create fast, deterministic, and accurate tests without relying on expensive resources or compromising the integrity of your test suite.

By the end of this guide, you'll have a powerful new tool in your AI development toolkit, enabling you to build more robust and reliable LLM-powered applications. Let's dive in!

"},{"location":"walkthrough/advanced/testing-llama-index-calls/#problem","title":"Problem","text":"

To understand why testing complex LLM-powered applications is challenging, let's break down the components and processes involved in a typical RAG (Retrieval-Augmented Generation) application using a framework like LlamaIndex:

  1. Data Ingestion: Your application likely starts by ingesting large amounts of data from various sources (documents, databases, APIs, etc.).

  2. Indexing: This data is then processed and indexed, often using vector embeddings, to allow for efficient retrieval.

  3. Query Processing: When a user submits a query, your application needs to: a) Convert the query into a suitable format (often involving embedding the query) b) Search the index to retrieve relevant information c) Format the retrieved information for use by the LLM

  4. LLM Interaction: The processed query and retrieved information are sent to an LLM (like GPT-4) for generating a response.

  5. Post-processing: The LLM's response might need further processing or validation before being returned to the user.

Now, consider the challenges in testing such a system:

  1. External Dependencies: Many of these steps rely on external APIs or services. The indexing and query embedding often use one model (e.g., OpenAI's embeddings API), while the final response generation uses another (e.g., GPT-4). Traditional testing approaches would require mocking or stubbing these services, which can be complex and may not accurately represent real-world behavior.

  2. Resource Intensity: Running a full RAG pipeline for each test can be extremely resource-intensive and time-consuming. It might involve processing large amounts of data and making multiple API calls to expensive LLM services.

  3. Determinism: LLMs can produce slightly different outputs for the same input, making it difficult to write deterministic tests. This variability can lead to flaky tests that sometimes pass and sometimes fail.

  4. Complexity of Interactions: In more advanced setups, you might have multi-step reasoning processes or agent-based systems where the LLM is called multiple times with intermediate results. This creates complex chains of API calls that are difficult to mock effectively.

  5. Sensitive Information: Your tests might involve querying over proprietary or sensitive data. You don't want to include this data in your test suite, especially if it's going to be stored in a version control system.

  6. Cost: Running tests that make real API calls to LLM services can quickly become expensive, especially when running comprehensive test suites in CI/CD pipelines.

  7. Speed: Tests that rely on actual API calls are inherently slower, which can significantly slow down your development and deployment processes.

Traditional testing approaches fall short in addressing these challenges:

  • Unit tests with mocks may not capture the nuances of LLM behavior.
  • Integration tests with real API calls are expensive, slow, and potentially non-deterministic.
  • Dependency injection can help but becomes unwieldy with complex, multi-step processes.

What's needed is a way to capture the behavior of the entire system, including all API interactions, in a reproducible manner that doesn't require constant re-execution of expensive operations. This is where the VCR approach comes in, as we'll explore in the next section.

"},{"location":"walkthrough/advanced/testing-llama-index-calls/#solution","title":"Solution","text":"

Over a couple years of working with the LLM and RAG application stack, a solution has emerged to this problem. A former colleague of mine pointed out a library for Ruby called VCR with the following goal:

Record your test suite's HTTP interactions and replay them during future test runs for fast, deterministic, accurate\ntests.\n

This sounds like exactly the sort of solution we're looking for! We have numerous API calls to third-party API endpoints. They are deterministic IF the responses from each step of the LLM reasoning loop is identical to a previous run of the same loop. If we could record each LLM call and response from one run of a specific LLamaIndex pipeline, for example, and then intercept future calls to the same endpoints and replay the old responses, in theory we'd have exactly the same results.

It turns out there's a Python version of VCR called VCR.py. It comes with nice pytest fixtures and lets you decorate an entire Django test. If you call a LlamaIndex pipeline from your test, if no \"cassette\" filed is found in your fixtures directory, your HTTPS calls will go out to actual API

"},{"location":"walkthrough/advanced/testing-llama-index-calls/#example","title":"Example","text":"

Using VCR.py + LlamaIndex, for example, is super simple. In a Django test, for example, you just write a test function per usual:

import vcr\nfrom django.test import TestCase\n\n\nclass ExtractsTaskTestCase(TestCase):\n\n    def test_run_extract_task(self):\n        print(f\"{self.extract.documents.all()}\")\n        ...\n

Add a vcr.py decorator naming the target fixture location:

import vcr\nfrom django.test import TestCase\n\n\nclass ExtractsTaskTestCase(TestCase):\n\n    @vcr.use_cassette(\"fixtures/vcr_cassettes/test_run_extract_task.yaml\", filter_headers=['authorization'])\n    def test_run_extract_task(self):\n        print(f\"{self.extract.documents.all()}\")\n\n        # Call your LLMs or LLM framework here\n        ...\n

Now you can call LlamaIndex query engines, retrievers, agents, etc. On the first run, all of your API calls and responses are capture. You'll obviously need to provide your API credentials, where required, or these calls will fail. As noted below if you omit the filter_headers parameter, this will result in your API key being in the recorded 'cassette'.

On subsequent runs, VCR will intercept calls to recorded endpoints with identical data and return the recorded responses, letting you full test your use of LlamaIndex without needing to patch the library or its dependencies.

"},{"location":"walkthrough/advanced/testing-llama-index-calls/#pitfalls","title":"Pitfalls","text":"

This approach has been used for production applications. We have seen a couple things worth noting:

  1. Be warned that if you don't use the filter_headers=['authorization'] in your decorators, your API keys will be in the cassette. You can replace these with fake credentials or you can just de-auth the now-public keys.
  2. If you use any local models and don't preload those, VCR.py will capture the call to download the models weights, configuration, etc. Even for small models, this can be a couple hundred megabytes, and it could be gigabytes of data for even small models like Phi or Llama3 7B. This is particularly problematic for GitHub as you'll quickly exceed file size caps, even if you're using LFS.
  3. There is a bug in VCR.py 6.0.1 in some limited circumstances if you use async code.
  4. This is obviously Python-only. Presumably there are similar libraries for other languages and web client libraries.
"},{"location":"walkthrough/advanced/write-your-own-extractors/","title":"Write Your Own Agentic, LlamaIndex Data Extractor","text":""},{"location":"walkthrough/advanced/write-your-own-extractors/#refresher-on-what-an-open-contracts-data-extractor-does","title":"Refresher on What an Open Contracts Data Extractor Does","text":"

When you create a new Extract on the frontend, you can build a grid of data field columns and document rows that the application will traverse, cell-by-cell, to answer the question posed in each column for every document:

You can define the target data shape for each column - e.g. require all outputs match a certain dictionary schema or be floats. We leverage LLMs to ensure that the retrieved data matches the desired schema.

You'll notice when you add or edit a column, you can configure a number of different things:

Specifically, you can adjust - name: The name of the column. - query: The query used for extraction. - match_text: Text we want to match semantically to process on. We use this instead of the query to find responsive text, if this field is provided. If not, we have to fall back to the query. - must_contain_text: Text that must be contained in a returned annotation. This is case insensitive. - output_type: The type of data to be extracted. This can be a python primitive or a simple Pydantic model. - instructions: Instructions for the extraction process. This instructs our parser how to convert retrieved text to the target output_type. Not strictly necessary, but recommended, specifically for objects. - task_name: The name of the registered celery extract task to use to process (lets you define and deploy custom ones). We'll show you have to create a custom one in this walkthrough. - agentic: Boolean indicating if the extraction is agentic. - extract_is_list: Boolean indicating if the extraction result is a list of the output_types you provided.

You'll notice that in the GUI, there is a dropdown to pick the extract task:

This is actually retrieved dynamically from the backend from the tasks in opencontractsserver.tasks.data_extract_tasks.py. Every celery task in this python module will show up in the GUI, and the description in the dropdown is actually pulled out of the docstring provided in the code itself:

@shared_task\ndef oc_llama_index_doc_query(cell_id, similarity_top_k=15, max_token_length: int = 512):\n    \"\"\"\n    OpenContracts' default LlamaIndex and Marvin-based data extract pipeline to run queries specified for a\n    particular cell. We use sentence transformer embeddings + sentence transformer re-ranking.\n    \"\"\"\n\n    ...\n

This means you can write your own data extractors! If you write a new task in data_extract_tasks.py, the next time the containers are rebuilt, you should see your custom extractor. We'll walk through this in a minute.

"},{"location":"walkthrough/advanced/write-your-own-extractors/#how-open-contracts-integrates-with-llamaindex","title":"How Open Contracts Integrates with LlamaIndex","text":"

You don't have to use LlamaIndex in your extractor - you could just pass an entire document to OpenAI's GPT-4o, for example, but LlamaIndex provides a tremendous amount of configurability that may yield to faster, better, cheaper or more reliable performance in many cases. You could even incorporate tools and third-party APIs in agentic fashion.

We assume you're already familiar wtih LlamaIndex, the \"data framework for your LLM applications\". It has a rich ecosystem of integrations, prompt templates, agents, retrieval techniques and more to let you customize how your LLMs interact with data.

"},{"location":"walkthrough/advanced/write-your-own-extractors/#custom-djangoannotationvectorstore","title":"Custom DjangoAnnotationVectorStore","text":"

We've written a custom implementation of one of LlamaIndex's core building blocks - the VectorStore - that lets LlamaIndex use OpenContracts as a vector store. Our DjangoAnnotationVectorStore in opencontractserver/llms/vector_stores.py lets you quickly write a LlamaIndex agent or question answering pipeline that can pull directly from the rich annotations and structural data (like annotation positions, layout class - e.g. header - and more) in OpenContracts. If you want to learn more about LlamaIndex's vector stores, see more in the documentation about VectorStores.

"},{"location":"walkthrough/advanced/write-your-own-extractors/#task-orchestration","title":"Task Orchestration","text":"

As discussed elsewhere, we use celery workers to run most of our analytics and transform logic. It simplifies the management of complex queues and lets us scale our application compute horizontally in the right environment.

Our data extract functionality has an orchestrator task - run_extract. For each data extract column for each document in the extract, we look at the column's task_name property and use it to attempt to load the celery task with that name via the get_task_by_name function:

def get_task_by_name(task_name) -> Optional[Callable]:\n    \"\"\"\n    Try to get celery task function Callable by name\n    \"\"\"\n    try:\n        return celery_app.tasks.get(task_name)\n    except Exception:\n        return None\n

As we loop over the datacells, we store the celery invocation for the cell's column's task_name in a task list:

for document_id in document_ids:\n        for column in fieldset.columns.all():\n            with transaction.atomic():\n                cell = Datacell.objects.create(\n                    extract=extract,\n                    column=column,\n                    data_definition=column.output_type,\n                    creator_id=user_id,\n                    document_id=document_id,\n                )\n\n            # Omitting some code here\n            ...\n\n            # Get the task function dynamically based on the column's task_name\n            task_func = get_task_by_name(column.task_name)\n            if task_func is None:\n                logger.error(\n                    f\"Task {column.task_name} not found for column {column.id}\"\n                )\n                continue\n\n            # Add the task to the group\n            tasks.append(task_func.si(cell.pk))\n

Upon completing the traversal of the grid, we use a celery workflow to run all the cell extract tasks in parallel:

chord(group(*tasks))(mark_extract_complete.si(extract_id))\n
"},{"location":"walkthrough/advanced/write-your-own-extractors/#our-default-data-extract-task-oc_llama_index_doc_query","title":"Our Default Data Extract Task - oc_llama_index_doc_query","text":"

Our default data extractor uses LlamaIndex to retrieved and structure the data in the DataGrid. Before we write a new one, let's walk through how we orchestrate tasks and how our default extract works.

oc_llama_index_doc_query requires a Datacell id as positional argument. NOTE if you were to write your own extract task, you'd need to follow this same signature (with a name of your choice, of course):

@shared_task\ndef oc_llama_index_doc_query(cell_id, similarity_top_k=15, max_token_length: int = 512):\n    \"\"\"\n    OpenContracts' default LlamaIndex and Marvin-based data extract pipeline to run queries specified for a\n    particular cell. We use sentence transformer embeddings + sentence transformer re-ranking.\n    \"\"\"\n\n    ...\n
The frontend pulls the task description from the docstring, so, again, if you write your own, make sure you provide a useful description.

Let's walk through how oc_llama_index_doc_query works

"},{"location":"walkthrough/advanced/write-your-own-extractors/#step-1-mark-datacell-as-started","title":"Step 1 - Mark Datacell as Started","text":"

Once the task kicks off, step one is to log in the DB that the task has started:

    ...\n\n    try:\n        datacell.started = timezone.now()\n        datacell.save()\n\n        ...\n
  • Exception Handling: We use a try block to handle any exceptions that might occur during the processing.
  • Set Started Timestamp: We set the started field to the current time to mark the beginning of the datacell processing.
  • Save Changes: We save the Datacell object to the database.
"},{"location":"walkthrough/advanced/write-your-own-extractors/#step-2-configure-embeddings-and-llm-settings","title":"Step 2 - Configure Embeddings and LLM Settings","text":"

Then, we create our embeddings module. We actually have a microservice for this to cut down on memory usage and allow for easier scaling of the compute-intensive parts of the app. For now, though, the task does not call the microservice so we're using a lightweight sentence tranformer embeddings model:

    document = datacell.document\n\n    embed_model = HuggingFaceEmbedding(\n        model_name=\"multi-qa-MiniLM-L6-cos-v1\", cache_folder=\"/models\"\n    )\n    Settings.embed_model = embed_model\n\n    llm = OpenAI(model=settings.OPENAI_MODEL, api_key=settings.OPENAI_API_KEY)\n    Settings.llm = llm\n
  • Retrieve Document: We fetch the document associated with the datacell.
  • Configure Embedding Model: We set up the HuggingFace embedding model. This model converts text into embeddings ( vector representations) which are essential for semantic search.
  • Set Embedding Model in Settings: We assign the embedding model to Settings.embed_model for global access within the task.
  • Configure LLM: We set up the OpenAI model using the API key from settings. This model will be used for language processing tasks.
  • Set LLM in Settings: We assign the LLM to Settings.llm for global access within the task.
"},{"location":"walkthrough/advanced/write-your-own-extractors/#step-3-initialize-djangoannotationvectorstore-for-llamaindex","title":"Step 3 - Initialize DjangoAnnotationVectorStore for LlamaIndex","text":"

Now, here's the cool part with LlamaIndex. Assuming we have Django models with embeddings produced by the same embeddings model, we don't need to do any real-time encoding of our source documents, and our Django object store in Postgres can be loaded as a LlamaIndex vector store. Even better, we can pass in some arguments that let us scope the store down to what we want. For example, we can limit retrieving text from to document, to annotations containing certain text, and to annotations with certain labels - e.g. termination. This lets us leverage all of the work that's been done by humans (and machines) in an OpenContracts corpus to label and tag documents. We're getting the best of both worlds - both human and machine intelligence!

    vector_store = DjangoAnnotationVectorStore.from_params(\n        document_id=document.id, must_have_text=datacell.column.must_contain_text\n    )\n    index = VectorStoreIndex.from_vector_store(vector_store=vector_store)\n
  • Vector Store Initialization: Here we create an instance of DjangoAnnotationVectorStore using parameters specific to the document and column.
  • LlamaIndex Integration: We create a VectorStoreIndex from the custom vector store. This integrates the vector store with LlamaIndex, enabling advanced querying capabilities.
"},{"location":"walkthrough/advanced/write-your-own-extractors/#step-4-perform-retrieval","title":"step 4 - Perform Retrieval","text":"

Now we use the properties of a configured column to find the proper text. For example, if match_text has been provided, we search for nearest K annotations to the match_text (rather than searching based on the query itself):

    search_text = datacell.column.match_text\n    query = datacell.column.query\n\n    retriever = index.as_retriever(similarity_top_k=similarity_top_k)\n    results = retriever.retrieve(search_text if search_text else query)\n
  • Retrieve Search Text and Query: We fetch the search text and query from the column associated with the datacell.
  • Configure Retriever: We configure the retriever with the similarity_top_k parameter, which determines the number of top similar results to retrieve.
  • Retrieve Results: We perform the retrieval using the search text or query. The retriever fetches the most relevant annotations from the vector store.
"},{"location":"walkthrough/advanced/write-your-own-extractors/#step-5-rerank-results","title":"Step 5 - Rerank Results","text":"

We use a LlamaIndex reranker (in this case a SentenceTransformer reranker) to rerank the retrieved annotations based on the query (this is an example of where you could easily customize your own pipeline - you might want to rerank based on match text, use an LLM-based reranker, or use a totally different reranker like cohere):

sbert_rerank = SentenceTransformerRerank(\n    model=\"cross-encoder/ms-marco-MiniLM-L-2-v2\", top_n=5\n)\nretrieved_nodes = sbert_rerank.postprocess_nodes(\n    results, QueryBundle(query)\n)\n
  • Reranker Configuration: We set up the SentenceTransformerRerank model. This model is used to rerank the retrieved results for better relevance.
  • Rerank Nodes: We rerank the retrieved nodes using the SentenceTransformerRerank model and the original query. This ensures that the top results are the most relevant.
"},{"location":"walkthrough/advanced/write-your-own-extractors/#step-6-process-retrieved-annotations","title":"Step 6 - Process Retrieved Annotations","text":"

Now, we determine the Annotation instance ids we retrieved so these can be linked to the datacell. On the OpenContracts frontend, this lets us readily navigate to the Annotations in the source documents:

        retrieved_annotation_ids = [\n            n.node.extra_info[\"annotation_id\"] for n in retrieved_nodes\n        ]\n        datacell.sources.add(*retrieved_annotation_ids)\n
  • Extract Annotation IDs: We extract the annotation IDs from the retrieved nodes.
  • Add Sources: We add the retrieved annotation IDs to the sources field of the datacell. This links the relevant annotations to the datacell.
"},{"location":"walkthrough/advanced/write-your-own-extractors/#step-7-format-retrieved-text-for-output","title":"Step 7 - Format Retrieved Text for Output","text":"

Next, we aggregate the retrieved annotations into a single string we can pass to an LLM:

    retrieved_text = \"\\n\".join(\n        [f\"```Relevant Section:\\n\\n{n.text}\\n```\" for n in results]\n    )\n    logger.info(f\"Retrieved text: {retrieved_text}\")\n
  • Format Text: We format the retrieved text for output. Each relevant section is enclosed in Markdown code blocks for better readability.
  • Log Retrieved Text: We log the retrieved text for debugging and tracking purposes.
"},{"location":"walkthrough/advanced/write-your-own-extractors/#step-8-parse-data","title":"Step 8 - Parse Data","text":"

Finally, we dynamically specify the output schema / format of the data. We use marvin to do the structuring, but you could tweak the pipeline to use LlamaIndex's Structured Data Extract or you could roll your own custom parsers.

        output_type = parse_model_or_primitive(datacell.column.output_type)\n        logger.info(f\"Output type: {output_type}\")\n\n        # If provided, we use the column parse instructions property to instruct Marvin how to parse, otherwise,\n        # we give it the query and target output schema. Usually the latter approach is OK, but the former is more\n        # intentional and gives better performance.\n        parse_instructions = datacell.column.instructions\n\n        result = marvin.cast(\n            retrieved_text,\n            target=output_type,\n            instructions=parse_instructions if parse_instructions else query,\n        )\n\n        if isinstance(result, BaseModel):\n            datacell.data = {\"data\": result.model_dump()}\n        else:\n            datacell.data = {\"data\": result}\n        datacell.completed = timezone.now()\n        datacell.save()\n
  • Determine Output Type: We determine the output type based on the column's output type.
  • Log Output Type: We log the output type for debugging purposes.
  • Parse Instructions: We fetch parsing instructions from the column.
  • Parse Result: We use marvin.cast to parse the retrieved text into the desired output type using the parsing instructions.
  • Save Result: We save the parsed result in the data field of the datacell. We also mark the datacell as completed and save the changes to the database.
"},{"location":"walkthrough/advanced/write-your-own-extractors/#step-9-save-results","title":"Step 9 - Save Results","text":"

This step is particularly important if you write your own extract. We're planning to write a decorator to make a lot of this easier and automatic, but, for now, you need to remember to store the output of your extract task as a json of form

{\n  \"data\": <extracted data>\n}\n

Here's the code from oc_llama_index_doc_query:

 if isinstance(result, BaseModel):\n    datacell.data = {\"data\": result.model_dump()}\nelse:\n    datacell.data = {\"data\": result}\n\ndatacell.completed = timezone.now()\ndatacell.save()\n
"},{"location":"walkthrough/advanced/write-your-own-extractors/#step-10-exception-handling","title":"Step 10 - Exception Handling","text":"

If processign fails, we catch the error and stacktrace. These are store with the Datacell so we can see which extracts succeded or failed, and, if they failed, why.

    except Exception as e:\n        logger.error(f\"run_extract() - Ran into error: {e}\")\n        datacell.stacktrace = f\"Error processing: {e}\"\n        datacell.failed = timezone.now()\n        datacell.save()\n
  • Exception Logging: We log any exceptions that occur during the processing.
  • Save Stacktrace: We save the error message in the stacktrace field of the datacell.
  • Mark as Failed: We mark the datacell as failed and save the changes to the database.
"},{"location":"walkthrough/advanced/write-your-own-extractors/#write-a-custom-llamaindex-extractor","title":"Write a Custom LlamaIndex Extractor","text":"

Let's write another data extractor based on LlamaIndex's REACT Agent!

"},{"location":"walkthrough/advanced/write-your-own-extractors/#step-1-ensure-you-load-datacell","title":"Step 1 - Ensure you Load Datacell","text":"

As mentioned above, we'd like to use decorators to make some of this more automatic, but, for now, you need to load the Datacell instance from the provided id:

@shared_task\ndef llama_index_react_agent_query(cell_id):\n    \"\"\"\n    Use our DjangoAnnotationVectorStore + LlamaIndex REACT Agent to retrieve text.\n    \"\"\"\n\n    datacell = Datacell.objects.get(id=cell_id)\n\n    try:\n\n        datacell.started = timezone.now()\n        datacell.save()\n
"},{"location":"walkthrough/advanced/write-your-own-extractors/#step-2-setup-embeddings-model-llm","title":"Step 2 - Setup Embeddings Model + LLM","text":"

OpenContracts uses multi-qa-MiniLM-L6-cos-v1 to generate its embeddings (for now, we can make this modular as well). You can use whatever LLM you want, but we're using GPT-4o. Don't forget to isntantiate both of these and configure LlamaIndex's global settings:

embed_model = HuggingFaceEmbedding(\n    model_name=\"multi-qa-MiniLM-L6-cos-v1\", cache_folder=\"/models\"\n)  # Using our pre-load cache path where the model was stored on container build\nSettings.embed_model = embed_model\n\nllm = OpenAI(model=settings.OPENAI_MODEL, api_key=settings.OPENAI_API_KEY)\nSettings.llm = llm\n
"},{"location":"walkthrough/advanced/write-your-own-extractors/#step-3-instantiate-open-contracts-vector-store","title":"Step 3 - Instantiate Open Contracts Vector Store","text":"

Now, let's instantiate a vector store that will only retrieve annotations from the document linked to our loaded datacell:

document = datacell.document\n\n vector_store = DjangoAnnotationVectorStore.from_params(\n    document_id=document.id, must_have_text=datacell.column.must_contain_text\n)\nindex = VectorStoreIndex.from_vector_store(vector_store=vector_store)\n
"},{"location":"walkthrough/advanced/write-your-own-extractors/#step-4-instantiate-query-engine-wrap-it-as-llm-agent-tool","title":"Step 4 - Instantiate Query Engine & Wrap it As LLM Agent Tool","text":"

Next, let's use OpenContracts as a LlamaIndex query engine:

doc_engine = index.as_query_engine(similarity_top_k=10)\n

And let's use that engine with an agent tool:

document = datacell.document\n\nquery_engine_tools = [\n    QueryEngineTool(\n        query_engine=doc_engine,\n        metadata=ToolMetadata(\n            name=\"doc_engine\",\n            description=(\n                f\"Provides detailed annotations and text from within the {document.title}\"\n            ),\n        ),\n    )\n]\n
"},{"location":"walkthrough/advanced/write-your-own-extractors/#step-5-setup-agent","title":"Step 5 - Setup Agent","text":"
agent = ReActAgent.from_tools(\n    query_engine_tools,\n    llm=llm,\n    verbose=True,\n    # context=context\n)\n
"},{"location":"walkthrough/advanced/write-your-own-extractors/#step-6-decide-how-to-map-column-properties-to-retrieval-process","title":"Step 6 - Decide how to Map Column Properties to Retrieval Process","text":"

As discussed above, the Column model definition has a lot of properties that service slightly different purposes depending on your RAG implementation. Since we're writing a new extract, you can decide how to map these inputs here. To keep things simple for starters, let's just take the column's query and pass it directly to the React Agent. For improvements, we could have a more complex prompt to apps along, for example, parsing instructions.

response = agent.chat(\"What was Lyft's revenue growth in 2021?\")\n
"},{"location":"walkthrough/advanced/write-your-own-extractors/#step-7-post-process-and-store-data","title":"Step 7 - Post-Process and Store Data","text":"

At this stage we could use a structured data parser, or we could just store the answer from the agent. For simplicity, let's do the latter:

datacell.data = {\"data\": str(response)}\ndatacell.completed = timezone.now()\ndatacell.save()\n
"},{"location":"walkthrough/advanced/write-your-own-extractors/#step-8-rebuild-containers-and-look-at-your-frontend","title":"Step 8 - Rebuild Containers and Look at Your Frontend","text":"

The next time you rebuild the containers (in prod, in local env they rebuild automatically), you will see a new entry in the column configuration modals:

It's that easy! Now, any user in your instance can run your extract and generate outputs - here we've used it for the Company Name column:

We plan to create decorators and other developer aids to reduce boilerplate here and let you focus entirely on your retrieval pipeline.

"},{"location":"walkthrough/advanced/write-your-own-extractors/#conclusion","title":"Conclusion","text":"

By breaking down the tasks step-by-step, you can see how the custom vector store integrates with LlamaIndex to provide powerful semantic search capabilities within a Django application. Even better, if you write your own data extract tasks you can expose them to users who don't have to know anything at all about how they're built. This is the way it should be - separation of concerns!

"}]} \ No newline at end of file +{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"About","text":""},{"location":"#open-contracts","title":"Open Contracts","text":""},{"location":"#the-free-and-open-source-document-analytics-platform","title":"The Free and Open Source Document Analytics Platform","text":"CI/CD Meta"},{"location":"#what-does-it-do","title":"What Does it Do?","text":"

OpenContracts is an Apache-2 Licensed enterprise document analytics tool. It provides several key features:

  1. Manage Documents - Manage document collections (Corpuses)
  2. Layout Parser - Automatically extracts layout features from PDFs
  3. Automatic Vector Embeddings - generated for uploaded PDFs and extracted layout blocks
  4. Pluggable microservice analyzer architecture - to let you analyze documents and automatically annotate them
  5. Human Annotation Interface - to manually annotated documents, including multi-page annotations.
  6. LlamaIndex Integration - Use our vector stores (powered by pgvector) and any manual or automatically annotated features to let an LLM intelligently answer questions.
  7. Data Extract - ask multiple questions across hundreds of documents using complex LLM-powered querying behavior. Our sample implementation uses LlamaIndex + Marvin.
  8. Custom Data Extract - Custom data extract pipelines can be used on the frontend to query documents in bulk.

"},{"location":"#key-docs","title":"Key Docs","text":"
  1. Quickstart Guide - You'll probably want to get started quickly. Setting up locally should be pretty painless if you're already running Docker.
  2. Basic Walkthrough - Check out the walkthrough to step through basic usage of the application for document and annotation management.
  3. PDF Annotation Data Format Overview - You may be interested how we map text to PDFs visually and the underlying data format we're using.
  4. Django + Pgvector Powered Hybrid Vector Database We've used the latest open source tooling for vector storage in postgres to make it almost trivially easy to combine structured metadata and vector embeddings with an API-powered application.
  5. LlamaIndex Integration Walkthrough - We wrote a wrapper for our backend database and vector store to make it simple to load our parsed annotations, embeddings and text into LlamaIndex. Even better, if you have additional annotations in the document, the LLM can access those too.
  6. Write Custom Data Extractors - Custom data extract tasks (which can use LlamaIndex or can be totally bespoke) are automatically loaded and displayed on the frontend to let user's select how to ask questions and extract data from documents.
"},{"location":"#architecture-and-data-flows-at-a-glance","title":"Architecture and Data Flows at a Glance","text":""},{"location":"#core-data-standard","title":"Core Data Standard","text":"

The core idea here - besides providing a platform to analyze contracts - is an open and standardized architecture that makes data extremely portable. Powering this is a set of data standards to describe the text and layout blocks on a PDF page:

"},{"location":"#robust-pdf-processing-pipeline","title":"Robust PDF Processing Pipeline","text":"

We have a robust PDF processing pipeline that is horizontally scalable and generates our standardized data consistently for PDF inputs (We're working on adding additional formats soon):

Special thanks to Nlmatics and nlm-ingestor for powering the layout parsing and extraction.

"},{"location":"#limitations","title":"Limitations","text":"

At the moment, it only works with PDFs. In the future, it will be able to convert other document types to PDF for storage and labeling. PDF is an excellent format for this as it introduces a consistent, repeatable format which we can use to generate a text and x-y coordinate layer from scratch.

Adding OCR and ingestion for other enterprise documents is a priority.

"},{"location":"#acknowledgements","title":"Acknowledgements","text":"

Special thanks to AllenAI's PAWLS project and Nlmatics nlm-ingestor. They've pioneered a number of features and flows, and we are using their code in some parts of the application.

"},{"location":"acknowledgements/","title":"Acknowledgements","text":"

OpenContracts is built in part on top of the PAWLs project frontend. We have made extensive changes, however, and plan to remove even more of the original PAWLs codebase, particularly their state management, as it's currently duplucitive of the Apollo state store we use throughout the application. That said, PAWLs was the inspiration for how we handle text extraction, and we're planning to continue using their PDF rendering code. We are also using PAWLs' pre-processing script, which is based on Grobid.

We should also thank the Grobid project, which was clearly a source of inspiration for PAWLs and an extremely impressive tool. Grobid is designed more for medical and scientific papers, but, nevertheless, offers a tremendous amount of inspiration and examples for the legal world to borrow. Perhaps there is an opportunity to have a unified tool in that respect.

Finally, let's not forget Tesseract, the OCR engine that started its life as an HP research project in the 1980s before being taken over by Google in the early aughts and finally becoming an independent project in 2018. Were it not for the excellent, free OCR provided by Tesseract, we'd have to rely on commercial OCR tech, which would make this kind of opensource, free project prohibitively expensive. Thanks to the many, many people who've made free OCR possible over the nearly 40 years Tesseract has been under development.

"},{"location":"philosophy/","title":"Philosophy","text":""},{"location":"philosophy/#dont-repeat-yourself","title":"Don't Repeat Yourself","text":"

OpenContracts is designed not only be a powerful document analysis and annotation platform, it's also envisioned as a way to embrace the DRY (Don't Repeat Yourself) principle for legal and legal engineering. You can make a corpus, along with all of its labels, documents and annotations \"public\" (currently, you must do this via a GraphQL mutation).

Once something is public, it's read-only for everyone other than its original creator. People with read-only access can \"clone\" the corpus to create a private copy of the corpus, its documents and its annotations. They can then edit the annotations, add to them, export them, etc. This lets us work from previous document annotations and re-use labels and training data.

"},{"location":"quick-start/","title":"Quick Start (For use on your local machine)","text":"

This guide is for people who want to quickly get started using the application and aren't interested in hosting it online for others to use. You'll get a default, local user with admin access. We recommend you change the user password after completing this tutorial. We assume you're using Linux or Max OS, but you could do this on Windows too, assuming you have docker compose and docker installed. The commands to create directories will be different on Windows, but the git, docker and docker-compose commands should all be the same.

"},{"location":"quick-start/#step-1-clone-this-repo","title":"Step 1: Clone this Repo","text":"

Clone the repository into a local directory of your choice. Here, we assume you are using a folder called source in your user's home directory:

    $ cd ~\n    $ mkdir source\n    $ cd source\n    $ git clone https://github.com/JSv4/OpenContracts.git\n
"},{"location":"quick-start/#step-2-copy-sample-env-files-to-appropriate-folders","title":"Step 2: Copy sample .env files to appropriate folders","text":"

Again, we're assuming a local deployment here with basic options. To just get up and running, you'll want to copy our sample .env file from the ./docs/sample_env_files directory to the appropriate .local subfolder in the .envs directory in the repo root.

"},{"location":"quick-start/#backend-env-file","title":"Backend .Env File","text":"

For the most basic deployment, copy ./sample_env_files/backend/local/.django to ./.envs/.local/.django and copy ./sample_env_files/backend/local/.postgres to ./.envs/.local/.postgres. You can use the default configurations, but we recommend you set you own admin account password in .django and your own postgres credentials in .postgres.

"},{"location":"quick-start/#frontend-env-file","title":"Frontend .Env File","text":"

You also need to copy the appropriate .frontend env file as ./envs/.local/.frontend. We're assuming you're not using something like auth0 and are going to rely on Django auth to provision and authenticate users. Grab ./sample_env_files/frontend/local/django.auth.env and copy it to ./envs/.local/.frontend.

"},{"location":"quick-start/#step-3-build-the-stack","title":"Step 3: Build the Stack","text":"

Change into the directory of the repository you just cloned, e.g.:

    cd OpenContracts\n

Now, you need to build the docker compose stack. IF you are okay with the default username and password, and, most importantly, you are NOT PLANNING TO HOST THE APPLICATION online, the default, local settings are sufficient and no configuration is required. If you want to change the

    $ docker-compose -f local.yml build\n
"},{"location":"quick-start/#step-4-choose-frontend-deployment-method","title":"Step 4 Choose Frontend Deployment Method","text":"

Option 1 Use \"Fullstack\" Profile in Docker Compose

If you're not planning to do any frontend development, the easiest way to get started with OpenContracts is to just type:

    docker-compose -f local.yml --profile fullstack up\n

This will start docker compose and add a container for the frontend to the stack.

Option 2 Use Node to Deploy Frontend

If you plan to actively develop the frontend in the /frontend folder, you can just point your favorite typescript ID to that directory and then run:

yarn install\n

and

yarn start\n

to bring up the frontend. Then you can edit the frontend code as desired and have it hot reload as you'd expect for a React app.

Congrats! You have OpenContracts running.

"},{"location":"quick-start/#step-5-login-and-start-annotating","title":"Step 5: Login and Start Annotating","text":"

If you go to http://localhost:3000 in your browser, you'll see the login page. You can login with the default username and password. These are set in the environment variable file you can find in the ./.envs/.local/ directory. In that directory, you'll see a file called .django. Backend specific configuration variables go in there. See our guide for how to create new users.

NOTE: The frontend is at port 3000, not 8000, so don't forget to use http://localhost:3000 for frontend access. We have an open issue to add a redirect from the backend root page - http://localhost:8000/ - to http://localhost:3000.

Caveats

The quick start local config is designed for use on a local machine, not for access over the Internet or a network. It uses the local disk for storage (not AWS), and Django's built-i

"},{"location":"requirements/","title":"System Requirements","text":""},{"location":"requirements/#system-requirements","title":"System Requirements","text":"

You will need Docker and Docker Compose installed to run Open Contracts. We've developed and run the application a Linux x86_64 environment. We haven't tested on Windows, and it's known that celery is not supported on Windows. For this reason, we do not recommend deployment on Windows. If you must run on a Windows machine, consider using a virtual machine or using the Windows Linux Subsystem.

If you need help setting up Docker, we recommend Digital Ocean's setup guide. Likewise, if you need assistance setting up Docker Compose, Digital Ocean's guide is excellent.

"},{"location":"architecture/PDF-data-layer/","title":"PDF data layer","text":""},{"location":"architecture/PDF-data-layer/#data-layers","title":"Data Layers","text":"

OpenContracts builds on the work that AllenAI did with PAWLs to create a consistent shared source of truth for data labeling and NLP algorithms, regardless of whether they are layout-aware, like LayoutLM or not, like BERT, Spacy or LexNLP. One of the challenges with natural language documents, particularly contracts is there are so many ways to structure any given file (e.g. .docx or .pdf) to represent exactly the same text. Even an identical document with identical formatting in a format like .pdf can have a significantly different file structure depending on what software was used to create it, the user's choices, and the software's own choices in deciding how to structure its output.

PAWLs and OpenContracts attempt to solve this by sending every document through a processing pipeline that provides a uniform and consistent way of extracting and structuring text and layout information. Using the parsing engine of Grobid and the open source OCR engine Tesseract, every single document is re-OCRed (to produce a consistent output for the same inputs) and then the \"tokens\" (text surrounded on all sides by whitespace - typically a word) in the OCRed document are stored as JSONs with their page and positional information. In OpenContracts, we refer to this JSON layer that combines text and positional data as the \"PAWLs\" layer. We use the PAWLs layer to build the full text extract from the document as well and store this as the \"text layer\".

Thus, in OpenContracts, every document has three files associated with it - the original pdf, a json file (the \"PAWLs layer\"), and a text file (the \"text layer\"). Because the text layer is built from the PAWLs layer, we can easily translate back and forth from text to positional information - e.g. given the start and end of a span of text the text layer, we can accurately say which PAWLs tokens the span includes, and, based on that, the x,y position of the span in the document.

This lets us take the outputs of many NLP libraries which typically produce only start and stop ranges and layer them perfectly on top of the original pdf. With the PAWLs tokens as the source of truth, we can seamlessly transition from text only to layout-aware text.

"},{"location":"architecture/PDF-data-layer/#limitations","title":"Limitations","text":"

OCR is not perfect. By only accepting pdf inputs and OCRing every document, we do ignore any text embedded in the pdf. To the extent that text was exported accurately from whatever tool was used to write the document, this introduces some potential loss of fidelity - e.g. if you've ever seen an OCR engine mistake an 'O' or a 0 or 'I' for a '1' or something like that. Typically, however, the instance of such errors is fairly small, and it's a price we have to pay for the power of being able to effortlessly layer NLP outputs that have no layout awareness on top of complex, visual layouts.

"},{"location":"architecture/asynchronous-processing/","title":"Asynchronous Processing","text":""},{"location":"architecture/asynchronous-processing/#asynchronous-tasks","title":"Asynchronous Tasks","text":"

OpenContracts makes extensive use of celery, a powerful, mature python framework for distributed and asynchronous processing. Out-of-the-box, dedicated celeryworkers are configured in the docker compose stack to handle computationally-intensive and long-running tasks like parsing documents, applying annotations to pdfs, creating exports, importing exports, and more.

"},{"location":"architecture/asynchronous-processing/#what-if-my-celery-queue-gets-clogged","title":"What if my celery queue gets clogged?","text":"

We are always working to make OpenContracts more fault-tolerant and stable. That said, due to the nature of the types of documents we're working with - pdfs - there is tremendous variation in what the parsers have to parse. Some documents are extremely long - thousands of pages or more - whereas other documents may have poor formatting, no text layers, etc.. In most cases, OpenContracts should be able to process the pdfs and make them compatible with our annotation tools. Sometimes, however, either due to unexpected issues or unexpected volume of documents, you may want to purge the queue of tasks to be processed by your celery workers. To do this, type:

sudo docker-compose -f local.yml run django celery -A config.celery_app purge\n

Be aware that this can cause some undesired effects for your users. For example, everytime a new document is uploaded, a Django signal kicks off the pdf preprocessor to produce the PAWLs token layer that is later annotated. If these tasks are in-queue and the queue is purged, you'll have documents that are not annotatable as they'll lack the PAWLS token layers. In such cases, we recommend you delete and re-upload the documents. There are ways to manually reprocess the pdfs, but we don't have a user-friendly way to do this yet.

"},{"location":"architecture/opencontract-corpus-actions/","title":"CorpusAction System in OpenContracts: Revised Explanation","text":"

The CorpusAction system in OpenContracts automates document processing when new documents are added to a corpus. This system is designed to be flexible, allowing for different types of actions to be triggered based on the configuration.

Within this system, users have three options for registering actions to run automatically on new documents:

  1. Custom data extractors
  2. Analyzer microservices
  3. Celery tasks decorated with @doc_analyzer_task (a \"task-based Analyzer\")

The @doc_analyzer_task decorator is specifically designed for the third option, providing a straightforward way to implement simple, span-based analytics directly within the OpenContracts ecosystem.

"},{"location":"architecture/opencontract-corpus-actions/#action-execution-overview","title":"Action Execution Overview","text":"

The following flowchart illustrates the CorpusAction system in OpenContracts, demonstrating the process that occurs when a new document is added to a corpus. This automated workflow begins with the addition of a document, which triggers a Django signal. The signal is then handled, leading to the processing of the corpus action. At this point, the system checks the type of CorpusAction configured for the corpus. Depending on this configuration, one of three paths is taken: running an Extract with a Fieldset, executing an Analysis with a doc_analyzer_task, or submitting an Analysis to a Gremlin Engine. This diagram provides a clear visual representation of how the CorpusAction system automates document processing based on predefined rules, enabling efficient and flexible handling of new documents within the OpenContracts platform.

graph TD\n    A[Document Added to Corpus] -->|Triggers| B[Django Signal]\n    B --> C[Handle Document Added Signal]\n    C --> D[Process Corpus Action]\n    D --> E{Check CorpusAction Type}\n    E -->|Fieldset| F[Run Extract]\n    E -->|Analyzer with task_name| G[Run Analysis with doc_analyzer_task]\n    E -->|Analyzer with host_gremlin| H[Run Analysis with Gremlin Engine]\n
"},{"location":"architecture/opencontract-corpus-actions/#key-components","title":"Key Components","text":"
  1. CorpusAction Model: Defines the action to be taken, including:

    • Reference to the associated corpus
    • Trigger type (e.g., ADD_DOCUMENT)
    • Reference to either an Analyzer or a Fieldset
  2. CorpusActionTrigger Enum: Defines trigger events (ADD_DOCUMENT, EDIT_DOCUMENT)

  3. Signal Handlers: Detect when documents are added to a corpus

  4. Celery Tasks: Perform the actual processing asynchronously

"},{"location":"architecture/opencontract-corpus-actions/#process-flow","title":"Process Flow","text":"
  1. Document Addition: A document is added to a corpus, triggering a Django signal.

  2. Signal Handling:

    @receiver(m2m_changed, sender=Corpus.documents.through)\ndef handle_document_added_to_corpus(sender, instance, action, pk_set, **kwargs):\n    if action == \"post_add\":\n        process_corpus_action.si(\n            corpus_id=instance.id,\n            document_ids=list(pk_set),\n            user_id=instance.creator.id,\n        ).apply_async()\n

  3. Action Processing: The process_corpus_action task is called, which determines the appropriate action based on the CorpusAction configuration.

  4. Execution Path: One of three paths is taken based on the CorpusAction configuration:

a) Run Extract with Fieldset - If the CorpusAction is associated with a Fieldset - Creates a new Extract object - Runs the extract process on the new document(s)

b) Run Analysis with doc_analyzer_task - If the CorpusAction is associated with an Analyzer that has a task_name - The task_name must refer to a function decorated with @doc_analyzer_task - Creates a new Analysis object - Runs the specified doc_analyzer_task on the new document(s)

c) Run Analysis with Gremlin Engine - If the CorpusAction is associated with an Analyzer that has a host_gremlin - Creates a new Analysis object - Submits the analysis job to the specified Gremlin Engine

Here's the relevant code snippet showing these paths:

@shared_task\ndef process_corpus_action(corpus_id: int, document_ids: list[int], user_id: int):\n    corpus = Corpus.objects.get(id=corpus_id)\n    actions = CorpusAction.objects.filter(\n        corpus=corpus, trigger=CorpusActionTrigger.ADD_DOCUMENT\n    )\n\n    for action in actions:\n        if action.fieldset:\n            # Path a: Run Extract with Fieldset\n            extract = Extract.objects.create(\n                name=f\"Extract for {corpus.title}\",\n                corpus=corpus,\n                fieldset=action.fieldset,\n                creator_id=user_id,\n            )\n            extract.documents.add(*document_ids)\n            run_extract.si(extract_id=extract.id).apply_async()\n        elif action.analyzer:\n            analysis = Analysis.objects.create(\n                analyzer=action.analyzer,\n                analyzed_corpus=corpus,\n                creator_id=user_id,\n            )\n            if action.analyzer.task_name:\n                # Path b: Run Analysis with doc_analyzer_task\n                task = import_string(action.analyzer.task_name)\n                for doc_id in document_ids:\n                    task.si(doc_id=doc_id, analysis_id=analysis.id).apply_async()\n            elif action.analyzer.host_gremlin:\n                # Path c: Run Analysis with Gremlin Engine\n                start_analysis.si(analysis_id=analysis.id).apply_async()\n

This system provides a flexible framework for automating document processing in OpenContracts. By configuring CorpusAction objects appropriately, users can ensure that newly added documents are automatically processed according to their specific needs, whether that involves running extracts, local analysis tasks, or submitting to external Gremlin engines for processing.

"},{"location":"architecture/components/Data-flow-diagram/","title":"Container Architecture & Data Flow","text":"

You'll notice that we have a number of containers in our docker compose file (Note the local.yml is up-to-date. The production file needs some work to be production grade, and we may switch to Tilt.).

Here, you can see how these containers relate to some of the core data elements powering the application - such as parsing structural and layout annotations from PDFs (which powers the vector store) and generating vector embeddings.

"},{"location":"architecture/components/Data-flow-diagram/#png-diagram","title":"PNG Diagram","text":""},{"location":"architecture/components/Data-flow-diagram/#mermaid-version","title":"Mermaid Version","text":"
graph TB\n    subgraph \"Docker Compose Environment\"\n        direction TB\n        django[Django]\n        postgres[PostgreSQL]\n        redis[Redis]\n        celeryworker[Celery Worker]\n        celerybeat[Celery Beat]\n        flower[Flower]\n        frontend[Frontend React]\n        nlm_ingestor[NLM Ingestor]\n        vector_embedder[Vector Embedder]\n    end\n\n    subgraph \"Django Models\"\n        direction TB\n        document[Document]\n        annotation[Annotation]\n        relationship[Relationship]\n        labelset[LabelSet]\n        extract[Extract]\n        datacell[Datacell]\n    end\n\n    django -->|Manages| document\n    django -->|Manages| annotation\n    django -->|Manages| relationship\n    django -->|Manages| labelset\n    django -->|Manages| extract\n    django -->|Manages| datacell\n\n    nlm_ingestor -->|Parses PDFs| django\n    nlm_ingestor -->|Creates layout annotations| annotation\n\n    vector_embedder -->|Generates embeddings| django\n    vector_embedder -->|Stores embeddings| annotation\n    vector_embedder -->|Stores embeddings| document\n\n    django -->|Stores data| postgres\n    django -->|Caching| redis\n\n    celeryworker -->|Processes tasks| django\n    celerybeat -->|Schedules tasks| celeryworker\n    flower -->|Monitors| celeryworker\n\n    frontend -->|User interface| django\n\n    classDef container fill:#e1f5fe,stroke:#01579b,stroke-width:2px;\n    classDef model fill:#fff59d,stroke:#f57f17,stroke-width:2px;\n\n    class django,postgres,redis,celeryworker,celerybeat,flower,frontend,nlm_ingestor,vector_embedder container;\n    class document,annotation,relationship,labelset,extract,datacell model;\n
"},{"location":"architecture/components/annotator/how-annotations-are-created/","title":"How Annotations are Handled","text":""},{"location":"architecture/components/annotator/how-annotations-are-created/#overview","title":"Overview","text":"

Here's a step-by-step explanation of the flow:

  1. The user selects text on the PDF by clicking and dragging the mouse. This triggers a mouse event in the Page component.
  2. The Page component checks if the Shift key is pressed.
  3. If the Shift key is not pressed, it creates a new selection and sets the selection state in the AnnotationStore.
  4. If the Shift key is pressed, it adds the selection to the selection queue in the AnnotationStore.
  5. The AnnotationStore updates its internal state with the new selection or the updated selection queue.
  6. If the Shift key is released, the Page component triggers the creation of a multi-page annotation. If the Shift key is still pressed, it waits for the next user action.
  7. To create a multi-page annotation, the Page component combines the selections from the queue.
  8. The Page component retrieves the annotation data from the PDFPageInfo object for each selected page.
  9. The Page component creates a ServerAnnotation object with the combined annotation data.
  10. The Page component calls the createAnnotation function in the AnnotationStore, passing the ServerAnnotation object.
  11. The AnnotationStore invokes the requestCreateAnnotation function in the Annotator component.
  12. The Annotator component sends a mutation to the server to create the annotation.
  13. If the server responds with success, the Annotator component updates the local state with the new annotation. If there's an error, it displays an error message.
  14. The updated annotations trigger a re-render of the relevant components, reflecting the newly created annotation on the PDF.
"},{"location":"architecture/components/annotator/how-annotations-are-created/#flowchart","title":"Flowchart","text":"
graph TD\n    A[User selects text on the PDF] -->|Mouse event| B(Page component)\n    B --> C{Is Shift key pressed?}\n    C -->|No| D[Create new selection]\n    C -->|Yes| E[Add selection to queue]\n    D --> F[Set selection state in AnnotationStore]\n    E --> G[Update selection queue in AnnotationStore]\n    F --> H{Is Shift key released?}\n    G --> H\n    H -->|Yes| I[Create multi-page annotation]\n    H -->|No| J[Wait for next user action]\n    I --> K[Combine selections from queue]\n    K --> L[Get annotation data from PDFPageInfo]\n    L --> M[Create ServerAnnotation object]\n    M --> N[Call createAnnotation in AnnotationStore]\n    N --> O[Invoke requestCreateAnnotation in Annotator]\n    O --> P[Send mutation to server]\n    P --> Q{Server response}\n    Q -->|Success| R[Update local state with new annotation]\n    Q -->|Error| S[Display error message]\n    R --> T[Re-render components with updated annotations]\n
"},{"location":"architecture/components/annotator/overview/","title":"Open Contracts Annotator Components","text":""},{"location":"architecture/components/annotator/overview/#key-questions","title":"Key Questions","text":"
  1. How is the PDF loaded?
  2. The PDF is loaded in the Annotator.tsx component.
  3. Inside the useEffect hook that runs when the openedDocument prop changes, the PDF loading process is initiated.
  4. The pdfjsLib.getDocument function from the pdfjs-dist library is used to load the PDF file specified by openedDocument.pdfFile.
  5. The loading progress is tracked using the loadingTask.onProgress callback, which updates the progress state.
  6. Once the PDF is loaded, the loadingTask.promise is resolved, and the PDFDocumentProxy object is obtained.
  7. The PDFPageInfo objects are created for each page of the PDF using doc.getPage(i) and stored in the pages state.

  8. Where and how are annotations loaded?

  9. Annotations are loaded using the REQUEST_ANNOTATOR_DATA_FOR_DOCUMENT GraphQL query in the Annotator.tsx component.
  10. The useQuery hook from Apollo Client is used to fetch the annotator data based on the provided initial_query_vars.
  11. The annotator_data received from the query contains information about existing text annotations, document label annotations, and relationships.
  12. The annotations are transformed into ServerAnnotation, DocTypeAnnotation, and RelationGroup objects and stored in the pdfAnnotations state using setPdfAnnotations.

  13. Where is the PAWLs layer loaded?

  14. The PAWLs layer is loaded in the Annotator.tsx component.
  15. Inside the useEffect hook that runs when the openedDocument prop changes, the PAWLs layer is loaded using the getPawlsLayer function from api/rest.ts.
  16. The getPawlsLayer function makes an HTTP GET request to fetch the PAWLs data file specified by openedDocument.pawlsParseFile.
  17. The PAWLs data is expected to be an array of PageTokens objects, which contain token information for each page of the PDF.
  18. The loaded PAWLs data is then used to create PDFPageInfo objects for each page, which include the page tokens.
"},{"location":"architecture/components/annotator/overview/#high-level-components-overview","title":"High-level Components Overview","text":"
  • The Annotator component is the top-level component that manages the state and data loading for the annotator.
  • It renders the PDFView component, which is responsible for displaying the PDF and annotations.
  • The PDFView component renders various sub-components, such as LabelSelector, DocTypeLabelDisplay, AnnotatorSidebar, AnnotatorTopbar, and PDF.
  • The PDF component renders individual Page components for each page of the PDF.
  • Each Page component renders Selection and SearchResult components for annotations and search results, respectively.
  • The AnnotatorSidebar component displays the list of annotations, relations, and a search widget.
  • The PDFStore and AnnotationStore are context providers that hold the PDF and annotation data, respectively.
"},{"location":"architecture/components/annotator/overview/#specific-component-deep-dives","title":"Specific Component Deep Dives","text":""},{"location":"architecture/components/annotator/overview/#pdfviewtsx","title":"PDFView.tsx","text":"

The PDFView component is a top-level component that renders the PDF document with annotations, relations, and text search capabilities. It manages the state and functionality related to annotations, relations, and user interactions. Here's a detailed explanation of how the component works:

  1. The PDFView component receives several props, including permissions, callbacks for CRUD operations on annotations and relations, refs for container and selection elements, and various configuration options.

  2. It initializes several state variables using the useState hook, including:

  3. selectionElementRefs and searchResultElementRefs: Refs for annotation selections and search results.
  4. pageElementRefs: Refs for individual PDF pages.
  5. scrollContainerRef: Ref for the scroll container.
  6. textSearchMatches and searchText: State for text search matches and search text.
  7. selectedAnnotations and selectedRelations: State for currently selected annotations and relations.
  8. pageSelection and pageSelectionQueue: State for current page selection and queued selections.
  9. pdfPageInfoObjs: State for PDF page information objects.
  10. Various other state variables for active labels, relation modal visibility, and annotation options.

  11. The component defines several functions for updating state and handling user interactions, such as:

  12. insertSelectionElementRef, insertSearchResultElementRefs, and insertPageRef: Functions to add refs for selections, search results, and pages.
  13. onError: Error handling callback.
  14. advanceTextSearchMatch and reverseTextSearchMatch: Functions to navigate through text search matches.
  15. onRelationModalOk and onRelationModalCancel: Callbacks for relation modal actions.
  16. createMultiPageAnnotation: Function to create a multi-page annotation from queued selections.

  17. The component uses the useEffect hook to handle side effects, such as:

  18. Setting the scroll container ref on load.
  19. Listening for changes in the shift key and triggering annotation creation.
  20. Updating text search matches when the search text changes.

  21. The component renders the PDF document and its related components using the PDFStore and AnnotationStore contexts:

  22. The PDFStore context provides the PDF document, pages, and error handling.
  23. The AnnotationStore context provides annotation-related state and functions.

  24. The component renders the following main sections:

  25. LabelSelector: Allows the user to select the active label for annotations.
  26. DocTypeLabelDisplay: Displays the document type labels.
  27. AnnotatorSidebar: Sidebar component for managing annotations and relations.
  28. AnnotatorTopbar: Top bar component for additional controls and options.
  29. PDF: The actual PDF component that renders the PDF pages and annotations.

  30. The PDF component, defined in PDF.tsx, is responsible for rendering the PDF pages and annotations. It receives props from the PDFView component, such as permissions, configuration options, and callbacks.

  31. The PDF component maps over each page of the PDF document and renders a Page component for each page, passing the necessary props.

  32. The Page component, also defined in PDF.tsx, is responsible for rendering a single page of the PDF document along with its annotations and search results. It handles mouse events for creating and modifying annotations.

  33. The PDFView component also renders the RelationModal component when the active relation label is set and the user has the necessary permissions. The modal allows the user to create or modify relations between annotations.

"},{"location":"architecture/components/annotator/overview/#pdftsx","title":"PDF.tsx","text":"

PDF renders the actual PDF document with annotations and text search capabilities. PDFView (see above) is what actually interacts with the backend / API.

  1. The PDF component receives several props:
  2. shiftDown: Indicates whether the Shift key is pressed (optional).
  3. doc_permissions and corpus_permissions: Specify the permissions for the document and corpus, respectively.
  4. read_only: Determines if the component is in read-only mode.
  5. show_selected_annotation_only: Specifies whether to show only the selected annotation.
  6. show_annotation_bounding_boxes: Specifies whether to show annotation bounding boxes.
  7. show_annotation_labels: Specifies the behavior for displaying annotation labels.
  8. setJumpedToAnnotationOnLoad: A callback function to set the jumped-to annotation on load.
  9. The PDF component retrieves the PDF document and pages from the PDFStore context.
  10. It maps over each page of the PDF document and renders a Page component for each page, passing the necessary props.
  11. The Page component is responsible for rendering a single page of the PDF document along with its annotations and search results.
  12. Inside the Page component:
  13. It creates a canvas element using the useRef hook to render the PDF page.
  14. It retrieves the annotations for the current page from the AnnotationStore context.
  15. It defines a ConvertBoundsToSelections function that converts the selected bounds to annotations and tokens.
  16. It uses the useEffect hook to set up the PDF page rendering and event listeners for resizing and scrolling.
  17. It renders the PDF page canvas, annotations, search results, and queued selections.
  18. The Page component renders the following sub-components:
  19. PageAnnotationsContainer: A styled container for the page annotations.
  20. PageCanvas: A styled canvas element for rendering the PDF page.
  21. Selection: Represents a single annotation selection on the page.
  22. SearchResult: Represents a search result on the page.
  23. The Page component handles mouse events for creating and modifying annotations:
  24. On mouseDown, it initializes the selection if the necessary permissions are granted and the component is not in read-only mode.
  25. On mouseMove, it updates the selection bounds if a selection is active.
  26. On mouseUp, it adds the completed selection to the pageSelectionQueue and triggers the creation of a multi-page annotation if the Shift key is not pressed.
  27. The Page component also handles fetching more annotations for previous and next pages using the FetchMoreOnVisible component.
  28. The SelectionBoundary and SelectionTokens components are used to render the annotation boundaries and tokens, respectively.
  29. The PDFPageRenderer class is responsible for rendering a single PDF page on the canvas. It manages the rendering tasks and provides methods for canceling and rescaling the rendering.
  30. The getPageBoundsFromCanvas function calculates the bounding box of the page based on the canvas dimensions and its parent container.
"},{"location":"configuration/add-users/","title":"Add Users","text":""},{"location":"configuration/add-users/#adding-more-users","title":"Adding More Users","text":"

You can use the same User admin page described above to create new users. Alternatively, go back to the main admin page http://localhost:8000/admin and, under the User section, click the \"+Add\" button:

Then, follow the on-screen instructions:

When you're done, the username and password you provided can be used to login.

OpenContracts is currently not built to allow users to self-register unless you use the Auth0 authentication. When managing users yourself, you'll need to add, remove and modify users via the admin panels.

"},{"location":"configuration/choose-an-authentication-backend/","title":"Configure Authentication Backend","text":""},{"location":"configuration/choose-an-authentication-backend/#select-authentication-system-via-env-variables","title":"Select Authentication System via Env Variables","text":"

For authentication and authorization, you have two choices. 1. You can configure an Auth0 account and use Auth0 to authenticate users, in which case anyone who is permitted to authenticate via your auth0 setup can login and automatically get an account, 2. or, you can require a username and password for each user and our OpenContracts backend can provide user authentication and authorization. Using the latter option, there is no currently-supported sign-up method, you'll need to use the admin dashboard (See \"Adding Users\" section).

"},{"location":"configuration/choose-an-authentication-backend/#auth0-auth-setup","title":"Auth0 Auth Setup","text":"

You need to configure three, separate applications on Auth0's platform:

  1. Configure the SPA as an application. You'll need the App Client ID.
  2. Configure the API. You'll need API Audience.
  3. Configure a M2M application to access the Auth0 Management API. This is used to fetch user details. You'll need the API_ID for the M2M application and the Client Secret for the M2M app.

You'll also need your Auth0 tenant ID (assuming it's the same for all three applications, though you could, in theory, host them in different tenants). These directions are not comprehensive, so, if you're not familiar with Auth0, we recommend you disable Auth0 for the time being and use username and password.

To enable and configure Auth0 Authentication, you'll need to set the following env variables in your .env file (the .django file in .envs/.production or .envs/.local, depending on your target environment). Our sample .envs only show these fields in the .production sample, but you could use them in the .local env file too:

  1. USE_AUTH0 - set to true to enable Auth0
  2. AUTH0_CLIENT_ID - should be the client ID configured on Auth0
  3. AUTH0_API_AUDIENCE - Configured API audience
  4. AUTH0_DOMAIN - domain of your configured Auth0 application
  5. AUTH0_M2M_MANAGEMENT_API_SECRET - secret for the auth0 Machine to Machine (M2M) API
  6. AUTH0_M2M_MANAGEMENT_API_ID - ID for Auth0 Machine to Machine (M2M) API
  7. AUTH0_M2M_MANAGEMENT_GRANT_TYPE - set to client_credentials
"},{"location":"configuration/choose-an-authentication-backend/#detailed-explanation-of-auth0-implementation","title":"Detailed Explanation of Auth0 Implementation","text":"

To get Auth0 to work nicely with Graphene, we modified the graphql_jwt backend to support syncing remote user metadata with a local user similar to the default, django RemoteUserMiddleware. We're keeping the graphql_jwt graphene middleware in its entirety as it fetches the token and then passes it along to django authentication *backend. That django backend is what we're modifying to decode the jwt token against Auth0 settings and then check to see if local user exists, and, if not, create it.

Here's the order of operations in the original Graphene backend provided by graphql_jwt:

  1. Backend's authenticate method is called from the graphene middleware via django (from django.contrib.auth import authenticate)
  2. token is retrieved via .utils get_credentials
  3. if token is not None, get_user_by_token in shortcuts module is called
    1. \"Payload\" is retrieved via utils.get_payload
    2. User is requested via utils.get_user_by_payload
    3. username is retrieved from payload via auth0_settings.JWT_PAYLOAD_GET_USERNAME_HANDLER
    4. user object is retrieved via auth0_settings.JWT_GET_USER_BY_NATURAL_KEY_HANDLER

We modified a couple of things:

  1. The decode method called in 3(a) needs to be modified to decode with Auth0 secrets and settings.
  2. get_user_by_payload needs to be modified in several ways:
    1. user object must use RemoteUserMiddleware logic and, if everything from auth0 decodes properly, check to see if user with e-mail exists and, if not, create it. Upon completion of this, try to sync user data with auth0. 2) return created or retrieved user object as original method did
"},{"location":"configuration/choose-an-authentication-backend/#django-based-authentication-setup","title":"Django-Based Authentication Setup","text":"

The only thing you need to do for this is toggle the two auth0-related environment variables: 1. For the backend environment, set USE_AUTH0=False in your environment (either via an environment variable file or directly in your environment via the console). 2. For the frontend environment, set REACT_APP_USE_AUTH0=false in your environment (either via an environment variable file or directly in your environment via the console).

Note

As noted elsewhere, users cannot sign up on their own. You need to log into the admin dashboard - e.g. http://localhost:8000/admin - and add users manually.

"},{"location":"configuration/choose-and-configure-docker-stack/","title":"Choose and Configure Docker Compose Stack","text":""},{"location":"configuration/choose-and-configure-docker-stack/#deployment-options","title":"Deployment Options","text":"

OpenContracts is designed to be deployed using docker-compose. You can run it locally or in a production environment. Follow the instructions below for a local environment if you just want to test it or you want to use it for yourself and don't intend to make the application available to other users via the Internet.

"},{"location":"configuration/choose-and-configure-docker-stack/#local-deployment","title":"Local Deployment","text":""},{"location":"configuration/choose-and-configure-docker-stack/#quick-start-with-default-settings","title":"Quick Start with Default Settings","text":"

A \"local\" deployment is deployed on your personal computer and is not meant to be accessed over the Internet. If you don't need to configure anything, just follow the quick start guide above to get up and running with a local deployment without needing any further configuration.

"},{"location":"configuration/choose-and-configure-docker-stack/#setup-env-files","title":"Setup .env Files","text":""},{"location":"configuration/choose-and-configure-docker-stack/#backend","title":"Backend","text":"

After cloning this repo to a machine of your choice, create a folder for your environment files in the repo root. You'll need ./.envs/.local/.django and ./.envs/.local/.postgres Use the samples in ./documentation/sample_env_files/local as guidance. NOTE, you'll need to replace the placeholder passwords and users where noted, but, otherwise, minimal config should be required.

"},{"location":"configuration/choose-and-configure-docker-stack/#frontend","title":"Frontend","text":"

In the ./frontend folder, you also need to create a single .env file which holds your configurations for your login method as well as certain feature switches (e.g. turn off imports). We've included a sample using auth0 and another sample using django's auth backend. Local vs production deployments are essentially the same, but the root url of the backend will change from localhost to whereever you're hosting the application in production.

"},{"location":"configuration/choose-and-configure-docker-stack/#build-the-stack","title":"Build the Stack","text":"

Once your .env files are setup, build the stack using docker-compose:

$ docker-compose -f local.yml build

Then, run migrations (to setup the database):

$ docker-compose -f local.yml run django python manage.py migrate

Then, create a superuser account that can log in to the admin dashboard (in a local deployment this is available at http://localhost:8000/admin) by typing this command and following the prompts:

$ docker-compose -f local.yml run django python manage.py createsuperuser\n

Finally, bring up the stack:

$ docker-compose -f local.yml up\n

You should now be able to access the OpenContracts frontend by visiting http://localhost:3000.

"},{"location":"configuration/choose-and-configure-docker-stack/#production-environment","title":"Production Environment","text":"

The production environment is designed to be public-facing and exposed to the Internet, so there are quite a number more configurations required than a local deployment, particularly if you use an AWS S3 storage backend or the Auth0 authentication system.

After cloning this repo to a machine of your choice, configure the production .env files as described above.

You'll also need to configure your website url. This needs to be done in a few places.

First, in opencontractserver/contrib/migrations, you'll fine a file called 0003_set_site_domain_and_name.py. BEFORE running any of your migrations, you should modify the domain and name defaults you'll fine in update_site_forward:

def update_site_forward(apps, schema_editor):\n \"\"\"Set site domain and name.\"\"\" Site = apps.get_model(\"sites\", \"Site\") Site.objects.update_or_create( id=settings.SITE_ID, defaults={ \"domain\": \"opencontracts.opensource.legal\", \"name\": \"OpenContractServer\", }, )\n

and update_site_backward:

def update_site_backward(apps, schema_editor):\n \"\"\"Revert site domain and name to default.\"\"\" Site = apps.get_model(\"sites\", \"Site\") Site.objects.update_or_create( id=settings.SITE_ID, defaults={\"domain\": \"example.com\", \"name\": \"example.com\"} )\n

Finally, don't forget to configure Treafik, the router in the docker-compose stack that exposes different containers to end-users depending on the route (url) received you need to update the Treafik file here.

If you're using Auth0, see the Auth0 configuration section.

If you're using AWS S3 for file storage, see the AWS configuration section. NOTE, the underlying django library that provides cloud storage, django-storages, can also work with other cloud providers such as Azure and GCP. See the django storages library docs for more info.

$ docker-compose -f production.yml build\n

Then, run migrations (to setup the database):

$ docker-compose -f production.yml run django python manage.py migrate`\n

Then, create a superuser account that can log in to the admin dashboard (in a production deployment this is available at the url set in your env file as the DJANGO_ADMIN_URL) by typing this command and following the prompts:

$ docker-compose -f production.yml run django python manage.py createsuperuser\n

Finally, bring up the stack:

$ docker-compose -f production.yml up\n

You should now be able to access the OpenContracts frontend by visiting http://localhost:3000.

"},{"location":"configuration/choose-and-configure-docker-stack/#env-file-configurations","title":"ENV File Configurations","text":"

OpenContracts is configured via .env files. For a local deployment, these should go in .envs/.local. For production, use .envs/.production. Sample .envs for each deployment environment are provided in documentation/sample_env_files.

The local configuration should let you deploy the application on your PC without requiring any specific configuration. The production configuration is meant to provide a web application and requires quite a bit more configuration and knowledge of web apps.

"},{"location":"configuration/choose-and-configure-docker-stack/#include-gremlin","title":"Include Gremlin","text":"

If you want to include a Gremlin analyzer, use local_deploy_with_gremlin.yml or production_deploy_with_gremlin.yml instead of local.yml or production.yml, respectively. All other parts of the tutorial are the same.

"},{"location":"configuration/choose-storage-backend/","title":"Configure Storage Backend","text":""},{"location":"configuration/choose-storage-backend/#select-and-setup-storage-backend","title":"Select and Setup Storage Backend","text":"

You can use Amazon S3 as a file storage backend (if you set the env flag USE_AWS=True, more on that below), or you can use the local storage of the host machine via a Docker volume.

"},{"location":"configuration/choose-storage-backend/#aws-storage-backend","title":"AWS Storage Backend","text":"

If you want to use AWS S3 to store files (primarily pdfs, but also exports, tokens and txt files), you will need an Amazon AWS account to setup S3. This README does not cover the AWS side of configuration, but there are a number of tutorials and guides to getting AWS configured to be used with a django project.

Once you have an S3 bucket configured, you'll need to set the following env variables in your .env file (the .django file in .envs/.production or .envs/.local, depending on your target environment). Our sample .envs only show these fields in the .production samples, but you could use them in the .local env file too.

Here the variables you need to set to enable AWS S3 storage:

  1. USE_AWS - set to true since you're using AWS, otherwise the backend will use a docker volume for storage.
  2. AWS_ACCESS_KEY_ID - the access key ID created by AWS when you set up your IAM user (see tutorials above).
  3. AWS_SECRET_ACCESS_KEY - the secret access key created by AWS when you set up your IAM user (see tutorials above)
  4. AWS_STORAGE_BUCKET_NAME - the name of the AWS bucket you created to hold the files.
  5. AWS_S3_REGION_NAME - the region of the AWS bucket you configured.
"},{"location":"configuration/choose-storage-backend/#django-storage-backend","title":"Django Storage Backend","text":"

Setting USE_AWS=false will use the disk space in the django container. When using the local docker compose stack, the celery workers and django containers share the same disk, so this works fine. Our production configuration would not work properly with USE_AWS=false, however, as each container has its own disk.

"},{"location":"configuration/configure-admin-users/","title":"Configure Admin Users","text":""},{"location":"configuration/configure-admin-users/#gremlin-admin-dashboard","title":"Gremlin Admin Dashboard","text":"

Gremlin's backend is built on Django, which has its own powerful admin dashboard. This dashboard is not meant for end-users and should only be used by admins. You can access the admin dashboard by going to the /admin page - e,g, opencontracts.opensource.legal/admin or http://localhost:8000/admin. For the most part, you shouldn't need to use the admin dashboard and should only go in here if you're experience errors or unexpected behavior and want to look at the detailed contents of the database to see if it sheds any light on what's happening with a give corpus, document, etc.

By default, Gremlin creates an admin user for you. If you don't specify the username and password in your environment on first boot, it'll use system defaults. You can customize the default username and password via environment variables or after the system boots using the admin dash.

"},{"location":"configuration/configure-admin-users/#configure-username-and-password-prior-to-first-deployment","title":"Configure Username and Password Prior to First Deployment","text":"

If the variable DJANGO_SUPERUSER_USERNAME is set, that will be the default admin user created on startup (the first time your run docker-compose -f local.yml up). The repo ships with a default superuser username of admin. The default password is set using the DJANGO_SUPERUSER_PASSWORD variable. The environment files for local deployments (but not production) include a default password of Openc0ntracts_def@ult. You should change this in the environment file before the first start OR, follow the instructions below to change it after the first start.

If you modify these environment variables in the environment file BEFORE running the docker-compose up command for the first time, your initial superuser will have the username, email and/or password you specify. If you don't modify the defaults, you can change them after you have created them via the admin dashboard (see below).

"},{"location":"configuration/configure-admin-users/#after-first-deployment-via-admin-dashboard","title":"After First Deployment via Admin Dashboard","text":"

Once the default superuser has been created, you'll need to use the admin dashboard to modify it.

To manage users, including changing the password, you'll need to access the backend admin dashboard. OpenContracts is built on Django, which ships with Django Admin, a tool to manage low-level object data and users. It doesn't provide the rich, document focused UI/UX our frontend does, but it does let you edit and delete objects created on the frontend if, for any reason, you are unable to fix something done by a frontend user (e.g. a corrupt file is uploaded and cannot be parsed or rendered properly on the frontend).

To update your users, first login to the admin panel:

Then, in the lefthand navbar, find the entry for \"Users\" and click on it

Then, you'll see a list of all users for this instance. You should see your admin user and an \"Anonymous\" user. The Anonymous user is required for public browsing of objcets with their is_public field set to True. The Anonymous user cannot see other objects.

Click on the admin user to bring up the detailed user view:

Now you can click the \"WHAT AM I CALLED\" button to bring up a dialog to change the user password.

"},{"location":"configuration/configure-gremlin/","title":"Configure Gremlin Analyzer","text":"

Gremlin is a separate project by OpenSource Legal to provide a standard API to access NLP capabilities. This lets us wrap multiple NLP engines / techniques in the same API which lets us build tools that can readily consume the outputs of very different NLP libraries (etc. a Transformers-based model like BERT, and tools like SPACY and LexNLP can be deployed on Gremlin and the outputs from all three can readily be rendered in OpenContracts).

OpenContracts is designed to work with Gremlin out-of-the-box. We have a sample compose yaml file showing how to do this on a local machine local_deploy_with_gremlin.yaml and as a web-facing application production_deploy_with_gremlin.yaml.

When you add a new Gremlin Engine to the database, OpenContracs will automatically query it for its installed analyzers and labels. These will then be available within OpenContracts, and you can use an analyzer to analyze any OpenContracts corpus.

While we have plans to automatically \"install\" the default Gremlin on first boot, currently you must manually go into the OpenContracts admin dash and add the Gremlin. Thankfully, this is an easy process:

  1. In your environment file, make sure you set CALLBACK_ROOT_URL_FOR_ANALYZER
    1. For local deploy, use CALLBACK_ROOT_URL_FOR_ANALYZER=http://localhost:8000
    2. For production deploy, use http://django:5000. Why the change? Well, in our local docker compose stack, the host the localhost and the django development server runs on port 8000. In production, we want Gremlin to communicate with the OpenContracts container (\"django\") via its hostname on the docker compose stack's network. The production OpenContracts container also uses gunicorn on port 5000 instead of the development server on port 8000, so the port changes too.
  2. Go to the admin page:
  3. Click \"Add+\" in the Gremlin row to bring up the Add Gremlin Engine form. You just need to set the creator Url fields (the url for our default config is http://gremlinengine:5000). If, for some reason, you don't want the analyzer to be visible to any unauthenticated user, unselect the is_public box :
  4. This will automatically kick off an install process that runs in the background. When it's complete, you'll see the \"Install Completed\" Field change. It should take a second or two. At the moment, we don't handle errors in this process, so, if it doesn't complete successfully in 30 seconds, there is probably a misconfiguration somewhere. We plan to improve our error handling for these backend installation processes.

Note, in our example implementations, Gremlin is NOT encrypted or API Key secured to outside traffic. It's not exposed to outside traffic either per our docker compose config, so this shouldn't be a major concern. If you do expose the container to the host via your Docker Compose file, you should ensure you run the traffic through Treafik and setup API Key authentication.

"},{"location":"configuration/frontend-configuration/","title":"Frontend Configuration","text":""},{"location":"configuration/frontend-configuration/#why","title":"Why?","text":"

The frontend configuration variables should not be secrets as there is no way to keep them secure on the frontend. That said, being able to specify certain configurations via environment variables makes configuration and deployment much easier.

"},{"location":"configuration/frontend-configuration/#what-can-be-configured","title":"What Can be Configured?","text":"

Our frontend config file should look like this (The OPEN_CONTRACTS_ prefixes are necessary to get the env variables injected into the frontend container. The env variable that shows up on window._env_ in the React frontend will omit the prefix, however - e.g. OPEN_CONTRACTS_REACT_APP_APPLICATION_DOMAIN will show up as REACT_APP_APPLICATION_DOMAIN):

OPEN_CONTRACTS_REACT_APP_APPLICATION_DOMAIN=\nOPEN_CONTRACTS_REACT_APP_APPLICATION_CLIENT_ID=\nOPEN_CONTRACTS_REACT_APP_AUDIENCE=http://localhost:3000\nOPEN_CONTRACTS_REACT_APP_API_ROOT_URL=https://opencontracts.opensource.legal\n\n# Uncomment to use Auth0 (you must then set the DOMAIN and CLIENT_ID envs above\n# OPEN_CONTRACTS_REACT_APP_USE_AUTH0=true\n\n# Uncomment to enable access to analyzers via the frontend\n# OPEN_CONTRACTS_REACT_APP_USE_ANALYZERS=true\n\n# Uncomment to enable access to import functionality via the frontend\n# OPEN_CONTRACTS_REACT_APP_ALLOW_IMPORTS=true\n

ATM, there are three key configurations: 1. OPEN_CONTRACTS_REACT_APP_USE_AUTH0 - uncomment this / set it to true to switch the frontend login components and auth flow from django password auth to Auth0 oauth2. IF this is true, you also need to provide valid configurations for OPEN_CONTRACTS_REACT_APP_APPLICATION_DOMAIN, OPEN_CONTRACTS_REACT_APP_APPLICATION_CLIENT_ID, and OPEN_CONTRACTS_REACT_APP_AUDIENCE. These are configured on the Auth0 platform. We don't have a walkthrough for that ATM. 2. OPEN_CONTRACTS_REACT_APP_USE_ANALYZERS - allow users to see and use analyzers. False on the demo deployment. 3. OPEN_CONTRACTS_REACT_APP_ALLOW_IMPORTS - do not let people upload zip files and attempt to import them. Not recommended on truly public installations as security will be challenging. Internal to an org should be OK, but still use caution.

"},{"location":"configuration/frontend-configuration/#how-to-configure","title":"How to Configure","text":""},{"location":"configuration/frontend-configuration/#method-1-using-an-env-file","title":"Method 1: Using an .env File","text":"

This method involves using a .env file that Docker Compose automatically picks up.

"},{"location":"configuration/frontend-configuration/#steps","title":"Steps:","text":"
  1. Create a file named .env in the same directory as your docker-compose.yml file.
  2. Copy the contents of your environment variable file into this .env file.
  3. In your docker-compose.yml, you don't need to explicitly specify the env file.
"},{"location":"configuration/frontend-configuration/#example-docker-composeyml","title":"Example docker-compose.yml:","text":"
version: '3'\nservices:\n  frontend:\n    build: ./frontend\n    ports:\n      - \"3000:3000\"\n    # No need to specify env_file here\n
"},{"location":"configuration/frontend-configuration/#pros","title":"Pros:","text":"
  • Simple setup
  • Docker Compose automatically uses the .env file
  • Easy to version control (if desired)
"},{"location":"configuration/frontend-configuration/#cons","title":"Cons:","text":"
  • All services defined in the Docker Compose file will have access to these variables
  • May not be suitable if you need different env files for different services
"},{"location":"configuration/frontend-configuration/#method-2-using-env_file-in-docker-compose","title":"Method 2: Using env_file in Docker Compose","text":"

This method allows you to specify a custom named env file for each service.

"},{"location":"configuration/frontend-configuration/#steps_1","title":"Steps:","text":"
  1. Keep your existing .env file (or rename it if desired).
  2. In your docker-compose.yml, specify the env file using the env_file key.
"},{"location":"configuration/frontend-configuration/#example-docker-composeyml_1","title":"Example docker-compose.yml:","text":"
version: '3'\nservices:\n  frontend:\n    build: ./frontend\n    ports:\n      - \"3000:3000\"\n    env_file:\n      - ./.env  # or your custom named file\n
"},{"location":"configuration/frontend-configuration/#pros_1","title":"Pros:","text":"
  • Allows using different env files for different services
  • More explicit than relying on the default .env file
"},{"location":"configuration/frontend-configuration/#cons_1","title":"Cons:","text":"
  • Requires specifying the env file in the Docker Compose file
"},{"location":"configuration/frontend-configuration/#method-3-defining-environment-variables-directly-in-docker-compose","title":"Method 3: Defining Environment Variables Directly in Docker Compose","text":"

This method involves defining the environment variables directly in the docker-compose.yml file.

"},{"location":"configuration/frontend-configuration/#steps_2","title":"Steps:","text":"
  1. In your docker-compose.yml, use the environment key to define variables.
"},{"location":"configuration/frontend-configuration/#example-docker-composeyml_2","title":"Example docker-compose.yml:","text":"
version: '3'\nservices:\n  frontend:\n    build: ./frontend\n    ports:\n      - \"3000:3000\"\n    environment:\n      - OPEN_CONTRACTS_REACT_APP_APPLICATION_DOMAIN=yourdomain.com\n      - OPEN_CONTRACTS_REACT_APP_APPLICATION_CLIENT_ID=your_client_id\n      - OPEN_CONTRACTS_REACT_APP_AUDIENCE=http://localhost:3000\n      - OPEN_CONTRACTS_REACT_APP_API_ROOT_URL=https://opencontracts.opensource.legal\n      - OPEN_CONTRACTS_REACT_APP_USE_AUTH0=true\n      - OPEN_CONTRACTS_REACT_APP_USE_ANALYZERS=true\n      - OPEN_CONTRACTS_REACT_APP_ALLOW_IMPORTS=true\n
"},{"location":"configuration/frontend-configuration/#pros_2","title":"Pros:","text":"
  • All configuration is in one file
  • Easy to see all environment variables at a glance
"},{"location":"configuration/frontend-configuration/#cons_2","title":"Cons:","text":"
  • Can make the docker-compose.yml file long and harder to manage
  • Sensitive information in the Docker Compose file may be a security risk
"},{"location":"configuration/frontend-configuration/#method-4-combining-env_file-and-environment","title":"Method 4: Combining env_file and environment","text":"

This method allows you to use an env file for most variables and override or add specific ones in the Docker Compose file.

"},{"location":"configuration/frontend-configuration/#steps_3","title":"Steps:","text":"
  1. Keep your .env file with most variables.
  2. In docker-compose.yml, use both env_file and environment.
"},{"location":"configuration/frontend-configuration/#example-docker-composeyml_3","title":"Example docker-compose.yml:","text":"
version: '3'\nservices:\n  frontend:\n    build: ./frontend\n    ports:\n      - \"3000:3000\"\n    env_file:\n      - ./.env\n    environment:\n      - REACT_APP_USE_AUTH0=true\n      - REACT_APP_USE_ANALYZERS=true\n      - REACT_APP_ALLOW_IMPORTS=true\n
"},{"location":"configuration/frontend-configuration/#pros_3","title":"Pros:","text":"
  • Flexibility to use env files and override when needed
  • Can keep sensitive info in env file and non-sensitive in Docker Compose
"},{"location":"configuration/frontend-configuration/#cons_3","title":"Cons:","text":"
  • Need to be careful about precedence (Docker Compose values override env file)
"},{"location":"development/documentation/","title":"Documentation","text":""},{"location":"development/documentation/#documentation-stack","title":"Documentation Stack","text":"

We're using mkdocs to render our markdown into pretty, bite-sized pieces. The markdown lives in /docs in our repo. If you want to work on the docs you'll need to install the requirements in /requirements/docs.txt.

To have a live server while working on them, type:

mkdocs serve\n
"},{"location":"development/documentation/#building-docs","title":"Building Docs","text":"

To build a html website from your markdown that can be uploaded to a webhost (or a GitHub Page), just type:

mkdocs build\n
"},{"location":"development/documentation/#deploying-to-gh-page","title":"Deploying to GH Page","text":"

mkdocs makes it super easy to deploy your docs to a GitHub page.

Just run:

mkdocs gh-deploy\n
"},{"location":"development/environment/","title":"Dev Environment","text":"

We use Black and Flake8 for Python Code Styling. These are run via pre-commit before all commits. If you want to develop extensions or code based on OpenContracts, you'll need to setup pre-commit. First, make sure the requirements in ./requirements/local.txt are installed in your local environment.

Then, install pre-commit into your local git repo. From the root of the repo, run:

 $ pre-commit install\n
If you want to run pre-commit manually on all the code in the repo, use this command:

 $ pre-commit run --all-files\n

When you commit changes to your repo or our repo as a PR, pre-commit will run and ensure your code follows our style guide and passes linting.

"},{"location":"development/frontend-notes/","title":"Frontend Notes","text":""},{"location":"development/frontend-notes/#responsive-layout","title":"Responsive Layout","text":"

The application was primarily designed to be viewed around 1080p. We've built in some quick and dirty (honestly, hacks) to display a usable layout at other resolutions. A more thorough redesign / refactor is in order, again if there's sufficient interest. What's available now should handle a lot of situations ok. If you find performance / layout is not looking great at your given resolution, try to use a desktop browser at a 1080p resolution.

"},{"location":"development/frontend-notes/#no-test-suite","title":"No Test Suite","text":"

As of our initial release, the test suite only tests the backend (and coverage is admittedly not as robust as we'd like). We'd like to add tests for the frontend, though this is a fairly large undertaking. We welcome any contributions on this front!

"},{"location":"development/test-suite/","title":"Test Suite","text":"

Our test suite is a bit sparse, but we're working to improve coverage on the backend. Frontend tests will likely take longer to implement. Our existing tests do test imports and a number of the utility functions for manipulating annotations. These tests are integrated in our GitHub actions.

NOTE, use Python 3.10 or above as pydantic and certain pre-3.10 type annotations do not play well. using from __future__ import annotations doesn't always solve the problem, and upgrading to Python 3.10 was a lot easier than trying to figure out why the from __future__ didn't behave as expected

To run the tests, check your test coverage, and generate an HTML coverage report:

 $ docker-compose -f local.yml run django coverage run -m pytest\n $ docker-compose -f local.yml run django coverage html\n $ open htmlcov/index.html\n

To run a specific test (e.g. test_analyzers):

 $ sudo docker-compose -f local.yml run django python manage.py test opencontractserver.tests.test_analyzers --noinput\n
"},{"location":"extract_and_retrieval/document_data_extract/","title":"Extracting Structured Data from Documents using LlamaIndex, AI Agents, and Marvin","text":"

We've added a powerful feature called \"extract\" that enables the generation of structured data grids from a list of documents using a combination of vector search, AI agents, and the Marvin library.

This run_extract task orchestrates the extraction process, spinning up a number of llama_index_doc_query tasks. Each of these query tasks uses LlamaIndex Django & pgvector for vector search and retrieval, and Marvin for data parsing and extraction. It processes each document and column in parallel using celery's task system.

All credit for the inspiration of this feature goes to the fine folks at Nlmatics. They were some of the first pioneers working on datagrids from document using a set of questions and custom transformer models. This implementation of their concept ultimately leverages newer techniques and better models, but hats off to them for coming up with a design like this in 2017/2018!

The current implementation relies heavily on LlamaIndex, specifically their vector store tooling, their reranker and their agent framework.

Structured data extraction is powered by the amazing Marvin library.

"},{"location":"extract_and_retrieval/document_data_extract/#overview","title":"Overview","text":"

The extract process involves the following key components:

  1. Document Corpus: A collection of documents from which structured data will be extracted.
  2. Fieldset: A set of columns defining the structure of the data to be extracted.
  3. LlamaIndex: A library used for efficient vector search and retrieval of relevant document sections.
  4. AI Agents: Intelligent agents that analyze the retrieved document sections and extract structured data.
  5. Marvin: A library that facilitates the parsing and extraction of structured data from text.

The extract process is initiated by creating an Extract object that specifies the document corpus and the fieldset defining the desired data structure. The process is then broken down into individual tasks for each document and column combination, allowing for parallel processing and scalability.

"},{"location":"extract_and_retrieval/document_data_extract/#detailed-walkthrough","title":"Detailed Walkthrough","text":"

Here's how the extract process works step by step.

"},{"location":"extract_and_retrieval/document_data_extract/#1-initiating-the-extract-process","title":"1. Initiating the Extract Process","text":"

The run_extract function is the entry point for initiating the extract process. It takes the extract_id and user_id as parameters and performs the following steps:

  1. Retrieves the Extract object from the database based on the provided extract_id.
  2. Sets the started timestamp of the extract to the current time.
  3. Retrieves the fieldset associated with the extract, which defines the columns of the structured data grid.
  4. Retrieves the list of document IDs associated with the extract.
  5. Creates Datacell objects for each document and column combination, representing the individual cells in the structured data grid.
  6. Sets the appropriate permissions for each Datacell object based on the user's permissions.
  7. Kicks off the processing job for each Datacell by appending a task to the Celery task queue.
"},{"location":"extract_and_retrieval/document_data_extract/#2-processing-individual-datacells","title":"2. Processing Individual Datacells","text":"

The llama_index_doc_query function is responsible for processing each individual Datacell.

"},{"location":"extract_and_retrieval/document_data_extract/#execution-flow-visualized","title":"Execution Flow Visualized:","text":"
graph TD\n    I[llama_index_doc_query] --> J[Retrieve Datacell]\n    J --> K[Create HuggingFaceEmbedding]\n    K --> L[Create OpenAI LLM]\n    L --> M[Create DjangoAnnotationVectorStore]\n    M --> N[Create VectorStoreIndex]\n    N --> O{Special character '|||' in search_text?}\n    O -- Yes --> P[Split examples and average embeddings]\n    P --> Q[Query annotations using averaged embeddings]\n    Q --> R[Rerank nodes using SentenceTransformerRerank]\n    O -- No --> S[Retrieve results using index retriever]\n    S --> T[Rerank nodes using SentenceTransformerRerank]\n    R --> U{Column is agentic?}\n    T --> U\n    U -- Yes --> V[Create QueryEngineTool]\n    V --> W[Create FunctionCallingAgentWorker]\n    W --> X[Create StructuredPlannerAgent]\n    X --> Y[Query agent for definitions]\n    U -- No --> Z{Extract is list?}\n    Y --> Z\n    Z -- Yes --> AA[Extract with Marvin]\n    Z -- No --> AB[Cast with Marvin]\n    AA --> AC[Save result to Datacell]\n    AB --> AC\n    AC --> AD[Mark Datacell complete]\n
"},{"location":"extract_and_retrieval/document_data_extract/#step-by-step-walkthrough","title":"Step-by-step Walkthrough","text":"
  1. The run_extract task is called with an extract_id and user_id. It retrieves the corresponding Extract object and marks it as started.

  2. It then iterates over the document IDs associated with the extract. For each document and each column in the extract's fieldset, it:

  3. Creates a new Datacell object with the extract, column, output type, creator, and document.
  4. Sets CRUD permissions for the datacell to the user.
  5. Appends a llama_index_doc_query task to a list of tasks, passing the datacell ID.

  6. After all datacells are created and their tasks added to the list, a Celery chord is used to group the tasks. Once all tasks are complete, it calls the mark_extract_complete task to mark the extract as finished.

  7. The llama_index_doc_query task processes each individual datacell. It:

  8. Retrieves the datacell and marks it as started.
  9. Creates a HuggingFaceEmbedding model and sets it as the Settings.embed_model.
  10. Creates an OpenAI LLM and sets it as the Settings.llm.
  11. Creates a DjangoAnnotationVectorStore from the document ID and column settings.
  12. Creates a VectorStoreIndex from the vector store.

  13. If the search_text contains the special character '|||':

  14. It splits the examples and calculates the embeddings for each example.
  15. It calculates the average embedding from the individual embeddings.
  16. It queries the Annotation objects using the averaged embeddings and orders them by cosine distance.
  17. It reranks the nodes using SentenceTransformerRerank and retrieves the top-n nodes.
  18. It adds the annotation IDs of the reranked nodes to the datacell's sources.
  19. It retrieves the text from the reranked nodes.

  20. If the search_text does not contain the special character '|||':

  21. It retrieves the relevant annotations using the index retriever based on the search_text or query.
  22. It reranks the nodes using SentenceTransformerRerank and retrieves the top-n nodes.
  23. It adds the annotation IDs of the reranked nodes to the datacell's sources.
  24. It retrieves the text from the retrieved nodes.

  25. If the column is marked as agentic:

  26. It creates a QueryEngineTool, FunctionCallingAgentWorker, and StructuredPlannerAgent.
  27. It queries the agent to find defined terms and section references in the retrieved text.
  28. The definitions and section text are added to the retrieved text.

  29. Depending on whether the column's extract_is_list is true, it either:

  30. Extracts a list of the output_type from the retrieved text using Marvin, with optional instructions or query.
  31. Casts the retrieved text to the output_type using Marvin, with optional instructions or query.

  32. The result is saved to the datacell's data field based on the output_type. The datacell is marked as completed.

  33. If an exception occurs during processing, the error is logged, saved to the datacell's stacktrace, and the datacell is marked as failed.

"},{"location":"extract_and_retrieval/document_data_extract/#next-steps","title":"Next Steps","text":"

This is more of a proof-of-concept of the power of the existing universe of open source tooling. There are a number of more advanced techniques we can use to get better retrieval, more intelligent agentic behavior and more. Also, we haven't optomized for performance AT ALL, so any improvements in any of these areas would be welcome. Further, we expect the real power for an open source tool like OpenContracts to come from custom implementations of this functionality, so we'll also be working on more easily customizable and modular agents and retrieval pipelines so you can quickly select the right pipeline for the right task.

"},{"location":"extract_and_retrieval/intro_to_django_annotation_vector_store/","title":"Making a Django Application Compatible with LlamaIndex using a Custom Vector Store","text":""},{"location":"extract_and_retrieval/intro_to_django_annotation_vector_store/#introduction","title":"Introduction","text":"

In this walkthrough, we'll explore how the custom DjangoAnnotationVectorStore makes a Django application compatible with LlamaIndex, enabling powerful vector search capabilities within the application's structured annotation store. By leveraging the BasePydanticVectorStore class provided by LlamaIndex and integrating it with Django's ORM and the pg-vector extension for PostgreSQL, we can achieve efficient and scalable vector search functionality.

"},{"location":"extract_and_retrieval/intro_to_django_annotation_vector_store/#understanding-the-djangoannotationvectorstore","title":"Understanding the DjangoAnnotationVectorStore","text":"

The DjangoAnnotationVectorStore is a custom implementation of LlamaIndex's BasePydanticVectorStore class, tailored specifically for a Django application. It allows the application to store and retrieve granular, visually-locatable annotations (x-y blocks) from PDF pages using vector search.

Let's break down the key components and features of the DjangoAnnotationVectorStore:

"},{"location":"extract_and_retrieval/intro_to_django_annotation_vector_store/#1-inheritance-from-basepydanticvectorstore","title":"1. Inheritance from BasePydanticVectorStore","text":"
class DjangoAnnotationVectorStore(BasePydanticVectorStore):\n    ...\n

By inheriting from BasePydanticVectorStore, the DjangoAnnotationVectorStore gains access to the base functionality and interfaces provided by LlamaIndex for vector stores. This ensures compatibility with LlamaIndex's query engines and retrieval methods.

"},{"location":"extract_and_retrieval/intro_to_django_annotation_vector_store/#2-integration-with-djangos-orm","title":"2. Integration with Django's ORM","text":"

The DjangoAnnotationVectorStore leverages Django's Object-Relational Mapping (ORM) to interact with the application's database. It defines methods like _get_annotation_queryset() and _build_filter_query() to retrieve annotations from the database using Django's queryset API.

def _get_annotation_queryset(self) -> QuerySet:\n    queryset = Annotation.objects.all()\n    if self.corpus_id is not None:\n        queryset = queryset.filter(\n            Q(corpus_id=self.corpus_id) | Q(document__corpus=self.corpus_id)\n        )\n    if self.document_id is not None:\n        queryset = queryset.filter(document=self.document_id)\n    if self.must_have_text is not None:\n        queryset = queryset.filter(raw_text__icontains=self.must_have_text)\n    return queryset.distinct()\n

This integration allows seamless retrieval of annotations from the Django application's database, making it compatible with LlamaIndex's querying and retrieval mechanisms.

"},{"location":"extract_and_retrieval/intro_to_django_annotation_vector_store/#3-utilization-of-pg-vector-for-vector-search","title":"3. Utilization of pg-vector for Vector Search","text":"

The DjangoAnnotationVectorStore utilizes the pg-vector extension for PostgreSQL to perform efficient vector search operations. pg-vector adds support for vector data types and provides optimized indexing and similarity search capabilities.

queryset = (\n    queryset.order_by(\n        CosineDistance(\"embedding\", query.query_embedding)\n    ).annotate(\n        similarity=CosineDistance(\"embedding\", query.query_embedding)\n    )\n)[: query.similarity_top_k]\n

In the code above, the CosineDistance function from pg-vector is used to calculate the cosine similarity between the query embedding and the annotation embeddings stored in the database. This allows for fast and accurate retrieval of relevant annotations based on vector similarity.

"},{"location":"extract_and_retrieval/intro_to_django_annotation_vector_store/#4-customization-and-filtering-options","title":"4. Customization and Filtering Options","text":"

The DjangoAnnotationVectorStore provides various customization and filtering options to fine-tune the vector search process. It allows filtering annotations based on criteria such as corpus_id, document_id, and must_have_text.

def _build_filter_query(self, filters: Optional[MetadataFilters]) -> QuerySet:\n    queryset = self._get_annotation_queryset()\n\n    if filters is None:\n        return queryset\n\n    for filter_ in filters.filters:\n        if filter_.key == \"label\":\n            queryset = queryset.filter(annotation_label__text__iexact=filter_.value)\n        else:\n            raise ValueError(f\"Unsupported filter key: {filter_.key}\")\n\n    return queryset\n

This flexibility enables targeted retrieval of annotations based on specific metadata filters, enhancing the search capabilities of the application.

"},{"location":"extract_and_retrieval/intro_to_django_annotation_vector_store/#benefits-of-integrating-llamaindex-with-django","title":"Benefits of Integrating LlamaIndex with Django","text":"

Integrating LlamaIndex with a Django application using the DjangoAnnotationVectorStore offers several benefits:

  1. Structured Annotation Storage: The Django application's annotation store provides a structured and organized way to store and manage granular annotations extracted from PDF pages. Each annotation is associated with metadata such as page number, bounding box coordinates, and labels, allowing for precise retrieval and visualization.
  2. Efficient Vector Search: By leveraging the pg-vector extension for PostgreSQL, the DjangoAnnotationVectorStore enables efficient vector search operations within the Django application. This allows for fast and accurate retrieval of relevant annotations based on their vector embeddings, improving the overall performance of the application.
  3. Compatibility with LlamaIndex: The DjangoAnnotationVectorStore is designed to be compatible with LlamaIndex's query engines and retrieval methods. This compatibility allows the Django application to benefit from the powerful natural language processing capabilities provided by LlamaIndex, such as semantic search, question answering, and document summarization.
  4. Customization and Extensibility: The DjangoAnnotationVectorStore provides a flexible and extensible foundation for building custom vector search functionality within a Django application. It can be easily adapted and extended to meet specific application requirements, such as adding new filtering options or incorporating additional metadata fields.
"},{"location":"extract_and_retrieval/intro_to_django_annotation_vector_store/#conclusion","title":"Conclusion","text":"

By implementing the DjangoAnnotationVectorStore and integrating it with LlamaIndex, a Django application can achieve powerful vector search capabilities within its structured annotation store. The custom vector store leverages Django's ORM and the pg-vector extension for PostgreSQL to enable efficient retrieval of granular annotations based on vector similarity.

This integration opens up new possibilities for building intelligent and interactive applications that can process and analyze large volumes of annotated data. With the combination of Django's robust web framework and LlamaIndex's advanced natural language processing capabilities, developers can create sophisticated applications that deliver enhanced user experiences and insights.

The DjangoAnnotationVectorStore serves as a bridge between the Django ecosystem and the powerful tools provided by LlamaIndex, enabling developers to harness the best of both worlds in their applications.

"},{"location":"extract_and_retrieval/querying_corpus/","title":"Answering Queries using LlamaIndex in a Django Application","text":"

This markdown document explains how queries are answered in a Django application using LlamaIndex, the limitations of the approach, and how LlamaIndex is leveraged for this purpose.

"},{"location":"extract_and_retrieval/querying_corpus/#query-answering-process","title":"Query Answering Process","text":"
  1. A user submits a query through the Django application, which is associated with a specific corpus (a collection of documents).
  2. The query is saved in the database as a CorpusQuery object, and a Celery task (run_query) is triggered to process the query asynchronously.
  3. Inside the run_query task:
  4. The CorpusQuery object is retrieved from the database using the provided query_id.
  5. The query's started timestamp is set to the current time.
  6. The necessary components for query processing are set up, including the embedding model (HuggingFaceEmbedding), language model (OpenAI), and vector store (DjangoAnnotationVectorStore).
  7. The DjangoAnnotationVectorStore is initialized with the corpus_id associated with the query, allowing it to retrieve the relevant annotations for the specified corpus.
  8. A VectorStoreIndex is created from the DjangoAnnotationVectorStore, which serves as the index for the query engine.
  9. A CitationQueryEngine is instantiated with the index, specifying the number of top similar results to retrieve (similarity_top_k) and the granularity of the citation sources (citation_chunk_size).
  10. The query is passed to the CitationQueryEngine, which processes the query and generates a response.
  11. The response includes the answer to the query along with the source annotations used to generate the answer.
  12. The source annotations are parsed and converted into a markdown format, with each citation linked to the corresponding annotation ID.
  13. The query's sources field is updated with the annotation IDs used in the response.
  14. The query's response field is set to the generated markdown text.
  15. The query's completed timestamp is set to the current time.
  16. If an exception occurs during the query processing, the query's failed timestamp is set, and the stack trace is stored in the stacktrace field.
"},{"location":"extract_and_retrieval/querying_corpus/#leveraging-llamaindex","title":"Leveraging LlamaIndex","text":"

LlamaIndex is leveraged in the following ways to enable query answering in the Django application:

  1. Vector Store: LlamaIndex provides the BasePydanticVectorStore class, which serves as the foundation for the custom DjangoAnnotationVectorStore. The DjangoAnnotationVectorStore integrates with Django's ORM to store and retrieve annotations efficiently, allowing seamless integration with the existing Django application.
  2. Indexing: LlamaIndex's VectorStoreIndex is used to create an index from the DjangoAnnotationVectorStore. The index facilitates fast and efficient retrieval of relevant annotations based on the query.
  3. Query Engine: LlamaIndex's CitationQueryEngine is employed to process the queries and generate responses. The query engine leverages the index to find the most relevant annotations and uses the language model to generate a coherent answer.
  4. Embedding and Language Models: LlamaIndex provides abstractions for integrating various embedding and language models. In this implementation, the HuggingFaceEmbedding and OpenAI models are used, but LlamaIndex allows flexibility in choosing different models based on requirements.

By leveraging LlamaIndex, the Django application benefits from a structured and efficient approach to query answering. LlamaIndex provides the necessary components and abstractions to handle vector storage, indexing, and query processing, allowing the application to focus on integrating these capabilities into its existing architecture.

"},{"location":"walkthrough/key-concepts/","title":"Key-Concepts","text":""},{"location":"walkthrough/key-concepts/#data-types","title":"Data Types","text":"

Text annotation data is divided into several concepts:

  1. Corpuses (or collections of documents). One document can be in multiple corpuses.
  2. Documents. Currently, these are PDFs ONLY.
  3. Annotations. These are either document-level annotations (the document type), text-level annotations (highlighted text), or relationships (which apply a label between two annotations). Relationships are currently not well-supported and may be buggy.
  4. Analyses. These groups of read-only annotations added by a Gremlin analyzer (see more on that below).
"},{"location":"walkthrough/key-concepts/#permissioning","title":"Permissioning","text":"

OpenContracts is built on top of the powerful permissioning framework for Django called django-guardian. Each GraphQL request can add a field to annotate the object-level permissions the current user has for a given object, and the frontend relies on this to determine whether to make some objects and pages read-only and whether certain features should be exposed to a given user. The capability of sharing objects with specific users is built in, but is not enabled from the frontend at the moment. Allowing such widespread sharing and user lookups could be a security hole and could also unduly tax the system. We'd like to test these capabilities more fully before letting users used them.

"},{"location":"walkthrough/key-concepts/#graphql","title":"GraphQL","text":""},{"location":"walkthrough/key-concepts/#mutations-and-queries","title":"Mutations and Queries","text":"

OpenContracts uses Graphene and GraphQL to serve data to its frontend. You can access the Graphiql playground by going to your OpenContracts root url /graphql - e.g. https://opencontracts.opensource.legal/graphql. Anonymous users have access to any public data. To authenticate and access your own data, you either need to use the login mutation to create a JWT token or login to the admin dashboard to get a Django session and auth cookie that will automatically authenticate your requests to the GraphQL endpoint.

If you're not familiar with GraphQL, it's a very powerful way to expose your backend to the user and/or frontend clients to permit the construction of specific queries with specific data shapes. As an example, here's a request to get public corpuses and the annotated text and labels in them:

Graphiql comes with a built-in documentation browser. Just click \"Docs\" in the top-right of the screen to start browsing. Typically, mutations change things on the server. Queries merely request copies of data from the server. We've tried to make our schema fairly self-explanatory, but we do plan to add more descriptions and guidance to our API docs.

"},{"location":"walkthrough/key-concepts/#graphql-only-features","title":"GraphQL-only features","text":"

Some of our features are currently not accessible via the frontend. Sharing analyses and corpuses to the public, for example, can only be achieved via makeCorpusPublic and makeAnalysisPublic mutations, and only admins have this power at the moment. For our current release, we've done this to prevent large numbers of public corpuses being shared to cut down on server usage. We'd like to make a fully free and open, collaborative platform with more features to share anonymously, but this will require additional effort and compute power.

"},{"location":"walkthrough/step-1-add-documents/","title":"Step 1 - Add Documents","text":"

In order to do anything, you need to add some documents to Gremlin.

"},{"location":"walkthrough/step-1-add-documents/#go-to-the-documents-tab","title":"Go to the Documents tab","text":"

Click on the \"Documents\" entry in the menu to bring up a view of all documents you have read and/or write access to:

"},{"location":"walkthrough/step-1-add-documents/#open-the-action-menu","title":"Open the Action Menu","text":"

Now, click on the \"Action\" dropdown to open the Action menu for available actions and click \"Import\":

This will bring up a dialog to load documents:

"},{"location":"walkthrough/step-1-add-documents/#select-documents-to-upload","title":"Select Documents to Upload","text":"

Open Contracts works with PDFs only (as this helps us have a single file type with predictable data structures, formats, etc.). In the future, we'll add functionality to convert other files to PDF, but, for now, please use PDFs. It doesn't matter if they are OCRed or not as OpenContracts performs its own OCR on every PDF anyway to ensure consistent OCR quality and outputs. Once you've added documents for upload, you'll see a list of documents:

Click on a document to change the description or title:

"},{"location":"walkthrough/step-1-add-documents/#upload-your-documents","title":"Upload Your Documents","text":"

Click upload to upload the documents to OpenContracts. Note Once the documents are uploaded, they are automatically processed with Tesseract amd PAWLs to create a layer of tokens - each one representing a word / symbol in the PDF an its X,Y coordinates on the page. This is what powers OpenContracts annotator and allows us to create both layout-aware and text-only annotations. While the PAWLs processing script is running, the document you uploaded will not be available for viewing and cannot be added to a corpus. You'll see a loading bar on the document until the pre-processing is complete. This is only one once and can take a long time (a couple of minutes to a max of 10) depending on the document length, quality, etc.

"},{"location":"walkthrough/step-2-create-labelset/","title":"Step 2 - Create Labelset","text":""},{"location":"walkthrough/step-2-create-labelset/#why-labelsets","title":"Why Labelsets?","text":"

Before you can add labels, you need to decide what you want to label. A labelset should reflect the taxonomy or concepts you want to associate with text in your document. This can be solely for the purpose of human review and retrieval, but we imagine many of you want to use it to train machine learning models.

At the moment, there's no way to create a label in a corpus without creating a labelset and creating a label for the labelset (though we'd like to add that and welcome contributions).

"},{"location":"walkthrough/step-2-create-labelset/#create-text-labels","title":"Create Text Labels","text":"

Let's say we want to add some labels for \"Parties\", \"Termination Clause\", and \"Effective Date\". To do that, let's first create a LabelSet to hold the labels.

  1. Go to the labelset view and click the action button to bring up the action menu:
  2. Clicking on the \"Create Label Set\" item will bring up a modal to let you create labels:
  3. Now click on the new label set to edit the labels:
  4. A modal comes up that lets you edit three types of labels:

    1. Text Labels - are meant to label spans of text (\"highlights\")
    2. Relationship Labels - this feature is still under development, but it labels relationships bewteen text label (e.g. one labelled party is the \"Parent Company\" of another).
    3. Doc Type Labels - are meant to label what category the document belongs in - e.g. a \"Stock Purchase Agreement\" or an \"NDA\"
  5. Click the \"Text Labels\" tab to bring up a view of current labels for text annotations and an action button that lets you create new ones. There should be no labels when you first open this view\"

  6. Click the action button and then the \"Create Text Label\" dropdown item:
  7. You'll see a new, blank label in the list of text labels:
  8. Click the edit icon on the label to edit the label title, description, color and/or icon. To edit the icon or highlight color, hover over or click the giant tag icon on the left side of the label:
  9. Hit save to commit the changes to the database. Repeat for the other labels - \"Parties\", \"Termination Clause\", and \"Effective Date\":
"},{"location":"walkthrough/step-2-create-labelset/#create-document-type-labels","title":"Create Document-Type Labels","text":"

In addition to labelling specific parts of a document, you may want to tag a document itself as a certain type of document or addressing a certain subject. In this example, let's say we want to label some documents as \"contracts\" and others as \"not contracts\".

  1. Let's also create two example document type labels. Click the \"Doc Type Labels\" tab:
  2. As before, click the action button and the \"Create Document Type Label\" item to create a blank document type label:
  3. Repeat to create two doc type labels - \"Contract\" and \"Not Contract\":
  4. Hit \"Close\" to close the editor.
"},{"location":"walkthrough/step-3-create-a-corpus/","title":"Step 3 - Create Corpus","text":""},{"location":"walkthrough/step-3-create-a-corpus/#purpose-of-the-corpus","title":"Purpose of the Corpus","text":"

A \"Corpus\" is a collection of documents that can be annotated by hand or automatically by a \"Gremlin\" analyzer. In order to create a Corpus, you first need to create a Corpus and then add documents to it.

"},{"location":"walkthrough/step-3-create-a-corpus/#go-to-the-corpus-page","title":"Go to the Corpus Page","text":"
  1. First, login if you're not already logged in.
  2. Then, go the \"Corpus\" tab and click the \"Action\" dropdown to bring up the action menu:
  3. Click \"Create Corpus\" to bring up the Create Corpus dialog. If you've already created a labelset or have a pre-existing one, you can select it, otherwise you'll need to create and add one later:
  4. Assuming you created the labelset you want to use, when you click on the dropdown in the \"Label Set\" section, you should see your new labelset. Click on it to select it:
  1. You will now be able to open the corpus again, open documents in the corpus and start labelling.
"},{"location":"walkthrough/step-3-create-a-corpus/#add-documents-to-corpus","title":"Add Documents to Corpus","text":"
  1. Once you have a corpus, go back to the document page to select documents to add. You can do this in one of two ways.
    1. Right-click on a document to show a context menu:
    2. Or, SHIFT + click on the documents you want to select in order to select multiple documents at once. A green checkmark will appear on selected documents.
  2. When you're done, click the \"Action\"
  3. A dialog will pop up asking you to select a corpus to add the documents to. Select the desired corpus and hit ok.
  4. You'll get a confirmation dialog. Hit OK.
  5. When you click on the Corpus you just added the documents to, you'll get a tabbed view of all of the documents, annotations and analyses for that Corpus. At this stage, you should see your documents:

Congrats! You've created a corpus to hold annotations or perform an analysis! In order to start labelling it yourself, you need to create and then select a LabelSet, however. You do not need to do this to run an analyzer, however.

Note: If you have an OpenContracts export file and proper permissions, you can also import a corpus, documents, annotations, and labels. This is disabled on our demo instance, however, to but down on server load and reduce opportunities to upload potentially malicious files. See the \"Advanced\" section for more details.

"},{"location":"walkthrough/step-4-create-text-annotations/","title":"Step 4 - Create Some Annotations","text":"

To view or edit annotations, you need to open a corpus and then open a document in the Corpus.

  1. Go to your Corpuses page and click on the corpus you just created:
  2. This will open up the document view again. Click on one of the documents to bring up the annotator:
  3. To select the label to apply, Click the vertical ellipses in the \"Text Label to Apply Widget\". This will bring up an interface that lets you search your labelset and select a label:
  4. Select the \"Effective Date\" label, for example, to label the Effective Date:
  5. Now, in the document, click and drag a box around the language that corresponds to your select label:
  6. When you've selected the correct text, release the mouse. You'll see a confirmtion when your annotation is created (you'll also see the annotation in the sidebar to the left):
  7. If you want to delete the annotation, you can click on the trash icon in the corresponding annotation card in the sidebar, or, when you hover over the annotation on the page, you'll see a trash icon in the label bar of the annotation. You can click this to delete the annotation too.
  8. If your desired annotated text is non-contiguous, you can hold down the SHIFT key while selecting blocks of text to combine them into a single annotation. While holding SHIFT, releasing the mouse will not create the annotation in the database, it will just allow you to move to a new area.
    1. One situation you might want to do this is where what you want to highlight is on different lines but is just a small part of the surrounding paragraph (such as this example, where Effective Date spans two lines):
    2. Or you might want to select multiple snippets of text in a larger block of text, such as where you have multiple parties you want to combine into a single annotation:
"},{"location":"walkthrough/step-5-create-doc-type-annotations/","title":"Step 5 - Create Some Document Annotations","text":"
  1. If you want to label the type of document instead of the text inside it, use the controls in the \"Doc Type\" widget on the bottom right of the Annotator. Hover over it and a green plus button should appear:
  2. Click the \"+\" button to bring up a dialog that lets you search and select document type labels (remember, we created these earlier in the tutorial):
  3. Click \"Add Label\" to actually apply the label, and you'll now see that label displayed in the \"Doc Type\" widget in the annotator:
  4. As before, you can click the trash can to delete the label.
"},{"location":"walkthrough/step-6-search-and-filter-by-annotations/","title":"Step 6 - Search and Filter By Annotations","text":"
  1. Back in the Corpus view, you can see in the document view the document type label you just added:
  2. You can click on the filter dropdown above to filter the documents to only those with a certain doc type label:
  3. With the corpus opened, click on the \"Annotations\" tab instead of the \"Documents\" tab to get a summary of all the current annotations in the Corpus:
  4. Click on an annotation card to automatically load the document it's in and jump right to the page containing the annotation:
"},{"location":"walkthrough/step-7-query-corpus/","title":"Querying a Corpus","text":"

Once you've created a corpus of documents, you can ask a natural language question and get a natural language answer, complete with citation and links back to the relevant text in the document(s)

Note: We're still working to improve nav and GUI performance, but this is pretty good for a first cut.

"},{"location":"walkthrough/step-8-data-extract/","title":"Build a Datagrid","text":"

You can easily use OpenContracts to create an \"Extract\" - a collection of queries and natural language-specified data points, represented as columns in a grid, that will be asked of every document in the extract (represented as rows). You can define complex extract schemas, including python primitives, Pydantic models (no nesting - yet) and lists.

"},{"location":"walkthrough/step-8-data-extract/#building-a-datagrid","title":"Building a Datagrid","text":"

To create a data grid, you can start by adding documents or adding data fields. Your choice. If you selected a corpus when defining the extract, the documents from that Corpus will be pre-loaded.

"},{"location":"walkthrough/step-8-data-extract/#to-add-documents","title":"To add documents:","text":""},{"location":"walkthrough/step-8-data-extract/#and-to-add-data-fields","title":"And to add data fields:","text":""},{"location":"walkthrough/step-8-data-extract/#running-an-extract","title":"Running an Extract","text":"

Once you've added all of the documents you want and defined all of the data fields to apply, you can click run to start processing the grid:

Extract speed will depend on your underlying LLM and the number of available celery workers provisioned for OpenContracts. We hope to do more performance optimization in a v2 minor release. We haven't optimized for performance at all.

"},{"location":"walkthrough/step-8-data-extract/#reviewing-results","title":"Reviewing Results","text":"

Once an extract is complete, you can click on the hamburger menu in a cell to see a dropdown menu. Click the eye to view the sources for that datacell. If you click thumbs up or thumbs down, you can log that you approved or rejected the value in question. Extract value edits are coming soon.

See a quick walkthrough here:

"},{"location":"walkthrough/step-9-corpus-actions/","title":"Corpus Actions","text":""},{"location":"walkthrough/step-9-corpus-actions/#introduction","title":"Introduction","text":"

If you're familiar with GitHub actions - user-scripted functions that run automatically over a software vcs repository when certain actions take place (like a merge, PR, etc.) - then a CorpusAction should be a familair concept. You can configure a celery task using our @doc_analyzer_task decorator (see more here on how to write these) and then configure a CorpusAction to run your custom task on all documents added to the target corpus.

"},{"location":"walkthrough/step-9-corpus-actions/#setting-up-a-corpus-action","title":"Setting up a Corpus Action","text":""},{"location":"walkthrough/step-9-corpus-actions/#supported-actions","title":"Supported Actions","text":"

NOTE: Currently, you have to configure all of this via the Django admin dashboard (http://localhost:8000/admin if you're using our local deployment), We'd like to expose this functionality using our React frontend, but the required GUI elements and GraphQL mutations need to be built out. A good starter PR for someone ;-).

Currently, a CorpusAction can be configured to run one of three types of analyzers automatically:

  1. A data extract fieldset - in which case, a data extract will be created and run on new documents added to the configured corpus (see our guide on setting up a data extract job)
  2. An Analyzer
    1. Configured as a \"Gremlin Microservice\". See more information on configuring a microservice-based analyzer here
    2. Configured to run a task decorated using the @doc_analyzer_task decorator. See more about configuring these kinds of tasks here.
"},{"location":"walkthrough/step-9-corpus-actions/#creating-corpus-action","title":"Creating Corpus Action","text":"

From within the Django admin dashboard, click on CorpusActions or the +Add button next to the header:

Once you've opened the create action form, you'll see a number of different options you can configure:

See next section for more details on these configuration options. Once you type in the appropriate configurations and hit \"Save\", the specified Analyzer or Fieldset will be run automatically on the specified Corpus! If you want to learn more about the underlying architecture, check out our deep dive on CorpusActions.

"},{"location":"walkthrough/step-9-corpus-actions/#configuration-options-for-corpus-action","title":"Configuration Options for Corpus Action","text":"

Corpus specifies that an action should run only on a single corpus, specified via dropdown.

Analyzer or Fieldset properties control whether an analysis or data extract runs when the applicable trigger is run (more on this below). If you want to run a data extract when document is added to the corpus, select the fieldset defining the data you want to extract. If you want to run an analyzer, select the pre-configured analyzer. Remember, an analyzer can point to a microservice or a task decorated with @doc_analyzer_task.

Trigger refers to the specific action type that should kick off the desired analysis. Currently, we \"provide\" add and edit actions - i.e., run specified analytics when a document is added or edited, respectively - but we have not configured the edit action to run.

Disabled is a toggle that will turn off the specified CorpusAction for ALL corpuses.

Run on all corpuses is a toggle that, if True, will run the specified action on EVERY corpus. Be careful with this as it runs for all corpuses for ALL users. Depending on your environment, this could incur a substantial compute cost and other users may not appreciate this. A nice feature we'd love to add is a more fine-grained set of rules based access controls to limit actions to certain groups. This would require a substantial investment into the frontend of the application and remains an unlikely addition, though we'd absolutely welcome contributions!

"},{"location":"walkthrough/step-9-corpus-actions/#quick-reference-configuring-doc_analyzer_task-analyzer","title":"Quick Reference - Configuring @doc_analyzer_task + Analyzer","text":"

If you write your own @doc_analyzer_task and want to run it automatically, let's step through this step-by-step.

  1. First, we assume you put a properly written and decorated task in opencontractserver.tasks.doc_analysis_tasks.py.
  2. Second, you need to create and configure an Analyzer via the Django admin panel. Click on the +Add button next to the Analyzer entry in the admin sidebar and then configure necessary properties:

Place the name of your task in the task_name property - e.g. opencontractserver.tasks.doc_analysis_tasks.contract_not_contract, add a brief description, assign the creator to the desired user, and click save. 3. Now, this Analyzer instance can be assigned to a CorpusAction!

"},{"location":"walkthrough/advanced/configure-annotation-view/","title":"Configure How Annotations Are Displayed","text":"

Annotations are composed of tokens (basically text in a line surrounded by whitespace). The tokens have a highlight. OpenContracts also has a \"BoundingBox\" around the tokens which is the smallest rectangle that can cover all of the tokens in an Annotation.

In the Annotator view, you'll see a purple-colored \"eye\" icon in the top left of the annotation list in the sidebar. Click the icon to bring up a series of configurations for how annotations are displayed:

There are three different settings that can be combined to significantly change how you see the annotations: 1. Show only selected - You will only see the annotation selected, either by clicking on it in the sidebar or when you clicked into an annotation from the Corpus view. All other annotations will be completely hidden. 2. Show bounding boxes - If you unselect this, only the tokens will be visible. This is recommended where you large numbers of overlapping annotations or annotations that are sparse - e.g. a few words scattered throughout a paragraph. In either of these cases, the bounding boxes can cover other bounding boxes and this can be confusing. Where you have too many overlapping bounding boxes, it's easier to hide them and just look at the tokens. 3. Label Display Behavior - has three options:

  1. Always Show - Always show the label for an annotation when it's displayed (remember, you can choose to only display selected annotations).
  2. Always Hide - Never show the label for an annotation, regardless of its visiblity.
  3. Show on Hover - If an annotation is visible, when you hover over it, you'll see the label.
"},{"location":"walkthrough/advanced/data-extract-models/","title":"Why Data Extract?","text":"

An extraction process is pivotal for transforming raw, unstructured data into actionable insights, especially in fields like legal, financial, healthcare, and research. Imagine having thousands of documents, such as contracts, invoices, medical records, or research papers, and needing to quickly locate and analyze specific information like key terms, dates, patient details, or research findings. Automated extraction saves countless hours of manual labor, reduces human error, and enables real-time data analysis. By leveraging an efficient extraction pipeline, businesses and researchers can make informed decisions faster, ensure compliance, enhance operational efficiency, and uncover valuable patterns and trends that might otherwise remain hidden in the data deluge. Simply put, data extraction transforms overwhelming amounts of information into strategic assets, driving innovation and competitive advantage.

"},{"location":"walkthrough/advanced/data-extract-models/#how-we-store-our-data-extracts","title":"How we Store Our Data Extracts","text":"

Ultimately, our application design follows Django best-practiecs for a data-driven application with asynchronous data processing. We use the Django ORM (with capabilities like vector search) to store our data and tasks to orchestrate. The extracts/models.py file defines several key models that are used to manage and track the process of extracting data from documents.

These models include:

  1. Fieldset
  2. Column
  3. Extract
  4. Datacell

Each model plays a specific role in the extraction workflow, and together they enable the storage, configuration, and execution of document-based data extraction tasks.

"},{"location":"walkthrough/advanced/data-extract-models/#detailed-explanation-of-each-model","title":"Detailed Explanation of Each Model","text":""},{"location":"walkthrough/advanced/data-extract-models/#1-fieldset","title":"1. Fieldset","text":"

Purpose: The Fieldset model groups related columns together. Each Fieldset represents a specific configuration of data fields that need to be extracted from documents.

class Fieldset(BaseOCModel):\n    name = models.CharField(max_length=256, null=False, blank=False)\n    description = models.TextField(null=False, blank=False)\n
  • name: The name of the fieldset.
  • description: A description of what this fieldset is intended to extract.

Usage: Fieldsets are associated with extracts in the Extract model, defining what data needs to be extracted.

"},{"location":"walkthrough/advanced/data-extract-models/#2-column","title":"2. Column","text":"

Purpose: The Column model defines individual data fields that need to be extracted. Each column specifies what to extract, the criteria for extraction, and the model to use for extraction.

class Column(BaseOCModel):\n    name = models.CharField(max_length=256, null=False, blank=False, default=\"\")\n    fieldset = models.ForeignKey('Fieldset', related_name='columns', on_delete=models.CASCADE)\n    query = models.TextField(null=True, blank=True)\n    match_text = models.TextField(null=True, blank=True)\n    must_contain_text = models.TextField(null=True, blank=True)\n    output_type = models.TextField(null=False, blank=False)\n    limit_to_label = models.CharField(max_length=512, null=True, blank=True)\n    instructions = models.TextField(null=True, blank=True)\n    task_name = models.CharField(max_length=1024, null=False, blank=False)\n    agentic = models.BooleanField(default=False)\n    extract_is_list = models.BooleanField(default=False)\n
  • name: The name of the column.
  • fieldset: ForeignKey linking to the Fieldset model.
  • query: The query used for extraction.
  • match_text: Text that must be matched during extraction.
  • must_contain_text: Text that must be contained in the document for extraction.
  • output_type: The type of data to be extracted.
  • limit_to_label: A label to limit the extraction scope.
  • instructions: Instructions for the extraction process.
  • task_name: The name of the registered celery extract task to use to process (lets you define and deploy custom ones).
  • agentic: Boolean indicating if the extraction is agentic.
  • extract_is_list: Boolean indicating if the extraction result is a list.

Usage: Columns are linked to fieldsets and specify detailed criteria for each piece of data to be extracted.

"},{"location":"walkthrough/advanced/data-extract-models/#4-extract","title":"4. Extract","text":"

Purpose: The Extract model represents an extraction job. It contains metadata about the extraction process, such as the documents to be processed, the fieldset to use, and the task type.

class Extract(BaseOCModel):\n    corpus = models.ForeignKey('Corpus', related_name='extracts', on_delete=models.SET_NULL, null=True, blank=True)\n    documents = models.ManyToManyField('Document', related_name='extracts', related_query_name='extract', blank=True)\n    name = models.CharField(max_length=512, null=False, blank=False)\n    fieldset = models.ForeignKey('Fieldset', related_name='extracts', on_delete=models.PROTECT, null=False)\n    created = models.DateTimeField(auto_now_add=True)\n    started = models.DateTimeField(null=True, blank=True)\n    finished = models.DateTimeField(null=True, blank=True)\n    error = models.TextField(null=True, blank=True)\n    doc_query_task = models.CharField(\n        max_length=10,\n        choices=[(tag.name, tag.value) for tag in DocQueryTask],\n        default=DocQueryTask.DEFAULT.name\n    )\n
  • corpus: ForeignKey linking to the Corpus model.
  • documents: ManyToManyField linking to the Document model.
  • name: The name of the extraction job.
  • fieldset: ForeignKey linking to the Fieldset model.
  • created: Timestamp when the extract was created.
  • started: Timestamp when the extract started.
  • finished: Timestamp when the extract finished.
  • error: Text field for storing error messages.
  • doc_query_task: CharField for storing the task type using DocQueryTask enum.

Usage: Extracts group the documents to be processed and the fieldset that defines what data to extract. The doc_query_task field determines which extraction pipeline to use.

"},{"location":"walkthrough/advanced/data-extract-models/#5-datacell","title":"5. Datacell","text":"

Purpose: The Datacell model stores the result of extracting a specific column from a specific document. Each datacell links to an extract, a column, and a document.

class Datacell(BaseOCModel):\n    extract = models.ForeignKey('Extract', related_name='extracted_datacells', on_delete=models.CASCADE)\n    column = models.ForeignKey('Column', related_name='extracted_datacells', on_delete=models.CASCADE)\n    document = models.ForeignKey('Document', related_name='extracted_datacells', on_delete=models.CASCADE)\n    sources = models.ManyToManyField('Annotation', blank=True, related_name='referencing_cells', related_query_name='referencing_cell')\n    data = NullableJSONField(default=jsonfield_default_value, null=True, blank=True)\n    data_definition = models.TextField(null=False, blank=False)\n    started = models.DateTimeField(null=True, blank=True)\n    completed = models.DateTimeField(null=True, blank=True)\n    failed = models.DateTimeField(null=True, blank=True)\n    stacktrace = models.TextField(null=True, blank=True)\n
  • extract: ForeignKey linking to the Extract model.
  • column: ForeignKey linking to the Column model.
  • document: ForeignKey linking to the Document model.
  • sources: ManyToManyField linking to the Annotation model.
  • data: JSON field for storing extracted data.
  • data_definition: Text field describing the data definition.
  • started: Timestamp when the datacell processing started.
  • completed: Timestamp when the datacell processing completed.
  • failed: Timestamp when the datacell processing failed.
  • stacktrace: Text field for storing error stack traces.

Usage: Datacells store the results of extracting specific fields from documents, linking back to the extract and column definitions. They also track the status and any errors during extraction.

"},{"location":"walkthrough/advanced/data-extract-models/#how-these-models-relate-to-data-extraction-tasks","title":"How These Models Relate to Data Extraction Tasks","text":"

1**Fieldset and Column**: Specify what data needs to be extracted and the criteria for extraction. Fieldsets group columns, which detail each piece of data to be extracted. You can register your own LlamaIndex extractors which you can then select as the extract engine for a given column, allowing you to create very bespoke extraction capabilities. 2**Extract**: Represents an extraction job, grouping documents to be processed with the fieldset defining what data to extract. The doc_query_task field allows dynamic selection of the extraction pipeline. 3**Datacell**: Stores the results of the extraction process for each document and column, tracking the status and any errors encountered.

"},{"location":"walkthrough/advanced/data-extract-models/#extraction-workflow","title":"Extraction Workflow","text":"
  1. Create Extract: An Extract instance is created, specifying the documents to process, the fieldset to use, and the desired extraction task.
  2. Run Extract: The run_extract task uses the doc_query_task field to determine which extraction pipeline to use. It iterates over the documents and columns, creating Datacell instances for each.
  3. Process Datacell: Each Datacell is processed by the selected extraction task (e.g., llama_index_doc_query or custom_llama_index_doc_query). The results are stored in the data field of the Datacell.
  4. Store Results: The extracted data is saved, and the status of each Datacell is updated to reflect completion or failure.

By structuring the models this way, the system is flexible and scalable, allowing for complex data extraction tasks to be defined, executed, and tracked efficiently.

"},{"location":"walkthrough/advanced/export-import-corpuses/","title":"Export / Import Functionality","text":""},{"location":"walkthrough/advanced/export-import-corpuses/#exports","title":"Exports","text":"

OpenContracts support both exporting and importing corpuses. This functionality is disabled on the public demo as it can be bandwidth intensive. If you want to experiment with these features on your own, you'll see the export action when you right-click on a corpus:

You can access your exports from the user dropdown menu in the top right corner of the screen. Once your export is complete, you should be able to download a zip containing all the documents, their PAWLs layers, and the corpus data you created - including all annotations.

"},{"location":"walkthrough/advanced/export-import-corpuses/#imports","title":"Imports","text":"

If you've enabled corpus imports (see the frontend env file for the boolean toggle to do this - it's REACT_APP_ALLOW_IMPORTS), you'll see an import action when you click the action button on the corpus page.

"},{"location":"walkthrough/advanced/export-import-corpuses/#export-format","title":"Export Format","text":""},{"location":"walkthrough/advanced/export-import-corpuses/#opencontracts-export-format-specification","title":"OpenContracts Export Format Specification","text":"

The OpenContracts export is a zip archive containing: 1. A data.json file with metadata about the export 2. The original PDF documents 3. Exported annotations \"burned in\" to the PDF documents

"},{"location":"walkthrough/advanced/export-import-corpuses/#datajson-format","title":"data.json Format","text":"

The data.json file contains a JSON object with the following fields:

  • annotated_docs (dict): Maps PDF filenames to OpenContractDocExport objects with annotations for that document.

  • doc_labels (dict): Maps document label names (strings) to AnnotationLabelPythonType objects defining those labels.

  • text_labels (dict): Maps text annotation label names (strings) to AnnotationLabelPythonType objects defining those labels.

  • corpus (OpenContractCorpusType): Metadata about the exported corpus, with fields:

    • id (int): ID of the corpus
    • title (string)
    • description (string)
    • icon_name (string): Filename of the corpus icon image
    • icon_data (string): Base64 encoded icon image data
    • creator (string): Email of the corpus creator
    • label_set (string): ID of the labelset used by this corpus
  • label_set (OpenContractsLabelSetType): Metadata about the label set, with fields:

    • id (int)
    • title (string)
    • description (string)
    • icon_name (string): Filename of the labelset icon
    • icon_data (string): Base64 encoded labelset icon data
    • creator (string): Email of the labelset creator
"},{"location":"walkthrough/advanced/export-import-corpuses/#opencontractdocexport-format","title":"OpenContractDocExport Format","text":"

Each document in annotated_docs is represented by an OpenContractDocExport object with fields:

  • doc_labels (list[string]): List of document label names applied to this doc
  • labelled_text (list[OpenContractsAnnotationPythonType]): List of text annotations
  • title (string): Document title
  • content (string): Full text content of the document
  • description (string): Description of the document
  • pawls_file_content (list[PawlsPagePythonType]): PAWLS parse data for each page
  • page_count (int): Number of pages in the document
"},{"location":"walkthrough/advanced/export-import-corpuses/#opencontractsannotationpythontype-format","title":"OpenContractsAnnotationPythonType Format","text":"

Represents an individual text annotation, with fields:

  • id (string): Optional ID
  • annotationLabel (string): Name of the label for this annotation
  • rawText (string): Raw text content of the annotation
  • page (int): 0-based page number the annotation is on
  • annotation_json (dict): Maps page numbers to OpenContractsSinglePageAnnotationType
"},{"location":"walkthrough/advanced/export-import-corpuses/#opencontractssinglepageannotationtype-format","title":"OpenContractsSinglePageAnnotationType Format","text":"

Represents the annotation data for a single page:

  • bounds (BoundingBoxPythonType): Bounding box of the annotation on the page
  • tokensJsons (list[TokenIdPythonType]): List of PAWLS tokens covered by the annotation
  • rawText (string): Raw text of the annotation on this page
"},{"location":"walkthrough/advanced/export-import-corpuses/#boundingboxpythontype-format","title":"BoundingBoxPythonType Format","text":"

Represents a bounding box with fields:

  • top (int)
  • bottom (int)
  • left (int)
  • right (int)
"},{"location":"walkthrough/advanced/export-import-corpuses/#tokenidpythontype-format","title":"TokenIdPythonType Format","text":"

References a PAWLS token by page and token index:

  • pageIndex (int)
  • tokenIndex (int)
"},{"location":"walkthrough/advanced/export-import-corpuses/#pawlspagepythontype-format","title":"PawlsPagePythonType Format","text":"

Represents PAWLS parse data for a single page:

  • page (PawlsPageBoundaryPythonType): Page boundary info
  • tokens (list[PawlsTokenPythonType]): List of PAWLS tokens on the page
"},{"location":"walkthrough/advanced/export-import-corpuses/#pawlspageboundarypythontype-format","title":"PawlsPageBoundaryPythonType Format","text":"

Represents the page boundary with fields:

  • width (float)
  • height (float)
  • index (int): Page index
"},{"location":"walkthrough/advanced/export-import-corpuses/#pawlstokenpythontype-format","title":"PawlsTokenPythonType Format","text":"

Represents a single PAWLS token with fields:

  • x (float): X-coordinate of token box
  • y (float): Y-coordinate of token box
  • width (float): Width of token box
  • height (float): Height of token box
  • text (string): Text content of the token
"},{"location":"walkthrough/advanced/export-import-corpuses/#annotationlabelpythontype-format","title":"AnnotationLabelPythonType Format","text":"

Defines an annotation label with fields:

  • id (string)
  • color (string): Hex color for the label
  • description (string)
  • icon (string): Icon name
  • text (string): Label text
  • label_type (LabelType): One of DOC_TYPE_LABEL, TOKEN_LABEL, RELATIONSHIP_LABEL, METADATA_LABEL
"},{"location":"walkthrough/advanced/export-import-corpuses/#example-datajson","title":"Example data.json","text":"
{\n  \"annotated_docs\": {\n    \"document1.pdf\": {\n      \"doc_labels\": [\"Contract\", \"NDA\"],\n      \"labelled_text\": [\n        {\n          \"id\": \"1\",\n          \"annotationLabel\": \"Effective Date\",\n          \"rawText\": \"This agreement is effective as of January 1, 2023\",\n          \"page\": 0,\n          \"annotation_json\": {\n            \"0\": {\n              \"bounds\": {\n                \"top\": 100,\n                \"bottom\": 120,\n                \"left\": 50,\n                \"right\": 500\n              },\n              \"tokensJsons\": [\n                {\n                  \"pageIndex\": 0,\n                  \"tokenIndex\": 5\n                },\n                {\n                  \"pageIndex\": 0,\n                  \"tokenIndex\": 6\n                }\n              ],\n              \"rawText\": \"January 1, 2023\"\n            }\n          }\n        }\n      ],\n      \"title\": \"Nondisclosure Agreement\",\n      \"content\": \"This Nondisclosure Agreement is made...\",\n      \"description\": \"Standard mutual NDA\",\n      \"pawls_file_content\": [\n        {\n          \"page\": {\n            \"width\": 612,\n            \"height\": 792,\n            \"index\": 0\n          },\n          \"tokens\": [\n            {\n              \"x\": 50,\n              \"y\": 100,\n              \"width\": 60,\n              \"height\": 10,\n              \"text\": \"This\"\n            },\n            {\n              \"x\": 120,\n              \"y\": 100,\n              \"width\": 100,\n              \"height\": 10,\n              \"text\": \"agreement\"\n            }\n          ]\n        }\n      ],\n      \"page_count\": 5\n    }\n  },\n  \"doc_labels\": {\n    \"Contract\": {\n      \"id\": \"1\",\n      \"color\": \"#FF0000\",\n      \"description\": \"Indicates a legal contract\",\n      \"icon\": \"contract\",\n      \"text\": \"Contract\",\n      \"label_type\": \"DOC_TYPE_LABEL\"\n    },\n    \"NDA\": {\n      \"id\": \"2\",\n      \"color\": \"#00FF00\",\n      \"description\": \"Indicates a non-disclosure agreement\",\n      \"icon\": \"nda\",\n      \"text\": \"NDA\",\n      \"label_type\": \"DOC_TYPE_LABEL\"\n    }\n  },\n  \"text_labels\": {\n    \"Effective Date\": {\n      \"id\": \"3\",\n      \"color\": \"#0000FF\",\n      \"description\": \"The effective date of the agreement\",\n      \"icon\": \"calendar\",\n      \"text\": \"Effective Date\",\n      \"label_type\": \"TOKEN_LABEL\"\n    }\n  },\n  \"corpus\": {\n    \"id\": 1,\n    \"title\": \"Example Corpus\",\n    \"description\": \"A sample corpus for demonstration\",\n    \"icon_name\": \"corpus_icon.png\",\n    \"icon_data\": \"iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAACklEQVR4nGMAAQAABQABDQottAAAAABJRU5ErkJggg==\",\n    \"creator\": \"user@example.com\",\n    \"label_set\": \"4\"\n  },\n  \"label_set\": {\n    \"id\": \"4\",\n    \"title\": \"Example Label Set\",\n    \"description\": \"A sample label set\",\n    \"icon_name\": \"label_icon.png\",\n    \"icon_data\": \"iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAACklEQVR4nGMAAQAABQABDQottAAAAABJRU5ErkJggg==\",\n    \"creator\":  \"user@example.com\"\n  }\n}\n

This data.json file includes:

  • One annotated document (document1.pdf) with two document labels (\"Contract\" and \"NDA\") and one text annotation for the \"Effective Date\"
  • Definitions for the two document labels (\"Contract\" and \"NDA\") and one text label (\"Effective Date\")
  • Metadata about the exported corpus and labelset, including Base64 encoded icon data

The PAWLS token data and text content are truncated for brevity. In a real export, the pawls_file_content would include the complete token data for each page, and content would contain the full extracted text of the document.

Let me know if you have any other questions!

"},{"location":"walkthrough/advanced/fork-a-corpus/","title":"Fork a Corpus","text":""},{"location":"walkthrough/advanced/fork-a-corpus/#to-fork-or-not-to-fork","title":"To Fork or Not to Fork?","text":"

One of the amazing things about Open Source collaboration is you can stand on the shoulder of giants - we can share techniques and data and collectively achieve what we could never do alone. OpenContracts is designed to make it super easy to share and re-use annotation data.

In OpenContracts, we introduce the concept of \"forking\" a corpus - basically creating a copy of public or private corpus, complete with its documents and annotations, which you can edit and tweak as needed. This opens up some interesting possibilities. For example, you might have a base corpus with annotations common to many types of AI models or annotation projects which you can fork as needed and layer task or domain-specific annotations on top of.

"},{"location":"walkthrough/advanced/fork-a-corpus/#fork-a-corpus","title":"Fork a Corpus","text":"

Forking a corpus is easy.

  1. Again, right-click on a corpus to bring up the context menu. You'll see an entry to \"Fork Corpus\":
  2. Click on it to start a fork. You should see a confirmation in the top right of the screen:
  3. Once the fork is complete, the next time you go to your Corpus page, you'll see a new Corpus with a Fork icon in the icon bar at the bottom. If you hover over it, you'll be able to see a summary of the corpus it was forked from. This is tracked in the database, so, long-term, we'd like to have corpus version control similar to how git works:
"},{"location":"walkthrough/advanced/generate-graphql-schema-files/","title":"Generate GraphQL Schema Files","text":""},{"location":"walkthrough/advanced/generate-graphql-schema-files/#generating-graphql-schema-files","title":"Generating GraphQL Schema Files","text":"

Open Contracts uses Graphene to provide a rich GraphQL endpoint, complete with the GraphiQL query application. For some applications, you may want to generate a GraphQL schema file in SDL or json. On example use case is if you're developing a frontend you want to connect to OpenContracts, and you'd like to autogenerate Typescript types from a GraphQL Schena.

To generate a GraphQL schema file, run your choice of the following commands.

For an SDL file:

$ docker-compose -f local.yml run django python manage.py graphql_schema --schema config.graphql.schema.schema --out schema.graphql\n

For a JSON file:

$ docker-compose -f local.yml run django python manage.py graphql_schema --schema config.graphql.schema.schema --out schema.json\n

You can convert these to TypeScript for use in a frontend (though you'll find this has already been done for the React- based OpenContracts frontend) using a tool like this.

"},{"location":"walkthrough/advanced/pawls-token-format/","title":"Understanding Document Ground Truth in OpenContracts","text":"

OpenContracts utilizes the PAWLs format for representing documents and their annotations. PAWLs was designed by AllenAI to provide a consistent and structured way to store text and layout information for complex documents like contracts, scientific papers, and newspapers.

AllenAI has largely stopped maintaining this project and this project evolved into something very different than its PAWLs namesake, but we've kept the name (and contributed a few PRs back to the PAWLs project).

"},{"location":"walkthrough/advanced/pawls-token-format/#standardized-pdf-data-layers","title":"Standardized PDF Data Layers","text":"

In OpenContracts, every document is processed through a pipeline that extracts and structures text and layout information into three files:

  1. Original PDF: The original PDF document.
  2. PAWLs Layer (JSON): A JSON file containing the text and positional data for each token (word) in the document.
  3. Text Layer: A text file containing the full text extracted from the document.
  4. Structural Annotations: Thanks to nlm-ingestor, we now use Nlmatics' parser to generate the PAWLs layer and turn the layout blocks - like header, paragraph, table, etc. - into Open Contracts Annotation objs that represent the visual blocks for each PDF. Upon creation, we create embeddings for each Annotation which are stored in Postgres via pgvector.

The PAWLs layer serves as the source of truth for the document, allowing seamless translation between text and positional information.

"},{"location":"walkthrough/advanced/pawls-token-format/#visualizing-how-pdfs-are-converted-to-data-annotations","title":"Visualizing How PDFs are Converted to Data & Annotations","text":"

Here's a rough diagram showing how a series of tokens - Lorem, ipsum, dolor, sit and amet - are mapped from a PDF to our various data types.

"},{"location":"walkthrough/advanced/pawls-token-format/#pawls-processing-pipeline","title":"PAWLs Processing Pipeline","text":"

The PAWLs processing pipeline involves the following steps:

  1. Token Extraction: The OCRed document is processed using the parsing engine of Grobid to extract \"tokens\" (text surrounded by whitespace, typically a word) along with their page and positional information.
  2. PAWLs Layer Generation: The extracted tokens and their positional data are stored as a JSON file, referred to as the \"PAWLs layer.\"
  3. Text Layer Generation: The full text is extracted from the PAWLs layer and stored as a separate text file, called the \"text layer.\"
"},{"location":"walkthrough/advanced/pawls-token-format/#pawls-layer-structure","title":"PAWLs Layer Structure","text":"

The PAWLs layer JSON file consists of a list of page objects, each containing the necessary tokens and page information for a given page. Here's the data shape for each page object:

class PawlsPagePythonType(TypedDict):\n    page: PawlsPageBoundaryPythonType\n    tokens: list[PawlsTokenPythonType]\n

The PawlsPageBoundaryPythonType represents the page boundary information:

class PawlsPageBoundaryPythonType(TypedDict):\n    width: float\n    height: float\n    index: int\n

Each token in the tokens list is represented by the PawlsTokenPythonType:

class PawlsTokenPythonType(TypedDict):\n    x: float\n    y: float\n    width: float\n    height: float\n    text: str\n

The x, y, width, and height fields provide the positional information for each token on the page.

"},{"location":"walkthrough/advanced/pawls-token-format/#annotation-process","title":"Annotation Process","text":"

OpenContracts allows users to annotate documents using the PAWLs layer. Annotations are stored as a dictionary mapping page numbers to annotation data:

Dict[int, OpenContractsSinglePageAnnotationType]\n

The OpenContractsSinglePageAnnotationType represents the annotation data for a single page:

class OpenContractsSinglePageAnnotationType(TypedDict):\n    bounds: BoundingBoxPythonType\n    tokensJsons: list[TokenIdPythonType]\n    rawText: str\n

The bounds field represents the bounding box of the annotation, while tokensJsons contains a list of token IDs that make up the annotation. The rawText field stores the raw text of the annotation.

"},{"location":"walkthrough/advanced/pawls-token-format/#advantages-of-pawls","title":"Advantages of PAWLs","text":"

The PAWLs format offers several advantages for document annotation and NLP tasks:

  1. Consistent Structure: PAWLs provides a consistent and structured representation of documents, regardless of the original file format or structure.
  2. Layout Awareness: By storing positional information for each token, PAWLs enables layout-aware text analysis and annotation.
  3. Seamless Integration: The PAWLs layer allows easy integration with various NLP libraries and tools, whether they are layout-aware or not.
  4. Reproducibility: The re-OCR process ensures consistent output across different documents and software versions.
"},{"location":"walkthrough/advanced/pawls-token-format/#conclusion","title":"Conclusion","text":"

The PAWLs format in OpenContracts provides a powerful and flexible way to represent and annotate complex documents. By extracting and structuring text and layout information, PAWLs enables efficient and accurate document analysis and annotation tasks. The consistent structure and layout awareness of PAWLs make it an essential component of the OpenContracts project.

"},{"location":"walkthrough/advanced/pawls-token-format/#example-pawls-file","title":"Example PAWLs File","text":"

Here's an example of what a PAWLs layer JSON file might look like:

[\n  {\n    \"page\": {\n      \"width\": 612.0,\n      \"height\": 792.0,\n      \"index\": 0\n    },\n    \"tokens\": [\n      {\n        \"x\": 72.0,\n        \"y\": 720.0,\n        \"width\": 41.0,\n        \"height\": 12.0,\n        \"text\": \"Lorem\"\n      },\n      {\n        \"x\": 113.0,\n        \"y\": 720.0,\n        \"width\": 35.0,\n        \"height\": 12.0,\n        \"text\": \"ipsum\"\n      },\n      {\n        \"x\": 148.0,\n        \"y\": 720.0,\n        \"width\": 31.0,\n        \"height\": 12.0,\n        \"text\": \"dolor\"\n      },\n      {\n        \"x\": 179.0,\n        \"y\": 720.0,\n        \"width\": 18.0,\n        \"height\": 12.0,\n        \"text\": \"sit\"\n      },\n      {\n        \"x\": 197.0,\n        \"y\": 720.0,\n        \"width\": 32.0,\n        \"height\": 12.0,\n        \"text\": \"amet,\"\n      },\n      {\n        \"x\": 72.0,\n        \"y\": 708.0,\n        \"width\": 66.0,\n        \"height\": 12.0,\n        \"text\": \"consectetur\"\n      },\n      {\n        \"x\": 138.0,\n        \"y\": 708.0,\n        \"width\": 60.0,\n        \"height\": 12.0,\n        \"text\": \"adipiscing\"\n      },\n      {\n        \"x\": 198.0,\n        \"y\": 708.0,\n        \"width\": 24.0,\n        \"height\": 12.0,\n        \"text\": \"elit.\"\n      }\n    ]\n  },\n  {\n    \"page\": {\n      \"width\": 612.0,\n      \"height\": 792.0,\n      \"index\": 1\n    },\n    \"tokens\": [\n      {\n        \"x\": 72.0,\n        \"y\": 756.0,\n        \"width\": 46.0,\n        \"height\": 12.0,\n        \"text\": \"Integer\"\n      },\n      {\n        \"x\": 118.0,\n        \"y\": 756.0,\n        \"width\": 35.0,\n        \"height\": 12.0,\n        \"text\": \"vitae\"\n      },\n      {\n        \"x\": 153.0,\n        \"y\": 756.0,\n        \"width\": 39.0,\n        \"height\": 12.0,\n        \"text\": \"augue\"\n      },\n      {\n        \"x\": 192.0,\n        \"y\": 756.0,\n        \"width\": 45.0,\n        \"height\": 12.0,\n        \"text\": \"rhoncus\"\n      },\n      {\n        \"x\": 237.0,\n        \"y\": 756.0,\n        \"width\": 57.0,\n        \"height\": 12.0,\n        \"text\": \"fermentum\"\n      },\n      {\n        \"x\": 294.0,\n        \"y\": 756.0,\n        \"width\": 13.0,\n        \"height\": 12.0,\n        \"text\": \"at\"\n      },\n      {\n        \"x\": 307.0,\n        \"y\": 756.0,\n        \"width\": 29.0,\n        \"height\": 12.0,\n        \"text\": \"quis.\"\n      }\n    ]\n  }\n]\n

In this example, the PAWLs layer JSON file contains an array of two page objects. Each page object has a page field with the page dimensions and index, and a tokens field with an array of token objects.

Each token object represents a word or a piece of text on the page, along with its positional information. The x and y fields indicate the coordinates of the token's bounding box, while width and height specify the dimensions of the bounding box. The text field contains the actual text content of the token.

The tokens are ordered based on their appearance on the page, allowing for the reconstruction of the document's text content while preserving the layout information.

This sample demonstrates the structure and content of a PAWLs layer JSON file, which serves as the foundation for annotation and analysis tasks in the OpenContracts project.

"},{"location":"walkthrough/advanced/register-doc-analyzer/","title":"Detailed Overview of @doc_analyzer_task Decorator","text":"

The @doc_analyzer_task decorator is an integral part of the OpenContracts CorpusAction system, which automates document processing when new documents are added to a corpus. As a refresher, within the CorpusAction system, users have three options for registering actions to run automatically on new documents:

  1. Custom data extractors
  2. Analyzer microservices
  3. Celery tasks decorated with @doc_analyzer_task

The @doc_analyzer_task decorator is specifically designed for the third option, providing a straightforward way to write and deploy simple, span-based analytics directly within the OpenContracts ecosystem.

"},{"location":"walkthrough/advanced/register-doc-analyzer/#when-to-use-doc_analyzer_task","title":"When to Use @doc_analyzer_task","text":"

The @doc_analyzer_task decorator is ideal for scenarios where:

  1. You're performing tests or analyses solely based on document text or PAWLs tokens.
  2. Your analyzer doesn't require conflicting dependencies or non-Python code bases.
  3. You want a quick and easy way to integrate custom analysis into the OpenContracts workflow.

For more complex scenarios, such as those requiring specific environments, non-Python components, or heavy computational resources, creating an analyzer microservice would be recommended.

"},{"location":"walkthrough/advanced/register-doc-analyzer/#advantages-of-doc_analyzer_task","title":"Advantages of @doc_analyzer_task","text":"

Using the @doc_analyzer_task decorator offers several benefits:

  1. Simplicity: It abstracts away much of the complexity of interacting with the OpenContracts system.
  2. Integration: Tasks are automatically integrated into the CorpusAction workflow.
  3. Consistency: It ensures that your analysis task produces outputs in a format that OpenContracts can readily use.
  4. Error Handling: It provides built-in error handling and retry mechanisms.

By using this decorator, you can focus on writing the core analysis logic while the OpenContracts system handles the intricacies of document processing, annotation creation, and result storage.

In the following sections, we'll dive deep into how to structure functions decorated with @doc_analyzer_task, what data they receive, and how their outputs are processed by the OpenContracts system.

"},{"location":"walkthrough/advanced/register-doc-analyzer/#function-signature","title":"Function Signature","text":"

Functions decorated with @doc_analyzer_task should have the following signature:

@doc_analyzer_task()\ndef your_analyzer_function(*args, pdf_text_extract=None, pdf_pawls_extract=None, **kwargs):\n    # Function body\n    pass\n
"},{"location":"walkthrough/advanced/register-doc-analyzer/#parameters","title":"Parameters:","text":"
  1. *args: Allows the function to accept any positional arguments.
  2. pdf_text_extract: Optional parameter that will contain the extracted text from the PDF.
  3. pdf_pawls_extract: Optional parameter that will contain the PAWLS (Page-Aware Word-Level Splitting) data from the PDF.
  4. **kwargs: Allows the function to accept any keyword arguments.

The resulting task then expects some kwargs, which, while not passed to the decorated function, are used to load the data passed to the decorated function:

  • doc_id: The ID of the document being analyzed.
  • corpus_id: The ID of the corpus containing the document (if applicable).
  • analysis_id: The ID of the analysis being performed.
"},{"location":"walkthrough/advanced/register-doc-analyzer/#injected-data","title":"Injected Data","text":"

The decorator provides the following data to your decorated function as kwargs:

  1. PDF Text Extract: The full text content of the PDF document, accessible via the pdf_text_extract parameter.
  2. PAWLS Extract: A structured representation of the document's layout and content, accessible via the pdf_pawls_extract parameter. This typically includes information about pages, tokens, and their positions.
"},{"location":"walkthrough/advanced/register-doc-analyzer/#required-outputs","title":"Required Outputs","text":"

The @doc_analyzer_task decorator in OpenContracts expects the decorated function's return value to match a specific output structure. It's a four element tuple, with each of the four elements (below) having a specific schema.

return doc_labels, span_labels, metadata, task_pass\n

Failure to adhere to this in your function will throw an error. This structure is designed to map directly to the data models used in the OpenContracts system.

Let's break down each component of the required output and explain how it's used.

"},{"location":"walkthrough/advanced/register-doc-analyzer/#1-document-labels-doc_labels","title":"1. Document Labels (doc_labels)","text":"

Document labels should be a list of strings representing the labels you want to apply to the entire document.

doc_labels = [\"IMPORTANT_DOCUMENT\", \"FINANCIAL_REPORT\"]\n

Purpose: These labels are applied to the entire document.

Relationship to OpenContracts Models:

  • Each string in this list corresponds to an AnnotationLabel object with label_type = DOC_TYPE_LABEL.
  • For each label, an Annotation object is created with:
    • document: Set to the current document
    • annotation_label: The corresponding AnnotationLabel object
    • analysis: The current Analysis object
    • corpus: The corpus of the document (if applicable)

Example in OpenContracts:

for label_text in doc_labels:\n    label = AnnotationLabel.objects.get(text=label_text, label_type=\"DOC_TYPE_LABEL\")\n    Annotation.objects.create(\n        document=document,\n        annotation_label=label,\n        analysis=analysis,\n        corpus=document.corpus\n    )\n
"},{"location":"walkthrough/advanced/register-doc-analyzer/#2-span-labels-span_labels","title":"2. Span Labels (span_labels)","text":"

These describe token / span level features you want to apply an annotation to.

span_labels = [\n    (TextSpan(id=\"1\", start=0, end=10, text=\"First ten\"), \"HEADER\"),\n    (TextSpan(id=\"2\", start=50, end=60, text=\"Next span\"), \"IMPORTANT_CLAUSE\")\n]\n

Purpose: These labels are applied to specific spans of text within the document.

Relationship to OpenContracts Models:

  • Each tuple in this list creates an Annotation object.
  • The TextSpan contains the position and content of the annotated text.
  • The label string corresponds to an AnnotationLabel object with label_type = TOKEN_LABEL.

Example in OpenContracts:

for span, label_text in span_labels:\n    label = AnnotationLabel.objects.get(text=label_text, label_type=\"TOKEN_LABEL\")\n    Annotation.objects.create(\n        document=document,\n        annotation_label=label,\n        analysis=analysis,\n        corpus=document.corpus,\n        page=calculate_page_from_span(span),\n        raw_text=span.text,\n        json={\n            \"1\": {\n                \"bounds\": calculate_bounds(span),\n                \"tokensJsons\": calculate_tokens(span),\n                \"rawText\": span.text\n            }\n        }\n    )\n
"},{"location":"walkthrough/advanced/register-doc-analyzer/#3-metadata","title":"3. Metadata","text":"

This element contains DataCell values we want to associate with resulting Analysis.

metadata = [{\"data\": {\"processed_date\": \"2023-06-15\", \"confidence_score\": 0.95}}]\n

Purpose: This provides additional context or information about the analysis.

Relationship to OpenContracts Models:

  • This element contains DataCell values we want to associate with resulting Analysis.

Example in OpenContracts:

analysis.metadata = metadata\nanalysis.save()\n
"},{"location":"walkthrough/advanced/register-doc-analyzer/#4-task-pass-task_pass","title":"4. Task Pass (task_pass)","text":"

This can be used to signal the failure of some kind of test or logic for automated testing.

task_pass = True\n

Purpose: Indicates whether the analysis task completed successfully.

Relationship to OpenContracts Models:

  • This boolean value is used to update the status of the Analysis object.
  • It can trigger further actions or notifications in the OpenContracts system.

Example in OpenContracts:

if task_pass:\n    analysis.status = \"COMPLETED\"\nelse:\n    analysis.status = \"FAILED\"\nanalysis.save()\n
"},{"location":"walkthrough/advanced/register-doc-analyzer/#how-the-decorator-processes-the-output","title":"How the Decorator Processes the Output","text":"
  1. Validation: The decorator first checks that the return value is a tuple of length 4 and that each element has the correct type.

  2. Document Label Processing: For each document label, it creates an Annotation object linked to the document, analysis, and corpus.

  3. Span Label Processing: For each span label, it creates an Annotation object with detailed information about the text span, including its position and content.

  4. Metadata Handling: The metadata is stored, typically with the Analysis object, for future reference.

  5. Task Status Update: Based on the task_pass value, the status of the analysis is updated.

  6. Error Handling: If any part of this process fails, the decorator handles the error, potentially marking the task as failed and logging the error.

"},{"location":"walkthrough/advanced/register-doc-analyzer/#benefits-of-this-structure","title":"Benefits of This Structure","text":"
  1. Consistency: By enforcing a specific output structure, the system ensures that all document analysis tasks provide consistent data.

  2. Separation of Concerns: The analysis logic (in the decorated function) is separated from the database operations (handled by the decorator).

  3. Flexibility: The structure allows for both document-level and span-level annotations, accommodating various types of analysis.

  4. Traceability: By linking annotations to specific analyses and including metadata, the system maintains a clear record of how and when annotations were created.

  5. Error Management: The task_pass boolean allows for clear indication of task success or failure, which can trigger appropriate follow-up actions in the system.

By structuring the output this way, the @doc_analyzer_task decorator seamlessly integrates custom analysis logic into the broader OpenContracts data model, ensuring that the results of document analysis are properly stored, linked, and traceable within the system.

"},{"location":"walkthrough/advanced/register-doc-analyzer/#example-implementation","title":"Example Implementation","text":"

Here's an example of how a function decorated with @doc_analyzer_task might look:

from opencontractserver.shared.decorators import doc_analyzer_task\nfrom opencontractserver.types.dicts import TextSpan\n\n\n@doc_analyzer_task()\ndef example_analyzer(*args, pdf_text_extract=None, pdf_pawls_extract=None, **kwargs):\n    doc_id = kwargs.get('doc_id')\n\n    # Your analysis logic here\n    # For example, let's say we're identifying a document type and important clauses\n\n    doc_type = identify_document_type(pdf_text_extract)\n    important_clauses = find_important_clauses(pdf_text_extract)\n\n    doc_labels = [doc_type]\n    span_labels = [\n        (TextSpan(id=str(i), start=clause.start, end=clause.end, text=clause.text), \"IMPORTANT_CLAUSE\")\n        for i, clause in enumerate(important_clauses)\n    ]\n    metadata = [{\"data\": {\"analysis_version\": \"1.0\", \"clauses_found\": len(important_clauses)}}]\n    task_pass = True\n\n    return doc_labels, span_labels, metadata, task_pass\n

In this example, the function uses the injected pdf_text_extract to perform its analysis. It identifies the document type and finds important clauses, then structures this information into the required output format.

By using the @doc_analyzer_task decorator, this function is automatically integrated into the OpenContracts system, handling document locking, error management, and annotation creation without requiring explicit code for these operations in the function body.

"},{"location":"walkthrough/advanced/run-gremlin-analyzer/","title":"Run a Gremlin Analyzer","text":""},{"location":"walkthrough/advanced/run-gremlin-analyzer/#introduction-to-gremlin-integration","title":"Introduction to Gremlin Integration","text":"

OpenContracts integrates with a powerful NLP engine called Gremlin Engine (\"Gremlin\"). If you run a Gremlin analyzer on a Corpus, it will create annotations of its own that you can view and export (e.g. automatically applying document labels or labeling parties, dates, and places, etc.). It's meant to provide a consistent API to deliver and render NLP and machine learning capabilities to end-users. As discussed in the configuration section, you need to install Gremlin Analyzers through the admin dashboard.

Once you've installed Gremlin Analyzers, however, it's easy to apply them.

"},{"location":"walkthrough/advanced/run-gremlin-analyzer/#using-an-installed-gremlin-analyzer","title":"Using an Installed Gremlin Analyzer","text":"
  1. If analysis capabilities are enabled for instance, when you right-click on a Corpus, you'll see an option to \"Analyze Corpus\":

  2. Clicking on this item will bring up a dialog where you can browse available analyzers:

  3. Select one and hit \"Analyze\" to submit a corpus for processing. When you go to the Analysis tab of your Corpus now, you'll see the analysis. Most likely, if you just clicked there, it will say processing:

  4. When the Analysis is complete, you'll see a summary of the number of labels and annotations applied by the analyzer:

"},{"location":"walkthrough/advanced/run-gremlin-analyzer/#note-on-processing-time","title":"Note on Processing Time","text":"

Large Corpuses of hundreds of documents can take a long time to process (10 minutes or more). It's hard to predict processing time up front, because it's dependent on the number of total pages and the specific analysis being performed. At the moment, there is not a great mechanism in place to detect and handle failures in a Gremlin analyzer and reflect this in OpenContracts. It's on our roadmap to improve this integration. In the meantime, the example analyzers we've released with Gremlin should be very stable, so they should run predictably.

"},{"location":"walkthrough/advanced/run-gremlin-analyzer/#viewing-the-outputs","title":"Viewing the Outputs","text":"

Once an Analysis completes, you'll be able to browse the annotations from the analysis in several ways.

  1. First, they'll be available in the \"Annotation\" tab, and you can easily filter to annotations from a specific analyzer.
  2. Second, when you load a Document, in the Annotator view, there's a small widget in the top of the annotator that has three downwards-facing arrows and says \"Human Annotation Mode\".
  3. Click on the arrows open a tray showing the analyses applied to this document.
  4. Click on an analysis to load the annotations and view them in the document.

Note: You can delete an analysis, but you cannot edit it. The annotations are machine-created and cannot be edited by human users.

"},{"location":"walkthrough/advanced/testing-llama-index-calls/","title":"Testing Complex LLM Applications","text":"

I've built a number of full-stack, LLM-powered applications at this point. A persistent challenge is testing the underlying LLM query pipelines in a deterministic and isolated way.

A colleague and I eventually hit on a way to make testing complex LLM behavior deterministic and decoupled from upstream LLM API providers. This tutorial walks you through the problem and solution to this testing issue.

In this guide, you'll learn:

  1. Why testing LLM applications is particularly challenging
  2. How to overcome common testing obstacles like API dependencies and resource limitations
  3. An innovative approach using VCR.py to record and replay LLM interactions
  4. How to implement this solution with popular frameworks like LlamaIndex and Django
  5. Potential pitfalls to watch out for when using this method

Whether you're working with RAG models, multi-hop reasoning loops, or other complex LLM architectures, this tutorial will show you how to create fast, deterministic, and accurate tests without relying on expensive resources or compromising the integrity of your test suite.

By the end of this guide, you'll have a powerful new tool in your AI development toolkit, enabling you to build more robust and reliable LLM-powered applications. Let's dive in!

"},{"location":"walkthrough/advanced/testing-llama-index-calls/#problem","title":"Problem","text":"

To understand why testing complex LLM-powered applications is challenging, let's break down the components and processes involved in a typical RAG (Retrieval-Augmented Generation) application using a framework like LlamaIndex:

  1. Data Ingestion: Your application likely starts by ingesting large amounts of data from various sources (documents, databases, APIs, etc.).

  2. Indexing: This data is then processed and indexed, often using vector embeddings, to allow for efficient retrieval.

  3. Query Processing: When a user submits a query, your application needs to: a) Convert the query into a suitable format (often involving embedding the query) b) Search the index to retrieve relevant information c) Format the retrieved information for use by the LLM

  4. LLM Interaction: The processed query and retrieved information are sent to an LLM (like GPT-4) for generating a response.

  5. Post-processing: The LLM's response might need further processing or validation before being returned to the user.

Now, consider the challenges in testing such a system:

  1. External Dependencies: Many of these steps rely on external APIs or services. The indexing and query embedding often use one model (e.g., OpenAI's embeddings API), while the final response generation uses another (e.g., GPT-4). Traditional testing approaches would require mocking or stubbing these services, which can be complex and may not accurately represent real-world behavior.

  2. Resource Intensity: Running a full RAG pipeline for each test can be extremely resource-intensive and time-consuming. It might involve processing large amounts of data and making multiple API calls to expensive LLM services.

  3. Determinism: LLMs can produce slightly different outputs for the same input, making it difficult to write deterministic tests. This variability can lead to flaky tests that sometimes pass and sometimes fail.

  4. Complexity of Interactions: In more advanced setups, you might have multi-step reasoning processes or agent-based systems where the LLM is called multiple times with intermediate results. This creates complex chains of API calls that are difficult to mock effectively.

  5. Sensitive Information: Your tests might involve querying over proprietary or sensitive data. You don't want to include this data in your test suite, especially if it's going to be stored in a version control system.

  6. Cost: Running tests that make real API calls to LLM services can quickly become expensive, especially when running comprehensive test suites in CI/CD pipelines.

  7. Speed: Tests that rely on actual API calls are inherently slower, which can significantly slow down your development and deployment processes.

Traditional testing approaches fall short in addressing these challenges:

  • Unit tests with mocks may not capture the nuances of LLM behavior.
  • Integration tests with real API calls are expensive, slow, and potentially non-deterministic.
  • Dependency injection can help but becomes unwieldy with complex, multi-step processes.

What's needed is a way to capture the behavior of the entire system, including all API interactions, in a reproducible manner that doesn't require constant re-execution of expensive operations. This is where the VCR approach comes in, as we'll explore in the next section.

"},{"location":"walkthrough/advanced/testing-llama-index-calls/#solution","title":"Solution","text":"

Over a couple years of working with the LLM and RAG application stack, a solution has emerged to this problem. A former colleague of mine pointed out a library for Ruby called VCR with the following goal:

Record your test suite's HTTP interactions and replay them during future test runs for fast, deterministic, accurate\ntests.\n

This sounds like exactly the sort of solution we're looking for! We have numerous API calls to third-party API endpoints. They are deterministic IF the responses from each step of the LLM reasoning loop is identical to a previous run of the same loop. If we could record each LLM call and response from one run of a specific LLamaIndex pipeline, for example, and then intercept future calls to the same endpoints and replay the old responses, in theory we'd have exactly the same results.

It turns out there's a Python version of VCR called VCR.py. It comes with nice pytest fixtures and lets you decorate an entire Django test. If you call a LlamaIndex pipeline from your test, if no \"cassette\" filed is found in your fixtures directory, your HTTPS calls will go out to actual API

"},{"location":"walkthrough/advanced/testing-llama-index-calls/#example","title":"Example","text":"

Using VCR.py + LlamaIndex, for example, is super simple. In a Django test, for example, you just write a test function per usual:

import vcr\nfrom django.test import TestCase\n\n\nclass ExtractsTaskTestCase(TestCase):\n\n    def test_run_extract_task(self):\n        print(f\"{self.extract.documents.all()}\")\n        ...\n

Add a vcr.py decorator naming the target fixture location:

import vcr\nfrom django.test import TestCase\n\n\nclass ExtractsTaskTestCase(TestCase):\n\n    @vcr.use_cassette(\"fixtures/vcr_cassettes/test_run_extract_task.yaml\", filter_headers=['authorization'])\n    def test_run_extract_task(self):\n        print(f\"{self.extract.documents.all()}\")\n\n        # Call your LLMs or LLM framework here\n        ...\n

Now you can call LlamaIndex query engines, retrievers, agents, etc. On the first run, all of your API calls and responses are capture. You'll obviously need to provide your API credentials, where required, or these calls will fail. As noted below if you omit the filter_headers parameter, this will result in your API key being in the recorded 'cassette'.

On subsequent runs, VCR will intercept calls to recorded endpoints with identical data and return the recorded responses, letting you full test your use of LlamaIndex without needing to patch the library or its dependencies.

"},{"location":"walkthrough/advanced/testing-llama-index-calls/#pitfalls","title":"Pitfalls","text":"

This approach has been used for production applications. We have seen a couple things worth noting:

  1. Be warned that if you don't use the filter_headers=['authorization'] in your decorators, your API keys will be in the cassette. You can replace these with fake credentials or you can just de-auth the now-public keys.
  2. If you use any local models and don't preload those, VCR.py will capture the call to download the models weights, configuration, etc. Even for small models, this can be a couple hundred megabytes, and it could be gigabytes of data for even small models like Phi or Llama3 7B. This is particularly problematic for GitHub as you'll quickly exceed file size caps, even if you're using LFS.
  3. There is a bug in VCR.py 6.0.1 in some limited circumstances if you use async code.
  4. This is obviously Python-only. Presumably there are similar libraries for other languages and web client libraries.
"},{"location":"walkthrough/advanced/write-your-own-extractors/","title":"Write Your Own Agentic, LlamaIndex Data Extractor","text":""},{"location":"walkthrough/advanced/write-your-own-extractors/#refresher-on-what-an-open-contracts-data-extractor-does","title":"Refresher on What an Open Contracts Data Extractor Does","text":"

When you create a new Extract on the frontend, you can build a grid of data field columns and document rows that the application will traverse, cell-by-cell, to answer the question posed in each column for every document:

You can define the target data shape for each column - e.g. require all outputs match a certain dictionary schema or be floats. We leverage LLMs to ensure that the retrieved data matches the desired schema.

You'll notice when you add or edit a column, you can configure a number of different things:

Specifically, you can adjust - name: The name of the column. - query: The query used for extraction. - match_text: Text we want to match semantically to process on. We use this instead of the query to find responsive text, if this field is provided. If not, we have to fall back to the query. - must_contain_text: Text that must be contained in a returned annotation. This is case insensitive. - output_type: The type of data to be extracted. This can be a python primitive or a simple Pydantic model. - instructions: Instructions for the extraction process. This instructs our parser how to convert retrieved text to the target output_type. Not strictly necessary, but recommended, specifically for objects. - task_name: The name of the registered celery extract task to use to process (lets you define and deploy custom ones). We'll show you have to create a custom one in this walkthrough. - agentic: Boolean indicating if the extraction is agentic. - extract_is_list: Boolean indicating if the extraction result is a list of the output_types you provided.

You'll notice that in the GUI, there is a dropdown to pick the extract task:

This is actually retrieved dynamically from the backend from the tasks in opencontractsserver.tasks.data_extract_tasks.py. Every celery task in this python module will show up in the GUI, and the description in the dropdown is actually pulled out of the docstring provided in the code itself:

@shared_task\ndef oc_llama_index_doc_query(cell_id, similarity_top_k=15, max_token_length: int = 512):\n    \"\"\"\n    OpenContracts' default LlamaIndex and Marvin-based data extract pipeline to run queries specified for a\n    particular cell. We use sentence transformer embeddings + sentence transformer re-ranking.\n    \"\"\"\n\n    ...\n

This means you can write your own data extractors! If you write a new task in data_extract_tasks.py, the next time the containers are rebuilt, you should see your custom extractor. We'll walk through this in a minute.

"},{"location":"walkthrough/advanced/write-your-own-extractors/#how-open-contracts-integrates-with-llamaindex","title":"How Open Contracts Integrates with LlamaIndex","text":"

You don't have to use LlamaIndex in your extractor - you could just pass an entire document to OpenAI's GPT-4o, for example, but LlamaIndex provides a tremendous amount of configurability that may yield to faster, better, cheaper or more reliable performance in many cases. You could even incorporate tools and third-party APIs in agentic fashion.

We assume you're already familiar wtih LlamaIndex, the \"data framework for your LLM applications\". It has a rich ecosystem of integrations, prompt templates, agents, retrieval techniques and more to let you customize how your LLMs interact with data.

"},{"location":"walkthrough/advanced/write-your-own-extractors/#custom-djangoannotationvectorstore","title":"Custom DjangoAnnotationVectorStore","text":"

We've written a custom implementation of one of LlamaIndex's core building blocks - the VectorStore - that lets LlamaIndex use OpenContracts as a vector store. Our DjangoAnnotationVectorStore in opencontractserver/llms/vector_stores.py lets you quickly write a LlamaIndex agent or question answering pipeline that can pull directly from the rich annotations and structural data (like annotation positions, layout class - e.g. header - and more) in OpenContracts. If you want to learn more about LlamaIndex's vector stores, see more in the documentation about VectorStores.

"},{"location":"walkthrough/advanced/write-your-own-extractors/#task-orchestration","title":"Task Orchestration","text":"

As discussed elsewhere, we use celery workers to run most of our analytics and transform logic. It simplifies the management of complex queues and lets us scale our application compute horizontally in the right environment.

Our data extract functionality has an orchestrator task - run_extract. For each data extract column for each document in the extract, we look at the column's task_name property and use it to attempt to load the celery task with that name via the get_task_by_name function:

def get_task_by_name(task_name) -> Optional[Callable]:\n    \"\"\"\n    Try to get celery task function Callable by name\n    \"\"\"\n    try:\n        return celery_app.tasks.get(task_name)\n    except Exception:\n        return None\n

As we loop over the datacells, we store the celery invocation for the cell's column's task_name in a task list:

for document_id in document_ids:\n        for column in fieldset.columns.all():\n            with transaction.atomic():\n                cell = Datacell.objects.create(\n                    extract=extract,\n                    column=column,\n                    data_definition=column.output_type,\n                    creator_id=user_id,\n                    document_id=document_id,\n                )\n\n            # Omitting some code here\n            ...\n\n            # Get the task function dynamically based on the column's task_name\n            task_func = get_task_by_name(column.task_name)\n            if task_func is None:\n                logger.error(\n                    f\"Task {column.task_name} not found for column {column.id}\"\n                )\n                continue\n\n            # Add the task to the group\n            tasks.append(task_func.si(cell.pk))\n

Upon completing the traversal of the grid, we use a celery workflow to run all the cell extract tasks in parallel:

chord(group(*tasks))(mark_extract_complete.si(extract_id))\n
"},{"location":"walkthrough/advanced/write-your-own-extractors/#our-default-data-extract-task-oc_llama_index_doc_query","title":"Our Default Data Extract Task - oc_llama_index_doc_query","text":"

Our default data extractor uses LlamaIndex to retrieved and structure the data in the DataGrid. Before we write a new one, let's walk through how we orchestrate tasks and how our default extract works.

oc_llama_index_doc_query requires a Datacell id as positional argument. NOTE if you were to write your own extract task, you'd need to follow this same signature (with a name of your choice, of course):

@shared_task\ndef oc_llama_index_doc_query(cell_id, similarity_top_k=15, max_token_length: int = 512):\n    \"\"\"\n    OpenContracts' default LlamaIndex and Marvin-based data extract pipeline to run queries specified for a\n    particular cell. We use sentence transformer embeddings + sentence transformer re-ranking.\n    \"\"\"\n\n    ...\n
The frontend pulls the task description from the docstring, so, again, if you write your own, make sure you provide a useful description.

Let's walk through how oc_llama_index_doc_query works

"},{"location":"walkthrough/advanced/write-your-own-extractors/#step-1-mark-datacell-as-started","title":"Step 1 - Mark Datacell as Started","text":"

Once the task kicks off, step one is to log in the DB that the task has started:

    ...\n\n    try:\n        datacell.started = timezone.now()\n        datacell.save()\n\n        ...\n
  • Exception Handling: We use a try block to handle any exceptions that might occur during the processing.
  • Set Started Timestamp: We set the started field to the current time to mark the beginning of the datacell processing.
  • Save Changes: We save the Datacell object to the database.
"},{"location":"walkthrough/advanced/write-your-own-extractors/#step-2-configure-embeddings-and-llm-settings","title":"Step 2 - Configure Embeddings and LLM Settings","text":"

Then, we create our embeddings module. We actually have a microservice for this to cut down on memory usage and allow for easier scaling of the compute-intensive parts of the app. For now, though, the task does not call the microservice so we're using a lightweight sentence tranformer embeddings model:

    document = datacell.document\n\n    embed_model = HuggingFaceEmbedding(\n        model_name=\"multi-qa-MiniLM-L6-cos-v1\", cache_folder=\"/models\"\n    )\n    Settings.embed_model = embed_model\n\n    llm = OpenAI(model=settings.OPENAI_MODEL, api_key=settings.OPENAI_API_KEY)\n    Settings.llm = llm\n
  • Retrieve Document: We fetch the document associated with the datacell.
  • Configure Embedding Model: We set up the HuggingFace embedding model. This model converts text into embeddings ( vector representations) which are essential for semantic search.
  • Set Embedding Model in Settings: We assign the embedding model to Settings.embed_model for global access within the task.
  • Configure LLM: We set up the OpenAI model using the API key from settings. This model will be used for language processing tasks.
  • Set LLM in Settings: We assign the LLM to Settings.llm for global access within the task.
"},{"location":"walkthrough/advanced/write-your-own-extractors/#step-3-initialize-djangoannotationvectorstore-for-llamaindex","title":"Step 3 - Initialize DjangoAnnotationVectorStore for LlamaIndex","text":"

Now, here's the cool part with LlamaIndex. Assuming we have Django models with embeddings produced by the same embeddings model, we don't need to do any real-time encoding of our source documents, and our Django object store in Postgres can be loaded as a LlamaIndex vector store. Even better, we can pass in some arguments that let us scope the store down to what we want. For example, we can limit retrieving text from to document, to annotations containing certain text, and to annotations with certain labels - e.g. termination. This lets us leverage all of the work that's been done by humans (and machines) in an OpenContracts corpus to label and tag documents. We're getting the best of both worlds - both human and machine intelligence!

    vector_store = DjangoAnnotationVectorStore.from_params(\n        document_id=document.id, must_have_text=datacell.column.must_contain_text\n    )\n    index = VectorStoreIndex.from_vector_store(vector_store=vector_store)\n
  • Vector Store Initialization: Here we create an instance of DjangoAnnotationVectorStore using parameters specific to the document and column.
  • LlamaIndex Integration: We create a VectorStoreIndex from the custom vector store. This integrates the vector store with LlamaIndex, enabling advanced querying capabilities.
"},{"location":"walkthrough/advanced/write-your-own-extractors/#step-4-perform-retrieval","title":"step 4 - Perform Retrieval","text":"

Now we use the properties of a configured column to find the proper text. For example, if match_text has been provided, we search for nearest K annotations to the match_text (rather than searching based on the query itself):

    search_text = datacell.column.match_text\n    query = datacell.column.query\n\n    retriever = index.as_retriever(similarity_top_k=similarity_top_k)\n    results = retriever.retrieve(search_text if search_text else query)\n
  • Retrieve Search Text and Query: We fetch the search text and query from the column associated with the datacell.
  • Configure Retriever: We configure the retriever with the similarity_top_k parameter, which determines the number of top similar results to retrieve.
  • Retrieve Results: We perform the retrieval using the search text or query. The retriever fetches the most relevant annotations from the vector store.
"},{"location":"walkthrough/advanced/write-your-own-extractors/#step-5-rerank-results","title":"Step 5 - Rerank Results","text":"

We use a LlamaIndex reranker (in this case a SentenceTransformer reranker) to rerank the retrieved annotations based on the query (this is an example of where you could easily customize your own pipeline - you might want to rerank based on match text, use an LLM-based reranker, or use a totally different reranker like cohere):

sbert_rerank = SentenceTransformerRerank(\n    model=\"cross-encoder/ms-marco-MiniLM-L-2-v2\", top_n=5\n)\nretrieved_nodes = sbert_rerank.postprocess_nodes(\n    results, QueryBundle(query)\n)\n
  • Reranker Configuration: We set up the SentenceTransformerRerank model. This model is used to rerank the retrieved results for better relevance.
  • Rerank Nodes: We rerank the retrieved nodes using the SentenceTransformerRerank model and the original query. This ensures that the top results are the most relevant.
"},{"location":"walkthrough/advanced/write-your-own-extractors/#step-6-process-retrieved-annotations","title":"Step 6 - Process Retrieved Annotations","text":"

Now, we determine the Annotation instance ids we retrieved so these can be linked to the datacell. On the OpenContracts frontend, this lets us readily navigate to the Annotations in the source documents:

        retrieved_annotation_ids = [\n            n.node.extra_info[\"annotation_id\"] for n in retrieved_nodes\n        ]\n        datacell.sources.add(*retrieved_annotation_ids)\n
  • Extract Annotation IDs: We extract the annotation IDs from the retrieved nodes.
  • Add Sources: We add the retrieved annotation IDs to the sources field of the datacell. This links the relevant annotations to the datacell.
"},{"location":"walkthrough/advanced/write-your-own-extractors/#step-7-format-retrieved-text-for-output","title":"Step 7 - Format Retrieved Text for Output","text":"

Next, we aggregate the retrieved annotations into a single string we can pass to an LLM:

    retrieved_text = \"\\n\".join(\n        [f\"```Relevant Section:\\n\\n{n.text}\\n```\" for n in results]\n    )\n    logger.info(f\"Retrieved text: {retrieved_text}\")\n
  • Format Text: We format the retrieved text for output. Each relevant section is enclosed in Markdown code blocks for better readability.
  • Log Retrieved Text: We log the retrieved text for debugging and tracking purposes.
"},{"location":"walkthrough/advanced/write-your-own-extractors/#step-8-parse-data","title":"Step 8 - Parse Data","text":"

Finally, we dynamically specify the output schema / format of the data. We use marvin to do the structuring, but you could tweak the pipeline to use LlamaIndex's Structured Data Extract or you could roll your own custom parsers.

        output_type = parse_model_or_primitive(datacell.column.output_type)\n        logger.info(f\"Output type: {output_type}\")\n\n        # If provided, we use the column parse instructions property to instruct Marvin how to parse, otherwise,\n        # we give it the query and target output schema. Usually the latter approach is OK, but the former is more\n        # intentional and gives better performance.\n        parse_instructions = datacell.column.instructions\n\n        result = marvin.cast(\n            retrieved_text,\n            target=output_type,\n            instructions=parse_instructions if parse_instructions else query,\n        )\n\n        if isinstance(result, BaseModel):\n            datacell.data = {\"data\": result.model_dump()}\n        else:\n            datacell.data = {\"data\": result}\n        datacell.completed = timezone.now()\n        datacell.save()\n
  • Determine Output Type: We determine the output type based on the column's output type.
  • Log Output Type: We log the output type for debugging purposes.
  • Parse Instructions: We fetch parsing instructions from the column.
  • Parse Result: We use marvin.cast to parse the retrieved text into the desired output type using the parsing instructions.
  • Save Result: We save the parsed result in the data field of the datacell. We also mark the datacell as completed and save the changes to the database.
"},{"location":"walkthrough/advanced/write-your-own-extractors/#step-9-save-results","title":"Step 9 - Save Results","text":"

This step is particularly important if you write your own extract. We're planning to write a decorator to make a lot of this easier and automatic, but, for now, you need to remember to store the output of your extract task as a json of form

{\n  \"data\": <extracted data>\n}\n

Here's the code from oc_llama_index_doc_query:

 if isinstance(result, BaseModel):\n    datacell.data = {\"data\": result.model_dump()}\nelse:\n    datacell.data = {\"data\": result}\n\ndatacell.completed = timezone.now()\ndatacell.save()\n
"},{"location":"walkthrough/advanced/write-your-own-extractors/#step-10-exception-handling","title":"Step 10 - Exception Handling","text":"

If processign fails, we catch the error and stacktrace. These are store with the Datacell so we can see which extracts succeded or failed, and, if they failed, why.

    except Exception as e:\n        logger.error(f\"run_extract() - Ran into error: {e}\")\n        datacell.stacktrace = f\"Error processing: {e}\"\n        datacell.failed = timezone.now()\n        datacell.save()\n
  • Exception Logging: We log any exceptions that occur during the processing.
  • Save Stacktrace: We save the error message in the stacktrace field of the datacell.
  • Mark as Failed: We mark the datacell as failed and save the changes to the database.
"},{"location":"walkthrough/advanced/write-your-own-extractors/#write-a-custom-llamaindex-extractor","title":"Write a Custom LlamaIndex Extractor","text":"

Let's write another data extractor based on LlamaIndex's REACT Agent!

"},{"location":"walkthrough/advanced/write-your-own-extractors/#step-1-ensure-you-load-datacell","title":"Step 1 - Ensure you Load Datacell","text":"

As mentioned above, we'd like to use decorators to make some of this more automatic, but, for now, you need to load the Datacell instance from the provided id:

@shared_task\ndef llama_index_react_agent_query(cell_id):\n    \"\"\"\n    Use our DjangoAnnotationVectorStore + LlamaIndex REACT Agent to retrieve text.\n    \"\"\"\n\n    datacell = Datacell.objects.get(id=cell_id)\n\n    try:\n\n        datacell.started = timezone.now()\n        datacell.save()\n
"},{"location":"walkthrough/advanced/write-your-own-extractors/#step-2-setup-embeddings-model-llm","title":"Step 2 - Setup Embeddings Model + LLM","text":"

OpenContracts uses multi-qa-MiniLM-L6-cos-v1 to generate its embeddings (for now, we can make this modular as well). You can use whatever LLM you want, but we're using GPT-4o. Don't forget to isntantiate both of these and configure LlamaIndex's global settings:

embed_model = HuggingFaceEmbedding(\n    model_name=\"multi-qa-MiniLM-L6-cos-v1\", cache_folder=\"/models\"\n)  # Using our pre-load cache path where the model was stored on container build\nSettings.embed_model = embed_model\n\nllm = OpenAI(model=settings.OPENAI_MODEL, api_key=settings.OPENAI_API_KEY)\nSettings.llm = llm\n
"},{"location":"walkthrough/advanced/write-your-own-extractors/#step-3-instantiate-open-contracts-vector-store","title":"Step 3 - Instantiate Open Contracts Vector Store","text":"

Now, let's instantiate a vector store that will only retrieve annotations from the document linked to our loaded datacell:

document = datacell.document\n\n vector_store = DjangoAnnotationVectorStore.from_params(\n    document_id=document.id, must_have_text=datacell.column.must_contain_text\n)\nindex = VectorStoreIndex.from_vector_store(vector_store=vector_store)\n
"},{"location":"walkthrough/advanced/write-your-own-extractors/#step-4-instantiate-query-engine-wrap-it-as-llm-agent-tool","title":"Step 4 - Instantiate Query Engine & Wrap it As LLM Agent Tool","text":"

Next, let's use OpenContracts as a LlamaIndex query engine:

doc_engine = index.as_query_engine(similarity_top_k=10)\n

And let's use that engine with an agent tool:

document = datacell.document\n\nquery_engine_tools = [\n    QueryEngineTool(\n        query_engine=doc_engine,\n        metadata=ToolMetadata(\n            name=\"doc_engine\",\n            description=(\n                f\"Provides detailed annotations and text from within the {document.title}\"\n            ),\n        ),\n    )\n]\n
"},{"location":"walkthrough/advanced/write-your-own-extractors/#step-5-setup-agent","title":"Step 5 - Setup Agent","text":"
agent = ReActAgent.from_tools(\n    query_engine_tools,\n    llm=llm,\n    verbose=True,\n    # context=context\n)\n
"},{"location":"walkthrough/advanced/write-your-own-extractors/#step-6-decide-how-to-map-column-properties-to-retrieval-process","title":"Step 6 - Decide how to Map Column Properties to Retrieval Process","text":"

As discussed above, the Column model definition has a lot of properties that service slightly different purposes depending on your RAG implementation. Since we're writing a new extract, you can decide how to map these inputs here. To keep things simple for starters, let's just take the column's query and pass it directly to the React Agent. For improvements, we could have a more complex prompt to apps along, for example, parsing instructions.

response = agent.chat(\"What was Lyft's revenue growth in 2021?\")\n
"},{"location":"walkthrough/advanced/write-your-own-extractors/#step-7-post-process-and-store-data","title":"Step 7 - Post-Process and Store Data","text":"

At this stage we could use a structured data parser, or we could just store the answer from the agent. For simplicity, let's do the latter:

datacell.data = {\"data\": str(response)}\ndatacell.completed = timezone.now()\ndatacell.save()\n
"},{"location":"walkthrough/advanced/write-your-own-extractors/#step-8-rebuild-containers-and-look-at-your-frontend","title":"Step 8 - Rebuild Containers and Look at Your Frontend","text":"

The next time you rebuild the containers (in prod, in local env they rebuild automatically), you will see a new entry in the column configuration modals:

It's that easy! Now, any user in your instance can run your extract and generate outputs - here we've used it for the Company Name column:

We plan to create decorators and other developer aids to reduce boilerplate here and let you focus entirely on your retrieval pipeline.

"},{"location":"walkthrough/advanced/write-your-own-extractors/#conclusion","title":"Conclusion","text":"

By breaking down the tasks step-by-step, you can see how the custom vector store integrates with LlamaIndex to provide powerful semantic search capabilities within a Django application. Even better, if you write your own data extract tasks you can expose them to users who don't have to know anything at all about how they're built. This is the way it should be - separation of concerns!

"}]} \ No newline at end of file diff --git a/walkthrough/advanced/configure-annotation-view/index.html b/walkthrough/advanced/configure-annotation-view/index.html index bb766262..1087262e 100755 --- a/walkthrough/advanced/configure-annotation-view/index.html +++ b/walkthrough/advanced/configure-annotation-view/index.html @@ -7,6 +7,6 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}

Configure How Annotations Are Displayed

Annotations are composed of tokens (basically text in a line surrounded by whitespace). The tokens have a highlight. OpenContracts also has a "BoundingBox" around the tokens which is the smallest rectangle that can cover all of the tokens in an Annotation.

In the Annotator view, you'll see a purple-colored "eye" icon in the top left of the annotation list in the sidebar. Click the icon to bring up a series of configurations for how annotations are displayed:

There are three different settings that can be combined to significantly change how you see the annotations: 1. Show only selected - You will only see the annotation selected, either by clicking on it in the sidebar or when you clicked into an annotation from the Corpus view. All other annotations will be completely hidden. 2. Show bounding boxes - If you unselect this, only the tokens will be visible. This is recommended where you large numbers of overlapping annotations or annotations that are sparse - e.g. a few words scattered throughout a paragraph. In either of these cases, the bounding boxes can cover other bounding boxes and this can be confusing. Where you have too many overlapping bounding boxes, it's easier to hide them and just look at the tokens. 3. Label Display Behavior - has three options:

  1. Always Show - Always show the label for an annotation when it's displayed (remember, you can choose to only display selected annotations).
  2. Always Hide - Never show the label for an annotation, regardless of its visiblity.
  3. Show on Hover - If an annotation is visible, when you hover over it, you'll see the label.

Configure How Annotations Are Displayed

Annotations are composed of tokens (basically text in a line surrounded by whitespace). The tokens have a highlight. OpenContracts also has a "BoundingBox" around the tokens which is the smallest rectangle that can cover all of the tokens in an Annotation.

In the Annotator view, you'll see a purple-colored "eye" icon in the top left of the annotation list in the sidebar. Click the icon to bring up a series of configurations for how annotations are displayed:

There are three different settings that can be combined to significantly change how you see the annotations: 1. Show only selected - You will only see the annotation selected, either by clicking on it in the sidebar or when you clicked into an annotation from the Corpus view. All other annotations will be completely hidden. 2. Show bounding boxes - If you unselect this, only the tokens will be visible. This is recommended where you large numbers of overlapping annotations or annotations that are sparse - e.g. a few words scattered throughout a paragraph. In either of these cases, the bounding boxes can cover other bounding boxes and this can be confusing. Where you have too many overlapping bounding boxes, it's easier to hide them and just look at the tokens. 3. Label Display Behavior - has three options:

  1. Always Show - Always show the label for an annotation when it's displayed (remember, you can choose to only display selected annotations).
  2. Always Hide - Never show the label for an annotation, regardless of its visiblity.
  3. Show on Hover - If an annotation is visible, when you hover over it, you'll see the label.
\ No newline at end of file diff --git a/walkthrough/advanced/data-extract-models/index.html b/walkthrough/advanced/data-extract-models/index.html index dc348c10..b1cfd908 100755 --- a/walkthrough/advanced/data-extract-models/index.html +++ b/walkthrough/advanced/data-extract-models/index.html @@ -7,7 +7,7 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}

Why Data Extract?

An extraction process is pivotal for transforming raw, unstructured data into actionable insights, especially in fields like legal, financial, healthcare, and research. Imagine having thousands of documents, such as contracts, invoices, medical records, or research papers, and needing to quickly locate and analyze specific information like key terms, dates, patient details, or research findings. Automated extraction saves countless hours of manual labor, reduces human error, and enables real-time data analysis. By leveraging an efficient extraction pipeline, businesses and researchers can make informed decisions faster, ensure compliance, enhance operational efficiency, and uncover valuable patterns and trends that might otherwise remain hidden in the data deluge. Simply put, data extraction transforms overwhelming amounts of information into strategic assets, driving innovation and competitive advantage.

How we Store Our Data Extracts

Ultimately, our application design follows Django best-practiecs for a data-driven application with asynchronous data processing. We use the Django ORM (with capabilities like vector search) to store our data and tasks to orchestrate. The extracts/models.py file defines several key models that are used to manage and track the process of extracting data from documents.

These models include:

  1. Fieldset
  2. Column
  3. Extract
  4. Datacell

Each model plays a specific role in the extraction workflow, and together they enable the storage, configuration, and execution of document-based data extraction tasks.

Detailed Explanation of Each Model

1. Fieldset

Purpose: The Fieldset model groups related columns together. Each Fieldset represents a specific configuration of data fields that need to be extracted from documents.

class Fieldset(BaseOCModel):
+    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Why Data Extract?

An extraction process is pivotal for transforming raw, unstructured data into actionable insights, especially in fields like legal, financial, healthcare, and research. Imagine having thousands of documents, such as contracts, invoices, medical records, or research papers, and needing to quickly locate and analyze specific information like key terms, dates, patient details, or research findings. Automated extraction saves countless hours of manual labor, reduces human error, and enables real-time data analysis. By leveraging an efficient extraction pipeline, businesses and researchers can make informed decisions faster, ensure compliance, enhance operational efficiency, and uncover valuable patterns and trends that might otherwise remain hidden in the data deluge. Simply put, data extraction transforms overwhelming amounts of information into strategic assets, driving innovation and competitive advantage.

How we Store Our Data Extracts

Ultimately, our application design follows Django best-practiecs for a data-driven application with asynchronous data processing. We use the Django ORM (with capabilities like vector search) to store our data and tasks to orchestrate. The extracts/models.py file defines several key models that are used to manage and track the process of extracting data from documents.

These models include:

  1. Fieldset
  2. Column
  3. Extract
  4. Datacell

Each model plays a specific role in the extraction workflow, and together they enable the storage, configuration, and execution of document-based data extraction tasks.

Detailed Explanation of Each Model

1. Fieldset

Purpose: The Fieldset model groups related columns together. Each Fieldset represents a specific configuration of data fields that need to be extracted from documents.

class Fieldset(BaseOCModel):
     name = models.CharField(max_length=256, null=False, blank=False)
     description = models.TextField(null=False, blank=False)
 
  • name: The name of the fieldset.
  • description: A description of what this fieldset is intended to extract.

Usage: Fieldsets are associated with extracts in the Extract model, defining what data needs to be extracted.

2. Column

Purpose: The Column model defines individual data fields that need to be extracted. Each column specifies what to extract, the criteria for extraction, and the model to use for extraction.

class Column(BaseOCModel):
diff --git a/walkthrough/advanced/export-import-corpuses/index.html b/walkthrough/advanced/export-import-corpuses/index.html
index 57fd60f6..15e2acea 100755
--- a/walkthrough/advanced/export-import-corpuses/index.html
+++ b/walkthrough/advanced/export-import-corpuses/index.html
@@ -7,7 +7,7 @@
     .gdesc-inner { font-size: 0.75rem; }
     body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);}
     body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);}
-    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Export / Import Functionality

Exports

OpenContracts support both exporting and importing corpuses. This functionality is disabled on the public demo as it can be bandwidth intensive. If you want to experiment with these features on your own, you'll see the export action when you right-click on a corpus:

You can access your exports from the user dropdown menu in the top right corner of the screen. Once your export is complete, you should be able to download a zip containing all the documents, their PAWLs layers, and the corpus data you created - including all annotations.

Imports

If you've enabled corpus imports (see the frontend env file for the boolean toggle to do this - it's REACT_APP_ALLOW_IMPORTS), you'll see an import action when you click the action button on the corpus page.

Export Format

OpenContracts Export Format Specification

The OpenContracts export is a zip archive containing: 1. A data.json file with metadata about the export 2. The original PDF documents 3. Exported annotations "burned in" to the PDF documents

data.json Format

The data.json file contains a JSON object with the following fields:

  • annotated_docs (dict): Maps PDF filenames to OpenContractDocExport objects with annotations for that document.

  • doc_labels (dict): Maps document label names (strings) to AnnotationLabelPythonType objects defining those labels.

  • text_labels (dict): Maps text annotation label names (strings) to AnnotationLabelPythonType objects defining those labels.

  • corpus (OpenContractCorpusType): Metadata about the exported corpus, with fields:

    • id (int): ID of the corpus
    • title (string)
    • description (string)
    • icon_name (string): Filename of the corpus icon image
    • icon_data (string): Base64 encoded icon image data
    • creator (string): Email of the corpus creator
    • label_set (string): ID of the labelset used by this corpus
  • label_set (OpenContractsLabelSetType): Metadata about the label set, with fields:

    • id (int)
    • title (string)
    • description (string)
    • icon_name (string): Filename of the labelset icon
    • icon_data (string): Base64 encoded labelset icon data
    • creator (string): Email of the labelset creator

OpenContractDocExport Format

Each document in annotated_docs is represented by an OpenContractDocExport object with fields:

  • doc_labels (list[string]): List of document label names applied to this doc
  • labelled_text (list[OpenContractsAnnotationPythonType]): List of text annotations
  • title (string): Document title
  • content (string): Full text content of the document
  • description (string): Description of the document
  • pawls_file_content (list[PawlsPagePythonType]): PAWLS parse data for each page
  • page_count (int): Number of pages in the document

OpenContractsAnnotationPythonType Format

Represents an individual text annotation, with fields:

  • id (string): Optional ID
  • annotationLabel (string): Name of the label for this annotation
  • rawText (string): Raw text content of the annotation
  • page (int): 0-based page number the annotation is on
  • annotation_json (dict): Maps page numbers to OpenContractsSinglePageAnnotationType

OpenContractsSinglePageAnnotationType Format

Represents the annotation data for a single page:

  • bounds (BoundingBoxPythonType): Bounding box of the annotation on the page
  • tokensJsons (list[TokenIdPythonType]): List of PAWLS tokens covered by the annotation
  • rawText (string): Raw text of the annotation on this page

BoundingBoxPythonType Format

Represents a bounding box with fields:

  • top (int)
  • bottom (int)
  • left (int)
  • right (int)

TokenIdPythonType Format

References a PAWLS token by page and token index:

  • pageIndex (int)
  • tokenIndex (int)

PawlsPagePythonType Format

Represents PAWLS parse data for a single page:

  • page (PawlsPageBoundaryPythonType): Page boundary info
  • tokens (list[PawlsTokenPythonType]): List of PAWLS tokens on the page

PawlsPageBoundaryPythonType Format

Represents the page boundary with fields:

  • width (float)
  • height (float)
  • index (int): Page index

PawlsTokenPythonType Format

Represents a single PAWLS token with fields:

  • x (float): X-coordinate of token box
  • y (float): Y-coordinate of token box
  • width (float): Width of token box
  • height (float): Height of token box
  • text (string): Text content of the token

AnnotationLabelPythonType Format

Defines an annotation label with fields:

  • id (string)
  • color (string): Hex color for the label
  • description (string)
  • icon (string): Icon name
  • text (string): Label text
  • label_type (LabelType): One of DOC_TYPE_LABEL, TOKEN_LABEL, RELATIONSHIP_LABEL, METADATA_LABEL

Example data.json

{
+    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Export / Import Functionality

Exports

OpenContracts support both exporting and importing corpuses. This functionality is disabled on the public demo as it can be bandwidth intensive. If you want to experiment with these features on your own, you'll see the export action when you right-click on a corpus:

You can access your exports from the user dropdown menu in the top right corner of the screen. Once your export is complete, you should be able to download a zip containing all the documents, their PAWLs layers, and the corpus data you created - including all annotations.

Imports

If you've enabled corpus imports (see the frontend env file for the boolean toggle to do this - it's REACT_APP_ALLOW_IMPORTS), you'll see an import action when you click the action button on the corpus page.

Export Format

OpenContracts Export Format Specification

The OpenContracts export is a zip archive containing: 1. A data.json file with metadata about the export 2. The original PDF documents 3. Exported annotations "burned in" to the PDF documents

data.json Format

The data.json file contains a JSON object with the following fields:

  • annotated_docs (dict): Maps PDF filenames to OpenContractDocExport objects with annotations for that document.

  • doc_labels (dict): Maps document label names (strings) to AnnotationLabelPythonType objects defining those labels.

  • text_labels (dict): Maps text annotation label names (strings) to AnnotationLabelPythonType objects defining those labels.

  • corpus (OpenContractCorpusType): Metadata about the exported corpus, with fields:

    • id (int): ID of the corpus
    • title (string)
    • description (string)
    • icon_name (string): Filename of the corpus icon image
    • icon_data (string): Base64 encoded icon image data
    • creator (string): Email of the corpus creator
    • label_set (string): ID of the labelset used by this corpus
  • label_set (OpenContractsLabelSetType): Metadata about the label set, with fields:

    • id (int)
    • title (string)
    • description (string)
    • icon_name (string): Filename of the labelset icon
    • icon_data (string): Base64 encoded labelset icon data
    • creator (string): Email of the labelset creator

OpenContractDocExport Format

Each document in annotated_docs is represented by an OpenContractDocExport object with fields:

  • doc_labels (list[string]): List of document label names applied to this doc
  • labelled_text (list[OpenContractsAnnotationPythonType]): List of text annotations
  • title (string): Document title
  • content (string): Full text content of the document
  • description (string): Description of the document
  • pawls_file_content (list[PawlsPagePythonType]): PAWLS parse data for each page
  • page_count (int): Number of pages in the document

OpenContractsAnnotationPythonType Format

Represents an individual text annotation, with fields:

  • id (string): Optional ID
  • annotationLabel (string): Name of the label for this annotation
  • rawText (string): Raw text content of the annotation
  • page (int): 0-based page number the annotation is on
  • annotation_json (dict): Maps page numbers to OpenContractsSinglePageAnnotationType

OpenContractsSinglePageAnnotationType Format

Represents the annotation data for a single page:

  • bounds (BoundingBoxPythonType): Bounding box of the annotation on the page
  • tokensJsons (list[TokenIdPythonType]): List of PAWLS tokens covered by the annotation
  • rawText (string): Raw text of the annotation on this page

BoundingBoxPythonType Format

Represents a bounding box with fields:

  • top (int)
  • bottom (int)
  • left (int)
  • right (int)

TokenIdPythonType Format

References a PAWLS token by page and token index:

  • pageIndex (int)
  • tokenIndex (int)

PawlsPagePythonType Format

Represents PAWLS parse data for a single page:

  • page (PawlsPageBoundaryPythonType): Page boundary info
  • tokens (list[PawlsTokenPythonType]): List of PAWLS tokens on the page

PawlsPageBoundaryPythonType Format

Represents the page boundary with fields:

  • width (float)
  • height (float)
  • index (int): Page index

PawlsTokenPythonType Format

Represents a single PAWLS token with fields:

  • x (float): X-coordinate of token box
  • y (float): Y-coordinate of token box
  • width (float): Width of token box
  • height (float): Height of token box
  • text (string): Text content of the token

AnnotationLabelPythonType Format

Defines an annotation label with fields:

  • id (string)
  • color (string): Hex color for the label
  • description (string)
  • icon (string): Icon name
  • text (string): Label text
  • label_type (LabelType): One of DOC_TYPE_LABEL, TOKEN_LABEL, RELATIONSHIP_LABEL, METADATA_LABEL

Example data.json

{
   "annotated_docs": {
     "document1.pdf": {
       "doc_labels": ["Contract", "NDA"],
diff --git a/walkthrough/advanced/fork-a-corpus/index.html b/walkthrough/advanced/fork-a-corpus/index.html
index 37456f95..d40e52b5 100755
--- a/walkthrough/advanced/fork-a-corpus/index.html
+++ b/walkthrough/advanced/fork-a-corpus/index.html
@@ -7,6 +7,6 @@
     .gdesc-inner { font-size: 0.75rem; }
     body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);}
     body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);}
-    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Fork a Corpus

To Fork or Not to Fork?

One of the amazing things about Open Source collaboration is you can stand on the shoulder of giants - we can share techniques and data and collectively achieve what we could never do alone. OpenContracts is designed to make it super easy to share and re-use annotation data.

In OpenContracts, we introduce the concept of "forking" a corpus - basically creating a copy of public or private corpus, complete with its documents and annotations, which you can edit and tweak as needed. This opens up some interesting possibilities. For example, you might have a base corpus with annotations common to many types of AI models or annotation projects which you can fork as needed and layer task or domain-specific annotations on top of.

Fork a Corpus

Forking a corpus is easy.

  1. Again, right-click on a corpus to bring up the context menu. You'll see an entry to "Fork Corpus":
  2. Click on it to start a fork. You should see a confirmation in the top right of the screen:
  3. Once the fork is complete, the next time you go to your Corpus page, you'll see a new Corpus with a Fork icon in the icon bar at the bottom. If you hover over it, you'll be able to see a summary of the corpus it was forked from. This is tracked in the database, so, long-term, we'd like to have corpus version control similar to how git works:

Fork a Corpus

To Fork or Not to Fork?

One of the amazing things about Open Source collaboration is you can stand on the shoulder of giants - we can share techniques and data and collectively achieve what we could never do alone. OpenContracts is designed to make it super easy to share and re-use annotation data.

In OpenContracts, we introduce the concept of "forking" a corpus - basically creating a copy of public or private corpus, complete with its documents and annotations, which you can edit and tweak as needed. This opens up some interesting possibilities. For example, you might have a base corpus with annotations common to many types of AI models or annotation projects which you can fork as needed and layer task or domain-specific annotations on top of.

Fork a Corpus

Forking a corpus is easy.

  1. Again, right-click on a corpus to bring up the context menu. You'll see an entry to "Fork Corpus":
  2. Click on it to start a fork. You should see a confirmation in the top right of the screen:
  3. Once the fork is complete, the next time you go to your Corpus page, you'll see a new Corpus with a Fork icon in the icon bar at the bottom. If you hover over it, you'll be able to see a summary of the corpus it was forked from. This is tracked in the database, so, long-term, we'd like to have corpus version control similar to how git works:
\ No newline at end of file diff --git a/walkthrough/advanced/generate-graphql-schema-files/index.html b/walkthrough/advanced/generate-graphql-schema-files/index.html index 3a80b8ba..ca3a5c10 100755 --- a/walkthrough/advanced/generate-graphql-schema-files/index.html +++ b/walkthrough/advanced/generate-graphql-schema-files/index.html @@ -7,7 +7,7 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}

Generate GraphQL Schema Files

Generating GraphQL Schema Files

Open Contracts uses Graphene to provide a rich GraphQL endpoint, complete with the GraphiQL query application. For some applications, you may want to generate a GraphQL schema file in SDL or json. On example use case is if you're developing a frontend you want to connect to OpenContracts, and you'd like to autogenerate Typescript types from a GraphQL Schena.

To generate a GraphQL schema file, run your choice of the following commands.

For an SDL file:

$ docker-compose -f local.yml run django python manage.py graphql_schema --schema config.graphql.schema.schema --out schema.graphql
+    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Generate GraphQL Schema Files

Generating GraphQL Schema Files

Open Contracts uses Graphene to provide a rich GraphQL endpoint, complete with the GraphiQL query application. For some applications, you may want to generate a GraphQL schema file in SDL or json. On example use case is if you're developing a frontend you want to connect to OpenContracts, and you'd like to autogenerate Typescript types from a GraphQL Schena.

To generate a GraphQL schema file, run your choice of the following commands.

For an SDL file:

$ docker-compose -f local.yml run django python manage.py graphql_schema --schema config.graphql.schema.schema --out schema.graphql
 

For a JSON file:

$ docker-compose -f local.yml run django python manage.py graphql_schema --schema config.graphql.schema.schema --out schema.json
 

You can convert these to TypeScript for use in a frontend (though you'll find this has already been done for the React- based OpenContracts frontend) using a tool like this.

Understanding Document Ground Truth in OpenContracts

OpenContracts utilizes the PAWLs format for representing documents and their annotations. PAWLs was designed by AllenAI to provide a consistent and structured way to store text and layout information for complex documents like contracts, scientific papers, and newspapers.

AllenAI has largely stopped maintaining this project and this project evolved into something very different than its PAWLs namesake, but we've kept the name (and contributed a few PRs back to the PAWLs project).

Standardized PDF Data Layers

In OpenContracts, every document is processed through a pipeline that extracts and structures text and layout information into three files:

  1. Original PDF: The original PDF document.
  2. PAWLs Layer (JSON): A JSON file containing the text and positional data for each token (word) in the document.
  3. Text Layer: A text file containing the full text extracted from the document.
  4. Structural Annotations: Thanks to nlm-ingestor, we now use Nlmatics' parser to generate the PAWLs layer and turn the layout blocks - like header, paragraph, table, etc. - into Open Contracts Annotation objs that represent the visual blocks for each PDF. Upon creation, we create embeddings for each Annotation which are stored in Postgres via pgvector.

The PAWLs layer serves as the source of truth for the document, allowing seamless translation between text and positional information.

Visualizing How PDFs are Converted to Data & Annotations

Here's a rough diagram showing how a series of tokens - Lorem, ipsum, dolor, sit and amet - are mapped from a PDF to our various data types.

pawls-annotation-mapping.svg

PAWLs Processing Pipeline

The PAWLs processing pipeline involves the following steps:

  1. Token Extraction: The OCRed document is processed using the parsing engine of Grobid to extract "tokens" (text surrounded by whitespace, typically a word) along with their page and positional information.
  2. PAWLs Layer Generation: The extracted tokens and their positional data are stored as a JSON file, referred to as the "PAWLs layer."
  3. Text Layer Generation: The full text is extracted from the PAWLs layer and stored as a separate text file, called the "text layer."

PAWLs Layer Structure

The PAWLs layer JSON file consists of a list of page objects, each containing the necessary tokens and page information for a given page. Here's the data shape for each page object:

class PawlsPagePythonType(TypedDict):
+    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Understanding Document Ground Truth in OpenContracts

OpenContracts utilizes the PAWLs format for representing documents and their annotations. PAWLs was designed by AllenAI to provide a consistent and structured way to store text and layout information for complex documents like contracts, scientific papers, and newspapers.

AllenAI has largely stopped maintaining this project and this project evolved into something very different than its PAWLs namesake, but we've kept the name (and contributed a few PRs back to the PAWLs project).

Standardized PDF Data Layers

In OpenContracts, every document is processed through a pipeline that extracts and structures text and layout information into three files:

  1. Original PDF: The original PDF document.
  2. PAWLs Layer (JSON): A JSON file containing the text and positional data for each token (word) in the document.
  3. Text Layer: A text file containing the full text extracted from the document.
  4. Structural Annotations: Thanks to nlm-ingestor, we now use Nlmatics' parser to generate the PAWLs layer and turn the layout blocks - like header, paragraph, table, etc. - into Open Contracts Annotation objs that represent the visual blocks for each PDF. Upon creation, we create embeddings for each Annotation which are stored in Postgres via pgvector.

The PAWLs layer serves as the source of truth for the document, allowing seamless translation between text and positional information.

Visualizing How PDFs are Converted to Data & Annotations

Here's a rough diagram showing how a series of tokens - Lorem, ipsum, dolor, sit and amet - are mapped from a PDF to our various data types.

pawls-annotation-mapping.svg

PAWLs Processing Pipeline

The PAWLs processing pipeline involves the following steps:

  1. Token Extraction: The OCRed document is processed using the parsing engine of Grobid to extract "tokens" (text surrounded by whitespace, typically a word) along with their page and positional information.
  2. PAWLs Layer Generation: The extracted tokens and their positional data are stored as a JSON file, referred to as the "PAWLs layer."
  3. Text Layer Generation: The full text is extracted from the PAWLs layer and stored as a separate text file, called the "text layer."

PAWLs Layer Structure

The PAWLs layer JSON file consists of a list of page objects, each containing the necessary tokens and page information for a given page. Here's the data shape for each page object:

class PawlsPagePythonType(TypedDict):
     page: PawlsPageBoundaryPythonType
     tokens: list[PawlsTokenPythonType]
 

The PawlsPageBoundaryPythonType represents the page boundary information:

class PawlsPageBoundaryPythonType(TypedDict):
diff --git a/walkthrough/advanced/register-doc-analyzer/index.html b/walkthrough/advanced/register-doc-analyzer/index.html
index c476cc95..521d8a3c 100755
--- a/walkthrough/advanced/register-doc-analyzer/index.html
+++ b/walkthrough/advanced/register-doc-analyzer/index.html
@@ -7,7 +7,7 @@
     .gdesc-inner { font-size: 0.75rem; }
     body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);}
     body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);}
-    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Detailed Overview of @doc_analyzer_task Decorator

The @doc_analyzer_task decorator is an integral part of the OpenContracts CorpusAction system, which automates document processing when new documents are added to a corpus. As a refresher, within the CorpusAction system, users have three options for registering actions to run automatically on new documents:

  1. Custom data extractors
  2. Analyzer microservices
  3. Celery tasks decorated with @doc_analyzer_task

The @doc_analyzer_task decorator is specifically designed for the third option, providing a straightforward way to write and deploy simple, span-based analytics directly within the OpenContracts ecosystem.

When to Use @doc_analyzer_task

The @doc_analyzer_task decorator is ideal for scenarios where:

  1. You're performing tests or analyses solely based on document text or PAWLs tokens.
  2. Your analyzer doesn't require conflicting dependencies or non-Python code bases.
  3. You want a quick and easy way to integrate custom analysis into the OpenContracts workflow.

For more complex scenarios, such as those requiring specific environments, non-Python components, or heavy computational resources, creating an analyzer microservice would be recommended.

Advantages of @doc_analyzer_task

Using the @doc_analyzer_task decorator offers several benefits:

  1. Simplicity: It abstracts away much of the complexity of interacting with the OpenContracts system.
  2. Integration: Tasks are automatically integrated into the CorpusAction workflow.
  3. Consistency: It ensures that your analysis task produces outputs in a format that OpenContracts can readily use.
  4. Error Handling: It provides built-in error handling and retry mechanisms.

By using this decorator, you can focus on writing the core analysis logic while the OpenContracts system handles the intricacies of document processing, annotation creation, and result storage.

In the following sections, we'll dive deep into how to structure functions decorated with @doc_analyzer_task, what data they receive, and how their outputs are processed by the OpenContracts system.

Function Signature

Functions decorated with @doc_analyzer_task should have the following signature:

@doc_analyzer_task()
+    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Detailed Overview of @doc_analyzer_task Decorator

The @doc_analyzer_task decorator is an integral part of the OpenContracts CorpusAction system, which automates document processing when new documents are added to a corpus. As a refresher, within the CorpusAction system, users have three options for registering actions to run automatically on new documents:

  1. Custom data extractors
  2. Analyzer microservices
  3. Celery tasks decorated with @doc_analyzer_task

The @doc_analyzer_task decorator is specifically designed for the third option, providing a straightforward way to write and deploy simple, span-based analytics directly within the OpenContracts ecosystem.

When to Use @doc_analyzer_task

The @doc_analyzer_task decorator is ideal for scenarios where:

  1. You're performing tests or analyses solely based on document text or PAWLs tokens.
  2. Your analyzer doesn't require conflicting dependencies or non-Python code bases.
  3. You want a quick and easy way to integrate custom analysis into the OpenContracts workflow.

For more complex scenarios, such as those requiring specific environments, non-Python components, or heavy computational resources, creating an analyzer microservice would be recommended.

Advantages of @doc_analyzer_task

Using the @doc_analyzer_task decorator offers several benefits:

  1. Simplicity: It abstracts away much of the complexity of interacting with the OpenContracts system.
  2. Integration: Tasks are automatically integrated into the CorpusAction workflow.
  3. Consistency: It ensures that your analysis task produces outputs in a format that OpenContracts can readily use.
  4. Error Handling: It provides built-in error handling and retry mechanisms.

By using this decorator, you can focus on writing the core analysis logic while the OpenContracts system handles the intricacies of document processing, annotation creation, and result storage.

In the following sections, we'll dive deep into how to structure functions decorated with @doc_analyzer_task, what data they receive, and how their outputs are processed by the OpenContracts system.

Function Signature

Functions decorated with @doc_analyzer_task should have the following signature:

@doc_analyzer_task()
 def your_analyzer_function(*args, pdf_text_extract=None, pdf_pawls_extract=None, **kwargs):
     # Function body
     pass
diff --git a/walkthrough/advanced/run-gremlin-analyzer/index.html b/walkthrough/advanced/run-gremlin-analyzer/index.html
index 4a5a65f0..d1481aaf 100755
--- a/walkthrough/advanced/run-gremlin-analyzer/index.html
+++ b/walkthrough/advanced/run-gremlin-analyzer/index.html
@@ -7,6 +7,6 @@
     .gdesc-inner { font-size: 0.75rem; }
     body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);}
     body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);}
-    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Run a Gremlin Analyzer

Introduction to Gremlin Integration

OpenContracts integrates with a powerful NLP engine called Gremlin Engine ("Gremlin"). If you run a Gremlin analyzer on a Corpus, it will create annotations of its own that you can view and export (e.g. automatically applying document labels or labeling parties, dates, and places, etc.). It's meant to provide a consistent API to deliver and render NLP and machine learning capabilities to end-users. As discussed in the configuration section, you need to install Gremlin Analyzers through the admin dashboard.

Once you've installed Gremlin Analyzers, however, it's easy to apply them.

Using an Installed Gremlin Analyzer

  1. If analysis capabilities are enabled for instance, when you right-click on a Corpus, you'll see an option to "Analyze Corpus":

  2. Clicking on this item will bring up a dialog where you can browse available analyzers:

  3. Select one and hit "Analyze" to submit a corpus for processing. When you go to the Analysis tab of your Corpus now, you'll see the analysis. Most likely, if you just clicked there, it will say processing:

  4. When the Analysis is complete, you'll see a summary of the number of labels and annotations applied by the analyzer:

Note on Processing Time

Large Corpuses of hundreds of documents can take a long time to process (10 minutes or more). It's hard to predict processing time up front, because it's dependent on the number of total pages and the specific analysis being performed. At the moment, there is not a great mechanism in place to detect and handle failures in a Gremlin analyzer and reflect this in OpenContracts. It's on our roadmap to improve this integration. In the meantime, the example analyzers we've released with Gremlin should be very stable, so they should run predictably.

Viewing the Outputs

Once an Analysis completes, you'll be able to browse the annotations from the analysis in several ways.

  1. First, they'll be available in the "Annotation" tab, and you can easily filter to annotations from a specific analyzer.
  2. Second, when you load a Document, in the Annotator view, there's a small widget in the top of the annotator that has three downwards-facing arrows and says "Human Annotation Mode".
  3. Click on the arrows open a tray showing the analyses applied to this document.
  4. Click on an analysis to load the annotations and view them in the document.

Note: You can delete an analysis, but you cannot edit it. The annotations are machine-created and cannot be edited by human users.

Run a Gremlin Analyzer

Introduction to Gremlin Integration

OpenContracts integrates with a powerful NLP engine called Gremlin Engine ("Gremlin"). If you run a Gremlin analyzer on a Corpus, it will create annotations of its own that you can view and export (e.g. automatically applying document labels or labeling parties, dates, and places, etc.). It's meant to provide a consistent API to deliver and render NLP and machine learning capabilities to end-users. As discussed in the configuration section, you need to install Gremlin Analyzers through the admin dashboard.

Once you've installed Gremlin Analyzers, however, it's easy to apply them.

Using an Installed Gremlin Analyzer

  1. If analysis capabilities are enabled for instance, when you right-click on a Corpus, you'll see an option to "Analyze Corpus":

  2. Clicking on this item will bring up a dialog where you can browse available analyzers:

  3. Select one and hit "Analyze" to submit a corpus for processing. When you go to the Analysis tab of your Corpus now, you'll see the analysis. Most likely, if you just clicked there, it will say processing:

  4. When the Analysis is complete, you'll see a summary of the number of labels and annotations applied by the analyzer:

Note on Processing Time

Large Corpuses of hundreds of documents can take a long time to process (10 minutes or more). It's hard to predict processing time up front, because it's dependent on the number of total pages and the specific analysis being performed. At the moment, there is not a great mechanism in place to detect and handle failures in a Gremlin analyzer and reflect this in OpenContracts. It's on our roadmap to improve this integration. In the meantime, the example analyzers we've released with Gremlin should be very stable, so they should run predictably.

Viewing the Outputs

Once an Analysis completes, you'll be able to browse the annotations from the analysis in several ways.

  1. First, they'll be available in the "Annotation" tab, and you can easily filter to annotations from a specific analyzer.
  2. Second, when you load a Document, in the Annotator view, there's a small widget in the top of the annotator that has three downwards-facing arrows and says "Human Annotation Mode".
  3. Click on the arrows open a tray showing the analyses applied to this document.
  4. Click on an analysis to load the annotations and view them in the document.

Note: You can delete an analysis, but you cannot edit it. The annotations are machine-created and cannot be edited by human users.

\ No newline at end of file diff --git a/walkthrough/advanced/testing-llama-index-calls/index.html b/walkthrough/advanced/testing-llama-index-calls/index.html index cded103a..fa7977ea 100755 --- a/walkthrough/advanced/testing-llama-index-calls/index.html +++ b/walkthrough/advanced/testing-llama-index-calls/index.html @@ -7,7 +7,7 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}

Testing Complex LLM Applications

I've built a number of full-stack, LLM-powered applications at this point. A persistent challenge is testing the underlying LLM query pipelines in a deterministic and isolated way.

A colleague and I eventually hit on a way to make testing complex LLM behavior deterministic and decoupled from upstream LLM API providers. This tutorial walks you through the problem and solution to this testing issue.

In this guide, you'll learn:

  1. Why testing LLM applications is particularly challenging
  2. How to overcome common testing obstacles like API dependencies and resource limitations
  3. An innovative approach using VCR.py to record and replay LLM interactions
  4. How to implement this solution with popular frameworks like LlamaIndex and Django
  5. Potential pitfalls to watch out for when using this method

Whether you're working with RAG models, multi-hop reasoning loops, or other complex LLM architectures, this tutorial will show you how to create fast, deterministic, and accurate tests without relying on expensive resources or compromising the integrity of your test suite.

By the end of this guide, you'll have a powerful new tool in your AI development toolkit, enabling you to build more robust and reliable LLM-powered applications. Let's dive in!

Problem

To understand why testing complex LLM-powered applications is challenging, let's break down the components and processes involved in a typical RAG (Retrieval-Augmented Generation) application using a framework like LlamaIndex:

  1. Data Ingestion: Your application likely starts by ingesting large amounts of data from various sources (documents, databases, APIs, etc.).

  2. Indexing: This data is then processed and indexed, often using vector embeddings, to allow for efficient retrieval.

  3. Query Processing: When a user submits a query, your application needs to: a) Convert the query into a suitable format (often involving embedding the query) b) Search the index to retrieve relevant information c) Format the retrieved information for use by the LLM

  4. LLM Interaction: The processed query and retrieved information are sent to an LLM (like GPT-4) for generating a response.

  5. Post-processing: The LLM's response might need further processing or validation before being returned to the user.

Now, consider the challenges in testing such a system:

  1. External Dependencies: Many of these steps rely on external APIs or services. The indexing and query embedding often use one model (e.g., OpenAI's embeddings API), while the final response generation uses another (e.g., GPT-4). Traditional testing approaches would require mocking or stubbing these services, which can be complex and may not accurately represent real-world behavior.

  2. Resource Intensity: Running a full RAG pipeline for each test can be extremely resource-intensive and time-consuming. It might involve processing large amounts of data and making multiple API calls to expensive LLM services.

  3. Determinism: LLMs can produce slightly different outputs for the same input, making it difficult to write deterministic tests. This variability can lead to flaky tests that sometimes pass and sometimes fail.

  4. Complexity of Interactions: In more advanced setups, you might have multi-step reasoning processes or agent-based systems where the LLM is called multiple times with intermediate results. This creates complex chains of API calls that are difficult to mock effectively.

  5. Sensitive Information: Your tests might involve querying over proprietary or sensitive data. You don't want to include this data in your test suite, especially if it's going to be stored in a version control system.

  6. Cost: Running tests that make real API calls to LLM services can quickly become expensive, especially when running comprehensive test suites in CI/CD pipelines.

  7. Speed: Tests that rely on actual API calls are inherently slower, which can significantly slow down your development and deployment processes.

Traditional testing approaches fall short in addressing these challenges:

  • Unit tests with mocks may not capture the nuances of LLM behavior.
  • Integration tests with real API calls are expensive, slow, and potentially non-deterministic.
  • Dependency injection can help but becomes unwieldy with complex, multi-step processes.

What's needed is a way to capture the behavior of the entire system, including all API interactions, in a reproducible manner that doesn't require constant re-execution of expensive operations. This is where the VCR approach comes in, as we'll explore in the next section.

Solution

Over a couple years of working with the LLM and RAG application stack, a solution has emerged to this problem. A former colleague of mine pointed out a library for Ruby called VCR with the following goal:

Record your test suite's HTTP interactions and replay them during future test runs for fast, deterministic, accurate
+    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Testing Complex LLM Applications

I've built a number of full-stack, LLM-powered applications at this point. A persistent challenge is testing the underlying LLM query pipelines in a deterministic and isolated way.

A colleague and I eventually hit on a way to make testing complex LLM behavior deterministic and decoupled from upstream LLM API providers. This tutorial walks you through the problem and solution to this testing issue.

In this guide, you'll learn:

  1. Why testing LLM applications is particularly challenging
  2. How to overcome common testing obstacles like API dependencies and resource limitations
  3. An innovative approach using VCR.py to record and replay LLM interactions
  4. How to implement this solution with popular frameworks like LlamaIndex and Django
  5. Potential pitfalls to watch out for when using this method

Whether you're working with RAG models, multi-hop reasoning loops, or other complex LLM architectures, this tutorial will show you how to create fast, deterministic, and accurate tests without relying on expensive resources or compromising the integrity of your test suite.

By the end of this guide, you'll have a powerful new tool in your AI development toolkit, enabling you to build more robust and reliable LLM-powered applications. Let's dive in!

Problem

To understand why testing complex LLM-powered applications is challenging, let's break down the components and processes involved in a typical RAG (Retrieval-Augmented Generation) application using a framework like LlamaIndex:

  1. Data Ingestion: Your application likely starts by ingesting large amounts of data from various sources (documents, databases, APIs, etc.).

  2. Indexing: This data is then processed and indexed, often using vector embeddings, to allow for efficient retrieval.

  3. Query Processing: When a user submits a query, your application needs to: a) Convert the query into a suitable format (often involving embedding the query) b) Search the index to retrieve relevant information c) Format the retrieved information for use by the LLM

  4. LLM Interaction: The processed query and retrieved information are sent to an LLM (like GPT-4) for generating a response.

  5. Post-processing: The LLM's response might need further processing or validation before being returned to the user.

Now, consider the challenges in testing such a system:

  1. External Dependencies: Many of these steps rely on external APIs or services. The indexing and query embedding often use one model (e.g., OpenAI's embeddings API), while the final response generation uses another (e.g., GPT-4). Traditional testing approaches would require mocking or stubbing these services, which can be complex and may not accurately represent real-world behavior.

  2. Resource Intensity: Running a full RAG pipeline for each test can be extremely resource-intensive and time-consuming. It might involve processing large amounts of data and making multiple API calls to expensive LLM services.

  3. Determinism: LLMs can produce slightly different outputs for the same input, making it difficult to write deterministic tests. This variability can lead to flaky tests that sometimes pass and sometimes fail.

  4. Complexity of Interactions: In more advanced setups, you might have multi-step reasoning processes or agent-based systems where the LLM is called multiple times with intermediate results. This creates complex chains of API calls that are difficult to mock effectively.

  5. Sensitive Information: Your tests might involve querying over proprietary or sensitive data. You don't want to include this data in your test suite, especially if it's going to be stored in a version control system.

  6. Cost: Running tests that make real API calls to LLM services can quickly become expensive, especially when running comprehensive test suites in CI/CD pipelines.

  7. Speed: Tests that rely on actual API calls are inherently slower, which can significantly slow down your development and deployment processes.

Traditional testing approaches fall short in addressing these challenges:

  • Unit tests with mocks may not capture the nuances of LLM behavior.
  • Integration tests with real API calls are expensive, slow, and potentially non-deterministic.
  • Dependency injection can help but becomes unwieldy with complex, multi-step processes.

What's needed is a way to capture the behavior of the entire system, including all API interactions, in a reproducible manner that doesn't require constant re-execution of expensive operations. This is where the VCR approach comes in, as we'll explore in the next section.

Solution

Over a couple years of working with the LLM and RAG application stack, a solution has emerged to this problem. A former colleague of mine pointed out a library for Ruby called VCR with the following goal:

Record your test suite's HTTP interactions and replay them during future test runs for fast, deterministic, accurate
 tests.
 

This sounds like exactly the sort of solution we're looking for! We have numerous API calls to third-party API endpoints. They are deterministic IF the responses from each step of the LLM reasoning loop is identical to a previous run of the same loop. If we could record each LLM call and response from one run of a specific LLamaIndex pipeline, for example, and then intercept future calls to the same endpoints and replay the old responses, in theory we'd have exactly the same results.

It turns out there's a Python version of VCR called VCR.py. It comes with nice pytest fixtures and lets you decorate an entire Django test. If you call a LlamaIndex pipeline from your test, if no "cassette" filed is found in your fixtures directory, your HTTPS calls will go out to actual API

Example

Using VCR.py + LlamaIndex, for example, is super simple. In a Django test, for example, you just write a test function per usual:

import vcr
 from django.test import TestCase
diff --git a/walkthrough/advanced/write-your-own-extractors/index.html b/walkthrough/advanced/write-your-own-extractors/index.html
index bd9bf15f..e2f68946 100755
--- a/walkthrough/advanced/write-your-own-extractors/index.html
+++ b/walkthrough/advanced/write-your-own-extractors/index.html
@@ -7,7 +7,7 @@
     .gdesc-inner { font-size: 0.75rem; }
     body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);}
     body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);}
-    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Write Your Own Agentic, LlamaIndex Data Extractor

Refresher on What an Open Contracts Data Extractor Does

When you create a new Extract on the frontend, you can build a grid of data field columns and document rows that the application will traverse, cell-by-cell, to answer the question posed in each column for every document:

datagrid

You can define the target data shape for each column - e.g. require all outputs match a certain dictionary schema or be floats. We leverage LLMs to ensure that the retrieved data matches the desired schema.

You'll notice when you add or edit a column, you can configure a number of different things:

datagrid

Specifically, you can adjust - name: The name of the column. - query: The query used for extraction. - match_text: Text we want to match semantically to process on. We use this instead of the query to find responsive text, if this field is provided. If not, we have to fall back to the query. - must_contain_text: Text that must be contained in a returned annotation. This is case insensitive. - output_type: The type of data to be extracted. This can be a python primitive or a simple Pydantic model. - instructions: Instructions for the extraction process. This instructs our parser how to convert retrieved text to the target output_type. Not strictly necessary, but recommended, specifically for objects. - task_name: The name of the registered celery extract task to use to process (lets you define and deploy custom ones). We'll show you have to create a custom one in this walkthrough. - agentic: Boolean indicating if the extraction is agentic. - extract_is_list: Boolean indicating if the extraction result is a list of the output_types you provided.

You'll notice that in the GUI, there is a dropdown to pick the extract task:

Extract_Task_Dropdown.png

This is actually retrieved dynamically from the backend from the tasks in opencontractsserver.tasks.data_extract_tasks.py. Every celery task in this python module will show up in the GUI, and the description in the dropdown is actually pulled out of the docstring provided in the code itself:

@shared_task
+    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Write Your Own Agentic, LlamaIndex Data Extractor

Refresher on What an Open Contracts Data Extractor Does

When you create a new Extract on the frontend, you can build a grid of data field columns and document rows that the application will traverse, cell-by-cell, to answer the question posed in each column for every document:

datagrid

You can define the target data shape for each column - e.g. require all outputs match a certain dictionary schema or be floats. We leverage LLMs to ensure that the retrieved data matches the desired schema.

You'll notice when you add or edit a column, you can configure a number of different things:

datagrid

Specifically, you can adjust - name: The name of the column. - query: The query used for extraction. - match_text: Text we want to match semantically to process on. We use this instead of the query to find responsive text, if this field is provided. If not, we have to fall back to the query. - must_contain_text: Text that must be contained in a returned annotation. This is case insensitive. - output_type: The type of data to be extracted. This can be a python primitive or a simple Pydantic model. - instructions: Instructions for the extraction process. This instructs our parser how to convert retrieved text to the target output_type. Not strictly necessary, but recommended, specifically for objects. - task_name: The name of the registered celery extract task to use to process (lets you define and deploy custom ones). We'll show you have to create a custom one in this walkthrough. - agentic: Boolean indicating if the extraction is agentic. - extract_is_list: Boolean indicating if the extraction result is a list of the output_types you provided.

You'll notice that in the GUI, there is a dropdown to pick the extract task:

Extract_Task_Dropdown.png

This is actually retrieved dynamically from the backend from the tasks in opencontractsserver.tasks.data_extract_tasks.py. Every celery task in this python module will show up in the GUI, and the description in the dropdown is actually pulled out of the docstring provided in the code itself:

@shared_task
 def oc_llama_index_doc_query(cell_id, similarity_top_k=15, max_token_length: int = 512):
     """
     OpenContracts' default LlamaIndex and Marvin-based data extract pipeline to run queries specified for a
diff --git a/walkthrough/key-concepts/index.html b/walkthrough/key-concepts/index.html
index d139e7e5..c8df59d0 100755
--- a/walkthrough/key-concepts/index.html
+++ b/walkthrough/key-concepts/index.html
@@ -7,6 +7,6 @@
     .gdesc-inner { font-size: 0.75rem; }
     body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);}
     body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);}
-    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Key-Concepts

Data Types

Text annotation data is divided into several concepts:

  1. Corpuses (or collections of documents). One document can be in multiple corpuses.
  2. Documents. Currently, these are PDFs ONLY.
  3. Annotations. These are either document-level annotations (the document type), text-level annotations (highlighted text), or relationships (which apply a label between two annotations). Relationships are currently not well-supported and may be buggy.
  4. Analyses. These groups of read-only annotations added by a Gremlin analyzer (see more on that below).

Permissioning

OpenContracts is built on top of the powerful permissioning framework for Django called django-guardian. Each GraphQL request can add a field to annotate the object-level permissions the current user has for a given object, and the frontend relies on this to determine whether to make some objects and pages read-only and whether certain features should be exposed to a given user. The capability of sharing objects with specific users is built in, but is not enabled from the frontend at the moment. Allowing such widespread sharing and user lookups could be a security hole and could also unduly tax the system. We'd like to test these capabilities more fully before letting users used them.

GraphQL

Mutations and Queries

OpenContracts uses Graphene and GraphQL to serve data to its frontend. You can access the Graphiql playground by going to your OpenContracts root url /graphql - e.g. https://opencontracts.opensource.legal/graphql. Anonymous users have access to any public data. To authenticate and access your own data, you either need to use the login mutation to create a JWT token or login to the admin dashboard to get a Django session and auth cookie that will automatically authenticate your requests to the GraphQL endpoint.

If you're not familiar with GraphQL, it's a very powerful way to expose your backend to the user and/or frontend clients to permit the construction of specific queries with specific data shapes. As an example, here's a request to get public corpuses and the annotated text and labels in them:

Graphiql comes with a built-in documentation browser. Just click "Docs" in the top-right of the screen to start browsing. Typically, mutations change things on the server. Queries merely request copies of data from the server. We've tried to make our schema fairly self-explanatory, but we do plan to add more descriptions and guidance to our API docs.

GraphQL-only features

Some of our features are currently not accessible via the frontend. Sharing analyses and corpuses to the public, for example, can only be achieved via makeCorpusPublic and makeAnalysisPublic mutations, and only admins have this power at the moment. For our current release, we've done this to prevent large numbers of public corpuses being shared to cut down on server usage. We'd like to make a fully free and open, collaborative platform with more features to share anonymously, but this will require additional effort and compute power.

Key-Concepts

Data Types

Text annotation data is divided into several concepts:

  1. Corpuses (or collections of documents). One document can be in multiple corpuses.
  2. Documents. Currently, these are PDFs ONLY.
  3. Annotations. These are either document-level annotations (the document type), text-level annotations (highlighted text), or relationships (which apply a label between two annotations). Relationships are currently not well-supported and may be buggy.
  4. Analyses. These groups of read-only annotations added by a Gremlin analyzer (see more on that below).

Permissioning

OpenContracts is built on top of the powerful permissioning framework for Django called django-guardian. Each GraphQL request can add a field to annotate the object-level permissions the current user has for a given object, and the frontend relies on this to determine whether to make some objects and pages read-only and whether certain features should be exposed to a given user. The capability of sharing objects with specific users is built in, but is not enabled from the frontend at the moment. Allowing such widespread sharing and user lookups could be a security hole and could also unduly tax the system. We'd like to test these capabilities more fully before letting users used them.

GraphQL

Mutations and Queries

OpenContracts uses Graphene and GraphQL to serve data to its frontend. You can access the Graphiql playground by going to your OpenContracts root url /graphql - e.g. https://opencontracts.opensource.legal/graphql. Anonymous users have access to any public data. To authenticate and access your own data, you either need to use the login mutation to create a JWT token or login to the admin dashboard to get a Django session and auth cookie that will automatically authenticate your requests to the GraphQL endpoint.

If you're not familiar with GraphQL, it's a very powerful way to expose your backend to the user and/or frontend clients to permit the construction of specific queries with specific data shapes. As an example, here's a request to get public corpuses and the annotated text and labels in them:

Graphiql comes with a built-in documentation browser. Just click "Docs" in the top-right of the screen to start browsing. Typically, mutations change things on the server. Queries merely request copies of data from the server. We've tried to make our schema fairly self-explanatory, but we do plan to add more descriptions and guidance to our API docs.

GraphQL-only features

Some of our features are currently not accessible via the frontend. Sharing analyses and corpuses to the public, for example, can only be achieved via makeCorpusPublic and makeAnalysisPublic mutations, and only admins have this power at the moment. For our current release, we've done this to prevent large numbers of public corpuses being shared to cut down on server usage. We'd like to make a fully free and open, collaborative platform with more features to share anonymously, but this will require additional effort and compute power.

\ No newline at end of file diff --git a/walkthrough/step-1-add-documents/index.html b/walkthrough/step-1-add-documents/index.html index 43d1ada0..5a97b01e 100755 --- a/walkthrough/step-1-add-documents/index.html +++ b/walkthrough/step-1-add-documents/index.html @@ -7,6 +7,6 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}

Step 1 - Add Documents

In order to do anything, you need to add some documents to Gremlin.

Go to the Documents tab

Click on the "Documents" entry in the menu to bring up a view of all documents you have read and/or write access to:

Open the Action Menu

Now, click on the "Action" dropdown to open the Action menu for available actions and click "Import":

This will bring up a dialog to load documents:

Select Documents to Upload

Open Contracts works with PDFs only (as this helps us have a single file type with predictable data structures, formats, etc.). In the future, we'll add functionality to convert other files to PDF, but, for now, please use PDFs. It doesn't matter if they are OCRed or not as OpenContracts performs its own OCR on every PDF anyway to ensure consistent OCR quality and outputs. Once you've added documents for upload, you'll see a list of documents:

Click on a document to change the description or title:

Upload Your Documents

Click upload to upload the documents to OpenContracts. Note Once the documents are uploaded, they are automatically processed with Tesseract amd PAWLs to create a layer of tokens - each one representing a word / symbol in the PDF an its X,Y coordinates on the page. This is what powers OpenContracts annotator and allows us to create both layout-aware and text-only annotations. While the PAWLs processing script is running, the document you uploaded will not be available for viewing and cannot be added to a corpus. You'll see a loading bar on the document until the pre-processing is complete. This is only one once and can take a long time (a couple of minutes to a max of 10) depending on the document length, quality, etc.

Step 1 - Add Documents

In order to do anything, you need to add some documents to Gremlin.

Go to the Documents tab

Click on the "Documents" entry in the menu to bring up a view of all documents you have read and/or write access to:

Open the Action Menu

Now, click on the "Action" dropdown to open the Action menu for available actions and click "Import":

This will bring up a dialog to load documents:

Select Documents to Upload

Open Contracts works with PDFs only (as this helps us have a single file type with predictable data structures, formats, etc.). In the future, we'll add functionality to convert other files to PDF, but, for now, please use PDFs. It doesn't matter if they are OCRed or not as OpenContracts performs its own OCR on every PDF anyway to ensure consistent OCR quality and outputs. Once you've added documents for upload, you'll see a list of documents:

Click on a document to change the description or title:

Upload Your Documents

Click upload to upload the documents to OpenContracts. Note Once the documents are uploaded, they are automatically processed with Tesseract amd PAWLs to create a layer of tokens - each one representing a word / symbol in the PDF an its X,Y coordinates on the page. This is what powers OpenContracts annotator and allows us to create both layout-aware and text-only annotations. While the PAWLs processing script is running, the document you uploaded will not be available for viewing and cannot be added to a corpus. You'll see a loading bar on the document until the pre-processing is complete. This is only one once and can take a long time (a couple of minutes to a max of 10) depending on the document length, quality, etc.

\ No newline at end of file diff --git a/walkthrough/step-2-create-labelset/index.html b/walkthrough/step-2-create-labelset/index.html index d00ad06f..dcbb8073 100755 --- a/walkthrough/step-2-create-labelset/index.html +++ b/walkthrough/step-2-create-labelset/index.html @@ -7,6 +7,6 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}

Step 2 - Create Labelset

Why Labelsets?

Before you can add labels, you need to decide what you want to label. A labelset should reflect the taxonomy or concepts you want to associate with text in your document. This can be solely for the purpose of human review and retrieval, but we imagine many of you want to use it to train machine learning models.

At the moment, there's no way to create a label in a corpus without creating a labelset and creating a label for the labelset (though we'd like to add that and welcome contributions).

Create Text Labels

Let's say we want to add some labels for "Parties", "Termination Clause", and "Effective Date". To do that, let's first create a LabelSet to hold the labels.

  1. Go to the labelset view and click the action button to bring up the action menu:
  2. Clicking on the "Create Label Set" item will bring up a modal to let you create labels:
  3. Now click on the new label set to edit the labels:
  4. A modal comes up that lets you edit three types of labels:

    1. Text Labels - are meant to label spans of text ("highlights")
    2. Relationship Labels - this feature is still under development, but it labels relationships bewteen text label (e.g. one labelled party is the "Parent Company" of another).
    3. Doc Type Labels - are meant to label what category the document belongs in - e.g. a "Stock Purchase Agreement" or an "NDA"
  5. Click the "Text Labels" tab to bring up a view of current labels for text annotations and an action button that lets you create new ones. There should be no labels when you first open this view"

  6. Click the action button and then the "Create Text Label" dropdown item:
  7. You'll see a new, blank label in the list of text labels:
  8. Click the edit icon on the label to edit the label title, description, color and/or icon. To edit the icon or highlight color, hover over or click the giant tag icon on the left side of the label:
  9. Hit save to commit the changes to the database. Repeat for the other labels - "Parties", "Termination Clause", and "Effective Date":

Create Document-Type Labels

In addition to labelling specific parts of a document, you may want to tag a document itself as a certain type of document or addressing a certain subject. In this example, let's say we want to label some documents as "contracts" and others as "not contracts".

  1. Let's also create two example document type labels. Click the "Doc Type Labels" tab:
  2. As before, click the action button and the "Create Document Type Label" item to create a blank document type label:
  3. Repeat to create two doc type labels - "Contract" and "Not Contract":
  4. Hit "Close" to close the editor.

Step 2 - Create Labelset

Why Labelsets?

Before you can add labels, you need to decide what you want to label. A labelset should reflect the taxonomy or concepts you want to associate with text in your document. This can be solely for the purpose of human review and retrieval, but we imagine many of you want to use it to train machine learning models.

At the moment, there's no way to create a label in a corpus without creating a labelset and creating a label for the labelset (though we'd like to add that and welcome contributions).

Create Text Labels

Let's say we want to add some labels for "Parties", "Termination Clause", and "Effective Date". To do that, let's first create a LabelSet to hold the labels.

  1. Go to the labelset view and click the action button to bring up the action menu:
  2. Clicking on the "Create Label Set" item will bring up a modal to let you create labels:
  3. Now click on the new label set to edit the labels:
  4. A modal comes up that lets you edit three types of labels:

    1. Text Labels - are meant to label spans of text ("highlights")
    2. Relationship Labels - this feature is still under development, but it labels relationships bewteen text label (e.g. one labelled party is the "Parent Company" of another).
    3. Doc Type Labels - are meant to label what category the document belongs in - e.g. a "Stock Purchase Agreement" or an "NDA"
  5. Click the "Text Labels" tab to bring up a view of current labels for text annotations and an action button that lets you create new ones. There should be no labels when you first open this view"

  6. Click the action button and then the "Create Text Label" dropdown item:
  7. You'll see a new, blank label in the list of text labels:
  8. Click the edit icon on the label to edit the label title, description, color and/or icon. To edit the icon or highlight color, hover over or click the giant tag icon on the left side of the label:
  9. Hit save to commit the changes to the database. Repeat for the other labels - "Parties", "Termination Clause", and "Effective Date":

Create Document-Type Labels

In addition to labelling specific parts of a document, you may want to tag a document itself as a certain type of document or addressing a certain subject. In this example, let's say we want to label some documents as "contracts" and others as "not contracts".

  1. Let's also create two example document type labels. Click the "Doc Type Labels" tab:
  2. As before, click the action button and the "Create Document Type Label" item to create a blank document type label:
  3. Repeat to create two doc type labels - "Contract" and "Not Contract":
  4. Hit "Close" to close the editor.
\ No newline at end of file diff --git a/walkthrough/step-3-create-a-corpus/index.html b/walkthrough/step-3-create-a-corpus/index.html index 1731f3e1..4740c1be 100755 --- a/walkthrough/step-3-create-a-corpus/index.html +++ b/walkthrough/step-3-create-a-corpus/index.html @@ -7,6 +7,6 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}

Step 3 - Create Corpus

Purpose of the Corpus

A "Corpus" is a collection of documents that can be annotated by hand or automatically by a "Gremlin" analyzer. In order to create a Corpus, you first need to create a Corpus and then add documents to it.

Go to the Corpus Page

  1. First, login if you're not already logged in.
  2. Then, go the "Corpus" tab and click the "Action" dropdown to bring up the action menu:
  3. Click "Create Corpus" to bring up the Create Corpus dialog. If you've already created a labelset or have a pre-existing one, you can select it, otherwise you'll need to create and add one later:
  4. Assuming you created the labelset you want to use, when you click on the dropdown in the "Label Set" section, you should see your new labelset. Click on it to select it:

  1. You will now be able to open the corpus again, open documents in the corpus and start labelling.

Add Documents to Corpus

  1. Once you have a corpus, go back to the document page to select documents to add. You can do this in one of two ways.
    1. Right-click on a document to show a context menu:
    2. Or, SHIFT + click on the documents you want to select in order to select multiple documents at once. A green checkmark will appear on selected documents.
  2. When you're done, click the "Action"
  3. A dialog will pop up asking you to select a corpus to add the documents to. Select the desired corpus and hit ok.
  4. You'll get a confirmation dialog. Hit OK.
  5. When you click on the Corpus you just added the documents to, you'll get a tabbed view of all of the documents, annotations and analyses for that Corpus. At this stage, you should see your documents:

Congrats! You've created a corpus to hold annotations or perform an analysis! In order to start labelling it yourself, you need to create and then select a LabelSet, however. You do not need to do this to run an analyzer, however.

Note: If you have an OpenContracts export file and proper permissions, you can also import a corpus, documents, annotations, and labels. This is disabled on our demo instance, however, to but down on server load and reduce opportunities to upload potentially malicious files. See the "Advanced" section for more details.

Step 3 - Create Corpus

Purpose of the Corpus

A "Corpus" is a collection of documents that can be annotated by hand or automatically by a "Gremlin" analyzer. In order to create a Corpus, you first need to create a Corpus and then add documents to it.

Go to the Corpus Page

  1. First, login if you're not already logged in.
  2. Then, go the "Corpus" tab and click the "Action" dropdown to bring up the action menu:
  3. Click "Create Corpus" to bring up the Create Corpus dialog. If you've already created a labelset or have a pre-existing one, you can select it, otherwise you'll need to create and add one later:
  4. Assuming you created the labelset you want to use, when you click on the dropdown in the "Label Set" section, you should see your new labelset. Click on it to select it:

  1. You will now be able to open the corpus again, open documents in the corpus and start labelling.

Add Documents to Corpus

  1. Once you have a corpus, go back to the document page to select documents to add. You can do this in one of two ways.
    1. Right-click on a document to show a context menu:
    2. Or, SHIFT + click on the documents you want to select in order to select multiple documents at once. A green checkmark will appear on selected documents.
  2. When you're done, click the "Action"
  3. A dialog will pop up asking you to select a corpus to add the documents to. Select the desired corpus and hit ok.
  4. You'll get a confirmation dialog. Hit OK.
  5. When you click on the Corpus you just added the documents to, you'll get a tabbed view of all of the documents, annotations and analyses for that Corpus. At this stage, you should see your documents:

Congrats! You've created a corpus to hold annotations or perform an analysis! In order to start labelling it yourself, you need to create and then select a LabelSet, however. You do not need to do this to run an analyzer, however.

Note: If you have an OpenContracts export file and proper permissions, you can also import a corpus, documents, annotations, and labels. This is disabled on our demo instance, however, to but down on server load and reduce opportunities to upload potentially malicious files. See the "Advanced" section for more details.

\ No newline at end of file diff --git a/walkthrough/step-4-create-text-annotations/index.html b/walkthrough/step-4-create-text-annotations/index.html index b19ead71..7be37365 100755 --- a/walkthrough/step-4-create-text-annotations/index.html +++ b/walkthrough/step-4-create-text-annotations/index.html @@ -7,6 +7,6 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}

Step 4 - Create Some Annotations

To view or edit annotations, you need to open a corpus and then open a document in the Corpus.

  1. Go to your Corpuses page and click on the corpus you just created:
  2. This will open up the document view again. Click on one of the documents to bring up the annotator:
  3. To select the label to apply, Click the vertical ellipses in the "Text Label to Apply Widget". This will bring up an interface that lets you search your labelset and select a label:
  4. Select the "Effective Date" label, for example, to label the Effective Date:
  5. Now, in the document, click and drag a box around the language that corresponds to your select label:
  6. When you've selected the correct text, release the mouse. You'll see a confirmtion when your annotation is created (you'll also see the annotation in the sidebar to the left):
  7. If you want to delete the annotation, you can click on the trash icon in the corresponding annotation card in the sidebar, or, when you hover over the annotation on the page, you'll see a trash icon in the label bar of the annotation. You can click this to delete the annotation too.
  8. If your desired annotated text is non-contiguous, you can hold down the SHIFT key while selecting blocks of text to combine them into a single annotation. While holding SHIFT, releasing the mouse will not create the annotation in the database, it will just allow you to move to a new area.
    1. One situation you might want to do this is where what you want to highlight is on different lines but is just a small part of the surrounding paragraph (such as this example, where Effective Date spans two lines):
    2. Or you might want to select multiple snippets of text in a larger block of text, such as where you have multiple parties you want to combine into a single annotation:

Step 4 - Create Some Annotations

To view or edit annotations, you need to open a corpus and then open a document in the Corpus.

  1. Go to your Corpuses page and click on the corpus you just created:
  2. This will open up the document view again. Click on one of the documents to bring up the annotator:
  3. To select the label to apply, Click the vertical ellipses in the "Text Label to Apply Widget". This will bring up an interface that lets you search your labelset and select a label:
  4. Select the "Effective Date" label, for example, to label the Effective Date:
  5. Now, in the document, click and drag a box around the language that corresponds to your select label:
  6. When you've selected the correct text, release the mouse. You'll see a confirmtion when your annotation is created (you'll also see the annotation in the sidebar to the left):
  7. If you want to delete the annotation, you can click on the trash icon in the corresponding annotation card in the sidebar, or, when you hover over the annotation on the page, you'll see a trash icon in the label bar of the annotation. You can click this to delete the annotation too.
  8. If your desired annotated text is non-contiguous, you can hold down the SHIFT key while selecting blocks of text to combine them into a single annotation. While holding SHIFT, releasing the mouse will not create the annotation in the database, it will just allow you to move to a new area.
    1. One situation you might want to do this is where what you want to highlight is on different lines but is just a small part of the surrounding paragraph (such as this example, where Effective Date spans two lines):
    2. Or you might want to select multiple snippets of text in a larger block of text, such as where you have multiple parties you want to combine into a single annotation:
\ No newline at end of file diff --git a/walkthrough/step-5-create-doc-type-annotations/index.html b/walkthrough/step-5-create-doc-type-annotations/index.html index 036c9588..eeeac4d8 100755 --- a/walkthrough/step-5-create-doc-type-annotations/index.html +++ b/walkthrough/step-5-create-doc-type-annotations/index.html @@ -7,6 +7,6 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}

Step 5 - Create Some Document Annotations

  1. If you want to label the type of document instead of the text inside it, use the controls in the "Doc Type" widget on the bottom right of the Annotator. Hover over it and a green plus button should appear:
  2. Click the "+" button to bring up a dialog that lets you search and select document type labels (remember, we created these earlier in the tutorial):
  3. Click "Add Label" to actually apply the label, and you'll now see that label displayed in the "Doc Type" widget in the annotator:
  4. As before, you can click the trash can to delete the label.

Step 5 - Create Some Document Annotations

  1. If you want to label the type of document instead of the text inside it, use the controls in the "Doc Type" widget on the bottom right of the Annotator. Hover over it and a green plus button should appear:
  2. Click the "+" button to bring up a dialog that lets you search and select document type labels (remember, we created these earlier in the tutorial):
  3. Click "Add Label" to actually apply the label, and you'll now see that label displayed in the "Doc Type" widget in the annotator:
  4. As before, you can click the trash can to delete the label.
\ No newline at end of file diff --git a/walkthrough/step-6-search-and-filter-by-annotations/index.html b/walkthrough/step-6-search-and-filter-by-annotations/index.html index 44147049..ac2ed021 100755 --- a/walkthrough/step-6-search-and-filter-by-annotations/index.html +++ b/walkthrough/step-6-search-and-filter-by-annotations/index.html @@ -7,6 +7,6 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}

Step 6 - Search and Filter By Annotations

  1. Back in the Corpus view, you can see in the document view the document type label you just added:
  2. You can click on the filter dropdown above to filter the documents to only those with a certain doc type label:
  3. With the corpus opened, click on the "Annotations" tab instead of the "Documents" tab to get a summary of all the current annotations in the Corpus:
  4. Click on an annotation card to automatically load the document it's in and jump right to the page containing the annotation:

Step 6 - Search and Filter By Annotations

  1. Back in the Corpus view, you can see in the document view the document type label you just added:
  2. You can click on the filter dropdown above to filter the documents to only those with a certain doc type label:
  3. With the corpus opened, click on the "Annotations" tab instead of the "Documents" tab to get a summary of all the current annotations in the Corpus:
  4. Click on an annotation card to automatically load the document it's in and jump right to the page containing the annotation:
\ No newline at end of file diff --git a/walkthrough/step-7-query-corpus/index.html b/walkthrough/step-7-query-corpus/index.html index 4c4f5b18..a019f07c 100755 --- a/walkthrough/step-7-query-corpus/index.html +++ b/walkthrough/step-7-query-corpus/index.html @@ -7,6 +7,6 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}

Querying a Corpus

Once you've created a corpus of documents, you can ask a natural language question and get a natural language answer, complete with citation and links back to the relevant text in the document(s)

Corpus Query.gif

Note: We're still working to improve nav and GUI performance, but this is pretty good for a first cut.

Querying a Corpus

Once you've created a corpus of documents, you can ask a natural language question and get a natural language answer, complete with citation and links back to the relevant text in the document(s)

Corpus Query.gif

Note: We're still working to improve nav and GUI performance, but this is pretty good for a first cut.

\ No newline at end of file diff --git a/walkthrough/step-8-data-extract/index.html b/walkthrough/step-8-data-extract/index.html index f55c01ca..f7abec57 100755 --- a/walkthrough/step-8-data-extract/index.html +++ b/walkthrough/step-8-data-extract/index.html @@ -7,6 +7,6 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}

Build a Datagrid

You can easily use OpenContracts to create an "Extract" - a collection of queries and natural language-specified data points, represented as columns in a grid, that will be asked of every document in the extract (represented as rows). You can define complex extract schemas, including python primitives, Pydantic models (no nesting - yet) and lists.

Building a Datagrid

To create a data grid, you can start by adding documents or adding data fields. Your choice. If you selected a corpus when defining the extract, the documents from that Corpus will be pre-loaded.

To add documents:

Add Extract Docs.gif

And to add data fields:

Add Extract Column Gif.gif

Running an Extract

Once you've added all of the documents you want and defined all of the data fields to apply, you can click run to start processing the grid:

Grid Processing.gif

Extract speed will depend on your underlying LLM and the number of available celery workers provisioned for OpenContracts. We hope to do more performance optimization in a v2 minor release. We haven't optimized for performance at all.

Reviewing Results

Once an extract is complete, you can click on the hamburger menu in a cell to see a dropdown menu. Click the eye to view the sources for that datacell. If you click thumbs up or thumbs down, you can log that you approved or rejected the value in question. Extract value edits are coming soon.

See a quick walkthrough here:

Grid Review And Sources.gif

Build a Datagrid

You can easily use OpenContracts to create an "Extract" - a collection of queries and natural language-specified data points, represented as columns in a grid, that will be asked of every document in the extract (represented as rows). You can define complex extract schemas, including python primitives, Pydantic models (no nesting - yet) and lists.

Building a Datagrid

To create a data grid, you can start by adding documents or adding data fields. Your choice. If you selected a corpus when defining the extract, the documents from that Corpus will be pre-loaded.

To add documents:

Add Extract Docs.gif

And to add data fields:

Add Extract Column Gif.gif

Running an Extract

Once you've added all of the documents you want and defined all of the data fields to apply, you can click run to start processing the grid:

Grid Processing.gif

Extract speed will depend on your underlying LLM and the number of available celery workers provisioned for OpenContracts. We hope to do more performance optimization in a v2 minor release. We haven't optimized for performance at all.

Reviewing Results

Once an extract is complete, you can click on the hamburger menu in a cell to see a dropdown menu. Click the eye to view the sources for that datacell. If you click thumbs up or thumbs down, you can log that you approved or rejected the value in question. Extract value edits are coming soon.

See a quick walkthrough here:

Grid Review And Sources.gif

\ No newline at end of file diff --git a/walkthrough/step-9-corpus-actions/index.html b/walkthrough/step-9-corpus-actions/index.html index bdcde6d8..4c6d8a27 100755 --- a/walkthrough/step-9-corpus-actions/index.html +++ b/walkthrough/step-9-corpus-actions/index.html @@ -7,6 +7,6 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}

Corpus Actions

Introduction

If you're familiar with GitHub actions - user-scripted functions that run automatically over a software vcs repository when certain actions take place (like a merge, PR, etc.) - then a CorpusAction should be a familair concept. You can configure a celery task using our @doc_analyzer_task decorator (see more here on how to write these) and then configure a CorpusAction to run your custom task on all documents added to the target corpus.

Setting up a Corpus Action

Supported Actions

NOTE: Currently, you have to configure all of this via the Django admin dashboard (http://localhost:8000/admin if you're using our local deployment), We'd like to expose this functionality using our React frontend, but the required GUI elements and GraphQL mutations need to be built out. A good starter PR for someone ;-).

Currently, a CorpusAction can be configured to run one of three types of analyzers automatically:

  1. A data extract fieldset - in which case, a data extract will be created and run on new documents added to the configured corpus (see our guide on setting up a data extract job)
  2. An Analyzer
    1. Configured as a "Gremlin Microservice". See more information on configuring a microservice-based analyzer here
    2. Configured to run a task decorated using the @doc_analyzer_task decorator. See more about configuring these kinds of tasks here.

Creating Corpus Action

From within the Django admin dashboard, click on CorpusActions or the +Add button next to the header:

img.png

Once you've opened the create action form, you'll see a number of different options you can configure:

img.png

See next section for more details on these configuration options.

Configuration Options for Corpus Action

Corpus specifies that an action should run only on a single corpus, specified via dropdown.

Analyzer or Fieldset properties control whether an analysis or data extract runs when the applicable trigger is run (more on this below). If you want to run a data extract when document is added to the corpus, select the fieldset defining the data you want to extract. If you want to run an analyzer, select the pre-configured analyzer. Remember, an analyzer can point to a microservice or a task decorated with @doc_analyzer_task.

Trigger refers to the specific action type that should kick off the desired analysis. Currently, we "provide" add and edit actions - i.e., run specified analytics when a document is added or edited, respectively - but we have not configured the edit action to run.

Disabled is a toggle that will turn off the specified CorpusAction for ALL corpuses.

Run on all corpuses is a toggle that, if True, will run the specified action on EVERY corpus. Be careful with this as it runs for all corpuses for ALL users. Depending on your environment, this could incur a substantial compute cost and other users may not appreciate this. A nice feature we'd love to add is a more fine-grained set of rules based access controls to limit actions to certain groups. This would require a substantial investment into the frontend of the application and remains an unlikely addition, though we'd absolutely welcome contributions!

Quick Reference - Configuring @doc_analyzer_task + Analyzer

If you write your own @doc_analyzer_task and want to run it automatically, let's step through this step-by-step.

  1. First, we assume you put a properly written and decorated task in opencontractserver.tasks.doc_analysis_tasks.py.
  2. Second, you need to create and configure an Analyzer via the Django admin panel. Click on the +Add button next to the Analyzer entry in the admin sidebar and then configure necessary properties:

img.png

Place the name of your task in the task_name property - e.g. opencontractserver.tasks.doc_analysis_tasks.contract_not_contract, add a brief description, assign the creator to the desired user, and click save. 3. Now, this Analyzer instance can be assigned to a CorpusAction!

Corpus Actions

Introduction

If you're familiar with GitHub actions - user-scripted functions that run automatically over a software vcs repository when certain actions take place (like a merge, PR, etc.) - then a CorpusAction should be a familair concept. You can configure a celery task using our @doc_analyzer_task decorator (see more here on how to write these) and then configure a CorpusAction to run your custom task on all documents added to the target corpus.

Setting up a Corpus Action

Supported Actions

NOTE: Currently, you have to configure all of this via the Django admin dashboard (http://localhost:8000/admin if you're using our local deployment), We'd like to expose this functionality using our React frontend, but the required GUI elements and GraphQL mutations need to be built out. A good starter PR for someone ;-).

Currently, a CorpusAction can be configured to run one of three types of analyzers automatically:

  1. A data extract fieldset - in which case, a data extract will be created and run on new documents added to the configured corpus (see our guide on setting up a data extract job)
  2. An Analyzer
    1. Configured as a "Gremlin Microservice". See more information on configuring a microservice-based analyzer here
    2. Configured to run a task decorated using the @doc_analyzer_task decorator. See more about configuring these kinds of tasks here.

Creating Corpus Action

From within the Django admin dashboard, click on CorpusActions or the +Add button next to the header:

img.png

Once you've opened the create action form, you'll see a number of different options you can configure:

img.png

See next section for more details on these configuration options. Once you type in the appropriate configurations and hit "Save", the specified Analyzer or Fieldset will be run automatically on the specified Corpus! If you want to learn more about the underlying architecture, check out our deep dive on CorpusActions.

Configuration Options for Corpus Action

Corpus specifies that an action should run only on a single corpus, specified via dropdown.

Analyzer or Fieldset properties control whether an analysis or data extract runs when the applicable trigger is run (more on this below). If you want to run a data extract when document is added to the corpus, select the fieldset defining the data you want to extract. If you want to run an analyzer, select the pre-configured analyzer. Remember, an analyzer can point to a microservice or a task decorated with @doc_analyzer_task.

Trigger refers to the specific action type that should kick off the desired analysis. Currently, we "provide" add and edit actions - i.e., run specified analytics when a document is added or edited, respectively - but we have not configured the edit action to run.

Disabled is a toggle that will turn off the specified CorpusAction for ALL corpuses.

Run on all corpuses is a toggle that, if True, will run the specified action on EVERY corpus. Be careful with this as it runs for all corpuses for ALL users. Depending on your environment, this could incur a substantial compute cost and other users may not appreciate this. A nice feature we'd love to add is a more fine-grained set of rules based access controls to limit actions to certain groups. This would require a substantial investment into the frontend of the application and remains an unlikely addition, though we'd absolutely welcome contributions!

Quick Reference - Configuring @doc_analyzer_task + Analyzer

If you write your own @doc_analyzer_task and want to run it automatically, let's step through this step-by-step.

  1. First, we assume you put a properly written and decorated task in opencontractserver.tasks.doc_analysis_tasks.py.
  2. Second, you need to create and configure an Analyzer via the Django admin panel. Click on the +Add button next to the Analyzer entry in the admin sidebar and then configure necessary properties:

img.png

Place the name of your task in the task_name property - e.g. opencontractserver.tasks.doc_analysis_tasks.contract_not_contract, add a brief description, assign the creator to the desired user, and click save. 3. Now, this Analyzer instance can be assigned to a CorpusAction!

\ No newline at end of file