diff --git a/404.html b/404.html index eb155900..c6a175c0 100755 --- a/404.html +++ b/404.html @@ -1 +1 @@ - OpenContracts
\ No newline at end of file + OpenContracts
\ No newline at end of file diff --git a/acknowledgements/index.html b/acknowledgements/index.html index 595f794c..9c2e198e 100755 --- a/acknowledgements/index.html +++ b/acknowledgements/index.html @@ -1,4 +1,4 @@ - Acknowledgements - OpenContracts

Acknowledgements

OpenContracts is built in part on top of the PAWLs project frontend. We have made extensive changes, however, and plan to remove even more of the original PAWLs codebase, particularly their state management, as it's currently duplucitive of the Apollo state store we use throughout the application. That said, PAWLs was the inspiration for how we handle text extraction, and we're planning to continue using their PDF rendering code. We are also using PAWLs' pre-processing script, which is based on Grobid.

We should also thank the Grobid project, which was clearly a source of inspiration for PAWLs and an extremely impressive tool. Grobid is designed more for medical and scientific papers, but, nevertheless, offers a tremendous amount of inspiration and examples for the legal world to borrow. Perhaps there is an opportunity to have a unified tool in that respect.

Finally, let's not forget Tesseract, the OCR engine that started its life as an HP research project in the 1980s before being taken over by Google in the early aughts and finally becoming an independent project in 2018. Were it not for the excellent, free OCR provided by Tesseract, we'd have to rely on commercial OCR tech, which would make this kind of opensource, free project prohibitively expensive. Thanks to the many, many people who've made free OCR possible over the nearly 40 years Tesseract has been under development.

Acknowledgements

OpenContracts is built in part on top of the PAWLs project frontend. We have made extensive changes, however, and plan to remove even more of the original PAWLs codebase, particularly their state management, as it's currently duplucitive of the Apollo state store we use throughout the application. That said, PAWLs was the inspiration for how we handle text extraction, and we're planning to continue using their PDF rendering code. We are also using PAWLs' pre-processing script, which is based on Grobid.

We should also thank the Grobid project, which was clearly a source of inspiration for PAWLs and an extremely impressive tool. Grobid is designed more for medical and scientific papers, but, nevertheless, offers a tremendous amount of inspiration and examples for the legal world to borrow. Perhaps there is an opportunity to have a unified tool in that respect.

Finally, let's not forget Tesseract, the OCR engine that started its life as an HP research project in the 1980s before being taken over by Google in the early aughts and finally becoming an independent project in 2018. Were it not for the excellent, free OCR provided by Tesseract, we'd have to rely on commercial OCR tech, which would make this kind of opensource, free project prohibitively expensive. Thanks to the many, many people who've made free OCR possible over the nearly 40 years Tesseract has been under development.

\ No newline at end of file diff --git a/architecture/asynchronous-processing/index.html b/architecture/asynchronous-processing/index.html index d14f2e66..e0e28e5c 100755 --- a/architecture/asynchronous-processing/index.html +++ b/architecture/asynchronous-processing/index.html @@ -1,4 +1,4 @@ - Asynchronous Processing - OpenContracts
Skip to content

Asynchronous Processing

Asynchronous Tasks

OpenContracts makes extensive use of celery, a powerful, mature python framework for distributed and asynchronous processing. Out-of-the-box, dedicated celeryworkers are configured in the docker compose stack to handle computationally-intensive and long-running tasks like parsing documents, applying annotations to pdfs, creating exports, importing exports, and more.

What if my celery queue gets clogged?

We are always working to make OpenContracts more fault-tolerant and stable. That said, due to the nature of the types of documents we're working with - pdfs - there is tremendous variation in what the parsers have to parse. Some documents are extremely long - thousands of pages or more - whereas other documents may have poor formatting, no text layers, etc.. In most cases, OpenContracts should be able to process the pdfs and make them compatible with our annotation tools. Sometimes, however, either due to unexpected issues or unexpected volume of documents, you may want to purge the queue of tasks to be processed by your celery workers. To do this, type:

sudo docker-compose -f local.yml run django celery -A config.celery_app purge
+    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Asynchronous Processing

Asynchronous Tasks

OpenContracts makes extensive use of celery, a powerful, mature python framework for distributed and asynchronous processing. Out-of-the-box, dedicated celeryworkers are configured in the docker compose stack to handle computationally-intensive and long-running tasks like parsing documents, applying annotations to pdfs, creating exports, importing exports, and more.

What if my celery queue gets clogged?

We are always working to make OpenContracts more fault-tolerant and stable. That said, due to the nature of the types of documents we're working with - pdfs - there is tremendous variation in what the parsers have to parse. Some documents are extremely long - thousands of pages or more - whereas other documents may have poor formatting, no text layers, etc.. In most cases, OpenContracts should be able to process the pdfs and make them compatible with our annotation tools. Sometimes, however, either due to unexpected issues or unexpected volume of documents, you may want to purge the queue of tasks to be processed by your celery workers. To do this, type:

sudo docker-compose -f local.yml run django celery -A config.celery_app purge
 

Be aware that this can cause some undesired effects for your users. For example, everytime a new document is uploaded, a Django signal kicks off the pdf preprocessor to produce the PAWLs token layer that is later annotated. If these tasks are in-queue and the queue is purged, you'll have documents that are not annotatable as they'll lack the PAWLS token layers. In such cases, we recommend you delete and re-upload the documents. There are ways to manually reprocess the pdfs, but we don't have a user-friendly way to do this yet.

\ No newline at end of file diff --git a/architecture/components/annotator/how-annotations-are-created/index.html b/architecture/components/annotator/how-annotations-are-created/index.html index 05aa3165..363d1633 100755 --- a/architecture/components/annotator/how-annotations-are-created/index.html +++ b/architecture/components/annotator/how-annotations-are-created/index.html @@ -1,4 +1,4 @@ - How Annotations are Handled - OpenContracts

How Annotations are Handled

Overview

Here's a step-by-step explanation of the flow:

  1. The user selects text on the PDF by clicking and dragging the mouse. This triggers a mouse event in the Page component.
  2. The Page component checks if the Shift key is pressed.
  3. If the Shift key is not pressed, it creates a new selection and sets the selection state in the AnnotationStore.
  4. If the Shift key is pressed, it adds the selection to the selection queue in the AnnotationStore.
  5. The AnnotationStore updates its internal state with the new selection or the updated selection queue.
  6. If the Shift key is released, the Page component triggers the creation of a multi-page annotation. If the Shift key is still pressed, it waits for the next user action.
  7. To create a multi-page annotation, the Page component combines the selections from the queue.
  8. The Page component retrieves the annotation data from the PDFPageInfo object for each selected page.
  9. The Page component creates a ServerAnnotation object with the combined annotation data.
  10. The Page component calls the createAnnotation function in the AnnotationStore, passing the ServerAnnotation object.
  11. The AnnotationStore invokes the requestCreateAnnotation function in the Annotator component.
  12. The Annotator component sends a mutation to the server to create the annotation.
  13. If the server responds with success, the Annotator component updates the local state with the new annotation. If there's an error, it displays an error message.
  14. The updated annotations trigger a re-render of the relevant components, reflecting the newly created annotation on the PDF.

Flowchart

graph TD
+    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

How Annotations are Handled

Overview

Here's a step-by-step explanation of the flow:

  1. The user selects text on the PDF by clicking and dragging the mouse. This triggers a mouse event in the Page component.
  2. The Page component checks if the Shift key is pressed.
  3. If the Shift key is not pressed, it creates a new selection and sets the selection state in the AnnotationStore.
  4. If the Shift key is pressed, it adds the selection to the selection queue in the AnnotationStore.
  5. The AnnotationStore updates its internal state with the new selection or the updated selection queue.
  6. If the Shift key is released, the Page component triggers the creation of a multi-page annotation. If the Shift key is still pressed, it waits for the next user action.
  7. To create a multi-page annotation, the Page component combines the selections from the queue.
  8. The Page component retrieves the annotation data from the PDFPageInfo object for each selected page.
  9. The Page component creates a ServerAnnotation object with the combined annotation data.
  10. The Page component calls the createAnnotation function in the AnnotationStore, passing the ServerAnnotation object.
  11. The AnnotationStore invokes the requestCreateAnnotation function in the Annotator component.
  12. The Annotator component sends a mutation to the server to create the annotation.
  13. If the server responds with success, the Annotator component updates the local state with the new annotation. If there's an error, it displays an error message.
  14. The updated annotations trigger a re-render of the relevant components, reflecting the newly created annotation on the PDF.

Flowchart

graph TD
     A[User selects text on the PDF] -->|Mouse event| B(Page component)
     B --> C{Is Shift key pressed?}
     C -->|No| D[Create new selection]
diff --git a/architecture/components/annotator/overview/index.html b/architecture/components/annotator/overview/index.html
index ca1ca606..1dcd006e 100755
--- a/architecture/components/annotator/overview/index.html
+++ b/architecture/components/annotator/overview/index.html
@@ -1,4 +1,4 @@
- Open Contracts Annotator Components - OpenContracts      

Open Contracts Annotator Components

Key Questions

  1. How is the PDF loaded?
  2. The PDF is loaded in the Annotator.tsx component.
  3. Inside the useEffect hook that runs when the openedDocument prop changes, the PDF loading process is initiated.
  4. The pdfjsLib.getDocument function from the pdfjs-dist library is used to load the PDF file specified by openedDocument.pdfFile.
  5. The loading progress is tracked using the loadingTask.onProgress callback, which updates the progress state.
  6. Once the PDF is loaded, the loadingTask.promise is resolved, and the PDFDocumentProxy object is obtained.
  7. The PDFPageInfo objects are created for each page of the PDF using doc.getPage(i) and stored in the pages state.

  8. Where and how are annotations loaded?

  9. Annotations are loaded using the REQUEST_ANNOTATOR_DATA_FOR_DOCUMENT GraphQL query in the Annotator.tsx component.
  10. The useQuery hook from Apollo Client is used to fetch the annotator data based on the provided initial_query_vars.
  11. The annotator_data received from the query contains information about existing text annotations, document label annotations, and relationships.
  12. The annotations are transformed into ServerAnnotation, DocTypeAnnotation, and RelationGroup objects and stored in the pdfAnnotations state using setPdfAnnotations.

  13. Where is the PAWLs layer loaded?

  14. The PAWLs layer is loaded in the Annotator.tsx component.
  15. Inside the useEffect hook that runs when the openedDocument prop changes, the PAWLs layer is loaded using the getPawlsLayer function from api/rest.ts.
  16. The getPawlsLayer function makes an HTTP GET request to fetch the PAWLs data file specified by openedDocument.pawlsParseFile.
  17. The PAWLs data is expected to be an array of PageTokens objects, which contain token information for each page of the PDF.
  18. The loaded PAWLs data is then used to create PDFPageInfo objects for each page, which include the page tokens.

High-level Components Overview

  • The Annotator component is the top-level component that manages the state and data loading for the annotator.
  • It renders the PDFView component, which is responsible for displaying the PDF and annotations.
  • The PDFView component renders various sub-components, such as LabelSelector, DocTypeLabelDisplay, AnnotatorSidebar, AnnotatorTopbar, and PDF.
  • The PDF component renders individual Page components for each page of the PDF.
  • Each Page component renders Selection and SearchResult components for annotations and search results, respectively.
  • The AnnotatorSidebar component displays the list of annotations, relations, and a search widget.
  • The PDFStore and AnnotationStore are context providers that hold the PDF and annotation data, respectively.

Specific Component Deep Dives

PDFView.tsx

The PDFView component is a top-level component that renders the PDF document with annotations, relations, and text search capabilities. It manages the state and functionality related to annotations, relations, and user interactions. Here's a detailed explanation of how the component works:

  1. The PDFView component receives several props, including permissions, callbacks for CRUD operations on annotations and relations, refs for container and selection elements, and various configuration options.

  2. It initializes several state variables using the useState hook, including:

  3. selectionElementRefs and searchResultElementRefs: Refs for annotation selections and search results.
  4. pageElementRefs: Refs for individual PDF pages.
  5. scrollContainerRef: Ref for the scroll container.
  6. textSearchMatches and searchText: State for text search matches and search text.
  7. selectedAnnotations and selectedRelations: State for currently selected annotations and relations.
  8. pageSelection and pageSelectionQueue: State for current page selection and queued selections.
  9. pdfPageInfoObjs: State for PDF page information objects.
  10. Various other state variables for active labels, relation modal visibility, and annotation options.

  11. The component defines several functions for updating state and handling user interactions, such as:

  12. insertSelectionElementRef, insertSearchResultElementRefs, and insertPageRef: Functions to add refs for selections, search results, and pages.
  13. onError: Error handling callback.
  14. advanceTextSearchMatch and reverseTextSearchMatch: Functions to navigate through text search matches.
  15. onRelationModalOk and onRelationModalCancel: Callbacks for relation modal actions.
  16. createMultiPageAnnotation: Function to create a multi-page annotation from queued selections.

  17. The component uses the useEffect hook to handle side effects, such as:

  18. Setting the scroll container ref on load.
  19. Listening for changes in the shift key and triggering annotation creation.
  20. Updating text search matches when the search text changes.

  21. The component renders the PDF document and its related components using the PDFStore and AnnotationStore contexts:

  22. The PDFStore context provides the PDF document, pages, and error handling.
  23. The AnnotationStore context provides annotation-related state and functions.

  24. The component renders the following main sections:

  25. LabelSelector: Allows the user to select the active label for annotations.
  26. DocTypeLabelDisplay: Displays the document type labels.
  27. AnnotatorSidebar: Sidebar component for managing annotations and relations.
  28. AnnotatorTopbar: Top bar component for additional controls and options.
  29. PDF: The actual PDF component that renders the PDF pages and annotations.

  30. The PDF component, defined in PDF.tsx, is responsible for rendering the PDF pages and annotations. It receives props from the PDFView component, such as permissions, configuration options, and callbacks.

  31. The PDF component maps over each page of the PDF document and renders a Page component for each page, passing the necessary props.

  32. The Page component, also defined in PDF.tsx, is responsible for rendering a single page of the PDF document along with its annotations and search results. It handles mouse events for creating and modifying annotations.

  33. The PDFView component also renders the RelationModal component when the active relation label is set and the user has the necessary permissions. The modal allows the user to create or modify relations between annotations.

PDF.tsx

PDF renders the actual PDF document with annotations and text search capabilities. PDFView (see above) is what actually interacts with the backend / API.

  1. The PDF component receives several props:
  2. shiftDown: Indicates whether the Shift key is pressed (optional).
  3. doc_permissions and corpus_permissions: Specify the permissions for the document and corpus, respectively.
  4. read_only: Determines if the component is in read-only mode.
  5. show_selected_annotation_only: Specifies whether to show only the selected annotation.
  6. show_annotation_bounding_boxes: Specifies whether to show annotation bounding boxes.
  7. show_annotation_labels: Specifies the behavior for displaying annotation labels.
  8. setJumpedToAnnotationOnLoad: A callback function to set the jumped-to annotation on load.
  9. The PDF component retrieves the PDF document and pages from the PDFStore context.
  10. It maps over each page of the PDF document and renders a Page component for each page, passing the necessary props.
  11. The Page component is responsible for rendering a single page of the PDF document along with its annotations and search results.
  12. Inside the Page component:
  13. It creates a canvas element using the useRef hook to render the PDF page.
  14. It retrieves the annotations for the current page from the AnnotationStore context.
  15. It defines a ConvertBoundsToSelections function that converts the selected bounds to annotations and tokens.
  16. It uses the useEffect hook to set up the PDF page rendering and event listeners for resizing and scrolling.
  17. It renders the PDF page canvas, annotations, search results, and queued selections.
  18. The Page component renders the following sub-components:
  19. PageAnnotationsContainer: A styled container for the page annotations.
  20. PageCanvas: A styled canvas element for rendering the PDF page.
  21. Selection: Represents a single annotation selection on the page.
  22. SearchResult: Represents a search result on the page.
  23. The Page component handles mouse events for creating and modifying annotations:
  24. On mouseDown, it initializes the selection if the necessary permissions are granted and the component is not in read-only mode.
  25. On mouseMove, it updates the selection bounds if a selection is active.
  26. On mouseUp, it adds the completed selection to the pageSelectionQueue and triggers the creation of a multi-page annotation if the Shift key is not pressed.
  27. The Page component also handles fetching more annotations for previous and next pages using the FetchMoreOnVisible component.
  28. The SelectionBoundary and SelectionTokens components are used to render the annotation boundaries and tokens, respectively.
  29. The PDFPageRenderer class is responsible for rendering a single PDF page on the canvas. It manages the rendering tasks and provides methods for canceling and rescaling the rendering.
  30. The getPageBoundsFromCanvas function calculates the bounding box of the page based on the canvas dimensions and its parent container.

Open Contracts Annotator Components

Key Questions

  1. How is the PDF loaded?
  2. The PDF is loaded in the Annotator.tsx component.
  3. Inside the useEffect hook that runs when the openedDocument prop changes, the PDF loading process is initiated.
  4. The pdfjsLib.getDocument function from the pdfjs-dist library is used to load the PDF file specified by openedDocument.pdfFile.
  5. The loading progress is tracked using the loadingTask.onProgress callback, which updates the progress state.
  6. Once the PDF is loaded, the loadingTask.promise is resolved, and the PDFDocumentProxy object is obtained.
  7. The PDFPageInfo objects are created for each page of the PDF using doc.getPage(i) and stored in the pages state.

  8. Where and how are annotations loaded?

  9. Annotations are loaded using the REQUEST_ANNOTATOR_DATA_FOR_DOCUMENT GraphQL query in the Annotator.tsx component.
  10. The useQuery hook from Apollo Client is used to fetch the annotator data based on the provided initial_query_vars.
  11. The annotator_data received from the query contains information about existing text annotations, document label annotations, and relationships.
  12. The annotations are transformed into ServerAnnotation, DocTypeAnnotation, and RelationGroup objects and stored in the pdfAnnotations state using setPdfAnnotations.

  13. Where is the PAWLs layer loaded?

  14. The PAWLs layer is loaded in the Annotator.tsx component.
  15. Inside the useEffect hook that runs when the openedDocument prop changes, the PAWLs layer is loaded using the getPawlsLayer function from api/rest.ts.
  16. The getPawlsLayer function makes an HTTP GET request to fetch the PAWLs data file specified by openedDocument.pawlsParseFile.
  17. The PAWLs data is expected to be an array of PageTokens objects, which contain token information for each page of the PDF.
  18. The loaded PAWLs data is then used to create PDFPageInfo objects for each page, which include the page tokens.

High-level Components Overview

  • The Annotator component is the top-level component that manages the state and data loading for the annotator.
  • It renders the PDFView component, which is responsible for displaying the PDF and annotations.
  • The PDFView component renders various sub-components, such as LabelSelector, DocTypeLabelDisplay, AnnotatorSidebar, AnnotatorTopbar, and PDF.
  • The PDF component renders individual Page components for each page of the PDF.
  • Each Page component renders Selection and SearchResult components for annotations and search results, respectively.
  • The AnnotatorSidebar component displays the list of annotations, relations, and a search widget.
  • The PDFStore and AnnotationStore are context providers that hold the PDF and annotation data, respectively.

Specific Component Deep Dives

PDFView.tsx

The PDFView component is a top-level component that renders the PDF document with annotations, relations, and text search capabilities. It manages the state and functionality related to annotations, relations, and user interactions. Here's a detailed explanation of how the component works:

  1. The PDFView component receives several props, including permissions, callbacks for CRUD operations on annotations and relations, refs for container and selection elements, and various configuration options.

  2. It initializes several state variables using the useState hook, including:

  3. selectionElementRefs and searchResultElementRefs: Refs for annotation selections and search results.
  4. pageElementRefs: Refs for individual PDF pages.
  5. scrollContainerRef: Ref for the scroll container.
  6. textSearchMatches and searchText: State for text search matches and search text.
  7. selectedAnnotations and selectedRelations: State for currently selected annotations and relations.
  8. pageSelection and pageSelectionQueue: State for current page selection and queued selections.
  9. pdfPageInfoObjs: State for PDF page information objects.
  10. Various other state variables for active labels, relation modal visibility, and annotation options.

  11. The component defines several functions for updating state and handling user interactions, such as:

  12. insertSelectionElementRef, insertSearchResultElementRefs, and insertPageRef: Functions to add refs for selections, search results, and pages.
  13. onError: Error handling callback.
  14. advanceTextSearchMatch and reverseTextSearchMatch: Functions to navigate through text search matches.
  15. onRelationModalOk and onRelationModalCancel: Callbacks for relation modal actions.
  16. createMultiPageAnnotation: Function to create a multi-page annotation from queued selections.

  17. The component uses the useEffect hook to handle side effects, such as:

  18. Setting the scroll container ref on load.
  19. Listening for changes in the shift key and triggering annotation creation.
  20. Updating text search matches when the search text changes.

  21. The component renders the PDF document and its related components using the PDFStore and AnnotationStore contexts:

  22. The PDFStore context provides the PDF document, pages, and error handling.
  23. The AnnotationStore context provides annotation-related state and functions.

  24. The component renders the following main sections:

  25. LabelSelector: Allows the user to select the active label for annotations.
  26. DocTypeLabelDisplay: Displays the document type labels.
  27. AnnotatorSidebar: Sidebar component for managing annotations and relations.
  28. AnnotatorTopbar: Top bar component for additional controls and options.
  29. PDF: The actual PDF component that renders the PDF pages and annotations.

  30. The PDF component, defined in PDF.tsx, is responsible for rendering the PDF pages and annotations. It receives props from the PDFView component, such as permissions, configuration options, and callbacks.

  31. The PDF component maps over each page of the PDF document and renders a Page component for each page, passing the necessary props.

  32. The Page component, also defined in PDF.tsx, is responsible for rendering a single page of the PDF document along with its annotations and search results. It handles mouse events for creating and modifying annotations.

  33. The PDFView component also renders the RelationModal component when the active relation label is set and the user has the necessary permissions. The modal allows the user to create or modify relations between annotations.

PDF.tsx

PDF renders the actual PDF document with annotations and text search capabilities. PDFView (see above) is what actually interacts with the backend / API.

  1. The PDF component receives several props:
  2. shiftDown: Indicates whether the Shift key is pressed (optional).
  3. doc_permissions and corpus_permissions: Specify the permissions for the document and corpus, respectively.
  4. read_only: Determines if the component is in read-only mode.
  5. show_selected_annotation_only: Specifies whether to show only the selected annotation.
  6. show_annotation_bounding_boxes: Specifies whether to show annotation bounding boxes.
  7. show_annotation_labels: Specifies the behavior for displaying annotation labels.
  8. setJumpedToAnnotationOnLoad: A callback function to set the jumped-to annotation on load.
  9. The PDF component retrieves the PDF document and pages from the PDFStore context.
  10. It maps over each page of the PDF document and renders a Page component for each page, passing the necessary props.
  11. The Page component is responsible for rendering a single page of the PDF document along with its annotations and search results.
  12. Inside the Page component:
  13. It creates a canvas element using the useRef hook to render the PDF page.
  14. It retrieves the annotations for the current page from the AnnotationStore context.
  15. It defines a ConvertBoundsToSelections function that converts the selected bounds to annotations and tokens.
  16. It uses the useEffect hook to set up the PDF page rendering and event listeners for resizing and scrolling.
  17. It renders the PDF page canvas, annotations, search results, and queued selections.
  18. The Page component renders the following sub-components:
  19. PageAnnotationsContainer: A styled container for the page annotations.
  20. PageCanvas: A styled canvas element for rendering the PDF page.
  21. Selection: Represents a single annotation selection on the page.
  22. SearchResult: Represents a search result on the page.
  23. The Page component handles mouse events for creating and modifying annotations:
  24. On mouseDown, it initializes the selection if the necessary permissions are granted and the component is not in read-only mode.
  25. On mouseMove, it updates the selection bounds if a selection is active.
  26. On mouseUp, it adds the completed selection to the pageSelectionQueue and triggers the creation of a multi-page annotation if the Shift key is not pressed.
  27. The Page component also handles fetching more annotations for previous and next pages using the FetchMoreOnVisible component.
  28. The SelectionBoundary and SelectionTokens components are used to render the annotation boundaries and tokens, respectively.
  29. The PDFPageRenderer class is responsible for rendering a single PDF page on the canvas. It manages the rendering tasks and provides methods for canceling and rescaling the rendering.
  30. The getPageBoundsFromCanvas function calculates the bounding box of the page based on the canvas dimensions and its parent container.
\ No newline at end of file diff --git a/architecture/under-the-hood/index.html b/architecture/under-the-hood/index.html index 0db8dd3f..6510bde3 100755 --- a/architecture/under-the-hood/index.html +++ b/architecture/under-the-hood/index.html @@ -7,6 +7,6 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}

Application Architecture

Data Layers

OpenContracts builds on the work that AllenAI did with PAWLs to create a consistent shared source of truth for data labeling and NLP algorithms, regardless of whether they are layout-aware, like LayoutLM or not, like BERT, Spacy or LexNLP. One of the challenges with natural language documents, particularly contracts is there are so many ways to structure any given file (e.g. .docx or .pdf) to represent exactly the same text. Even an identical document with identical formatting in a format like .pdf can have a significantly different file structure depending on what software was used to create it, the user's choices, and the software's own choices in deciding how to structure its output.

PAWLs and OpenContracts attempt to solve this by sending every document through a processing pipeline that provides a uniform and consistent way of extracting and structuring text and layout information. Using the parsing engine of Grobid and the open source OCR engine Tesseract, every single document is re-OCRed (to produce a consistent output for the same inputs) and then the "tokens" (text surrounded on all sides by whitespace - typically a word) in the OCRed document are stored as JSONs with their page and positional information. In OpenContracts, we refer to this JSON layer that combines text and positional data as the "PAWLs" layer. We use the PAWLs layer to build the full text extract from the document as well and store this as the "text layer".

Thus, in OpenContracts, every document has three files associated with it - the original pdf, a json file (the "PAWLs layer"), and a text file (the "text layer"). Because the text layer is built from the PAWLs layer, we can easily translate back and forth from text to positional information - e.g. given the start and end of a span of text the text layer, we can accurately say which PAWLs tokens the span includes, and, based on that, the x,y position of the span in the document.

This lets us take the outputs of many NLP libraries which typically produce only start and stop ranges and layer them perfectly on top of the original pdf. With the PAWLs tokens as the source of truth, we can seamlessly transition from text only to layout-aware text.

Limitations

OCR is not perfect. By only accepting pdf inputs and OCRing every document, we do ignore any text embedded in the pdf. To the extent that text was exported accurately from whatever tool was used to write the document, this introduces some potential loss of fidelity - e.g. if you've ever seen an OCR engine mistake an 'O' or a 0 or 'I' for a '1' or something like that. Typically, however, the instance of such errors is fairly small, and it's a price we have to pay for the power of being able to effortlessly layer NLP outputs that have no layout awareness on top of complex, visual layouts.

Application Architecture

Data Layers

OpenContracts builds on the work that AllenAI did with PAWLs to create a consistent shared source of truth for data labeling and NLP algorithms, regardless of whether they are layout-aware, like LayoutLM or not, like BERT, Spacy or LexNLP. One of the challenges with natural language documents, particularly contracts is there are so many ways to structure any given file (e.g. .docx or .pdf) to represent exactly the same text. Even an identical document with identical formatting in a format like .pdf can have a significantly different file structure depending on what software was used to create it, the user's choices, and the software's own choices in deciding how to structure its output.

PAWLs and OpenContracts attempt to solve this by sending every document through a processing pipeline that provides a uniform and consistent way of extracting and structuring text and layout information. Using the parsing engine of Grobid and the open source OCR engine Tesseract, every single document is re-OCRed (to produce a consistent output for the same inputs) and then the "tokens" (text surrounded on all sides by whitespace - typically a word) in the OCRed document are stored as JSONs with their page and positional information. In OpenContracts, we refer to this JSON layer that combines text and positional data as the "PAWLs" layer. We use the PAWLs layer to build the full text extract from the document as well and store this as the "text layer".

Thus, in OpenContracts, every document has three files associated with it - the original pdf, a json file (the "PAWLs layer"), and a text file (the "text layer"). Because the text layer is built from the PAWLs layer, we can easily translate back and forth from text to positional information - e.g. given the start and end of a span of text the text layer, we can accurately say which PAWLs tokens the span includes, and, based on that, the x,y position of the span in the document.

This lets us take the outputs of many NLP libraries which typically produce only start and stop ranges and layer them perfectly on top of the original pdf. With the PAWLs tokens as the source of truth, we can seamlessly transition from text only to layout-aware text.

Limitations

OCR is not perfect. By only accepting pdf inputs and OCRing every document, we do ignore any text embedded in the pdf. To the extent that text was exported accurately from whatever tool was used to write the document, this introduces some potential loss of fidelity - e.g. if you've ever seen an OCR engine mistake an 'O' or a 0 or 'I' for a '1' or something like that. Typically, however, the instance of such errors is fairly small, and it's a price we have to pay for the power of being able to effortlessly layer NLP outputs that have no layout awareness on top of complex, visual layouts.

\ No newline at end of file diff --git a/configuration/add-users/index.html b/configuration/add-users/index.html index 255bc27e..18638d70 100755 --- a/configuration/add-users/index.html +++ b/configuration/add-users/index.html @@ -7,6 +7,6 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}

Add Users

Adding More Users

You can use the same User admin page described above to create new users. Alternatively, go back to the main admin page http://localhost:8000/admin and, under the User section, click the "+Add" button:

Then, follow the on-screen instructions:

When you're done, the username and password you provided can be used to login.

OpenContracts is currently not built to allow users to self-register unless you use the Auth0 authentication. When managing users yourself, you'll need to add, remove and modify users via the admin panels.

Add Users

Adding More Users

You can use the same User admin page described above to create new users. Alternatively, go back to the main admin page http://localhost:8000/admin and, under the User section, click the "+Add" button:

Then, follow the on-screen instructions:

When you're done, the username and password you provided can be used to login.

OpenContracts is currently not built to allow users to self-register unless you use the Auth0 authentication. When managing users yourself, you'll need to add, remove and modify users via the admin panels.

\ No newline at end of file diff --git a/configuration/choose-an-authentication-backend/index.html b/configuration/choose-an-authentication-backend/index.html index 12905be5..e304234d 100755 --- a/configuration/choose-an-authentication-backend/index.html +++ b/configuration/choose-an-authentication-backend/index.html @@ -7,6 +7,6 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}

Configure Authentication Backend

Select Authentication System via Env Variables

For authentication and authorization, you have two choices. 1. You can configure an Auth0 account and use Auth0 to authenticate users, in which case anyone who is permitted to authenticate via your auth0 setup can login and automatically get an account, 2. or, you can require a username and password for each user and our OpenContracts backend can provide user authentication and authorization. Using the latter option, there is no currently-supported sign-up method, you'll need to use the admin dashboard (See "Adding Users" section).

Auth0 Auth Setup

You need to configure three, separate applications on Auth0's platform:

  1. Configure the SPA as an application. You'll need the App Client ID.
  2. Configure the API. You'll need API Audience.
  3. Configure a M2M application to access the Auth0 Management API. This is used to fetch user details. You'll need the API_ID for the M2M application and the Client Secret for the M2M app.

You'll also need your Auth0 tenant ID (assuming it's the same for all three applications, though you could, in theory, host them in different tenants). These directions are not comprehensive, so, if you're not familiar with Auth0, we recommend you disable Auth0 for the time being and use username and password.

To enable and configure Auth0 Authentication, you'll need to set the following env variables in your .env file (the .django file in .envs/.production or .envs/.local, depending on your target environment). Our sample .envs only show these fields in the .production sample, but you could use them in the .local env file too:

  1. USE_AUTH0 - set to true to enable Auth0
  2. AUTH0_CLIENT_ID - should be the client ID configured on Auth0
  3. AUTH0_API_AUDIENCE - Configured API audience
  4. AUTH0_DOMAIN - domain of your configured Auth0 application
  5. AUTH0_M2M_MANAGEMENT_API_SECRET - secret for the auth0 Machine to Machine (M2M) API
  6. AUTH0_M2M_MANAGEMENT_API_ID - ID for Auth0 Machine to Machine (M2M) API
  7. AUTH0_M2M_MANAGEMENT_GRANT_TYPE - set to client_credentials

Detailed Explanation of Auth0 Implementation

To get Auth0 to work nicely with Graphene, we modified the graphql_jwt backend to support syncing remote user metadata with a local user similar to the default, django RemoteUserMiddleware. We're keeping the graphql_jwt graphene middleware in its entirety as it fetches the token and then passes it along to django authentication *backend. That django backend is what we're modifying to decode the jwt token against Auth0 settings and then check to see if local user exists, and, if not, create it.

Here's the order of operations in the original Graphene backend provided by graphql_jwt:

  1. Backend's authenticate method is called from the graphene middleware via django (from django.contrib.auth import authenticate)
  2. token is retrieved via .utils get_credentials
  3. if token is not None, get_user_by_token in shortcuts module is called
    1. "Payload" is retrieved via utils.get_payload
    2. User is requested via utils.get_user_by_payload
    3. username is retrieved from payload via auth0_settings.JWT_PAYLOAD_GET_USERNAME_HANDLER
    4. user object is retrieved via auth0_settings.JWT_GET_USER_BY_NATURAL_KEY_HANDLER

We modified a couple things:

  1. The decode method called in 3(a) needs to be modified to decode with Auth0 secrets and settings.
  2. get_user_by_payload needs to be modified in several ways:
    1. user object must use RemoteUserMiddleware logic and, if everything from auth0 decodes properly, check to see if user with e-mail exists and, if not, create it. Upon completion of this, try to sync user data with auth0. 2) return created or retrieved user object as original method did

Django-Based Authentication Setup

The only thing you need to do for this is toggle the two auth0-related environment variables: 1. For the backend environment, set USE_AUTH0=False in your environment (either via an environment variable file or directly in your environment via the console). 2. For the frontend environment, set REACT_APP_USE_AUTH0=false in your environment (either via an environment variable file or directly in your environment via the console).

Note

As noted elsewhere, users cannot sign up on their own. You need to log into the admin dashboard - e.g. http://localhost:8000/admin - and add users manually.

Configure Authentication Backend

Select Authentication System via Env Variables

For authentication and authorization, you have two choices. 1. You can configure an Auth0 account and use Auth0 to authenticate users, in which case anyone who is permitted to authenticate via your auth0 setup can login and automatically get an account, 2. or, you can require a username and password for each user and our OpenContracts backend can provide user authentication and authorization. Using the latter option, there is no currently-supported sign-up method, you'll need to use the admin dashboard (See "Adding Users" section).

Auth0 Auth Setup

You need to configure three, separate applications on Auth0's platform:

  1. Configure the SPA as an application. You'll need the App Client ID.
  2. Configure the API. You'll need API Audience.
  3. Configure a M2M application to access the Auth0 Management API. This is used to fetch user details. You'll need the API_ID for the M2M application and the Client Secret for the M2M app.

You'll also need your Auth0 tenant ID (assuming it's the same for all three applications, though you could, in theory, host them in different tenants). These directions are not comprehensive, so, if you're not familiar with Auth0, we recommend you disable Auth0 for the time being and use username and password.

To enable and configure Auth0 Authentication, you'll need to set the following env variables in your .env file (the .django file in .envs/.production or .envs/.local, depending on your target environment). Our sample .envs only show these fields in the .production sample, but you could use them in the .local env file too:

  1. USE_AUTH0 - set to true to enable Auth0
  2. AUTH0_CLIENT_ID - should be the client ID configured on Auth0
  3. AUTH0_API_AUDIENCE - Configured API audience
  4. AUTH0_DOMAIN - domain of your configured Auth0 application
  5. AUTH0_M2M_MANAGEMENT_API_SECRET - secret for the auth0 Machine to Machine (M2M) API
  6. AUTH0_M2M_MANAGEMENT_API_ID - ID for Auth0 Machine to Machine (M2M) API
  7. AUTH0_M2M_MANAGEMENT_GRANT_TYPE - set to client_credentials

Detailed Explanation of Auth0 Implementation

To get Auth0 to work nicely with Graphene, we modified the graphql_jwt backend to support syncing remote user metadata with a local user similar to the default, django RemoteUserMiddleware. We're keeping the graphql_jwt graphene middleware in its entirety as it fetches the token and then passes it along to django authentication *backend. That django backend is what we're modifying to decode the jwt token against Auth0 settings and then check to see if local user exists, and, if not, create it.

Here's the order of operations in the original Graphene backend provided by graphql_jwt:

  1. Backend's authenticate method is called from the graphene middleware via django (from django.contrib.auth import authenticate)
  2. token is retrieved via .utils get_credentials
  3. if token is not None, get_user_by_token in shortcuts module is called
    1. "Payload" is retrieved via utils.get_payload
    2. User is requested via utils.get_user_by_payload
    3. username is retrieved from payload via auth0_settings.JWT_PAYLOAD_GET_USERNAME_HANDLER
    4. user object is retrieved via auth0_settings.JWT_GET_USER_BY_NATURAL_KEY_HANDLER

We modified a couple things:

  1. The decode method called in 3(a) needs to be modified to decode with Auth0 secrets and settings.
  2. get_user_by_payload needs to be modified in several ways:
    1. user object must use RemoteUserMiddleware logic and, if everything from auth0 decodes properly, check to see if user with e-mail exists and, if not, create it. Upon completion of this, try to sync user data with auth0. 2) return created or retrieved user object as original method did

Django-Based Authentication Setup

The only thing you need to do for this is toggle the two auth0-related environment variables: 1. For the backend environment, set USE_AUTH0=False in your environment (either via an environment variable file or directly in your environment via the console). 2. For the frontend environment, set REACT_APP_USE_AUTH0=false in your environment (either via an environment variable file or directly in your environment via the console).

Note

As noted elsewhere, users cannot sign up on their own. You need to log into the admin dashboard - e.g. http://localhost:8000/admin - and add users manually.

\ No newline at end of file diff --git a/configuration/choose-and-configure-docker-stack/index.html b/configuration/choose-and-configure-docker-stack/index.html index 18656ea5..c2078a45 100755 --- a/configuration/choose-and-configure-docker-stack/index.html +++ b/configuration/choose-and-configure-docker-stack/index.html @@ -7,7 +7,7 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}

Choose and Configure Docker Compose Stack

Deployment Options

OpenContracts is designed to be deployed using docker-compose. You can run it locally or in a production environment. Follow the instructions below for a local environment if you just want to test it or you want to use it for yourself and don't intend to make the application available to other users via the Internet.

Local Deployment

Quick Start with Default Settings

A "local" deployment is deployed on your personal computer and is not meant to be accessed over the Internet. If you don't need to configure anything, just follow the quick start guide above to get up and running with a local deployment without needing any further configuration.

Setup .env Files

Backend

After cloning this repo to a machine of your choice, create a folder for your environment files in the repo root. You'll need ./.envs/.local/.django and ./.envs/.local/.postgres Use the samples in ./documentation/sample_env_files/local as guidance. NOTE, you'll need to replace the placeholder passwords and users where noted, but, otherwise, minimal config should be required.

Frontend

In the ./frontend folder, you also need to create a single .env file which holds your configurations for your login method as well as certain feature switches (e.g. turn off imports). We've included a sample using auth0 and another sample using django's auth backend. Local vs production deployments are essentially the same, but the root url of the backend will change from localhost to whereever you're hosting the application in production.

Build the Stack

Once your .env files are setup, build the stack using docker-compose:

$ docker-compose -f local.yml build

Then, run migrations (to setup the database):

$ docker-compose -f local.yml run django python manage.py migrate

Then, create a superuser account that can log in to the admin dashboard (in a local deployment this is available at http://localhost:8000/admin) by typing this command and following the prompts:

$ docker-compose -f local.yml run django python manage.py createsuperuser
+    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Choose and Configure Docker Compose Stack

Deployment Options

OpenContracts is designed to be deployed using docker-compose. You can run it locally or in a production environment. Follow the instructions below for a local environment if you just want to test it or you want to use it for yourself and don't intend to make the application available to other users via the Internet.

Local Deployment

Quick Start with Default Settings

A "local" deployment is deployed on your personal computer and is not meant to be accessed over the Internet. If you don't need to configure anything, just follow the quick start guide above to get up and running with a local deployment without needing any further configuration.

Setup .env Files

Backend

After cloning this repo to a machine of your choice, create a folder for your environment files in the repo root. You'll need ./.envs/.local/.django and ./.envs/.local/.postgres Use the samples in ./documentation/sample_env_files/local as guidance. NOTE, you'll need to replace the placeholder passwords and users where noted, but, otherwise, minimal config should be required.

Frontend

In the ./frontend folder, you also need to create a single .env file which holds your configurations for your login method as well as certain feature switches (e.g. turn off imports). We've included a sample using auth0 and another sample using django's auth backend. Local vs production deployments are essentially the same, but the root url of the backend will change from localhost to whereever you're hosting the application in production.

Build the Stack

Once your .env files are setup, build the stack using docker-compose:

$ docker-compose -f local.yml build

Then, run migrations (to setup the database):

$ docker-compose -f local.yml run django python manage.py migrate

Then, create a superuser account that can log in to the admin dashboard (in a local deployment this is available at http://localhost:8000/admin) by typing this command and following the prompts:

$ docker-compose -f local.yml run django python manage.py createsuperuser
 

Finally, bring up the stack:

$ docker-compose -f local.yml up
 

You should now be able to access the OpenContracts frontend by visiting http://localhost:3000.

Production Environment

The production environment is designed to be public-facing and exposed to the Internet, so there are quite a number more configurations required than a local deployment, particularly if you use an AWS S3 storage backend or the Auth0 authentication system.

After cloning this repo to a machine of your choice, configure the production .env files as described above.

You'll also need to configure your website url. This needs to be done in a few places.

First, in opencontractserver/contrib/migrations, you'll fine a file called 0003_set_site_domain_and_name.py. BEFORE running any of your migrations, you should modify the domain and name defaults you'll fine in update_site_forward:

def update_site_forward(apps, schema_editor):
  """Set site domain and name.""" Site = apps.get_model("sites", "Site") Site.objects.update_or_create( id=settings.SITE_ID, defaults={ "domain": "opencontracts.opensource.legal", "name": "OpenContractServer", }, )
diff --git a/configuration/choose-storage-backend/index.html b/configuration/choose-storage-backend/index.html
index 1373c826..8d248d7d 100755
--- a/configuration/choose-storage-backend/index.html
+++ b/configuration/choose-storage-backend/index.html
@@ -7,6 +7,6 @@
     .gdesc-inner { font-size: 0.75rem; }
     body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);}
     body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);}
-    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Configure Storage Backend

Select and Setup Storage Backend

You can use Amazon S3 as a file storage backend (if you set the env flag USE_AWS=True, more on that below), or you can use the local storage of the host machine via a Docker volume.

AWS Storage Backend

If you want to use AWS S3 to store files (primarily pdfs, but also exports, tokens and txt files), you will need an Amazon AWS account to setup S3. This README does not cover the AWS side of configuration, but there are a number of tutorials and guides to getting AWS configured to be used with a django project.

Once you have an S3 bucket configured, you'll need to set the following env variables in your .env file (the .django file in .envs/.production or .envs/.local, depending on your target environment). Our sample .envs only show these fields in the .production samples, but you could use them in the .local env file too.

Here the variables you need to set to enable AWS S3 storage:

  1. USE_AWS - set to true since you're using AWS, otherwise the backend will use a docker volume for storage.
  2. DJANGO_AWS_ACCESS_KEY_ID - the access key ID created by AWS when you set up your IAM user (see tutorials above).
  3. DJANGO_AWS_SECRET_ACCESS_KEY - the secret access key created by AWS when you set up your IAM user (see tutorials above)
  4. DJANGO_AWS_STORAGE_BUCKET_NAME - the name of the AWS bucket you created to hold the files.
  5. DJANGO_AWS_S3_REGION_NAME - the region of the AWS bucket you configured.

Django Storage Backend

Setting USE_AWS=false will use the disk space in the django container. When using the local docker compose stack, the celery workers and django containers share the same disk, so this works fine. Our production configuration would not work properly with USE_AWS=false, however, as each container has its own disk.

Configure Storage Backend

Select and Setup Storage Backend

You can use Amazon S3 as a file storage backend (if you set the env flag USE_AWS=True, more on that below), or you can use the local storage of the host machine via a Docker volume.

AWS Storage Backend

If you want to use AWS S3 to store files (primarily pdfs, but also exports, tokens and txt files), you will need an Amazon AWS account to setup S3. This README does not cover the AWS side of configuration, but there are a number of tutorials and guides to getting AWS configured to be used with a django project.

Once you have an S3 bucket configured, you'll need to set the following env variables in your .env file (the .django file in .envs/.production or .envs/.local, depending on your target environment). Our sample .envs only show these fields in the .production samples, but you could use them in the .local env file too.

Here the variables you need to set to enable AWS S3 storage:

  1. USE_AWS - set to true since you're using AWS, otherwise the backend will use a docker volume for storage.
  2. DJANGO_AWS_ACCESS_KEY_ID - the access key ID created by AWS when you set up your IAM user (see tutorials above).
  3. DJANGO_AWS_SECRET_ACCESS_KEY - the secret access key created by AWS when you set up your IAM user (see tutorials above)
  4. DJANGO_AWS_STORAGE_BUCKET_NAME - the name of the AWS bucket you created to hold the files.
  5. DJANGO_AWS_S3_REGION_NAME - the region of the AWS bucket you configured.

Django Storage Backend

Setting USE_AWS=false will use the disk space in the django container. When using the local docker compose stack, the celery workers and django containers share the same disk, so this works fine. Our production configuration would not work properly with USE_AWS=false, however, as each container has its own disk.

\ No newline at end of file diff --git a/configuration/configure-admin-users/index.html b/configuration/configure-admin-users/index.html index cfa7ffa4..5f87d460 100755 --- a/configuration/configure-admin-users/index.html +++ b/configuration/configure-admin-users/index.html @@ -7,6 +7,6 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}

Configure Admin Users

Gremlin Admin Dashboard

Gremlin's backend is built on Django, which has its own powerful admin dashboard. This dashboard is not meant for end-users and should only be used by admins. You can access the admin dashboard by going to the /admin page - e,g, opencontracts.opensource.legal/admin or http://localhost:8000/admin. For the most part, you shouldn't need to use the admin dashboard and should only go in here if you're experience errors or unexpected behavior and want to look at the detailed contents of the database to see if it sheds any light on what's happening with a give corpus, document, etc.

By default, Gremlin creates an admin user for you. If you don't specify the username and password in your environment on first boot, it'll use system defaults. You can customize the default username and password via environment variables or after the system boots using the admin dash.

Configure Username and Password Prior to First Deployment

If the variable DJANGO_SUPERUSER_USERNAME is set, that will be the default admin user created on startup (the first time your run docker-compose -f local.yml up). The repo ships with a default superuser username of admin. The default password is set using the DJANGO_SUPERUSER_PASSWORD variable. The environment files for local deployments (but not production) include a default password of Openc0ntracts_def@ult. You should change this in the environment file before the first start OR, follow the instructions below to change it after the first start.

If you modify these environment variables in the environment file BEFORE running the docker-compose up command for the first time, your initial superuser will have the username, email and/or password you specify. If you don't modify the defaults, you can change them after you have created them via the admin dashboard (see below).

After First Deployment via Admin Dashboard

Once the default superuser has been created, you'll need to use the admin dashboard to modify it.

To manage users, including changing the password, you'll need to access the backend admin dashboard. OpenContracts is built on Django, which ships with Django Admin, a tool to manage low-level object data and users. It doesn't provide the rich, document focused UI/UX our frontend does, but it does let you edit and delete objects created on the frontend if, for any reason, you are unable to fix something done by a frontend user (e.g. a corrupt file is uploaded and cannot be parsed or rendered properly on the frontend).

To update your users, first login to the admin panel:

Then, in the lefthand navbar, find the entry for "Users" and click on it

Then, you'll see a list of all users for this instance. You should see your admin user and an "Anonymous" user. The Anonymous user is required for public browsing of objcets with their is_public field set to True. The Anonymous user cannot see other objects.

Click on the admin user to bring up the detailed user view:

Now you can click the "WHAT AM I CALLED" button to bring up a dialog to change the user password.

Configure Admin Users

Gremlin Admin Dashboard

Gremlin's backend is built on Django, which has its own powerful admin dashboard. This dashboard is not meant for end-users and should only be used by admins. You can access the admin dashboard by going to the /admin page - e,g, opencontracts.opensource.legal/admin or http://localhost:8000/admin. For the most part, you shouldn't need to use the admin dashboard and should only go in here if you're experience errors or unexpected behavior and want to look at the detailed contents of the database to see if it sheds any light on what's happening with a give corpus, document, etc.

By default, Gremlin creates an admin user for you. If you don't specify the username and password in your environment on first boot, it'll use system defaults. You can customize the default username and password via environment variables or after the system boots using the admin dash.

Configure Username and Password Prior to First Deployment

If the variable DJANGO_SUPERUSER_USERNAME is set, that will be the default admin user created on startup (the first time your run docker-compose -f local.yml up). The repo ships with a default superuser username of admin. The default password is set using the DJANGO_SUPERUSER_PASSWORD variable. The environment files for local deployments (but not production) include a default password of Openc0ntracts_def@ult. You should change this in the environment file before the first start OR, follow the instructions below to change it after the first start.

If you modify these environment variables in the environment file BEFORE running the docker-compose up command for the first time, your initial superuser will have the username, email and/or password you specify. If you don't modify the defaults, you can change them after you have created them via the admin dashboard (see below).

After First Deployment via Admin Dashboard

Once the default superuser has been created, you'll need to use the admin dashboard to modify it.

To manage users, including changing the password, you'll need to access the backend admin dashboard. OpenContracts is built on Django, which ships with Django Admin, a tool to manage low-level object data and users. It doesn't provide the rich, document focused UI/UX our frontend does, but it does let you edit and delete objects created on the frontend if, for any reason, you are unable to fix something done by a frontend user (e.g. a corrupt file is uploaded and cannot be parsed or rendered properly on the frontend).

To update your users, first login to the admin panel:

Then, in the lefthand navbar, find the entry for "Users" and click on it

Then, you'll see a list of all users for this instance. You should see your admin user and an "Anonymous" user. The Anonymous user is required for public browsing of objcets with their is_public field set to True. The Anonymous user cannot see other objects.

Click on the admin user to bring up the detailed user view:

Now you can click the "WHAT AM I CALLED" button to bring up a dialog to change the user password.

\ No newline at end of file diff --git a/configuration/configure-gremlin/index.html b/configuration/configure-gremlin/index.html index e9f0f274..1f6fac59 100755 --- a/configuration/configure-gremlin/index.html +++ b/configuration/configure-gremlin/index.html @@ -7,6 +7,6 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}

Configure Gremlin Analyzer

Gremlin is a separate project by OpenSource Legal to provide a standard API to access NLP capabilities. This lets us wrap multiple NLP engines / techniques in the same API which lets us build tools that can readily consume the outputs of very different NLP libraries (etc. a Transformers-based model like BERT, and tools like SPACY and LexNLP can be deployed on Gremlin and the outputs from all three can readily be rendered in OpenContracts).

OpenContracts is designed to work with Gremlin out-of-the-box. We have a sample compose yaml file showing how to do this on a local machine local_deploy_with_gremlin.yaml and as a web-facing application production_deploy_with_gremlin.yaml.

When you add a new Gremlin Engine to the database, OpenContracs will automatically query it for its installed analyzers and labels. These will then be available within OpenContracts, and you can use an analyzer to analyze any OpenContracts corpus.

While we have plans to automatically "install" the default Gremlin on first boot, currently you must manually go into the OpenContracts admin dash and add the Gremlin. Thankfully, this is an easy process:

  1. In your environment file, make sure you set CALLBACK_ROOT_URL_FOR_ANALYZER
    1. For local deploy, use CALLBACK_ROOT_URL_FOR_ANALYZER=http://localhost:8000
    2. For production deploy, use http://django:5000. Why the change? Well, in our local docker compose stack, the host the localhost and the django development server runs on port 8000. In production, we want Gremlin to communicate with the OpenContracts container ("django") via its hostname on the docker compose stack's network. The production OpenContracts container also uses gunicorn on port 5000 instead of the development server on port 8000, so the port changes too.
  2. Go to the admin page:
  3. Click "Add+" in the Gremlin row to bring up the Add Gremlin Engine form. You just need to set the creator Url fields (the url for our default config is http://gremlinengine:5000). If, for some reason, you don't want the analyzer to be visible to any unauthenticated user, unselect the is_public box :
  4. This will automatically kick off an install process that runs in the background. When it's complete, you'll see the "Install Completed" Field change. It should take a second or two. At the moment, we don't handle errors in this process, so, if it doesn't complete successfully in 30 seconds, there is probably a misconfiguration somewhere. We plan to improve our error handling for these backend installation processes.

Note, in our example implementations, Gremlin is NOT encrypted or API Key secured to outside traffic. It's not exposed to outside traffic either per our docker compose config, so this shouldn't be a major concern. If you do expose the container to the host via your Docker Compose file, you should ensure you run the traffic through Treafik and setup API Key authentication.

Configure Gremlin Analyzer

Gremlin is a separate project by OpenSource Legal to provide a standard API to access NLP capabilities. This lets us wrap multiple NLP engines / techniques in the same API which lets us build tools that can readily consume the outputs of very different NLP libraries (etc. a Transformers-based model like BERT, and tools like SPACY and LexNLP can be deployed on Gremlin and the outputs from all three can readily be rendered in OpenContracts).

OpenContracts is designed to work with Gremlin out-of-the-box. We have a sample compose yaml file showing how to do this on a local machine local_deploy_with_gremlin.yaml and as a web-facing application production_deploy_with_gremlin.yaml.

When you add a new Gremlin Engine to the database, OpenContracs will automatically query it for its installed analyzers and labels. These will then be available within OpenContracts, and you can use an analyzer to analyze any OpenContracts corpus.

While we have plans to automatically "install" the default Gremlin on first boot, currently you must manually go into the OpenContracts admin dash and add the Gremlin. Thankfully, this is an easy process:

  1. In your environment file, make sure you set CALLBACK_ROOT_URL_FOR_ANALYZER
    1. For local deploy, use CALLBACK_ROOT_URL_FOR_ANALYZER=http://localhost:8000
    2. For production deploy, use http://django:5000. Why the change? Well, in our local docker compose stack, the host the localhost and the django development server runs on port 8000. In production, we want Gremlin to communicate with the OpenContracts container ("django") via its hostname on the docker compose stack's network. The production OpenContracts container also uses gunicorn on port 5000 instead of the development server on port 8000, so the port changes too.
  2. Go to the admin page:
  3. Click "Add+" in the Gremlin row to bring up the Add Gremlin Engine form. You just need to set the creator Url fields (the url for our default config is http://gremlinengine:5000). If, for some reason, you don't want the analyzer to be visible to any unauthenticated user, unselect the is_public box :
  4. This will automatically kick off an install process that runs in the background. When it's complete, you'll see the "Install Completed" Field change. It should take a second or two. At the moment, we don't handle errors in this process, so, if it doesn't complete successfully in 30 seconds, there is probably a misconfiguration somewhere. We plan to improve our error handling for these backend installation processes.

Note, in our example implementations, Gremlin is NOT encrypted or API Key secured to outside traffic. It's not exposed to outside traffic either per our docker compose config, so this shouldn't be a major concern. If you do expose the container to the host via your Docker Compose file, you should ensure you run the traffic through Treafik and setup API Key authentication.

\ No newline at end of file diff --git a/development/documentation/index.html b/development/documentation/index.html index efca45e7..e647d2dc 100755 --- a/development/documentation/index.html +++ b/development/documentation/index.html @@ -7,7 +7,7 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}

Documentation

Documentation Stack

We're using mkdocs to render our markdown into pretty, bite-sized pieces. The markdown lives in /docs in our repo. If you want to work on the docs you'll need to install the requirements in /requirements/docs.txt.

To have a live server while working on them, type:

mkdocs serve
+    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Documentation

Documentation Stack

We're using mkdocs to render our markdown into pretty, bite-sized pieces. The markdown lives in /docs in our repo. If you want to work on the docs you'll need to install the requirements in /requirements/docs.txt.

To have a live server while working on them, type:

mkdocs serve
 

Building Docs

To build a html website from your markdown that can be uploaded to a webhost (or a GitHub Page), just type:

mkdocs build
 

Deploying to GH Page

mkdocs makes it super easy to deploy your docs to a GitHub page.

Just run:

mkdocs gh-deploy
 

Dev Environment

We use Black and Flake8 for Python Code Styling. These are run via pre-commit before all commits. If you want to develop extensions or code based on OpenContracts, you'll need to setup pre-commit. First, make sure the requirements in ./requirements/local.txt are installed in your local environment.

Then, install pre-commit into your local git repo. From the root of the repo, run:

 $ pre-commit install
+    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}     

Dev Environment

We use Black and Flake8 for Python Code Styling. These are run via pre-commit before all commits. If you want to develop extensions or code based on OpenContracts, you'll need to setup pre-commit. First, make sure the requirements in ./requirements/local.txt are installed in your local environment.

Then, install pre-commit into your local git repo. From the root of the repo, run:

 $ pre-commit install
 
If you want to run pre-commit manually on all the code in the repo, use this command:

 $ pre-commit run --all-files
 

When you commit changes to your repo or our repo as a PR, pre-commit will run and ensure your code follows our style guide and passes linting.

Frontend Notes

Responsive Layout

The application was primarily designed to be viewed around 1080p. We've built in some quick and dirty (honestly, hacks) to display a usable layout at other resolutions. A more thorough redesign / refactor is in order, again if there's sufficient interest. What's available now should handle a lot of situations ok. If you find performance / layout is not looking great at your given resolution, try to use a desktop browser at a 1080p resolution.

No Test Suite

As of our initial release, the test suite only tests the backend (and coverage is admittedly not as robust as we'd like). We'd like to add tests for the frontend, though this is a fairly large undertaking. We welcome any contributions on this front!

Frontend Notes

Responsive Layout

The application was primarily designed to be viewed around 1080p. We've built in some quick and dirty (honestly, hacks) to display a usable layout at other resolutions. A more thorough redesign / refactor is in order, again if there's sufficient interest. What's available now should handle a lot of situations ok. If you find performance / layout is not looking great at your given resolution, try to use a desktop browser at a 1080p resolution.

No Test Suite

As of our initial release, the test suite only tests the backend (and coverage is admittedly not as robust as we'd like). We'd like to add tests for the frontend, though this is a fairly large undertaking. We welcome any contributions on this front!

\ No newline at end of file diff --git a/development/test-suite/index.html b/development/test-suite/index.html index bb9f3677..cae68496 100755 --- a/development/test-suite/index.html +++ b/development/test-suite/index.html @@ -7,7 +7,7 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}

Test Suite

Our test suite is a bit sparse, but we're working to improve coverage on the backend. Frontend tests will likely take longer to implement. Our existing tests do test imports and a number of the utility functions for manipulating annotations. These tests are integrated in our GitHub actions.

NOTE, use Python 3.10 or above as pydantic and certain pre-3.10 type annotations do not play well. using from __future__ import annotations doesn't always solve the problem, and upgrading to Python 3.10 was a lot easier than trying to figure out why the from __future__ didn't behave as expected

To run the tests, check your test coverage, and generate an HTML coverage report:

 $ docker-compose -f local.yml run django coverage run -m pytest
+    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}     

Test Suite

Our test suite is a bit sparse, but we're working to improve coverage on the backend. Frontend tests will likely take longer to implement. Our existing tests do test imports and a number of the utility functions for manipulating annotations. These tests are integrated in our GitHub actions.

NOTE, use Python 3.10 or above as pydantic and certain pre-3.10 type annotations do not play well. using from __future__ import annotations doesn't always solve the problem, and upgrading to Python 3.10 was a lot easier than trying to figure out why the from __future__ didn't behave as expected

To run the tests, check your test coverage, and generate an HTML coverage report:

 $ docker-compose -f local.yml run django coverage run -m pytest
  $ docker-compose -f local.yml run django coverage html
  $ open htmlcov/index.html
 

To run a specific test (e.g. test_analyzers):

 $ sudo docker-compose -f local.yml run django python manage.py test opencontractserver.tests.test_analyzers --noinput
diff --git a/extract_and_retrieval/document_data_extract/index.html b/extract_and_retrieval/document_data_extract/index.html
index aca8a5f7..996ceb9a 100755
--- a/extract_and_retrieval/document_data_extract/index.html
+++ b/extract_and_retrieval/document_data_extract/index.html
@@ -1,4 +1,4 @@
- Extracting Structured Data from Documents using LlamaIndex, AI Agents, and Marvin - OpenContracts      

Extracting Structured Data from Documents using LlamaIndex, AI Agents, and Marvin

We've added a powerful feature called "extract" that enables the generation of structured data grids from a list of documents using a combination of vector search, AI agents, and the Marvin library. This functionality is implemented in a Django application and leverages Celery for asynchronous task processing.

All credit for the inspiration of this features goes to the fine folks at Nlmatics. They were some of the first pioneers working on datagrids from document using a set of questions and custom transformer models. This implementation of their concept ultimately leverages newer techniques and better models, but hats off to them for coming up with a design like this 6 years ago!

Overview

The extract process involves the following key components:

  1. Document Corpus: A collection of documents from which structured data will be extracted.
  2. Fieldset: A set of columns defining the structure of the data to be extracted.
  3. LlamaIndex: A library used for efficient vector search and retrieval of relevant document sections.
  4. AI Agents: Intelligent agents that analyze the retrieved document sections and extract structured data.
  5. Marvin: A library that facilitates the parsing and extraction of structured data from text.

The extract process is initiated by creating an Extract object that specifies the document corpus and the fieldset defining the desired data structure. The process is then broken down into individual tasks for each document and column combination, allowing for parallel processing and scalability.

Detailed Walkthrough

Let's dive into the code and understand how the extract process works step by step.

1. Initiating the Extract Process

The run_extract function is the entry point for initiating the extract process. It takes the extract_id and user_id as parameters and performs the following steps:

  1. Retrieves the Extract object from the database based on the provided extract_id.
  2. Sets the started timestamp of the extract to the current time.
  3. Retrieves the fieldset associated with the extract, which defines the columns of the structured data grid.
  4. Retrieves the list of document IDs associated with the extract.
  5. Creates Datacell objects for each document and column combination, representing the individual cells in the structured data grid.
  6. Sets the appropriate permissions for each Datacell object based on the user's permissions.
  7. Kicks off the processing job for each Datacell by appending a task to the Celery task queue.

2. Processing Individual Datacells

The llama_index_doc_query function is responsible for processing each individual Datacell. It performs the following steps:

  1. Retrieves the Datacell object from the database based on the provided cell_id.
  2. Sets the started timestamp of the datacell to the current time.
  3. Retrieves the associated document and initializes the necessary components for vector search and retrieval using LlamaIndex, including the embedding model, language model, and vector store.
  4. Performs a vector search to retrieve the most relevant document sections based on the search text or query specified in the datacell's column.
  5. Extracts the retrieved annotation IDs and associates them with the datacell as sources.
  6. If the datacell's column is marked as "agentic," it uses an AI agent to further analyze the retrieved document sections and extract additional information such as defined terms and section references.
  7. Prepares the retrieved text and additional information for parsing using the Marvin library.
  8. Depending on the specified output type of the datacell's column, it uses Marvin to extract the structured data as either a list or a single instance.
  9. Parses the extracted data and stores it in the datacell's data field based on the output type (e.g., BaseModel, str, int, bool, float).
  10. Sets the completed timestamp of the datacell to the current time.
  11. If an exception occurs during processing, it sets the failed timestamp and stores the error stacktrace in the datacell.

3. Marking the Extract as Complete

Once all the datacells have been processed, the mark_extract_complete function is triggered by the Celery chord. It retrieves the Extract object based on the provided extract_id and sets the finished timestamp to the current time, indicating that the extract process is complete.

Benefits and Considerations

The extract functionality offers several benefits:

  1. Structured Data Extraction: It enables the extraction of structured data from unstructured or semi-structured documents, making the information more accessible and actionable.
  2. Scalability: By breaking down the process into individual tasks for each document and column combination, it allows for parallel processing and scalability, enabling the handling of large document corpora.
  3. Flexibility: The use of fieldsets allows for the definition of custom data structures tailored to specific requirements.
  4. AI-Powered Analysis: The integration of AI agents and the Marvin library enables intelligent analysis and extraction of relevant information from the retrieved document sections.
  5. Asynchronous Processing: The use of Celery for asynchronous task processing ensures that the extract process doesn't block the main application and can be performed in the background.

However, there are a few considerations to keep in mind:

1**Processing Time**: Depending on the size of the document corpus and the complexity of the fieldset, the extract process may take a considerable amount of time to complete. 2**Error Handling**: Proper error handling and monitoring should be implemented to handle any exceptions or failures during the processing of individual datacells. 3**Data Validation**: The extracted structured data may require additional validation and cleansing steps to ensure its quality and consistency.

Next Steps

This is more of a proof-of-concept of the power of the existing universe of open source tooling. There are a number of more advanced techniques we can use to get better retrieval, more intelligent agentic behavior and more. Also, we haven't optomized for performance AT ALL, so any improvements in any of these areas would be welcome. Further, we expect the real power for an open source tool like OpenContracts to come from custom implementations of this functionality, so we'll also be working on more easily customizable and modular agents and retrieval pipelines so you can quickly select the right pipeline for the right task.

Extracting Structured Data from Documents using LlamaIndex, AI Agents, and Marvin

We've added a powerful feature called "extract" that enables the generation of structured data grids from a list of documents using a combination of vector search, AI agents, and the Marvin library. This functionality is implemented in a Django application and leverages Celery for asynchronous task processing.

All credit for the inspiration of this features goes to the fine folks at Nlmatics. They were some of the first pioneers working on datagrids from document using a set of questions and custom transformer models. This implementation of their concept ultimately leverages newer techniques and better models, but hats off to them for coming up with a design like this 6 years ago!

Overview

The extract process involves the following key components:

  1. Document Corpus: A collection of documents from which structured data will be extracted.
  2. Fieldset: A set of columns defining the structure of the data to be extracted.
  3. LlamaIndex: A library used for efficient vector search and retrieval of relevant document sections.
  4. AI Agents: Intelligent agents that analyze the retrieved document sections and extract structured data.
  5. Marvin: A library that facilitates the parsing and extraction of structured data from text.

The extract process is initiated by creating an Extract object that specifies the document corpus and the fieldset defining the desired data structure. The process is then broken down into individual tasks for each document and column combination, allowing for parallel processing and scalability.

Detailed Walkthrough

Let's dive into the code and understand how the extract process works step by step.

1. Initiating the Extract Process

The run_extract function is the entry point for initiating the extract process. It takes the extract_id and user_id as parameters and performs the following steps:

  1. Retrieves the Extract object from the database based on the provided extract_id.
  2. Sets the started timestamp of the extract to the current time.
  3. Retrieves the fieldset associated with the extract, which defines the columns of the structured data grid.
  4. Retrieves the list of document IDs associated with the extract.
  5. Creates Datacell objects for each document and column combination, representing the individual cells in the structured data grid.
  6. Sets the appropriate permissions for each Datacell object based on the user's permissions.
  7. Kicks off the processing job for each Datacell by appending a task to the Celery task queue.

2. Processing Individual Datacells

The llama_index_doc_query function is responsible for processing each individual Datacell. It performs the following steps:

  1. Retrieves the Datacell object from the database based on the provided cell_id.
  2. Sets the started timestamp of the datacell to the current time.
  3. Retrieves the associated document and initializes the necessary components for vector search and retrieval using LlamaIndex, including the embedding model, language model, and vector store.
  4. Performs a vector search to retrieve the most relevant document sections based on the search text or query specified in the datacell's column.
  5. Extracts the retrieved annotation IDs and associates them with the datacell as sources.
  6. If the datacell's column is marked as "agentic," it uses an AI agent to further analyze the retrieved document sections and extract additional information such as defined terms and section references.
  7. Prepares the retrieved text and additional information for parsing using the Marvin library.
  8. Depending on the specified output type of the datacell's column, it uses Marvin to extract the structured data as either a list or a single instance.
  9. Parses the extracted data and stores it in the datacell's data field based on the output type (e.g., BaseModel, str, int, bool, float).
  10. Sets the completed timestamp of the datacell to the current time.
  11. If an exception occurs during processing, it sets the failed timestamp and stores the error stacktrace in the datacell.

3. Marking the Extract as Complete

Once all the datacells have been processed, the mark_extract_complete function is triggered by the Celery chord. It retrieves the Extract object based on the provided extract_id and sets the finished timestamp to the current time, indicating that the extract process is complete.

Benefits and Considerations

The extract functionality offers several benefits:

  1. Structured Data Extraction: It enables the extraction of structured data from unstructured or semi-structured documents, making the information more accessible and actionable.
  2. Scalability: By breaking down the process into individual tasks for each document and column combination, it allows for parallel processing and scalability, enabling the handling of large document corpora.
  3. Flexibility: The use of fieldsets allows for the definition of custom data structures tailored to specific requirements.
  4. AI-Powered Analysis: The integration of AI agents and the Marvin library enables intelligent analysis and extraction of relevant information from the retrieved document sections.
  5. Asynchronous Processing: The use of Celery for asynchronous task processing ensures that the extract process doesn't block the main application and can be performed in the background.

However, there are a few considerations to keep in mind:

1**Processing Time**: Depending on the size of the document corpus and the complexity of the fieldset, the extract process may take a considerable amount of time to complete. 2**Error Handling**: Proper error handling and monitoring should be implemented to handle any exceptions or failures during the processing of individual datacells. 3**Data Validation**: The extracted structured data may require additional validation and cleansing steps to ensure its quality and consistency.

Next Steps

This is more of a proof-of-concept of the power of the existing universe of open source tooling. There are a number of more advanced techniques we can use to get better retrieval, more intelligent agentic behavior and more. Also, we haven't optomized for performance AT ALL, so any improvements in any of these areas would be welcome. Further, we expect the real power for an open source tool like OpenContracts to come from custom implementations of this functionality, so we'll also be working on more easily customizable and modular agents and retrieval pipelines so you can quickly select the right pipeline for the right task.

\ No newline at end of file diff --git a/extract_and_retrieval/intro_to_django_annotation_vector_store/index.html b/extract_and_retrieval/intro_to_django_annotation_vector_store/index.html index f0891fe2..9726261d 100755 --- a/extract_and_retrieval/intro_to_django_annotation_vector_store/index.html +++ b/extract_and_retrieval/intro_to_django_annotation_vector_store/index.html @@ -7,7 +7,7 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}

Making a Django Application Compatible with LlamaIndex using a Custom Vector Store

Introduction

In this walkthrough, we'll explore how the custom DjangoAnnotationVectorStore makes a Django application compatible with LlamaIndex, enabling powerful vector search capabilities within the application's structured annotation store. By leveraging the BasePydanticVectorStore class provided by LlamaIndex and integrating it with Django's ORM and the pg-vector extension for PostgreSQL, we can achieve efficient and scalable vector search functionality.

Understanding the DjangoAnnotationVectorStore

The DjangoAnnotationVectorStore is a custom implementation of LlamaIndex's BasePydanticVectorStore class, tailored specifically for a Django application. It allows the application to store and retrieve granular, visually-locatable annotations (x-y blocks) from PDF pages using vector search.

Let's break down the key components and features of the DjangoAnnotationVectorStore:

1. Inheritance from BasePydanticVectorStore

class DjangoAnnotationVectorStore(BasePydanticVectorStore):
+    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Making a Django Application Compatible with LlamaIndex using a Custom Vector Store

Introduction

In this walkthrough, we'll explore how the custom DjangoAnnotationVectorStore makes a Django application compatible with LlamaIndex, enabling powerful vector search capabilities within the application's structured annotation store. By leveraging the BasePydanticVectorStore class provided by LlamaIndex and integrating it with Django's ORM and the pg-vector extension for PostgreSQL, we can achieve efficient and scalable vector search functionality.

Understanding the DjangoAnnotationVectorStore

The DjangoAnnotationVectorStore is a custom implementation of LlamaIndex's BasePydanticVectorStore class, tailored specifically for a Django application. It allows the application to store and retrieve granular, visually-locatable annotations (x-y blocks) from PDF pages using vector search.

Let's break down the key components and features of the DjangoAnnotationVectorStore:

1. Inheritance from BasePydanticVectorStore

class DjangoAnnotationVectorStore(BasePydanticVectorStore):
     ...
 

By inheriting from BasePydanticVectorStore, the DjangoAnnotationVectorStore gains access to the base functionality and interfaces provided by LlamaIndex for vector stores. This ensures compatibility with LlamaIndex's query engines and retrieval methods.

2. Integration with Django's ORM

The DjangoAnnotationVectorStore leverages Django's Object-Relational Mapping (ORM) to interact with the application's database. It defines methods like _get_annotation_queryset() and _build_filter_query() to retrieve annotations from the database using Django's queryset API.

def _get_annotation_queryset(self) -> QuerySet:
     queryset = Annotation.objects.all()
diff --git a/extract_and_retrieval/querying_corpus/index.html b/extract_and_retrieval/querying_corpus/index.html
index 14773f9e..857a4073 100755
--- a/extract_and_retrieval/querying_corpus/index.html
+++ b/extract_and_retrieval/querying_corpus/index.html
@@ -1,4 +1,4 @@
- Answering Queries using LlamaIndex in a Django Application - OpenContracts      

Answering Queries using LlamaIndex in a Django Application

This markdown document explains how queries are answered in a Django application using LlamaIndex, the limitations of the approach, and how LlamaIndex is leveraged for this purpose.

Query Answering Process

  1. A user submits a query through the Django application, which is associated with a specific corpus (a collection of documents).
  2. The query is saved in the database as a CorpusQuery object, and a Celery task (run_query) is triggered to process the query asynchronously.
  3. Inside the run_query task:
  4. The CorpusQuery object is retrieved from the database using the provided query_id.
  5. The query's started timestamp is set to the current time.
  6. The necessary components for query processing are set up, including the embedding model (HuggingFaceEmbedding), language model (OpenAI), and vector store (DjangoAnnotationVectorStore).
  7. The DjangoAnnotationVectorStore is initialized with the corpus_id associated with the query, allowing it to retrieve the relevant annotations for the specified corpus.
  8. A VectorStoreIndex is created from the DjangoAnnotationVectorStore, which serves as the index for the query engine.
  9. A CitationQueryEngine is instantiated with the index, specifying the number of top similar results to retrieve (similarity_top_k) and the granularity of the citation sources (citation_chunk_size).
  10. The query is passed to the CitationQueryEngine, which processes the query and generates a response.
  11. The response includes the answer to the query along with the source annotations used to generate the answer.
  12. The source annotations are parsed and converted into a markdown format, with each citation linked to the corresponding annotation ID.
  13. The query's sources field is updated with the annotation IDs used in the response.
  14. The query's response field is set to the generated markdown text.
  15. The query's completed timestamp is set to the current time.
  16. If an exception occurs during the query processing, the query's failed timestamp is set, and the stack trace is stored in the stacktrace field.

Leveraging LlamaIndex

LlamaIndex is leveraged in the following ways to enable query answering in the Django application:

  1. Vector Store: LlamaIndex provides the BasePydanticVectorStore class, which serves as the foundation for the custom DjangoAnnotationVectorStore. The DjangoAnnotationVectorStore integrates with Django's ORM to store and retrieve annotations efficiently, allowing seamless integration with the existing Django application.
  2. Indexing: LlamaIndex's VectorStoreIndex is used to create an index from the DjangoAnnotationVectorStore. The index facilitates fast and efficient retrieval of relevant annotations based on the query.
  3. Query Engine: LlamaIndex's CitationQueryEngine is employed to process the queries and generate responses. The query engine leverages the index to find the most relevant annotations and uses the language model to generate a coherent answer.
  4. Embedding and Language Models: LlamaIndex provides abstractions for integrating various embedding and language models. In this implementation, the HuggingFaceEmbedding and OpenAI models are used, but LlamaIndex allows flexibility in choosing different models based on requirements.

By leveraging LlamaIndex, the Django application benefits from a structured and efficient approach to query answering. LlamaIndex provides the necessary components and abstractions to handle vector storage, indexing, and query processing, allowing the application to focus on integrating these capabilities into its existing architecture.

Answering Queries using LlamaIndex in a Django Application

This markdown document explains how queries are answered in a Django application using LlamaIndex, the limitations of the approach, and how LlamaIndex is leveraged for this purpose.

Query Answering Process

  1. A user submits a query through the Django application, which is associated with a specific corpus (a collection of documents).
  2. The query is saved in the database as a CorpusQuery object, and a Celery task (run_query) is triggered to process the query asynchronously.
  3. Inside the run_query task:
  4. The CorpusQuery object is retrieved from the database using the provided query_id.
  5. The query's started timestamp is set to the current time.
  6. The necessary components for query processing are set up, including the embedding model (HuggingFaceEmbedding), language model (OpenAI), and vector store (DjangoAnnotationVectorStore).
  7. The DjangoAnnotationVectorStore is initialized with the corpus_id associated with the query, allowing it to retrieve the relevant annotations for the specified corpus.
  8. A VectorStoreIndex is created from the DjangoAnnotationVectorStore, which serves as the index for the query engine.
  9. A CitationQueryEngine is instantiated with the index, specifying the number of top similar results to retrieve (similarity_top_k) and the granularity of the citation sources (citation_chunk_size).
  10. The query is passed to the CitationQueryEngine, which processes the query and generates a response.
  11. The response includes the answer to the query along with the source annotations used to generate the answer.
  12. The source annotations are parsed and converted into a markdown format, with each citation linked to the corresponding annotation ID.
  13. The query's sources field is updated with the annotation IDs used in the response.
  14. The query's response field is set to the generated markdown text.
  15. The query's completed timestamp is set to the current time.
  16. If an exception occurs during the query processing, the query's failed timestamp is set, and the stack trace is stored in the stacktrace field.

Leveraging LlamaIndex

LlamaIndex is leveraged in the following ways to enable query answering in the Django application:

  1. Vector Store: LlamaIndex provides the BasePydanticVectorStore class, which serves as the foundation for the custom DjangoAnnotationVectorStore. The DjangoAnnotationVectorStore integrates with Django's ORM to store and retrieve annotations efficiently, allowing seamless integration with the existing Django application.
  2. Indexing: LlamaIndex's VectorStoreIndex is used to create an index from the DjangoAnnotationVectorStore. The index facilitates fast and efficient retrieval of relevant annotations based on the query.
  3. Query Engine: LlamaIndex's CitationQueryEngine is employed to process the queries and generate responses. The query engine leverages the index to find the most relevant annotations and uses the language model to generate a coherent answer.
  4. Embedding and Language Models: LlamaIndex provides abstractions for integrating various embedding and language models. In this implementation, the HuggingFaceEmbedding and OpenAI models are used, but LlamaIndex allows flexibility in choosing different models based on requirements.

By leveraging LlamaIndex, the Django application benefits from a structured and efficient approach to query answering. LlamaIndex provides the necessary components and abstractions to handle vector storage, indexing, and query processing, allowing the application to focus on integrating these capabilities into its existing architecture.

\ No newline at end of file diff --git a/index.html b/index.html index 2c270458..56261971 100755 --- a/index.html +++ b/index.html @@ -7,6 +7,6 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}

OpenContracts

The Free and Open Source Document Analysis Platform


OSLegal logo

CI/CD Codecov (Backend)
Meta code style - black types - Mypy imports - isort License - Apache2

What Does it Do?

OpenContracts is an Apache-2 Licensed software application to label, share and search annotate documents. It's designed specifically to label documents with complex layouts such as contracts, scientific papers, newspapers, etc.

When combine with a NLP processing engine like Gremlin Engine (another of our open source projects), OpenContracts not only lets humans collaborate on and share document annotations, it also can analyze and export data from contracts using state-of-the-art NLP technology.

Why Does it Exist?

The OpenContracts stack is designed to provide a cutting edge frontend experience while providing access to the incredible machine learning and natural language processing capabilities of Python. For this reason, our frontend is based on React. We use a GraphQL API to connect it to a django-based backend. Django is a incredibly mature, battle-tested framework that is written in Python, so integrating all the amazing Python-based AI and NLP libraries out there is super easy.

We'd like to give credit to AllenAI's PAWLs project for our document annotating component. We rewrote most of the code base and replaced their backend entirely, so it was hard to keep , but we believe in giving credit where it's due! We are relying on their document parser, however, as it produces a really excellent text and x-y coordinate layer that we'd encourage others to use as well in similar applications that require you to interact with complex text layouts.

Limitations

At the moment, it only works with PDFs. In the future, it will be able to convert other document types to PDF for storage and labeling. PDF is an excellent format for this as it introduces a consistent, repeatable format which we can use to generate a text and x-y coordinate layer from scratch. Formats like .docx and .html are too complex and varied to provide an easy, consistent format. Likewise, the output quality of many converters and tools is sub-par and these tools can produce very different document structures for the same inputs.

About OpenSource.Legal

OpenSource.Legal believes that the effective, digital transformation of the legal services industry and the execution of "the law", broadly speaking, requires shared solutions and tools to solve some of the problems that are common to almost every legal workflow. The current splintering of service delivery into dozens of incompatible platforms with limited configurations threatens to put software developers and software vendors in the driver seat of the industry. We firmly believe that lawyers and legal engineers, armed with easily configurable and extensible tools can much more effectively design the workflows and user experiences that they need to deliver and scale their expertise.

Visit us at https://opensource.legal for a directory of open source legal projects and an overview of our projects.

OpenContracts

The Free and Open Source Document Analysis Platform


OSLegal logo

CI/CD Codecov (Backend)
Meta code style - black types - Mypy imports - isort License - Apache2

What Does it Do?

OpenContracts is an Apache-2 Licensed software application to label, share and search annotate documents. It's designed specifically to label documents with complex layouts such as contracts, scientific papers, newspapers, etc.

When combine with a NLP processing engine like Gremlin Engine (another of our open source projects), OpenContracts not only lets humans collaborate on and share document annotations, it also can analyze and export data from contracts using state-of-the-art NLP technology.

Why Does it Exist?

The OpenContracts stack is designed to provide a cutting edge frontend experience while providing access to the incredible machine learning and natural language processing capabilities of Python. For this reason, our frontend is based on React. We use a GraphQL API to connect it to a django-based backend. Django is a incredibly mature, battle-tested framework that is written in Python, so integrating all the amazing Python-based AI and NLP libraries out there is super easy.

We'd like to give credit to AllenAI's PAWLs project for our document annotating component. We rewrote most of the code base and replaced their backend entirely, so it was hard to keep , but we believe in giving credit where it's due! We are relying on their document parser, however, as it produces a really excellent text and x-y coordinate layer that we'd encourage others to use as well in similar applications that require you to interact with complex text layouts.

Limitations

At the moment, it only works with PDFs. In the future, it will be able to convert other document types to PDF for storage and labeling. PDF is an excellent format for this as it introduces a consistent, repeatable format which we can use to generate a text and x-y coordinate layer from scratch. Formats like .docx and .html are too complex and varied to provide an easy, consistent format. Likewise, the output quality of many converters and tools is sub-par and these tools can produce very different document structures for the same inputs.

About OpenSource.Legal

OpenSource.Legal believes that the effective, digital transformation of the legal services industry and the execution of "the law", broadly speaking, requires shared solutions and tools to solve some of the problems that are common to almost every legal workflow. The current splintering of service delivery into dozens of incompatible platforms with limited configurations threatens to put software developers and software vendors in the driver seat of the industry. We firmly believe that lawyers and legal engineers, armed with easily configurable and extensible tools can much more effectively design the workflows and user experiences that they need to deliver and scale their expertise.

Visit us at https://opensource.legal for a directory of open source legal projects and an overview of our projects.

\ No newline at end of file diff --git a/philosophy/index.html b/philosophy/index.html index 271e119d..ab60c8e3 100755 --- a/philosophy/index.html +++ b/philosophy/index.html @@ -7,6 +7,6 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}

Philosophy

Don't Repeat Yourself

OpenContracts is designed not only be a powerful document analysis and annotation platform, it's also envisioned as a way to embrace the DRY (Don't Repeat Yourself) principle for legal and legal engineering. You can make a corpus, along with all of its labels, documents and annotations "public" (currently, you must do this via a GraphQL mutation).

Once something is public, it's read-only for everyone other than its original creator. People with read-only access can "clone" the corpus to create a private copy of the corpus, its documents and its annotations. They can then edit the annotations, add to them, export them, etc. This lets us work from previous document annotations and re-use labels and training data.

Philosophy

Don't Repeat Yourself

OpenContracts is designed not only be a powerful document analysis and annotation platform, it's also envisioned as a way to embrace the DRY (Don't Repeat Yourself) principle for legal and legal engineering. You can make a corpus, along with all of its labels, documents and annotations "public" (currently, you must do this via a GraphQL mutation).

Once something is public, it's read-only for everyone other than its original creator. People with read-only access can "clone" the corpus to create a private copy of the corpus, its documents and its annotations. They can then edit the annotations, add to them, export them, etc. This lets us work from previous document annotations and re-use labels and training data.

\ No newline at end of file diff --git a/quick-start/index.html b/quick-start/index.html index 5176f04d..6cc1978b 100755 --- a/quick-start/index.html +++ b/quick-start/index.html @@ -7,7 +7,7 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}

Quick Start (For use on your local machine)

This guide is for people who want to quickly get started using the application and aren't interested in hosting it online for others to use. You'll get a default, local user with admin access. We recommend you change the user password after completing this tutorial. We assume you're using Linux or Max OS, but you could do this on Windows too, assuming you have docker compose and docker installed. The commands to create directories will be different on Windows, but the git, docker and docker-compose commands should all be the same.

Step 1: Clone this Repo

Clone the repository into a local directory of your choice. Here, we assume you are using a folder called source in your user's home directory:

    $ cd ~
+    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Quick Start (For use on your local machine)

This guide is for people who want to quickly get started using the application and aren't interested in hosting it online for others to use. You'll get a default, local user with admin access. We recommend you change the user password after completing this tutorial. We assume you're using Linux or Max OS, but you could do this on Windows too, assuming you have docker compose and docker installed. The commands to create directories will be different on Windows, but the git, docker and docker-compose commands should all be the same.

Step 1: Clone this Repo

Clone the repository into a local directory of your choice. Here, we assume you are using a folder called source in your user's home directory:

    $ cd ~
     $ mkdir source
     $ cd source
     $ git clone https://github.com/JSv4/OpenContracts.git
diff --git a/requirements/index.html b/requirements/index.html
index e6ca4f38..0cd672a0 100755
--- a/requirements/index.html
+++ b/requirements/index.html
@@ -7,6 +7,6 @@
     .gdesc-inner { font-size: 0.75rem; }
     body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);}
     body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);}
-    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

System Requirements

System Requirements

You will need Docker and Docker Compose installed to run Open Contracts. We've developed and run the application a Linux x86_64 environment. We haven't tested on Windows, and it's known that celery is not supported on Windows. For this reason, we do not recommend deployment on Windows. If you must run on a Windows machine, consider using a virtual machine or using the Windows Linux Subsystem.

If you need help setting up Docker, we recommend Digital Ocean's setup guide. Likewise, if you need assistance setting up Docker Compose, Digital Ocean's guide is excellent.

System Requirements

System Requirements

You will need Docker and Docker Compose installed to run Open Contracts. We've developed and run the application a Linux x86_64 environment. We haven't tested on Windows, and it's known that celery is not supported on Windows. For this reason, we do not recommend deployment on Windows. If you must run on a Windows machine, consider using a virtual machine or using the Windows Linux Subsystem.

If you need help setting up Docker, we recommend Digital Ocean's setup guide. Likewise, if you need assistance setting up Docker Compose, Digital Ocean's guide is excellent.

\ No newline at end of file diff --git a/walkthrough/advanced/configure-annotation-view/index.html b/walkthrough/advanced/configure-annotation-view/index.html index cfe5f3ee..8ce0e65f 100755 --- a/walkthrough/advanced/configure-annotation-view/index.html +++ b/walkthrough/advanced/configure-annotation-view/index.html @@ -7,6 +7,6 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}

Configure How Annotations Are Displayed

Annotations are composed of tokens (basically text in a line surrounded by whitespace). The tokens have a highlight. OpenContracts also has a "BoundingBox" around the tokens which is the smallest rectangle that can cover all of the tokens in an Annotation.

In the Annotator view, you'll see a purple-colored "eye" icon in the top left of the annotation list in the sidebar. Click the icon to bring up a series of configurations for how annotations are displayed:

There are three different settings that can be combined to significantly change how you see the annotations: 1. Show only selected - You will only see the annotation selected, either by clicking on it in the sidebar or when you clicked into an annotation from the Corpus view. All other annotations will be completely hidden. 2. Show bounding boxes - If you unselect this, only the tokens will be visible. This is recommended where you large numbers of overlapping annotations or annotations that are sparse - e.g. a few words scattered throughout a paragraph. In either of these cases, the bounding boxes can cover other bounding boxes and this can be confusing. Where you have too many overlapping bounding boxes, it's easier to hide them and just look at the tokens. 3. Label Display Behavior - has three options:

  1. Always Show - Always show the label for an annotation when it's displayed (remember, you can choose to only display selected annotations).
  2. Always Hide - Never show the label for an annotation, regardless of its visiblity.
  3. Show on Hover - If an annotation is visible, when you hover over it, you'll see the label.

Configure How Annotations Are Displayed

Annotations are composed of tokens (basically text in a line surrounded by whitespace). The tokens have a highlight. OpenContracts also has a "BoundingBox" around the tokens which is the smallest rectangle that can cover all of the tokens in an Annotation.

In the Annotator view, you'll see a purple-colored "eye" icon in the top left of the annotation list in the sidebar. Click the icon to bring up a series of configurations for how annotations are displayed:

There are three different settings that can be combined to significantly change how you see the annotations: 1. Show only selected - You will only see the annotation selected, either by clicking on it in the sidebar or when you clicked into an annotation from the Corpus view. All other annotations will be completely hidden. 2. Show bounding boxes - If you unselect this, only the tokens will be visible. This is recommended where you large numbers of overlapping annotations or annotations that are sparse - e.g. a few words scattered throughout a paragraph. In either of these cases, the bounding boxes can cover other bounding boxes and this can be confusing. Where you have too many overlapping bounding boxes, it's easier to hide them and just look at the tokens. 3. Label Display Behavior - has three options:

  1. Always Show - Always show the label for an annotation when it's displayed (remember, you can choose to only display selected annotations).
  2. Always Hide - Never show the label for an annotation, regardless of its visiblity.
  3. Show on Hover - If an annotation is visible, when you hover over it, you'll see the label.
\ No newline at end of file diff --git a/walkthrough/advanced/export-import-corpuses/index.html b/walkthrough/advanced/export-import-corpuses/index.html index b511f2cc..bf93ba4e 100755 --- a/walkthrough/advanced/export-import-corpuses/index.html +++ b/walkthrough/advanced/export-import-corpuses/index.html @@ -7,7 +7,7 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}

Export / Import Functionality

Exports

OpenContracts support both exporting and importing corpuses. This functionality is disabled on the public demo as it can be bandwidth intensive. If you want to experiment with these features on your own, you'll see the export action when you right-click on a corpus:

You can access your exports from the user dropdown menu in the top right corner of the screen. Once your export is complete, you should be able to download a zip containing all the documents, their PAWLs layers, and the corpus data you created - including all annotations.

Imports

If you've enabled corpus imports (see the frontend env file for the boolean toggle to do this - it's REACT_APP_ALLOW_IMPORTS), you'll see an import action when you click the action button on the corpus page.

Export Format

OpenContracts Export Format Specification

The OpenContracts export is a zip archive containing: 1. A data.json file with metadata about the export 2. The original PDF documents 3. Exported annotations "burned in" to the PDF documents

data.json Format

The data.json file contains a JSON object with the following fields:

  • annotated_docs (dict): Maps PDF filenames to OpenContractDocExport objects with annotations for that document.

  • doc_labels (dict): Maps document label names (strings) to AnnotationLabelPythonType objects defining those labels.

  • text_labels (dict): Maps text annotation label names (strings) to AnnotationLabelPythonType objects defining those labels.

  • corpus (OpenContractCorpusType): Metadata about the exported corpus, with fields:

    • id (int): ID of the corpus
    • title (string)
    • description (string)
    • icon_name (string): Filename of the corpus icon image
    • icon_data (string): Base64 encoded icon image data
    • creator (string): Email of the corpus creator
    • label_set (string): ID of the labelset used by this corpus
  • label_set (OpenContractsLabelSetType): Metadata about the label set, with fields:

    • id (int)
    • title (string)
    • description (string)
    • icon_name (string): Filename of the labelset icon
    • icon_data (string): Base64 encoded labelset icon data
    • creator (string): Email of the labelset creator

OpenContractDocExport Format

Each document in annotated_docs is represented by an OpenContractDocExport object with fields:

  • doc_labels (list[string]): List of document label names applied to this doc
  • labelled_text (list[OpenContractsAnnotationPythonType]): List of text annotations
  • title (string): Document title
  • content (string): Full text content of the document
  • description (string): Description of the document
  • pawls_file_content (list[PawlsPagePythonType]): PAWLS parse data for each page
  • page_count (int): Number of pages in the document

OpenContractsAnnotationPythonType Format

Represents an individual text annotation, with fields:

  • id (string): Optional ID
  • annotationLabel (string): Name of the label for this annotation
  • rawText (string): Raw text content of the annotation
  • page (int): 0-based page number the annotation is on
  • annotation_json (dict): Maps page numbers to OpenContractsSinglePageAnnotationType

OpenContractsSinglePageAnnotationType Format

Represents the annotation data for a single page:

  • bounds (BoundingBoxPythonType): Bounding box of the annotation on the page
  • tokensJsons (list[TokenIdPythonType]): List of PAWLS tokens covered by the annotation
  • rawText (string): Raw text of the annotation on this page

BoundingBoxPythonType Format

Represents a bounding box with fields:

  • top (int)
  • bottom (int)
  • left (int)
  • right (int)

TokenIdPythonType Format

References a PAWLS token by page and token index:

  • pageIndex (int)
  • tokenIndex (int)

PawlsPagePythonType Format

Represents PAWLS parse data for a single page:

  • page (PawlsPageBoundaryPythonType): Page boundary info
  • tokens (list[PawlsTokenPythonType]): List of PAWLS tokens on the page

PawlsPageBoundaryPythonType Format

Represents the page boundary with fields:

  • width (float)
  • height (float)
  • index (int): Page index

PawlsTokenPythonType Format

Represents a single PAWLS token with fields:

  • x (float): X-coordinate of token box
  • y (float): Y-coordinate of token box
  • width (float): Width of token box
  • height (float): Height of token box
  • text (string): Text content of the token

AnnotationLabelPythonType Format

Defines an annotation label with fields:

  • id (string)
  • color (string): Hex color for the label
  • description (string)
  • icon (string): Icon name
  • text (string): Label text
  • label_type (LabelType): One of DOC_TYPE_LABEL, TOKEN_LABEL, RELATIONSHIP_LABEL, METADATA_LABEL

Example data.json

{
+    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Export / Import Functionality

Exports

OpenContracts support both exporting and importing corpuses. This functionality is disabled on the public demo as it can be bandwidth intensive. If you want to experiment with these features on your own, you'll see the export action when you right-click on a corpus:

You can access your exports from the user dropdown menu in the top right corner of the screen. Once your export is complete, you should be able to download a zip containing all the documents, their PAWLs layers, and the corpus data you created - including all annotations.

Imports

If you've enabled corpus imports (see the frontend env file for the boolean toggle to do this - it's REACT_APP_ALLOW_IMPORTS), you'll see an import action when you click the action button on the corpus page.

Export Format

OpenContracts Export Format Specification

The OpenContracts export is a zip archive containing: 1. A data.json file with metadata about the export 2. The original PDF documents 3. Exported annotations "burned in" to the PDF documents

data.json Format

The data.json file contains a JSON object with the following fields:

  • annotated_docs (dict): Maps PDF filenames to OpenContractDocExport objects with annotations for that document.

  • doc_labels (dict): Maps document label names (strings) to AnnotationLabelPythonType objects defining those labels.

  • text_labels (dict): Maps text annotation label names (strings) to AnnotationLabelPythonType objects defining those labels.

  • corpus (OpenContractCorpusType): Metadata about the exported corpus, with fields:

    • id (int): ID of the corpus
    • title (string)
    • description (string)
    • icon_name (string): Filename of the corpus icon image
    • icon_data (string): Base64 encoded icon image data
    • creator (string): Email of the corpus creator
    • label_set (string): ID of the labelset used by this corpus
  • label_set (OpenContractsLabelSetType): Metadata about the label set, with fields:

    • id (int)
    • title (string)
    • description (string)
    • icon_name (string): Filename of the labelset icon
    • icon_data (string): Base64 encoded labelset icon data
    • creator (string): Email of the labelset creator

OpenContractDocExport Format

Each document in annotated_docs is represented by an OpenContractDocExport object with fields:

  • doc_labels (list[string]): List of document label names applied to this doc
  • labelled_text (list[OpenContractsAnnotationPythonType]): List of text annotations
  • title (string): Document title
  • content (string): Full text content of the document
  • description (string): Description of the document
  • pawls_file_content (list[PawlsPagePythonType]): PAWLS parse data for each page
  • page_count (int): Number of pages in the document

OpenContractsAnnotationPythonType Format

Represents an individual text annotation, with fields:

  • id (string): Optional ID
  • annotationLabel (string): Name of the label for this annotation
  • rawText (string): Raw text content of the annotation
  • page (int): 0-based page number the annotation is on
  • annotation_json (dict): Maps page numbers to OpenContractsSinglePageAnnotationType

OpenContractsSinglePageAnnotationType Format

Represents the annotation data for a single page:

  • bounds (BoundingBoxPythonType): Bounding box of the annotation on the page
  • tokensJsons (list[TokenIdPythonType]): List of PAWLS tokens covered by the annotation
  • rawText (string): Raw text of the annotation on this page

BoundingBoxPythonType Format

Represents a bounding box with fields:

  • top (int)
  • bottom (int)
  • left (int)
  • right (int)

TokenIdPythonType Format

References a PAWLS token by page and token index:

  • pageIndex (int)
  • tokenIndex (int)

PawlsPagePythonType Format

Represents PAWLS parse data for a single page:

  • page (PawlsPageBoundaryPythonType): Page boundary info
  • tokens (list[PawlsTokenPythonType]): List of PAWLS tokens on the page

PawlsPageBoundaryPythonType Format

Represents the page boundary with fields:

  • width (float)
  • height (float)
  • index (int): Page index

PawlsTokenPythonType Format

Represents a single PAWLS token with fields:

  • x (float): X-coordinate of token box
  • y (float): Y-coordinate of token box
  • width (float): Width of token box
  • height (float): Height of token box
  • text (string): Text content of the token

AnnotationLabelPythonType Format

Defines an annotation label with fields:

  • id (string)
  • color (string): Hex color for the label
  • description (string)
  • icon (string): Icon name
  • text (string): Label text
  • label_type (LabelType): One of DOC_TYPE_LABEL, TOKEN_LABEL, RELATIONSHIP_LABEL, METADATA_LABEL

Example data.json

{
   "annotated_docs": {
     "document1.pdf": {
       "doc_labels": ["Contract", "NDA"],
diff --git a/walkthrough/advanced/fork-a-corpus/index.html b/walkthrough/advanced/fork-a-corpus/index.html
index 2a5d01bb..df51a79e 100755
--- a/walkthrough/advanced/fork-a-corpus/index.html
+++ b/walkthrough/advanced/fork-a-corpus/index.html
@@ -7,6 +7,6 @@
     .gdesc-inner { font-size: 0.75rem; }
     body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);}
     body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);}
-    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Fork a Corpus

To Fork or Not to Fork?

One of the amazing things about Open Source collaboration is you can stand on the shoulder of giants - we can share techniques and data and collectively achieve what we could never do alone. OpenContracts is designed to make it super easy to share and re-use annotation data.

In OpenContracts, we introduce the concept of "forking" a corpus - basically creating a copy of public or private corpus, complete with its documents and annotations, which you can edit and tweak as needed. This opens up some interesting possibilities. For example, you might have a base corpus with annotations common to many types of AI models or annotation projects which you can fork as needed and layer task or domain-specific annotations on top of.

Fork a Corpus

Forking a corpus is easy.

  1. Again, right-click on a corpus to bring up the context menu. You'll see an entry to "Fork Corpus":
  2. Click on it to start a fork. You should see a confirmation in the top right of the screen:
  3. Once the fork is complete, the next time you go to your Corpus page, you'll see a new Corpus with a Fork icon in the icon bar at the bottom. If you hover over it, you'll be able to see a summary of the corpus it was forked from. This is tracked in the database, so, long-term, we'd like to have corpus version control similar to how git works:

Fork a Corpus

To Fork or Not to Fork?

One of the amazing things about Open Source collaboration is you can stand on the shoulder of giants - we can share techniques and data and collectively achieve what we could never do alone. OpenContracts is designed to make it super easy to share and re-use annotation data.

In OpenContracts, we introduce the concept of "forking" a corpus - basically creating a copy of public or private corpus, complete with its documents and annotations, which you can edit and tweak as needed. This opens up some interesting possibilities. For example, you might have a base corpus with annotations common to many types of AI models or annotation projects which you can fork as needed and layer task or domain-specific annotations on top of.

Fork a Corpus

Forking a corpus is easy.

  1. Again, right-click on a corpus to bring up the context menu. You'll see an entry to "Fork Corpus":
  2. Click on it to start a fork. You should see a confirmation in the top right of the screen:
  3. Once the fork is complete, the next time you go to your Corpus page, you'll see a new Corpus with a Fork icon in the icon bar at the bottom. If you hover over it, you'll be able to see a summary of the corpus it was forked from. This is tracked in the database, so, long-term, we'd like to have corpus version control similar to how git works:
\ No newline at end of file diff --git a/walkthrough/advanced/generate-graphql-schema-files/index.html b/walkthrough/advanced/generate-graphql-schema-files/index.html index b103f9b7..4fde7747 100755 --- a/walkthrough/advanced/generate-graphql-schema-files/index.html +++ b/walkthrough/advanced/generate-graphql-schema-files/index.html @@ -7,7 +7,7 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}

Generate GraphQL Schema Files

Generating GraphQL Schema Files

Open Contracts uses Graphene to provide a rich GraphQL endpoint, complete with the GraphiQL query application. For some applications, you may want to generate a GraphQL schema file in SDL or json. On example use case is if you're developing a frontend you want to connect to OpenContracts, and you'd like to autogenerate Typescript types from a GraphQL Schena.

To generate a GraphQL schema file, run your choice of the following commands.

For an SDL file:

$ docker-compose -f local.yml run django python manage.py graphql_schema --schema config.graphql.schema.schema --out schema.graphql
+    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Generate GraphQL Schema Files

Generating GraphQL Schema Files

Open Contracts uses Graphene to provide a rich GraphQL endpoint, complete with the GraphiQL query application. For some applications, you may want to generate a GraphQL schema file in SDL or json. On example use case is if you're developing a frontend you want to connect to OpenContracts, and you'd like to autogenerate Typescript types from a GraphQL Schena.

To generate a GraphQL schema file, run your choice of the following commands.

For an SDL file:

$ docker-compose -f local.yml run django python manage.py graphql_schema --schema config.graphql.schema.schema --out schema.graphql
 

For a JSON file:

$ docker-compose -f local.yml run django python manage.py graphql_schema --schema config.graphql.schema.schema --out schema.json
 

You can convert these to TypeScript for use in a frontend (though you'll find this has already been done for the React- based OpenContracts frontend) using a tool like this.

Understanding the PAWLs Format in OpenContracts

The OpenContracts project utilizes the PAWLs format for representing documents and their annotations. PAWLs is designed to provide a consistent and structured way to store text and layout information for complex documents like contracts, scientific papers, and newspapers.

PAWLs Layers

In OpenContracts, every document is processed through a pipeline that extracts and structures text and layout information into three files:

  1. Original PDF: The original PDF document.
  2. PAWLs Layer (JSON): A JSON file containing the text and positional data for each token (word) in the document.
  3. Text Layer: A text file containing the full text extracted from the document.

The PAWLs layer serves as the source of truth for the document, allowing seamless translation between text and positional information.

PAWLs Processing Pipeline

The PAWLs processing pipeline involves the following steps:

  1. OCR: The original PDF is re-OCRed using the open-source Tesseract OCR engine to produce a consistent output.
  2. Token Extraction: The OCRed document is processed using the parsing engine of Grobid to extract "tokens" (text surrounded by whitespace, typically a word) along with their page and positional information.
  3. PAWLs Layer Generation: The extracted tokens and their positional data are stored as a JSON file, referred to as the "PAWLs layer."
  4. Text Layer Generation: The full text is extracted from the PAWLs layer and stored as a separate text file, called the "text layer."

PAWLs Layer Structure

The PAWLs layer JSON file consists of a list of page objects, each containing the necessary tokens and page information for a given page. Here's the data shape for each page object:

class PawlsPagePythonType(TypedDict):
+    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Understanding the PAWLs Format in OpenContracts

The OpenContracts project utilizes the PAWLs format for representing documents and their annotations. PAWLs is designed to provide a consistent and structured way to store text and layout information for complex documents like contracts, scientific papers, and newspapers.

PAWLs Layers

In OpenContracts, every document is processed through a pipeline that extracts and structures text and layout information into three files:

  1. Original PDF: The original PDF document.
  2. PAWLs Layer (JSON): A JSON file containing the text and positional data for each token (word) in the document.
  3. Text Layer: A text file containing the full text extracted from the document.

The PAWLs layer serves as the source of truth for the document, allowing seamless translation between text and positional information.

PAWLs Processing Pipeline

The PAWLs processing pipeline involves the following steps:

  1. OCR: The original PDF is re-OCRed using the open-source Tesseract OCR engine to produce a consistent output.
  2. Token Extraction: The OCRed document is processed using the parsing engine of Grobid to extract "tokens" (text surrounded by whitespace, typically a word) along with their page and positional information.
  3. PAWLs Layer Generation: The extracted tokens and their positional data are stored as a JSON file, referred to as the "PAWLs layer."
  4. Text Layer Generation: The full text is extracted from the PAWLs layer and stored as a separate text file, called the "text layer."

PAWLs Layer Structure

The PAWLs layer JSON file consists of a list of page objects, each containing the necessary tokens and page information for a given page. Here's the data shape for each page object:

class PawlsPagePythonType(TypedDict):
     page: PawlsPageBoundaryPythonType
     tokens: list[PawlsTokenPythonType]
 

The PawlsPageBoundaryPythonType represents the page boundary information:

class PawlsPageBoundaryPythonType(TypedDict):
diff --git a/walkthrough/advanced/run-gremlin-analyzer/index.html b/walkthrough/advanced/run-gremlin-analyzer/index.html
index 1714c2fe..e80878ec 100755
--- a/walkthrough/advanced/run-gremlin-analyzer/index.html
+++ b/walkthrough/advanced/run-gremlin-analyzer/index.html
@@ -7,6 +7,6 @@
     .gdesc-inner { font-size: 0.75rem; }
     body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);}
     body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);}
-    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Run a Gremlin Analyzer

Introduction to Gremlin Integration

OpenContracts integrates with a powerful NLP engine called Gremlin Engine ("Gremlin"). If you run a Gremlin analyzer on a Corpus, it will create annotations of its own that you can view and export (e.g. automatically applying document labels or labeling parties, dates, and places, etc.). It's meant to provide a consistent API to deliver and render NLP and machine learning capabilities to end-users. As discussed in the configuration section, you need to install Gremlin Analyzers through the admin dashboard.

Once you've installed Gremlin Analyzers, however, it's easy to apply them.

Using an Installed Gremlin Analyzer

  1. If analysis capabilities are enabled for instance, when you right-click on a Corpus, you'll see an option to "Analyze Corpus":

  2. Clicking on this item will bring up a dialog where you can browse available analyzers:

  3. Select one and hit "Analyze" to submit a corpus for processing. When you go to the Analysis tab of your Corpus now, you'll see the analysis. Most likely, if you just clicked there, it will say processing:

  4. When the Analysis is complete, you'll see a summary of the number of labels and annotations applied by the analyzer:

Note on Processing Time

Large Corpuses of hundreds of documents can take a long time to process (10 minutes or more). It's hard to predict processing time up front, because it's dependent on the number of total pages and the specific analysis being performed. At the moment, there is not a great mechanism in place to detect and handle failures in a Gremlin analyzer and reflect this in OpenContracts. It's on our roadmap to improve this integration. In the meantime, the example analyzers we've released with Gremlin should be very stable, so they should run predictably.

Viewing the Outputs

Once an Analysis completes, you'll be able to browse the annotations from the analysis in several ways.

  1. First, they'll be available in the "Annotation" tab, and you can easily filter to annotations from a specific analyzer.
  2. Second, when you load a Document, in the Annotator view, there's a small widget in the top of the annotator that has three downwards-facing arrows and says "Human Annotation Mode".
  3. Click on the arrows open a tray showing the analyses applied to this document.
  4. Click on an analysis to load the annotations and view them in the document.

Note: You can delete an analysis, but you cannot edit it. The annotations are machine-created and cannot be edited by human users.

Run a Gremlin Analyzer

Introduction to Gremlin Integration

OpenContracts integrates with a powerful NLP engine called Gremlin Engine ("Gremlin"). If you run a Gremlin analyzer on a Corpus, it will create annotations of its own that you can view and export (e.g. automatically applying document labels or labeling parties, dates, and places, etc.). It's meant to provide a consistent API to deliver and render NLP and machine learning capabilities to end-users. As discussed in the configuration section, you need to install Gremlin Analyzers through the admin dashboard.

Once you've installed Gremlin Analyzers, however, it's easy to apply them.

Using an Installed Gremlin Analyzer

  1. If analysis capabilities are enabled for instance, when you right-click on a Corpus, you'll see an option to "Analyze Corpus":

  2. Clicking on this item will bring up a dialog where you can browse available analyzers:

  3. Select one and hit "Analyze" to submit a corpus for processing. When you go to the Analysis tab of your Corpus now, you'll see the analysis. Most likely, if you just clicked there, it will say processing:

  4. When the Analysis is complete, you'll see a summary of the number of labels and annotations applied by the analyzer:

Note on Processing Time

Large Corpuses of hundreds of documents can take a long time to process (10 minutes or more). It's hard to predict processing time up front, because it's dependent on the number of total pages and the specific analysis being performed. At the moment, there is not a great mechanism in place to detect and handle failures in a Gremlin analyzer and reflect this in OpenContracts. It's on our roadmap to improve this integration. In the meantime, the example analyzers we've released with Gremlin should be very stable, so they should run predictably.

Viewing the Outputs

Once an Analysis completes, you'll be able to browse the annotations from the analysis in several ways.

  1. First, they'll be available in the "Annotation" tab, and you can easily filter to annotations from a specific analyzer.
  2. Second, when you load a Document, in the Annotator view, there's a small widget in the top of the annotator that has three downwards-facing arrows and says "Human Annotation Mode".
  3. Click on the arrows open a tray showing the analyses applied to this document.
  4. Click on an analysis to load the annotations and view them in the document.

Note: You can delete an analysis, but you cannot edit it. The annotations are machine-created and cannot be edited by human users.

\ No newline at end of file diff --git a/walkthrough/key-concepts/index.html b/walkthrough/key-concepts/index.html index 120c6e81..cb095075 100755 --- a/walkthrough/key-concepts/index.html +++ b/walkthrough/key-concepts/index.html @@ -7,6 +7,6 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}

Key-Concepts

Data Types

Text annotation data is divided into several concepts:

  1. Corpuses (or collections of documents). One document can be in multiple corpuses.
  2. Documents. Currently, these are PDFs ONLY.
  3. Annotations. These are either document-level annotations (the document type), text-level annotations (highlighted text), or relationships (which apply a label between two annotations). Relationships are currently not well-supported and may be buggy.
  4. Analyses. These groups of read-only annotations added by a Gremlin analyzer (see more on that below).

Permissioning

OpenContracts is built on top of the powerful permissioning framework for Django called django-guardian. Each GraphQL request can add a field to annotate the object-level permissions the current user has for a given object, and the frontend relies on this to determine whether to make some objects and pages read-only and whether certain features should be exposed to a given user. The capability of sharing objects with specific users is built in, but is not enabled from the frontend at the moment. Allowing such widespread sharing and user lookups could be a security hole and could also unduly tax the system. We'd like to test these capabilities more fully before letting users used them.

GraphQL

Mutations and Queries

OpenContracts uses Graphene and GraphQL to serve data to its frontend. You can access the Graphiql playground by going to your OpenContracts root url /graphql - e.g. https://opencontracts.opensource.legal/graphql. Anonymous users have access to any public data. To authenticate and access your own data, you either need to use the login mutation to create a JWT token or login to the admin dashboard to get a Django session and auth cookie that will automatically authenticate your requests to the GraphQL endpoint.

If you're not familiar with GraphQL, it's a very powerful way to expose your backend to the user and/or frontend clients to permit the construction of specific queries with specific data shapes. As an example, here's a request to get public corpuses and the annotated text and labels in them:

Graphiql comes with a built-in documentation browser. Just click "Docs" in the top-right of the screen to start browsing. Typically, mutations change things on the server. Queries merely request copies of data from the server. We've tried to make our schema fairly self-explanatory, but we do plan to add more descriptions and guidance to our API docs.

GraphQL-only features

Some of our features are currently not accessible via the frontend. Sharing analyses and corpuses to the public, for example, can only be achieved via makeCorpusPublic and makeAnalysisPublic mutations, and only admins have this power at the moment. For our current release, we've done this to prevent large numbers of public corpuses being shared to cut down on server usage. We'd like to make a fully free and open, collaborative platform with more features to share anonymously, but this will require additional effort and compute power.

Key-Concepts

Data Types

Text annotation data is divided into several concepts:

  1. Corpuses (or collections of documents). One document can be in multiple corpuses.
  2. Documents. Currently, these are PDFs ONLY.
  3. Annotations. These are either document-level annotations (the document type), text-level annotations (highlighted text), or relationships (which apply a label between two annotations). Relationships are currently not well-supported and may be buggy.
  4. Analyses. These groups of read-only annotations added by a Gremlin analyzer (see more on that below).

Permissioning

OpenContracts is built on top of the powerful permissioning framework for Django called django-guardian. Each GraphQL request can add a field to annotate the object-level permissions the current user has for a given object, and the frontend relies on this to determine whether to make some objects and pages read-only and whether certain features should be exposed to a given user. The capability of sharing objects with specific users is built in, but is not enabled from the frontend at the moment. Allowing such widespread sharing and user lookups could be a security hole and could also unduly tax the system. We'd like to test these capabilities more fully before letting users used them.

GraphQL

Mutations and Queries

OpenContracts uses Graphene and GraphQL to serve data to its frontend. You can access the Graphiql playground by going to your OpenContracts root url /graphql - e.g. https://opencontracts.opensource.legal/graphql. Anonymous users have access to any public data. To authenticate and access your own data, you either need to use the login mutation to create a JWT token or login to the admin dashboard to get a Django session and auth cookie that will automatically authenticate your requests to the GraphQL endpoint.

If you're not familiar with GraphQL, it's a very powerful way to expose your backend to the user and/or frontend clients to permit the construction of specific queries with specific data shapes. As an example, here's a request to get public corpuses and the annotated text and labels in them:

Graphiql comes with a built-in documentation browser. Just click "Docs" in the top-right of the screen to start browsing. Typically, mutations change things on the server. Queries merely request copies of data from the server. We've tried to make our schema fairly self-explanatory, but we do plan to add more descriptions and guidance to our API docs.

GraphQL-only features

Some of our features are currently not accessible via the frontend. Sharing analyses and corpuses to the public, for example, can only be achieved via makeCorpusPublic and makeAnalysisPublic mutations, and only admins have this power at the moment. For our current release, we've done this to prevent large numbers of public corpuses being shared to cut down on server usage. We'd like to make a fully free and open, collaborative platform with more features to share anonymously, but this will require additional effort and compute power.

\ No newline at end of file diff --git a/walkthrough/step-1-add-documents/index.html b/walkthrough/step-1-add-documents/index.html index 7ee9cd36..83862172 100755 --- a/walkthrough/step-1-add-documents/index.html +++ b/walkthrough/step-1-add-documents/index.html @@ -7,6 +7,6 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}

Step 1 - Add Documents

In order to do anything, you need to add some documents to Gremlin.

Go to the Documents tab

Click on the "Documents" entry in the menu to bring up a view of all documents you have read and/or write access to:

Open the Action Menu

Now, click on the "Action" dropdown to open the Action menu for available actions and click "Import":

This will bring up a dialog to load documents:

Select Documents to Upload

Open Contracts works with PDFs only (as this helps us have a single file type with predictable data structures, formats, etc.). In the future, we'll add functionality to convert other files to PDF, but, for now, please use PDFs. It doesn't matter if they are OCRed or not as OpenContracts performs its own OCR on every PDF anyway to ensure consistent OCR quality and outputs. Once you've added documents for upload, you'll see a list of documents:

Click on a document to change the description or title:

Upload Your Documents

Click upload to upload the documents to OpenContracts. Note Once the documents are uploaded, they are automatically processed with Tesseract amd PAWLs to create a layer of tokens - each one representing a word / symbol in the PDF an its X,Y coordinates on the page. This is what powers OpenContracts annotator and allows us to create both layout-aware and text-only annotations. While the PAWLs processing script is running, the document you uploaded will not be available for viewing and cannot be added to a corpus. You'll see a loading bar on the document until the pre-processing is complete. This is only one once and can take a long time (a couple of minutes to a max of 10) depending on the document length, quality, etc.

Step 1 - Add Documents

In order to do anything, you need to add some documents to Gremlin.

Go to the Documents tab

Click on the "Documents" entry in the menu to bring up a view of all documents you have read and/or write access to:

Open the Action Menu

Now, click on the "Action" dropdown to open the Action menu for available actions and click "Import":

This will bring up a dialog to load documents:

Select Documents to Upload

Open Contracts works with PDFs only (as this helps us have a single file type with predictable data structures, formats, etc.). In the future, we'll add functionality to convert other files to PDF, but, for now, please use PDFs. It doesn't matter if they are OCRed or not as OpenContracts performs its own OCR on every PDF anyway to ensure consistent OCR quality and outputs. Once you've added documents for upload, you'll see a list of documents:

Click on a document to change the description or title:

Upload Your Documents

Click upload to upload the documents to OpenContracts. Note Once the documents are uploaded, they are automatically processed with Tesseract amd PAWLs to create a layer of tokens - each one representing a word / symbol in the PDF an its X,Y coordinates on the page. This is what powers OpenContracts annotator and allows us to create both layout-aware and text-only annotations. While the PAWLs processing script is running, the document you uploaded will not be available for viewing and cannot be added to a corpus. You'll see a loading bar on the document until the pre-processing is complete. This is only one once and can take a long time (a couple of minutes to a max of 10) depending on the document length, quality, etc.

\ No newline at end of file diff --git a/walkthrough/step-2-create-labelset/index.html b/walkthrough/step-2-create-labelset/index.html index fab6214d..01c14849 100755 --- a/walkthrough/step-2-create-labelset/index.html +++ b/walkthrough/step-2-create-labelset/index.html @@ -7,6 +7,6 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}

Step 2 - Create Labelset

Why Labelsets?

Before you can add labels, you need to decide what you want to label. A labelset should reflect the taxonomy or concepts you want to associate with text in your document. This can be solely for the purpose of human review and retrieval, but we imagine many of you want to use it to train machine learning models.

At the moment, there's no way to create a label in a corpus without creating a labelset and creating a label for the labelset (though we'd like to add that and welcome contributions).

Create Text Labels

Let's say we want to add some labels for "Parties", "Termination Clause", and "Effective Date". To do that, let's first create a LabelSet to hold the labels.

  1. Go to the labelset view and click the action button to bring up the action menu:
  2. Clicking on the "Create Label Set" item will bring up a modal to let you create labels:
  3. Now click on the new label set to edit the labels:
  4. A modal comes up that lets you edit three types of labels:

    1. Text Labels - are meant to label spans of text ("highlights")
    2. Relationship Labels - this feature is still under development, but it labels relationships bewteen text label (e.g. one labelled party is the "Parent Company" of another).
    3. Doc Type Labels - are meant to label what category the document belongs in - e.g. a "Stock Purchase Agreement" or an "NDA"
  5. Click the "Text Labels" tab to bring up a view of current labels for text annotations and an action button that lets you create new ones. There should be no labels when you first open this view"

  6. Click the action button and then the "Create Text Label" dropdown item:
  7. You'll see a new, blank label in the list of text labels:
  8. Click the edit icon on the label to edit the label title, description, color and/or icon. To edit the icon or highlight color, hover over or click the giant tag icon on the left side of the label:
  9. Hit save to commit the changes to the database. Repeat for the other labels - "Parties", "Termination Clause", and "Effective Date":

Create Document-Type Labels

In addition to labelling specific parts of a document, you may want to tag a document itself as a certain type of document or addressing a certain subject. In this example, let's say we want to label some documents as "contracts" and others as "not contracts".

  1. Let's also create two example document type labels. Click the "Doc Type Labels" tab:
  2. As before, click the action button and the "Create Document Type Label" item to create a blank document type label:
  3. Repeat to create two doc type labels - "Contract" and "Not Contract":
  4. Hit "Close" to close the editor.

Step 2 - Create Labelset

Why Labelsets?

Before you can add labels, you need to decide what you want to label. A labelset should reflect the taxonomy or concepts you want to associate with text in your document. This can be solely for the purpose of human review and retrieval, but we imagine many of you want to use it to train machine learning models.

At the moment, there's no way to create a label in a corpus without creating a labelset and creating a label for the labelset (though we'd like to add that and welcome contributions).

Create Text Labels

Let's say we want to add some labels for "Parties", "Termination Clause", and "Effective Date". To do that, let's first create a LabelSet to hold the labels.

  1. Go to the labelset view and click the action button to bring up the action menu:
  2. Clicking on the "Create Label Set" item will bring up a modal to let you create labels:
  3. Now click on the new label set to edit the labels:
  4. A modal comes up that lets you edit three types of labels:

    1. Text Labels - are meant to label spans of text ("highlights")
    2. Relationship Labels - this feature is still under development, but it labels relationships bewteen text label (e.g. one labelled party is the "Parent Company" of another).
    3. Doc Type Labels - are meant to label what category the document belongs in - e.g. a "Stock Purchase Agreement" or an "NDA"
  5. Click the "Text Labels" tab to bring up a view of current labels for text annotations and an action button that lets you create new ones. There should be no labels when you first open this view"

  6. Click the action button and then the "Create Text Label" dropdown item:
  7. You'll see a new, blank label in the list of text labels:
  8. Click the edit icon on the label to edit the label title, description, color and/or icon. To edit the icon or highlight color, hover over or click the giant tag icon on the left side of the label:
  9. Hit save to commit the changes to the database. Repeat for the other labels - "Parties", "Termination Clause", and "Effective Date":

Create Document-Type Labels

In addition to labelling specific parts of a document, you may want to tag a document itself as a certain type of document or addressing a certain subject. In this example, let's say we want to label some documents as "contracts" and others as "not contracts".

  1. Let's also create two example document type labels. Click the "Doc Type Labels" tab:
  2. As before, click the action button and the "Create Document Type Label" item to create a blank document type label:
  3. Repeat to create two doc type labels - "Contract" and "Not Contract":
  4. Hit "Close" to close the editor.
\ No newline at end of file diff --git a/walkthrough/step-3-create-a-corpus/index.html b/walkthrough/step-3-create-a-corpus/index.html index 02da8118..4e4e650e 100755 --- a/walkthrough/step-3-create-a-corpus/index.html +++ b/walkthrough/step-3-create-a-corpus/index.html @@ -7,6 +7,6 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}

Step 3 - Create Corpus

Purpose of the Corpus

A "Corpus" is a collection of documents that can be annotated by hand or automatically by a "Gremlin" analyzer. In order to create a Corpus, you first need to create a Corpus and then add documents to it.

Go to the Corpus Page

  1. First, login if you're not already logged in.
  2. Then, go the "Corpus" tab and click the "Action" dropdown to bring up the action menu:
  3. Click "Create Corpus" to bring up the Create Corpus dialog. If you've already created a labelset or have a pre-existing one, you can select it, otherwise you'll need to create and add one later:
  4. Assuming you created the labelset you want to use, when you click on the dropdown in the "Label Set" section, you should see your new labelset. Click on it to select it:
  5. You will now be able to open the corpus again, open documents in the corpus and start labelling.

Add Documents to Corpus

  1. Once you have a corpus, go back to the document page to select documents to add. You can do this in one of two ways.
    1. Right-click on a document to show a context menu:
    2. Or, SHIFT + click on the documents you want to select in order to select multiple documents at once. A green checkmark will appear on selected documents.
  2. When you're done, click the "Action"
  3. A dialog will pop up asking you to select a corpus to add the documents to. Select the desired corpus and hit ok.
  4. You'll get a confirmation dialog. Hit OK.
  5. When you click on the Corpus you just added the documents to, you'll get a tabbed view of all of the documents, annotations and analyses for that Corpus. At this stage, you should see your documents:

Congrats! You've created a corpus to hold annotations or perform an analysis! In order to start labelling it yourself, you need to create and then select a LabelSet, however. You do not need to do this to run an analyzer, however.

Note: If you have an OpenContracts export file and proper permissions, you can also import a corpus, documents, annotations, and labels. This is disabled on our demo instance, however, to but down on server load and reduce opportunities to upload potentially malicious files. See the "Advanced" section for more details.

Step 3 - Create Corpus

Purpose of the Corpus

A "Corpus" is a collection of documents that can be annotated by hand or automatically by a "Gremlin" analyzer. In order to create a Corpus, you first need to create a Corpus and then add documents to it.

Go to the Corpus Page

  1. First, login if you're not already logged in.
  2. Then, go the "Corpus" tab and click the "Action" dropdown to bring up the action menu:
  3. Click "Create Corpus" to bring up the Create Corpus dialog. If you've already created a labelset or have a pre-existing one, you can select it, otherwise you'll need to create and add one later:
  4. Assuming you created the labelset you want to use, when you click on the dropdown in the "Label Set" section, you should see your new labelset. Click on it to select it:
  5. You will now be able to open the corpus again, open documents in the corpus and start labelling.

Add Documents to Corpus

  1. Once you have a corpus, go back to the document page to select documents to add. You can do this in one of two ways.
    1. Right-click on a document to show a context menu:
    2. Or, SHIFT + click on the documents you want to select in order to select multiple documents at once. A green checkmark will appear on selected documents.
  2. When you're done, click the "Action"
  3. A dialog will pop up asking you to select a corpus to add the documents to. Select the desired corpus and hit ok.
  4. You'll get a confirmation dialog. Hit OK.
  5. When you click on the Corpus you just added the documents to, you'll get a tabbed view of all of the documents, annotations and analyses for that Corpus. At this stage, you should see your documents:

Congrats! You've created a corpus to hold annotations or perform an analysis! In order to start labelling it yourself, you need to create and then select a LabelSet, however. You do not need to do this to run an analyzer, however.

Note: If you have an OpenContracts export file and proper permissions, you can also import a corpus, documents, annotations, and labels. This is disabled on our demo instance, however, to but down on server load and reduce opportunities to upload potentially malicious files. See the "Advanced" section for more details.

\ No newline at end of file diff --git a/walkthrough/step-4-create-text-annotations/index.html b/walkthrough/step-4-create-text-annotations/index.html index 1fffbcd4..3a10b70d 100755 --- a/walkthrough/step-4-create-text-annotations/index.html +++ b/walkthrough/step-4-create-text-annotations/index.html @@ -7,6 +7,6 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}

Step 4 - Create Some Annotations

To view or edit annotations, you need to open a corpus and then open a document in the Corpus.

  1. Go to your Corpuses page and click on the corpus you just created:
  2. This will open up the document view again. Click on one of the documents to bring up the annotator:
  3. To select the label to apply, Click the vertical ellipses in the "Text Label to Apply Widget". This will bring up an interface that lets you search your labelset and select a label:
  4. Select the "Effective Date" label, for example, to label the Effective Date:
  5. Now, in the document, click and drag a box around the language that corresponds to your select label:
  6. When you've selected the correct text, release the mouse. You'll see a confirmtion when your annotation is created (you'll also see the annotation in the sidebar to the left):
  7. If you want to delete the annotation, you can click on the trash icon in the corresponding annotation card in the sidebar, or, when you hover over the annotation on the page, you'll see a trash icon in the label bar of the annotation. You can click this to delete the annotation too.
  8. If your desired annotated text is non-contiguous, you can hold down the SHIFT key while selecting blocks of text to combine them into a single annotation. While holding SHIFT, releasing the mouse will not create the annotation in the database, it will just allow you to move to a new area.
    1. One situation you might want to do this is where what you want to highlight is on different lines but is just a small part of the surrounding paragraph (such as this example, where Effective Date spans two lines):
    2. Or you might want to select multiple snippets of text in a larger block of text, such as where you have multiple parties you want to combine into a single annotation:

Step 4 - Create Some Annotations

To view or edit annotations, you need to open a corpus and then open a document in the Corpus.

  1. Go to your Corpuses page and click on the corpus you just created:
  2. This will open up the document view again. Click on one of the documents to bring up the annotator:
  3. To select the label to apply, Click the vertical ellipses in the "Text Label to Apply Widget". This will bring up an interface that lets you search your labelset and select a label:
  4. Select the "Effective Date" label, for example, to label the Effective Date:
  5. Now, in the document, click and drag a box around the language that corresponds to your select label:
  6. When you've selected the correct text, release the mouse. You'll see a confirmtion when your annotation is created (you'll also see the annotation in the sidebar to the left):
  7. If you want to delete the annotation, you can click on the trash icon in the corresponding annotation card in the sidebar, or, when you hover over the annotation on the page, you'll see a trash icon in the label bar of the annotation. You can click this to delete the annotation too.
  8. If your desired annotated text is non-contiguous, you can hold down the SHIFT key while selecting blocks of text to combine them into a single annotation. While holding SHIFT, releasing the mouse will not create the annotation in the database, it will just allow you to move to a new area.
    1. One situation you might want to do this is where what you want to highlight is on different lines but is just a small part of the surrounding paragraph (such as this example, where Effective Date spans two lines):
    2. Or you might want to select multiple snippets of text in a larger block of text, such as where you have multiple parties you want to combine into a single annotation:
\ No newline at end of file diff --git a/walkthrough/step-5-create-doc-type-annotations/index.html b/walkthrough/step-5-create-doc-type-annotations/index.html index e04ce4e2..4dcfdd17 100755 --- a/walkthrough/step-5-create-doc-type-annotations/index.html +++ b/walkthrough/step-5-create-doc-type-annotations/index.html @@ -7,6 +7,6 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}

Step 5 - Create Some Document Annotations

  1. If you want to label the type of document instead of the text inside it, use the controls in the "Doc Type" widget on the bottom right of the Annotator. Hover over it and a green plus button should appear:
  2. Click the "+" button to bring up a dialog that lets you search and select document type labels (remember, we created these earlier in the tutorial):
  3. Click "Add Label" to actually apply the label, and you'll now see that label displayed in the "Doc Type" widget in the annotator:
  4. As before, you can click the trash can to delete the label.

Step 5 - Create Some Document Annotations

  1. If you want to label the type of document instead of the text inside it, use the controls in the "Doc Type" widget on the bottom right of the Annotator. Hover over it and a green plus button should appear:
  2. Click the "+" button to bring up a dialog that lets you search and select document type labels (remember, we created these earlier in the tutorial):
  3. Click "Add Label" to actually apply the label, and you'll now see that label displayed in the "Doc Type" widget in the annotator:
  4. As before, you can click the trash can to delete the label.
\ No newline at end of file diff --git a/walkthrough/step-6-search-and-filter-by-annotations/index.html b/walkthrough/step-6-search-and-filter-by-annotations/index.html index d3da0d60..d322c383 100755 --- a/walkthrough/step-6-search-and-filter-by-annotations/index.html +++ b/walkthrough/step-6-search-and-filter-by-annotations/index.html @@ -7,6 +7,6 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}

Step 6 - Search and Filter By Annotations

  1. Back in the Corpus view, you can see in the document view the document type label you just added:
  2. You can click on the filter dropdown above to filter the documents to only those with a certain doc type label:
  3. With the corpus opened, click on the "Annotations" tab instead of the "Documents" tab to get a summary of all the current annotations in the Corpus:
  4. Click on an annotation card to automatically load the document it's in and jump right to the page containing the annotation:

Step 6 - Search and Filter By Annotations

  1. Back in the Corpus view, you can see in the document view the document type label you just added:
  2. You can click on the filter dropdown above to filter the documents to only those with a certain doc type label:
  3. With the corpus opened, click on the "Annotations" tab instead of the "Documents" tab to get a summary of all the current annotations in the Corpus:
  4. Click on an annotation card to automatically load the document it's in and jump right to the page containing the annotation:
\ No newline at end of file