Skip to content

Commit

Permalink
Merge pull request #128 from JSv4/JSv4/v2-bugfixes
Browse files Browse the repository at this point in the history
v2 Bugfixes
  • Loading branch information
JSv4 authored Jun 21, 2024
2 parents 1e821d5 + a022ed6 commit e3ad017
Show file tree
Hide file tree
Showing 30 changed files with 1,107,646 additions and 20,169 deletions.
2 changes: 1 addition & 1 deletion .envs/.test/.django
Original file line number Diff line number Diff line change
Expand Up @@ -38,5 +38,5 @@ USE_AUTH0=false

# LLM SETTINGS
# ------------------------------------------------------------------------------
OPENAI_API_KEY=FAKE_API_KEY
OPENAI_API_KEY=fake
OPENAI_MODEL=gpt-4o
43 changes: 22 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,38 @@
# OpenContracts
![OpenContracts](/docs/assets/images/logos/OpenContracts.webp)

## The Free and Open Source Document Analysis Platform
## The Free and Open Source Document Analytics Platform

---

![OSLegal logo](docs/assets/images/logos/os_legal_128_name_left_dark.png)

| | |
| --- |--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| CI/CD | [![codecov](https://codecov.io/gh/JSv4/OpenContracts/branch/main/graph/badge.svg?token=RdVsiuaTVz)](https://codecov.io/gh/JSv4/OpenContracts) |
| Meta | [![code style - black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) [![types - Mypy](https://img.shields.io/badge/types-Mypy-blue.svg)](https://github.com/python/mypy) [![imports - isort](https://img.shields.io/badge/imports-isort-ef8336.svg)](https://github.com/pycqa/isort) [![License - Apache2](https://img.shields.io/badge/license-Apache%202-blue.svg)](https://spdx.org/licenses/) |

## What Does it Do?

OpenContracts is an **Apache-2 Licensed** software application to label, share and search annotate documents.
It's designed specifically to label documents with complex layouts such as contracts, scientific papers, newspapers,
etc.
OpenContracts is an **Apache-2 Licensed** enterprise document analytics tool. It was originally designed to label and
share label document corpuses with complex layouts such as contracts, scientific papers, newspapers,
etc. It has evolved into a platform for mass contract analytics that still maintains its core functionality as an open
platform that makes it effortless to view, edit and share annotations:

![Grid Review And Sources.gif](docs/assets/images/gifs/Grid_Review_And_Sources.gif)

![](docs/assets/images/screenshots/Jumped_To_Annotation.png)

When combine with a NLP processing engine like Gremlin Engine (another of our open source projects),
OpenContracts not only lets humans collaborate on and share document annotations, it also can analyze and export data
from contracts using state-of-the-art NLP technology.
Now, in the version 2 release (currently in beta) - we've incorporated LLMs and vector databases to
provide a seamless and efficient workflow for processing large volumes of documents in parallel. At the core of the
system is pgvector for vector search, LlamaIndex for precise vector search and retrieval, and Marvin framework for data
parsing and extraction.

Users can still create and edit annotations directly within the platform, enabling them to enrich documents with their
own insights and domain expertise. Through a custom LlamaIndex DjangoVectorStore, we can expose this structured data -
human annotated text with embeddings - to LLMs and the LlamaIndex ecosystem.

Finally, the tool's intuitive interface allows for easy navigation through documents, providing clear visual cues to identify
the exact source of information extracted by the language model. This transparency ensures that users can verify the
accuracy and context of the extracted data.


## Documentation

Expand Down Expand Up @@ -49,14 +60,4 @@ use to generate a text and x-y coordinate layer from scratch. Formats like .docx
to provide an easy, consistent format. Likewise, the output quality of many converters and tools is sub-par and these
tools can produce very different document structures for the same inputs.

## About OpenSource.Legal

OpenSource.Legal believes that the effective, digital transformation of the legal services industry and the execution of
"the law", broadly speaking, requires shared solutions and tools to solve some of the problems that are common to almost
every legal workflow. The current splintering of service delivery into dozens of incompatible platforms with limited
configurations threatens to put software developers and software vendors in the driver seat of the industry. We firmly
believe that lawyers and legal engineers, armed with easily configurable and *extensible* tools can much more effectively
design the workflows and user experiences that they need to deliver and scale their expertise.

Visit us at https://opensource.legal for a directory of open source legal projects and an overview of our
projects.
**Adding OCR and ingestion for other enterprise documents is a priority**.
4 changes: 3 additions & 1 deletion config/graphql/queries.py
Original file line number Diff line number Diff line change
Expand Up @@ -738,7 +738,9 @@ def resolve_extract(self, info, **kwargs):
Q(id=django_pk) & (Q(creator=info.context.user) | Q(is_public=True))
)

extracts = DjangoFilterConnectionField(ExtractType, filterset_class=ExtractFilter)
extracts = DjangoFilterConnectionField(
ExtractType, filterset_class=ExtractFilter, max_limit=15
)

@login_required
def resolve_extracts(self, info, **kwargs):
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/images/gifs/Add_Extract_Docs.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/images/gifs/Corpus_Query.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/images/gifs/Grid_Processing.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/images/logos/OpenContracts.webp
Binary file not shown.
Binary file added docs/assets/images/logos/OpenContracts_Logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
133 changes: 92 additions & 41 deletions docs/extract_and_retrieval/document_data_extract.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,21 @@
# Extracting Structured Data from Documents using LlamaIndex, AI Agents, and Marvin

We've added a powerful feature called "extract" that enables the generation of structured data grids from a list of
documents using a combination of vector search, AI agents, and the Marvin library. This functionality is implemented in
a Django application and leverages Celery for asynchronous task processing.
documents using a combination of vector search, AI agents, and the Marvin library.

All credit for the inspiration of this features goes to the fine folks at Nlmatics. They were some of the first pioneers
working on datagrids from document using a set of questions and custom transformer models. This implementation of their
concept ultimately leverages newer techniques and better models, but hats off to them for coming up with a design like
this 6 years ago!
This `run_extract` task orchestrates the extraction process, spinning up a number of `llama_index_doc_query` tasks.
Each of these query tasks uses LlamaIndex Django & pgvector for vector search and retrieval, and Marvin
for data parsing and extraction. It processes each document and column in parallel using celery's task system.

All credit for the inspiration of this feature goes to the fine folks at [Nlmatics](https://www.nlmatics.com/). They
were some of the first pioneers working on datagrids from document using a set of questions and custom transformer
models. This implementation of their concept ultimately leverages newer techniques and better models, but hats off
to them for coming up with a design like this in 2017/2018!

The current implementation relies heavily on [LlamaIndex](https://docs.llamaindex.ai/en/stable/), specifically
their vector store tooling, their reranker and their agent framework.

Structured data extraction is powered by the amazing [Marvin library](https://github.com/prefecthq/marvin).

## Overview

Expand All @@ -17,13 +25,13 @@ The extract process involves the following key components:
2. **Fieldset**: A set of columns defining the structure of the data to be extracted.
3. **LlamaIndex**: A library used for efficient vector search and retrieval of relevant document sections.
4. **AI Agents**: Intelligent agents that analyze the retrieved document sections and extract structured data.
5. **Marvin**: A library that facilitates the parsing and extraction of structured data from text.
5. **[Marvin](https://github.com/prefecthq/marvin)**: A library that facilitates the parsing and extraction of structured data from text.

The extract process is initiated by creating an `Extract` object that specifies the document corpus and the fieldset defining the desired data structure. The process is then broken down into individual tasks for each document and column combination, allowing for parallel processing and scalability.

## Detailed Walkthrough

Let's dive into the code and understand how the extract process works step by step.
Here's how the extract process works step by step.

### 1. Initiating the Extract Process

Expand All @@ -39,39 +47,82 @@ The `run_extract` function is the entry point for initiating the extract process

### 2. Processing Individual Datacells

The `llama_index_doc_query` function is responsible for processing each individual `Datacell`. It performs the following steps:

1. Retrieves the `Datacell` object from the database based on the provided `cell_id`.
2. Sets the `started` timestamp of the datacell to the current time.
3. Retrieves the associated `document` and initializes the necessary components for vector search and retrieval using LlamaIndex, including the embedding model, language model, and vector store.
4. Performs a vector search to retrieve the most relevant document sections based on the search text or query specified in the datacell's column.
5. Extracts the retrieved annotation IDs and associates them with the datacell as sources.
6. If the datacell's column is marked as "agentic," it uses an AI agent to further analyze the retrieved document sections and extract additional information such as defined terms and section references.
7. Prepares the retrieved text and additional information for parsing using the Marvin library.
8. Depending on the specified output type of the datacell's column, it uses Marvin to extract the structured data as either a list or a single instance.
9. Parses the extracted data and stores it in the datacell's `data` field based on the output type (e.g., BaseModel, str, int, bool, float).
10. Sets the `completed` timestamp of the datacell to the current time.
11. If an exception occurs during processing, it sets the `failed` timestamp and stores the error stacktrace in the datacell.

### 3. Marking the Extract as Complete

Once all the datacells have been processed, the `mark_extract_complete` function is triggered by the Celery chord. It retrieves the `Extract` object based on the provided `extract_id` and sets the `finished` timestamp to the current time, indicating that the extract process is complete.

## Benefits and Considerations

The extract functionality offers several benefits:

1. **Structured Data Extraction**: It enables the extraction of structured data from unstructured or semi-structured documents, making the information more accessible and actionable.
2. **Scalability**: By breaking down the process into individual tasks for each document and column combination, it allows for parallel processing and scalability, enabling the handling of large document corpora.
3. **Flexibility**: The use of fieldsets allows for the definition of custom data structures tailored to specific requirements.
4. **AI-Powered Analysis**: The integration of AI agents and the Marvin library enables intelligent analysis and extraction of relevant information from the retrieved document sections.
5. **Asynchronous Processing**: The use of Celery for asynchronous task processing ensures that the extract process doesn't block the main application and can be performed in the background.

However, there are a few considerations to keep in mind:

1**Processing Time**: Depending on the size of the document corpus and the complexity of the fieldset, the extract process may take a considerable amount of time to complete.
2**Error Handling**: Proper error handling and monitoring should be implemented to handle any exceptions or failures during the processing of individual datacells.
3**Data Validation**: The extracted structured data may require additional validation and cleansing steps to ensure its quality and consistency.
The `llama_index_doc_query` function is responsible for processing each individual `Datacell`.

#### Execution Flow Visualized:

```mermaid
graph TD
I[llama_index_doc_query] --> J[Retrieve Datacell]
J --> K[Create HuggingFaceEmbedding]
K --> L[Create OpenAI LLM]
L --> M[Create DjangoAnnotationVectorStore]
M --> N[Create VectorStoreIndex]
N --> O{Special character '|||' in search_text?}
O -- Yes --> P[Split examples and average embeddings]
P --> Q[Query annotations using averaged embeddings]
Q --> R[Rerank nodes using SentenceTransformerRerank]
O -- No --> S[Retrieve results using index retriever]
S --> T[Rerank nodes using SentenceTransformerRerank]
R --> U{Column is agentic?}
T --> U
U -- Yes --> V[Create QueryEngineTool]
V --> W[Create FunctionCallingAgentWorker]
W --> X[Create StructuredPlannerAgent]
X --> Y[Query agent for definitions]
U -- No --> Z{Extract is list?}
Y --> Z
Z -- Yes --> AA[Extract with Marvin]
Z -- No --> AB[Cast with Marvin]
AA --> AC[Save result to Datacell]
AB --> AC
AC --> AD[Mark Datacell complete]
```
#### Step-by-step Walkthrough

1. The `run_extract` task is called with an `extract_id` and `user_id`. It retrieves the corresponding `Extract` object and marks it as started.

2. It then iterates over the document IDs associated with the extract. For each document and each column in the extract's fieldset, it:
- Creates a new `Datacell` object with the extract, column, output type, creator, and document.
- Sets CRUD permissions for the datacell to the user.
- Appends a `llama_index_doc_query` task to a list of tasks, passing the datacell ID.

3. After all datacells are created and their tasks added to the list, a Celery `chord` is used to group the tasks. Once all tasks are complete, it calls the `mark_extract_complete` task to mark the extract as finished.

4. The `llama_index_doc_query` task processes each individual datacell. It:
- Retrieves the datacell and marks it as started.
- Creates a `HuggingFaceEmbedding` model and sets it as the `Settings.embed_model`.
- Creates an `OpenAI` LLM and sets it as the `Settings.llm`.
- Creates a `DjangoAnnotationVectorStore` from the document ID and column settings.
- Creates a `VectorStoreIndex` from the vector store.

5. If the `search_text` contains the special character '|||':
- It splits the examples and calculates the embeddings for each example.
- It calculates the average embedding from the individual embeddings.
- It queries the `Annotation` objects using the averaged embeddings and orders them by cosine distance.
- It reranks the nodes using `SentenceTransformerRerank` and retrieves the top-n nodes.
- It adds the annotation IDs of the reranked nodes to the datacell's sources.
- It retrieves the text from the reranked nodes.

6. If the `search_text` does not contain the special character '|||':
- It retrieves the relevant annotations using the index retriever based on the `search_text` or `query`.
- It reranks the nodes using `SentenceTransformerRerank` and retrieves the top-n nodes.
- It adds the annotation IDs of the reranked nodes to the datacell's sources.
- It retrieves the text from the retrieved nodes.

7. If the column is marked as `agentic`:
- It creates a `QueryEngineTool`, `FunctionCallingAgentWorker`, and `StructuredPlannerAgent`.
- It queries the agent to find defined terms and section references in the retrieved text.
- The definitions and section text are added to the retrieved text.

8. Depending on whether the column's `extract_is_list` is true, it either:
- Extracts a list of the `output_type` from the retrieved text using Marvin, with optional `instructions` or `query`.
- Casts the retrieved text to the `output_type` using Marvin, with optional `instructions` or `query`.

9. The result is saved to the datacell's `data` field based on the `output_type`. The datacell is marked as completed.

10. If an exception occurs during processing, the error is logged, saved to the datacell's `stacktrace`, and the
datacell is marked as failed.

## Next Steps

Expand Down
Loading

0 comments on commit e3ad017

Please sign in to comment.