diff --git a/walkthrough/advanced/write-your-own-extractors/index.html b/walkthrough/advanced/write-your-own-extractors/index.html index a401e1b8..f7cbb551 100755 --- a/walkthrough/advanced/write-your-own-extractors/index.html +++ b/walkthrough/advanced/write-your-own-extractors/index.html @@ -7,7 +7,7 @@ .gdesc-inner { font-size: 0.75rem; } body[data-md-color-scheme="slate"] .gdesc-inner { background: var(--md-default-bg-color);} body[data-md-color-scheme="slate"] .gslide-title { color: var(--md-default-fg-color);} - body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}
Skip to content

Write Your Own Agentic, LlamaIndex Data Extractor

Refresher on What an Open Contracts Data Extractor Does

When you create a new Extract on the frontend, you can build a grid of data field columns and document rows that the application will traverse, cell-by-cell, to answer the question posed in each column for every document:

datagrid

You can define the target data shape for each column - e.g. require all outputs match a certain dictionary schema or be floats. We leverage LLMs to ensure that the retrieved data matches the desired schema.

You'll notice when you add or edit a column, you can configure a number of different things:

datagrid

Specifically, you can adjust - name: The name of the column. - query: The query used for extraction. - match_text: Text we want to match semantically to process on. We use this instead of the query to find responsive text, if this field is provided. If not, we have to fall back to the query. - must_contain_text: Text that must be contained in a returned annotation. This is case insensitive. - output_type: The type of data to be extracted. This can be a python primitive or a simple Pydantic model. - instructions: Instructions for the extraction process. This instructs our parser how to convert retrieved text to the target output_type. Not strictly necessary, but recommended, specifically for objects. - task_name: The name of the registered celery extract task to use to process (lets you define and deploy custom ones). We'll show you have to create a custom one in this walkthrough. - agentic: Boolean indicating if the extraction is agentic. - extract_is_list: Boolean indicating if the extraction result is a list of the output_types you provided.

You'll notice that in the GUI, there is a dropdown to pick the extract task:

Extract_Task_Dropdown.png

This is actually retrieved dynamically from the backend from the tasks in opencontractsserver.tasks.data_extract_tasks.py. Every celery task in this python module will show up in the GUI, and the description in the dropdown is actually pulled out of the docstring provided in the code itself:

@shared_task
+    body[data-md-color-scheme="slate"] .gslide-desc { color: var(--md-default-fg-color);}      

Write Your Own Agentic, LlamaIndex Data Extractor

Refresher on What an Open Contracts Data Extractor Does

When you create a new Extract on the frontend, you can build a grid of data field columns and document rows that the application will traverse, cell-by-cell, to answer the question posed in each column for every document:

datagrid

You can define the target data shape for each column - e.g. require all outputs match a certain dictionary schema or be floats. We leverage LLMs to ensure that the retrieved data matches the desired schema.

You'll notice when you add or edit a column, you can configure a number of different things:

datagrid

Specifically, you can adjust - name: The name of the column. - query: The query used for extraction. - match_text: Text we want to match semantically to process on. We use this instead of the query to find responsive text, if this field is provided. If not, we have to fall back to the query. - must_contain_text: Text that must be contained in a returned annotation. This is case insensitive. - output_type: The type of data to be extracted. This can be a python primitive or a simple Pydantic model. - instructions: Instructions for the extraction process. This instructs our parser how to convert retrieved text to the target output_type. Not strictly necessary, but recommended, specifically for objects. - task_name: The name of the registered celery extract task to use to process (lets you define and deploy custom ones). We'll show you have to create a custom one in this walkthrough. - agentic: Boolean indicating if the extraction is agentic. - extract_is_list: Boolean indicating if the extraction result is a list of the output_types you provided.

You'll notice that in the GUI, there is a dropdown to pick the extract task:

Extract_Task_Dropdown.png

This is actually retrieved dynamically from the backend from the tasks in opencontractsserver.tasks.data_extract_tasks.py. Every celery task in this python module will show up in the GUI, and the description in the dropdown is actually pulled out of the docstring provided in the code itself:

@shared_task
 def oc_llama_index_doc_query(cell_id, similarity_top_k=15, max_token_length: int = 512):
     """
     OpenContracts' default LlamaIndex and Marvin-based data extract pipeline to run queries specified for a
@@ -175,6 +175,6 @@
 

Step 7 - Post-Process and Store Data

At this stage we could use a structured data parser, or we could just store the answer from the agent. For simplicity, let's do the latter:

datacell.data = {"data": str(response)}
 datacell.completed = timezone.now()
 datacell.save()
-

Step 8 - Rebuild Containers and Look at Your Frontend

The next time you rebuild the containers (in prod, in local env they rebuild automatically), you will see a new entry in the column configuration modals:

Custom_Extract_Task.png

It's that easy! Now, any user in your instance can run your extract and generate outputs - here we've used it for the Company Name column:

Agent_Extract_Demo_Run.png

We plan to create decorators and other developer aids to reduce boilerplate here and let you focus entirely on your retrieval pipeline.

Conclusion

By breaking down the tasks step-by-step, you can see how the custom vector store integrates with LlamaIndex to provide powerful semantic search capabilities within a Django application. Even better, if you write your own data extract tasks you can expose them to users who don't have to know anything at all about how they're built. This is the way it should be - separation of concerns!

\ No newline at end of file