Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Data Extraction #117

Merged
merged 69 commits into from
Jun 19, 2024
Merged

Add Data Extraction #117

merged 69 commits into from
Jun 19, 2024

Conversation

JSv4
Copy link
Owner

@JSv4 JSv4 commented May 27, 2024

feat: Add Background Annotation Extracts and Supporting Models

  • Added new models in the extracts app for background annotation extracts.

    • Extract: Represents a headless, background annotation task linked to a Corpus and Fieldset.
    • Fieldset: Defines a reusable set of fields for Extracts, linked to Columns.
    • Column: Represents a discrete data structure to extract from a document, with various properties like query, match_text, output_type, and more.
    • Row: Represents extracted data for each column and document, storing data as JSON.
    • LanguageModel: Represents a language model to be used in the extraction process.
  • Defined new GraphQL types, queries, and mutations for managing the new models.

    • Added CRUD operations for Extract, Fieldset, Column, Row, and LanguageModel.
    • Included necessary permissions using Django Guardian.
  • Implemented Celery tasks to process extracts and generate rows.

    • Task workflow includes creating Row instances for each column and document, performing vector search, handling agentic fetch, and extracting data using a custom function.
  • Followed existing design patterns for model definitions, GraphQL schema, and permissions.

  • Ensured modularity and efficiency in the workflow with Celery chains and groups for parallel execution.

This update extends the annotation capabilities of the application, allowing for automated and background extraction of structured data from documents, improving efficiency and scalability.

Copy link

codecov bot commented May 27, 2024

Codecov Report

Attention: Patch coverage is 90.00000% with 42 lines in your changes missing coverage. Please review.

Project coverage is 68.73%. Comparing base (3f67012) to head (ece27d2).
Report is 2 commits behind head on main.

Files Patch % Lines
opencontractserver/llms/vector_stores.py 75.28% 22 Missing ⚠️
opencontractserver/tasks/extract_tasks.py 88.57% 12 Missing ⚠️
opencontractserver/tasks/query_tasks.py 89.36% 5 Missing ⚠️
opencontractserver/extracts/apps.py 77.77% 2 Missing ⚠️
opencontractserver/utils/etl.py 97.14% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #117      +/-   ##
==========================================
+ Coverage   65.36%   68.73%   +3.37%     
==========================================
  Files          51       58       +7     
  Lines        2249     2658     +409     
==========================================
+ Hits         1470     1827     +357     
- Misses        779      831      +52     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

JSv4 added 23 commits May 26, 2024 23:32
…pes match older convention. Lot of requisite components for the extract flow are built out.
…Also made DataGrid pass typescript checks subbing in some console.log statements for now.
…ew mutations and a modal to select and add documents that's very similar to Document view itself (ideally, that would have been re-used, but it was different enough recoding it seemed expedient.
JSv4 added 28 commits June 10, 2024 20:39
…) to allow target annotations for render to get passed in as props and not get overriden by annotator data loaded on mount.
…d cassette file from repo where it container model binary in it. Was too large. Now this won't be necessary thanks to cached model in image.
…wer and fixed a bug that was cropping up in mnoth-old image on docker hub. Pushed local image to DockerHub.
@JSv4 JSv4 merged commit f55cdcf into main Jun 19, 2024
5 checks passed
@JSv4 JSv4 deleted the JSv4/add-data-extraction branch July 19, 2024 07:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant