GitHub - UW-Macrostrat/factsheet-generator

About The Project

The factsheet generator utilizes Retrieval Augmented Generation (RAG) over a distributed cluster to extract key facts from a dataset of documents.

System Design

RAG overview

In the RAG pipeline, documents are first split into chunks (500-1000 tokens each). An embedding is then generated from each text chunk. These chunk-embedding pairs are stored in a pgvector database. After the entire dataset of documents is processed, similarity searches can be run to find relevant data given a query.

A set of queries is written beforehand for each fact that should be extracted. For each stratigraphic unit, relevant data along with a query is used as a prompt for a LLM to generate facts.

System overview

Worker nodes running the LLM and embedding model are distributed across the COSMOS machines using Docker Swarm. Tasks are delegated to them by a master node which communicates through gRPC requests. Embeddings and the factsheets generated from the worker nodes are sent to a container running PostgreSQL with pgvector.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
app		app
embeddings		embeddings
images		images
llms		llms
nb		nb
scripts		scripts
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
llm.proto		llm.proto
master.Dockerfile		master.Dockerfile
nb.Dockerfile		nb.Dockerfile
node.Dockerfile		node.Dockerfile
pgvector.Dockerfile		pgvector.Dockerfile
pgvector.sql		pgvector.sql
requirements.txt		requirements.txt
startup.py		startup.py
worker.Dockerfile		worker.Dockerfile
workerserver.proto		workerserver.proto

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About The Project

System Design

RAG overview

System overview

About

Releases

Packages

Languages

License

UW-Macrostrat/factsheet-generator

Folders and files

Latest commit

History

Repository files navigation

About The Project

System Design

RAG overview

System overview

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages