The factsheet generator utilizes Retrieval Augmented Generation (RAG) over a distributed cluster to extract key facts from a dataset of documents.
In the RAG pipeline, documents are first split into chunks (500-1000 tokens each). An embedding is then generated from each text chunk. These chunk-embedding pairs are stored in a pgvector database. After the entire dataset of documents is processed, similarity searches can be run to find relevant data given a query.
A set of queries is written beforehand for each fact that should be extracted. For each stratigraphic unit, relevant data along with a query is used as a prompt for a LLM to generate facts.
Worker nodes running the LLM and embedding model are distributed across the COSMOS machines using Docker Swarm. Tasks are delegated to them by a master node which communicates through gRPC requests. Embeddings and the factsheets generated from the worker nodes are sent to a container running PostgreSQL with pgvector.