Example notebook | Link |
---|---|
RAG pipeline with LLMs loaded from Hugging Face | 📔 |
RAG pipeline with FiD generator | 📔 |
RAG pipeline with REPLUG-based generator | 📔 |
RAG pipeline with LLMs running on Gaudi2 | 📔 |
RAG pipeline with quantized LLMs running on ONNX-running backend | 📔 |
RAG pipeline with LLMs running on Llama-CPP backend | 📔 |
Optimized and quantized Embeddings models for retrieval and ranking | 📔 |
RAG pipeline with PLAID index and ColBERT Ranker | 📔 |
RAG pipeline with Qdrant index | 📔 |
RAG pipeline for summarization of multiple documents | 📔 |
Generate answers to questions answerable by using a corpus of knowledge.
Retrieval with fast lexical retrieval with BM25 or late-interaction dense retrieval with PLAID
Ranking with Sentence Transformers or ColBERT. We also offer utilizing highly optimized quantized re-rankers for fast inference. See how to get your own here.
Generation with Fusion-in-Decoder
flowchart LR
id1[(Elastic<br>/PLAID)] <--> id2(BM25<br>/ColBERT) --> id3(ST<br>/ColBERT) --> id4(FiD)
style id1 fill:#E1D5E7,stroke:#9673A6
style id2 fill:#DAE8FC,stroke:#6C8EBF
style id4 fill:#D5E8D4,stroke:#82B366
📓 Efficient and fast ODQA with PLAID, ColBERT and FiD
📓 Quantized Retrievers and Rankers using bi-encoders
To enhance generations using a Large Language Model (LLM) with retrieval augmentation, you can follow these steps:
-
Define a retrieval flow: This involves creating a store that holds the relevant information and one or more retrievers/rankers to retrieve the most relevant documents or passages.
-
Define a prompt template: Design a template that includes a suitable context or instruction, along with placeholders for the query and information retrieved by the pipeline. These placeholders will be filled in dynamically during generation.
-
Request token generation from the LLM: Utilize the prompt template and pass it to the LLM, allowing it to generate tokens based on the provided context, query, and retrieved information.
Most of Huggingface Decoder LLMs are supported.
See a complete example in our RAG with LLMs📓 notebook.
flowchart LR
id1[(Index)] <-->id2(.. Retrieval pipeline ..) --> id3(Prompt Template) --> id4(LLM)
style id1 fill:#E1D5E7,stroke:#9673A6
style id2 fill:#DAE8FC,stroke:#6C8EBF
style id3 fill:#F3CECC,stroke:#B25450
style id4 fill:#D5E8D4,stroke:#82B366
Using the algorithm introduced in REPLUG: Retrieval-Augmented Black-Box Language Models to read multiple documents in parallel to generate an answer for any question.
📓 Using REPLUG for Parallel Document Reading with LLMs
flowchart LR
id1[(Index)] <--> id2(.. Retrieval pipeline ..) -- "in parallel" --> id4(Doc 1 ...\nDoc 2 ...\nDoc 3 ...)
style id1 fill:#E1D5E7,stroke:#9673A6
style id2 fill:#DAE8FC,stroke:#6C8EBF
style id4 fill:#D5E8D4,stroke:#82B366
We enable the utilization of the FiD model, to read multiple documents in parallel, thus generating an answer with the fusion of the knowledge in all retrieved documents.
flowchart LR
id1[(Index)] <--> id2(.. Retrieval pipeline ..) --> id4(FiD)
style id1 fill:#E1D5E7,stroke:#9673A6
style id2 fill:#DAE8FC,stroke:#6C8EBF
style id4 fill:#D5E8D4,stroke:#82B366
Summarize topics given free-text input and a corpus of knowledge.
Retrieval with BM25 or other retrievers
Ranking with Sentence Transformers or other rankers
Generation Using "summarize: "
prompt, all documents concatenated and FLAN-T5 generative model
flowchart LR
id1[(Elastic)] <--> id2(BM25) --> id3(SentenceTransformer) -- summarize--> id4(FLAN-T5)
style id1 fill:#E1D5E7,stroke:#9673A6
style id2 fill:#DAE8FC,stroke:#6C8EBF
style id4 fill:#D5E8D4,stroke:#82B366