You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The phrase Retrieval Augmented Generation (RAG) comes from a recent paper by Lewis et al. from Facebook AI. The idea is to use a pre-trained language model (LM) to generate text, but to use a separate retrieval system to find relevant documents to condition the LM on.
RAG & VectorDB
VectorDB provides the necessary underlying technical support for RAG, enabling it to effectively retrieve external knowledge. RAG utilizes this retrieved information to enhance the quality and relevance of its generated text. This collaborative effect allows the entire system to provide more in-depth and accurate information, especially in answering complex queries or questions. This helps understand where the model is accurate or where it may be hallucinating answers or incorrectly skipping questions.
What is VectorDB?
VectorDB is a vector search engine that allows you to search for similar vectors in a large dataset. It stores unstructured data (such as audio, video, images, text PDFs) in a vectorized form, has the capability to handle a large amount of high-dimensional data, efficiency in performing similarity and nearest neighbor searches, and powerful indexing capabilities supporting CRUD operations, metadata filtering, and horizontal scaling. This makes them play an important role in various fields such as recommendation systems, object detection, image retrieval, and fraud detection.
Evaluation of RAG
Ragas is an open-source framework dedicated to evaluating the performance of RAG systems. It provides a range of scoring metrics that measure different aspects of an RAG system to offer a comprehensive and multi-angle evaluation of the quality of RAG applications.
• Faithfulness: Assessing the factual accuracy of the generated answer in the given context.
• Answer Relevancy: Evaluating the relevance of the generated answer to the question.
• Context Precision: Signal-to-noise ratio in the retrieved context.
• Answer Correctness: Assessing the accuracy of the generated answer compared to the ground truth.
• Answer Similarity: Evaluating the semantic resemblance between the generated answer and the ground truth.
What index will be used in VectorDB?
Indexes supported in Milvus
According to the suited data type, the supported indexes in Milvus can be divided into two categories:
Indexes for floating-point embeddings:
For 128-dimensional floating-point embeddings, the storage they take up is 128 * the size of float = 512 bytes. And the distance metrics used for float-point embeddings are Euclidean distance (L2) and Inner product.
These types of indexes include FLAT, IVF_FLAT, IVF_PQ, IVF_SQ8, HNSW, and SCANN(beta) for CPU-based ANN searches and GPU_IVF_FLAT and GPU_IVF_PQ for GPU-based ANN searches.
Indexes for binary embeddings
For 128-dimensional binary embeddings, the storage they take up is 128 / 8 = 16 bytes. And the distance metrics used for binary embeddings are Jaccard and Hamming.
This type of indexes include BIN_FLAT and BIN_IVF_FLAT.
HNSW graph indexing algorithm because it’s fast to build the index, fast to query, highly accurate, and straightforward to implement. However, it has a well-known downside: it requires a lot of memory.
An HNSW index is a series of layers, where each layer above the base layer has roughly 10% as many nodes as the previous. This enables the upper layers to act as a skip list, allowing the search to zero in on the right neighborhood of the bottom layer that contains all of the vectors.
unlike normal database query workloads, every vector in the graph has an almost equal chance of being relevant to a search. (The exception is the upper layers, which we can and do cache.)
VectorDB such as Cassandra is spending almost all of its time waiting to read vectors off of the disk.
OpenAI also have their RAG Tool:
OpenAI Embedding API
The OpenAI Embedding API is a powerful tool designed to help developers convert text into high-dimensional vectors that can represent its semantic information. These vectors can be used for various applications, including semantic similarity matching, data clustering, text classification, information retrieval, and fine-tuning of language models. The API is backed by large-scale, efficient neural network models that have been extensively trained to quickly generate high-quality text embeddings. The design of the OpenAI Embedding API focuses on efficiency, scalability, and ease of use, making it an ideal choice for enterprise applications and big data needs. By simplifying the integration process, developers can easily incorporate advanced natural language processing capabilities into their applications to process and understand text data.
OpenAI Knowledge Retrieval
The Knowledge Retrieval feature by OpenAI allows the assistant to enhance its knowledge base through external documents, such as proprietary product information or user-provided files. Once enabled, uploaded files are automatically chunked, indexed, and their embedding vectors stored, then vector search technology is utilized to retrieve relevant content to answer user queries.
Unlike traditional vector databases, Knowledge Retrieval is specifically aimed at integration with OpenAI's assistant models, allowing developers to enhance the model's answering capabilities through simple API calls utilizing these external data sources.
RAG with LlamaIndex
• Loading: this refers to getting your data from where it lives – whether it’s text files, PDFs, another website, a database, or an API – into your pipeline. LlamaHub provides hundreds of connectors to choose from.
• Indexing: this means creating a data structure that allows for querying the data. For LLMs this nearly always means creating vector embeddings, numerical representations of the meaning of your data, as well as numerous other metadata strategies to make it easy to accurately find contextually relevant data.
• Storing: once your data is indexed you will almost always want to store your index, as well as other metadata, to avoid having to re-index it.
• Querying: for any given indexing strategy there are many ways you can utilize LLMs and LlamaIndex data structures to query, including sub-queries, multi-step queries and hybrid strategies.
• Evaluation: a critical step in any pipeline is checking how effective it is relative to other strategies, or when you make changes. Evaluation provides objective measures of how accurate, faithful and fast your responses to queries are.
RAG Architecture
The first step involves storing the knowledge of internal documents in a format suitable for querying. We embed it by using an embedding model:
Split the entire knowledge base's text corpus into chunks—a chunk will represent a single, queryable context. The data of interest can come from multiple sources, for example, documents in Confluence attached with PDF reports.
Use an embedding model to transform each chunk into a vector embedding.
Store all vector embeddings in a vector database, with some reference to the original content the embedding was created from.
Save the text representing each embedding and create a index to the embedding separately (we will need this later).
Next, we can start building answers for the questions/queries of interest:
Embed the question/query you want to ask, using the same embedding model as the embedding knowledge base itself.
Run a query against the index in the vector DB using the resulting vector embedding. Choose how many vectors you want to retrieve from the Vector Database—it will equal the number of contexts you will retrieve and ultimately use to answer the query question.
The Vector DB performs an approximate nearest neighbor (ANN) search against the index for the provided vector embeddings and returns the previously chosen number of context vectors. This process returns the vectors most similar in the given embedding/latent space.
Map the returned vector embeddings back to the text blocks representing them.
Pass the question along with the retrieved context text blocks to an LLM by prompting. The LLM uses only the provided context to answer the given question. This doesn't mean that prompt engineering is unnecessary—you need to ensure that the answer returned by the LLM is within expected boundaries, for example, ensure that no fabricated answers are provided if there's no available data in the retrieved context.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Retrieval Augmented Generation (RAG)?
The phrase Retrieval Augmented Generation (RAG) comes from a recent paper by Lewis et al. from Facebook AI. The idea is to use a pre-trained language model (LM) to generate text, but to use a separate retrieval system to find relevant documents to condition the LM on.
RAG & VectorDB
VectorDB provides the necessary underlying technical support for RAG, enabling it to effectively retrieve external knowledge. RAG utilizes this retrieved information to enhance the quality and relevance of its generated text. This collaborative effect allows the entire system to provide more in-depth and accurate information, especially in answering complex queries or questions. This helps understand where the model is accurate or where it may be hallucinating answers or incorrectly skipping questions.
What is VectorDB?
VectorDB is a vector search engine that allows you to search for similar vectors in a large dataset. It stores unstructured data (such as audio, video, images, text PDFs) in a vectorized form, has the capability to handle a large amount of high-dimensional data, efficiency in performing similarity and nearest neighbor searches, and powerful indexing capabilities supporting CRUD operations, metadata filtering, and horizontal scaling. This makes them play an important role in various fields such as recommendation systems, object detection, image retrieval, and fraud detection.
Evaluation of RAG
Ragas is an open-source framework dedicated to evaluating the performance of RAG systems. It provides a range of scoring metrics that measure different aspects of an RAG system to offer a comprehensive and multi-angle evaluation of the quality of RAG applications.
• Faithfulness: Assessing the factual accuracy of the generated answer in the given context.
• Answer Relevancy: Evaluating the relevance of the generated answer to the question.
• Context Precision: Signal-to-noise ratio in the retrieved context.
• Answer Correctness: Assessing the accuracy of the generated answer compared to the ground truth.
• Answer Similarity: Evaluating the semantic resemblance between the generated answer and the ground truth.
What index will be used in VectorDB?
Indexes supported in Milvus
According to the suited data type, the supported indexes in Milvus can be divided into two categories:
Indexes for floating-point embeddings:
Indexes for binary embeddings
HNSW: Paper
HNSW graph indexing algorithm because it’s fast to build the index, fast to query, highly accurate, and straightforward to implement. However, it has a well-known downside: it requires a lot of memory.
An HNSW index is a series of layers, where each layer above the base layer has roughly 10% as many nodes as the previous. This enables the upper layers to act as a skip list, allowing the search to zero in on the right neighborhood of the bottom layer that contains all of the vectors.
unlike normal database query workloads, every vector in the graph has an almost equal chance of being relevant to a search. (The exception is the upper layers, which we can and do cache.)
VectorDB such as Cassandra is spending almost all of its time waiting to read vectors off of the disk.
OpenAI also have their RAG Tool:
OpenAI Embedding API
The OpenAI Embedding API is a powerful tool designed to help developers convert text into high-dimensional vectors that can represent its semantic information. These vectors can be used for various applications, including semantic similarity matching, data clustering, text classification, information retrieval, and fine-tuning of language models. The API is backed by large-scale, efficient neural network models that have been extensively trained to quickly generate high-quality text embeddings. The design of the OpenAI Embedding API focuses on efficiency, scalability, and ease of use, making it an ideal choice for enterprise applications and big data needs. By simplifying the integration process, developers can easily incorporate advanced natural language processing capabilities into their applications to process and understand text data.
OpenAI Knowledge Retrieval
The Knowledge Retrieval feature by OpenAI allows the assistant to enhance its knowledge base through external documents, such as proprietary product information or user-provided files. Once enabled, uploaded files are automatically chunked, indexed, and their embedding vectors stored, then vector search technology is utilized to retrieve relevant content to answer user queries.
Unlike traditional vector databases, Knowledge Retrieval is specifically aimed at integration with OpenAI's assistant models, allowing developers to enhance the model's answering capabilities through simple API calls utilizing these external data sources.
RAG with LlamaIndex
• Loading: this refers to getting your data from where it lives – whether it’s text files, PDFs, another website, a database, or an API – into your pipeline. LlamaHub provides hundreds of connectors to choose from.
• Indexing: this means creating a data structure that allows for querying the data. For LLMs this nearly always means creating vector embeddings, numerical representations of the meaning of your data, as well as numerous other metadata strategies to make it easy to accurately find contextually relevant data.
• Storing: once your data is indexed you will almost always want to store your index, as well as other metadata, to avoid having to re-index it.
• Querying: for any given indexing strategy there are many ways you can utilize LLMs and LlamaIndex data structures to query, including sub-queries, multi-step queries and hybrid strategies.
• Evaluation: a critical step in any pipeline is checking how effective it is relative to other strategies, or when you make changes. Evaluation provides objective measures of how accurate, faithful and fast your responses to queries are.
RAG Architecture
The first step involves storing the knowledge of internal documents in a format suitable for querying. We embed it by using an embedding model:
Split the entire knowledge base's text corpus into chunks—a chunk will represent a single, queryable context. The data of interest can come from multiple sources, for example, documents in Confluence attached with PDF reports.
Use an embedding model to transform each chunk into a vector embedding.
Store all vector embeddings in a vector database, with some reference to the original content the embedding was created from.
Save the text representing each embedding and create a index to the embedding separately (we will need this later).
Next, we can start building answers for the questions/queries of interest:
Embed the question/query you want to ask, using the same embedding model as the embedding knowledge base itself.
Run a query against the index in the vector DB using the resulting vector embedding. Choose how many vectors you want to retrieve from the Vector Database—it will equal the number of contexts you will retrieve and ultimately use to answer the query question.
The Vector DB performs an approximate nearest neighbor (ANN) search against the index for the provided vector embeddings and returns the previously chosen number of context vectors. This process returns the vectors most similar in the given embedding/latent space.
Map the returned vector embeddings back to the text blocks representing them.
Pass the question along with the retrieved context text blocks to an LLM by prompting. The LLM uses only the provided context to answer the given question. This doesn't mean that prompt engineering is unnecessary—you need to ensure that the answer returned by the LLM is within expected boundaries, for example, ensure that no fabricated answers are provided if there's no available data in the retrieved context.
Beta Was this translation helpful? Give feedback.
All reactions