Welcome to the Bangla Retrieval-Augmented Generation (RAG) Pipeline! This repository provides a pipeline for interacting with Bengali text data using natural language.
- Interact with your Bengali data in Bengali.
- Ask questions about your Bengali text and get answers.
- LLM Framework: Transformers
- RAG Framework: Langchain
- Chunking: Recursive Character Split
- Vector Store: ChromaDB
- Data Ingestion: Currently supports text (.txt) files only due to the lack of reliable Bengali PDF parsing tools.
- Customizable LLM Integration: Supports Hugging Face or local LLMs compatible with Transformers.
- Flexible Embedding: Supports embedding models compatible with Sentence Transformers (embedding dimension: 768).
- Hyperparameter Control: Adjust
max_new_tokens
,top_p
,top_k
,temperature
,chunk_size
,chunk_overlap
, andk
. - Toggle Quantization mode: Pass
--quantization
argument to toggle between different types of model including LoRA and 4bit quantization.
- Install Python: Download and install Python from python.org.
- Clone the Repository:
git clone https://github.com/Bangla-RAG/PoRAG.git cd PoRAG
- Install Required Libraries:
pip install -r requirements.txt
Click to view example `requirements.txt`
langchain==0.2.3
langchain-community==0.2.4
langchain-core==0.2.5
chromadb==0.5.0
accelerate==0.31.0
peft==0.11.1
transformers==4.40.1
bitsandbytes==0.41.3
sentence-transformers==3.0.1
rich==13.7.1
- Prepare Your Bangla Text Corpus: Create a text file (e.g.,
test.txt
) with the Bengali text you want to use. - Run the RAG Pipeline:
python main.py --text_path test.txt
- Interact with the System: Type your question and press Enter to get a response based on the retrieved information.
আপনার প্রশ্ন: রবীন্দ্রনাথ ঠাকুরের জন্মস্থান কোথায়?
উত্তর: রবীন্দ্রনাথ ঠাকুরের জন্মস্থান কলকাতার জোড়াসাঁকোর 'ঠাকুরবাড়ি'তে।
You can pass these arguments and adjust their values during each runs.
Flag Name | Type | Description | Instructions |
---|---|---|---|
chat_model |
str | The ID of the chat model. It can be either a Hugging Face model ID or a local path to the model. | Use the model ID from the HuggingFace model card or provide the local path to the model. The default value is set to "hassanaliemon/bn_rag_llama3-8b" . |
embed_model |
str | The ID of the embedding model. It can be either a Hugging Face model ID or a local path to the model. | Use the model ID from the HuggingFace model card or provide the local path to the model. The default value is set to "l3cube-pune/bengali-sentence-similarity-sbert" . |
k |
int | The number of documents to retrieve. | The default value is set to 4 . |
top_k |
int | The top_k parameter for the chat model. | The default value is set to 2 . |
top_p |
float | The top_p parameter for the chat model. | The default value is set to 0.6 . |
temperature |
float | The temperature parameter for the chat model. | The default value is set to 0.6 . |
max_new_tokens |
int | The maximum number of new tokens to generate. | The default value is set to 256 . |
chunk_size |
int | The chunk size for text splitting. | The default value is set to 500 . |
chunk_overlap |
int | The chunk overlap for text splitting. | The default value is set to 150 . |
text_path |
str | The txt file path to the text file. | This is a required field. Provide the path to the text file you want to use. |
show_context |
bool | Whether to show the retrieved context or not. | Use --show_context flag to enable this feature. |
quantization |
bool | Whether to enable quantization(4bit) or not. | Use --quantization flag to enable this feature. |
hf_token |
str | Your Hugging Face API token. | The default value is set to None . Provide your Hugging Face API token if necessary. |
- Default LLM: Trained a LLaMA-3 8B model
hassanaliemon/bn_rag_llama3-8b
for context-based QA. - Embedding Model: Tested
sagorsarker/bangla-bert-base
,csebuetnlp/banglabert
, and foundl3cube-pune/bengali-sentence-similarity-sbert
to be most effective. - Retrieval Pipeline: Implemented Langchain Retrieval pipeline and tested with our fine-tuned LLM and embedding model.
- Ingestion System: Settled on text files after testing several PDF parsing solutions.
- Question Answering Chat Loop: Developed a multi-turn chat system for terminal testing.
- Generation Configuration Control: Attempted to use generation config in the LLM pipeline.
- Model Testing: Tested with the following models(quantized and lora versions):
- PDF Parsing: Currently, only text (.txt) files are supported due to the lack of reliable Bengali PDF parsing tools.
- Quality of answers: The qualities of answer depends heavily on the quality of your chosen LLM, embedding model and your Bengali text corpus.
- Scarcity of Pre-trained models: As of now, we do not have a high fidelity Bengali LLM Pre-trained models available for QA tasks, which makes it difficult to achieve impressive RAG performance. Overall performance may very depending on the model we use.
- PDF Parsing: Develop a reliable Bengali-specific PDF parser.
- User Interface: Design a chat-like UI for easier interaction.
- Chat History Management: Implement a system for maintaining and accessing chat history.
We welcome contributions! If you have suggestions, bug reports, or enhancements, please open an issue or submit a pull request.
This is a work-in-progress and may require further refinement. The results depend on the quality of your Bengali text corpus and the chosen models.