PoRAG (পরাগ), Bangla Retrieval-Augmented Generation (RAG) Pipeline

Welcome to the Bangla Retrieval-Augmented Generation (RAG) Pipeline! This repository provides a pipeline for interacting with Bengali text data using natural language.

Use Cases

Interact with your Bengali data in Bengali.
Ask questions about your Bengali text and get answers.

How It Works

LLM Framework: Transformers
RAG Framework: Langchain
Chunking: Recursive Character Split
Vector Store: ChromaDB
Data Ingestion: Currently supports text (.txt) files only due to the lack of reliable Bengali PDF parsing tools.

Configurability

Customizable LLM Integration: Supports Hugging Face or local LLMs compatible with Transformers.
Flexible Embedding: Supports embedding models compatible with Sentence Transformers (embedding dimension: 768).
Hyperparameter Control: Adjust max_new_tokens, top_p, top_k, temperature, chunk_size, chunk_overlap, and k.
Toggle Quantization mode: Pass --quantization argument to toggle between different types of model including LoRA and 4bit quantization.

Installation

Install Python: Download and install Python from python.org.

Clone the Repository:

git clone https://github.com/Bangla-RAG/PoRAG.git
cd PoRAG

Install Required Libraries:
```
pip install -r requirements.txt
```

Click to view example `requirements.txt`

langchain==0.2.3
langchain-community==0.2.4
langchain-core==0.2.5
chromadb==0.5.0
accelerate==0.31.0
peft==0.11.1
transformers==4.40.1
bitsandbytes==0.41.3
sentence-transformers==3.0.1
rich==13.7.1

Running the Pipeline

Prepare Your Bangla Text Corpus: Create a text file (e.g., test.txt) with the Bengali text you want to use.
Run the RAG Pipeline:
```
python main.py --text_path test.txt
```
Interact with the System: Type your question and press Enter to get a response based on the retrieved information.

Example

আপনার প্রশ্ন: রবীন্দ্রনাথ ঠাকুরের জন্মস্থান কোথায়?
উত্তর: রবীন্দ্রনাথ ঠাকুরের জন্মস্থান কলকাতার জোড়াসাঁকোর 'ঠাকুরবাড়ি'তে।

Parameters description

You can pass these arguments and adjust their values during each runs.

Flag Name	Type	Description	Instructions
`chat_model`	str	The ID of the chat model. It can be either a Hugging Face model ID or a local path to the model.	Use the model ID from the HuggingFace model card or provide the local path to the model. The default value is set to `"hassanaliemon/bn_rag_llama3-8b"`.
`embed_model`	str	The ID of the embedding model. It can be either a Hugging Face model ID or a local path to the model.	Use the model ID from the HuggingFace model card or provide the local path to the model. The default value is set to `"l3cube-pune/bengali-sentence-similarity-sbert"`.
`k`	int	The number of documents to retrieve.	The default value is set to `4`.
`top_k`	int	The top_k parameter for the chat model.	The default value is set to `2`.
`top_p`	float	The top_p parameter for the chat model.	The default value is set to `0.6`.
`temperature`	float	The temperature parameter for the chat model.	The default value is set to `0.6`.
`max_new_tokens`	int	The maximum number of new tokens to generate.	The default value is set to `256`.
`chunk_size`	int	The chunk size for text splitting.	The default value is set to `500`.
`chunk_overlap`	int	The chunk overlap for text splitting.	The default value is set to `150`.
`text_path`	str	The txt file path to the text file.	This is a required field. Provide the path to the text file you want to use.
`show_context`	bool	Whether to show the retrieved context or not.	Use `--show_context` flag to enable this feature.
`quantization`	bool	Whether to enable quantization(4bit) or not.	Use `--quantization` flag to enable this feature.
`hf_token`	str	Your Hugging Face API token.	The default value is set to `None`. Provide your Hugging Face API token if necessary.

Key Milestones

Default LLM: Trained a LLaMA-3 8B model hassanaliemon/bn_rag_llama3-8b for context-based QA.
Embedding Model: Tested sagorsarker/bangla-bert-base, csebuetnlp/banglabert, and found l3cube-pune/bengali-sentence-similarity-sbert to be most effective.
Retrieval Pipeline: Implemented Langchain Retrieval pipeline and tested with our fine-tuned LLM and embedding model.
Ingestion System: Settled on text files after testing several PDF parsing solutions.
Question Answering Chat Loop: Developed a multi-turn chat system for terminal testing.
Generation Configuration Control: Attempted to use generation config in the LLM pipeline.
Model Testing: Tested with the following models(quantized and lora versions):

Limitations

PDF Parsing: Currently, only text (.txt) files are supported due to the lack of reliable Bengali PDF parsing tools.
Quality of answers: The qualities of answer depends heavily on the quality of your chosen LLM, embedding model and your Bengali text corpus.
Scarcity of Pre-trained models: As of now, we do not have a high fidelity Bengali LLM Pre-trained models available for QA tasks, which makes it difficult to achieve impressive RAG performance. Overall performance may very depending on the model we use.

Future Steps

PDF Parsing: Develop a reliable Bengali-specific PDF parser.
User Interface: Design a chat-like UI for easier interaction.
Chat History Management: Implement a system for maintaining and accessing chat history.

Contribution and Feedback

We welcome contributions! If you have suggestions, bug reports, or enhancements, please open an issue or submit a pull request.

Top Contributors

Abdullah Al Asif

Hasan Ali Emon

Disclaimer

This is a work-in-progress and may require further refinement. The results depend on the quality of your Bengali text corpus and the chosen models.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
porag		porag
.gitattributes		.gitattributes
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
banner.png		banner.png
main.py		main.py
requirements.txt		requirements.txt
test.txt		test.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PoRAG (পরাগ), Bangla Retrieval-Augmented Generation (RAG) Pipeline

Use Cases

How It Works

Configurability

Installation

Running the Pipeline

Example

Parameters description

Key Milestones

Limitations

Future Steps

Contribution and Feedback

Top Contributors

Disclaimer

References

About

Releases

Packages

Contributors 2

Languages

License

Bangla-RAG/PoRAG

Folders and files

Latest commit

History

Repository files navigation

PoRAG (পরাগ), Bangla Retrieval-Augmented Generation (RAG) Pipeline

Use Cases

How It Works

Configurability

Installation

Running the Pipeline

Example

Parameters description

Key Milestones

Limitations

Future Steps

Contribution and Feedback

Top Contributors

Disclaimer

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages