RAG Query System

Introduction

This project implements a document query system using LightRAG, a state-of-the-art Retrieval-Augmented Generation (RAG) framework. LightRAG represents a significant advancement in RAG systems, offering advantages over traditional approaches like GraphRAG while maintaining lower computational costs.

What is LightRAG?

LightRAG is a lightweight yet powerful RAG framework that enhances retrieval quality through:

Dual-View Retrieval: Combines both lexical and semantic search capabilities
Local Structure Modeling: Captures document relationships with hierarchical structures
Global Knowledge Integration: Leverages knowledge graphs for comprehensive understanding
Cost-Effective Performance: Achieves comparable or better results than GraphRAG with reduced computational overhead

Indexing Flow

Retrieval & Querying Flow

Learn more about LightRAG

Features

Implementation of LightRAG's dual-view retrieval system:
- Naive search for direct matching
- Hybrid search combining semantic and lexical features
Support for different document types

Graph Storage Options

Neo4j Approach

Enterprise-grade graph database integration
Scalable for large document collections
Requires Neo4j installation:
requirements.txt includes neo4j
The graph of BIAN document from neo4j

NetworkX Approach

Lightweight in-memory graph solution
Suitable for smaller document sets
No additional installation needed
Default configuration:

Prerequisites

Python 3.8+
Neo4j Database (required for BAIN document querying)

Installation

Clone the repository:

git clone <repository-url>
cd <repository-name>

Create and activate a virtual environment:

# Windows
python -m venv .venv
.venv\Scripts\activate

# Linux/MacOS
python -m venv .venv
source .venv/bin/activate

Install required packages:

pip install -e .

Insall Docker Desktop or any other local container host, & run

docker run \
    --restart always \
    --publish=7474:7474 --publish=7687:7687 \
    neo4j:5.25.1

Set environment variables for OpenAI and Neo4j

export OPENAI_API_KEY="sk-..."
export NEO4J_URI="neo4j://localhost:7687"
export NEO4J_USERNAME="neo4j"
export NEO4J_PASSWORD="password"

Document Preparation

Two separate RAG storages are prepared using:

BIAN document - https://bian.org/wp-content/uploads/2020/10/BIAN-Semantic-API-Pactitioner-Guide-V8.1-FINAL.pdf
ARPA - Managing Data Risk - https://www.apra.gov.au/sites/default/files/Prudential-Practice-Guide-CPG-235-Managing-Data-Risk_1.pdf

The system expects documents to be in the raw-docs folder. If you have PDF files, they need to be converted to markdown format using PyMuPDF4LLM. In the raw-docs you will find that .md files already exists.

import fitz  # PyMuPDF
from PyMuPDF4LLM import extract_text_from_pdf

# Convert PDF to markdown
with fitz.open("raw-docs/your-document.pdf") as doc:
    markdown_text = extract_text_from_pdf(doc)
    
    with open("raw-docs/output.md", "w", encoding="utf-8") as f:
        f.write(markdown_text)

Project Structure

.
├── raw-docs/
│   ├── bian.md
│   └── CPG235.md
├── lightrag-demo.py
├── .venv/
└── README.md

Usage

Run the main script:

python lightrag-demo.py

Choose the document you want to query:
- Option 1: BAIN documentation
- Option 2: APRA Data Regulation
Enter your question when prompted
View both naive and hybrid search results:
- Naive search provides direct matching results
- Hybrid search combines semantic understanding with lexical matching for more comprehensive answers
Choose your next action:
- Ask another question
- Change document
- Exit the program

Testing

To run the test cases:

python -m pytest lightrag-demo.py

Working Directories

The system creates working directories for each document type:

./bian_neo4jWorkDir: For BAIN documentation
./cpg235_workdir: For APRA Data Regulation

These directories store the necessary indexes and cached data for the RAG system.

Notes

The system uses gpt_4o_mini_complete as the default LLM model, but can be configured to use other models
For BIAN documentation, Neo4j storage is specifically used for graph operations
LightRAG's dual-view retrieval can be adjusted through QueryParam settings for different use cases

Error Handling

The system includes comprehensive error handling and logging. All errors are:

Logged with timestamps
Returned in the RAGResponse object
Displayed to the user with appropriate context

Contributing

Feel free to submit issues, fork the repository, and create pull requests for any improvements.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
lightrag		lightrag
raw-docs		raw-docs
.DS_Store		.DS_Store
.gitignore		.gitignore
LightRAG-demo.py		LightRAG-demo.py
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Query System

Introduction

What is LightRAG?

Features

Graph Storage Options

Neo4j Approach

NetworkX Approach

Prerequisites

Installation

Document Preparation

Project Structure

Usage

Testing

Working Directories

Notes

Error Handling

Contributing

About

Releases

Packages

Languages

bhardwajvicky/rag-lightrag-demo

Folders and files

Latest commit

History

Repository files navigation

RAG Query System

Introduction

What is LightRAG?

Features

Graph Storage Options

Neo4j Approach

NetworkX Approach

Prerequisites

Installation

Document Preparation

Project Structure

Usage

Testing

Working Directories

Notes

Error Handling

Contributing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages