Retrieval Augmented Generation on RTS

This project was developed as part of the EPFL Machine Learning course (2023).

Authors

Viacheslav Surkov
Semen Matrenok
Daniil Likhobaba

Summary

This repository contains code for building RAG pipeline and scripts for dataset generation and pipeline evaluation.

Usage

The code was tested with Python 3.10 and CUDA 12

Install requirements

pip install -r requirements.txt

Store OpenAI API key to the environment

os.environ["OPENAI_API_KEY"] = <YOUR API KEY>

Dataset Generation

python dataset_generation/script_gpt4.py <path to data> <store path>

Collection of transcripts in <path to data> should a JSON file in the following format:

[
    ...
    {
        "transcript": <example transcript>: str,
        "media_id": <example Media ID>: str
    },
    ...
]

The dataset of 100 examples will be saved to <store path>.

Models

First, construct and store embedding index on the dataset.

python index_storing/build_and_store.py <model name> <path to data> <persist dir path>

<model name> is an embedding model name from huggingface, <path to data> is a filepath to the dataset (the format is the same as for Dataset Generation section), <persist dir path> is path to the directory to store index.

To launch the pipeline as service running on port 8000:

cd models
bash start_app.sh

It can be configured via models/config.json configuration file. Refer to it to get the list of configuration parameters.

Querying is performed with HTTP POST requests, for example:

import requests
import json
x = requests.post('http://127.0.0.1:8000/', json={
    'question': question, 
    'techniques': ['cot']})
print(json.loads(x.text))

techniques is the list of the prompt techniques to apply. Refer to models/config.json file for the list of implemented techniques.

Evaluation

To test embeddings:

python evaluation/embedding_quality_simple.py <model name> <dataset path> <persist dir path>

<dataset path> and <persist dir path> are paths to the generated dataset and the constructed index.

To test whole pipeline correctness

python evaluation/quality_metrics.py <dataset path>

It will also generate log file with all answers.

To get average lengths and non-French answers rate:

python evaluation/inner_quality_statistics.py <path to logs>

<path to logs> represent the path to logfile generated by evaluation/quality_metrics.py

To check output for toxicity:

python evaluation/toxic_detection.py <path to logs> <path to output>

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
dataset_generation		dataset_generation
evaluation		evaluation
index_storing		index_storing
models		models
.gitignore		.gitignore
README.MD		README.MD
requirements.txt		requirements.txt
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Retrieval Augmented Generation on RTS

Authors

Summary

Usage

Dataset Generation

Models

Evaluation

About

Releases

Packages

Languages

CS-433/ml-project-2-university_dropout

Folders and files

Latest commit

History

Repository files navigation

Retrieval Augmented Generation on RTS

Authors

Summary

Usage

Dataset Generation

Models

Evaluation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages