A RAG system based on Langchain and Llama3 created for answering general question about Kendrick Lamar's career and life. It proved a simple yet effective point on how to setup, compile and use NLP tools for interesting and fast results.
Table of Contents
This project serves the purpose of showing how to correctly build from start to finish a RAG system using Langchain and Llama3. Not only that, it is also shown how to recreate the dataset used (as well as a copy ready for use). From dataset creation to prompt answering, it can be useful to guide new comers.
The project is divided into the following steps:
- Create the dataset -> This was done using the wikiextractor library and the source code can be found in the
building_dataset
folder. - Creating the FAISS index -> Especially for larger datasets, it can be quite computationally expensive to build one each and every time. As such I used a cluster made available my university based on SLURM and Singularity. In the
faiss_env
folder you can find the source code on how it was done. - Creating the RAG model -> This was done creating a Jupiter notebook on Google Colab. The source code can be found in the main folder of the project.
Skip to the appropriate section for more information.
Feel free to skip to whatever section you are interested in. As the project is divided into three main parts, you might have different requirements for each one.
The dataset was extract from the Wikipedia dumps. The main idea is to parse the dump, extract only the relevant articles and then saving them in a format suitable for the Langchain API. All of the code mentioned can be found in the building_dataset
folder.
An already made extraction for the Kendrick Lamar related articles can be found in the folder under the name extract_kendrick.tar.gz
.
To extract and parse all of the articles I used the wikiextractor library. Simply check the documentation and extract them wherever you want. Given this was a dummy project, I extracted the data without parsing the templates
as well, make sure to change that if you want to have a more complete dataset. The rest of the procedure remains unchanged.
Once extracted the data, use the kendrick_polisher.py
script to parse the data and save it in a format suitable for the Langchain API. The script will create a folder with the extracted .txt files. The script uses concurrent processing to speed up the process in batches, change the batch_size
variable to your liking.
python3 kendrick_polisher.py <path_to_extracted_data> <path_to_save_data> <batch_size>
Now the data is ready to be used for the next steps, I zipped it and uploaded it to Google Drive for easy access and retrieved it easily with the use of the gdown library.
Creating the FAISS index can be quite long, in my case it wasn't suitable on my laptop and I used a cluster for more computing power. Given that each and every setup is quite different, you might skip this part unless you have a very similar setup to mine.
All referenced files can be found in the faiss_env
folder.
Still, the process was the following:
- Create a Singularity image with the necessary dependencies.
sudo singularity build faiss.sif faiss.def
- Copy the slurm script and the python script to the cluster and run it.
sbatch faiss.slurm
You can also use the build_faiss_index.py
file on your own local machine, to do so run it like this:
python3 build_faiss_index.py <path_to_dataset> <path_to_save_index>
This will create the index and save it to the specified path. Later on I uploaded it to Google Drive as well and used the gdown library to download it in Google Colab.
The heart of the project: the RAG model. The development was done on Google Colab but the code can be run on any machine with the necessary dependencies. The code can be found in the main folder as NLP_Rag_project.ipynb
.
The code is quite well documented and should be easy to follow.
Please be aware that you need a GPU to load the model and also an account on the Hugging Face website to access the model. Make sure to request permissions at the Llama 3 model card.
The main steps are:
- Load the dataset
- Load the FAISS index
- Do some analysis on the dataset
- Setup Llama3
- Create the RAG model
Here you can find some examples prompt-answer using the RAG model created:
Question: Tell me an interesting fact about Kendrick.
Answer:
According to the provided context, an interesting fact about Kendrick Lamar is that his song "Alright" became a rallying cry for the Black Lives Matter movement in 2015, symbolizing hope and optimism in the face of discrimination and persecution.
Question: Did Lamar set any record or achieved widely known prizes?
Answer:
Yes, Kendrick Lamar has achieved several records and widely recognized prizes throughout his career. Some of his notable achievements include:
* Most awarded artist in BET Hip Hop Awards history with 29 wins
* First artist to win Video of the Year for a music video he co-directed with "Humble"
* First artist from any genre to be nominated for Album of the Year with four consecutive lead studio albums since Billy Joel (1979–1983)
* Third-most won by a rapper in Grammy Awards history with 17 wins
* Most wins by any artist in BET Hip Hop Awards history with 29 wins
* Won the Pulitzer Prize for Music for his album "Damn"
* Won the Primetime Emmy Award for Outstanding Variety Special (Live) as a headliner of the Pepsi Super Bowl LVI Halftime Show
* Won the Brit Award, four American Music Awards, six "Billboard" Music Awards, and 11 MTV Video Music Awards (including two Video of the Year wins)
These are just a few examples of his many achievements.
Question: What are some of the best Lamar's songs?
Answer: It's difficult to pinpoint the best of Kendrick Lamar's songs, as opinions on his music are subjective and varied. However, some of his most critically acclaimed and popular tracks include:
* "Alright"
* "Humble"
* "The Blacker the Berry"
* "Like That"
* "Freedom"
* "Sidewalks"
* "Goosebumps"
* "Bad Blood" (remix)
* "America Has a Problem"
* "The Hillbillies"
These songs showcase his unique blend of jazz, funk, soul, and hip-hop styles, as well as his thought-provoking lyrics that often address social and political issues.
Distributed under the MIT License.
Project Link: https://github.com/gp-1108/NLP_RAG