The Tim Ferriss Show (TFS) is one of the most popular podcast, focusing on "deconstructing world-class performers from eclectic areas (investing, chess, pro sports, etc.), digging deep to find the tools, tactics, and tricks that listeners can use". After 10 years and over 750 episodes, the content has grown to be intimidating to read and search for the gems.
The TFS Archivist is a conversational AI that can help users search for the relevant idea from a specific guest/episode, saving the need of manually skimming through the library and the hour-long transcript.
This is my final project for DataTalk.Club's LLM Zoomcamp - a free course about LLMs and RAG.
- 1. The Tim Ferriss Show Archivist
- 2. Notes
- 3. Progress
- 4. Points
- 5. Overview
- 6. Dataset
- 7. App Architecture
- 8. How to Run the App
- 9. Code
- 10. Evaluations
- 11. Monitoring
- 12. Acknowledgements
The app was developed on GitHub Codespaces with a disc constraints of 32GB. If the app is set up normally, the machine would actually crash due to disc overflow. To circumvent this, in docker-compose
file I point the volume to /tmp/postgres_data
which is a system folder outside the codespace working directory and is not counted towards the 32GB quota. If you are running the app outside of GitHub Codespaces, you may want to this path to just postgres_data
.
- Scrape the data.
- Chunk the data.
- Tokenize the data.
- Ingest the data into an ElasticSearch Docker
- Perform RAG trial with Groq API & Phi-3 (Ollama)
- Build an UI for the app
- Perform Evaluations with GPT-4o
- Build a dashboard for evalution
- Best practices
To save you the trouble of looking for the project criteria, I put my marks here. You can double-check while reading through the repo and running it.
Problem description
- 2 points: The problem is well-described and it's clear what problem the project solves
RAG flow
- 2 points: Both a knowledge base and an LLM are used in the RAG flow
Retrieval evaluation
- 2 points: Multiple retrieval approaches are evaluated, and the best one is used
RAG evaluation
- 2 points: Multiple RAG approaches are evaluated, and the best one is used
Interface
- 2 points: UI (e.g., Streamlit), web application (e.g., Django), or an API (e.g., built with FastAPI)
Ingestion pipeline
- 2 points: Automated ingestion with a Python script or a special tool (e.g., Mage, dlt, Airflow, Prefect)
Monitoring
- 2 point: User feedback is collected and there's a dashboard with at least 5 charts
Containerization
- 2 points: Everything is in docker-compose
Reproducibility
- 2 points: Instructions are clear, the dataset is accessible, it's easy to run the code, and it works. The versions for all dependencies are specified.
Best practices
- Hybrid search: combining both text and vector search (at least evaluating it) (1 point)
- Document re-ranking (1 point)
- User query rewriting (1 point)
The TFS Archivist lets user search for a specific content from an episode of The Tim Ferriss Show.
Example use case incluces
- Search for background information about a guest.
- Search for the episode a guest appears in.
- Search for a specific idea that a guest mentioned in the show.
The dataset is the show transcripts up to episode 766, scraped from https://tim.blog/2018/09/20/all-transcripts-from-the-tim-ferriss-show/. The notebook to process the data is in the scrape
folder. The notebook was run on Colab (to make use of the GPU) across different sessions, so it can be messy. The basic steps:
- Get all the transcripts, in legacy format (PDF) and current format (web content).
- Process to extract out the episode content itself.
- Chunk each episode into chunks of 700 words with 20 words overlapped.
- Use SentenceTransformer to embed each chunk into 768 dense vectors.
After processing, the data has the following fields:
id
: The episode number.chunk_id
: The chunk ID in formatid_{auto-increment number}
.title
: Episode titlechunk
: The text in the chunk.embedding
: The embedding vector of the text chunk.
Note: Based on the clear copyright prominently displayed in his website (e.g., here), commercial usage of his transcript is disallowed. It means that you cannot take an app like this and deploy it on cloud for commercial use.
Technologies used:
- Python 3.12
- Docker and Docker Compose for containerization
- ElasticSearch for full-text search (and semantic search during evaluation)
- Streamlit as both the app backend and frontend
- PostgreSQL as the backend for monitoring
- Grafana as monitoring dashboard
- OpenAI and Groq as possible LLMs
Prepare a .env
file with the following format
GROQ_API_KEY=your_api_key
OPENAI_API_KEY=your_api_key
TZ=Asia/Singapore
# PostgreSQL Configuration
POSTGRES_HOST=postgres
POSTGRES_DB=tfs_archivist
POSTGRES_USER=admin
POSTGRES_PASSWORD=admin
POSTGRES_PORT=5432
# Grafana Configuration
GRAFANA_ADMIN_USER=admin
GRAFANA_ADMIN_PASSWORD=admin
GRAFANA_SECRET_KEY=SECRET_KEY
# Elasticsearch Configuration
ELASTIC_URL_LOCAL=http://127.0.0.1:9200
ELASTIC_URL=http://elasticsearch:9200
ELASTIC_PORT=9200
# Streamlit Configuration
STREAMLIT_PORT=8501
Get your Groq and OpenAI API keys from respective website.
The database for ElasticSearch and PostgreSQL needs initializing before running the app.
First, run the postgres and elasticsearch containers only
docker-compose up postgres elasticsearch -d
Second, prepare the Python environments and run the prep.py
and ingestion.py
scripts
conda create -n llm
conda activate llm
pip install -r requirements.txt
export POSTGRES_HOST=localhost
python prep.py
python ingestion.py
The easiest way is to use Docker Compose. After database initialization, run
docker-compose up
If you want to run the application locally, after database initialization, instead of docker-compose up
, run
export POSTGRES_HOST=localhost
bash streamlit.sh
If you want to run the application using only Docker for development, after database initialization, build the image and run it
docker build -t streamlit .
docker run -it --rm \
--network="llm-zoomcamp-tf-show-archivist_default" \
--env-file=".env" \
-e OPENAI_API_KEY=${OPENAI_API_KEY} \
-e GROQ_API_KEY=${GROQ_API_KEY} \
-p 8501:8501 \
app
Navigate to http://127.0.0.1:8501/ to use the app via the Streamlit UI.
Demo can be viewed at
https://www.loom.com/share/1c3e150ea6c04e9bb21f13c295e201d3
grafana
- Initialization and dashboard settings for Grafana dashboards.notebooks
- contain experiment notebooks and first prototypescrape
- contain notebook used to scrape and process the datautils
- util functionsapp.py
- the main app logicassistant.py
- the main RAG logic for building the retrieving the data and building the promptingestion.py
- loading the data into the knowledge basedb.py
- the logic for logging the requests and responses to postgres databaseprep.py
- the script for initializing the database
Note: Due to a gross mistake on my part during transfer between different GitHub Codespace (I broke the last one), the experiment data were lost. Only the data as the output of the notebooks remain 😔.
2 Jupyter notebooks in the notebooks
folder.
evaluation_data_generation.ipynb
- Ground truth dataset generation.evaluation_rag.ipynb
- The retrieval and RAG evaluation.
Vector approximate search (10,000 sample (max setting for ElasticSearch), top-5, cosine similarity)
- Chunk Hit Rate: 0.3820558526440879,
- Chunk MRR: 0.44081501287383695,
- Document Hit Rate: 0.6316102198455139,
- Document MRR: 0.9488710635769462
Keyword search (chunk and title, no boosting, top-5)
- Chunk Hit Rate: 0.7890671420083185,
- Chunk MRR: 1.0368785898197679,
- Document Hit Rate: 0.8722519310754605,
- Document MRR: 1.571390374331495
Keyword search performed better. Due to time constraint, I did not test boosting for keyword search. To do so, we can use minsearch.py
as an approximate to perform simple optimization, and then use the setting for ElasticSearch.
I evaluated the new llama-3.1-8b-instant
and the older llama3-8b-8192
from Groq using GPT-4o-mini as a judge, using 103 samples.
The odd number of sample is due to rate limit from Groq!
Llama 3:
relevance
RELEVANT 0.737864
PARTLY_RELEVANT 0.174757
NON_RELEVANT 0.087379
Llama 3.1
relevance
RELEVANT 0.718447
PARTLY_RELEVANT 0.165049
NON_RELEVANT 0.116505
Based on the 103 samples, GPT-4o-mini judged that Llama-3 8B has some edge over the new Llama-3.1 8B, though it's just 1-2 questions different. Considering they are both free, I used both.
A further evaluation would be to try Llama-3 70B to see if the increased size can lead to a better performance, and if it's worth the cost.
A postgres DB was set up as the backend for monitoring, storing the conversations as well as user feedback. Grafana visualized the information using this data.
When the app is running, access it at localhost:3000:
- Login: "admin"
- Password: "admin"
The dashboard follows the template in the course, with 7 panels
- Last 5 Conversations (Table): Displays a table showing the five most recent conversations, including details such as the question, answer, relevance, and timestamp. This panel helps monitor recent interactions with users.
- +1/-1 (Pie Chart): A pie chart that visualizes the feedback from users, showing the count of positive (thumbs up) and negative (thumbs down) feedback received. This panel helps track user satisfaction.
- Relevancy (Gauge): A gauge chart representing the relevance of the responses provided during conversations. The chart categorizes relevance and indicates thresholds using different colors to highlight varying levels of response quality.
- Tokens Cost (Time Series): A time series line chart depicting the cost associated with API usage over time for both Groq and OpenAI. This panel helps monitor and analyze the expenditure linked to the AI model's usage.
- Tokens (Time Series): Another time series chart that tracks the number of tokens used in conversations over time. This helps to understand the usage patterns and the volume of data processed.
- Model Used (Bar Chart): A bar chart displaying the count of conversations based on the different models used. This panel provides insights into which AI models are most frequently used.
- Response Time (Time Series): A time series chart showing the response time of conversations over time. This panel is useful for identifying performance issues and ensuring the system's responsiveness.
All Grafana configurations are in the grafana
folder:
init.py
- for initializing the datasource and the dashboard.dashboard.json
- the actual dashboard (taken from LLM Zoomcamp without changes).
To initialize the dashboard, first ensure Grafana is
running (it starts automatically when you do docker-compose up
).
Then run:
export POSTGRES_HOST=localhost
python init.py
Then go to localhost:3000:
- Login: "admin"
- Password: "admin"
I would like to thank DataTalks.Club and all the guests and sponsors for the quality content of the course, all totally free.
And I hope you, the reviewer, enjoyed doing the course as much as I do ⸜(。˃ ᵕ ˂ )⸝♡