title	author
Semantic Search and RAG on a FOSS stack	Robert Timm

Semantic Search and RAG on a FOSS stack

Wikimedia Hackathon Tallinn 2024

github.com/rti/barebone-rag

Robert Timm robert.timm@wikimedia.de

Semantic Search

Given a query, find texts with a meaning similar.

Retrieval Augmented Generation (RAG)

Create texts based on information loaded from external sources.

Free and Open Source Software (FOSS) Stack

All software components are released under OSI approved licenses.

Demo Time

GPU stacks

⛔ NVIDIA CUDA has a proprietary license
✅ AMD ROCm stack is MIT licensed
- amdgpu driver in kernel mainline

Components for Semantic Search

Encode semantics ▶ Embeddings
Find semantically similar objects ▶ Vector Database

Embedding Models

	Dims	OSI License	Pre Train Data	Fine Tune data
all-MiniLM-L6-v2	384	✅ Apache-2.0	✅	✅
nomic-embed-text-v1	768	✅ Apache-2.0	✅	✅
bge-large-en-v1.5	1024	✅ MIT	⛔	⛔
mxbai-embed-large-v1	1024	✅ Apache-2.0	⛔	⛔

Embedding Inference

🦙 Ollama

Get up and running with large language models.

Inference engine based on llama.cpp
Supports AMD GPU via ROCm
CPU support (AVX, AVX2, AVX512, Apple Silicon)
Quantization
Model Library
One model, one inference at a time

Embedding Inference

	OSI License	ROCm support	Production
Ollama	✅ MIT	✅	⛔
llama.cpp	✅ MIT	✅	⛔
HF Text Embeddings Interface	✅ Apache-2.0	⛔	✅
Infinity	✅ MIT	✅	✅

Embedding Implementation

Start ollama

$ ollama serve &
$ ollama pull nomic-embed-text-v1

Generate embedding

import ollama # pip install ollama

res = ollama.embeddings(
    model="nomic-embed-text-v1",
    prompt="This string")

res["embedding"] # [0.33, 0.62, 0.19, ...]

Vector Database

🐘 Postgres can do it 🎉

pgvecto.rs

Scalable, Low-latency and Hybrid-enabled Vector Search in Postgres.

A PostgreSQL extension written in Rust.

Vector Database Licensing

	OSI License
PostgreSQL	✅ PostgreSQL License
pgvector.rs	✅ Apache-2.0

Vector Database Implementation

Create PostgreSQL table using pgvecto.rs

CREATE EXTENSION vectors;

CREATE TABLE chunks (
  text TEXT NOT NULL,
  embedding VECTOR( 768 ) NOT NULL
);

Find most similar chunks

SELECT text FROM chunks
  ORDER BY embedding <-> [0.33, 0.62, 0.19, ...]
  LIMIT 5;

Components for Retrieval Augmented Generation (RAG)

Find matching sources ▶ Semantic Search
Generate Response ▶ Large Language Model (LLM) Inference

LLM Inference

🦙 Ollama

LLMs too 😀 Actually its core use case
Great model library 📚

LLM Inference

	OSI License	ROCm Support	Production
Ollama	✅ MIT	✅	⛔
llama.cpp	✅ MIT	✅	⛔
vllm	✅ Apache-2.0	✅	✅
HF Text Generation Interface	✅ Apache-2.0	✅	✅

LLM Inference Implementation

Generate a text based on a prompt

import ollama # pip install ollama

res = ollama.chat(
    model="zephyr:7b-beta",
    messages=[{"role": "user", "content": f"Summarize this text: {text}"}],
    stream=False,
)
res["message"]["content"] # "The given text..."

Large Language Model Building Blocks

Weights (binary)
Pre training (source)
Fine tuning data (source)
Training code (build scripts)

OSI - Open Source AI Initiative

Intends to define Open Source models
Defines which parts need to have OSD-compliant licenses
Draft, Release planned for October 2024
Latest draft (April 2024)
- Marks training data sets as optional
- But requires data characteristics, labeling procedures, etc.

LLMs with Openly Licensed Weights

	OSI Weights	PT Data	FT Data	Code
Mistral 0.2 7b	✅ Apache-2.0	⛔	⛔	⛔
HF Zephyr 7b beta	✅ MIT	⛔	✅	✅
Microsoft Phi-3 Mini 3.8b	✅ MIT	⛔	⛔	⛔
Apple ELM 3b	⛔❓ASCL	✅	✅	✅
Meta Llama 3 8b	⛔ Custom	⛔	⛔	⛔
Google Gemma 1.1 7b	⛔ Custom	⛔	⛔	⛔

PT: pre-training - FT: fine tuning

Open Source LLM Projects

Bloom (2022), BigScience RAIL License v1.0, not SOTA, not OSD
Allan AI OLMo based on the Dolma dataset
LumiOpen Viking built on Lumi Supercomputer
HuggingFace StarChat2 focussed on code
OpenGPT-X EU funded

Conclusion

✅ Almost all software components are available with OSI approved licenses
✅ ROCm works and people are using it
❓ Definition of open source models unclear
🤔 Identifying truly open source models is complicated
⏳ Interesting developments ongoing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slides.md

slides.md

Semantic Search and RAG on a FOSS stack

github.com/rti/barebone-rag

Semantic Search

Retrieval Augmented Generation (RAG)

Free and Open Source Software (FOSS) Stack

Demo Time

GPU stacks

Components for Semantic Search

Embedding Models

Embedding Inference

🦙 Ollama

Embedding Inference

Embedding Implementation

Start ollama

Generate embedding

Vector Database

pgvecto.rs

Vector Database Licensing

Vector Database Implementation

Create PostgreSQL table using pgvecto.rs

Find most similar chunks

Components for Retrieval Augmented Generation (RAG)

LLM Inference

🦙 Ollama

LLM Inference

LLM Inference Implementation

Large Language Model Building Blocks

OSI - Open Source AI Initiative

LLMs with Openly Licensed Weights

Open Source LLM Projects

Conclusion

Files

slides.md

Latest commit

History

slides.md

File metadata and controls

Semantic Search and RAG on a FOSS stack

github.com/rti/barebone-rag

Semantic Search

Retrieval Augmented Generation (RAG)

Free and Open Source Software (FOSS) Stack

Demo Time

GPU stacks

Components for Semantic Search

Embedding Models

Embedding Inference

🦙 Ollama

Embedding Inference

Embedding Implementation

Start ollama

Generate embedding

Vector Database

pgvecto.rs

Vector Database Licensing

Vector Database Implementation

Create PostgreSQL table using pgvecto.rs

Find most similar chunks

Components for Retrieval Augmented Generation (RAG)

LLM Inference

🦙 Ollama

LLM Inference

LLM Inference Implementation

Large Language Model Building Blocks

OSI - Open Source AI Initiative

LLMs with Openly Licensed Weights

Open Source LLM Projects

Conclusion