mysql_vss
is a plugin designed for storing and searching vector embeddings using approximate nearest neighbor search, leveraging the Annoy library for fast lookups in high-dimensional spaces.
🚧 Experimental Stage: The plugin is experimental and not recommended for production use yet.
- Install recent versions of
gcc
andg++
. - Ensure necessary build tools and libraries are installed. These include:
- g++
- gcc
- libstdc++-static
- cmake
- openssl-devel
- python-devel
- ncurses-devel
- Clone the repository:
git clone https://github.com/stephenc222/mysql_vss
cd mysql_vss
- Initialize and update submodules:
git submodule update --init --recursive --progress
- Compile
mysql-server
. This plugin requires amysql-server
build from source to link against:
cd src/vendor/mysql-server
mkdir build
cd build
cmake ..
make
*NOTE: mysql_vss
uses mysql-server
version 8.0, and this is pegged in the mysql-server
git submodule. There will be additional packages you'll need to install, such as a modern version of bison
and others. CMake's output is pretty helpful in determining what you need
- Compile the
mysql-vss
plugin source:
cmake .
make
The compiled output is a shared library named something like libmysql_vss_v0.0.1_AmazonLinux2023_x86_64.so
, tailored to your operating system.
- Deploy the shared library to MySQL's plugin directory.
- Register the UDFs in MySQL:
CREATE FUNCTION vss_search RETURNS STRING SONAME 'libmysql_vss.so';
CREATE FUNCTION vss_version RETURNS STRING SONAME 'libmysql_vss.so';
Verify installation by checking the plugin version:
SELECT CAST(vss_version() AS CHAR);
Set up the embeddings
table as per the required schema (currently a manual process):
CREATE TABLE IF NOT EXISTS embeddings (
ID INT PRIMARY KEY,
vector JSON NOT NULL,
original_text TEXT NOT NULL,
annoy_index INT
);
Use vss_search
for querying similar embeddings:
SELECT e.original_text
FROM embeddings AS e
WHERE FIND_IN_SET(e.ID, CAST(vss_search('[0.01,0.02,0.03,...]') AS CHAR)) > 0
ORDER BY FIELD(e.ID, CAST(vss_search('[0.01,0.02,0.03,...]') AS CHAR)) DESC;
- The current version only supports 768-dimensional embeddings.
mysql_vss
was developed using the embedding model, gte-base, a top performing embedding model that is runnable on a wide variety of consumer hardware.
- The Annoy Index loads once and requires a manual update for new embeddings.
To enhance mysql_vss
for larger data scales, some possible future development considerations include:
- External Process Management: Isolate the Annoy Index to manage memory separately from MySQL.
- Dynamic Index Reloading: Enable index updates without restarting the service.
Additionally, configurations that you can take as well:
- Containerization: Apply memory quotas to contain the Annoy Index within resource limits.
- MySQL Optimization: Tweak MySQL configurations for larger datasets (e.g.,
innodb_buffer_pool_size
). - Performance Benchmarks: Test against various data sizes to determine performance and stability.
- Expand testing to include unit and integration tests for robust validation.
Check out examples/app.py
and the provided Dockerfile in the repository for demonstration and containerized deployment of mysql_vss
.