-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add example python project #197
base: main
Are you sure you want to change the base?
Conversation
Revises: | ||
Create Date: 2024-11-04 11:47:57.345379 | ||
|
||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file was generated with alembic revision --autogenerate
which compares the database to the sqlalchemy models and generates a naive version of a diff. Works quite well for adding new tables. This of course doesn't work for our views.
SELECT ai.create_vectorizer( | ||
'code_files'::regclass, | ||
destination => 'code_files_embeddings', | ||
embedding => ai.embedding_openai('text-embedding-3-small', 768), | ||
chunking => ai.chunking_recursive_character_text_splitter( | ||
'contents', | ||
chunk_size => 1000, | ||
chunk_overlap => 200 | ||
), | ||
formatting => ai.formatting_python_template( | ||
'File: $file_name\n\nContents:\n$chunk' | ||
) | ||
); | ||
""") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how I am creating a vectorizer via migrations. I guess it works but it's not as pretty as the native alembic functions. It's probably possible to have nice wrappers around those somehow.
examples/code-llm-sync/db/models.py
Outdated
|
||
class CodeFileEmbedding(Base): | ||
""" | ||
Model representing the view created by pgai vectorizer. | ||
This maps to the automatically created view 'code_files_embeddings' | ||
which joins the original code_files table with its embeddings. | ||
""" | ||
|
||
__tablename__ = "code_files_embeddings" | ||
|
||
# We make this a view model by setting it as such | ||
__table_args__ = {"info": {"is_view": True}} | ||
|
||
# Original CodeFile columns | ||
id = Column(Integer, ForeignKey("code_files.id"), primary_key=True) | ||
file_name = Column(String(255), nullable=False) | ||
updated_at = Column(DateTime, nullable=True) | ||
contents = Column(Text, nullable=True) | ||
|
||
# Embedding specific columns added by pgai | ||
embedding_uuid = Column(String, primary_key=True) | ||
chunk = Column(Text, nullable=False) | ||
embedding = Column( | ||
Vector(768), nullable=False | ||
) # 768 dimensions for text-embedding-3-small | ||
chunk_seq = Column(Integer, nullable=False) | ||
|
||
# Relationship back to original CodeFile | ||
code_file = relationship("CodeFile", foreign_keys=[id]) | ||
|
||
@override | ||
def __repr__(self) -> str: | ||
return f"<CodeFileEmbedding(file_name='{self.file_name}', chunk_seq={self.chunk_seq})>" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is nice that this works at all, but having to define all fields twice is a bit ugly I feel like. We can maybe provide a helper annotation or similar here. Like @embedded
on the original model and then it automatically injects the embedding and chunk field, etc.
def parse_embedding_string(embedding_str: str) -> np.ndarray: | ||
"""Convert a pgai embedding string to a numpy array""" | ||
# Remove brackets and split on commas | ||
values = embedding_str.strip("[]").split(",") | ||
# Convert to float array | ||
return np.array([float(x) for x in values], dtype=">f4") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is currently not in use. But when I originally called openai_embed()
it simply returned a string representation of the vector (float array). Which is extremely unfortunate if you want to use this embedding in any way afterwards. Maybe I setup pgvector-python
in the wrong way though.
examples/code-llm-sync/main.py
Outdated
class OpenAIEmbed(FunctionElement): | ||
inherit_cache = True | ||
|
||
|
||
class PGAIFunction(expression.FunctionElement): | ||
def __init__(self, model: str, text: str, dimensions: int): | ||
self.model = model | ||
self.text = literal(text) | ||
self.dimensions = dimensions | ||
super().__init__() | ||
|
||
|
||
@compiles(PGAIFunction) | ||
def _compile_pgai_embed(element, compiler, **kw): | ||
return "ai.openai_embed('%s', %s, dimensions => %d)" % ( | ||
element.model, | ||
compiler.process(element.text), | ||
element.dimensions, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already part of a proposed solution, it was quite hard to pass named arguments to sqlalchemy. It does allow basically arbitrary function calling but I couldn't define the dimensions parameter somehow (maybe this is some SQL magic I don't understand).
This implementation also has another benefit though, where you can define return value and parameter types. I think this is quite the low hanging fruit, and mostly a bit of busy work to build. But has some nice benefits.
results = await session.execute( | ||
select( | ||
CodeFileEmbedding.file_name, | ||
CodeFileEmbedding.chunk, | ||
CodeFileEmbedding.chunk_seq, | ||
similarity_score, | ||
) | ||
.order_by(similarity_score.desc()) | ||
.limit(limit) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw does putting the function twice in the code like here make it execute twice? Or is the Optimizer smart enough to understand that the result is the same since it's an idempotent function and only executes it once?
# Test database configuration | ||
TEST_DB_URL = "postgresql+asyncpg://postgres:postgres@localhost/postgres" | ||
project_root = Path(__file__).parent.parent | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing in general was a bit of a pain. But mostly because I struggled with setting up migrations and a database setup. However I am also not trying to mock anything here, which makes it a little easier.
docker_client: DockerClient, load_dotenv | ||
) -> Generator[Container, None, None]: | ||
"""Start vectorizer worker after database is ready""" | ||
# Configure container | ||
container_config = { | ||
"image": "timescale/pgai-vectorizer-worker:0.1.0", | ||
"environment": { | ||
"PGAI_VECTORIZER_WORKER_DB_URL": "postgres://postgres:[email protected]:5432/postgres", | ||
"OPENAI_API_KEY": os.environ["OPENAI_API_KEY"], | ||
}, | ||
"command": ["--poll-interval", "5s"], | ||
"extra_hosts": { | ||
"host.docker.internal": "host-gateway" | ||
}, # Allow container to connect to host postgres | ||
} | ||
|
||
# Start container | ||
container = docker_client.containers.run(**container_config, detach=True) | ||
|
||
# Wait for container to be running | ||
container.reload() | ||
assert container.status == "running" | ||
|
||
yield container | ||
|
||
# Cleanup | ||
container.stop() | ||
container.remove() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably not how you'd actually want to test this, instead you can depend on pgai and call the vectorizer-worker from there. But Claude came up with this (as most of the code in this PR) and it worked, so I'm keeping it for now.
I built a small Demo application to figure out how well pgai integrates with existing python tooling. It's not done but the first endpoint works. It's a FastAPI service using SQLAlchemy that enables semantic code search using pgai features like the automatic embeddings.
The idea is to keep track of a code base through file watchers and input changes into postgres and immediately embed the files. You can then use these embeddings to find relevant code files for any LLM queries related to improvements on that code base without having to manually copy all the related code for it each time.
Changes you make based on those results will then immediately propagate into the store -> repeat.
Might not be my greatest startup idea but I needed something to start working 😄
Status:
Currently there is a single API endpoint that allows to send a query and retrieve relevant code files based on the query (see tests for how it works).
Example usage:
I left a review on this with some thoughts but would continue building for a bit longer before prioritizing work items.