Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Edr/sql db #22

Merged
merged 39 commits into from
Dec 15, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
4517a1e
Add models, repos, config
evanderiel Nov 30, 2023
7ee5a4a
Add db util functions for saving
evanderiel Nov 30, 2023
7ea7d99
Try to resolve sqlalchemy typing trilemma
evanderiel Nov 30, 2023
c1ef1de
Add alembic + tests
evanderiel Dec 1, 2023
58f45f7
Fix tests
evanderiel Dec 4, 2023
8ec53a7
Commit poetry.lock
evanderiel Dec 5, 2023
c9e0db6
Fix "Args" in docstrings
evanderiel Dec 5, 2023
f2724ad
PR feedback
evanderiel Dec 7, 2023
6c6de89
Fixes for tests and configs (PR)
evanderiel Dec 7, 2023
9261aa5
Add check constraints to columns
evanderiel Dec 7, 2023
7f5c9e8
Replace old alembic init
evanderiel Dec 7, 2023
e59b104
Make TranscriptEntity.segments a JSON column
evanderiel Dec 7, 2023
20c13f5
Add frame_ids to frame extraction output
evanderiel Dec 7, 2023
30917b9
Run alembic migration on app startup
evanderiel Dec 7, 2023
730d9c8
Use importlib with pytest to avoid pydantic import errors
evanderiel Dec 11, 2023
8355f05
PR feedback: add frame_ids to typeddict
evanderiel Dec 11, 2023
7977377
PR feedback: fix batching for db util functions
evanderiel Dec 11, 2023
9e9c0a4
PR: more fixes to new pipeline nodes
evanderiel Dec 11, 2023
e28f38c
PR: try fix again
evanderiel Dec 11, 2023
994624c
PR: Fix tests
evanderiel Dec 11, 2023
af68a85
PR: fixes for batching, dict_output
evanderiel Dec 12, 2023
5e58869
Fix transcription stuff
evanderiel Dec 12, 2023
b1eedb0
Fix bug in blip2 video endpoint definition
evanderiel Dec 13, 2023
d6f6306
PR: de-batchify video storage
evanderiel Dec 13, 2023
bda03c5
Use a separate Video and Media entities
evanderiel Dec 13, 2023
8e95a54
Fix tests
evanderiel Dec 13, 2023
99e92a2
PR: cleanup
evanderiel Dec 13, 2023
4bf5b67
Re-add batch transcript saving
evanderiel Dec 14, 2023
94ba449
PR feedback: rename aana.models.db.BaseModel to BaseEntity
evanderiel Dec 14, 2023
334c3a3
Add __init__.py file to remove --import-mode=importlib
evanderiel Dec 14, 2023
e355d20
Ruff fixes
evanderiel Dec 14, 2023
afe6076
Update README & precommit hook
evanderiel Dec 14, 2023
763f6ee
Ruff format for db init
evanderiel Dec 14, 2023
271661e
PR feedback: remove need for video_id in nodes
evanderiel Dec 14, 2023
5144b91
Pr feedback: disable importlib mode in CI tests
evanderiel Dec 14, 2023
7e9cac8
Merge branch 'main' into edr/sql_db
evanderiel Dec 15, 2023
129f72d
Fix db tests and type annotations
evanderiel Dec 15, 2023
c038c34
merge from origin/main
evanderiel Dec 15, 2023
28eb1a9
Ruff fix
evanderiel Dec 15, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .githooks/pre-commit
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,6 @@

set -e # exit on error

ruff check aana
ruff format aana
poetry run ruff check aana
poetry run ruff format aana

2 changes: 2 additions & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,12 @@
"editor.formatOnSave": true,
},
"python.testing.pytestArgs": [
// "--import-mode=importlib",
"aana"
],
"python.testing.unittestEnabled": false,
"python.testing.pytestEnabled": true,
"python.testing.pytestPath": "poetry run pytest",
"ruff.fixAll": true,
"ruff.organizeImports": true,
}
27 changes: 26 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@ to look for `/nas` and `/nas2`). You can read more about environment variables f

## Code Standards
This project uses Ruff for linting and formatting. If you want to
manually run Ruff on the codebase, it's
manually run Ruff on the codebase, using poetry it's

```sh
poetry run ruff check aana
Expand All @@ -118,6 +118,7 @@ To run the auto-formatter, it's
poetry run ruff format aana
```

(If you are running code in a non-poetry environment, just leave off `poetry run`.)
If you want to enable this as a local pre-commit hook, additionally
run the following:

Expand All @@ -132,3 +133,27 @@ command is available in your default shell. You can also simply run
For users of VS Code, the included `settings.json` should ensure
that Ruff problems appear while you edit, and formatting is applied
automatically on save.


## Databases
The project uses two databases: a vector database as well as a tradtional SQL database,
referred to internally as vectorstore and datastore, respectively.

### Vectorstore
TBD

### Datastore
The datastore uses SQLAlchemy as an ORM layer and Alembic for migrations. The migrations are run
automatically at startup. If changes are made to the SQLAlchemy models, it is necessary to also
create an alembic migration that can be run to upgrade the database.
The easiest way to do so is as follows:

```bash
poetry run alembic revision --autogenerate -m "<Short description of changes in sentence form.>"
```

ORM models referenced in the rest of the code should be imported from `aana.models.db` directly,
not from that model's file for reasons explained in `aana/models/db/__init__.py`. This also means that
if you add a new model class, it should be imported by `__init__.py` in addition to creating a migration.

Higher level code for interacting with the ORM is available in `aana.repository.data`.
116 changes: 116 additions & 0 deletions aana/alembic.ini
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
# A generic, single database configuration.

[alembic]
# path to migration scripts
script_location = alembic

# template used to generate migration file names; The default value is %%(rev)s_%%(slug)s
# Uncomment the line below if you want the files to be prepended with date and time
# see https://alembic.sqlalchemy.org/en/latest/tutorial.html#editing-the-ini-file
# for all available tokens
# file_template = %%(year)d_%%(month).2d_%%(day).2d_%%(hour).2d%%(minute).2d-%%(rev)s_%%(slug)s

# sys.path path, will be prepended to sys.path if present.
# defaults to the current working directory.
prepend_sys_path = .

# timezone to use when rendering the date within the migration file
# as well as the filename.
# If specified, requires the python-dateutil library that can be
# installed by adding `alembic[tz]` to the pip requirements
# string value is passed to dateutil.tz.gettz()
# leave blank for localtime
# timezone =

# max length of characters to apply to the
# "slug" field
# truncate_slug_length = 40

# set to 'true' to run the environment during
# the 'revision' command, regardless of autogenerate
# revision_environment = false

# set to 'true' to allow .pyc and .pyo files without
# a source .py file to be detected as revisions in the
# versions/ directory
# sourceless = false

# version location specification; This defaults
# to alembic/versions. When using multiple version
# directories, initial revisions must be specified with --version-path.
# The path separator used here should be the separator specified by "version_path_separator" below.
# version_locations = %(here)s/bar:%(here)s/bat:alembic/versions

# version path separator; As mentioned above, this is the character used to split
# version_locations. The default within new alembic.ini files is "os", which uses os.pathsep.
# If this key is omitted entirely, it falls back to the legacy behavior of splitting on spaces and/or commas.
# Valid values for version_path_separator are:
#
# version_path_separator = :
# version_path_separator = ;
# version_path_separator = space
version_path_separator = os # Use os.pathsep. Default configuration used for new projects.

# set to 'true' to search source files recursively
# in each "version_locations" directory
# new in Alembic version 1.10
# recursive_version_locations = false

# the output encoding used when revision files
# are written from script.py.mako
# output_encoding = utf-8

# sqlalchemy.url = driver://user:pass@localhost/dbname


[post_write_hooks]
# post_write_hooks defines scripts or Python functions that are run
# on newly generated revision scripts. See the documentation for further
# detail and examples

# format using "black" - use the console_scripts runner, against the "black" entrypoint
# hooks = black
# black.type = console_scripts
# black.entrypoint = black
# black.options = -l 79 REVISION_SCRIPT_FILENAME

# lint with attempts to fix using "ruff" - use the exec runner, execute a binary
hooks = ruff
ruff.type = exec
ruff.executable = ruff
ruff.options = --fix REVISION_SCRIPT_FILENAME

# Logging configuration
[loggers]
keys = root,sqlalchemy,alembic

[handlers]
keys = console

[formatters]
keys = generic

[logger_root]
level = WARN
handlers = console
qualname =

[logger_sqlalchemy]
level = WARN
handlers =
qualname = sqlalchemy.engine

[logger_alembic]
level = INFO
handlers =
qualname = alembic

[handler_console]
class = StreamHandler
args = (sys.stderr,)
level = NOTSET
formatter = generic

[formatter_generic]
format = %(levelname)-5.5s [%(name)s] %(message)s
datefmt = %H:%M:%S
1 change: 1 addition & 0 deletions aana/alembic/README
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Generic single-database configuration.
Empty file added aana/alembic/__init__.py
Empty file.
79 changes: 79 additions & 0 deletions aana/alembic/env.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
from logging.config import fileConfig

from alembic import context
from sqlalchemy import engine_from_config, pool

from aana.configs.db import create_database_engine
from aana.configs.settings import settings
from aana.models.db.base import BaseEntity

# this is the Alembic Config object, which provides
# access to the values within the .ini file in use.
config = context.config

# Interpret the config file for Python logging.
# This line sets up loggers basically.
if config.config_file_name is not None:
fileConfig(config.config_file_name)

# add your model's MetaData object here
# for 'autogenerate' support
# from myapp import mymodel
# target_metadata = mymodel.Base.metadata

target_metadata = BaseEntity.metadata

# other values from the config, defined by the needs of env.py,
# can be acquired:
# my_important_option = config.get_main_option("my_important_option")
# ... etc.


def run_migrations_offline() -> None:
"""Run migrations in 'offline' mode.

Modified to use our existing db config module.

Calls to context.execute() here emit the given string to the
script output.

"""
engine = create_database_engine(settings.db_config)
context.configure(
url=engine.url,
target_metadata=target_metadata,
literal_binds=True,
dialect_opts={"paramstyle": "named"},
)

with context.begin_transaction():
context.run_migrations()


def run_migrations_online() -> None:
"""Run migrations in 'online' mode.

In this scenario we need to create an Engine
and associate a connection with the context.

"""
config_section = config.get_section(config.config_ini_section, {})
engine = create_database_engine(settings.db_config)
config_section["sqlalchemy.url"] = engine.url
connectable = engine_from_config(
config_section,
prefix="sqlalchemy.",
poolclass=pool.NullPool,
)

with connectable.connect() as connection:
context.configure(connection=connection, target_metadata=target_metadata)

with context.begin_transaction():
context.run_migrations()


if context.is_offline_mode():
run_migrations_offline()
else:
run_migrations_online()
28 changes: 28 additions & 0 deletions aana/alembic/script.py.mako
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
"""${message}

Revision ID: ${up_revision}
Revises: ${down_revision | comma,n}
Create Date: ${create_date}

"""
from typing import Sequence

from alembic import op
import sqlalchemy as sa
${imports if imports else ""}

# revision identifiers, used by Alembic.
revision: str = ${repr(up_revision)}
down_revision: str | None = ${repr(down_revision)}
branch_labels: str | Sequence[str] | None = ${repr(branch_labels)}
depends_on: str | Sequence[str] | None = ${repr(depends_on)}


def upgrade() -> None:
"""Upgrade database to this revision from previous."""
${upgrades if upgrades else "pass"}


def downgrade() -> None:
"""Downgrade database from this revision to previous."""
${downgrades if downgrades else "pass"}
Loading