Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving upserts #79

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,9 @@ Getting Started
- Write access to an empty `PostgreSQL <http://www.postgresql.org>`_ database.
- A Python installation with `Jupyter Notebook <https://github.com/jupyter/notebook>`_ >= 5.0.

PGContents will put its table in the `pgcontents` namespace. When you log onto the PostgreSQL database server, make
sure the `pgcontents` schema is in the search path (e.g., `set search_path to 'pgcontents'`; see `PostgreSQL documentation`_).

**Installation:**

0. Install ``pgcontents`` from PyPI via ``pip install pgcontents``.
Expand All @@ -23,3 +26,4 @@ Demo Video
You can see a demo of PGContents in action in `this presentation from JupyterCon 2017`_.

.. _`this presentation from JupyterCon 2017` : https://youtu.be/TtsbspKHJGo?t=917
.. _`PostgreSQL documentation` : https://www.postgresql.org/docs/14/ddl-schemas.html#DDL-SCHEMAS-PATH
26 changes: 10 additions & 16 deletions pgcontents/query.py
Original file line number Diff line number Diff line change
Expand Up @@ -510,7 +510,7 @@ def save_file(db, user_id, path, content, encrypt_func, max_size_bytes):
)
directory, name = split_api_filepath(path)
with db.begin_nested() as savepoint:
try:
if not file_exists(db, user_id, path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm nervous about handling this by checking file_exists before executing the insert, since that allows for a possible race condition where another transaction inserts a file in between the exists check and the insert. That worry was the reason that this operation is currently implemented as "try to insert, but handle unique violations if they occur". Do you have any more info on your transaction that ended up in weird state?

Also, since this code was written, postgresql and sqlalchemy have added support for proper upserts directly in sql (see https://docs.sqlalchemy.org/en/14/dialects/postgresql.html#insert-on-conflict-upsert, for instance). Using that would probably be the "right" way to fix this if there's a problem with the current error handling strategy.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm new to SQLALchemy, so forgive my ignorance.

I was under the assumption that the select and insert/update would happen in a single transaction, because of the db.begin_nested() in the with statement. Hence, if two transaction are trying to save (a new notebook) simultaneously, one would win, the other would need to retry a save and get updated in the second attempt.

I get the following exceptions when using pgcontents version 0.6.0:

First, the insert fails

psycopg2.errors.UniqueViolation: duplicate key value violates unique constraint "uix_filepath_username"
DETAIL:  Key (user_id, parent_name, name)=(my_awesome_username, /, Untitled.ipynb) already exists.

Then

sqlalchemy.exc.InvalidRequestError: Can't operate on closed transaction inside context manager.  Please complete the context manager before emitting further commands.

If I understand it correctly, that means that the transaction started by the db.begin_nested() has failed in some way and a new transaction needs to be started?

On the upsert: very interesting. My initial attempt at using the on_conflict_do_update fails with Insert does not have an attribute on_conflict_do_update. This is the code fragment:

res = db.execute(
            files.insert()
                .values(
                name=name,
                user_id=user_id,
                parent_name=directory,
                content=content,
            )
            .on_conflict_do_update(constraint="uix_filepath_username", set_={
                "name": name,
                "user_id": user_id,
                "parent_name": directory,
                "content": content
            })
        )

res = db.execute(
files.insert().values(
name=name,
Expand All @@ -519,22 +519,16 @@ def save_file(db, user_id, path, content, encrypt_func, max_size_bytes):
content=content,
)
)
except IntegrityError as error:
# The file already exists, so overwrite its content with the newer
# version.
if is_unique_violation(error):
savepoint.rollback()
res = db.execute(
files.update().where(
_file_where(user_id, path),
).values(
content=content,
created_at=func.now(),
)
else:
# The file already exists, so overwrite its content with the newer version.
res = db.execute(
files.update().where(
_file_where(user_id, path),
).values(
content=content,
created_at=func.now(),
)
else:
# Unknown error. Reraise
raise
)

return res

Expand Down