Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to export tables #4090

Open
wants to merge 7 commits into
base: develop
Choose a base branch
from
Open

Add ability to export tables #4090

wants to merge 7 commits into from

Conversation

pavish
Copy link
Member

@pavish pavish commented Dec 17, 2024

Fixes #4067

This PR implements downloads for tables using the following approach:

Querying data:

  • Creates a database cursor that queries records
  • Fetches records in chunks of 2000 rows

Transferring:

  • Forms csv from the results of each fetch
  • Streams the csv to the browser, which saves it as a file

Approaches tested

Querying data:

  • COPY command:
    • Faster if we only need to export entire non-foreign table.
    • Can form CSV output directly on the database.
    • For our requirements, we would have more use for the syntax COPY (SELECT * FROM table) TO instead of COPY table TO (since we need filtering & sorting and support for explorations in the future). This syntax has negligible performance difference compared to reading from a cursor.
    • COPY is not supported inside a db function.
  • Multiple queries with standard limit and incremental offsets:
    • Can use the existing list function.
    • Flexible for our needs.
    • Chances of data changing under the hood between queries, so the end result is unreliable.
  • DB cursor:
    • Can fetch & transfer in chunks.
    • Flexible for our needs.
    • Data is reliable since querying is part of the same transaction.
    • Reasonably fast, and since it fetches data in chunks, we have more control over memory and cpu usage.

Transferring:

  • Direct transfer:
    • Forms the entire file in the server and transfers it.
    • Error handling is easier.
    • Processing time is higher as we wait for the entire file to complete and the request is kept open until then.
    • Higher chances of network disconnects since webservers/proxies/load-balancers might cut off idle connections after a threshold.
    • Cannot update progress on the frontend.
  • Streaming:
    • Streams chunk by chunk to the frontend.
    • Error handling is additional work.
    • Keeps the connection open and active, so webservers/proxies/load-balancers would not cut them off.
    • Can update progress on the frontend.
  • Storing as a file and allowing download via link:
    • Stores the csv data as a file on the Mathesar server.
    • Allows the frontend to download it via a link.
    • Requires additional permission checks.
    • Makes it possible to view export history and audit access.
    • Ideal for long-term, however we cannot do that without figuring out storage flow for Mathesar.
    • Not feasible for v0.2.0.

What is not implemented in this PR:

  • Error handling i.e. viewing exact cause for failure on the frontend:
    • We cannot return a different status code while a streaming request in underway.
      • Say a few chunks have been sent, and one fails due to an unforeseen error.
      • We cannot return a different http status code separately for that chunk.
    • The download is directly made by the browser via the a['download'] tag.
    • This lets the browser take over and download the streamed data. When there's a failure, the browser would show an error. However, it cannot show the actual cause since streaming requests do not return the error.
    • We can handle this by streaming as json with each chunk containing a status code and data, and later joining the data on the frontend.
    • This isn't implemented in this PR, since it's additional work with little benefit. The only major cause for streaming to fail is due to network failures, which the browser would display.
    • We could implement this as a nice-to-have post v0.2.0, or when users report issues.

UX changes from the design

  • The UX is made more straight-forward since download is handled directly by the browser.
  • Here is a video of the interaction:
Screen.Recording.2024-12-18.at.7.32.06.PM.mov

Checklist

  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the develop branch of the repository
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no
    visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@pavish pavish self-assigned this Dec 17, 2024
@pavish pavish added the pr-status: revision A PR awaiting follow-up work from its author after review label Dec 17, 2024
@pavish pavish added this to the v0.2.0 (beta release) milestone Dec 17, 2024
@pavish pavish added pr-status: review A PR awaiting review and removed pr-status: revision A PR awaiting follow-up work from its author after review labels Dec 18, 2024
@pavish
Copy link
Member Author

pavish commented Dec 18, 2024

@ghislaineguerin @kgodey I've assigned you to this PR to get a sign off on the UX.

@pavish pavish marked this pull request as ready for review December 18, 2024 13:49
Copy link
Contributor

@mathemancer mathemancer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall method seems reasonable. Nice work on that!

However, I think most of the SQL code and a large part of the Python code could be removed if you can get ServerCursor.fetchmany working (or just use the builtin itersize param and loop through rows one at a time clientside).

If there's a really good reason that doesn't work, what is it?

Also, please just pull apart the msar.list_records_from_table result to get the actual records to return. I don't want to have to maintain that formatted query in multiple places if we can avoid it. Alternatively, you could extract the CTE-generating code and call it from both the record lister and the iterable record lister SQL functions. I prefer the former approach, since it prioritizes the efficiency of the hot path, i.e., listing records for display normally.

Comment on lines +43 to +48
while True:
cursor.execute(fetch_query)
rows = cursor.fetchall()
if not rows:
break
yield rows
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use the fetchmany functionality of the psycopg.ServerCursor class? This seems like it's reinventing the wheel.

See the docs for ServerCursor.

I think it has all the pieces you need, and it'd be more convenient than rolling our own if it works.

Copy link
Member Author

@pavish pavish Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We need a SQL call to build the select query which accepts filters and sorting in the exact format as list does.
  2. We need a SQL call to fetch the names of the columns (as the existing functions rely on oids).
  3. We need a server side cursor for the records.

If we use ServerCursor class for the records, we'd need to make additional calls for build_select_query and fetch_selectable_column_names. We'd also need to have additional logic to ensure that the column names are in the same order as the columns in the records result.

It's much simpler to fetch everything via a single plpgsql function.

Comment on lines +5311 to +5318
'SELECT %1$s FROM %2$I.%3$I %7$s %6$s LIMIT %4$L OFFSET %5$L',
COALESCE(columns_sql, 'NULL'),
msar.get_relation_schema_name(tab_id),
msar.get_relation_name(tab_id),
limit_,
offset_,
msar.build_order_by_expr(tab_id, order_),
msar.build_where_clause(tab_id, filter_)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code replication is avoidable by just calling the record lister, then pulling apart the result. Given that this is not exactly a "Hot Path", I think the very minor inefficiency involved is well worth the increase in maintainability.

Copy link
Member Author

@pavish pavish Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code replication is avoidable by just calling the record lister, then pulling apart the result.

The record lister queries all the records and stores it into a single json, so calling that function means that we cannot use a cursor to iterate through the results. That would make the entire process inefficient.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless, maintaining that SQL construction in multiple different functions isn't a good idea. We'll need to find a better way.

@pavish pavish requested a review from mathemancer December 19, 2024 14:50
@pavish
Copy link
Member Author

pavish commented Dec 23, 2024

Update for anyone following the PR: @mathemancer and I had a call discussing the approach to take and we'll be continuing the discussion after the holidays. This PR might remain idle until then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-status: review A PR awaiting review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add ability to export tables
4 participants