Add ability to export tables #4090

pavish · 2024-12-17T20:17:47Z

Fixes #4067

This PR implements downloads for tables using the following approach:

Querying data:

Creates a database cursor that queries records
Fetches records in chunks of 2000 rows

Transferring:

Forms csv from the results of each fetch
Streams the csv to the browser, which saves it as a file

Approaches tested

Querying data:

COPY command:
- Faster if we only need to export entire non-foreign table.
- Can form CSV output directly on the database.
- For our requirements, we would have more use for the syntax COPY (SELECT * FROM table) TO instead of COPY table TO (since we need filtering & sorting and support for explorations in the future). This syntax has negligible performance difference compared to reading from a cursor.
- COPY is not supported inside a db function.
Multiple queries with standard limit and incremental offsets:
- Can use the existing list function.
- Flexible for our needs.
- Chances of data changing under the hood between queries, so the end result is unreliable.
DB cursor:
- Can fetch & transfer in chunks.
- Flexible for our needs.
- Data is reliable since querying is part of the same transaction.
- Reasonably fast, and since it fetches data in chunks, we have more control over memory and cpu usage.

Transferring:

Direct transfer:
- Forms the entire file in the server and transfers it.
- Error handling is easier.
- Processing time is higher as we wait for the entire file to complete and the request is kept open until then.
- Higher chances of network disconnects since webservers/proxies/load-balancers might cut off idle connections after a threshold.
- Cannot update progress on the frontend.
Streaming:
- Streams chunk by chunk to the frontend.
- Error handling is additional work.
- Keeps the connection open and active, so webservers/proxies/load-balancers would not cut them off.
- Can update progress on the frontend.
Storing as a file and allowing download via link:
- Stores the csv data as a file on the Mathesar server.
- Allows the frontend to download it via a link.
- Requires additional permission checks.
- Makes it possible to view export history and audit access.
- Ideal for long-term, however we cannot do that without figuring out storage flow for Mathesar.
- Not feasible for v0.2.0.

What is not implemented in this PR:

Error handling i.e. viewing exact cause for failure on the frontend:
- We cannot return a different status code while a streaming request in underway.
  - Say a few chunks have been sent, and one fails due to an unforeseen error.
  - We cannot return a different http status code separately for that chunk.
- The download is directly made by the browser via the a['download'] tag.
- This lets the browser take over and download the streamed data. When there's a failure, the browser would show an error. However, it cannot show the actual cause since streaming requests do not return the error.
- We can handle this by streaming as json with each chunk containing a status code and data, and later joining the data on the frontend.
- This isn't implemented in this PR, since it's additional work with little benefit. The only major cause for streaming to fail is due to network failures, which the browser would display.
- We could implement this as a nice-to-have post v0.2.0, or when users report issues.

UX changes from the design

The UX is made more straight-forward since download is handled directly by the browser.
Here is a video of the interaction:

Screen.Recording.2024-12-18.at.7.32.06.PM.mov

Checklist

My pull request has a descriptive title (not a vague title like Update index.md).
My pull request targets the develop branch of the repository
My commit messages follow best practices.
My code follows the established code style of the repository.
I added tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no
visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

…esar into export_tables

pavish · 2024-12-18T13:49:50Z

@ghislaineguerin @kgodey I've assigned you to this PR to get a sign off on the UX.

mathemancer

The overall method seems reasonable. Nice work on that!

However, I think most of the SQL code and a large part of the Python code could be removed if you can get ServerCursor.fetchmany working (or just use the builtin itersize param and loop through rows one at a time clientside).

If there's a really good reason that doesn't work, what is it?

Also, please just pull apart the msar.list_records_from_table result to get the actual records to return. I don't want to have to maintain that formatted query in multiple places if we can avoid it. Alternatively, you could extract the CTE-generating code and call it from both the record lister and the iterable record lister SQL functions. I prefer the former approach, since it prioritizes the efficiency of the hot path, i.e., listing records for display normally.

mathemancer · 2024-12-19T01:54:55Z

db/connection.py

+    while True:
+        cursor.execute(fetch_query)
+        rows = cursor.fetchall()
+        if not rows:
+            break
+        yield rows


Why not use the fetchmany functionality of the psycopg.ServerCursor class? This seems like it's reinventing the wheel.

See the docs for ServerCursor.

I think it has all the pieces you need, and it'd be more convenient than rolling our own if it works.

We need a SQL call to build the select query which accepts filters and sorting in the exact format as list does.

We need a SQL call to fetch the names of the columns (as the existing functions rely on oids).

We need a server side cursor for the records.

If we use ServerCursor class for the records, we'd need to make additional calls for build_select_query and fetch_selectable_column_names. We'd also need to have additional logic to ensure that the column names are in the same order as the columns in the records result.

It's much simpler to fetch everything via a single plpgsql function.

mathemancer · 2024-12-19T02:18:41Z

db/sql/00_msar.sql

+    'SELECT %1$s FROM %2$I.%3$I %7$s %6$s LIMIT %4$L OFFSET %5$L',
+    COALESCE(columns_sql, 'NULL'),
+    msar.get_relation_schema_name(tab_id),
+    msar.get_relation_name(tab_id),
+    limit_,
+    offset_,
+    msar.build_order_by_expr(tab_id, order_),
+    msar.build_where_clause(tab_id, filter_)


This code replication is avoidable by just calling the record lister, then pulling apart the result. Given that this is not exactly a "Hot Path", I think the very minor inefficiency involved is well worth the increase in maintainability.

This code replication is avoidable by just calling the record lister, then pulling apart the result.

The record lister queries all the records and stores it into a single json, so calling that function means that we cannot use a cursor to iterate through the results. That would make the entire process inefficient.

Regardless, maintaining that SQL construction in multiple different functions isn't a good idea. We'll need to find a better way.

pavish · 2024-12-23T13:41:19Z

Update for anyone following the PR: @mathemancer and I had a call discussing the approach to take and we'll be continuing the discussion after the holidays. This PR might remain idle until then.

pavish added 4 commits December 11, 2024 12:17

Stream table as csv file

004181d

Export table using postgresql cursor to batch records

5b89a5c

Add column headers to exported csv

f8315af

Merge branch 'develop' of https://github.com/mathesar-foundation/math…

838452d

…esar into export_tables

pavish self-assigned this Dec 17, 2024

pavish added the pr-status: revision A PR awaiting follow-up work from its author after review label Dec 17, 2024

pavish added this to the v0.2.0 (beta release) milestone Dec 17, 2024

pavish added 2 commits December 18, 2024 18:15

Improve naming of functions

d92eb24

Add information tooltip to export button

59ae25c

pavish assigned kgodey, ghislaineguerin and mathemancer and unassigned pavish Dec 18, 2024

pavish added pr-status: review A PR awaiting review and removed pr-status: revision A PR awaiting follow-up work from its author after review labels Dec 18, 2024

pavish marked this pull request as ready for review December 18, 2024 13:49

Merge branch 'develop' into export_tables

93392db

zackkrida requested review from kgodey and ghislaineguerin December 18, 2024 15:00

pavish requested a review from mathemancer December 18, 2024 15:59

mathemancer requested changes Dec 19, 2024

View reviewed changes

ghislaineguerin approved these changes Dec 19, 2024

View reviewed changes

pavish requested a review from mathemancer December 19, 2024 14:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to export tables #4090

Add ability to export tables #4090

pavish commented Dec 17, 2024 •

edited

Loading

pavish commented Dec 18, 2024

mathemancer left a comment

mathemancer Dec 19, 2024

pavish Dec 19, 2024 •

edited

Loading

mathemancer Dec 19, 2024

pavish Dec 19, 2024 •

edited

Loading

mathemancer Dec 20, 2024

pavish commented Dec 23, 2024

Add ability to export tables #4090

Are you sure you want to change the base?

Add ability to export tables #4090

Conversation

pavish commented Dec 17, 2024 • edited Loading

Approaches tested

What is not implemented in this PR:

UX changes from the design

Checklist

Developer Certificate of Origin

pavish commented Dec 18, 2024

mathemancer left a comment

Choose a reason for hiding this comment

mathemancer Dec 19, 2024

Choose a reason for hiding this comment

pavish Dec 19, 2024 • edited Loading

Choose a reason for hiding this comment

mathemancer Dec 19, 2024

Choose a reason for hiding this comment

pavish Dec 19, 2024 • edited Loading

Choose a reason for hiding this comment

mathemancer Dec 20, 2024

Choose a reason for hiding this comment

pavish commented Dec 23, 2024

pavish commented Dec 17, 2024 •

edited

Loading

pavish Dec 19, 2024 •

edited

Loading

pavish Dec 19, 2024 •

edited

Loading