Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of cursor.next() could be improved with typedarray #117

Open
neon-ninja opened this issue Feb 9, 2024 · 2 comments
Open

Performance of cursor.next() could be improved with typedarray #117

neon-ninja opened this issue Feb 9, 2024 · 2 comments
Labels
help wanted Extra attention is needed

Comments

@neon-ninja
Copy link

neon-ninja commented Feb 9, 2024

Hi,

I'm trying to read a parquet file in the browser, and it seems to take a lot longer than it does in Python. Testing with the largest parquet file in this repo, test/test-files/customer.impala.parquet, in Python:

#!/usr/bin/env python3

import pandas as pd
import time

start = time.time()
df = pd.read_parquet("test/test-files/customer.impala.parquet", engine='pyarrow')
print(df)
end = time.time()
print(f"Took {end-start}s to read with pyarrow")

start = time.time()
df = pd.read_parquet("test/test-files/customer.impala.parquet", engine='fastparquet')
end = time.time()
print(f"Took {end-start}s to read with fastparquet")

outputs:

Took 0.1700916290283203s to read with pyarrow
Took 0.10409688949584961s to read with fastparquet

Whereas in the browser, using this test HTML/JS:

<html>
  <head>
    <script type="module">
      const parquet = await import("https://unpkg.com/@dsnp/[email protected]/dist/browser/parquet.esm.js");
      const buffer_library = await import("https://esm.sh/buffer");
      console.log(buffer_library)
      console.log(parquet)
      const URL = "test/test-files/customer.impala.parquet";
      let resp = await fetch(URL)
      let buffer = await resp.arrayBuffer()
      console.log(buffer)
      buffer = buffer_library.Buffer.from(buffer);
      const reader = await parquet.ParquetReader.openBuffer(buffer);
      //const reader = await parquet.ParquetReader.openUrl(URL);
      window.reader = reader
      console.log(reader)
      var startTime = performance.now()
      let cursor = reader.getCursor();
      await cursor.next()
      console.log(`Time to read first row: ${(performance.now() - startTime)/1000}s`)
      let record = null;
      while (record = await cursor.next()) {
        //console.log(record);
      }
      var endTime = performance.now()
      console.log(`Took ${(endTime - startTime)/1000}s to read ${URL}`)
    </script>
  </head>
</html>

The console outputs:

Time to read first row: 0.6747999997138977s
Took 1.0477999997138978s to read test/test-files/customer.impala.parquet

Which is ~10x slower than Python

Any ideas on how to improve browser read performance?

The bulk of the time seems to spent reading the first row.

@wilwade
Copy link
Member

wilwade commented Feb 9, 2024

So a few things.

  1. While running a version of your script, we can save a bit with finally getting around to updating from buffer.slice to buffer.subarray (mostly saves on the stack). (I'll put up a PR for that)
  2. The loading of the first row requires the loading of a lot of pages, that then it doesn't have to load in for rows 2+. So it is doing some up front work that I believe it mostly has to do. (Likely are some additional ways to optimize that)
  3. The buffer shim could likely be replaced with just using the native js ArrayBuffer or a typed array, however that is a large refactor.

If you don't need any of the additional features this library has, you might find https://github.com/kylebarron/parquet-wasm to be faster.

wilwade added a commit that referenced this issue Feb 10, 2024
- Buffer.slice -> Buffer.subarray (and correct test that wasn't using
buffers)
- new Buffer(array) -> Buffer.from(array)
- Fix issue with `npm run serve`

Via looking into #117 As `subarray` is slightly faster in the browser
shim.
@simline
Copy link

simline commented Apr 28, 2024

I did some work for cursor.next(), add a function called nextBatch() returns a batch, and each row without key-value mode, test result:

TPCH sf1 lineitem.parquet with specified columns:
pyarrow 1 thread: 0.45s
cursor.next() with columns typedarray: 43s
nextBatch with default array: 13s

If materializeRecords in shred.js can auto returns typedarray, might be faster.

@wilwade wilwade changed the title Performance Performance of cursor.next() Jun 11, 2024
@wilwade wilwade changed the title Performance of cursor.next() Performance of cursor.next() could be improved with typedarray Jun 11, 2024
@wilwade wilwade added the help wanted Extra attention is needed label Jun 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants