Performance of cursor.next() could be improved with typedarray #117

neon-ninja · 2024-02-09T02:58:57Z

Hi,

I'm trying to read a parquet file in the browser, and it seems to take a lot longer than it does in Python. Testing with the largest parquet file in this repo, test/test-files/customer.impala.parquet, in Python:

#!/usr/bin/env python3

import pandas as pd
import time

start = time.time()
df = pd.read_parquet("test/test-files/customer.impala.parquet", engine='pyarrow')
print(df)
end = time.time()
print(f"Took {end-start}s to read with pyarrow")

start = time.time()
df = pd.read_parquet("test/test-files/customer.impala.parquet", engine='fastparquet')
end = time.time()
print(f"Took {end-start}s to read with fastparquet")

outputs:

Took 0.1700916290283203s to read with pyarrow
Took 0.10409688949584961s to read with fastparquet

Whereas in the browser, using this test HTML/JS:

<html>
  <head>
    <script type="module">
      const parquet = await import("https://unpkg.com/@dsnp/[email protected]/dist/browser/parquet.esm.js");
      const buffer_library = await import("https://esm.sh/buffer");
      console.log(buffer_library)
      console.log(parquet)
      const URL = "test/test-files/customer.impala.parquet";
      let resp = await fetch(URL)
      let buffer = await resp.arrayBuffer()
      console.log(buffer)
      buffer = buffer_library.Buffer.from(buffer);
      const reader = await parquet.ParquetReader.openBuffer(buffer);
      //const reader = await parquet.ParquetReader.openUrl(URL);
      window.reader = reader
      console.log(reader)
      var startTime = performance.now()
      let cursor = reader.getCursor();
      await cursor.next()
      console.log(`Time to read first row: ${(performance.now() - startTime)/1000}s`)
      let record = null;
      while (record = await cursor.next()) {
        //console.log(record);
      }
      var endTime = performance.now()
      console.log(`Took ${(endTime - startTime)/1000}s to read ${URL}`)
    </script>
  </head>
</html>

The console outputs:

Time to read first row: 0.6747999997138977s
Took 1.0477999997138978s to read test/test-files/customer.impala.parquet

Which is ~10x slower than Python

Any ideas on how to improve browser read performance?

The bulk of the time seems to spent reading the first row.

The text was updated successfully, but these errors were encountered:

wilwade · 2024-02-09T16:18:24Z

So a few things.

While running a version of your script, we can save a bit with finally getting around to updating from buffer.slice to buffer.subarray (mostly saves on the stack). (I'll put up a PR for that)
The loading of the first row requires the loading of a lot of pages, that then it doesn't have to load in for rows 2+. So it is doing some up front work that I believe it mostly has to do. (Likely are some additional ways to optimize that)
The buffer shim could likely be replaced with just using the native js ArrayBuffer or a typed array, however that is a large refactor.

If you don't need any of the additional features this library has, you might find https://github.com/kylebarron/parquet-wasm to be faster.

- Buffer.slice -> Buffer.subarray (and correct test that wasn't using buffers) - new Buffer(array) -> Buffer.from(array) - Fix issue with `npm run serve` Via looking into #117 As `subarray` is slightly faster in the browser shim.

simline · 2024-04-28T07:20:35Z

I did some work for cursor.next(), add a function called nextBatch() returns a batch, and each row without key-value mode, test result:

TPCH sf1 lineitem.parquet with specified columns:
pyarrow 1 thread: 0.45s
cursor.next() with columns typedarray: 43s
nextBatch with default array: 13s

If materializeRecords in shred.js can auto returns typedarray, might be faster.

wilwade mentioned this issue Feb 9, 2024

Update Deprecated Function Calls #118

Merged

wilwade changed the title ~~Performance~~ Performance of cursor.next() Jun 11, 2024

wilwade changed the title ~~Performance of cursor.next()~~ Performance of cursor.next() could be improved with typedarray Jun 11, 2024

wilwade added the help wanted Extra attention is needed label Jun 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance of cursor.next() could be improved with typedarray #117

Performance of cursor.next() could be improved with typedarray #117

neon-ninja commented Feb 9, 2024 •

edited

Loading

wilwade commented Feb 9, 2024

simline commented Apr 28, 2024

Performance of cursor.next() could be improved with typedarray #117

Performance of cursor.next() could be improved with typedarray #117

Comments

neon-ninja commented Feb 9, 2024 • edited Loading

wilwade commented Feb 9, 2024

simline commented Apr 28, 2024

neon-ninja commented Feb 9, 2024 •

edited

Loading