-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support apache arrow #45
Comments
Whether & when I work on Arrow support will depend on the fate of this currently-deprecated part of the C API. Based on this PR, it looks like they will remain deprecated. I'm not sure what the long-term plan is. |
It seems the Arrow support in the C API has a "DEPRECATION NOTICE", which is distinct from "DEPRECATED". Apparently this means the functions are likely to change, but the functionality itself will likely be preserved. So it seems possible I could expose Arrow support using the current functions with some confidence I can preserve that support when the C API is changed. |
I reviewed the Arrow support in the C API, and, while I believe I can expose it, I'm unsure whether it provides the desired functionality. In particular, I'm unsure it provides access to the binary IPC format. For folks interested in Arrow support: What functionality is useful? What parts of the Arrow C API would you use, and what would you like that's missing from that API? |
I mostly need IPC which I could then read with flechette or arrow js. In most cases I'll send the arrow straight over the wire somewhere else. Arrow seems like the ideal format for streaming data from a server to a client. I'm not sure what alternative there is so this seems like an essential functionality almost. |
While it's admittedly more awkward than direct API support, have you tried using the It's not very well documented, but the tests are informative. |
Oh, fun idea. This should be easy to wrap as well. It would be nice to have something built in, though, to make sure we didn't fall into some perf trap with a workaround. |
Chiming in: the core feature that is keeping us on the old node client is the ability to have some kind of iterator over arrow record batches. Basically wanting to be able to stream data through duckdb and keep memory usage low and the types as close as possible, like @domoritz for subsequent passing across the network. E.g. we have essentially the following case running now. import { RecordBatchStreamReader } from 'apache-arrow';
const stream = db.arrowIPCStream("SELECT * FROM 's3://huggingface-datasets/somebigtable-part*.parquet'")
const reader = await RecordBatchStreamReader.from(stream);
for await (const batch of reader) {
await myUploadFunction(batch)
} To be fair, we (@RLesser) have found some problems with this workflow in the node client so it may not be easy! |
I'd be curious to see how It does seems that Example:
|
Thanks for the suggestion! Having dug in a little more, it turns out true streaming is tricky because of the So I think you can probably call this supported. This is an insanely roundabout way to run a query on a streaming chunk though… Is there an easy way to just run sql against a const instance = await DuckDBInstance.create(':memory:');
const connection = await instance.connect();
const connection2 = await instance.connect();
await connection2.run('INSTALL ARROW');
await connection2.run('LOAD ARROW');
const query = `FROM 'hf://datasets/HuggingFaceGECLM/REDDIT_comments@~parquet/default/AskHistorians/*.parquet'`;f
await connection2.run(`CREATE TABLE eric AS ${query} LIMIT 1`)
const result = await connection.stream(query);
while (true) {
const chunk = await result.fetchChunk();
if (!chunk?.rowCount) {break}
// Write the chunk to the holder table.
await connection2.run('DELETE FROM eric')
const appender = await connection2.createAppender('main', 'eric');
appender.appendDataChunk(chunk);
appender.flush();
// Pull the chunk as arrow from the holder.
const reader = await connection2.runAndReadAll(`FROM to_arrow_ipc((FROM eric))`);
const batches = []
const rows = reader.getRows();
for (let row of rows) {
batches.push((row[0]! as DuckDBBlobValue).bytes)
}
// Write the chunk bytes to a single Uint8 Array we can deserialize.
const buff = new Uint8Array(sum(batches.map(d => d.length)))
let offset = 0;
for (let batch of batches) {
buff.set(batch, offset)
offset += batch.length
}
const v = tableFromIPC(buff) |
That's a neat trick @bmschmidt. I suspect it's actually fairly efficient. Yes, the data has to go back and forth more than it should, but moving single chunks and buffers around like that is going to be pretty fast. It's definitely not possible to query a I can see the value in putting this approach you've concocted into a helper method in the library. I'll have to think about how best to fit it in. |
Cool, thanks. Yeah I always figure the name of the game in these things is to avoid ever casting to js-native types, so shoving around the DuckDBDataChunk object under the table feels sensible enough… As I see it there are two things that would be nice from the duckdb-node API perspective here, happy to help if there's a need.
|
Both of those sound feasible. Certainly exposing the buffers in a convenient way. I think it's also possible to get enough information about the column names and types from the (non-materialized) result and the initial chunk to create the table to append to. I've got a bunch of other items on my queue, but I'll try to get to this at some point. PRs are also welcome. |
It's on the roadmap but I wanted to create an issue for it so I can see when it may be supported. I'm very interested in adopting this package but need arrow support (I just need access to the binary ipc).
The text was updated successfully, but these errors were encountered: