Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Append data chunk with Dates and null values? #124

Closed
nshiab opened this issue Jan 28, 2025 · 7 comments
Closed

Append data chunk with Dates and null values? #124

nshiab opened this issue Jan 28, 2025 · 7 comments

Comments

@nshiab
Copy link

nshiab commented Jan 28, 2025

Hi!

I have this data.

const data = [
  {
    key1: 1,
    key2: "un",
    key3: new Date("2010-01-01"),
    key4: true,
  },
  {
    key1: NaN,
    key2: "deux",
    key3: new Date("2010-01-02"),
    key4: null,
  },
  {
    key1: 3,
    key2: undefined,
    key3: new Date("2010-01-03"),
    key4: false,
  },
  {
    key1: 4,
    key2: "quatre",
    key3: new Date("2010-01-04"),
    key4: false,
  },
]

I would like to append it to a table.

Following the documentation, I wrote this. Note that I am replacing undefined and NaN by null.

const instance = await DuckDBInstance.create(":memory:");
const connection = await instance.connect();

await connection.run(
  `CREATE OR REPLACE TABLE table1("key1" DOUBLE, "key2" VARCHAR, "key3" TIMESTAMP, "key4" BOOLEAN)`,
);

const appender = await connection.createAppender("main", "table1");

const chunk = DuckDBDataChunk.create([DOUBLE, VARCHAR, TIMESTAMP, BOOLEAN]);

const keys = Object.keys(data[0]) as (keyof typeof data[0])[];
const dataForChunk = keys.map((key) =>
  data.map((d) =>
    d[key] === undefined || Number.isNaN(d[key]) ? null : d[key] as DuckDBValue
  )
);

chunk.setColumns(
  dataForChunk,
);

appender.appendDataChunk(chunk);

appender.flush();

But I get this error:

error: Uncaught (in promise) TypeError: Cannot convert undefined to a BigInt
    at DuckDBTimestampVector.setItem (file:///Users/naelshiab/Library/Caches/deno/npm/registry.npmjs.org/@duckdb/node-api/1.1.3-alpha.10/lib/DuckDBVector.js:956:35)
    at DuckDBDataChunk.setColumnValues (file:///Users/naelshiab/Library/Caches/deno/npm/registry.npmjs.org/@duckdb/node-api/1.1.3-alpha.10/lib/DuckDBDataChunk.js:66:20)
    at DuckDBDataChunk.setColumns (file:///Users/naelshiab/Library/Caches/deno/npm/registry.npmjs.org/@duckdb/node-api/1.1.3-alpha.10/lib/DuckDBDataChunk.js:94:18)
    at file:///Users/naelshiab/Desktop/duckdb-neo/main.ts:56:7

Note that I am using Deno, and this is adapted from a test for my library simple-data-analysis.

What am I doing wrong?

Here's the whole script in case you want to test it.

import {
  BOOLEAN,
  DOUBLE,
  DuckDBDataChunk,
  DuckDBInstance,
  DuckDBValue,
  TIMESTAMP,
  VARCHAR,
} from "@duckdb/node-api";

const data = [
  {
    key1: 1,
    key2: "un",
    key3: new Date("2010-01-01"),
    key4: true,
  },
  {
    key1: NaN,
    key2: "deux",
    key3: new Date("2010-01-02"),
    key4: null,
  },
  {
    key1: 3,
    key2: undefined,
    key3: new Date("2010-01-03"),
    key4: false,
  },
  {
    key1: 4,
    key2: "quatre",
    key3: new Date("2010-01-04"),
    key4: false,
  },
];

const instance = await DuckDBInstance.create(":memory:");
const connection = await instance.connect();

await connection.run(
  `CREATE OR REPLACE TABLE table1("key1" DOUBLE, "key2" VARCHAR, "key3" TIMESTAMP, "key4" BOOLEAN)`,
);

const appender = await connection
  .createAppender("main", "table1");

  const chunk = DuckDBDataChunk.create([DOUBLE, VARCHAR, TIMESTAMP, BOOLEAN]);

const keys = Object.keys(data[0]) as (keyof typeof data[0])[];
const dataForChunk = keys.map((key) =>
  data.map((d) =>
    d[key] === undefined || Number.isNaN(d[key]) ? null : d[key] as DuckDBValue
  )
);

chunk.setColumns(
  dataForChunk,
);

appender.appendDataChunk(chunk);

appender.flush();

// const rows = await reader.getRowsJson();
// console.log(rows);

Thank you!

@jraymakers
Copy link
Contributor

The problem is the JS Date objects. You need to explicitly convert those to DuckDBTimestampValue objects. One way do to that is: new DuckDBTimestampValue(date.getTime() * 1000n). This extracts the milliseconds-from-the-epoch from the JS Date, converts it to microseconds (which is how DuckDB natively stores TIMESTAMPs), and wraps it in a DuckDB value object of the appropriate type.

(The as DuckDBValue type assertion is hiding the fact that passing a JS Date to setColumns is not supported.)

Clearly this could be made more user-friendly; converting JS Date to a TIMESTAMP is likely a common case. I recently did some work on converting values to more convenient forms when reading; perhaps that pattern could be applied when writing as well.

@jraymakers
Copy link
Contributor

Of course, if what you actually have are dates, not timestamps, you might consider using the DuckDB type DATE and DuckDBDateValue. The constructor of DuckDBDateValue takes the number of days since the epoch, and it also has a fromParts that takes an object with the year, month, and day. (DuckDBTimestampValue also has a fromParts that takes the same parts plus hours, minutes, seconds, and microseconds.)

@nshiab
Copy link
Author

nshiab commented Jan 28, 2025

Thank you, @jraymakers! It works!

It could be more user-friendly, but I understand the challenges of making data flow in and out of DuckDB.

I noticed two other things while transitioning from the old library to this one when using .getRowObjectsJson() for the data:

  • BigInt values are returned as strings.
  • Dates are also returned as strings.

I created a simple function to handle this on my end (see below), but I wanted to let you know in case it's useful. :)

export default function convertForJS(rows: {
  [key: string]: string | number | boolean | null;
}[], types: {
  [key: string]: string;
}) {
  const firstObjectKeys = Object.keys(rows[0]);
  for (const key of Object.keys(types)) {
    if (!firstObjectKeys.includes(key)) {
      continue;
    }
    if (types[key] === "DATE") {
      for (const row of rows) {
        row[key] = row[key] === null
          ? null
          : new Date(`${row[key]}T00:00:00.000Z`);
      }
    } else if (types[key] === "TIMESTAMP") {
      for (const row of rows) {
        row[key] = row[key] === null
          ? null
          : new Date((row[key] as string).replace(" ", "T") + "Z");
      }
    } else if (types[key] === "BIGINT") {
      for (const row of rows) {
        row[key] = row[key] === null ? null : Number(row[key]);
      }
    }
  }
}

Just out of curiosity—are you part of the DuckDB team? And are you working on this alone?

Cheers!

@nshiab nshiab closed this as completed Jan 28, 2025
@jraymakers
Copy link
Contributor

It's intentional that the get...Json methods return large integers and date/time/timestamps as strings. The goal is to return JSON-serializable JS objects (or arrays), and JS BigInt and Date objects are not JSON-serializable. (Large integers that don't fit into JS Numbers (e.g. 64-bit integers or larger) can only be represented losslessly in JSON as strings. They could be converted to Number with some loss of precision; whether this is acceptable depends on the use case.)

I think there's value in additional methods, similar to the get...Json family, that return JS built-ins instead of JSON-serializable objects. This would allow them to return BigInts and Dates, and also represent the floating-points values NaN and Infinity in JS-native ways (instead of as strings). This should be easy to add using the same mechanism, but with slightly different conversion logic.

@jraymakers
Copy link
Contributor

Regarding your other question: I work at MotherDuck, which is distinct from DuckDB Labs, although we collaborate closely. Node Neo is done mostly as a side project, not as part of my "day job", though of course there's some overlap. What's good for the DuckDB community is also good for MotherDuck!

It got started as a collaboration with @hannes, who set the direction and laid the foundation, but since the initial prototype I've been working on it mostly solo. Note that the repo is part of the DuckDB GitHub organization, and I continue to get some advice and assistance from the kind folks at DuckDB Labs from time to time.

@nshiab
Copy link
Author

nshiab commented Jan 28, 2025

It's intentional that the get...Json methods return large integers and date/time/timestamps as strings. The goal is to return JSON-serializable JS objects (or arrays), and JS BigInt and Date objects are not JSON-serializable. (Large integers that don't fit into JS Numbers (e.g. 64-bit integers or larger) can only be represented losslessly in JSON as strings. They could be converted to Number with some loss of precision; whether this is acceptable depends on the use case.)

I think there's value in additional methods, similar to the get...Json family, that return JS built-ins instead of JSON-serializable objects. This would allow them to return BigInts and Dates, and also represent the floating-points values NaN and Infinity in JS-native ways (instead of as strings). This should be easy to add using the same mechanism, but with slightly different conversion logic.

This makes a lot of sense! And yes, having a method or option to return JavaScript built-ins would be great. In my context, we use the library/DuckDB to process and analyze the data, but afterward, we rely on JavaScript built-ins for visualization.

Regarding your other question: I work at MotherDuck, which is distinct from DuckDB Labs, although we collaborate closely. Node Neo is done mostly as a side project, not as part of my "day job", though of course there's some overlap. What's good for the DuckDB community is also good for MotherDuck!

It got started as a collaboration with @hannes, who set the direction and laid the foundation, but since the initial prototype I've been working on it mostly solo. Note that the repo is part of the DuckDB GitHub organization, and I continue to get some advice and assistance from the kind folks at DuckDB Labs from time to time.

Very interesting! Thank you for taking the time to share this! Selfishly, I hope you'll keep maintaining this library. 😅

@jraymakers
Copy link
Contributor

Filed this issue to track the idea of adding methods for JS built-ins.

I do intend to keep improving and maintaining this library for the foreseeable future. I'm glad you're finding it useful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants