Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1730905: Support retrieving empty Arrow result #1219

Closed
datbth opened this issue Oct 10, 2024 · 14 comments
Closed

SNOW-1730905: Support retrieving empty Arrow result #1219

datbth opened this issue Oct 10, 2024 · 14 comments
Assignees
Labels
question Issue is a usage/other question rather than a bug status-triage_done Initial triage done, will be further handled by the driver team wontfix

Comments

@datbth
Copy link

datbth commented Oct 10, 2024

What is the current behavior?

When the query returns empty result (e.g. SELECT 1 WHERE 1 = 0), GetArrowBatches returns an empty array.

What is the desired behavior?

Empty result is still a valid result.
GetArrowBatches should return an empty Arrow Record.

Otherwise, downstream processing would have to treat empty results as errors and that does not make sense.

How would this improve gosnowflake?

Allow downstream processing/usage to consume empty results properly.

References, Other Background

This could be a server issue if the server is not responding any "chunks".

@github-actions github-actions bot changed the title Support retrieving empty Arrow result SNOW-1730905: Support retrieving empty Arrow result Oct 10, 2024
@sfc-gh-dszmolka sfc-gh-dszmolka self-assigned this Oct 11, 2024
@sfc-gh-dszmolka sfc-gh-dszmolka added the status-triage Issue is under initial triage label Oct 11, 2024
@sfc-gh-dszmolka sfc-gh-dszmolka added enhancement The issue is a request for improvement or a new feature status-triage_done Initial triage done, will be further handled by the driver team and removed status-triage Issue is under initial triage labels Oct 11, 2024
@sfc-gh-dszmolka
Copy link
Contributor

hi and thank you for raising this requirement. to me it makes sense indeed, being able to handle results the same way, regardless if they are empty or non-empty. we'll look into this (no timeline estimated)

@sfc-gh-dszmolka
Copy link
Contributor

so, had a word with the team and for now, we won't change our approach. Mostly because if there's no resultset, then there's no schema to create the Arrow record with.

So if your code seems to need error handling when a resultset is empty (thus result is an empty array, not an arrow.Record), perhaps changing the processing approach could help the fastest. Here's a pseudocode:

batches := result.GetBatches()
for batch := range batches {
   records := range batch.Records()
   for record := range records {
       // do something on record only if it exists
   }
}

and here's an actually working example on how to process arrow batches with GetArrowBatches.
if you modify the generator to GENERATOR(ROWCOUNT=>0 then you can emulate an empty resultset and can also observe the code will still successfully run to completion - as it iterates over no matter how many records are in the batch.

Hope this helps.

@sfc-gh-dszmolka sfc-gh-dszmolka added question Issue is a usage/other question rather than a bug wontfix and removed enhancement The issue is a request for improvement or a new feature labels Oct 11, 2024
@datbth
Copy link
Author

datbth commented Oct 11, 2024

Mostly because if there's no resultset, then there's no schema to create the Arrow record with.

I think this is more of an implementation issue. Any schemaful database would output empty results with proper schema.
Here is an empty result on Snowflake console:
image
When using the snowflake clients, the JSON response of an empty result also has a schema.

So if your code seems to need error handling when a resultset is empty

It's because my application needs to do downstream processing. It should be able to process any valid resultset as a normal relation/table/Arrow record. It expects to process 1 resultset (empty or not), not 0 resultset.
Otherwise, for example, would you output a blank CSV or a Parquet with no schema when downloading from an empty Snowflake table?

I can understand if you say it's a technical limitation/oversight of the API (either the server side or the client side).
But I really doubt your database team would say this is "by design" that "an empty Arrow result does not have a schema".

@sfc-gh-dszmolka
Copy link
Contributor

@sfc-gh-pfus do you have any insight which might be worth sharing here ?

@datbth
Copy link
Author

datbth commented Oct 11, 2024

I just tried the python connector and it outputs a proper empty result:

import snowflake.connector

ctx = snowflake.connector.connect(
          user=user,
          password=password,
          account=account,
          warehouse=warehouse,
          database=database,
          login_timeout=3,
          )
cur = ctx.cursor()
sql = "select 1 a where 1 = 0"
cur.execute(sql)
df = cur.fetch_pandas_all()

print(df.to_string())

print("--\n")

print("num columns: " + str(len(df.columns)))
print("size:" + str(df.size))

print("-----\n\n")

import pyarrow as pa
table = pa.Table.from_pandas(df)
print(table)

Output:

Empty DataFrame
Columns: [A]
Index: []
--

num columns: 1
size:0
-----


pyarrow.Table
A: int64
----
A: [[]]

@sfc-gh-pfus
Copy link
Collaborator

Hi @datbth ! Let me summarize a couple of things.

  1. "Any schemaful database would output empty results with proper schema." I agree with that. But don't look at arrow batches as a standard SQL interface. If you want SQL - use standard Go SQL approach. If you use Arrow batches, please have in mind, that this is not a standard SQL anymore, especially, you don't retrieve schema+rows, but Arrow results (self contained with schema if any). Using an SQL query syntax here is only a convenience, but the rest is not SQL-like.
  2. The most important reason - Snowflake does some optimizations on each micropartition (source). Because of that, each arrow record can contain slightly different schema (based on what values exist in a specific micropartition). This is also a reason, why we don't want to produce schema for empty records - cause they potentially vary. What you showed in Snowsight is again, row-based approach. And if we describe such a query you can see for instance, that SELECT 1, 256, 32768 produces three INTEGER database columns. But in Arrow they may be described as int8, int16 and int32 columns. Furthermore, another example. If you have a table which contains one INTEGER database type, and it contains values from 1 to let's say 1000000 (incrementing by one), and you observe Arrow batches schema, you can see that for instance first record may have int8 (supposing that first batch contains only values below 128), then a couple of records containing int16 (up to the int16 limit) and only then you will have int32 records. I hope with that knowledge I convinced you why we don't want to create an empty record.
  3. I'm far from being Python driver expert as it is managed by the other team, but it seems to me, that we compare different things. Based on what you showed it seems that in Python you firstly create pandas, which is schema aware, and then you convert it to arrow. In Go, we don't have such step - we just read native Arrow stream and return it from driver directly. Still - I don't know this technology stack, so I'm not sure if I'm fully right. By the way - the schema described in pyarrow is also interesting. Why it is int64 and not int8 for instance? If you do just SELECT 1 you should retrieve int8 column or at least this is something that is returned from backend.

@datbth
Copy link
Author

datbth commented Oct 15, 2024

Thank you for the response.

If you use Arrow batches, please have in mind, that this is not a standard SQL anymore.

My point was to explain that my query returns a schemaful relation, so I expect to be able to retrieve that schema, especially when Arrow is also schemaful.

  • Data structure-wise, an Arrow record can still have zero rows, so it is still possible to return 1 result with 0 rows, instead of 0 results.
  • Usage-wise, skipping the schema affects all downstream usages that rely on the schema.
    • For me, I'm working on a Bussiness Intelligence platform. Here are a few use cases:
      • Export data to csv, excel, HTML tables, Google Sheets, etc. These all expect to see empty result tables with proper column names, not just "nothing".
        • The exported data can be further processed by our users, who would expect to be able to use the schema as long as their queries are correct. Imagine having an hourly data processing pipeline that suddenly gets broken when there is no data in 1 hour.
      • Data caching
        • It becomes painful when suddenly all the post-processing steps have to handle a "no-schema" flow.
        • Most importantly, I'm caching data in a schemaful data storage. All schemaful data storages, even Parquet files, require a schema.
        • It is just not reasonable having to deal with such a "no-schema" flow.

I'm trying to use Arrow to benefit from its performance values (minimal serialization cost, columnar processing, etc.).
I think it's reasonable to expect that, from the client perspective, Arrow is an interchangeable schemaful data format.

If you have a table which contains one INTEGER database type, and it contains values from 1 to let's say 1000000 (incrementing by one), and you observe Arrow batches schema, you can see that for instance first record may have int8 (supposing that first batch contains only values below 128), then a couple of records containing int16 (up to the int16 limit) and only then you will have int32 records.

I agree that this makes some sense. But then how about using the minimal data type (e.g. int8 in your example) when the result table is empty?

Based on what you showed it seems that in Python you firstly create pandas, which is schema aware, and then you convert it to arrow

I also don't know whether it is possible to retrieve Arrow records directly or not. But at least the library provides a natively supported way to retrieve the schema. Could you try discussing this matter with other client teams or the database team?


For me, currently, I'm able to construct the Arrow schema from the snowflake result schema. So this issue is not urgent to me.
However, I hope you could agree that having a schema is important (for further processing) and that the need for the schema is valid.
The technical limitation/optimization that you described is a valid reason to me, but I also hope that it could be improved.

@sfc-gh-pfus
Copy link
Collaborator

I see your point @datbth . Nevertheless, we currently have 7 native drivers. We try to minimize code duplication and keep consistency in our codebases, especially if something may be implemented on the backend side. And second huge point is - we don't create arrow data in driver, we just propagate what backend sends to us and we don't want to change such approach (we had this discussion more than once before). My suggestion is to work with your AM/SE and file a backend request to fill the schema on that end and hopefully someone will implement it :) 🤞 🤞 🤞

@datbth
Copy link
Author

datbth commented Oct 15, 2024

My suggestion is to work with your AM/SE and file a backend request to fill the schema on that end and hopefully someone will implement it

I see. I was hoping this gets propagated to your backend team.
But I understand. Will try to file a request when I get the time.

Thank you.

@sfc-gh-pfus
Copy link
Collaborator

I think it may be faster if it is an external requirement :)

Can we close the issue now?

@datbth
Copy link
Author

datbth commented Oct 15, 2024

Yeah ok. Thanks

@datbth
Copy link
Author

datbth commented Oct 15, 2024

If you have a table which contains one INTEGER database type, and it contains values from 1 to let's say 1000000 (incrementing by one), and you observe Arrow batches schema, you can see that for instance first record may have int8 (supposing that first batch contains only values below 128), then a couple of records containing int16 (up to the int16 limit) and only then you will have int32 records.

Such varying schemas in a single resultset would be hard to deal with.
And to produce such varying schemas, the server side would have to compute the min/max of each column of each batch, which can be costly.

I just went ahead and tried this query:

select 1 FROM table (generator(rowcount => 10000))
union all
select 123456789123456789.123456789 FROM table (generator(rowcount => 10000))

It yields 2 batches, containing 8192 rows (in 1 Arrow Record) and 11808 rows (in 2 Arrow Records). But both batches (all Arrow records) have this same schema:

arrow.Decimal128Type {Precision: 27, Scale: 9}

Tried another one:

select 1 FROM table (generator(rowcount => 10000))
union all
select 123456789123456789 FROM table (generator(rowcount => 10000))

and also found that all the result Arrow records have this same schema:

arrow.Int64Type{}

So I think there has been a mistake in your statement.

@sfc-gh-pfus
Copy link
Collaborator

My guess is that it is because you just select static data, but I'm not 100% sure. I think server does not compute it on select time, but on storage time - this is the optimization to save the disc source. I'm not an expert on how backend compresses integers, what are the rules behind it. I only now, that we already had customers that were unhappy with this compression and schema variations and there is ongoing task to implement a parameter to disable this behaviour on the backend side.

@datbth
Copy link
Author

datbth commented Oct 16, 2024

Thank you for the answer.

there is ongoing task to implement a parameter to disable this behaviour on the backend side

Is there any way I can get notified when such parameter is released?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Issue is a usage/other question rather than a bug status-triage_done Initial triage done, will be further handled by the driver team wontfix
Projects
None yet
Development

No branches or pull requests

4 participants