Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: keep column statistics of all NULL column #16753

Conversation

dantengsky
Copy link
Member

@dantengsky dantengsky commented Nov 2, 2024

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

In PR #16728, statistics for columns with unsupported data types are excluded.

However, the data type is inferred from the the scalar, instead of the the type of column, thus, an edge case may arise for columns that contain only NULL values:

For such columns, both the min and max scalar values are NULL, causing them to be incorrectly classified as "supported_stat_type" and subsequently excluded. This leads to issues during pruning because RangePruner cannot prune these columns without available statistics.

Although the table data is safe, and the correctness of filtering also retained, the execution of pruning may be inefficient:

Example

please disable table meta cache in query config file to reproduce this issue:

[cache]
...
enable_table_meta_cache = false
create or replace database col_stats_all_null;
use col_stats_all_null;
create or replace table t(c int);
insert into t values(NULL);

-- segments should be pruned (BUT NOT)
explain select * from t where c > 6;
Filter
├── output columns: [t.c (#0)]
├── filters: [is_true(t.c (#0) > 6)]
├── estimated rows: 0.00
└── TableScan
    ├── table: default.col_stats_all_null.t
    ├── output columns: [c (#0)]
    ├── read rows: 0
    ├── read size: 0
    ├── partitions total: 1
    ├── partitions scanned: 0
    ├── pruning stats: [segments: <range pruning: 1 to 1>]
    ├── push downs: [filters: [is_true(t.c (#0) > 6)], limit: NONE]
    └── estimated rows: 1.00

Changes

In this PR, column statistics with both NULL min and max values are retained. since they will be stored as Scalar::Null, they cloud be ser/deserialized without issue.

Note:

  • Columns containing at least one non-NULL value will always have non-NULL min/max values.
  • Tweak databend_storages_common_table_meta::meta::supported_stat_type may also work, but to minimize the risks (since other components also rely on it), changes are kept in ColStatsVisitor

For tables have been processed(compact/insert, etc.) with PR #16728, it is safe to apply the changes of this PR:

  • newly created blocks with this PR will keep the correct column statistics
  • newly created segment/snapshot with this PR, which may contain columns statistics merged from legacy block/segments will also be safe.

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@github-actions github-actions bot added the pr-bugfix this PR patches a bug in codebase label Nov 2, 2024
@dantengsky dantengsky force-pushed the fix-include-col-stats-of-all-null-cols branch from 86992b7 to db30e09 Compare November 2, 2024 10:52
@dantengsky dantengsky marked this pull request as ready for review November 2, 2024 11:28
@dantengsky dantengsky added this pull request to the merge queue Nov 2, 2024
Merged via the queue into databendlabs:main with commit 1388625 Nov 2, 2024
77 checks passed
@dantengsky dantengsky deleted the fix-include-col-stats-of-all-null-cols branch November 2, 2024 12:06
dantengsky added a commit to dantengsky/fuse-query that referenced this pull request Nov 2, 2024
* fix: keep column statistics of all NULL column

* add logic test

* fix typos reported by typos-cli
dantengsky added a commit that referenced this pull request Nov 2, 2024
…16753) (#16756)

fix: keep column statistics of all NULL column (#16753)

* fix: keep column statistics of all NULL column

* add logic test

* fix typos reported by typos-cli
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-bugfix this PR patches a bug in codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants