HIVE-28581: Support Partition Prunning stats optimization for Iceberg tables #5498

deniskuzZ · 2024-10-09T15:33:37Z

What changes were proposed in this pull request?

Add support for Iceberg partition prune stats optimization

Why are the changes needed?

Performance

Does this PR introduce any user-facing change?

No

Is the change a dependency upgrade?

No

How was this patch tested?

mvn test -Dtest=TestIcebergCliDriver -Dqfile=iceberg_stats_with_ppr.q -Drat.skip=true

deniskuzZ · 2024-10-09T15:42:16Z

ql/src/java/org/apache/hadoop/hive/ql/metadata/DummyPartition.java

-  public DummyPartition(Table tbl, String name,
-      Map<String, String> partSpec) {
-    setTable(tbl);
+  public DummyPartition(Table tbl, String name, Map<String, String> partSpec) throws HiveException {


Can we reuse this object or better create another abstraction, like [Virtual/Hidden]Partition?
cc @kasakrisz

Is the functionality of DummyPartition exploited? I saw that when we create DummyPartition instances we return a Partition type reference and never access any DummyPartition defined method. If this is not the case then DummyPartition is fine. The name is misleading though since these objects represent real partitions, aren't they?

we use overriden getValues and getSpec. And yes, they represent real partitions (not in HMS)

Seems that the current class hierarchy doesn't represent our needs:

Partition class represents a partition stored on HMS.

DummyPartition extends Partition so it seems like a special HMS stored partition but this is not the case.

Maybe defining a Partition interface or abstract class with a minimal contract (getName, getValues) would be better. And two class could implement/extend it: one for HMS stored partition and another for non-HMS.

It seems to be a bigger refactor because the current Partition class is widely used.

DummyPartition is used to represent non-HMS partitions. Before iceberg, it had just two properties: table and name. But I agree the name is not self-explanatory

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/IcebergTableUtil.java

kasakrisz · 2024-10-21T13:09:13Z

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/IcebergTableUtil.java

+  public static Expression generateExpressionFromPartitionSpec(Table table, Map<String, String> partitionSpec,
+      boolean latestSpecOnly) throws SemanticException {
+    Map<String, PartitionField> partitionFieldMap = getPartitionFields(table, latestSpecOnly).stream()


The method generateExpressionFromPartitionSpec is called from two places and the actual parameter value latestSpecOnly is always true. How about calling

icebergTable.spec().fields()

instead of

getPartitionFields(table, latestSpecOnly)

it's called from
IcebergTableUtil.getPartitionInfo(Table icebergTable, ... , boolean latestSpecOnly),
that is called from HiveIcebergStorageHandler.getPartitionNames(Table icebergTable, ..., boolean latestSpecOnly)

or are you saying that we could refactor it to

public static List<PartitionField> getPartitionFields(Table table, boolean latestSpecOnly) { return latestSpecOnly ? table.spec().fields() : table.specs().values().stream() .flatMap(spec -> spec.fields().stream()).distinct() .collect(Collectors.toList()); }

or are you saying that we could refactor it to

public static List<PartitionField> getPartitionFields(Table table, boolean latestSpecOnly) { return latestSpecOnly ? table.spec().fields() : table.specs().values().stream() .flatMap(spec -> spec.fields().stream()).distinct() .collect(Collectors.toList()); }

yes.

Actually at the top of the call stack, from we call this method, we already know whether we need only the current spec or all. :)

kasakrisz · 2024-10-21T13:13:40Z

ql/src/java/org/apache/hadoop/hive/ql/exec/ColumnInfo.java

@@ -166,9 +170,13 @@ public String getTabAlias() {
  }

  public boolean getIsVirtualCol() {
-    return isVirtualCol;
+    return isVirtualCol || isPartitionCol;


Please don't mix virtual columns and partition columns. These are very different things. getIsVirtualCol() should depend on isVirtualCol only.

I know, but it's completely opposite here: all partitionCol were virtual ones, I've added isPartitionCol not to exclude them from the projected list because of isVirtual marker

ColumnPrunerProcFactory.class

if (colInfo.getIsVirtualCol() && !colInfo.getIsPartitionCol()) { // part is also a virtual column, but part col should not in this // list.

I still has the impression that this change is not safe. Example:
here we create a new instance of ColumnInfo based on an existing one. If the source ColumnInfo is a partition column the field isPartitionCol is still false in the result which is not valid.

hive/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java

Lines 196 to 197 in b2a5933

ColumnInfo newCol = new ColumnInfo(colInfo.getInternalName(), colInfo.getType(),

colInfo.getTabAlias(), colInfo.getIsVirtualCol(), colInfo.isHiddenVirtualCol());

true, but how else should I call this method? isHiddenPartitionCol?
We didn't pass virtual/partition columns to reader before as we new the location upfront, now we rely on iceberg.

renamed to isHiddenPartitionCol

kasakrisz · 2024-10-21T13:19:40Z

ql/src/java/org/apache/hadoop/hive/ql/metadata/DummyPartition.java

-  public DummyPartition(Table tbl, String name,
-      Map<String, String> partSpec) {
-    setTable(tbl);
+  public DummyPartition(Table tbl, String name, Map<String, String> partSpec) throws HiveException {


Is the functionality of DummyPartition exploited? I saw that when we create DummyPartition instances we return a Partition type reference and never access any DummyPartition defined method. If this is not the case then DummyPartition is fine. The name is misleading though since these objects represent real partitions, aren't they?

kasakrisz · 2024-10-21T13:32:04Z

ql/src/test/results/clientpositive/llap/vector_bucket.q.out

-                enabled: false
+                enabled: true
                enabledConditionsMet: hive.vectorized.use.vectorized.input.format IS true
-                enabledConditionsNotMet: Could not enable vectorization due to partition column names size 1 is greater than the number of table column names size 0 IS false
                inputFileFormats: org.apache.hadoop.hive.ql.io.NullRowsInputFormat
+                notVectorizedReason: UDTF Operator (UDTF) not supported
+                vectorized: false


What is the cause of these changes? This is not an iceberg related test.

there was a bug in a Vectorizer, I noticed it while troubleshooting. I can move it into a different PR

sonarcloud · 2024-10-22T12:34:32Z

Quality Gate passed

Issues
18 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

zhangbutao · 2024-10-23T04:25:11Z

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java

+  @Override
+  public boolean canProvidePartitionStatistics(org.apache.hadoop.hive.ql.metadata.Table hmsTable) {
+    Table table = IcebergTableUtil.getTable(conf, hmsTable.getTTable());
+    if (table.currentSnapshot() != null) {


We can also check branch/tag here.

zhangbutao · 2024-10-23T04:39:13Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/StatsOptimizer.java

@@ -932,7 +932,7 @@ private Collection<List<ColumnStatisticsObj>> verifyAndGetPartColumnStats(
    private Long getRowCnt(
        ParseContext pCtx, TableScanOperator tsOp, Table tbl) throws HiveException {
      Long rowCnt = 0L;
-      if (tbl.isPartitioned()) {
+      if (tbl.isPartitioned() && StatsUtils.checkCanProvidePartitionStats(tbl)) {
        for (Partition part : pctx.getPrunedPartitions(
            tsOp.getConf().getAlias(), tsOp).getPartitions()) {
          if (!StatsUtils.areBasicStatsUptoDateForQueryAnswering(part.getTable(), part.getParameters())) {


StatsUtils::areBasicStatsUptoDateForQueryAnswering is not applicable to Iceberg table, and it will check table param COLUMN_STATS_ACCURATE and then determine to get stats or not. But we always get partition stats from iceberg metadata file, so COLUMN_STATS_ACCURATE should be always true.

The reason the iceberg qtest for table&partition's stats looks good is because we already set COLUMN_STATS_ACCURATE to true in hive-site.xml. But in fact, i think no users will care this param. So i think if we want to use iceberg partition stats, we should consider to remove this param.

hive/data/conf/hive-site.xml

Lines 334 to 339 in 48a67a4

<property>

<name>iceberg.hive.keep.stats</name>

<value>true</value>

<description>

We want we keep the stats in Hive sessions.

</description>

Support Iceberg partition stats with PPR

60e821a

asf-ci-hive added the tests pending label Oct 9, 2024

github-actions bot requested a review from miklosgergely October 9, 2024 15:34

deniskuzZ mentioned this pull request Oct 9, 2024

HIVE-28268: Iceberg: Retrieve row count from iceberg SnapshotSummary in case of iceberg.hive.keep.stats=false #5215

Open

deniskuzZ commented Oct 9, 2024

View reviewed changes

deniskuzZ changed the title ~~Support Iceberg partition stats with PPR~~ Support Partition Prunning stats optimization for Iceberg tables Oct 9, 2024

asf-ci-hive added tests unstable and removed tests pending labels Oct 9, 2024

support only hive catalog

b4671b0

asf-ci-hive added tests pending tests unstable tests failed and removed tests unstable tests pending tests failed labels Oct 10, 2024

deniskuzZ force-pushed the iceberg_part_stats_with_ppr branch from f2e7ab6 to 4b0f6e9 Compare October 11, 2024 08:48

asf-ci-hive added tests pending and removed tests unstable labels Oct 11, 2024

deniskuzZ force-pushed the iceberg_part_stats_with_ppr branch from 4b0f6e9 to 68f1a7a Compare October 11, 2024 09:01

asf-ci-hive removed the tests pending label Oct 11, 2024

fix bug introduced by HIVE-25457

29d91d4

deniskuzZ force-pushed the iceberg_part_stats_with_ppr branch from 4d940c2 to 29d91d4 Compare October 20, 2024 18:01

asf-ci-hive added tests pending tests unstable and removed tests failed tests pending labels Oct 20, 2024

qtests

f27a60d

asf-ci-hive added tests pending and removed tests unstable labels Oct 21, 2024

kasakrisz reviewed Oct 21, 2024

View reviewed changes

asf-ci-hive added tests unstable tests pending and removed tests pending tests unstable labels Oct 21, 2024

refactor

a791d29

deniskuzZ force-pushed the iceberg_part_stats_with_ppr branch from 60840a7 to a791d29 Compare October 21, 2024 19:51

asf-ci-hive added tests unstable tests pending tests passed and removed tests pending tests unstable labels Oct 21, 2024

review comments

858419a

asf-ci-hive added tests pending and removed tests passed labels Oct 22, 2024

asf-ci-hive added tests unstable and removed tests pending labels Oct 22, 2024

zhangbutao reviewed Oct 23, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HIVE-28581: Support Partition Prunning stats optimization for Iceberg tables #5498

HIVE-28581: Support Partition Prunning stats optimization for Iceberg tables #5498

deniskuzZ commented Oct 9, 2024 •

edited

Loading

deniskuzZ Oct 9, 2024

kasakrisz Oct 21, 2024

deniskuzZ Oct 21, 2024

kasakrisz Oct 22, 2024

deniskuzZ Oct 22, 2024 •

edited

Loading

kasakrisz Oct 21, 2024

deniskuzZ Oct 21, 2024 •

edited

Loading

deniskuzZ Oct 21, 2024

kasakrisz Oct 22, 2024

deniskuzZ Oct 22, 2024

kasakrisz Oct 21, 2024

deniskuzZ Oct 21, 2024 •

edited

Loading

deniskuzZ Oct 21, 2024

kasakrisz Oct 22, 2024

deniskuzZ Oct 22, 2024 •

edited

Loading

deniskuzZ Oct 22, 2024

kasakrisz Oct 21, 2024

kasakrisz Oct 21, 2024

deniskuzZ Oct 21, 2024

sonarcloud bot commented Oct 22, 2024

zhangbutao Oct 23, 2024

zhangbutao Oct 23, 2024

	ColumnInfo newCol = new ColumnInfo(colInfo.getInternalName(), colInfo.getType(),
	colInfo.getTabAlias(), colInfo.getIsVirtualCol(), colInfo.isHiddenVirtualCol());

	<property>
	<name>iceberg.hive.keep.stats</name>
	<value>true</value>
	<description>
	We want we keep the stats in Hive sessions.
	</description>

HIVE-28581: Support Partition Prunning stats optimization for Iceberg tables #5498

Are you sure you want to change the base?

HIVE-28581: Support Partition Prunning stats optimization for Iceberg tables #5498

Conversation

deniskuzZ commented Oct 9, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

Is the change a dependency upgrade?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deniskuzZ Oct 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deniskuzZ Oct 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deniskuzZ Oct 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deniskuzZ Oct 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sonarcloud bot commented Oct 22, 2024

Quality Gate passed

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deniskuzZ commented Oct 9, 2024 •

edited

Loading

deniskuzZ Oct 22, 2024 •

edited

Loading

deniskuzZ Oct 21, 2024 •

edited

Loading

deniskuzZ Oct 21, 2024 •

edited

Loading

deniskuzZ Oct 22, 2024 •

edited

Loading