Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-28581: Support Partition Prunning stats optimization for Iceberg tables #5498

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

deniskuzZ
Copy link
Member

@deniskuzZ deniskuzZ commented Oct 9, 2024

What changes were proposed in this pull request?

Add support for Iceberg partition prune stats optimization

Why are the changes needed?

Performance

Does this PR introduce any user-facing change?

No

Is the change a dependency upgrade?

No

How was this patch tested?

mvn test -Dtest=TestIcebergCliDriver -Dqfile=iceberg_stats_with_ppr.q -Drat.skip=true

public DummyPartition(Table tbl, String name,
Map<String, String> partSpec) {
setTable(tbl);
public DummyPartition(Table tbl, String name, Map<String, String> partSpec) throws HiveException {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reuse this object or better create another abstraction, like [Virtual/Hidden]Partition?
cc @kasakrisz

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the functionality of DummyPartition exploited? I saw that when we create DummyPartition instances we return a Partition type reference and never access any DummyPartition defined method. If this is not the case then DummyPartition is fine. The name is misleading though since these objects represent real partitions, aren't they?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use overriden getValues and getSpec. And yes, they represent real partitions (not in HMS)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that the current class hierarchy doesn't represent our needs:

  • Partition class represents a partition stored on HMS.
  • DummyPartition extends Partition so it seems like a special HMS stored partition but this is not the case.

Maybe defining a Partition interface or abstract class with a minimal contract (getName, getValues) would be better. And two class could implement/extend it: one for HMS stored partition and another for non-HMS.

It seems to be a bigger refactor because the current Partition class is widely used.

Copy link
Member Author

@deniskuzZ deniskuzZ Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DummyPartition is used to represent non-HMS partitions. Before iceberg, it had just two properties: table and name. But I agree the name is not self-explanatory

@deniskuzZ deniskuzZ changed the title Support Iceberg partition stats with PPR Support Partition Prunning stats optimization for Iceberg tables Oct 9, 2024
Comment on lines +447 to +449
public static Expression generateExpressionFromPartitionSpec(Table table, Map<String, String> partitionSpec,
boolean latestSpecOnly) throws SemanticException {
Map<String, PartitionField> partitionFieldMap = getPartitionFields(table, latestSpecOnly).stream()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method generateExpressionFromPartitionSpec is called from two places and the actual parameter value latestSpecOnly is always true. How about calling

icebergTable.spec().fields()

instead of

getPartitionFields(table, latestSpecOnly)

Copy link
Member Author

@deniskuzZ deniskuzZ Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's called from
IcebergTableUtil.getPartitionInfo(Table icebergTable, ... , boolean latestSpecOnly),
that is called from HiveIcebergStorageHandler.getPartitionNames(Table icebergTable, ..., boolean latestSpecOnly)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or are you saying that we could refactor it to

  public static List<PartitionField> getPartitionFields(Table table, boolean latestSpecOnly) {
    return latestSpecOnly ? table.spec().fields() :
      table.specs().values().stream()
        .flatMap(spec -> spec.fields().stream()).distinct()
        .collect(Collectors.toList());
  }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or are you saying that we could refactor it to

  public static List<PartitionField> getPartitionFields(Table table, boolean latestSpecOnly) {
    return latestSpecOnly ? table.spec().fields() :
      table.specs().values().stream()
        .flatMap(spec -> spec.fields().stream()).distinct()
        .collect(Collectors.toList());
  }

yes.

Actually at the top of the call stack, from we call this method, we already know whether we need only the current spec or all. :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -166,9 +170,13 @@ public String getTabAlias() {
}

public boolean getIsVirtualCol() {
return isVirtualCol;
return isVirtualCol || isPartitionCol;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't mix virtual columns and partition columns. These are very different things. getIsVirtualCol() should depend on isVirtualCol only.

Copy link
Member Author

@deniskuzZ deniskuzZ Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, but it's completely opposite here: all partitionCol were virtual ones, I've added isPartitionCol not to exclude them from the projected list because of isVirtual marker

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ColumnPrunerProcFactory.class

     if (colInfo.getIsVirtualCol() && !colInfo.getIsPartitionCol()) {
        // part is also a virtual column, but part col should not in this
        // list.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still has the impression that this change is not safe. Example:
here we create a new instance of ColumnInfo based on an existing one. If the source ColumnInfo is a partition column the field isPartitionCol is still false in the result which is not valid.

ColumnInfo newCol = new ColumnInfo(colInfo.getInternalName(), colInfo.getType(),
colInfo.getTabAlias(), colInfo.getIsVirtualCol(), colInfo.isHiddenVirtualCol());

Copy link
Member Author

@deniskuzZ deniskuzZ Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, but how else should I call this method? isHiddenPartitionCol?
We didn't pass virtual/partition columns to reader before as we new the location upfront, now we rely on iceberg.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to isHiddenPartitionCol

public DummyPartition(Table tbl, String name,
Map<String, String> partSpec) {
setTable(tbl);
public DummyPartition(Table tbl, String name, Map<String, String> partSpec) throws HiveException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the functionality of DummyPartition exploited? I saw that when we create DummyPartition instances we return a Partition type reference and never access any DummyPartition defined method. If this is not the case then DummyPartition is fine. The name is misleading though since these objects represent real partitions, aren't they?

Comment on lines -64 to +68
enabled: false
enabled: true
enabledConditionsMet: hive.vectorized.use.vectorized.input.format IS true
enabledConditionsNotMet: Could not enable vectorization due to partition column names size 1 is greater than the number of table column names size 0 IS false
inputFileFormats: org.apache.hadoop.hive.ql.io.NullRowsInputFormat
notVectorizedReason: UDTF Operator (UDTF) not supported
vectorized: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the cause of these changes? This is not an iceberg related test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there was a bug in a Vectorizer, I noticed it while troubleshooting. I can move it into a different PR

Copy link

sonarcloud bot commented Oct 22, 2024

@Override
public boolean canProvidePartitionStatistics(org.apache.hadoop.hive.ql.metadata.Table hmsTable) {
Table table = IcebergTableUtil.getTable(conf, hmsTable.getTTable());
if (table.currentSnapshot() != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also check branch/tag here.

@@ -932,7 +932,7 @@ private Collection<List<ColumnStatisticsObj>> verifyAndGetPartColumnStats(
private Long getRowCnt(
ParseContext pCtx, TableScanOperator tsOp, Table tbl) throws HiveException {
Long rowCnt = 0L;
if (tbl.isPartitioned()) {
if (tbl.isPartitioned() && StatsUtils.checkCanProvidePartitionStats(tbl)) {
for (Partition part : pctx.getPrunedPartitions(
tsOp.getConf().getAlias(), tsOp).getPartitions()) {
if (!StatsUtils.areBasicStatsUptoDateForQueryAnswering(part.getTable(), part.getParameters())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StatsUtils::areBasicStatsUptoDateForQueryAnswering is not applicable to Iceberg table, and it will check table param COLUMN_STATS_ACCURATE and then determine to get stats or not. But we always get partition stats from iceberg metadata file, so COLUMN_STATS_ACCURATE should be always true.

The reason the iceberg qtest for table&partition's stats looks good is because we already set COLUMN_STATS_ACCURATE to true in hive-site.xml. But in fact, i think no users will care this param. So i think if we want to use iceberg partition stats, we should consider to remove this param.

<property>
<name>iceberg.hive.keep.stats</name>
<value>true</value>
<description>
We want we keep the stats in Hive sessions.
</description>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants