-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HIVE-28581: Support Partition Prunning stats optimization for Iceberg tables #5498
base: master
Are you sure you want to change the base?
Conversation
public DummyPartition(Table tbl, String name, | ||
Map<String, String> partSpec) { | ||
setTable(tbl); | ||
public DummyPartition(Table tbl, String name, Map<String, String> partSpec) throws HiveException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we reuse this object or better create another abstraction, like [Virtual/Hidden]Partition?
cc @kasakrisz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the functionality of DummyPartition
exploited? I saw that when we create DummyPartition
instances we return a Partition
type reference and never access any DummyPartition
defined method. If this is not the case then DummyPartition
is fine. The name is misleading though since these objects represent real partitions, aren't they?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we use overriden getValues
and getSpec
. And yes, they represent real partitions (not in HMS)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems that the current class hierarchy doesn't represent our needs:
Partition
class represents a partition stored on HMS.DummyPartition
extendsPartition
so it seems like a special HMS stored partition but this is not the case.
Maybe defining a Partition interface or abstract class with a minimal contract (getName, getValues) would be better. And two class could implement/extend it: one for HMS stored partition and another for non-HMS.
It seems to be a bigger refactor because the current Partition
class is widely used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DummyPartition
is used to represent non-HMS partitions. Before iceberg, it had just two properties: table
and name
. But I agree the name is not self-explanatory
f2e7ab6
to
4b0f6e9
Compare
4b0f6e9
to
68f1a7a
Compare
4d940c2
to
29d91d4
Compare
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java
Outdated
Show resolved
Hide resolved
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/IcebergTableUtil.java
Outdated
Show resolved
Hide resolved
public static Expression generateExpressionFromPartitionSpec(Table table, Map<String, String> partitionSpec, | ||
boolean latestSpecOnly) throws SemanticException { | ||
Map<String, PartitionField> partitionFieldMap = getPartitionFields(table, latestSpecOnly).stream() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method generateExpressionFromPartitionSpec
is called from two places and the actual parameter value latestSpecOnly
is always true
. How about calling
icebergTable.spec().fields()
instead of
getPartitionFields(table, latestSpecOnly)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's called from
IcebergTableUtil.getPartitionInfo(Table icebergTable, ... , boolean latestSpecOnly)
,
that is called from HiveIcebergStorageHandler.getPartitionNames(Table icebergTable, ..., boolean latestSpecOnly)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or are you saying that we could refactor it to
public static List<PartitionField> getPartitionFields(Table table, boolean latestSpecOnly) {
return latestSpecOnly ? table.spec().fields() :
table.specs().values().stream()
.flatMap(spec -> spec.fields().stream()).distinct()
.collect(Collectors.toList());
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or are you saying that we could refactor it to
public static List<PartitionField> getPartitionFields(Table table, boolean latestSpecOnly) { return latestSpecOnly ? table.spec().fields() : table.specs().values().stream() .flatMap(spec -> spec.fields().stream()).distinct() .collect(Collectors.toList()); }
yes.
Actually at the top of the call stack, from we call this method, we already know whether we need only the current spec or all. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -166,9 +170,13 @@ public String getTabAlias() { | |||
} | |||
|
|||
public boolean getIsVirtualCol() { | |||
return isVirtualCol; | |||
return isVirtualCol || isPartitionCol; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't mix virtual columns and partition columns. These are very different things. getIsVirtualCol()
should depend on isVirtualCol
only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know, but it's completely opposite here: all partitionCol were virtual ones, I've added isPartitionCol not to exclude them from the projected list because of isVirtual
marker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ColumnPrunerProcFactory.class
if (colInfo.getIsVirtualCol() && !colInfo.getIsPartitionCol()) {
// part is also a virtual column, but part col should not in this
// list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still has the impression that this change is not safe. Example:
here we create a new instance of ColumnInfo
based on an existing one. If the source ColumnInfo
is a partition column the field isPartitionCol
is still false
in the result which is not valid.
hive/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java
Lines 196 to 197 in b2a5933
ColumnInfo newCol = new ColumnInfo(colInfo.getInternalName(), colInfo.getType(), | |
colInfo.getTabAlias(), colInfo.getIsVirtualCol(), colInfo.isHiddenVirtualCol()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true, but how else should I call this method? isHiddenPartitionCol
?
We didn't pass virtual/partition columns to reader before as we new the location upfront, now we rely on iceberg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed to isHiddenPartitionCol
public DummyPartition(Table tbl, String name, | ||
Map<String, String> partSpec) { | ||
setTable(tbl); | ||
public DummyPartition(Table tbl, String name, Map<String, String> partSpec) throws HiveException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the functionality of DummyPartition
exploited? I saw that when we create DummyPartition
instances we return a Partition
type reference and never access any DummyPartition
defined method. If this is not the case then DummyPartition
is fine. The name is misleading though since these objects represent real partitions, aren't they?
enabled: false | ||
enabled: true | ||
enabledConditionsMet: hive.vectorized.use.vectorized.input.format IS true | ||
enabledConditionsNotMet: Could not enable vectorization due to partition column names size 1 is greater than the number of table column names size 0 IS false | ||
inputFileFormats: org.apache.hadoop.hive.ql.io.NullRowsInputFormat | ||
notVectorizedReason: UDTF Operator (UDTF) not supported | ||
vectorized: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the cause of these changes? This is not an iceberg related test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there was a bug in a Vectorizer, I noticed it while troubleshooting. I can move it into a different PR
60840a7
to
a791d29
Compare
Quality Gate passedIssues Measures |
@Override | ||
public boolean canProvidePartitionStatistics(org.apache.hadoop.hive.ql.metadata.Table hmsTable) { | ||
Table table = IcebergTableUtil.getTable(conf, hmsTable.getTTable()); | ||
if (table.currentSnapshot() != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also check branch/tag here.
@@ -932,7 +932,7 @@ private Collection<List<ColumnStatisticsObj>> verifyAndGetPartColumnStats( | |||
private Long getRowCnt( | |||
ParseContext pCtx, TableScanOperator tsOp, Table tbl) throws HiveException { | |||
Long rowCnt = 0L; | |||
if (tbl.isPartitioned()) { | |||
if (tbl.isPartitioned() && StatsUtils.checkCanProvidePartitionStats(tbl)) { | |||
for (Partition part : pctx.getPrunedPartitions( | |||
tsOp.getConf().getAlias(), tsOp).getPartitions()) { | |||
if (!StatsUtils.areBasicStatsUptoDateForQueryAnswering(part.getTable(), part.getParameters())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StatsUtils::areBasicStatsUptoDateForQueryAnswering
is not applicable to Iceberg table, and it will check table param COLUMN_STATS_ACCURATE
and then determine to get stats or not. But we always get partition stats from iceberg metadata file, so COLUMN_STATS_ACCURATE
should be always true
.
The reason the iceberg qtest for table&partition's stats looks good is because we already set COLUMN_STATS_ACCURATE
to true in hive-site.xml. But in fact, i think no users will care this param. So i think if we want to use iceberg partition stats, we should consider to remove this param.
Lines 334 to 339 in 48a67a4
<property> | |
<name>iceberg.hive.keep.stats</name> | |
<value>true</value> | |
<description> | |
We want we keep the stats in Hive sessions. | |
</description> |
What changes were proposed in this pull request?
Add support for Iceberg partition prune stats optimization
Why are the changes needed?
Performance
Does this PR introduce any user-facing change?
No
Is the change a dependency upgrade?
No
How was this patch tested?
mvn test -Dtest=TestIcebergCliDriver -Dqfile=iceberg_stats_with_ppr.q -Drat.skip=true