-
Notifications
You must be signed in to change notification settings - Fork 486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ORC-1361: InvalidProtocolBufferException when reading large stripe statistics #1402
Conversation
…atistics Catch `InvalidProtocolBufferException` when parsing stripe statistics, log a warning message, and return an empty list. In some cases ORC files may be created with very large Metadata section due to stripe statistics. The bigger the ORC file gets the easier is to endup with a large Metadata section. Nevertheless, it is still possible to hit the problem with smaller ORC files and `TestOrcWithLargeStripeStatistics` demonstrates some extremes. Any attempt to read back the stripe statistics from the file will fail with `InvalidProtocolBufferException`. The exact exception may differ slighly and depending on: a) the size of stripe statistics; b) protobuf size limit. Stripe statistics less than 2GB, and protobuf limit less than stats size: ``` com.google.protobuf.InvalidProtocolBufferException: Protocol message was too large. May be malicious. Use CodedInputStream.setSizeLimit() to increase the size limit. ``` Stripe statistics greater than 2GB, and protobuf limit 2GB (`Integer.MAX_VALUE`): ``` com.google.protobuf.InvalidProtocolBufferException: While parsing a protocol message, the input ended unexpectedly in the middle of a field. This could mean either that the input has been truncated or that an embedded message misreported its own length. ``` The `Protocol message was too large` problem could be alleviated by increasing the limit as it was done in the past. However, increasing the limit would hide the underlying problem and probably lead to permanent metadata corruption in the near future. Moreover, the limit is already high enough (1GB) and bumping it further would lead to more memory being used potentially triggering `OutOfMemoryError`, OOM Killer, GC pauses, and other problems that are usually harder to debug and find the root cause. When the stripe statistics exceeds the 2GB there is nothing to be done to parse back the statistics since protobuf cannot deserialize such messages (protocolbuffers/protobuf#11729). The problem cannot be solved unless the metadata storage changes but this can only happen in newer versions. On the other hand, stripe statistics are important but not vital for readers. In those situations where limits are breached it is acceptable to log a warning and return nothing instead of raising a fatal error to the caller. Existing unit tests plus new tests added in TestOrcWithLargeStripeStatistics.
InStream.createCodedInputStream(stream)); | ||
return meta.getStripeStatsList(); | ||
} catch (InvalidProtocolBufferException e) { | ||
LOG.warn("Failed to parse stripe statistics; check ORC-1361 for more details.", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This message could be wrong because InvalidProtocolBufferException
can happen in another cases, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
From protobuf, there are many other cases here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first part of the message (Failed to parse stripe statistics
) is accurate. The second part indeed can be misleading in some cases.
Instead of pointing to ORC-1361 what do you think of creating a page/section in the ORC website and document there whatever is needed? Then the message can contain the URL to this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the exception message provide detail information? It yes, we may not need to state check ORC-1361 for more details
explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we just need to change the title to catch InvalidProtocolBufferException
and let the program continue to run.
Reading large bars of statistical information is one of the causes of the exception. But throw any InvalidProtocolBufferException and this pr will allow the program to continue to run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can keep running, but need to let the user know that we skipped these statistics. This can be achieved by catching exceptions, or requiring the user to configure a parameter such as skipCheckStatistics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The details in the exception are probably enough for the user to understand that the stats are big/corrupted. What may be more difficult to understand is why the stats are big or why they were corrupted and what can they do about it.
Providing some tips like check the strip size, number of columns, size of string stats, total size of the file, etc., should be useful and that's what I was thinking that could go in ORC-1361 or another resource.
I am fine with whatever you decide but I would strongly suggest to LOG at least the exception that was caught.
Let me know what you prefer and I will do the appropriate changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the content of the logging message is somewhat blocking this PR from being merged. I think this is a rather minor issue so how about keeping it dead simple (Failed to parse stripe statistics
) and move the PR forward? WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the content of the logging message is somewhat blocking this PR from being merged. I think this is a rather minor issue so how about keeping it dead simple (
Failed to parse stripe statistics
) and move the PR forward? WDYT?
I'm +1 on this. Thanks @zabetak
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplified message in 55eebc6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for making a PR, @zabetak . Do you think 2GB limitation exists in C++ code path?
cc @williamhyun , @wgtmac , @stiga-huang , @coderex2522 .
I think so. According to my experience, large message cannot be decoded with following warning:
One has to workaround it manually by setting a larger limit. |
I agree to merge the PR although the PR does not directly solve the problem of large statistical information. |
@deshanxiao It doesn't solve the problem cause there is no a real fix :/ . The metadata structure has to change if we want to allow large metadata but this is relevant only for new versions of the ORC format. If there is enough interest we can try to brainstorm under #1404 or another place. |
I see. +1 for the PR. Thank you @zabetak |
I just submitted it. Thanks all! |
…atistics ### What changes were proposed in this pull request? Catch `InvalidProtocolBufferException` when parsing stripe statistics, log a warning message, and return an empty list. ### Why are the changes needed? In some cases ORC files may be created with very large Metadata section due to stripe statistics. The bigger the ORC file gets the easier is to endup with a large Metadata section. Nevertheless, it is still possible to hit the problem with smaller ORC files and `TestOrcWithLargeStripeStatistics` demonstrates some extremes. Any attempt to read back the stripe statistics from the file will fail with `InvalidProtocolBufferException`. The exact exception may differ slighly and depending on: a) the size of stripe statistics; b) protobuf size limit. Stripe statistics less than 2GB, and protobuf limit less than stats size: ``` com.google.protobuf.InvalidProtocolBufferException: Protocol message was too large. May be malicious. Use CodedInputStream.setSizeLimit() to increase the size limit. ``` Stripe statistics greater than 2GB, and protobuf limit 2GB (`Integer.MAX_VALUE`): ``` com.google.protobuf.InvalidProtocolBufferException: While parsing a protocol message, the input ended unexpectedly in the middle of a field. This could mean either that the input has been truncated or that an embedded message misreported its own length. ``` The `Protocol message was too large` problem could be alleviated by increasing the limit as it was done in the past. However, increasing the limit would hide the underlying problem and probably lead to permanent metadata corruption in the near future. Moreover, the limit is already high enough (1GB) and bumping it further would lead to more memory being used potentially triggering `OutOfMemoryError`, OOM Killer, GC pauses, and other problems that are usually harder to debug and find the root cause. When the stripe statistics exceeds the 2GB there is nothing to be done to parse back the statistics since protobuf cannot deserialize such messages (protocolbuffers/protobuf#11729). The problem cannot be solved unless the metadata storage changes but this can only happen in newer versions. On the other hand, stripe statistics are important but not vital for readers. In those situations where limits are breached it is acceptable to log a warning and return nothing instead of raising a fatal error to the caller. ### How was this patch tested? Existing unit tests plus new tests added in `TestOrcWithLargeStripeStatistics`. This closes apache#1402
What changes were proposed in this pull request?
Catch
InvalidProtocolBufferException
when parsing stripe statistics, log a warning message, and return an empty list.Why are the changes needed?
In some cases ORC files may be created with very large Metadata section due to stripe statistics. The bigger the ORC file gets the easier is to endup with a large Metadata section. Nevertheless, it is still possible to hit the problem with smaller ORC files and
TestOrcWithLargeStripeStatistics
demonstrates some extremes.Any attempt to read back the stripe statistics from the file will fail with
InvalidProtocolBufferException
. The exact exception may differ slighly and depending on: a) the size of stripe statistics; b) protobuf size limit.Stripe statistics less than 2GB, and protobuf limit less than stats size:
Stripe statistics greater than 2GB, and protobuf limit 2GB (
Integer.MAX_VALUE
):The
Protocol message was too large
problem could be alleviated by increasing the limit as it was done in the past. However, increasing the limit would hide the underlying problem and probably lead to permanent metadata corruption in the near future. Moreover, the limit is already high enough (1GB) and bumping it further would lead to more memory being used potentially triggeringOutOfMemoryError
, OOM Killer, GC pauses, and other problems that are usually harder to debug and find the root cause.When the stripe statistics exceeds the 2GB there is nothing to be done to parse back the statistics since protobuf cannot deserialize such messages (protocolbuffers/protobuf#11729). The problem cannot be solved unless the metadata storage changes but this can only happen in newer versions.
On the other hand, stripe statistics are important but not vital for readers. In those situations where limits are breached it is acceptable to log a warning and return nothing instead of raising a fatal error to the caller.
How was this patch tested?
Existing unit tests plus new tests added in
TestOrcWithLargeStripeStatistics
.