Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC-1361: InvalidProtocolBufferException when reading large stripe statistics #1402

Merged
merged 2 commits into from
Feb 24, 2023

Conversation

zabetak
Copy link
Member

@zabetak zabetak commented Feb 8, 2023

What changes were proposed in this pull request?

Catch InvalidProtocolBufferException when parsing stripe statistics, log a warning message, and return an empty list.

Why are the changes needed?

In some cases ORC files may be created with very large Metadata section due to stripe statistics. The bigger the ORC file gets the easier is to endup with a large Metadata section. Nevertheless, it is still possible to hit the problem with smaller ORC files and TestOrcWithLargeStripeStatistics demonstrates some extremes.

Any attempt to read back the stripe statistics from the file will fail with InvalidProtocolBufferException. The exact exception may differ slighly and depending on: a) the size of stripe statistics; b) protobuf size limit.

Stripe statistics less than 2GB, and protobuf limit less than stats size:

com.google.protobuf.InvalidProtocolBufferException: Protocol message was too large.  May be malicious.  Use CodedInputStream.setSizeLimit() to increase the size limit.

Stripe statistics greater than 2GB, and protobuf limit 2GB (Integer.MAX_VALUE):

com.google.protobuf.InvalidProtocolBufferException: While parsing a protocol message, the input ended unexpectedly in the middle of a field.  This could mean either that the input has been truncated or that an embedded message misreported its own length.

The Protocol message was too large problem could be alleviated by increasing the limit as it was done in the past. However, increasing the limit would hide the underlying problem and probably lead to permanent metadata corruption in the near future. Moreover, the limit is already high enough (1GB) and bumping it further would lead to more memory being used potentially triggering OutOfMemoryError, OOM Killer, GC pauses, and other problems that are usually harder to debug and find the root cause.

When the stripe statistics exceeds the 2GB there is nothing to be done to parse back the statistics since protobuf cannot deserialize such messages (protocolbuffers/protobuf#11729). The problem cannot be solved unless the metadata storage changes but this can only happen in newer versions.

On the other hand, stripe statistics are important but not vital for readers. In those situations where limits are breached it is acceptable to log a warning and return nothing instead of raising a fatal error to the caller.

How was this patch tested?

Existing unit tests plus new tests added in TestOrcWithLargeStripeStatistics.

…atistics

Catch `InvalidProtocolBufferException` when parsing stripe statistics, log a warning message, and return an empty list.

In some cases ORC files may be created with very large Metadata section due to stripe statistics.
The bigger the ORC file gets the easier is to endup with a large Metadata section.
Nevertheless, it is still possible to hit the problem with smaller ORC files and `TestOrcWithLargeStripeStatistics` demonstrates some extremes.

Any attempt to read back the stripe statistics from the file will fail with `InvalidProtocolBufferException`.
The exact exception may differ slighly and depending on: a) the size of stripe statistics; b) protobuf size limit.

Stripe statistics less than 2GB, and protobuf limit less than stats size:
```
com.google.protobuf.InvalidProtocolBufferException: Protocol message was too large.  May be malicious.  Use CodedInputStream.setSizeLimit() to increase the size limit.
```

Stripe statistics greater than 2GB, and protobuf limit 2GB (`Integer.MAX_VALUE`):
```
com.google.protobuf.InvalidProtocolBufferException: While parsing a protocol message, the input ended unexpectedly in the middle of a field.  This could mean either that the input has been truncated or that an embedded message misreported its own length.
```

The `Protocol message was too large` problem could be alleviated by increasing the limit as it was done in the past.
However, increasing the limit would hide the underlying problem and probably lead to permanent metadata corruption in the near future.
Moreover, the limit is already high enough (1GB) and bumping it further would lead to more memory being used potentially triggering `OutOfMemoryError`, OOM Killer, GC pauses, and other problems that are usually harder to debug and find the root cause.

When the stripe statistics exceeds the 2GB there is nothing to be done to parse back the statistics since protobuf cannot deserialize such messages (protocolbuffers/protobuf#11729).
The problem cannot be solved unless the metadata storage changes but this can only happen in newer versions.

On the other hand, stripe statistics are important but not vital for readers.
In those situations where limits are breached it is acceptable to log a warning and return nothing instead of raising a fatal error to the caller.

Existing unit tests plus new tests added in TestOrcWithLargeStripeStatistics.
@github-actions github-actions bot added the JAVA label Feb 8, 2023
InStream.createCodedInputStream(stream));
return meta.getStripeStatsList();
} catch (InvalidProtocolBufferException e) {
LOG.warn("Failed to parse stripe statistics; check ORC-1361 for more details.", e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This message could be wrong because InvalidProtocolBufferException can happen in another cases, right?

Copy link
Contributor

@deshanxiao deshanxiao Feb 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
From protobuf, there are many other cases here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first part of the message (Failed to parse stripe statistics) is accurate. The second part indeed can be misleading in some cases.

Instead of pointing to ORC-1361 what do you think of creating a page/section in the ORC website and document there whatever is needed? Then the message can contain the URL to this page.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the exception message provide detail information? It yes, we may not need to state check ORC-1361 for more details explicitly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we just need to change the title to catch InvalidProtocolBufferException and let the program continue to run.
Reading large bars of statistical information is one of the causes of the exception. But throw any InvalidProtocolBufferException and this pr will allow the program to continue to run.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can keep running, but need to let the user know that we skipped these statistics. This can be achieved by catching exceptions, or requiring the user to configure a parameter such as skipCheckStatistics

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The details in the exception are probably enough for the user to understand that the stats are big/corrupted. What may be more difficult to understand is why the stats are big or why they were corrupted and what can they do about it.

Providing some tips like check the strip size, number of columns, size of string stats, total size of the file, etc., should be useful and that's what I was thinking that could go in ORC-1361 or another resource.

I am fine with whatever you decide but I would strongly suggest to LOG at least the exception that was caught.

Let me know what you prefer and I will do the appropriate changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the content of the logging message is somewhat blocking this PR from being merged. I think this is a rather minor issue so how about keeping it dead simple (Failed to parse stripe statistics) and move the PR forward? WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the content of the logging message is somewhat blocking this PR from being merged. I think this is a rather minor issue so how about keeping it dead simple (Failed to parse stripe statistics) and move the PR forward? WDYT?

I'm +1 on this. Thanks @zabetak

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified message in 55eebc6

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making a PR, @zabetak . Do you think 2GB limitation exists in C++ code path?

cc @williamhyun , @wgtmac , @stiga-huang , @coderex2522 .

@wgtmac
Copy link
Member

wgtmac commented Feb 9, 2023

Thank you for making a PR, @zabetak . Do you think 2GB limitation exists in C++ code path?

cc @williamhyun , @wgtmac , @stiga-huang , @coderex2522 .

I think so. According to my experience, large message cannot be decoded with following warning:

Reading dangerously large protocol message. If the message turns out to be larger than 67108864 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.

One has to workaround it manually by setting a larger limit.

@zabetak
Copy link
Member Author

zabetak commented Feb 9, 2023

@wgtmac Note that there is no workaround when the size reaches 2GB at least from the java perspective. The limit cannot be increased further and as I explained under ORC-1361 probably is not worth it.

@deshanxiao
Copy link
Contributor

deshanxiao commented Feb 17, 2023

I agree to merge the PR although the PR does not directly solve the problem of large statistical information.
It can skip statistical information without affecting the normal reading of files.

@zabetak
Copy link
Member Author

zabetak commented Feb 17, 2023

@deshanxiao It doesn't solve the problem cause there is no a real fix :/ . The metadata structure has to change if we want to allow large metadata but this is relevant only for new versions of the ORC format. If there is enough interest we can try to brainstorm under #1404 or another place.

@deshanxiao
Copy link
Contributor

@deshanxiao It doesn't solve the problem cause there is no a real fix :/ . The metadata structure has to change if we want to allow large metadata but this is relevant only for new versions of the ORC format. If there is enough interest we can try to brainstorm under #1404 or another place.

I see. +1 for the PR. Thank you @zabetak

@wgtmac wgtmac merged commit c5d7755 into apache:main Feb 24, 2023
@wgtmac
Copy link
Member

wgtmac commented Feb 24, 2023

I just submitted it. Thanks all!

cxzl25 pushed a commit to cxzl25/orc that referenced this pull request Jan 11, 2024
…atistics

### What changes were proposed in this pull request?
Catch `InvalidProtocolBufferException` when parsing stripe statistics, log a warning message, and return an empty list.

### Why are the changes needed?
In some cases ORC files may be created with very large Metadata section due to stripe statistics. The bigger the ORC file gets the easier is to endup with a large Metadata section. Nevertheless, it is still possible to hit the problem with smaller ORC files and `TestOrcWithLargeStripeStatistics` demonstrates some extremes.

Any attempt to read back the stripe statistics from the file will fail with `InvalidProtocolBufferException`. The exact exception may differ slighly and depending on: a) the size of stripe statistics; b) protobuf size limit.

Stripe statistics less than 2GB, and protobuf limit less than stats size:
```
com.google.protobuf.InvalidProtocolBufferException: Protocol message was too large.  May be malicious.  Use CodedInputStream.setSizeLimit() to increase the size limit.
```

Stripe statistics greater than 2GB, and protobuf limit 2GB (`Integer.MAX_VALUE`):
```
com.google.protobuf.InvalidProtocolBufferException: While parsing a protocol message, the input ended unexpectedly in the middle of a field.  This could mean either that the input has been truncated or that an embedded message misreported its own length.
```

The `Protocol message was too large` problem could be alleviated by increasing the limit as it was done in the past. However, increasing the limit would hide the underlying problem and probably lead to permanent metadata corruption in the near future. Moreover, the limit is already high enough (1GB) and bumping it further would lead to more memory being used potentially triggering `OutOfMemoryError`, OOM Killer, GC pauses, and other problems that are usually harder to debug and find the root cause.

When the stripe statistics exceeds the 2GB there is nothing to be done to parse back the statistics since protobuf cannot deserialize such messages (protocolbuffers/protobuf#11729). The problem cannot be solved unless the metadata storage changes but this can only happen in newer versions.

On the other hand, stripe statistics are important but not vital for readers. In those situations where limits are breached it is acceptable to log a warning and return nothing instead of raising a fatal error to the caller.

### How was this patch tested?
Existing unit tests plus new tests added in `TestOrcWithLargeStripeStatistics`.

This closes apache#1402
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants