Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue a diagnostic if we try to parse a source file that is too large. #4429

Merged
merged 11 commits into from
Oct 22, 2024

Conversation

zygoloid
Copy link
Contributor

@zygoloid zygoloid commented Oct 19, 2024

Previously in an optimized build we'd produce bogus tokens, such as tokens with incorrect IdentifierIds, and in a debug build we would try to CHECK-fail -- but actually wouldn't, because we're incorrectly checking for 2 << bits instead of 1 << bits. I hit this while I was trying to do some profiling and was seeing some very strange diagnostics.

The diagnostic is pointed at the first token that is beyond the limit to help people determine where to split their files.

Previously in an optimized build we'd produce bogus tokens, such as
tokens with incorrect IdentifierIds, and in a debug build we would
CHECK-fail.

The diagnostic is pointed at the first token that is beyond the limit to
help people determine where to split their files.
}
EXPECT_CALL(consumer, HandleDiagnostic(IsSingleDiagnostic(
DiagnosticKind::TooManyTokens,
DiagnosticLevel::Error, 0x400000, 2, _)));
Copy link
Contributor

@jonmeow jonmeow Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0x400000 seems opaque in its origin (I'm looking back and forth between files and it seems correct to assume it's PayloadBits, but it's really making me think, particularly due to the switch from bit shift to hex). Had you considered either refactoring the 1 << TokenizedBuffer::TokenInfo::PayloadBits so that it's shared here, or not asserting the specific value? e.g., there could be a TokenizedBuffer::MaxTokens.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually one less than PayloadBits because we have two tokens per line. My thinking was that I wanted the test to fail if someone changed PayloadBits so that their attention would be drawn to the fact that they're modifying the maximum number of tokens supported (which is otherwise very implicit here). But I think adding a MaxTokens would deal with this nicely; I'll do that.

Also I just realized the 2 here is off by one if we want to point at the first "bad" token, because we need to leave room for the end-of-file token.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 399 to 402
// Note that we don't check the payload fits in `token_payload_` here.
// Payloads are typically ID types for which we create at most one per
// token, and in `Lexer::LexFileEnd` we diagnose if the file has too many
// tokens to fit in `token_payload_`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had you considered having the above comments refer to this, since you provide more detail here? Moving it to either LexFileEnd or token_payload_ might make it easier to cross-reference.

Another related thought would be to migrate the logic to something like TokenizedBuffer::Verify, similar to Parse::Tree::Verify, as another way to give an easy cross-reference (I don't think it's complex enough to have complexity motivate a refactor).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved comment to token_payload_. I've also factored out the finalization steps into a separate function on the lexer.

toolchain/lex/tokenized_buffer_test.cpp Outdated Show resolved Hide resolved
@@ -306,7 +306,7 @@ class TokenizedBuffer : public Printable<TokenizedBuffer> {
}
auto set_ident_id(IdentifierId ident_id) -> void {
CARBON_DCHECK(kind() == TokenKind::Identifier);
CARBON_DCHECK(ident_id.index < (2 << PayloadBits));
// Overflow is checked in `Lexer::LexFileEnd`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth noting somewhere that this relies on no identifiers being added to the shared value store outside this specific tokenized buffer?

i.e., this should work because the identifier count is limited by the token count. However, if there's ever a second source of identifiers, this breaks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment to Lexer::Finalize.

Co-authored-by: Jon Ross-Perkins <[email protected]>
@jonmeow
Copy link
Contributor

jonmeow commented Oct 21, 2024

Note, failing test is that this should be added to IsUntestedDiagnostic in emitted_diagnostics_test.cpp

@zygoloid zygoloid requested a review from jonmeow October 21, 2024 22:01
Copy link
Contributor

@jonmeow jonmeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG

toolchain/lex/lex.cpp Outdated Show resolved Hide resolved

// A bitfield that encodes the token's kind, the leading space flag, and the
// remaining bits in a payload. These are encoded together as a bitfield for
// density and because these are the hottest fields of tokens for consumers
// after lexing.
TokenKind::RawEnumType kind_ : sizeof(TokenKind) * 8;
bool has_leading_space_ : 1;
// Payload values are typically ID types for which we create at most one per
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Payload values are typically ID types for which we create at most one per
// Payload values are typically ID types for which we create at most one per

Whitespace suggestion due to the long comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes the previous comment that's trying to talk about all three bit-fields less clear.

What do you think about moving this comment before all the bit-fields, with the other one?

Comment on lines +413 to +414
static_assert(MaxTokens <= 1 << PayloadBits,
"Not enough payload bits to store a token index");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What made you choose this approach versus relocating PayloadBits to make MaxTokens directly calculated based on it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted an explicit value for MaxTokens so that anyone changing that value knows they're changing the token limit, not just some representation detail of the tokenized buffer.

@josh11b josh11b removed their request for review October 22, 2024 21:14
@zygoloid zygoloid added this pull request to the merge queue Oct 22, 2024
Merged via the queue into carbon-language:trunk with commit e68e54d Oct 22, 2024
8 checks passed
@zygoloid zygoloid deleted the toolchain-file-too-large branch October 22, 2024 23:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants