-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue a diagnostic if we try to parse a source file that is too large. #4429
Issue a diagnostic if we try to parse a source file that is too large. #4429
Conversation
Previously in an optimized build we'd produce bogus tokens, such as tokens with incorrect IdentifierIds, and in a debug build we would CHECK-fail. The diagnostic is pointed at the first token that is beyond the limit to help people determine where to split their files.
} | ||
EXPECT_CALL(consumer, HandleDiagnostic(IsSingleDiagnostic( | ||
DiagnosticKind::TooManyTokens, | ||
DiagnosticLevel::Error, 0x400000, 2, _))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0x400000
seems opaque in its origin (I'm looking back and forth between files and it seems correct to assume it's PayloadBits, but it's really making me think, particularly due to the switch from bit shift to hex). Had you considered either refactoring the 1 << TokenizedBuffer::TokenInfo::PayloadBits
so that it's shared here, or not asserting the specific value? e.g., there could be a TokenizedBuffer::MaxTokens
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually one less than PayloadBits
because we have two tokens per line. My thinking was that I wanted the test to fail if someone changed PayloadBits
so that their attention would be drawn to the fact that they're modifying the maximum number of tokens supported (which is otherwise very implicit here). But I think adding a MaxTokens
would deal with this nicely; I'll do that.
Also I just realized the 2
here is off by one if we want to point at the first "bad" token, because we need to leave room for the end-of-file token.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
toolchain/lex/tokenized_buffer.h
Outdated
// Note that we don't check the payload fits in `token_payload_` here. | ||
// Payloads are typically ID types for which we create at most one per | ||
// token, and in `Lexer::LexFileEnd` we diagnose if the file has too many | ||
// tokens to fit in `token_payload_`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had you considered having the above comments refer to this, since you provide more detail here? Moving it to either LexFileEnd
or token_payload_
might make it easier to cross-reference.
Another related thought would be to migrate the logic to something like TokenizedBuffer::Verify
, similar to Parse::Tree::Verify
, as another way to give an easy cross-reference (I don't think it's complex enough to have complexity motivate a refactor).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved comment to token_payload_
. I've also factored out the finalization steps into a separate function on the lexer.
toolchain/lex/tokenized_buffer.h
Outdated
@@ -306,7 +306,7 @@ class TokenizedBuffer : public Printable<TokenizedBuffer> { | |||
} | |||
auto set_ident_id(IdentifierId ident_id) -> void { | |||
CARBON_DCHECK(kind() == TokenKind::Identifier); | |||
CARBON_DCHECK(ident_id.index < (2 << PayloadBits)); | |||
// Overflow is checked in `Lexer::LexFileEnd`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth noting somewhere that this relies on no identifiers being added to the shared value store outside this specific tokenized buffer?
i.e., this should work because the identifier count is limited by the token count. However, if there's ever a second source of identifiers, this breaks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment to Lexer::Finalize
.
Co-authored-by: Jon Ross-Perkins <[email protected]>
Note, failing test is that this should be added to IsUntestedDiagnostic in emitted_diagnostics_test.cpp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG
|
||
// A bitfield that encodes the token's kind, the leading space flag, and the | ||
// remaining bits in a payload. These are encoded together as a bitfield for | ||
// density and because these are the hottest fields of tokens for consumers | ||
// after lexing. | ||
TokenKind::RawEnumType kind_ : sizeof(TokenKind) * 8; | ||
bool has_leading_space_ : 1; | ||
// Payload values are typically ID types for which we create at most one per |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Payload values are typically ID types for which we create at most one per | |
// Payload values are typically ID types for which we create at most one per |
Whitespace suggestion due to the long comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes the previous comment that's trying to talk about all three bit-fields less clear.
What do you think about moving this comment before all the bit-fields, with the other one?
static_assert(MaxTokens <= 1 << PayloadBits, | ||
"Not enough payload bits to store a token index"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What made you choose this approach versus relocating PayloadBits to make MaxTokens directly calculated based on it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted an explicit value for MaxTokens
so that anyone changing that value knows they're changing the token limit, not just some representation detail of the tokenized buffer.
Co-authored-by: Jon Ross-Perkins <[email protected]>
Previously in an optimized build we'd produce bogus tokens, such as tokens with incorrect IdentifierIds, and in a debug build we would try to CHECK-fail -- but actually wouldn't, because we're incorrectly checking for
2 << bits
instead of1 << bits
. I hit this while I was trying to do some profiling and was seeing some very strange diagnostics.The diagnostic is pointed at the first token that is beyond the limit to help people determine where to split their files.