Issue a diagnostic if we try to parse a source file that is too large. #4429

zygoloid · 2024-10-19T01:56:56Z

Previously in an optimized build we'd produce bogus tokens, such as tokens with incorrect IdentifierIds, and in a debug build we would try to CHECK-fail -- but actually wouldn't, because we're incorrectly checking for 2 << bits instead of 1 << bits. I hit this while I was trying to do some profiling and was seeing some very strange diagnostics.

The diagnostic is pointed at the first token that is beyond the limit to help people determine where to split their files.

Previously in an optimized build we'd produce bogus tokens, such as tokens with incorrect IdentifierIds, and in a debug build we would CHECK-fail. The diagnostic is pointed at the first token that is beyond the limit to help people determine where to split their files.

jonmeow · 2024-10-21T15:41:20Z

toolchain/lex/tokenized_buffer_test.cpp

+  }
+  EXPECT_CALL(consumer, HandleDiagnostic(IsSingleDiagnostic(
+                            DiagnosticKind::TooManyTokens,
+                            DiagnosticLevel::Error, 0x400000, 2, _)));


0x400000 seems opaque in its origin (I'm looking back and forth between files and it seems correct to assume it's PayloadBits, but it's really making me think, particularly due to the switch from bit shift to hex). Had you considered either refactoring the 1 << TokenizedBuffer::TokenInfo::PayloadBits so that it's shared here, or not asserting the specific value? e.g., there could be a TokenizedBuffer::MaxTokens.

It's actually one less than PayloadBits because we have two tokens per line. My thinking was that I wanted the test to fail if someone changed PayloadBits so that their attention would be drawn to the fact that they're modifying the maximum number of tokens supported (which is otherwise very implicit here). But I think adding a MaxTokens would deal with this nicely; I'll do that.

Also I just realized the 2 here is off by one if we want to point at the first "bad" token, because we need to leave room for the end-of-file token.

jonmeow · 2024-10-21T15:54:01Z

toolchain/lex/tokenized_buffer.h

+      // Note that we don't check the payload fits in `token_payload_` here.
+      // Payloads are typically ID types for which we create at most one per
+      // token, and in `Lexer::LexFileEnd` we diagnose if the file has too many
+      // tokens to fit in `token_payload_`.


Had you considered having the above comments refer to this, since you provide more detail here? Moving it to either LexFileEnd or token_payload_ might make it easier to cross-reference.

Another related thought would be to migrate the logic to something like TokenizedBuffer::Verify, similar to Parse::Tree::Verify, as another way to give an easy cross-reference (I don't think it's complex enough to have complexity motivate a refactor).

Moved comment to token_payload_. I've also factored out the finalization steps into a separate function on the lexer.

toolchain/lex/tokenized_buffer_test.cpp

jonmeow · 2024-10-21T16:06:34Z

toolchain/lex/tokenized_buffer.h

@@ -306,7 +306,7 @@ class TokenizedBuffer : public Printable<TokenizedBuffer> {
    }
    auto set_ident_id(IdentifierId ident_id) -> void {
      CARBON_DCHECK(kind() == TokenKind::Identifier);
-      CARBON_DCHECK(ident_id.index < (2 << PayloadBits));
+      // Overflow is checked in `Lexer::LexFileEnd`.


Is it worth noting somewhere that this relies on no identifiers being added to the shared value store outside this specific tokenized buffer?

i.e., this should work because the identifier count is limited by the token count. However, if there's ever a second source of identifiers, this breaks.

Added a comment to Lexer::Finalize.

Co-authored-by: Jon Ross-Perkins <[email protected]>

jonmeow · 2024-10-21T20:57:34Z

Note, failing test is that this should be added to IsUntestedDiagnostic in emitted_diagnostics_test.cpp

jonmeow

LG

toolchain/lex/lex.cpp

jonmeow · 2024-10-21T22:12:22Z

toolchain/lex/tokenized_buffer.h


    // A bitfield that encodes the token's kind, the leading space flag, and the
    // remaining bits in a payload. These are encoded together as a bitfield for
    // density and because these are the hottest fields of tokens for consumers
    // after lexing.
    TokenKind::RawEnumType kind_ : sizeof(TokenKind) * 8;
    bool has_leading_space_ : 1;
+    // Payload values are typically ID types for which we create at most one per


Suggested change

// Payload values are typically ID types for which we create at most one per

// Payload values are typically ID types for which we create at most one per

Whitespace suggestion due to the long comment.

This makes the previous comment that's trying to talk about all three bit-fields less clear.

What do you think about moving this comment before all the bit-fields, with the other one?

jonmeow · 2024-10-21T22:13:10Z

toolchain/lex/tokenized_buffer.h

+    static_assert(MaxTokens <= 1 << PayloadBits,
+                  "Not enough payload bits to store a token index");


What made you choose this approach versus relocating PayloadBits to make MaxTokens directly calculated based on it?

I wanted an explicit value for MaxTokens so that anyone changing that value knows they're changing the token limit, not just some representation detail of the tokenized buffer.

toolchain/lex/tokenized_buffer.h

Co-authored-by: Jon Ross-Perkins <[email protected]>

github-actions bot requested a review from josh11b October 19, 2024 01:57

github-actions bot added the toolchain label Oct 19, 2024

jonmeow approved these changes Oct 21, 2024

View reviewed changes

Reorder attribute

e02272c

Co-authored-by: Jon Ross-Perkins <[email protected]>

zygoloid added 5 commits October 21, 2024 21:01

Review comments.

9d713e3

Merge branch 'trunk' into toolchain-file-too-large

ecd78c9

Fix diagnostic test after merge.

b013265

Pre-commit

20f2daf

Move lexer finalization logic into a separate function.

9fca859

zygoloid requested a review from jonmeow October 21, 2024 22:01

jonmeow reviewed Oct 21, 2024

View reviewed changes

toolchain/lex/tokenized_buffer.h Outdated Show resolved Hide resolved

zygoloid and others added 3 commits October 21, 2024 15:20

Fix comment.

fc0bd90

Co-authored-by: Jon Ross-Perkins <[email protected]>

Improve comment.

7299761

Add TODO

597880b

josh11b removed their request for review October 22, 2024 21:14

Merge comments

ca76669

zygoloid enabled auto-merge October 22, 2024 23:03

zygoloid added this pull request to the merge queue Oct 22, 2024

Merged via the queue into carbon-language:trunk with commit e68e54d Oct 22, 2024
8 checks passed

zygoloid deleted the toolchain-file-too-large branch October 22, 2024 23:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue a diagnostic if we try to parse a source file that is too large. #4429

Issue a diagnostic if we try to parse a source file that is too large. #4429

zygoloid commented Oct 19, 2024 •

edited

Loading

jonmeow Oct 21, 2024 •

edited

Loading

zygoloid Oct 21, 2024

zygoloid Oct 21, 2024

jonmeow Oct 21, 2024

zygoloid Oct 21, 2024

jonmeow Oct 21, 2024

zygoloid Oct 21, 2024

jonmeow commented Oct 21, 2024

jonmeow left a comment

jonmeow Oct 21, 2024

zygoloid Oct 21, 2024

jonmeow Oct 21, 2024

zygoloid Oct 21, 2024

	// Payload values are typically ID types for which we create at most one per

	// Payload values are typically ID types for which we create at most one per

		static_assert(MaxTokens <= 1 << PayloadBits,
		"Not enough payload bits to store a token index");

Issue a diagnostic if we try to parse a source file that is too large. #4429

Issue a diagnostic if we try to parse a source file that is too large. #4429

Conversation

zygoloid commented Oct 19, 2024 • edited Loading

jonmeow Oct 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonmeow commented Oct 21, 2024

jonmeow left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zygoloid commented Oct 19, 2024 •

edited

Loading

jonmeow Oct 21, 2024 •

edited

Loading