Modify lex yaml output to elide FileStart/End in tests. #4433

jonmeow · 2024-10-22T00:00:28Z

Trying to make split file tests of lex functionality shorter and easier to read. numeric_literals.carbon in particular has an example of why I'm interested in this (at the bottom). This also switches from [] list format to - list format so that the trailing ] is removed.

Trimming comments in tokenized_buffer.h because (1) it feels like it's giving too much detail about what's printed, which has drifted slightly and (2) it also feels like it's trying to justify YAML output, when that's just what we're doing in general.

chandlerc

It seems like the bigger change here is to omit the tokens: [ array wrapper around the output, not the file start/end tokens themselves... Am I misunderstanding?

If so, I'm actually down with that, but wonder if we could keep the file start/end tokens and avoid the flag, and always omit the array wrapper?

geoffromer

It seems like the bigger change here is to omit the tokens: [ array wrapper around the output, not the file start/end tokens themselves... Am I misunderstanding?

If so, I'm actually down with that, but wonder if we could keep the file start/end tokens and avoid the flag, and always omit the array wrapper?

I haven't spent much time reading these files before, but to my eye, omitting the start/end seems like a bigger change than omitting the [ ]. Both because it's more text, and because it's data rather than metadata.

toolchain/lex/tokenized_buffer.h

geoffromer · 2024-10-22T16:03:32Z

toolchain/lex/tokenized_buffer.cpp

@@ -254,7 +256,7 @@ auto TokenizedBuffer::PrintToken(llvm::raw_ostream& output_stream,
  // justification manually in order to use the dynamically computed widths
  // and get the quotes included.
  output_stream << llvm::formatv(
-      "    { index: {0}, kind: {1}, line: {2}, column: {3}, indent: {4}, "
+      "  - { index: {0}, kind: {1}, line: {2}, column: {3}, indent: {4}, "


What's the purpose of the -?

This is the alternate YAML sequence syntax. See https://www.tutorialspoint.com/yaml/yaml_sequence_styles.htm for examples.

jonmeow · 2024-10-22T16:16:56Z

Instead of:

- filename: ...
  tokens: [
    token0,
    token1
  ]

This does change it to:

- filename: ...
  tokens:
  - token0
  - token1

I've added this to the PR description, it's to eliminate that trailing ]. Similarly to the FileEnd, the way that's handled in output ends up looking a little distracting.

Note what I'm trying to achieve is as in numeric_literals.carbon:


// --- fail_binary_real.carbon
// CHECK:STDOUT: - filename: fail_binary_real.carbon
// CHECK:STDOUT:   tokens:

// CHECK:STDERR: fail_binary_real.carbon:[[@LINE+4]]:4: error(BinaryRealLiteral): binary real number literals are not supported
// CHECK:STDERR: 0b1.0
// CHECK:STDERR:    ^
// CHECK:STDERR:
0b1.0
// CHECK:STDOUT:   - { index: 1, kind: 'RealLiteral', line: {{ *}}[[@LINE-1]], column: 1, indent: 1, spelling: '0b1.0', value: `2*2^-1`, has_leading_space: true }

// --- fail_wrong_real_exponent
// CHECK:STDOUT: - filename: fail_wrong_real_exponent
// CHECK:STDOUT:   tokens:

The current state is:

// --- fail_binary_real.carbon
// CHECK:STDOUT: - filename: fail_binary_real.carbon
// CHECK:STDOUT:   tokens: [
// CHECK:STDOUT:     { index: 0, kind:   'FileStart', line: {{ *\d+}}, column:  1, indent: 1, spelling: '' },

// CHECK:STDERR: fail_binary_real.carbon:[[@LINE+4]]:4: error(BinaryRealLiteral): binary real number literals are not supported
// CHECK:STDERR: 0b1.0
// CHECK:STDERR:    ^
// CHECK:STDERR:
0b1.0
// CHECK:STDOUT:     { index: 1, kind: 'RealLiteral', line: {{ *}}[[@LINE-1]], column:  1, indent: 1, spelling: '0b1.0', value: `2*2^-1`, has_leading_space: true },

// CHECK:STDOUT:     { index: 2, kind:     'FileEnd', line: {{ *}}[[@LINE+1]], column: {{ *\d+}}, indent: 1, spelling: '', has_leading_space: true },
// CHECK:STDOUT:   ]
// --- fail_wrong_real_exponent
// CHECK:STDOUT: - filename: fail_wrong_real_exponent
// CHECK:STDOUT:   tokens: [
// CHECK:STDOUT:     { index: 0, kind: 'FileStart', line: {{ *\d+}}, column:  1, indent: 1, spelling: '' },

For me, in the proposed output, the split (// --- fail_wrong_real_exponent) is easier to notice because there's a blank line above, and less text below (no FileStart token). The increased amount of whitespace makes the split stand out.

If FileStart/FileEnd is retained, it would look like:


// --- fail_binary_real.carbon
// CHECK:STDOUT: - filename: fail_binary_real.carbon
// CHECK:STDOUT:   tokens:
// CHECK:STDOUT:   - { index: 0, kind:   'FileStart', line: {{ *\d+}}, column:  1, indent: 1, spelling: '' },

// CHECK:STDERR: fail_binary_real.carbon:[[@LINE+4]]:4: error(BinaryRealLiteral): binary real number literals are not supported
// CHECK:STDERR: 0b1.0
// CHECK:STDERR:    ^
// CHECK:STDERR:
0b1.0
// CHECK:STDOUT:   - { index: 1, kind: 'RealLiteral', line: {{ *}}[[@LINE-1]], column: 1, indent: 1, spelling: '0b1.0', value: `2*2^-1`, has_leading_space: true }

// CHECK:STDOUT:   - { index: 2, kind:     'FileEnd', line: {{ *}}[[@LINE+1]], column: {{ *\d+}}, indent: 1, spelling: '', has_leading_space: true },
// --- fail_wrong_real_exponent
// CHECK:STDOUT: - filename: fail_wrong_real_exponent
// CHECK:STDOUT:   tokens:
// CHECK:STDOUT:   - { index: 0, kind: 'FileStart', line: {{ *\d+}}, column:  1, indent: 1, spelling: '' },

For me, this returns to making it hard to notice the --- fail_wrong_real_exponent split due to it being crowded by tokens.

And to give the example with []:

// --- fail_binary_real.carbon
// CHECK:STDOUT: - filename: fail_binary_real.carbon
// CHECK:STDOUT:   tokens: [

// CHECK:STDERR: fail_binary_real.carbon:[[@LINE+4]]:4: error(BinaryRealLiteral): binary real number literals are not supported
// CHECK:STDERR: 0b1.0
// CHECK:STDERR:    ^
// CHECK:STDERR:
0b1.0
// CHECK:STDOUT:     { index: 1, kind: 'RealLiteral', line: {{ *}}[[@LINE-1]], column:  1, indent: 1, spelling: '0b1.0', value: `2*2^-1`, has_leading_space: true },

// CHECK:STDOUT:   ]
// --- fail_wrong_real_exponent
// CHECK:STDOUT: - filename: fail_wrong_real_exponent
// CHECK:STDOUT:   tokens: [

Here, I was primarily thinking it's easier to read without the // CHECK:STDOUT: ] right before the split -- maybe less because of crowding (the lower amount of text helps), but because I thought the - array syntax would be more helpful for reading. (strictly speaking, I could also switch to a index: map, as we do in https://github.com/carbon-language/carbon-lang/blob/trunk/toolchain/check/testdata/basics/no_prelude/raw_ir.carbon)

jonmeow · 2024-10-22T16:18:49Z

Note, my specific intent here is to write more tests of lex using splits. But I'm hesitant to do that due to the overhead of a split in lex.

jonmeow · 2024-10-22T16:22:34Z

One more example to illustrate what I mean with indices:

// --- fail_binary_real.carbon
// CHECK:STDOUT: - filename: fail_binary_real.carbon
// CHECK:STDOUT:   tokens:

// CHECK:STDERR: fail_binary_real.carbon:[[@LINE+4]]:4: error(BinaryRealLiteral): binary real number literals are not supported
// CHECK:STDERR: 0b1.0
// CHECK:STDERR:    ^
// CHECK:STDERR:
0b1.0
// CHECK:STDOUT:     1: { kind: 'RealLiteral', line: {{ *}}[[@LINE-1]], column:  1, indent: 1, spelling: '0b1.0', value: `2*2^-1`, has_leading_space: true },

// --- fail_wrong_real_exponent
// CHECK:STDOUT: - filename: fail_wrong_real_exponent
// CHECK:STDOUT:   tokens:

Note there that - { index: 1, becomes 1: {, switching from a sequence to a map (...with numeric keys). This would probably also mean switching from spaces to 0s for index padding, i.e. 01: once there's >= 10 entries.

Co-authored-by: Geoff Romer <[email protected]>

chandlerc · 2024-10-22T16:26:54Z

For me, this returns to making it hard to notice the --- fail_wrong_real_exponent split due to it being crowded by tokens.

Thanks, these examples helped me understand the problem you were hoping to solve.

I definitely prefer the line-oriented list without the []s:

// tokens:
// - { ... }
// - { ... }

But it would seem nice to keep the file start/end, so wondering if there is another approach that would still really emphasize the file splits visually?

Specifically, could we force whitespace around the split comments?

Would it help to have more of a horizontal rule as part of the split syntax?

Thinking of something like:

// --- fail_binary_real.carbon

// CHECK:STDOUT: - filename: fail_binary_real.carbon
// CHECK:STDOUT:   tokens:
// CHECK:STDOUT:   - { index: 0, kind:   'FileStart', line: {{ *\d+}}, column:  1, indent: 1, spelling: '' },

// CHECK:STDERR: fail_binary_real.carbon:[[@LINE+4]]:4: error(BinaryRealLiteral): binary real number literals are not supported
// CHECK:STDERR: 0b1.0
// CHECK:STDERR:    ^
// CHECK:STDERR:
0b1.0
// CHECK:STDOUT:   - { index: 1, kind: 'RealLiteral', line: {{ *}}[[@LINE-1]], column: 1, indent: 1, spelling: '0b1.0', value: `2*2^-1`, has_leading_space: true }

// CHECK:STDOUT:   - { index: 2, kind:     'FileEnd', line: {{ *}}[[@LINE+1]], column: {{ *\d+}}, indent: 1, spelling: '', has_leading_space: true },

////////////////////////////////////////////////////////////////////////////////
// --- fail_wrong_real_exponent

// CHECK:STDOUT: - filename: fail_wrong_real_exponent
// CHECK:STDOUT:   tokens:
// CHECK:STDOUT:   - { index: 0, kind: 'FileStart', line: {{ *\d+}}, column:  1, indent: 1, spelling: '' },

jonmeow · 2024-10-22T16:36:07Z

For me, this returns to making it hard to notice the --- fail_wrong_real_exponent split due to it being crowded by tokens.

Thanks, these examples helped me understand the problem you were hoping to solve.

I definitely prefer the line-oriented list without the []s:
// tokens:
// - { ... }
// - { ... }
But it would seem nice to keep the file start/end, so wondering if there is another approach that would still really emphasize the file splits visually?

Specifically, could we force whitespace around the split comments?

Would it help to have more of a horizontal rule as part of the split syntax?

Understood. To be sure, I think that'd require significant modifications to how file_test and autoupdate work. That's more work than I'm bargaining for, and I'm not sure it's really going to help me (I think FileStart/FileEnd would still crowd the actual test data substantially [I'm partly concerned about the # of lines per test]), so I'm just reverting the FileStart/FileEnd portion of this change and I'll leave it to someone else if they want to work on lex test approaches.

geoffromer

FWIW I substantially preferred dropping the FileStart and FileEnd lines, but this still seems like an improvement.

chandlerc · 2024-10-22T20:14:21Z

FWIW I substantially preferred dropping the FileStart and FileEnd lines, but this still seems like an improvement.

If you both prefer without start and end, let's do that.

jonmeow · 2024-10-22T20:51:56Z

Okay, switched back to the version of this PR that omits the FileStart/FileEnd in most tests.

chandlerc · 2024-10-22T20:57:17Z

toolchain/driver/compile_subcommand.cpp

+
+  b.AddFlag(
+      {
+          .name = "omit-file-boundary-tokens",


Any reason to keep the flag? I'd be happy at least making the omitted the default, or even removing the flag entirely.

So that people get the actual representation by default.

Modify lex yaml output to elide FileStart/End in tests.

c39e335

github-actions bot added the toolchain label Oct 22, 2024

github-actions bot requested a review from geoffromer October 22, 2024 00:00

chandlerc reviewed Oct 22, 2024

View reviewed changes

geoffromer reviewed Oct 22, 2024

View reviewed changes

jonmeow and others added 3 commits October 22, 2024 09:23

Update toolchain/lex/tokenized_buffer.h

295952d

Co-authored-by: Geoff Romer <[email protected]>

pre-commit

47c780f

pre-commit

9268b81

Revert FileStart/FileEnd changes

7782a39

jonmeow changed the title ~~Modify lex yaml output to elide FileStart/End in tests.~~ Change lex yaml output to use - for sequences. Oct 22, 2024

geoffromer approved these changes Oct 22, 2024

View reviewed changes

Restore omit flag

00a1066

jonmeow changed the title ~~Change lex yaml output to use - for sequences.~~ Modify lex yaml output to elide FileStart/End in tests. Oct 22, 2024

Extra autoupdate

19c62c2

geoffromer approved these changes Oct 22, 2024

View reviewed changes

chandlerc reviewed Oct 22, 2024

View reviewed changes

chandlerc approved these changes Oct 23, 2024

View reviewed changes

Merge branch 'trunk' into lex-yaml-format

dce891f

jonmeow enabled auto-merge October 23, 2024 18:36

jonmeow added this pull request to the merge queue Oct 23, 2024

Merged via the queue into carbon-language:trunk with commit 06f4eec Oct 23, 2024
8 checks passed

jonmeow deleted the lex-yaml-format branch October 23, 2024 19:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify lex yaml output to elide FileStart/End in tests. #4433

Modify lex yaml output to elide FileStart/End in tests. #4433

jonmeow commented Oct 22, 2024 •

edited

Loading

chandlerc left a comment

geoffromer left a comment

geoffromer Oct 22, 2024

jonmeow Oct 22, 2024

jonmeow commented Oct 22, 2024 •

edited

Loading

jonmeow commented Oct 22, 2024

jonmeow commented Oct 22, 2024 •

edited

Loading

chandlerc commented Oct 22, 2024

jonmeow commented Oct 22, 2024 •

edited

Loading

geoffromer left a comment

chandlerc commented Oct 22, 2024

jonmeow commented Oct 22, 2024

chandlerc Oct 22, 2024

jonmeow Oct 22, 2024

Modify lex yaml output to elide FileStart/End in tests. #4433

Modify lex yaml output to elide FileStart/End in tests. #4433

Conversation

jonmeow commented Oct 22, 2024 • edited Loading

chandlerc left a comment

Choose a reason for hiding this comment

geoffromer left a comment

Choose a reason for hiding this comment

geoffromer Oct 22, 2024

Choose a reason for hiding this comment

jonmeow Oct 22, 2024

Choose a reason for hiding this comment

jonmeow commented Oct 22, 2024 • edited Loading

jonmeow commented Oct 22, 2024

jonmeow commented Oct 22, 2024 • edited Loading

chandlerc commented Oct 22, 2024

jonmeow commented Oct 22, 2024 • edited Loading

geoffromer left a comment

Choose a reason for hiding this comment

chandlerc commented Oct 22, 2024

jonmeow commented Oct 22, 2024

chandlerc Oct 22, 2024

Choose a reason for hiding this comment

jonmeow Oct 22, 2024

Choose a reason for hiding this comment

jonmeow commented Oct 22, 2024 •

edited

Loading

jonmeow commented Oct 22, 2024 •

edited

Loading

jonmeow commented Oct 22, 2024 •

edited

Loading

jonmeow commented Oct 22, 2024 •

edited

Loading