Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify lex yaml output to elide FileStart/End in tests. #4433

Merged
merged 8 commits into from
Oct 23, 2024

Conversation

jonmeow
Copy link
Contributor

@jonmeow jonmeow commented Oct 22, 2024

Trying to make split file tests of lex functionality shorter and easier to read. numeric_literals.carbon in particular has an example of why I'm interested in this (at the bottom). This also switches from [] list format to - list format so that the trailing ] is removed.

Trimming comments in tokenized_buffer.h because (1) it feels like it's giving too much detail about what's printed, which has drifted slightly and (2) it also feels like it's trying to justify YAML output, when that's just what we're doing in general.

Copy link
Contributor

@chandlerc chandlerc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the bigger change here is to omit the tokens: [ array wrapper around the output, not the file start/end tokens themselves... Am I misunderstanding?

If so, I'm actually down with that, but wonder if we could keep the file start/end tokens and avoid the flag, and always omit the array wrapper?

Copy link
Contributor

@geoffromer geoffromer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the bigger change here is to omit the tokens: [ array wrapper around the output, not the file start/end tokens themselves... Am I misunderstanding?

If so, I'm actually down with that, but wonder if we could keep the file start/end tokens and avoid the flag, and always omit the array wrapper?

I haven't spent much time reading these files before, but to my eye, omitting the start/end seems like a bigger change than omitting the [ ]. Both because it's more text, and because it's data rather than metadata.

toolchain/lex/tokenized_buffer.h Outdated Show resolved Hide resolved
@@ -254,7 +256,7 @@ auto TokenizedBuffer::PrintToken(llvm::raw_ostream& output_stream,
// justification manually in order to use the dynamically computed widths
// and get the quotes included.
output_stream << llvm::formatv(
" { index: {0}, kind: {1}, line: {2}, column: {3}, indent: {4}, "
" - { index: {0}, kind: {1}, line: {2}, column: {3}, indent: {4}, "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of the -?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the alternate YAML sequence syntax. See https://www.tutorialspoint.com/yaml/yaml_sequence_styles.htm for examples.

@jonmeow
Copy link
Contributor Author

jonmeow commented Oct 22, 2024

Instead of:

- filename: ...
  tokens: [
    token0,
    token1
  ]

This does change it to:

- filename: ...
  tokens:
  - token0
  - token1

I've added this to the PR description, it's to eliminate that trailing ]. Similarly to the FileEnd, the way that's handled in output ends up looking a little distracting.

Note what I'm trying to achieve is as in numeric_literals.carbon:


// --- fail_binary_real.carbon
// CHECK:STDOUT: - filename: fail_binary_real.carbon
// CHECK:STDOUT:   tokens:

// CHECK:STDERR: fail_binary_real.carbon:[[@LINE+4]]:4: error(BinaryRealLiteral): binary real number literals are not supported
// CHECK:STDERR: 0b1.0
// CHECK:STDERR:    ^
// CHECK:STDERR:
0b1.0
// CHECK:STDOUT:   - { index: 1, kind: 'RealLiteral', line: {{ *}}[[@LINE-1]], column: 1, indent: 1, spelling: '0b1.0', value: `2*2^-1`, has_leading_space: true }

// --- fail_wrong_real_exponent
// CHECK:STDOUT: - filename: fail_wrong_real_exponent
// CHECK:STDOUT:   tokens:

The current state is:

// --- fail_binary_real.carbon
// CHECK:STDOUT: - filename: fail_binary_real.carbon
// CHECK:STDOUT:   tokens: [
// CHECK:STDOUT:     { index: 0, kind:   'FileStart', line: {{ *\d+}}, column:  1, indent: 1, spelling: '' },

// CHECK:STDERR: fail_binary_real.carbon:[[@LINE+4]]:4: error(BinaryRealLiteral): binary real number literals are not supported
// CHECK:STDERR: 0b1.0
// CHECK:STDERR:    ^
// CHECK:STDERR:
0b1.0
// CHECK:STDOUT:     { index: 1, kind: 'RealLiteral', line: {{ *}}[[@LINE-1]], column:  1, indent: 1, spelling: '0b1.0', value: `2*2^-1`, has_leading_space: true },

// CHECK:STDOUT:     { index: 2, kind:     'FileEnd', line: {{ *}}[[@LINE+1]], column: {{ *\d+}}, indent: 1, spelling: '', has_leading_space: true },
// CHECK:STDOUT:   ]
// --- fail_wrong_real_exponent
// CHECK:STDOUT: - filename: fail_wrong_real_exponent
// CHECK:STDOUT:   tokens: [
// CHECK:STDOUT:     { index: 0, kind: 'FileStart', line: {{ *\d+}}, column:  1, indent: 1, spelling: '' },

For me, in the proposed output, the split (// --- fail_wrong_real_exponent) is easier to notice because there's a blank line above, and less text below (no FileStart token). The increased amount of whitespace makes the split stand out.

If FileStart/FileEnd is retained, it would look like:


// --- fail_binary_real.carbon
// CHECK:STDOUT: - filename: fail_binary_real.carbon
// CHECK:STDOUT:   tokens:
// CHECK:STDOUT:   - { index: 0, kind:   'FileStart', line: {{ *\d+}}, column:  1, indent: 1, spelling: '' },

// CHECK:STDERR: fail_binary_real.carbon:[[@LINE+4]]:4: error(BinaryRealLiteral): binary real number literals are not supported
// CHECK:STDERR: 0b1.0
// CHECK:STDERR:    ^
// CHECK:STDERR:
0b1.0
// CHECK:STDOUT:   - { index: 1, kind: 'RealLiteral', line: {{ *}}[[@LINE-1]], column: 1, indent: 1, spelling: '0b1.0', value: `2*2^-1`, has_leading_space: true }

// CHECK:STDOUT:   - { index: 2, kind:     'FileEnd', line: {{ *}}[[@LINE+1]], column: {{ *\d+}}, indent: 1, spelling: '', has_leading_space: true },
// --- fail_wrong_real_exponent
// CHECK:STDOUT: - filename: fail_wrong_real_exponent
// CHECK:STDOUT:   tokens:
// CHECK:STDOUT:   - { index: 0, kind: 'FileStart', line: {{ *\d+}}, column:  1, indent: 1, spelling: '' },

For me, this returns to making it hard to notice the --- fail_wrong_real_exponent split due to it being crowded by tokens.

And to give the example with []:

// --- fail_binary_real.carbon
// CHECK:STDOUT: - filename: fail_binary_real.carbon
// CHECK:STDOUT:   tokens: [

// CHECK:STDERR: fail_binary_real.carbon:[[@LINE+4]]:4: error(BinaryRealLiteral): binary real number literals are not supported
// CHECK:STDERR: 0b1.0
// CHECK:STDERR:    ^
// CHECK:STDERR:
0b1.0
// CHECK:STDOUT:     { index: 1, kind: 'RealLiteral', line: {{ *}}[[@LINE-1]], column:  1, indent: 1, spelling: '0b1.0', value: `2*2^-1`, has_leading_space: true },

// CHECK:STDOUT:   ]
// --- fail_wrong_real_exponent
// CHECK:STDOUT: - filename: fail_wrong_real_exponent
// CHECK:STDOUT:   tokens: [

Here, I was primarily thinking it's easier to read without the // CHECK:STDOUT: ] right before the split -- maybe less because of crowding (the lower amount of text helps), but because I thought the - array syntax would be more helpful for reading. (strictly speaking, I could also switch to a index: map, as we do in https://github.com/carbon-language/carbon-lang/blob/trunk/toolchain/check/testdata/basics/no_prelude/raw_ir.carbon)

@jonmeow
Copy link
Contributor Author

jonmeow commented Oct 22, 2024

Note, my specific intent here is to write more tests of lex using splits. But I'm hesitant to do that due to the overhead of a split in lex.

@jonmeow
Copy link
Contributor Author

jonmeow commented Oct 22, 2024

One more example to illustrate what I mean with indices:

// --- fail_binary_real.carbon
// CHECK:STDOUT: - filename: fail_binary_real.carbon
// CHECK:STDOUT:   tokens:

// CHECK:STDERR: fail_binary_real.carbon:[[@LINE+4]]:4: error(BinaryRealLiteral): binary real number literals are not supported
// CHECK:STDERR: 0b1.0
// CHECK:STDERR:    ^
// CHECK:STDERR:
0b1.0
// CHECK:STDOUT:     1: { kind: 'RealLiteral', line: {{ *}}[[@LINE-1]], column:  1, indent: 1, spelling: '0b1.0', value: `2*2^-1`, has_leading_space: true },

// --- fail_wrong_real_exponent
// CHECK:STDOUT: - filename: fail_wrong_real_exponent
// CHECK:STDOUT:   tokens:

Note there that - { index: 1, becomes 1: {, switching from a sequence to a map (...with numeric keys). This would probably also mean switching from spaces to 0s for index padding, i.e. 01: once there's >= 10 entries.

@chandlerc
Copy link
Contributor

For me, this returns to making it hard to notice the --- fail_wrong_real_exponent split due to it being crowded by tokens.

Thanks, these examples helped me understand the problem you were hoping to solve.

I definitely prefer the line-oriented list without the []s:

// tokens:
// - { ... }
// - { ... }

But it would seem nice to keep the file start/end, so wondering if there is another approach that would still really emphasize the file splits visually?

Specifically, could we force whitespace around the split comments?

Would it help to have more of a horizontal rule as part of the split syntax?

Thinking of something like:

// --- fail_binary_real.carbon

// CHECK:STDOUT: - filename: fail_binary_real.carbon
// CHECK:STDOUT:   tokens:
// CHECK:STDOUT:   - { index: 0, kind:   'FileStart', line: {{ *\d+}}, column:  1, indent: 1, spelling: '' },

// CHECK:STDERR: fail_binary_real.carbon:[[@LINE+4]]:4: error(BinaryRealLiteral): binary real number literals are not supported
// CHECK:STDERR: 0b1.0
// CHECK:STDERR:    ^
// CHECK:STDERR:
0b1.0
// CHECK:STDOUT:   - { index: 1, kind: 'RealLiteral', line: {{ *}}[[@LINE-1]], column: 1, indent: 1, spelling: '0b1.0', value: `2*2^-1`, has_leading_space: true }

// CHECK:STDOUT:   - { index: 2, kind:     'FileEnd', line: {{ *}}[[@LINE+1]], column: {{ *\d+}}, indent: 1, spelling: '', has_leading_space: true },

////////////////////////////////////////////////////////////////////////////////
// --- fail_wrong_real_exponent

// CHECK:STDOUT: - filename: fail_wrong_real_exponent
// CHECK:STDOUT:   tokens:
// CHECK:STDOUT:   - { index: 0, kind: 'FileStart', line: {{ *\d+}}, column:  1, indent: 1, spelling: '' },

@jonmeow
Copy link
Contributor Author

jonmeow commented Oct 22, 2024

For me, this returns to making it hard to notice the --- fail_wrong_real_exponent split due to it being crowded by tokens.

Thanks, these examples helped me understand the problem you were hoping to solve.

I definitely prefer the line-oriented list without the []s:

// tokens:
// - { ... }
// - { ... }

But it would seem nice to keep the file start/end, so wondering if there is another approach that would still really emphasize the file splits visually?

Specifically, could we force whitespace around the split comments?

Would it help to have more of a horizontal rule as part of the split syntax?

Understood. To be sure, I think that'd require significant modifications to how file_test and autoupdate work. That's more work than I'm bargaining for, and I'm not sure it's really going to help me (I think FileStart/FileEnd would still crowd the actual test data substantially [I'm partly concerned about the # of lines per test]), so I'm just reverting the FileStart/FileEnd portion of this change and I'll leave it to someone else if they want to work on lex test approaches.

@jonmeow jonmeow changed the title Modify lex yaml output to elide FileStart/End in tests. Change lex yaml output to use - for sequences. Oct 22, 2024
Copy link
Contributor

@geoffromer geoffromer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I substantially preferred dropping the FileStart and FileEnd lines, but this still seems like an improvement.

@chandlerc
Copy link
Contributor

FWIW I substantially preferred dropping the FileStart and FileEnd lines, but this still seems like an improvement.

If you both prefer without start and end, let's do that.

@jonmeow jonmeow changed the title Change lex yaml output to use - for sequences. Modify lex yaml output to elide FileStart/End in tests. Oct 22, 2024
@jonmeow
Copy link
Contributor Author

jonmeow commented Oct 22, 2024

Okay, switched back to the version of this PR that omits the FileStart/FileEnd in most tests.


b.AddFlag(
{
.name = "omit-file-boundary-tokens",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to keep the flag? I'd be happy at least making the omitted the default, or even removing the flag entirely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that people get the actual representation by default.

@jonmeow jonmeow added this pull request to the merge queue Oct 23, 2024
Merged via the queue into carbon-language:trunk with commit 06f4eec Oct 23, 2024
8 checks passed
@jonmeow jonmeow deleted the lex-yaml-format branch October 23, 2024 19:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants