Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix multiple issues with multi-strings #142

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

chqrlie
Copy link
Contributor

@chqrlie chqrlie commented Dec 22, 2024

lexer:

  • simplify lexer: do not check for multi-strings
  • remove Tokenizer.lex_string_literal_multi, Tokenizer.is_multi_string,
    Tokenizer.skip_to_next_string

parser:

  • concatenate multi-strings in Parser.parseStringLiteral
  • this allows multi-strings to be partially commented or preprocessed
  • fix incorrect parsing of "\1" "23" to produce 3 characters instead of 1

tests:

  • add tests in test/parser/multi_string.c2 for various issues

@chqrlie chqrlie force-pushed the strings branch 2 times, most recently from 34fbf28 to e9e803c Compare December 22, 2024 23:48
@bvdberg
Copy link
Member

bvdberg commented Dec 23, 2024

The convention in the git commit message is to use 'subject: msg'. Possible followed by an empty line and more text.
This makes it easier to see which commits are related (a group with the same subject)

@bvdberg
Copy link
Member

bvdberg commented Dec 23, 2024

  • I dont think bootstrap.c needs an update for this patch (i only update when needed, since it is always a big change)
  • please change commit msg to: 'subject: ..', subject could be parser/tokenizer here
  • I dont like having the parsing doing char stuff and using malloc's (since this will lead to mem leaks). The use of a string-buffer in the parser would be ok for this. Now the parser is handling char-level stuff instead of the tokenizer. I dont like that. What would be better is to let the tokenizer handle multi-string by also checking for comments in between.
  • maybe you could also add a test case for 'too many digits in hexadecimal escape sequence '

@chqrlie
Copy link
Contributor Author

chqrlie commented Dec 23, 2024

I dont think bootstrap.c needs an update for this patch (i only update when needed, since it is always a big change)

OK, I will be more careful about this.

please change commit msg to: 'subject: ..', subject could be parser/tokenizer here

OK, I will conform to current usage.

I dont like having the parsing doing char stuff and using malloc's (since this will lead to mem leaks). The use of a string-buffer in the parser would be ok for this. Now the parser is handling char-level stuff instead of the tokenizer. I dont like that. What would be better is to let the tokenizer handle multi-string by also checking for comments in between.

For the tokenizer to handle string concatenation is not semantically correct and would require recursion to skip comments and handle preprocessing correctly.

Regarding memory leaks, the operation allocates and frees the temporary block, which may cause a leak if p.pool.add(p3, len3, false) uses longjmp in case for error handling, yet currently it ultimately calls Block.create which uses malloc and does not check for allocation failure (like everywhere else).

Using a string_buffer.Buf in the parser for this tricky concatenation is possible.

Limiting the length of combined multi-strings and using a VLA or a local array is even simpler.

A more correct approach would be to create string concatenation nodes in the ast and delay the handling until the analysis phase or the generation phase. This would allow for multi-strings composed of different string types as is possible in C. I don't see a need for L"touché" nor u8"touché" in C2 as UTF-8 should be the default encoding except for explicit octal and hexadecimal escape sequences, but raw strings or triple quoted strings would be handy for multi-line text blocks.

maybe you could also add a test case for 'too many digits in hexadecimal escape sequence '

OK, will do. I should probably make a separate PR for the escape sequence fixes.

@chqrlie chqrlie force-pushed the strings branch 6 times, most recently from adfd6b1 to 0f9fafa Compare January 21, 2025 08:25
@bvdberg
Copy link
Member

bvdberg commented Jan 21, 2025

If you have an expression like "one" "two" "three", what the code does is to StringPool:

  • add 'one' (by tokenizer)
  • add 'one' (by parser, will be mapped to same)
  • add 'two' (by tokenizer)
  • add 'one two' (by parser)
  • add 'three'
  • add 'one two three'

So this is not very good. There will be duplicate strings anyway with this solution, but then let's
concatenate them first (on stack/elsewhere) and then add them to stringpool after while (tok == String)

@chqrlie
Copy link
Contributor Author

chqrlie commented Jan 22, 2025

That's a good point, I updated the code to avoid creating the intermediary strings in the string pool.
There will still be extra entries in the string pool, but limited to a factor of 2 and no impact on the generated code.

lexer:
* simplify lexer: do not check for multi-strings
* remove `Tokenizer.lex_string_literal_multi`, `Tokenizer.is_multi_string`,
  `Tokenizer.skip_to_next_string`

parser:
* concatenate multi-strings in `Parser.parseStringLiteral`
* this allows multi-strings to be partially commented or preprocessed
* fix incorrect parsing of `"\1" "23"` to produce 3 characters instead of 1

tests:
* add tests in test/parser/multi_string.c2 for various issues
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants