Fix multiple issues with multi-strings #142

chqrlie · 2024-12-22T22:22:09Z

lexer:

simplify lexer: do not check for multi-strings
remove Tokenizer.lex_string_literal_multi, Tokenizer.is_multi_string,
Tokenizer.skip_to_next_string

parser:

concatenate multi-strings in Parser.parseStringLiteral
this allows multi-strings to be partially commented or preprocessed
fix incorrect parsing of "\1" "23" to produce 3 characters instead of 1

tests:

add tests in test/parser/multi_string.c2 for various issues

bvdberg · 2024-12-23T06:48:48Z

The convention in the git commit message is to use 'subject: msg'. Possible followed by an empty line and more text.
This makes it easier to see which commits are related (a group with the same subject)

bvdberg · 2024-12-23T07:01:29Z

I dont think bootstrap.c needs an update for this patch (i only update when needed, since it is always a big change)
please change commit msg to: 'subject: ..', subject could be parser/tokenizer here
I dont like having the parsing doing char stuff and using malloc's (since this will lead to mem leaks). The use of a string-buffer in the parser would be ok for this. Now the parser is handling char-level stuff instead of the tokenizer. I dont like that. What would be better is to let the tokenizer handle multi-string by also checking for comments in between.
maybe you could also add a test case for 'too many digits in hexadecimal escape sequence '

chqrlie · 2024-12-23T07:59:55Z

I dont think bootstrap.c needs an update for this patch (i only update when needed, since it is always a big change)

OK, I will be more careful about this.

please change commit msg to: 'subject: ..', subject could be parser/tokenizer here

OK, I will conform to current usage.

I dont like having the parsing doing char stuff and using malloc's (since this will lead to mem leaks). The use of a string-buffer in the parser would be ok for this. Now the parser is handling char-level stuff instead of the tokenizer. I dont like that. What would be better is to let the tokenizer handle multi-string by also checking for comments in between.

For the tokenizer to handle string concatenation is not semantically correct and would require recursion to skip comments and handle preprocessing correctly.

Regarding memory leaks, the operation allocates and frees the temporary block, which may cause a leak if p.pool.add(p3, len3, false) uses longjmp in case for error handling, yet currently it ultimately calls Block.create which uses malloc and does not check for allocation failure (like everywhere else).

Using a string_buffer.Buf in the parser for this tricky concatenation is possible.

Limiting the length of combined multi-strings and using a VLA or a local array is even simpler.

A more correct approach would be to create string concatenation nodes in the ast and delay the handling until the analysis phase or the generation phase. This would allow for multi-strings composed of different string types as is possible in C. I don't see a need for L"touché" nor u8"touché" in C2 as UTF-8 should be the default encoding except for explicit octal and hexadecimal escape sequences, but raw strings or triple quoted strings would be handy for multi-line text blocks.

maybe you could also add a test case for 'too many digits in hexadecimal escape sequence '

OK, will do. I should probably make a separate PR for the escape sequence fixes.

bvdberg · 2025-01-21T09:09:40Z

If you have an expression like "one" "two" "three", what the code does is to StringPool:

add 'one' (by tokenizer)
add 'one' (by parser, will be mapped to same)
add 'two' (by tokenizer)
add 'one two' (by parser)
add 'three'
add 'one two three'

So this is not very good. There will be duplicate strings anyway with this solution, but then let's
concatenate them first (on stack/elsewhere) and then add them to stringpool after while (tok == String)

chqrlie · 2025-01-22T03:48:21Z

That's a good point, I updated the code to avoid creating the intermediary strings in the string pool.
There will still be extra entries in the string pool, but limited to a factor of 2 and no impact on the generated code.

lexer: * simplify lexer: do not check for multi-strings * remove `Tokenizer.lex_string_literal_multi`, `Tokenizer.is_multi_string`, `Tokenizer.skip_to_next_string` parser: * concatenate multi-strings in `Parser.parseStringLiteral` * this allows multi-strings to be partially commented or preprocessed * fix incorrect parsing of `"\1" "23"` to produce 3 characters instead of 1 tests: * add tests in test/parser/multi_string.c2 for various issues

chqrlie force-pushed the strings branch 2 times, most recently from 34fbf28 to e9e803c Compare December 22, 2024 23:48

chqrlie force-pushed the strings branch 6 times, most recently from adfd6b1 to 0f9fafa Compare January 21, 2025 08:25

chqrlie force-pushed the strings branch from 0f9fafa to aa56870 Compare January 22, 2025 05:06

chqrlie force-pushed the strings branch from aa56870 to 38f5173 Compare January 22, 2025 11:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix multiple issues with multi-strings #142

Fix multiple issues with multi-strings #142

chqrlie commented Dec 22, 2024 •

edited

Loading

bvdberg commented Dec 23, 2024 •

edited

Loading

bvdberg commented Dec 23, 2024

chqrlie commented Dec 23, 2024

bvdberg commented Jan 21, 2025

chqrlie commented Jan 22, 2025 •

edited

Loading

Fix multiple issues with multi-strings #142

Are you sure you want to change the base?

Fix multiple issues with multi-strings #142

Conversation

chqrlie commented Dec 22, 2024 • edited Loading

bvdberg commented Dec 23, 2024 • edited Loading

bvdberg commented Dec 23, 2024

chqrlie commented Dec 23, 2024

bvdberg commented Jan 21, 2025

chqrlie commented Jan 22, 2025 • edited Loading

chqrlie commented Dec 22, 2024 •

edited

Loading

bvdberg commented Dec 23, 2024 •

edited

Loading

chqrlie commented Jan 22, 2025 •

edited

Loading