How to handle Tokens when the lexical grammar might be non regular #463

vikigenius · 2025-01-19T04:22:09Z

I am working on a toy CSS parser and it seems to have really complicated identifiers that don't seem like they can be captured by a regex?

https://www.w3.org/TR/css-syntax-3/#ident-token-diagram

And also what if you have to treat a Token differently based on surrounding context? Is it an identifier or an arithmetic operation (CSS/SASS allow hyphens in their identifier)?

How would I handle such cases in Logos? Would you recommend just combining the parser and lexer for cases like these and not use Logos?

jeertmans · 2025-01-19T14:47:44Z

Hi @vikigenius, this is probably possible using extras (Extra) or callbacks, did you check the examples and the book?

Context-based lexing is pretty much a matter of you implementing the necessary logic inside callbacks.

vikigenius · 2025-01-20T01:56:59Z

@jeertmans thanks. I checked the examples in the book, and it does point towards this being possible in theory. But the examples using callbacks and extras are all for simple things like (type conversions) or keeping track of some minor additional state (such as line no. etc.)

None of them deal with changing the state of lexing itself. I have seen other lexers have modes. In theory this does seem possible in logos. But it would be good to have a more complex example with callbacks

jeertmans · 2025-01-20T10:25:03Z

Indeed, the examples are relatively simple, but that's also the point of examples: show what is possible with a minimal amount of code.

We could add more complex examples, but it takes time. If you manage to implement what you are looking for, please let me know, and I would be happy to add it as an example to the book!

0x2a-42 · 2025-01-26T23:31:18Z

@vikigenius: A railroad diagram without recursion can always be transformed to a regex (the proof for this is trivial, as you can expand the diagram, at which point it is just another way to draw a finite automaton).

I haven't tested it, but something like the following should correspond to your railroad diagram.

#[derive(Logos, Debug, PartialEq, Copy, Clone)]
#[logos(subpattern newline = r"\n|\r\n|\r|\f")]
#[logos(subpattern whitespace = r"[ \t]|(?&newline)")]
#[logos(subpattern hexdigit = "[0-9a-fA-F]")]
#[logos(subpattern escape = r"\\((?&hexdigit){1,6}(?&whitespace)?)|[^\n\r\f0-9a-fA-F]")]
pub enum Token {
    #[regex("(-?([a-zA-Z_\u{0080}-\u{10FFFF}]|(?&escape))|--)([a-zA-Z_\u{0080}-\u{10FFFF}-]|(?&escape))*")]
    Identifier,
}

And also what if you have to treat a Token differently based on surrounding context? Is it an identifier or an arithmetic operation (CSS/SASS allow hyphens in their identifier)?

Typically the lexical grammar of a programming language is designed to follow the maximal munch principle, so the longest match wins. An example for this would be the ++ operator in C++ which is lexed as a single token instead of two + tokens. If this is not the case you can use callbacks with extras instead of regexes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to handle Tokens when the lexical grammar might be non regular #463

How to handle Tokens when the lexical grammar might be non regular #463

vikigenius commented Jan 19, 2025

jeertmans commented Jan 19, 2025

vikigenius commented Jan 20, 2025

jeertmans commented Jan 20, 2025

0x2a-42 commented Jan 26, 2025

How to handle Tokens when the lexical grammar might be non regular #463

How to handle Tokens when the lexical grammar might be non regular #463

Comments

vikigenius commented Jan 19, 2025

jeertmans commented Jan 19, 2025

vikigenius commented Jan 20, 2025

jeertmans commented Jan 20, 2025

0x2a-42 commented Jan 26, 2025