Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to handle Tokens when the lexical grammar might be non regular #463

Open
vikigenius opened this issue Jan 19, 2025 · 4 comments
Open

Comments

@vikigenius
Copy link

I am working on a toy CSS parser and it seems to have really complicated identifiers that don't seem like they can be captured by a regex?

https://www.w3.org/TR/css-syntax-3/#ident-token-diagram

And also what if you have to treat a Token differently based on surrounding context? Is it an identifier or an arithmetic operation (CSS/SASS allow hyphens in their identifier)?

How would I handle such cases in Logos? Would you recommend just combining the parser and lexer for cases like these and not use Logos?

@jeertmans
Copy link
Collaborator

Hi @vikigenius, this is probably possible using extras (Extra) or callbacks, did you check the examples and the book?

Context-based lexing is pretty much a matter of you implementing the necessary logic inside callbacks.

@vikigenius
Copy link
Author

@jeertmans thanks. I checked the examples in the book, and it does point towards this being possible in theory. But the examples using callbacks and extras are all for simple things like (type conversions) or keeping track of some minor additional state (such as line no. etc.)

None of them deal with changing the state of lexing itself. I have seen other lexers have modes. In theory this does seem possible in logos. But it would be good to have a more complex example with callbacks

@jeertmans
Copy link
Collaborator

Indeed, the examples are relatively simple, but that's also the point of examples: show what is possible with a minimal amount of code.

We could add more complex examples, but it takes time. If you manage to implement what you are looking for, please let me know, and I would be happy to add it as an example to the book!

@0x2a-42
Copy link
Contributor

0x2a-42 commented Jan 26, 2025

@vikigenius: A railroad diagram without recursion can always be transformed to a regex (the proof for this is trivial, as you can expand the diagram, at which point it is just another way to draw a finite automaton).

I haven't tested it, but something like the following should correspond to your railroad diagram.

#[derive(Logos, Debug, PartialEq, Copy, Clone)]
#[logos(subpattern newline = r"\n|\r\n|\r|\f")]
#[logos(subpattern whitespace = r"[ \t]|(?&newline)")]
#[logos(subpattern hexdigit = "[0-9a-fA-F]")]
#[logos(subpattern escape = r"\\((?&hexdigit){1,6}(?&whitespace)?)|[^\n\r\f0-9a-fA-F]")]
pub enum Token {
    #[regex("(-?([a-zA-Z_\u{0080}-\u{10FFFF}]|(?&escape))|--)([a-zA-Z_\u{0080}-\u{10FFFF}-]|(?&escape))*")]
    Identifier,
}

And also what if you have to treat a Token differently based on surrounding context? Is it an identifier or an arithmetic operation (CSS/SASS allow hyphens in their identifier)?

Typically the lexical grammar of a programming language is designed to follow the maximal munch principle, so the longest match wins. An example for this would be the ++ operator in C++ which is lexed as a single token instead of two + tokens. If this is not the case you can use callbacks with extras instead of regexes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants