-
-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to handle Tokens when the lexical grammar might be non regular #463
Comments
Hi @vikigenius, this is probably possible using extras (Extra) or callbacks, did you check the examples and the book? Context-based lexing is pretty much a matter of you implementing the necessary logic inside callbacks. |
@jeertmans thanks. I checked the examples in the book, and it does point towards this being possible in theory. But the examples using callbacks and extras are all for simple things like (type conversions) or keeping track of some minor additional state (such as line no. etc.) None of them deal with changing the state of lexing itself. I have seen other lexers have |
Indeed, the examples are relatively simple, but that's also the point of examples: show what is possible with a minimal amount of code. We could add more complex examples, but it takes time. If you manage to implement what you are looking for, please let me know, and I would be happy to add it as an example to the book! |
@vikigenius: A railroad diagram without recursion can always be transformed to a regex (the proof for this is trivial, as you can expand the diagram, at which point it is just another way to draw a finite automaton). I haven't tested it, but something like the following should correspond to your railroad diagram. #[derive(Logos, Debug, PartialEq, Copy, Clone)]
#[logos(subpattern newline = r"\n|\r\n|\r|\f")]
#[logos(subpattern whitespace = r"[ \t]|(?&newline)")]
#[logos(subpattern hexdigit = "[0-9a-fA-F]")]
#[logos(subpattern escape = r"\\((?&hexdigit){1,6}(?&whitespace)?)|[^\n\r\f0-9a-fA-F]")]
pub enum Token {
#[regex("(-?([a-zA-Z_\u{0080}-\u{10FFFF}]|(?&escape))|--)([a-zA-Z_\u{0080}-\u{10FFFF}-]|(?&escape))*")]
Identifier,
}
Typically the lexical grammar of a programming language is designed to follow the maximal munch principle, so the longest match wins. An example for this would be the |
I am working on a toy CSS parser and it seems to have really complicated identifiers that don't seem like they can be captured by a regex?
https://www.w3.org/TR/css-syntax-3/#ident-token-diagram
And also what if you have to treat a Token differently based on surrounding context? Is it an identifier or an arithmetic operation (CSS/SASS allow hyphens in their identifier)?
How would I handle such cases in Logos? Would you recommend just combining the parser and lexer for cases like these and not use Logos?
The text was updated successfully, but these errors were encountered: