Compose non-exclusive token with regex w/ diff priorities #397

pinkforest · 2024-06-09T01:24:32Z

I know the regex support is limited but being able to do exclusive regexes would be nice without repeating ?

e.g. in the below I would like to match the token '(' and stop there not even try the regex below it.

#[derive(Debug, Logos, PartialEq)]
pub enum Tokens<'hdr> {
    #[token("(", priority = 20)]
    CommentStart,

    #[regex(r#"[^\s\r\n\t;]+"#, |lex| lex.slice(), priority = 2)]                                                                                                                                   
    MaybeValue(&'hdr str),

}

#[cfg(test)]
mod test {

    use super::*;

    #[test]
    fn comments() {
        let mut lexer = Tokens::lexer("(comment) value");
        assert_eq!(lexer.next(), Some(Ok(Tokens::CommentStart)));
    }
}

Results regex with priority = 2 overriding the priority = 20 token

assertion `left == right` failed
  left: Some(Ok(MaybeValue("(comment)")))
 right: Some(Ok(CommentStart))

This happens regardless whether priority is higher or lower between each other - e.g. commentstart would have 2 instead of 20 and MaybeValue has 20 instead of 2 - effectively ignoring the priority e.g.:

    #[token("(", priority = 2)]
    CommentStart,

    #[regex(r#"[^\s\r\n\t;]+"#, |lex| lex.slice(), priority = 20)]                                                                                                                                   
    MaybeValue(&'hdr str),

If I add ( to the [^..] exclusive Tokens::MaybeValue then it works but it would be nice if priority can be used to compose regular expression/s over tokens that may match each other.

Looking at the codegen both seem to be treated as regexes but it doesn't explain different priorities not working.

That said the documentation perhaps could be filled out re: limitations if not supported - happy to help doc or send PR/s.

What is curious that if the priority is same you get the warning at least about it matching the same input but given priority is different it probably could be composable grouping.

If I write it all regexes then it also works but it would be nice to compose tokens with regexes w/ diff priorites

e.g. this works:

use logos::{Logos};

#[derive(Debug, Logos, PartialEq)]
pub enum Tokens<'hdr> {
    #[regex(r#"\([a-z0-9\s]+\)"#, |lex| lex.slice())]
    WholeComment(&'hdr str),

    #[regex(r#"[^()\s\r\n\t;]+"#, |lex| lex.slice())]
    MaybeValue(&'hdr str),

    #[regex(r"[\s\r\n\t]+", |lex| lex.slice())]
    WHS(&'hdr str),
}

#[cfg(test)]
mod test {

    use super::*;

    #[test]
    fn whole_thing() {
        let mut lexer = Tokens::lexer("(comment) value");
        assert_eq!(lexer.next(), Some(Ok(Tokens::WholeComment("(comment)"))));
        assert_eq!(lexer.next(), Some(Ok(Tokens::WHS(" "))));
        assert_eq!(lexer.next(), Some(Ok(Tokens::MaybeValue("value"))));
    }
}

But my preference would be to use tokens where I can and leave regexes where I can't use tokens.

I could always split to different lexer but having to construct & morph diff lexer is time consuming.

The text was updated successfully, but these errors were encountered:

jeertmans · 2024-06-10T07:38:20Z

Hello, thanks for creating this issue!

I think this is part of related to all the bugs with priorities, see also #265, and other related issues.

Sadly, I currently have not time to invest into this problem, but I hope someone smarter than me (and with more free time) can address this in the near future! That would greatly help the project!

0x2a-42 · 2025-01-27T00:39:13Z

In Logos the longest match always takes priority (maximal munch principle). So this is not a bug.

A simple example for this would be the following.

#[derive(Logos)]
enum Token {
  #[regex("[a-z][a-z]*")]
  Identifier,
  #[regex("int")]
  Int,
}

Here the regex for Identifier could also match the input int. However Logos uses a heuristic to decide which regex has a higher priority (other lexer generators such as flex use the definition order instead). In this case the regex int gets priority 6 and [a-z][a-z]* gets priority 2. So the priority is only relevant when multiple regexes can match input of the same length.

Assigning an explicit priority is just an escape hatch for when the priority heuristic fails. In your example both regexes have the same priority and can match (, so by assigning a higher priority to the first regex you get a CommentStart if ( is the longest match.

I guess it would be possible to extend Logos, such that one could specify a lazy match for certain regexes. However I don't think this would be a good idea, as it would implicitly change the possible matches for other regexes in a non-obvious way. For example, if such a flag was used for the int regex in the above example, it would prevent integer from being lexed as an identifier.

pinkforest · 2025-01-27T23:35:58Z

@0x2a-42 I have specified explicit weighted priorities manually and I don't see 20 being same as 2 ?

0x2a-42 · 2025-01-28T12:26:56Z

In your example both regexes get assigned the same priority (2) by the heuristic, which is why you have to explicitly assign a priority. You assigned 20 to the first regex and 2 to the second regex. Therefore the first regex will be prioritized when ( is the longest match.

#[test]
fn priority() {
    let mut lexer = Tokens::lexer("(");
    assert_eq!(lexer.next(), Some(Ok(Tokens::CommentStart)));
}

If you assign the priorities the other way around, the assertion in this example will fail instead.

Note that the priority is only used to resolve ambiguities, when multiple regexes match with the same length. The longest match is always prioritized, so if a regex with a lower priority has a longer match it will still be chosen instead of a higher priority regex with a shorter match.

pinkforest changed the title ~~regex does not honor priority~~ Compose non-exclusive token with regex w/ diff priorities Jun 9, 2024

jeertmans added duplicate This issue or pull request already exists help wanted Extra attention is needed labels Jun 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compose non-exclusive token with regex w/ diff priorities #397

Compose non-exclusive token with regex w/ diff priorities #397

pinkforest commented Jun 9, 2024 •

edited

Loading

jeertmans commented Jun 10, 2024

0x2a-42 commented Jan 27, 2025

pinkforest commented Jan 27, 2025 •

edited

Loading

0x2a-42 commented Jan 28, 2025

Compose non-exclusive token with regex w/ diff priorities #397

Compose non-exclusive token with regex w/ diff priorities #397

Comments

pinkforest commented Jun 9, 2024 • edited Loading

Sidenote

jeertmans commented Jun 10, 2024

0x2a-42 commented Jan 27, 2025

pinkforest commented Jan 27, 2025 • edited Loading

0x2a-42 commented Jan 28, 2025

pinkforest commented Jun 9, 2024 •

edited

Loading

pinkforest commented Jan 27, 2025 •

edited

Loading