Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with overlapping token definitions #420

Open
ccleve opened this issue Sep 13, 2024 · 18 comments
Open

Error with overlapping token definitions #420

ccleve opened this issue Sep 13, 2024 · 18 comments
Labels
question Further information is requested

Comments

@ccleve
Copy link

ccleve commented Sep 13, 2024

I'm getting a strange error when a regex could match the prefix of another regex. Maybe. I just don't know what the problem is. Here's a simplified case:


#[derive(Logos, Debug, PartialEq)]
#[logos(skip r".|[\r\n]")] // skip everything not recognized
pub enum LogosToken {
    // any letter except capital Z
    #[regex(r"[a-zA-Y]+", priority = 3)]
    WordExceptZ,

    // any number
    #[regex(r"[0-9]+", priority = 3)]
    Number,

    /*
    This expression is:
    (letter or number)* [Z] (letter or number)*
    In other words, a token with any number of letters or numbers,
    including at least one capital Z.
     */
    #[regex(r"[a-zA-Z0-9]*[Z][a-zA-Z0-9]*", priority = 3)]
    TermWithZ,
}

#[pg_extern]
fn test_logos() {
    let mut lex = LogosToken::lexer("hello 42world fooZfoo");
    while let Some(result) = lex.next() {
        let slice = lex.slice();
        println!("{:?} {:?}", slice, result);
    }
}

This generates:

"hello" Ok(WordExceptZ)
"42world" Err(())
"fooZfoo" Ok(TermWithZ)

If I replace the regex over TermWithZ with #[regex(r"Z", priority = 3)], I get:

"hello" Ok(WordExceptZ)
"42" Ok(Number)
"world" Ok(WordExceptZ)
"foo" Ok(WordExceptZ)
"Z" Ok(TermWithZ)
"foo" Ok(WordExceptZ)

The "42world" is getting recognized correctly as a number and word.

What I don't understand is, why does the first TermWithZ regex mess up the recognition of "42world"? It doesn't contain a Z, so TermWithZ should ignore it completely and let the first two variants do their job.

@ccleve
Copy link
Author

ccleve commented Sep 13, 2024

After looking at this a bit more, I'm guessing I'm running into the "no backtracking" limitation.

Any workaround suggestions are welcome.

@jeertmans
Copy link
Collaborator

Hello! Can you please run in debugging mode and printout the corresponding graph?

See https://logos.maciej.codes/debugging.html.

@ccleve
Copy link
Author

ccleve commented Sep 13, 2024

@jeertmans

{
    1: ::<skip> (<skip>),
    2: [80-BF] ⇒ 1,
    3: [A0-BF][80-BF] ⇒ 1,
    4: [80-BF][80-BF] ⇒ 1,
    5: [80-9F][80-BF] ⇒ 1,
    6: [90-BF][80-BF][80-BF] ⇒ 1,
    7: [80-BF][80-BF][80-BF] ⇒ 1,
    8: [80-8F][80-BF][80-BF] ⇒ 1,
    10: ::WordExceptZ,
    13: ::Number,
    16: ::TermWithZ,
    17: {
        [0-9] ⇒ 17,
        [A-Z] ⇒ 17,
        [a-z] ⇒ 17,
        _ ⇒ 16,
    },
    18: Z ⇒ 17,
    19: {
        [0-9] ⇒ 19,
        [A-Z] ⇒ 19,
        [a-z] ⇒ 19,
        _ ⇒ 18,
    },
    22: {
        [0-9] ⇒ 24,
        [A-Y] ⇒ 19,
        Z ⇒ 25,
        [a-z] ⇒ 19,
        _ ⇒ 13,
    },
    24: {
        [0-9] ⇒ 24,
        [A-Y] ⇒ 19,
        Z ⇒ 25,
        [a-z] ⇒ 19,
        _ ⇒ 13,
    },
    25: {
        [0-9] ⇒ 25,
        [A-Z] ⇒ 25,
        [a-z] ⇒ 25,
        _ ⇒ 16,
    },
    27: {
        [0-9] ⇒ 19,
        [A-Y] ⇒ 29,
        Z ⇒ 25,
        [a-z] ⇒ 29,
        _ ⇒ 10,
    },
    29: {
        [0-9] ⇒ 19,
        [A-Y] ⇒ 29,
        Z ⇒ 25,
        [a-z] ⇒ 29,
        _ ⇒ 10,
    },
    32: {
        [0-9] ⇒ 25,
        [A-Z] ⇒ 25,
        [a-z] ⇒ 25,
        _ ⇒ 16,
    },
    33: {
        [00-/] ⇒ 1,
        [0-9] ⇒ 22,
        [:-@] ⇒ 1,
        [A-Y] ⇒ 27,
        Z ⇒ 32,
        [[-`] ⇒ 1,
        [a-z] ⇒ 27,
        [{-7F] ⇒ 1,
        [C2-DF] ⇒ 2,
        [E0] ⇒ 3,
        [E1-EC] ⇒ 4,
        [ED] ⇒ 5,
        [EE-EF] ⇒ 4,
        [F0] ⇒ 6,
        [F1-F3] ⇒ 7,
        [F4] ⇒ 8,
    },
}

@jeertmans
Copy link
Collaborator

Hum, so I think you are right about the fact that the error might come from no backtracking issue, but a perfect Logos implementation shouldn't have that issue.

First question: are you using Logos >=0.14.0? As it may have fixed some issues.

Second: did you try not setting any priority? Here, you set the priority to 3 to all tokens, it doesn't make much sense as the priority is only used when two or more patterns match the same slice, and they are differentiated based on their priority. But, if the number is the same, it doesn't help. So please try without any priority, and only edit one priority at a time.

Last, it is often a source of issues to have patterns embedded in others, like TermWithZ containing both Word and Number, causing backtracking issues.

@jeertmans jeertmans added the question Further information is requested label Sep 15, 2024
@ccleve
Copy link
Author

ccleve commented Sep 15, 2024 via email

@jeertmans
Copy link
Collaborator

Yes, using 0.14.1. I had to set the priorities

Ok perfect.

3 because of the skip expression, #[logos(skip r".|[\r\n]")] The "." can match anything, so there was a conflict. Other than that, the priorities can all be the same.

Ok seems legit, wasn't aware of that.

I'm not sure I understand the third point. How else would you do it?

Usually, you can break down your logic in unique, non-overlapping, tokens, and then use callbacks and extras to handle more complex logic. Unfortunately, I don't have enough time to dig into this problem and understand really the root causes of why it doesn't work :-/

@ccleve
Copy link
Author

ccleve commented Jan 27, 2025

It's good to see a bit of activity on the no-backtracking issue. I'm not fully convinced that that's what the problem is here, though.

Can I bump this issue? It has come up again in my project.

@0x2a-42 @jeertmans

@jeertmans
Copy link
Collaborator

jeertmans commented Jan 27, 2025

Can I bump this issue? It has come up again in my project.

What do you mean with "bump"?

@ccleve
Copy link
Author

ccleve commented Jan 27, 2025 via email

@ccleve
Copy link
Author

ccleve commented Jan 27, 2025

Here is a simpler version of the problem:

#[derive(Logos, Debug, PartialEq)]
#[logos(skip r" ")]
pub enum LogosTester {
    #[regex(r"[a-z]+")]
    Word,

    #[regex(r"[0-9]+")]
    Number,

    // identifier followed by X
    #[regex(r"[a-z]([a-z0-9])*X")]
    Fieldname,
}

pub fn logos_test() {
    let mut lex = LogosTester::lexer("a1");
    while let Some(result) = lex.next() {
        println!("{:?}", result);
    }
}

Running this returns Err(()).

Here is the debug dump:

{
    1: ::<skip> (<skip>),
    3: ::Word,
    6: ::Number,
    7: {
        [0-9] ⇒ 7,
        _ ⇒ 6,
    },
    9: ::Fieldname,
    10: X ⇒ 9,
    11: {
        [0-9] ⇒ 11,
        [a-z] ⇒ 11,
        _ ⇒ 10,
    },
    13: {
        [0-9] ⇒ 11,
        X ⇒ 9,
        [a-z] ⇒ 13,
        _ ⇒ 3,
    },
    15: {
          ⇒ 1,
        [0-9] ⇒ 7,
        [a-z] ⇒ 13,
    },
}

@0x2a-42
Copy link
Contributor

0x2a-42 commented Jan 27, 2025

@ccleve: Your example works with Herring, where the following minimized DFA is produced, which looks close enough to the Logos debug output.

flowchart LR
style start fill:#FFFFFF00, stroke:#FFFFFF00
start-->0;
0@{shape: circ}
0 -- "{0x20}" --> 1
0 -- "{'0'-'9'}" --> 2
0 -- "{'a'-'z'}" --> 3
1@{shape: dbl-circ}
2@{shape: dbl-circ}
2 -- "{'0'-'9'}" --> 2
3@{shape: dbl-circ}
3 -- "{'0'-'9'}" --> 4
3 -- "{'X'}" --> 5
3 -- "{'a'-'z'}" --> 3
4@{shape: circ}
4 -- "{'0'-'9', 'a'-'z'}" --> 4
4 -- "{'X'}" --> 5
5@{shape: dbl-circ}
skipped_regex_0[skipped regex]@{shape: rect}
1 .-> skipped_regex_0
Number_0[Number]@{shape: rect}
2 .-> Number_0
Word_0[Word]@{shape: rect}
3 .-> Word_0
Fieldname_0[Fieldname]@{shape: rect}
5 .-> Fieldname_0
Loading

Feel free to use that crate, if it solves your problem. It probably will not be merged into Logos anytime soon, as it currently requires an additional LLVM optimization to reach a similar performance as Logos.

The problem with the code generated by Logos seems to be, that in its state 11, where the end of input is reached, the state machine still transitions to state 10 where it will eventually return an error instead of the longest match.

fn goto11_ctx10_x<'s>(lex: &mut Lexer<'s>) {
    while let Some(arr) = lex.read::<&[u8; 16]>() {
        // unrolled loop ...
    }
    while lex.test(pattern0) {
        lex.bump_unchecked(1);
    }
    goto10_ctx10_x(lex);
}

To fix this Logos would have to keep track of the last accepting state and the corresponding offset, when it was visited.

The equivalent state 4 in the code generated by Herring looks like this.

State::S4 => {
    match lexer.next_byte() {
        Some(b) if LUT0[b as usize] & 1u8 > 0 => {
            state = State::S4;
            continue;
        }
        Some(88u8) => {
            state = State::S5;
            continue;
        }
        None => {
            lexer.offset -= 1;
            break;
        }
        _ => break,
    }
}

Here the None branch breaks out of the state machine loop and afterwards a Word token will be returned, as state 3 was the last accepting state that was visited.

State::S3 => {
    last_accept = LastAccept::Token(LogosTester::Word, lexer.offset);
    // ...
match last_accept {
    LastAccept::None => {
        use herring::Source;
        while !lexer.source.is_boundary(lexer.offset) {
            lexer.offset += 1;
        }
        return Some(Err(Default::default()));
    }
    LastAccept::Token(token, offset) => {
        lexer.offset = offset;
        return Some(Ok(token));
    }
    LastAccept::TokenCallback(callback, offset) => {
        lexer.offset = offset;
        return Some(callback(lexer));
    }
    LastAccept::Skip(offset) => {
        lexer.offset = offset;
    }
    LastAccept::SkipCallback(callback, offset) => {
        lexer.offset = offset;
        callback(lexer);
    }
}

@jeertmans
Copy link
Collaborator

https://www.google.com/search?client=firefox-b-1-d&q=bump+a+post+definition

To "bump" a post means to bump it up on the priority list, that is, to
bring it to your attention again.

I got this, but there is no such thing on GitHub, as we don't have tasks or projects. I will pin the issue, so it gives more visibility, but I barely put enough time into this project to answer questions, and I cannot really work on fixing bugs (especially complex ones).

@jeertmans jeertmans pinned this issue Jan 28, 2025
@ccleve
Copy link
Author

ccleve commented Jan 28, 2025

I got this, but there is no such thing on GitHub, as we don't have tasks or projects. I will pin the issue, so it gives more visibility, but I barely put enough time into this project to answer questions, and I cannot really work on fixing bugs (especially complex ones).

I understand, and thank you for putting in all work you've done so far. It's a great project, and we're grateful for it. @jeertmans

@ccleve
Copy link
Author

ccleve commented Jan 28, 2025

@0x2a-42 Herring sounds terrific, and I'll plug it in and see how it does today.

Question: is it possible to get the actual code that Herring generates? I think it's better to generate actual text files at build time and check them into the project, the way LALRPOP does it. I've found it super useful to be able to examine the generated code, and in one small corner case I had to modify the generated LALRPOP code to make something work.

@jeertmans
Copy link
Collaborator

I got this, but there is no such thing on GitHub, as we don't have tasks or projects. I will pin the issue, so it gives more visibility, but I barely put enough time into this project to answer questions, and I cannot really work on fixing bugs (especially complex ones).

I understand, and thank you for putting in all work you've done so far. It's a great project, and we're grateful for it. @jeertmans

All credits are to @maciejhirsz, I did nothing but maintain the tool and write some docs, but thanks ;)

@0x2a-42
Copy link
Contributor

0x2a-42 commented Jan 28, 2025

@ccleve: Thanks.

Question: is it possible to get the actual code that Herring generates? I think it's better to generate actual text files at build time and check them into the project, the way LALRPOP does it. I've found it super useful to be able to examine the generated code, and in one small corner case I had to modify the generated LALRPOP code to make something work.

Well, procedural macros are not really supposed to write files, so I would probably not add such a feature to the derive macro. The most sensible way to achieve this would be another input format (e.g. json) and an executable that generates code from this format.

If you don't need an automated process, you can just use cargo expand to inspect the generated code. It requires a tiny bit of manual cleanup if you use other derive macros such as Debug at the Token enum, as it will expand all macros.

@0x2a-42
Copy link
Contributor

0x2a-42 commented Jan 28, 2025

@ccleve: Adding something like the logos-cli or a library that can be called from the build.rs file would also be a possibility.

@ccleve
Copy link
Author

ccleve commented Jan 28, 2025

@0x2a-42

Adding something like the logos-cli or a library that can be called from the build.rs file would also be a possibility.

Thanks. I was unaware of logos-cli. But I think the build.rs approach would be the best. This is all I have to do with LALRPOP:

lalrpop::Configuration::new()
    .generate_in_source_tree()
    .process()
    .unwrap();

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants