-
-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error with overlapping token definitions #420
Comments
After looking at this a bit more, I'm guessing I'm running into the "no backtracking" limitation. Any workaround suggestions are welcome. |
Hello! Can you please run in debugging mode and printout the corresponding graph? |
|
Hum, so I think you are right about the fact that the error might come from no backtracking issue, but a perfect Logos implementation shouldn't have that issue. First question: are you using Logos Second: did you try not setting any priority? Here, you set the priority to Last, it is often a source of issues to have patterns embedded in others, like |
Yes, using 0.14.1.
I had to set the priorities > 3 because of the skip expression,
#[logos(skip r".|[\r\n]")]
The "." can match anything, so there was a conflict. Other than that,
the priorities can all be the same.
I'm not sure I understand the third point. How else would you do it?
…On Sun, Sep 15, 2024 at 4:20 AM Jérome Eertmans ***@***.***> wrote:
Hum, so I think you are right about the fact that the error might come
from *no backtracking issue*, but a perfect Logos implementation
shouldn't have that issue.
First question: are you using Logos >=0.14.0? As it may have fixed some
issues.
Second: did you try not setting any priority? Here, you set the priority
to 3 to all tokens, it doesn't make much sense as the priority is only
used when two or more patterns match the same slice, and they are
differentiated based on their priority. But, if the number is the same, it
doesn't help. So please try without any priority, and only edit one
priority at a time.
Last, it is often a source of issues to have patterns embedded in others,
like TermWithZ containing both Word and Number, causing backtracking
issues.
—
Reply to this email directly, view it on GitHub
<#420 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAITHKRFDXP4BS3WI3SJVGDZWVGNFAVCNFSM6AAAAABOEMBCYWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNJRGQ4DOOBRGM>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Ok perfect.
Ok seems legit, wasn't aware of that.
Usually, you can break down your logic in unique, non-overlapping, tokens, and then use callbacks and extras to handle more complex logic. Unfortunately, I don't have enough time to dig into this problem and understand really the root causes of why it doesn't work :-/ |
It's good to see a bit of activity on the no-backtracking issue. I'm not fully convinced that that's what the problem is here, though. Can I bump this issue? It has come up again in my project. |
What do you mean with "bump"? |
https://www.google.com/search?client=firefox-b-1-d&q=bump+a+post+definition
To "bump" a post means to bump it up on the priority list, that is, to
bring it to your attention again.
…On Mon, Jan 27, 2025 at 11:45 AM Jérome Eertmans ***@***.***> wrote:
Can I bump this issue? It has come up again in my project.
Why do you mean with "*bump*"?
—
Reply to this email directly, view it on GitHub
<#420 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAITHKRJOMKBVET4KBHBNO32MZWDXAVCNFSM6AAAAABOEMBCYWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMJWGQ4TENJTGI>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Here is a simpler version of the problem:
Running this returns Here is the debug dump:
|
@ccleve: Your example works with Herring, where the following minimized DFA is produced, which looks close enough to the Logos debug output. flowchart LR
style start fill:#FFFFFF00, stroke:#FFFFFF00
start-->0;
0@{shape: circ}
0 -- "{0x20}" --> 1
0 -- "{'0'-'9'}" --> 2
0 -- "{'a'-'z'}" --> 3
1@{shape: dbl-circ}
2@{shape: dbl-circ}
2 -- "{'0'-'9'}" --> 2
3@{shape: dbl-circ}
3 -- "{'0'-'9'}" --> 4
3 -- "{'X'}" --> 5
3 -- "{'a'-'z'}" --> 3
4@{shape: circ}
4 -- "{'0'-'9', 'a'-'z'}" --> 4
4 -- "{'X'}" --> 5
5@{shape: dbl-circ}
skipped_regex_0[skipped regex]@{shape: rect}
1 .-> skipped_regex_0
Number_0[Number]@{shape: rect}
2 .-> Number_0
Word_0[Word]@{shape: rect}
3 .-> Word_0
Fieldname_0[Fieldname]@{shape: rect}
5 .-> Fieldname_0
Feel free to use that crate, if it solves your problem. It probably will not be merged into Logos anytime soon, as it currently requires an additional LLVM optimization to reach a similar performance as Logos. The problem with the code generated by Logos seems to be, that in its state 11, where the end of input is reached, the state machine still transitions to state 10 where it will eventually return an error instead of the longest match. fn goto11_ctx10_x<'s>(lex: &mut Lexer<'s>) {
while let Some(arr) = lex.read::<&[u8; 16]>() {
// unrolled loop ...
}
while lex.test(pattern0) {
lex.bump_unchecked(1);
}
goto10_ctx10_x(lex);
} To fix this Logos would have to keep track of the last accepting state and the corresponding offset, when it was visited. The equivalent state 4 in the code generated by Herring looks like this. State::S4 => {
match lexer.next_byte() {
Some(b) if LUT0[b as usize] & 1u8 > 0 => {
state = State::S4;
continue;
}
Some(88u8) => {
state = State::S5;
continue;
}
None => {
lexer.offset -= 1;
break;
}
_ => break,
}
} Here the State::S3 => {
last_accept = LastAccept::Token(LogosTester::Word, lexer.offset);
// ... match last_accept {
LastAccept::None => {
use herring::Source;
while !lexer.source.is_boundary(lexer.offset) {
lexer.offset += 1;
}
return Some(Err(Default::default()));
}
LastAccept::Token(token, offset) => {
lexer.offset = offset;
return Some(Ok(token));
}
LastAccept::TokenCallback(callback, offset) => {
lexer.offset = offset;
return Some(callback(lexer));
}
LastAccept::Skip(offset) => {
lexer.offset = offset;
}
LastAccept::SkipCallback(callback, offset) => {
lexer.offset = offset;
callback(lexer);
}
} |
I got this, but there is no such thing on GitHub, as we don't have tasks or projects. I will pin the issue, so it gives more visibility, but I barely put enough time into this project to answer questions, and I cannot really work on fixing bugs (especially complex ones). |
I understand, and thank you for putting in all work you've done so far. It's a great project, and we're grateful for it. @jeertmans |
@0x2a-42 Herring sounds terrific, and I'll plug it in and see how it does today. Question: is it possible to get the actual code that Herring generates? I think it's better to generate actual text files at build time and check them into the project, the way LALRPOP does it. I've found it super useful to be able to examine the generated code, and in one small corner case I had to modify the generated LALRPOP code to make something work. |
All credits are to @maciejhirsz, I did nothing but maintain the tool and write some docs, but thanks ;) |
@ccleve: Thanks.
Well, procedural macros are not really supposed to write files, so I would probably not add such a feature to the derive macro. The most sensible way to achieve this would be another input format (e.g. json) and an executable that generates code from this format. If you don't need an automated process, you can just use |
Thanks. I was unaware of logos-cli. But I think the build.rs approach would be the best. This is all I have to do with LALRPOP:
|
I'm getting a strange error when a regex could match the prefix of another regex. Maybe. I just don't know what the problem is. Here's a simplified case:
This generates:
If I replace the regex over
TermWithZ
with#[regex(r"Z", priority = 3)]
, I get:The "42world" is getting recognized correctly as a number and word.
What I don't understand is, why does the first TermWithZ regex mess up the recognition of "42world"? It doesn't contain a Z, so TermWithZ should ignore it completely and let the first two variants do their job.
The text was updated successfully, but these errors were encountered: