Scopes on Recursive Regex Cause Problems #208

jeff-hykin · 2023-06-03T12:35:18Z

Non-issue example:

Adding scopes to the this capture group (a)+ will only apply the scopes to the last match. While annoying, that behavior is the expected behavior and it's easy enough to work around.

A similar behavior happens with recursion
(a)b\g<1>? this can match ababab but only tags the last a with a scope.

The issue

Let's say (a)\g<1>? is trivial recursion (just repetion)

And (basecase|(\[)\g<1>(\])) is basic recursion (balanced square brackets around the word "basecase")

Anything more complex than those, such as adding alternation, dual-recusion, branching recursion, becomes impossible to reliably highlight.

For illustration, applying a balanced parentheses pattern to [[basecase]] would only add scopes to "basecase". And while there might be a hacky way to color the outside brackets, that hack will fall apart as soon as there's a more complex matcher applied to something like [basecase lambda {basecase[basecase][basecase]}]. Only the last basecase will have scopes. It might only be possible to find "lambda" using a recursive pattern, but by using a recursive pattern we are unable to tag the "lambda", because it only adds scopes to the inner most basecase.

TLDR; this is fundamentally limiting, not just an inconvenience that can be worked around.

Not only is the behavior incredibly unintuitve, but add multiple-recusion (e.g. \g<1>|\g<2>) on top of it and it's debatably undefined behavior.

Rare? Well...

Even hello-world tutorials for defining languages (like a BNF grammar) use recursion like this. When defining a syntax it's easy to define it with this kind of recusion, so it's necessary that this work if those languages are ever going to be correctly parsed by textmate.

Requested solution

There are only a handful of textmate grammars that use complex recusion, and the ones that do almost certainly don't expect this behavior. So changing the scoping behavior shouldn't affect anyone negatively regardless of whether or matches the original textmate implementation.

With that in mind, having the FIRST recursive case be given scopes rather than the last basecase should avoid this problem.

The text was updated successfully, but these errors were encountered:

RedCMD · 2024-10-05T19:58:55Z

dup:
#127
#164

slevithan · 2025-01-19T16:42:56Z

@RedCMD its not clear to me that any of these 3 issues are duplicates. Especially #127 seems different since it doesn't reference regex subroutines.

@jeff-hykin TextMate grammars use the Oniguruma regex engine, and Oniguruma has particular, non-portable (and not particularly intuitive) behavior about how subroutines affect captures and backreferences (e.g. see the details I mentioned in #164). The concept of non-participating captures (and their specific handling in various cases) is also likely relevant, and doesn't work the same in every regex flavor.

Given that, it's not clear if you're saying that you wish Oniguruma worked differently (in which case IMO this should probably be closed, and any potential solution to this issue should probably be reframed around an assumption that Oniguruma subroutines will continue to work the way they work) or if you're saying there's an issue with vscode-textmate and it's not giving you the results you'd expect, given how Oniguruma works. This distinction is important since it frames any potential solution.

jeff-hykin · 2025-01-21T13:27:15Z

Framing is important, to a degree. If this was merely a bug in oniguruma, then yeah, it should be reported there so basically all textmate engines could "get the fix".

But with the issue being a design flaw, meaning if oniguruma changed, it would break downstream applications, we can't push this issue onto upstream.

That doesnt mean this is a non-issue. At the end of the day, its the responsibility of the end-application to use tools that create a good user experience.

I'm sure I won't be taken up on the offer, but the short of it is: the only way basic BNF defined grammars will ever be highlighted correctly in VS Code is if VS code makes a small fork of oniguruma, or switches regex engines.

jeff-hykin · 2025-01-21T13:29:08Z

I do want to point out, swapping engines might not be as insane as it initially sounds.

Translating oniguruma regex to PCRE is straightforward, and I don't know of any (desireable) features that are unique to oniguruma.

Its the C api of Oniguruma VS PCRE that I'm unaware of, and where all the work would probably be.

jeff-hykin · 2025-01-21T13:39:29Z

TLDR; I expect this to be closed as "an issue, but too much work to fix"

If the effort on this could be spent on getting the tree sitter into vs code, I'd prefer that.

slevithan · 2025-01-21T16:06:48Z

swapping engines might not be as insane as it initially sounds.

Unfortunately, it probably is. There's a long tail of TM grammars that rely on specific quirks of Oniguruma syntax and behavior.

Translating oniguruma regex to PCRE is straightforward

That's far from true if you don't want to break the long tail of TM grammars. This is an area I know a lot about since I created both Oniguruma-To-ES (an extremely precise Oniguruma to JavaScript regex translator) and Regex+ (which, among other features, emulates PCRE subroutines).

Comprehensive and precise regex flavor-to-flavor translations that sweat the details can be extremely complicated (much more than most people assume), and Oniguruma is one of the weirdest modern flavors out there when you dig into all the edge cases. It has a lot of syntax and behavior quirks that are nonportable and defy prior art from PCRE/Perl.

Subroutines are used very heavily in TM grammars, so I'd bet on some grammars relying on nearly every edge case of Oniguruma's subroutine behavior. Which is a shame, because I agree that PCRE's subroutines are superior (more intuitive, more useful, less complicated).

jeff-hykin · 2025-01-22T00:46:05Z

That's far from true if you don't want to break the long tail of TM grammars. This is an area I know a lot about since I created both Oniguruma-To-ES

Oh cool! Well I'm glad to be corrected before I tried going off to make my own translator. And I will probably make use of your library at some point!

Subroutines are used very heavily in TM grammars

Sorry, I probably contributed a good bit to that long tail 😅.

I am very surprised by that though. A few years ago I did some scraping/indexing of vs code grammars to build a complete-ish list of TM scopes. I decided to look through the regex too, and remeber being surprised at how few features were used.

slevithan · 2025-01-22T02:21:45Z

Yeah, there are many esoteric Oniguruma features and sub-features that are rarely used (or not used at all) in popular TM grammars (I'm primarily looking at grammars provided by Shiki). But one or two grammars might use them in one regex, making the whole grammar reliant on it. This applies to a lot of features collectively. And of the features that are more commonly used... there are a lot of subtle syntax and behavior edge cases to get right if you want identical matches, subpattern matches/indexes, and errors for all grammars and all potential target strings.

On the specific point of how subroutines replace the values in their capture slots: it wasn't relied on by many grammars, but there were some that did, and I had to build a complex system (that uses a JS RegExp subclass) to transfer the captured values into the right slots on match results in order to support this in Shiki's JS RegExp engine. There are also various other subroutine-related quirks that work differently from PCRE, including complicated interactions with backreference multiplexing (an Oniguruma-specific feature) and duplicate group names (which subroutines can't reference directly but might reference indirectly).

This was referenced Nov 5, 2023

Repository works only when it defined at top level of grammar file #140

Open

Subroutines breaking capture tokenizing inside of referenced capture group #164

Open

Syntax Highlighter Crashing jeff-hykin/better-cpp-syntax#637

Open

RedCMD referenced this issue in RedCMD/TmLanguage-Syntax-Highlighter Nov 19, 2023

Fix character class range bug and improve \\x{}&\\o{} code points

f48d1bd

RedCMD mentioned this issue Oct 5, 2024

multiply applied capture groups seems to ignore some captures #127

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scopes on Recursive Regex Cause Problems #208

Scopes on Recursive Regex Cause Problems #208

jeff-hykin commented Jun 3, 2023 •

edited

Loading

RedCMD commented Oct 5, 2024

slevithan commented Jan 19, 2025 •

edited

Loading

jeff-hykin commented Jan 21, 2025 •

edited

Loading

jeff-hykin commented Jan 21, 2025

jeff-hykin commented Jan 21, 2025 •

edited

Loading

slevithan commented Jan 21, 2025 •

edited

Loading

jeff-hykin commented Jan 22, 2025

slevithan commented Jan 22, 2025 •

edited

Loading

Scopes on Recursive Regex Cause Problems #208

Scopes on Recursive Regex Cause Problems #208

Comments

jeff-hykin commented Jun 3, 2023 • edited Loading

Non-issue example:

The issue

Rare? Well...

Requested solution

RedCMD commented Oct 5, 2024

slevithan commented Jan 19, 2025 • edited Loading

jeff-hykin commented Jan 21, 2025 • edited Loading

jeff-hykin commented Jan 21, 2025

jeff-hykin commented Jan 21, 2025 • edited Loading

slevithan commented Jan 21, 2025 • edited Loading

jeff-hykin commented Jan 22, 2025

slevithan commented Jan 22, 2025 • edited Loading

jeff-hykin commented Jun 3, 2023 •

edited

Loading

slevithan commented Jan 19, 2025 •

edited

Loading

jeff-hykin commented Jan 21, 2025 •

edited

Loading

jeff-hykin commented Jan 21, 2025 •

edited

Loading

slevithan commented Jan 21, 2025 •

edited

Loading

slevithan commented Jan 22, 2025 •

edited

Loading