Skip to content

slevithan/oniguruma-to-es

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Oniguruma-To-ES

A lightweight Oniguruma to JavaScript RegExp transpiler that runs in the browser or on your server. Use it to:

  • Take advantage of Oniguruma's extended regex capabilities in JavaScript.
  • Run regexes intended for Oniguruma in JavaScript, such as those used in TextMate grammars (used by VS Code, Shiki syntax highlighter, etc.).
  • Share regexes across your Ruby and JavaScript code.

Compared to running the actual Oniguruma C library in JavaScript via WASM bindings (e.g. via vscode-oniguruma or node-oniguruma), this library is much lighter weight and its regexes run much faster since they run as native JavaScript.

Oniguruma-To-ES deeply understands all of the hundreds of large and small differences in Oniguruma and JavaScript regex syntax and behavior across multiple JavaScript version targets. It's obsessive about precisely following Oniguruma syntax rules and ensuring that the emulated features it supports have exactly the same behavior, even in extreme edge cases. A few uncommon features can't be perfectly emulated and allow rare differences, but if you don't want to allow this, you can disable the allowBestEffort option to throw for such patterns (see details below).

πŸ“œ Contents

πŸ•ΉοΈ Install and use

npm install oniguruma-to-es
import {compile} from 'oniguruma-to-es';

In browsers:

<script type="module">
  import {compile} from 'https://esm.run/oniguruma-to-es';
  compile(String.raw`…`);
</script>
Using a global name (no import)
<script src="https://cdn.jsdelivr.net/npm/oniguruma-to-es/dist/index.min.js"></script>
<script>
  const {compile} = OnigurumaToES;
</script>

πŸ”‘ API

compile

Transpiles an Oniguruma regex pattern and flags to native JavaScript.

function compile(
  pattern: string,
  flags?: OnigurumaFlags,
  options?: CompileOptions
): {
  pattern: string;
  flags: string;
};

The returned pattern and flags can be provided directly to the JavaScript RegExp constructor. Various JavaScript flags might have been added or removed compared to the Oniguruma flags provided, as part of the emulation process.

Type OnigurumaFlags

A string with i, m, and x in any order (all optional).

Important

Oniguruma and JavaScript both have an m flag but with different meanings. Oniguruma's m is equivalent to JavaScript's s (dotAll).

Type CompileOptions

type CompileOptions = {
    allowBestEffort?: boolean;
    maxRecursionDepth?: number | null;
    optimize?: boolean;
    target?: 'ES2018' | 'ES2024' | 'ESNext';
};

See Options for more details.

toRegExp

Transpiles an Oniguruma regex pattern and flags and returns a native JavaScript RegExp.

function toRegExp(
  pattern: string,
  flags?: string,
  options?: CompileOptions
): RegExp;

The flags string can be any combination of Oniguruma flags i, m, and x, plus JavaScript flags d and g. Oniguruma's flag m is equivalent to JavaScript's flag s. See Options for more details.

Tip

Try it in the demo REPL.

toOnigurumaAst

Generates an Oniguruma AST from an Oniguruma pattern and flags.

function toOnigurumaAst(
  pattern: string,
  flags?: OnigurumaFlags
): OnigurumaAst;

toRegexAst

Generates a regex AST from an Oniguruma pattern and flags.

function toRegexAst(
  pattern: string,
  flags?: OnigurumaFlags
): RegexAst;

regex's syntax and behavior is a strict superset of native JavaScript, so the AST is very close to representing native ESNext JavaScript RegExp but with some added features (atomic groups, possessive quantifiers, recursion). The regex AST doesn't use some of regex's extended features like flag x or subroutines because they follow PCRE behavior and work somewhat differently than in Oniguruma. The AST represents what's needed to precisely reproduce the Oniguruma behavior using regex.

πŸ”© Options

These options are shared by functions compile and toRegExp.

allowBestEffort

Allows results that differ from Oniguruma in rare cases. If false, throws if the pattern can't be emulated with identical behavior for the given target.

Default: true.

More details

Specifically, this option enables the following additional features, depending on target:

  • All targets (ESNext and earlier):
    • Enables use of \X using a close approximation of a Unicode extended grapheme cluster.
    • Enables recursion (e.g. via \g<0>) using a depth limit specified via option maxRecursionDepth.
  • ES2024 and earlier:
    • Enables use of case-insensitive backreferences to case-sensitive groups.
  • ES2018:
    • Enables use of POSIX classes [:graph:] and [:print:] using ASCII versions rather than the Unicode versions available for ES2024 and later. Other POSIX classes always use Unicode.

maxRecursionDepth

If null, any use of recursion throws. If an integer between 2 and 100 (and allowBestEffort is true), common recursion forms are supported and recurse up to the specified max depth.

Default: 6.

More details

Using a high limit is not a problem if needed. Although there can be a performance cost (minor unless it's exacerbating an existing issue with runaway backtracking), there is no effect on regexes that don't use recursion.

optimize

Simplify the generated pattern when it doesn't change the meaning.

Default: true.

target

Sets the JavaScript language version for generated patterns and flags. Later targets allow faster processing, simpler generated source, and support for additional features.

Default: 'ES2024'.

More details
  • ES2018: Uses JS flag u.
    • Emulation restrictions: Character class intersection, nested negated character classes, and Unicode properties added after ES2018 are not allowed.
    • Generated regexes might use ES2018 features that require Node.js 10 or a browser version released during 2018 to 2023 (in Safari's case). Minimum requirement for any regex is Node.js 6 or a 2016-era browser.
  • ES2024: Uses JS flag v.
    • No emulation restrictions.
    • Generated regexes require Node.js 20 or a 2023-era browser (compat table).
  • ESNext: Uses JS flag v and allows use of flag groups and duplicate group names.
    • Benefits: Faster transpilation, simpler generated source, and duplicate group names are preserved across separate alternation paths.
    • Generated regexes might use features that require Node.js 23 or a 2024-era browser (except Safari, which lacks support).

βœ… Supported features

Following are the supported features by target. Targets ES2024 and ESNext have the same emulation capabilities, although resulting regexes might differ (though not in the strings they match).

Notice that nearly every feature has at least subtle differences from JavaScript. Some features and sub-features listed as unsupported can be added in future versions, but some are not emulatable with native JavaScript regexes. Unsupported features throw an error.

Feature Example ES2018 ES2024+ Subfeatures & JS differences
Flags i i βœ… βœ… βœ” Unicode case folding (same as JS with flag u, v)
m m βœ… βœ… βœ” Equivalent to JS flag s (dotAll)
x x βœ… βœ… βœ” Unicode whitespace ignored
βœ” Line comments with #
βœ” Whitespace/comments allowed between a token and its quantifier
βœ” Whitespace/comments between a quantifier and the ?/+ that makes it lazy/possessive changes it to a chained quantifier
βœ” Whitespace/comments separate tokens (ex: \1 0)
βœ” Whitespace and # not ignored in char classes
Flag modifiers Groups (?im-x:…) βœ… βœ… βœ” Unicode case folding for i
βœ” Allows enabling and disabling the same flag (priority: disable)
βœ” Allows lone or multiple -
Directives (?im-x) βœ… βœ… βœ” Continues until end of pattern or group (spanning alternatives)
Characters Literal E, ! βœ… βœ… βœ” Code point based matching (same as JS with flag u, v)
βœ” Standalone ], {, } don't require escaping
Identity escape \E, \! βœ… βœ… βœ” Different allowed set than JS
βœ” Invalid for multibyte chars
Char escapes \t βœ… βœ… βœ” JS set plus \a, \e
\x \xA0 βœ… βœ… βœ” 1 hex digit \xA
βœ” 2 hex digits \xA0 (same as JS)
βœ” Incomplete \x is invalid (like JS with flag u, v)
\u \uFFFF βœ… βœ… βœ” Incomplete \u is invalid (like JS with flag u, v)
\u{…} \u{A} βœ… βœ… βœ” Allows whitespace padding
βœ” Allows leading 0s up to 6 total hex digits (JS allows unlimited)
βœ” Incomplete \u{ is invalid (like JS with flag u, v)
Escaped num \20 βœ… βœ… βœ” Can be backref, error, null, octal, identity escape, or any of these combined with literal digits, based on complex rules that differ from JS
βœ” Always handles escaped single digit 1-9 outside char class as backref
βœ” Allows null with 1-3 0s (unlike JS in any mode)
Control \cA, \C-A βœ… βœ… βœ” With A-Za-z (JS: only \c form)
βœ” Incomplete \c is invalid (like JS with flag u, v)
Other (very rare) ❌ ❌ Not yet supported:
● \cx, \C-x with non-A-Za-z
● Meta-code \M-x, \M-\C-x
Character sets Digit, word \d, \w, etc. βœ… βœ… βœ” Same as JS (ASCII)
Hex digit \h, \H βœ… βœ… βœ” ASCII
Whitespace \s, \S βœ… βœ… βœ” ASCII (unlike JS)
Dot . βœ… βœ… βœ” Excludes only \n (unlike JS)
Unicode property \p{L},
\P{L}
βœ…[1] βœ… βœ” Categories
βœ” Binary properties
βœ” Scripts
βœ” Aliases
βœ” POSIX properties
βœ” Negate with \p{^…}, \P{^…}
βœ” Insignificant spaces, underscores, and casing in names
βœ” \p, \P without { is identity escape (like JS without flag u, v)
βœ” JS prefixes invalid (ex: Script=)
βœ” JS properties of strings invalid
❌ Blocks (wontfix[2])
Variable-length sets Newline \R βœ… βœ… βœ” Matched atomically
Grapheme \X β˜‘οΈ β˜‘οΈ ● Uses a close approximation
βœ” Matched atomically
Character classes Base […], [^…] βœ… βœ… βœ” Unescaped - is literal char in some contexts (different than JS rules in any mode)
βœ” Fewer chars require escaping than JS
βœ” No subtraction operator (from JS flag v)
Empty [], [^] βœ… βœ… βœ” Invalid (unlike JS)
Ranges [a-z] βœ… βœ… βœ” Same as JS with flag u, v
POSIX classes [[:word:]] β˜‘οΈ[3] βœ… βœ” All use Unicode interpretations
βœ” Negate with [:^…:]
Nested classes […[…]] β˜‘οΈ[4] βœ… βœ” Same as JS with flag v
Intersection […&&…] ❌ βœ… βœ” Doesn't require nested classes for union and ranges (unlike JS)
Assertions Line start, end ^, $ βœ… βœ… βœ” Multiline mode only (compared to JS)
βœ” Only \n as newline (unlike JS)
βœ” Allows following quantifier (unlike JS)
String start, end \A, \z βœ… βœ… βœ” Like JS ^ $ without JS flag m
String end or before terminating newline \Z βœ… βœ… βœ” Only \n as newline
Search start \G β˜‘οΈ β˜‘οΈ ● Supported when used at start of pattern (if no top-level alternation) and when at start of all top-level alternatives
Word boundary \b, \B βœ… βœ… βœ” Unicode interpretation (unlike JS)
βœ” Allows following quantifier (unlike JS)
Lookahead (?=…),
(?!…)
βœ… βœ… βœ” Allows following quantifier (unlike JS with flag u, v)
βœ” Values captured within min-0 quantified lookahead remain referenceable (unlike JS)
Lookbehind (?<=…),
(?<!…)
βœ… βœ… βœ” Variable-length quantifiers within lookbehind invalid (unlike JS)
βœ” Allows variable-length top-level alternatives
βœ” Allows following quantifier (unlike JS in any mode)
βœ” Values captured within min-0 quantified lookbehind remain referenceable
Quantifiers Greedy, lazy *, +?, {2}, etc. βœ… βœ… βœ” Same as JS
Possessive ?+, *+, ++ βœ… βœ… βœ” + suffix doesn't possessivize {…} quantifiers (creates a chained quantifier instead)
Chained **, ??+*, {2,3}+, etc. βœ… βœ… βœ” Each applies itself to the preceding repetition
Groups Noncapturing (?:…) βœ… βœ… βœ” Same as JS
Atomic (?>…) βœ… βœ… βœ” Supported
Capturing (…) βœ… βœ… βœ” Is noncapturing if any named capture is used
Named capturing (?<n>…),
(?'n'…)
βœ… βœ… βœ” Allows duplicate names
βœ” Error for group names invalid in Oniguruma or JS
Other Comment groups (?#…) βœ… βœ… βœ” Allows escaping \), \\
βœ” Comments allowed between a token and its quantifier
βœ” Comments between a quantifier and the ?/+ that makes it lazy/possessive changes it to a chained quantifier
Alternation …|… βœ… βœ… βœ” Same as JS
Keep \K β˜‘οΈ β˜‘οΈ ● Supported if used at top level and no top-level alternation is used
Absence operators (?~…) ❌ ❌ ● Some forms are supportable
Conditionals (?(1)…) ❌ ❌ ● Some forms are supportable
Unsupported JS features are handled using Oniguruma syntax rules βœ… βœ… βœ” [\q{…}] matches literal q, etc.
βœ” [a--b] includes the invalid reversed range a to -
Invalid Oniguruma syntax βœ… βœ… βœ” Error; not passed through
Not yet complete…

As detailed as the table above is, it doesn't include all aspects that Oniguruma-To-ES emulates. For example, most aspects that work the same as JavaScript are omitted, as are aspects of non-JavaScript features that work the same in other regex flavors that support them.

Footnotes

  1. Target ES2018 doesn't allow Unicode property names added in JavaScript specifications after ES2018.
  2. Unicode blocks are easily emulatable but their character data would significantly increase library weight, and they're a flawed, arguably-unuseful feature (use Unicode scripts and other properties instead).
  3. With target ES2018, the specific POSIX classes [:graph:] and [:print:] use ASCII versions rather than the Unicode versions available for target ES2024 and later, and they result in an error if option allowBestEffort is disabled.
  4. Target ES2018 doesn't allow nested negated character classes.

γŠ—οΈ Unicode / mixed case-sensitivity

Oniguruma-To-ES fully supports mixed case-sensitivity (and handles the Unicode edge cases) regardless of JavaScript target. It also restricts Unicode properties to those supported by Oniguruma and the target JavaScript version.

Oniguruma-To-ES focuses on being lightweight to make it better for use in browsers. This is partly achieved by not including heavyweight Unicode character data, which imposes a couple of minor/rare restrictions:

  • Character class intersection and nested negated character classes are unsupported with target ES2018. Use target ES2024 or later if you need support for these Oniguruma features.
  • With targets before ESNext, a handful of Unicode properties that target a specific character case (ex: \p{Lower}) can't be used case-insensitively in patterns that contain other characters with a specific case that are used case-sensitively.
    • In other words, almost every usage is fine, including A\p{Lower}, (?i:A\p{Lower}), (?i:A)\p{Lower}, (?i:A(?-i:\p{Lower})), and \w(?i:\p{Lower}), but not A(?i:\p{Lower}).
    • Using these properties case-insensitively is basically never done intentionally, so you're unlikely to encounter this error unless it's catching a mistake.

πŸ‘€ Similar projects

JsRegex transpiles Onigmo regexes to JavaScript (Onigmo is a fork of Oniguruma that has slightly different syntax/behavior). JsRegex is written in Ruby and relies on the Regexp::Parser Ruby gem, which means regexes must be pre-transpiled on the server to use them in JavaScript. In contrast, Oniguruma-To-ES is written in JavaScript and does its own parsing, so it can be used at runtime. JsRegex also produces regexes with more edge cases that don't perfectly follow Oniguruma's behavior, in addition to the Oniguruma/Onigmo differences.

🏷️ About

Oniguruma-To-ES was created by Steven Levithan.

If you want to support this project, I'd love your help by contributing improvements, sharing it with others, or sponsoring ongoing development.

Β© 2024–present. MIT License.