“I think [Oniguruma-To-ES] is very wonderful”
— K. Kosako, creator of Oniguruma
An Oniguruma to JavaScript regex translator that runs in the browser and on your server. Use it to:
- Take advantage of Oniguruma's many extended regex features in JavaScript.
- Run regexes written for Oniguruma from JavaScript, such as those used in TextMate grammars (used by VS Code, Shiki syntax highlighter, etc.).
- Share regexes across your Ruby✳︎ or PHP (
mb_ereg
, etc.) and JavaScript code. - Evaluate the validity of Oniguruma regexes and traverse their ASTs.
Compared to running the Oniguruma C library via WASM bindings using vscode-oniguruma, this library is less than 4% of the size and its regexes often run much faster since they run as native JavaScript.
Oniguruma-To-ES deeply understands the hundreds of large and small differences between Oniguruma and JavaScript regex syntax and behavior, across multiple JavaScript version targets. It's obsessive about ensuring that the emulated features it supports have exactly the same behavior, even in extreme edge cases. And it's been battle-tested on tens of thousands of real-world Oniguruma regexes used in TextMate grammars.
Depending on features used, Oniguruma-To-ES might use advanced emulation via a RegExp
subclass (that remains a native JavaScript regular expression).
✳︎: Ruby 2.0+ uses Onigmo, a fork of Oniguruma with similar syntax and behavior.
import {toRegExp} from 'oniguruma-to-es';
toRegExp(String.raw`(?x)
(?<n>\d) (?<n>\p{greek}) \k<n>
([0a-z&&\h]){,2}
`);
// → /(?<n>\p{Nd})(\p{sc=Greek})(?:\1|\2)(?:[[0a-z]&&\p{AHex}]){0,2}/v
Although the example above is fairly straightforward, you can see several translations that might not be obvious. Apart from the (?x)
free-spacing modifier and the \h
hex-digit shorthand that aren't available in JavaScript, you can also see that Oniguruma's \d
is Unicode-based by default, backreferences to duplicate group names match the captured value of any of the groups, (…)
groups are noncapturing by default if named groups are present, character class intersection doesn't follow JavaScript's requirement of using nested classes for union and ranges, and {…}
interval quantifiers can use an implicit 0
min. Many advanced features are supported that would produce more complicated transformations.
This next example shows support for Unicode case folding with mixed case-sensitivity. Notice that code points ſ
(U+017F) and K
(U+212A) are added to the second, case-insensitive range if using a target
prior to ES2025
, and that modern JavaScript regex features (like flag groups) are used if supported by the target
.
toRegExp('[a-z](?i)[a-z]', {target: 'ES2018'});
// → /[a-z][a-zA-ZſK]/u
toRegExp('[a-z](?i)[a-z]', {target: 'ES2025'});
// → /[a-z](?i:[a-z])/v
npm install oniguruma-to-es
import {toRegExp} from 'oniguruma-to-es';
const str = '…';
const pattern = '…';
// Works with all string/regexp methods since it returns a native regexp
str.match(toRegExp(pattern));
Using a global name (no import)
<script src="https://cdn.jsdelivr.net/npm/oniguruma-to-es/dist/index.min.js"></script>
<script>
const {toRegExp} = OnigurumaToES;
</script>
Accepts an Oniguruma pattern and returns an equivalent JavaScript RegExp
.
Tip
Try it in the demo REPL.
function toRegExp(
pattern: string,
options?: OnigurumaToEsOptions
): RegExp | EmulatedRegExp;
type OnigurumaToEsOptions = {
accuracy?: 'default' | 'strict';
avoidSubclass?: boolean;
flags?: string;
global?: boolean;
hasIndices?: boolean;
rules?: {
allowOrphanBackrefs?: boolean;
asciiWordBoundaries?: boolean;
captureGroup?: boolean;
recursionLimit?: number;
};
target?: 'auto' | 'ES2025' | 'ES2024' | 'ES2018';
verbose?: boolean;
};
See Options for more details.
Accepts an Oniguruma pattern and returns the details needed to construct an equivalent JavaScript RegExp
.
function toDetails(
pattern: string,
options?: OnigurumaToEsOptions
): {
pattern: string;
flags: string;
options?: EmulatedRegExpOptions;
};
Note that the returned flags
might also be different than those provided, as a result of the emulation process. The returned pattern
, flags
, and options
properties can be provided as arguments to the EmulatedRegExp
constructor to produce the same result as toRegExp
.
If the only keys returned are pattern
and flags
, they can optionally be provided to JavaScript's RegExp
constructor instead. Setting option avoidSubclass
to true
ensures that this is always the case (resulting in an error for any patterns that require EmulatedRegExp
's additional handling).
Returns an Oniguruma AST generated from an Oniguruma pattern.
function toOnigurumaAst(
pattern: string,
options?: {
flags?: string;
rules?: {
captureGroup?: boolean;
};
}
): OnigurumaAst;
An error is thrown if the pattern isn't valid in Oniguruma. But unlike toRegExp
and toDetails
, toOnigurumaAst
doesn't evaluate whether the pattern can be emulated in JavaScript.
Works the same as JavaScript's native RegExp
constructor in all contexts, but can be given results from toDetails
to produce the same result as toRegExp
.
class EmulatedRegExp extends RegExp {
constructor(pattern: string, flags?: string, options?: EmulatedRegExpOptions);
constructor(pattern: EmulatedRegExp, flags?: string);
rawArgs: {
pattern: string;
flags: string;
options: EmulatedRegExpOptions;
};
}
The rawArgs
property of EmulatedRegExp
instances can be used to serialize the object.
The following options are shared by functions toRegExp
and toDetails
.
One of 'default'
(default) or 'strict'
.
Sets the level of emulation rigor/strictness.
- Default: Permits a few close approximations in order to support additional features.
- Strict: Error if the pattern can't be emulated with identical behavior (even in rare edge cases) for the given
target
.
More details
Using default accuracy
adds support for the following features, depending on target
:
- All targets (
ES2025
and earlier):- Enables use of
\X
using a close approximation of a Unicode extended grapheme cluster. - Enables combining lookbehind with uncommon uses of
\G
that rely on subclass-based emulation.
- Enables use of
ES2024
and earlier:- Enables use of case-insensitive backreferences to case-sensitive groups.
ES2018
:- Enables use of POSIX classes
[:graph:]
and[:print:]
using ASCII-based versions rather than the Unicode versions available forES2024
and later. Other POSIX classes are always Unicode-based.
- Enables use of POSIX classes
Default: false
.
Disables advanced emulation that relies on returning a RegExp
subclass. In cases when a subclass would otherwise have been used, this results in one of the following:
- An error is thrown for patterns that are not emulatable without a subclass.
- Some patterns can still be emulated accurately without a subclass, but in this case subpattern match details might differ from Oniguruma.
- This is only relevant if you access the subpattern details of match results in your code (via subpattern array indices,
groups
, andindices
).
- This is only relevant if you access the subpattern details of match results in your code (via subpattern array indices,
Oniguruma flags; a string with i
, m
, x
, D
, S
, W
in any order (all optional).
Flags can also be specified via modifiers in the pattern.
Important
Oniguruma and JavaScript both have an m
flag but with different meanings. Oniguruma's m
is equivalent to JavaScript's s
(dotAll
).
Default: false
.
Include JavaScript flag g
(global
) in the result.
Default: false
.
Include JavaScript flag d
(hasIndices
) in the result.
Advanced options that override standard behavior, error checking, and flags when enabled.
allowOrphanBackrefs
: Useful with TextMate grammars that merge backreferences across patterns.asciiWordBoundaries
: Use ASCII-based\b
and\B
, which increases search performance of generated regexes.captureGroup
: Allow unnamed captures and numbered calls (backreferences and subroutines) when using named capture.- This is Oniguruma option
ONIG_OPTION_CAPTURE_GROUP
; on by default invscode-oniguruma
.
- This is Oniguruma option
recursionLimit
: Change the recursion depth limit from Oniguruma's20
to an integer2
–20
.
One of 'auto'
(default), 'ES2025'
, 'ES2024'
, or 'ES2018'
.
JavaScript version used for generated regexes. Using auto
detects the best value based on your environment. Later targets allow faster processing, simpler generated source, and support for additional features.
More details
ES2018
: Uses JS flagu
.- Emulation restrictions: Character class intersection and nested negated character classes are not allowed.
- Generated regexes might use ES2018 features that require Node.js 10 or a browser version released during 2018 to 2023 (in Safari's case). Minimum requirement for any regex is Node.js 6 or a 2016-era browser.
ES2024
: Uses JS flagv
.- No emulation restrictions.
- Generated regexes require Node.js 20 or any 2023-era browser (compat table).
ES2025
: Uses JS flagv
and allows use of flag groups.- Benefits: Faster transpilation, simpler generated source.
- Generated regexes might use features that require Node.js 23 or a 2024-era browser (except Safari, which lacks support for flag groups).
Default: false
.
Disables optimizations that simplify the pattern when it doesn't change the meaning.
Following are the supported features by target. The official Oniguruma syntax doc doesn't cover many of the finer details described here.
Note
Targets ES2024
and ES2025
have the same emulation capabilities. Resulting regexes might have different source and flags, but they match the same strings. See target
.
Notice that nearly every feature below has at least subtle differences from JavaScript. Some features listed as unsupported are not emulatable using native JavaScript regexes, but support for others might be added in future versions of this library. Unsupported features throw an error.
Feature | Example | ES2018 | ES2024+ | Subfeatures & JS differences | |
---|---|---|---|---|---|
Flags | Supported in top-level flags and pattern modifiers | ||||
Ignore case | i |
✅ | ✅ |
✔ Unicode case folding (same as JS with flag u , v ) |
|
Dot all | m |
✅ | ✅ |
✔ Equivalent to JS flag s |
|
Extended | x |
✅ | ✅ |
✔ Unicode whitespace ignored ✔ Line comments with # ✔ Whitespace/comments allowed between a token and its quantifier ✔ Whitespace/comments between a quantifier and the ? /+ that makes it lazy/possessive changes it to a quantifier chain✔ Whitespace/comments separate tokens (ex: \1 0 )✔ Whitespace and # not ignored in char classes |
|
Currently supported only in top-level flags | |||||
Digit is ASCII | D |
✅ | ✅ |
✔ ASCII \d , \p{Digit} , [[:digit:]] |
|
Space is ASCII | S |
✅ | ✅ |
✔ ASCII \s , \p{Space} , [[:space:]] |
|
Word is ASCII | W |
✅ | ✅ |
✔ ASCII \b , \w , \p{Word} , [[:word:]] |
|
Pattern modifiers | Group | (?im-x:…) |
✅ | ✅ |
✔ Unicode case folding for i ✔ Allows enabling and disabling the same flag (priority: disable) ✔ Allows lone or multiple - |
Directive | (?im-x) |
✅ | ✅ |
✔ Continues until end of pattern or group (spanning alternatives) |
|
Characters | Literal | E , ! |
✅ | ✅ |
✔ Code point based matching (same as JS with flag u , v )✔ Standalone ] , { , } don't require escaping |
Identity escape | \E , \! |
✅ | ✅ |
✔ Different set than JS ✔ Allows multibyte chars |
|
Escaped metachar | \\ , \. |
✅ | ✅ |
✔ Same as JS |
|
Control code escape | \t |
✅ | ✅ |
✔ The JS set plus \a , \e |
|
\xNN |
\x7F |
✅ | ✅ |
✔ Allows 1 hex digit ✔ Above 7F , is UTF-8 encoded byte (≠ JS)✔ Error for invalid encoded bytes |
|
\uNNNN |
\uFFFF |
✅ | ✅ |
✔ Same as JS with flag u , v |
|
\x{…} |
\x{A} |
✅ | ✅ |
✔ Allows leading 0s up to 8 total hex digits |
|
Escaped num | \20 |
✅ | ✅ |
✔ Can be backref, error, null, octal, identity escape, or any of these combined with literal digits, based on complex rules that differ from JS ✔ Always handles escaped single digit 1-9 outside char class as backref ✔ Allows null with 1-3 0s ✔ Error for octal > 177 |
|
Caret notation | \cA , \C-A |
✅ | ✅ |
✔ With A-Za-z (JS: only \c form) |
|
Character sets | Digit | \d , \D |
✅ | ✅ |
✔ Unicode by default (≠ JS) |
Hex digit | \h , \H |
✅ | ✅ |
✔ ASCII |
|
Whitespace | \s , \S |
✅ | ✅ |
✔ Unicode by default ✔ No JS adjustments to Unicode set (− \uFEFF , +\x85 ) |
|
Word | \w , \W |
✅ | ✅ |
✔ Unicode by default (≠ JS) |
|
Dot | . |
✅ | ✅ |
✔ Excludes only \n (≠ JS) |
|
Any | \O |
✅ | ✅ |
✔ Any char (with any flags) ✔ Identity escape in char class |
|
Not newline | \N |
✅ | ✅ |
✔ Identity escape in char class |
|
Unicode property |
\p{L} ,\P{L}
|
✅ | ✅ |
✔ Binary properties ✔ Categories ✔ Scripts ✔ Aliases ✔ POSIX properties ✔ Invert with \p{^…} , \P{^…} ✔ Insignificant spaces, hyphens, underscores, and casing in names ✔ \p , \P without { is an identity escape✔ Error for key prefixes ✔ Error for props of strings ❌ Blocks (wontfix[1]) |
|
Variable-length sets | Newline | \R |
✅ | ✅ |
✔ Matched atomically |
Grapheme | \X |
☑️ | ☑️ |
● Uses a close approximation ✔ Matched atomically |
|
Character classes | Base | […] , [^…] |
✅ | ✅ |
✔ Unescaped - outside of range is literal in some contexts (different than JS rules in any mode)✔ Error for unescaped [ that doesn't form nested class✔ Leading unescaped ] OK✔ Fewer chars require escaping than JS |
Empty | [] , [^] |
✅ | ✅ |
✔ Error |
|
Range | [a-z] |
✅ | ✅ |
✔ Same as JS with flag u , v ✔ Allows \x{…} above 10FFFF at end of range to mean last valid code point |
|
POSIX class |
[[:word:]] ,[[:^word:]]
|
☑️[2] | ✅ |
✔ All use Unicode definitions |
|
Nested class | […[…]] |
☑️[3] | ✅ |
✔ Same as JS with flag v |
|
Intersection | […&&…] |
❌ | ✅ |
✔ Doesn't require nested classes for intersection of union and ranges ✔ Allows empty segments |
|
Assertions | Line start, end | ^ , $ |
✅ | ✅ |
✔ Always "multiline" ✔ Only \n as newline✔ ^ doesn't match after string-terminating \n |
String start, end | \A , \z |
✅ | ✅ |
✔ Same as JS ^ $ without JS flag m |
|
String end or before terminating newline | \Z |
✅ | ✅ |
✔ Only \n as newline |
|
Search start | \G |
✅ | ✅ |
✔ Matches at start of match attempt (not end of prev match; advances after 0-length match) |
|
Lookaround |
(?=…) ,(?!…) ,(?<=…) ,(?<!…)
|
✅ | ✅ |
✔ Allows variable-length quantifiers and alternation within lookbehind ✔ Lookahead invalid within lookbehind ✔ Capturing groups invalid within negative lookbehind ✔ Negative lookbehind invalid within positive lookbehind |
|
Word boundary | \b , \B |
✅ | ✅ |
✔ Unicode based (≠ JS) |
|
Quantifiers | Greedy, lazy | * , +? , {2,} , etc. |
✅ | ✅ |
✔ Includes all JS forms ✔ Adds {,n} for min 0✔ Explicit bounds have upper limit of 100,000 (unlimited in JS) ✔ Error with assertions (same as JS with flag u , v ) and directives |
Possessive | ?+ , *+ , ++ , {3,2} |
✅ | ✅ |
✔ + suffix doesn't make {…} quantifiers possessive (creates a quantifier chain)✔ Reversed {…} ranges are possessive |
|
Chained | ** , ??+* , {2,3}+ , etc. |
✅ | ✅ |
✔ Further repeats the preceding repetition |
|
Groups | Noncapturing | (?:…) |
✅ | ✅ |
✔ Same as JS |
Atomic | (?>…) |
✅ | ✅ |
✔ Supported |
|
Capturing | (…) |
✅ | ✅ |
✔ Is noncapturing if named capture present |
|
Named capturing |
(?<a>…) ,(?'a'…)
|
✅ | ✅ |
✔ Duplicate names allowed (including within the same alternation path) unless directly referenced by a subroutine ✔ Error for names invalid in Oniguruma or JS |
|
Backreferences | Numbered | \1 |
✅ | ✅ |
✔ Error if named capture used ✔ Refs the most recent of a capture/subroutine set |
Enclosed numbered, relative |
\k<1> ,\k'1' ,\k<-1> ,\k'-1'
|
✅ | ✅ |
✔ Error if named capture used ✔ Allows leading 0s ✔ Refs the most recent of a capture/subroutine set ✔ \k without < ' is an identity escape |
|
Named |
\k<a> ,\k'a'
|
✅ | ✅ |
✔ For duplicate group names, rematch any of their matches (multiplex) ✔ Refs the most recent of a capture/subroutine set (no multiplex) ✔ Combination of multiplex and most recent of capture/subroutine set if duplicate name is indirectly created by a subroutine |
|
To nonparticipating groups | ☑️ | ☑️ |
✔ Error if group to the right[4] ✔ Duplicate names (and subroutines) to the right not included in multiplex ✔ Fail to match (or don't include in multiplex) ancestor groups and groups in preceding alternation paths ❌ Some rare cases are indeterminable at compile time and use the JS behavior of matching an empty string |
||
Subroutines | Numbered, relative |
\g<1> ,\g'1' ,\g<-1> ,\g'-1' ,\g<+1> ,\g'+1'
|
✅ | ✅ |
✔ Error if named capture used ✔ Allows leading 0s All subroutines (incl. named): ✔ Allowed before reffed group ✔ Can be nested (any depth) ✔ Reuses flags from the reffed group (ignores local flags) ✔ Replaces most recent captured values (for backrefs) ✔ \g without < ' is an identity escape |
Named |
\g<a> ,\g'a'
|
✅ | ✅ |
● Same behavior as numbered ✔ Error if reffed group uses duplicate name |
|
Recursion | Full pattern |
\g<0> ,\g'0'
|
☑️[5] | ☑️[5] |
✔ 20-level depth limit |
Numbered, relative, named |
(…\g<1>?…) ,(…\g<-1>?…) ,(?<a>…\g<a>?…) , etc.
|
☑️[5] | ☑️[5] |
✔ 20-level depth limit |
|
Other | Comment group | (?#…) |
✅ | ✅ |
✔ Allows escaping \) , \\ ✔ Comments allowed between a token and its quantifier ✔ Comments between a quantifier and the ? /+ that makes it lazy/possessive changes it to a quantifier chain |
Alternation | …|… |
✅ | ✅ |
✔ Same as JS |
|
Keep | \K |
☑️ | ☑️ |
● Supported at top level if no top-level alternation is used |
|
JS features unknown to Oniguruma are handled using Oniguruma syntax | ✅ | ✅ |
✔ \u{…} is an error✔ [\q{…}] matches q , etc.✔ [a--b] includes the invalid reversed range a to - |
||
Invalid Oniguruma syntax | ✅ | ✅ |
✔ Error |
||
Compile-time options | ONIG_OPTION_CAPTURE_GROUP |
✅ | ✅ |
✔ Unnamed captures and numbered calls allowed when using named capture |
The table above doesn't include all aspects that Oniguruma-To-ES emulates (including error handling, subpattern details on match results, most aspects that work the same as in JavaScript, and many aspects of non-JavaScript features that work the same in the other regex flavors that support them). Where applicable, Oniguruma-To-ES follows the latest version of Oniguruma (currently 6.9.10).
- Unicode blocks (which in Oniguruma are specified with an
In
prefix) are easily emulatable but their character data would significantly increase library weight. They're also rarely used, fundamentally flawed, and arguably unuseful given the availability of Unicode scripts and other properties. - With target
ES2018
, the specific POSIX classes[:graph:]
and[:print:]
use ASCII-based versions rather than the Unicode versions available for targetES2024
and later, and they result in an error if using strictaccuracy
. - Target
ES2018
doesn't support nested negated character classes. - It's not an error for numbered backreferences to come before their referenced group in Oniguruma, but an error is the best path for Oniguruma-To-ES because ① most placements are mistakes and can never match (based on the Oniguruma behavior for backreferences to nonparticipating groups), ② erroring matches the behavior of named backreferences, and ③ the edge cases where they're matchable rely on rules for backreference resetting within quantified groups that are different in JavaScript and aren't emulatable. Note that it's not a backreference in the first place if using
\10
or higher and not as many capturing groups are defined to the left (it's an octal or identity escape). - Oniguruma's recursion depth limit is
20
. Oniguruma-To-ES uses the same limit by default but allows customizing it via therules.recursionLimit
option. Two rare uses of recursion aren't yet supported: overlapping recursions, and use of backreferences when a recursed subpattern contains captures. Patterns that would trigger an infinite recursion error in Oniguruma might find a match in Oniguruma-To-ES (since recursion is bounded), but future versions will detect this and error at transpilation time.
The following throw errors since they aren't yet supported. They're all extremely rare.
- Supportable:
- Rarely-used character specifiers: Non-A-Za-z with
\cx
,\C-x
; meta\M-x
,\M-\C-x
; bracketed octals\o{…}
; octal UTF-8 encoded bytes (≥\200
). - Code point sequences:
\x{H H …}
,\o{O O …}
. - Grapheme boundaries:
\y
,\Y
. - Flags
P
(POSIX is ASCII) andy{g}
/y{w}
(grapheme boundary modes). - Whole-pattern modifier: Don't capture group
(?C)
. - Callout:
(*FAIL)
.
- Rarely-used character specifiers: Non-A-Za-z with
- Supportable for some uses:
- Absence operators:
(?~…)
, etc. - Conditionals:
(?(…)…)
, etc. - Whole-pattern modifiers: Ignore-case is ASCII
(?I)
, find longest(?L)
. - Callout pair:
(*SKIP)(*FAIL)
.
- Absence operators:
- Not supportable:
- Other callouts:
(?{…})
,(*…)
, etc.
- Other callouts:
Note that Oniguruma-To-ES supports 99.9+% of real-world Oniguruma regexes, based on a sample of tens of thousands of regexes used in TextMate grammars. Of the features listed above, absence operators and conditionals were used in 2–3 regexes each. The rest weren't used at all.
See also the supported features table (above) which describes some additional rarely-used sub-features that aren't currently supported.
Contributions are welcome if you want to add support for currently unsupported features.
Oniguruma-To-ES fully supports mixed case-sensitivity (ex: (?i)a(?-i)a
) and handles the Unicode edge cases regardless of JavaScript target.
Oniguruma-To-ES focuses on being lightweight to make it better for use in browsers. This is partly achieved by not including heavyweight Unicode character data, which imposes a few minor/rare restrictions:
- Character class intersection and nested negated character classes are unsupported with target
ES2018
. Use targetES2024
(supported by Node.js 20 and 2023-era browsers) or later if you need support for these features. - With targets before
ES2025
, a handful of Unicode properties that target a specific character case (ex:\p{Lower}
) can't be used case-insensitively in patterns that contain other characters with a specific case that are used case-sensitively.- In other words, almost every usage is fine, including
A\p{Lower}
,(?i)A\p{Lower}
,(?i:A)\p{Lower}
,(?i)A(?-i)\p{Lower}
, and\w(?i)\p{Lower}
, but notA(?i)\p{Lower}
. - Using these properties case-insensitively is basically never done intentionally, so you're unlikely to encounter this error unless it's catching a mistake.
- In other words, almost every usage is fine, including
- Oniguruma-To-ES uses the version of Unicode supported natively by your JavaScript environment. Using Unicode properties via
\p{…}
that were added in a later version of Unicode than the environment supports results in a runtime error. This is an extreme edge case since modern JavaScript environments support recent versions of Unicode.
JsRegex transpiles Onigmo regexes to JavaScript (Onigmo is a fork of Oniguruma with similar syntax and behavior). It's written in Ruby and relies on the Regexp::Parser Ruby gem, which means regexes must be pre-transpiled on the server to use them in JavaScript. Note that JsRegex doesn't always translate edge case behavior differences.
Oniguruma-To-ES was created by Steven Levithan and contributors.
If you want to support this project, I'd love your help by contributing improvements, sharing it with others, or sponsoring ongoing development.
MIT License.