Misc #20519
closedPorting regexp to pure ruby?
Description
Would there be any benefit in porting Regexp from Onigmo to a pure ruby implementation that could benefit from YJIT?
Compiling a pattern could be translating to a ruby method which would be optimized by YJIT easily.
Has this been explored or any work done around this kind of thing, before I take a look in to it more?
Many thanks
Updated by shyouhei (Shyouhei Urabe) 5 months ago
- Status changed from Open to Feedback
Ruby (especially its multilingualized string) is built on top of Onigmo and not vice versa. You must first decouple them, which alone is not an easy task.
Updated by brightbits (Michael Baldry) 5 months ago
shyouhei (Shyouhei Urabe) wrote in #note-1:
Ruby (especially its multilingualized string) is built on top of Onigmo and not vice versa. You must first decouple them, which alone is not an easy task.
Ah yes, I see now that everything in enc has an Oniguruma copyright header.
I think that could all remain and just change the actual regexp matching functions but after doing some quick benchmarking with ruby implementing the logic of a relatively simple regexp parsing dates, with YJIT I couldn't get anywhere near the speed of Onigmo.. Which doesn't mean it's not possible, I didn't dig too deep, or do any kind of profiling to see what was taking the time.
The thought came about as my team were benchmarking a change where one suggested a regexp for matching and replacing a string prefix and it was tested against using start_with? and then string range accessor to drop the prefix, which seemed to be faster for that case.
I agree it sounds like a very big job and based on initial testing, unlikely to be an improvement in most cases.
Updated by kddnewton (Kevin Newton) 5 months ago
Hi @brightbits (Michael Baldry)! I've investigated this one at length, and can give some context.
As you already discovered, Onigmo stretches well beyond regular expressions. It also provides all of the encoding support within CRuby, stretching all of the way into the parser. This has led most other Ruby implementations to have to vendor Onigmo in order to match behavior 1:1. For example TruffleRuby uses it as a fallback (https://github.com/oracle/truffleruby/blob/master/lib/cext/include/ruby/onigmo.h), Artichoke uses it as a fallback (https://github.com/artichoke/artichoke/blob/77434156f30188a6e27f321b9b0f8437acfc0834/spinoso-regexp/Cargo.toml#L27), Natalie uses it as its regexp engine (https://github.com/natalie-lang/natalie/blob/556e8c195423daddf1c5aba49bb67dda22fb36d7/Rakefile#L467-L480), etc. For these reasons replacing Onigmo entirely may be possible, but it would certainly be an extremely long and arduous process because of concerns about backward compatibility.
That being said, there are things that could be done. The various options would be:
- What you already mentioned about handling subsets of regular expressions and splitting them up/enhancing them with additional APIs. You could do this today with ISEQ translation. (Check out https://github.com/k0kubun/ruby-jit-challenge for an intro to how this could work.)
- You could interpret the Onigmo bytecode in Ruby directly and attempt to work with YJIT to get performance up. Check out a couple of links here: https://speakerdeck.com/makenowjust/rubykaigi-2024-make-your-own-regex-engine and https://github.com/Shopify/onigmo.
- You could rewrite it entirely in Ruby (https://github.com/kddnewton/exreg). The only real way this matches up with performance would be having its own JIT. Certainly possible, but difficult.
Updated by brightbits (Michael Baldry) 5 months ago
I was at the kaigi but unfortunately missed that talk! I didn't realise a few weeks later I'd be digging in to it :) Looks like some interesting work has gone in to this area already. I'm going to spend some time looking in to this.
Thanks for the detailed response, I really appreciate it!