Bug #1301
closedPoor RegExp Matching Performance
Added by neo237 (Andreas Grau) over 15 years ago. Updated over 13 years ago.
Description
=begin
I noticed a very poor performance on matching regular expressions.
Running following code using ruby 1.9.0 on a Core2Duo (2x2Ghz) takes more than 20s to complete.
regexp = /[b]+.+.+.+.+.+.+.+.+[a]/
str="bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"
regexp.match(str)
$ time ruby1.9 test.rb
real 0m23.029s
user 0m22.421s
sys 0m0.012s
=end
Updated by rue (Eero Saynatkari) over 15 years ago
=begin
Excerpts from Marius Mårnes Mathiesen's message of Wed Mar 18 16:53:08 +0200 2009:
Bug #1301: Poor RegExp Matching Performance
http://redmine.ruby-lang.org/issues/show/1301Author: Andreas Grau
Status: Open, Priority: Normal
Category: core
ruby -v: 1.9.0 (2008-06-20 revision 17482) [i486-linux]I noticed a very poor performance on matching regular expressions.
You mean non-matching? Aside from which, I would be hard-pressed
to call the code below a "regular expression."
Running following code using ruby 1.9.0 on a Core2Duo (2x2Ghz) takes more than
20s to complete.regexp = /[b]+.+.+.+.+.+.+.+.+[a]/
str="bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"
regexp.match(str)
Are you able to produce a reasonable example of where this
is a problem and not just a pathologically horrible regexp?
The above, written as:
/[b]+.+[a]/
Or:
/[b].+.{1,}[a]/
Completes instantly.
--
Magic is insufficiently advanced technology.
=end
Updated by neo237 (Andreas Grau) over 15 years ago
=begin
Using this regexp
regexp = /^(\d*\s+){8,8}+\d+$/
This matches on any line starting with up to 8 number separated by by blanks and a final number
if the string is now in the wrong format, e.g.
str=" #"
the matching takes a very long time
regexp.match(str)
=end
Updated by rue (Eero Saynatkari) over 15 years ago
=begin
Excerpts from Marius Mårnes Mathiesen's message of Wed Mar 18 17:43:04 +0200 2009:
Issue #1301 has been updated by Andreas Grau.
Using this regexp
regexp = /^(\d*\s+){8,8}+\d+$/This matches on any line starting with up to 8 number separated by by blanks
and a final numberif the string is now in the wrong format, e.g.
str=" #"the matching takes a very long time
regexp.match(str)
OK, much more reasonable case, enough to warrant looking into
maybe eliminating some pathological cases. Your particular case
could, of course, be better (and more naturally) written as:
regexp = /^(\d+\s+){0,8}\s*\d+$/
Or something similar to fit the bill, but it eliminates the
poor performance of the example provided until and if such
fixes in the state machine are done.
--
Magic is insufficiently advanced technology.
=end
Updated by duerst (Martin Dürst) over 15 years ago
=begin
At 23:53 09/03/18, Andreas Grau wrote:
Bug #1301: Poor RegExp Matching Performance
http://redmine.ruby-lang.org/issues/show/1301Author: Andreas Grau
Status: Open, Priority: Normal
Category: core
ruby -v: 1.9.0 (2008-06-20 revision 17482) [i486-linux]I noticed a very poor performance on matching regular expressions.
Running following code using ruby 1.9.0 on a Core2Duo (2x2Ghz) takes more
than 20s to complete.regexp = /[b]+.+.+.+.+.+.+.+.+[a]/
Eero gave a much better way to write this regexp.
But why is it so bad, in particular for the string
below? The string has 37 'b's, and you are asking
it to try all different ways it can be split into
nine non-empty substrings (the first of which has
to consist of 'b's only), followed by an [a]. It's
surprising that this only takes 23 seconds!
Theory about regular expressions (formal language theory)
says there shouldn't be any difference, but Ruby regular
expressions (same for Perl, Python, and so on) are not
really regular expressions in the formal sense anymore,
because they allow capturing and many more neat practical
things. So the implementation uses backtracking, and
your regexp above creates a lot of backtracking.
For details on how to write good regular expressions,
please see Jeffrey Friedl's book
(e.g. at
http://www.amazon.com/Mastering-Regular-Expressions-Jeffrey-Friedl/dp/0596528124/)
BTW, the [] in the above expression are not necessary,
it would usually be written
regexp = /b+.+.+.+.+.+.+.+.+a/
Regards, Martin.
str="bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"
regexp.match(str)$ time ruby1.9 test.rb
real 0m23.029s
user 0m22.421s
sys 0m0.012s
#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
=end
Updated by nobu (Nobuyoshi Nakada) over 15 years ago
- Status changed from Open to Rejected
=begin
=end
Updated by neo237 (Andreas Grau) over 15 years ago
=begin
Theory about regular expressions (formal language theory)
says there shouldn't be any difference, but Ruby regular
expressions (same for Perl, Python, and so on) are not
really regular expressions in the formal sense anymore,
because they allow capturing and many more neat practical
things. So the implementation uses backtracking, and
your regexp above creates a lot of backtracking.
I agree that python is also ineffizient, however, it is in
my tests 10 times faster when testing this regexp.
What I'm not sure if regexp parsing really requires backtracking.
The Unix command sed for matching and group capturing requires
below 10ms. (ruby ~10s, python ~1s).
regexp=/(\d*) +(\d*) +(\d*) +(\d*) +(\d*) +(\d*) +(\d*) +(\d*) +(\d+)$/
str=" #"
BTW, a trivial optimization would be to test matching of the regexp using
fast DFA/NFA automat and in case of a matching, use backtracking...
=end
Updated by WoNaDo (Wolfgang Nádasi-Donner) over 15 years ago
=begin
Andreas Grau schrieb:
BTW, a trivial optimization would be to test matching of the regexp using
fast DFA/NFA automat and in case of a matching, use backtracking...
The new Ruby regex interpreter Oniguruma works on "very extended regular
expressions" which are able to parse some context sensitive grammars.
Because a less powerful real regular expression machine may be helpful
for several simpler expressions, it may be worth to think about an
additional adaption of such a machine for Ruby.
Wolfgang Nádasi-Donner
=end