Bug #1740 [ruby-core:24188]

ruby regexp 100% usage cpu.

Added by paranormal dev 257 days ago. Updated 257 days ago.

Status :Rejected Start :07/07/2009
Priority :Normal Due date :
Assigned to :- % Done :

0%

Category :-
Target version :-
ruby -v :

ruby 1.8.7 (2009-04-08 patchlevel 160) [i386-freebsd6]


Description

On freebsd i'm test ruby
ruby 1.8.7 (2009-04-08 patchlevel 160) [i386-freebsd6]

And my linux notebook 
ruby -v ruby 1.8.7 (2009-06-08 patchlevel 173) [x86_64-linux]

For this code 
#######################################
require 'open-uri'
$KCODE = 'u' 

reg = %r{<.*?div\s*class\s*=\s*.entry.*?>[^<]*<.*?img\s*src\s*=\s*.([^"|']*).*?>[^<]*<.*?p\s*class\s*=\s*.date.*?>}im
#del = %r{<(?!p|div|img)[^>]*>}i

doc = open('http://www.radiokvit.com.ua/?p=1895').read

#doc.gsub!(del, ' ')
a = doc.match(reg)
p a
######################################

My ruby process use 100% cpu for long time and on linux exit normaly, on freebsd no exit %-(.
I'm submited another bug for freebsd http://www.freebsd.org/cgi/query-pr.cgi?pr=136384 but this is for freebsd only. 

This templates writes another man for perl and i'm must use here.

testfile.html - doc for test regexp. if url is not valid. (23.3 KB) paranormal dev, 07/07/2009 11:38 PM

History

07/08/2009 02:35 AM - Eero Saynatkari

Excerpts from rubymine message of Tue Jul 07 17:38:10 +0300 2009:
> reg =
> %r{<.*?div\s*class\s*=\s*.entry.*?>[^<]*<.*?img\s*src\s*=\s*.([^"|']*).*?>[^<]*<
> .*?p\s*class\s*=\s*.date.*?>}im
> #del = %r{<(?!p|div|img)[^>]*>}i
>
> My ruby process use 100% cpu for long time and on linux exit normaly, on
> freebsd no exit %-(.
> I'm submited another bug for freebsd
> http://www.freebsd.org/cgi/query-pr.cgi?pr=136384 but this is for freebsd only. 
> 
> This templates writes another man for perl and i'm must use here.

Firstly, Ruby regexps are not PCRE, so you must have some
leeway constructing the regexp. You cannot (necessarily)
just drop the Perl version in and expect it to work, or
work the same.

Secondly, you should be using something like Nokogiri or
hpricot rather than "parsing" the HTML yourself. For example
your div matcher will fail if the attribute is quoted.

Thirdly, it has "pathological" written all over it. You
should refactor the regexp to try to get some small case
that is reproducible to illustrate the actual problem to
see if it is something that should be fixed.

I am pretty sure there was another thread about really bad
regexp performance in a pathological case a while back, if
you want to search the archives.


Eero
--
Magic is insufficiently advanced technology.

07/08/2009 03:04 PM - Nobuyoshi Nakada

  • ruby -v changed from ruby 1.8.7 (2009-06-08 patchlevel 173) [x86_64-linux] to ruby 1.8.7 (2009-04-08 patchlevel 160) [i386-freebsd6]

07/08/2009 03:40 PM - Nobuyoshi Nakada

  • Status changed from Open to Rejected
Too many backtracks consume a lot of time.
You can use (?>...) to suppress backtracking:
  reg = %r{(?><div\s*class\s*=\s*.entry.*?>.*?<img\b[^<>]*\s+src\s*=\s*.([^\"|\']*).*?>).*?<p\s*class\s*=\s*.date.*?>}im

07/09/2009 10:46 PM - paranormal dev

I'm rewriting one big program, and write compatible layer before all refactoring done. And this regexp bad, because it write this program.

Also available in: Atom PDF