Project

General

Profile

Bug #1740

ruby regexp 100% usage cpu.

Added by paranormal (paranormal dev) about 10 years ago. Updated about 8 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
ruby -v:
ruby 1.8.7 (2009-04-08 patchlevel 160) [i386-freebsd6]
[ruby-core:24188]

Description

=begin
On freebsd i'm test ruby
ruby 1.8.7 (2009-04-08 patchlevel 160) [i386-freebsd6]

And my linux notebook
ruby -v ruby 1.8.7 (2009-06-08 patchlevel 173) [x86_64-linux]

For this code
#######################################
require 'open-uri'
$KCODE = 'u'

reg = %r{<.?div\s*class\s=\s*.entry.?>[<]<.?img\s*src\s=\s*.(["|']).?>[<]<.?p\s*class\s*=\s*.date.*?>}im
#del = %r{<(?!p|div|img)[>]*>}i

doc = open('http://www.radiokvit.com.ua/?p=1895').read

#doc.gsub!(del, ' ')
a = doc.match(reg)
p a
######################################

My ruby process use 100% cpu for long time and on linux exit normaly, on freebsd no exit %-(.
I'm submited another bug for freebsd http://www.freebsd.org/cgi/query-pr.cgi?pr=136384 but this is for freebsd only.

This templates writes another man for perl and i'm must use here.
=end


Files

testfile.html (23.3 KB) testfile.html doc for test regexp. if url is not valid. paranormal (paranormal dev), 07/07/2009 11:38 PM

History

#1

Updated by rue (Eero Saynatkari) about 10 years ago

=begin
Excerpts from rubymine message of Tue Jul 07 17:38:10 +0300 2009:

reg =
%r{<.?div\s*class\s=\s*.entry.?>[<]<.?img\s*src\s=\s*.(["|']).?>[<]*<
.?p\s*class\s=\s*.date.?>}im
#del = %r{<(?!p|div|img)[>]
>}i

My ruby process use 100% cpu for long time and on linux exit normaly, on
freebsd no exit %-(.
I'm submited another bug for freebsd
http://www.freebsd.org/cgi/query-pr.cgi?pr=136384 but this is for freebsd only.

This templates writes another man for perl and i'm must use here.

Firstly, Ruby regexps are not PCRE, so you must have some
leeway constructing the regexp. You cannot (necessarily)
just drop the Perl version in and expect it to work, or
work the same.

Secondly, you should be using something like Nokogiri or
hpricot rather than "parsing" the HTML yourself. For example
your div matcher will fail if the attribute is quoted.

Thirdly, it has "pathological" written all over it. You
should refactor the regexp to try to get some small case
that is reproducible to illustrate the actual problem to
see if it is something that should be fixed.

I am pretty sure there was another thread about really bad
regexp performance in a pathological case a while back, if
you want to search the archives.

Eero
--
Magic is insufficiently advanced technology.

=end

#2

Updated by nobu (Nobuyoshi Nakada) about 10 years ago

  • ruby -v changed from ruby 1.8.7 (2009-06-08 patchlevel 173) [x86_64-linux] to ruby 1.8.7 (2009-04-08 patchlevel 160) [i386-freebsd6]

=begin

=end

#3

Updated by nobu (Nobuyoshi Nakada) about 10 years ago

  • Status changed from Open to Rejected

=begin
Too many backtracks consume a lot of time.
You can use (?>...) to suppress backtracking:
reg = %r{(?>.?]\s+src\s*=\s*.([\"|\']).?>).?<p\s*class\s=\s*.date.*?>}im

=end

#4

Updated by paranormal (paranormal dev) about 10 years ago

=begin
I'm rewriting one big program, and write compatible layer before all refactoring done. And this regexp bad, because it write this program.
=end

Also available in: Atom PDF