Backport #3329
closedSegfault using nokogiri
Description
=begin
I'm using the Nokogiri gem to parse HTML and XML, and to apply XSLT. Tests went ok in irb but when running the method I get the following:
jdrowell@falcon:~/work/GEDi$ ruby crawler.rb
crawler.rb:18: [BUG] Segmentation fault
ruby 1.9.1p378 (2010-01-10 revision 26273) [i686-linux]
-- control frame ----------
c:0010 p:---- s:0033 b:0033 l:000032 d:000032 CFUNC :transform
c:0009 p:0066 s:0029 b:0029 l:000028 d:000028 METHOD crawler.rb:18
c:0008 p:0037 s:0022 b:0022 l:000013 d:000021 BLOCK crawler.rb:26
c:0007 p:---- s:0019 b:0019 l:000018 d:000018 FINISH
c:0006 p:---- s:0017 b:0017 l:000016 d:000016 CFUNC :each
c:0005 p:0032 s:0014 b:0014 l:000013 d:000013 METHOD crawler.rb:25
c:0004 p:0011 s:0010 b:0010 l:000009 d:000009 METHOD crawler.rb:31
c:0003 p:0063 s:0007 b:0007 l:00057c d:002338 EVAL crawler.rb:36
c:0002 p:---- s:0004 b:0004 l:000003 d:000003 FINISH
c:0001 p:0000 s:0002 b:0002 l:00057c d:00057c TOP
-- Ruby level backtrace information-----------------------------------------
crawler.rb:18:in transform' crawler.rb:18:in
prefeitura_noticia'
crawler.rb:26:in block in prefeitura_noticias' crawler.rb:25:in
each'
crawler.rb:25:in prefeitura_noticias' crawler.rb:31:in
run'
crawler.rb:36:in `'
-- C level backtrace information -------------------------------------------
0x81239f7 ruby(rb_vm_bugreport+0x47) [0x81239f7]
0x8150363 ruby() [0x8150363]
0x81503d8 ruby(rb_bug+0x28) [0x81503d8]
0x80d33c8 ruby() [0x80d33c8]
0x276410 [0x276410]
0x2ea1ff /usr/lib/libxslt.so.1(xsltApplyStylesheet+0x2f) [0x2ea1ff]
0xc3cac3 /home/jdrowell/.rvm/gems/ruby-1.9.1-p378/gems/nokogiri-1.4.1/lib/nokogiri/nokogiri.so(+0xaac3) [0xc3cac3]
0x811345d ruby() [0x811345d]
0x8113790 ruby() [0x8113790]
0x811e8ed ruby() [0x811e8ed]
0x811856d ruby() [0x811856d]
0x811b3c6 ruby() [0x811b3c6]
0x812077a ruby(rb_yield+0x1aa) [0x812077a]
0x812e191 ruby(rb_ary_each+0x41) [0x812e191]
0x8113790 ruby() [0x8113790]
0x811e8ed ruby() [0x811e8ed]
0x811856d ruby() [0x811856d]
0x811b3c6 ruby() [0x811b3c6]
0x811b5f9 ruby(rb_iseq_eval_main+0x99) [0x811b5f9]
0x805d64f ruby(ruby_exec_node+0x9f) [0x805d64f]
0x805e9e6 ruby(ruby_run_node+0x46) [0x805e9e6]
0x805c09a ruby(main+0x5a) [0x805c09a]
0x126bd6 /lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe6) [0x126bd6]
0x805bfa1 ruby() [0x805bfa1]
Please advise if any additional information would be useful. I can provide both the HTML and the XSLT file that caused the segfault. I'll continue to work on the issue and will leave more comments later.
=end
Updated by jdrowell (John Rowell) almost 14 years ago
=begin
This may be happening due to character encodings. A
res = xslt.transform(page.search('//body'))
(where 'page' is a Mechanize instance) causes a segfault, while a
res = xslt.transform(Nokogiri::HTML(page.content, nil, page.encoding))
does not. The original page is encoded with ISO-8859-1, and Mechanize doesn't always convert text to UTF-8 (#text is converted, #content is not). Maybe libxslt only accepts UTF-8 and Nokogiri is not properly converting the encodings before sending the text.
=end
Updated by tenderlovemaking (Aaron Patterson) almost 14 years ago
=begin
On Fri, May 21, 2010 at 09:50:54AM +0900, John Rowell wrote:
Issue #3329 has been updated by John Rowell.
This may be happening due to character encodings. A
res = xslt.transform(page.search('//body'))
(where 'page' is a Mechanize instance) causes a segfault, while a
res = xslt.transform(Nokogiri::HTML(page.content, nil, page.encoding))
does not. The original page is encoded with ISO-8859-1, and Mechanize doesn't always convert text to UTF-8 (#text is converted, #content is not). Maybe libxslt only accepts UTF-8 and Nokogiri is not properly converting the encodings before sending the text.
This sounds like it may be a bug in Nokogiri and not Ruby. Can you
please add a ticket to our tracker here:
http://github.com/tenderlove/nokogiri/issues
Also, if you provide the output of nokogiri -v
and a script to
reproduce the problem, that would be extremely helpful. Thanks!
--
Aaron Patterson
http://tenderlovemaking.com/
Attachment: (unnamed)
=end
Updated by jeremyevans0 (Jeremy Evans) almost 5 years ago
- Description updated (diff)
- Status changed from Open to Closed