Project

General

Profile

Actions

Backport #4339

closed

Segmentation fault during Marshal.load

Added by rhythmx (Sean Bradly) about 13 years ago. Updated about 12 years ago.


Description

=begin
Conditions that seem to have to be satisfied:

  1. A call to Marshal.load must be interrupted by a context switch.
  2. Another thread calls GC.start manually.
  3. The object being marshaled needs to have a lot of child objects to be reliable.

This appears to be a thread-safety issue affecting Ruby 1.8.x only. It triggers reliably for me on:

Ubuntu -> ruby 1.8.7 (2010-06-23 patchlevel 299) [i686-linux]
Ubuntu-rvm -> ruby 1.8.7 (2010-12-23 patchlevel 330) [i686-linux]
MacRuby -> ruby 1.8.7 (2009-06-12 patchlevel 174) [universal-darwin10.0]
Mac-rvm -> ruby 1.8.7 (2010-08-16 patchlevel 302) [i686-darwin10.4.0]
=end


Files

REPRO4.rb (906 Bytes) REPRO4.rb Reproduces the segfault rhythmx (Sean Bradly), 01/29/2011 09:09 AM
ruby-1.8.7-trac22417.patch (3.49 KB) ruby-1.8.7-trac22417.patch Patch file rhythmx (Sean Bradly), 02/01/2011 08:56 AM
ruby-1.8.7-marshal.patch (6.09 KB) ruby-1.8.7-marshal.patch backport of marshall patch from upstream vo.x (Vit Ondruch), 11/15/2011 11:16 PM
Actions #1

Updated by rhythmx (Sean Bradly) about 13 years ago

=begin
Including a gdb trace. This issue happens in the mark_load_arg function when st_foreach encounters a corrupt bin ptr. I have seen similar failures occur in marshal_dump_arg. Also, very rarely the st_table_entry list of one of the bins will become circularly linked, causing an infinite loop.

Program received signal SIGSEGV, Segmentation fault.
0x001c86b6 in st_foreach (table=0x3cb3c0, func=0x178f40 <mark_entry>, arg=0) at st.c:486
486 for(ptr = table->bins[i]; ptr != 0;) {
(gdb) bt
#0 0x001c86b6 in st_foreach (table=0x3cb3c0, func=0x178f40 <mark_entry>, arg=0) at st.c:486
#1 0x001786c3 in mark_tbl (tbl=0x0) at gc.c:716
#2 rb_mark_tbl (tbl=0x0) at gc.c:723
#3 0x00189c04 in mark_load_arg (ptr=0xbfffca3c) at marshal.c:841
#4 0x00178d7e in gc_mark_children (ptr=3086829960, lev=1) at gc.c:1025
#5 0x00178a71 in mark_locations_array (x=, n=) at gc.c:684
#6 0x00157b95 in thread_mark (th=0x8087300) at eval.c:10466
#7 0x00178d7e in gc_mark_children (ptr=3086830120, lev=3) at gc.c:1025
#8 0x00178d06 in gc_mark_children (ptr=, lev=) at gc.c:1006
#9 0x00178e18 in gc_mark_children (ptr=, lev=) at gc.c:1057
#10 0x001791de in garbage_collect () at gc.c:1465
#11 0x00179be7 in rb_gc () at gc.c:1530
#12 0x00179c17 in rb_gc_start () at gc.c:1547
#13 0x00159a9d in call_cfunc (func=0x179c00 <rb_gc_start>, recv=3978176, len=0, argc=0, argv=0x0) at eval.c:5781
#14 0x00164a09 in rb_call0 (klass=, recv=, id=5313, oid=5313, argc=0, argv=0x0, body=0xb7fd88bc, flags=) at eval.c:5928
#15 0x00164baa in rb_call (klass=3086846160, recv=3086846180, mid=5313, argc=0, argv=0x0, scope=0, self=3086911820) at eval.c:6176
#16 0x00161e7b in rb_eval (self=, n=) at eval.c:3506
#17 0x00163404 in rb_yield_0 (val=, self=, klass=0, flags=, avalue=0) at eval.c:5095
#18 0x0016e657 in rb_yield (val=3) at eval.c:5179
#19 0x0018e641 in int_dotimes (num=201) at numeric.c:2960
#20 0x00159a9d in call_cfunc (func=0x18e5f0 <int_dotimes>, recv=3978176, len=0, argc=0, argv=0x0) at eval.c:5781
#21 0x00164a09 in rb_call0 (klass=, recv=, id=5753, oid=5753, argc=0, argv=0x0, body=0xb7fe4c5c, flags=) at eval.c:5928
#22 0x00164baa in rb_call (klass=3086896500, recv=201, mid=5753, argc=0, argv=0x0, scope=0, self=3086911820) at eval.c:6176
#23 0x00161e7b in rb_eval (self=, n=) at eval.c:3506
#24 0x0016292e in rb_eval (self=, n=) at eval.c:3236
#25 0x00163404 in rb_yield_0 (val=, self=, klass=0, flags=, avalue=2) at eval.c:5095
#26 0x0016374a in rb_thread_yield (arg=3086829920, th=0x8087658) at eval.c:12553
#27 0x0016c2f9 in rb_thread_start_0 (fn=, arg=, th=0x8087658) at eval.c:12471
#28 0x00159ade in call_cfunc (func=0x16c420 <rb_thread_initialize>, recv=3978176, len=-2, argc=0, argv=0x0) at eval.c:5775
#29 0x00164a09 in rb_call0 (klass=, recv=, id=2961, oid=2961, argc=0, argv=0x0, body=0xb7fe59b8, flags=) at eval.c:5928
#30 0x00164baa in rb_call (klass=3086899740, recv=3086829940, mid=2961, argc=0, argv=0x0, scope=1, self=6) at eval.c:6176
#31 0x00165459 in rb_funcall2 (recv=3978176, mid=2961, argc=0, argv=0x0) at eval.c:6312
#32 0x001654f7 in rb_obj_call_init (obj=3086829940, argc=0, argv=0x0) at eval.c:7825
#33 0x00165552 in rb_thread_s_new (argc=0, argv=0x0, klass=3086899740) at eval.c:12584
#34 0x00159ab8 in call_cfunc (func=0x165510 <rb_thread_s_new>, recv=3978176, len=-1, argc=0, argv=0x0) at eval.c:5778
#35 0x00164a09 in rb_call0 (klass=, recv=, id=3361, oid=3361, argc=0, argv=0x0, body=0xb7fe59e0, flags=) at eval.c:5928
#36 0x00164baa in rb_call (klass=3086899720, recv=3086899740, mid=3361, argc=0, argv=0x0, scope=0, self=3086911820) at eval.c:6176
#37 0x00161e7b in rb_eval (self=, n=) at eval.c:3506
#38 0x0016292e in rb_eval (self=, n=) at eval.c:3236
#39 0x0015f659 in rb_eval (self=, n=) at eval.c:3501
#40 0x00170d66 in ruby_exec_internal () at eval.c:1654
#41 0x00170db2 in ruby_exec () at eval.c:1674
#42 0x00170df5 in ruby_run () at eval.c:1684
#43 0x0804869d in main (argc=2, argv=0xbfffed24, envp=0xbfffed30) at main.c:48

=end

Actions #2

Updated by nobu (Nobuyoshi Nakada) about 13 years ago

  • Category set to core
  • Status changed from Open to Assigned
  • Assignee set to shyouhei (Shyouhei Urabe)

=begin

=end

Actions #3

Updated by rhythmx (Sean Bradly) about 13 years ago

=begin
Fwiw, this resolves the issue for me. The general problem is that "struct load_arg arg" is on the stack AND wrapped by an RData. When GC starts, the RData refers to arg (a stack pointer) but the current (GC) thread's stack has clobbered this address. My fix simply switches arg to be dynamically allocated, and the same for mark_dump_arg. However, it appears this will leak a small amount of memory if Marshal throws an exception. This seems ok for my application, so I'm not going to look into it any more.
=end

Actions #4

Updated by nobu (Nobuyoshi Nakada) about 13 years ago

=begin
r25230 has fixed this bug in trunk and 1.8.
=end

Updated by shyouhei (Shyouhei Urabe) almost 13 years ago

  • Assignee changed from shyouhei (Shyouhei Urabe) to nobu (Nobuyoshi Nakada)

... and r25230 has a bug. cf: #2386

Updated by vo.x (Vit Ondruch) over 12 years ago

Hi Shyouhei,

Could you please translate for me what is wrong with this patch? I'd like to apply this patch for Ruby in RHEL 6, since it fixes one bug we are facing when using mcollective.

Updated by shyouhei (Shyouhei Urabe) over 12 years ago

  • Assignee changed from nobu (Nobuyoshi Nakada) to shyouhei (Shyouhei Urabe)

It's a rather simple story. Instead of applying the proposed patch Nobu wrote something different (against trunk) for his own and told me to backport that. But I found Nobu's introduced a test failure. That's all.

So we had not take a close look at the patch attatched here (I read this one then, but didn't evaluated). Sorry. I tested it today and it seems OK. If you plan to fix this issue use the one attatched here instead of applying Nobu's. I'll also apply this in the next patchlevel.

Updated by nobu (Nobuyoshi Nakada) over 12 years ago

This patch seems to leak memory if an exception raises during dump/load.
You should use Data_Make_Struct() with dfree -1, I guess.

Updated by kosaki (Motohiro KOSAKI) over 12 years ago

  • Status changed from Assigned to Feedback

Vit, please send us an updated patch.

thank you.

Actions #11

Updated by vo.x (Vit Ondruch) over 12 years ago

Hi,

Could you please review the attached patch? It is basically backported the upstream marshal.c, i.e. it contains slightly more stuff then just r25230. "make test-all" passes as well as the original reproducer. Please apply this patch to 1.8

Updated by vo.x (Vit Ondruch) about 12 years ago

Bump.

Can somebody take look on the ruby-1.8.7-marshal.patch patch, please?

Updated by meyering (Jim Meyering) about 12 years ago

Hello,
I encountered this bug on RHEL-6's ruby-1.8.7.299 and with .352 and
reported it against RHEL-6 via http://bugzilla.redhat.com/781561
Applying the patch,
http://bugs.ruby-lang.org/attachments/2243/ruby-1.8.7-marshal.patch
solved our problem nicely. Please consider it for upstream ruby 1.8.7.

The reproducer, http://bugs.ruby-lang.org/attachments/1446/REPRO4.rb
also triggers a segmentation fault with Fedora 16's ruby:

$ ruby /t/REPRO4.rb
/t/REPRO4.rb:35: [BUG] Segmentation fault
ruby 1.8.7 (2011-12-28 patchlevel 357) [x86_64-linux]

zsh: abort (core dumped)  ruby /t/REPRO4.rb
Actions #14

Updated by nobu (Nobuyoshi Nakada) about 12 years ago

  • Tracker changed from Backport to Bug
  • Project changed from Backport187 to Ruby 1.8
  • Category changed from core to core
Actions #15

Updated by nobu (Nobuyoshi Nakada) about 12 years ago

  • Status changed from Feedback to Closed
  • % Done changed from 0 to 100

This issue was solved with changeset r34866.
Sean, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.


  • marshal.c (mark_dump_arg): mark destination string. patch by
    Vit Ondruch. [Bug #4339]
  • marshal.c (clear_dump_arg, clear_load_arg): clean up also data
    tables as same as symbols tables.
Actions #16

Updated by nobu (Nobuyoshi Nakada) about 12 years ago

  • Tracker changed from Bug to Backport
  • Project changed from Ruby 1.8 to Backport187
  • Category changed from core to core

Updated by vo.x (Vit Ondruch) about 12 years ago

Thank you for accepting the patch.

Updated by vo.x (Vit Ondruch) about 12 years ago

BTW shouldn't it go into ruby-187 branch instead of ruby_1_8?

Updated by marcandre (Marc-Andre Lafortune) about 12 years ago

Hi,

Vit Ondruch wrote:

BTW shouldn't it go into ruby-187 branch instead of ruby_1_8?

Nobu applied it also to -187 as r34867

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0