Bug #12095
closedruby_vm_at_exit can sometime cause a crash.
Description
This behavior has been seen erratically, but one of our users got it to reproduce almost systematically. We didn't managed to understand what made his system special that it would get that crash to reproduce so well.
Here's one of the reports:
https://gist.github.com/blowmage/7ebe774039013bc8c990
The current workaround to that one (alongside a few other comments) is done here: https://github.com/grpc/grpc/pull/5337/files
Note that removing the call to ruby_vm_at_exit makes everything load fine. Also note that the removed comment from that pull request is wrong: this has been happening on versions of Ruby other than 2.0.
It's interesting to note from the backtrace information that this is happening during a garbage collection. The fact that a garbage collection happens at that exact moment is probably the reason that bug is so difficult to reproduce. Perhaps a modified version of ruby might help reproducing it. Or very specific garbage collector settings.
The fault address (0x88) seems to indicate that a NULL pointer into a struct was being dereferenced.
Disassembling the corresponding execution address seems to point at a crash inside obj_info, from the first line of gc_writebarrier_incremental, but this is after a very quick inspection of the code, so don't take my word from it.
This problem has been repoted to us on Ruby 2.0.0, Ruby 2.2.0, Ruby 2.2.3, Ruby 2.3.0, at least.
Files
Updated by evanphx (Evan Phoenix) over 8 years ago
I'm hitting this as well, and looking over the code in question on 2.3.0, I wondering if the problem is that the at_exit pseudo-object is actually allocated within the body of rb_vm_t. It's address is taken and passed to rb_ary_push
, which perform OBJ_WRITE. That's where wb_incremental is invoked from.
Because the mark bits are not located with the object header anymore, the mark bitmap is consulted but the position in the mark bitmap is calculated against the address of at_exit, which isn't located on the main ruby heap at all!
The path to the bad pointer, given X as the address of at_exit within rb_vm_t is: RVALUE_BLACK_P(X) => RVALUE_MARKED(X) => RVALUE_MARK_BITMAP(X) => GET_HEAP_MARK_BITS(X) => GET_HEAP_PAGE(X) => GET_PAGE_HEADER(X) => GET_PAGE_BODY(X) => ((struct heap_page_body *)((bits_t)(x) & ~(HEAP_ALIGN_MASK))).
The value returned by that above sequence is supposed to return a page header that can itself be dereferenced to find the mark bits. But because the at_exit is in a random place, the page header is basically random bytes, and thus the deference crashes.
Updated by evanphx (Evan Phoenix) over 8 years ago
- File at_exit_fix.patch at_exit_fix.patch added
Attached is a patch that fixes this issue by replacing the troublesome usage of a VALUE to store the at_exit functions with a simple linked list. This patch was created against the ruby_2_3 branch. It should apply cleanly to most branches because of it's small size.
Updated by nobu (Nobuyoshi Nakada) over 8 years ago
Thank you for the investigation and the patch, I've missed this.
you should:
-
free
the list inruby_vm_run_at_exit_hooks
, - use the argument
vm
instead ofGET_VM()
, and - replace the existing
typedef
ofrb_vm_t
with merestruct
, as multipletypedef
s are not allowed in C, IIRC.
Updated by evanphx (Evan Phoenix) over 8 years ago
- File at_exit_fix.patch at_exit_fix.patch added
Thank you for the feedback nobu!
Attached is an updated version of the patch with the changes.
Updated by naruse (Yui NARUSE) over 8 years ago
- Backport changed from 2.0.0: UNKNOWN, 2.1: UNKNOWN, 2.2: UNKNOWN, 2.3: UNKNOWN to 2.1: REQUIRED, 2.2: REQUIRED, 2.3: REQUIRED
Updated by nobu (Nobuyoshi Nakada) over 8 years ago
at_exit
functions should be called in the inverse order, LIFO.
Updated by nobu (Nobuyoshi Nakada) over 8 years ago
- Status changed from Open to Closed
Applied in changeset r54484.
at_exit list
- vm_core.h (rb_vm_struct): make at_exit a single linked list but
not RArray, not to mark the registered functions by the write
barrier. based on the patches by Evan Phoenix.
[ruby-core:73908] [Bug #12095]
Updated by naruse (Yui NARUSE) over 8 years ago
- Backport changed from 2.1: REQUIRED, 2.2: REQUIRED, 2.3: REQUIRED to 2.1: REQUIRED, 2.2: REQUIRED, 2.3: DONE
ruby_2_3 r54633 merged revision(s) 54484.
Updated by usa (Usaku NAKAMURA) over 8 years ago
- Backport changed from 2.1: REQUIRED, 2.2: REQUIRED, 2.3: DONE to 2.1: REQUIRED, 2.2: DONE, 2.3: DONE
ruby_2_2 r54683 merged revision(s) 54484.