Bug #19234
closed[3.2.0dev] YJIT code GC can lead to crashes
Description
Filing this bug here in case some people may have observed it too and may have more information, and also to keep track of it for the upcoming 3.2.0 release.
After changing some settings on our CI to make sure YJIT's code_gc
would trigger, we discovered that it sometimes cause crashes.
The crash can take many different form (e.g. [BUG] Segmentation fault at 0x00005604a8e78006
or [BUG] Illegal instruction at 0x0000aaaacc0ce4c0
), and happens on both x86
and arm64
.
It however happens very consistently on our CI, but only after running for 15 to 20 minutes and we haven't been able to reduce it to a local reproduction script.
When it happens however the backtrace isn't really helpful:
-- C level backtrace information -------------------------------------------
/usr/local/ruby/bin/real-ruby(rb_print_backtrace+0x11) [0x5604a8a6df7d] vm_dump.c:770
/usr/local/ruby/bin/real-ruby(rb_vm_bugreport) vm_dump.c:1065
/usr/local/ruby/bin/real-ruby(rb_bug_for_fatal_signal+0xee) [0x5604a8ba927e] error.c:813
/usr/local/ruby/bin/real-ruby(sigsegv+0x4d) [0x5604a89c3ded] signal.c:964
/lib/x86_64-linux-gnu/libpthread.so.0(__restore_rt+0x0) [0x7fb5b4285420]
[0x5604acd31079]
Like regular GC bugs, it is likely that the code GC need to trigger at a very specific place for the bug to happen. Our attempts at triggering it manually with RubyVM::YJIT.code_gc
or to set the executable memory very low to trigger it more often didn't allow for a simpler reproduction.
Both @k0kubun (Takashi Kokubun) and @alanwu (Alan Wu) are investigating it right now.
Updated by byroot (Jean Boussier) almost 2 years ago
- Backport changed from 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN to 2.7: DONTNEED, 3.0: DONTNEED, 3.1: DONTNEED
Updated by alanwu (Alan Wu) almost 2 years ago
- Status changed from Open to Closed
Applied in changeset git|5fa608ed79645464bf80fa318d89745159301471.
YJIT: Fix code GC freeing stubs with a trampoline (#6937)
Stubs we generate for invalidation don't necessarily co-locate with the
code that jump to the stub. Since we rely on co-location to keep stubs
alive as they are in the outlined code block, it used to be possible for
code GC inside branch_stub_hit() to free the stub that's its direct
caller, leading us to return to freed code after.
Stubs used to look like:
mov arg0, branch_ptr
mov arg1, target_idx
mov arg2, ec
call branch_stub_hit
jmp return_reg
Since the call and the jump after the call is the same for all stubs, we
can extract them and use a static trampoline for them. That makes
branch_stub_hit() always return to static code. Stubs now look like:
mov arg0, branch_ptr
mov arg1, target_idx
jmp trampoline
Where the trampoline is:
mov arg2, ec
call branch_stub_hit
jmp return_reg
Code GC can now free stubs without problems since we'll always return
to the trampoline, which we generate once on boot and lives forever.
This might save a small bit of memory due to factoring out the static
part of stubs, but it's probably minor.
[Bug #19234]
Co-authored-by: Takashi Kokubun takashikkbn@gmail.com