Bug #17573
closedCrashes in profiling tools when signals arrive in non-Ruby threads
Description
Stackprof (and likely similar tools) works by setting up a timer to sends it a unix signal on an interval. From that signal handler it does a small amount of internal bookkeeping and calls rb_postponed_job_register_one
.
This is a problem because unix signals arrive on an arbitrary thread, and as of Ruby 3.0 the execution context (which rb_postponed_job_register_one
relies on) is stored as a thread-local.
This reproduction crashes reliably for me on macos. It doesn't seem to on linux, maybe because the timer thread is different or the kernel has a different "arbitrary" choice. It feels like this is just one of the circumstances this crash could happen.
require "stackprof"
StackProf.run(interval: 100) do
1000.times do
GC.start
end
end
$ ruby crash_stackprof.rb
[BUG] Segmentation fault at 0x0000000000000038
ruby 3.0.0p0 (2020-12-25 revision 95aff21468) [x86_64-darwin19]
-- Crash Report log information --------------------------------------------
See Crash Report log file under the one of following:
* ~/Library/Logs/DiagnosticReports
* /Library/Logs/DiagnosticReports
for more details.
Don't forget to include the above Crash Report log file in bug reports.
-- Machine register context ------------------------------------------------
rax: 0x0000000000000000 rbx: 0x0000000107fbb780 rcx: 0x0000000000000000
rdx: 0x0000000000000000 rdi: 0x0000000106982c28 rsi: 0x0000000107fbb780
rbp: 0x000070000eb47a10 rsp: 0x000070000eb479f0 r8: 0x000070000eb47eb0
r9: 0xd44931e7344c235f r10: 0x00007fff6ef49501 r11: 0x0000000000000202
r12: 0xd44931e7344c235f r13: 0x00000000ffffffff r14: 0x0000000000000000
r15: 0x0000000000000000 rip: 0x00000001068c85fd rfl: 0x0000000000010202
-- C level backtrace information -------------------------------------------
/Users/jhawthorn/.rubies/ruby-3.0.0/bin/ruby(rb_vm_bugreport+0x6cf) [0x1068c2d5f]
/Users/jhawthorn/.rubies/ruby-3.0.0/bin/ruby(rb_bug_for_fatal_signal+0x1d6) [0x1066dc556]
/Users/jhawthorn/.rubies/ruby-3.0.0/bin/ruby(sigsegv+0x5b) [0x10681aa0b]
/usr/lib/system/libsystem_platform.dylib(_sigtramp+0x1d) [0x7fff6efff5fd]
/Users/jhawthorn/.rubies/ruby-3.0.0/bin/ruby(rb_postponed_job_register_one+0x1d) [0x1068c85fd]
/usr/lib/system/libsystem_platform.dylib(0x7fff6efff5fd) [0x7fff6efff5fd]
0x38
is the address of ((rb_execution_context_t *)0)->vm
.
lldb shows that it comes from a second thread which was running timer_pthread_fn
$ lldb =ruby -- ./crash_stackprof.rb
(lldb) target create "/Users/jhawthorn/.rubies/ruby-3.0.0/bin/ruby" ruCurrent executable set to '/Users/jhawthorn/.rubies/ruby-3.0.0/bin/ruby' (x86_64).
(lldb) settings set -- target.run-args "./crash_stackprof.rb" (lldb) run
Process 92893 launched: '/Users/jhawthorn/.rubies/ruby-3.0.0/bin/ruby' (x86_64) Process 92893 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGALRM
frame #0: 0x00000001000dfbcd ruby`rgengc_check_relation [inlined] RVALUE_OLD_P_RAW(obj=4303689480) at gc.c:1419:32
1416 RVALUE_OLD_P_RAW(VALUE obj)
1417 {
1418 const VALUE promoted = FL_PROMOTED0 | FL_PROMOTED1;
-> 1419 return (RBASIC(obj)->flags & promoted) == promoted;
1420 }
1421
1422 static inline int thread #2, stop reason = EXC_BAD_ACCESS (code=1, address=0x38)
frame #0: 0x000000010029a5fd ruby`rb_postponed_job_register_one(flags=3492904, func=(stackprof.bundle`stackprof_gc_job_handler at stackprof.c:598), data=0x0000000000000000) at vm_trace.c:1622:19 1619 rb_postponed_job_register_one(unsigned int flags, rb_postponed_job_func_t func, void *data)
1620 { 1621 rb_execution_context_t *ec = GET_EC();
-> 1622 rb_vm_t *vm = rb_ec_vm_ptr(ec);
1623 rb_postponed_job_t *pjob;
1624 rb_atomic_t i, index;
1625
Target 0: (ruby) stopped.
(lldb) t 2
* thread #2
frame #0: 0x000000010029a5fd ruby`rb_postponed_job_register_one(flags=3492904, func=(stackprof.bundle`stackprof_gc_job_handler at stackprof.c:598), data=0x0000000000000000) at vm_trace.c:1622:19
1619 rb_postponed_job_register_one(unsigned int flags, rb_postponed_job_func_t func, void *data)
1620 {
1621 rb_execution_context_t *ec = GET_EC();
-> 1622 rb_vm_t *vm = rb_ec_vm_ptr(ec);
1623 rb_postponed_job_t *pjob;
1624 rb_atomic_t i, index;
1625
(lldb) bt
* thread #2
* frame #0: 0x000000010029a5fd ruby`rb_postponed_job_register_one(flags=3492904, func=(stackprof.bundle`stackprof_gc_job_handler at stackprof.c:598), data=0x0000000000000000) at vm_trace.c:1622:19
frame #1: 0x00007fff6efff5fd libsystem_platform.dylib`_sigtramp + 29
frame #2: 0x00007fff6ef4e3d7 libsystem_kernel.dylib`poll + 11
frame #3: 0x0000000100238e1e ruby`timer_pthread_fn(p=<unavailable>) at thread_pthread.c:2189:15
frame #4: 0x00007fff6f00b109 libsystem_pthread.dylib`_pthread_start + 148
frame #5: 0x00007fff6f006b8b libsystem_pthread.dylib`thread_start + 15
Attached is my attempted fix (also available at https://github.com/ruby/ruby/pull/4108) which uses the main-ractor's EC if there is none on the current thread. I hope this works (it seems to and fixes the crash) because before Ruby 3.0 there was a global EC, but I'm not entirely sure if this will cause other problems.
If accepted this should be backported to the 3.0 branch.
Files
Updated by jhawthorn (John Hawthorn) almost 4 years ago
Updated by ko1 (Koichi Sasada) almost 4 years ago
- Assignee set to ko1 (Koichi Sasada)
Updated by ko1 (Koichi Sasada) almost 4 years ago
Ah, OK. This issue doesn't expose on recent Linux system.
Hmm. How to debug it...
Updated by byroot (Jean Boussier) almost 4 years ago
This issue doesn't expose on recent Linux system
I do have 3.0 receive SEGV avout 50% of the time when using stackprof in CPU mode on ubuntu 20.04. Unfortunately All I get as output is Segmentation fault
, so I don't know wether it's the same issue or not.
Updated by byroot (Jean Boussier) over 3 years ago
I don't know wether it's the same issue or not.
So I tested this patch on top of the current ruby_3_0
branch, and it does fix the stackprof issue I had.
Updated by alanwu (Alan Wu) over 3 years ago
Ah, OK. This issue doesn't expose on recent Linux system.
I can somewhat reliably repro by running ruby --jit repro.rb
with the following:
# repro.rb
require "stackprof"
StackProf.run(interval: 100) do
1000.times do
GC.start
end
end
I get a crash with the ruby:3.0.0 image from DockerHub.
It happens with --jit
because MJIT spawns a worker thread that doesn't have
an execution context. It could happen with the timer thread too, but not all build
configurations spawn the timer thread.
Since this problem is specific to rb_postponed_job_register_one()
, I think it
would be better to fix it there rather than touching the much more popular rb_current_execution_context()
:
diff --git a/vm_trace.c b/vm_trace.c
index 383f255799..b012e946e9 100644
--- a/vm_trace.c
+++ b/vm_trace.c
@@ -1599,8 +1599,8 @@ postponed_job_register(rb_execution_context_t *ec, rb_vm_t *vm,
int
rb_postponed_job_register(unsigned int flags, rb_postponed_job_func_t func, void *data)
{
- rb_execution_context_t *ec = GET_EC();
- rb_vm_t *vm = rb_ec_vm_ptr(ec);
+ rb_vm_t *vm = GET_VM();
+ rb_execution_context_t *ec = rb_vm_main_ractor_ec(vm);
begin:
switch (postponed_job_register(ec, vm, flags, func, data, MAX_POSTPONED_JOB, vm->postponed_job_index)) {
@@ -1618,8 +1618,8 @@ rb_postponed_job_register(unsigned int flags, rb_postponed_job_func_t func, void
int
rb_postponed_job_register_one(unsigned int flags, rb_postponed_job_func_t func, void *data)
{
- rb_execution_context_t *ec = GET_EC();
- rb_vm_t *vm = rb_ec_vm_ptr(ec);
+ rb_vm_t *vm = GET_VM();
+ rb_execution_context_t *ec = rb_vm_main_ractor_ec(vm);
rb_postponed_job_t *pjob;
rb_atomic_t i, index;
Basically make the postpond job API always deliver to the main ractor. I think it makes sense
given that signal don't necessarily land on a Ruby thread.
Available as a GitHub PR: https://github.com/ruby/ruby/pull/4336
Updated by alanwu (Alan Wu) over 3 years ago
- Related to Bug #15263: [PATCH] vm_trace.c (postponed_job_register): only hit main thread added
Updated by alanwu (Alan Wu) about 3 years ago
- Backport changed from 2.5: UNKNOWN, 2.6: UNKNOWN, 2.7: UNKNOWN, 3.0: UNKNOWN to 2.5: UNKNOWN, 2.6: UNKNOWN, 2.7: UNKNOWN, 3.0: REQUIRED
Updated by alanwu (Alan Wu) about 3 years ago
- Status changed from Open to Closed
Applied in changeset git|f5d20411386ff2552ff27661387ddc4bae1ebc30.
Avoid assert failure when NULL EC is expected
After 5680c38c75aeb5cbd219aafa8eb48c315f287d97, postponed job APIs now
expect to be called on native threads not managed by Ruby and handles
getting a NULL execution context. However, in debug builds the change
runs into an assertion failure with GET_EC() which asserts that EC is
non-NULL. Avoid the assertion failure by passing false
for expect_ec
instead as the intention is to handle when there is no EC.
Add a test from John Crepezzi and John Hawthorn to exercise this
situation.
See GH-4108
See GH-5094
[Bug #17573]
Co-authored-by: John Hawthorn john@hawthorn.email
Co-authored-by: John Crepezzi john.crepezzi@gmail.com
Updated by nagachika (Tomoyuki Chikanaga) about 3 years ago
- Backport changed from 2.5: UNKNOWN, 2.6: UNKNOWN, 2.7: UNKNOWN, 3.0: REQUIRED to 2.5: UNKNOWN, 2.6: UNKNOWN, 2.7: UNKNOWN, 3.0: DONE
ruby_3_0 949af69408e44b69cc7437b58e8edbe3cd77c966 merged revision(s) 5680c38c75aeb5cbd219aafa8eb48c315f287d97,f5d20411386ff2552ff27661387ddc4bae1ebc30.