Bug #20243
openM:N threading VM_ASSERT failure in rb_current_execution_context with clang 17 (on Linux)
Description
When building with Clang 17 and -DVM_CHECK_MODE=1
(with the following configure)
optflags="-ggdb3 -fno-omit-frame-pointer -fno-optimize-sibling-calls -O3" cflags="-DVM_CHECK_MODE=1" CC=clang ../configure --prefix=/home/kj/ruby/installed --enable-yjit=dev --disable-install-doc
And then running the following script with the built ./miniruby
(which is actually from bootstraptest/test_ractor.rb
):
counts = []
counts << Ractor.count
ractors = (1..3).map { Ractor.new { Ractor.receive } }
counts << Ractor.count
ractors[0].send('End 0').take
sleep 0.1 until ractors[0].inspect =~ /terminated/
counts << Ractor.count
ractors[1].send('End 1').take
sleep 0.1 until ractors[1].inspect =~ /terminated/
counts << Ractor.count
ractors[2].send('End 2').take
sleep 0.1 until ractors[2].inspect =~ /terminated/
counts << Ractor.count
counts.inspect
I get the following crash:
Assertion Failed: ../vm_core.h:1957:rb_current_execution_context:ec == rb_current_ec_noinline()
ruby 3.4.0dev (2024-02-07T07:52:06Z ktsanaktsidis/igno.. 5cc6d944c2) [x86_64-linux]
-- Control frame information -----------------------------------------------
c:0003 p:0003 s:0010 e:000009 METHOD <internal:ractor>:431
c:0002 p:0004 s:0006 e:000005 BLOCK ractor_crash.rb:3 [FINISH]
c:0001 p:---- s:0003 e:000002 DUMMY [FINISH]
-- Ruby level backtrace information ----------------------------------------
ractor_crash.rb:3:in `block (2 levels) in <main>'
<internal:ractor>:431:in `receive'
-- Threading information ---------------------------------------------------
Total ractor count: 2
Ruby thread count for this ractor: 1
-- C level backtrace information -------------------------------------------
/home/kj/ruby/build/miniruby(rb_print_backtrace+0x14) [0x55faa97a4ebd] ../vm_dump.c:820
/home/kj/ruby/build/miniruby(rb_vm_bugreport) ../vm_dump.c:1151
/home/kj/ruby/build/miniruby(rb_assert_failure+0x81) [0x55faa94d2719] ../error.c:1131
./miniruby(thread_sched_wait_running_turn+0x2e9) [0x55faa9726f59]
/home/kj/ruby/build/miniruby(rb_ractor_sched_sleep+0x10b) [0x55faa972687b] ../thread_pthread.c:1348
/home/kj/ruby/build/miniruby(ractor_check_ints+0x0) [0x55faa968b328] ../ractor.c:683
/home/kj/ruby/build/miniruby(ractor_sleep_with_cleanup) ../ractor.c:684
/home/kj/ruby/build/miniruby(ractor_sleep+0x15) [0x55faa968adf4] ../ractor.c:701
/home/kj/ruby/build/miniruby(ractor_wait_receive) ../ractor.c:748
/home/kj/ruby/build/miniruby(ractor_receive+0x1f) [0x55faa968768e] ../ractor.c:762
/home/kj/ruby/build/miniruby(builtin_inline_class_431) ../ractor.rb:432
/home/kj/ruby/build/miniruby(builtin_invoker0+0x6) [0x55faa978fc66] ../vm_insnhelper.c:6746
/home/kj/ruby/build/miniruby(invoke_bf+0x39) [0x55faa979816e] ../vm_insnhelper.c:6886
/home/kj/ruby/build/miniruby(vm_invoke_builtin_delegate) ../vm_insnhelper.c:6909
/home/kj/ruby/build/miniruby(rb_vm_check_ints+0x0) [0x55faa9771fac] ../insns.def:1533
/home/kj/ruby/build/miniruby(vm_pop_frame) ../vm_insnhelper.c:419
/home/kj/ruby/build/miniruby(vm_exec_core) ../insns.def:1537
/home/kj/ruby/build/miniruby(vm_exec_loop+0x0) [0x55faa9767f02] ../vm.c:2489
/home/kj/ruby/build/miniruby(rb_vm_exec) ../vm.c:2492
/home/kj/ruby/build/miniruby(invoke_block+0x6f) [0x55faa9781a58] ../vm.c:1512
/home/kj/ruby/build/miniruby(invoke_iseq_block_from_c) ../vm.c:1582
/home/kj/ruby/build/miniruby(invoke_block_from_c_proc) ../vm.c:1680
/home/kj/ruby/build/miniruby(vm_invoke_proc) ../vm.c:1710
/home/kj/ruby/build/miniruby(rb_vm_invoke_proc_with_self+0x5a) [0x55faa9781eaa] ../vm.c:1745
/home/kj/ruby/build/miniruby(thread_do_start_proc+0x199) [0x55faa9739e19] ../thread.c:574
/home/kj/ruby/build/miniruby(thread_do_start+0x6c) [0x55faa973933f] ../thread.c:618
/home/kj/ruby/build/miniruby(thread_start_func_2) ../thread.c:668
/home/kj/ruby/build/miniruby(rb_native_mutex_lock+0x0) [0x55faa973a141] ../thread_pthread.c:2234
/home/kj/ruby/build/miniruby(thread_sched_lock_) ../thread_pthread.c:387
/home/kj/ruby/build/miniruby(call_thread_start_func_2) ../thread_pthread_mn.c:436
/home/kj/ruby/build/miniruby(co_start) ../thread_pthread_mn.c:434
The failing assertion is this one in vm_core.h: https://github.com/ruby/ruby/blob/42c36269403baac67b0d5dc1d6d6e31168cf6a1f/vm_core.h#L1957. It actually has a very helpful comment.
/* On the shared objects, `__tls_get_addr()` is used to access the TLS
* and the address of the `ruby_current_ec` can be stored on a function
* frame. However, this address can be mis-used after native thread
* migration of a coroutine.
* 1) Get `ptr =&ruby_current_ec` op NT1 and store it on the frame.
* 2) Context switch and resume it on the NT2.
* 3) `ptr` is used on NT2 but it accesses to the TLS on NT1.
* This assertion checks such misusage.
*
* To avoid accidents, `GET_EC()` should be called once on the frame.
* Note that inlining can produce the problem.
*/
VM_ASSERT(ec == rb_current_ec_noinline());
What seems to be happening is exactly that. This is a disassembly of the relevant bits of thread_sched_wait_running_turn
:
........
# This is the only bits of the entire function which access the TLS base register %fs.
# It seems to have spilled the value of ruby_current_ec into %r13.
0x000055603d2e1cf8 <+136>: mov $0xffffffffffffff90,%rax
0x000055603d2e1cff <+143>: mov %fs:0x0,%r12
0x000055603d2e1d08 <+152>: add %rax,%r12
0x000055603d2e1d0b <+155>: mov %fs:(%rax),%r13
........
# There's a call to coroutine_transfer, so after this point we're returned to on a
# different thread
0x000055603d2e1e90 <+544>: call 0x55603d7fce84 <coroutine_transfer>
# But nothing ever loads the address of ruby_current_ec from %fs again (i didn't trace
# exactly the data flow from %r13 at 0x000055603d2e1d0b to here, but i assume it spilled
# somewhere and now got loaded back into %r15 here). In any case, that means %r15 here
# contains the value of ruby_current_ec from the _old_ thread, not the current one.
0x000055603d2e1e95 <+549>: mov %rbx,0x28(%r14)
0x000055603d2e1e99 <+553>: mov (%r12),%r15
0x000055603d2e1e9d <+557>: call 0x55603d33a010 <rb_current_ec_noinline>
0x000055603d2e1ea2 <+562>: cmp %rax,%r15
=> 0x000055603d2e1ea5 <+565>: jne 0x55603d2e1f3a <thread_sched_wait_running_turn+714>
........
# assertion failure code path.
0x000055603d2e1f3a <+714>: lea 0x542c0c(%rip),%rdi # 0x55603d824b4d
0x000055603d2e1f41 <+721>: lea 0x542c12(%rip),%rdx # 0x55603d824b5a
0x000055603d2e1f48 <+728>: lea 0x542c28(%rip),%rcx # 0x55603d824b77
0x000055603d2e1f4f <+735>: mov $0x7a5,%esi
0x000055603d2e1f54 <+740>: call 0x55603d08d698 <rb_assert_failure>
if we look at the register values from 0x000055603d2e1ea2
:
(rr) print/x $rax
$2 = 0x55603e159ad0
(rr) print/x $r15
$3 = 0x0
So the value from %rax
which came from ruby_current_ec_noinline
is correctly the value of ruby_current_ec
for this thread, and %r15
contains a stale value from a previous thread.
Now, what can we do about this, is a different question :/ There's a really good stackoverflow answer about it here: https://stackoverflow.com/questions/75592038/how-to-disable-clang-expression-elimination-for-thread-local-variable, but to summarise
- longstanding GCC and Clang bugs for this exist and have been marked as WONTFIX (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26461, https://github.com/llvm/llvm-project/issues/19551)
- It's even worse than this EC problem - things like
errno
also might be incorrectly persisted across coroutine switches (so e.g. an inlined C library function could in theory seterrno
in another thread, for example) - C++ actually has coroutines now, so this must work for those. Clang at least has fixed some TLS problems in their C++ coroutine implementation (https://github.com/llvm/llvm-project/issues/47179)
Other than reimplementing all of our coroutine stuff on top of C++ coroutines, I'm not sure what else we can do. AFAICT there's no way to tell the compiler that we clobbered the %fs
register because that's just not a thing in its model (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66631, but i assume clang is similar).
Thoughts? For now I think my workaround is to disable M:N at build time when building with ASAN (or turn optimizations down). At least this isn't a problem with Fiber
because we never move them across threads (probably for this reason in part).