Bug #20243: M:N threading VM_ASSERT failure in rb_current_execution_context with clang 17 (on Linux) - Ruby - Ruby Issue Tracking System

Actions

Copy link

Bug #20243

open

M:N threading VM_ASSERT failure in rb_current_execution_context with clang 17 (on Linux)

Bug #20243: M:N threading VM_ASSERT failure in rb_current_execution_context with clang 17 (on Linux)

Added by kjtsanaktsidis (KJ Tsanaktsidis) over 1 year ago. Updated over 1 year ago.

Status:

Open

Assignee:

Target version:

ruby -v:

Backport:

3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN

[ruby-core:116615]

Description

When building with Clang 17 and -DVM_CHECK_MODE=1 (with the following configure)

optflags="-ggdb3 -fno-omit-frame-pointer -fno-optimize-sibling-calls -O3" cflags="-DVM_CHECK_MODE=1" CC=clang  ../configure --prefix=/home/kj/ruby/installed --enable-yjit=dev --disable-install-doc

And then running the following script with the built ./miniruby (which is actually from bootstraptest/test_ractor.rb):

counts = []
counts << Ractor.count
ractors = (1..3).map { Ractor.new { Ractor.receive } }
counts << Ractor.count

ractors[0].send('End 0').take
sleep 0.1 until ractors[0].inspect =~ /terminated/
counts << Ractor.count

ractors[1].send('End 1').take
sleep 0.1 until ractors[1].inspect =~ /terminated/
counts << Ractor.count

ractors[2].send('End 2').take
sleep 0.1 until ractors[2].inspect =~ /terminated/
counts << Ractor.count

counts.inspect

I get the following crash:

Assertion Failed: ../vm_core.h:1957:rb_current_execution_context:ec == rb_current_ec_noinline()
ruby 3.4.0dev (2024-02-07T07:52:06Z ktsanaktsidis/igno.. 5cc6d944c2) [x86_64-linux]

-- Control frame information -----------------------------------------------
c:0003 p:0003 s:0010 e:000009 METHOD <internal:ractor>:431
c:0002 p:0004 s:0006 e:000005 BLOCK  ractor_crash.rb:3 [FINISH]
c:0001 p:---- s:0003 e:000002 DUMMY  [FINISH]

-- Ruby level backtrace information ----------------------------------------
ractor_crash.rb:3:in `block (2 levels) in <main>'
<internal:ractor>:431:in `receive'

-- Threading information ---------------------------------------------------
Total ractor count: 2
Ruby thread count for this ractor: 1

-- C level backtrace information -------------------------------------------
/home/kj/ruby/build/miniruby(rb_print_backtrace+0x14) [0x55faa97a4ebd] ../vm_dump.c:820
/home/kj/ruby/build/miniruby(rb_vm_bugreport) ../vm_dump.c:1151
/home/kj/ruby/build/miniruby(rb_assert_failure+0x81) [0x55faa94d2719] ../error.c:1131
./miniruby(thread_sched_wait_running_turn+0x2e9) [0x55faa9726f59]
/home/kj/ruby/build/miniruby(rb_ractor_sched_sleep+0x10b) [0x55faa972687b] ../thread_pthread.c:1348
/home/kj/ruby/build/miniruby(ractor_check_ints+0x0) [0x55faa968b328] ../ractor.c:683
/home/kj/ruby/build/miniruby(ractor_sleep_with_cleanup) ../ractor.c:684
/home/kj/ruby/build/miniruby(ractor_sleep+0x15) [0x55faa968adf4] ../ractor.c:701
/home/kj/ruby/build/miniruby(ractor_wait_receive) ../ractor.c:748
/home/kj/ruby/build/miniruby(ractor_receive+0x1f) [0x55faa968768e] ../ractor.c:762
/home/kj/ruby/build/miniruby(builtin_inline_class_431) ../ractor.rb:432
/home/kj/ruby/build/miniruby(builtin_invoker0+0x6) [0x55faa978fc66] ../vm_insnhelper.c:6746
/home/kj/ruby/build/miniruby(invoke_bf+0x39) [0x55faa979816e] ../vm_insnhelper.c:6886
/home/kj/ruby/build/miniruby(vm_invoke_builtin_delegate) ../vm_insnhelper.c:6909
/home/kj/ruby/build/miniruby(rb_vm_check_ints+0x0) [0x55faa9771fac] ../insns.def:1533
/home/kj/ruby/build/miniruby(vm_pop_frame) ../vm_insnhelper.c:419
/home/kj/ruby/build/miniruby(vm_exec_core) ../insns.def:1537
/home/kj/ruby/build/miniruby(vm_exec_loop+0x0) [0x55faa9767f02] ../vm.c:2489
/home/kj/ruby/build/miniruby(rb_vm_exec) ../vm.c:2492
/home/kj/ruby/build/miniruby(invoke_block+0x6f) [0x55faa9781a58] ../vm.c:1512
/home/kj/ruby/build/miniruby(invoke_iseq_block_from_c) ../vm.c:1582
/home/kj/ruby/build/miniruby(invoke_block_from_c_proc) ../vm.c:1680
/home/kj/ruby/build/miniruby(vm_invoke_proc) ../vm.c:1710
/home/kj/ruby/build/miniruby(rb_vm_invoke_proc_with_self+0x5a) [0x55faa9781eaa] ../vm.c:1745
/home/kj/ruby/build/miniruby(thread_do_start_proc+0x199) [0x55faa9739e19] ../thread.c:574
/home/kj/ruby/build/miniruby(thread_do_start+0x6c) [0x55faa973933f] ../thread.c:618
/home/kj/ruby/build/miniruby(thread_start_func_2) ../thread.c:668
/home/kj/ruby/build/miniruby(rb_native_mutex_lock+0x0) [0x55faa973a141] ../thread_pthread.c:2234
/home/kj/ruby/build/miniruby(thread_sched_lock_) ../thread_pthread.c:387
/home/kj/ruby/build/miniruby(call_thread_start_func_2) ../thread_pthread_mn.c:436
/home/kj/ruby/build/miniruby(co_start) ../thread_pthread_mn.c:434

The failing assertion is this one in vm_core.h: https://github.com/ruby/ruby/blob/42c36269403baac67b0d5dc1d6d6e31168cf6a1f/vm_core.h#L1957. It actually has a very helpful comment.

    /* On the shared objects, `__tls_get_addr()` is used to access the TLS
     * and the address of the `ruby_current_ec` can be stored on a function
     * frame. However, this address can be mis-used after native thread
     * migration of a coroutine.
     *   1) Get `ptr =&ruby_current_ec` op NT1 and store it on the frame.
     *   2) Context switch and resume it on the NT2.
     *   3) `ptr` is used on NT2 but it accesses to the TLS on NT1.
     * This assertion checks such misusage.
     *
     * To avoid accidents, `GET_EC()` should be called once on the frame.
     * Note that inlining can produce the problem.
     */
    VM_ASSERT(ec == rb_current_ec_noinline());

What seems to be happening is exactly that. This is a disassembly of the relevant bits of thread_sched_wait_running_turn:

........
   # This is the only bits of the entire function which access the TLS base register %fs.
   # It seems to have spilled the value of ruby_current_ec into %r13.
   0x000055603d2e1cf8 <+136>:	mov    $0xffffffffffffff90,%rax
   0x000055603d2e1cff <+143>:	mov    %fs:0x0,%r12
   0x000055603d2e1d08 <+152>:	add    %rax,%r12
   0x000055603d2e1d0b <+155>:	mov    %fs:(%rax),%r13
........
   # There's a call to coroutine_transfer, so after this point we're returned to on a
   # different thread
   0x000055603d2e1e90 <+544>:	call   0x55603d7fce84 <coroutine_transfer>
   # But nothing ever loads the address of ruby_current_ec from %fs again (i didn't trace
   # exactly the data flow from %r13 at 0x000055603d2e1d0b to here, but i assume it spilled
   # somewhere and now got loaded back into %r15 here). In any case, that means %r15 here
   # contains the value of ruby_current_ec from the _old_ thread, not the current one.
   0x000055603d2e1e95 <+549>:	mov    %rbx,0x28(%r14)
   0x000055603d2e1e99 <+553>:	mov    (%r12),%r15
   0x000055603d2e1e9d <+557>:	call   0x55603d33a010 <rb_current_ec_noinline>
   0x000055603d2e1ea2 <+562>:	cmp    %rax,%r15
=> 0x000055603d2e1ea5 <+565>:	jne    0x55603d2e1f3a <thread_sched_wait_running_turn+714>
........
   # assertion failure code path.
   0x000055603d2e1f3a <+714>:	lea    0x542c0c(%rip),%rdi        # 0x55603d824b4d
   0x000055603d2e1f41 <+721>:	lea    0x542c12(%rip),%rdx        # 0x55603d824b5a
   0x000055603d2e1f48 <+728>:	lea    0x542c28(%rip),%rcx        # 0x55603d824b77
   0x000055603d2e1f4f <+735>:	mov    $0x7a5,%esi
   0x000055603d2e1f54 <+740>:	call   0x55603d08d698 <rb_assert_failure>

if we look at the register values from 0x000055603d2e1ea2:

(rr) print/x $rax
$2 = 0x55603e159ad0
(rr) print/x $r15
$3 = 0x0

So the value from %rax which came from ruby_current_ec_noinline is correctly the value of ruby_current_ec for this thread, and %r15 contains a stale value from a previous thread.

Now, what can we do about this, is a different question :/ There's a really good stackoverflow answer about it here: https://stackoverflow.com/questions/75592038/how-to-disable-clang-expression-elimination-for-thread-local-variable, but to summarise

longstanding GCC and Clang bugs for this exist and have been marked as WONTFIX (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26461, https://github.com/llvm/llvm-project/issues/19551)
It's even worse than this EC problem - things like errno also might be incorrectly persisted across coroutine switches (so e.g. an inlined C library function could in theory set errno in another thread, for example)
C++ actually has coroutines now, so this must work for those. Clang at least has fixed some TLS problems in their C++ coroutine implementation (https://github.com/llvm/llvm-project/issues/47179)

Other than reimplementing all of our coroutine stuff on top of C++ coroutines, I'm not sure what else we can do. AFAICT there's no way to tell the compiler that we clobbered the %fs register because that's just not a thing in its model (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66631, but i assume clang is similar).

Thoughts? For now I think my workaround is to disable M:N at build time when building with ASAN (or turn optimizations down). At least this isn't a problem with Fiber because we never move them across threads (probably for this reason in part).

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Also available in: PDF Atom

Project

General

Profile

Ruby

Tags

Custom queries

Bug #20243

M:N threading VM_ASSERT failure in rb_current_execution_context with clang 17 (on Linux)

Updated by kjtsanaktsidis (KJ Tsanaktsidis) over 1 year ago Actions
Copy link
#1 [ruby-core:116616]

Updated by kjtsanaktsidis (KJ Tsanaktsidis) over 1 year ago Actions
Copy link
#2

Project

General

Profile

Ruby

Tags

Custom queries

Bug #20243

M:N threading VM_ASSERT failure in rb_current_execution_context with clang 17 (on Linux)

Updated by kjtsanaktsidis (KJ Tsanaktsidis) over 1 year ago ActionsCopy link #1 [ruby-core:116616]

Updated by kjtsanaktsidis (KJ Tsanaktsidis) over 1 year ago ActionsCopy link #2

Updated by kjtsanaktsidis (KJ Tsanaktsidis) over 1 year ago Actions
Copy link
#1 [ruby-core:116616]

Updated by kjtsanaktsidis (KJ Tsanaktsidis) over 1 year ago Actions
Copy link
#2