Bug #19680
closedtest_process.rb tests fail sometimes on FreeBSD
Description
I've been investigating the repeated failures of test_process.rb on FreeBSD on rubyci. I'm still working on it but I wanted to open this ticket just to keep others in the loop and gather any pointers any of you might have!
These are some of the failures I've been investigating -
- http://rubyci.s3.amazonaws.com/freebsd13/ruby-master/log/20230516T063002Z.fail.html.gz
- http://rubyci.s3.amazonaws.com/freebsd13/ruby-master/log/20230515T103002Z.fail.html.gz
- http://rubyci.s3.amazonaws.com/freebsd13/ruby-master/log/20230512T083002Z.fail.html.gz
- http://rubyci.s3.amazonaws.com/freebsd13/ruby-master/log/20230506T023001Z.fail.html.gz
- http://rubyci.s3.amazonaws.com/freebsd13/ruby-master/log/20230505T103002Z.fail.html.gz
- http://rubyci.s3.amazonaws.com/freebsd13/ruby-master/log/20230517T003001Z.fail.html.gz
I have been able to reproduce one of them fairly reliably on my laptop (a 4-core, 8-thread Intel thinkpad running FreeBSD 13.2 under Linux KVM)
1) Timeout:
TestProcess#test_daemon_no_threads
simply by running while ./miniruby -I./lib -I. -I.ext/common ./tool/runruby.rb --extout=.ext -- ./test/runner.rb test/ruby/test_process.rb -n test_daemon_no_threads; do echo "ok"; done;
The test in question:
def test_daemon_no_threads
data = EnvUtil.timeout(3) do
IO.popen("-") do |f|
break f.readlines.map(&:chomp) if f
th = Thread.start {sleep 3}
Process.daemon(true, true)
puts Thread.list.size, th.status.inspect
end
end
assert_equal(["1", "false"], data)
end
This seems to have two different causes.
- Sometimes, the test appears to be stuck, but if you change the timeout to be high enough, this stops happening. i.e. this test would pass, provided it was allowed to run for longer than 3 seconds. I think I've narrowed this down to some kind of problem with Ruby's UBF & timer thread mechanism.
- Sometimes, the test is properly deadlocked and won't make any forward progress no matter the timeout. I believe this is actually a bug in FreeBSD.
Problems in the Ruby UBF/timer thread mechanism¶
I've been using this branch to dump some debug info while the test is running. Fun note, I had to dump the debug logs into an in-memory buffer because adding actual console printf's actually changes things enough that the test stops hanging.
My finding is that, when the test gets into the "stuck" state...
- It's the parent process that is stuck (i.e. the process that calls IO.popen).
- The parent process has two threads
- One thread doing in IO (doing
f.readlines
) - The other thread is the thread created by the timeout library, which is blocked in
rb_sigwait_sleep
waiting to be woken up by something.
When the SIGCHLD signal from the popen'd process exiting arrives, if one is sufficiently unlucky, the following sequence of events can occur:
- The signal handler in
sighandler
will wake up the timeout thread by callingrb_thread_wakeup_timer_thread
- That will prompt the timeout thread to eventually call
ubf_wakeup_thread
on the main thread throughcheck_signals_nogvl
- That will send SIGVTALRM to the main thread
- The timeout thread, having been interrupted, exits
rb_sigwait_sleep
, and on its way out ofnative_sleep
, callsTHREAD_BLOCKING_END
. -
THREAD_BLOCKING_END
callsthread_to_sched_running
, which will actually callrb_thread_wakeup_timer_thread
again, writing into the communication pipe - The timeout thread still has sleep to do (it has not yet been three seconds), so it loops back around and calls
rb_sigwait_sleep
again - but because there's a pending read on the communication pipe, it immediately notices that, and winds up not sleeping at all and callscheck_signals_nogvl
again - This kicks of the whole chain of events all over again, and sends another SIGVTALRM signal to the main thread.
This winds up working almost all of the time because when the main thread gets to run, it will handle the signal and the timeout thread will stop being woken up. However, it seems that on FreeBSD, the timeout thread hits the main thread with signals so hard that the main thread winds up almost unable to make any forward progress - especially as the target thread takes the th->interrupt_lock
to send the signal, that the main thread needs to exit its blocking region! In fact, when this test hangs, I see hundreds of thousands of SIGVTALRM signals sent from the timeout thread to the main thread, which seems... excessive. I guess this manifests on FreeBSD specifically for scheduler reasons.
Here's a pair of backtraces of the two threads I took from gdb - https://gist.github.com/KJTsanaktsidis/3eee77cb308f5760c5ae7cc19f4f43b5
This pair of stacks was fairly common in my investigation - the main thread is trying to exit its blocking region, and it's blocked on th->interrupt_lock
; it's then handling a SIGVTALRM signal inside the pthread_mutex_lock
call. The timeout thread holds th->interrupt_lock
and is pthread_kill
'ing SIGVTALRM at the main thread.
This patch SEEMS to fix the problem on my machine: https://github.com/ruby/ruby/commit/da705cd5efd6561d58a0dd08ec0ea94757ffe7dc - it should stop the timeout thread waking itself up just because the main thread has not yet processed a pending signal.
I still need to:
* Burn this in properly overnight to make sure it stopped all the process_test.rb flake on my machine
* Also check it against some other likely tests overnight like thread-related ones
* Also check it against a wide range of other platforms - AFAICT this code affects all platforms.
Problems with hard deadlocks¶
Sometimes, the child process that got popen'd gets deadlocked while calling Process.daemon
. The stacks wind up looking like this: https://gist.github.com/KJTsanaktsidis/11df4ab633f63c3c1a2f1bca55a88ce9
The main thread is running libc's before fork hooks (preparing the dynamic linker for forking) whilst the other thread is in its thread-creation routines.
This seems to be a bug in FreeBSD's jemalloc implementation, which I reported here - https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=271490. They seem to agree.
I'll work with the FreeBSD developers to see if this can be fixed (it might possibly affect jemalloc on other platforms too - I haven't looked).
Next steps¶
If anybody has any ideas on other fixes for the SIGVTALRM spam, I'd love to hear them! Otherwise, I'll test my patch more exhaustively and open a PR if it seems to work everywhere that I can try.
As for the jemalloc deadlock, I'll try the fix suggested by the FreeBSD developers in that bug and see if that fixes things on my machine.
Updated by kjtsanaktsidis (KJ Tsanaktsidis) over 1 year ago
Apologies, I accidentally submitted the issue while I was still writing it, just editing the issue to provide some detail now... (OK - done now)
Updated by kjtsanaktsidis (KJ Tsanaktsidis) over 1 year ago
- Description updated (diff)
Updated by kjtsanaktsidis (KJ Tsanaktsidis) over 1 year ago
I made a fair bit more progress over the weekend with this:
- For the
TestProcess#test_daemon_no_threads
test - I updated my patch slightly: https://github.com/ruby/ruby/commit/2f306cbd15de9899906a563012c92fd02b805300 - For the bug in FreeBSD - It really is a bug, and the FreeBSD developers have patched it - https://reviews.freebsd.org/D40178. I also discovered that it can be worked around with
LD_BIND_NOW
, so this might be a good thing to add to the test runner for FreeBSD for existing versions.
I also found a similar issue in the TestIO#test_race_gets_and_close
test. This test can hang or take a very long time to finish, because:
- The main thread goes to close one of the pipes here:
https://github.com/ruby/ruby/blob/872249e209fdb7b7c890a93b0f93a74a62d21aec/test/ruby/test_io.rb#L3838
- The other threads are currently using those files here:
https://github.com/ruby/ruby/blob/872249e209fdb7b7c890a93b0f93a74a62d21aec/test/ruby/test_io.rb#L3829
- So, when closing the pipe, the main thread calls
rb_notify_fd_close
here: https://github.com/ruby/ruby/blob/872249e209fdb7b7c890a93b0f93a74a62d21aec/io.c#L5643. This builds up a list of other threads currently using the pipe we're trying to close - It later waits for that list to become empty here: https://github.com/ruby/ruby/blob/872249e209fdb7b7c890a93b0f93a74a62d21aec/io.c#L5470 by calling
rb_thread_schedule()
in a loop until it is. -
rb_thread_schedule
eventually winds up whacking the other threads that are trying to leave the blocking region ofgets
with SIGVTALRM. - It seems on FreeBSD, those threads very often wind up not being ready in time and
rb_thread_schedule
ends up with nothing to run but the main thread again - Which kicks off this whole loop again
I thought it was suspicious that this thread has a 200 second timeout on it; there is no need for this to take that long.
This patch stops the busy-waiting through rb_thread_schedule
and instead uses a dedicated condition variable which is woken up whenever one of the gets
threads is done with the file descriptor that's attempting to be closed: https://github.com/ruby/ruby/commit/a17304b3c4eeffbe945c8d0d4555c096c1183045
After applying this patch, TestIO#test_race_gets_and_close
passes in a matter of milliseconds, not hundreds of seconds.
Next steps¶
- I ran the test suite in a loop for 12 hours, 2x in parallel, on my FreeBSD VM - no failures at all! of any kind!
- Now I need to do the same thing to at least Linux, MacOS, and Windows, since these are generic changes to
thread.c
that will affect all platforms. - There are also a few other tests skipped on FreeBSD, I will review these and see if my patches fix them.
- If that comes up OK, I'll tidy up my two patches & submit PR's for them
- I'll also submit a PR to make the test runner use
LD_BIND_NOW
on FreeBSD < 13.3
Updated by kjtsanaktsidis (KJ Tsanaktsidis) over 1 year ago
OK so https://github.com/ruby/ruby/pull/7864 and https://github.com/ruby/ruby/pull/7865 were merged, so this should be fixed. I'll keep an eye out on the CI tests over the weekend and see if this clears things up.
I also have https://github.com/ruby/ruby/pull/7867 open which works around the freebsd bug I found but that's probably less critical.
Updated by kjtsanaktsidis (KJ Tsanaktsidis) over 1 year ago
FreeBSD 13.1 CI hasn't failed since these fixes were merged so touch wood I think we can call this done.
Updated by ioquatix (Samuel Williams) over 1 year ago
- Status changed from Open to Closed
Closed per request.