Bug #1525
closedDeadlock in Ruby 1.9's VM caused by ConditionVariable.wait and fork?
Description
=begin
The following code seems to cause a VM-wide deadlock on 1.9:
require 'thread'
lock = Mutex.new
cond = ConditionVariable.new
t = Thread.new do
lock.synchronize do
cond.wait(lock)
end
end
pid = fork do
# Child
STDOUT.write "This is the child process.\n"
STDOUT.write "Child process exiting.\n"
end
STDOUT.write("Child PID = #{pid}\n")
Process.waitpid(pid)
The expected output is:
Child PID = xxxx
This is the child process.
Child process exiting.
After the exit message, Ruby should exit.
Instead, Ruby 1.9 gives:
Child PID = 15493
This is the child process.
(process hangs here)
Ruby 1.8 does not suffer from this problem.
Upon debugging Ruby, I've found that Ruby is stuck in blocking_region_end(), at the following line:
native_mutex_lock(&th->vm->global_vm_lock);
blocking_region_end() was called as part of rb_write_internal(), right after writing "This is the child process\n" to stdout. This problem only occurs if there's a background thread that's waiting on a ConditionVariable. If you remove the thread then the deadlock does not occur.
=end
Files
Updated by hongli (Hongli Lai) over 15 years ago
=begin
It appears that this bug is OS X-specific. On Ubuntu 8.04 it behaves correctly: ruby 1.9.1p129 (2009-05-12 revision 23412) [x86_64-linux]
=end
Updated by yugui (Yuki Sonoda) over 15 years ago
- Assignee set to ko1 (Koichi Sasada)
- Target version set to 1.9.1
=begin
=end
Updated by vanjab (Vanja Bucic) over 15 years ago
=begin
Just to chime in on this issue.
It is affecting our company as well. We run our software on apple machines in a server environment.
Our application is a multithreaded daemon process that is accessing mysql database pretty often. Some of these threads fork off a task that may take some time to complete. It has worked well prior to ruby 1.9. Since then we have tried to find a workaround but to no avail. Any attempts to use fork in our application will result in deadlocks 8/10 times.
It deadlocks in weirdest places, like 'puts' or mysql.query (which we know is setting global lock internally, but should be thread safe).
Any ideas and attempts to resolve this ASAP are welcome.
=end
Updated by vanjab (Vanja Bucic) over 15 years ago
=begin
To add my test case:
------------------------¶
require 'thread'
$stderr.puts RUBY_VERSION
$pid = 0
$t1 = Thread.new do
$pid = fork {
sleep(1)
$stderr.puts "thread 1 exiting"
exit
}
end
$stdout.puts "ok, thread with fork spawned"
$stdout.puts "la la la la"
Process.waitpid($pid)
$stderr.puts "never done"
------ outputs ----------¶
1.9.2
ok, thread with fork spawned
la la la la
=end
Updated by vanjab (Vanja Bucic) over 15 years ago
=begin
In Reply to:
-- IMHO, from the respective of user, although it is hard, try not to use
-- any non-async-signal-safe functions in a forked child process before any
-- exec functions are called.
-- - Tetsu
Just so I can understand the logic, could you rewrite my test case above so that it does not deadlock?
I am not clear which of the functions I used in the test case are non-async-signal-safe or not.
Thanks.
=end
Updated by matz (Yukihiro Matsumoto) over 15 years ago
=begin
Hi,
In message "Re: [ruby-core:24565] Re: [Bug #1525] Deadlock in Ruby 1.9's VM caused by ConditionVariable.wait and fork?"
on Sun, 26 Jul 2009 22:11:41 +0900, Hongli Lai hongli@plan99.net writes:
|In any case, not being able to create threads or doing anything
|complicated in child processes is a serious limitation. This makes
|forking-without-exec in Ruby 1.9 as good as useless. Even
|forking-with-exec is dangerous now. For example, suppose that the child
|process creates a command string to pass to exec(), and creating this
|command string involves malloc()ing memory. Even this isn't safe anymore.
|
|I think Kernel#fork should be made safe as much as possible.
I know what you mean. But we cannot override the underlying platform
behavior (i.e impossible). If it's possible, we are glad to adopt.
but in case of Vanja, it might be able to support by adjusting the¶
timing of launching the internal worker thread. I am not sure yet.¶
matz.
=end
Updated by normalperson (Eric Wong) over 15 years ago
=begin
Looking at trunk, there doesn't seem to be any accounting of mutexes to pass to
handlers for pthread_atfork; so the child process will just inherit the mutexes
in an unknown state.
It should be possible to fix the problem by keeping track of all mutexes as
they're created/initialized and registering pthread_atfork handlers
to ensure all mutexes are unlocked when the child starts running.
I'm pretty sure forking in the presence of threads in the parent will always
require a GVL, but I don't think it's too big of an issue otherwise.
=end
Updated by normalperson (Eric Wong) over 15 years ago
=begin
"none <" tetsu.soh.dev@gmail.com wrote:
Eric Wong wrote:
It should be possible to fix the problem by keeping track of all mutexes as
they're created/initialized and registering pthread_atfork handlers
to ensure all mutexes are unlocked when the child starts running.In fact, it is impossible to track all mutexes because the usage of
mutexes really
depends on the underlying implementation.
For example, the deadlock on this issues doesn't happen on Linux system,
even
not on FreeBSD7.2, but happens on FreeBSD6.4.
Yes, it's not easy; but I think we can start making a best effort and
wait for OSes to catch up. This lets us start paving the way towards
reducing the reliance on the GVL:
The big system-side offenders are stdio, malloc and resolver...
-
Ruby 1.9 already removed most of stdio dependencies.
-
malloc still happens under a GVL, but I think replacing it with a
Ruby-aware memory allocator that's better integrated with the GC and
thread management would be a good thing anyways. -
Maybe look at c-ares or even resolv.rb since they'd play nicer
with timeouts anyways... (not too sure on this one).
There's probably a few other things, but I think those are the main
ones that server applications (the ones most likely to use threads+fork)
will care about...
--
Eric Wong
=end
Updated by hongli (Hongli Lai) about 15 years ago
- File vm_deadlock_fix.diff vm_deadlock_fix.diff added
=begin
The attached patch fixes the problem. Before forking there might be an arbitrary number of threads waiting on the lock, causing it to enter an undefined state after forking, which in turn causes a deadlock on some platforms. This patch reinitializes the global interpreter lock right after forking, which should be safe because all threads are gone right after forking.
I tried this before and it didn't work, but I suspect that that was caused by bug #2371. Now that #2371 has been fixed it would seem that this patch works.
=end
Updated by vanjab (Vanja Bucic) about 15 years ago
=begin
Very good news, thanks. Where do I fetch the latest sources that include your patch so that I can test with our use case?
Thanks.
=end
Updated by hongli (Hongli Lai) about 15 years ago
=begin
The patch is to be applied on top of Ruby's SVN sources.
=end
Updated by nobu (Nobuyoshi Nakada) about 15 years ago
- Status changed from Open to Closed
- % Done changed from 0 to 100
=begin
This issue was solved with changeset r25844.
Hongli, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.
=end
Updated by daniel (Daniel Cavanagh) about 15 years ago
=begin
On 25/11/2009, at 5:57 PM, Tanaka Akira wrote:
In article 4B07C4C5.8060102@plan99.net,
Hongli Lai hongli@plan99.net writes:% ./ruby -e 'fork { puts }'
-e:1: [BUG] native_mutex_unlock return non-zero: 1
ruby 1.9.2dev (2009-11-19 trunk 25848) [x86_64-freebsd6.4]This is what I get on FreeBSD 7.1-RELEASE:
FreeBSD 8.0-RELEASE behaves similar to FreeBSD 6.4.
% uname -mrsv
FreeBSD 8.0-RELEASE FreeBSD 8.0-RELEASE #0: Sat Nov 21 15:48:17 UTC 2009 root@almeida.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC i386
% ./ruby -e 'fork { puts }'
-e:1: [BUG] pthread_mutex_unlock: Operation not permitted (EPERM)
ruby 1.9.2dev (2009-11-25 trunk 25911) [i386-freebsd8.0]-- control frame ----------
c:0009 p:---- s:0020 b:0020 l:000019 d:000019 CFUNC :write
c:0008 p:---- s:0018 b:0018 l:000017 d:000017 CFUNC :puts
c:0007 p:---- s:0016 b:0016 l:000015 d:000015 CFUNC :puts
c:0006 p:0009 s:0013 b:0013 l:0010a4 d:000012 BLOCK -e:1
c:0005 p:---- s:0011 b:0011 l:000010 d:000010 FINISH
c:0004 p:---- s:0009 b:0009 l:000008 d:000008 CFUNC :fork
c:0003 p:0009 s:0006 b:0006 l:0010a4 d:000004 EVAL -e:1
c:0002 p:---- s:0004 b:0004 l:000003 d:000003 FINISH
c:0001 p:0000 s:0002 b:0002 l:0010a4 d:0010a4 TOP-e:1:in
<main>' -e:1:in
fork'
-e:1:inblock in <main>' -e:1:in
puts'
-e:1:inputs' -e:1:in
write'
exactly the same thing happens on netbsd 5.0.1, if that helps
=end