Project

General

Profile

Bug #1525

Deadlock in Ruby 1.9's VM caused by ConditionVariable.wait and fork?

Added by hongli (Hongli Lai) about 10 years ago. Updated about 8 years ago.

Status:
Closed
Priority:
Normal
Target version:
ruby -v:
ruby 1.9.1p129 (2009-05-12 revision 23412) [i386-darwin9.6.0]
Backport:
[ruby-core:23572]

Description

=begin
The following code seems to cause a VM-wide deadlock on 1.9:

require 'thread'

lock = Mutex.new
cond = ConditionVariable.new
t = Thread.new do
lock.synchronize do
cond.wait(lock)
end
end

pid = fork do
# Child
STDOUT.write "This is the child process.\n"
STDOUT.write "Child process exiting.\n"
end
STDOUT.write("Child PID = #{pid}\n")
Process.waitpid(pid)

The expected output is:

Child PID = xxxx
This is the child process.
Child process exiting.

After the exit message, Ruby should exit.

Instead, Ruby 1.9 gives:

Child PID = 15493
This is the child process.
(process hangs here)

Ruby 1.8 does not suffer from this problem.

Upon debugging Ruby, I've found that Ruby is stuck in blocking_region_end(), at the following line:

native_mutex_lock(&th->vm->global_vm_lock);

blocking_region_end() was called as part of rb_write_internal(), right after writing "This is the child process\n" to stdout. This problem only occurs if there's a background thread that's waiting on a ConditionVariable. If you remove the thread then the deadlock does not occur.
=end


Files

vm_deadlock_fix.diff (466 Bytes) vm_deadlock_fix.diff hongli (Hongli Lai), 11/17/2009 09:58 PM

Related issues

Related to Ruby master - Bug #2025: problem with pthread handling on non NPTL platformClosed08/31/2009Actions

History

#1

Updated by hongli (Hongli Lai) about 10 years ago

=begin
It appears that this bug is OS X-specific. On Ubuntu 8.04 it behaves correctly: ruby 1.9.1p129 (2009-05-12 revision 23412) [x86_64-linux]
=end

#2

Updated by yugui (Yuki Sonoda) about 10 years ago

  • Assignee set to ko1 (Koichi Sasada)
  • Target version set to 1.9.1

=begin

=end

#3

Updated by vanjab (Vanja Bucic) almost 10 years ago

=begin
Just to chime in on this issue.

It is affecting our company as well. We run our software on apple machines in a server environment.
Our application is a multithreaded daemon process that is accessing mysql database pretty often. Some of these threads fork off a task that may take some time to complete. It has worked well prior to ruby 1.9. Since then we have tried to find a workaround but to no avail. Any attempts to use fork in our application will result in deadlocks 8/10 times.
It deadlocks in weirdest places, like 'puts' or mysql.query (which we know is setting global lock internally, but should be thread safe).

Any ideas and attempts to resolve this ASAP are welcome.
=end

#4

Updated by vanjab (Vanja Bucic) almost 10 years ago

=begin
To add my test case:
# ------------------------
require 'thread'

$stderr.puts RUBY_VERSION

$pid = 0
$t1 = Thread.new do
$pid = fork {
sleep(1)
$stderr.puts "thread 1 exiting"
exit
}
end

$stdout.puts "ok, thread with fork spawned"
$stdout.puts "la la la la"

Process.waitpid($pid)

$stderr.puts "never done"

# ------ outputs ----------
1.9.2
ok, thread with fork spawned
la la la la

=end

#5

Updated by vanjab (Vanja Bucic) almost 10 years ago

=begin
In Reply to:
-- IMHO, from the respective of user, although it is hard, try not to use
-- any non-async-signal-safe functions in a forked child process before any
-- exec functions are called.

-- - Tetsu

Just so I can understand the logic, could you rewrite my test case above so that it does not deadlock?
I am not clear which of the functions I used in the test case are non-async-signal-safe or not.

Thanks.
=end

#6

Updated by matz (Yukihiro Matsumoto) almost 10 years ago

=begin
Hi,

In message "Re: [ruby-core:24565] Re: [Bug #1525] Deadlock in Ruby 1.9's VM caused by ConditionVariable.wait and fork?"
on Sun, 26 Jul 2009 22:11:41 +0900, Hongli Lai hongli@plan99.net writes:

|In any case, not being able to create threads or doing anything
|complicated in child processes is a serious limitation. This makes
|forking-without-exec in Ruby 1.9 as good as useless. Even
|forking-with-exec is dangerous now. For example, suppose that the child
|process creates a command string to pass to exec(), and creating this
|command string involves malloc()ing memory. Even this isn't safe anymore.
|
|I think Kernel#fork should be made safe as much as possible.

I know what you mean. But we cannot override the underlying platform
behavior (i.e impossible). If it's possible, we are glad to adopt.

# but in case of Vanja, it might be able to support by adjusting the
# timing of launching the internal worker thread. I am not sure yet.

                        matz.

=end

#7

Updated by normalperson (Eric Wong) almost 10 years ago

=begin
Looking at trunk, there doesn't seem to be any accounting of mutexes to pass to
handlers for pthread_atfork; so the child process will just inherit the mutexes
in an unknown state.

It should be possible to fix the problem by keeping track of all mutexes as
they're created/initialized and registering pthread_atfork handlers
to ensure all mutexes are unlocked when the child starts running.

I'm pretty sure forking in the presence of threads in the parent will always
require a GVL, but I don't think it's too big of an issue otherwise.

=end

#8

Updated by normalperson (Eric Wong) almost 10 years ago

=begin
"none <" tetsu.soh.dev@gmail.com wrote:

Eric Wong wrote:

It should be possible to fix the problem by keeping track of all mutexes as
they're created/initialized and registering pthread_atfork handlers
to ensure all mutexes are unlocked when the child starts running.

In fact, it is impossible to track all mutexes because the usage of

mutexes really
depends on the underlying implementation.
For example, the deadlock on this issues doesn't happen on Linux system,

even
not on FreeBSD7.2, but happens on FreeBSD6.4.

Yes, it's not easy; but I think we can start making a best effort and
wait for OSes to catch up. This lets us start paving the way towards
reducing the reliance on the GVL:

The big system-side offenders are stdio, malloc and resolver...

  1. Ruby 1.9 already removed most of stdio dependencies.

  2. malloc still happens under a GVL, but I think replacing it with a
    Ruby-aware memory allocator that's better integrated with the GC and
    thread management would be a good thing anyways.

  3. Maybe look at c-ares or even resolv.rb since they'd play nicer
    with timeouts anyways... (not too sure on this one).

There's probably a few other things, but I think those are the main
ones that server applications (the ones most likely to use threads+fork)
will care about...

--
Eric Wong

=end

#9

Updated by hongli (Hongli Lai) over 9 years ago

=begin
The attached patch fixes the problem. Before forking there might be an arbitrary number of threads waiting on the lock, causing it to enter an undefined state after forking, which in turn causes a deadlock on some platforms. This patch reinitializes the global interpreter lock right after forking, which should be safe because all threads are gone right after forking.

I tried this before and it didn't work, but I suspect that that was caused by bug #2371. Now that #2371 has been fixed it would seem that this patch works.
=end

#10

Updated by vanjab (Vanja Bucic) over 9 years ago

=begin
Very good news, thanks. Where do I fetch the latest sources that include your patch so that I can test with our use case?
Thanks.
=end

#11

Updated by hongli (Hongli Lai) over 9 years ago

=begin
The patch is to be applied on top of Ruby's SVN sources.
=end

#12

Updated by nobu (Nobuyoshi Nakada) over 9 years ago

  • Status changed from Open to Closed
  • % Done changed from 0 to 100

=begin
This issue was solved with changeset r25844.
Hongli, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.

=end

#13

Updated by daniel (Daniel Cavanagh) over 9 years ago

=begin
On 25/11/2009, at 5:57 PM, Tanaka Akira wrote:

In article 4B07C4C5.8060102@plan99.net,
Hongli Lai hongli@plan99.net writes:

% ./ruby -e 'fork { puts }'
-e:1: [BUG] native_mutex_unlock return non-zero: 1
ruby 1.9.2dev (2009-11-19 trunk 25848) [x86_64-freebsd6.4]

This is what I get on FreeBSD 7.1-RELEASE:

FreeBSD 8.0-RELEASE behaves similar to FreeBSD 6.4.

% uname -mrsv
FreeBSD 8.0-RELEASE FreeBSD 8.0-RELEASE #0: Sat Nov 21 15:48:17 UTC 2009 root@almeida.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC i386
% ./ruby -e 'fork { puts }'
-e:1: [BUG] pthread_mutex_unlock: Operation not permitted (EPERM)
ruby 1.9.2dev (2009-11-25 trunk 25911) [i386-freebsd8.0]

-- control frame ----------
c:0009 p:---- s:0020 b:0020 l:000019 d:000019 CFUNC :write
c:0008 p:---- s:0018 b:0018 l:000017 d:000017 CFUNC :puts
c:0007 p:---- s:0016 b:0016 l:000015 d:000015 CFUNC :puts
c:0006 p:0009 s:0013 b:0013 l:0010a4 d:000012 BLOCK -e:1
c:0005 p:---- s:0011 b:0011 l:000010 d:000010 FINISH
c:0004 p:---- s:0009 b:0009 l:000008 d:000008 CFUNC :fork
c:0003 p:0009 s:0006 b:0006 l:0010a4 d:000004 EVAL -e:1
c:0002 p:---- s:0004 b:0004 l:000003 d:000003 FINISH

c:0001 p:0000 s:0002 b:0002 l:0010a4 d:0010a4 TOP

-e:1:in <main>'
-e:1:in
fork'
-e:1:in block in <main>'
-e:1:in
puts'
-e:1:in puts'
-e:1:in
write'

exactly the same thing happens on netbsd 5.0.1, if that helps

=end

Also available in: Atom PDF