Bug #14841
closedVery rarely IO#readpartial does not raise EOFError
Description
Hi,
I'm now writing a multi thread application, which uses IO.pipe, IO#close and IO#readpartial.
During running a code many times, I found that, even though calling IO#close on a write side object instantiated by IO.pipe, IO#readpartial of the read side object very rarely does not raise EOFError and keeps blocking.
The below code is working fine in most cases.
- Thread t2 generates '0' and '1' values and writes them on w object.
- Thread t1 read from r object with IO#readpartial
- After thread t2 exited, close w object to finish thread t1
require 'logger'
logger = Logger.new 'logger.log', 1, 10000000
logger.level = Logger::DEBUG
r, w = IO.pipe
t0 = Thread.new{
# This is a dummy thread to avoid being killed by Ruby interpreter automatically due to deadlock
loop do
sleep 1
Thread.pass
end
}
t1 = Thread.new{
loop do
begin
logger.debug { "r.readpartial(10240)" }
data_read = r.readpartial(10240)
logger.debug { "data_read: #{data_read}" }
rescue EOFError
logger.info { "EOFError" }
break
rescue => e
logger.error { e.message }
end
end
logger.info { "exiting t1" }
}
t2 = Thread.new{
2.times{ |i|
sleep 0.1
w.write i.to_s
}
logger.info { "exiting t2" }
}
begin
t2.join
logger.info { "t2 exited" }
logger.info { "close w" }
w.close
logger.info { "w closed" }
rescue IOError
logger.info { "IOError when closing w" }
end
t1.join
logger.info { "t1 exited" }
begin
r.close
rescue IOError
logger.info { "IOError when closing r" }
end
And I run the above code many times like:
$ i=0; while :; do echo $i; ruby readpartial_and_eof_test.rb; i=$(expr $i + 1); done
0
1
(Snip)
3892
3893
3894
(Stopped here)
As seen above, during 3894th iteration, the code is not finished.
The output of logger is the below. Closing w object is done, but EOFError looks not being raised.
$ tail -n 32 logger.log
I, [2018-06-10T22:09:58.738087 #19893] INFO -- : t1 exited
D, [2018-06-10T22:09:58.871906 #19916] DEBUG -- : r.readpartial(10240)
D, [2018-06-10T22:09:58.972198 #19916] DEBUG -- : data_read: 0
D, [2018-06-10T22:09:58.972327 #19916] DEBUG -- : r.readpartial(10240)
I, [2018-06-10T22:09:59.072294 #19916] INFO -- : exiting t2
D, [2018-06-10T22:09:59.072441 #19916] DEBUG -- : data_read: 1
D, [2018-06-10T22:09:59.072758 #19916] DEBUG -- : r.readpartial(10240)
I, [2018-06-10T22:09:59.072667 #19916] INFO -- : t2 exited
I, [2018-06-10T22:09:59.072915 #19916] INFO -- : close w
I, [2018-06-10T22:09:59.073011 #19916] INFO -- : w closed
I, [2018-06-10T22:09:59.073101 #19916] INFO -- : EOFError
I, [2018-06-10T22:09:59.073166 #19916] INFO -- : exiting t1
I, [2018-06-10T22:09:59.073268 #19916] INFO -- : t1 exited
D, [2018-06-10T22:09:59.165037 #19944] DEBUG -- : r.readpartial(10240)
D, [2018-06-10T22:09:59.265229 #19944] DEBUG -- : data_read: 0
D, [2018-06-10T22:09:59.265279 #19944] DEBUG -- : r.readpartial(10240)
D, [2018-06-10T22:09:59.365360 #19944] DEBUG -- : data_read: 1
D, [2018-06-10T22:09:59.365469 #19944] DEBUG -- : r.readpartial(10240)
I, [2018-06-10T22:09:59.365252 #19944] INFO -- : exiting t2
I, [2018-06-10T22:09:59.365640 #19944] INFO -- : t2 exited
I, [2018-06-10T22:09:59.365709 #19944] INFO -- : close w
I, [2018-06-10T22:09:59.365754 #19944] INFO -- : w closed
I, [2018-06-10T22:09:59.365819 #19944] INFO -- : EOFError
I, [2018-06-10T22:09:59.365959 #19944] INFO -- : exiting t1
I, [2018-06-10T22:09:59.366078 #19944] INFO -- : t1 exited
D, [2018-06-10T22:09:59.472484 #19967] DEBUG -- : r.readpartial(10240)
D, [2018-06-10T22:09:59.572743 #19967] DEBUG -- : data_read: 0
D, [2018-06-10T22:09:59.572891 #19967] DEBUG -- : r.readpartial(10240)
I, [2018-06-10T22:09:59.672976 #19967] INFO -- : exiting t2
I, [2018-06-10T22:09:59.673165 #19967] INFO -- : t2 exited
I, [2018-06-10T22:09:59.673199 #19967] INFO -- : close w
I, [2018-06-10T22:09:59.673220 #19967] INFO -- : w closed
$
This issue occurred in CentOS 7.2.1511 and Ubuntu 16.04.4 LTS.
- CentOS
$ cat /etc/redhat-release
CentOS Linux release 7.2.1511 (Core)
$ uname -a
Linux mfz-remote-gw-01 3.10.0-327.18.2.el7.x86_64 #1 SMP Thu May 12 11:03:55 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
$ ruby -v
ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-linux]
$ rbenv -v
rbenv 1.1.1-30-gc8ba27f
$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,objc,obj-c++,java,fortran,ada,go,lto --enable-plugin --enable-initfini-array --disable-libgcj --with-isl=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/isl-install --with-cloog=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/cloog-install --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux
Thread model: posix
gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC)
- Ubuntu
$ cat /etc/os-release
NAME="Ubuntu"
VERSION="16.04.4 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.4 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial
$ uname -a
Linux epsilon 4.4.0-122-generic #146-Ubuntu SMP Mon Apr 23 15:34:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
$ ruby -v
ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-linux]
$ rbenv -v
rbenv 1.1.1-37-g1c772d5
$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/5/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 5.4.0-6ubuntu1~16.04.9' --with-bugurl=file:///usr/share/doc/gcc-5/README.Bugs --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-5 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-5-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-5-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-5-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.9)
Files
Updated by hirura (Hiroyuki URANISHI) over 6 years ago
- File gdb_trace_normal.txt gdb_trace_normal.txt added
- File gdb_trace_reproduced.txt gdb_trace_reproduced.txt added
- ruby -v changed from ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-linux] to ruby 2.6.0dev (2018-06-15 trunk 63671) [x86_64-linux]
Hi,
I tried getting gdb backtrace of a reproduced, means stopped in this case, process.
Attached files are the outputs of reproduced case and normal case.
Could anyone help on this ?
hirura (Hiroyuki URANISHI) wrote:
Hi,
I'm now writing a multi thread application, which uses IO.pipe, IO#close and IO#readpartial.
During running a code many times, I found that, even though calling IO#close on a write side object instantiated by IO.pipe, IO#readpartial of the read side object very rarely does not raise EOFError and keeps blocking.The below code is working fine in most cases.
- Thread t2 generates '0' and '1' values and writes them on w object.
- Thread t1 read from r object with IO#readpartial
- After thread t2 exited, close w object to finish thread t1
require 'logger' logger = Logger.new 'logger.log', 1, 10000000 logger.level = Logger::DEBUG r, w = IO.pipe t0 = Thread.new{ # This is a dummy thread to avoid being killed by Ruby interpreter automatically due to deadlock loop do sleep 1 Thread.pass end } t1 = Thread.new{ loop do begin logger.debug { "r.readpartial(10240)" } data_read = r.readpartial(10240) logger.debug { "data_read: #{data_read}" } rescue EOFError logger.info { "EOFError" } break rescue => e logger.error { e.message } end end logger.info { "exiting t1" } } t2 = Thread.new{ 2.times{ |i| sleep 0.1 w.write i.to_s } logger.info { "exiting t2" } } begin t2.join logger.info { "t2 exited" } logger.info { "close w" } w.close logger.info { "w closed" } rescue IOError logger.info { "IOError when closing w" } end t1.join logger.info { "t1 exited" } begin r.close rescue IOError logger.info { "IOError when closing r" } end
And I run the above code many times like:
$ i=0; while :; do echo $i; ruby readpartial_and_eof_test.rb; i=$(expr $i + 1); done 0 1 (Snip) 3892 3893 3894 (Stopped here)
As seen above, during 3894th iteration, the code is not finished.
The output of logger is the below. Closing w object is done, but EOFError looks not being raised.
$ tail -n 32 logger.log I, [2018-06-10T22:09:58.738087 #19893] INFO -- : t1 exited D, [2018-06-10T22:09:58.871906 #19916] DEBUG -- : r.readpartial(10240) D, [2018-06-10T22:09:58.972198 #19916] DEBUG -- : data_read: 0 D, [2018-06-10T22:09:58.972327 #19916] DEBUG -- : r.readpartial(10240) I, [2018-06-10T22:09:59.072294 #19916] INFO -- : exiting t2 D, [2018-06-10T22:09:59.072441 #19916] DEBUG -- : data_read: 1 D, [2018-06-10T22:09:59.072758 #19916] DEBUG -- : r.readpartial(10240) I, [2018-06-10T22:09:59.072667 #19916] INFO -- : t2 exited I, [2018-06-10T22:09:59.072915 #19916] INFO -- : close w I, [2018-06-10T22:09:59.073011 #19916] INFO -- : w closed I, [2018-06-10T22:09:59.073101 #19916] INFO -- : EOFError I, [2018-06-10T22:09:59.073166 #19916] INFO -- : exiting t1 I, [2018-06-10T22:09:59.073268 #19916] INFO -- : t1 exited D, [2018-06-10T22:09:59.165037 #19944] DEBUG -- : r.readpartial(10240) D, [2018-06-10T22:09:59.265229 #19944] DEBUG -- : data_read: 0 D, [2018-06-10T22:09:59.265279 #19944] DEBUG -- : r.readpartial(10240) D, [2018-06-10T22:09:59.365360 #19944] DEBUG -- : data_read: 1 D, [2018-06-10T22:09:59.365469 #19944] DEBUG -- : r.readpartial(10240) I, [2018-06-10T22:09:59.365252 #19944] INFO -- : exiting t2 I, [2018-06-10T22:09:59.365640 #19944] INFO -- : t2 exited I, [2018-06-10T22:09:59.365709 #19944] INFO -- : close w I, [2018-06-10T22:09:59.365754 #19944] INFO -- : w closed I, [2018-06-10T22:09:59.365819 #19944] INFO -- : EOFError I, [2018-06-10T22:09:59.365959 #19944] INFO -- : exiting t1 I, [2018-06-10T22:09:59.366078 #19944] INFO -- : t1 exited D, [2018-06-10T22:09:59.472484 #19967] DEBUG -- : r.readpartial(10240) D, [2018-06-10T22:09:59.572743 #19967] DEBUG -- : data_read: 0 D, [2018-06-10T22:09:59.572891 #19967] DEBUG -- : r.readpartial(10240) I, [2018-06-10T22:09:59.672976 #19967] INFO -- : exiting t2 I, [2018-06-10T22:09:59.673165 #19967] INFO -- : t2 exited I, [2018-06-10T22:09:59.673199 #19967] INFO -- : close w I, [2018-06-10T22:09:59.673220 #19967] INFO -- : w closed $
This issue occurred in CentOS 7.2.1511 and Ubuntu 16.04.4 LTS.
- CentOS
$ cat /etc/redhat-release CentOS Linux release 7.2.1511 (Core) $ uname -a Linux mfz-remote-gw-01 3.10.0-327.18.2.el7.x86_64 #1 SMP Thu May 12 11:03:55 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux $ ruby -v ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-linux] $ rbenv -v rbenv 1.1.1-30-gc8ba27f $ gcc -v Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper Target: x86_64-redhat-linux Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,objc,obj-c++,java,fortran,ada,go,lto --enable-plugin --enable-initfini-array --disable-libgcj --with-isl=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/isl-install --with-cloog=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/cloog-install --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux Thread model: posix gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC)
- Ubuntu
$ cat /etc/os-release NAME="Ubuntu" VERSION="16.04.4 LTS (Xenial Xerus)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 16.04.4 LTS" VERSION_ID="16.04" HOME_URL="http://www.ubuntu.com/" SUPPORT_URL="http://help.ubuntu.com/" BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/" VERSION_CODENAME=xenial UBUNTU_CODENAME=xenial $ uname -a Linux epsilon 4.4.0-122-generic #146-Ubuntu SMP Mon Apr 23 15:34:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux $ ruby -v ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-linux] $ rbenv -v rbenv 1.1.1-37-g1c772d5 $ gcc -v Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/5/lto-wrapper Target: x86_64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Ubuntu 5.4.0-6ubuntu1~16.04.9' --with-bugurl=file:///usr/share/doc/gcc-5/README.Bugs --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-5 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-5-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-5-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-5-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu Thread model: posix gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.9)
Updated by normalperson (Eric Wong) over 6 years ago
hirura@gmail.com wrote:
File gdb_trace_reproduced.txt added
ruby -v changed from ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-linux] to ruby 2.6.0dev (2018-06-15 trunk 63671) [x86_64-linux]
Odd, it is stuck in rb_mutex_sleep based on your backtrace (from
Logger, it looks like); and you can reproduce it on recent trunk.
I'm trying to reproduce the bug, but I cannot. I will let it
run while I do other things...
I suspect the problem is a bug in our deadlock detection
(vm->sleeper (*))
In your code, you have a t0 dummy thread. Is it really needed?
Curious, how many CPU cores and what model do you have?
Can you reproduce it if you use taskset or schedtool to limit
it to a single core?
Do you have a different hardware to try on?
(*) I also wonder why vm->sleeper is volatile, it should not
need to be.
Updated by normalperson (Eric Wong) over 6 years ago
Sorry, I left this out: If you can reproduce it again, can you
print vm->sleeper and vm->living_thread_num values from gdb?
Thanks.
Updated by hirura (Hiroyuki URANISHI) over 6 years ago
Hi Eric,
Thank you for replying.
I tried and reproduced this on
- Ubuntu Linux on Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz (4cores / 8threads)
- CentOS Linux on Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz (1cores /
2threads; a VM)
Unfortunately I don't have a Linux machine with AMD or other vendors'.
As for additional gdb outputs, I have a dumped core file of the
stopped process, and I tried printing the instructed.
But I couldn't get "vm" variable like
(gdb) print vm
No symbol "vm" in current context.
Could you give me a bit detailed way to get the output? I'm sorry I'm
not familiar with gdb.
The t0 dummy thread is used so that the process is not killed by
deadlock detection and I can get the core-dump.
Without the thread, the process is killed as deadlock.
The output of the case is the below. I suppose deadlock detection is
working as expected.
(Line 45 is t1.join in this case)
Traceback (most recent call last):
1: from readpartial_and_eof_test.rb:45:in `<main>'
readpartial_and_eof_test.rb:45:in `join': No live threads left.
Deadlock? (fatal)
2 threads, 2 sleeps current:0x00007f8c498cfc20 main thread:0x00007f8c49566470
* #<Thread:0x00007f8c49598c60 sleep_forever>
rb_thread_t:0x00007f8c49566470 native:0x00007f8c47674740 int:0
readpartial_and_eof_test.rb:45:in `join'
readpartial_and_eof_test.rb:45:in `<main>'
* #<Thread:0x00007f8c4988ce80@readpartial_and_eof_test.rb:11 sleep_forever>
rb_thread_t:0x00007f8c498cfc20 native:0x00007f8c3f908700 int:0
depended by: tb_thread_id:0x00007f8c49566470
/home/hirura/.rbenv/versions/2.5.1/lib/ruby/2.5.0/monitor.rb:185:in `lock'
/home/hirura/.rbenv/versions/2.5.1/lib/ruby/2.5.0/monitor.rb:185:in
`mon_enter'
/home/hirura/.rbenv/versions/2.5.1/lib/ruby/2.5.0/monitor.rb:224:in
`mon_synchronize'
/home/hirura/.rbenv/versions/2.5.1/lib/ruby/2.5.0/logger.rb:688:in `write'
/home/hirura/.rbenv/versions/2.5.1/lib/ruby/2.5.0/logger.rb:471:in `add'
/home/hirura/.rbenv/versions/2.5.1/lib/ruby/2.5.0/logger.rb:493:in `debug'
readpartial_and_eof_test.rb:16:in `block (2 levels) in <main>'
readpartial_and_eof_test.rb:12:in `loop'
readpartial_and_eof_test.rb:12:in `block in <main>'
Thanks,
Updated by normalperson (Eric Wong) over 6 years ago
hirura hirura@gmail.com wrote:
I tried and reproduced this on
- Ubuntu Linux on Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz (4cores / 8threads)
- CentOS Linux on Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz (1cores /
2threads; a VM)
Unfortunately I don't have a Linux machine with AMD or other vendors'.
OK, I'm on AMD (Phenom II) with 4 cores and no HT.
As for additional gdb outputs, I have a dumped core file of the
stopped process, and I tried printing the instructed.
But I couldn't get "vm" variable like(gdb) print vm No symbol "vm" in current context.
Could you give me a bit detailed way to get the output? I'm sorry I'm
not familiar with gdb.
You can use "up" and "down" to move up/down stack frames until you're
in a function where "vm" is accessible. (I'm not a gdb expert, either)
The t0 dummy thread is used so that the process is not killed by
deadlock detection and I can get the core-dump.
Without the thread, the process is killed as deadlock.
Odd, I can't reproduce a deadlock with t0 commented out, either.
The output of the case is the below. I suppose deadlock detection is
working as expected.
(Line 45 is t1.join in this case)Traceback (most recent call last): 1: from readpartial_and_eof_test.rb:45:in `<main>' readpartial_and_eof_test.rb:45:in `join': No live threads left. Deadlock? (fatal) 2 threads, 2 sleeps current:0x00007f8c498cfc20 main thread:0x00007f8c49566470
OK, so that's 2 for both vm->living_thread_num and vm->sleeper;
in that case that's normal.
I'm wondering if a stuck process will show some strange values
(e.g. negative due to underflow).
Perhaps `volatile' is unnecessary and the compiler does
something strange with sleeper... Does removing it change
things?
--- a/vm_core.h
+++ b/vm_core.h
@@ -565,7 +565,7 @@ typedef struct rb_vm_struct {
unsigned int safe_level_: 1;
int trace_running;
- volatile int sleeper;
+ int sleeper;
/* object management */
VALUE mark_object_ary;
Updated by hirura (Hiroyuki URANISHI) over 6 years ago
Thank you for letting me know how to touch gdb.
The below is the output, it seems that sleeper and living_thread_num
have reasonable values.
(gdb) t 1
[Switching to thread 1 (Thread 0x7f46c51c8700 (LWP 827))]
#0 pthread_cond_wait@@GLIBC_2.3.2 () at
../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
185 in ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S
(gdb) up 3
#3 0x000055bec706df9a in thread_join_sleep (arg=140725177158608) at
thread.c:932
932 native_sleep(th, 0);
(gdb) p th->vm
$1 = (rb_vm_t *) 0x55bec8085f20
(gdb) p th->vm->sleeper
$2 = 2
(gdb) p th->vm->living_thread_num
$3 = 3
(gdb)
(gdb) t 3
[Switching to thread 3 (Thread 0x7f46c37aa700 (LWP 841))]
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at
../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
225 ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S: No
such file or directory.
(gdb) up 3
#2 0x000055bec7069920 in native_sleep (th=0x55bec8468fa0,
timeout_rel=0x7f46c37a7b50) at thread_pthread.c:1073
1073 native_cond_timedwait(cond, lock, &timeout);
(gdb) p th->vm
$1 = (rb_vm_t *) 0x55bec8085f20
(gdb) p th->vm->sleeper
$2 = 2
(gdb) p th->vm->living_thread_num
$3 = 3
(gdb)
(gdb) t 4
[Switching to thread 4 (Thread 0x7f46c35a8700 (LWP 842))]
#0 pthread_cond_wait@@GLIBC_2.3.2 () at
../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
185 in ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S
(gdb) up 2
#2 0x000055bec7069907 in native_sleep (th=0x55bec8425390,
timeout_rel=0x0) at thread_pthread.c:1071
1071 rb_native_cond_wait(cond, lock);
(gdb) p th->vm
$1 = (rb_vm_t *) 0x55bec8085f20
(gdb) p th->vm->sleeper
$2 = 2
(gdb) p th->vm->living_thread_num
$3 = 3
(gdb)
Nextly, I will try the same code with limited CPU cores, and will try
patching vm_core.h.
Updated by normalperson (Eric Wong) over 6 years ago
hirura hirura@gmail.com wrote:
Nextly, I will try the same code with limited CPU cores, and will try
patching vm_core.h.
Thanks. I'm failing to reproduce the problem on my end.
I also got my netbook to overheat, but that's not Ruby's fault :x
Updated by normalperson (Eric Wong) over 6 years ago
mofezilla@gmail.com wrote:
I think the problem is mutex->th may be altered by
RUBY_VM_CHECK_INTS_BLOCKING(th->ec), and we miss
a point where "mutex->th = 0;"
Can you test?
--- a/thread_sync.c
+++ b/thread_sync.c
@@ -272,6 +272,7 @@ rb_mutex_lock(VALUE self)
list_add_tail(&mutex->waitq, &w.node);
native_sleep(th, timeout); /* release GVL */
list_del(&w.node);
+ RUBY_VM_CHECK_INTS_BLOCKING(th->ec);
if (!mutex->th) {
mutex->th = th;
}
@@ -289,8 +290,6 @@ rb_mutex_lock(VALUE self)
th->vm->sleeper--;
if (mutex->th == th) mutex_locked(th, self);
-
- RUBY_VM_CHECK_INTS_BLOCKING(th->ec);
}
}
return self;
I think the bug existed in <2.4, but 2.5 made it easier-to-hit.
Updated by normalperson (Eric Wong) over 6 years ago
Eric Wong normalperson@yhbt.net wrote:
mofezilla@gmail.com wrote:
Can you test?
Sorry, right idea, bad patch (broke during "make exam")
This should work:
--- a/thread_sync.c
+++ b/thread_sync.c
@@ -289,8 +289,11 @@ rb_mutex_lock(VALUE self)
th->vm->sleeper--;
if (mutex->th == th) mutex_locked(th, self);
-
- RUBY_VM_CHECK_INTS_BLOCKING(th->ec);
+ RUBY_VM_CHECK_INTS_BLOCKING(th->ec); /* may release mutex */
+ if (!mutex->th) {
+ mutex->th = th;
+ mutex_locked(th, self);
+ }
}
}
return self;
Updated by normalperson (Eric Wong) over 6 years ago
- Status changed from Open to Closed
Applied in changeset trunk|r63711.
thread_sync.c (rb_mutex_lock): fix deadlock
- thread_sync.c (rb_mutex_lock): fix deadlock
[ruby-core:87467] [Bug #14841]
Updated by normalperson (Eric Wong) over 6 years ago
- Backport changed from 2.3: UNKNOWN, 2.4: UNKNOWN, 2.5: UNKNOWN to 2.3: REQUIRED, 2.4: REQUIRED, 2.5: REQUIRED
Updated by normalperson (Eric Wong) over 6 years ago
- Status changed from Closed to Open
Updated by normalperson (Eric Wong) over 6 years ago
This affects all versions, but is most easily triggered in 2.5+
Updated by Anonymous over 6 years ago
- Status changed from Open to Closed
Applied in changeset commit:ruby-git|501069b8a4013f2e3fdde35c50e9527ef0061963.
thread_sync.c (rb_mutex_lock): fix deadlock
- thread_sync.c (rb_mutex_lock): fix deadlock
[ruby-core:87467] [Bug #14841]
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@63711 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
Updated by hirura (Hiroyuki URANISHI) over 6 years ago
Hi,
Sorry to reply late.
Thank you for your quick help and fix.
I have tried the same code more than 1,000,000 times on "ruby 2.6.0dev (2018-06-22 trunk 63723) [x86_64-linux]" version that includes the fix, and I confirmed the issue is not reproduced.
And just for information, I tried the older version, that does not include the fix, with "taskset" tool, and the issue was not reproduced as well.
Thanks,
Updated by nagachika (Tomoyuki Chikanaga) about 6 years ago
- Backport changed from 2.3: REQUIRED, 2.4: REQUIRED, 2.5: REQUIRED to 2.3: REQUIRED, 2.4: REQUIRED, 2.5: DONE
ruby_2_5 r64998 merged revision(s) 63711,64398.
Updated by usa (Usaku NAKAMURA) about 6 years ago
- Backport changed from 2.3: REQUIRED, 2.4: REQUIRED, 2.5: DONE to 2.3: REQUIRED, 2.4: DONE, 2.5: DONE
ruby_2_4 r65115 merged revision(s) 63711,64398.