Bug #5306
closedApplication Hangs Due to Recent rb_thread_select Changes
Description
This commit:
4e9438bc9153f7a1f4ea0af85c8dbe359e1a55d8
Changed the implementation of rb_thread_select.
It causes eventmachine to hang on CentOS 5.5. Not sure what the issue is, but its easily reproduced by by running the test eventmachine/tests/test_epoll.rb.
We noticed this because it also causes the tweetstream gem to hang.
The same setup works on Fedora 14 and an up-to-date arch linux. Specific version information included below.
We temporarily fixed this by reverting the commit.
Since Centos is a common production environment (and the one we are using), this seems to us a blocker for 1.9.3.
We are happy to provide any additional information or test fixes.
Thanks - Charlie
We are running this version of CentOS:
Linux app1.zerista.com 2.6.18-238.19.1.el5.centos.plus #1 SMP Mon Jul 18 10:05:09 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
And this version of Fedora:
Linux ammonite.internal.zerista.com 2.6.35.14-95.fc14.x86_64 #1 SMP Tue Aug 16 21:01:58 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux
And this version of eventmachine:
eventmachine (1.0.0.beta.3)
And this version of tweetstream:
tweetstream (1.0.4)
Files
Updated by kosaki (Motohiro KOSAKI) over 13 years ago
I think it's duplicated with 5299.
Updated by cfis (Charlie Savage) over 13 years ago
Unfortunately it is not. That was the first problem - and resulted in segmentation faults. We manually backported the fix for #5299 to our local copy of ruby 193. Once we did that, it fixed the segmentation faults, but resulted in this problem.
So this is a new problem with that particular commit.
Updated by normalperson (Eric Wong) over 13 years ago
Charlie Savage cfis@savagexi.com wrote:
It causes eventmachine to hang on CentOS 5.5. Not sure what the issue
is, but its easily reproduced by by running the test
eventmachine/tests/test_epoll.rb.
I have CentOS 5.4, x86_64, kernel 2.6.18-164.11.1.el5
rake compile
ruby -I .:lib:tests/ tests/test_epoll.rb
Works for me on an unpacked eventmachine-1.0.0.beta.3 tree with
ruby_1_9_3 branch. However, only 2 tests appeared enabled.
We noticed this because it also causes the tweetstream gem to hang.
The same setup works on Fedora 14 and an up-to-date arch linux.
Specific version information included below.
Yes, unable to reproduce on a more modern Debian testing machine
(x86_64)
Linux app1.zerista.com 2.6.18-238.19.1.el5.centos.plus #1 SMP Mon Jul
18 10:05:09 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
I'll try to find a machine closer to the above.
Updated by cfis (Charlie Savage) over 13 years ago
Hi Eric,
It causes eventmachine to hang on CentOS 5.5.
Sorry, these machines are actually CentOS 5.6. The latest patches were applied via yum update about a week ago, so its pretty up-to-date.
I have CentOS 5.4, x86_64, kernel 2.6.18-164.11.1.el5
rake compile
ruby -I .:lib:tests/ tests/test_epoll.rbWorks for me on an unpacked eventmachine-1.0.0.beta.3 tree with
ruby_1_9_3 branch. However, only 2 tests appeared enabled.
So what we see is this test hanging:
def test_datagrams
$in = $out = ""
EM.run {
EM.open_datagram_socket "127.0.0.1", @port, TestDatagramServer
EM.open_datagram_socket "127.0.0.1", 0, TestDatagramClient, @port
}
assert_equal( "1234567890", $in )
assert_equal( "abcdefghij", $out )
end
It hangs on the first EM.open_datagram_socket call.
Here is another one, this time from test_pure_ruby.rb (which in fact seems misnamed, it is using the C code):
def test_connrefused
assert_nothing_raised do
EM.run {
setup_timeout(2)
EM.connect "127.0.0.1", @port, TestConnrefused
}
end
In this one, its the EM connect call that hangs.
I'll try to find a machine closer to the above.
Probably a yum update will get you there...
Let me know if there is anything we can do to help debug this. Its happens across 8 servers (all of which are at the same CentOS release, albeit they did start as the same VM image a while back).
Charlie
Updated by normalperson (Eric Wong) over 13 years ago
Charlie Savage cfis@savagexi.com wrote:
Sorry, these machines are actuall CentOS 5.6. The latest patches were
applied via yum update about a week ago, so its pretty up-to-date.
OK, I'm closer with 2.6.18-238.9.1.el5xen but still can't reproduce it.
I don't have permission to upgrade kernels on CentOS images,
unfortunately. It's the weekend so the folks that do have permission
aren't around...
So what we see is this test hanging:
def test_datagrams
$in = $out = ""
EM.run {
EM.open_datagram_socket "127.0.0.1", @port, TestDatagramServer
EM.open_datagram_socket "127.0.0.1", 0, TestDatagramClient, @port
}
assert_equal( "1234567890", $in )
assert_equal( "abcdefghij", $out )
endIt hangs on the first EM.open_datagram_socket call.
Can you show us "strace -f -v" output from that test?
Maybe sprinkle some `fprintf(stderr, "%s:%d\n", FILE, LINE);'
or similar inside EventMachine_t::OpenDatagramSocket and see where it
gets to? It shouldn't hit gethostbyname()...
Here is another one, this time from test_pure_ruby.rb (which in fact seems misnamed, it is using the C code):
def test_connrefused
assert_nothing_raised do
EM.run {
setup_timeout(2)
EM.connect "127.0.0.1", @port, TestConnrefused
}
endIn this one, its the EM connect call that hangs.
I can't reproduce this, either...
Also, can you extract these tests and run with a hand-picked port?
Let me know if there is anything we can do to help debug this. Its
happens across 8 servers (all of which are at the same CentOS release,
albeit they did start as the same VM image a while back).
I assume you tried a clean build/install of Ruby to make sure all
objects got rebuilt and reinstalled?
Can you also try running `pmap $PID' on the hung processes to make sure
it's loading the correct libs + versions?
Updated by cfis (Charlie Savage) over 13 years ago
- File strace_hangs.log strace_hangs.log added
- File strace_completes.log strace_completes.log added
- File strace_pure.log strace_pure.log added
- File pmap.log pmap.log added
Ok, on the first test, strange results. Running this command:
strace -f -v ruby -I.:lib:tests tests/test_epoll.rb -n test_datagrams
Hangs the test as expected. But running this command:
strace -f -v ruby -I.:lib:tests tests/test_epoll.rb -n test_datagrams &> /tmp/strace1.log
Causes the test runs to completion. And then annoyingly enough that one particular test works after that. If I reboot the machine, then the test hangs again.
I have attached 2 logs, strace_completes.log and strace_hangs.log. stace_hangs.log is only the last few hundred lines (rest scrolled off the top), but what I saw matches strace_completes.log to line 2,271. After that, the two diverge.
The story is different for the second test, it always hangs:
strace -v -v ruby -I.:lib:tests tests/test_pure.rb -n test_connrefused 2>&1 | tee /tmp/strace_pure.log
That log is attached.
As for your other questions:
Also, can you extract these tests and run with a hand-picked port?
Sure. The connection refused one is intentionally picking the first unused port. It turns out to be 9001.
I assume you tried a clean build/install of Ruby to make sure all > objects got rebuilt and reinstalled?
Yes.
$cd /usr/src/ruby
$git pull (on the ruby 193 branch)
$git clean -fx
$autoconf
$./configure --prefix=/usr --enable-shared=true
$make
$make install
Can you also try running `pmap $PID' on the hung processes to make > sure it's loading the correct libs + versions?
$ps -ef | grep ruby
cfis 16185 15381 4 01:51 pts/1 00:00:00 ruby -I.:lib:tests
$pmap 16185
(see attached log)
Hope this info helps.
Updated by cfis (Charlie Savage) over 13 years ago
And a bit more info. Running the datagrams test under GDB.
$gdb --args ruby -I.:lib:tests tests/test_epoll.rb -n test_datagrams
(gdb) run
... hangs ...
hit ctrl+c
Program received signal SIGINT, Interrupt.
0x000000375200d91b in read () from /lib64/libpthread.so.0
(gdb) bt
#0 0x000000375200d91b in read () from /lib64/libpthread.so.0
#1 0x00002aaaae9ea3ce in EventMachine_t::_ReadLoopBreaker (this=0xd61b50)
at em.cpp:998
#2 0x00002aaaae9ebc9a in EventMachine_t::_RunSelectOnce (this=0xd61b50)
at em.cpp:935
#3 0x00002aaaae9ec4f5 in EventMachine_t::_RunOnce (this=0x9) at em.cpp:498
#4 0x00002aaaae9ee183 in EventMachine_t::Run (this=0xd61b50) at em.cpp:478
#5 0x00002aaaae9e86a9 in t_run_machine_without_threads (self=9)
at rubymain.cpp:219
#6 0x00002aaaaac1b2d0 in vm_call_cfunc (th=0x602520, cfp=0x2aaaae5c7778,
num=0, blockptr=0x1, flag=24, id=0, me=0x8e9f90, recv=9127040)
at vm_insnhelper.c:404
etc.
(gdb) frame 1
#1 0x00002aaaae9ea3ce in EventMachine_t::_ReadLoopBreaker (this=0xd61ce0)
at em.cpp:998
998 read (LoopBreakerReader, buffer, sizeof(buffer));
(gdb) list
993 /* The loop breaker has selected readable.
994 * Read it ONCE (it may block if we try to read it twice)
995 * and send a loop-break event back to user code.
996 */
997 char buffer [1024];
998 read (LoopBreakerReader, buffer, sizeof(buffer));
999 if (EventCallback)
Running the other test, gdb --args ruby -I.:lib:tests tests/test_pure.rb -n test_connrefused, shows the same backtrace in gdb.
Updated by normalperson (Eric Wong) over 13 years ago
- File 0001-thread.c-rb_thread_select-mark-original-fd_sets-prop.patch 0001-thread.c-rb_thread_select-mark-original-fd_sets-prop.patch added
Thanks for the straces, I was able to tell the EM pipe was stuck on a
false-positive and calling a blocking read() on a pipe that had no data.
Attached is a patch which should fix the issue, sorry for the bug :x
Updated by kosaki (Motohiro KOSAKI) over 13 years ago
- ruby -v changed from ruby 1.9.3dev (2011-09-09 revision 33236) [x86_64-linux] to -
2011/9/11 Eric Wong normalperson@yhbt.net:
Issue #5306 has been updated by Eric Wong.
File 0001-thread.c-rb_thread_select-mark-original-fd_sets-prop.patch added
Thanks for the straces, I was able to tell the EM pipe was stuck on a
false-positive and calling a blocking read() on a pipe that had no data.Attached is a patch which should fix the issue, sorry for the bug :x
Your patch will break non linux platform. I can't apply it. :x
Updated by cfis (Charlie Savage) over 13 years ago
Ok, some questions so I can understand this code:
How is the false-positive happening?
Why does this break on non-linux platforms?
And then obviously, what is the next step?
Thanks for looking into this and the quick responses.
Updated by normalperson (Eric Wong) over 13 years ago
- File 0001-thread.c-rb_thread_select-mark-original-fd_sets-prop.patch 0001-thread.c-rb_thread_select-mark-original-fd_sets-prop.patch added
Hopefully a better patch is attached. I have no way of testing on non-Linux,
but I did test successfully without HAVE_RB_FD_INIT defined. _WIN32 tester (and
potential fixer) is needed.
Updated by normalperson (Eric Wong) over 13 years ago
Charlie Savage cfis@savagexi.com wrote:
Ok, some questions so I can understand this code:
How is the false-positive happening?
rb_thread_select() needs to modify the arguments passed to it (and clear
out not-ready descriptors). My patch fixed that for Linux and other
platforms with NFDBITS && HAVE_RB_FD_INIT.
Why does this break on non-linux platforms
I missed the (NFDBITS && HAVE_RB_FD_INIT) code paths completely.
And then obviously, what is the next step?
I am testing a patch, I manually disabled the HAVE_RB_FD_INIT code
paths to test, but I cannot test _WIN32 path.
Updated by cfis (Charlie Savage) over 13 years ago
Thanks for the explanations.
I can test on windows - I have mswin and mingw builds. How to test though? Are there any tests in the test suite I should run to verify? Would love to run the whole test suite, but sadly that doesn't work on windows.
Updated by normalperson (Eric Wong) over 13 years ago
Charlie Savage cfis@savagexi.com wrote:
I can test on windows - I have mswin and mingw builds. How to test
though? Are there any tests in the test suite I should run to verify?
Would love to run the whole test suite, but sadly that doesn't work on
windows.
./ruby -I .ext/$PLATFORM test/-ext-/old_thread_select/test_old_thread_select.rb
For me, I have PLATFORM=x86_64-linux
Updated by cfis (Charlie Savage) over 13 years ago
Hmm, I take is this is against head? On the 1.9.3 branch there is already this method (line 2384):
void
rb_fd_copy(rb_fdset_t *dst, const fd_set *src, int max)
The patch then adds this right below it (line 2399):
static void
rb_fd_rcopy(fd_set *dst, rb_fdset_t *src)
And then lower down (line 2690):
if (read) {
rfds = &fdsets[0];
rb_fd_init(rfds);
rb_fd_copy(rfds, read, max);
}
So that rb_fd_copy call would no longer work.
Updated by normalperson (Eric Wong) over 13 years ago
Charlie Savage cfis@savagexi.com wrote:
Issue #5306 has been updated by Charlie Savage.
Hmm, I take is this is against head? On the 1.9.3 branch there is
already this method (line 2384):
It should apply cleanly to r33236 (ruby_1_9_3)
void
rb_fd_copy(rb_fdset_t *dst, const fd_set *src, int max)The patch then adds this right below it (line 2399):
static void
rb_fd_rcopy(fd_set *dst, rb_fdset_t *src)
The new function is "rcopy" (reverse copy). I named it based on memrchr()
vs memchr(). Maybe someone can think of a better name?
Updated by cfis (Charlie Savage) over 13 years ago
Ah, totally missed that r - its not obvious if you aren't looking for it.
Patch doesn't compile on Windows:
thread.c
./../thread.c(2466) : error C2143: syntax error : missing ')' before ';'
NMAKE : fatal error U1077: '"c:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\BIN\cl.EXE"' : return code '0x2'
Stop.
As the error says, easy to fix, its missing a ) at the end of the line.
The test passes on mswin. One fails on mingw:
- Failure:
test_old_select_false_positive(TestOldThreadSelect) [ruby/test/-ext-/old_thread_select/test_old_thread_select.rb:34]:
[ruby-core:39435].
<[5]> expected but was
<[3, 5]>.
4 tests, 12 assertions, 1 failures, 0 errors, 0 skips
Updated by kosaki (Motohiro KOSAKI) over 13 years ago
- File old_thread_select.patch old_thread_select.patch added
static void
rb_fd_rcopy(fd_set *dst, rb_fdset_t *src)
{
size_t size = howmany(rb_fd_max(src), NFDBITS) * sizeof(fd_mask);
if (size < sizeof(fd_set)) size = sizeof(fd_set);
memcpy(dst, rb_fd_ptr(src), size);
}
If size > sizeof(fd_set), this code makes memory corruption.
static void
rb_fd_rcopy(fd_set *dst, rb_fdset_t *src)
{
memcpy(dst->fd_array, src->fdset->fd_array,
dst->fd_count * sizeof(dst->fd_array[0]);
dst->fd_count = src->fdset->fd_count;
}
Bad indentation of coding style violation.
Also, if src->fdset->fd_count > FD_SETSIZE, we should return an error or raise an exception.
Attached new patch. It works both linux and windows. Can you please review it?
Updated by normalperson (Eric Wong) over 13 years ago
Motohiro KOSAKI kosaki.motohiro@gmail.com wrote:
Attached new patch. It works both linux and windows. Can you please
review it?
Thanks! I can confirm it's good on Linux, Charlie?
Updated by kosaki (Motohiro KOSAKI) over 13 years ago
- Status changed from Open to Closed
- % Done changed from 0 to 100
This issue was solved with changeset r33256.
Charlie, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.
-
thread.c (rb_thread_select): fix to ignore an argument
modification of rb_thread_fd_select().
based on a patch by Eric Wong. [Bug #5306] [ruby-core:39435] -
thread.c (rb_fd_rcopy): New. for reverse fd copy.
-
test/-ext-/old_thread_select/test_old_thread_select.rb
(test_old_select_false_positive): test for bug5306. -
ext/-test-/old_thread_select/old_thread_select.c (fdset2array):
New. convert fdsets to array. -
ext/-test-/old_thread_select/old_thread_select.c (old_thread_select):
return 'read', 'write', 'except' argument of rb_thread_select()
to ruby script.
Updated by kosaki (Motohiro KOSAKI) over 13 years ago
committed both trunk and ruby_1_9_3.
Updated by cfis (Charlie Savage) over 13 years ago
- File mingw_backtrace.txt mingw_backtrace.txt added
Thanks for all the effort. But sorry, not fixed yet. This version segfaults on MinGW. Trace attached. Will check mswin next.
Updated by usa (Usaku NAKAMURA) over 13 years ago
Hello,
In message "[ruby-core:39483] [Ruby 1.9 - Bug #5306] Application Hangs Due to Recent rb_thread_select Changes"
on Sep.13,2011 01:29:28, cfis@savagexi.com wrote:
File mingw_backtrace.txt added
Thanks for all the effort. But sorry, not fixed yet. This version segfaults on MinGW. Trace attached. Will check mswin next.
Hmm, did you do make install before running the test?
C level backtrace information of your trace says that the ruby
core dll is c:\MinGW\local\ruby\bin\msvcrt-ruby191.dll .
I guess that your build path is c:/MinGW/local/src/ruby, and
the built ruby core dll is c:/MinGW/local/src/ruby/msvcrt-ruby191.dll .
If you want to test safely, run as follows:
make test-all TESTS="-- -ext-/old_thread_select"
BTW, I've checked kosaki-san's patch with x64-mswin64.
No problem was reported in test.
P.S.
kosaki-san, I want to add a guard to your patch.
--- thread.c.bak 2011-09-13 03:40:05.948172400 +0900
+++ thread.c 2011-09-13 03:40:24.308222500 +0900
@@ -2469,7 +2469,9 @@ rb_fd_rcopy(fd_set *dst, rb_fdset_t *src
{
int max = rb_fd_max(src);
- if (max > FD_SETSIZE) {
- /* we assume src is the result of select() with dst, so dst should be
-
* larger or equal than src. */
- if (max > FD_SETSIZE || max > dst->fd_count) {
rb_raise(rb_eArgError, "too large fdsets");
}
Regards,
U.Nakamura usa@garbagecollect.jp
Updated by cfis (Charlie Savage) over 13 years ago
Ok, I rebuilt everything from scratch and did not encounter any errors - sorry for the false alarm. mswin also checked out fine.
We will next test this fix on the original servers where we encountered the problem. If any issues remain, I will reopen the ticket.
Thanks again for the help.
Updated by normalperson (Eric Wong) over 13 years ago
Motohiro KOSAKI kosaki.motohiro@gmail.com wrote:
File old_thread_select.patch added
static void
rb_fd_rcopy(fd_set *dst, rb_fdset_t *src)
{
size_t size = howmany(rb_fd_max(src), NFDBITS) * sizeof(fd_mask);
if (size < sizeof(fd_set)) size = sizeof(fd_set);
memcpy(dst, rb_fd_ptr(src), size);
}If size > sizeof(fd_set), this code makes memory corruption.
I just thought of this again and think rb_bug() is better than
rb_raise() here. While unlikely to hit either case, rb_raise()
will leak memory since the rb_fd_term() call gets skipped.
Updated by kosaki (Motohiro KOSAKI) over 13 years ago
BTW, I've checked kosaki-san's patch with x64-mswin64.
No problem was reported in test.P.S.
kosaki-san, I want to add a guard to your patch.--- thread.c.bak     2011-09-13 03:40:05.948172400 +0900
+++ thread.c   2011-09-13 03:40:24.308222500 +0900
@@ -2469,7 +2469,9 @@ rb_fd_rcopy(fd_set *dst, rb_fdset_t *src
 {
  int max
Updated by kosaki (Motohiro KOSAKI) over 13 years ago
2011/9/13 Eric Wong normalperson@yhbt.net:
Motohiro KOSAKI kosaki.motohiro@gmail.com wrote:
File old_thread_select.patch added
static void
rb_fd_rcopy(fd_set *dst, rb_fdset_t *src)
{
  size_t size