Project

General

Profile

Actions

Bug #1993

closed

IO.select fails when called in multiple threads on 1.8.7p174

Added by dazuma (Daniel Azuma) over 14 years ago. Updated almost 13 years ago.

Status:
Closed
ruby -v:
ruby 1.8.7 (2009-06-12 patchlevel 174) [i686-darwin9.8.0]
[ruby-core:25114]

Description

=begin
IO#select (Kernel#select) fails when run on different sets of IO objects in different threads. This affects release versions 1.8.7p160, 1.8.7p173, and 1.8.7p174. It does NOT seem to affect recent versions of 1.9.1 that I have tested. It also does NOT affect release version 1.8.7p72. I have not tested 1.8.6 versions. The repro steps have been tested mostly on Mac OS X 10.5.8 on an Intel-based MacBook Pro. I have, however, seen similar behavior on a recent Fedora Linux i686.

To reproduce, run the following script. (Replace the two filenames with distinct known readable files on your system.)

Begin code

FILENAME1 = "Rakefile"
FILENAME2 = "README"
TWO_THREADS = true

f1 = File.open(FILENAME2)
f2 = File.open(FILENAME1)
t1 = Thread.new do
c1 = 0
loop do
c1 += 1
s1 = IO.select([f1], nil, nil, 0)
n1 = s1 ? s1.first.size : 0
puts "t1: num=#{n1} iter=#{c1}"
end
end
t2 = Thread.new do
c2 = 0
loop do
c2 += 1
s2 = IO.select([f2], nil, nil, 0)
n2 = s2 ? s2.first.size : 0
puts "t2: num=#{n2} iter=#{c2}"
end
end if TWO_THREADS
t1.join

End code

The code simply repeatedly calls IO#select on IO objects known to have readable bytes, either in one thread or two threads. When run on one thread (TWO_THREADS=false), it behaves as expected, printing "num=1" indicating that select has detected the readable stream. However, when run on two threads (TWO_THREADS=true), both threads print "num=0" indicating neither thread is detecting readable information on their streams.

The relevant code appears to be the function rb_thread_schedule() in eval.c, and I believe this issue is related to revision 21165. I haven't been able to untangle everything in this code yet, but here's what I've been able to determine:

  • The code that collects file descriptors for the system select() call (lines 11063-11073 of the 1.8.7 branch as of revision 24104) DOES NOT RUN for a given thread unless the thread has a THREAD_STOPPED status at that time (because of line 11051). Therefore, any threads with a THREAD_RUNNABLE status at that time, are effectively shut out of receiving select() results unless their fd lists overlap other threads.

  • It appears that the tendency is (given the sample code above) for the next qualifying thread (that is, the thread that will be assigned to the "next" variable later on), to be in the THREAD_RUNNABLE state at this time. Since such threads are shut out of the select() call, they can never be assigned to "th_found" (see lines 11208-11212). As a result, "th_found" is assigned to a later thread in the list, rather than, as appears to be the intent, the first qualifying thread in the list (note the break on line 11214).

  • Unfortunately, this mismatches lines 11230ff. Those lines, which choose the "next" thread, always prefer the first thread given equal priority (line 11231). Since "th_found" tends not to be the first qualifying thread, we have a situation where lines 11231 and 11232 are never both true; as a result, th->select_value is never set, and the select calls never succeed.

  • The code appeared to work pre-revision-21165 (e.g. 1.8.7p72) because that version of the code set select_value on every qualifying thread, whereas the current code sets it on only one thread.

Here's where I'm unsure about how to proceed with a patch. I would like to move lines 11058 through 11073 to immediately above line 11051. This would add each thread's file descriptors to the select call, regardless of whether the thread has status THREAD_STOPPED or THREAD_RUNNABLE. This change appears to fix the test case above. And I believe it is the correct behavior; however, I'm new to this part of the code and do not have enough understanding of the intent of thread->status to assert that this is correct. I was hoping someone with more knowledge of this area could use this analysis as a starting point.
=end


Related issues 2 (0 open2 closed)

Related to Backport186 - Backport #2039: Backport 24413, 24416, 24442 to fix IO#select threading issueClosedwyhaines (Kirk Haines)09/04/2009Actions
Is duplicate of Backport187 - Backport #1484: Ruby 1.8.6_p368 and Ruby 1.8.7_p160 have threading regressionsClosedActions
Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0