Bug #6358: arm-linux : sleep() time dependent threading bug - Ruby - Ruby Issue Tracking System

Actions

Copy link

Bug #6358

closed

arm-linux : sleep() time dependent threading bug

Added by stevegoobermanhill (stephen gooberman-hill) over 13 years ago. Updated almost 13 years ago.

Status:

Closed

Assignee:

Target version:

ruby -v:

ruby 1.9.2p136 (2010-12-25 revision 30365) [arm-linux]

Backport:

[ruby-core:44621]

Description

Hi,
I have a multi-threading bug on a low-speed (180MHz) arm-linux platform. The bug can be consistantly produced or removed by varying the length of a sleep() interval at the end of a piece of code invoked in a Thread (the GServer#serve(io) method)

I am unable to completely isolate the bug, and the trace I get off it is intermittent, but when it occurs it terminates the program instantly. The debugging info that I get when the program fails is dependent on the sleep timer.

I have once seen a ruby fault, with a control frame backtrace (3 levels) showing a cfp consistancy fault (sorry - not sure which one, but send I think - my fault for losing the trace). However, I can consistently get a trace claiming a ThreadError

The lines referenced are marked in the code.

OK - the following piece of code (log statements removed) is taken from a class derived from GServer, and implements the #serve method. The line marked with the arrow (==>) at the bottom is critical in controlling the appearance of the bug.

The bug can be made to appear by running a sequence of unit tests to stress this piece of code (there are 5 test methods in the sequence)

If the sleep length is 0.01, the program (a test suite) will crash on the first test, with no backtrace or debugging info
If the sleep length is 0.1, or if the sleep length is 0.01 AND ruby is invoked with --debug then the program will crash at a variable point
in the test sequence, giving the following trace
Exception ThreadError' at /mnt/nand-user/trident_v1.1_alpha/lib/rmodbus/tcp_multi_server.rb:88 - must be called with a block must be called with a block Exception ThreadError' at /mnt/nand-user/trident_v1.1_alpha/lib/rmodbus/tcp_multi_server.rb:118 - must be called with a block
/mnt/nand-user/trident_v1.1_alpha/lib/rmodbus/tcp_multi_server.rb:88:in initialize' /mnt/nand-user/trident_v1.1_alpha/lib/rmodbus/tcp_multi_server.rb:88:in eof'
/mnt/nand-user/trident_v1.1_alpha/lib/rmodbus/tcp_multi_server.rb:88:in `serve'
This sleep length has also (once) given some control frame information.
If the sleep length is 0.5, the test sequence will always run correctly.

I suspect that the issue is somewhere deep in the Thread handling code - possibly the ruby interpreter is not giving the underlying linux system enough time to clear up and reallocate threads; this results in the crash.

Hope this is of some help. I do have a workaround at this point (lengthen the timer to 0.5s), but this may be an issue on slower platforms.

Kind regards

Steve

====================== simplified code (/mnt/nand-user/trident_v1.1_alpha/lib/rmodbus/tcp_multi_server.rb:88) ===================
====================== lines referred to in error are commented ===================
====================== sleep statement which can trigger or remove the bug marked with ==> ===================

class TCPMultiServer < GServer
def serve(io)
begin
io.sync=true
io.fcntl(Fcntl::F_SETFL, io.fcntl(Fcntl::F_GETFL) | Fcntl::O_NONBLOCK)

    while not stopped? 
      req=nil
      unless io.eof                                                           #line 88 (referenced in ThreadError)
        req = io.read(7) 
                  
        destination=req.getbyte(6)
        multicast= ( destination == 0 ) #true if multicast message
        
        active_devices= multicast ? @devices.values : [@devices[destination]]
        active_devices.compact!
        
        tr = req[0,2]
        len = req[4,2].unpack('n')[0]
        req = io.read(len - 1) unless io.eof
        
        active_devices.each do |device|    
          pdu = device.serve(req)

          resp = tr + "\0\0" + (pdu.size + 1).to_word + device.uid.chr + pdu
          io.write resp
        end
      end

==> sleep(0.01)
end
rescue Exception => ex
STDERR.puts ex
raise ex #line 118 (referenced in ThreadError)
end
end
end

Related issues 1 (0 open — 1 closed)

Actions

Copy link

#1 [ruby-core:44624]

Updated by mame (Yusuke Endoh) over 13 years ago

Status changed from Open to Feedback

At first, please try Ruby trunk.

If the problem still occurs, could you please create a patch yourself?
There is no arm maintainer unfortunately.

If your patch looks benign for other platforms, we may import it.

Thanks,

--
Yusuke Endoh mame@tsg.ne.jp

Actions

Copy link

#2 [ruby-core:44627]

Updated by stevegoobermanhill (stephen gooberman-hill) over 13 years ago

Hi Yusuke-san,
I'm not sure that I will be able to create a patch - you can blame Matz :-) If he hadn't written such a brilliant language I would still be competent in C. :-) As it is, while I can still competently read and understand C code, I don't think I have written any since about 2003.

Thinking about the problem a bit, my suspicion is that somewhere (either in ruby, or the pthreads library, or in the interaction between them), a thread is being reused before it is properly torn down, and this is causing the problem. A gross simplification of the code (which I have yet to try and use to reproduce the bug) is

timeout=0.01 #this is the critical value
while true
Thread.new do
Thread.new { sleep(timeout) }
end
end

My reasoning is that the bug may be quite deep as it generally exits without giving any backtrace, and is very dependent on the critical value. I don't know about the deep design of the ruby code or the pthreads library, but I am wondering whether it is possible that the there is a timer used to give the system time to tidy up before a thread is reused, and that this timer simply times out too quickly for the platform.

Are you aware of any similar problem on a slower (embedded) platform? It may be a problem that is specific to arm-linux, or to the ulibc pthreads library, or one that only turns up on slow processors.

In the mean time, I will see if I can produce a simple test case which demonstrates the problem. That would be a start.

FYI, I am putting a significant number of embedded arm-linux devices into the field this year (on the Techbase NPE series GPRS modem / industrial computer). I would be happy to continue to contribute bug reports - I have ruby 1.8.7 working happily on this device, but 1.9 is a little more challenging at the moment. Anything that I can contribute, I will do so.

Kind regards,

Steve

Actions

Copy link

#3 [ruby-core:44653]

Updated by naruse (Yui NARUSE) over 13 years ago

Could you write a reproducible code with C?
If you can, it is an issue of ulibc's pthread implementation.

The root of sleep(timeout) is pthread_cond_timedwait(3).
So I suspect it has a bug.

Actions

Copy link

#4 [ruby-core:44654]

Updated by stevegoobermanhill (stephen gooberman-hill) over 13 years ago

Hi Yusuke-san
Resolved - duplicated #5343

further investigation leads me to believe that this is actually a duplicate of Bug #5343 as I can consistently make that bug happen on the arm-linux platform. Therefore please close this bug.

However, the actual conditions that give rise to this bug are made more likely by the design of the GServer#serve method in the standard library. I will therefore open a separate bug report for the GServer issue - I have a patch which I believe improves the design of GServer

Kind regards

Steve

Actions

Copy link

#5 [ruby-core:48749]

Updated by mame (Yusuke Endoh) almost 13 years ago

Status changed from Feedback to Closed

Actions

Copy link

Also available in: Atom PDF

Like0

Like0Like0Like0Like0Like0

Project

General

Profile

Ruby

Tags

Custom queries

Bug #6358

arm-linux : sleep() time dependent threading bug

Updated by mame (Yusuke Endoh) over 13 years ago

Updated by stevegoobermanhill (stephen gooberman-hill) over 13 years ago

Updated by naruse (Yui NARUSE) over 13 years ago

Updated by stevegoobermanhill (stephen gooberman-hill) over 13 years ago

Updated by mame (Yusuke Endoh) almost 13 years ago