Bug #1696
closedhttp downloads are unuseably slow
Description
=begin
I've been using a a bit of code which use open-uri and http urls to download files and have found this is unuseably slow.
Example:
require 'open-uri'
open( )
I did some benchmarking and the download of a 157MB file from the local machine was taking over 20 seconds and using 100% where as wget for the same file only took 0.7seconds.
Digging some more and profiling the code with RubyProf revealed that during the execution of the download over 11,000 threads where being created. This I've tracked down to the net/protocol module and the Net::BufferIO::rbuf_fill method which is using a timeout block to wrap the @io.sysread(1024) call. This is clearly an extremely bad way to do this, totally crippling the performance.
Below are benchmarks and some profiles of this issue and various quick fixes I tested here
== 1.8.6 core ==
user system total real
open 15.664062 6.164062 21.828125 ( 22.810859)
%self total self wait child calls name
14.04 5.27 3.00 2.27 0.00 109426 Class::Thread#start (ruby_runtime:0}
10.97 13.14 2.34 0.00 10.80 109427 Module::Timeout#timeout (/usr/local/lib/ruby/1.8/timeout.rb:52}
6.65 2.52 1.42 1.10 0.00 109426 Thread#kill (ruby_runtime:0}
5.78 21.36 1.23 0.00 20.12 1 Net::BufferedIO#read (/usr/local/lib/ruby/1.8/net/protocol.rb:79}
5.27 14.27 1.12 0.00 13.14 109427 Object#timeout (/usr/local/lib/ruby/1.8/timeout.rb:92}
5.12 1.11 1.09 0.02 0.00 109426 IO#sysread (ruby_runtime:0}
4.75 2.61 1.02 0.00 1.59 109426 OpenURI::Buffer#<< (/usr/local/lib/ruby/1.8/open-uri.rb:297}
3.40 3.62 0.73 0.00 2.89 109426 Net::ReadAdapter#call_block (/usr/local/lib/ruby/1.8/net/protocol.rb:380}
2.52 0.54 0.54 0.00 0.00 109435 String#slice! (ruby_runtime:0}
2.30 1.33 0.49 0.00 0.84 109415 #Class:0x801ac6f30#<< ((eval):1}
2.08 0.50 0.45 0.00 0.05 109427 Class#initialize (ruby_runtime:0}
2.08 0.98 0.45 0.00 0.54 109435 Net::BufferedIO#rbuf_consume (/usr/local/lib/ruby/1.8/net/protocol.rb:137}
2.01 4.05 0.43 0.00 3.62 109426 Net::ReadAdapter#<< (/usr/local/lib/ruby/1.8/net/protocol.rb:371}
1.94 0.41 0.41 0.00 0.00 109416 IO#write (ruby_runtime:0}
1.72 1.03 0.37 0.00 0.66 109435 Class#new (ruby_runtime:0}
1.50 0.32 0.32 0.00 0.00 328286 String#size (ruby_runtime:0}
1.32 14.55 0.28 0.00 14.27 109426 Net::BufferedIO#rbuf_fill (/usr/local/lib/ruby/1.8/net/protocol.rb:131}
1.21 0.84 0.26 0.00 0.58 109417 Kernel#send (ruby_runtime:0}
1.13 0.24 0.24 0.00 0.00 109429 String#<< (ruby_runtime:0}
0.99 0.30 0.21 0.00 0.09 109428 Fixnum#== (ruby_runtime:0}
0.84 0.18 0.18 0.00 0.00 328289 Fixnum#+ (ruby_runtime:0}
0.77 0.16 0.16 0.00 0.00 109427 Class::Class#allocate (ruby_runtime:0}
0.77 0.58 0.16 0.00 0.41 109415 IO#<< (ruby_runtime:0}
0.73 0.28 0.16 0.00 0.12 109475 Hash#[] (ruby_runtime:0}
...
== 1.8.6 core + no timeout block ==
user system total real
open 1.757812 0.765625 2.523438 ( 2.873194)
%self total self wait child calls name
21.62 2.63 2.63 0.00 0.00 109426 IO#sysread (ruby_runtime:0}
15.72 1.91 1.91 0.00 0.00 109429 String#<< (ruby_runtime:0}
13.98 1.70 1.70 0.00 0.00 109416 IO#write (ruby_runtime:0}
9.56 12.16 1.16 0.00 10.99 1 Net::BufferedIO#read (/usr/local/lib/ruby/1.8/net/protocol.rb:79}
7.83 3.73 0.95 0.00 2.77 109426 OpenURI::Buffer#<< (/usr/local/lib/ruby/1.8/open-uri.rb:297}
5.58 4.68 0.68 0.00 4.00 109426 Net::ReadAdapter#call_block (/usr/local/lib/ruby/1.8/net/protocol.rb:385}
3.66 4.99 0.45 0.00 4.55 109426 Net::BufferedIO#rbuf_fill (/usr/local/lib/ruby/1.8/net/protocol.rb:138}
3.34 2.38 0.41 0.00 1.97 109415 #Class:0x801ac82b8#<< ((eval):1}
3.27 0.53 0.40 0.00 0.13 109435 Net::BufferedIO#rbuf_consume (/usr/local/lib/ruby/1.8/net/protocol.rb:142}
2.95 5.04 0.36 0.00 4.68 109426 Net::ReadAdapter#<< (/usr/local/lib/ruby/1.8/net/protocol.rb:376}
2.25 0.27 0.27 0.00 0.00 328289 Fixnum#+ (ruby_runtime:0}
1.54 0.19 0.19 0.00 0.00 328286 String#size (ruby_runtime:0}
1.28 0.27 0.16 0.00 0.12 109475 Hash#[] (ruby_runtime:0}
1.22 0.15 0.15 0.00 0.00 109431 Module#=== (ruby_runtime:0}
1.22 1.85 0.15 0.00 1.70 109415 IO#<< (ruby_runtime:0}
1.09 0.13 0.13 0.00 0.00 109435 String#slice! (ruby_runtime:0}
0.96 1.97 0.12 0.00 1.85 109417 Kernel#send (ruby_runtime:0}
0.96 0.12 0.12 0.00 0.00 109426 String#length (ruby_runtime:0}
0.96 0.12 0.12 0.00 0.00 109435 Hash#default (ruby_runtime:0}
0.71 0.09 0.09 0.00 0.00 109439 Fixnum#< (ruby_runtime:0}
...
== 1.8.6 core + no timeout block + 10K sysread ==
user system total real
open 0.570312 1.406250 1.976562 ( 2.521614)
%self total self wait child calls name
27.32 0.39 0.39 0.00 0.00 10953 String#slice! (ruby_runtime:0}
18.58 0.27 0.27 0.00 0.00 10943 IO#write (ruby_runtime:0}
12.02 0.17 0.17 0.00 0.00 10944 IO#sysread (ruby_runtime:0}
8.74 1.42 0.12 0.00 1.30 1 Net::BufferedIO#read (/usr/local/lib/ruby/1.8/net/protocol.rb:79}
7.10 0.50 0.10 0.00 0.40 10944 OpenURI::Buffer#<< (/usr/local/lib/ruby/1.8/open-uri.rb:297}
5.46 0.37 0.08 0.00 0.29 10942 #Class:0x801ac8b28#<< ((eval):1}
3.28 0.05 0.05 0.00 0.00 32840 String#size (ruby_runtime:0}
2.73 0.55 0.04 0.00 0.52 10944 Net::ReadAdapter#call_block (/usr/local/lib/ruby/1.8/net/protocol.rb:385}
2.73 0.04 0.04 0.00 0.00 32843 Fixnum#+ (ruby_runtime:0}
2.73 0.59 0.04 0.00 0.55 10944 Net::ReadAdapter#<< (/usr/local/lib/ruby/1.8/net/protocol.rb:376}
1.64 0.21 0.02 0.00 0.19 10944 Net::BufferedIO#rbuf_fill (/usr/local/lib/ruby/1.8/net/protocol.rb:138}
1.64 0.41 0.02 0.00 0.39 10953 Net::BufferedIO#rbuf_consume (/usr/local/lib/ruby/1.8/net/protocol.rb:142}
1.09 0.02 0.02 0.00 0.00 10944 String#length (ruby_runtime:0}
1.09 0.02 0.02 0.00 0.00 10947 String#<< (ruby_runtime:0}
1.09 0.02 0.02 0.00 0.00 10953 Hash#default (ruby_runtime:0}
1.09 0.28 0.02 0.00 0.27 10942 IO#<< (ruby_runtime:0}
...
== 1.8.6 core + no timeout block + 100K sysread ==
user system total real
open 0.390625 0.484375 0.875000 ( 1.013516)
%self total self wait child calls name
45.45 0.43 0.43 0.00 0.00 1098 IO#sysread (ruby_runtime:0}
33.88 0.32 0.32 0.00 0.00 1097 IO#write (ruby_runtime:0}
8.26 0.08 0.08 0.00 0.00 1107 String#slice! (ruby_runtime:0}
1.65 0.45 0.02 0.00 0.44 1098 Net::BufferedIO#rbuf_fill (/usr/local/lib/ruby/1.8/net/protocol.rb:138}
1.65 0.02 0.02 0.00 0.00 3302 String#size (ruby_runtime:0}
1.65 0.35 0.02 0.00 0.34 1098 OpenURI::Buffer#<< (/usr/local/lib/ruby/1.8/open-uri.rb:297}
1.65 0.93 0.02 0.00 0.91 1 Net::BufferedIO#read (/usr/local/lib/ruby/1.8/net/protocol.rb:79}
0.83 0.01 0.01 0.00 0.00 1 URI::HTTP#initialize (/usr/local/lib/ruby/1.8/uri/http.rb:77}
0.83 0.09 0.01 0.00 0.08 1107 Net::BufferedIO#rbuf_consume (/usr/local/lib/ruby/1.8/net/protocol.rb:142}
0.83 0.01 0.01 0.00 0.00 357 Module#method_added (ruby_runtime:0}
0.83 0.01 0.01 0.00 0.00 37 Class::Dir#[] (ruby_runtime:0}
0.83 0.01 0.01 0.00 0.00 1101 String#<< (ruby_runtime:0}
0.83 0.36 0.01 0.00 0.35 1098 Net::ReadAdapter#call_block (/usr/local/lib/ruby/1.8/net/protocol.rb:385}
0.83 0.33 0.01 0.00 0.32 1096 IO#<< (ruby_runtime:0}
...
== 1.8.6 core + no timeout block + 1M sysread ==
user system total real
open 0.445312 0.335938 0.781250 ( 0.913736)
%self total self wait child calls name
29.66 0.27 0.27 0.00 0.00 157 IO#sysread (ruby_runtime:0}
28.81 0.27 0.27 0.00 0.00 166 String#slice! (ruby_runtime:0}
28.81 0.27 0.27 0.00 0.00 156 IO#write (ruby_runtime:0}
7.63 0.07 0.07 0.00 0.00 160 String#<< (ruby_runtime:0}
1.69 0.30 0.02 0.00 0.28 157 Net::ReadAdapter#call_block (/usr/local/lib/ruby/1.8/net/protocol.rb:385}
0.85 0.01 0.01 0.00 0.00 37 Class::Dir#[] (ruby_runtime:0}
0.85 0.01 0.01 0.00 0.00 98 Class#inherited (ruby_runtime:0}
0.85 0.02 0.01 0.00 0.02 2 Kernel#gem_original_require (ruby_runtime:0}
0.85 0.28 0.01 0.00 0.27 157 OpenURI::Buffer#<< (/usr/local/lib/ruby/1.8/open-uri.rb:297}
...
== 1.8.6 core + no timeout block + 1M sysread + slice bypass ==
user system total real
open 0.093750 0.523438 0.617188 ( 0.717748)
%self total self wait child calls name
66.67 0.23 0.23 0.00 0.00 52 IO#write (ruby_runtime:0}
13.33 0.05 0.05 0.00 0.00 52 IO#sysread (ruby_runtime:0}
4.44 0.02 0.02 0.00 0.00 55 String#<< (ruby_runtime:0}
4.44 0.02 0.02 0.00 0.00 1 TCPSocket#initialize (ruby_runtime:0}
2.22 0.01 0.01 0.00 0.00 101 Hash#[] (ruby_runtime:0}
2.22 0.01 0.01 0.00 0.00 10 String#slice! (ruby_runtime:0}
2.22 0.02 0.01 0.00 0.01 2 Kernel#gem_original_require (ruby_runtime:0}
2.22 0.25 0.01 0.00 0.24 52 OpenURI::Buffer#<< (/usr/local/lib/ruby/1.8/open-uri.rb:297}
2.22 0.01 0.01 0.00 0.00 37 Class::Dir#[] (ruby_runtime:0}
...
Note: The change does also raises questions about the core slice method not performing as well as it might for the no-op case.
== 1.8.6 core + select timeout + 1M sysread + slice bypass ==
user system total real
open 0.070312 0.226562 0.296875 ( 0.898285)
%self total self wait child calls name
64.10 0.20 0.20 0.00 0.00 49 IO#write (ruby_runtime:0}
15.38 0.05 0.05 0.00 0.00 49 IO#sysread (ruby_runtime:0}
7.69 0.02 0.02 0.00 0.00 52 String#<< (ruby_runtime:0}
5.13 0.23 0.02 0.00 0.21 49 OpenURI::Buffer#<< (/usr/local/lib/ruby/1.8/open-uri.rb:297}
2.56 0.02 0.01 0.00 0.02 2 Kernel#gem_original_require (ruby_runtime:0}
2.56 0.01 0.01 0.00 0.00 37 Class::Dir#[] (ruby_runtime:0}
2.56 0.02 0.01 0.00 0.01 3 Array#each (ruby_runtime:0}
...
So overall these changes improve the speed from 22seconds to 0.7seconds in my test case making ruby comparable with wget's performance.
The patch for this last run is attached and should be compatible with existing functionality through the use of select + raise Timeout::Error instead of timeout( @read_timeout ) { .. }
Hope this helps, any questions let me know
=end
Files
Updated by steveh (Steven Hartland) over 15 years ago
=begin
The patch description above should read: patch for select timeout + 1M sysread + slice bypass
=end
Updated by steveh (Steven Hartland) over 15 years ago
- File protocol.rb.patch protocol.rb.patch added
=begin
Silly bug in original patch, fixed version here, results still consistent with original.
It may be interesting to note that this also brings net/http speed inline with that of libcurl ruby gems
open.core 15.296875 1.437500 16.734375 ( 22.416853)
open.patched 0.093750 0.218750 0.312500 ( 0.340683)
taf2-curb 0.171875 0.273438 0.445312 ( 0.546568)
=end
Updated by oldmoe (Muhammad Ali) over 15 years ago
=begin
Net/HTTP in 1.9.2dev is already working as you described with two
exceptions:
1 - It uses nonblocking I/O
2 - It uses a 16KB buffer
The 1MB buffer you are using is too large for several reasons:
1 - You use blocking I/O calls, this works great if you are downloading from
the local machine. OTOH if the server you are connecting to is slow,
attempting to collect such a large buffer in a blocking manner is guaranteed
to cause serious delays.
2 - If multiple threads are downloading data together (which is normal for
crawlers) you will be allocating a lot of memory at once this way (mutliple
strings up to 1MB each). This will propably tax the GC.
You might want to upgrade to 1.8.7 (which implements read_nonblock) and
modify your patch for a non blocking behavior
oldmoe
On Sat, Jun 27, 2009 at 9:47 PM, Steven Hartland redmine@ruby-lang.orgwrote:
Issue #1696 has been updated by Steven Hartland.
File protocol.rb.patch added
Silly bug in original patch, fixed version here, results still consistent
with original.It may be interesting to note that this also brings net/http speed inline
with that of libcurl ruby gems
open.core 15.296875 1.437500 16.734375 ( 22.416853)
open.patched 0.093750 0.218750 0.312500 ( 0.340683)
taf2-curb 0.171875 0.273438 0.445312 ( 0.546568)
http://redmine.ruby-lang.org/issues/show/1696
File protocol.rb.patch added
Silly bug in original patch, fixed version here, results still consistent with original.
It may be interesting to note that this also brings net/http speed inline with that of libcurl ruby gems
open.core 15.296875 1.437500 16.734375 ( 22.416853)
open.patched 0.093750 0.218750 0.312500 ( 0.340683)
taf2-curb 0.171875 0.273438 0.445312 ( 0.546568)
----------------------------------------
----------------------------------------
=end
Updated by oldmoe (Muhammad Ali) over 15 years ago
=begin
Not sure really, never tested it.
On Sun, Jun 28, 2009 at 2:29 AM, Roger Pack rogerdpack@gmail.com wrote:
Net/HTTP in 1.9.2dev is already working as you described with two
exceptions:
1 - It uses nonblocking I/O
2 - It uses a 16KB bufferdoes 1.9.2dev work as quickly then?
> Net/HTTP in 1.9.2dev is already working as you described with twodoes 1.9.2dev work as quickly then?
> exceptions:
> 1 - It uses nonblocking I/O
> 2 - It uses a 16KB buffer
=end
Updated by rue (Eero Saynatkari) over 15 years ago
=begin
Excerpts from rogerdpack's message of Sun Jun 28 02:29:12 +0300 2009:
Net/HTTP in 1.9.2dev is already working as you described with two
exceptions:
1 - It uses nonblocking I/O
2 - It uses a 16KB bufferdoes 1.9.2dev work as quickly then?
Whether it does or not, at least decreasing the buffer size
is absolutely necessary. The performance of a HTTP download
cannot be tuned based on a local, large file.
Eero¶
Magic is insufficiently advanced technology.
=end
Updated by steveh (Steven Hartland) over 15 years ago
=begin
As you can see from the test results a 16Kb buffer is way to small for high bandwidth connection, so as Eero mentioned this definitely needs to be documenting. N.B. I assume this has been moved to either a class variable or other accessible option.
None blocking support should be trivial, simply setting the flag on the socket should be sufficient, but with the current patch select will be achieving this anyway assuming sysread is actually using recv under the covers and returns with what ever data is available instead of waiting for size data. This is not clear from the docs as sysread, readpartial and read_nonblock contain conflicting info and in places some English which just doesn't make sense. If not just changing that one line should be sufficient.
Having a quick look at the current nightly snapshot the code:
- It may not be ordered in the best way for performance especially on slower connections. I would suggest testing if inverting the order and allowing select to do its work before doing a none blocking read, or in fact a standard recv ( no need for none blocking, as its already guaranteed to work due to the select check ). Careful benchmarking of differing speed connections would be needed to confirm which is better.
- slice is still in place, which could also still be causing an issue unless the underlying slice implementation has been fixed for the no-op case.
=end
Updated by oldmoe (Muhammad Ali) over 15 years ago
=begin
I believe if the descriptor is set to non blocking then sysread will return
whatever data available or raise an error if there isn't any.
But non block is critical for large buffer sizes as select will return the
socket as ready even if it has only one byte to read.
I would also recommend the nonblocking call happens before the select. This
way fast connections will not be taxed by two system calls, while slow
connections are slow anyway so they can wait for the Errno::EAGAIN exception
to be handled
oldmoe
On Sun, Jun 28, 2009 at 9:26 PM, Steven Hartland redmine@ruby-lang.orgwrote:
Issue #1696 has been updated by Steven Hartland.
As you can see from the test results a 16Kb buffer is way to small for high
bandwidth connection, so as Eero mentioned this definitely needs to be
documenting. N.B. I assume this has been moved to either a class variable or
other accessible option.None blocking support should be trivial, simply setting the flag on the
socket should be sufficient, but with the current patch select will be
achieving this anyway assuming sysread is actually using recv under the
covers and returns with what ever data is available instead of waiting for
size data. This is not clear from the docs as sysread, readpartial and
read_nonblock contain conflicting info and in places some English which just
doesn't make sense. If not just changing that one line should be sufficient.Having a quick look at the current nightly snapshot the code:
- It may not be ordered in the best way for performance especially on
slower connections. I would suggest testing if inverting the order and
allowing select to do its work before doing a none blocking read, or in fact
a standard recv ( no need for none blocking, as its already guaranteed to
work due to the select check ). Careful benchmarking of differing speed
connections would be needed to confirm which is better.- slice is still in place, which could also still be causing an issue
unless the underlying slice implementation has been fixed for the no-op
case.
http://redmine.ruby-lang.org/issues/show/1696
As you can see from the test results a 16Kb buffer is way to small for high bandwidth connection, so as Eero mentioned this definitely needs to be documenting. N.B. I assume this has been moved to either a class variable or other accessible option.
None blocking support should be trivial, simply setting the flag on the socket should be sufficient, but with the current patch select will be achieving this anyway assuming sysread is actually using recv under the covers and returns with what ever data is available instead of waiting for size data. This is not clear from the docs as sysread, readpartial and read_nonblock contain conflicting info and in places some English which just doesn't make sense. If not just changing that one line should be sufficient.
Having a quick look at the current nightly snapshot the code:
1. It may not be ordered in the best way for performance especially on slower connections. I would suggest testing if inverting the order and allowing select to do its work before doing a none blocking read, or in fact a standard recv ( no need for none blocking, as its already guaranteed to work due to the select check ). Careful benchmarking of differing speed connections would be needed to confirm which is better.
- slice is still in place, which could also still be causing an issue unless the underlying slice implementation has been fixed for the no-op case.
----------------------------------------
http://redmine.ruby-lang.org/issues/show/1696
----------------------------------------
http://redmine.ruby-lang.org
=end
Updated by steveh (Steven Hartland) over 15 years ago
=begin
But non block is critical for large buffer sizes as select will return the
socket as ready even if it has only one byte to read.
This makes no difference to the socket level recv, even if the socket isn't
set as none blocking, it will always return what it has. It doesn't wait till
the data size matches size of the buffer passed, so again it depends on how the
lower level methods actually deal with this; which is not clear from the docs.
I would also recommend the nonblocking call happens before the select. This
way fast connections will not be taxed by two system calls, while slow
connections are slow anyway so they can wait for the Errno::EAGAIN exception
to be handled.
This really should be profiled instead of making assumptions. I say that as I
suspect exception handling is actually quite a bit heavier than a select call,
so if its used in any significant amount, it could cause noticeably higher CPU
load. It could go either way, but its always better to know and document why a
design decision was made than to make just make it on a supposition.
A very quick test of 10 requests for a 157MB from a high speed local connection
shows:
user system total real
open.none-blocking 5.195312 4.023438 9.218750 ( 11.924775)
open.blocking-patched 2.312500 4.562500 6.875000 ( 9.244287)
N.B. The none-blocking test was done using 1.8 install just replacing protocol.rb
with the nightly snapshot, increasing the buffer to 1MB and removing the rescue
type for compatibility.
=end
Updated by rue (Eero Saynatkari) over 15 years ago
=begin
Excerpts from Tanaka Akira's message of Mon Jun 29 21:17:58 +0300 2009:
In article 4a464441bf3f7_13bd3907d016634@redmine.ruby-lang.org,
Steven Hartland redmine@ruby-lang.org writes:This I've tracked down to the net/protocol module and the Net::BufferIO::rbuf_fill method which is using a timeout block to wrap the @io.sysread(1024) call. This is clearly an extremely bad way to do this, totally crippling the performance.
There is a reson to use timeout(). It is used to fix
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=265429Before the fix, IO.select was used in rbuf_fill.
On the face of it, using timeout does not seem like the
optimal fix to the problem...speaking of, can anyone
reproduce this issue currently, or has it gotten fixed
otherwise in the 5 intervening years?
Eero¶
Magic is insufficiently advanced technology.
=end
Updated by rue (Eero Saynatkari) over 15 years ago
=begin
Excerpts from James Gray's message of Tue Jun 30 00:12:23 +0300 2009:
Well, this definitely sounds like the kind of thing I am seeing.
Which system are you on, and do you have some reasonably
small test case? Presumably requires recompiling without
the timeout?
Eero¶
Magic is insufficiently advanced technology.
=end
Updated by ysh (Young Hyun) over 15 years ago
=begin
On Jun 29, 2009, at 2:12 PM, James Gray wrote:
On Jun 29, 2009, at 1:38 PM, Eero Saynatkari wrote:
Excerpts from Tanaka Akira's message of Mon Jun 29 21:17:58 +0300
2009:In article 4a464441bf3f7_13bd3907d016634@redmine.ruby-lang.org,
Steven Hartland redmine@ruby-lang.org writes:This I've tracked down to the net/protocol module and the
Net::BufferIO::rbuf_fill method which is using a timeout block to
wrap the @io.sysread(1024) call. This is clearly an extremely bad
way to do this, totally crippling the performance.There is a reson to use timeout(). It is used to fix
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=265429Before the fix, IO.select was used in rbuf_fill.
On the face of it, using timeout does not seem like the
optimal fix to the problem...speaking of, can anyone
reproduce this issue currently, or has it gotten fixed
otherwise in the 5 intervening years?Well, this definitely sounds like the kind of thing I am seeing.
For what it's worth, I've been fighting a similar problem for a while
now, trying Ruby 1.8.5, 1.8.6, and 1.8.7. The circumstances are
similar: TCP connection with SSL, Ruby hanging on sysread, and very
rare failure (once every 3 months with persistent connections).
One possibility is that IO.select is returning a false readiness
notification, so a subsequent sysread hangs. Linux manpages note that
select(2) "may report a socket file descriptor as 'ready for reading',
while nevertheless a subsequent read blocks" [1], and the recommended
solution is to work with nonblocking sockets. (Incidentally, my
failures were under FreeBSD.) However, until Ruby 1.9, SSL sockets
didn't support nonblocking I/O, so we were always susceptible to this
problem; we just lucked out because of the rarity of this failure mode.
Another possibility, and the one I have some suspicion of, is that
there's some pathological interaction with SSL; that is, IO.select
indicates readiness (and there actually are some bytes available to
read), the client invokes SSLSocket#sysread, and the SSL library does
some low-level SSL re-negotiations (which according to the SSL
protocol can happen at any time) and then blocks waiting on actual
user-level data. Or instead of a low-level SSL re-negotiation,
perhaps the connection with the peer is lost (say, the remote box
loses power so that there's no notification of connection loss) in an
inopportune moment in SSLSocket#sysread (for example, in the middle of
reading an SSL protocol message), and SSLSocket#sysread hangs since it
has no way of detecting the connection loss (assuming SSLSocket isn't
using keepalives).
Anyway, this is all speculation, but I've had multiple failures over
the past 18 months, and a perusal of the Ruby core dump shows the
exact same pattern of a hang on select(2) called underneath in the
implementation of rb_io_wait_readable() and similar internal functions
(I'm not talking about a hang in select(2) as a direct result of a
user-level code invoking IO.select()).
I'm going to try using the nonblocking SSL operations under Ruby 1.9
to see whether it prevents my particular problem. In general, though,
it seems safest to always use nonblocking sockets with Ruby if you're
going to use IO.select.
--Young
[1] http://linux.die.net/man/2/select
=end
Updated by jeremyevans0 (Jeremy Evans) about 5 years ago
- Project changed from Backport186 to Ruby master
- Description updated (diff)
- Status changed from Open to Closed
- Backport set to 2.5: UNKNOWN, 2.6: UNKNOWN