Bug #7584
closedRuby hangs when shutting down an ssl connection in gc finalization
Description
It appears that during garbage collection ruby will sometimes attempt to shutdown an ssl connection that has already been shutdown. At least in certain cases this causes the SSL_shutdown call to block forever.
(gdb) bt
#0 0xb76fa424 in __kernel_vsyscall ()
#1 0xb740d17b in read () from /lib/i386-linux-gnu/tls/i686/nosegneg/libc.so.6
#2 0xb6faabae in ?? () from /lib/i386-linux-gnu/libcrypto.so.1.0.0
#3 0xb6fa8412 in BIO_read () from /lib/i386-linux-gnu/libcrypto.so.1.0.0
#4 0xb70c3157 in ?? () from /lib/i386-linux-gnu/libssl.so.1.0.0
#5 0xb70c42d5 in ?? () from /lib/i386-linux-gnu/libssl.so.1.0.0
#6 0xb70c1371 in ?? () from /lib/i386-linux-gnu/libssl.so.1.0.0
#7 0xb70dc931 in SSL_shutdown () from /lib/i386-linux-gnu/libssl.so.1.0.0
#8 0xb711ba9d in ossl_ssl_shutdown (ssl=0xb21c1f18) at ossl_ssl.c:987
#9 0xb711bb31 in ossl_ssl_free (ssl=0xb21c1f18) at ossl_ssl.c:998
#10 0xb753bba5 in run_final (obj=229474180, objspace=) at gc.c:2932
#11 finalize_list (objspace=0x98805c0, p=0xdad7f84) at gc.c:1946
(gdb) f 9
#9 0xb711bb31 in ossl_ssl_free (ssl=0xb21c1f18) at ossl_ssl.c:998
998 in ossl_ssl.c
(gdb) p (*ssl).shutdown
$8 = 1
I'm not sure exactly what is causing this condition, but it seems like we should be checking if the connection is already shutdown (ssl_get_shutdown()) before trying to shut it down in ossl_ssl_free.
Files
Updated by drbrain (Eric Hodel) almost 12 years ago
- Category set to ext
- Assignee set to MartinBosslet (Martin Bosslet)
- Target version set to 2.0.0
Updated by usa (Usaku NAKAMURA) almost 12 years ago
- Status changed from Open to Assigned
Updated by mame (Yusuke Endoh) over 11 years ago
I'm not sure how significant this is.
Martin, what do you think?
--
Yusuke Endoh mame@tsg.ne.jp
Updated by bpot (Bob Potter) over 11 years ago
For what it's worth, we were able to work around this by explicitly shutting down ssl connections instead of letting it happen during GC.
Updated by mame (Yusuke Endoh) over 11 years ago
- Target version changed from 2.0.0 to 2.1.0
Thank you Bob!
Then I postpone this to 2.1.0.
If there is a good patch available soon, we may backport it to ruby_2_0_0.
--
Yusuke Endoh mame@tsg.ne.jp
Updated by bpot (Bob Potter) over 11 years ago
=begin
Digging into this some more because it cropped up in another project.
I've had some luck reproducing this issue with a small script: https://gist.github.com/bpot/5225920
Sometimes it reproduces easily, sometimes it doesn't. I've been testing with "ruby 1.9.3p374". I haven't tested to see if it happens with the latest trunk.
I've attached a file with some gdb logs of the state of the SSL struct and lsof
output for the process after it hangs. One oddity is that the SSL connection appears to be for FD 11 but lsof doesn't have output for such a file descriptor. You would expect the hanging read to just receive EBADF in this case, right?
I don't know if this is related to the hanging issue but I've noticed that often during GC the underlying TCPSocket object's finalizer will be called before the SSLSocket's finalizer preventing us from closing down SSL connection correctly:
19:28:56.520556 close(13) = 0
19:28:56.520969 write(13, "\25\3\3\0\26\177\n\232\255{9iWfz4\255\240 \276\254!\222\374\10C{", 27) = -1 EBADF (Bad file descriptor)
Should SSLSocket be setting a custom finalizer on the underlying rb_io_t?
=end
Updated by bpot (Bob Potter) over 11 years ago
It looks like it is a race condition where the TCPSocket finalizer closes a fd, that fd is reused by another thread and then we try to do SSL_shutdown on this new connection and bad things happen.
Would it make sense to have the TCPSocket finalizer be a no-op (when it's used for ssl) and have SSLSocket's free operation take care of shutting everything down? Is there a better way to solve this?
Updated by nobu (Nobuyoshi Nakada) over 11 years ago
The order of finalizers is indeterminate.
You'd be able to leave FDs open with IO#autoclose.
Updated by MartinBosslet (Martin Bosslet) over 11 years ago
Bob, could you please check if the patch applied in https://bugs.ruby-lang.org/issues/8240 fixed this? Thanks!
Updated by bpot (Bob Potter) over 11 years ago
Looks like that fixes it! I was unable to reproduce with "ruby 2.1.0dev (2013-04-15 trunk 40308) [x86_64-linux]".
Updated by MartinBosslet (Martin Bosslet) over 11 years ago
- Status changed from Assigned to Closed
Great, thanks for checking! I'll close this, too, then. If anyone still runs into this or a similar issue, please let me know!