Bug #9525
closedStuck with Socket.pack_sockaddr_in
Description
We met this trouble with Fluentd https://github.com/fluent/fluentd.
Fluentd is sometimes stuck at Socket.pack_sockaddr_in
line on shutdown.
Here is the gist https://gist.github.com/sonots/9047653 to explain details.
Updated by akr (Akira Tanaka) over 10 years ago
- Status changed from Open to Feedback
Socket.pack_sockaddr_in can block with a name resolution.
If you think the name resolution doesn't block, please explain why.
Updated by sonots (Naotoshi Seo) over 10 years ago
Are there any ways to timeout Socket.pack_sockaddr_in?
If so, this problem should be resolved.
EDIT: And, I was pointing an IP address at the concerned line. Will it still block for the name resolution?
Updated by akr (Akira Tanaka) over 10 years ago
I think it is very difficult to add timeout to getaddrinfo() function in C.
Updated by akr (Akira Tanaka) over 10 years ago
I guess getaddrinfo() doesn't block if an IP address is given.
At least, following command on Debian GNU/Linux (jessie) doesn't show anything.
% strace -e socket ruby -rsocket -e 'Socket.pack_sockaddr_in(80, "192.168.1.1")'
It means the command create no socket and no communication to another site.
Updated by sonots (Naotoshi Seo) over 10 years ago
Following command on CentOS release 6.2 x86_64 returned a line
$ strace -e socket /usr/lib64/fluent/ruby/bin/ruby -rsocket -e 'Socket.pack_sockaddr_in(80, "192.168.1.1")'
socket(PF_NETLINK, SOCK_RAW, 0) = 5
/usr/lib64/fluent/ruby/bin/ruby -v #=> ruby 1.9.3p194 (2012-04-20 revision 35410) [x86_64-linux]
Updated by akr (Akira Tanaka) over 10 years ago
I think that PF_NETLINK is not a socket to communicate to another host.
Updated by naruse (Yui NARUSE) over 10 years ago
- Status changed from Feedback to Third Party's Issue
gdb says __check_pf is guilty.
https://gist.github.com/sonots/9047653#file-gistfile3-txt
(gdb) bt
#0 0x00000032f2ce659d in recvmsg () from /lib64/libc.so.6
#1 0x00000032f2d0c2c5 in make_request () from /lib64/libc.so.6
#2 0x00000032f2d0c6fa in __check_pf () from /lib64/libc.so.6
#3 0x00000032f2ccfb47 in getaddrinfo () from /lib64/libc.so.6
#4 0x00007fb3139d3818 in nogvl_getaddrinfo (arg=<value optimized out>) at raddrinfo.c:161
#5 0x00000000004f600c in rb_thread_blocking_region (func=0x7fb3139d3800 <nogvl_getaddrinfo>, data1=0x7fb0d5dd0140, ubf=<value optimized out>, data2=<value optimized out>) at thread.c:1129
#6 0x00007fb3139d2eaf in rb_getaddrinfo (node=<value optimized out>, service=<value optimized out>, hints=<value optimized out>, res=<value optimized out>) at raddrinfo.c:181
#7 0x00007fb3139d2fcb in rsock_getaddrinfo (host=<value optimized out>, port=22017, hints=0x7fb0d5dd0600, socktype_hack=1) at raddrinfo.c:359
#8 0x00007fb3139d37ed in rsock_addrinfo (host=0, port=140397479001552, socktype=<value optimized out>, flags=659) at raddrinfo.c:379
#9 0x00007fb3139c939a in sock_s_pack_sockaddr_in (self=<value optimized out>, port=140397479001552, host=35158400) at socket.c:1307
By glibc's commits, once __check_pf is always called 6f3914d5a3269c00e70506bd95f816fef6b635ce
(it is fixed at fa3fc0fe5f452d0aa7e435d8f32e992958683819)
The difference between akr's and sonots' seems because of this.
Therefore this is glibc's old bug and RHEL/CentOS's backport issue.
Updated by sonots (Naotoshi Seo) over 10 years ago
Thank you! Let me write a note here.
On my current environment, the glibc version was glibc-2.12-1.47.el6.x86_64.
I checked the latest centos rpm http://vault.centos.org/6.5/os/Source/SPackages/glibc-2.12-1.132.el6.src.rpm,
but it looked RHED/CentOS is not backporting it yet (cf. https://gist.github.com/sonots/9065060)
Updated by akr (Akira Tanaka) over 10 years ago
I think we can accept a workaround if there is a good patch.
Updated by kosaki (Motohiro KOSAKI) over 10 years ago
I checked a glibc code. I don't think fa3fc0fe5f fixed this issue. It is a mere optimization patch.
I'm not sure why kernel's netlink doesn't reply anything. But it may be worth to try upgrade your kernel.
Thanks.
Updated by kosaki (Motohiro KOSAKI) over 10 years ago
Akr-san,
I'm not an expert this area. But I guess we don't need to call getaddrinfo() in this case
because the name was already resolved.
How about bypass to call getaddrinfo when 'host' is given by IP address?
Updated by akr (Akira Tanaka) over 10 years ago
Motohiro KOSAKI wrote:
I'm not an expert this area. But I guess we don't need to call getaddrinfo() in this case
because the name was already resolved.
How about bypass to call getaddrinfo when 'host' is given by IP address?
Yes. It is the workaround I said.
Updated by akr (Akira Tanaka) over 10 years ago
I tried to workaround this issue at r45047.
However I don't have an environment to reproduce the problem.
Would anyone test the problem at latest trunk?
Updated by normalperson (Eric Wong) over 10 years ago
akr@fsij.org wrote:
I tried to workaround this issue at r45047.
However I don't have an environment to reproduce the problem.
I don't know how to reproduce the problem, either. netlink sockets
should not block like that.
r45047 looks correct. A few minor comments:
- backwards for loop for list is confusing to me,
any reason for not reversing list declaration? - why xmalloc + MEMZERO? xcalloc is shorter and generates smaller code
Updated by kosaki (Motohiro KOSAKI) over 10 years ago
- Related to Backport #9538: backport r45047 added
Updated by kosaki (Motohiro KOSAKI) over 10 years ago
- Related to Backport #9536: backport r45047 added
Updated by kosaki (Motohiro KOSAKI) over 10 years ago
- Status changed from Third Party's Issue to Closed
- Backport changed from 1.9.3: UNKNOWN, 2.0.0: UNKNOWN, 2.1: UNKNOWN to 1.9.3: REQUIRED, 2.0.0: REQUIRED, 2.1: REQUIRED
Updated by akr (Akira Tanaka) over 10 years ago
Eric Wong wrote:
- backwards for loop for list is confusing to me,
any reason for not reversing list declaration?
I prefer ordering consistency between "list" and "res".
- why xmalloc + MEMZERO? xcalloc is shorter and generates smaller code
xcalloc is good idea.
Updated by sonots (Naotoshi Seo) over 10 years ago
Akira Tanaka wrote:
Would anyone test the problem at latest trunk?
I had never ran fluentd with ruby 2.1.1, but now I am trying it.
If it works well, I will try ruby-trunk next.
PS. Sorry, I struggled, but could not create a small subset to reproduce this issue so that anyone can check.
What I can tell here is only that this issue would sometimes occur on an environment which is heavy-loaded, and uses many threads, but I am not sure.
Updated by sonots (Naotoshi Seo) over 10 years ago
I applied following consecutive patches for ruby 2.1.1, and tried.
https://github.com/ruby/ruby/commit/948ce9decb97e5ff0833e53a392aa9f1d42c9b0d
https://github.com/ruby/ruby/commit/dd1c3a75096b97c1ebcb8597c001761ddfb3c1bf
https://github.com/ruby/ruby/commit/2e6b97a45d077979121b29484a8831034d47ef50
Before (ruby 2.1.1):
The stuck was able to be reproduced with ruby 2.1.1.
I restarted my entire fluentd cluster 12 times, and it occurred at 4th and 12th, that is, 2 times out of 12.
After (ruby 2.1.1 + patch)
I restarted the fluentd cluster 36 times, and got stuck 3 times, but the stuck point was changed.
See https://gist.github.com/sonots/9392668
I did not see that Fluentd was stuck at Socket.pack_sockaddr_in anymore.
I will continue to work for the new stuck point at Fluentd issue https://github.com/fluent/fluentd/pull/257. I think the problem of stuck at Socket.pack_sockaddr_in was resolved.
Updated by usa (Usaku NAKAMURA) over 10 years ago
memo: related commits: 45046,45047,45063,45087,45150,45151,45152
Updated by nagachika (Tomoyuki Chikanaga) about 10 years ago
Is there any progress?
Updated by nagachika (Tomoyuki Chikanaga) about 10 years ago
I have confirmed that this issue is already fixed.
https://twitter.com/sonots/status/507804708153483264
Add related commits: r45146
Updated by nagachika (Tomoyuki Chikanaga) about 10 years ago
- Backport changed from 1.9.3: REQUIRED, 2.0.0: REQUIRED, 2.1: REQUIRED to 1.9.3: REQUIRED, 2.0.0: REQUIRED, 2.1: DONE
r45046, r45047, r45063, r45087, r45146, r45150, r45151 and r45152 were backported into ruby_2_1
branch at r47415.
Updated by mariakatosvich (maria katosvich) over 8 years ago
- Backport deleted (
1.9.3: REQUIRED, 2.0.0: REQUIRED, 2.1: DONE)
Related issue: https://bugs.ruby-lang.org/issues/9536
This problem is a big issue for Fluentd.
We hope to backport this commit to Latest Rubies.
Updated by nobu (Nobuyoshi Nakada) over 8 years ago
- Backport set to 2.1: DONE