Bug #9525: Stuck with Socket.pack_sockaddr_in - Ruby - Ruby Issue Tracking System

Actions

Copy link

Bug #9525

closed

Stuck with Socket.pack_sockaddr_in

Added by sonots (Naotoshi Seo) over 11 years ago. Updated almost 9 years ago.

Status:

Closed

Assignee:

Target version:

ruby -v:

1.9.3p194

Backport:

2.1: DONE

[ruby-core:60801]

Description

We met this trouble with Fluentd https://github.com/fluent/fluentd.

Fluentd is sometimes stuck at Socket.pack_sockaddr_in line on shutdown.
Here is the gist https://gist.github.com/sonots/9047653 to explain details.

Actions

Copy link

#1 [ruby-core:60802]

Updated by akr (Akira Tanaka) over 11 years ago

Status changed from Open to Feedback

Socket.pack_sockaddr_in can block with a name resolution.
If you think the name resolution doesn't block, please explain why.

Actions

Copy link

#2 [ruby-core:60803]

Updated by sonots (Naotoshi Seo) over 11 years ago

Are there any ways to timeout Socket.pack_sockaddr_in?
If so, this problem should be resolved.

EDIT: And, I was pointing an IP address at the concerned line. Will it still block for the name resolution?

Actions

Copy link

#3 [ruby-core:60804]

Updated by akr (Akira Tanaka) over 11 years ago

I think it is very difficult to add timeout to getaddrinfo() function in C.

Actions

Copy link

#4 [ruby-core:60805]

Updated by akr (Akira Tanaka) over 11 years ago

I guess getaddrinfo() doesn't block if an IP address is given.

At least, following command on Debian GNU/Linux (jessie) doesn't show anything.

% strace -e socket ruby -rsocket -e 'Socket.pack_sockaddr_in(80, "192.168.1.1")'

It means the command create no socket and no communication to another site.

Actions

Copy link

#5 [ruby-core:60806]

Updated by sonots (Naotoshi Seo) over 11 years ago

Following command on CentOS release 6.2 x86_64 returned a line

$ strace -e socket /usr/lib64/fluent/ruby/bin/ruby -rsocket -e 'Socket.pack_sockaddr_in(80, "192.168.1.1")'
socket(PF_NETLINK, SOCK_RAW, 0)         = 5

/usr/lib64/fluent/ruby/bin/ruby -v #=> ruby 1.9.3p194 (2012-04-20 revision 35410) [x86_64-linux]

Actions

Copy link

#6 [ruby-core:60807]

Updated by akr (Akira Tanaka) over 11 years ago

I think that PF_NETLINK is not a socket to communicate to another host.

Actions

Copy link

#7 [ruby-core:60809]

Updated by naruse (Yui NARUSE) over 11 years ago

Status changed from Feedback to Third Party's Issue

gdb says __check_pf is guilty.
https://gist.github.com/sonots/9047653#file-gistfile3-txt

(gdb) bt
#0  0x00000032f2ce659d in recvmsg () from /lib64/libc.so.6
#1  0x00000032f2d0c2c5 in make_request () from /lib64/libc.so.6
#2  0x00000032f2d0c6fa in __check_pf () from /lib64/libc.so.6
#3  0x00000032f2ccfb47 in getaddrinfo () from /lib64/libc.so.6
#4  0x00007fb3139d3818 in nogvl_getaddrinfo (arg=<value optimized out>) at raddrinfo.c:161
#5  0x00000000004f600c in rb_thread_blocking_region (func=0x7fb3139d3800 <nogvl_getaddrinfo>, data1=0x7fb0d5dd0140, ubf=<value optimized out>, data2=<value optimized out>) at thread.c:1129
#6  0x00007fb3139d2eaf in rb_getaddrinfo (node=<value optimized out>, service=<value optimized out>, hints=<value optimized out>, res=<value optimized out>) at raddrinfo.c:181
#7  0x00007fb3139d2fcb in rsock_getaddrinfo (host=<value optimized out>, port=22017, hints=0x7fb0d5dd0600, socktype_hack=1) at raddrinfo.c:359
#8  0x00007fb3139d37ed in rsock_addrinfo (host=0, port=140397479001552, socktype=<value optimized out>, flags=659) at raddrinfo.c:379
#9  0x00007fb3139c939a in sock_s_pack_sockaddr_in (self=<value optimized out>, port=140397479001552, host=35158400) at socket.c:1307

By glibc's commits, once __check_pf is always called 6f3914d5a3269c00e70506bd95f816fef6b635ce
(it is fixed at fa3fc0fe5f452d0aa7e435d8f32e992958683819)

The difference between akr's and sonots' seems because of this.

Therefore this is glibc's old bug and RHEL/CentOS's backport issue.

Actions

Copy link

#8 [ruby-core:60810]

Updated by sonots (Naotoshi Seo) over 11 years ago

Thank you! Let me write a note here.

On my current environment, the glibc version was glibc-2.12-1.47.el6.x86_64.

I checked the latest centos rpm http://vault.centos.org/6.5/os/Source/SPackages/glibc-2.12-1.132.el6.src.rpm,
but it looked RHED/CentOS is not backporting it yet (cf. https://gist.github.com/sonots/9065060)

Actions

Copy link

#9 [ruby-core:60829]

Updated by akr (Akira Tanaka) over 11 years ago

I think we can accept a workaround if there is a good patch.

Actions

Copy link

#10 [ruby-core:60830]

Updated by kosaki (Motohiro KOSAKI) over 11 years ago

I checked a glibc code. I don't think fa3fc0fe5f fixed this issue. It is a mere optimization patch.
I'm not sure why kernel's netlink doesn't reply anything. But it may be worth to try upgrade your kernel.

Thanks.

Actions

Copy link

#11 [ruby-core:60831]

Updated by kosaki (Motohiro KOSAKI) over 11 years ago

Akr-san,

I'm not an expert this area. But I guess we don't need to call getaddrinfo() in this case
because the name was already resolved.
How about bypass to call getaddrinfo when 'host' is given by IP address?

Actions

Copy link

#12 [ruby-core:60846]

Updated by akr (Akira Tanaka) over 11 years ago

Motohiro KOSAKI wrote:

I'm not an expert this area. But I guess we don't need to call getaddrinfo() in this case
because the name was already resolved.
How about bypass to call getaddrinfo when 'host' is given by IP address?

Yes. It is the workaround I said.

Actions

Copy link

#13 [ruby-core:60890]

Updated by akr (Akira Tanaka) over 11 years ago

I tried to workaround this issue at r45047.
However I don't have an environment to reproduce the problem.

Would anyone test the problem at latest trunk?

Actions

Copy link

#14 [ruby-core:60891]

Updated by normalperson (Eric Wong) over 11 years ago

akr@fsij.org wrote:

I tried to workaround this issue at r45047.
However I don't have an environment to reproduce the problem.

I don't know how to reproduce the problem, either. netlink sockets
should not block like that.

r45047 looks correct. A few minor comments:

backwards for loop for list is confusing to me,
any reason for not reversing list declaration?
why xmalloc + MEMZERO? xcalloc is shorter and generates smaller code

Actions

Copy link

#17 [ruby-core:60898]

Updated by kosaki (Motohiro KOSAKI) over 11 years ago

Status changed from Third Party's Issue to Closed
Backport changed from 1.9.3: UNKNOWN, 2.0.0: UNKNOWN, 2.1: UNKNOWN to 1.9.3: REQUIRED, 2.0.0: REQUIRED, 2.1: REQUIRED

Actions

Copy link

#18 [ruby-core:60907]

Updated by akr (Akira Tanaka) over 11 years ago

Eric Wong wrote:

backwards for loop for list is confusing to me,
any reason for not reversing list declaration?

I prefer ordering consistency between "list" and "res".

why xmalloc + MEMZERO? xcalloc is shorter and generates smaller code

xcalloc is good idea.

Actions

Copy link

#19 [ruby-core:61276]

Updated by sonots (Naotoshi Seo) over 11 years ago

Akira Tanaka wrote:

Would anyone test the problem at latest trunk?

I had never ran fluentd with ruby 2.1.1, but now I am trying it.
If it works well, I will try ruby-trunk next.

PS. Sorry, I struggled, but could not create a small subset to reproduce this issue so that anyone can check.
What I can tell here is only that this issue would sometimes occur on an environment which is heavy-loaded, and uses many threads, but I am not sure.

Actions

Copy link

#20 [ruby-core:61334]

Updated by sonots (Naotoshi Seo) over 11 years ago

I applied following consecutive patches for ruby 2.1.1, and tried.

https://github.com/ruby/ruby/commit/948ce9decb97e5ff0833e53a392aa9f1d42c9b0d
https://github.com/ruby/ruby/commit/dd1c3a75096b97c1ebcb8597c001761ddfb3c1bf
https://github.com/ruby/ruby/commit/2e6b97a45d077979121b29484a8831034d47ef50

Before (ruby 2.1.1):

The stuck was able to be reproduced with ruby 2.1.1.
I restarted my entire fluentd cluster 12 times, and it occurred at 4th and 12th, that is, 2 times out of 12.

After (ruby 2.1.1 + patch)

I restarted the fluentd cluster 36 times, and got stuck 3 times, but the stuck point was changed.
See https://gist.github.com/sonots/9392668
I did not see that Fluentd was stuck at Socket.pack_sockaddr_in anymore.

I will continue to work for the new stuck point at Fluentd issue https://github.com/fluent/fluentd/pull/257. I think the problem of stuck at Socket.pack_sockaddr_in was resolved.

Actions

Copy link

#21 [ruby-core:62241]