Project

General

Profile

Actions

Bug #9525

closed

Stuck with Socket.pack_sockaddr_in

Added by sonots (Naotoshi Seo) over 10 years ago. Updated over 8 years ago.

Status:
Closed
Assignee:
-
Target version:
-
ruby -v:
1.9.3p194
Backport:
[ruby-core:60801]

Description

We met this trouble with Fluentd https://github.com/fluent/fluentd.

Fluentd is sometimes stuck at Socket.pack_sockaddr_in line on shutdown.
Here is the gist https://gist.github.com/sonots/9047653 to explain details.


Related issues 2 (0 open2 closed)

Related to Backport193 - Backport #9538: backport r45047Closed02/20/2014Actions
Related to Backport21 - Backport #9536: backport r45047ClosedActions

Updated by akr (Akira Tanaka) over 10 years ago

  • Status changed from Open to Feedback

Socket.pack_sockaddr_in can block with a name resolution.
If you think the name resolution doesn't block, please explain why.

Updated by sonots (Naotoshi Seo) over 10 years ago

Are there any ways to timeout Socket.pack_sockaddr_in?
If so, this problem should be resolved.

EDIT: And, I was pointing an IP address at the concerned line. Will it still block for the name resolution?

Updated by akr (Akira Tanaka) over 10 years ago

I think it is very difficult to add timeout to getaddrinfo() function in C.

Updated by akr (Akira Tanaka) over 10 years ago

I guess getaddrinfo() doesn't block if an IP address is given.

At least, following command on Debian GNU/Linux (jessie) doesn't show anything.

% strace -e socket ruby -rsocket -e 'Socket.pack_sockaddr_in(80, "192.168.1.1")'

It means the command create no socket and no communication to another site.

Updated by sonots (Naotoshi Seo) over 10 years ago

Following command on CentOS release 6.2 x86_64 returned a line

$ strace -e socket /usr/lib64/fluent/ruby/bin/ruby -rsocket -e 'Socket.pack_sockaddr_in(80, "192.168.1.1")'
socket(PF_NETLINK, SOCK_RAW, 0)         = 5

/usr/lib64/fluent/ruby/bin/ruby -v #=> ruby 1.9.3p194 (2012-04-20 revision 35410) [x86_64-linux]

Updated by akr (Akira Tanaka) over 10 years ago

I think that PF_NETLINK is not a socket to communicate to another host.

Updated by naruse (Yui NARUSE) over 10 years ago

  • Status changed from Feedback to Third Party's Issue

gdb says __check_pf is guilty.
https://gist.github.com/sonots/9047653#file-gistfile3-txt

(gdb) bt
#0  0x00000032f2ce659d in recvmsg () from /lib64/libc.so.6
#1  0x00000032f2d0c2c5 in make_request () from /lib64/libc.so.6
#2  0x00000032f2d0c6fa in __check_pf () from /lib64/libc.so.6
#3  0x00000032f2ccfb47 in getaddrinfo () from /lib64/libc.so.6
#4  0x00007fb3139d3818 in nogvl_getaddrinfo (arg=<value optimized out>) at raddrinfo.c:161
#5  0x00000000004f600c in rb_thread_blocking_region (func=0x7fb3139d3800 <nogvl_getaddrinfo>, data1=0x7fb0d5dd0140, ubf=<value optimized out>, data2=<value optimized out>) at thread.c:1129
#6  0x00007fb3139d2eaf in rb_getaddrinfo (node=<value optimized out>, service=<value optimized out>, hints=<value optimized out>, res=<value optimized out>) at raddrinfo.c:181
#7  0x00007fb3139d2fcb in rsock_getaddrinfo (host=<value optimized out>, port=22017, hints=0x7fb0d5dd0600, socktype_hack=1) at raddrinfo.c:359
#8  0x00007fb3139d37ed in rsock_addrinfo (host=0, port=140397479001552, socktype=<value optimized out>, flags=659) at raddrinfo.c:379
#9  0x00007fb3139c939a in sock_s_pack_sockaddr_in (self=<value optimized out>, port=140397479001552, host=35158400) at socket.c:1307

By glibc's commits, once __check_pf is always called 6f3914d5a3269c00e70506bd95f816fef6b635ce
(it is fixed at fa3fc0fe5f452d0aa7e435d8f32e992958683819)

The difference between akr's and sonots' seems because of this.

Therefore this is glibc's old bug and RHEL/CentOS's backport issue.

Updated by sonots (Naotoshi Seo) over 10 years ago

Thank you! Let me write a note here.

On my current environment, the glibc version was glibc-2.12-1.47.el6.x86_64.

I checked the latest centos rpm http://vault.centos.org/6.5/os/Source/SPackages/glibc-2.12-1.132.el6.src.rpm,
but it looked RHED/CentOS is not backporting it yet (cf. https://gist.github.com/sonots/9065060)

Updated by akr (Akira Tanaka) over 10 years ago

I think we can accept a workaround if there is a good patch.

Updated by kosaki (Motohiro KOSAKI) over 10 years ago

I checked a glibc code. I don't think fa3fc0fe5f fixed this issue. It is a mere optimization patch.
I'm not sure why kernel's netlink doesn't reply anything. But it may be worth to try upgrade your kernel.

Thanks.

Updated by kosaki (Motohiro KOSAKI) over 10 years ago

Akr-san,

I'm not an expert this area. But I guess we don't need to call getaddrinfo() in this case
because the name was already resolved.
How about bypass to call getaddrinfo when 'host' is given by IP address?

Updated by akr (Akira Tanaka) over 10 years ago

Motohiro KOSAKI wrote:

I'm not an expert this area. But I guess we don't need to call getaddrinfo() in this case
because the name was already resolved.
How about bypass to call getaddrinfo when 'host' is given by IP address?

Yes. It is the workaround I said.

Updated by akr (Akira Tanaka) over 10 years ago

I tried to workaround this issue at r45047.
However I don't have an environment to reproduce the problem.

Would anyone test the problem at latest trunk?

Updated by normalperson (Eric Wong) over 10 years ago

wrote:

I tried to workaround this issue at r45047.
However I don't have an environment to reproduce the problem.

I don't know how to reproduce the problem, either. netlink sockets
should not block like that.

r45047 looks correct. A few minor comments:

  • backwards for loop for list is confusing to me,
    any reason for not reversing list declaration?
  • why xmalloc + MEMZERO? xcalloc is shorter and generates smaller code

Updated by kosaki (Motohiro KOSAKI) over 10 years ago

Updated by kosaki (Motohiro KOSAKI) over 10 years ago

Updated by kosaki (Motohiro KOSAKI) over 10 years ago

  • Status changed from Third Party's Issue to Closed
  • Backport changed from 1.9.3: UNKNOWN, 2.0.0: UNKNOWN, 2.1: UNKNOWN to 1.9.3: REQUIRED, 2.0.0: REQUIRED, 2.1: REQUIRED

Updated by akr (Akira Tanaka) over 10 years ago

Eric Wong wrote:

  • backwards for loop for list is confusing to me,
    any reason for not reversing list declaration?

I prefer ordering consistency between "list" and "res".

  • why xmalloc + MEMZERO? xcalloc is shorter and generates smaller code

xcalloc is good idea.

Updated by sonots (Naotoshi Seo) over 10 years ago

Akira Tanaka wrote:

Would anyone test the problem at latest trunk?

I had never ran fluentd with ruby 2.1.1, but now I am trying it.
If it works well, I will try ruby-trunk next.

PS. Sorry, I struggled, but could not create a small subset to reproduce this issue so that anyone can check.
What I can tell here is only that this issue would sometimes occur on an environment which is heavy-loaded, and uses many threads, but I am not sure.

Updated by sonots (Naotoshi Seo) over 10 years ago

I applied following consecutive patches for ruby 2.1.1, and tried.

https://github.com/ruby/ruby/commit/948ce9decb97e5ff0833e53a392aa9f1d42c9b0d
https://github.com/ruby/ruby/commit/dd1c3a75096b97c1ebcb8597c001761ddfb3c1bf
https://github.com/ruby/ruby/commit/2e6b97a45d077979121b29484a8831034d47ef50

Before (ruby 2.1.1):

The stuck was able to be reproduced with ruby 2.1.1.
I restarted my entire fluentd cluster 12 times, and it occurred at 4th and 12th, that is, 2 times out of 12.

After (ruby 2.1.1 + patch)

I restarted the fluentd cluster 36 times, and got stuck 3 times, but the stuck point was changed.
See https://gist.github.com/sonots/9392668
I did not see that Fluentd was stuck at Socket.pack_sockaddr_in anymore.

I will continue to work for the new stuck point at Fluentd issue https://github.com/fluent/fluentd/pull/257. I think the problem of stuck at Socket.pack_sockaddr_in was resolved.

Updated by usa (Usaku NAKAMURA) over 10 years ago

memo: related commits: 45046,45047,45063,45087,45150,45151,45152

Updated by nagachika (Tomoyuki Chikanaga) about 10 years ago

Is there any progress?

Updated by nagachika (Tomoyuki Chikanaga) about 10 years ago

I have confirmed that this issue is already fixed.

https://twitter.com/sonots/status/507804708153483264

Add related commits: r45146

Updated by nagachika (Tomoyuki Chikanaga) about 10 years ago

  • Backport changed from 1.9.3: REQUIRED, 2.0.0: REQUIRED, 2.1: REQUIRED to 1.9.3: REQUIRED, 2.0.0: REQUIRED, 2.1: DONE

r45046, r45047, r45063, r45087, r45146, r45150, r45151 and r45152 were backported into ruby_2_1 branch at r47415.

Updated by mariakatosvich (maria katosvich) over 8 years ago

  • Backport deleted (1.9.3: REQUIRED, 2.0.0: REQUIRED, 2.1: DONE)

Related issue: https://bugs.ruby-lang.org/issues/9536

This problem is a big issue for Fluentd.
We hope to backport this commit to Latest Rubies.

Thanks,
sky contact number uk@fixithere.net

Updated by nobu (Nobuyoshi Nakada) over 8 years ago

  • Backport set to 2.1: DONE
Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0