Feature #19965
closedMake the name resolution interruptible
Description
Problem¶
Currently, Ruby name resolution is not interruptible.
$ cat /etc/resolv.conf
nameserver 198.51.100.1
$ ./local/bin/ruby -rsocket -e 'Addrinfo.getaddrinfo("www.ruby-lang.org", 80)'
^C^C^C^C
If you set a non-responsive IP as the nameserver, you cannot stop Addrinfo.getaddrinfo
by pressing Ctrl+C. Note that Timeout.timeout
does not work either.
This is because there is no way to cancel getaddrinfo(3)
.
Proposal¶
I wrote a patch to make getaddrinfo(3)
work in a separate pthread.
https://github.com/ruby/ruby/pull/8695
Whenever it needs name resolution, it creates a worker pthread, and executes getaddrinfo(3)
in it.
The caller thread waits for the worker to complete.
When an interrupt occurs, the caller thread leaves stop waiting and leaves the worker pthread.
The detached worker pthread will exit after getaddrinfo(3)
completes (or name resolution times out).
Evaluation¶
By applying this patch, name resolution is now interruptible.
$ ./local/bin/ruby -rsocket -e 'pp Addrinfo.getaddrinfo("www.ruby-lang.org", 80)'
^C-e:1:in `getaddrinfo': Interrupt
from -e:1:in `<main>'
As a drawback, name resolution performance will be degraded.
10000.times { Addrinfo.getaddrinfo("www.ruby-lang.org", 80) }
# Before patch: 2.3 sec.
# After ptach: 3.0 sec.
However, I think that name resolution is typically short enough for the application's runtime. For example, the difference is small for the performance of URI.open
.
100.times { URI.open("https://www.ruby-lang.org").read }
# Before patch: 3.36 sec.
# After ptach: 3.40 sec.
Alternative approaches¶
I proposed using c-ares to resolve this issue (#19430). However, there was an opinion that it would be a problem that c-ares does not respect the platform-dependent own name resolution.
Room for improvement¶
- Currently, this patch works only when pthread is available.
- It might be possible to force to stop the worker threads by using
pthread_cancel
. However,pthread_cancel
withgetaddrinfo(3)
seems still premature; there seems to be a bug in glibc until recently: https://bugzilla.redhat.com/show_bug.cgi?id=1405071 https://sourceware.org/bugzilla/show_bug.cgi?id=20975 - It would be more efficient to pool worker pthreads instead of creating them each time.
Updated by mame (Yusuke Endoh) over 1 year ago
- Related to Feature #19430: Contribution wanted: DNS lookup by c-ares library added
Updated by mame (Yusuke Endoh) about 1 year ago
- Status changed from Open to Closed
Applied in changeset git|3dc311bdc8badb680267f5a10e0c467ddd9dfe4c.
Make rb_getaddrinfo interruptible
When pthread_create is available, rb_getaddrinfo creates a pthread and
executes getaddrinfo(3) in it. The caller thread waits for the pthread
to complete, but detaches it if interrupted. This allows name resolution
to be interuppted by Timeout.timeout, etc. even if it takes a long time
(for example, when the DNS server does not respond). [Feature #19965]
Updated by hsbt (Hiroshi SHIBATA) about 1 year ago
- Related to Feature #16476: Socket.getaddrinfo cannot be interrupted by Timeout.timeout added
Updated by byroot (Jean Boussier) about 1 year ago
@mame (Yusuke Endoh) we just ran into a crash on our ruby-head
nightly CI that seem related:
/app/components/platform/essentials/lib/http_host_restriction.rb:50: [BUG] Segmentation fault at 0x00007ff23f795910
ruby 3.3.0dev (2023-11-06T03:01:06Z shopify a763d085e4) +YJIT [x86_64-linux]
-- Control frame information -----------------------------------------------
c:0139 p:---- s:0684 e:000683 CFUNC :ip_address
c:0138 p:0006 s:0680 e:000679 METHOD /app/components/platform/essentials/lib/http_host_restriction.rb:50
c:0137 p:0024 s:0672 e:000671 METHOD /app/components/platform/essentials/lib/http_host_restriction.rb:86
-- C level backtrace information -------------------------------------------
/usr/local/ruby/bin/ruby(rb_print_backtrace+0x14) [0x557302ccae51] /tmp/ruby-build/ruby-3.3.0-a763d085e446d4a3cb09bd5f6bcaffc30484e804/vm_dump.c:812
/usr/local/ruby/bin/ruby(rb_vm_bugreport) /tmp/ruby-build/ruby-3.3.0-a763d085e446d4a3cb09bd5f6bcaffc30484e804/vm_dump.c:1143
/usr/local/ruby/bin/ruby(rb_bug_for_fatal_signal+0xfc) [0x557302e77a7c] /tmp/ruby-build/ruby-3.3.0-a763d085e446d4a3cb09bd5f6bcaffc30484e804/error.c:1065
/usr/local/ruby/bin/ruby(sigsegv+0x4d) [0x557302c1763d] /tmp/ruby-build/ruby-3.3.0-a763d085e446d4a3cb09bd5f6bcaffc30484e804/signal.c:920
/lib/x86_64-linux-gnu/libc.so.6(0x7ff42d333520) [0x7ff42d333520]
/lib/x86_64-linux-gnu/libc.so.6(pthread_setaffinity_np+0x4) [0x7ff42d38c524]
/usr/local/ruby/lib/ruby/3.3.0+0/x86_64-linux/socket.so(rb_getnameinfo+0xf2) [0x7ff40f1c8f92] /tmp/ruby-build/ruby-3.3.0-a763d085e446d4a3cb09bd5f6bcaffc30484e804/ext/socket/raddrinfo.c:711
/usr/local/ruby/lib/ruby/3.3.0+0/x86_64-linux/socket.so(rb_getnameinfo) (null):0
/usr/local/ruby/lib/ruby/3.3.0+0/x86_64-linux/socket.so(addrinfo_getnameinfo+0x88) [0x7ff40f1c94e8] /tmp/ruby-build/ruby-3.3.0-a763d085e446d4a3cb09bd5f6bcaffc30484e804/ext/socket/raddrinfo.c:2372
/usr/local/ruby/lib/ruby/3.3.0+0/x86_64-linux/socket.so(addrinfo_ip_address+0x59) [0x7ff40f1c95f9] /tmp/ruby-build/ruby-3.3.0-a763d085e446d4a3cb09bd5f6bcaffc30484e804/ext/socket/raddrinfo.c:2430
/usr/local/ruby/bin/ruby(vm_call_cfunc_with_frame_+0x117) [0x557302ca77e7] /tmp/ruby-build/ruby-3.3.0-a763d085e446d4a3cb09bd5f6bcaffc30484e804/vm_insnhelper.c:3503
/usr/local/ruby/bin/ruby(vm_sendish+0x9e) [0x557302cbafc7] /tmp/ruby-build/ruby-3.3.0-a763d085e446d4a3cb09bd5f6bcaffc30484e804/vm_insnhelper.c:5581
/usr/local/ruby/bin/ruby(vm_exec_core) /tmp/ruby-build/ruby-3.3.0-a763d085e446d4a3cb09bd5f6bcaffc30484e804/insns.def:822
/usr/local/ruby/bin/ruby(rb_vm_exec+0x18e) [0x557302cab87e] /tmp/ruby-build/ruby-3.3.0-a763d085e446d4a3cb09bd5f6bcaffc30484e804/vm.c:2472
Let me know if I can provide more information.
Updated by mame (Yusuke Endoh) about 1 year ago
- Status changed from Closed to Assigned
- Assignee set to mame (Yusuke Endoh)
@byroot (Jean Boussier) Thanks! Maybe I'm misunderstanding the usage of pthread_setaffinity_np
. I'll check it out. If I don't understand it, I'll stop using pthread_setaffinity_np
.
Updated by byroot (Jean Boussier) about 1 year ago
Important note: in our environment we do fork a lot, so it's not impossible that the cause may be the that the thread in dead.
Updated by mame (Yusuke Endoh) about 1 year ago
Actually, I saw the same problem with CI on RedHat on s390x.
https://rubyci.s3.amazonaws.com/rhel_zlinux/ruby-master/log/20231025T093302Z.fail.html.gz
-- C level backtrace information -------------------------------------------
unknown address_size:0/home/chkbuild/build/20231025T093302Z/ruby/ruby(rb_print_backtrace+0x10) [0x2aa22b5eb06] vm_dump.c:812
/home/chkbuild/build/20231025T093302Z/ruby/ruby(rb_vm_bugreport) vm_dump.c:1143
/home/chkbuild/build/20231025T093302Z/ruby/ruby(rb_bug_for_fatal_signal+0xc2) [0x2aa22c62da2] error.c:1065
/home/chkbuild/build/20231025T093302Z/ruby/ruby(sigill+0x0) [0x2aa22a9f000] signal.c:920
/home/chkbuild/build/20231025T093302Z/ruby/ruby(sigsegv) (null):0
[0x3fef1782718]
/lib64/libpthread.so.0(pthread_setaffinity_np+0x44) [0x3ff8031103c]
/home/chkbuild/build/20231025T093302Z/ruby/.ext/s390x-linux/socket.so(rb_getnameinfo+0x290) [0x3ff567a3340]
I thought it might be specific to glibc on s390x, and I stopped using pthread_setaffinity_np
on only s390x. But if it appears on other environments as well (especially x86_64), I'll have to do something.
Updated by mame (Yusuke Endoh) about 1 year ago
I guess https://github.com/ruby/ruby/pull/8852 will solve the issue.
Updated by mame (Yusuke Endoh) about 1 year ago
- Status changed from Assigned to Closed
Applied in changeset git|d0066211f2052bf1444ffeb11544860a12cebff2.
Detach a pthread after pthread_setaffinity_np
After a pthread for getaddrinfo is detached, we cannot predict when the
thread will exit. It would lead to a segfault by setting
pthread_setaffinity to the terminated pthread. I guess this problem
would be more likely to occur in high-load environments.
This change detaches the pthread after pthread_setaffinity is called.
[Feature #19965]
Updated by hsbt (Hiroshi SHIBATA) about 1 year ago
- Related to Bug #20172: Socket.addrinfo failing randomly added