Bug #20172
closedSocket.addrinfo failing randomly
Description
I've recently updated one of my linux systems (Gentoo) to glibc 2.38 (that was the only change). After the update most of the time the below error happens. Among other things this breaks rubygems for me. I've reinstalled ruby 3.2.2 with rvm and didn't encounter the issue. The issue however remained even after reinstalling ruby 3.3.0 and even with ruby master. Since this goes back to getaddrinfo (which is working without any issues outside of ruby) and there seems to be only one bigger change to stdlib socket, I'm assuming the problem was introduced with https://bugs.ruby-lang.org/issues/19965
3.3.0 :001 > require 'socket'
=> true
3.3.0 :002 > Socket.getaddrinfo('rubygems.org', 443)
(irb):2:in `getaddrinfo': getaddrinfo: Temporary failure in name resolution (Socket::ResolutionError)
from (irb):2:in `<main>'
from <internal:kernel>:187:in `loop'
from /usr/local/rvm/rubies/ruby-3.3.0/lib/ruby/gems/3.3.0/gems/irb-1.11.0/exe/irb:9:in `<top (required)>'
from /usr/local/rvm/rubies/ruby-3.3.0/bin/irb:25:in `load'
from /usr/local/rvm/rubies/ruby-3.3.0/bin/irb:25:in `<main>'
3.3.0 :003 > Socket.getaddrinfo('rubygems.org', 443)
(irb):3:in `getaddrinfo': getaddrinfo: Temporary failure in name resolution (Socket::ResolutionError)
from (irb):3:in `<main>'
from <internal:kernel>:187:in `loop'
from /usr/local/rvm/rubies/ruby-3.3.0/lib/ruby/gems/3.3.0/gems/irb-1.11.0/exe/irb:9:in `<top (required)>'
from /usr/local/rvm/rubies/ruby-3.3.0/bin/irb:25:in `load'
from /usr/local/rvm/rubies/ruby-3.3.0/bin/irb:25:in `<main>'
3.3.0 :004 > Socket.getaddrinfo('rubygems.org', 443)
=>
[["AF_INET", 443, "151.101.193.227", "151.101.193.227", 2, 1, 6],
["AF_INET", 443, "151.101.193.227", "151.101.193.227", 2, 2, 17],
["AF_INET", 443, "151.101.193.227", "151.101.193.227", 2, 3, 0],
...
["AF_INET6", 443, "2a04:4e42::483", "2a04:4e42::483", 10, 1, 6],
["AF_INET6", 443, "2a04:4e42::483", "2a04:4e42::483", 10, 2, 17],
["AF_INET6", 443, "2a04:4e42::483", "2a04:4e42::483", 10, 3, 0]]
3.3.0 :005 >
Updated by hsbt (Hiroshi SHIBATA) about 1 year ago
- Related to Feature #19965: Make the name resolution interruptible added
Updated by mame (Yusuke Endoh) about 1 year ago
Yeah, it is probably due to the change of #19965. I cannot debug it soon because I don't have a gentoo environment. I suspect pthread_create
is somehow failing. Does 10000.times { Thread.new {}.join }
work successfully on your machine?
Updated by mame (Yusuke Endoh) about 1 year ago
Incidentally, our Arch Linux CI also uses glibc 2.38, and it is working fine. So I guess either that it is a Gentoo-specific problem, or that your machine is so heavily loaded that it cannot pthread_create
.
Updated by mwaldvogel (Michael Waldvogel) about 1 year ago
We can at least exclude that it is due to heavy load. I will provide you access to one of the VMs by tomorrow. That way it should be easier to analyze.
Updated by mwaldvogel (Michael Waldvogel) about 1 year ago
mame (Yusuke Endoh) wrote in #note-2:
Yeah, it is probably due to the change of #19965. I cannot debug it soon because I don't have a gentoo environment. I suspect
pthread_create
is somehow failing. Does10000.times { Thread.new {}.join }
work successfully on your machine?
Yes, 10000.times { Thread.new {}.join }
works without any problems.
Updated by mame (Yusuke Endoh) about 1 year ago
I investigated the issue by using the VM access Michael gave me. (Thank you!) And I understand the issue.
It looks like sched_getcpu(3)
returns an unexpected number in the environment. Since the number of CPUs in the VM is 2, I expect it to return 0 or 1. However, it actually returns 0 or 123. This makes pthread_create
fail with EINVAL because of a wrong affinity configuration.
TBH, I don't know why sched_getcpu(3)
returns a strange value, but I guess it may depend on the configuration of the virtual environment.
I decided to remove the setaffinity mechanism and confirmed that it solves the issue: https://github.com/ruby/ruby/pull/9479
I introduced the mechanism to reduce the overhead of thread context switch, but a quick benchmark showed that removing it didn't seem to degrade the performance. So I'd like to simply delete the troublesome code.
Updated by mame (Yusuke Endoh) about 1 year ago
- Backport changed from 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN to 3.0: DONTNEED, 3.1: DONTNEED, 3.2: DONTNEED, 3.3: REQUIRED
Updated by mame (Yusuke Endoh) about 1 year ago
- Status changed from Open to Closed
Applied in changeset git|1bd98c820da46a05328d2d53b8f748f28e7ee8f7.
Remove setaffinity of pthread for getaddrinfo
It looks like sched_getcpu(3)
returns a strange number on some
(virtual?) environments.
I decided to remove the setaffinity mechanism because the performance
does not appear to degrade on a quick benchmark even if removed.
[Bug #20172]
Updated by naruse (Yui NARUSE) 12 months ago
Merging into 3.3 is pending
https://github.com/ruby/ruby/pull/9798
Updated by ioquatix (Samuel Williams) 12 months ago
For reference, I had a user report a similar issue due to Addrinfo#ip_address
: https://github.com/socketry/falcon/issues/217
Updated by naruse (Yui NARUSE) 12 months ago
- Backport changed from 3.0: DONTNEED, 3.1: DONTNEED, 3.2: DONTNEED, 3.3: REQUIRED to 3.0: DONTNEED, 3.1: DONTNEED, 3.2: DONTNEED, 3.3: DONE
ruby_3_3 53d4e9c4bbba077a569549a01a8263e5e8f59ee8 merged revision(s) 1bd98c820da46a05328d2d53b8f748f28e7ee8f7.