Bug #20048
closedUDPSocket#remote_address spec errors
Description
Testing of Fedora Rawhide, we have recently started to observe following errors:
$ make -C redhat-linux-build test-spec MSPECOPT="-fs ../spec/ruby/library/socket/udpsocket"
... snip ...
1)
An exception occurred during: before :each
UDPSocket#local_address using IPv4 using an implicit hostname the returned Addrinfo uses the correct IP address ERROR
Socket::ResolutionError: getaddrinfo: Name or service not known
/builddir/build/BUILD/ruby-3.3.0-071df40495/spec/ruby/library/socket/udpsocket/local_address_spec.rb:66:in `connect'
/builddir/build/BUILD/ruby-3.3.0-071df40495/spec/ruby/library/socket/udpsocket/local_address_spec.rb:66:in `block (4 levels) in <top (required)>'
/builddir/build/BUILD/ruby-3.3.0-071df40495/spec/ruby/library/socket/udpsocket/local_address_spec.rb:4:in `<top (required)>'
2)
An exception occurred during: before :each
UDPSocket#local_address using IPv6 using an implicit hostname the returned Addrinfo uses the correct IP address ERROR
Socket::ResolutionError: getaddrinfo: Name or service not known
/builddir/build/BUILD/ruby-3.3.0-071df40495/spec/ruby/library/socket/udpsocket/local_address_spec.rb:66:in `connect'
/builddir/build/BUILD/ruby-3.3.0-071df40495/spec/ruby/library/socket/udpsocket/local_address_spec.rb:66:in `block (4 levels) in <top (required)>'
/builddir/build/BUILD/ruby-3.3.0-071df40495/spec/ruby/library/socket/udpsocket/local_address_spec.rb:4:in `<top (required)>'
3)
An exception occurred during: before :each
UDPSocket#remote_address using IPv4 using an implicit hostname the returned Addrinfo uses the correct IP address ERROR
Socket::ResolutionError: getaddrinfo: Name or service not known
/builddir/build/BUILD/ruby-3.3.0-071df40495/spec/ruby/library/socket/udpsocket/remote_address_spec.rb:65:in `connect'
/builddir/build/BUILD/ruby-3.3.0-071df40495/spec/ruby/library/socket/udpsocket/remote_address_spec.rb:65:in `block (4 levels) in <top (required)>'
/builddir/build/BUILD/ruby-3.3.0-071df40495/spec/ruby/library/socket/udpsocket/remote_address_spec.rb:4:in `<top (required)>'
4)
An exception occurred during: before :each
UDPSocket#remote_address using IPv6 using an implicit hostname the returned Addrinfo uses the correct IP address ERROR
Socket::ResolutionError: getaddrinfo: Name or service not known
/builddir/build/BUILD/ruby-3.3.0-071df40495/spec/ruby/library/socket/udpsocket/remote_address_spec.rb:65:in `connect'
/builddir/build/BUILD/ruby-3.3.0-071df40495/spec/ruby/library/socket/udpsocket/remote_address_spec.rb:65:in `block (4 levels) in <top (required)>'
/builddir/build/BUILD/ruby-3.3.0-071df40495/spec/ruby/library/socket/udpsocket/remote_address_spec.rb:4:in `<top (required)>'
Finished in 0.020615 seconds
11 files, 95 examples, 123 expectations, 0 failures, 4 errors, 0 tagged
make: *** [uncommon.mk:983: yes-test-spec] Error 1
make: Leaving directory '/builddir/build/BUILD/ruby-3.3.0-071df40495/redhat-linux-build'
Please note that the build environment does not have network connection enabled by default. As soon as the network connection is available, the test cases pass just fine.
This started to happen between these two commits:
Where the culprit likely is git|d2ba8ea54a4089959afdeecdd963e3c4ff391748
Originally reported here
Updated by mtasaka (Mamoru TASAKA) 11 months ago
Actually reverting d2ba8ea54a4 makes the above 4 test errors disappear, especally reverting this line:
https://github.com/ruby/ruby/commit/d2ba8ea54a4089959afdeecdd963e3c4ff391748#diff-f84073f6c886ca13d2a0bcbee487d5c405d50c52c59ffdc2b754417ee4a5a38dR96
Updated by mtasaka (Mamoru TASAKA) 11 months ago
Not sure if this is the right resolution, but at least the following change makes the above tests pass:
diff --git a/ext/socket/raddrinfo.c b/ext/socket/raddrinfo.c
index 683f9fa4b4..ad4f7267fb 100644
--- a/ext/socket/raddrinfo.c
+++ b/ext/socket/raddrinfo.c
@@ -976,6 +976,11 @@ rsock_getaddrinfo(VALUE host, VALUE port, struct addrinfo *hints, int socktype_h
}
if (!resolved) {
+#ifdef HAVE_CONST_AI_ADDRCONFIG
+ if (!hostp) {
+ hints->ai_flags &= ~AI_ADDRCONFIG;
+ }
+#endif
error = rb_getaddrinfo(hostp, portp, hints, &ai);
if (error == 0) {
res = (struct rb_addrinfo *)xmalloc(sizeof(struct rb_addrinfo));
man getaddrinfo
says If hints.ai_flags includes the AI_ADDRCONFIG flag, .... The loopback address is not considered for this case as valid as a configured address
, but perhaps some loopback mechanism is needed.
Updated by kjtsanaktsidis (KJ Tsanaktsidis) 11 months ago
Apologies for this - I'm having a look at it now. Do you know how I can spin up an equivalent environment like the one the Fedora package tests run in? I'm running fedora 39 on my workstation so that should be fairly simple, in theory...
Updated by vo.x (Vit Ondruch) 11 months ago
On Fedora, we are using Mock for that purpose. E.g. you need to do something like:
$ sudo dnf install mock
$ usermod -a -G mock [User name]
$ # To prepare the buildroot and install all the required packages
$ mock -i gcc make ...
$ mock shell --unpriv --enable-network
This will give you shell, you can prepare your environment, download the sources etc. To have the newtwork disable, you can just drop the --enable-network
. And you can access the root in /var/lib/mock/fedora-rawhide-x86_64/root
.
And after you are done, you can clean up:
$ mock --scrub all
If you'd like to try different Fedora version, you can use e.g. mock -r fedora-38-x86_64
. Rawhide is the default.
Updated by kjtsanaktsidis (KJ Tsanaktsidis) 11 months ago
OK, thank you for that, I was able to get a mock up and running and I managed to reproduce the issue.
I wrote a standalone C program to debug what happens when we call getaddrinfo with various combinations of flags. Its source is here, along with the output from running it inside and outside the mock. https://gist.github.com/KJTsanaktsidis/9f58e332d2bf3ccdbc18a3ff148b5bd4
What I found is that what doesn't work inside the mock environment is:
- A call to
getaddrinfo
that wants localhost, whether it's spelled as"localhost"
orNULL
doesn't matter - The
AI_ADDRCONFIG
flag is passed, - And the family is explicitly set to
AF_INET
orAF_INET6
(i.e. NOT set toAF_UNSPEC
).
I'm umming and aah'ing as to whether this is a glibc bug or not - I might spend some time tomorrow seeing how it behaves on different systems. But in any case the whole point of this feature was to work around a different glibc bug, and if this is triggering a worse one, then we should revert it.
btw @mtasaka (Mamoru TASAKA) I don't think your patch is enough - even if hostp
is not NULL, it could be "localhost"
and it'll still fail. So yeah, tl;dr, I think we have to revert.
Updated by vo.x (Vit Ondruch) 11 months ago
Updated by kjtsanaktsidis (KJ Tsanaktsidis) 11 months ago
OK, I think the solution is to partially revert that commit. The problem is that for TCPsocket
connects we do getaddrinfo(3)
with AF_UNSPEC
and then socket(2)
with the family of one of the returned addresses. BUT, the UDPSocket
constructor takes an explicit family option (and defaults to AF_INET
if unset), and then when UDPSocket#connect
is called, we call getaddrinfo(3)
with that address family. Thus, UDP sockets fall into the case I outlined in my previous message, but TCP sockets don't.
If we ever wrote a UDPSocket::new(remote_host, remote_port, local_host=nil, local_port=nil)
constructor like we have for TCPSocket
, that constructor could do the getaddrinfo-then-socket flow and could use AI_ADDRCONFIG. But the current constructor sets the family on the socket explicitly, so AI_ADDRCONFIG doesn't make a lot of sense. AI_ADDRCONFIG says "return addresses only of types that I could conceivably use to make a connection", but if you've already committed to what address family to use because you made the socket, you know exactly what address family you're looking for.
So let's merge https://github.com/ruby/ruby/pull/9177 I think.
Updated by vo.x (Vit Ondruch) 11 months ago
I cannot judge the change. But the PR makes the tests pass. The scratch build is here and drilling through it, one could get to e.g. x86_64 build log.
@kjtsanaktsidis (KJ Tsanaktsidis) thank you for looking into this.
Updated by Anonymous 11 months ago
- Status changed from Open to Closed
Applied in changeset git|25711e7063060920d14e42a530da6f7198926629.
Partially revert "Set AI_ADDRCONFIG when making getaddrinfo(3) calls"
This partially reverts commit
d2ba8ea54a4089959afdeecdd963e3c4ff391748, but for UDP sockets only.
With TCP sockets (and other things which use rsock_init_inetsock
), the
order of operations is to call getaddrinfo(3)
with AF_UNSPEC, look at
the returned addresses, pick one, and then call socket(2)
with the
family for that address (i.e. AF_INET or AF_INET6).
With UDP sockets, however, this is reversed; UDPSocket.new
takes an
address family as an argument, and then calls socket(2)
with that
family. A subsequent call to UDPSocket#connect will then call
getaddrinfo(3)
with that family.
The problem here is that...
- If you are in a networking situation that only has loopback addrs,
- And you want to look up a name like "localhost" (or NULL)
- And you pass AF_INET or AF_INET6 as the ai_family argument to
getaddrinfo(3), - And you pass AI_ADDRCONFIG to the hints argument as well,
then glibc on Linux will not return an address. This is because
AI_ADDRCONFIG is supposed to return addresses for families we actually
have an address for and could conceivably connect to, but also is
documented to explicitly ignore localhost in that situation.
It honestly doesn't make a ton of sense to pass AI_ADDRCONFIG if you're
explicitly passing the address family anyway, because you're not looking
for "an address for this name we can connect to"; you're looking for "an
IPv(4|6) address for this name". And the original glibc bug that
d2ba8ea5 was supposed to work around was related to parallel issuance of
A and AAAA queries, which of course won't happen if an address family is
explicitly specified.
So, we fix this by not passing AI_ADDRCONFIG for calls to
rsock_addrinfo
that we also pass an explicit family to (i.e. for
UDPsocket).
[Bug #20048]