Project

General

Profile

Actions

Bug #20048

closed

UDPSocket#remote_address spec errors

Added by vo.x (Vit Ondruch) 11 months ago. Updated 11 months ago.

Status:
Closed
Assignee:
-
Target version:
-
ruby -v:
ruby 3.3.0dev (2023-12-07 master 071df40495) [x86_64-linux]
[ruby-core:115647]

Description

Testing of Fedora Rawhide, we have recently started to observe following errors:

$ make -C redhat-linux-build test-spec MSPECOPT="-fs ../spec/ruby/library/socket/udpsocket"

... snip ...

1)
An exception occurred during: before :each
UDPSocket#local_address using IPv4 using an implicit hostname the returned Addrinfo uses the correct IP address ERROR
Socket::ResolutionError: getaddrinfo: Name or service not known
/builddir/build/BUILD/ruby-3.3.0-071df40495/spec/ruby/library/socket/udpsocket/local_address_spec.rb:66:in `connect'
/builddir/build/BUILD/ruby-3.3.0-071df40495/spec/ruby/library/socket/udpsocket/local_address_spec.rb:66:in `block (4 levels) in <top (required)>'
/builddir/build/BUILD/ruby-3.3.0-071df40495/spec/ruby/library/socket/udpsocket/local_address_spec.rb:4:in `<top (required)>'

2)
An exception occurred during: before :each
UDPSocket#local_address using IPv6 using an implicit hostname the returned Addrinfo uses the correct IP address ERROR
Socket::ResolutionError: getaddrinfo: Name or service not known
/builddir/build/BUILD/ruby-3.3.0-071df40495/spec/ruby/library/socket/udpsocket/local_address_spec.rb:66:in `connect'
/builddir/build/BUILD/ruby-3.3.0-071df40495/spec/ruby/library/socket/udpsocket/local_address_spec.rb:66:in `block (4 levels) in <top (required)>'
/builddir/build/BUILD/ruby-3.3.0-071df40495/spec/ruby/library/socket/udpsocket/local_address_spec.rb:4:in `<top (required)>'

3)
An exception occurred during: before :each
UDPSocket#remote_address using IPv4 using an implicit hostname the returned Addrinfo uses the correct IP address ERROR
Socket::ResolutionError: getaddrinfo: Name or service not known
/builddir/build/BUILD/ruby-3.3.0-071df40495/spec/ruby/library/socket/udpsocket/remote_address_spec.rb:65:in `connect'
/builddir/build/BUILD/ruby-3.3.0-071df40495/spec/ruby/library/socket/udpsocket/remote_address_spec.rb:65:in `block (4 levels) in <top (required)>'
/builddir/build/BUILD/ruby-3.3.0-071df40495/spec/ruby/library/socket/udpsocket/remote_address_spec.rb:4:in `<top (required)>'

4)
An exception occurred during: before :each
UDPSocket#remote_address using IPv6 using an implicit hostname the returned Addrinfo uses the correct IP address ERROR
Socket::ResolutionError: getaddrinfo: Name or service not known
/builddir/build/BUILD/ruby-3.3.0-071df40495/spec/ruby/library/socket/udpsocket/remote_address_spec.rb:65:in `connect'
/builddir/build/BUILD/ruby-3.3.0-071df40495/spec/ruby/library/socket/udpsocket/remote_address_spec.rb:65:in `block (4 levels) in <top (required)>'
/builddir/build/BUILD/ruby-3.3.0-071df40495/spec/ruby/library/socket/udpsocket/remote_address_spec.rb:4:in `<top (required)>'

Finished in 0.020615 seconds

11 files, 95 examples, 123 expectations, 0 failures, 4 errors, 0 tagged
make: *** [uncommon.mk:983: yes-test-spec] Error 1
make: Leaving directory '/builddir/build/BUILD/ruby-3.3.0-071df40495/redhat-linux-build'

Please note that the build environment does not have network connection enabled by default. As soon as the network connection is available, the test cases pass just fine.

This started to happen between these two commits:

https://github.com/ruby/ruby/compare/c8b60c8ac2c8bbd077150792b5b207e983ab3634...071df40495e31f6d3fd14ae8686b01edf9a689e3

Where the culprit likely is git|d2ba8ea54a4089959afdeecdd963e3c4ff391748

Originally reported here

Updated by mtasaka (Mamoru TASAKA) 11 months ago

Not sure if this is the right resolution, but at least the following change makes the above tests pass:

diff --git a/ext/socket/raddrinfo.c b/ext/socket/raddrinfo.c
index 683f9fa4b4..ad4f7267fb 100644
--- a/ext/socket/raddrinfo.c
+++ b/ext/socket/raddrinfo.c
@@ -976,6 +976,11 @@ rsock_getaddrinfo(VALUE host, VALUE port, struct addrinfo *hints, int socktype_h
         }
 
         if (!resolved) {
+#ifdef HAVE_CONST_AI_ADDRCONFIG
+            if (!hostp) {
+                hints->ai_flags &= ~AI_ADDRCONFIG;
+            }
+#endif
             error = rb_getaddrinfo(hostp, portp, hints, &ai);
             if (error == 0) {
                 res = (struct rb_addrinfo *)xmalloc(sizeof(struct rb_addrinfo));

man getaddrinfo says If hints.ai_flags includes the AI_ADDRCONFIG flag, .... The loopback address is not considered for this case as valid as a configured address , but perhaps some loopback mechanism is needed.

Updated by kjtsanaktsidis (KJ Tsanaktsidis) 11 months ago

Apologies for this - I'm having a look at it now. Do you know how I can spin up an equivalent environment like the one the Fedora package tests run in? I'm running fedora 39 on my workstation so that should be fairly simple, in theory...

Updated by vo.x (Vit Ondruch) 11 months ago

On Fedora, we are using Mock for that purpose. E.g. you need to do something like:

$ sudo dnf install mock

$ usermod -a -G mock [User name]

$ # To prepare the buildroot and install all the required packages
$ mock -i gcc make ...

$ mock shell --unpriv --enable-network

This will give you shell, you can prepare your environment, download the sources etc. To have the newtwork disable, you can just drop the --enable-network. And you can access the root in /var/lib/mock/fedora-rawhide-x86_64/root.

And after you are done, you can clean up:

$ mock --scrub all

If you'd like to try different Fedora version, you can use e.g. mock -r fedora-38-x86_64. Rawhide is the default.

Updated by kjtsanaktsidis (KJ Tsanaktsidis) 11 months ago

OK, thank you for that, I was able to get a mock up and running and I managed to reproduce the issue.

I wrote a standalone C program to debug what happens when we call getaddrinfo with various combinations of flags. Its source is here, along with the output from running it inside and outside the mock. https://gist.github.com/KJTsanaktsidis/9f58e332d2bf3ccdbc18a3ff148b5bd4

What I found is that what doesn't work inside the mock environment is:

  • A call to getaddrinfo that wants localhost, whether it's spelled as "localhost" or NULL doesn't matter
  • The AI_ADDRCONFIG flag is passed,
  • And the family is explicitly set to AF_INET or AF_INET6 (i.e. NOT set to AF_UNSPEC).

I'm umming and aah'ing as to whether this is a glibc bug or not - I might spend some time tomorrow seeing how it behaves on different systems. But in any case the whole point of this feature was to work around a different glibc bug, and if this is triggering a worse one, then we should revert it.

btw @mtasaka (Mamoru TASAKA) I don't think your patch is enough - even if hostp is not NULL, it could be "localhost" and it'll still fail. So yeah, tl;dr, I think we have to revert.

Updated by vo.x (Vit Ondruch) 11 months ago

Not sure if that might be relevant, therefore just FYI, the --enable-network controls:

  1. if host resolv.conf will be used 1
  2. systemd-nspaw network capabilities 2. Because systemd-nspawn is used on the background to run the buildroot inside container

Updated by kjtsanaktsidis (KJ Tsanaktsidis) 11 months ago

OK, I think the solution is to partially revert that commit. The problem is that for TCPsocket connects we do getaddrinfo(3) with AF_UNSPEC and then socket(2) with the family of one of the returned addresses. BUT, the UDPSocket constructor takes an explicit family option (and defaults to AF_INET if unset), and then when UDPSocket#connect is called, we call getaddrinfo(3) with that address family. Thus, UDP sockets fall into the case I outlined in my previous message, but TCP sockets don't.

If we ever wrote a UDPSocket::new(remote_host, remote_port, local_host=nil, local_port=nil) constructor like we have for TCPSocket, that constructor could do the getaddrinfo-then-socket flow and could use AI_ADDRCONFIG. But the current constructor sets the family on the socket explicitly, so AI_ADDRCONFIG doesn't make a lot of sense. AI_ADDRCONFIG says "return addresses only of types that I could conceivably use to make a connection", but if you've already committed to what address family to use because you made the socket, you know exactly what address family you're looking for.

So let's merge https://github.com/ruby/ruby/pull/9177 I think.

Updated by vo.x (Vit Ondruch) 11 months ago

I cannot judge the change. But the PR makes the tests pass. The scratch build is here and drilling through it, one could get to e.g. x86_64 build log.

@kjtsanaktsidis (KJ Tsanaktsidis) thank you for looking into this.

Actions #9

Updated by Anonymous 11 months ago

  • Status changed from Open to Closed

Applied in changeset git|25711e7063060920d14e42a530da6f7198926629.


Partially revert "Set AI_ADDRCONFIG when making getaddrinfo(3) calls"

This partially reverts commit
d2ba8ea54a4089959afdeecdd963e3c4ff391748, but for UDP sockets only.

With TCP sockets (and other things which use rsock_init_inetsock), the
order of operations is to call getaddrinfo(3) with AF_UNSPEC, look at
the returned addresses, pick one, and then call socket(2) with the
family for that address (i.e. AF_INET or AF_INET6).

With UDP sockets, however, this is reversed; UDPSocket.new takes an
address family as an argument, and then calls socket(2) with that
family. A subsequent call to UDPSocket#connect will then call
getaddrinfo(3) with that family.

The problem here is that...

  • If you are in a networking situation that only has loopback addrs,
  • And you want to look up a name like "localhost" (or NULL)
  • And you pass AF_INET or AF_INET6 as the ai_family argument to
    getaddrinfo(3),
  • And you pass AI_ADDRCONFIG to the hints argument as well,

then glibc on Linux will not return an address. This is because
AI_ADDRCONFIG is supposed to return addresses for families we actually
have an address for and could conceivably connect to, but also is
documented to explicitly ignore localhost in that situation.

It honestly doesn't make a ton of sense to pass AI_ADDRCONFIG if you're
explicitly passing the address family anyway, because you're not looking
for "an address for this name we can connect to"; you're looking for "an
IPv(4|6) address for this name". And the original glibc bug that
d2ba8ea5 was supposed to work around was related to parallel issuance of
A and AAAA queries, which of course won't happen if an address family is
explicitly specified.

So, we fix this by not passing AI_ADDRCONFIG for calls to
rsock_addrinfo that we also pass an explicit family to (i.e. for
UDPsocket).

[Bug #20048]

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0