Project

General

Profile

Actions

Bug #20237

closed

Unable to unshare(CLONE_NEWUSER) in Linux because of timer thread

Added by hanazuki (Kasumi Hanazuki) 9 months ago. Updated about 2 months ago.

Status:
Closed
Target version:
-
ruby -v:
ruby 3.4.0dev (2024-02-04T16:05:02Z master 8bc6fff322) [x86_64-linux]
[ruby-core:116581]

Description

Backgrounds

unshare(2) is a syscall in Linux to move the calling process into a fresh execution context. With unshare(CLONE_NEWUSER) you can move a process into a new user_namespace(7), where the process gains the full capability on the resources within the namespace. This is fundamental for Linux containers to achieve privilege separation. unshare(CLONE_NEWUSER) requires the calling process to be single-threaded (or no background threads are running). So, it is often invoked after fork(2) as forking propagates only the calling thread to the child process.

Problem

It becomes a problem that Ruby 3.3 on Linux uses timer threads even for a single-Threaded application. Because Kernel#fork spawns a thread in the child process before the control returns to the user code, there is no chance to call unshare(CLONE_NEWUSER) in Ruby.

The following snippet is a reproducer of this problem. This program first forks and then shows the user namespace to which the process belongs before and after calling unshare(2). It also shows the threads of the child process after forking.

p(RUBY_DESCRIPTION:)
require 'fiddle/import'
module C
  extend Fiddle::Importer
  dlload 'libc.so.6'

  extern 'int unshare(int flags)'
  CLONE_NEWUSER = 0x10000000

  def self.raise_system_call_error
    raise SystemCallError.new(Fiddle.last_error)
  end
end

pid = fork do
  system("ps -O tid -T -p #$$")
  system("ls -l /proc/self/ns/user")

  if C.unshare(C::CLONE_NEWUSER) != 0
    C.raise_system_call_error  # => EINVAL with Ruby 3.3
  end

  system("ls -l /proc/self/ns/user")
end

p Process.wait2(pid)

The program successfully changes the user namespace with Ruby 3.2, but it raises EINVAL with Ruby 3.3. You can see Ruby 3.3 has two threads running after forking.

% rbenv shell 3.2 && ruby ./test.rb
{:RUBY_DESCRIPTION=>"ruby 3.2.3 (2024-01-18 revision 52bb2ac0a6) [x86_64-linux]"}
    PID     TID S TTY          TIME COMMAND
1585787 1585787 S pts/12   00:00:00 ruby ./test.rb
lrwxrwxrwx 1 kasumi kasumi 0 Feb  5 02:25 /proc/self/ns/user -> 'user:[4026531837]'
lrwxrwxrwx 1 nobody nogroup 0 Feb  5 02:25 /proc/self/ns/user -> 'user:[4026532675]'
[1585787, #<Process::Status: pid 1585787 exit 0>]

% rbenv shell 3.3 && ruby ./test.rb
{:RUBY_DESCRIPTION=>"ruby 3.3.0 (2023-12-25 revision 5124f9ac75) [x86_64-linux]"}
    PID     TID S TTY          TIME COMMAND
1585849 1585849 S pts/12   00:00:00 ruby ./test.rb
1585849 1585851 S pts/12   00:00:00 ruby ./test.rb
lrwxrwxrwx 1 kasumi kasumi 0 Feb  5 02:25 /proc/self/ns/user -> 'user:[4026531837]'
./test.rb:10:in `raise_system_call_error': Invalid argument (Errno::EINVAL)
        from ./test.rb:24:in `block in <main>'
        from ./test.rb:19:in `fork'
        from ./test.rb:19:in `<main>'
[1585849, #<Process::Status: pid 1585849 exit 1>]

% rbenv shell master && ruby ./test.rb
{:RUBY_DESCRIPTION=>"ruby 3.4.0dev (2024-02-04T16:05:02Z master 8bc6fff322) [x86_64-linux]"}
    PID     TID S TTY          TIME COMMAND
1585965 1585965 S pts/12   00:00:00 ruby ./test.rb
1585965 1585967 S pts/12   00:00:00 ruby ./test.rb
lrwxrwxrwx 1 kasumi kasumi 0 Feb  5 02:25 /proc/self/ns/user -> 'user:[4026531837]'
./test.rb:10:in `raise_system_call_error': Invalid argument (Errno::EINVAL)
        from ./test.rb:24:in `block in <main>'
        from ./test.rb:19:in `fork'
        from ./test.rb:19:in `<main>'
[1585965, #<Process::Status: pid 1585965 exit 1>]

Workaround

My workaround is to rebuild ruby with rb_thread_stop_timer_thread and rb_thread_start_timer_thread exported, and use a C-ext that stops the timer thread before calling unshare. This seems not robust because the process cannot know when the terminated thread is reclaimed by the kernel, after which the process is considered single-threaded.

#define _GNU_SOURCE 1
#include <sched.h>
#include <ruby/ruby.h>

static VALUE Unshare_s_unshare(VALUE _self, VALUE rflags) {
  int const flags = NUM2INT(rflags);
  rb_thread_stop_timer_thread();
  usleep(1000);  // FIXME: It takes some time for the kernel to remove the stopped thread?
  int const ret  = unshare(flags);
  rb_thread_start_timer_thread();
  if(ret != 0) rb_sys_fail_str(rb_sprintf("unshare(%#x)", flags));
  return Qnil;
}


RUBY_FUNC_EXPORTED void
Init_unshare(void) {
  VALUE rb_mUnshare = rb_define_module("Unshare");
  rb_define_singleton_method(rb_mUnshare, "unshare", Unshare_s_unshare, 1);
  rb_define_const(rb_mUnshare, "CLONE_NEWUSER", INT2FIX(CLONE_NEWUSER));
}

Questions

  • Is this a limitation of Ruby?
  • Is it safe (or even possible) to stop the timer thread during execution?
    • If so, can we export it as the public API?
    • But it may not so useful for this problem as explained in the workaround.
  • Is it guaranteed that no other threads are running after forks?
  • Are there any better ways to solve this issue?
    • Can we somehow delay the start of the timer thread after forking, or hook into fork to run some code in the child process immediately after it spawns.
    • Can they be Ruby API instead of C API?

Updated by mame (Yusuke Endoh) 9 months ago

  • Status changed from Open to Assigned
  • Assignee set to ko1 (Koichi Sasada)

Updated by hanazuki (Kasumi Hanazuki) 9 months ago

Another option would be to define something like fork_then_unshare(unshare_flags:, &block) method in C extension, but because you would usually want to set up and clean up your environment between fork and unshare, this C function could become huge and kill the flexibility of Ruby.

Updated by hanazuki (Kasumi Hanazuki) 9 months ago

hanazuki (Kasumi Hanazuki) wrote in #note-2:

Another option would be to define something like fork_then_unshare(unshare_flags:, &block) method in C extension, but because you would usually want to set up and clean up your environment between fork and unshare, this C function could become huge and kill the flexibility of Ruby.

After some experiments, I found this approach doesn't work with the current API. IIUC, the only official way for native extensions to properly fork the Ruby interpreter is to call Process.fork (A plain invocation of fork(2) followed by rb_thread_atfork seems to break something). Therefore, the extensions don't have more control than pure-Ruby codes on how the process is forked. Specifically, they can't execute any additional codes before the child process starts the background thread.

Updated by kjtsanaktsidis (KJ Tsanaktsidis) 9 months ago

or hook into fork to run some code in the child process immediately after it spawns

If your objective is "from a C extension, fork, set up the child process whilst it is still single threaded, and then return to Ruby".... you could possibly do this by registering a pthread_atfork function (and then unregistering it after you fork and after it runs, I suppose)

I've run into the same issue before (with pid namespaces though, which have the same problem).

Updated by hanazuki (Kasumi Hanazuki) 9 months ago

kjtsanaktsidis (KJ Tsanaktsidis) wrote in #note-4:

or hook into fork to run some code in the child process immediately after it spawns

If your objective is "from a C extension, fork, set up the child process whilst it is still single threaded, and then return to Ruby".... you could possibly do this by registering a pthread_atfork function (and then unregistering it after you fork and after it runs, I suppose)

Thank you for your advice. It looks something like this works:

namespace {
  thread_local std::optional<std::function<void()>> atfork_init;
  void atfork_child() {
    if(atfork_init) (*atfork_init)();
    atfork_init.reset();
  }
  void atfork_parent() {
    atfork_init.reset();
  }

  VALUE Namespace_s_fork(VALUE _self, VALUE opts) {
    atfork_init = [flags = /*...*/]() {
      if(unshare(flags) != 0) {
        // TODO: handle error
      }
    };

    auto const pid = rb_funcall(rb_mProcess, rb_intern("_fork"), 0);
    if(FIX2INT(pid) == 0) {
      if(rb_block_given_p()) {
        int status;
        rb_protect(rb_yield, Qundef, &status);
        ruby_stop(status);
      }
      return Qnil;
    }

    return pid;
  }
}

extern "C" {
  RUBY_FUNC_EXPORTED void
  Init_namespace_ext() {
    if(pthread_atfork(nullptr, atfork_parent, atfork_child) != 0) {
      rb_sys_fail("pthread_atfork()");
    }
    auto rb_mNamespace = rb_define_module("Namespace");
    rb_define_singleton_method(rb_mNamespace, "fork", Namespace_s_fork, 1);
  }
}

Updated by kjtsanaktsidis (KJ Tsanaktsidis) 9 months ago

It looks something like this works:

Won't win any awards for beauty perhaps, but gets the job done!

Is this enough to meet your needs? If so, I can close this out. Otherwise, perhaps a good next step would be to open a new feature request issue with a proposal of what interface, specifically, you want, and why?

From a stability perspective, I would think this is something that should work in future versions of Ruby (of course, I can't make any promises though). It's used in some pretty prevalent gems, like the grpc gem for instance (https://github.com/grpc/grpc/blob/038215b504b9027ac85527f5fdcd85c76b7e3a1f/src/core/lib/iomgr/fork_posix.cc#L117).

Updated by hanazuki (Kasumi Hanazuki) 9 months ago

kjtsanaktsidis (KJ Tsanaktsidis) wrote in #note-6:

Is this enough to meet your needs? If so, I can close this out.

Maybe, yes. It's unfortunate for me to rewrite a few lines of Ruby into 100~200 lines of C++, though.
I wonder whether this change is inevitable in return for the future with M:N threads.

Anyway, this is no longer my blocker. Thanks.

Updated by ko1 (Koichi Sasada) 9 months ago

Making the timer thread lazily is in tasklist but not sure when we can make it.

Updated by hanazuki (Kasumi Hanazuki) about 2 months ago

Thank you @ko1 (Koichi Sasada) for sharing the current situation. I'm fine with closing this ticket as it is due to the design decision and (AFAIK) the original behavior had never been documented.

Actions #10

Updated by jeremyevans0 (Jeremy Evans) about 2 months ago

  • Status changed from Assigned to Closed
Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0