Bug #20816: Potential regression in Ruby 3.3.x (compared with 3.1 and 3.2) regarding fast syscalls and multi-threading. - Ruby - Ruby Issue Tracking System

Actions

Copy link

Bug #20816

open

Potential regression in Ruby 3.3.x (compared with 3.1 and 3.2) regarding fast syscalls and multi-threading.

Added by adrienjarthon (Adrien Jarthon) 9 months ago. Updated 4 months ago.

Status:

Assigned

Assignee:

ko1 (Koichi Sasada)

Target version:

ruby -v:

3.3.5

Backport:

3.1: DONTNEED, 3.2: DONTNEED, 3.3: REQUIRED

[ruby-core:119621]

Description

Ok I'm sorry in advance if this is not really a regression (maybe it's the cost to pay for other improvements) but it still showed as a 30% performance drop in my benchmark so I think it may be worth a look.

For the full story about how I came down this rabbit hole I just wrote a blog post to avoid cluttering the issue: https://updown.io/blog/weird-results-comparing-ruby-3-1-3-2-3-3-with-jemalloc-and-yjit (please read it first)

Here I'm gonna put my last findings and the smallest reproducible example I was able to write.

So basically I noticed that quick syscalls (at least with a good SSD) like File.exists? or File.mtime, when used in a multi-threaded environement (Sidekiq in my initial testing but I reduced it to a few Thread.new) are significantly slower in 3.3.x (tested with the same results on 3.3.0, 3.3.2 and 3.3.5 for the moment) in comparison with 3.1.2 or 3.2.5: about twice as slow.

These syscalls are heavily used by the Rails Reloader which is how it had such a huge impact (30%) in my testing.

Here is the small reproduction script I wrote:

#!/usr/bin/env ruby

GC.disable # just to make sure it doesn't skew results
THREADS = (ARGV.first || 10).to_i  # choose thread count from CLI
N = 10_000_000 # to make the test longer or shorter

puts "Ruby #{RUBY_VERSION} / YJIT: #{RubyVM::YJIT.enabled?} / #{THREADS} Threads"

Array.new(THREADS).map do
  Thread.new do
    (N/THREADS).times { File.mtime(__FILE__) }
  end
end.each(&:join)

puts "#{N} calls completed"

The goal is just to run a lot of fast syscalls, in various number of threads, and here are the numbers on my machine:

Threads\Ruby	3.1.2	3.3.5
1	13.7s	✅ 14.2s
2	9.7s	⚠️ 33.4s
4	23.2s	⚠️ 37.0s
8	21.8s	⚠️ 39.7s
16	23.1s	⚠️ 41.5s

(more numbers in the blog post)

I suspected the new N:M thread scheduler introduced in 3.3, even though it's "disabled" by default on the main Ractor, thinking maybe these changes degraded a bit this (quite specific scenario). So I gave a quick try with the feature ENABLED, to see if it would at least change anything and it did ! (though not was I expected):

Threads\Ruby	3.1.2	3.3.5	3.3.5 with RUBY_MN_THREADS=1
1	13.7s	14.2s	14.5s ✅
2	9.7s	⚠️ 33.4s	12.3s 🎉
4	23.2s	⚠️ 37.0s	40.5s 🧐
8	21.8s	⚠️ 39.7s	37.9s 🧐
16	23.1s	⚠️ 41.5s	36.3s 🧐

(I can make graphs for these numbers if it helps)

Now I'm a bit stuck in my investigation, I'm not sure what change in 3.3 caused this (the new thread scheduler is a good lead IMO) or if it even is a regression. Maybe this downside is innevitable for other benefits? I am happy to contribute more time to this issue and perform more tests but I would need some guidance on what could be interesting to test, from people who understand a bit better than me how this works internally.

Thanks ! (and sorry for the long read)

Files

Download all files

Capture d’écran 2024-10-26 à 12.10.45.png (133 KB) Capture d’écran 2024-10-26 à 12.10.45.png		byroot (Jean Boussier), 10/26/2024 10:13 AM
Capture d’écran 2024-10-26 à 12.10.55.png (109 KB) Capture d’écran 2024-10-26 à 12.10.55.png		byroot (Jean Boussier), 10/26/2024 10:13 AM
Capture d’écran 2024-10-26 à 12.14.44.png (127 KB) Capture d’écran 2024-10-26 à 12.14.44.png		byroot (Jean Boussier), 10/26/2024 10:15 AM
Capture d’écran 2024-10-26 à 12.14.51.png (139 KB) Capture d’écran 2024-10-26 à 12.14.51.png		byroot (Jean Boussier), 10/26/2024 10:15 AM

Actions

Copy link

#1 [ruby-core:119626]

Updated by nobu (Nobuyoshi Nakada) 9 months ago

The result on my machine (Apple M2 Pro, 12 processors).
Note that N=5_000_000, a half of your example.

Threads\Ruby	3.1.4	3.2.5	3.3.5	3.4.0
1	8.818	5.628	5.943	6.066
2	10.389	6.920	11.252	12.315
4	18.643	15.347	20.285	20.184
8	18.766	14.994	20.155	19.923
16	17.530	13.459	19.988	19.705

Actions

Copy link

#2 [ruby-core:119628]

Updated by adrienjarthon (Adrien Jarthon) 9 months ago

Thanks, these numbers are slightly more "expected" than mine (Ubuntu 24.04), especially for 2 threads, but other that that it looks like the "regression" is also visible on MacOS.
For the record I also tested with ruby master (3.4.0 as of 2024-10-25) but the results were similar to 3.3.5 (like for you).

Actions

Copy link Download all files

#3 [ruby-core:119629]

Updated by byroot (Jean Boussier) 9 months ago

File Capture d’écran 2024-10-26 à 12.10.55.png Capture d’écran 2024-10-26 à 12.10.55.png added
File Capture d’écran 2024-10-26 à 12.10.45.png Capture d’écran 2024-10-26 à 12.10.45.png added
File Capture d’écran 2024-10-26 à 12.14.51.png Capture d’écran 2024-10-26 à 12.14.51.png added
File Capture d’écran 2024-10-26 à 12.14.44.png Capture d’écran 2024-10-26 à 12.14.44.png added
Backport changed from 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN to 3.1: DONTNEED, 3.2: DONTNEED, 3.3: REQUIRED

I can also repro on macOS. Comparing profiles between 3.2 and 3.3, in both case the overwhelming majority of the time is spent in the thread scheduler.

The one big difference I can see is that 3.2 seem to use poll to wait on threads, while 3.3 uses pthread_cond_wait.

Actions

Copy link

#4 [ruby-core:120641]

Updated by luke-gru (Luke Gruber) 7 months ago

I think this is due to the refactor for M:N threads. In the 3.2 release, which is fast even though it still uses condition variables in thread_sched_to_running_common, the changing of threads was never forced during a blocking operation.

That is, in BLOCKING_REGION (thread.c), it goes:

thread_sched_to_waiting(th) -- this releases per-ractor GVL (TH_SCHED(th)->running changes from current ruby thread (GET_THREAD()) to NULL). Then, signal the next ruby thread in this ractor's readyq, if any.
execute blocking operation, like File.mtime
thread_sched_to_running(th) -- Blocks until another thread signals it if another thread is running and has the GVL, otherwise continues without blocking

Lots of times, since #2 can be so fast, when we get to #3, sched->running is still NULL from #1 because the other thread we signalled hasn't woken up yet and taken control of sched->running. The OS scheduler hasn't switched, so there are less context switches. I see this in some custom printf logging I made, where 1 thread runs many iterations of the syscall, then another thread runs many, etc.

In the new system,

Similar to before, but if there's a next thread ready (in readyq) we actually force it to hold the GVL (we set sched->running = next_th), then we signal it's condvar telling it to run
Execute blocking operation
Put ourselves in the back of the readyq, and wait until we're signalled

So in summary, there is more context switching, which is the main slowdown I believe. There is also more overhead now due to timeslice bookkeeping which I think can be reduced.

The new ractor/thread scheduling system is more complex than the old one, but I think we can fix this issue. I read that Go does some fancy things with fast syscalls where if a known-fast syscall is made, the goroutine scheduler doesn't signal the next goroutine but posts to a monitoring thread with a small timeout value and if this value is exceeded by the syscall then the next goroutine is scheduled, otherwise it never signals the next goroutine at all, takes itself off the monitoring thread and continues. There's some information here about that: https://utcc.utoronto.ca/~cks/space/blog/programming/GoSchedulerAndSyscalls.

We could do something similar to Go or try to incorporate more of how it used to work into the new system. I think doing something similar to Go makes sense with how the new scheduler works.

Actions

Copy link

#5 [ruby-core:120659]

Updated by byroot (Jean Boussier) 7 months ago

Assignee set to ko1 (Koichi Sasada)

@ko1 (Koichi Sasada) would input would be welcome on this issue.

Actions

Copy link

Updated by hsbt (Hiroshi SHIBATA) 4 months ago

Status changed from Open to Assigned

Actions

Copy link

Also available in: Atom PDF

Like3

Like1Like0Like1Like0Like0Like0

Project

General

Profile

Ruby

Tags

Custom queries

Bug #20816

Potential regression in Ruby 3.3.x (compared with 3.1 and 3.2) regarding fast syscalls and multi-threading.

Updated by nobu (Nobuyoshi Nakada) 9 months ago

Updated by adrienjarthon (Adrien Jarthon) 9 months ago

Updated by byroot (Jean Boussier) 9 months ago

Updated by luke-gru (Luke Gruber) 7 months ago

Updated by byroot (Jean Boussier) 7 months ago

Updated by hsbt (Hiroshi SHIBATA) 4 months ago