Project

General

Profile

Actions

Bug #21685

open

Unnecessary context-switching, especially bad on multi-core machines.

Bug #21685: Unnecessary context-switching, especially bad on multi-core machines.

Added by jpl-coconut (Jacob Lacouture) about 21 hours ago. Updated about 12 hours ago.

Status:
Open
Assignee:
-
Target version:
-
ruby -v:
ruby 3.4.7 (2025-10-08 revision 7a5688e2a2) +PRISM [aarch64-linux]
[ruby-core:123795]

Description

While debugging a performance issue in a large rails application, I wrote a minimal microbenchmark that reproduces the issue. [here] I was surprised to see that the benchmark takes ~3.6sec on a single-core machine, and ~36sec (10x slower) on a machine with 2 or more cores . Initially I thought this was a bug in the implementation of Thread::Queue, but soon realized it relates to how the ruby reschedules threads around system calls.

I prepared a fix in [this branch] which is based off ruby 3.4.7. I can apply the fix to a different branch or to master if that's helpful. The fix simply defers suspending the thread until the syscall has been running for some short interval. I chose 100usec initially, but this could easily be made configurable.

I pasted raw benchmark results below from a single run (though I did many runs and the results are stable). My CPU is an Apple M4.

After the fix:

  • Single-core performance improves by 55%, from 3.6sec to 2sec.
  • Adding cores causes performance to be flat (at 2sec), rather than getting 10x slower.
  • Multi-core context-switch count reduces by 99.995%, from 1.4 million to ~80
  • system_time/user_time ratio drops from (1.2 - 1.6) to 0.65

Here are the benchmark results before my change:

# time taskset --cpu-list 1 ./ruby qtest_simple.rb
voluntary_ctxt_switches:	1140773
nonvoluntary_ctxt_switches:	9487
real	0m3.619s
user	0m1.653s
sys	0m1.950s

# time taskset --cpu-list 1,2 ./ruby qtest_simple.rb
voluntary_ctxt_switches:	1400110
nonvoluntary_ctxt_switches:	3
real	0m36.223s
user	0m9.380s
sys	0m14.927s

And after:

# time taskset --cpu-list 1 ./ruby qtest_simple.rb
voluntary_ctxt_switches:	88
nonvoluntary_ctxt_switches:	899
real	0m2.031s
user	0m1.209s
sys	0m0.743s

# time taskset --cpu-list 1,2 ./ruby qtest_simple.rb
voluntary_ctxt_switches:	75
nonvoluntary_ctxt_switches:	8
real	0m2.062s
user	0m1.279s
sys	0m0.783s

I was concerned these results might still be reflective of a bug in Thread::Queue, so I also came up with a repro that doesn't rely on it. That one is [here].

Results summary:

  • Single-core performance improves (this time by only 30%)
  • Multi-core penalty drops from 4x to 0.
  • No change to context-switching rates.
  • system_time/user_time ratio drops from (0.5-1) to 0.15

Before fix:

# time taskset --cpu-list 1 ./ruby mbenchmark.rb
voluntary_ctxt_switches:	60
real	0m0.336s
user	0m0.211s
sys	0m0.118s

# time taskset --cpu-list 1,2 ./ruby mbenchmark.rb
voluntary_ctxt_switches:	60
real	0m1.424s
user	0m0.468s
sys	0m0.496s

After fix:

# time taskset --cpu-list 1 ./ruby mbenchmark.rb
voluntary_ctxt_switches:	59
real	0m0.241s
user	0m0.202s
sys	0m0.032s

# time taskset --cpu-list 1,2 ./ruby mbenchmark.rb
voluntary_ctxt_switches:	60
real	0m0.238s
user	0m0.195s
sys	0m0.035s

Updated by byroot (Jean Boussier) about 12 hours ago Actions #1 [ruby-core:123797]

The fix simply defers suspending the thread until the syscall has been running for some short interval.

That's an idea we discussed in the past with @jhawthorn (John Hawthorn) @tenderlovemaking (Aaron Patterson) and @luke-gru (Luke Gruber). IIRC that's something Go does?

Actions

Also available in: PDF Atom