Project

General

Profile

Actions

Feature #22121

open

Introduce Parallel Sweep feature

Feature #22121: Introduce Parallel Sweep feature
1

Added by luke-gru (Luke Gruber) 10 days ago. Updated 6 days ago.

Status:
Open
Assignee:
-
Target version:
-
[ruby-core:125796]

Description

Abstract

Ruby's GC sweep implementation is currently incremental and lazy. This is to reduce pause times when sweeping. However, the sweep implementation doesn't take advantage of parallelism (multi-core). Sweeping in a GC is not an "embarassingly parallel" problem, but extra threads can help. I would like to introduce such a feature to Ruby so that users can take advantage of their multi-core CPUs to reduce GC pause times.

Design

I would like to have an additional "sweep thread" that sweeps alongside the Ruby GC thread (in parallel) and at the same time as Ruby code is running (concurrent). When the sweep thread is working alongside the Ruby GC thread, both threads grab pages from the current heap. If the sweep thread has already swept a page, the Ruby GC thread finishes it by clearing its bitmaps and adding the page to the free pages or empty pages lists. If there are no swept pages from the sweep thread, the Ruby GC thread sweeps a page by itself instead of waiting for swept pages.

When an incremental sweep step is over, the sweep thread sweeps 1 incremental step's budget worth of slots while Ruby code is running. This is so that during the next incremental step, the Ruby GC thread just has to finish the pages off instead of sweeping the objects.

Limitations

Certain objects aren't safe to be freed by the sweep thread. T_DATA types from native extensions cannot be swept in general (although most are safe) because the user's sweep function may not be thread-safe. That function may modify global state in such a way that when it is called from both the sweep thread AND the Ruby GC thread at once, it behaves badly. In order to get around this, we introduce a new TypedData flag. T_DATA internal to the VM mostly have this flag set, and native extension authors can set this flag if it is defined to allow their type to be swept concurrently.

This feature is only available for the default GC. Ruby's MMTK garbage collector has its own implementation of concurrent sweeping that is not affected by this feature.

This feature is only available for pthread platforms, although that restriction could be lifted with a bit of work.

This feature is not on by default (see section Building).

Implementation

The PR is currently a draft, but it's in a working state. Please play with it and tell me what you think!

Building

./configure --enable-parallel-sweep
make -j
./ruby --enable-parallel-sweep -v
#=> ruby 4.1.0dev (2026-06-19T15:49:17Z parallel-sweep cd7e59d45b) +PRISM +Parallel-Sweep [arm64-darwin25]

Benchmarks

This image below is from running ruby-bench --headline benchmarks 10 times and taking the median of means of the runs.
The ruby-bench command was:

./run_benchmarks.rb --no-pinning --interleave --chruby="ruby_ctl::ruby_ctl --yjit" --chruby="psweep_lockfree::psweep_lockfree --yjit --enable-parallel-sweep" --headline

If you want sweep info per run, you can run it with the GC harness:

./run_benchmarks.rb --no-pinning --interleave --chruby="ruby_ctl::ruby_ctl --yjit" --chruby="psweep_lockfree::psweep_lockfree --yjit --enable-parallel-sweep" --harness=gc --headline


I encourage others to experiment with this feature, run your own benchmarks and share them here. When running your benchmarks, make sure not to pin the process to a CPU with taskset, otherwise the sweep thread will not run in parallel.

Note

Sometimes there are regressions run to run. Run the benchmark 5-10 times to get more accurate results.
GC micro-benchmarks may not see any improvements or even slight regressions. The implementation targets a workload where there is sufficient time in between sweep steps for the sweep thread to do its work.

Warning

Benchmarking on MacOS tends to be less accurate than Linux. However, if that's all you have and want to share your results, please do.

Future Work

In the future, I would like it if we could process the metadata of a page in the sweep thread. For example, if we could clear the bitmaps and unlink an empty page, or even add a page to the free pages list then the Ruby pause time would get even smaller (or none at all). I did have a prototype of this, but it caused issues because it was creating too many empty pages too fast, and that had unintended consequences for the rest of the GC.

I also believe parallel marking would have a large benefit. Major GCs take a long time due to marking, and if you have lots of threads of fibers than it takes even longer. Marking is a more naturally parallel problem, so would likely benefit from more than 1 worker thread.


Files

clipboard-202606191327-7umxq.png (342 KB) clipboard-202606191327-7umxq.png luke-gru (Luke Gruber), 06/19/2026 05:27 PM

Updated by ko1 (Koichi Sasada) 7 days ago Actions #1 [ruby-core:125809]

I would like to have an additional "sweep thread" that sweeps alongside the Ruby GC thread (in parallel) and at the same time as Ruby code is running (concurrent). When the sweep thread is working alongside the Ruby GC thread, both threads grab pages from the current heap. If the sweep thread has already swept a page, the Ruby GC thread finishes it by clearing its bitmaps and adding the page to the free pages or empty pages lists. If there are no swept pages from the sweep thread, the Ruby GC thread sweeps a page by itself instead of waiting for swept pages.

Could you explain more about whole of sweep algorithm?

  • The definition of "Ruby GC thread" and "the sweep thread" (I got confusion with mutator and collector threads)
  • Whole sweep algorithm (maybe in pseudo-code), especially (I want to know):
    • initial sweep
    • conflict management with the sweep thread and mutator thread
  • how to manage the sweep thread
  • Table 3
    • what is "3"?
    • what is CTL / Lockfree?
    • 3% is valuable for apps? (if no disadvantage, I think so)

Updated by luke-gru (Luke Gruber) 6 days ago Actions #2 [ruby-core:125825]

ko1 (Koichi Sasada) wrote in #note-1:

The definition of "Ruby GC thread" and "the sweep thread"

The Ruby GC thread is the mutator thread that started GC and is currently doing GC work. The sweep thread is the thread whose only work is to sweep (the new thread in this feature). I'll call the Ruby GC thread the mutator from now on to avoid confusion.

ko1 (Koichi Sasada) wrote in #note-1:

Could you explain more about whole of sweep algorithm?

Sure. Sweeping is still incremental and lazy and the sweep steps start at the same time as before. The sweep thread does:

heaps.each do |heap|
  notified_move_on = false # atomically set by mutator
  until notified_move_on
    pages = heap.grab_next_pages_atomically(4)
    pages.each do |page|
      pre_sweep_page(heap, page)
    end
    heap.pre_swept_pages.concat(pages) # atomically
  end
end

# Only do sweep work. Don't update page bitmaps and don't put the page on any pages linked list (free_pages, empty_pages).
def pre_sweep_page(heap, page)
  page.sweep_planes do |plane|
    if plane.any?
       while obj = plane.next
         if can_pre_sweep_obj?(obj)
           sweep(obj)
         else
           page.register_deferred_sweep(obj)
         end
       end
    end
  end
end

The mutator does:

# When incremental step's budget is filled, set `notified_move_on = true`
heaps.each do |heap|
  if page = heap.pre_swept_pages.shift # atomic
    finish_page(page) # updates bitmap, adds to pages linked_list, frees any deferred objects that sweep thread couldn't free safely
  elsif page = heap.swept_pages.shift
    sweep_page(page) # full sweep of page because page has not been pre-swept
  else
    heap.swept_pages.concat heap.grab_next_pages_atomically(4)
    page = heap.swept_pages.shift
    sweep_page(page)
  end
end    

When exiting from sweep step, the sweep thread starts up again (signal it if it's already waiting). It does a similar thing as before with a small change:

# Now we run concurrently with Ruby code.
heaps.each do |heap|
  swept_slots = 0
  until swept_slots >= INCR_STEP_SLOTS_BUDGET || mutator_sweep_started?
    pages = heap.grab_next_pages_atomically(4)
    pages.each do |page|
      swept_slots += pre_sweep_page(heap, page)
    end
    heap.pre_swept_pages.concat(pages) # atomically
  end
  if !mutator_sweep_started?
    heap.skip_next_pre_sweep_parallel = true # We already swept 1 incremental step ahead. We'll pre-sweep other heaps that need it, or skip next step if all are done.
  end
end

ko1 (Koichi Sasada) wrote in #note-1:

how to manage the sweep thread

The sweep thread is started:

  • When entering an incremental sweep step.
  • At the beginning of immediate sweep (ex: GC.start).
  • When exiting sweeping, if there is more work to be done.

The sweep thread is stopped:

  • When all heaps are done being pre-swept by 1 incremental step's budget (for incrementel sweeping).
  • When all heaps are completely finished (for immediate sweeping).
  • During special operations (ex: we set dont_gc_on(), we expect no sweeping to occur).
  • User calls GC.disable.
  • Before fork, so that objects are in a consistent state.
  • Before VM exit.

ko1 (Koichi Sasada) wrote in #note-1:

Table 3

  • what is "3"?
  • what is CTL / Lockfree?

I had 3 Parallel Sweep branches with small changes and I was benchmarking the differences. ctl in this table means the "control", and lockfree is the branch of parallel sweep (the "experiment"). It was called lockfree because it uses atomic operations for the queue of pages between the mutator and the sweep thread.

Updated by luke-gru (Luke Gruber) 6 days ago Actions #3 [ruby-core:125827]

Sorry, I forget to address the last concern.

3% is valuable for apps? (if no disadvantage, I think so)

I think so, but we would like to improve upon it over time.

Actions

Also available in: PDF Atom