Feature #22121
openIntroduce Parallel Sweep feature
Description
Abstract¶
Ruby's GC sweep implementation is currently incremental and lazy. This is to reduce pause times when sweeping. However, the sweep implementation doesn't take advantage of parallelism (multi-core). Sweeping in a GC is not an "embarassingly parallel" problem, but extra threads can help. I would like to introduce such a feature to Ruby so that users can take advantage of their multi-core CPUs to reduce GC pause times.
Design¶
I would like to have an additional "sweep thread" that sweeps alongside the Ruby GC thread (in parallel) and at the same time as Ruby code is running (concurrent). When the sweep thread is working alongside the Ruby GC thread, both threads grab pages from the current heap. If the sweep thread has already swept a page, the Ruby GC thread finishes it by clearing its bitmaps and adding the page to the free pages or empty pages lists. If there are no swept pages from the sweep thread, the Ruby GC thread sweeps a page by itself instead of waiting for swept pages.
When an incremental sweep step is over, the sweep thread sweeps 1 incremental step's budget worth of slots while Ruby code is running. This is so that during the next incremental step, the Ruby GC thread just has to finish the pages off instead of sweeping the objects.
Limitations¶
Certain objects aren't safe to be freed by the sweep thread. T_DATA types from native extensions cannot be swept in general (although most are safe) because the user's sweep function may not be thread-safe. That function may modify global state in such a way that when it is called from both the sweep thread AND the Ruby GC thread at once, it behaves badly. In order to get around this, we introduce a new TypedData flag. T_DATA internal to the VM mostly have this flag set, and native extension authors can set this flag if it is defined to allow their type to be swept concurrently.
This feature is only available for the default GC. Ruby's MMTK garbage collector has its own implementation of concurrent sweeping that is not affected by this feature.
This feature is only available for pthread platforms, although that restriction could be lifted with a bit of work.
This feature is not on by default (see section Building).
Implementation¶
The PR is currently a draft, but it's in a working state. Please play with it and tell me what you think!
Building¶
./configure --enable-parallel-sweep
make -j
./ruby --enable-parallel-sweep -v
#=> ruby 4.1.0dev (2026-06-19T15:49:17Z parallel-sweep cd7e59d45b) +PRISM +Parallel-Sweep [arm64-darwin25]
Benchmarks¶
This image below is from running ruby-bench --headline benchmarks 10 times and taking the median of means of the runs.
The ruby-bench command was:
./run_benchmarks.rb --no-pinning --interleave --chruby="ruby_ctl::ruby_ctl --yjit" --chruby="psweep_lockfree::psweep_lockfree --yjit --enable-parallel-sweep" --headline
If you want sweep info per run, you can run it with the GC harness:
./run_benchmarks.rb --no-pinning --interleave --chruby="ruby_ctl::ruby_ctl --yjit" --chruby="psweep_lockfree::psweep_lockfree --yjit --enable-parallel-sweep" --harness=gc --headline

I encourage others to experiment with this feature, run your own benchmarks and share them here. When running your benchmarks, make sure not to pin the process to a CPU with taskset, otherwise the sweep thread will not run in parallel.
Note
Sometimes there are regressions run to run. Run the benchmark 5-10 times to get more accurate results.
GC micro-benchmarks may not see any improvements or even slight regressions. The implementation targets a workload where there is sufficient time in between sweep steps for the sweep thread to do its work.
Warning
Benchmarking on MacOS tends to be less accurate than Linux. However, if that's all you have and want to share your results, please do.
Future Work¶
In the future, I would like it if we could process the metadata of a page in the sweep thread. For example, if we could clear the bitmaps and unlink an empty page, or even add a page to the free pages list then the Ruby pause time would get even smaller (or none at all). I did have a prototype of this, but it caused issues because it was creating too many empty pages too fast, and that had unintended consequences for the rest of the GC.
I also believe parallel marking would have a large benefit. Major GCs take a long time due to marking, and if you have lots of threads of fibers than it takes even longer. Marking is a more naturally parallel problem, so would likely benefit from more than 1 worker thread.
Files
Updated by ko1 (Koichi Sasada) 7 days ago
I would like to have an additional "sweep thread" that sweeps alongside the Ruby GC thread (in parallel) and at the same time as Ruby code is running (concurrent). When the sweep thread is working alongside the Ruby GC thread, both threads grab pages from the current heap. If the sweep thread has already swept a page, the Ruby GC thread finishes it by clearing its bitmaps and adding the page to the free pages or empty pages lists. If there are no swept pages from the sweep thread, the Ruby GC thread sweeps a page by itself instead of waiting for swept pages.
Could you explain more about whole of sweep algorithm?
- The definition of "Ruby GC thread" and "the sweep thread" (I got confusion with mutator and collector threads)
- Whole sweep algorithm (maybe in pseudo-code), especially (I want to know):
- initial sweep
- conflict management with the sweep thread and mutator thread
- how to manage the sweep thread
- Table 3
- what is "3"?
- what is CTL / Lockfree?
- 3% is valuable for apps? (if no disadvantage, I think so)
Updated by luke-gru (Luke Gruber) 6 days ago
ko1 (Koichi Sasada) wrote in #note-1:
The definition of "Ruby GC thread" and "the sweep thread"
The Ruby GC thread is the mutator thread that started GC and is currently doing GC work. The sweep thread is the thread whose only work is to sweep (the new thread in this feature). I'll call the Ruby GC thread the mutator from now on to avoid confusion.
ko1 (Koichi Sasada) wrote in #note-1:
Could you explain more about whole of sweep algorithm?
Sure. Sweeping is still incremental and lazy and the sweep steps start at the same time as before. The sweep thread does:
heaps.each do |heap|
notified_move_on = false # atomically set by mutator
until notified_move_on
pages = heap.grab_next_pages_atomically(4)
pages.each do |page|
pre_sweep_page(heap, page)
end
heap.pre_swept_pages.concat(pages) # atomically
end
end
# Only do sweep work. Don't update page bitmaps and don't put the page on any pages linked list (free_pages, empty_pages).
def pre_sweep_page(heap, page)
page.sweep_planes do |plane|
if plane.any?
while obj = plane.next
if can_pre_sweep_obj?(obj)
sweep(obj)
else
page.register_deferred_sweep(obj)
end
end
end
end
end
The mutator does:
# When incremental step's budget is filled, set `notified_move_on = true`
heaps.each do |heap|
if page = heap.pre_swept_pages.shift # atomic
finish_page(page) # updates bitmap, adds to pages linked_list, frees any deferred objects that sweep thread couldn't free safely
elsif page = heap.swept_pages.shift
sweep_page(page) # full sweep of page because page has not been pre-swept
else
heap.swept_pages.concat heap.grab_next_pages_atomically(4)
page = heap.swept_pages.shift
sweep_page(page)
end
end
When exiting from sweep step, the sweep thread starts up again (signal it if it's already waiting). It does a similar thing as before with a small change:
# Now we run concurrently with Ruby code.
heaps.each do |heap|
swept_slots = 0
until swept_slots >= INCR_STEP_SLOTS_BUDGET || mutator_sweep_started?
pages = heap.grab_next_pages_atomically(4)
pages.each do |page|
swept_slots += pre_sweep_page(heap, page)
end
heap.pre_swept_pages.concat(pages) # atomically
end
if !mutator_sweep_started?
heap.skip_next_pre_sweep_parallel = true # We already swept 1 incremental step ahead. We'll pre-sweep other heaps that need it, or skip next step if all are done.
end
end
ko1 (Koichi Sasada) wrote in #note-1:
how to manage the sweep thread
The sweep thread is started:
- When entering an incremental sweep step.
- At the beginning of immediate sweep (ex:
GC.start). - When exiting sweeping, if there is more work to be done.
The sweep thread is stopped:
- When all heaps are done being pre-swept by 1 incremental step's budget (for incrementel sweeping).
- When all heaps are completely finished (for immediate sweeping).
- During special operations (ex: we set
dont_gc_on(), we expect no sweeping to occur). - User calls
GC.disable. - Before fork, so that objects are in a consistent state.
- Before VM exit.
ko1 (Koichi Sasada) wrote in #note-1:
Table 3
- what is "3"?
- what is CTL / Lockfree?
I had 3 Parallel Sweep branches with small changes and I was benchmarking the differences. ctl in this table means the "control", and lockfree is the branch of parallel sweep (the "experiment"). It was called lockfree because it uses atomic operations for the queue of pages between the mutator and the sweep thread.
Updated by luke-gru (Luke Gruber) 6 days ago
Sorry, I forget to address the last concern.
3% is valuable for apps? (if no disadvantage, I think so)
I think so, but we would like to improve upon it over time.