Project

General

Profile

Actions

Bug #19288

closed

Ractor JSON parsing significantly slower than linear parsing

Bug #19288: Ractor JSON parsing significantly slower than linear parsing

Added by maciej.mensfeld (Maciej Mensfeld) almost 3 years ago. Updated about 2 months ago.

Status:
Closed
Assignee:
-
Target version:
-
ruby -v:
ruby 3.2.0 (2022-12-25 revision a528908271) [x86_64-linux]
[ruby-core:111526]

Description

a simple benchmark:

require 'json'
require 'benchmark'

CONCURRENT = 5
RACTORS = true
ELEMENTS = 100_000

data = CONCURRENT.times.map do
  ELEMENTS.times.map do
    {
      rand => rand,
      rand => rand,
      rand => rand,
      rand => rand
    }.to_json
  end
end

ractors = CONCURRENT.times.map do
  Ractor.new do
    Ractor.receive.each { JSON.parse(_1) }
  end
end

result = Benchmark.measure do
  if RACTORS
    CONCURRENT.times do |i|
      ractors[i].send(data[i], move: false)
    end

    ractors.each(&:take)
  else
    # Linear without any threads
    data.each do |piece|
      piece.each { JSON.parse(_1) }
    end
  end
end

puts result

Gives following results on my 8 core machine:

# without ractors:
  2.731748   0.003993   2.735741 (  2.736349)

# with ractors
12.580452   5.089802  17.670254 (  5.209755)

I would expect Ractors not to be two times slower on the CPU intense work.


Files

json_parse_float.rb (727 Bytes) json_parse_float.rb luke-gru (Luke Gruber), 08/28/2025 07:04 PM
json_parse_string.rb (1.22 KB) json_parse_string.rb luke-gru (Luke Gruber), 08/28/2025 07:05 PM

Updated by Eregon (Benoit Daloze) almost 3 years ago Actions #1 [ruby-core:111537]

It would be more fair to Ractor.make_shareable(data) first.
But even with that Ractor is slower:

no Ractor:
  2.748311   0.003002   2.751313 (  2.763541)
Ractor
  9.939530   5.816431  15.755961 (  4.289792)

This high system time seems strange.
Probably lock contention for allocations?

Updated by Eregon (Benoit Daloze) almost 3 years ago Actions #2 [ruby-core:111538]

Also that script creates Ractors even in "linear" mode.
With the fixed script below:

  2.040496   0.002988   2.043484 (  2.048731)

i.e. it's also quite a bit slower if any Ractor is created.

Script:

require 'json'
require 'benchmark'

CONCURRENT = 5
RACTORS = ARGV.first == "ractor"
ELEMENTS = 100_000

data = CONCURRENT.times.map do
  ELEMENTS.times.map do
    {
      rand => rand,
      rand => rand,
      rand => rand,
      rand => rand
    }.to_json
  end
end

if RACTORS
  Ractor.make_shareable(data)

  ractors = CONCURRENT.times.map do
    Ractor.new do
      Ractor.receive.each { JSON.parse(_1) }
    end
  end
end

result = Benchmark.measure do
  if RACTORS
    CONCURRENT.times do |i|
      ractors[i].send(data[i], move: false)
    end

    ractors.each(&:take)
  else
    # Linear without any threads
    data.each do |piece|
      piece.each { JSON.parse(_1) }
    end
  end
end

puts result

Updated by luke-gru (Luke Gruber) almost 3 years ago Actions #3 [ruby-core:111887]

I just took a look at this and it looks like the culprit is the c dtoa function that's called in the json parser, specifically a helper function Balloc. It uses a lock for some reason shrug.

Edit: It looks like in ruby's missing/dtoa.c, the lock function is a no-op. If that version of dtoa.c is used in your Ruby then it isn't that. My ruby is using the missing/dtoa.c and running the perf tool with this script it points to Balloc being the main issue. Something funny is going on in that Balloc function. I think it's the malloc() calls that are locking the malloc arena lock, and the lock contention is there, but that's just a guess.

Updated by maciej.mensfeld (Maciej Mensfeld) almost 3 years ago Actions #4 [ruby-core:111986]

I find this issue important and if mitigated, it would allow me to release production-grade functionalities that would benefit users of the Ruby language.

I run an OSS project called Karafka (https://github.com/karafka/karafka) that allows for processing Kafka messages using multiple threads in parallel. For non-IO bound cases, the majority of the time of users whom use-cases I know is spent on data deserialization (> 80%). JSON is by far the most popular format that is also conveniently supported natively by Ruby. While providing true parallelism around the whole processing may not be easy due to a ton of synchronization around the whole process, the atomicity of messages deserialization makes it an ideal case of using Ractors.

  • Data can be sent there, and results can be transferred without interdependencies.
  • Each message is atomic; hence their deserialization can run in parallel.
  • All message deserialization requests can be sent to a generic queue from which Ractors could consume.

I am not an expert in the Ruby code, but if there is anything I could help with to move this forward, please just ping me.

Updated by luke-gru (Luke Gruber) almost 3 years ago Actions #5 [ruby-core:111994]

I've notified the flori/json people (https://github.com/flori/json/issues/511)

So to update everyone, the dtoa function is called during json generation, not parsing. As this script does both, it's hard to measure it using perf tools. You have to run
the generation part of the script alone and look at it the perf report, then compare it against running the generation and the parsing (both with ractors and without).

Updated by luke-gru (Luke Gruber) almost 3 years ago Actions #6 [ruby-core:112001]

Here's a simple reproduction showing that the problem is not send/receive:

RACTORS = ARGV.first == "ractor"
J = { rand => rand }.to_json
Ractor.make_shareable(J)
if RACTORS
  rs = []
  10.times.each do
    rs << Ractor.new do
      i = 0
      while i < 100_000
        JSON.parse(J)
        i+=1
      end
    end
  end
  rs.each(&:take)
else
  1_000_000.times do
    JSON.parse(J)
  end
end

The ractor example should take less time, but it doesn't.

Updated by Eregon (Benoit Daloze) almost 3 years ago Actions #7 [ruby-core:112002]

maciej.mensfeld (Maciej Mensfeld) wrote in #note-4:

I find this issue important and if mitigated, it would allow me to release production-grade functionalities that would benefit users of the Ruby language.

Note that Ractor is far from production-ready.
It has many issues as can be found on this bug tracker and when using it and as the warning says (Also there are many implementation issues.).
Also the fact that the main Ruby test suites don't run any Ractor test in the same process also seems an indication of instability.

And then of course there is the issue that Ractor is incompatible with most gems/code out there.
While JSON loading might work, any non-trivial processing after using a gem is unlikely to work well.
Other Rubies have solved this in a much more efficient, usable and reliable way, by having no GVL.

Updated by maciej.mensfeld (Maciej Mensfeld) almost 3 years ago Actions #8 [ruby-core:112003]

Note that Ractor is far from production-ready.

I am well aware. I just provide a justification and since my case seems to fit the limited scope of this functionality, I wanted to raise the attention.

While JSON loading might work, any non-trivial processing after using a gem is unlikely to work well.

This is exactly why I want to get a limited functionality that anyhow would allow me to parallelize the processing.

Other Rubies have solved this in a much more efficient, usable and reliable way, by having no GVL.

I am also aware of this :)

Updated by luke-gru (Luke Gruber) almost 3 years ago Actions #9 [ruby-core:112012]

It has many issues as can be found on this bug tracker and when using it and as the warning says (Also there are many implementation issues.).

I think the implementation issues are solvable but the bigger picture issue of adoption is of course up in the air. IMO if are allowed to have an API, for example, of Ractor.disable_isolation_checks! { ... } for use around thread-safe code, that would be a big win in my book.

Also about the test-suite, I do want to add in-process ractor tests. I hope the ruby core team isn't against it.

Updated by maciej.mensfeld (Maciej Mensfeld) almost 3 years ago Actions #10 [ruby-core:112014]

I think the implementation issues are solvable but the bigger picture issue of adoption is of course up in the air.

The first step to adoption is to have a case for it that could be used. I believe the case I presented is viable and should be considered.

Updated by luke-gru (Luke Gruber) over 2 years ago Actions #11 [ruby-core:112104]

This PR I made to JSON repository is related: https://github.com/flori/json/pull/512

Updated by duerst (Martin Dürst) over 2 years ago Actions #12 [ruby-core:112109]

Eregon (Benoit Daloze) wrote in #note-7:

And then of course there is the issue that Ractor is incompatible with most gems/code out there.
While JSON loading might work, any non-trivial processing after using a gem is unlikely to work well.
Other Rubies have solved this in a much more efficient, usable and reliable way, by having no GVL.

But don't other Rubies rely on the programmer to know how to program with threads? That's only usable if you're used to programming with threads and avoid the related issues. The idea (where the implementation and many gems may still have to catch up) behind Ractor is that thread-related issues such as data races can be avoided at the level of the programming model.

Updated by maciej.mensfeld (Maciej Mensfeld) over 2 years ago Actions #13 [ruby-core:112968]

And then of course there is the issue that Ractor is incompatible with most gems/code out there.
While JSON loading might work, any non-trivial processing after using a gem is unlikely to work well.

We need to start somewhere. Even if trivial/isolated cases work, if they work well, they can act as the first milestone to usage of this API for commercial benefits and I am willing to take the risk ;)

Updated by Eregon (Benoit Daloze) over 2 years ago Actions #14 [ruby-core:112970]

duerst (Martin Dürst) wrote in #note-12:

But don't other Rubies rely on the programmer to know how to program with threads? That's only usable if you're used to programming with threads and avoid the related issues. The idea (where the implementation and many gems may still have to catch up) behind Ractor is that thread-related issues such as data races can be avoided at the level of the programming model.

We're getting a bit off-topic, but I believe not necessarily. And the GIL doesn't prevent most Ruby-level threading issues, so in that matter it's almost the same on CRuby.
For example I would think many Ruby on Rails devs don't know well threading, and they don't need to, even though webservers like Puma use threads.
Deep knowledge of multithreading is needed e.g. when creating concurrent data structures, but using them OTOH doesn't require much.
I would think for most programmers, using threads is much easier and more intuitive than having the big limitations of Ractor which prevent sharing any state, especially in an imperative and stateful language like Ruby where almost everything is mutable.
IMO Ractors are way more difficult to use than threads. They can also have some sorts of race conditions due to message order, so it's not that much safer either. And it's a lot less efficient for any communication between ractors vs threads (Ractor copy or move both need a object graph walk).

Updated by maciej.mensfeld (Maciej Mensfeld) about 2 years ago Actions #15 [ruby-core:114924]

I want to revisit our discussion about leveraging Ruby Ractors for parallel JSON parsing. It appears there hasn't been much activity on this thread for a long time.

I found it pertinent to mention that during the recent RubyKaigi conference, Koichi Sasada highlighted the need for real-life/commercial use-cases to showcase Ractors' potential. To that end, I wanted to bring forth that I do have a practical, commercial scenario. Karafka handles parsing of thousands or more of JSONs in parallel. Having Ractors support in such a context could substantially enhance performance, providing a tangible benefit to the end users.

Given this real-life use case, are there any updates or plans to continue work on allowing Ractors to operate faster in the presented-by-me scenario? It would indeed be invaluable for many of users working with Kafka in Ruby. While the end-user processing of data still will have to happen in a single Ractor, parsing seems like a great example where immutable raw payload can be shipped to independent ractors and frozen deserialized payloads can be shipped back.

Updated by byroot (Jean Boussier) 9 months ago 1Actions #16 [ruby-core:120923]

I profiled this repro out of curiosity, and ractors spend 32% of their time waiting for the VM lock (vm_lock_enter + the unlock) to be able to lookup in the fstring table: https://share.firefox.dev/4152X8a

Currently this is done explicitly by the json gem, but even if json wasn't attempting to do it, Ruby would do the same thing once we're trying to insert string keys: https://share.firefox.dev/4hsVPbC

I suppose there are various solutions to this with their own tradeoffs:

  • Don't intern hash keys when not on the main ractor.
  • Protect the fstring table with its own dedicated Read-write lock, so that concurrent Ractors can lookup string (but only one insert). And also reduce contention for other areas still protected by the remaining VM lock.
  • Somehow replace the fstring table by a truly lockfree hash table.

Updated by tenderlovemaking (Aaron Patterson) 9 months ago Actions #17 [ruby-core:120927]

byroot (Jean Boussier) wrote in #note-16:

I profiled this repro out of curiosity, and ractors spend 32% of their time waiting for the VM lock (vm_lock_enter + the unlock) to be able to lookup in the fstring table: https://share.firefox.dev/4152X8a

Currently this is done explicitly by the json gem, but even if json wasn't attempting to do it, Ruby would do the same thing once we're trying to insert string keys: https://share.firefox.dev/4hsVPbC

I suppose there are various solutions to this with their own tradeoffs:

  • Don't intern hash keys when not on the main ractor.
  • Protect the fstring table with its own dedicated Read-write lock, so that concurrent Ractors can lookup string (but only one insert). And also reduce contention for other areas still protected by the remaining VM lock.

I'm not sure if this one is possible. If some Ractor is updating the hash, it could be in an inconsistent state when another Ractor is trying to read. Maybe st table updates are atomic, but I don't know (and I kind of doubt it).

  • Somehow replace the fstring table by a truly lockfree hash table.

This would be ideal IMO, but seems hard 😅

We could also add an fstring table to each Ractor. I know the purpose of the fstring table is to limit the number of instances of a string, but at least we would be limited to the number of Ractors (rather than no limit when there are multiple Ractors).

Updated by byroot (Jean Boussier) 9 months ago Actions #18 [ruby-core:120928]

I'm not sure if this one is possible. If some Ractor is updating the hash, it could be in an inconsistent state when another Ractor is trying to read.

A RW-lock doesn't allow reads while the write lock is held.

We could also add an fstring table to each Ractor.

That's an interesting idea.

Updated by Eregon (Benoit Daloze) 9 months ago Actions #19 [ruby-core:120932]

tenderlovemaking (Aaron Patterson) wrote in #note-17:

We could also add an fstring table to each Ractor [...] but at least we would be limited to the number of Ractors

The number of Ractors can be high though since M-N threads.
Also one might rely that the result of String#-@ is always the same object for equivalent strings, even across Ractors.
Concretely this means a string that is interned in one Ractor won't be interned in another Ractor, that seems surprising.

(FWIW TruffleRuby solves this using a ConcurrentHashMap with weak values)

Updated by jhawthorn (John Hawthorn) 6 months ago Actions #20 [ruby-core:121682]

  • Status changed from Open to Closed

I've made a couple changes which should improve this. This is a bit of an odd benchmark as the keys are always random (which I think would be quite uncommon in real JSON parsing situations). But this makes it an interesting benchmark for some worst case behaviours.

The two things this benchmark is exercising, as others have pointed out, is float parsing and insertion into the fstring table.

The first change is https://github.com/ruby/ruby/pull/12991, which removes a spinlock in dtoa.c for bignum memory management (Balloc/Bfree).

The second change is https://bugs.ruby-lang.org/issues/21268 / https://github.com/ruby/ruby/pull/12921, which I've just merged (NB. it isn't in preview1).

With these two I believe the main bottlenecks of this benchmark have been removed. Though performance can always be improved, these are now faster under Ractors than serial. The main contention is now GC pressure caused by the JSON being parsed (in particular the random string keys, I would expect a more reasonable set of JSON to perform even better).


Original benchmark

Serial - Ruby 3.4.2
  0.995186   0.030131   1.025317 (  1.025439)

Serial - Ruby master@3a29e835e69187330bb3fee035f1289561de5dad
  0.933650   0.014119   0.947769 (  0.947780)

Ractors - Ruby 3.4.2
 10.875507   1.600797  12.476304 (  3.380926)

Ractors - Ruby master@3a29e835e69187330bb3fee035f1289561de5dad
  1.490369   0.192624   1.682993 (  0.749747)

Eregon's improved benchmark - with shareable objects

Serial - Ruby 3.4.2
  0.982018   0.035003   1.017021 (  1.017816)

Serial - Ruby master@3a29e835e69187330bb3fee035f1289561de5dad
  0.981966   0.026635   1.008601 (  1.008593)

Ractors - Ruby 3.4.2
 12.232766   1.772653  14.005419 (  3.578559)

Ractors - Ruby master@3a29e835e69187330bb3fee035f1289561de5dad
  1.301200   0.122725   1.423925 (  0.470395)

Luke-gru's Benchmark - with no send/recv

Benchmark 1: ruby 3.4.2 serial
  Time (mean ± σ):     459.4 ms ±  10.5 ms    [User: 449.8 ms, System: 8.8 ms]
  Range (min … max):   450.7 ms … 478.2 ms    10 runs

Benchmark 2: ruby master serial
  Time (mean ± σ):     391.2 ms ±  60.4 ms    [User: 381.7 ms, System: 8.5 ms]
  Range (min … max):   220.0 ms … 419.0 ms    10 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 3: ruby 3.4.2 ractors
  Time (mean ± σ):      2.258 s ±  0.685 s    [User: 13.585 s, System: 0.367 s]
  Range (min … max):    0.310 s …  2.484 s    10 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 4: ruby master ractors
  Time (mean ± σ):     191.2 ms ±  13.8 ms    [User: 547.1 ms, System: 110.3 ms]
  Range (min … max):   144.3 ms … 201.5 ms    14 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  ruby master ractors ran
    2.05 ± 0.35 times faster than ruby master serial
    2.40 ± 0.18 times faster than ruby 3.4.2 serial
   11.81 ± 3.68 times faster than ruby 3.4.2 ractors

Updated by luke-gru (Luke Gruber) about 2 months ago Actions #21 [ruby-core:123111]

The situation with ractors has improved a lot since the last update to this thread. I encourage you to try out ruby master if you can.

Running benchmark "json_parse_float" (1/2)
+ /Users/luke/.rubies/master-release/bin/ruby -I harness-ractor /Users/luke/workspace/yjit-bench/benchmarks/ractor/json_parse_float.rb
JSON 2.13.2
r:   itr:   time
0    #1:  488ms
0    #2:  477ms
0    #3:  458ms
0    #4:  448ms
0    #5:  445ms
1    #1:  384ms
1    #2:  411ms
1    #3:  395ms
1    #4:  422ms
1    #5:  390ms
2    #1:  262ms
2    #2:  260ms
2    #3:  244ms
2    #4:  248ms
2    #5:  241ms
4    #1:  168ms
4    #2:  179ms
4    #3:  145ms
4    #4:  175ms
4    #5:  145ms
6    #1:  166ms
6    #2:  181ms
6    #3:  151ms
6    #4:  133ms
6    #5:  139ms
8    #1:  116ms
8    #2:  140ms
8    #3:  136ms
8    #4:  117ms
8    #5:  148ms
12   #1:  130ms
12   #2:  123ms
12   #3:  117ms
12   #4:  136ms
12   #5:  129ms
16   #1:  100ms
16   #2:  111ms
16   #3:   96ms
16   #4:  122ms
16   #5:   90ms
32   #1:  100ms
32   #2:   76ms
32   #3:  100ms
32   #4:   76ms
32   #5:   91ms

Running benchmark "json_parse_string" (2/2)
+ /Users/luke/.rubies/master-release/bin/ruby -I harness-ractor /Users/luke/workspace/yjit-bench/benchmarks/ractor/json_parse_string.rb

JSON 2.13.2
r:   itr:   time
0    #1:  116ms
0    #2:  130ms
0    #3:  121ms
0    #4:  126ms
0    #5:  128ms
1    #1:   91ms
1    #2:  117ms
1    #3:  102ms
1    #4:  101ms
1    #5:   97ms
2    #1:   70ms
2    #2:   83ms
2    #3:   64ms
2    #4:   65ms
2    #5:   80ms
4    #1:   58ms
4    #2:   57ms
4    #3:   58ms
4    #4:   60ms
4    #5:   60ms
6    #1:   55ms
6    #2:   77ms
6    #3:   75ms
6    #4:   57ms
6    #5:   58ms
8    #1:   64ms
8    #2:   61ms
8    #3:   58ms
8    #4:   56ms
8    #5:   61ms
12   #1:   57ms
12   #2:   54ms
12   #3:   55ms
12   #4:   55ms
12   #5:   57ms
16   #1:   49ms
16   #2:   59ms
16   #3:   74ms
16   #4:   53ms
16   #5:   55ms
32   #1:   52ms
32   #2:   53ms
32   #3:   44ms
32   #4:   51ms
32   #5:   51ms

Updated by maciej.mensfeld (Maciej Mensfeld) about 2 months ago Actions #22 [ruby-core:123187]

I can confirm significant performance improvements with the new Ractor APIs. I will now proceed to benchmark this against real Karafka user payloads. If the performance gains match what we're seeing in these synthetic benchmarks, I'll move forward with implementing a Ractor-based deserialization backend for Karafka. Amazing work!

Actions

Also available in: PDF Atom