Bug #17497
closedRactor performance issue
Description
There's a strange performance issue with Ractor (at least on MacOS, didn't run on other OS).
I ran a benchmark doing 3 different types of work:
- "fib": method calls (naive fibonacci calculation)
- "cpu":
(0...1000).inject(:+)
- "sleep": call
sleep
I get the kind of results I was excepting for the fib
and for sleeping, but the results for the "cpu" workload show a problem.
It is so slow that my pure Ruby backport (using Threads) is 65x faster 😮 on my Mac Pro (despite having 6 cores). Expected results would be 6x slower, so in that case Ractor is 400x slower than it should 😿
On my MacBook (2 cores) the results are not as bad, the cpu
workload is 3x faster with my pure-Ruby backport (only) instead of ~2x slower, so the factor is 6x too slow.
$ gem install backports
Successfully installed backports-3.20.0
1 gem installed
$ ruby ractor_test.rb
<internal:ractor>:267: warning: Ractor is experimental, and the behavior may change in future versions of Ruby! Also there are many implementation issues.
fib: 110 ms | cpu: 22900 ms | sleep: 206 ms
$ B=t ruby ractor_test.rb
Using pure Ruby implementation
fib: 652 ms | cpu: 337 ms | sleep: 209 ms
Notice the sleep
run takes similar time, which is good, and fib
is ~6x faster on my 6-core CPU (and ~2x faster on my 2-core MacBook), again that's good as the pure ruby version uses Threads and thus runs with a single GVL.
The cpu
version is the problem.
Script is here: https://gist.github.com/marcandre/bfed626e538a3d0fc7cad38dc026cf0e
Updated by MSP-Greg (Greg L) almost 4 years ago
Using various 2021-01-03 versions of master, got the following times, a few were averaged by eye:
—————————— Windows 10 mingw ——————————
fib | cpu | sleep
native 41 ms | 940 ms | 220 ms
ruby 345 ms | 205 ms | 243 ms
—————————— WSL2/Ubuntu 20.04 —————————
fib | cpu | sleep
native 40 ms | 530 ms | 202 ms
ruby 330 ms | 150 ms | 203 ms
—————————— Windows 10 mswin ——————————
fib | cpu | sleep
native 88 ms | 1080 ms | 215 ms
ruby 690 ms | 290 ms | 235 ms
New, fast desktop. Always interested in comparing the three platforms/OS's. Ubuntu usually wins...
Updated by marcandre (Marc-Andre Lafortune) almost 4 years ago
Thanks for running this on other platforms.
From the numbers, it looks like you are running on an 8-core machine, right? If so, the "fib" and "sleep" numbers are what should be expected, but "cpu" is 3-5x slower on Ractor when it should be 8x times faster...
Updated by MSP-Greg (Greg L) almost 4 years ago
10 core i9. I've set up enough systems in my life (I used DOS); prefer new systems to last a while...
Updated by ko1 (Koichi Sasada) almost 4 years ago
Thank you for the report. Let me investigate more.
(just curious) why is the name "CPU"?
Updated by marcandre (Marc-Andre Lafortune) almost 4 years ago
ko1 (Koichi Sasada) wrote in #note-4:
Thank you for the report. Let me investigate more.
(just curious) why is the name "CPU"?
I wanted to compare CPU-bound processes (which should benefit from Ractor) to IO-bound processes (which should have similar benchmarks if my backport isn't too inefficient). I used sleep instead of IO because I'm lazy 😅. It's only when I saw the issue that I tested other CPU-bound methods and added "fib".
Updated by ko1 (Koichi Sasada) almost 4 years ago
flash report;
Warning[:experimental] = false if defined? Warning[]
def task_inject
(1..10_000_000).inject(:+)
end
alias task task_inject
# p method(:task)
MODE = (ARGV.shift || :r_parallel).to_sym
TN = 4
case MODE
when :serial
TN.times{ task }
when :r_serial
exit(1) unless defined? Ractor
TN.times{
Ractor.new{
task
}.take
}
when :r_parallel
exit(1) unless defined? Ractor
TN.times.map{
Ractor.new{
task
}
}.each{|r| r.take}
else
raise
end
print "%4d" % GC.count
and
user system total real
serial/26_mini 0 0.000000 0.000248 1.318308 ( 1.318555)
serial/27_mini 0 0.000000 0.000627 1.209881 ( 1.209730)
serial/master_mini 0 0.000000 0.000430 1.997904 ( 1.997656)
serial/miniruby 0 0.000000 0.000254 1.723801 ( 1.723786)
serial/26_ruby 0 0.000000 0.000481 1.256867 ( 1.256746)
serial/27_ruby 0 0.000000 0.008709 1.098332 ( 1.098257)
serial/master_ruby 0 0.000000 0.000312 1.915706 ( 1.916034)
serial/ruby 0 0.000000 0.000288 1.921821 ( 1.921793)
r_serial/26_mini N/A
r_serial/27_mini N/A
r_serial/master_mini 1 0.000000 0.000388 2.460095 ( 2.460922)
r_serial/miniruby 1 0.000000 0.000359 2.784072 ( 2.784779)
r_serial/26_ruby N/A
r_serial/27_ruby N/A
r_serial/master_ruby 1 0.000000 0.000216 2.690338 ( 2.690321)
r_serial/ruby 1 0.000000 0.000237 2.982560 ( 2.983885)
r_parallel/26_mini N/A
r_parallel/27_mini N/A
r_parallel/master_mini 1 0.000000 0.000210 23.172113 ( 6.316598)
r_parallel/miniruby 1 0.000000 0.000248 25.933848 ( 7.054210)
r_parallel/26_ruby N/A
r_parallel/27_ruby N/A
r_parallel/master_ruby 1 0.000000 0.000214 25.243151 ( 6.805798)
r_parallel/ruby 1 0.000000 0.000181 28.647737 ( 7.991565)
- on serial execution, master is x2 slower than 2.6/2.7
- on serial execution with quiet ractor (multi-ractor-mode), master is -x2.5 times slower than 2.6/2.7
- on parallel execution with ractos, master is x7 slower than 2.6/2.7
Updated by ko1 (Koichi Sasada) almost 4 years ago
master_ruby and ruby was same, so not needed.
with version information:
26_mini ruby 2.6.7p148 (2020-06-14 revision 67884) [x86_64-linux]
27_mini ruby 2.7.3p139 (2020-10-11 revision d1ba554551) [x86_64-linux]
miniruby ruby 3.1.0dev (2021-01-05T07:50:00Z master e91160f757) [x86_64-linux]
26_ruby ruby 2.6.7p148 (2020-06-14 revision 67884) [x86_64-linux]
27_ruby ruby 2.7.3p139 (2020-10-11 revision d1ba554551) [x86_64-linux]
ruby ruby 3.1.0dev (2021-01-05T07:50:00Z master e91160f757) [x86_64-linux]
method: task_range_inject
user system total real
serial/26_mini 0 0.000330 0.000000 1.334362 ( 1.334173)
serial/27_mini 0 0.000401 0.000000 1.111259 ( 1.111157)
serial/miniruby 0 0.000388 0.000000 1.803708 ( 1.803610)
serial/26_ruby 0 0.000283 0.000000 1.274182 ( 1.274051)
serial/27_ruby 0 0.000243 0.000000 1.102547 ( 1.102493)
serial/ruby 0 0.000340 0.000000 1.892771 ( 1.892611)
r_serial/26_mini N/A
r_serial/27_mini N/A
r_serial/miniruby 1 0.000271 0.000000 2.484844 ( 2.485867)
r_serial/26_ruby N/A
r_serial/27_ruby N/A
r_serial/ruby 1 0.000308 0.000000 2.736252 ( 2.737059)
r_parallel/26_mini N/A
r_parallel/27_mini N/A
r_parallel/miniruby 1 0.000232 0.000000 20.958254 ( 5.828039)
r_parallel/26_ruby N/A
r_parallel/27_ruby N/A
r_parallel/ruby 1 0.000396 0.000000 22.323984 ( 6.165640)
Updated by ko1 (Koichi Sasada) almost 4 years ago
with perf (with --call-graph dwarf
) option, I may figure out the big difference:
master:
- 63.72% 6.23% miniruby miniruby [.] inject_op_i
- 57.49% inject_op_i
- 55.13% rb_funcallv_public
- rb_call (inlined)
- 26.94% rb_call0
+ 17.69% rb_callable_method_entry_with_refinements
+ 3.30% stack_check (inlined)
2.75% rb_method_call_status (inlined)
+ 1.03% rb_class_of (inlined)
- 25.91% rb_vm_call0
- vm_call0_body (inlined)
+ 13.27% vm_call0_cfunc (inlined)
0.50% vm_passed_block_handler (inlined)
0.50% rb_vm_check_ints (inlined)
1.21% rb_current_execution_context (inlined)
0.59% rb_vm_call_kw
1.73% rb_sym2id
0.63% rb_enum_values_pack
+ 4.38% _start
+ 0.53% 0x564cbc05b517
ruby 2.7:
- 46.93% 9.09% miniruby miniruby [.] inject_op_i
- 37.85% inject_op_i
- 34.18% rb_funcallv_with_cc
- 23.15% vm_call0_body
+ 17.07% vm_call0_cfunc (inlined)
0.81% rb_vm_check_ints (inlined)
+ 4.22% vm_search_method (inlined)
2.94% rb_sym2id
0.73% rb_enum_values_pack
+ 8.13% _start
Updated by marcandre (Marc-Andre Lafortune) almost 4 years ago
Just to be clear, there may be two different issues:
-
Ruby 2.x vs Ruby 3.0 performance regression. This can be important to figure out, but is not why I opened this issue.
-
Threads vs Ractor in Ruby 3.0. My tests are all in Ruby 3.0. This is what this issue is about.
Updated by inversion (Yura Babak) almost 4 years ago
I also made 2 posts about strange performance testing results (with sources) and some conclusions.
In my case, 2 ractors work 3-times longer than doing the same payload in the main thread.
https://www.reddit.com/r/ruby/comments/kpmt73/ruby_30_ractors_performance_test_strange_results/
https://www.reddit.com/r/ruby/comments/krq5xe/when_ruby3_ractors_are_good_and_when_not_yet/
The test:
require 'digest/sha2'
N = 1_500_000
def workload
N.times do |n|
Digest::SHA2.base64digest(n.to_s)
end
end
puts 'in the main thread'
t_start = Time.now
workload
puts "Total: %.3f" % (Time.now - t_start)
puts
worker1 = Ractor.new do
Ractor.receive
print '['; workload; ']'
end
worker2 = Ractor.new do
Ractor.receive
print '['; workload; ']'
end
puts 'in 2 ractors'
t_start = Time.now
worker1.send 'start'
worker2.send 'start'
print '='
print worker1.take
print worker2.take
puts
puts "Total: %.3f" % (Time.now - t_start)
Updated by keithrbennett (Keith Bennett) almost 4 years ago
I too have seen strange results testing ractors. I used the code at https://github.com/keithrbennett/keithrbennett-ractor-test/blob/master/my_ractor.rb to do some arbitrary but predictable work. I have a 24-core Ryzen 9 CPU, and I compared using 1 ractor with using 24. With 24, htop reported that all the CPU's were at 100% most of the time, yet the elapsed time using 24 CPU's was only about a third less than when using 1 CPU. Also, the CPU's seemed to be working collectively about ten times harder with 24 CPU's. Here is the program output:
1 CPU:
Many HTOP readings are < 100% for all CPU's
time ractor/my_ractor.rb ruby '*.rb'
Running the following command to find all filespecs to process: find -L ruby -type f -name '*.rb' -print
Processing 8218 files in 1 slices, whose sizes are:
[8218]
ractor/my_ractor.rb ruby '*.rb' 2513.90s user 6.75s system 99% cpu 42:03.01 total
24 CPU's:
% time ractor/my_ractor.rb ruby '*.rb' ; espeak finished
Running the following command to find all filespecs to process: find -L ruby -type f -name '*.rb' -print
Processing 8218 files in 24 slices, whose sizes are:
[343, 343, 343, 343, 343, 343, 343, 343, 343, 343, 343, 343, 343, 343, 343, 343, 343, 343, 343, 343, 343, 343, 343, 329]
ractor/my_ractor.rb ruby '*.rb' 22986.42s user 14.98s system 1134% cpu 33:47.96 total
(In the command, ruby
refers to the directory in which I've cloned the Github Ruby repo.)
Here is the current content of the test program:
#!/usr/bin/env ruby
require 'amazing_print'
require 'etc'
require 'set'
require 'shellwords'
require 'yaml'
raise "This script requires Ruby version 3 or later." unless RUBY_VERSION.split('.').first.to_i >= 3
# An instance of this parser class is created for each ractor.
class RactorParser
attr_reader :dictionary_words
def initialize(dictionary_words)
@dictionary_words = dictionary_words
end
def parse(filespecs)
filespecs.inject(Set.new) do |found_words, filespec|
found_words | process_one_file(filespec)
end
end
private def word?(string)
dictionary_words.include?(string)
end
private def strip_punctuation(string)
punctuation_regex = /[[:punct:]]/
string.gsub(punctuation_regex, ' ')
end
private def file_lines(filespec)
command = "strings #{Shellwords.escape(filespec)}"
text = `#{command}`
strip_punctuation(text).split("\n")
end
private def line_words(line)
line.split.map(&:downcase).select { |text| word?(text) }
end
private def process_one_file(filespec)
file_words = Set.new
file_lines(filespec).each do |line|
line_words(line).each { |word| file_words << word }
end
# puts "Found #{file_words.count} words in #{filespec}."
file_words
end
end
class Main
BASEDIR = ARGV[0] || '.'
FILEMASK = ARGV[1]
CPU_COUNT = Etc.nprocessors
def call
check_arg_count
slices = get_filespec_slices
ractors = create_and_populate_ractors(slices)
all_words = collate_ractor_results(ractors)
yaml = all_words.to_a.sort.to_yaml
File.write('ractor-words.yaml', yaml)
puts "Words are in ractor-words.yaml."
end
private def check_arg_count
if ARGV.length > 2
puts "Syntax is ractor [base_directory] [filemask], and filemask must be quoted so that the shell does not expand it."
exit -1
end
end
private def collate_ractor_results(ractors)
ractors.inject(Set.new) do |all_words, ractor|
all_words | ractor.take
end
end
private def get_filespec_slices
all_filespecs = find_all_filespecs
slice_size = (all_filespecs.size / CPU_COUNT) + 1
# slice_size = all_filespecs.size # use this line instead of previous to test with 1 ractor
slices = all_filespecs.each_slice(slice_size).to_a
puts "Processing #{all_filespecs.size} files in #{slices.size} slices, whose sizes are:\n#{slices.map(&:size).inspect}"
slices
end
private def create_and_populate_ractors(slices)
words = File.readlines('/usr/share/dict/words').map(&:chomp).map(&:downcase).sort
slices.map do |slice|
ractor = Ractor.new do
filespecs = Ractor.receive
dictionary_words = Ractor.receive
RactorParser.new(dictionary_words).parse(filespecs)
end
ractor.send(slice)
ractor.send(words)
ractor
end
end
private def find_all_filespecs
filemask = FILEMASK ? %Q{-name '#{FILEMASK}'} : ''
command = "find -L #{BASEDIR} -type f #{filemask} -print"
puts "Running the following command to find all filespecs to process: #{command}"
`#{command}`.split("\n")
end
end
Main.new.call
Updated by keithrbennett (Keith Bennett) almost 4 years ago
I've updated the software I used to measure this, and moved it to https://github.com/keithrbennett/keithrbennett-ractor-test.
Updated by ko1 (Koichi Sasada) almost 4 years ago
- Status changed from Open to Closed
Applied in changeset git|1ecda213668644d656eb0d60654737482447dd92.
global call-cache cache table for rb_funcall*
rb_funcall* (rb_funcall(), rb_funcallv(), ...) functions invokes
Ruby's method with given receiver. Ruby 2.7 introduced inline method
cache with static memory area. However, Ruby 3.0 reimplemented the
method cache data structures and the inline cache was removed.
Without inline cache, rb_funcall* searched methods everytime.
Most of cases per-Class Method Cache (pCMC) will be helped but
pCMC requires VM-wide locking and it hurts performance on
multi-Ractor execution, especially all Ractors calls methods
with rb_funcall*.
This patch introduced Global Call-Cache Cache Table (gccct) for
rb_funcall*. Call-Cache was introduced from Ruby 3.0 to manage
method cache entry atomically and gccct enables method-caching
without VM-wide locking. This table solves the performance issue
on multi-ractor execution.
[Bug #17497]
Ruby-level method invocation does not use gccct because it has
inline-method-cache and the table size is limited. Basically
rb_funcall* is not used frequently, so 1023 entries can be enough.
We will revisit the table size if it is not enough.
Updated by ko1 (Koichi Sasada) almost 4 years ago
quoted from https://github.com/ruby/ruby/pull/4129#issuecomment-769613184
call the following methods as a task:
def task_range_inject
(1..20_000_000).inject(:+)
end
with
- 4 times sequentially
- 4 times sequentially with a sleeping ractor
- 4 ractors in parallel
on
26_mini ruby 2.6.7p150 (2020-12-09 revision 67888) [x86_64-linux]
27_mini ruby 2.7.3p140 (2020-12-09 revision 9b884df6dd) [x86_64-linux]
master_mini ruby 3.1.0dev (2021-01-29T05:27:53Z master 9241211538) [x86_64-linux]
miniruby ruby 3.1.0dev (2021-01-29T06:21:39Z gh-4129 f996e15ff6) [x86_64-linux]
26_ruby ruby 2.6.7p150 (2020-12-09 revision 67888) [x86_64-linux]
27_ruby ruby 2.7.3p139 (2020-10-11 revision d1ba554551) [x86_64-linux]
master_ruby ruby 3.1.0dev (2021-01-29T05:27:53Z master 9241211538) [x86_64-linux]
ruby ruby 3.1.0dev (2021-01-29T06:21:39Z gh-4129 f996e15ff6) [x86_64-linux]
result:
user system total real
serial/26_mini 0 0.000145 0.000039 2.782685 ( 2.783691)
serial/27_mini 0 0.000133 0.000036 2.320257 ( 2.320305)
serial/master_mini 0 0.000141 0.000039 3.756926 ( 3.756963)
serial/miniruby 0 0.000136 0.000037 2.598088 ( 2.598126)
serial/26_ruby 0 0.000143 0.000038 2.695175 ( 2.704443)
serial/27_ruby 0 0.000139 0.000038 2.391679 ( 2.401067)
serial/master_ruby 0 0.000139 0.000038 4.391577 ( 4.391626)
serial/ruby 0 0.000128 0.000035 3.109923 ( 3.109991)
r_serial/26_mini N/A
r_serial/27_mini N/A
r_serial/master_mini 1 0.000146 0.000040 5.133049 ( 5.133056)
r_serial/miniruby 1 0.000133 0.000037 2.597336 ( 2.597300)
r_serial/26_ruby N/A
r_serial/27_ruby N/A
r_serial/master_ruby 1 0.000147 0.000040 5.910876 ( 5.910907)
r_serial/ruby 1 0.000135 0.000037 2.875752 ( 2.875727)
r_parallel/26_mini N/A
r_parallel/27_mini N/A
r_parallel/master_mini 1 0.000100 0.000028 39.297123 ( 10.160359)
r_parallel/miniruby 1 0.000110 0.000030 2.703400 ( 0.695634)
r_parallel/26_ruby N/A
r_parallel/27_ruby N/A
r_parallel/master_ruby 1 0.000122 0.000034 38.941810 ( 10.072950)
r_parallel/ruby 1 0.000131 0.000036 2.980137 ( 0.757672)
Updated by ko1 (Koichi Sasada) almost 4 years ago
inversion (Yura Babak) wrote in #note-10:
I also made 2 posts about strange performance testing results (with sources) and some conclusions.
In my case, 2 ractors work 3-times longer than doing the same payload in the main thread.
With digest benchmark:
user system total real
serial/26_ruby 0 0.000239 0.000079 2.497317 ( 2.503279)
serial/27_ruby 0 0.000972 0.000324 2.306552 ( 2.310275)
serial/master_ruby 0 0.000293 0.000098 3.776824 ( 3.776623)
serial/ruby 0 0.000190 0.000063 2.668395 ( 2.668287)
r_serial/26_ruby N/A
r_serial/27_ruby N/A
r_serial/master_ruby 1 0.000407 0.000000 5.579597 ( 5.579969)
r_serial/ruby 1 0.000476 0.000000 2.682626 ( 2.683627)
r_parallel/26_ruby N/A
r_parallel/27_ruby N/A
r_parallel/master_ruby 1 0.000242 0.000000 48.298173 ( 12.959255)
r_parallel/ruby 1 0.000242 0.000000 3.782164 ( 0.984832)
seems solved.
Updated by ko1 (Koichi Sasada) almost 4 years ago
keithrbennett (Keith Bennett) wrote in #note-11:
I too have seen strange results testing ractors. I used the code at https://github.com/keithrbennett/keithrbennett-ractor-test/blob/master/my_ractor.rb to do some arbitrary but predictable work. I have a 24-core Ryzen 9 CPU, and I compared using 1 ractor with using 24. With 24, htop reported that all the CPU's were at 100% most of the time, yet the elapsed time using 24 CPU's was only about a third less than when using 1 CPU. Also, the CPU's seemed to be working collectively about ten times harder with 24 CPU's. Here is the program output:
could you check it again?
Updated by keithrbennett (Keith Bennett) almost 4 years ago
@ko1 (Koichi Sasada) - My apologies for not responding sooner. I guess I have not configured this forum correctly to receive notifications, I'll look into that.
I've tested my benchmark against Ruby head, and performance with multiple cores seem to have degraded. Perhaps I have made an error in my approaches, I don't know. I will paste my results below. My OS is "Ubuntu 20.04.2 LTS" (Kubuntu).
In case it's useful, I've made my script easier to use; it automatically tests and compares 1 ractor with (CPU_count) ractors. You can find it at https://github.com/keithrbennett/keithrbennett-ractor-test/blob/master/ractor-file-strings-test.rb. Information about configuring it, what it does, etc., is included in comments at the top of the script.
Small Data Set:
Ruby 3.0.0:
1 CPU 24 CPU's Factor
----------------------------------------------------------------
User 6.75500 52.34200 7.74863
System 0.00000 0.02800 0.00415
Total 6.83900 52.40700 7.66296
Real 6.83300 4.10500 0.60076
2021-01-31 Ruby head (ruby 3.1.0dev (2021-01-31T09:48:28Z master 22b8ddfd10) [x86_64-linux]):
1 CPU 24 CPU's Factor
----------------------------------------------------------------
User 6.18000 56.27400 9.10583
System 0.00400 0.02800 0.00453
Total 6.26800 56.34200 8.98883
Real 6.26100 4.26200 0.68072
================================================================
Larger Data Set:
Ruby 3.0.0:
1 CPU 24 CPU's Factor
----------------------------------------------------------------
User 51.01000 499.67200 9.79557
System 0.04000 0.25900 0.00508
Total 51.32300 500.12900 9.74473
Real 51.31200 45.56600 0.88802
2021-01-31 Ruby head (ruby 3.1.0dev (2021-01-31T09:48:28Z master 22b8ddfd10) [x86_64-linux]):
1 CPU 24 CPU's Factor
----------------------------------------------------------------
User 47.08900 486.34400 10.32819
System 0.03200 0.20300 0.00431
Total 47.39500 486.74800 10.27003
Real 47.38400 43.95100 0.92755
Updated by ko1 (Koichi Sasada) almost 4 years ago
- Status changed from Closed to Assigned
keithrbennett (Keith Bennett) wrote in #note-17:
I've tested my benchmark against Ruby head, and performance with multiple cores seem to have degraded. Perhaps I have made an error in my approaches, I don't know. I will paste my results below. My OS is "Ubuntu 20.04.2 LTS" (Kubuntu).
I confirmed with the following script
WORDS = Ractor.make_shareable File.readlines('/usr/share/dict/words').map(&:chomp).map(&:downcase).sort
def try
File.readlines(__dir__ + '/compar.c').each{|line|
line.split.map(&:downcase).select { |text|
WORDS.include? text
}
}
end
Warning[:experimental] = false
require 'benchmark'
Benchmark.bm{|x|
x.report{
4.times{try}
}
x.report{
4.times.map{
Ractor.new{ try }
}.each(&:take)
}
}
__END__
user system total real
4.501388 0.001541 4.502929 ( 4.502980)
16.763446 0.000018 16.763464 ( 4.335964)
It compare with sequential 4 times try
method and 4 times try
methods on ractors in parallel.
To compare with real, 4.5 vs 4.3 sec. It is not slow, but not first with 4 cores.
The reason seems WORDS.include? text
. I'll investigate more.
Updated by ko1 (Koichi Sasada) almost 4 years ago
- Status changed from Assigned to Closed
Applied in changeset git|813fe4c256f89babebb8ab53821ae5eb6bb138c6.
opt_equality_by_mid for rb_equal_opt
This patch improves the performance of sequential and parallel
execution of rb_equal() (and rb_eql()).
[Bug #17497]
rb_equal_opt (and rb_eql_opt) does not have own cd and it waste
a time to initialize cd. This patch introduces opt_equality_by_mid()
to check equality without cd.
Furthermore, current master uses "static" cd on rb_equal_opt
(and rb_eql_opt) and it hurts CPU caches on multi-thread execution.
Now they are gone so there are no bottleneck on parallel execution.
Updated by ko1 (Koichi Sasada) almost 4 years ago
maybe git|813fe4c256f89babebb8ab53821ae5eb6bb138c6 solved the issue.
could you check it?
Updated by ko1 (Koichi Sasada) almost 4 years ago
- Backport changed from 2.5: UNKNOWN, 2.6: UNKNOWN, 2.7: UNKNOWN to 2.5: UNKNOWN, 2.6: UNKNOWN, 2.7: UNKNOWN, 3.0: REQUIRED
Updated by keithrbennett (Keith Bennett) almost 4 years ago
Koichi -
Huge improvement! Thank you!
I installed Ruby head and now have the following output from ruby -v
:
ruby 3.1.0dev (2021-02-15T09:29:35Z master 37b90bcdc1) [x86_64-linux]
I made minor modifications to your script (see https://gist.github.com/keithrbennett/18f10124354d62eb8ba5feafaa9b39dc) and then ran it in the Ruby project root directory and got the following results:
On my Linux (Kubuntu 20.04.2) desktop:
Measuring first sequentially on main ractor and then with 24 ractors:
user system total real
14.969907 0.003891 14.973798 ( 14.977699)
29.087580 0.051934 29.139514 ( 1.243316)
0.515 User time difference factor
12.047 Real time difference factor
And then on my 2015 Mac:
Measuring first sequentially on main ractor and then with 4 ractors:
user system total real
10.477194 0.047028 10.524222 ( 10.605862)
18.226199 0.068098 18.294297 ( 5.101498)
0.575 User time difference factor
2.079 Real time difference factor
It's interesting that the real time difference factor on both machines is so close to ((the number of CPU's and ractors) / 2.0).
The original script I used to test (at https://github.com/keithrbennett/keithrbennett-ractor-test/blob/master/ractor-file-strings-test.rb) was not very good at distributing work among the ractors equally, and this made the real time observations less reliable, since the real time was really the real time of the longest running ractor. Your script is much better in that way. It would be interesting to test more parts of the standard library though, such as the Set
instantiations and merges I had used; if I have time I'll look into that.
P.S. Sorry it took so long to respond; given that notifications don't seem to work, I need to develop a habit of manually checking here every day.
Updated by naruse (Yui NARUSE) over 3 years ago
- Backport changed from 2.5: UNKNOWN, 2.6: UNKNOWN, 2.7: UNKNOWN, 3.0: REQUIRED to 2.5: UNKNOWN, 2.6: UNKNOWN, 2.7: UNKNOWN, 3.0: DONE
ruby_3_0 de6072a22edbaab3793cf7f976cc9e0118d0df40 merged revision(s) abdc634f64a440afcdc7f23c9757d27aab4db8a9,083c5f08ec4e95c9b75810d46f933928327a5ab3,1ecda213668644d656eb0d60654737482447dd92,813fe4c256f89babebb8ab53821ae5eb6bb138c6.