Misc #16125
Updated by nobu (Nobuyoshi Nakada) over 4 years ago
References PR https://github.com/ruby/ruby/pull/2396 I noticed that since the introduction of the `GC.compact` API, struct `rb_data_type_t` spans multiple cache lines with the introduction of the `dcompact` function pointer / callback: ```C ``` struct rb_data_type_struct { const char * wrap_struct_name; /* 0 8 */ struct { void (*dmark)(void *); /* 8 8 */ void (*dfree)(void *); /* 16 8 */ size_t (*dsize)(const void *); /* 24 8 */ void (*dcompact)(void *); /* 32 8 */ <<<<<<<<<<<<<<<<<<<<<<<<<<<<<< void * reserved[1]; /* 40 8 */ } function; /* 8 40 */ const rb_data_type_t * parent; /* 48 8 */ void * data; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ VALUE flags; /* 64 8 */ /* size: 72, cachelines: 2, members: 5 */ /* last cacheline: 8 bytes */ }; ``` I'm wondering what the `reserved` member was originally intended for, given introducing the `dcompact` member basically already broke binary compatibility by changing the struct size from `64` -> `72` bytes when preserving the `reserved` member as well. This struct is defined in `include/ruby.h` and used extensively in MRI but also extensions and thus "public API". If there's the off chance that there isn't a need for the reserved member moving forward (maybe could have been for compacting or a similar GC feature?), could we remove it and prefer aligning on cache line boundaries instead? Packed with the `reserved` member removed, single cache line: ```C ``` struct rb_data_type_struct { const char * wrap_struct_name; /* 0 8 */ struct { void (*dmark)(void *); /* 8 8 */ void (*dfree)(void *); /* 16 8 */ size_t (*dsize)(const void *); /* 24 8 */ void (*dcompact)(void *); /* 32 8 */ } function; /* 8 32 */ const rb_data_type_t * parent; /* 40 8 */ void * data; /* 48 8 */ VALUE flags; /* 56 8 */ /* size: 64, cachelines: 1, members: 5 */ }; ``` ### Usage in MRI Examples of internal APIs that use it and how the typed data type declarations does not affect the tail of the function struct with the style used in MRI (I realize this may not be true for all extensions): #### AST ```C ``` static const rb_data_type_t rb_node_type = { "AST/node", {node_gc_mark, RUBY_TYPED_DEFAULT_FREE, node_memsize,}, 0, 0, RUBY_TYPED_FREE_IMMEDIATELY, }; ``` #### Fiber ```C ``` static const rb_data_type_t fiber_data_type = { "fiber", {fiber_mark, fiber_free, fiber_memsize, fiber_compact,}, 0, 0, RUBY_TYPED_FREE_IMMEDIATELY }; ``` #### Enumerator And related generator etc. types. ```C ``` static const rb_data_type_t enumerator_data_type = { "enumerator", { enumerator_mark, enumerator_free, enumerator_memsize, enumerator_compact, }, 0, 0, RUBY_TYPED_FREE_IMMEDIATELY }; ``` #### Encoding ```C ``` static const rb_data_type_t encoding_data_type = { "encoding", {0, 0, 0,}, 0, 0, RUBY_TYPED_FREE_IMMEDIATELY }; ``` #### Proc, Binding and methods ```C ``` static const rb_data_type_t proc_data_type = { "proc", { proc_mark, RUBY_TYPED_DEFAULT_FREE, proc_memsize, proc_compact, }, 0, 0, RUBY_TYPED_FREE_IMMEDIATELY | RUBY_TYPED_WB_PROTECTED }; ``` ```C ``` const ruby_binding_data_type = { "binding", { binding_mark, binding_free, binding_memsize, binding_compact, }, 0, 0, RUBY_TYPED_WB_PROTECTED | RUBY_TYPED_FREE_IMMEDIATELY }; ``` ```C ``` static const rb_data_type_t method_data_type = { "method", { bm_mark, RUBY_TYPED_DEFAULT_FREE, bm_memsize, bm_compact, }, 0, 0, RUBY_TYPED_FREE_IMMEDIATELY }; ``` #### Threads ```C ``` #define thread_data_type ruby_threadptr_data_type const rb_data_type_t ruby_threadptr_data_type = { "VM/thread", { thread_mark, thread_free, thread_memsize, thread_compact, }, 0, 0, RUBY_TYPED_FREE_IMMEDIATELY }; ``` And *many* others both internal and in `ext/`. Looking at the definitions in MRI at least, I don't see: * patterns of any typed data definition explicitly initializing the `reserved` member * how this would affect "in the wild" extensions negatively as the more popular ones I referenced also followed the MRI init style. ### Benchmarks Focused from the standard bench suite on typed data objects as mentioned above. Prelude: ``` lourens@CarbonX1:~/src/ruby/ruby$ make benchmark COMPARE_RUBY=~/src/ruby/trunk/ruby OPTS="-v --repeat-count 10" ./revision.h unchanged /usr/local/bin/ruby --disable=gems -rrubygems -I./benchmark/lib ./benchmark/benchmark-driver/exe/benchmark-driver \ --executables="compare-ruby::/home/lourens/src/ruby/trunk/ruby -I.ext/common --disable-gem" \ --executables="built-ruby::./miniruby -I./lib -I. -I.ext/common ./tool/runruby.rb --extout=.ext -- --disable-gems --disable-gem" \ $(find ./benchmark -maxdepth 1 -name '' -o -name '**.yml' -o -name '**.rb' | sort) -v --repeat-count 10 compare-ruby: ruby 2.7.0dev (2019-08-20T13:33:32Z master 235d810c2e) [x86_64-linux] built-ruby: ruby 2.7.0dev (2019-08-20T15:03:21Z pack-rb_data_type_t 92b8641ccd) [x86_64-linux] ``` Left side `compare-ruby` (master), right side `current` (this branch): ``` require_thread 0.035 0.049 i/s - 1.000 times in 28.932403s 20.426896s vm1_blockparam_call 18.885M 18.907M i/s - 30.000M times in 1.588571s 1.586713s vm1_blockparam_pass 15.159M 15.434M i/s - 30.000M times in 1.978964s 1.943805s vm1_blockparam_yield 20.560M 20.673M i/s - 30.000M times in 1.459127s 1.451188s vm1_blockparam 32.733M 33.358M i/s - 30.000M times in 0.916513s 0.899344s vm1_block 33.796M 34.215M i/s - 30.000M times in 0.887692s 0.876808s vm2_fiber_reuse_gc 98.480 104.688 i/s - 100.000 times in 1.015439s 0.955219s vm2_fiber_reuse 364.082 397.878 i/s - 200.000 times in 0.549327s 0.502667s vm2_fiber_switch 11.548M 11.730M i/s - 20.000M times in 1.731852s 1.704978s vm2_proc 36.025M 36.278M i/s - 6.000M times in 0.166552s 0.165389s vm_thread_alive_check 108.273k 109.290k i/s - 50.000k times in 0.461794s 0.457499s vm_thread_close 1.415 1.432 i/s - 1.000 times in 0.706720s 0.698509s vm_thread_condvar1 1.287 1.287 i/s - 1.000 times in 0.776782s 0.777074s vm_thread_condvar2 1.653 1.615 i/s - 1.000 times in 0.604922s 0.619380s vm_thread_create_join 0.913 0.921 i/s - 1.000 times in 1.094693s 1.085227s vm_thread_mutex1 2.537 2.581 i/s - 1.000 times in 0.394181s 0.387481s vm_thread_mutex2 2.571 2.577 i/s - 1.000 times in 0.388932s 0.388020s vm_thread_mutex3 1.110 1.660 i/s - 1.000 times in 0.900852s 0.602422s vm_thread_pass_flood 5.867 9.997 i/s - 1.000 times in 0.170431s 0.100032s vm_thread_pass 0.349 0.350 i/s - 1.000 times in 2.865303s 2.854191s vm_thread_pipe 6.923 7.093 i/s - 1.000 times in 0.144447s 0.140993s vm_thread_queue 1.297 1.287 i/s - 1.000 times in 0.771302s 0.777274s vm_thread_sized_queue2 1.538 1.479 i/s - 1.000 times in 0.650188s 0.676074s vm_thread_sized_queue3 1.421 1.456 i/s - 1.000 times in 0.703753s 0.686595s vm_thread_sized_queue4 1.347 1.342 i/s - 1.000 times in 0.742653s 0.745130s vm_thread_sized_queue 5.473 5.377 i/s - 1.000 times in 0.182710s 0.185966s ``` ### Further cache utilization info Used `perf stat` on a rails console using the integration session helper to load the redmine homepage 100 times (removes network roundtrip and other variance and easier to reproduce for reviewers - less tools). Master ``` lourens@CarbonX1:~/src/redmine$ sudo perf stat -d bin/rails c -e production Loading production environment (Rails 5.2.3) irb(main):001:0> 100.times { app.get('/') } ----- truncated ----- Processing by WelcomeController#index as HTML Current user: anonymous Rendering welcome/index.html.erb within layouts/base Rendered welcome/index.html.erb within layouts/base (0.5ms) Completed 200 OK in 13ms (Views: 5.1ms | ActiveRecord: 1.3ms) => 100 irb(main):002:0> RUBY_DESCRIPTION => "ruby 2.7.0dev (2019-08-20T13:33:32Z master 235d810c2e) [x86_64-linux]" irb(main):003:0> exit Performance counter stats for 'bin/rails c -e production': 4373,155316 task-clock (msec) # 0,093 CPUs utilized 819 context-switches # 0,187 K/sec 30 cpu-migrations # 0,007 K/sec 82376 page-faults # 0,019 M/sec 13340422873 cycles # 3,051 GHz (50,18%) 17274934973 instructions # 1,29 insn per cycle (62,74%) 3558147880 branches # 813,634 M/sec (62,42%) 77703222 branch-misses # 2,18% of all branches (62,39%) 4625597415 L1-dcache-loads # 1057,725 M/sec (62,22%) 216886763 L1-dcache-load-misses # 4,69% of all L1-dcache hits (62,54%) 66242477 LLC-loads # 15,148 M/sec (50,19%) 13766303 LLC-load-misses # 20,78% of all LL-cache hits (50,05%) 47,171186591 seconds time elapsed ``` This branch: ``` lourens@CarbonX1:~/src/redmine$ sudo perf stat -d bin/rails c -e production Loading production environment (Rails 5.2.3) irb(main):001:0> 100.times { app.get('/') } ----- truncated ----- Started GET "/" for 127.0.0.1 at 2019-08-20 23:40:43 +0100 Processing by WelcomeController#index as HTML Current user: anonymous Rendering welcome/index.html.erb within layouts/base Rendered welcome/index.html.erb within layouts/base (0.6ms) Completed 200 OK in 13ms (Views: 5.1ms | ActiveRecord: 1.4ms) => 100 irb(main):002:0> p RUBY_DESCRIPTION "ruby 2.7.0dev (2019-08-20T15:03:21Z pack-rb_data_type_t 92b8641ccd) [x86_64-linux]" => "ruby 2.7.0dev (2019-08-20T15:03:21Z pack-rb_data_type_t 92b8641ccd) [x86_64-linux]" irb(main):003:0> exit Performance counter stats for 'bin/rails c -e production': 4318,441633 task-clock (msec) # 0,112 CPUs utilized 599 context-switches # 0,139 K/sec 14 cpu-migrations # 0,003 K/sec 81011 page-faults # 0,019 M/sec 13241070220 cycles # 3,066 GHz (49,56%) 17323594358 instructions # 1,31 insn per cycle (62,27%) 3553794043 branches # 822,934 M/sec (62,89%) 76390145 branch-misses # 2,15% of all branches (63,12%) 4595415722 L1-dcache-loads # 1064,138 M/sec (62,83%) 202269349 L1-dcache-load-misses # 4,40% of all L1-dcache hits (62,66%) 66193702 LLC-loads # 15,328 M/sec (49,44%) 12548399 LLC-load-misses # 18,96% of all LL-cache hits (49,49%) 38,464764876 seconds time elapsed ``` Conclusions: * Minor improvement in instructions per cycle * `L1-dcache-loads`: `1057,725 M/sec` -> `1064,138 M/sec` (higher rate of L1 cache loads) * `L1-dcache-load-misses`: `4,69%` -> `4,40%` (reduced L1 cache miss rate) Thoughts?