Feature #16837
closedCan we make Ruby 3.0 as fast as Ruby 2.7 with the new assertions?
Description
Problem¶
How can we make Ruby 3.0 as fast as (or faster than) Ruby 2.7?
Background¶
- Split ruby.h https://github.com/ruby/ruby/pull/2991 added some new assertions
- While it has been helpful for revealing various bugs, it also made some Ruby programs notably slow, especially Optcarrot https://benchmark-driver.github.io/benchmarks/optcarrot/commits.html
Possible approaches¶
I have no strong preference yet. Here are some random ideas:
- Optimize the assertion code somehow
- Enable the new assertions only on CIs, at least ones in hot spots
- Not sure which places have large impact on Optcarrot yet
- Make some other not-so-important assertions CI-only to offset the impact from new ones
- Provide .so for an assertion-enabled mode? (ko1's idea)
I hope people will comment more ideas in this ticket.
Updated by k0kubun (Takashi Kokubun) over 4 years ago
- Tracker changed from Bug to Feature
- Backport deleted (
2.5: UNKNOWN, 2.6: UNKNOWN, 2.7: UNKNOWN)
Updated by k0kubun (Takashi Kokubun) over 4 years ago
- Description updated (diff)
Updated by k0kubun (Takashi Kokubun) over 4 years ago
- Description updated (diff)
Updated by shyouhei (Shyouhei Urabe) over 4 years ago
I would like to suggest that if a user really favor speed over sanity check, they should just compiler everything with -DNDEBUG
. This has been the standard C manner since long before Ruby's birth.
Updated by shyouhei (Shyouhei Urabe) over 4 years ago
Some analysis of the slowdown.
Looking at the generated binary and perf
output, the slowdown is because some functions are not inlined. Might depend on compilers, but for me rb_array_len()
is one of such victim:
zsh % gdb -batch -ex 'file miniruby' -ex 'disassemble rb_array_len'
Dump of assembler code for function rb_array_len:
0x0000000000295540 <+0>: push %rbx
0x0000000000295541 <+1>: mov %rdi,%rbx
0x0000000000295544 <+4>: test $0x7,%bl
0x0000000000295547 <+7>: jne 0x2955be <rb_array_len+126>
0x0000000000295549 <+9>: mov %rbx,%rax
0x000000000029554c <+12>: and $0xfffffffffffffff7,%rax
0x0000000000295550 <+16>: je 0x2955be <rb_array_len+126>
0x0000000000295552 <+18>: mov (%rbx),%rax
0x0000000000295555 <+21>: mov %eax,%edx
0x0000000000295557 <+23>: and $0x1f,%edx
0x000000000029555a <+26>: mov $0x7,%ecx
0x000000000029555f <+31>: cmp $0x7,%edx
0x0000000000295562 <+34>: jne 0x295585 <rb_array_len+69>
0x0000000000295564 <+36>: test $0x2000,%eax ;; <- This is `RB_FL_ANY_RAW(a, RARRAY_EMBED_FLAG)`
0x0000000000295569 <+41>: jne 0x295571 <rb_array_len+49>
0x000000000029556b <+43>: mov 0x10(%rbx),%rax ;; <-
0x000000000029556f <+47>: pop %rbx ;; <- This is `return RARRAY(a)->as.heap.len;`
0x0000000000295570 <+48>: retq ;; <-
0x0000000000295571 <+49>: cmp $0x7,%ecx
0x0000000000295574 <+52>: jne 0x2955a2 <rb_array_len+98>
0x0000000000295576 <+54>: test $0x2000,%eax
0x000000000029557b <+59>: je 0x2955ea <rb_array_len+170>
0x000000000029557d <+61>: shr $0xf,%eax ;; <-
0x0000000000295580 <+64>: and $0x3,%eax ;; <- This is `return RARRAY_EMBED_LEN(a);`
0x0000000000295583 <+67>: pop %rbx ;; <-
0x0000000000295584 <+68>: retq ;; <-
0x0000000000295585 <+69>: mov %rbx,%rdi
0x0000000000295588 <+72>: mov $0x7,%esi
0x000000000029558d <+77>: callq 0xcaea2 <rb_check_type>
0x0000000000295592 <+82>: mov (%rbx),%rax
0x0000000000295595 <+85>: mov %eax,%ecx
0x0000000000295597 <+87>: and $0x1f,%ecx
0x000000000029559a <+90>: cmp $0x1b,%rcx
0x000000000029559e <+94>: jne 0x295564 <rb_array_len+36>
0x00000000002955a0 <+96>: jmp 0x2955cb <rb_array_len+139>
0x00000000002955a2 <+98>: mov %rbx,%rdi
0x00000000002955a5 <+101>: mov $0x7,%esi
0x00000000002955aa <+106>: callq 0xcaea2 <rb_check_type>
0x00000000002955af <+111>: mov (%rbx),%rax
0x00000000002955b2 <+114>: mov %eax,%ecx
0x00000000002955b4 <+116>: and $0x1f,%ecx
0x00000000002955b7 <+119>: cmp $0x1b,%ecx
0x00000000002955ba <+122>: jne 0x295576 <rb_array_len+54>
0x00000000002955bc <+124>: jmp 0x2955cb <rb_array_len+139>
0x00000000002955be <+126>: mov %rbx,%rdi
0x00000000002955c1 <+129>: mov $0x7,%esi
0x00000000002955c6 <+134>: callq 0xcaea2 <rb_check_type>
0x00000000002955cb <+139>: lea 0x142fe(%rip),%rdi # 0x2a98d0
0x00000000002955d2 <+146>: lea 0x1432f(%rip),%rdx # 0x2a9908
0x00000000002955d9 <+153>: lea 0x14337(%rip),%rcx # 0x2a9917
0x00000000002955e0 <+160>: mov $0xea,%esi
0x00000000002955e5 <+165>: callq 0xcad86 <rb_assert_failure>
0x00000000002955ea <+170>: lea 0x14338(%rip),%rdi # 0x2a9929
0x00000000002955f1 <+177>: lea 0x1436d(%rip),%rdx # 0x2a9965
0x00000000002955f8 <+184>: lea 0x14377(%rip),%rcx # 0x2a9976
0x00000000002955ff <+191>: mov $0x79,%esi
0x0000000000295604 <+196>: callq 0xcad86 <rb_assert_failure>
End of assembler dump.
Here, assertions practically never fail. This means jumps are 100% predicted (almost no-op). They don't slow things. The problem is those unreachable branches. If you can read the assembly you see almost 2/3 of the above function just never reach. They blow the generated binary up significantly. rb_array_len
is thus now considered too big to be inlined, to my compiler at least.
An obvious ad-hoc remedy is to supply __attribute__((__always_inline__))
for everything. But I don't think that's a good idea, because what is inlined and what is not depends very much on compilers, versions, target architectures, and almost everything.
Updated by shyouhei (Shyouhei Urabe) over 4 years ago
If you recompile everything using ./configure cppflags=-DNDEBUG
, then those assertions are eliminated, to let compilers inline rb_array_len
again.
Updated by shevegen (Robert A. Heiler) over 4 years ago
I have a question concerning one point mentioned above.
k0kubun wrote:
Provide .so for an assertion-enabled mode? (ko1's idea)
Could someone briefly explain the general idea behind this? I assume for a .so
file the ruby user would have to require/load that file, but what may be the
perceived benefits/disadvantages for doing so?
Updated by k0kubun (Takashi Kokubun) over 4 years ago
I would like to suggest that if a user really favor speed over sanity check, they should just compiler everything with -DNDEBUG. This has been the standard C manner since long before Ruby's birth.
Got it. I'll consider using -DNDEBUG in benchmark servers at least. Also maybe it's worth noting it in NEWS for those who package Ruby for performance-sensitive usages?
An obvious ad-hoc remedy is to supply
__attribute__((__always_inline__))
for everything. But I don't think that's a good idea, because what is inlined and what is not depends very much on compilers, versions, target architectures, and almost everything.
Agreed. While it's not a good idea to always inline everything, some may be worth a consideration though.
I assume for a .so file the ruby user would have to require/load that file
His idea was to install the .so file to Ruby prefix by default and add a --debug-xxx
option to load it.
Updated by k0kubun (Takashi Kokubun) over 4 years ago
- Related to Bug #16840: Decrease in Hash#[]= performance with object keys added
Updated by nobu (Nobuyoshi Nakada) over 4 years ago
Not only assertions, some optimizations can no longer be applied.
For instance, rb_str_new_cstr
was defined as following in 2.7,
#define rb_str_new_cstr(str) RB_GNUC_EXTENSION_BLOCK( \
(__builtin_constant_p(str)) ? \
rb_str_new_static((str), (long)strlen(str)) : \
rb_str_new_cstr(str) \
)
and rb_str_new_cstr("...")
has been expected to be compiled as rb_str_new_static("...", 3)
.
The below is the master version.
static inline VALUE
ruby3_str_new_cstr(const char *str)
{
if /* constexpr */ (! RUBY3_CONSTANT_P(str)) {
return rb_str_new_cstr(str);
}
else {
long len = ruby3_strlen(str);
return rb_str_new_static(str, len);
}
}
As str
is an argument variable and RUBY3_CONSTANT_P(str)
is always false here, _static
function is never used (in Apple clang 11.0.3 and gcc 10.1.0-RC-20200430_0).
I'm uncertain how this particular case affects the whole performance though, similar un-optimizations might be more.
Updated by shyouhei (Shyouhei Urabe) over 4 years ago
nobu (Nobuyoshi Nakada) wrote in #note-10:
As
str
is an argument variable andRUBY3_CONSTANT_P(str)
is always false here,
Well, thank you pointing this out. As I wrote in include/ruby/3/constant_p.h
, you can apply __builtin_constant_p
to an inline function argument. I thought that RUBY3_CONSTANT_P(str)
is not always false. However https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html says:
You may use this built-in function in either a macro or an inline function. However, if you use it in an inlined function and pass an argument of the function as the argument to the built-in, GCC never returns 1 when you call the inline function with a string constant or ...
In this ruby3_str_new_cstr()
's particular case, the argument is a string. There is no chance. This is in fact wrong. We have to fix.
Updated by naruse (Yui NARUSE) over 4 years ago
I want Ruby 2.8/3.0 is faster than 2.7 by default.
NDEBUG is not acceptable.
I think Microsoft's _DEBUG approach is more reasonable.
Updated by shyouhei (Shyouhei Urabe) over 4 years ago
naruse (Yui NARUSE) wrote in #note-12:
NDEBUG is not acceptable.
NDEBUG is not my invention. Please file a bug report to upstream (ISO/IEC JTC1/SC22/WG14).
I'm not against defining it by default, though.
Updated by ko1 (Koichi Sasada) over 4 years ago
- Status changed from Open to Closed
Applied in changeset git|21991e6ca59274e41a472b5256bd3245f6596c90.
Use RUBY_DEBUG instead of NDEBUG
Assertions in header files slows down an interpreter, so they should be
turned off by default (simple make
). To enable them, define a macro
RUBY_DEBUG=1
(e.g. make cppflags=-DRUBY_DEBUG
or use #define
at
the very beggining of the file. Note that even if NDEBUG=1
is defined,
RUBY_DEBUG=1
enables all assertions.
[Feature #16837]
related: https://github.com/ruby/ruby/pull/3120
assert()
lines in MRI *.c is not disabled even if RUBY_DEBUG=0
and
it can be disabled with NDEBUG=1
. So please consider to use
RUBY_ASSERT()
if you want to disable them when RUBY_DEBUG=0
.