Bug #21715
openMiscompilation on x86-64-v2 due to undefined behavior in search_nonascii in string.c
Description
Building the following Dockerfile fails on a x86-64 machine in the last step (running make command):
FROM opensuse/leap:16.0
RUN zypper --non-interactive install wget make gcc
RUN wget 'https://cache.ruby-lang.org/pub/ruby/3.4/ruby-3.4.7.tar.gz'
RUN tar xaf ruby-3.4.7.tar.gz
WORKDIR ruby-3.4.7/build
RUN ../configure
RUN make
The failing command (during make) is: ./miniruby -I../lib -I. -I.ext/common ../tool/mkconfig.rb -arch=x86_64-linux -version=3.4.7 -install_name=ruby -so_name=ruby -unicode_version=15.0.0 -unicode_emoji_version=15.0 > rbconfig.tmp
Excerpt from the crash report:
../tool/mkconfig.rb: [BUG] Segmentation fault at 0x0000000000000000
ruby 3.4.7 (2025-10-08 revision 7a5688e2a2) +PRISM [x86_64-linux]
-- Control frame information -----------------------------------------------
c:0001 p:0000 s:0003 E:000ec0 DUMMY [FINISH]
-- Threading information ---------------------------------------------------
Total ractor count: 1
Ruby thread count for this ractor: 1
-- Machine register context ------------------------------------------------
RIP: 0x0000556c2da74760 RBP: 0x0000000000000027 RSP: 0x00007ffd24a195f0
RAX: 0x0000000000000028 RBX: 0x0000556c64acc420 RCX: 0x0000000000000000
RDX: 0x0000000000000000 RDI: 0x0000000000000014 RSI: 0x00007f49f7d6c123
R8: 0x46ea57707c6b1df2 R9: 0x00007f49f7d6c123 R10: 0x2afb945fcb545f01
R11: 0x0000556c2dc3fe50 R12: 0x00007f49f7d6c263 R13: 0x00007f49f7d6c11b
R14: 0x0000556c64bdaa48 R15: 0x00007f49f7d6c25c EFL: 0x0000000000010256
-- C level backtrace information -------------------------------------------
/ruby-3.4.7/build/miniruby(rb_print_backtrace+0x5) [0x556c2db2c1b6] ../vm_dump.c:823
/ruby-3.4.7/build/miniruby(rb_vm_bugreport) ../vm_dump.c:1155
/ruby-3.4.7/build/miniruby(rb_bug_for_fatal_signal+0xf7) [0x556c2d8cdc47] ../error.c:1130
/ruby-3.4.7/build/miniruby(sigsegv+0x42) [0x556c2da58482] ../signal.c:934
/lib64/libc.so.6(__restore_rt+0x0) [0x7f49f7eb2090]
/ruby-3.4.7/build/miniruby(search_nonascii+0xcb) [0x556c2da74760] ../string.c:729
/ruby-3.4.7/build/miniruby(coderange_scan) ../string.c:767
/ruby-3.4.7/build/miniruby(rbimpl_fl_unset_raw_raw+0x0) [0x556c2da76874] ../string.c:895
/ruby-3.4.7/build/miniruby(RB_FL_UNSET_RAW) ../include/ruby/internal/fl_type.h:669
/ruby-3.4.7/build/miniruby(RB_ENC_CODERANGE_SET) ../include/ruby/internal/encoding/coderange.h:131
/ruby-3.4.7/build/miniruby(enc_coderange_scan) ../string.c:911
/ruby-3.4.7/build/miniruby(rb_enc_str_coderange) ../string.c:910
/ruby-3.4.7/build/miniruby(is_ascii_string+0x8) [0x556c2da7697e] ../internal/string.h:151
/ruby-3.4.7/build/miniruby(str_do_hash) ../string.c:393
/ruby-3.4.7/build/miniruby(register_fstring) ../string.c:554
/ruby-3.4.7/build/miniruby(rb_enc_literal_str+0x87) [0x556c2da94bb7] ../string.c:12546
/ruby-3.4.7/build/miniruby(parse_static_literal_string+0x38) [0x556c2d875991] ../prism_compile.c:312
/ruby-3.4.7/build/miniruby(pm_compile_node) ../prism_compile.c:10321
/ruby-3.4.7/build/miniruby(pm_compile_node+0x2e65) [0x556c2d875aa5] ../prism_compile.c:10309
/ruby-3.4.7/build/miniruby(pm_compile_conditional+0x18c) [0x556c2d88cfcc] ../prism_compile.c:1053
/ruby-3.4.7/build/miniruby(pm_compile_node+0x42e1) [0x556c2d876f21] ../prism_compile.c:9355
/ruby-3.4.7/build/miniruby(pm_setup_args_core+0xe4) [0x556c2d884304] ../prism_compile.c:1792
/ruby-3.4.7/build/miniruby(pm_setup_args+0x98) [0x556c2d884e98] ../prism_compile.c:1979
/ruby-3.4.7/build/miniruby(pm_compile_call+0x307) [0x556c2d885cf7] ../prism_compile.c:3673
/ruby-3.4.7/build/miniruby(pm_compile_call_node+0x2c6) [0x556c2d872326] ../prism_compile.c:7403
/ruby-3.4.7/build/miniruby(pm_compile_node+0x39dc) [0x556c2d87661c] ../prism_compile.c:8775
/ruby-3.4.7/build/miniruby(pm_compile_node+0x2e65) [0x556c2d875aa5] ../prism_compile.c:10309
/ruby-3.4.7/build/miniruby(pm_compile_conditional+0x18c) [0x556c2d88cfcc] ../prism_compile.c:1053-march=x86-64-v2
/ruby-3.4.7/build/miniruby(pm_compile_node+0x42e1) [0x556c2d876f21] ../prism_compile.c:9355
/ruby-3.4.7/build/miniruby(pm_compile_node+0x2e3a) [0x556c2d875a7a] ../prism_compile.c:10307
/ruby-3.4.7/build/miniruby(pm_compile_scope_node+0x104a) [0x556c2d88f5da] ../prism_compile.c:6991
/ruby-3.4.7/build/miniruby(pm_compile_node+0x35c9) [0x556c2d876209] ../prism_compile.c:10180
/ruby-3.4.7/build/miniruby(APPEND_LIST+0x0) [0x556c2d891e60] ../prism_compile.c:10481
/ruby-3.4.7/build/miniruby(pm_iseq_compile_node) ../prism_compile.c:10485
/ruby-3.4.7/build/miniruby(pm_iseq_new_with_opt_try+0x10) [0x556c2d94c790] ../iseq.c:1042
/ruby-3.4.7/build/miniruby(rb_protect+0xd6) [0x556c2d8db9c6] ../eval.c:1054
/ruby-3.4.7/build/miniruby(pm_iseq_new_with_opt+0x177) [0x556c2d9525c7] ../iseq.c:1095
/ruby-3.4.7/build/miniruby(pm_iseq_new_main+0x85) [0x556c2d952895] ../iseq.c:943
/ruby-3.4.7/build/miniruby(process_options+0x12fd) [0x556c2da519cd] ../ruby.c:2616
/ruby-3.4.7/build/miniruby(ruby_process_options+0x157) [0x556c2da52657] ../ruby.c:3174
/ruby-3.4.7/build/miniruby(ruby_options+0x97) [0x556c2d8da977] ../eval.c:117
/ruby-3.4.7/build/miniruby(rb_main+0x19) [0x556c2d7eb578] ../prism/prism.c:21769
/ruby-3.4.7/build/miniruby(main) ../main.c:68
/lib64/libc.so.6(__libc_start_call_main+0x82) [0x7f49f7e9b340]
/lib64/libc.so.6(__libc_start_main+0x8b) [0x7f49f7e9b409]
/ruby-3.4.7/build/miniruby(_start+0x25) [0x556c2d7eb5c5] ../main.c:69
The failing instruction at 0x556c2da74760 is: movdqa xmm0, XMMWORD PTR [rsi+rcx*1]. At this place, register rsi contains 0x7f49f7d6c123, which is the value 0x7f49f7d6c11b of parameter p of the function search_nonascii + 8, and register rcx contains 0. So, the whole instruction means “move aligned packed integer values from memory at 0x7f49f7d6c123 to register xmm0”. The segmentation fault happened because the address is expected to be aligned on a 16-byte boundary, but it is not.
The instruction is part of a loop at https://github.com/ruby/ruby/blob/v3_4_7/string.c#L728 that gets auto-vectorized by GCC. On x86-64,
-
UNALIGNED_WORD_ACCESSis1 -
pdoesn’t get aligned to anything because of#if !UNALIGNED_WORD_ACCESSin line 700 -
aligned_ptr(value)is expanded to(uintptr_t *)(value)according to line 723 -
pis therefore casted to typeuintptr_t *in line 725 -
uintptr_tis typedefed tounsigned long int, which has alignment of 8 bytes
In result, a pointer p to potentially unaligned memory is casted to a pointer to a type with alignment of 8 bytes. That is undefined behavior according to C99 6.3.2.3p7: “A pointer to an object or incomplete type may be converted to a pointer to a different object or incomplete type. If the resulting pointer is not correctly aligned for the pointed-to type, the behavior is undefined.”. Compilers can utilize this rule to make the assumption that the pointed-to memory has alignment of 8 bytes. In this case, the GCC loop auto-vectorizer adds code to align the assumedly 8 bytes aligned address to 16 bytes alignment. A subsequent instruction assuming 16 bytes alignment can therefore fail.
I could reproduce this crash only on openSUSE Leap 16.0, but not openSUSE Leap 15.6, openSUSE Tumbleweed or Arch Linux, because only the former configured GCC to default to emitting code requiring x86-64-v2. When passing -march=x86-64-v2 in CFLAGS, the crash happens on all these distributions.
Updated by alanwu (Alan Wu) 1 day ago
Right, it's doing the unaligned read in the classic intuitive-but-UB way. Can you try the following (roughly tested) patch? It's based on the ruby_3_4 branch.
From 225f6caf914a4dd4c457d9e52ab72a79c91bd1a7 Mon Sep 17 00:00:00 2001
From: Alan Wu <XrXr@users.noreply.github.com>
Date: Wed, 26 Nov 2025 21:59:37 -0500
Subject: [PATCH] string.c: Fix UB unaligned read by replacing with memcpy
---
string.c | 16 +++++++++-------
1 file changed, 9 insertions(+), 7 deletions(-)
diff --git a/string.c b/string.c
index af8f493285..663d2d01c7 100644
--- a/string.c
+++ b/string.c
@@ -676,7 +676,7 @@ VALUE rb_fs;
static inline const char *
search_nonascii(const char *p, const char *e)
{
- const uintptr_t *s, *t;
+ const char *s, *t;
#if defined(__STDC_VERSION__) && (__STDC_VERSION__ >= 199901L)
# if SIZEOF_UINTPTR_T == 8
@@ -720,17 +720,19 @@ search_nonascii(const char *p, const char *e)
#define aligned_ptr(value) \
__builtin_assume_aligned((value), sizeof(uintptr_t))
#else
-#define aligned_ptr(value) (uintptr_t *)(value)
+#define aligned_ptr(value) (value)
#endif
s = aligned_ptr(p);
- t = (uintptr_t *)(e - (SIZEOF_VOIDP-1));
+ t = (e - (SIZEOF_VOIDP-1));
#undef aligned_ptr
- for (;s < t; s++) {
- if (*s & NONASCII_MASK) {
+ for (;s < t; s += sizeof(uintptr_t)) {
+ uintptr_t word;
+ memcpy(&word, s, sizeof(word));
+ if (word & NONASCII_MASK) {
#ifdef WORDS_BIGENDIAN
- return (const char *)s + (nlz_intptr(*s&NONASCII_MASK)>>3);
+ return (const char *)s + (nlz_intptr(word&NONASCII_MASK)>>3);
#else
- return (const char *)s + (ntz_intptr(*s&NONASCII_MASK)>>3);
+ return (const char *)s + (ntz_intptr(word&NONASCII_MASK)>>3);
#endif
}
}
--
2.50.1
Updated by mame (Yusuke Endoh) about 19 hours ago
I wonder if the premise that "unaligned word access is feasible on x86" no longer holds in modern contexts?
We are of course aware that unaligned word access is undefined behavior in C. However, it is slightly faster, which is why we introduced this optimization specifically for x86.
I evaluated the performance on an AMD Ryzen 9 6900HX with gcc version 15.2.0 (Ubuntu 15.2.0-4ubuntu4) using the benchmark below. (I ran each test 10 times and picked the best result.)
s = ([65] * 10).pack("C*")
t = Process.clock_gettime(Process::CLOCK_MONOTONIC)
20000000.times { s.dup.force_encoding("UTF-8").scrub }
p Process.clock_gettime(Process::CLOCK_MONOTONIC) - t
It appears that -march=x86-64 -DUNALIGNED_WORD_ACCESS=1 remains the fastest.
-
cflags="-march=x86-64 -DUNALIGNED_WORD_ACCESS=1": 2.918 s. -
cflags="-march=x86-64 -DUNALIGNED_WORD_ACCESS=1"with Alan's patch: 2.941 s. -
cflags="-march=x86-64 -DUNALIGNED_WORD_ACCESS=0": 3.020 s. -
cflags="-march=x86-64-v2 -DUNALIGNED_WORD_ACCESS=0": 3.175 s. -
cflags="-march=x86-64-v3 -DUNALIGNED_WORD_ACCESS=0": 3.017 s. -
cflags="-march=x86-64-v4 -DUNALIGNED_WORD_ACCESS=0": Illegal instruction
It is worth noting that x86-64-v3 performs extremely well for long strings. On the other hand, x86-64-v2 is clearly slower than x86-64, which is unfortunate.
s = ([65] * 1000000).pack("C*")
t = Process.clock_gettime(Process::CLOCK_MONOTONIC)
200000.times { s.dup.force_encoding("UTF-8").scrub }
p Process.clock_gettime(Process::CLOCK_MONOTONIC) - t
-
cflags="-march=x86-64 -DUNALIGNED_WORD_ACCESS=1": 5.229 s. -
cflags="-march=x86-64 -DUNALIGNED_WORD_ACCESS=1"with Alan's patch: 5.232 s. -
cflags="-march=x86-64 -DUNALIGNED_WORD_ACCESS=0": 5.230 s. -
cflags="-march=x86-64-v2 -DUNALIGNED_WORD_ACCESS=0": 6.127 s. -
cflags="-march=x86-64-v3 -DUNALIGNED_WORD_ACCESS=0": 2.728 s. -
cflags="-march=x86-64-v4 -DUNALIGNED_WORD_ACCESS=0": Illegal instruction
However, since most strings handled in Ruby are not that long, it is likely more critical to ensure speed for short strings.
Regarding Alan's patch, it only supports search_nonascii. Since the optimization under UNALIGNED_WORD_ACCESS is applied in other places as well, the patch may be incomplete.
Looking at these benchmarks, it seems fair to say the difference is not drastic. If the performance degradation is only around 3.3%, I think it is fine to abandon the optimization and set UNALIGNED_WORD_ACCESS=0 unconditionally. I would appreciate it if others could verify this on different environments as well.
Updated by alanwu (Alan Wu) about 3 hours ago
I repeated Mame's experience on a Xeon Platinum 8124M and gcc version 13.3.0 (Ubuntu 13.3.0-6ubuntu2~24.04). The chip is from 2017, and runs x86-64-v4. I'm using slightly different scripts since I'm running with frequency scaling disabled. Also, I'm using hyperfine to get some basic stats on the results.
# short-str.rb
s = ([65] * 10).pack("C*")
4000000.times { s.dup.force_encoding("UTF-8").scrub }
$ hyperfine -L ruby x86-64-uwa-1,x86-64-uwa-1-sans-ub,x86-64-uwa-0,x86-64-v2-uwa-0,x86-64-v3-uwa-0,x86-64-v4-uwa-0 '~/.rubies/{ruby}/bin/ruby --disable-all short-str.rb'
Benchmark 1: ~/.rubies/x86-64-uwa-1/bin/ruby --disable-all short-str.rb
Time (mean ± σ): 1.165 s ± 0.001 s [User: 1.157 s, System: 0.007 s]
Range (min … max): 1.164 s … 1.166 s 10 runsBenchmark 2: ~/.rubies/x86-64-uwa-1-sans-ub/bin/ruby --disable-all short-str.rb
Time (mean ± σ): 1.179 s ± 0.001 s [User: 1.172 s, System: 0.007 s]
Range (min … max): 1.177 s … 1.181 s 10 runsBenchmark 3: ~/.rubies/x86-64-uwa-0/bin/ruby --disable-all short-str.rb
Time (mean ± σ): 1.142 s ± 0.001 s [User: 1.135 s, System: 0.007 s]
Range (min … max): 1.141 s … 1.144 s 10 runsBenchmark 4: ~/.rubies/x86-64-v2-uwa-0/bin/ruby --disable-all short-str.rb
Time (mean ± σ): 1.165 s ± 0.001 s [User: 1.157 s, System: 0.007 s]
Range (min … max): 1.162 s … 1.167 s 10 runsBenchmark 5: ~/.rubies/x86-64-v3-uwa-0/bin/ruby --disable-all short-str.rb
Time (mean ± σ): 1.150 s ± 0.001 s [User: 1.140 s, System: 0.009 s]
Range (min … max): 1.148 s … 1.153 s 10 runsBenchmark 6: ~/.rubies/x86-64-v4-uwa-0/bin/ruby --disable-all short-str.rb
Time (mean ± σ): 1.181 s ± 0.001 s [User: 1.172 s, System: 0.008 s]
Range (min … max): 1.179 s … 1.184 s 10 runs
Summary
~/.rubies/x86-64-uwa-0/bin/ruby --disable-all short-str.rb ran
1.01 ± 0.00 times faster than ~/.rubies/x86-64-v3-uwa-0/bin/ruby --disable-all short-str.rb
1.02 ± 0.00 times faster than ~/.rubies/x86-64-v2-uwa-0/bin/ruby --disable-all short-str.rb
1.02 ± 0.00 times faster than ~/.rubies/x86-64-uwa-1/bin/ruby --disable-all short-str.rb
1.03 ± 0.00 times faster than ~/.rubies/x86-64-uwa-1-sans-ub/bin/ruby --disable-all short-str.rb
1.03 ± 0.00 times faster than ~/.rubies/x86-64-v4-uwa-0/bin/ruby --disable-all short-str.rb
I'm seeing the same 3% difference, but cflags="-march=x86-64 -DUNALIGNED_WORD_ACCESS=0" wins. Side note, it's pretty tricky to measure the speed on short inputs. The loop overhead seems too large compared to the string operations.
# long-str.rb
s = ([65] * 100000).pack("C*")
200000.times { s.dup.force_encoding("UTF-8").scrub }
$ hyperfine -L ruby x86-64-uwa-1,x86-64-uwa-1-sans-ub,x86-64-uwa-0,x86-64-v2-uwa-0,x86-64-v3-uwa-0,x86-64-v4-uwa-0 '~/.rubies/{ruby}/bin/ruby --disable-all long-str.rb' --warmup 3
Benchmark 1: ~/.rubies/x86-64-uwa-1/bin/ruby --disable-all long-str.rb
Time (mean ± σ): 1.531 s ± 0.002 s [User: 1.527 s, System: 0.004 s]
Range (min … max): 1.529 s … 1.534 s 10 runsBenchmark 2: ~/.rubies/x86-64-uwa-1-sans-ub/bin/ruby --disable-all long-str.rb
Time (mean ± σ): 830.5 ms ± 1.0 ms [User: 826.5 ms, System: 3.7 ms]
Range (min … max): 829.1 ms … 831.9 ms 10 runsBenchmark 3: ~/.rubies/x86-64-uwa-0/bin/ruby --disable-all long-str.rb
Time (mean ± σ): 831.3 ms ± 2.1 ms [User: 827.4 ms, System: 3.6 ms]
Range (min … max): 828.9 ms … 834.8 ms 10 runsBenchmark 4: ~/.rubies/x86-64-v2-uwa-0/bin/ruby --disable-all long-str.rb
Time (mean ± σ): 2.248 s ± 0.002 s [User: 2.244 s, System: 0.003 s]
Range (min … max): 2.246 s … 2.253 s 10 runsBenchmark 5: ~/.rubies/x86-64-v3-uwa-0/bin/ruby --disable-all long-str.rb
Time (mean ± σ): 830.1 ms ± 1.7 ms [User: 827.2 ms, System: 2.6 ms]
Range (min … max): 827.6 ms … 832.9 ms 10 runsBenchmark 6: ~/.rubies/x86-64-v4-uwa-0/bin/ruby --disable-all long-str.rb
Time (mean ± σ): 2.254 s ± 0.004 s [User: 2.249 s, System: 0.004 s]
Range (min … max): 2.249 s … 2.259 s 10 runs
Summary
~/.rubies/x86-64-v3-uwa-0/bin/ruby --disable-all long-str.rb ran
1.00 ± 0.00 times faster than ~/.rubies/x86-64-uwa-1-sans-ub/bin/ruby --disable-all long-str.rb
1.00 ± 0.00 times faster than ~/.rubies/x86-64-uwa-0/bin/ruby --disable-all long-str.rb
1.84 ± 0.00 times faster than ~/.rubies/x86-64-uwa-1/bin/ruby --disable-all long-str.rb
2.71 ± 0.01 times faster than ~/.rubies/x86-64-v2-uwa-0/bin/ruby --disable-all long-str.rb
2.71 ± 0.01 times faster than ~/.rubies/x86-64-v4-uwa-0/bin/ruby --disable-all long-str.rb
x86-64-v3 wins.
Regarding Alan's patch, it only supports search_nonascii. Since the optimization under UNALIGNED_WORD_ACCESS is applied in other places as well, the patch may be incomplete.
Right, it's incomplete. I just wanted to offer something quickly to see if it fixes the particular crash in OP.
I think it is fine to abandon the optimization and set UNALIGNED_WORD_ACCESS=0 unconditionally.
I agree. If we do that, I hope we can delete the code for UNALIGNED_WORD_ACCESS=1. I think it's a mistake to keep around code that intentionally trigger UB, especially after learning that they cause crashes.
Further simplification is possible after removing dead code by doing unaligned reads using memcpy unconditionally, on all platforms. It gets rid of the code for manually align pointers. It's a good balance between speed, C compliance, and complexity. This is optional, though, since we simplify by a lot by just keeping one side of UNALIGNED_WORD_ACCESS.
UNALIGNED_WORD_ACCESS=1 is kind of funny. Once vectorized, most of the loads in the loop are in fact, aligned reads such as MOVDQA.