Bug #18286
closedUniversal arm64/x86_84 binary built on an x86_64 machine segfaults/is killed on arm64
Description
A universal arm64/x86_84 ruby binary for macOS built on a x86_64 machine segfaults/is killed when executed on an arm64 machine.
To reproduce:
- On an Intel Mac:
git clone https://github.com/ruby/ruby && cd ruby && git checkout v3_0_2 && ./autogen.sh && ./configure --with-arch=arm64,x86_64 && make -j$(sysctl -n hw.ncpu)
- Copy the built
./ruby
binary to an Apple Silicon machine - Attempt to execute it
Expected:
The universal ruby
binary works correctly on both devices
Actual:
The universal ruby
binary crashes with either Segmentation fault: 11
or Killed: 9
(this seems to occur if arm64e
is used instead of arm64
).
Details:
I'm attempting to build a universal Ruby for macOS that will run on both Intel (x86_64) and Apple Silicon (arm64) machines.
It seemed initially that this was as easy as adding --with-arch=arm64,x86_64
to ./configure
would do it, as it produced a ruby
binary that reports as Mach-O universal binary with 2 architectures: [x86_64:Mach-O 64-bit executable x86_64] [arm64]
This ruby
works correctly on the Intel machine I built in on, but does not work when copied to an Apple Silicon device. The reverse, however, seems to work. That is, if I build the universal ruby on an Apple Silicon machine, the ruby
binary that's built seems to work correctly on both Intel and Apple Silicon machines.
Intel:
$ ./ruby -v
ruby 3.0.2p107 (2021-07-07 revision 0db68f0233) [universal.x86_64-darwin21]
Apple Silicon:
$ ./ruby -v
Segmentation fault: 11
$ lldb ./ruby
(lldb) target create "./ruby"
Current executable set to '/Users/crc/ruby' (arm64).
(lldb) run
Process 77071 launched: '/Users/crc/ruby' (arm64)
Process 77071 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
frame #0: 0x00000001002176b8 ruby`ruby_vm_special_exception_copy + 16
ruby`ruby_vm_special_exception_copy:
-> 0x1002176b8 <+16>: ldr x0, [x0, #0x8]
0x1002176bc <+20>: bl 0x10011fed8 ; rb_class_real
0x1002176c0 <+24>: bl 0x10012070c ; rb_obj_alloc
0x1002176c4 <+28>: mov x20, x0
Target 0: (ruby) stopped.
(lldb) ^D
I also attempted the same thing with ruby 2.7.4 source, with the same result.
Updated by nobu (Nobuyoshi Nakada) about 3 years ago
Could you try with the master, and show more backtraces?
Updated by ccaviness (Clay Caviness) about 3 years ago
nobu (Nobuyoshi Nakada) wrote in #note-1:
Could you try with the master, and show more backtraces?
Sure. Similar error, though this time running the universal ruby
on Apple Silicon just results in a Killed: 9
message. I'm unable to run this binary under lldb
; however, I'm not familiar with debuggers so if there's a different method you'd like me to try I'd be happy to. I did get a backtrace for the segfault on the v3_0_2
build.
ruby
built on an Intel machine, from master
, running my Apple Silicon device:
$ file ruby
ruby: Mach-O universal binary with 2 architectures: [x86_64:Mach-O 64-bit executable x86_64] [arm64]
ruby (for architecture x86_64): Mach-O 64-bit executable x86_64
ruby (for architecture arm64): Mach-O 64-bit executable arm64
$ ./ruby -v
Killed: 9
$ lldb ./ruby
(lldb) target create "./ruby"
Killed: 9
ruby
built on an Intel machine, from v3_0_2
, running my Apple Silicon device:
$ lldb ruby
(lldb) target create "ruby"
Current executable set to '/Users/crc/ruby' (arm64).
(lldb) run
Process 38054 launched: '/Users/crc/ruby' (arm64)
Process 38054 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
frame #0: 0x00000001002176b8 ruby`ruby_vm_special_exception_copy + 16
ruby`ruby_vm_special_exception_copy:
-> 0x1002176b8 <+16>: ldr x0, [x0, #0x8]
0x1002176bc <+20>: bl 0x10011fed8 ; rb_class_real
0x1002176c0 <+24>: bl 0x10012070c ; rb_obj_alloc
0x1002176c4 <+28>: mov x20, x0
Target 0: (ruby) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
* frame #0: 0x00000001002176b8 ruby`ruby_vm_special_exception_copy + 16
frame #1: 0x0000000100217788 ruby`ec_stack_overflow + 56
frame #2: 0x0000000100217708 ruby`rb_ec_stack_overflow + 40
frame #3: 0x000000010023da90 ruby`rb_call0 + 1828
frame #4: 0x00000001001213bc ruby`rb_class_new_instance + 88
frame #5: 0x000000010008a6d8 ruby`rb_exc_new_str + 64
frame #6: 0x000000010022fee4 ruby`rb_vm_register_special_exception_str + 52
frame #7: 0x00000001000966cc ruby`Init_eval + 768
frame #8: 0x00000001000c4d34 ruby`rb_call_inits + 72
frame #9: 0x0000000100093e58 ruby`ruby_setup + 316
frame #10: 0x0000000100093ee0 ruby`ruby_init + 12
frame #11: 0x0000000100001be4 ruby`main + 76
frame #12: 0x00000001003fd0f4 dyld`start + 520
(lldb)
Updated by timsutton (Tim Sutton) almost 3 years ago
I have been hoping to do the same operation here for my org, as a way to distribute a universal Ruby binary that would be usable on both Intel and Apple Silicon machines, and to be able to build it on Intel. I seem to run into the same problem when building on Intel.
Updated by ecnelises (Chaofan QIU) almost 3 years ago
Can you please try codesign -s - ruby
? Because Apple's arm chip requires the exectutables signed.
I encountered the same killed 9 error elsewhere, FYI: https://lists.gnu.org/archive/html/bug-gnu-emacs/2020-11/msg01480.html
Updated by timsutton (Tim Sutton) almost 3 years ago
Sure. I had suspected that at some point so I checked the signature using codesign -dvvvvv
. But I also just repeated that test, and then replaced the built binary with a new ad-hoc signature on the M1. That unfortunately seemed to not help:
# intel-built universal binary copied over
tsutton@tim-m1 ~ % cp /Volumes/ssd/ruby_274 .
tsutton@tim-m1 ~ % codesign -d -vvvvv ruby_274
Executable=/Users/tsutton/ruby_274
Identifier=-5ac6e2.out
Format=Mach-O universal (x86_64 arm64)
CodeDirectory v=20400 size=30020 flags=0x20002(adhoc,linker-signed) hashes=935+0 location=embedded
VersionPlatform=1
VersionMin=720896
VersionSDK=721664
Hash type=sha256 size=32
CandidateCDHash sha256=63eda95634ac1d1ea6c97467085ec887b45f1dde
CandidateCDHashFull sha256=63eda95634ac1d1ea6c97467085ec887b45f1dde4659262d661eccca13ba17ca
Hash choices=sha256
CMSDigest=63eda95634ac1d1ea6c97467085ec887b45f1dde4659262d661eccca13ba17ca
CMSDigestType=2
Executable Segment base=0
Executable Segment limit=2834432
Executable Segment flags=0x1
Page size=4096
CDHash=63eda95634ac1d1ea6c97467085ec887b45f1dde
Signature=adhoc
Info.plist=not bound
TeamIdentifier=not set
Sealed Resources=none
Internal requirements=none
tsutton@tim-m1 ~ % ./ruby_274
zsh: segmentation fault ./ruby_274
tsutton@tim-m1 ~ % cp ruby_274 ruby_274_copy
tsutton@tim-m1 ~ % codesign -s - ruby_274_copy
tsutton@tim-m1 ~ % ./ruby_274_copy
zsh: segmentation fault ./ruby_274_copy
# using -f to force signature replacement
tsutton@tim-m1 ~ % codesign -fs - ruby_274_copy
ruby_274_copy: replacing existing signature
tsutton@tim-m1 ~ % ./ruby_274_copy
zsh: segmentation fault ./ruby_274_copy
Updated by ccaviness (Clay Caviness) almost 3 years ago
Lack of codesigning on Apple Silicon is an excellent guess, but unfortunately does not seem to be the cause here as Tim's demonstrated above (and I've verified as well). I first noticed this issue when testing a ruby
that was fully signed with a public developer cert.
Updated by ccaviness (Clay Caviness) over 2 years ago
I don't believe any of those bugs are related.
My suspicion is that, when building on x86 and targeting universal, during configure
for cross-compilation on arm64 the small test binaries that built cannot be executed on x86, leading to the various hints about the host machine to be wildly incorrect.
When building on arm64 and targeting universal, these test binaries that are built for x86 can actually run on the arm64 machine successfully, due to the Rosetta x86 compatibility layer.
There is no mechanism to run arm64 binaries on x86 Macs, though, so I think to get cross-compilation working on x86 many of the various autoconf hints will need to be manually set.
I'm not that familiar with autoconf or what these values should be, though.
Updated by benhamilton (Ben Hamilton) over 1 year ago
I think I know what the problem is.
During the build, Ruby has special logic to serialize its own builtin
module to disk using the binary iseq
format during the build (I assume for speed so it doesn't have to parse builtin
every time it starts up).
However, since iseq
format is architecture-specific, when building on x86_64 for universal x86_64 + arm64, the serialized builtin
module is written with the x86_64
architecture of the build machine, which fails this check whenever ruby
imports its (serialized to x86_64-specific iseq
format) builtin
module on arm64:
https://github.com/ruby/ruby/blob/1fdaa0666086529b3aae2d509a2e71c4247c3a12/compile.c#L13243
Thankfully, there's logic to disable this feature for cross-compiled builds:
https://github.com/ruby/ruby/blob/1fdaa0666086529b3aae2d509a2e71c4247c3a12/builtin.c#L6
We just need to enable this for universal builds as well.
Updated by benhamilton (Ben Hamilton) over 1 year ago
Sent PR https://github.com/ruby/ruby/pull/7367 with a fix.
Updated by benhamilton (Ben Hamilton) over 1 year ago
I also reproduced the SIGSEGV
from the original bug using a build with debug symbols:
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
* frame #0: 0x00000001002444b4 ruby`ruby_vm_special_exception_copy [inlined] RBASIC_CLASS(obj=0) at rbasic.h:155:25
frame #1: 0x00000001002444b4 ruby`ruby_vm_special_exception_copy(exc=0) at vm_insnhelper.c:52:42
frame #2: 0x0000000100244584 ruby`ec_stack_overflow(ec=0x0000000100704920, setup=1) at vm_insnhelper.c:65:16
frame #3: 0x0000000100244504 ruby`rb_ec_stack_overflow(ec=0x0000000100704920, crit=0) at vm_insnhelper.c:97:5
frame #4: 0x000000010026b544 ruby`rb_call0(ec=0x0000000100704920, recv=4300956680, mid=3121, argc=1, argv=0x000000016fdff3e0, call_scope=<unavailable>, self=4) at rgengc.h:0:9
frame #5: 0x000000010013ca30 ruby`rb_class_new_instance [inlined] rb_class_new_instance_kw(argc=1, argv=0x000000016fdff3e0, klass=<unavailable>, kw_splat=0) at object.c:2025:5
frame #6: 0x000000010013c9f0 ruby`rb_class_new_instance(argc=1, argv=0x000000016fdff3e0, klass=<unavailable>) at object.c:2033:12
frame #7: 0x00000001000953c0 ruby`rb_exc_new_str(etype=<unavailable>, str=4300956720) at error.c:1145:12
frame #8: 0x000000010025d668 ruby`rb_vm_register_special_exception_str(sp=ruby_error_reenter, cls=<unavailable>, mesg=<unavailable>) at vm.c:2872:17
frame #9: 0x00000001000a1d50 ruby`Init_eval at eval.c:2091:5
frame #10: 0x00000001000d51fc ruby`rb_call_inits at inits.c:41:5
frame #11: 0x000000010009fa14 ruby`ruby_setup at eval.c:89:9
frame #12: 0x000000010009fa90 ruby`ruby_init at eval.c:101:17
frame #13: 0x0000000100004610 ruby`main [inlined] rb_main(argc=1, argv=0x000000016fdff938) at main.c:37:5
frame #14: 0x0000000100004604 ruby`main(argc=1, argv=0x000000016fdff938) at main.c:57:12
frame #15: 0x000000018599be50 dyld`start + 2544
(snip)
(lldb) print *ec
(rb_execution_context_t) $7 = {
vm_stack = 0x0000000108028000
vm_stack_size = 131072
cfp = 0x0000000108127fc0
tag = 0x000000016fdff4b0
interrupt_flag = 0
interrupt_mask = 0
fiber_ptr = 0x00000001007048d0
thread_ptr = 0x0000000100704260
local_storage = NULL
local_storage_recursive_hash = 4
local_storage_recursive_hash_for_trace = 4
storage = 4
root_lep = 0x0000000000000000
root_svar = 0
ensure_list = NULL
trace_arg = NULL
errinfo = 4
passed_block_handler = 0
raised_flag = '\0'
method_missing_reason = MISSING_NOENTRY
private_const_reference = 0
machine = {
stack_start = 0x000000016fdff4ac
stack_end = 0x000000016fdff2c0
stack_maxsize = 0
(snip)
(lldb) print ec->machine.stack_start
(VALUE *) $4 = 0x000000016fdff4ac
(lldb) print ec->machine.stack_end
(VALUE *) $5 = 0x000000016fdff2c0
(lldb) print ec->machine.stack_end - ec->machine.stack_start
(long) $6 = -61
(snip)
(lldb) print ruby_stack_length(NULL)
(size_t) $8 = 18446744073709551529
Looks like it's got the stack direction backwards. I checked config.log
, and it's incorrectly detecting the stack length as +1
(growing towards larger addresses) for arm64
:
| #if defined __x86_64__
| #define STACK_GROW_DIRECTION -1
| #endif /* defined __x86_64__ */
| #if defined __arm64__
| #define STACK_GROW_DIRECTION +1
(snip)
rb_cv_stack_grow_dir_arm64=+1
rb_cv_stack_grow_dir_x86_64=-1
Sent PR https://github.com/ruby/ruby/pull/7373 with this fix as well.
Updated by ccaviness (Clay Caviness) over 1 year ago
Could someone please review and merge Ben's PR https://github.com/ruby/ruby/pull/7367 to fix this? I'd like to see these changes make the next release.
Updated by ccaviness (Clay Caviness) about 1 year ago
https://github.com/ruby/ruby/pull/7367 fixes this, and just needs to be merged.
Updated by nobu (Nobuyoshi Nakada) almost 1 year ago
Does this help it?
https://github.com/ruby/ruby/pull/8708
Updated by ccaviness (Clay Caviness) 12 months ago
@nobu (Nobuyoshi Nakada) Yes, at least in initial tests. A universal ruby built on an x86 Mac with that patch seems to work on an Apple Silicon Mac.
Updated by benhamilton (Ben Hamilton) 12 months ago
- Status changed from Open to Closed
Applied in changeset git|1d5598fe0d3470e7cab06a756d40a9221fcd501b.
Disable iseq-dumped builtin module for universal x86_64/arm64 binaries
During the build, Ruby has special logic to serialize its own builtin
module to disk using the binary iseq format during the build (I assume
for speed so it doesn't have to parse builtin every time it starts
up).
However, since iseq format is architecture-specific, when building on
x86_64 for universal x86_64 + arm64, the serialized builtin module is
written with the x86_64 architecture of the build machine, which fails
this check whenever ruby imports the builtin module on arm64:
https://github.com/ruby/ruby/blob/1fdaa0666086529b3aae2d509a2e71c4247c3a12/compile.c#L13243
Thankfully, there's logic to disable this feature for cross-compiled builds:
https://github.com/ruby/ruby/blob/1fdaa0666086529b3aae2d509a2e71c4247c3a12/builtin.c#L6
This disables the iseq logic for universal builds as well.
Fixes [Bug #18286]