Project

General

Profile

Actions

Bug #18286

closed

Universal arm64/x86_84 binary built on an x86_64 machine segfaults/is killed on arm64

Added by ccaviness (Clay Caviness) about 3 years ago. Updated 12 months ago.

Status:
Closed
Assignee:
-
Target version:
-
[ruby-core:105920]

Description

A universal arm64/x86_84 ruby binary for macOS built on a x86_64 machine segfaults/is killed when executed on an arm64 machine.

To reproduce:

  • On an Intel Mac: git clone https://github.com/ruby/ruby && cd ruby && git checkout v3_0_2 && ./autogen.sh && ./configure --with-arch=arm64,x86_64 && make -j$(sysctl -n hw.ncpu)
  • Copy the built ./ruby binary to an Apple Silicon machine
  • Attempt to execute it

Expected:
The universal ruby binary works correctly on both devices

Actual:
The universal ruby binary crashes with either Segmentation fault: 11 or Killed: 9 (this seems to occur if arm64e is used instead of arm64).

Details:
I'm attempting to build a universal Ruby for macOS that will run on both Intel (x86_64) and Apple Silicon (arm64) machines.

It seemed initially that this was as easy as adding --with-arch=arm64,x86_64 to ./configure would do it, as it produced a ruby binary that reports as Mach-O universal binary with 2 architectures: [x86_64:Mach-O 64-bit executable x86_64] [arm64]

This ruby works correctly on the Intel machine I built in on, but does not work when copied to an Apple Silicon device. The reverse, however, seems to work. That is, if I build the universal ruby on an Apple Silicon machine, the ruby binary that's built seems to work correctly on both Intel and Apple Silicon machines.

Intel:

$ ./ruby -v
ruby 3.0.2p107 (2021-07-07 revision 0db68f0233) [universal.x86_64-darwin21]

Apple Silicon:

$ ./ruby -v
Segmentation fault: 11
$ lldb ./ruby
(lldb) target create "./ruby"
Current executable set to '/Users/crc/ruby' (arm64).
(lldb) run
Process 77071 launched: '/Users/crc/ruby' (arm64)
Process 77071 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
    frame #0: 0x00000001002176b8 ruby`ruby_vm_special_exception_copy + 16
ruby`ruby_vm_special_exception_copy:
->  0x1002176b8 <+16>: ldr    x0, [x0, #0x8]
    0x1002176bc <+20>: bl     0x10011fed8               ; rb_class_real
    0x1002176c0 <+24>: bl     0x10012070c               ; rb_obj_alloc
    0x1002176c4 <+28>: mov    x20, x0
Target 0: (ruby) stopped.
(lldb) ^D

I also attempted the same thing with ruby 2.7.4 source, with the same result.

Updated by nobu (Nobuyoshi Nakada) about 3 years ago

Could you try with the master, and show more backtraces?

Updated by ccaviness (Clay Caviness) about 3 years ago

nobu (Nobuyoshi Nakada) wrote in #note-1:

Could you try with the master, and show more backtraces?

Sure. Similar error, though this time running the universal ruby on Apple Silicon just results in a Killed: 9 message. I'm unable to run this binary under lldb; however, I'm not familiar with debuggers so if there's a different method you'd like me to try I'd be happy to. I did get a backtrace for the segfault on the v3_0_2 build.

ruby built on an Intel machine, from master, running my Apple Silicon device:

$ file ruby 
ruby: Mach-O universal binary with 2 architectures: [x86_64:Mach-O 64-bit executable x86_64] [arm64]
ruby (for architecture x86_64):	Mach-O 64-bit executable x86_64
ruby (for architecture arm64):	Mach-O 64-bit executable arm64
$ ./ruby -v
Killed: 9
$ lldb ./ruby
(lldb) target create "./ruby"
Killed: 9

ruby built on an Intel machine, from v3_0_2, running my Apple Silicon device:

$ lldb ruby
(lldb) target create "ruby"
Current executable set to '/Users/crc/ruby' (arm64).
(lldb) run
Process 38054 launched: '/Users/crc/ruby' (arm64)
Process 38054 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
    frame #0: 0x00000001002176b8 ruby`ruby_vm_special_exception_copy + 16
ruby`ruby_vm_special_exception_copy:
->  0x1002176b8 <+16>: ldr    x0, [x0, #0x8]
    0x1002176bc <+20>: bl     0x10011fed8               ; rb_class_real
    0x1002176c0 <+24>: bl     0x10012070c               ; rb_obj_alloc
    0x1002176c4 <+28>: mov    x20, x0
Target 0: (ruby) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
  * frame #0: 0x00000001002176b8 ruby`ruby_vm_special_exception_copy + 16
    frame #1: 0x0000000100217788 ruby`ec_stack_overflow + 56
    frame #2: 0x0000000100217708 ruby`rb_ec_stack_overflow + 40
    frame #3: 0x000000010023da90 ruby`rb_call0 + 1828
    frame #4: 0x00000001001213bc ruby`rb_class_new_instance + 88
    frame #5: 0x000000010008a6d8 ruby`rb_exc_new_str + 64
    frame #6: 0x000000010022fee4 ruby`rb_vm_register_special_exception_str + 52
    frame #7: 0x00000001000966cc ruby`Init_eval + 768
    frame #8: 0x00000001000c4d34 ruby`rb_call_inits + 72
    frame #9: 0x0000000100093e58 ruby`ruby_setup + 316
    frame #10: 0x0000000100093ee0 ruby`ruby_init + 12
    frame #11: 0x0000000100001be4 ruby`main + 76
    frame #12: 0x00000001003fd0f4 dyld`start + 520
(lldb) 

Updated by timsutton (Tim Sutton) almost 3 years ago

I have been hoping to do the same operation here for my org, as a way to distribute a universal Ruby binary that would be usable on both Intel and Apple Silicon machines, and to be able to build it on Intel. I seem to run into the same problem when building on Intel.

Updated by ecnelises (Chaofan QIU) almost 3 years ago

Can you please try codesign -s - ruby? Because Apple's arm chip requires the exectutables signed.

I encountered the same killed 9 error elsewhere, FYI: https://lists.gnu.org/archive/html/bug-gnu-emacs/2020-11/msg01480.html

Updated by timsutton (Tim Sutton) almost 3 years ago

Sure. I had suspected that at some point so I checked the signature using codesign -dvvvvv. But I also just repeated that test, and then replaced the built binary with a new ad-hoc signature on the M1. That unfortunately seemed to not help:

# intel-built universal binary copied over
tsutton@tim-m1 ~ % cp /Volumes/ssd/ruby_274 .

tsutton@tim-m1 ~ % codesign -d -vvvvv ruby_274 
Executable=/Users/tsutton/ruby_274
Identifier=-5ac6e2.out
Format=Mach-O universal (x86_64 arm64)
CodeDirectory v=20400 size=30020 flags=0x20002(adhoc,linker-signed) hashes=935+0 location=embedded
VersionPlatform=1
VersionMin=720896
VersionSDK=721664
Hash type=sha256 size=32
CandidateCDHash sha256=63eda95634ac1d1ea6c97467085ec887b45f1dde
CandidateCDHashFull sha256=63eda95634ac1d1ea6c97467085ec887b45f1dde4659262d661eccca13ba17ca
Hash choices=sha256
CMSDigest=63eda95634ac1d1ea6c97467085ec887b45f1dde4659262d661eccca13ba17ca
CMSDigestType=2
Executable Segment base=0
Executable Segment limit=2834432
Executable Segment flags=0x1
Page size=4096
CDHash=63eda95634ac1d1ea6c97467085ec887b45f1dde
Signature=adhoc
Info.plist=not bound
TeamIdentifier=not set
Sealed Resources=none
Internal requirements=none

tsutton@tim-m1 ~ % ./ruby_274 
zsh: segmentation fault  ./ruby_274

tsutton@tim-m1 ~ % cp ruby_274 ruby_274_copy

tsutton@tim-m1 ~ % codesign -s - ruby_274_copy 

tsutton@tim-m1 ~ % ./ruby_274_copy  
zsh: segmentation fault  ./ruby_274_copy

# using -f to force signature replacement
tsutton@tim-m1 ~ % codesign -fs - ruby_274_copy
ruby_274_copy: replacing existing signature

tsutton@tim-m1 ~ % ./ruby_274_copy             
zsh: segmentation fault  ./ruby_274_copy

Updated by ccaviness (Clay Caviness) almost 3 years ago

Lack of codesigning on Apple Silicon is an excellent guess, but unfortunately does not seem to be the cause here as Tim's demonstrated above (and I've verified as well). I first noticed this issue when testing a ruby that was fully signed with a public developer cert.

Updated by ccaviness (Clay Caviness) over 2 years ago

I don't believe any of those bugs are related.

My suspicion is that, when building on x86 and targeting universal, during configure for cross-compilation on arm64 the small test binaries that built cannot be executed on x86, leading to the various hints about the host machine to be wildly incorrect.

When building on arm64 and targeting universal, these test binaries that are built for x86 can actually run on the arm64 machine successfully, due to the Rosetta x86 compatibility layer.

There is no mechanism to run arm64 binaries on x86 Macs, though, so I think to get cross-compilation working on x86 many of the various autoconf hints will need to be manually set.

I'm not that familiar with autoconf or what these values should be, though.

Updated by benhamilton (Ben Hamilton) over 1 year ago

I think I know what the problem is.

During the build, Ruby has special logic to serialize its own builtin module to disk using the binary iseq format during the build (I assume for speed so it doesn't have to parse builtin every time it starts up).

However, since iseq format is architecture-specific, when building on x86_64 for universal x86_64 + arm64, the serialized builtin module is written with the x86_64 architecture of the build machine, which fails this check whenever ruby imports its (serialized to x86_64-specific iseq format) builtin module on arm64:

https://github.com/ruby/ruby/blob/1fdaa0666086529b3aae2d509a2e71c4247c3a12/compile.c#L13243

Thankfully, there's logic to disable this feature for cross-compiled builds:

https://github.com/ruby/ruby/blob/1fdaa0666086529b3aae2d509a2e71c4247c3a12/builtin.c#L6

We just need to enable this for universal builds as well.

Updated by benhamilton (Ben Hamilton) over 1 year ago

I also reproduced the SIGSEGV from the original bug using a build with debug symbols:

* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
  * frame #0: 0x00000001002444b4 ruby`ruby_vm_special_exception_copy [inlined] RBASIC_CLASS(obj=0) at rbasic.h:155:25
    frame #1: 0x00000001002444b4 ruby`ruby_vm_special_exception_copy(exc=0) at vm_insnhelper.c:52:42
    frame #2: 0x0000000100244584 ruby`ec_stack_overflow(ec=0x0000000100704920, setup=1) at vm_insnhelper.c:65:16
    frame #3: 0x0000000100244504 ruby`rb_ec_stack_overflow(ec=0x0000000100704920, crit=0) at vm_insnhelper.c:97:5
    frame #4: 0x000000010026b544 ruby`rb_call0(ec=0x0000000100704920, recv=4300956680, mid=3121, argc=1, argv=0x000000016fdff3e0, call_scope=<unavailable>, self=4) at rgengc.h:0:9
    frame #5: 0x000000010013ca30 ruby`rb_class_new_instance [inlined] rb_class_new_instance_kw(argc=1, argv=0x000000016fdff3e0, klass=<unavailable>, kw_splat=0) at object.c:2025:5
    frame #6: 0x000000010013c9f0 ruby`rb_class_new_instance(argc=1, argv=0x000000016fdff3e0, klass=<unavailable>) at object.c:2033:12
    frame #7: 0x00000001000953c0 ruby`rb_exc_new_str(etype=<unavailable>, str=4300956720) at error.c:1145:12
    frame #8: 0x000000010025d668 ruby`rb_vm_register_special_exception_str(sp=ruby_error_reenter, cls=<unavailable>, mesg=<unavailable>) at vm.c:2872:17
    frame #9: 0x00000001000a1d50 ruby`Init_eval at eval.c:2091:5
    frame #10: 0x00000001000d51fc ruby`rb_call_inits at inits.c:41:5
    frame #11: 0x000000010009fa14 ruby`ruby_setup at eval.c:89:9
    frame #12: 0x000000010009fa90 ruby`ruby_init at eval.c:101:17
    frame #13: 0x0000000100004610 ruby`main [inlined] rb_main(argc=1, argv=0x000000016fdff938) at main.c:37:5
    frame #14: 0x0000000100004604 ruby`main(argc=1, argv=0x000000016fdff938) at main.c:57:12
    frame #15: 0x000000018599be50 dyld`start + 2544

(snip)

(lldb) print *ec
(rb_execution_context_t) $7 = {
  vm_stack = 0x0000000108028000
  vm_stack_size = 131072
  cfp = 0x0000000108127fc0
  tag = 0x000000016fdff4b0
  interrupt_flag = 0
  interrupt_mask = 0
  fiber_ptr = 0x00000001007048d0
  thread_ptr = 0x0000000100704260
  local_storage = NULL
  local_storage_recursive_hash = 4
  local_storage_recursive_hash_for_trace = 4
  storage = 4
  root_lep = 0x0000000000000000
  root_svar = 0
  ensure_list = NULL
  trace_arg = NULL
  errinfo = 4
  passed_block_handler = 0
  raised_flag = '\0'
  method_missing_reason = MISSING_NOENTRY
  private_const_reference = 0
  machine = {
    stack_start = 0x000000016fdff4ac
    stack_end = 0x000000016fdff2c0
    stack_maxsize = 0

(snip)

(lldb) print ec->machine.stack_start
(VALUE *) $4 = 0x000000016fdff4ac
(lldb) print ec->machine.stack_end
(VALUE *) $5 = 0x000000016fdff2c0
(lldb) print ec->machine.stack_end - ec->machine.stack_start
(long) $6 = -61

(snip)

(lldb) print ruby_stack_length(NULL)
(size_t) $8 = 18446744073709551529

Looks like it's got the stack direction backwards. I checked config.log, and it's incorrectly detecting the stack length as +1 (growing towards larger addresses) for arm64:

| #if defined __x86_64__
| #define STACK_GROW_DIRECTION -1
| #endif /* defined __x86_64__ */
| #if defined __arm64__
| #define STACK_GROW_DIRECTION +1

(snip)

rb_cv_stack_grow_dir_arm64=+1
rb_cv_stack_grow_dir_x86_64=-1

Sent PR https://github.com/ruby/ruby/pull/7373 with this fix as well.

Updated by ccaviness (Clay Caviness) over 1 year ago

Could someone please review and merge Ben's PR https://github.com/ruby/ruby/pull/7367 to fix this? I'd like to see these changes make the next release.

Updated by ccaviness (Clay Caviness) about 1 year ago

https://github.com/ruby/ruby/pull/7367 fixes this, and just needs to be merged.

Updated by ccaviness (Clay Caviness) 12 months ago

@nobu (Nobuyoshi Nakada) Yes, at least in initial tests. A universal ruby built on an x86 Mac with that patch seems to work on an Apple Silicon Mac.

Actions #16

Updated by benhamilton (Ben Hamilton) 12 months ago

  • Status changed from Open to Closed

Applied in changeset git|1d5598fe0d3470e7cab06a756d40a9221fcd501b.


Disable iseq-dumped builtin module for universal x86_64/arm64 binaries

During the build, Ruby has special logic to serialize its own builtin
module to disk using the binary iseq format during the build (I assume
for speed so it doesn't have to parse builtin every time it starts
up).

However, since iseq format is architecture-specific, when building on
x86_64 for universal x86_64 + arm64, the serialized builtin module is
written with the x86_64 architecture of the build machine, which fails
this check whenever ruby imports the builtin module on arm64:

https://github.com/ruby/ruby/blob/1fdaa0666086529b3aae2d509a2e71c4247c3a12/compile.c#L13243

Thankfully, there's logic to disable this feature for cross-compiled builds:

https://github.com/ruby/ruby/blob/1fdaa0666086529b3aae2d509a2e71c4247c3a12/builtin.c#L6

This disables the iseq logic for universal builds as well.

Fixes [Bug #18286]

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0