Project

General

Profile

Actions

Bug #20203

open

`TestEnumerable` test failures with GCC 14

Added by vo.x (Vit Ondruch) 3 months ago. Updated about 1 month ago.

Status:
Open
Assignee:
-
Target version:
-
ruby -v:
ruby 3.3.0 (2023-12-25 revision 5124f9ac75) [x86_64-linux]
[ruby-core:116370]

Description

There is ongoing mass rebuild in Fedora and that is first time GCC 14 is used and we observe test failures in TestEnumerable. Here are a few examples:

[ 3000/26419] TestEnumerable#test_transient_heap_sort_bymalloc_consolidate(): unaligned fastbin chunk detected
[ 2455/26535] TestEnumerable#test_transient_heap_sort_bycorrupted size vs. prev_size in fastbins
[ 9716/26532] TestEnumerable#test_any_with_unused_blockdouble free or corruption (fasttop)

The full logs are accessible here. Please drill through Descendants and build.log


Files

sort-benchmark-ubuntu.png (233 KB) sort-benchmark-ubuntu.png alanwu (Alan Wu), 03/20/2024 06:10 PM
sort-benchmark-macos.png (232 KB) sort-benchmark-macos.png alanwu (Alan Wu), 03/20/2024 06:18 PM

Updated by vo.x (Vit Ondruch) 3 months ago

This is the backtrace I was able to get:

[72/83] TestEnumerable#test_inject_array_op_redefined[Detaching after vfork from child process 94]
 = 0.00 s
 10) Error:
TestEnumerable#test_inject_array_op_redefined:
Errno::ENOENT: No such file or directory - /usr/bin/ruby
    /builddir/build/BUILD/ruby-3.3.0/tool/lib/envutil.rb:161:in `spawn'
    /builddir/build/BUILD/ruby-3.3.0/tool/lib/envutil.rb:161:in `invoke_ruby'

malloc_consolidate(): unaligned fastbin chunk detected                  

Thread 1 "ruby" received signal SIGABRT, Aborted.
0x00007ffff7723184 in __pthread_kill_implementation () from /lib64/libc.so.6
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.38.9000-33.fc40.x86_64 gmp-6.2.1-5.fc39.x86_64 libgcc-14.0.1-0.2.fc40.x86_64 libxcrypt-4.4.36-4.fc40.x86_64 zlib-ng-compat-2.1.6-1.fc40.x86_64
(gdb) bt
#0  0x00007ffff7723184 in __pthread_kill_implementation () from /lib64/libc.so.6
#1  0x00007ffff76cb65e in raise () from /lib64/libc.so.6
#2  0x00007ffff76b3902 in abort () from /lib64/libc.so.6
#3  0x00007ffff76b4767 in __libc_message_impl.cold () from /lib64/libc.so.6
#4  0x00007ffff772d1b5 in malloc_printerr () from /lib64/libc.so.6
#5  0x00007ffff772dd7c in malloc_consolidate () from /lib64/libc.so.6
#6  0x00007ffff772ef90 in _int_free_maybe_consolidate.part.0 () from /lib64/libc.so.6
#7  0x00007ffff772f5fa in _int_free () from /lib64/libc.so.6
#8  0x00007ffff7731e0e in free () from /lib64/libc.so.6
#9  0x00007ffff7ad52ec in objspace_xfree (old_size=11256, ptr=0x555555930250, objspace=0x55555555dc70) at /builddir/build/BUILD/ruby-3.3.0/gc.c:12823
#10 objspace_xfree (old_size=<optimized out>, ptr=0x555555930250, objspace=0x55555555dc70) at /builddir/build/BUILD/ruby-3.3.0/gc.c:12754
#11 ruby_sized_xfree (x=0x555555930250, size=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/gc.c:12927
#12 0x00007ffff7a91ec1 in cont_free (ptr=0x555555c9f3e0) at /builddir/build/BUILD/ruby-3.3.0/cont.c:1059
#13 0x00007ffff7accd01 in rb_data_free (obj=140736887130000, objspace=0x55555555dc70) at /builddir/build/BUILD/ruby-3.3.0/gc.c:3500
#14 obj_free (objspace=0x55555555dc70, obj=140736887130000) at /builddir/build/BUILD/ruby-3.3.0/gc.c:3659
#15 0x00007ffff7cd9b10 in gc_sweep_plane (heap=0x55555555dce0, ctx=<optimized out>, bitset=<optimized out>, p=140736887130000, objspace=0x55555555dc70) at /builddir/build/BUILD/ruby-3.3.0/gc.c:5680
#16 gc_sweep_page.constprop.0 (objspace=0x55555555dc70, heap=0x55555555dce0, ctx=0x7fffffffca40) at /builddir/build/BUILD/ruby-3.3.0/gc.c:5758
#17 0x00007ffff7acac51 in gc_sweep_step (objspace=objspace@entry=0x55555555dc70, size_pool=size_pool@entry=0x55555555dc90, heap=heap@entry=0x55555555dce0) at /builddir/build/BUILD/ruby-3.3.0/gc.c:6047
#18 0x00007ffff7acfa71 in gc_sweep (objspace=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/gc.c:6272
#19 0x00007ffff7adadce in gc_start (objspace=0x55555555dc70, reason=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/gc.c:9609
#20 0x00007ffff7ad35ab in heap_prepare (heap=0x55555555dce0, size_pool=0x55555555dc90, objspace=0x55555555dc70) at /builddir/build/BUILD/ruby-3.3.0/gc.c:2517
#21 heap_next_free_page (heap=0x55555555dce0, size_pool=0x55555555dc90, objspace=0x55555555dc70) at /builddir/build/BUILD/ruby-3.3.0/gc.c:2725
#22 newobj_alloc (objspace=0x55555555dc70, cr=0x55555555e960, size_pool_idx=0, vm_locked=<optimized out>, vm_locked@entry=false) at /builddir/build/BUILD/ruby-3.3.0/gc.c:2827
#23 0x00007ffff7ad3eb4 in newobj_of0 (alloc_size=<optimized out>, cr=<optimized out>, wb_protected=1, flags=<optimized out>, klass=140736918383680) at /builddir/build/BUILD/ruby-3.3.0/gc.c:2930
#24 newobj_of (alloc_size=<optimized out>, wb_protected=1, v3=0, v2=0, v1=0, flags=<optimized out>, klass=140736918383680, cr=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/gc.c:2947
#25 rb_wb_protected_newobj_of (ec=<optimized out>, klass=140736918383680, flags=<optimized out>, size=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/gc.c:2962
#26 0x00007ffff7bdf0c1 in str_alloc_embed (capa=6, klass=140736918383680) at /builddir/build/BUILD/ruby-3.3.0/vm_core.h:1954
#27 str_new0 (klass=140736918383680, ptr=0x5555555fa945 "vt100", len=5, termlen=1) at /builddir/build/BUILD/ruby-3.3.0/string.c:871
#28 0x00007ffff7bdf9fe in rb_enc_str_new (ptr=<optimized out>, len=<optimized out>, enc=0x555555579650) at /builddir/build/BUILD/ruby-3.3.0/string.c:928
#29 0x00007ffff7ae4ae5 in env_enc_str_new (enc=<optimized out>, len=5, ptr=0x5555555fa945 "vt100") at /builddir/build/BUILD/ruby-3.3.0/hash.c:4810
#30 env_str_new (len=5, ptr=0x5555555fa945 "vt100") at /builddir/build/BUILD/ruby-3.3.0/hash.c:4819
#31 env_str_new2 (ptr=0x5555555fa945 "vt100") at /builddir/build/BUILD/ruby-3.3.0/hash.c:4826
#32 env_to_hash () at /builddir/build/BUILD/ruby-3.3.0/hash.c:6257
#33 0x00007ffff7ae4bf0 in env_to_h (_=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/hash.c:6317
#34 0x00007ffff7c3aa46 in vm_call_cfunc_with_frame_ (ec=0x55555555ec10, reg_cfp=0x7ffff75fca98, calling=<optimized out>, argc=0, argv=0x7ffff74fd4a8, stack_bottom=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/vm_insnhelper.c:3490
#35 0x00007ffff7c4233b in vm_sendish (method_explorer=<optimized out>, block_handler=<optimized out>, cd=<optimized out>, reg_cfp=<optimized out>, ec=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/vm_insnhelper.c:5581
#36 vm_exec_core (ec=0x55555555ec10) at /builddir/build/BUILD/ruby-3.3.0/redhat-linux-build/insns.def:834
#37 0x00007ffff7c5a680 in vm_exec_loop (result=<optimized out>, tag=0x7fffffffd0b0, state=<optimized out>, ec=0x55555555ec10) at /builddir/build/BUILD/ruby-3.3.0/vm.c:2513
#38 rb_vm_exec (ec=0x55555555ec10) at /builddir/build/BUILD/ruby-3.3.0/vm.c:2492
#39 0x00007ffff7c47c67 in vm_yield_with_cref (is_lambda=0, cref=0x0, kw_splat=0, argv=0x7fffffffd1e8, argc=1, ec=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/vm.c:1634
#40 vm_yield (kw_splat=0, argv=0x7fffffffd1e8, argc=1, ec=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/vm.c:1642
#41 rb_yield_0 (argv=0x7fffffffd1e8, argc=1) at /builddir/build/BUILD/ruby-3.3.0/vm_eval.c:1366
#42 rb_yield (val=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/vm_eval.c:1382
#43 0x00007ffff7a4220c in rb_ary_collect (ary=140736886959720) at /builddir/build/BUILD/ruby-3.3.0/array.c:3630
#44 0x00007ffff7c3aa46 in vm_call_cfunc_with_frame_ (ec=0x55555555ec10, reg_cfp=0x7ffff75fcbb0, calling=<optimized out>, argc=0, argv=0x7ffff74fd3a8, stack_bottom=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/vm_insnhelper.c:3490
#45 0x00007ffff7c4527d in vm_sendish (method_explorer=<optimized out>, block_handler=<optimized out>, cd=<optimized out>, reg_cfp=<optimized out>, ec=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/vm_insnhelper.c:5581
#46 vm_exec_core (ec=0x55555555ec10) at /builddir/build/BUILD/ruby-3.3.0/redhat-linux-build/insns.def:814
#47 0x00007ffff7c5a46d in rb_vm_exec (ec=0x55555555ec10) at /builddir/build/BUILD/ruby-3.3.0/vm.c:2486
#48 0x00007ffff7c47c67 in vm_yield_with_cref (is_lambda=0, cref=0x0, kw_splat=0, argv=0x7fffffffd538, argc=1, ec=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/vm.c:1634
#49 vm_yield (kw_splat=0, argv=0x7fffffffd538, argc=1, ec=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/vm.c:1642
#50 rb_yield_0 (argv=0x7fffffffd538, argc=1) at /builddir/build/BUILD/ruby-3.3.0/vm_eval.c:1366
#51 rb_yield (val=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/vm_eval.c:1382
#52 0x00007ffff7a41fc4 in rb_ary_each (ary=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/array.c:2538
#53 rb_ary_each (ary=140736887015160) at /builddir/build/BUILD/ruby-3.3.0/array.c:2532
#54 0x00007ffff7c3aa46 in vm_call_cfunc_with_frame_ (ec=0x55555555ec10, reg_cfp=0x7ffff75fcc90, calling=<optimized out>, argc=0, argv=0x7ffff74fd2c0, stack_bottom=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/vm_insnhelper.c:3490
#55 0x00007ffff7c4527d in vm_sendish (method_explorer=<optimized out>, block_handler=<optimized out>, cd=<optimized out>, reg_cfp=<optimized out>, ec=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/vm_insnhelper.c:5581
#56 vm_exec_core (ec=0x55555555ec10) at /builddir/build/BUILD/ruby-3.3.0/redhat-linux-build/insns.def:814
#57 0x00007ffff7c5a46d in rb_vm_exec (ec=0x55555555ec10) at /builddir/build/BUILD/ruby-3.3.0/vm.c:2486
#58 0x00007ffff7c47c67 in vm_yield_with_cref (is_lambda=0, cref=0x0, kw_splat=0, argv=0x7fffffffd888, argc=1, ec=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/vm.c:1634
#59 vm_yield (kw_splat=0, argv=0x7fffffffd888, argc=1, ec=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/vm.c:1642
--Type <RET> for more, q to quit, c to continue without paging--
#60 rb_yield_0 (argv=0x7fffffffd888, argc=1) at /builddir/build/BUILD/ruby-3.3.0/vm_eval.c:1366
#61 rb_yield (val=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/vm_eval.c:1382
#62 0x00007ffff7a41fc4 in rb_ary_each (ary=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/array.c:2538
#63 rb_ary_each (ary=140736886986000) at /builddir/build/BUILD/ruby-3.3.0/array.c:2532
#64 0x00007ffff7c3aa46 in vm_call_cfunc_with_frame_ (ec=0x55555555ec10, reg_cfp=0x7ffff75fce18, calling=<optimized out>, argc=0, argv=0x7ffff74fd160, stack_bottom=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/vm_insnhelper.c:3490
#65 0x00007ffff7c4527d in vm_sendish (method_explorer=<optimized out>, block_handler=<optimized out>, cd=<optimized out>, reg_cfp=<optimized out>, ec=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/vm_insnhelper.c:5581
#66 vm_exec_core (ec=0x55555555ec10) at /builddir/build/BUILD/ruby-3.3.0/redhat-linux-build/insns.def:814
#67 0x00007ffff7c5a680 in vm_exec_loop (result=<optimized out>, tag=0x7fffffffdaa0, state=<optimized out>, ec=0x55555555ec10) at /builddir/build/BUILD/ruby-3.3.0/vm.c:2513
#68 rb_vm_exec (ec=0x55555555ec10) at /builddir/build/BUILD/ruby-3.3.0/vm.c:2492
#69 0x00007ffff7c5b92e in rb_vm_invoke_proc (ec=<optimized out>, proc=<optimized out>, argc=<optimized out>, argv=<optimized out>, kw_splat=<optimized out>, passed_block_handler=<optimized out>)
    at /builddir/build/BUILD/ruby-3.3.0/vm.c:1728
#70 0x00007ffff7b75d61 in rb_proc_call_kw (self=<optimized out>, args=<optimized out>, kw_splat=0) at /builddir/build/BUILD/ruby-3.3.0/proc.c:978
#71 0x00007ffff7ab8e0a in exec_end_procs_chain (errp=<optimized out>, procs=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/eval_jump.c:105
#72 rb_ec_exec_end_proc (ec=ec@entry=0x55555555ec10) at /builddir/build/BUILD/ruby-3.3.0/eval_jump.c:120
#73 0x00007ffff7ab9d98 in rb_ec_teardown (ec=ec@entry=0x55555555ec10) at /builddir/build/BUILD/ruby-3.3.0/eval.c:159
#74 0x00007ffff7aba32c in rb_ec_cleanup (ec=ec@entry=0x55555555ec10, ex=RUBY_TAG_NONE) at /builddir/build/BUILD/ruby-3.3.0/eval.c:212
#75 0x00007ffff7aba99d in ruby_run_node (n=0x7fffdc315b30) at /builddir/build/BUILD/ruby-3.3.0/eval.c:328
#76 0x0000555555555195 in rb_main (argv=0x7fffffffe1c8, argc=7) at /builddir/build/BUILD/ruby-3.3.0/main.c:39
#77 main (argc=<optimized out>, argv=<optimized out>) at /builddir/build/BUILD/ruby-3.3.0/main.c:58

Updated by vo.x (Vit Ondruch) 3 months ago

I have also reported the issue against GCC in Fedora where the toolchain folks are already at it:

https://bugzilla.redhat.com/show_bug.cgi?id=2259845

Updated by vo.x (Vit Ondruch) 3 months ago

So far, it was attributed to "issue within glibc qsort".

Updated by vo.x (Vit Ondruch) 3 months ago

vo.x (Vit Ondruch) wrote in #note-3:

So far, it was attributed to "issue within glibc qsort".

By mistake. Back to GCC

Updated by vo.x (Vit Ondruch) 3 months ago

So there is more insights from glibc developers and it seems the issue is that "Ruby uses qsort_r in an undefined way". Let me quote @fweimer (Florian Weimer) from RH bugzilla:

In current rawhide glibc (glibc-2.38.9000-33.fc40.x86_64), a buffer allocated with malloc is used for the qsort scratch buffer. This is actually a glibc bug because the array is very short and we should use an on-stack buffer.

I need to confirm the details yet, but I think what happens is that the Ruby garbage collector runs during the sort_by callback. I suspect the collector writes to the array, which is quite undefined (“The comparison function shall not alter the contents of the array.” says the C standard). This causes problems subsequently when we copy back previous array contents from the scratch buffer. With a stack-based buffer, the collector pins objects, so the issue is not visible. Sorry, this is all very speculative, but I don't want you to spend more time chasing this.

I can reproduce the crash in Fedora 38 (with upstream Ruby sources) if I increase the size of the array being sorted so that qsort_r uses a malloc-based buffer there as well:

diff --git a/test/ruby/test_enum.rb b/test/ruby/test_enum.rb
index f7c8f012d8..23e18cc590 100644
--- a/test/ruby/test_enum.rb
+++ b/test/ruby/test_enum.rb
@@ -871,7 +871,9 @@ class << o; self; end.class_eval do
           0
         end
       end
-      [o, o, o].sort_by {|x| x }
+      l = []
+      (1..100).each {|x| l += [o] }
+      l.sort_by {|x| x }
       c.call
     end

The whole thing is probably quite sensitive to allocation patterns etc., so I have no idea how reliable this is as a trigger for the bug.

and followup

With this instrumentation patch applied to glibc:

diff --git a/stdlib/qsort.c b/stdlib/qsort.c
index 7f5a00fb33..c5263d9f5f 100644
--- a/stdlib/qsort.c
+++ b/stdlib/qsort.c
@@ -25,6 +25,7 @@
 #include <stdlib.h>
 #include <string.h>
 #include <stdbool.h>
+#include <assert.h>
 
 /* Swap SIZE bytes between addresses A and B.  These helpers are provided
    along the generic one as an optimization.  */
@@ -338,9 +339,9 @@ indirect_msort_with_tmp (const struct msort_param *p, void *b, size_t n,
       }
 }
 
-void
-__qsort_r (void *const pbase, size_t total_elems, size_t size,
-	   __compar_d_fn_t cmp, void *arg)
+static void
+__qsort_r_real (void *const pbase, size_t total_elems, size_t size,
+		__compar_d_fn_t cmp, void *arg)
 {
   if (total_elems <= 1)
     return;
@@ -396,6 +397,43 @@ __qsort_r (void *const pbase, size_t total_elems, size_t size,
   if (buf != tmp)
     free (buf);
 }
+
+struct qsort_r_data
+{
+  __compar_d_fn_t cmp;
+  void *arg;
+  void *array;
+  size_t size;
+  void *copy;
+};
+
+static int
+qsort_compare_wrapper (const void *a, const void *b, void *data1)
+{
+  struct qsort_r_data *data = data1;
+  memcpy (data->copy, data->array, data->size);
+  int ret = data->cmp (a, b, data->arg);
+  assert (memcmp (data->array, data->copy, data->size) == 0);
+  return ret;
+}
+
+void
+__qsort_r (void *pbase, size_t total_elems, size_t size,
+	   __compar_d_fn_t cmp, void *arg)
+{
+  struct qsort_r_data data =
+    {
+      .cmp = cmp,
+      .arg = arg,
+      .array = pbase,
+      .size = total_elems * size,
+    };
+  data.copy = malloc (data.size);
+  assert (data.copy != NULL);
+  __qsort_r_real (pbase, total_elems, size, qsort_compare_wrapper, &data);
+  free (data.copy);
+}
+
 libc_hidden_def (__qsort_r)
 weak_alias (__qsort_r, qsort_r)

And using the Fedora rawhide glibc variant with the heap allocation and the unchanged Ruby test case, I get:

[54/83] TestEnumerable#test_callccFatal glibc error: qsort.c:416 (qsort_compare_wrapper): assertion failed: memcmp (data->array, data->copy, data->size) == 0

Thread 1 "ruby" received signal SIGABRT, Aborted.
__pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, 
    no_tid=no_tid@entry=0) at pthread_kill.c:44
44	      return INTERNAL_SYSCALL_ERROR_P (ret) ? INTERNAL_SYSCALL_ERRNO (ret) : 0;
(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, 
    signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
#1  0x00007ffff7c57423 in __pthread_kill_internal (signo=6, 
    threadid=<optimized out>) at pthread_kill.c:78
#2  0x00007ffff7c0493e in __GI_raise (sig=sig@entry=6)
    at ../sysdeps/posix/raise.c:26
#3  0x00007ffff7bec8ff in __GI_abort () at abort.c:79
#4  0x00007ffff7bed7d5 in __libc_message_impl (
    fmt=fmt@entry=0x7ffff7d6cba0 "Fatal glibc error: %s:%s (%s): assertion failed: %s\n") at ../sysdeps/posix/libc_fatal.c:132
#5  0x00007ffff7bfcaa9 in __libc_assert_fail (
    assertion=assertion@entry=0x7ffff7d6cd70 "memcmp (data->array, data->copy, data->size) == 0", file=file@entry=0x7ffff7d67d51 "qsort.c", 
    line=line@entry=416, 
    function=function@entry=0x7ffff7d71390 <__PRETTY_FUNCTION__.1> "qsort_compare_wrapper") at __libc_assert_fail.c:31
#6  0x00007ffff7c0873c in qsort_compare_wrapper (a=a@entry=0x7fffdc852fe0, 
    b=b@entry=0x7fffdc852ff0, data1=data1@entry=0x7fffffffd520) at qsort.c:416
#7  0x00007ffff7c08923 in msort_with_tmp (p=p@entry=0x7fffffffd0a0, 
    b=b@entry=0x7fffdc852fe0, n=n@entry=2) at qsort.c:276
#8  0x00007ffff7c08ced in msort_with_tmp (n=2, b=0x7fffdc852fe0, 
    p=0x7fffffffd0a0) at qsort.c:202
#9  __qsort_r_real (pbase=pbase@entry=0x7fffdc852fe0, 
    total_elems=total_elems@entry=2, size=size@entry=16, 
    arg=arg@entry=0x7fffffffd520, cmp=0x7ffff7c086c0 <qsort_compare_wrapper>)
    at qsort.c:394
#10 0x00007ffff7c09140 in __GI___qsort_r (pbase=0x7fffdc852fe0, total_elems=2, 
    size=size@entry=16, cmp=cmp@entry=0x5555559709a0 <sort_by_cmp>, 
    arg=arg@entry=0x7fffdc852fd0) at qsort.c:433
#11 0x000055555596f3ad in enum_sort_by (obj=<optimized out>) at enum.c:1691

 
I think that's pretty good evidence that ruby uses qsort_r in an undefined way

Updated by fweimer (Florian Weimer) 3 months ago

If you want to keep using qsort_r, please use it to sort a temporary array of array indices, and not an array of GC-managed pointers.

Updated by alanwu (Alan Wu) 3 months ago

The GC by default doesn't move objects so it shouldn't write to the array. I guess it's a use-after-free and double free caused by callcc in the test jumping into the middle of qsort_r like the glibc devs noticed https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=dfa3394a605c8f6f25e4f827789bc89eca1d206c

The c.call line in the test jumps back into the middle of qsort_r (with a saved stack) after it already returned, so if the buffer is malloc'd it'd be freed already the second time around.

Updated by alanwu (Alan Wu) 3 months ago

So there are currently 2 issues with using qsort_r that I see.

  1. As noticed in this issue, if the comparison function uses coroutine/fiber to reenter the middle of qsort_r, that results in heap corruption. This happens on older glibc too. Valgrind shows this issue:
require 'continuation'
c = nil
o = Object.new
class << o; self; end.class_eval do
  define_method(:<=>) do |x|
    callcc {|c2| c ||= c2 }
    0
  end
end
Array.new(1000, o).sort_by {|x| x }
c.call
==8321== Invalid read of size 8
==8321==    at 0x484DE5E: memmove (vg_replace_strmem.c:1410)
==8321==    by 0x4E42270: msort_with_tmp (msort.c:44)
==8321==    by 0x4E42270: msort_with_tmp.part.0 (msort.c:53)
<snip>
==8321==    by 0x4E427B5: qsort_r (msort.c:296)
==8321==    by 0x492BFEA: enum_sort_by (enum.c:1293)
==8321==  Address 0x9bb55a0 is 0 bytes inside a block of size 16,000 free'd
==8321==    at 0x484488F: free (vg_replace_malloc.c:985)
==8321==    by 0x4E427C2: qsort_r (msort.c:298)
<snip>
==8321==  Block was alloc'd at
==8321==    at 0x4841828: malloc (vg_replace_malloc.c:442)
==8321==    by 0x4E42641: qsort_r (msort.c:221)
==8321==    by 0x492BFEA: enum_sort_by (enum.c:1293)
  1. With GC compaction, the GC can update references inside the comparison function. This is undefined behavior:

    The application shall ensure that the comparison function pointed to by compar does not alter the contents of the array.

    https://pubs.opengroup.org/onlinepubs/9699919799/functions/qsort.html
    In practice, I've seen this causing issue with GC.auto_compact in our app's CI, where sorting leaving moved objects in the array.

It seems that to fix these issues we need to stop using qsort_r and use our own ruby_qsort implementation that doesn't malloc.

Updated by ko1 (Koichi Sasada) 2 months ago

@alanwu (Alan Wu) do you have measurements with system qsort and ruby's qsort?

Updated by mame (Yusuke Endoh) about 1 month ago

This was discussed at the February dev meeting and @matz (Yukihiro Matsumoto) said "give it a try." @alanwu (Alan Wu) Can you please discuss with @ko1 (Koichi Sasada) and proceed it?

Updated by alanwu (Alan Wu) about 1 month ago

I ran some benchmarks comparing the builtin ruby_qsort() and qsort_r() on macOS with an M1 chip and on Ubuntu 22.04 (glibc 2.35) wtih a Xeon Platinum 8000 chip. The rubies are built off of 3f5f04afa7 and the ruby_qsort() one is built with configure ac_cv_func_qsort_r=no to use the builtin sort (verified with a debugger). Note that I used benchmark-driver in time mode, which picks the number times to repeat based on the workload, so larger input size doesn't necessarily run longer.


With shuffled inputs, the builtin sort is about 5% faster on macOS but 10 to 20% slower on Ubuntu. The builtin sort seems very good with ordered inputs and outperforms qsort_r() across the board.

Considering it's faster on macOS, I think the builtin sort has acceptable performance.

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0