Project

General

Profile

Actions

Bug #19386

closed

`test_hmac.rb` of openssl is timeout on RHEL9

Added by hsbt (Hiroshi SHIBATA) almost 2 years ago. Updated over 1 year ago.

Status:
Closed
Assignee:
-
Target version:
-
[ruby-core:112072]

Description

I prepared new test instance of RHEL9 for https://rubyci.org/. But it stuck with openssl tests.

see http://rubyci.s3.amazonaws.com/rhel9/ruby-master/log/20230127T003004Z.fail.html.gz

I did bisect this issue. This issue caused by https://github.com/ruby/ruby/blob/master/test/openssl/test_hmac.rb#L23

  def test_dup
    h1 = OpenSSL::HMAC.new("KEY", "MD5")
    h1.update("DATA")
    h = h1.dup
    assert_equal(h1.digest, h.digest, "dup digest")
  end

Stuck is happened after running this test.

/cc @vo.x @jaruga (Jun Aruga) @rhenium (Kazuki Yamaguchi)

Updated by jaruga (Jun Aruga) almost 2 years ago

Thank you for the report! I just logged in the instance of the RHEL 9 server used for the Ruby CI. Here is the info of the dependencies.

[jaruga@rhel9 ~]$ cat /etc/redhat-release
Red Hat Enterprise Linux release 9.1 (Plow)

[jaruga@rhel9 ~]$ rpm -q gcc
gcc-11.3.1-2.1.el9.x86_64

[jaruga@rhel9 ~]$ rpm -q openssl
openssl-3.0.1-43.el9_0.x86_64

[jaruga@rhel9 ~]$ rpm -q openssl-devel
openssl-devel-3.0.1-43.el9_0.x86_64

Updated by jaruga (Jun Aruga) almost 2 years ago

I debugged the issue on the latest master branch 47b66a5acd1511089ac9585371d93e1b5ea07232 on the RHEL 9 instance.

I installed the following RPM packages to compile and debug. The gdb asked the following debuginfo RPM packages are missing. So, I installed these ones.

$ sudo dnf install gdb
$ sudo dnf install ruby
$ sudo dnf debuginfo-install glibc-2.34-40.el9_1.1.x86_64 gmp-6.2.0-10.el9.x86_64 libgcc-11.3.1-2.1.el9.x86_64 libxcrypt-4.4.18-3.el9.x86_64 zlib-1.2.11-35.el9_1.x86_64
$ sudo dnf debuginfo-install openssl-libs-3.0.1-43.el9_0.x86_64

I compiled the Ruby with the following flags.

$ ./autogen.sh
$ optflags=-O0 debugflags="-g3 -ggdb3 -gdwarf-4" \
  ./configure \
  --prefix=$HOME/.local/ruby-47b66a5acd-O0 \
  --enable-shared
$ make

Then I confirmed the stuck or coredump happened with the command below.

[jaruga@rhel9 ruby]$ make test-all V=1 TESTS="-v test/openssl/test_hmac.rb -n OpenSSL::TestHMAC#test_dup"
exec ./miniruby -I./lib -I. -I.ext/common  ./tool/runruby.rb --extout=.ext  -- --disable-gems "./test/runner.rb" --ruby="./miniruby -I./lib -I. -I.ext/common  ./tool/runruby.rb --extout=.ext  -- --disable-gems" -
-excludes-dir=./test/excludes --name=!/memory_leak/  -v test/openssl/test_hmac.rb -n OpenSSL::TestHMAC#test_dup
Run options:
  --seed=25634
  "--ruby=./miniruby -I./lib -I. -I.ext/common  ./tool/runruby.rb --extout=.ext  -- --disable-gems"
  --excludes-dir=./test/excludes
  --name=!/memory_leak/
  -v
  -n
  OpenSSL::TestHMAC#test_dup

# Running tests:

[1/0] OpenSSL::TestHMAC#test_dup = 0.00 s
Finished tests in 0.025968s, 38.5083 tests/s, 154.0333 assertions/s.
1 tests, 4 assertions, 0 failures, 0 errors, 0 skips

ruby -v: ruby 3.3.0dev (2023-01-27T11:07:06Z master 47b66a5acd) [x86_64-linux]

^Cmake: *** [uncommon.mk:850: yes-test-all] Interrupt

Here is the gdb backtrace.

[jaruga@rhel9 ruby]$ gdb -q -ex "set breakpoint pending on" --args   ./miniruby -I./lib -I. -I.ext/common  ./tool/runruby.rb --extout=.ext -- --disable-gems "./test/runner.rb"   --ruby="./miniruby -I./lib -I. -I.ext/common  ./tool/runruby.rb --extout=.ext  -- --disable-gems" --excludes-dir=./test/excludes   -v test/openssl/test_hmac.rb -n OpenSSL::TestHMAC#test_dup
(gdb) b main
...<snip>...
(gdb) n
38	    return ruby_run_node(ruby_options(argc, argv));
(gdb) n
Run options:
  --seed=43937
  "--ruby=./miniruby -I./lib -I. -I.ext/common  ./tool/runruby.rb --extout=.ext  -- --disable-gems"
  --excludes-dir=./test/excludes
  -v
  -n
  OpenSSL::TestHMAC#test_dup

# Running tests:

[Detaching after vfork from child process 363930]
[1/0] OpenSSL::TestHMAC#test_dup = 0.00 s
Finished tests in 0.026546s, 37.6711 tests/s, 150.6844 assertions/s.
1 tests, 4 assertions, 0 failures, 0 errors, 0 skips

ruby -v: ruby 3.3.0dev (2023-01-27T11:07:06Z master 47b66a5acd) [x86_64-linux]
^C
Program received signal SIGINT, Interrupt.
0x00007ffff76c7df6 in __pthread_rwlock_wrlock_full64 (abstime=0x0, clockid=0,
    rwlock=0x850540)
    at /usr/src/debug/glibc-2.34-40.el9_1.1.x86_64/nptl/pthread_rwlock_common.c:941
941	      if ((atomic_load_acquire (&rwlock->__data.__readers)
(gdb) bt
#0  0x00007ffff76c7df6 in __pthread_rwlock_wrlock_full64 (abstime=0x0,
    clockid=0, rwlock=0x850540)
    at /usr/src/debug/glibc-2.34-40.el9_1.1.x86_64/nptl/pthread_rwlock_common.c:941
#1  ___pthread_rwlock_wrlock (rwlock=0x850540) at pthread_rwlock_wrlock.c:27
#2  0x00007ffff3669f5d in CRYPTO_THREAD_write_lock (lock=<optimized out>)
    at crypto/threads_pthread.c:100
#3  0x00007ffff364b747 in evp_keymgmt_util_clear_operation_cache (locking=1,
    pk=pk@entry=0x97e350) at crypto/evp/keymgmt_lib.c:223
#4  evp_pkey_free_it (x=x@entry=0x97e350) at crypto/evp/p_lib.c:1742
#5  0x00007ffff364bf60 in EVP_PKEY_free (x=0x97e350) at crypto/evp/p_lib.c:1768
#6  EVP_PKEY_free (x=0x97e350) at crypto/evp/p_lib.c:1756
#7  0x00007ffff364c252 in EVP_PKEY_CTX_free (ctx=0x861c70)
    at crypto/evp/pmeth_lib.c:427
#8  0x00007ffff36266a9 in EVP_MD_CTX_reset (ctx=0x8fd770)
    at crypto/evp/digest.c:76
#9  EVP_MD_CTX_reset (ctx=ctx@entry=0x8fd770) at crypto/evp/digest.c:65
#10 0x00007ffff36266d2 in EVP_MD_CTX_free (ctx=0x8fd770)
    at crypto/evp/digest.c:123
#11 EVP_MD_CTX_free (ctx=0x8fd770) at crypto/evp/digest.c:118
#12 0x00007ffff39a796d in ossl_hmac_free (ctx=0x8fd770) at ossl_hmac.c:37
#13 0x00007ffff7b10e40 in run_final (objspace=0x4097e0, zombie=140737275023000) at gc.c:4388
#14 0x00007ffff7b10efb in finalize_list (objspace=0x4097e0, zombie=140737275023000) at gc.c:4407
#15 0x00007ffff7b11016 in finalize_deferred_heap_pages (objspace=0x4097e0) at gc.c:4436
#16 0x00007ffff7b114f7 in rb_objspace_call_finalizer (objspace=0x4097e0) at gc.c:4573
#17 0x00007ffff7af8b95 in rb_ec_finalize (ec=0x40a6c0) at eval.c:168
#18 0x00007ffff7af9047 in rb_ec_cleanup (ec=0x40a6c0, ex=RUBY_TAG_RAISE) at eval.c:262
#19 0x00007ffff7af9361 in ruby_run_node (n=0x7ffff7482b38) at eval.c:330
#20 0x00000000004011d9 in rb_main (argc=9, argv=0x7fffffffe638) at ./main.c:38
#21 0x0000000000401236 in main (argc=9, argv=0x7fffffffe638) at ./main.c:57
(gdb)

Updated by jprokop (Jarek Prokop) almost 2 years ago

We have observed similar build failure on RHEL 9.1 system with Ruby 3.2.

I executed a few builds, there were 2 kinds of issues:

  1. tests got stuck as observed initially
  2. tests failed after memory corruption (varying messages) when GC was interacting with OpenSSL memory.
    I observed the following messages in the logs just before build end: "double free or corruption (fasttop)", "corrupted double-linked list", "malloc chunk alignment error", "malloc_consolidate(): unaligned fastbin chunk detected"
    No other info to help debug was present in the logs.

This happened on the OpenSSL 3.0.1 that Jun pointed out is on RHEL 9.1. These failures are no longer present with CentOS 9 Stream (preview for RHEL 9.2) where the openssl is in newer version: openssl-3.0.7. The environment is as follows:

<mock-chroot> sh-5.1$ cat /etc/os-release
NAME="CentOS Stream"
VERSION="9"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="9"
PLATFORM_ID="platform:el9"
PRETTY_NAME="CentOS Stream 9"
ANSI_COLOR="0;31"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:centos:centos:9"
HOME_URL="https://centos.org/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux 9"
REDHAT_SUPPORT_PRODUCT_VERSION="CentOS Stream"
<mock-chroot> sh-5.1$ rpm -qa openssl
openssl-3.0.7-2.el9.x86_64
<mock-chroot> sh-5.1$ rpm -qa openssl-devel
openssl-devel-3.0.7-2.el9.x86_64

It is around OpenSSL memory management, as isolated OpenSSL tests fail the same as when you execute the whole test suite.

To support this further, when I execute a build on RHEL 9.1, it either gets stuck or fails with the messages I outlined above. When I execute a build on an RHEL 9.2 equivalent, the build passes.
There was no change in the contents of Ruby sources between these builds.

Therefore I'd be inclined to believe that there was a problem with OpenSSL that was fixed in the rebase.

Updated by rhenium (Kazuki Yamaguchi) almost 2 years ago

I think this is very likely due to https://github.com/openssl/openssl/issues/17261, for which the fix was merged into OpenSSL 3.0.2.

Updated by jaruga (Jun Aruga) almost 2 years ago

I think this is very likely due to https://github.com/openssl/openssl/issues/17261, for which the fix was merged into OpenSSL 3.0.2.

Thanks for sharing the ticket. So, this issue should be fixed with newer version of the openssl in the next version of the RHEL, RHEL 9.2.

Updated by jaruga (Jun Aruga) almost 2 years ago

As a note, CentOS Stream is a downstream Linux distro of the Fedora and the upstream Linux distro of the RHEL. CentOS 9 Stream is the upstream of the RHEL 9. And here is the opessl RPM package's spec (recipe) file. The openssl version is currently 3.0.7 in the CentOS 9 Stream. This should be the openssl version of the next RHEL 9.y.

Updated by vo.x (Vit Ondruch) almost 2 years ago

@jaruga (Jun Aruga) if we could confirm that the patch referenced by @rhenium (Kazuki Yamaguchi) is the culprit, we could ask backporting it into 9.1. Not sure what are the chances though.

Updated by jaruga (Jun Aruga) almost 2 years ago

@jaruga (Jun Aruga) (Jun Aruga) if we could confirm that the patch referenced by @rhenium (Kazuki Yamaguchi) (Kazuki Yamaguchi) is the culprit, we could ask backporting it into 9.1. Not sure what are the chances though.

OK. That makes sense. Let me work for that.

Updated by hsbt (Hiroshi SHIBATA) almost 2 years ago

Thanks for all.

How can we skip this test until RHEL9 fix it?

[hsbt@rhel9 ~]$ rpm -qa openssl
openssl-3.0.1-43.el9_0.x86_64
irb(main):003:0> OpenSSL::OPENSSL_VERSION_NUMBER.to_s(16)
=> "30000010"

Can I skip this test when using OpenSSL 3.0.1?

Updated by jaruga (Jun Aruga) over 1 year ago

RHEL 9.2 was released. The openssl RPM (maybe version 3.0.7. I haven't checked it by myself) on RHEL 9.2 should fix this issue.
https://developers.redhat.com/articles/2023/05/10/how-new-rhel-92-improves-developer-experience

Updated by hsbt (Hiroshi SHIBATA) over 1 year ago

Thanks to track this.

I did upgrade our CI instance.

[hsbt@rhel9 ~]$ cat /etc/redhat-release
Red Hat Enterprise Linux release 9.2 (Plow)

I try to this test with RHEL 9.2 again.

Updated by hsbt (Hiroshi SHIBATA) over 1 year ago

  • Status changed from Open to Closed

I confirmed to resolve test_hmac.rb issue with RHEL 9.2. Thanks.

Actions

Also available in: Atom PDF

Like0
Like0Like0Like1Like1Like0Like0Like0Like0Like0Like0Like0Like0Like0