Bug #19156
closedObjectSpace.dump_all segfault during string inspection
Added by mk (Matthias Käppler) almost 2 years ago. Updated almost 2 years ago.
Description
I am working on a feature that would allow our application to capture heap dumps during shutdown for later inspection.
These heap dumps are captured via ObjectSpace.dump_all(output: io)
. While walking the object space, MRI occasionally segfaults while inspecting string objects in search_nonascii
of string.c
:
/usr/local/lib/ruby/3.0.0/objspace.rb:87: [BUG] Segmentation fault at 0x00007efee4201000
ruby 3.0.4p208 (2022-04-12 revision 3fa771dded) [x86_64-linux]
...
-- Control frame information -----------------------------------------------
c:0053 p:---- s:0312 e:000311 CFUNC :_dump_all
c:0052 p:0130 s:0305 e:000304 METHOD /usr/local/lib/ruby/3.0.0/objspace.rb:87
c:0051 p:0023 s:0295 e:000294 METHOD /home/git/gitlab/lib/gitlab/memory/reports/heap_dump.rb:26
...
-- C level backtrace information -------------------------------------------
/usr/local/lib/libruby.so.3.0(rb_print_backtrace+0x11) [0x7efee4ad0c5e] vm_dump.c:758
/usr/local/lib/libruby.so.3.0(rb_vm_bugreport) vm_dump.c:998
/usr/local/lib/libruby.so.3.0(rb_bug_for_fatal_signal+0xf8) [0x7efee48d0b08] error.c:787
/usr/local/lib/libruby.so.3.0(sigsegv+0x55) [0x7efee4a23db5] signal.c:963
/lib/x86_64-linux-gnu/libpthread.so.0(__restore_rt+0x0) [0x7efee4f12140] ../sysdeps/pthread/funlockfile.c:28
/usr/local/lib/libruby.so.3.0(search_nonascii+0x30) [0x7efee4a3ca60] string.c:552
/usr/local/lib/libruby.so.3.0(coderange_scan) string.c:585
/usr/local/lib/libruby.so.3.0(enc_coderange_scan+0x1b) [0x7efee4a3e28a] string.c:709
/usr/local/lib/libruby.so.3.0(rb_enc_str_coderange) string.c:727
/usr/local/lib/ruby/3.0.0/x86_64-linux/objspace.so(is_broken_string+0x8) [0x7efeced9c304] ../../internal/string.h:116
/usr/local/lib/ruby/3.0.0/x86_64-linux/objspace.so(dump_object) objspace_dump.c:388
/usr/local/lib/ruby/3.0.0/x86_64-linux/objspace.so(heap_i+0x39) [0x7efeced9caa9] objspace_dump.c:521
/usr/local/lib/libruby.so.3.0(objspace_each_objects_without_setup+0xaf) [0x7efee48e878f] gc.c:3232
/usr/local/lib/libruby.so.3.0(objspace_each_objects_protected+0x14) [0x7efee48e87c4] gc.c:3242
/usr/local/lib/libruby.so.3.0(rb_ensure+0x12a) [0x7efee48d96aa] eval.c:1162
/usr/local/lib/libruby.so.3.0(objspace_each_objects+0x28) [0x7efee48fb458] gc.c:3310
/usr/local/lib/libruby.so.3.0(rb_objspace_each_objects) gc.c:3298
/usr/local/lib/ruby/3.0.0/x86_64-linux/objspace.so(objspace_dump_all+0x88) [0x7efeced9b068] objspace_dump.c:616
...
Unfortunately I couldn't get my hands on that memory region to see which strings are causing this since this doesn't always happen.
I suspect this is also a problem with MRI master since the code looks unchanged from 3.0.4.
Updated by byroot (Jean Boussier) almost 2 years ago
I suspect this is also a problem with MRI master since the code looks unchanged from 3.0.4.
Well, It might not be a bug in objspace_dump.c
.
Here coderange_scan / search_nonascii
end up reading invalid memory regions which suggest that you have a corrupted string in your heap. Somehow either that String memory was freed, or it is pointing to an invalid region in the first place.
This might be a bug in Ruby, but I suspect it might be a C-extension creating an invalid string. It will be very hard to track down without a repro or a core dump though.
Do you have a core dump? It would be interesting to inspect what this string look like.
Updated by mk (Matthias Käppler) almost 2 years ago
byroot (Jean Boussier) wrote in #note-2:
Do you have a core dump? It would be interesting to inspect what this string look like.
Yes, I just obtained one. This is a little tricky to debug, however. The object space dump is streamed to disk as ndjson, but the last entry I see written to disk changes randomly so I suspect the heap slot that references the broken string crashes the VM before it even starts writing the object address to the dump.
In other words I have no idea what this string is or where it is located in memory.
I did look at the memory region of the last entry I can actually see in the dump:
{"address":"0x7f584f4a0930", "type":"SYMBOL", "class":"0x7f58791d0410", "frozen":true, "bytesize":11, "value":"GitalyCheck"
Note the missing closing brace; so before finalizing this entry, dump_all crashed. I am not sure whether this is a red herring though; this is a symbol, not a string? And as mentioned above, it does not always crash here. In a repeat run, the last entry was an iseq instead.
Anyway I tried inspecting memory around this address to see if anything stands out, but it doesn't to me anyway:
(gdb) dump memory /tmp/worker_mem 0x7f584f4a0930 0x7f584f4a1000
xxd -l 128 /tmp/worker_mem
00000000: 7408 8000 0000 0000 1004 1d79 587f 0000 t..........yX...
00000010: a6fa 64f8 dfb2 23c9 5809 4a4f 587f 0000 ..d...#.X.JOX...
00000020: 4a6d 2500 0000 0000 65c8 9220 0000 0000 Jm%.....e.. ....
00000030: 2828 1d79 587f 0000 4769 7461 6c79 4368 ((.yX...GitalyCh
00000040: 6563 6b00 0072 6200 0000 0000 0000 0000 eck..rb.........
00000050: 0800 0000 0000 0000 f881 1b79 587f 0000 ...........yX...
00000060: 0000 0000 0000 0000 0800 0000 0000 0000 ................
00000070: 0000 0000 0000 0000 0120 0000 0000 0000 ......... ......
I'm not sure that's helpful. Is there anything else I could do?
Updated by byroot (Jean Boussier) almost 2 years ago
I suspect the heap slot that references the broken string crashes the VM before it even starts writing the object address to the dump.
Yes, the is some buffering, so the last written objects might be entirely different. Hence why I'm suggesting to look at the core dump with gdb
or lldb
to look at what that string look like.
Is there anything else I could do?
Ideally you'd use gdb
to go on the dump_object) objspace_dump.c:388
frame and inspect what VALUE obj
looks like.
is_broken_string
is only called for T_STRING
, so we know for sure it's either a String
or a subclass of it, certainly not a Symbol.
Updated by byroot (Jean Boussier) almost 2 years ago
Also, based on the backtrace I believe that ObjectSpace.each_object(String, &:valid_encoding?)
should cause the same crash.
Updated by mk (Matthias Käppler) almost 2 years ago
byroot (Jean Boussier) wrote in #note-4:
Ideally you'd use
gdb
to go on thedump_object) objspace_dump.c:388
frame and inspect whatVALUE obj
looks like.
Remind me how to do that? I tried the following, but it segfaults too:
(gdb) break _exit
Breakpoint 1 at 0x7fe712b74f70: file ../sysdeps/unix/sysv/linux/_exit.c, line 27.
(gdb) continue
Continuing.
Thread 8 "background_task" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fe6e5bfd700 (LWP 227)]
0x00007fe71303ca60 in search_nonascii (e=0x7fe7131f4000 "X\016", p=0x7fe7131f3000 <error: Cannot access memory at address 0x7fe7131f3000>) at string.c:552
552 string.c: No such file or directory.
(gdb) f 5
#5 dump_object (obj=140629680052600, dc=0x7fe6e5af9be0) at objspace_dump.c:388
388 objspace_dump.c: No such file or directory.
(gdb) p rb_p(140629680052600)
Thread 8 "background_task" received signal SIGSEGV, Segmentation fault.
onigenc_single_byte_mbc_to_code (p=0x7fe7131f3000 <error: Cannot access memory at address 0x7fe7131f3000>, end=0x7fe7131f4000 "X\016", enc=0x7fe711c54000) at regenc.c:624
the VALUE obj
here would just hold a pointer to a GC slot correct? So is 140629680052600
the address we are looking for or do I have convert this somehow first (I remember MRI doing funny things like embedding INTs in VALUEs and using a bit to identify whether it is actually a heap address or not?) Or is this just the same problem, i.e. that we have a VALUE that is pointing to an invalid memory address, maybe something that was already GC'ed but dump_all
did not know about it?
Updated by mk (Matthias Käppler) almost 2 years ago
Oh yeah it's actually telling me it's the same address from the original segfault.
So would we conclude from this that ObjectSpace.dump_all
is trying to serialize objects that have already been GC'ed?
Updated by byroot (Jean Boussier) almost 2 years ago
I tried the following, but it segfaults too:
Yeah, if the RString points to an invalid memory region, I think rb_p
will just segfault like you saw.
I'm not very good at GDB, but I think instead you want to print the content of the RString
struct. I think you can try p *140629680052600
or p *(RString *)140629680052600
(again, I'm not good at gdb, this might not work).
maybe something that was already GC'ed but dump_all did not know about it?
I highly doubt so. My bet is a C-ext creating a string using str_new_static
, as in it create a RString struct pointing to a memory region not owned by Ruby, and then either free that region too soon, or is directly pointing to an invalid region right from the start.
The hard part now is to figure out where this string is allocated. ObjectSpace.trace_object_allocations_start
can help here.
Updated by byroot (Jean Boussier) almost 2 years ago
Also, assuming my str_new_static
theory is correct, you could look at: grep -IR str_new_static $(bundle show --paths)
see if anything stands out.
Updated by mk (Matthias Käppler) almost 2 years ago
byroot (Jean Boussier) wrote in #note-5:
Also, based on the backtrace I believe that
ObjectSpace.each_object(String, &:valid_encoding?)
should cause the same crash.
Indeed!
bundle exec rbtrace -p $(pgrep -f 'worker 1') -e 'ObjectSpace.each_object(String, &:valid_encoding?)'
*** run `sudo sysctl kernel.msgmnb=1048576` to prevent losing events (currently: 16384 bytes)
*** attached to process 214
*** timed out waiting for eval response
*** could not detach cleanly from process 214
(eval):1: [BUG] Segmentation fault at 0x00007fea85bf5000
ruby 3.0.4p208 (2022-04-12 revision 3fa771dded) [x86_64-linux]
-- Control frame information -----------------------------------------------
c:0006 p:---- s:0025 e:000024 CFUNC :valid_encoding?
c:0005 p:---- s:0022 e:000021 CFUNC :each_object
c:0004 p:0023 s:0017 e:000016 EVAL (eval):1 [FINISH]
c:0003 p:---- s:0014 e:000013 CFUNC :eval
c:0002 p:0022 s:0009 e:000005 BLOCK eval:6 [FINISH]
c:0001 p:---- s:0003 e:000002 (none) [FINISH]
-- Ruby level backtrace information ----------------------------------------
eval:6:in `block in eval_and_inspect'
eval:6:in `eval'
(eval):1:in `eval_context'
(eval):1:in `each_object'
(eval):1:in `valid_encoding?'
-- Machine register context ------------------------------------------------
RIP: 0x00007fea85a3ca60 RBP: 0x00007fea576f2720 RSP: 0x00007fea576f26e0
RAX: 0x8080808080808080 RBX: 0x00007fea84654000 RCX: 0x00007fea85bf5ff9
RDX: 0x8080808080808080 RDI: 0x00007fea84654000 RSI: 0x0000000000001000
R8: 0x0000000000000000 R9: 0x0000000000000000 R10: 0x0000000000000000
R11: 0x0000000000000000 R12: 0x00007fea85bf5000 R13: 0x0000000000001000
R14: 0x00007fea85bf6000 R15: 0x00007fea845d06d0 EFL: 0x0000000000010293
-- C level backtrace information -------------------------------------------
/usr/local/lib/libruby.so.3.0(rb_print_backtrace+0x11) [0x7fea85ad0c5e] vm_dump.c:758
/usr/local/lib/libruby.so.3.0(rb_vm_bugreport) vm_dump.c:998
/usr/local/lib/libruby.so.3.0(rb_bug_for_fatal_signal+0xf8) [0x7fea858d0b08] error.c:787
/usr/local/lib/libruby.so.3.0(sigsegv+0x55) [0x7fea85a23db5] signal.c:963
/lib/x86_64-linux-gnu/libpthread.so.0(__restore_rt+0x0) [0x7fea86041140] ../sysdeps/pthread/funlockfile.c:28
/usr/local/lib/libruby.so.3.0(search_nonascii+0x30) [0x7fea85a3ca60] string.c:552
/usr/local/lib/libruby.so.3.0(coderange_scan) string.c:585
/usr/local/lib/libruby.so.3.0(enc_coderange_scan+0x1b) [0x7fea85a3e28a] string.c:709
/usr/local/lib/libruby.so.3.0(rb_enc_str_coderange) string.c:727
/usr/local/lib/libruby.so.3.0(rb_str_valid_encoding_p+0x9) [0x7fea85a3ef99] string.c:10474
The hard part now is to figure out where this string is allocated. ObjectSpace.trace_object_allocations_start can help here.
..
Also, assuming my str_new_static theory is correct, you could look at: grep -IR str_new_static $(bundle show --paths) see if anything stands out.
Thank you, these are good ideas, I will have a look.
Updated by byroot (Jean Boussier) almost 2 years ago
str_new_static
sets a specific flag on the string:
RBASIC(str)->flags |= STR_NOFREE;
So from GDB you should be able to confirm or infirm whether that string was created through str_new_static
.
Updated by mk (Matthias Käppler) almost 2 years ago
It looks like I can't do anything useful with that address:
(gdb) p *0x7fced23f4000
Cannot access memory at address 0x7fced23f4000
This is getting mysterious. I looked at process maps to see what kind of memory region this is, and it looks like it's a memory-mapped file from prometheus-client-mmap
:
pmap -x 216
...
00007fced23f3000 4 4 0 rw-s- gauge_max_puma_0-0.db
00007fced23f5000 4 4 0 r---- wait.so
...
I also cannot dump this region, but I can print the value for the object starting at 0x...3000
:
(gdb) dump memory /tmp/memdmp 0x7fced23f3000 0x7fced23f5000
Cannot access memory at address 0x7fced23f4000
but:
(gdb) p *0x7fced23f3000
$1 = 104
That doesn't tell me anything, but I know this library uses a binary format to serialize metric samples into JSON strings from Ruby hashes using a C extension. The only Ruby strings in here are metric names and labels IIRC.
I will investigate more in this direction. It indeed does sound like it is not strictly a bug with MRI, though I wonder if it should be more resilient to these kind of rogue objects and just filter them out?
Updated by mk (Matthias Käppler) almost 2 years ago
I'm not sure this makes sense but: we know the address being referenced is somewhere (half-way) into a memory region that is a memory-mapped file using that Prometheus library's binary encoding.
This is the file:
xxd gauge_max_puma_0-0.db
00000000: 6800 0000 0000 0000 5100 0000 5b22 6465 h.......Q...["de
00000010: 706c 6f79 6d65 6e74 7322 2c22 6465 706c ployments","depl
00000020: 6f79 6d65 6e74 7322 2c5b 2272 6576 6973 oyments",["revis
00000030: 696f 6e22 2c22 7665 7273 696f 6e22 5d2c ion","version"],
00000040: 5b22 6335 3365 6336 3336 6435 3022 2c22 ["c53ec636d50","
00000050: 3135 2e37 2e30 2d70 7265 225d 5d20 2020 15.7.0-pre"]]
00000060: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000070: 0000 0000 0000 0000 0000 0000 0000 0000 ................
...
00000ff0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
We can see there is only a single sample in here, the application version (we track this into a metric).
The library zeroes these files out with a minimum size of 4KB. So somehow there is an object slot in MRI's ObjectSpace
that shares an address with this empty mmap'ed space? That doesn't make any sense. Of course it would explain why it doesn't pass the test "are you a valid string", but how did we even get here to begin with?
prometheus-client-mmap
also doesn't use str_new_static
, but it does allocate Ruby strings in its C extension e.g. here: https://gitlab.com/gitlab-org/prometheus-client-mmap/-/blob/b4eae83927efd5132b9b2726f21d9a3ea49e9336/ext/fast_mmaped_file/fast_mmaped_file.c#L81
I'm not sure at this point whether this is just another red herring.
Updated by byroot (Jean Boussier) almost 2 years ago
- Status changed from Open to Third Party's Issue
It indeed does sound like it is not strictly a bug with MRI
I'll mark the ticket as 3rd party issue. I'd suggest reporting this to that project.
I wonder if it should be more resilient to these kind of rogue objects and just filter them out?
I don't think it's possible. Ruby would need to somehow check if the memory region the string points to is still valid.
So somehow there is an object slot in MRI's ObjectSpace that shares an address with this empty mmap'ed space?
Not exactly. In Ruby string objects contain a char *
pointer to an address in memory, and you can point it to whatever you want. In this case it's made to point to a mmaped region.
I'll have a quick look at that gem to see if I can find the source, but the fix might not be simple.
Updated by byroot (Jean Boussier) almost 2 years ago
Ok, I found the function that cause this bug:
static VALUE mm_str(VALUE obj, int modify) {
mm_ipc *i_mm;
VALUE ret = Qnil;
GET_MMAP(obj, i_mm, modify & ~MM_ORIGIN);
if (modify & MM_MODIFY) {
if (i_mm->t->flag & MM_FROZEN) rb_error_frozen("mmap");
}
ret = rb_obj_alloc(rb_cString);
RSTRING(ret)->as.heap.ptr = i_mm->t->addr;
RSTRING(ret)->as.heap.aux.capa = i_mm->t->len;
RSTRING(ret)->as.heap.len = i_mm->t->real;
DEBUGF("RString capa: %d, len: %d", RSTRING(ret)->as.heap.aux.capa, RSTRING(ret)->as.heap.len);
if (modify & MM_ORIGIN) {
#if HAVE_RB_DEFINE_ALLOC_FUNC
RSTRING(ret)->as.heap.aux.shared = obj;
FL_SET(ret, RSTRING_NOEMBED);
FL_SET(ret, FL_USER18);
#else
RSTRING(ret)->orig = ret;
#endif
}
if (i_mm->t->flag & MM_FROZEN) {
ret = rb_obj_freeze(ret);
}
return ret;
}
It's even worse than using str_new_static
, it's creating a regular string and rewriting the pointer. This is terrible code.
Updated by byroot (Jean Boussier) almost 2 years ago
I can't be bothered to create an account to submit a PR, but I suspect the following diff should fix your bug:
diff --git a/ext/fast_mmaped_file/mmap.c b/ext/fast_mmaped_file/mmap.c
index 213281c..a3f646f 100644
--- a/ext/fast_mmaped_file/mmap.c
+++ b/ext/fast_mmaped_file/mmap.c
@@ -44,6 +44,8 @@ static VALUE mm_str(VALUE obj, int modify) {
if (i_mm->t->flag & MM_FROZEN) rb_error_frozen("mmap");
}
ret = rb_obj_alloc(rb_cString);
+ rb_ivar_set(ret, obj);
+
RSTRING(ret)->as.heap.ptr = i_mm->t->addr;
RSTRING(ret)->as.heap.aux.capa = i_mm->t->len;
RSTRING(ret)->as.heap.len = i_mm->t->real;
TL;DR; you store the object that hold the mmaped region as an instance variable on the String, ensuring it's not garbage collected until the string is too.
It's quite dirty but at this point...
Updated by mk (Matthias Käppler) almost 2 years ago
byroot (Jean Boussier) wrote in #note-15:
It's even worse than using
str_new_static
, it's creating a regular string and rewriting the pointer. This is terrible code.
Yikes -- good catch! This library isn't particularly active or well maintained, unfortunately. We've meant to move away from it for a long time now and use the official prometheus-ruby client but it's no easy task at this point.
This may just be another reason to accelerate that process, to be honest.
Updated by byroot (Jean Boussier) almost 2 years ago
Sorry, I meant:
rb_ivar_set(ret, rb_intern("mmap"), obj);
Updated by mk (Matthias Käppler) almost 2 years ago
Thank you, this helps a lot! I will test the patch you suggest and see if that addresses the immediate problem.
If it does, I can take a stab at putting a PR together. I would credit you of course. Really appreciate your time spent on this.
Updated by hsbt (Hiroshi SHIBATA) almost 2 years ago
- Related to Bug #19159: ObjectSpace segfaults added
Updated by mk (Matthias Käppler) almost 2 years ago
@byroot (Jean Boussier) I wonder if you could help me understand the underlying issue better. I found a minimal, executable test case that reproduces this issue reliably:
Prometheus::Client::MmapedValue.new(:counter, :counter, 'ordered_counter', { label_1: 'x' * 1024**2 })
# This will crash
ObjectSpace.each_object(String, &:valid_encoding?)
I am trying to pour this into an automated test to make sure we won't regress on this again. This test does not crash if the metric label here is relatively short, a few characters perhaps. It does crash once the label string grows above a certain size, however.
Here is what I don't understand yet:
-
prometheus-client-mmap
calls intonew_str0
, which when the string is large enough, will be malloc'ed by MRI, correct? I had to make the string much larger thansizeof(RVALUE)
for the crash to occur. Aren't strings malloc'ed as soon as they do not fit into a heap slot anymore? - The library proceeds to overwrite the internal pointer pointing to the malloc'ed memory region and let's it point to the mapped file memory instead. But how come MRI does not see this when accessing this string memory through a function like
valid_encoding?
Shouldn't that result in traversing the same pointer, now pointing to the string data in the memory map? From MRIs perspective, why does it matter where the actual string data resides?
I also don't think it's because of the MmapValue
object being GC'ed; I ran this example through GC.stress
and even nulled out the parent object creating this string, but it never crashed in response to that. In fact, I can completely disable GC in this test case, and it will still crash, which leads me to think this is not related to the mmap memory being freed before the Ruby string is or vice versa?
Updated by byroot (Jean Boussier) almost 2 years ago
calls into new_str0, which when the string is large enough, will be malloc'ed by MRI, correct?
No. It calls rb_str_new("", 0)
which means it's always an empty String, so it only allocates an Object slot.
It then never increase that string, but make it point to the underlying mmaped region.
From MRIs perspective, why does it matter where the actual string data resides?
It doesn't. That is why it works generally, until the mmaped region is freed.
I also don't think it's because of the MmapValue object being GC'ed;
That was a guess, it might just free that region without being GCed, not sure.
What is certain though is that at some point it frees that region while the String objects it created still points to it, and it does nothing about it.
Updated by byroot (Jean Boussier) almost 2 years ago
Here's the repro:
require 'prometheus'
require 'prometheus/client'
File.write("/tmp/mmap.txt", "a" * 100)
file = FastMmapedFile.new("/tmp/mmap.txt")
str = file.slice(0, 100)
p str
file.munmap
p str
So yeah it's not necessarily GC, but it's indeed when the FastMmapedFile is unmapped that all the strings created through it become broken.
I can't really think of any reliable way to make this work.
Updated by byroot (Jean Boussier) almost 2 years ago
And to confirm, FastMmappedFile
does unmap when it's GCed too:
static void mm_free(mm_ipc *i_mm) {
if (i_mm->t->path) {
if (munmap(i_mm->t->addr, i_mm->t->len) != 0) {
So yeah, can happen either way.
Updated by mk (Matthias Käppler) almost 2 years ago
Thanks for sharing this; I still don't think this is related to the memory map going away. Not exclusively anyway. My original test case does not close or unmap the counter file, but it still crashes. The counter is still in scope when we walk the object space, so it hasn't been GC'ed either. I also verified that mm_unmap
is not called in this case.
So there is something else at play where even during the life-time of the metric file i.e. while still residing in memory, string memory gets corrupted. This also means that the original fix proposed (keeping a reference to the memory mapped object on the string itself) is not sufficient.
I will move this discussion to the library issue tracker; thanks for helping out!
Updated by mk (Matthias Käppler) almost 2 years ago
I hope I summarized it correctly: https://gitlab.com/gitlab-org/ruby/gems/prometheus-client-mmap/-/issues/39
I'm not sure if your and my test case crash for precisely the same reason. If they do, then I am missing how unmap fits into the case where no memory is (or should be) released at all.