Bug #19041
closedWeakref is still alive after major garbage collection
Description
I am able to get into an infinite loop waiting for garbage collection to take a WeakRef.
Reproduction Process¶
The following script prints a "0", then a "1", and then hangs forever. I expect it to keep printing numbers.
require "weakref"
iterations = 0
loop do
print "\r#{iterations}"
obj = WeakRef.new(Object.new)
GC.start while obj.weakref_alive?
iterations += 1
end
Ruby Version¶
I have tested this on Ruby 3.1.2, 3.1.0, 3.0.4, 3.0.0, 2.7.6, and 2.7.0 on macOS. All exhibit this behavior.
Further Investigation¶
Sleeping¶
Sleeping before the garbage collection allows the loop to continue. The below exhibits the expected behavior:
require "weakref"
iterations = 0
loop do
print "\r#{iterations}"
obj = WeakRef.new(Object.new)
(sleep(0.5); GC.start) while obj.weakref_alive?
iterations += 1
end
However, sleeping after the garbage collection still shows the buggy behavior (loop hangs):
require "weakref"
iterations = 0
loop do
print "\r#{iterations}"
obj = WeakRef.new(Object.new)
(GC.start; sleep(0.5)) while obj.weakref_alive?
iterations += 1
end
Running Garbage Collection Multiple Times¶
Explicitly running garbage collection multiple times allows the loop to continue. This has the expected behavior, more numbers continue to be printed:
require "weakref"
iterations = 0
loop do
print "\r#{iterations}"
obj = WeakRef.new(Object.new)
while obj.weakref_alive?
GC.start
GC.start
GC.start
end
iterations += 1
end
However, with certain rubies, running those garbage collection calls in a times
block prevents even a single iteration from completing. The following prints only "0" with ruby 3.0.4 on macOS, ruby 2.7.6 on macOS, and ruby 3.1.2 on linux (ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux]
on a virtual machine). It shows the expected behavior on ruby 3.1.2 on macOS.
require "weakref"
iterations = 0
loop do
print "\r#{iterations}"
obj = WeakRef.new(Object.new)
3.times { GC.start } while obj.weakref_alive?
iterations += 1
end
Files
Updated by byroot (Jean Boussier) about 2 years ago
I don't think this is a bug per say. The Ruby GC is conservative. That means it goes over the whole stack in search for potential references to objects, and mark them.
As a result, it can happen that an object ref stays in an unused saved register and prevent an object from being merged.
Updated by jeremyevans0 (Jeremy Evans) about 2 years ago
- Status changed from Open to Closed
Updated by parker (Parker Finch) about 2 years ago
Thanks @byroot (Jean Boussier)! I think this could be considered a bug in the documentation, since the docs for WeakRef imply that a WeakRef
should be collected after a garbage collection. Perhaps we could call this corner-case out?
I'm also curious to learn more about this case. (I'm unfamiliar with Ruby's use of registers and how that interacts with live objects and garbage collection.) It seems like calling the weakref_alive?
method is continually forcing the object ref into a register, and sleeping after calling that method gives time for the register to clear. Is that understanding correct? (I'm surprised that calling a method on the WeakRef
object prevents the underlying object from being collected, since shouldn't that underlying one be collected even though the WeakRef
itself still has a reference? Does the method call put the underlying object ref in a register?)
Is there a more reliable/direct way to get rid of the reference than sleeping?
One aspect of this where I'm still confused is why the loop given to reproduce this issue completes an iteration before hanging. What is different on the first iteration that allows this to succeed?
Updated by chrisseaton (Chris Seaton) about 2 years ago
The documentation could be more clear, but also note that this isn't in any way specific to Ruby - I would say that this is expected behaviour for a managed language. A weak-ref may be cleared if no other references exist. That's should be the extent of the guarantee offered.
Updated by tenderlovemaking (Aaron Patterson) about 2 years ago
parker (Parker Finch) wrote in #note-3:
Thanks @byroot (Jean Boussier)! I think this could be considered a bug in the documentation, since the docs for WeakRef imply that a
WeakRef
should be collected after a garbage collection. Perhaps we could call this corner-case out?I'm also curious to learn more about this case. (I'm unfamiliar with Ruby's use of registers and how that interacts with live objects and garbage collection.
Ruby's garbage collector is conservative. Ruby objects that are allocated inside of C code must be kept alive. Lets look at a simple example:
void neat_function(void) {
VALUE list = rb_ary_new();
rb_gc_start();
rb_ary_push(list, Qnil);
}
The above C code is compiled in to machine code, but the array's life span is managed by the garbage collector. How can the garbage collector ensure that the array stays alive even after the call to rb_gc_start()
? We humans can clearly see that the array is used in the C code, but the GC cannot read the C code. In fact there is no C code for the GC because it's all machine code now! So how can the GC keep the reference alive? It will scan the machine registers as well as the stack memory looking for addresses that might be Ruby objects. The C compiler will probably have generated machine code that puts a reference to the local variable list
in either a register or stack memory (there are cases where this doesn't happen, and we have to deal with it manually. See RB_GC_GUARD
).
The GC will look at the values stored in the machine registers, as well as any values in stack memory, then check if those values are within the bounds of Ruby's GC heap memory. If the address is inside the bounds, then the GC will consider the object to be alive. The GC cannot know if a pointer stored in a machine register will ever be used again, so it takes a conservative approach and keeps the reference alive.
This conservative approach can lead to the behavior that you are seeing with the weak reference: a value that nobody is actually using or referencing is kept alive because the GC can't know that fact for sure. The reference may or may not stay alive, but it depends on what machine code has executed, if the value is in the stack, if any registers have been overwritten, etc.
I hope this helps.
Updated by parker (Parker Finch) about 2 years ago
Thanks for that explanation @tenderlovemaking (Aaron Patterson), it helps and I truly appreciate it!
One misunderstanding I had was that I was thinking about this in terms of the Ruby VM. But it seems like garbage collection actually occurs down at the machine level (which makes much more sense now that I think about it) and that's why we're dealing with registers. (And the stack we're talking about is the C stack and not the Ruby VM stack.)
The recommendation to take a look at RB_GC_GUARD was helpful as well, that's a great comment there.
I'm still curious why calling #weakref_alive?
on the WeakRef
seems to put the underlying Object
(that the WeakRef
delegates to) in a register or on the stack. But the fact that this is happening so close to the actual machine makes it seem like it would be tricky to figure out.
Anyway, I'll keep learning more about how memory management works, thank you for the info here! I think the docs are fine as-is, so it makes sense to me to close this one.
Thank you all for your time and explanations!
Updated by tenderlovemaking (Aaron Patterson) about 2 years ago
parker (Parker Finch) wrote in #note-6:
I'm still curious why calling
#weakref_alive?
on theWeakRef
seems to put the underlyingObject
(that theWeakRef
delegates to) in a register or on the stack. But the fact that this is happening so close to the actual machine makes it seem like it would be tricky to figure out.
That method may not be putting the object in a register. Something else may have put it in a register or in the stack, and it just happens that no other machine code has overwritten the register or stack memory. If you dump the heap (ObjectSpace.dump_all
), you'll probably see one of the roots (probably VM?) pointing at the object. Unfortunately the heap dump won't tell you how it found the reference, just that the reference exists. You could find whether it's a register or stack memory by adding some debugging code to the GC or by tracing the machine code via lldb.
It might be nice if ObjectSpace.dump_all
could indicate whether the reference came from the stack or machine registers as I've also tried to figure that out. But it is work. 😅
Updated by parker (Parker Finch) about 2 years ago
tenderlovemaking (Aaron Patterson) wrote in #note-7:
That method may not be putting the object in a register. Something else may have put it in a register or in the stack, and it just happens that no other machine code has overwritten the register or stack memory.
There's some evidence that the weakref_alive?
method is putting it in a register or the stack. Running garbage collection immediately after calling weakref_alive?
will fail to collect the underlying object. But if there's a sleep
between the weakref_alive?
and running garbage collection then the garbage collection will succeed in collecting the underlying object.
To test if it was the weakref_alive?
call itself that was causing the issue I ran a few different scenarios:
# This version does not manifest the issue. (It makes it through two iterations
# and terminates.)
require "weakref"
iterations = 0
while iterations < 2
print "\r#{iterations}"
obj = WeakRef.new(Object.new)
while obj.weakref_alive?
# Sleep to give registers a chance to clear.
sleep(0.5)
GC.start
end
iterations += 1
end
# This version does manifest the issue. (It gets stuck in the inner loop and
# never terminates.)
require "weakref"
iterations = 0
while iterations < 2
print "\r#{iterations}"
obj = WeakRef.new(Object.new)
while obj.weakref_alive?
# Sleep to give registers a chance to clear.
sleep(0.5)
# Call the `WeakRef#weakref_alive?` method to see if that causes the issue
# to manifest. (It does, GC does _not_ clear out the underlying Object after
# this.)
obj.weakref_alive?
GC.start
end
iterations += 1
end
# This version does not manifest the issue. (It makes it through two iterations
# and terminates.)
require "weakref"
iterations = 0
while iterations < 2
print "\r#{iterations}"
obj = WeakRef.new(Object.new)
while obj.weakref_alive?
# Sleep to give registers a chance to clear.
sleep(0.5)
# Reference the WeakRef object to see if that causes the issue to
# manifest. (It does not, GC still clears out the underlying Object here.)
obj
GC.start
end
iterations += 1
end
# This version does not manifest the issue. (It makes it through two iterations
# and terminates.)
require "weakref"
iterations = 0
while iterations < 2
print "\r#{iterations}"
obj = WeakRef.new(Object.new)
while obj.weakref_alive?
# Sleep to give registers a chance to clear.
sleep(0.5)
# Call another method on the WeakRef object to see if that causes the issue
# to manifest. (It does not, GC still clears out the underlying Object
# here.)
obj.object_id
GC.start
end
iterations += 1
end
Sorry for the wall of code there — the summary is that the issue only seems to manifest when the weakref_alive?
method is called immediately before garbage collecting.
The fact that the behavior is predictable in those different scenarios makes me think that the weakref_alive?
method is doing something that adds a reference to the underlying Object
to a register or the stack. Is there another explanation for the behavior there that I'm missing?
If you dump the heap (
ObjectSpace.dump_all
), you'll probably see one of the roots (probably VM?) pointing at the object. Unfortunately the heap dump won't tell you how it found the reference, just that the reference exists. You could find whether it's a register or stack memory by adding some debugging code to the GC or by tracing the machine code via lldb.
Thanks @tenderlovemaking (Aaron Patterson)! I didn't know about ObjectSpace.dump_all
. I'll try exploring those options to see if I can pin down how it's finding the reference to the Object. Heads up that it will likely take me a while since I'm not yet familiar with C and lldb.
Updated by parker (Parker Finch) almost 2 years ago
Hi @tenderlovemaking (Aaron Patterson)! I'm having difficulty interpreting the results of the ObjectSpace
dump and I'm hoping you can help.
I've adjusted the script to print out the address of the underlying object, and then (when the issue manifests) print all lines from ObjectSpace.dump_all
that match that address. The code is attached, here's some example output:
Ruby version: 3.3.0
Iteration: 0
Object address: 0x1051cd788
Inner iterations: 1
Iteration: 1
Object address: 0x105205ae8
Inner iterations: 1
Inner iterations: 2
Inner iterations: 3
{"address":"0x105205ae8", "type":"OBJECT", "shape_id":5, "slot_size":40, "class":"0x1029bfe80", "embedded":true, "ivars":0, "memsize":40, "flags":{"wb_protected":true, "old":true, "uncollectible":true, "marked":true}}
{"address":"0x10520da90", "type":"STRING", "shape_id":0, "slot_size":40, "class":"0x1029beda0", "embedded":true, "bytesize":11, "value":"0x105205ae8", "encoding":"UTF-8", "coderange":"7bit", "memsize":40, "flags":{"wb_protected":true, "old":true, "uncollectible":true, "marked":true}}
In that example, the underlying object was at 0x105205ae8
. But as far as I can tell, there's nothing else that points at it. (The other object there is the String used to hold that address.) I would have expected that, if nothing was referencing it, it would be collected by GC.
One interesting tidbit is that just calling ObjectSpace.dump_all
prevents the issue from manifesting. Is it possible that something was referencing the object address, then running dump_all
caused that reference to be removed?
Updated by byroot (Jean Boussier) over 1 year ago
- Related to Bug #19460: Class not able to be garbage collected added