Bug #15362
closed[PATCH] Avoid GCing dead stack after switching away from a fiber
Description
Hello! I have a patch that fixes Bug #14561. It's not a platform specific issue but
it affects the default build configuration for MacOS and is causing segfaults on 2.5.x.
I've put the test for this in a separate patch because I'm not sure if we want to have
a 5 second test that only matters for non-default build configs and doesn't catch things reliably on Linux.
I tested this on both trunk and ruby_2_5, on MacOS and on Linux, on various build configs.
Please let me know if anything in my understanding is wrong. I've pasted my commit message below.
Fibers save execution contextes, and execution contexts include a native
stack pointer. It may happen that a Fiber outlive the native thread
it executed on. Consider the following code adapted from Bug #14561:
enum = Enumerator.new { |y| y << 1 }
thread = Thread.new { enum.peek } # fiber constructed inside the
# block and saved inside `enum`
thread.join
sleep 5 # thread finishes and thread cache wait time runs out.
# Native thread exits, possibly freeing its stack.
GC.start # segfault because GC tires to mark the dangling stack pointer
# inside `enum`'s fiber
The problem is masked by FIBER_USE_COROUTINE and FIBER_USE_NATIVE,
as those implementations already do what this commit does.
Generally on Linux systems, FIBER_USE_NATIVE is 1 even when
one uses ./configure --disable-fiber-coroutine
, since most
Linux systems have getcontext() and setcontext() which
turns on FIBER_USE_NATIVE. (compile with `make
DEFS="-DFIBER_USE_NATIVE=0" to explicitly disable it)
Furthermore, when both FIBER_USE_COROUTINE and FIBER_USE_NATIVE
are off, and the GC reads from the stack of a dead native
thread, MRI does not segfault on Linux. This is probably due to
libpthread not marking the page where the dead stack lives as
unreadable. Nevertheless, this use-after-free is visible through
Valgrind.
On ruby_2_5, this is an acute problem, since it doesn't have FIBER_USE_COROUTINE.
Thread cache is also unavailable for 2.5.x, triggering this issue
more often. (thread cache gives this bug a grace period since
it makes native threads wait a little before exiting)
This issue is very visible on MacOS on 2.5.x since libpthread marks
the dead stack as unreadable, consistently turning this use-after-free
into a segfault.
Fixes Bug #14561
- cont.c: Set saved_ec.machine.stack_end to NULL when switching away from a
fiber to keep the GC marking it.saved_ec
gets rehydrated with a
stack pointer if/when the fiber runs again.
Files
Updated by aselder (Andrew Selder) about 6 years ago
Alan,
You're amazing... We're completely blocked on upgrading by this bug. You're a savior
Thanks,
Andrew
Updated by ioquatix (Samuel Williams) about 6 years ago
- Assignee set to ioquatix (Samuel Williams)
- Target version set to 2.6
- Backport changed from 2.4: UNKNOWN, 2.5: UNKNOWN to 2.5: REQUIRED
I will take look at it.
Updated by ioquatix (Samuel Williams) about 6 years ago
I looked at the patch it seemed okay.
I committed it to trunk r66111
Once we've confirmed it solves the issue, I will backport 2.5 branch, hopefully it can be included 2.5.4 or next patch release.
Updated by ioquatix (Samuel Williams) about 6 years ago
After I merged this patch, https://travis-ci.org/ruby/ruby/jobs/462071967 fails. I am testing locally in more detail.
Updated by ioquatix (Samuel Williams) about 6 years ago
Okay, it was random failure. That concerns me too. But, it passed now. I'll merge to ruby_2_5 too.
Updated by normalperson (Eric Wong) about 6 years ago
samuel@oriontransfer.net wrote:
After I merged this patch,
https://travis-ci.org/ruby/ruby/jobs/462071967 fails. I am
testing locally in more detail.
Alan + Sam: thanks for investigating this.
Perhaps an alternative fix would be to avoid calling
rb_gc_mark_machine_stack() if the owner thread is dead.
I couldn't reproduce the original problem on FreeBSD 11.2
with both FIBER_USE_NATIVE and FIBER_USE_COROUTINE disabled;
so I can't verify my hypothesis
(Didn't bother testing on Linux)
Updated by ioquatix (Samuel Williams) about 6 years ago
@normalperson (Eric Wong) I am open to ideas. I don't know a lot about how thread and GC interact within Ruby, unfortunately it's not an area I spent much time on. So, if you have better proposal, I suggest you implement it. This change included a spec, so I was happy to merge it even thought I didn't fully understand what is going wrong. Confirmed green build matrix so I am happy to close. If you want to make a change, let me know, otherwise I will close this and it would be backported to 2.5.
Updated by ioquatix (Samuel Williams) about 6 years ago
@normalperson (Eric Wong) I thought about it a bit more, perhaps both ideas should be implemented for maximum protection. We could add some notes in the source code too about the issue/invariants.
Updated by normalperson (Eric Wong) about 6 years ago
samuel@oriontransfer.net wrote:
@normalperson (Eric Wong) I thought about it a bit more, perhaps both
ideas should be implemented for maximum protection. We could
add some notes in the source code too about the
issue/invariants.
NAK. I'm against belt-and-suspenders fixes because it favors
lack-of-understanding and makes Ruby slower with more overhead.
Looking at Alan's original fix (which you committed), it seems
to make sense. Anyways, my hypothesis is here:
https://80x24.org/spew/20181201063852.30438-1-e@80x24.org/raw
It may allow Alan's simple one-line fix to be reverted.
However, I prefer Alan's one-liner fix since it's smaller
with instructions with no extra branches.
Updated by ioquatix (Samuel Williams) about 6 years ago
- Status changed from Open to Closed
Okay, I think we are in agreement then.
Updated by alanwu (Alan Wu) about 6 years ago
A comment for future maintainers:
Since &fib->cont.saved_ec == ruby_current_execution_context_ptr
in the place my patch makes the change, if GC starts below where I set stack_end
to NULL, the stack doesn't get marked.
This is not a problem right now since there is no call to any function that could trigger GC (correct me if I'm wrong), but future changes should be careful to not introduce such calls.
I wish there was some way to codify this constraint and have the compiler statically check this for us.
Updated by dazuma (Daniel Azuma) about 6 years ago
Did this get backported to ruby_2_5? I don't see a corresponding commit in the github mirror https://github.com/ruby/ruby/commits/ruby_2_5
Updated by aselder (Andrew Selder) almost 6 years ago
This still looks like it's waiting on a backport to Ruby 2.5. Also, does anyone know when the next release of the Ruby 2.5 branch will be done?
Thanks!
Updated by nagachika (Tomoyuki Chikanaga) almost 6 years ago
- Has duplicate Bug #14561: Consistent 2.5.0 seg fault in GC, related to accessing an enumerator in a thread added
Updated by nagachika (Tomoyuki Chikanaga) almost 6 years ago
- Backport changed from 2.5: REQUIRED to 2.5: DONE
ruby_2_5 r66777 merged revision(s) 66111.
Updated by wanabe (_ wanabe) almost 6 years ago
- Related to Bug #14714: Ruby 2.5.1 Segmentation Fault in GC added
Updated by wanabe (_ wanabe) almost 6 years ago
- Related to Bug #15375: Crash report for Ruby 2.5.3p105 added