Bug #19473
opencan't be called from trap context (ThreadError) is too limiting
Description
Simple reproducer:
$ ruby -ve 'm=Mutex.new; trap(:HUP) { m.synchronize { p :OK } }; Process.kill :HUP, Process.pid; sleep 0.1'
ruby 3.2.1 (2023-02-08 revision 31819e82c8) [x86_64-linux]
-e:1:in `synchronize': can't be called from trap context (ThreadError)
from -e:1:in `block in <main>'
from -e:1:in `kill'
from -e:1:in `<main>'
Expected behavior:
$ ruby -ve 'm=Mutex.new; trap(:HUP) { m.synchronize { p :OK } }; Process.kill :HUP, Process.pid; sleep 0.1'
truffleruby 22.3.1, like ruby 3.0.3, GraalVM CE Native [x86_64-linux]
:OK
$ ruby -ve 'm=Mutex.new; trap(:HUP) { m.synchronize { p :OK } }; Process.kill :HUP, Process.pid; sleep 0.1'
jruby 9.4.0.0 (3.1.0) 2022-11-23 95c0ec159f OpenJDK 64-Bit Server VM 17.0.6+10 on 17.0.6+10 +jit [x86_64-linux]
:OK
This exception is highly problematic, for instance it breaks Timeout.timeout
in trap
:
https://github.com/ruby/timeout/issues/17#issuecomment-1142035939
I suppose this behavior is because sometimes it's problematic to lock a Mutex in trap, e.g., if it's already locked by the main thread/fiber.
But that would otherwise already raise deadlock; recursive locking (ThreadError)
, so there is no point to fail early.
And that's just one case, not all, so we should not always raise an exception.
There seems to be no valid reason to prevent all Mutex#synchronize
in trap
.
After all, if the Mutex for instance is only used in trap
, it's well-defined AFAIK.
For instance a given trap handler does not seem executed concurrently:
$ ruby -ve 'trap(:HUP) { puts "in trap\n"+caller.join("\n")+"\n\n"; sleep 0.1 }; pid = Process.pid; Process.wait fork { 20.times { Process.kill :HUP, pid } }; sleep 1'
ruby 3.2.1 (2023-02-08 revision 31819e82c8) [x86_64-linux]
in trap
-e:1:in `wait'
-e:1:in `<main>'
in trap
-e:1:in `wait'
-e:1:in `<main>'
in trap
-e:1:in `wait'
-e:1:in `<main>'
in trap
-e:1:in `wait'
-e:1:in `<main>'
in trap
-e:1:in `wait'
-e:1:in `<main>'
in trap
-e:1:in `wait'
-e:1:in `<main>'
And if the trap handler using the Mutex is never called while the Mutex is held by the main thread/fiber, there is also no problem.
Updated by Eregon (Benoit Daloze) over 2 years ago
Also Monitor
has the same problem on CRuby:
$ ruby -ve 'm=Monitor.new; trap(:HUP) { m.synchronize { p :OK } }; Process.kill :HUP, Process.pid; sleep 0.1'
ruby 3.2.1 (2023-02-08 revision 31819e82c8) [x86_64-linux]
-e:1:in `synchronize': can't be called from trap context (ThreadError)
from -e:1:in `block in <main>'
from -e:1:in `kill'
from -e:1:in `<main>'
And again it works fine on truffleruby and jruby
And since Monitor
is reentrant there is no reason at all (AFAIK) to prevent it in trap
.
Updated by Eregon (Benoit Daloze) over 2 years ago
- ruby -v set to ruby 3.2.1 (2023-02-08 revision 31819e82c8) [x86_64-linux]
Updated by Eregon (Benoit Daloze) over 2 years ago
Another approach would be to do it like e.g. Java which runs signal handlers on a separate thread.
Then there is no potential problem of the signal handler being reentrant to the main thread/fiber.
But that is a much bigger semantic change of course, so I think just removing the can't be called from trap context (ThreadError)
is much more actionable.
Updated by ioquatix (Samuel Williams) over 2 years ago
I was under the impression that very few operations are safe to call from a signal handler. Is that not the case here?
Updated by Eregon (Benoit Daloze) over 2 years ago
ioquatix (Samuel Williams) wrote in #note-5:
I was under the impression that very few operations are safe to call from a signal handler. Is that not the case here?
I think you are thinking about C signal handlers and C async-safety. This is not relevant here, the Ruby signal handler doesn't run inside the C signal handler.
Updated by ko1 (Koichi Sasada) over 2 years ago
The current limitation is introduced to protect user from dead-lock error like that:
m = Mutex.new
trap(:INT){m.synchronizaton{ do_something } }
m.lock; sleep
# C-c here
will cause deadlock and it is hard to predict and coding for it. ( https://github.com/ruby/timeout/issues/17#issuecomment-1461498517 is one example)
Another approach would be to do it like e.g. Java which runs signal handlers on a separate thread.
Yes. There was a discussion to introduce a thread for signal handlers. But it was not introduced maybe because it is too much for many cases.
akr had proposed that if someone want to use such case, how about to make a thread in a trap handler for workaround. I think it is reasonable.
Updated by Eregon (Benoit Daloze) over 2 years ago
Right, like I mentioned above.
But the deadlock might not happen (or even cannot happen for some cases I mentioned above), and then CRuby fails needlessly.
If the deadlock does happen, then we would get this error: deadlock; recursive locking (ThreadError)
. Which means the program doesn't hang and there is a stacktrace and it shows relevant information which is useful to fix it.
So I think it is just a mistake to fail early with can't be called from trap context (ThreadError)
.
In fact CRuby even cheats this rule for IO I saw.
There are fully correct use cases for using a Mutex inside trap
as I mentioned in the description.
Updated by Eregon (Benoit Daloze) over 2 years ago
https://github.com/ruby/timeout/issues/17#issuecomment-1464039853 seems a proof this restriction is wrong, and actually prevents a correct solution to that issue.
Also this restriction makes no sense for Monitor, since that's reentrant by design.
In fact, one could make their own Mutex using Queue, since Queue is allowed in trap handlers to workaround this restriction. It doesn't make any sense to reimplement Mutex in such an inefficient way of course.
Updated by Eregon (Benoit Daloze) over 2 years ago
ko1 (Koichi Sasada) wrote in #note-7:
Yes. There was a discussion to introduce a thread for signal handlers. But it was not introduced maybe because it is too much for many cases.
I think this is ultimately the proper solution to this problem.
It's extremely hard to reason about a signal handler running on top of any line of code of the main thread.
akr had proposed that if someone want to use such case, how about to make a thread in a trap handler for workaround. I think it is reasonable.
I cannot really use that here, or I would need an efficient way to know I'm in a trap handler (I suppose I could catch the can't be called from trap context (ThreadError)
but that's really ugly and likely quite slow).
Because Timeout doesn't define the trap handler, the user does.
And asking every user to workaround like that doesn't seem good, that actually creates more threads than a single signal-handling thread created by Ruby.
Updated by ko1 (Koichi Sasada) over 2 years ago
Eregon (Benoit Daloze) wrote in #note-8:
Right, like I mentioned above.
But the deadlock might not happen (or even cannot happen for some cases I mentioned above), and then CRuby fails needlessly.
If the deadlock does happen, then we would get this error:deadlock; recursive locking (ThreadError)
. Which means the program doesn't hang and there is a stacktrace and it shows relevant information which is useful to fix it.
My understandings are:
(1) It is hard to detect such deadlock risk because such errors occur with with very low frequency.
(2) It is hard to expect that the Mutex users care about signal handlers.
(3) It is hard to modify fixing with signal handlers.
So I think current limitation makes sense.
akr had proposed that if someone want to use such case, how about to make a thread in a trap handler for workaround. I think it is reasonable.
I cannot really use that here, or I would need an efficient way to know I'm in a trap handler (I suppose I could catch the can't be called from trap context (ThreadError) but that's really ugly and likely quite slow).
Because Timeout doesn't define the trap handler, the user does.
And asking every user to workaround like that doesn't seem good, that actually creates more threads than a single signal-handling thread created by Ruby.
akr's idea is not about Timeout library, but for Timeout users in trap handers.
If someone wants to use Timeout (or other complex/thread-safety needed code), making a thread like trap(...){Thread.new{ ... }}
seems feasible workaround.
Updated by Eregon (Benoit Daloze) over 2 years ago
ko1 (Koichi Sasada) wrote in #note-11:
My understandings are:
(1) It is hard to detect such deadlock risk because such errors occur with with very low frequency.
(2) It is hard to expect that the Mutex users care about signal handlers.
(3) It is hard to modify fixing with signal handlers.So I think current limitation makes sense.
No, it doesn't make sense, because it actually prevents the proper fix for TIMEOUT_THREAD_MUTEX
for Timeout:
https://github.com/ruby/timeout/issues/17#issuecomment-1464039853
So the limitation is unsound (it prevents valid and correct usages) and so we must remove it.
Also it doesn't catch try_lock
and Queue#pop
, etc, so it's very rough and inaccurate.
We could make it a $VERBOSE = true
warning maybe, but again it would be false positives and impossible to address, so it seems not good.
Another idea would be being able to mark which Mutex are "trap-safe", maybe with an argument to Mutex#initialize.
Sort of like how IO does it.
ko1 (Koichi Sasada) wrote in #note-11:
akr's idea is not about Timeout library, but for Timeout users in trap handers.
If someone wants to use Timeout (or other complex/thread-safety needed code), making a thread liketrap(...){Thread.new{ ... }}
seems feasible workaround.
But users might use some gem or some library code in trap handler and they might not know whether that uses Timeout or not.
And even if they know, it should just work, without needing any workaround from the user.
My conclusion is doing nothing is unacceptable, this CRuby limitation breaks valid and correct code such as https://github.com/ruby/timeout/issues/17#issuecomment-1464039853.
So either:
- We remove this unsound limitation, same as on other Rubies
- We add an argument to Mutex#initialize to mark it as trap-safe
- We execute trap handlers on a separate thread
Which one do we choose?
I think 3 is the best (much easier to reason about), but of course there is some potential for incompatibility there.
1 or 2 seem easy, so we could do them fast and maybe even backport it.
Updated by ko1 (Koichi Sasada) over 2 years ago
No, it doesn't make sense, because it actually prevents the proper fix for TIMEOUT_THREAD_MUTEX for Timeout:
I think this reason violates (1).
Updated by Eregon (Benoit Daloze) over 2 years ago
ko1 (Koichi Sasada) wrote in #note-13:
I think this reason violates (1).
What do you mean?
AFAIK https://github.com/ruby/timeout/issues/17#issuecomment-1464039853 is correct.
If it's incorrect, please point the mistake in my reasoning.
We can't have understanding (1)
by using solution 1.
(remove the trap+Mutex check), indeed those are exclusive.
IMO it's still worth it because we are always breaking when it's only rarely a problem.
So CRuby is causing hard failures for things that might just work.
Here is a list of your 3 understandings and how the 3 solutions address them, I also add (0):
(0) Fixes soundness, i.e. doesn't fail when the code correctly handles signal handler reentrancy
Solutions:
- fixes (0)
- fixes (0), (1) because no change there, (2) because opt-in
- fixes (0), (1), (2), (3)
Updated by ioquatix (Samuel Williams) 3 months ago
I agree the current behaviour seems too limiting. If user code does something incorrect, it should fail, or deadlock. But we should not prevent valid programs just because some users will write invalid programs.
Updated by ioquatix (Samuel Williams) about 1 month ago
· Edited
Folks, what do you think of https://github.com/ruby/ruby/pull/13545?
In my PR, I noticed that IO
's internal write_lock
explicitly bypasses this limitation. In other words, it shows that there is a need for this behaviour and we have to implement special cases. I'd prefer we don't have any special cases, so I agree even more strongly with (1) We remove this unsound limitation, same as on other Rubies.
Updated by ioquatix (Samuel Williams) 16 days ago
Had to special case another mutex: https://github.com/ruby/ruby/pull/13684
Updated by ioquatix (Samuel Williams) 16 days ago
I agree, it would be nice to relax this restriction. But I also understand that it's unpredictable since the trap handler can run at any point the Ruby interpreter checks for interrupts (which for user code is fairly frequent)
As an alternative (to relaxing the restriction), we could expose the existing logic/flag, e.g. Mutex#safe_in_trap_context = true/false
. It's a bit ugly, but it's one idea that might allow us to move forward while retaining the default existing behaviour.
It's extremely hard to reason about a signal handler running on top of any line of code of the main thread.
In Async, we handle this using Thread.handle_interrupt
to defer interrupts until a specific known point in the event loop execution: https://github.com/socketry/async/blob/4206e32da4e04b65a4074080f9a59d66b827a347/lib/async/scheduler.rb#L520-L540. This helps to make interrupt handling more predictable. This makes asynchronous interrupts a little bit safer, IMHO.
There seems to be no valid reason to prevent all
Mutex#synchronize
in trap.
pthread_mutex_
functions are not async-signal-safe. However, Ruby does not run code directly in the signal handler, so it's not the same risk - but I feel like in principle it's still the same semantic issue. So, it's not entirely clear to me that it's safe to use any kind of blocking operation in principle, in trap handlers.
Updated by ioquatix (Samuel Williams) 16 days ago
As one other idea, what about introducing Mutex#reentrant = true/false
. This seems like a more general model than safe_in_trap_context
and I think we could then allow reentrant mutex to be used safely in trap contexts. The user would be explicitly opting into that scenario.
Updated by Eregon (Benoit Daloze) 15 days ago
I think a simple way to look at this issue is to analyze all cases:
- The Mutex is not held by any Thread, and the signal handler acquires it (the common case): this is fully correct and would work just fine without the unnecessary
can't be called from trap context (ThreadError)
from CRuby which breaks this. - The Mutex is held by the main Thread, and the signal handler tries to acquire it: it would be
deadlock; recursive locking (ThreadError)
which is as good as the currentcan't be called from trap context (ThreadError)
. - The Mutex is held by another Thread, and the signal handler tries to acquire it: the signal handler will wait until that Thread releases the Mutex. This is totally normal, it is Mutex semantics. You can't argue the Thread might keep the Mutex forever because that would be a bug in the first place. In the worst case it hangs, the user would Ctrl+C and see the stacktrace, which is a reasonable outcome for that case. CRuby is even sometimes able to detect such a deadlock and raises a
fatal
error in that case.
So removing the check is clearly better for the 1st case, is equivalent for the 2nd case, and is better for the 3rd case (at least for the subcase where the other Thread doesn't hold the Mutex forever).
If anyone disagrees, can they explain how it is better for all of these cases to fail with can't be called from trap context (ThreadError)
?
Updated by ko1 (Koichi Sasada) 2 days ago
Eregon (Benoit Daloze) wrote in #note-20:
I think a simple way to look at this issue is to analyze all cases:
- The Mutex is not held by any Thread, and the signal handler acquires it (the common case): this is fully correct and would work just fine without the unnecessary
can't be called from trap context (ThreadError)
from CRuby which breaks this.- The Mutex is held by the main Thread, and the signal handler tries to acquire it: it would be
deadlock; recursive locking (ThreadError)
which is as good as the currentcan't be called from trap context (ThreadError)
.- The Mutex is held by another Thread, and the signal handler tries to acquire it: the signal handler will wait until that Thread releases the Mutex. This is totally normal, it is Mutex semantics. You can't argue the Thread might keep the Mutex forever because that would be a bug in the first place. In the worst case it hangs, the user would Ctrl+C and see the stacktrace, which is a reasonable outcome for that case. CRuby is even sometimes able to detect such a deadlock and raises a
fatal
error in that case.
If anyone disagrees, can they explain how it is better for all of these cases to fail with
can't be called from trap context (ThreadError)
?
In most cases, (1) occurs, so it is difficult to notice the possibility of (2).
With the current behavior, it is possible to recognize the risk of (2) even in case (1). This is the reason why this behavior is merged into Ruby.
(In this thread, I mentioned as "hard to predict" — this is what it refers to.)
Updated by mame (Yusuke Endoh) 2 days ago
If Mutex#lock
were permitted inside a trap handler, a program like the following would have a very hard-to-reproduce race condition bug.
m = Mutex.new
trap(:INT) { m.synchronize { p :signalled } }
while true
sleep 1
m.synchronize { p :hi! }
end
If you press Ctrl+C while this program is running, it would, in most cases, print :signalled
and continue execution, which is an expected behavior. However, if the signal arrives at a particularly unlucky moment, it would cause a deadlock; recursive locking (ThreadError)
.
The problem isn't just that the bug is hard to reproduce; it's that programmers often don't even realize they've introduced such a race condition bug. To prevent this, Mutex#lock
is designed to consistently raise an exception when called inside a trap handler.
Updated by tompng (tomoya ishida) 2 days ago
Locking multiple mutex with a fixed order is considered to be deadlock-safe, but permitting mutex in signal_handler, I think fixed order locking is hard or impossible.
The code below looks deadlock-safe, but it's not.
trap(:INT){A.synchronize{}}
Thread.new{A.synchronize{sleep 0.1; B.synchronize{}}}
B.synchronize{sleep}
- Main thread locks B
- Another thread locks A and then tries to lock B
- Signal handler locks A
Main thread can lock B and then A if signal handler is called while main thread is locking B.
I don't think there's an easy way to fix this race condition except trap(...){Tread.new{...}}
, so I think the restriction is reasonable.
Updated by Eregon (Benoit Daloze) 1 day ago
OK, thank you all 3 for clarifying.
I understand that point of view, even though I don't fully agree with it.
One problem is this limitation makes it impossible to support Timeout.timeout
in trap
, see https://github.com/ruby/timeout/issues/17, but that would work 100% correctly without this limitation.
In fact it already works correctly on JRuby and TruffleRuby which don't raise for case 1.
I see two ways to resolve this:
- Run signal handlers not on the main thread but another thread. Then there is no need to prevent Mutex/Monitor/etc in
trap
since there won't be such problems. It seems the best solution and already e.g. what the JVM does. It might have some incompatibility due to running that code on another thread, but it shouldn't be the case since even currently trap handlers run "concurrently" to the main thread (they can run at any interrupt point which is pretty much anywhere). For exceptions from trap handlers (like the default SIGINT handler), they can simply be forwarded to the main thread so that case is easy to stay compatible. I think the main thing missing here is someone taking the time to implement this, anyone interested? - Allow specific Mutex & Monitor to be marked as "trap-safe" via a new method (like
rb_mutex_allow_trap()
but as a Ruby method) as proposed by @ioquatix (Samuel Williams) in https://bugs.ruby-lang.org/issues/19473#note-18. That method documentation can then show the examples above and examples of safe usages of it. The default is still safe, but then at least it's possible to use Mutex in trap handler when we have checked the logic would be correct and CRuby no longer prevents such correct cases.
@ko1 (Koichi Sasada) @mame (Yusuke Endoh) @tompng (tomoya ishida) Any opinion on which of these two we should do?
Updated by mame (Yusuke Endoh) 1 day ago
Run signal handlers not on the main thread but another thread.
While I don't have a strong opinion on this, I personally feel this approach makes more sense.
However, ko1 said that users can just write this explicitly themselves. Indeed, creating a new thread for each signal arrival is extremely simple.
trap { ... }
# is rewritten to:
trap { Thread.new { ... } }
Setting up a single, dedicated thread to execute signal handlers serially is more complex, but it still doesn't require a lot of code.
trap { ... }
# is rewritten to:
Q = Queue.new
Thread.new do
Q.each { ... }
end
trap { Q << it }
If core language support were to be added for this, I suppose it would be for cases where you want to force a trap handler written by someone else (e.g., in a gem) to run in a separate thread. But I wonder, is such a scenario realistic?
Updated by Eregon (Benoit Daloze) about 23 hours ago
mame (Yusuke Endoh) wrote in #note-25:
Indeed, creating a new thread for each signal arrival is extremely simple.
It feels like a hack though and it's a pretty high cost to create a new Thread for every signal.
If this was the official solution, then the can't be called from trap context (ThreadError)
should mention it, but I think it's not good enough to be an official solution, just a hacky workaround.
It seems nobody mentioned this workaround on https://github.com/ruby/timeout/issues/17. I think that's because it's unnatural and not a proper solution.
BTW another problem with that workaround is if the same signal is sent twice in quick succession (e.g. two Ctrl+C) the handlers will run concurrently in two threads, which is a recipe for complicated problems.
Currently trap handlers run serially (see issue description) and that seems clearly safer.
Setting up a single, dedicated thread to execute signal handlers serially is more complex, but it still doesn't require a lot of code.
The problem there is then all trap
handlers need to Q << it
which is too difficult to guarantee in general.
I suppose it would be for cases where you want to force a trap handler written by someone else (e.g., in a gem) to run in a separate thread. But I wonder, is such a scenario realistic?
Yes, for example gems like webservers like Puma set trap handlers.
There are many gems with trap handlers:
$ gem-codesearch '\bSignal\.trap\b' | wc -l
5528
(more with \btrap\b
but that takes too long to list)
Updated by mame (Yusuke Endoh) about 21 hours ago
Yes, for example gems like webservers like Puma set trap handlers.
The trap handler code that has already been written and published probably does not use Mutex#lock
, so it is not related to this problem. Such benign trap handlers do not require invoking a new thread.
In terms of using Timeout.timeout
in a trap handler, I don't think it is a good idea to run in a trap handler time-consuming code that requires Timeout.timeout
, because the trap handler may be executed with the main thread suspended in a strange state.
Updated by ioquatix (Samuel Williams) about 4 hours ago
We know that signal handlers run in a trap context, but are there other ways this can happen? I was under the impression GC finalizers also run in a pseudo-trap context, preventing the use of Mutex too. Of course, using a mutex during GC is probably a bad idea.