Feature #19790
closedOptionally write Ruby crash reports into a file rather than STDERR
Description
Use case¶
On our servers we set /proc/sys/kernel/core_pattern
to point to a small utility that report all the crashes happening in production with the associated core dump into BugSnag.
This allowed us to find and fix many Ruby and native extensions bugs in the last few years.
However, these are hard to triage, so we'd like to augment these crash reports with the output of rb_vm_bugreport()
.
Problem¶
rb_vm_bugreport()
is hard coded to print to STDERR, this makes it hard to extract and parse the report in a production environment, as very often STDERR is shared with other processes, so the crash report is intertwined with logs from other processes.
Feature Request¶
It would be very useful if Ruby could write the crash report to an arbitrary path rather than STDERR, akin to kernel/core_pattern
.
Especially it would be useful if it supported interpolating the crashing process PID with %p
like kernel/core_pattern
, as it would make it easier to map that report with the core file.
This could be controller by an environment variable such as RUBY_BUGREPORT_PATH
. e.g.
RUBY_BUGREPORT_PATH=/var/log/ruby/ruby-crash-pid-%p.log
Optional Features¶
kernel/core_pattern
supports other interpolations, however not all of them would make sense for Ruby to support.
%% A single % character.
%c Core file size soft resource limit of crashing process
(since Linux 2.6.24).
%d Dump mode—same as value returned by prctl(2)
PR_GET_DUMPABLE (since Linux 3.7).
%e The process or thread's comm value, which typically is
the same as the executable filename (without path prefix,
and truncated to a maximum of 15 characters), but may
have been modified to be something different; see the
discussion of /proc/pid/comm and /proc/pid/task/tid/comm
in proc(5).
%E Pathname of executable, with slashes ('/') replaced by
exclamation marks ('!') (since Linux 3.0).
%g Numeric real GID of dumped process.
%h Hostname (same as nodename returned by uname(2)).
%i TID of thread that triggered core dump, as seen in the
PID namespace in which the thread resides (since Linux
3.18).
%I TID of thread that triggered core dump, as seen in the
initial PID namespace (since Linux 3.18).
%p PID of dumped process, as seen in the PID namespace in
which the process resides.
%P PID of dumped process, as seen in the initial PID
namespace (since Linux 3.12).
%s Number of signal causing dump.
%t Time of dump, expressed as seconds since the Epoch,
1970-01-01 00:00:00 +0000 (UTC).
%u Numeric real UID of dumped process.
Additionally, if kernel/core_pattern
starts with a pipe (|
), then it allows to pipe the core dump to another program, this may also make sense as a feature.
Prior Art¶
Aside from kernel/core_pattern
, some other virtual machine have a similar feature, for instance the JVM has a configurable crash report:
-XX:ErrorFile=/var/log/java/hs_err_pid%p.log
Updated by nobu (Nobuyoshi Nakada) over 1 year ago
https://github.com/nobu/ruby/tree/bugreport_path
Implemented %%
, %e
, %E
, %f
, %F
, %p
, %t
and %NNN
(octal).
When invoking pipe, the string is split with white spaces first without quoting, %040
would be useful to represent a space character, especially on Windows.
Updated by byroot (Jean Boussier) about 1 year ago
Thank you @nobu (Nobuyoshi Nakada), I wasn't expecting someone to implement it :)
I take it that you are supportive of the feature?
Updated by nobu (Nobuyoshi Nakada) about 1 year ago
byroot (Jean Boussier) wrote in #note-2:
Thank you @nobu (Nobuyoshi Nakada), I wasn't expecting someone to implement it :)
I take it that you are supportive of the feature?
I'm positive, but also afraid if this can work well in a signal handler.
Updated by nobu (Nobuyoshi Nakada) about 1 year ago
Note that the branch includes a testing feature, and is not ready to merge as-is.
Updated by kjtsanaktsidis (KJ Tsanaktsidis) about 1 year ago
afraid if this can work well in a signal handler.
The current bugreporter does lots of things which are not really safe to do in a signal handler... even fprintf
is not async-signal-safe (at least on Linux, according to signal-safety(7)). It also accesses the Ruby stack which of course may be in an inconsistent state. I guess it all works "well enough" most of the time.
Looking at your patch, it looks about as signal-safe as the current implementation is. Notably the sprintf
calls are unsafe (but we're already doing that), and I'm not sure if the RSTRING_PTR
etc manipulations are safe (but we do things like that in the Ruby backtrace-printing code anyway). Nothing else jumps out to my eyes as notably more hazardous. The only thing I might suggest is to push back the opening of the bugreport file (and parsing the %
's) until after the if (crashing)
check in rb_vm_bugreport
, so that these unsafety's can't cause recursive crashes.
(Also - unrelated to the issue at hand, but I think the crashing = true
assignment might strictly speaking need a memory fence after it)
If we wanted Ruby crashdumps to be signal-safe, I guess the best option would be to simply ask the operating system to generate coredumps on crashes, and write some external tooling to do nice things like extract Ruby backtraces out of those coredumps. However that code would be quite platform-specific, and also duplicate decent amounts of Ruby's string/backtrace handling code, so I can see why this hasn't been the approach so far. Not to mention that dealing with coredump files is kind of radioactive because they contain whatever potentially sensitive data that the application was processing at the time.
Updated by byroot (Jean Boussier) about 1 year ago
also afraid if this can work well in a signal handler.
Yeah, as @kjtsanaktsidis (KJ Tsanaktsidis) said, in theory the current one already use things that aren't async signal safe, so I don't think it really change much here.
open(2)
should be async-signal safe according to https://man7.org/linux/man-pages/man7/signal-safety.7.html, so unless I'm missing something that's the only real addition here.
Updated by ko1 (Koichi Sasada) about 1 year ago
Does it print to both STDERR and a file? or only to a file?
Updated by byroot (Jean Boussier) about 1 year ago
I would say only a file if the environment variable is set, with perhaps just a message on stderr saying the report was written into that path.
But I don't have a strong opinion about this, if others think it should be written in both, that's workable for me, I just personally don't think it has value.
Updated by matz (Yukihiro Matsumoto) about 1 year ago
I accept this proposal, at least we can experiment.
I am a bit worried about calling open(2) in other functions/system calls from the signal handler.
Matz.
Updated by byroot (Jean Boussier) about 1 year ago
I accept this proposal
Thank you!
I am a bit worried about calling open(2) in other functions/system calls from the signal handler.
Unless I'm misreading it, POSIX says open(2)
should be async signal safe: https://man7.org/linux/man-pages/man7/signal-safety.7.html so we should be good.
Either way, I think the crash reporter is understood as being "best effort", in rare cases, it can already happens that generating it cause a recursive crash, as long as it works in the vast majority of cases, it's not the end of the world. At this stage all bets are off anyway.
@nobu (Nobuyoshi Nakada) do you wish to finish your branch? Let me know if I can help in any way.
Updated by byroot (Jean Boussier) about 1 year ago
A note from the dev meeting logs, the env var that was accepted is RUBY_CRASH_REPORT
(not RUBY_BUGREPORT_PATH
).
Updated by byroot (Jean Boussier) about 1 year ago
I've backported @nobu's branch on our 3.2 rubies and it's been working very well. Not too sure what is missing.
Updated by nobu (Nobuyoshi Nakada) about 1 year ago
- Status changed from Open to Closed
Applied in changeset git|bab01d284c86b39952baaddb15901fd4a15e4151.
[Feature #19790] Rename BUGREPORT_PATH as CRASH_REPORT