Bug #20586
closedSome filesystem calls in dir.c are missing error handling and can return incorrect results if interrupted
Description
Background¶
Hey! I work for Datadog on the Ruby profiler part of the datadog
(previously ddtrace
) gem.
A customer reached out with an issue where enabling the profiler made Dir.glob
return no files for a given folder:
Without profiler:
irb(main):002:0> Dir.glob('/gcsfuse/t*')
=> ["/gcsfuse/test.html", "/gcsfuse/test.txt"]
With profiler:
irb(main):002:0> Dir.glob('/gcsfuse/t*')
=> []
It turns out the issue is related to missing error handling in dir.c
.
The Datadog Ruby profiler, like stackprof, pf2 and vernier, uses unix signals to interrupt the currently-active thread and take a sample (usually SIGPROF
). When some system calls get interrupted by a signal, they return an EINTR error code back to the caller.
Consider for instance the implementation of dir_each_entry
in dir.c
:
static VALUE
dir_each_entry(VALUE dir, VALUE (*each)(VALUE, VALUE), VALUE arg, int children_only)
{
struct dir_data *dirp;
struct dirent *dp;
IF_NORMALIZE_UTF8PATH(int norm_p);
GetDIR(dir, dirp);
rewinddir(dirp->dir);
IF_NORMALIZE_UTF8PATH(norm_p = need_normalization(dirp->dir, RSTRING_PTR(dirp->path)));
while ((dp = READDIR(dirp->dir, dirp->enc)) != NULL) {
// ... do things
}
return dir;
}
If READDIR
returns NULL
, then dir_each_entry
assumes it has iterated the entire directory. But looking at the man page for readdir
we see the following sharp edge (emphasis mine):
It returns NULL on reaching the end of the directory stream or if an error occurred.
So what's happening in this situation is: readdir
gets interrupted, returns NULL
+ sets errno to EINTR
. But dir_each_entry
doesn't check errno
, so rather than raising an exception to flag the issue, it treats it as if the end of the directory has been reached.
How to reproduce¶
Reproducing this is somewhat annoying, because it's dependent on timing: the signal must arrive at the exact time the dir API is getting executed.
I was able to reproduce this every time by using the google cloud gcsfuse
tool. This somewhat makes sense -- a remote filesystem is much slower than a local one, so there's a much bigger window of opportunity for a signal to arrive while the system call is blocked.
Here's an example I included in https://github.com/DataDog/dd-trace-rb/pull/3720:
# Not shown: Set up a trial google cloud account, install gcsfuse, create a cloud storage bucket and put in some test files
$ gcsfuse test_fs_dd_trace_rb fuse-testing/
$ ls fuse-testing/
hello.txt test.html test.txt
# Not shown: Add `datadog` gem to `Gemfile`
$ DD_PROFILING_ENABLED=true DD_PROFILING_DIR_INTERRUPTION_WORKAROUND_ENABLED=false bundle exec ddprofrb exec ruby -e "Datadog::Profiling.wait_until_running; pp Dir.children('fuse-testing/')"
[]
Let me know if you'd like me to try to create a reproducer that does not depend on the datadog
gem.
Additional notes¶
I've spent quite some time looking at the dir.c
sources, and here's the full list of APIs that suffer from issues:
-
dir_each_entry
does not checkerrno
; all of its users have interruption bugs -
dir_tell
will return -1 instead of the correct position (which means that passing -1 todir_seek
/dir_set_pos
will cause it to not list the directory properly) -
do_opendir
an error in system calls will only sometimes be turned into a raised exception- Indirect callers that pass in rb_glob_error as errfunc: rb_glob, Dir.[], Dir.glob
- Indirect callers that pass in 0 as errfunc: ruby_glob, ruby_brace_glob
-
glob_opendir
does not check errno; all of its users have interruption bugs -
glob_getent
does not check errno; all of its users have interruption bugs -
nogvl_dir_empty_p
does not check errno (of readdir! it actually checks for opendir); all of its users have interruption bugs
Less sure about these:
-
do_stat
/do_lstat
will turn errors into warnings (unclear if enabled or disabled by default) -
need_normalization
callsfgetattrlist
/getattrlist
and all errors(ret != 0)
are treated in the same way -
rb_glob_error
is andrb_glob_caller
leave exceptions as pending and rely on callers to raise them properly - Error handling of
rb_home_dir_of
andrb_default_home_dir
are a bit suspicious
As a workaround in the Datadog Ruby profiler, in https://github.com/DataDog/dd-trace-rb/pull/3720 I've added monkey patches to all of the Ruby-level APIs that use the above functions and mask out SIGPROF
so these calls are never interrupted.
This solution is does successfully work around the issue, although it prevents the profiler from sampling during these system calls, which will mean less visibility if e.g. these calls are taking a long time. And well, maintaining monkey patches is always problematic for future Ruby compatibility.
Files