Bug #18882
closedFile.read cuts off a text file with special characters when reading it on MS Windows
Added by magynhard (Matthäus Johannes Beyrle) over 2 years ago. Updated over 2 years ago.
Description
When using File.read to read a text file (in this case a javascript file) with special characters, the content is cut off at special characters.
It occurs only when running ruby on Windows, tried several versions, including the latest.
Does not occur on Linux or WSL (Windows Subsystem for Linux).
I created a github repo including a test script and the source file as the result inside a file as well:
https://github.com/grob-net4industry/ruby_win_file_bug
Files
copy_pdfmake.min.js (582 KB) copy_pdfmake.min.js | magynhard (Matthäus Johannes Beyrle), 06/27/2022 11:12 AM | ||
pdfmake.min.js (1.29 MB) pdfmake.min.js | magynhard (Matthäus Johannes Beyrle), 06/27/2022 11:12 AM | ||
diff.png (55.9 KB) diff.png | magynhard (Matthäus Johannes Beyrle), 06/27/2022 11:12 AM |
Updated by chrisseaton (Chris Seaton) over 2 years ago
Don't you need to use binary mode to read these characters?
Updated by magynhard (Matthäus Johannes Beyrle) over 2 years ago
chrisseaton (Chris Seaton) wrote in #note-1:
Don't you need to use binary mode to read these characters?
As it is a text file and reading from it behaves correctly on Linux, i expect not to need to use binary mode.
Otherwise, should it be windows specific for a reason, that cannot be changed, at least an error message would be appropriate? Because it fails silently.
Updated by austin (Austin Ziegler) over 2 years ago
This is Windows-specific, and there is not an error condition to check for reporting on this.
The distinction between "binary" and "text" mode is simply the way that DOS works (yes, that’s how old this behaviour is), and the definition of text is essentially just characters 0x20 to 0x1e. If I remember correctly.
Using open(filename, 'rb')
is the only way to do this on Windows, and there is no such thing as "text mode" and "binary mode" on any system other than an old DOS-based mechanism.
This is not a bug.
Updated by chrisseaton (Chris Seaton) over 2 years ago
it behaves correctly on Linux, i expect not to need to use binary mode
I think binary mode is a no-op on Linux though? Different systems behave differently - Ruby can't completely remove those differences when you're doing IO.
Updated by alanwu (Alan Wu) over 2 years ago
File.read
(actually IO.read
) interprets the file based on the
external encoding, which is a function of the platform and the environment.
If you put a p [content.encoding, ::Encoding.default_external]
in your script
I think you should see a difference between the Windows Ruby and WSL Ruby.
Assuming that's the problem, I don't know how one could issue a warning for it
because Ruby can't know if the external encoding matches user expectation.
We could potentially improve the docs, though.
Updated by austin (Austin Ziegler) over 2 years ago
alanwu (Alan Wu) wrote in #note-5:
File.read
(actuallyIO.read
) interprets the file based on the
external encoding, which is a function of the platform and the environment.
If you put ap [content.encoding, ::Encoding.default_external]
in your script
I think you should see a difference between the Windows Ruby and WSL Ruby.
This is true, but not the source of the problem here. See https://docs.microsoft.com/en-us/cpp/c-runtime-library/text-and-binary-mode-file-i-o?view=msvc-170 and https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/fopen-wfopen?view=msvc-170.
The only real improvement to the documentation would be to suggest opening files for read as 'rb'
all the time if your software might be used on Windows or might be anything other than ASCII text with only a few control characters supported (0x1A
is particularly dangerous, but others like 0x00
might also cause problems). This is well-known Windows behaviour.
Updated by magynhard (Matthäus Johannes Beyrle) over 2 years ago
alanwu (Alan Wu) wrote in #note-5:
File.read
(actuallyIO.read
) interprets the file based on the
external encoding, which is a function of the platform and the environment.
If you put ap [content.encoding, ::Encoding.default_external]
in your script
I think you should see a difference between the Windows Ruby and WSL Ruby.Assuming that's the problem, I don't know how one could issue a warning for it
because Ruby can't know if the external encoding matches user expectation.
We could potentially improve the docs, though.
Added it to the script, both have the same output in this case:
[#<Encoding:UTF-8>, #<Encoding:UTF-8>]
But as @alanwu (Alan Wu) stated, there is a 0x1A charachter (26, = CTRL+Z) inside the javascript file, which is interpreted as end of file on text mode on windows:
https://stackoverflow.com/a/230878/11205329
Might be nice if File.read would use binary mode by default, but that would be a breaking change. So extending the documentation might be the better choice. Or do you think that there is a smart way to throw an error in cases, where a 0x1A character is included and therefore the read file is incomplete?
Updated by austin (Austin Ziegler) over 2 years ago
magynhard (Matthäus Johannes Beyrle) wrote in #note-7:
alanwu (Alan Wu) wrote in #note-5:
Might be nice if File.read would use binary mode by default, but that would be a breaking change. So extending the documentation might be the better choice. Or do you think that there is a smart way to throw an error in cases, where a 0x1A character is included and therefore the read file is incomplete?
There is no way to throw an error in this case, because there’s no error condition. 0x1a
is read just fine on non-Windows systems in "text" mode, but ignored on Windows, because the underlying operating system APIs respect file mode, which is not really accessible to Ruby. The OS sees an EOF and reports that back to Ruby. Ruby does not know this.
Updated by alanwu (Alan Wu) over 2 years ago
Huh, Ruby might be exposing a C runtime behavior here.
Assuming I'm reading the call graph right, and there
isn't some Ruby level defaulting going on, a File.read()
with just a path and no mode maps to a call to _wopen()
with neither _O_TEXT
nor _O_BINARY
set. In this
case the runtime uses a global _fmode
to decide the
actual mode, according to the docs here: https://docs.microsoft.com/en-us/cpp/c-runtime-library/text-and-binary-mode-file-i-o?view=msvc-170
So potentially, someone could use Fiddle or some C extension
to call _set_fmode(_O_BINARY)
to change the behavior of plain
File.read()
calls.
Anyways clearly I'm out of my depth here as you can tell how
I got the issue wrong initially. I agree that it's surprising
that File.read
doesn't read every byte, though. I will
add this to the next dev meeting because I'm interested in what
other people think.
The call graph from File.read()
to _wopen()
(again, it might be wrong):
rb_io_s_read
open_key_args
rb_io_open
rb_io_open_generic
rb_sysopen
rb_sysopen_internal
sysopen_func
rb_cloexec_open
open [(rb_w32_uopen when _WIN32)](https://github.com/ruby/ruby/blob/aba804ef91a5b2aa88efdd74205026aca3f943b2/io.c#L178-L181)
_wopen
Updated by nobu (Nobuyoshi Nakada) over 2 years ago
- Status changed from Open to Rejected
You can use File.binread
.
Updated by nobu (Nobuyoshi Nakada) over 2 years ago
alanwu (Alan Wu) wrote in #note-9:
So potentially, someone could use Fiddle or some C extension
to call_set_fmode(_O_BINARY)
to change the behavior of plain
File.read()
calls.
You mean we should use O_TEXT
explicitly when calling _wopen
?
Updated by nobu (Nobuyoshi Nakada) over 2 years ago
Actually, NEED_NEWLINE_DECORATOR_ON_READ_CHECK()
and do_writeconv()
set O_TEXT
or O_BINARY
always on each I/O.
Updated by mame (Yusuke Endoh) over 2 years ago
We discussed this ticket at the dev meeting.
@usa (Usaku NAKAMURA) and @nobu (Nobuyoshi Nakada) said that File.read
reads a file in text mode. And the VC runtime (msvcrt) does EOF character handling, CRLF conversion, etc under text mode. There is no fine-grained control on the Ruby side.
It is possible for File.read
to read a file in binary mode, but it would require careful consideration about implementation, side effect, compatibility, etc. For now, it is good to add a sentence like "This method reads a file in text mode." to the rdoc of File.read
(and File.write
).
Updated by mame (Yusuke Endoh) almost 2 years ago
- Related to Feature #19193: drop DOS TEXT mode support added