Bug #19192
openIO has third data mode, document is incomplete.
Description
The documentation on the mode parameter of File.open is incomplete, I would like to clarify IO's data mode actual behavior here.
document says
To specify whether data is to be treated as text or as binary data, either of the following may be suffixed to any of the string read/write modes above:
't': Text data; sets the default external encoding to Encoding::UTF_8; on Windows, enables conversion between EOL and CRLF and enables interpreting 0x1A as an end-of-file marker.
'b': Binary data; sets the default external encoding to Encoding::ASCII_8BIT; on Windows, suppresses conversion between EOL and CRLF and disables interpreting 0x1A as an end-of-file marker.
If neither is given, the stream defaults to text data.
But actually it's more complicated than that.
There is three Data Mode
- text mode. Can convert encoding and newline.
- binary mode. Cannot convert encoding nor newline. Encoding is treated as Encoding::ASCII_8BIT.
- third mode: DOS TEXT mode. That enables conversion between EOL and CRLF and enables interpreting 0x1A as an end-of-file marker.
On Windows platform
't' textmode with universal newline conversion.
'b' binary mode.
If neither is given, DOS TEXT mode.
On other platforms
't' textmode with universal newline conversion.
'b' binary mode.
If neither is given, textmode without newline conversion.
On Windows, there are some special cases.
If Encoding conversion is specified, DOS TEXT mode is ignored and universal newline conversion applied.
If access mode is "a+", last (only one) EOF charactor is overwritten when DOS TEXT mode.
There are more parameter combinations, see https://gist.github.com/YO4/262e9bd5e44a37a7a2fa9118e271b30b
Is this all? I have not fully investigated.
Since the topic of data mode spanned access mode and encoding conversion, I don't think my English skills will allow me to summarize this into rdoc without breaking something...
Updated by alanwu (Alan Wu) about 2 years ago
Ugh, it's quite weird. The :crlf_newline
option is an encoding option,
but File.open
uses it in a way different from String#encode
.
With String#encode
, it replaces \n
with \r\n
:
p "\n".encode(crlf_newline: true) # => "\r\n"
With File.open
, it controls DOS TEXT mode, which is Windows-specific and
does the inverse conversion when reading.
Also, when used in conjunction with encoding conversion, it doesn't do any newline conversion on Windows:
require 'tempfile'
content = "\x1a \r \r\n \n".freeze
tmp = Tempfile.new.binmode
tmp.write(content)
tmp.flush
File.open(tmp.path, "r:US-ASCII:UTF-8", crlf_newline: true) do
read_content = _1.read
p content
p read_content
p content == read_content
end
__END__
F:\> ruby.exe -v .\19192.rb
ruby 3.2.0dev (2022-12-09T14:34:17Z master 12b5268679) [x64-mswin64_140]
"\u001A \r \r\n \n"
"\u001A \r \r\n \n"
true
There seems to be no way to get just CRLF conversion without also getting
special treatment for 0x1A
. This is probably because Ruby has to rely on the Windows
system library to do the conversion. The :universal_newline
option is built into
Ruby so it works cross-platform.