Bug #14127
closed(CSV) generating UTF-16LE encoded file without BOM
Description
This file should contain BOM information so that it is properly detected as UTF-16LE file.
How to generate such file:
file = CSV.generate(encoding: 'UTF-16LE') do |csv|
csv << ['something', 'ľščťžýáíé']
end
According to file -I file.csv
this file is recognized as application/octet-stream; charset=binary
because it is missing the BOM information.
According to Wikipedia https://en.wikipedia.org/wiki/UTF-16 it should contain "\xFF\xFE" on the beginning of the document so that everyone knows iths UTF-16LE.
Here is someone trying to fix this in the similiar way: https://stackoverflow.com/a/22950912/1632815 I did it: manually adding that BOM information.
## Adds BOM, albeit in a somewhat hacky way.
new_html_file = File.open(foo.txt, "w:UTF-8")
new_html_file << "\xFF\xFE".force_encoding('utf-16le') + some_text.force_encoding('utf-8').encode('utf-16le')
Updated by nobu (Nobuyoshi Nakada) about 7 years ago
laykou (Ladislav Gallay) wrote:
This file should contain BOM information so that it is properly detected as UTF-16LE file.
How to generate such file:
file = CSV.generate(encoding: 'UTF-16LE') do |csv| csv << ['something', 'ľščťžýáíé'] end
csv.rb seems having bugs in ASCII-incompatible encodings support.
According to
file -I file.csv
this file is recognized asapplication/octet-stream; charset=binary
because it is missing the BOM information.According to Wikipedia https://en.wikipedia.org/wiki/UTF-16 it should contain "\xFF\xFE" on the beginning of the document so that everyone knows iths UTF-16LE.
CSV.generate
just builds a CSV string, doesn't create a file.
Writing the result to a file with BOM is an application's responsibility.
CSV.open("utf16.csv", "w:UTF-16LE:utf-8") do |csv|
csv.to_io.write "\uFEFF"
csv << ['something', 'ľščťžýáíé']
end
Here is someone trying to fix this in the similiar way: https://stackoverflow.com/a/22950912/1632815 I did it: manually adding that BOM information.
new_html_file = File.open("foo.txt", "w:UTF-16LE")
new_html_file << "\uFEFF" << some_text
Updated by hsbt (Hiroshi SHIBATA) almost 7 years ago
- Status changed from Open to Assigned
- Assignee set to kou (Kouhei Sutou)
Updated by kou (Kouhei Sutou) almost 7 years ago
- Status changed from Assigned to Rejected
nobu almost said.
You should write BOM by yourself when you use CSV.generate
.
If you don't want to write BOM by yourself, you should use CSV.open(..., "w:UTF-16")
:
CSV.open("utf16.csv", "w:UTF-16:utf-8") do |csv|
csv << ['something', 'ľščťžýáíé']
end
But it generates big-endian UTF-16.
Updated by printercu (Max Melentiev) over 6 years ago
WDYT about adding file_header
option or something like this?
It's quite tricky to add this in streaming mode:
CSV.open(file, 'wb', encoding: 'utf-16le', headers: headers_row, write_headers: true) do |csv|
bom_written = false
for_each_row do |row|
unless bom_written
csv.to_io.write(BOM)
bom_written = true
end
csv << row
end
end
Updated by kou (Kouhei Sutou) over 6 years ago
Why do you need to use bom_written
?
CSV.open(file, 'wb', encoding: 'utf-16le', headers: headers_row, write_headers: true) do |csv|
csv.to_io.write(BOM)
for_each_row do |row|
csv << row
end
end
Updated by printercu (Max Melentiev) over 6 years ago
It has different behaviour. In my example file is empty if csv.<<
is never called, in suggested example it contains BOM anyway.