Project

General

Profile

Bug #14127

(CSV) generating UTF-16LE encoded file without BOM

Added by laykou (Ladislav Gallay) over 1 year ago. Updated 6 months ago.

Status:
Rejected
Priority:
Normal
Target version:
-
ruby -v:
2.4.1
[ruby-core:83865]

Description

This file should contain BOM information so that it is properly detected as UTF-16LE file.

How to generate such file:

file = CSV.generate(encoding: 'UTF-16LE') do |csv|
    csv << ['something', 'ľščťžýáíé']
end

According to file -I file.csv this file is recognized as application/octet-stream; charset=binary because it is missing the BOM information.

According to Wikipedia https://en.wikipedia.org/wiki/UTF-16 it should contain "\xFF\xFE" on the beginning of the document so that everyone knows iths UTF-16LE.

Here is someone trying to fix this in the similiar way: https://stackoverflow.com/a/22950912/1632815 I did it: manually adding that BOM information.

## Adds BOM, albeit in a somewhat hacky way.
new_html_file = File.open(foo.txt, "w:UTF-8")
new_html_file << "\xFF\xFE".force_encoding('utf-16le') + some_text.force_encoding('utf-8').encode('utf-16le')

History

Updated by nobu (Nobuyoshi Nakada) over 1 year ago

laykou (Ladislav Gallay) wrote:

This file should contain BOM information so that it is properly detected as UTF-16LE file.

How to generate such file:

file = CSV.generate(encoding: 'UTF-16LE') do |csv|
    csv << ['something', 'ľščťžýáíé']
end

csv.rb seems having bugs in ASCII-incompatible encodings support.

According to file -I file.csv this file is recognized as application/octet-stream; charset=binary because it is missing the BOM information.

According to Wikipedia https://en.wikipedia.org/wiki/UTF-16 it should contain "\xFF\xFE" on the beginning of the document so that everyone knows iths UTF-16LE.

CSV.generate just builds a CSV string, doesn't create a file.
Writing the result to a file with BOM is an application's responsibility.

CSV.open("utf16.csv", "w:UTF-16LE:utf-8") do |csv|
  csv.to_io.write "\uFEFF"
  csv << ['something', 'ľščťžýáíé']
end

Here is someone trying to fix this in the similiar way: https://stackoverflow.com/a/22950912/1632815 I did it: manually adding that BOM information.

new_html_file = File.open("foo.txt", "w:UTF-16LE")
new_html_file << "\uFEFF" << some_text

Updated by hsbt (Hiroshi SHIBATA) about 1 year ago

  • Assignee set to kou (Kouhei Sutou)
  • Status changed from Open to Assigned

Updated by kou (Kouhei Sutou) about 1 year ago

  • Status changed from Assigned to Rejected

nobu almost said.

You should write BOM by yourself when you use CSV.generate.

If you don't want to write BOM by yourself, you should use CSV.open(..., "w:UTF-16"):

CSV.open("utf16.csv", "w:UTF-16:utf-8") do |csv|
  csv << ['something', 'ľščťžýáíé']
end

But it generates big-endian UTF-16.

Updated by printercu (Max Melentiev) 6 months ago

WDYT about adding file_header option or something like this?

It's quite tricky to add this in streaming mode:

CSV.open(file, 'wb', encoding: 'utf-16le', headers: headers_row, write_headers: true) do |csv|
  bom_written = false
  for_each_row do |row|
    unless bom_written
      csv.to_io.write(BOM)
      bom_written = true
    end
    csv << row
  end
end

Updated by kou (Kouhei Sutou) 6 months ago

Why do you need to use bom_written?

CSV.open(file, 'wb', encoding: 'utf-16le', headers: headers_row, write_headers: true) do |csv|
  csv.to_io.write(BOM)
  for_each_row do |row|
    csv << row
  end
end

Updated by printercu (Max Melentiev) 6 months ago

It has different behaviour. In my example file is empty if csv.<< is never called, in suggested example it contains BOM anyway.

Also available in: Atom PDF