Project

General

Profile

Bug #4603

lib/csv.rb: when the :encoding parameter is not provided, the encoding of CSV data is treated as ASCII-8BIT

Added by nobuoka (yu nobuoka) over 8 years ago. Updated over 7 years ago.

Status:
Closed
Priority:
Normal
Target version:
ruby -v:
-
Backport:
[ruby-core:35866]

Description

=begin
This issue is involved in three methods, CSV::open, CSV::read and CSV::foreach.

The document of CSV::read says "This method also understands an additional
:encoding parameter that you can use to specify the Encoding of the data
in the file to be read. You must provide this unless your data is in
Encoding::default_external()."
However, when the :encoding parameter is not provided, the encoding of the CSV data
is treated as ASCII-8BIT. Not as Encoding.default_external.
CSV::open and CSV::foreach are also similar.

I think the actual behaviour of these methods doesn't conform to the document of these.
=end


Files

csv_test.rb (814 Bytes) csv_test.rb nobuoka (yu nobuoka), 04/24/2011 03:33 PM

Associated revisions

Revision e858442f
Added by nobu (Nobuyoshi Nakada) over 8 years ago

  • lib/csv.rb (CSV::open): suppress universal newline decorator. fixes #4603

git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@31370 b2dd03c8-39d4-4d8f-98ff-823fe69b080e

Revision 31370
Added by nobu (Nobuyoshi Nakada) over 8 years ago

  • lib/csv.rb (CSV::open): suppress universal newline decorator. fixes #4603

Revision 31370
Added by nobu (Nobuyoshi Nakada) over 8 years ago

  • lib/csv.rb (CSV::open): suppress universal newline decorator. fixes #4603

Revision 31370
Added by nobu (Nobuyoshi Nakada) over 8 years ago

  • lib/csv.rb (CSV::open): suppress universal newline decorator. fixes #4603

Revision 31370
Added by nobu (Nobuyoshi Nakada) over 8 years ago

  • lib/csv.rb (CSV::open): suppress universal newline decorator. fixes #4603

Revision 31370
Added by nobu (Nobuyoshi Nakada) over 8 years ago

  • lib/csv.rb (CSV::open): suppress universal newline decorator. fixes #4603

Revision 31370
Added by nobu (Nobuyoshi Nakada) over 8 years ago

  • lib/csv.rb (CSV::open): suppress universal newline decorator. fixes #4603

History

Updated by naruse (Yui NARUSE) over 8 years ago

  • Status changed from Open to Assigned
  • Assignee set to JEG2 (James Gray)
  • Target version set to 1.9.2

=begin

=end

Updated by naruse (Yui NARUSE) over 8 years ago

  • ruby -v changed from ruby 1.9.2p188 (2011-03-28 revision 31204) [x86_64-linux] to -

=begin
2011/4/25 James Gray james@graysoftinc.com:

On Sun, Apr 24, 2011 at 1:33 AM, yu nobuoka nobuoka@r-definition.com
wrote:

The document of CSV::read says "This method also understands an additional
:encoding parameter that you can use to specify the Encoding of the data
in the file to be read. You must provide this unless your data is in
Encoding::default_external()."
However, when the :encoding parameter is not provided, the encoding of the
CSV data
is treated as ASCII-8BIT. Not as Encoding.default_external.
CSV::open and CSV::foreach are also similar.

I think the actual behaviour of these methods doesn't conform to the
document of these.

It seems this was an intentional change not made by me:
r25362 | naruse | 2009-10-15 22:04:38 -0500 (Thu, 15 Oct 2009) | 2 lines

  • lib/csv.rb (CSV#raw_encoding): returns ASCII-8BIT when the io doesn't have encoding. This seems like a wrong choice. Why would we not want to support the default encodings? Can someone please explain to me why this was done?

Ah, sorry, that commit message doesn't explain the intention.
It is for IO-like object which doesn't have encoding method, for example Zlib::GzipReader
test_gzip_reader_bug_fix in test/csv/test_features.rb.

Anyway, even if I applied following patch, the problem is still reproduced.

diff --git a/lib/csv.rb b/lib/csv.rb
index 45273f9..ee35ccc 100644
--- a/lib/csv.rb
+++ b/lib/csv.rb
@@ -2296,7 +2296,7 @@ class CSV
elsif @io.respond_to? :encoding
@io.encoding
else

  • default
  • Encoding.default_internal || Encoding.default_external end end end

--
NARUSE, Yui naruse@airemix.jp
=end

Updated by naruse (Yui NARUSE) over 8 years ago

  • ruby -v changed from - to ruby 1.9.2p188 (2011-03-28 revision 31204) [x86_64-linux]

=begin

=end

Updated by Anonymous over 8 years ago

=begin
On Sun, Apr 24, 2011 at 1:33 AM, yu nobuoka nobuoka@r-definition.comwrote:

The document of CSV::read says "This method also understands an additional
:encoding parameter that you can use to specify the Encoding of the data
in the file to be read. You must provide this unless your data is in
Encoding::default_external()."
However, when the :encoding parameter is not provided, the encoding of the
CSV data
is treated as ASCII-8BIT. Not as Encoding.default_external.
CSV::open and CSV::foreach are also similar.

I think the actual behaviour of these methods doesn't conform to the
document of these.

It seems this was an intentional change not made by me:

r25362 | naruse | 2009-10-15 22:04:38 -0500 (Thu, 15 Oct 2009) | 2 lines

  • lib/csv.rb (CSV#raw_encoding): returns ASCII-8BIT when the io doesn't have encoding.

This seems like a wrong choice. Why would we not want to support the
default encodings? Can someone please explain to me why this was done?

James Edward Gray II
=end

Updated by Anonymous over 8 years ago

=begin
On Sun, Apr 24, 2011 at 11:29 PM, NARUSE, Yui naruse@airemix.jp wrote:

Ah, sorry, that commit message doesn't explain the intention.
It is for IO-like object which doesn't have encoding method, for
example Zlib::GzipReader
test_gzip_reader_bug_fix in test/csv/test_features.rb.

Anyway, even if I applied following patch, the problem is still reproduced.

diff --git a/lib/csv.rb b/lib/csv.rb
index 45273f9..ee35ccc 100644
--- a/lib/csv.rb
+++ b/lib/csv.rb
@@ -2296,7 +2296,7 @@ class CSV
elsif @io.respond_to? :encoding
@io.encoding
else

  • default
  • Encoding.default_internal || Encoding.default_external end end end

OK, I see the problem.

The issue is that a mode of "rb" is being used to suppress newline
translation on Windows. However, that's also switching my Encoding to
ASCII-8BIT. Usually I love that feature, but here it's not what I want. Is
there anyway to shut off the translation and not get the encoding change?

James Edward Gray II
=end

Updated by nobu (Nobuyoshi Nakada) over 8 years ago

  • ruby -v changed from ruby 1.9.2p188 (2011-03-28 revision 31204) [x86_64-linux] to -

=begin
Hi,

At Mon, 25 Apr 2011 22:57:52 +0900,
James Gray wrote in [ruby-core:35878]:

The issue is that a mode of "rb" is being used to suppress newline
translation on Windows. However, that's also switching my Encoding to
ASCII-8BIT. Usually I love that feature, but here it's not what I want. Is
there anyway to shut off the translation and not get the encoding change?

Now fixed so that universal_newline: false can work. What's
about the following patch?

diff --git i/lib/csv.rb w/lib/csv.rb
index 1aad2f3..d51cb55 100644
--- i/lib/csv.rb
+++ w/lib/csv.rb
@@ -1334,10 +1334,18 @@ class CSV
def self.open(*args)
# find the +options+ Hash
options = if args.last.is_a? Hash then args.pop else Hash.new end

  • # default to a binary open mode
  • args << "rb" if args.size == 1 and !options.key?(:mode)
  • # wrap a File opened with the remaining +args+
  • csv = new(File.open(*args, options), options)
  • # wrap a File opened with the remaining +args+ with no newline
  • # decorator
  • file_opts = {universal_newline: false}.merge(options)
  • begin
  • f = File.open(*args, file_opts)
  • rescue ArgumentError => e
  • throw unless /needs binmode/ =~ e.message and args.size == 1
  • args << "rb"
  • file_opts = {encoding: Encoding.default_external}.merge(file_opts)
  • retry
  • end
  • csv = new(f, options)

    # handle blocks like Ruby's open(), not like the CSV library
    if block_given?
    @@ -1398,11 +1406,8 @@ class CSV

    encoding: "UTF-32BE:UTF-8" would read UTF-32BE data from the file

    but transcode it to UTF-8 before CSV parses it.

    #

  • def self.read(path, options = Hash.new)

  • encoding = options.delete(:encoding)

  • mode = "rb"

  • mode << ":#{encoding}" if encoding

  • open(path, mode, options) { |csv| csv.read }

  • def self.read(path, *options)

  • open(path, *options) { |csv| csv.read }
    end

    Alias for CSV::read().

    diff --git i/test/csv/test_encodings.rb w/test/csv/test_encodings.rb
    index 3880f3a..54c34f3 100755
    --- i/test/csv/test_encodings.rb
    +++ w/test/csv/test_encodings.rb
    @@ -79,6 +79,21 @@ class TestCSV::Encodings < TestCSV
    end
    end

  • def test_read_with_default_encoding

  • data = "abc"

  • default_external = Encoding.default_external

  • each_encoding do |encoding|

  •  File.open(@temp_csv_path, "wb", encoding: encoding) {|f| f << data}
    
  •  begin
    
  •    Encoding.default_external = encoding
    
  •    result = CSV.read(@temp_csv_path)[0][0]
    
  •  ensure
    
  •    Encoding.default_external = default_external
    
  •  end
    
  •  assert_equal(encoding, result.encoding)
    
  • end

  • end
    +
    #######################################################################

    Stress Test ASCII Compatible and Non-ASCII Compatible Encodings

    #######################################################################

--
Nobu Nakada
=end

#7

Updated by nobuoka (yu nobuoka) over 8 years ago

=begin
Hi,

Nobuyoshi Nakada wrote:

Now fixed so that universal_newline: false can work. What's
about the following patch?

diff --git i/lib/csv.rb w/lib/csv.rb
index 1aad2f3..d51cb55 100644
--- i/lib/csv.rb
+++ w/lib/csv.rb
@@ -1334,10 +1334,18 @@ class CSV
def self.open(*args)
# find the +options+ Hash
options = if args.last.is_a? Hash then args.pop else Hash.new end

  • # default to a binary open mode
  • args << "rb" if args.size == 1 and !options.key?(:mode)
  • # wrap a File opened with the remaining +args+
  • csv = new(File.open(*args, options), options)
  • # wrap a File opened with the remaining +args+ with no newline
  • # decorator
  • file_opts = {universal_newline: false}.merge(options)
  • begin
  • f = File.open(*args, file_opts)
  • rescue ArgumentError => e
  • throw unless /needs binmode/ =~ e.message and args.size == 1
  • args << "rb"
  • file_opts = {encoding: Encoding.default_external}.merge(file_opts)
  • retry
  • end
  • csv = new(f, options)

In +rescue+ clause, a +throw+ expression is used. Is it correct?
I think a +raise+ expression should be used instead. Or my idea is wrong...?
=end

Updated by Anonymous over 8 years ago

=begin
On Tue, Apr 26, 2011 at 3:13 PM, yu nobuoka nobuoka@r-definition.comwrote:

Issue #4603 has been updated by yu nobuoka.

Hi,

Nobuyoshi Nakada wrote:

Now fixed so that universal_newline: false can work. What's
about the following patch?

diff --git i/lib/csv.rb w/lib/csv.rb
index 1aad2f3..d51cb55 100644
--- i/lib/csv.rb
+++ w/lib/csv.rb
@@ -1334,10 +1334,18 @@ class CSV
def self.open(*args)
# find the +options+ Hash
options = if args.last.is_a? Hash then args.pop else Hash.new end

  • # default to a binary open mode
  • args << "rb" if args.size == 1 and !options.key?(:mode)
  • # wrap a File opened with the remaining +args+
  • csv = new(File.open(*args, options), options)
  • # wrap a File opened with the remaining +args+ with no newline
  • # decorator
  • file_opts = {universal_newline: false}.merge(options)
  • begin
  • f = File.open(*args, file_opts)
  • rescue ArgumentError => e
  • throw unless /needs binmode/ =~ e.message and args.size == 1
  • args << "rb"
  • file_opts = {encoding: Encoding.default_external}.merge(file_opts)
  • retry
  • end
  • csv = new(f, options)

In +rescue+ clause, a +throw+ expression is used. Is it correct?
I think a +raise+ expression should be used instead. Or my idea is
wrong...?

Good point. I think raise() is correct.

James Edward Gray II
=end

#9

Updated by nobu (Nobuyoshi Nakada) over 8 years ago

  • Status changed from Assigned to Closed
  • % Done changed from 0 to 100

=begin
This issue was solved with changeset r31370.
yu, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.


  • lib/csv.rb (CSV::open): suppress universal newline decorator. fixes #4603 =end

Updated by theirishpenguin (Declan McGrath) over 7 years ago

Hi,

I seem to be still getting this problem on Ruby 1.9.2p290 revision 32553. This issue should be fixed in revision 32553, correct? (as 32553 > 31370)

See below for test case.

Regards,
Declan

Test case

1) Contents of test.csv are:
á,b
1,2

2) Steps to reproduce issue:
declan@foo:~$ ruby -v
ruby 1.9.2p290 (2011-07-09 revision 32553) [i686-linux]
declan@foo:~$ irb
irb(main):004:0> Encoding::default_internal = 'UTF-8'
=> "UTF-8"
irb(main):007:0> Encoding::default_external = 'UTF-8'
=> "UTF-8"
irb(main):009:0> require 'csv'
=> true
irb(main):010:0> CSV.read('test.csv')
=> "\xC3\xA1", "b"], ["1", "2"

Updated by naruse (Yui NARUSE) over 7 years ago

  • Description updated (diff)

theirishpenguin (Declan McGrath) wrote:

I seem to be still getting this problem on Ruby 1.9.2p290 revision 32553.
This issue should be fixed in revision 32553, correct? (as 32553 > 31370)

No, revision numbers are repository global number.
r32553 > r31370 doesn't mean it because they are different branch.
r32553 changes trunk and ruby_1_9_3 branch but doesn't change ruby_1_9_2 branch.
r31370 is not merged to ruby_1_9_2.

Also available in: Atom PDF