Project

General

Profile

Actions

Bug #931

closed

[m17n] TestCSVFeatures fails because of r20905

Bug #931: [m17n] TestCSVFeatures fails because of r20905

Added by duerst (Martin Dürst) almost 17 years ago. Updated over 14 years ago.

Status:
Closed
Assignee:
-
Target version:
-
ruby -v:
Backport:
[ruby-core:20884]

Description

=begin
Hello James,

Akira wrote the text below, and Matz said it should
somehow get to you. I'm not sure whether Akira has the
time to do this, so here's a short summary.

Akira thinks that what you tried to do in CSV#inspect
is to somehow produce an ASCII-compatible encoding.
If that's your intent, then a simple force_encoding
won't work well for UTF-16, because it will leave
some 0x00 bytes in the string.

What Akira proposes is to use

e = Encoding::Converter.asciicompat_encoding(s.encoding)
e ? s.encode(e) : s.force_encoding("ASCII-8BIT")

i.e. to convert to an ASCII-compatible encoding from
the current encoding if necessary and possible, otherwise
to force the data to be interpreted as ASCII-8BIT.

I have to admit that I didn't think about UTF-16 at all,
but my guess is that the above code might not (at least
not by itself) solve the problem that different pieces
of data with different encodings will be concatenated,
because if there is a piece in ISO-2022-JP, it will be
converted to something called "stateless-ISO-2022-JP",
whereas some other piece, originally in an ASCII-compatible
encoding (e.g. UTF-8 or whatever) will be forced to
ASCII-8BIT.

On the same problem, Yugui suggested that the encoding
of the string returned by inspect should be the encoding
of the file.

Regards, Martin.

Date: Fri, 26 Dec 2008 13:13:29 +0900
From: Tanaka Akira
Subject: [ruby-dev:37603] Re: [BUG:trunk] [m17n] TestCSVFeatures fails
because of r20905
To: (ruby developers list)

In article ,
"NARUSE, Yui" writes:

直感的には String#encode("ASCII-8BIT") は、
String#force_encode("ASCII-8BIT") と同じ効果になるべきに感じます。

あまり直感的に思えません。encode は文字を保存するようにバイ
ト列を変換するはずなのに、そうなっていません。

CSV#inspect をみると、ASCII 互換の encoding にしたい、という
意図を感るんですが、違うんでしょうか。UTF-16 が来たときの対
策というか。

UTF-16 を考えると、force_encoding にすると、中身が文字として
ASCII の範囲内でも \0 がひとつおきに入って嬉しくないんじゃな
いでしょうか。

UTF-16 についての議論がどうなったかちゃんと覚えてないんです
が、もし UTF-16 は扱わないでもいいという話だったら、単純に
.encode("ASCII-8BIT") を消してしまうというのはどうでしょうか。

また、UTF-16 を扱うのであれば、UTF-16 に対応する ASCII 互換
な encoding に変換するということで、

e = Encoding::Converter.asciicompat_encoding(s.encoding)
e ? s.encode(e) : s.force_encoding("ASCII-8BIT")

とかはどうでしょう。

[田中 哲][たなか あきら][Tanaka Akira]

#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp
=end

Actions

Also available in: PDF Atom