https://redmine.ruby-lang.org/https://redmine.ruby-lang.org/favicon.ico?17113305112008-10-24T15:54:30ZRuby Issue Tracking SystemRuby master - Bug #680: csv.rb: CSV.parse is too late when encoding is mismatchhttps://redmine.ruby-lang.org/issues/680?journal_id=14462008-10-24T15:54:30Zduerst (Martin Dürst)duerst@it.aoyama.ac.jp
<ul></ul><p>=begin<br>
A default for the source encoding has been discussed quite a long<br>
time ago (in some Japanese meetings or on ruby-dev, I don't remember),<br>
and the conclusion was that the source encoding has to be given<br>
(with a majic comment) in the file itself (unless the file is all ascii).</p>
<p>The reason for this is that the source encoding is a property of the<br>
source, and nothing else. On very simple scripts, it might occasionally<br>
be slightly easier if it were the same as default_external or<br>
default_internal, but this is only the case as long as you stay<br>
in exactly the same environment, and don't move the script.<br>
But scripts grow and move, so it's better to get the settings<br>
right at the start.</p>
<p>However, as far as I remember, the idea was that for -e,<br>
default_external should be used, because that's what one<br>
is using in a shell. I'm not sure why this doesn't work below.<br>
(assuming Takeyuki is working in a Shift_JIS environment,<br>
which isn't completely sure).</p>
<p>Regards, Martin.</p>
<p>At 12:12 08/10/24, Michael Selig wrote:</p>
<blockquote>
<p>Hi,</p>
<p>This bug actually brings up an interesting issue - should the source<br>
encoding default to something other than UTF-8 (ie: if it is not specified<br>
in the "magic comment")?</p>
<p>Perhaps it should default to the encoding specified by the user's locale?<br>
Or perhaps it should default to the value of "default_internal" if it is<br>
set? Or even default_external?</p>
<p>I suggest that it should default to "default_internal" if that is set, and<br>
then to the locale encoding if not.</p>
<p>What do others think?<br>
Having it default to the locale in this case would probably avoid the<br>
encoding mismatch entirely (and the resulting confusion).</p>
<p>Cheers<br>
Mike</p>
<p>On Fri, 24 Oct 2008 11:58:33 +1100, Takeyuki Fujioka<br>
<a href="mailto:redmine@ruby-lang.org" class="email">redmine@ruby-lang.org</a> wrote:</p>
<blockquote>
<p>Bug <a class="issue tracker-1 status-5 priority-4 priority-default closed" title="Bug: csv.rb: CSV.parse is too late when encoding is mismatch (Closed)" href="https://redmine.ruby-lang.org/issues/680">#680</a>: csv.rb: CSV.parse is too late when encoding is mismatch<br>
<a href="http://redmine.ruby-lang.org/issues/show/680" class="external">http://redmine.ruby-lang.org/issues/show/680</a></p>
<p>Author: Takeyuki Fujioka<br>
Status: Open, Priority: Normal<br>
Category: lib, Target version: 1.9.x</p>
<p>I think this result is true, but encoding mismatch raise is too late.</p>
<p>see:<br>
% time ruby19 -rcsv -e<br>
'CSV.parse(("\x82\xA0,\x82\xA2\n"*10000).force_encoding("shift_jis"))'<br>
ruby19 -rcsv -e 0.30s user 0.02s system 96% cpu 0.330 total</p>
<p>% time ruby19 -rcsv -e 'CSV.parse(("\x82\xA0,\x82\xA2\n"*10000))'<br>
/Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1981:in <code>=~': broken UTF-8 string (ArgumentError) from /Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1981:in </code>init_separators'<br>
from /Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1563:in <code>initialize' from /Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1350:in </code>new'<br>
from /Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1350:in <code>parse' from -e:1:in </code>'<br>
ruby19 -rcsv -e 'CSV.parse(("\x82\xA0,\x82\xA2\n"*10000))' 1.55s user<br>
2.57s system 90% cpu 4.530 total</p>
<hr>
<p><a href="http://redmine.ruby-lang.org" class="external">http://redmine.ruby-lang.org</a></p>
</blockquote>
</blockquote>
<p>#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University<br>
#-#-# <a href="http://www.sw.it.aoyama.ac.jp" class="external">http://www.sw.it.aoyama.ac.jp</a> <a href="mailto:duerst@it.aoyama.ac.jp" class="email">mailto:duerst@it.aoyama.ac.jp</a></p>
<p>=end</p> Ruby master - Bug #680: csv.rb: CSV.parse is too late when encoding is mismatchhttps://redmine.ruby-lang.org/issues/680?journal_id=14482008-10-24T17:06:30Zmatz (Yukihiro Matsumoto)matz@ruby.or.jp
<ul></ul><p>=begin<br>
Hi,</p>
<p>In message "Re: <a href="https://blade.ruby-lang.org/ruby-core/19473">[ruby-core:19473]</a> Re: Default source encoding (Was: [Bug <a class="issue tracker-1 status-5 priority-4 priority-default closed" title="Bug: csv.rb: CSV.parse is too late when encoding is mismatch (Closed)" href="https://redmine.ruby-lang.org/issues/680">#680</a>] csv.rb: CSV.parse is toolate when encoding is mismatch)"<br>
on Fri, 24 Oct 2008 16:48:04 +0900, "Michael Selig" <a href="mailto:michael.selig@fs.com.au" class="email">michael.selig@fs.com.au</a> writes:</p>
<p>|The problem I am trying to solve is the compatibility of string literals<br>
|in your source and strings from other sources.<br>
|<br>
|"default_internal" was introduced to try to make all strings the same<br>
|encoding to avoid incompatibilities. But at the moment string literals<br>
|seem to default to the source encoding or to UTF-8 if oit is not set<br>
|(please correct me if I am wrong). What I was suggesting was a way to make<br>
|string literals be compatible.</p>
<p>You are correct here.</p>
<p>|This normally isn't a problem if:<br>
|a) All string literals are 7 bit ASCII, or<br>
|b) The source encoding matches "default_internal"<br>
|<br>
|If the source encoding of a program containing non-ascii string literals<br>
|is set different from default_internal, you are asking for trouble, and<br>
|would defeat the purpose of default_internal. Therefore to prevent the<br>
|programmer from having to remember to specify both, it makes sense to me<br>
|that the source encoding should default to default_internal. I think this<br>
|is important.</p>
<p>The point is that when we have a source code written in source<br>
encoding, the literals naturally encoded in that encoding. So do we<br>
need to convert string literals in to default encoding? But<br>
conversion can bring us more troubles, since they tend to change the<br>
meaning, for example what /[<a>-<b>]/ mean, where <a> and <b> are<br>
multi byte characters and their corresponding codepoints (and sorting<br>
order) differ in converted encoding?</b></a></b></a></p>
<p>|(By the way, I am not talking about libraries here. As I have stressed<br>
|previously, libraries should be carefully written to either use ASCII<br>
|string literals only, or to make sure that it transcodes them properly.)</p>
<p>That makes me feel much better, so we can limit the issue about the<br>
scripts only.</p>
<p>|Finally, are you suggesting that "-e" should perform differently to a<br>
|single-line ruby script? That seems non-intuitive to me.</p>
<p>-e takes programs from command line shell, which probably yields<br>
strings in locale encoding anyway. But we cannot assume that for<br>
scripts contained in files.</p>
<pre><code> matz.
</code></pre>
<p>=end</p> Ruby master - Bug #680: csv.rb: CSV.parse is too late when encoding is mismatchhttps://redmine.ruby-lang.org/issues/680?journal_id=14502008-10-24T19:25:30Zxibbar (Takeyuki FUJIOKA)xibbar@gmail.com
<ul><li><strong>File</strong> <a href="/attachments/116">sample.csv</a> <a class="icon-only icon-download" title="Download" href="/attachments/download/116/sample.csv">sample.csv</a> added</li></ul><p>=begin<br>
Please save as 'sample.csv' attached file.<br>
This file include japanese UTF-8 string in first line.<br>
Other line is us-ascii. Line number count is 5001.</p>
<p>% time ruby19 -Eutf-8 -rcsv -e 'CSV.parse(open("sample.csv","r").read)'<br>
ruby19 -Eutf-8 -rcsv -e 'CSV.parse(open("sample.csv","r").read)' 0.23s user 0.01s system 96% cpu 0.254 total</p>
<p>this is OK very fast.<br>
But:</p>
<p>% time ruby19 -Eeuc-jp -rcsv -e 'CSV.parse(open("sample.csv","r").read)'<br>
/Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1981:in <code>=~': broken EUC-JP string (ArgumentError) from /Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1981:in </code>init_separators'<br>
from /Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1563:in <code>initialize' from /Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1350:in </code>new'<br>
from /Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1350:in <code>parse' from -e:1:in </code>'<br>
ruby19 -Eeuc-jp -rcsv -e 'CSV.parse(open("sample.csv","r").read)' 3.93s user 6.38s system 98% cpu 10.457 total</p>
<p>this result is very slow.<br>
I hope raise as soon as encoding mismatch found .</p>
<a name="Sorry-I-dont-understand-M17Ns-default_external-and-default_internal-behavior"></a>
<h1 >Sorry, I don't understand M17N's default_external and default_internal behavior.<a href="#Sorry-I-dont-understand-M17Ns-default_external-and-default_internal-behavior" class="wiki-anchor">¶</a></h1>
<a name="I-cant-reply-about-M17Ns-problem"></a>
<h1 >I can't reply about M17N's problem.<a href="#I-cant-reply-about-M17Ns-problem" class="wiki-anchor">¶</a></h1>
<p>=end</p> Ruby master - Bug #680: csv.rb: CSV.parse is too late when encoding is mismatchhttps://redmine.ruby-lang.org/issues/680?journal_id=14532008-10-25T01:01:26Znobu (Nobuyoshi Nakada)nobu@ruby-lang.org
<ul></ul><p>=begin<br>
Hi,</p>
<p>At Fri, 24 Oct 2008 23:00:27 +0900,<br>
James Gray wrote in <a href="https://blade.ruby-lang.org/ruby-core/19481">[ruby-core:19481]</a>:</p>
<blockquote>
<blockquote>
<p>I work on TextMate and we use Ruby all over the place inside that<br>
application. I'm sure we have hundreds of scripts in there. We try<br>
hard to make sure everything in TextMate is UTF-8, so now we get<br>
errors out of Ruby 1.9. To fix, we need to add hundreds of magic<br>
comments and worse, train our users who often write their own<br>
automations in Ruby why they have to do the same to make their code<br>
work.</p>
</blockquote>
<p>The real issue here is that you can argue the user doesn't even know<br>
the proper encoding these scripts should be using. Only TextMate<br>
really knows the encoding it's going to hand-off the data in.</p>
</blockquote>
<p>Though I don't know about TextMate at all, ruby-mode.el in 1.9<br>
deals with magic comments automatically.</p>
<p>--<br>
Nobu Nakada</p>
<p>=end</p> Ruby master - Bug #680: csv.rb: CSV.parse is too late when encoding is mismatchhttps://redmine.ruby-lang.org/issues/680?journal_id=14562008-10-25T09:57:06ZJEG2 (James Gray)jeg2@ruby-lang.org
<ul><li><strong>Status</strong> changed from <i>Open</i> to <i>Closed</i></li><li><strong>% Done</strong> changed from <i>0</i> to <i>100</i></li></ul><p>=begin<br>
Applied in changeset r19931.<br>
=end</p> Ruby master - Bug #680: csv.rb: CSV.parse is too late when encoding is mismatchhttps://redmine.ruby-lang.org/issues/680?journal_id=14572008-10-25T09:58:38ZJEG2 (James Gray)jeg2@ruby-lang.org
<ul><li><strong>Assignee</strong> set to <i>JEG2 (James Gray)</i></li></ul><p>=begin<br>
Thanks for finding the bug in my logic. It should be much faster now:</p>
<p>$ time ruby_dev -Eeuc-jp -rlib/csv -e 'CSV.parse(open("/Users/james/Desktop/sample.csv","r").read)'<br>
/Users/james/Documents/ruby_source/trunk/lib/csv.rb:1981:in <code>=~': broken EUC-JP string (ArgumentError) from /Users/james/Documents/ruby_source/trunk/lib/csv.rb:1981:in </code>init_separators'<br>
from /Users/james/Documents/ruby_source/trunk/lib/csv.rb:1563:in <code>initialize' from /Users/james/Documents/ruby_source/trunk/lib/csv.rb:1350:in </code>new'<br>
from /Users/james/Documents/ruby_source/trunk/lib/csv.rb:1350:in <code>parse' from -e:1:in </code>'</p>
<p>real 0m0.053s<br>
user 0m0.039s<br>
sys 0m0.011s</p>
<p>=end</p> Ruby master - Bug #680: csv.rb: CSV.parse is too late when encoding is mismatchhttps://redmine.ruby-lang.org/issues/680?journal_id=14602008-10-26T15:26:58Znobu (Nobuyoshi Nakada)nobu@ruby-lang.org
<ul></ul><p>=begin<br>
Hi,</p>
<p>At Sun, 26 Oct 2008 11:25:58 +0900,<br>
Michael Selig wrote in <a href="https://blade.ruby-lang.org/ruby-core/19515">[ruby-core:19515]</a>:</p>
<blockquote>
<ol>
<li>
</ol>
<p>My preference would be to <em>always</em> encode string literals constructed with<br>
"\x.." as ASCII-8BIT, ignoring the source encoding. This means that if you<br>
really want to use such a literal as an encoded string, you must use<br>
"force_encoding". I think this would be much clearer and get rid of the<br>
"ambiguity".</p>
</blockquote>
<blockquote>
<ol start="2">
<li>
</ol>
<p>My suggestion for "defaulting" the source encoding was an attempt to avoid<br>
having to do this (but probably not a good way!). It isn't a big deal, and<br>
I understand the argument that the source encoding is a property of the<br>
script. My original suggestion (last month) of a special magic comment was<br>
to have a way of specifying BOTH the default_internal and source encoding<br>
once, but this idea was rejected.</p>
</blockquote>
<p>I'd prefer to default the internal encoding to the source<br>
encoding of the main script.</p>
<blockquote>
<ol start="3">
<li>
</ol>
<p>Perhaps this check could be based on the library's source encoding? If<br>
this were done, most libraries would have to use a source encoding of<br>
US-ASCII (or just have no encoding magic comment) <em>not</em> UTF-8, so that<br>
non-Unicode default_internal's will work. Perhaps Ruby could be smarter,<br>
and only flag an error if there actually is an incomaptible string literal<br>
in the library?</p>
</blockquote>
<p>What about comments? I suspect it might not a good idea.</p>
<blockquote>
<ol start="4">
<li>
</ol>
<p>Also it means that:<br>
ruby test.rb<br>
may perform differently than:<br>
ruby -e "<code>cat test.rb</code>"</p>
</blockquote>
<p>magic comments are effective with -e too.</p>
<p>$ ruby19 -e 'p <strong>ENCODING</strong>'<br>
#<a href="Encoding:EUC-JP" class="external">Encoding:EUC-JP</a></p>
<p>$ ruby19 -e '#-<em>- encoding:utf-8 -</em>-' -e 'p <strong>ENCODING</strong>'<br>
#<a href="Encoding:UTF-8" class="external">Encoding:UTF-8</a></p>
<p>Therefore no differences if the file has the magic comment.</p>
<p>--<br>
Nobu Nakada</p>
<p>=end</p> Ruby master - Bug #680: csv.rb: CSV.parse is too late when encoding is mismatchhttps://redmine.ruby-lang.org/issues/680?journal_id=14632008-10-26T21:34:52Znobu (Nobuyoshi Nakada)nobu@ruby-lang.org
<ul></ul><p>=begin<br>
Hi,</p>
<p>At Sun, 26 Oct 2008 17:20:17 +0900,<br>
Michael Selig wrote in <a href="https://blade.ruby-lang.org/ruby-core/19518">[ruby-core:19518]</a>:</p>
<blockquote>
<blockquote>
<p>I'd prefer to default the internal encoding to the source<br>
encoding of the main script.</p>
</blockquote>
<p>But then how do you tell Ruby NOT to set "default_internal"?</p>
</blockquote>
<p>I think defaulting the internal encoding to something other is<br>
bad.</p>
<blockquote>
<p>It also means that comments must be in the default_internal encoding (see<br>
your comment below).</p>
</blockquote>
<p>I don't follow you here, all comments should be written in the<br>
source encoding. Why default_internal affects?</p>
<blockquote>
<blockquote>
<p>Therefore no differences if the file has the magic comment.</p>
</blockquote>
<p>That's true, but my point was "why should a simple non-m17n non-ascii ruby<br>
program have to contain the magic comment"?</p>
</blockquote>
<p>Because, non-ascii. It's definitely enough reason.</p>
<p>--<br>
Nobu Nakada</p>
<p>=end</p> Ruby master - Bug #680: csv.rb: CSV.parse is too late when encoding is mismatchhttps://redmine.ruby-lang.org/issues/680?journal_id=14662008-10-27T14:08:26Znobu (Nobuyoshi Nakada)nobu@ruby-lang.org
<ul></ul><p>=begin<br>
Hi,</p>
<p>At Mon, 27 Oct 2008 07:28:42 +0900,<br>
Michael Selig wrote in <a href="https://blade.ruby-lang.org/ruby-core/19525">[ruby-core:19525]</a>:</p>
<blockquote>
<p>Yes you are right, and I was not suggesting doing that.<br>
But Matz wants to default default_internal to nil. With your proposal, how<br>
do you do that and still set the source encoding?</p>
</blockquote>
<p>I don't like the idea setting default_internal from source<br>
encoding, but meant "it feels less worse" by "prefer".</p>
<blockquote>
<p>My original suggestion was to use an extended "magic comment" to set both.</p>
</blockquote>
<p>But it can't keep the source encoding unset, and<br>
"internal_encoding" has no effect for Emacs.</p>
<blockquote>
<p>Isn't backward compatibility with 1.8 scripts more important?<br>
You are now forcing anyone with 1.8 scripts containing non-ascii string<br>
literals to put in a magic comment, otherwise you get "inavlid multibyte<br>
char (US-ASCII)" error in 1.9.</p>
</blockquote>
<p>In other words, what you want is -K option?</p>
<p>--<br>
Nobu Nakada</p>
<p>=end</p> Ruby master - Bug #680: csv.rb: CSV.parse is too late when encoding is mismatchhttps://redmine.ruby-lang.org/issues/680?journal_id=14672008-10-27T15:28:29Znobu (Nobuyoshi Nakada)nobu@ruby-lang.org
<ul></ul><p>=begin<br>
Hi,</p>
<p>At Mon, 27 Oct 2008 14:48:41 +0900,<br>
Michael Selig wrote in <a href="https://blade.ruby-lang.org/ruby-core/19532">[ruby-core:19532]</a>:</p>
<blockquote>
<p>OK, I don't use Emacs, and no one told me that before, thanks! I assumed<br>
it would work, but I admit I didn't test it.<br>
Then is there another form of magic comment that can be used - eg:<br>
"internal encoding: XXXX" or "encoding: XXXX internal" that does work with<br>
Emacs?</p>
</blockquote>
<p>No. Magic comments without -*- markers are for VIM, like</p>
<a name="vim-set-encodingUTF-8"></a>
<h1 >vim: set encoding=UTF-8<a href="#vim-set-encodingUTF-8" class="wiki-anchor">¶</a></h1>
<p>and, both of VIM and Emacs wouldn't work with your examples.</p>
<blockquote>
<p>What I am saying is that we need to consider backward compatibility of<br>
Ruby scripts. James Grey brought up an example with his "Textmate scripts"<br>
which contain UTF-8 multibyte string literals, which used to work with<br>
1.8, but do not in 1.9, because they need either a "magic comment" or, as<br>
you say "-KU". Either way, 1.9 is not truly backward compatible when it<br>
comes to simple, non-m17n, non-ascii scripts, because you have to either<br>
modify the script or add a flag to the ruby options. There must be lots of<br>
Japanese ruby scripts which will have a similar issue.</p>
</blockquote>
<p>Even in 1.8 or prior, -Ks has been mandatory for Shift_JIS<br>
sources, so they have had -K in the shebang lines already.</p>
<blockquote>
<p>Defaulting source encoding to locale encoding (like -e does) should fix<br>
this (as long as the end-user's locale is correct), right?</p>
</blockquote>
<p>Yes if they match.</p>
<blockquote>
<p>I guess if necessary James can put "-KU" in the RUBYOPT environment<br>
variable to save having to add multiple magic comments, but I feel this<br>
shouldn't be necessary.</p>
</blockquote>
<p>-U option would be better.</p>
<p>--<br>
Nobu Nakada</p>
<p>=end</p> Ruby master - Bug #680: csv.rb: CSV.parse is too late when encoding is mismatchhttps://redmine.ruby-lang.org/issues/680?journal_id=14682008-10-27T18:56:06Znobu (Nobuyoshi Nakada)nobu@ruby-lang.org
<ul></ul><p>=begin<br>
Hi,</p>
<p>At Mon, 27 Oct 2008 15:57:03 +0900,<br>
Michael Selig wrote in <a href="https://blade.ruby-lang.org/ruby-core/19535">[ruby-core:19535]</a>:</p>
<blockquote>
<blockquote>
<p>Even in 1.8 or prior, -Ks has been mandatory for Shift_JIS<br>
sources, so they have had -K in the shebang lines already.</p>
</blockquote>
<p>Why then can I write a ruby 1.8 script which does a "puts" of a Shift_JIS<br>
string (no shebang or magic comment), and have it run fine without -Ks?</p>
</blockquote>
<p>Because you are avoiding troublesome chars. Without such<br>
chars, we can't write the words "display", "table", "software"<br>
and "ruby".</p>
<blockquote>
<blockquote>
<blockquote>
<p>I guess if necessary James can put "-KU" in the RUBYOPT environment<br>
variable to save having to add multiple magic comments, but I feel this<br>
shouldn't be necessary.</p>
</blockquote>
<p>-U option would be better.</p>
</blockquote>
<p>I don't think that will work:</p>
<p>t2.rb is a single line script which does a puts of a short UTF-8 multibyte<br>
string.</p>
</blockquote>
<p>Indeed. -U sets only internal encoding, whereas -Ku sets also<br>
external and source encodings. Therefore -U isn't direct<br>
replacement for -Ku.</p>
<p>But it's very ambiguous and dangerous to imply encodings. We<br>
can't trust locale for this purpose, at least.</p>
<p>You can use BOM to mean that the source is written in UTF-8.</p>
<p>--<br>
Nobu Nakada</p>
<p>=end</p> Ruby master - Bug #680: csv.rb: CSV.parse is too late when encoding is mismatchhttps://redmine.ruby-lang.org/issues/680?journal_id=14692008-10-27T19:38:37Zduerst (Martin Dürst)duerst@it.aoyama.ac.jp
<ul></ul><p>=begin<br>
At 07:28 08/10/27, Michael Selig wrote:</p>
<blockquote>
<p>I thought one of your points was that you would like to be able to write<br>
Japanese (or other non-ascii) comments which is otherwise only ascii<br>
(which may use "\u" in literals, and want default_internal to be UTF-8).<br>
This means that the source encoding should be Japanese. Your suggestion of<br>
defaulting default_internal to the source encoding means that it will be<br>
set to Japanese. I am not sure that this is always desirable. (This is<br>
very minor - you can always override it)</p>
</blockquote>
<p>I'm not sure what you mean by "Japanese". It's no problem at all<br>
to use UTF-8 to write Japanese. And I guess if somebody uses<br>
\u literals and wants default_internal to be UTF-8, they'll<br>
in most cases use UTF-8 for the source encoding (comments or<br>
whatever else).</p>
<p>If you mean Japanese legacy encodings (such as Shift_JIS and<br>
EUC-JP), then your are correct, but it would be very rare<br>
for somebody to use Shift_JIS or EUC-JP for comments when<br>
the program is otherwise supposed to run all-UTF-8.</p>
<blockquote>
<p>Isn't backward compatibility with 1.8 scripts more important?<br>
You are now forcing anyone with 1.8 scripts containing non-ascii string<br>
literals to put in a magic comment, otherwise you get "inavlid multibyte<br>
char (US-ASCII)" error in 1.9.</p>
</blockquote>
<p>Well, yes, that's actually the point of it. Wherever necessary,<br>
get everybody to declare their encoding. It may be somewhat suboptimal<br>
in the transition phase, but after that, we know what we're dealing<br>
with.</p>
<p>Regards, Martin.</p>
<p>#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University<br>
#-#-# <a href="http://www.sw.it.aoyama.ac.jp" class="external">http://www.sw.it.aoyama.ac.jp</a> <a href="mailto:duerst@it.aoyama.ac.jp" class="email">mailto:duerst@it.aoyama.ac.jp</a></p>
<p>=end</p> Ruby master - Bug #680: csv.rb: CSV.parse is too late when encoding is mismatchhttps://redmine.ruby-lang.org/issues/680?journal_id=14702008-10-27T19:38:45Zduerst (Martin Dürst)duerst@it.aoyama.ac.jp
<ul></ul><p>=begin<br>
At 14:48 08/10/27, Michael Selig wrote:</p>
<blockquote>
<p>I am not sure why you would want to keep the source encoding unset when<br>
setting default_internal at the top of a script. Perhaps you could explain.</p>
</blockquote>
<p>The simplest case is a script in US-ASCII only, but where you want<br>
the data to be handled e.g. in UTF-8.</p>
<p>Regards, Martin.</p>
<p>#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University<br>
#-#-# <a href="http://www.sw.it.aoyama.ac.jp" class="external">http://www.sw.it.aoyama.ac.jp</a> <a href="mailto:duerst@it.aoyama.ac.jp" class="email">mailto:duerst@it.aoyama.ac.jp</a></p>
<p>=end</p> Ruby master - Bug #680: csv.rb: CSV.parse is too late when encoding is mismatchhttps://redmine.ruby-lang.org/issues/680?journal_id=14712008-10-27T19:39:53Zduerst (Martin Dürst)duerst@it.aoyama.ac.jp
<ul></ul><p>=begin<br>
At 12:24 08/10/27, James Gray wrote:</p>
<blockquote>
<p>They sure could, yeah. Our policy for TextMate development has always<br>
been that UTF-8 is king. We use it heavily and I'm sure some scripts<br>
do contain multibyte characters in UTF-8.</p>
</blockquote>
<p>Wouldn't it be only these scripts (including those that contain<br>
\x escapes for UTF-8) that need the encoding indication at the top?<br>
(please note that literals with \u escapes are automatically UTF-8).</p>
<p>Regards, Martin.</p>
<p>#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University<br>
#-#-# <a href="http://www.sw.it.aoyama.ac.jp" class="external">http://www.sw.it.aoyama.ac.jp</a> <a href="mailto:duerst@it.aoyama.ac.jp" class="email">mailto:duerst@it.aoyama.ac.jp</a></p>
<p>=end</p> Ruby master - Bug #680: csv.rb: CSV.parse is too late when encoding is mismatchhttps://redmine.ruby-lang.org/issues/680?journal_id=14722008-10-27T19:59:29Zduerst (Martin Dürst)duerst@it.aoyama.ac.jp
<ul></ul><p>=begin<br>
At 19:17 08/10/27, Michael Selig wrote:</p>
<blockquote>
<p>On Mon, 27 Oct 2008 20:55:32 +1100, Nobuyoshi Nakada <a href="mailto:nobu@ruby-lang.org" class="email">nobu@ruby-lang.org</a><br>
wrote:</p>
<blockquote>
<p>Hi,</p>
<p>At Mon, 27 Oct 2008 15:57:03 +0900,<br>
Michael Selig wrote in <a href="https://blade.ruby-lang.org/ruby-core/19535">[ruby-core:19535]</a>:</p>
<blockquote>
<blockquote>
<p>Even in 1.8 or prior, -Ks has been mandatory for Shift_JIS<br>
sources, so they have had -K in the shebang lines already.</p>
</blockquote>
<p>Why then can I write a ruby 1.8 script which does a "puts" of a<br>
Shift_JIS<br>
string (no shebang or magic comment), and have it run fine without -Ks?</p>
</blockquote>
<p>Because you are avoiding troublesome chars. Without such<br>
chars, we can't write the words "display", "table", "software"<br>
and "ruby".</p>
</blockquote>
<p>OK, I'm sure you know more about Japanese encodings that I do.</p>
</blockquote>
<p>To give you the details, these characters, in Shift_JIS, are<br>
encoded with two bytes, the second of which is the same byte<br>
as e.g. a backslash.</p>
<blockquote>
<p>But my original point is that 1.8 scripts exist which contain multibyte<br>
characters (eg UTF-8) which work fine under 1.8 without-K, but will fail<br>
under 1.9 unless a magic comment or -K is provided.</p>
</blockquote>
<p>Yes, that's because 1.8 is essentially garbage-in-garbage out.<br>
If you are careful about certain bytes, you can essentially have<br>
arbitrary byte sequences in your script, and Ruby 1.8 won't complain.</p>
<p>Regards, Martin.</p>
<p>#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University<br>
#-#-# <a href="http://www.sw.it.aoyama.ac.jp" class="external">http://www.sw.it.aoyama.ac.jp</a> <a href="mailto:duerst@it.aoyama.ac.jp" class="email">mailto:duerst@it.aoyama.ac.jp</a></p>
<p>=end</p> Ruby master - Bug #680: csv.rb: CSV.parse is too late when encoding is mismatchhttps://redmine.ruby-lang.org/issues/680?journal_id=14732008-10-27T21:07:51Znobu (Nobuyoshi Nakada)nobu@ruby-lang.org
<ul></ul><p>=begin<br>
Hi,</p>
<p>At Mon, 27 Oct 2008 19:17:45 +0900,<br>
Michael Selig wrote in <a href="https://blade.ruby-lang.org/ruby-core/19540">[ruby-core:19540]</a>:</p>
<blockquote>
<p>But my original point is that 1.8 scripts exist which contain multibyte<br>
characters (eg UTF-8) which work fine under 1.8 without-K, but will fail<br>
under 1.9 unless a magic comment or -K is provided.</p>
</blockquote>
<p>It just seemed working by chance.</p>
<blockquote>
<blockquote>
<p>But it's very ambiguous and dangerous to imply encodings. We<br>
can't trust locale for this purpose, at least.</p>
</blockquote>
<p>It's a trade-off between that and backward compatibility. I think the<br>
"danger" is not high and it gives backward compatibility, so my vote would<br>
be to use it.</p>
</blockquote>
<p>And it will suddenly crash or behave weirdly by moving other<br>
locales.</p>
<p>Anyway, I think I understand the needs to specify source<br>
encoding without magic comments. Is the option for that<br>
purpose an acceptable solution?</p>
<p>--<br>
Nobu Nakada</p>
<p>=end</p> Ruby master - Bug #680: csv.rb: CSV.parse is too late when encoding is mismatchhttps://redmine.ruby-lang.org/issues/680?journal_id=14742008-10-27T21:12:18Znobu (Nobuyoshi Nakada)nobu@ruby-lang.org
<ul></ul><p>=begin<br>
Hi,</p>
<p>At Mon, 27 Oct 2008 19:37:58 +0900,<br>
Martin Duerst wrote in <a href="https://blade.ruby-lang.org/ruby-core/19541">[ruby-core:19541]</a>:</p>
<blockquote>
<p>If you mean Japanese legacy encodings (such as Shift_JIS and<br>
EUC-JP), then your are correct, but it would be very rare<br>
for somebody to use Shift_JIS or EUC-JP for comments when<br>
the program is otherwise supposed to run all-UTF-8.</p>
</blockquote>
<p>I don't do it of course, but know that some people love to do<br>
it.</p>
<p>--<br>
Nobu Nakada</p>
<p>=end</p> Ruby master - Bug #680: csv.rb: CSV.parse is too late when encoding is mismatchhttps://redmine.ruby-lang.org/issues/680?journal_id=14762008-10-28T01:16:04Zmatz (Yukihiro Matsumoto)matz@ruby.or.jp
<ul></ul><p>=begin<br>
Hi,</p>
<p>In message "Re: <a href="https://blade.ruby-lang.org/ruby-core/19550">[ruby-core:19550]</a> Re: String literal encoding (Was: Default source encoding (Was: [Bug <a class="issue tracker-1 status-5 priority-4 priority-default closed" title="Bug: csv.rb: CSV.parse is too late when encoding is mismatch (Closed)" href="https://redmine.ruby-lang.org/issues/680">#680</a>] csv.rb: CSV.parse is toolate when encoding is mismatch))"<br>
on Tue, 28 Oct 2008 00:12:46 +0900, James Gray <a href="mailto:james@grayproductions.net" class="email">james@grayproductions.net</a> writes:</p>
<p>|I wasn't aware -KU still worked though, as Michael pointed out. I<br>
|thought for sure I had tried that and got a warning about it being<br>
|ignored now.<br>
|<br>
|It may be that the TextMate team could use that. What all does it set<br>
|in 1.9? Source encoding obviously. It seems to affect<br>
|default_external as well, but not touch default_internal. Do I have<br>
|that right? Does it have any other special effects?</p>
<p>-Ku (or -KU) specifies to</p>
<ul>
<li>default script encoding to be UTF-8</li>
<li>default_external encoding to be UTF-8 unless it's specified<br>
previously by -E or -U</li>
<li>do not touch default_internal</li>
</ul>
<p>|Will -KU stay supported for the foreseeable future?</p>
<p>Yes.</p>
<pre><code> matz.
</code></pre>
<p>=end</p> Ruby master - Bug #680: csv.rb: CSV.parse is too late when encoding is mismatchhttps://redmine.ruby-lang.org/issues/680?journal_id=15042008-10-31T18:39:33Znobu (Nobuyoshi Nakada)nobu@ruby-lang.org
<ul></ul><p>=begin<br>
Hi,</p>
<p>At Mon, 27 Oct 2008 21:07:16 +0900,<br>
Nobuyoshi Nakada wrote in <a href="https://blade.ruby-lang.org/ruby-core/19546">[ruby-core:19546]</a>:</p>
<blockquote>
<p>Anyway, I think I understand the needs to specify source<br>
encoding without magic comments. Is the option for that<br>
purpose an acceptable solution?</p>
</blockquote>
<p>Here is the patch to add options:</p>
<p>--encoding=external:internal:source<br>
--external-encoding=enc<br>
--internal-encoding=enc<br>
--source-encoding=enc</p>
<h1>
<br>
Index: ruby.c</h1>
<p>--- ruby.c (revision 20075)<br>
+++ ruby.c (working copy)<br>
@@ -623,5 +623,5 @@ dump_option(const char *str, int len, vo</p>
<p>static void<br>
-set_internal_encoding_once(struct cmdline_options *opt, const char *e, int elen)<br>
+set_option_encoding_once(const char *type, VALUE *name, const char *e, int elen)<br>
{<br>
VALUE ename;<br>
@@ -630,27 +630,16 @@ set_internal_encoding_once(struct cmdlin<br>
ename = rb_str_new(e, elen);</p>
<ul>
<li>if (opt->intern.enc.name &&</li>
<li>rb_funcall(ename, rb_intern("casecmp"), 1, opt->intern.enc.name) != INT2FIX(0)) {</li>
</ul>
<ul>
<li>if (*name &&</li>
<li>rb_funcall(ename, rb_intern("casecmp"), 1, *name) != INT2FIX(0)) {<br>
rb_raise(rb_eRuntimeError,</li>
</ul>
<ul>
<li>
<pre><code> "default_intenal already set to %s", RSTRING_PTR(opt->intern.enc.name));
</code></pre>
</li>
</ul>
<ul>
<li>
<pre><code> "%s already set to %s", type, RSTRING_PTR(*name));
</code></pre>
}</li>
</ul>
<ul>
<li>opt->intern.enc.name = ename;</li>
</ul>
<ul>
<li>*name = ename;<br>
}</li>
</ul>
<p>-static void<br>
-set_external_encoding_once(struct cmdline_options *opt, const char *e, int elen)<br>
-{</p>
<ul>
<li>VALUE ename;</li>
<li>
<li>if (!elen) elen = strlen(e);</li>
<li>ename = rb_str_new(e, elen);</li>
<li>
<li>if (opt->ext.enc.name &&</li>
<li>rb_funcall(ename, rb_intern("casecmp"), 1, opt->ext.enc.name) != INT2FIX(0)) {</li>
<li>rb_raise(rb_eRuntimeError,</li>
<li>
<pre><code> "default_external already set to %s", RSTRING_PTR(opt->ext.enc.name));
</code></pre>
</li>
<li>}</li>
<li>opt->ext.enc.name = ename;<br>
-}<br>
+#define set_internal_encoding_once(opt, e, elen) \</li>
</ul>
<ul>
<li>set_option_encoding_once("default_intenal", &opt->intern.enc.name, e, elen)<br>
+#define set_external_encoding_once(opt, e, elen) \</li>
<li>set_option_encoding_once("default_extenal", &opt->ext.enc.name, e, elen)</li>
</ul>
<p>static int<br>
@@ -956,13 +945,29 @@ proc_options(int argc, char **argv, stru<br>
char *p;<br>
encoding:</p>
<ul>
<li>
<pre><code> p = strchr(s, ':');
</code></pre>
</li>
<li>
<pre><code> if (p) {
</code></pre>
</li>
<li>
<pre><code> if (p > s)
</code></pre>
</li>
<li>
<pre><code> set_external_encoding_once(opt, s, p-s);
</code></pre>
</li>
<li>
<pre><code> if (*++p)
</code></pre>
</li>
<li>
<pre><code> set_internal_encoding_once(opt, p, 0);
</code></pre>
</li>
<li>
<pre><code> }
</code></pre>
</li>
<li>
<pre><code> else
</code></pre>
</li>
<li>
<pre><code> set_external_encoding_once(opt, s, 0);
</code></pre>
</li>
</ul>
<ul>
<li>
<pre><code> do {
</code></pre>
</li>
</ul>
<p>+# define set_encoding_part(type) \</p>
<ul>
<li>
<pre><code> if (!(p = strchr(s, ':'))) { \
</code></pre>
</li>
<li>
<pre><code> set_##type##_encoding_once(opt, s, 0); \
</code></pre>
</li>
<li>
<pre><code> break; \
</code></pre>
</li>
<li>
<pre><code> } \
</code></pre>
</li>
<li>
<pre><code> else if (p > s) { \
</code></pre>
</li>
<li>
<pre><code> set_##type##_encoding_once(opt, s, p-s); \
</code></pre>
</li>
<li>
<pre><code> }
</code></pre>
</li>
<li>
<pre><code> set_encoding_part(external);
</code></pre>
</li>
<li>
<pre><code> if (!*(s = ++p)) break;
</code></pre>
</li>
<li>
<pre><code> set_encoding_part(internal);
</code></pre>
</li>
<li>
<pre><code> if (!*(s = ++p)) break;
</code></pre>
</li>
<li>
<pre><code> set_encoding_part(source);
</code></pre>
</li>
</ul>
<p>+# undef set_encoding_part</p>
<ul>
<li>
<pre><code> } while (0);
</code></pre>
</li>
<li>
<pre><code> }
</code></pre>
</li>
<li>
<pre><code> else if (is_option_with_arg("internal-encoding", Qfalse, Qtrue)) {
</code></pre>
</li>
<li>
<pre><code> set_internal_encoding_once(opt, s, 0);
</code></pre>
</li>
<li>
<pre><code> }
</code></pre>
</li>
<li>
<pre><code> else if (is_option_with_arg("external-encoding", Qfalse, Qtrue)) {
</code></pre>
</li>
<li>
<pre><code> set_external_encoding_once(opt, s, 0);
</code></pre>
</li>
<li>
<pre><code> }
</code></pre>
</li>
<li>
<pre><code> else if (is_option_with_arg("source-encoding", Qfalse, Qtrue)) {
</code></pre>
</li>
<li>
<pre><code> set_source_encoding_once(opt, s, 0);
}
else if (strcmp("version", s) == 0) {
</code></pre>
</li>
</ul>
<p></p>
<p>--<br>
Nobu Nakada</p>
<p>=end</p> Ruby master - Bug #680: csv.rb: CSV.parse is too late when encoding is mismatchhttps://redmine.ruby-lang.org/issues/680?journal_id=15052008-10-31T18:59:11Znobu (Nobuyoshi Nakada)nobu@ruby-lang.org
<ul></ul><p>=begin<br>
Hi,</p>
<p>At Fri, 31 Oct 2008 18:38:24 +0900,<br>
Nobuyoshi Nakada wrote in <a href="https://blade.ruby-lang.org/ruby-core/19655">[ruby-core:19655]</a>:</p>
<blockquote>
<p>+#define set_internal_encoding_once(opt, e, elen) \</p>
<ul>
<li>set_option_encoding_once("default_intenal", &opt->intern.enc.name, e, elen)<br>
+#define set_external_encoding_once(opt, e, elen) \</li>
<li>set_option_encoding_once("default_extenal", &opt->ext.enc.name, e, elen)</li>
</ul>
</blockquote>
<p>Sorry, missed these 2 lines.</p>
<p>#define set_source_encoding_once(opt, e, elen) <br>
set_option_encoding_once("source", &opt->src.enc.name, e, elen)</p>
<p>--<br>
Nobu Nakada</p>
<p>=end</p> Ruby master - Bug #680: csv.rb: CSV.parse is too late when encoding is mismatchhttps://redmine.ruby-lang.org/issues/680?journal_id=15062008-10-31T19:06:37Zduerst (Martin Dürst)duerst@it.aoyama.ac.jp
<ul></ul><p>=begin<br>
At 18:38 08/10/31, Nobuyoshi Nakada wrote:</p>
<blockquote>
<p>Hi,</p>
<p>At Mon, 27 Oct 2008 21:07:16 +0900,<br>
Nobuyoshi Nakada wrote in <a href="https://blade.ruby-lang.org/ruby-core/19546">[ruby-core:19546]</a>:</p>
<blockquote>
<p>Anyway, I think I understand the needs to specify source<br>
encoding without magic comments. Is the option for that<br>
purpose an acceptable solution?</p>
</blockquote>
<p>Here is the patch to add options:</p>
</blockquote>
<p>Great work!</p>
<blockquote>
<p>--encoding=external:internal:source<br>
--external-encoding=enc<br>
--internal-encoding=enc<br>
--source-encoding=enc</p>
</blockquote>
<p>I personally don't like the last one, and the :source in the first<br>
one, but I guess there are situations where they can be very helpful<br>
(e.g. testing with different encodings).</p>
<p>I also think that it would be good to have the values of --encoding<br>
and -E look/work the same, so unless :source already works on -E,<br>
I think having just --source-encoding for the case that the<br>
source encoding must be set by an option should be okay.<br>
This will also make it easier to distinguish in documentation<br>
that --source-encoding is really only for very special occasions,<br>
and declaring the source encoding in the script itself is strongly<br>
preferred.</p>
<p>Regards, Martin.</p>
<p>#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University<br>
#-#-# <a href="http://www.sw.it.aoyama.ac.jp" class="external">http://www.sw.it.aoyama.ac.jp</a> <a href="mailto:duerst@it.aoyama.ac.jp" class="email">mailto:duerst@it.aoyama.ac.jp</a></p>
<p>=end</p> Ruby master - Bug #680: csv.rb: CSV.parse is too late when encoding is mismatchhttps://redmine.ruby-lang.org/issues/680?journal_id=15082008-10-31T21:49:36Znobu (Nobuyoshi Nakada)nobu@ruby-lang.org
<ul></ul><p>=begin<br>
Hi,</p>
<p>At Fri, 31 Oct 2008 19:05:25 +0900,<br>
Martin Duerst wrote in <a href="https://blade.ruby-lang.org/ruby-core/19657">[ruby-core:19657]</a>:</p>
<blockquote>
<blockquote>
<p>--encoding=external:internal:source<br>
--external-encoding=enc<br>
--internal-encoding=enc<br>
--source-encoding=enc</p>
</blockquote>
<p>I personally don't like the last one, and the :source in the first<br>
one, but I guess there are situations where they can be very helpful<br>
(e.g. testing with different encodings).</p>
<p>I also think that it would be good to have the values of --encoding<br>
and -E look/work the same, so unless :source already works on -E,<br>
I think having just --source-encoding for the case that the<br>
source encoding must be set by an option should be okay.</p>
</blockquote>
<p>-E equals to --encoding.</p>
<blockquote>
<p>This will also make it easier to distinguish in documentation<br>
that --source-encoding is really only for very special occasions,<br>
and declaring the source encoding in the script itself is strongly<br>
preferred.</p>
</blockquote>
<p>Since these four options are separated, so it's easy to remove<br>
some of them.</p>
<p>--<br>
Nobu Nakada</p>
<p>=end</p>