Ruby master - Bug #680: csv.rb: CSV.parse is too late when encoding is mismatch</h1> <article> <h1>Ruby master - Bug #680: csv.rb: CSV.parse is too late when encoding is mismatch</h1> <p>2008-10-24T15:54:30Z</p> <ul></ul><p>=begin<br> A default for the source encoding has been discussed quite a long<br> time ago (in some Japanese meetings or on ruby-dev, I don't remember),<br> and the conclusion was that the source encoding has to be given<br> (with a majic comment) in the file itself (unless the file is all ascii).</p> <p>The reason for this is that the source encoding is a property of the<br> source, and nothing else. On very simple scripts, it might occasionally<br> be slightly easier if it were the same as default_external or<br> default_internal, but this is only the case as long as you stay<br> in exactly the same environment, and don't move the script.<br> But scripts grow and move, so it's better to get the settings<br> right at the start.</p> <p>However, as far as I remember, the idea was that for -e,<br> default_external should be used, because that's what one<br> is using in a shell. I'm not sure why this doesn't work below.<br> (assuming Takeyuki is working in a Shift_JIS environment,<br> which isn't completely sure).</p> <p>Regards, Martin.</p> <p>At 12:12 08/10/24, Michael Selig wrote:</p> <blockquote> <p>Hi,</p> <p>This bug actually brings up an interesting issue - should the source<br> encoding default to something other than UTF-8 (ie: if it is not specified<br> in the "magic comment")?</p> <p>Perhaps it should default to the encoding specified by the user's locale?<br> Or perhaps it should default to the value of "default_internal" if it is<br> set? Or even default_external?</p> <p>I suggest that it should default to "default_internal" if that is set, and<br> then to the locale encoding if not.</p> <p>What do others think?<br> Having it default to the locale in this case would probably avoid the<br> encoding mismatch entirely (and the resulting confusion).</p> <p>Cheers<br> Mike</p> <p>On Fri, 24 Oct 2008 11:58:33 +1100, Takeyuki Fujioka<br> <a href="mailto:redmine@ruby-lang.org" class="email">redmine@ruby-lang.org</a> wrote:</p> <blockquote> <p>Bug <a class="issue tracker-1 status-5 priority-4 priority-default closed" title="Bug: csv.rb: CSV.parse is too late when encoding is mismatch (Closed)" href="https://redmine.ruby-lang.org/issues/680">#680</a>: csv.rb: CSV.parse is too late when encoding is mismatch<br> <a href="http://redmine.ruby-lang.org/issues/show/680" class="external">http://redmine.ruby-lang.org/issues/show/680</a></p> <p>Author: Takeyuki Fujioka<br> Status: Open, Priority: Normal<br> Category: lib, Target version: 1.9.x</p> <p>I think this result is true, but encoding mismatch raise is too late.</p> <p>see:<br> % time ruby19 -rcsv -e<br> 'CSV.parse(("\x82\xA0,\x82\xA2\n"*10000).force_encoding("shift_jis"))'<br> ruby19 -rcsv -e 0.30s user 0.02s system 96% cpu 0.330 total</p> <p>% time ruby19 -rcsv -e 'CSV.parse(("\x82\xA0,\x82\xA2\n"*10000))'<br> /Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1981:in <code>=~': broken UTF-8 string (ArgumentError) from /Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1981:in </code>init_separators'<br> from /Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1563:in <code>initialize' from /Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1350:in </code>new'<br> from /Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1350:in <code>parse' from -e:1:in </code>'<br> ruby19 -rcsv -e 'CSV.parse(("\x82\xA0,\x82\xA2\n"*10000))' 1.55s user<br> 2.57s system 90% cpu 4.530 total</p> <hr> <p><a href="http://redmine.ruby-lang.org" class="external">http://redmine.ruby-lang.org</a></p> </blockquote> </blockquote> <p>#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University<br> #-#-# <a href="http://www.sw.it.aoyama.ac.jp" class="external">http://www.sw.it.aoyama.ac.jp</a> <a href="mailto:duerst@it.aoyama.ac.jp" class="email">mailto:duerst@it.aoyama.ac.jp</a></p> <p>=end</p> </article> <article> <h1>Ruby master - Bug #680: csv.rb: CSV.parse is too late when encoding is mismatch</h1> <p>2008-10-24T17:06:30Z</p> <ul></ul><p>=begin<br> Hi,</p> <p>In message "Re: <a href="https://blade.ruby-lang.org/ruby-core/19473">[ruby-core:19473]</a> Re: Default source encoding (Was: [Bug <a class="issue tracker-1 status-5 priority-4 priority-default closed" title="Bug: csv.rb: CSV.parse is too late when encoding is mismatch (Closed)" href="https://redmine.ruby-lang.org/issues/680">#680</a>] csv.rb: CSV.parse is toolate when encoding is mismatch)"<br> on Fri, 24 Oct 2008 16:48:04 +0900, "Michael Selig" <a href="mailto:michael.selig@fs.com.au" class="email">michael.selig@fs.com.au</a> writes:</p> <p>|The problem I am trying to solve is the compatibility of string literals<br> |in your source and strings from other sources.<br> |<br> |"default_internal" was introduced to try to make all strings the same<br> |encoding to avoid incompatibilities. But at the moment string literals<br> |seem to default to the source encoding or to UTF-8 if oit is not set<br> |(please correct me if I am wrong). What I was suggesting was a way to make<br> |string literals be compatible.</p> <p>You are correct here.</p> <p>|This normally isn't a problem if:<br> |a) All string literals are 7 bit ASCII, or<br> |b) The source encoding matches "default_internal"<br> |<br> |If the source encoding of a program containing non-ascii string literals<br> |is set different from default_internal, you are asking for trouble, and<br> |would defeat the purpose of default_internal. Therefore to prevent the<br> |programmer from having to remember to specify both, it makes sense to me<br> |that the source encoding should default to default_internal. I think this<br> |is important.</p> <p>The point is that when we have a source code written in source<br> encoding, the literals naturally encoded in that encoding. So do we<br> need to convert string literals in to default encoding? But<br> conversion can bring us more troubles, since they tend to change the<br> meaning, for example what /[<a>-<b>]/ mean, where <a> and <b> are<br> multi byte characters and their corresponding codepoints (and sorting<br> order) differ in converted encoding?</b></a></b></a></p> <p>|(By the way, I am not talking about libraries here. As I have stressed<br> |previously, libraries should be carefully written to either use ASCII<br> |string literals only, or to make sure that it transcodes them properly.)</p> <p>That makes me feel much better, so we can limit the issue about the<br> scripts only.</p> <p>|Finally, are you suggesting that "-e" should perform differently to a<br> |single-line ruby script? That seems non-intuitive to me.</p> <p>-e takes programs from command line shell, which probably yields<br> strings in locale encoding anyway. But we cannot assume that for<br> scripts contained in files.</p> <pre><code> matz. </code></pre> <p>=end</p> </article> <article> <h1>Ruby master - Bug #680: csv.rb: CSV.parse is too late when encoding is mismatch</h1> <p>2008-10-24T19:25:30Z</p> <ul><li><strong>File</strong> <a href="/attachments/116">sample.csv</a> <a class="icon-only icon-download" title="Download" href="/attachments/download/116/sample.csv">sample.csv</a> added</li></ul><p>=begin<br> Please save as 'sample.csv' attached file.<br> This file include japanese UTF-8 string in first line.<br> Other line is us-ascii. Line number count is 5001.</p> <p>% time ruby19 -Eutf-8 -rcsv -e 'CSV.parse(open("sample.csv","r").read)'<br> ruby19 -Eutf-8 -rcsv -e 'CSV.parse(open("sample.csv","r").read)' 0.23s user 0.01s system 96% cpu 0.254 total</p> <p>this is OK very fast.<br> But:</p> <p>% time ruby19 -Eeuc-jp -rcsv -e 'CSV.parse(open("sample.csv","r").read)'<br> /Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1981:in <code>=~': broken EUC-JP string (ArgumentError) from /Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1981:in </code>init_separators'<br> from /Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1563:in <code>initialize' from /Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1350:in </code>new'<br> from /Users/fujioka/local/lib/ruby/1.9.0/csv.rb:1350:in <code>parse' from -e:1:in </code>'<br> ruby19 -Eeuc-jp -rcsv -e 'CSV.parse(open("sample.csv","r").read)' 3.93s user 6.38s system 98% cpu 10.457 total</p> <p>this result is very slow.<br> I hope raise as soon as encoding mismatch found .</p> <a name="Sorry-I-dont-understand-M17Ns-default_external-and-default_internal-behavior"></a> <h1 >Sorry, I don't understand M17N's default_external and default_internal behavior.<a href="#Sorry-I-dont-understand-M17Ns-default_external-and-default_internal-behavior" class="wiki-anchor">¶</a></h1> <a name="I-cant-reply-about-M17Ns-problem"></a> <h1 >I can't reply about M17N's problem.<a href="#I-cant-reply-about-M17Ns-problem" class="wiki-anchor">¶</a></h1> <p>=end</p> </article> <article> <h1>Ruby master - Bug #680: csv.rb: CSV.parse is too late when encoding is mismatch</h1> <p>2008-10-25T01:01:26Z</p> <ul></ul><p>=begin<br> Hi,</p> <p>At Fri, 24 Oct 2008 23:00:27 +0900,<br> James Gray wrote in <a href="https://blade.ruby-lang.org/ruby-core/19481">[ruby-core:19481]</a>:</p> <blockquote> <blockquote> <p>I work on TextMate and we use Ruby all over the place inside that<br> application. I'm sure we have hundreds of scripts in there. We try<br> hard to make sure everything in TextMate is UTF-8, so now we get<br> errors out of Ruby 1.9. To fix, we need to add hundreds of magic<br> comments and worse, train our users who often write their own<br> automations in Ruby why they have to do the same to make their code<br> work.</p> </blockquote> <p>The real issue here is that you can argue the user doesn't even know<br> the proper encoding these scripts should be using. Only TextMate<br> really knows the encoding it's going to hand-off the data in.</p> </blockquote> <p>Though I don't know about TextMate at all, ruby-mode.el in 1.9<br> deals with magic comments automatically.</p> <p>--<br> Nobu Nakada</p> <p>=end</p> </article> <article> <h1>Ruby master - Bug #680: csv.rb: CSV.parse is too late when encoding is mismatch</h1> <p>2008-10-25T09:57:06Z</p> <ul><li><strong>Status</strong> changed from <i>Open</i> to <i>Closed</i></li><li><strong>% Done</strong> changed from <i>0</i> to <i>100</i></li></ul><p>=begin<br> Applied in changeset r19931.<br> =end</p> </article> <article> <h1>Ruby master - Bug #680: csv.rb: CSV.parse is too late when encoding is mismatch</h1> <p>2008-10-25T09:58:38Z</p> <ul><li><strong>Assignee</strong> set to <i>JEG2 (James Gray)</i></li></ul><p>=begin<br> Thanks for finding the bug in my logic. It should be much faster now:</p> <p>$ time ruby_dev -Eeuc-jp -rlib/csv -e 'CSV.parse(open("/Users/james/Desktop/sample.csv","r").read)'<br> /Users/james/Documents/ruby_source/trunk/lib/csv.rb:1981:in <code>=~': broken EUC-JP string (ArgumentError) from /Users/james/Documents/ruby_source/trunk/lib/csv.rb:1981:in </code>init_separators'<br> from /Users/james/Documents/ruby_source/trunk/lib/csv.rb:1563:in <code>initialize' from /Users/james/Documents/ruby_source/trunk/lib/csv.rb:1350:in </code>new'<br> from /Users/james/Documents/ruby_source/trunk/lib/csv.rb:1350:in <code>parse' from -e:1:in </code>'</p> <p>real 0m0.053s<br> user 0m0.039s<br> sys 0m0.011s</p> <p>=end</p> </article> <article> <h1>Ruby master - Bug #680: csv.rb: CSV.parse is too late when encoding is mismatch</h1> <p>2008-10-26T15:26:58Z</p> <ul></ul><p>=begin<br> Hi,</p> <p>At Sun, 26 Oct 2008 11:25:58 +0900,<br> Michael Selig wrote in <a href="https://blade.ruby-lang.org/ruby-core/19515">[ruby-core:19515]</a>:</p> <blockquote> <ol> <li> </ol> <p>My preference would be to <em>always</em> encode string literals constructed with<br> "\x.." as ASCII-8BIT, ignoring the source encoding. This means that if you<br> really want to use such a literal as an encoded string, you must use<br> "force_encoding". I think this would be much clearer and get rid of the<br> "ambiguity".</p> </blockquote> <blockquote> <ol start="2"> <li> </ol> <p>My suggestion for "defaulting" the source encoding was an attempt to avoid<br> having to do this (but probably not a good way!). It isn't a big deal, and<br> I understand the argument that the source encoding is a property of the<br> script. My original suggestion (last month) of a special magic comment was<br> to have a way of specifying BOTH the default_internal and source encoding<br> once, but this idea was rejected.</p> </blockquote> <p>I'd prefer to default the internal encoding to the source<br> encoding of the main script.</p> <blockquote> <ol start="3"> <li> </ol> <p>Perhaps this check could be based on the library's source encoding? If<br> this were done, most libraries would have to use a source encoding of<br> US-ASCII (or just have no encoding magic comment) <em>not</em> UTF-8, so that<br> non-Unicode default_internal's will work. Perhaps Ruby could be smarter,<br> and only flag an error if there actually is an incomaptible string literal<br> in the library?</p> </blockquote> <p>What about comments? I suspect it might not a good idea.</p> <blockquote> <ol start="4"> <li> </ol> <p>Also it means that:<br> ruby test.rb<br> may perform differently than:<br> ruby -e "<code>cat test.rb</code>"</p> </blockquote> <p>magic comments are effective with -e too.</p> <p>$ ruby19 -e 'p <strong>ENCODING</strong>'<br> #<a href="Encoding:EUC-JP" class="external">Encoding:EUC-JP</a></p> <p>$ ruby19 -e '#-<em>- encoding:utf-8 -</em>-' -e 'p <strong>ENCODING</strong>'<br> #<a href="Encoding:UTF-8" class="external">Encoding:UTF-8</a></p> <p>Therefore no differences if the file has the magic comment.</p> <p>--<br> Nobu Nakada</p> <p>=end</p> </article> <article> <h1>Ruby master - Bug #680: csv.rb: CSV.parse is too late when encoding is mismatch</h1> <p>2008-10-26T21:34:52Z</p> <ul></ul><p>=begin<br> Hi,</p> <p>At Sun, 26 Oct 2008 17:20:17 +0900,<br> Michael Selig wrote in <a href="https://blade.ruby-lang.org/ruby-core/19518">[ruby-core:19518]</a>:</p> <blockquote> <blockquote> <p>I'd prefer to default the internal encoding to the source<br> encoding of the main script.</p> </blockquote> <p>But then how do you tell Ruby NOT to set "default_internal"?</p> </blockquote> <p>I think defaulting the internal encoding to something other is<br> bad.</p> <blockquote> <p>It also means that comments must be in the default_internal encoding (see<br> your comment below).</p> </blockquote> <p>I don't follow you here, all comments should be written in the<br> source encoding. Why default_internal affects?</p> <blockquote> <blockquote> <p>Therefore no differences if the file has the magic comment.</p> </blockquote> <p>That's true, but my point was "why should a simple non-m17n non-ascii ruby<br> program have to contain the magic comment"?</p> </blockquote> <p>Because, non-ascii. It's definitely enough reason.</p> <p>--<br> Nobu Nakada</p> <p>=end</p> </article> <article> <h1>Ruby master - Bug #680: csv.rb: CSV.parse is too late when encoding is mismatch</h1> <p>2008-10-27T14:08:26Z</p> <ul></ul><p>=begin<br> Hi,</p> <p>At Mon, 27 Oct 2008 07:28:42 +0900,<br> Michael Selig wrote in <a href="https://blade.ruby-lang.org/ruby-core/19525">[ruby-core:19525]</a>:</p> <blockquote> <p>Yes you are right, and I was not suggesting doing that.<br> But Matz wants to default default_internal to nil. With your proposal, how<br> do you do that and still set the source encoding?</p> </blockquote> <p>I don't like the idea setting default_internal from source<br> encoding, but meant "it feels less worse" by "prefer".</p> <blockquote> <p>My original suggestion was to use an extended "magic comment" to set both.</p> </blockquote> <p>But it can't keep the source encoding unset, and<br> "internal_encoding" has no effect for Emacs.</p> <blockquote> <p>Isn't backward compatibility with 1.8 scripts more important?<br> You are now forcing anyone with 1.8 scripts containing non-ascii string<br> literals to put in a magic comment, otherwise you get "inavlid multibyte<br> char (US-ASCII)" error in 1.9.</p> </blockquote> <p>In other words, what you want is -K option?</p> <p>--<br> Nobu Nakada</p> <p>=end</p> </article> <article> <h1>Ruby master - Bug #680: csv.rb: CSV.parse is too late when encoding is mismatch</h1> <p>2008-10-27T15:28:29Z</p> <ul></ul><p>=begin<br> Hi,</p> <p>At Mon, 27 Oct 2008 14:48:41 +0900,<br> Michael Selig wrote in <a href="https://blade.ruby-lang.org/ruby-core/19532">[ruby-core:19532]</a>:</p> <blockquote> <p>OK, I don't use Emacs, and no one told me that before, thanks! I assumed<br> it would work, but I admit I didn't test it.<br> Then is there another form of magic comment that can be used - eg:<br> "internal encoding: XXXX" or "encoding: XXXX internal" that does work with<br> Emacs?</p> </blockquote> <p>No. Magic comments without -*- markers are for VIM, like</p> <a name="vim-set-encodingUTF-8"></a> <h1 >vim: set encoding=UTF-8<a href="#vim-set-encodingUTF-8" class="wiki-anchor">¶</a></h1> <p>and, both of VIM and Emacs wouldn't work with your examples.</p> <blockquote> <p>What I am saying is that we need to consider backward compatibility of<br> Ruby scripts. James Grey brought up an example with his "Textmate scripts"<br> which contain UTF-8 multibyte string literals, which used to work with<br> 1.8, but do not in 1.9, because they need either a "magic comment" or, as<br> you say "-KU". Either way, 1.9 is not truly backward compatible when it<br> comes to simple, non-m17n, non-ascii scripts, because you have to either<br> modify the script or add a flag to the ruby options. There must be lots of<br> Japanese ruby scripts which will have a similar issue.</p> </blockquote> <p>Even in 1.8 or prior, -Ks has been mandatory for Shift_JIS<br> sources, so they have had -K in the shebang lines already.</p> <blockquote> <p>Defaulting source encoding to locale encoding (like -e does) should fix<br> this (as long as the end-user's locale is correct), right?</p> </blockquote> <p>Yes if they match.</p> <blockquote> <p>I guess if necessary James can put "-KU" in the RUBYOPT environment<br> variable to save having to add multiple magic comments, but I feel this<br> shouldn't be necessary.</p> </blockquote> <p>-U option would be better.</p> <p>--<br> Nobu Nakada</p> <p>=end</p> </article> <article> <h1>Ruby master - Bug #680: csv.rb: CSV.parse is too late when encoding is mismatch</h1> <p>2008-10-27T18:56:06Z</p> <ul></ul><p>=begin<br> Hi,</p> <p>At Mon, 27 Oct 2008 15:57:03 +0900,<br> Michael Selig wrote in <a href="https://blade.ruby-lang.org/ruby-core/19535">[ruby-core:19535]</a>:</p> <blockquote> <blockquote> <p>Even in 1.8 or prior, -Ks has been mandatory for Shift_JIS<br> sources, so they have had -K in the shebang lines already.</p> </blockquote> <p>Why then can I write a ruby 1.8 script which does a "puts" of a Shift_JIS<br> string (no shebang or magic comment), and have it run fine without -Ks?</p> </blockquote> <p>Because you are avoiding troublesome chars. Without such<br> chars, we can't write the words "display", "table", "software"<br> and "ruby".</p> <blockquote> <blockquote> <blockquote> <p>I guess if necessary James can put "-KU" in the RUBYOPT environment<br> variable to save having to add multiple magic comments, but I feel this<br> shouldn't be necessary.</p> </blockquote> <p>-U option would be better.</p> </blockquote> <p>I don't think that will work:</p> <p>t2.rb is a single line script which does a puts of a short UTF-8 multibyte<br> string.</p> </blockquote> <p>Indeed. -U sets only internal encoding, whereas -Ku sets also<br> external and source encodings. Therefore -U isn't direct<br> replacement for -Ku.</p> <p>But it's very ambiguous and dangerous to imply encodings. We<br> can't trust locale for this purpose, at least.</p> <p>You can use BOM to mean that the source is written in UTF-8.</p> <p>--<br> Nobu Nakada</p> <p>=end</p> </article> <article> <h1>Ruby master - Bug #680: csv.rb: CSV.parse is too late when encoding is mismatch</h1> <p>2008-10-27T19:38:37Z</p> <ul></ul><p>=begin<br> At 07:28 08/10/27, Michael Selig wrote:</p> <blockquote> <p>I thought one of your points was that you would like to be able to write<br> Japanese (or other non-ascii) comments which is otherwise only ascii<br> (which may use "\u" in literals, and want default_internal to be UTF-8).<br> This means that the source encoding should be Japanese. Your suggestion of<br> defaulting default_internal to the source encoding means that it will be<br> set to Japanese. I am not sure that this is always desirable. (This is<br> very minor - you can always override it)</p> </blockquote> <p>I'm not sure what you mean by "Japanese". It's no problem at all<br> to use UTF-8 to write Japanese. And I guess if somebody uses<br> \u literals and wants default_internal to be UTF-8, they'll<br> in most cases use UTF-8 for the source encoding (comments or<br> whatever else).</p> <p>If you mean Japanese legacy encodings (such as Shift_JIS and<br> EUC-JP), then your are correct, but it would be very rare<br> for somebody to use Shift_JIS or EUC-JP for comments when<br> the program is otherwise supposed to run all-UTF-8.</p> <blockquote> <p>Isn't backward compatibility with 1.8 scripts more important?<br> You are now forcing anyone with 1.8 scripts containing non-ascii string<br> literals to put in a magic comment, otherwise you get "inavlid multibyte<br> char (US-ASCII)" error in 1.9.</p> </blockquote> <p>Well, yes, that's actually the point of it. Wherever necessary,<br> get everybody to declare their encoding. It may be somewhat suboptimal<br> in the transition phase, but after that, we know what we're dealing<br> with.</p> <p>Regards, Martin.</p> <p>#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University<br> #-#-# <a href="http://www.sw.it.aoyama.ac.jp" class="external">http://www.sw.it.aoyama.ac.jp</a> <a href="mailto:duerst@it.aoyama.ac.jp" class="email">mailto:duerst@it.aoyama.ac.jp</a></p> <p>=end</p> </article> <article> <h1>Ruby master - Bug #680: csv.rb: CSV.parse is too late when encoding is mismatch</h1> <p>2008-10-27T19:38:45Z</p> <ul></ul><p>=begin<br> At 14:48 08/10/27, Michael Selig wrote:</p> <blockquote> <p>I am not sure why you would want to keep the source encoding unset when<br> setting default_internal at the top of a script. Perhaps you could explain.</p> </blockquote> <p>The simplest case is a script in US-ASCII only, but where you want<br> the data to be handled e.g. in UTF-8.</p> <p>Regards, Martin.</p> <p>#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University<br> #-#-# <a href="http://www.sw.it.aoyama.ac.jp" class="external">http://www.sw.it.aoyama.ac.jp</a> <a href="mailto:duerst@it.aoyama.ac.jp" class="email">mailto:duerst@it.aoyama.ac.jp</a></p> <p>=end</p> </article> <article> <h1>Ruby master - Bug #680: csv.rb: CSV.parse is too late when encoding is mismatch</h1> <p>2008-10-27T19:39:53Z</p> <ul></ul><p>=begin<br> At 12:24 08/10/27, James Gray wrote:</p> <blockquote> <p>They sure could, yeah. Our policy for TextMate development has always<br> been that UTF-8 is king. We use it heavily and I'm sure some scripts<br> do contain multibyte characters in UTF-8.</p> </blockquote> <p>Wouldn't it be only these scripts (including those that contain<br> \x escapes for UTF-8) that need the encoding indication at the top?<br> (please note that literals with \u escapes are automatically UTF-8).</p> <p>Regards, Martin.</p> <p>#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University<br> #-#-# <a href="http://www.sw.it.aoyama.ac.jp" class="external">http://www.sw.it.aoyama.ac.jp</a> <a href="mailto:duerst@it.aoyama.ac.jp" class="email">mailto:duerst@it.aoyama.ac.jp</a></p> <p>=end</p> </article> <article> <h1>Ruby master - Bug #680: csv.rb: CSV.parse is too late when encoding is mismatch</h1> <p>2008-10-27T19:59:29Z</p> <ul></ul><p>=begin<br> At 19:17 08/10/27, Michael Selig wrote:</p> <blockquote> <p>On Mon, 27 Oct 2008 20:55:32 +1100, Nobuyoshi Nakada <a href="mailto:nobu@ruby-lang.org" class="email">nobu@ruby-lang.org</a><br> wrote:</p> <blockquote> <p>Hi,</p> <p>At Mon, 27 Oct 2008 15:57:03 +0900,<br> Michael Selig wrote in <a href="https://blade.ruby-lang.org/ruby-core/19535">[ruby-core:19535]</a>:</p> <blockquote> <blockquote> <p>Even in 1.8 or prior, -Ks has been mandatory for Shift_JIS<br> sources, so they have had -K in the shebang lines already.</p> </blockquote> <p>Why then can I write a ruby 1.8 script which does a "puts" of a<br> Shift_JIS<br> string (no shebang or magic comment), and have it run fine without -Ks?</p> </blockquote> <p>Because you are avoiding troublesome chars. Without such<br> chars, we can't write the words "display", "table", "software"<br> and "ruby".</p> </blockquote> <p>OK, I'm sure you know more about Japanese encodings that I do.</p> </blockquote> <p>To give you the details, these characters, in Shift_JIS, are<br> encoded with two bytes, the second of which is the same byte<br> as e.g. a backslash.</p> <blockquote> <p>But my original point is that 1.8 scripts exist which contain multibyte<br> characters (eg UTF-8) which work fine under 1.8 without-K, but will fail<br> under 1.9 unless a magic comment or -K is provided.</p> </blockquote> <p>Yes, that's because 1.8 is essentially garbage-in-garbage out.<br> If you are careful about certain bytes, you can essentially have<br> arbitrary byte sequences in your script, and Ruby 1.8 won't complain.</p> <p>Regards, Martin.</p> <p>#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University<br> #-#-# <a href="http://www.sw.it.aoyama.ac.jp" class="external">http://www.sw.it.aoyama.ac.jp</a> <a href="mailto:duerst@it.aoyama.ac.jp" class="email">mailto:duerst@it.aoyama.ac.jp</a></p> <p>=end</p> </article> <article> <h1>Ruby master - Bug #680: csv.rb: CSV.parse is too late when encoding is mismatch</h1> <p>2008-10-27T21:07:51Z</p> <ul></ul><p>=begin<br> Hi,</p> <p>At Mon, 27 Oct 2008 19:17:45 +0900,<br> Michael Selig wrote in <a href="https://blade.ruby-lang.org/ruby-core/19540">[ruby-core:19540]</a>:</p> <blockquote> <p>But my original point is that 1.8 scripts exist which contain multibyte<br> characters (eg UTF-8) which work fine under 1.8 without-K, but will fail<br> under 1.9 unless a magic comment or -K is provided.</p> </blockquote> <p>It just seemed working by chance.</p> <blockquote> <blockquote> <p>But it's very ambiguous and dangerous to imply encodings. We<br> can't trust locale for this purpose, at least.</p> </blockquote> <p>It's a trade-off between that and backward compatibility. I think the<br> "danger" is not high and it gives backward compatibility, so my vote would<br> be to use it.</p> </blockquote> <p>And it will suddenly crash or behave weirdly by moving other<br> locales.</p> <p>Anyway, I think I understand the needs to specify source<br> encoding without magic comments. Is the option for that<br> purpose an acceptable solution?</p> <p>--<br> Nobu Nakada</p> <p>=end</p> </article> <article> <h1>Ruby master - Bug #680: csv.rb: CSV.parse is too late when encoding is mismatch</h1> <p>2008-10-27T21:12:18Z</p> <ul></ul><p>=begin<br> Hi,</p> <p>At Mon, 27 Oct 2008 19:37:58 +0900,<br> Martin Duerst wrote in <a href="https://blade.ruby-lang.org/ruby-core/19541">[ruby-core:19541]</a>:</p> <blockquote> <p>If you mean Japanese legacy encodings (such as Shift_JIS and<br> EUC-JP), then your are correct, but it would be very rare<br> for somebody to use Shift_JIS or EUC-JP for comments when<br> the program is otherwise supposed to run all-UTF-8.</p> </blockquote> <p>I don't do it of course, but know that some people love to do<br> it.</p> <p>--<br> Nobu Nakada</p> <p>=end</p> </article> <article> <h1>Ruby master - Bug #680: csv.rb: CSV.parse is too late when encoding is mismatch</h1> <p>2008-10-28T01:16:04Z</p> <ul></ul><p>=begin<br> Hi,</p> <p>In message "Re: <a href="https://blade.ruby-lang.org/ruby-core/19550">[ruby-core:19550]</a> Re: String literal encoding (Was: Default source encoding (Was: [Bug <a class="issue tracker-1 status-5 priority-4 priority-default closed" title="Bug: csv.rb: CSV.parse is too late when encoding is mismatch (Closed)" href="https://redmine.ruby-lang.org/issues/680">#680</a>] csv.rb: CSV.parse is toolate when encoding is mismatch))"<br> on Tue, 28 Oct 2008 00:12:46 +0900, James Gray <a href="mailto:james@grayproductions.net" class="email">james@grayproductions.net</a> writes:</p> <p>|I wasn't aware -KU still worked though, as Michael pointed out. I<br> |thought for sure I had tried that and got a warning about it being<br> |ignored now.<br> |<br> |It may be that the TextMate team could use that. What all does it set<br> |in 1.9? Source encoding obviously. It seems to affect<br> |default_external as well, but not touch default_internal. Do I have<br> |that right? Does it have any other special effects?</p> <p>-Ku (or -KU) specifies to</p> <ul> <li>default script encoding to be UTF-8</li> <li>default_external encoding to be UTF-8 unless it's specified<br> previously by -E or -U</li> <li>do not touch default_internal</li> </ul> <p>|Will -KU stay supported for the foreseeable future?</p> <p>Yes.</p> <pre><code> matz. </code></pre> <p>=end</p> </article> <article> <h1>Ruby master - Bug #680: csv.rb: CSV.parse is too late when encoding is mismatch</h1> <p>2008-10-31T18:39:33Z</p> <ul></ul><p>=begin<br> Hi,</p> <p>At Mon, 27 Oct 2008 21:07:16 +0900,<br> Nobuyoshi Nakada wrote in <a href="https://blade.ruby-lang.org/ruby-core/19546">[ruby-core:19546]</a>:</p> <blockquote> <p>Anyway, I think I understand the needs to specify source<br> encoding without magic comments. Is the option for that<br> purpose an acceptable solution?</p> </blockquote> <p>Here is the patch to add options:</p> <p>--encoding=external:internal:source<br> --external-encoding=enc<br> --internal-encoding=enc<br> --source-encoding=enc</p> <h1> <br> Index: ruby.c</h1> <p>--- ruby.c (revision 20075)<br> +++ ruby.c (working copy)<br> @@ -623,5 +623,5 @@ dump_option(const char *str, int len, vo</p> <p>static void<br> -set_internal_encoding_once(struct cmdline_options *opt, const char *e, int elen)<br> +set_option_encoding_once(const char *type, VALUE *name, const char *e, int elen)<br> {<br> VALUE ename;<br> @@ -630,27 +630,16 @@ set_internal_encoding_once(struct cmdlin<br> ename = rb_str_new(e, elen);</p> <ul> <li>if (opt->intern.enc.name &&</li> <li>rb_funcall(ename, rb_intern("casecmp"), 1, opt->intern.enc.name) != INT2FIX(0)) {</li> </ul> <ul> <li>if (*name &&</li> <li>rb_funcall(ename, rb_intern("casecmp"), 1, *name) != INT2FIX(0)) {<br> rb_raise(rb_eRuntimeError,</li> </ul> <ul> <li> <pre><code> "default_intenal already set to %s", RSTRING_PTR(opt->intern.enc.name)); </code></pre> </li> </ul> <ul> <li> <pre><code> "%s already set to %s", type, RSTRING_PTR(*name)); </code></pre> }</li> </ul> <ul> <li>opt->intern.enc.name = ename;</li> </ul> <ul> <li>*name = ename;<br> }</li> </ul> <p>-static void<br> -set_external_encoding_once(struct cmdline_options *opt, const char *e, int elen)<br> -{</p> <ul> <li>VALUE ename;</li> <li> <li>if (!elen) elen = strlen(e);</li> <li>ename = rb_str_new(e, elen);</li> <li> <li>if (opt->ext.enc.name &&</li> <li>rb_funcall(ename, rb_intern("casecmp"), 1, opt->ext.enc.name) != INT2FIX(0)) {</li> <li>rb_raise(rb_eRuntimeError,</li> <li> <pre><code> "default_external already set to %s", RSTRING_PTR(opt->ext.enc.name)); </code></pre> </li> <li>}</li> <li>opt->ext.enc.name = ename;<br> -}<br> +#define set_internal_encoding_once(opt, e, elen) \</li> </ul> <ul> <li>set_option_encoding_once("default_intenal", &opt->intern.enc.name, e, elen)<br> +#define set_external_encoding_once(opt, e, elen) \</li> <li>set_option_encoding_once("default_extenal", &opt->ext.enc.name, e, elen)</li> </ul> <p>static int<br> @@ -956,13 +945,29 @@ proc_options(int argc, char **argv, stru<br> char *p;<br> encoding:</p> <ul> <li> <pre><code> p = strchr(s, ':'); </code></pre> </li> <li> <pre><code> if (p) { </code></pre> </li> <li> <pre><code> if (p > s) </code></pre> </li> <li> <pre><code> set_external_encoding_once(opt, s, p-s); </code></pre> </li> <li> <pre><code> if (*++p) </code></pre> </li> <li> <pre><code> set_internal_encoding_once(opt, p, 0); </code></pre> </li> <li> <pre><code> } </code></pre> </li> <li> <pre><code> else </code></pre> </li> <li> <pre><code> set_external_encoding_once(opt, s, 0); </code></pre> </li> </ul> <ul> <li> <pre><code> do { </code></pre> </li> </ul> <p>+# define set_encoding_part(type) \</p> <ul> <li> <pre><code> if (!(p = strchr(s, ':'))) { \ </code></pre> </li> <li> <pre><code> set_##type##_encoding_once(opt, s, 0); \ </code></pre> </li> <li> <pre><code> break; \ </code></pre> </li> <li> <pre><code> } \ </code></pre> </li> <li> <pre><code> else if (p > s) { \ </code></pre> </li> <li> <pre><code> set_##type##_encoding_once(opt, s, p-s); \ </code></pre> </li> <li> <pre><code> } </code></pre> </li> <li> <pre><code> set_encoding_part(external); </code></pre> </li> <li> <pre><code> if (!*(s = ++p)) break; </code></pre> </li> <li> <pre><code> set_encoding_part(internal); </code></pre> </li> <li> <pre><code> if (!*(s = ++p)) break; </code></pre> </li> <li> <pre><code> set_encoding_part(source); </code></pre> </li> </ul> <p>+# undef set_encoding_part</p> <ul> <li> <pre><code> } while (0); </code></pre> </li> <li> <pre><code> } </code></pre> </li> <li> <pre><code> else if (is_option_with_arg("internal-encoding", Qfalse, Qtrue)) { </code></pre> </li> <li> <pre><code> set_internal_encoding_once(opt, s, 0); </code></pre> </li> <li> <pre><code> } </code></pre> </li> <li> <pre><code> else if (is_option_with_arg("external-encoding", Qfalse, Qtrue)) { </code></pre> </li> <li> <pre><code> set_external_encoding_once(opt, s, 0); </code></pre> </li> <li> <pre><code> } </code></pre> </li> <li> <pre><code> else if (is_option_with_arg("source-encoding", Qfalse, Qtrue)) { </code></pre> </li> <li> <pre><code> set_source_encoding_once(opt, s, 0); } else if (strcmp("version", s) == 0) { </code></pre> </li> </ul> <p></p> <p>--<br> Nobu Nakada</p> <p>=end</p> </article> <article> <h1>Ruby master - Bug #680: csv.rb: CSV.parse is too late when encoding is mismatch</h1> <p>2008-10-31T18:59:11Z</p> <ul></ul><p>=begin<br> Hi,</p> <p>At Fri, 31 Oct 2008 18:38:24 +0900,<br> Nobuyoshi Nakada wrote in <a href="https://blade.ruby-lang.org/ruby-core/19655">[ruby-core:19655]</a>:</p> <blockquote> <p>+#define set_internal_encoding_once(opt, e, elen) \</p> <ul> <li>set_option_encoding_once("default_intenal", &opt->intern.enc.name, e, elen)<br> +#define set_external_encoding_once(opt, e, elen) \</li> <li>set_option_encoding_once("default_extenal", &opt->ext.enc.name, e, elen)</li> </ul> </blockquote> <p>Sorry, missed these 2 lines.</p> <p>#define set_source_encoding_once(opt, e, elen) <br> set_option_encoding_once("source", &opt->src.enc.name, e, elen)</p> <p>--<br> Nobu Nakada</p> <p>=end</p> </article> <article> <h1>Ruby master - Bug #680: csv.rb: CSV.parse is too late when encoding is mismatch</h1> <p>2008-10-31T19:06:37Z</p> <ul></ul><p>=begin<br> At 18:38 08/10/31, Nobuyoshi Nakada wrote:</p> <blockquote> <p>Hi,</p> <p>At Mon, 27 Oct 2008 21:07:16 +0900,<br> Nobuyoshi Nakada wrote in <a href="https://blade.ruby-lang.org/ruby-core/19546">[ruby-core:19546]</a>:</p> <blockquote> <p>Anyway, I think I understand the needs to specify source<br> encoding without magic comments. Is the option for that<br> purpose an acceptable solution?</p> </blockquote> <p>Here is the patch to add options:</p> </blockquote> <p>Great work!</p> <blockquote> <p>--encoding=external:internal:source<br> --external-encoding=enc<br> --internal-encoding=enc<br> --source-encoding=enc</p> </blockquote> <p>I personally don't like the last one, and the :source in the first<br> one, but I guess there are situations where they can be very helpful<br> (e.g. testing with different encodings).</p> <p>I also think that it would be good to have the values of --encoding<br> and -E look/work the same, so unless :source already works on -E,<br> I think having just --source-encoding for the case that the<br> source encoding must be set by an option should be okay.<br> This will also make it easier to distinguish in documentation<br> that --source-encoding is really only for very special occasions,<br> and declaring the source encoding in the script itself is strongly<br> preferred.</p> <p>Regards, Martin.</p> <p>#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University<br> #-#-# <a href="http://www.sw.it.aoyama.ac.jp" class="external">http://www.sw.it.aoyama.ac.jp</a> <a href="mailto:duerst@it.aoyama.ac.jp" class="email">mailto:duerst@it.aoyama.ac.jp</a></p> <p>=end</p> </article> <article> <h1>Ruby master - Bug #680: csv.rb: CSV.parse is too late when encoding is mismatch</h1> <p>2008-10-31T21:49:36Z</p> <ul></ul><p>=begin<br> Hi,</p> <p>At Fri, 31 Oct 2008 19:05:25 +0900,<br> Martin Duerst wrote in <a href="https://blade.ruby-lang.org/ruby-core/19657">[ruby-core:19657]</a>:</p> <blockquote> <blockquote> <p>--encoding=external:internal:source<br> --external-encoding=enc<br> --internal-encoding=enc<br> --source-encoding=enc</p> </blockquote> <p>I personally don't like the last one, and the :source in the first<br> one, but I guess there are situations where they can be very helpful<br> (e.g. testing with different encodings).</p> <p>I also think that it would be good to have the values of --encoding<br> and -E look/work the same, so unless :source already works on -E,<br> I think having just --source-encoding for the case that the<br> source encoding must be set by an option should be okay.</p> </blockquote> <p>-E equals to --encoding.</p> <blockquote> <p>This will also make it easier to distinguish in documentation<br> that --source-encoding is really only for very special occasions,<br> and declaring the source encoding in the script itself is strongly<br> preferred.</p> </blockquote> <p>Since these four options are separated, so it's easy to remove<br> some of them.</p> <p>--<br> Nobu Nakada</p> <p>=end</p> </article> </main></body></html>