Project

General

Profile

Actions

Feature #6679

closed

Default Ruby source file encoding to utf-8

Feature #6679: Default Ruby source file encoding to utf-8

Added by claytrump (Clay Trump) over 13 years ago. Updated almost 13 years ago.

Status:
Closed
Target version:
[ruby-core:46021]

Description

Let's change the default encoding for Ruby source files from US-ASCII
to UTF-8 in Ruby 2.0

• Convention over Configuration
• Ruby 1.9 forced encoding for code that was not pure ASCII, so
existing codebase already has magic comments.

In Ruby 2.0, "# encoding: utf-8" can be the default.


Files

utf.pdf (36.3 KB) utf.pdf claytrump (Clay Trump), 07/01/2012 07:23 AM
utf.pdf (37.1 KB) utf.pdf claytrump (Clay Trump), 07/03/2012 12:23 AM

Updated by claytrump (Clay Trump) over 13 years ago Actions #1 [ruby-core:46022]

Oh, and here's a slide for the feature meetup. It's ugly, I know.

Updated by mame (Yusuke Endoh) over 13 years ago Actions #3 [ruby-core:46080]

  • Status changed from Open to Assigned
  • Assignee set to naruse (Yui NARUSE)

Received. Thank you!

Naruse-san, what do you think?

--
Yusuke Endoh

Updated by nobu (Nobuyoshi Nakada) over 13 years ago Actions #4 [ruby-core:46098]

claytrump (Clay Trump) wrote:

• Ruby 1.9 forced encoding for code that was not pure ASCII,

Could you elaborate?

Updated by duerst (Martin Dürst) over 13 years ago Actions #5 [ruby-core:46101]

I think this is the right direction to go, and doing it for a major
version (2.0) is the right timing.

Maybe we can also on this occasion abolish the -U option and make its
action the default? Matz proposed that quite a long time ago, when we
were moving to 1.9.

Regards, Martin.

On 2012/07/02 3:15, mame (Yusuke Endoh) wrote:

Issue #6679 has been updated by mame (Yusuke Endoh).

Status changed from Open to Assigned
Assignee set to naruse (Yui NARUSE)

Received. Thank you!

Naruse-san, what do you think?

Updated by claytrump (Clay Trump) over 13 years ago Actions #6 [ruby-core:46111]

claytrump (Clay Trump) wrote:

• Ruby 1.9 forced encoding for code that was not pure ASCII,

Could you elaborate?

Sure. Ruby 1.9 forced us to specify the encoding for code that was not pure
ASCII.

I'm no expert, but I think that in Ruby 1.8, you could write code using an
encoding compatbile with ASCII like 8859-1. Things would kind of work, it
would output the expected sequence of bytes, etc... at least as long as
you're using and expecting that encoding everywhere.

If Ruby 1.9 had assumed utf-8, that legacy code would now output the wrong
stuff, and you might not notice right away. Subttle errors, etc.. So it's
cool that in Ruby 1.9 it produces an error; you need to put the encoding.

So any code like that has the right # coding comment by now.

Attached a slide with clearer sentence

Updated by claytrump (Clay Trump) over 13 years ago Actions #7 [ruby-core:46112]

On Mon, Jul 2, 2012 at 2:34 AM, "Martin J. Dürst" wrote:

I think this is the right direction to go, and doing it for a major
version (2.0) is the right timing.

Maybe we can also on this occasion abolish the -U option and make its
action the default? Matz proposed that quite a long time ago, when we were
moving to 1.9.

Cool, sounds like a plan.

Updated by drbrain (Eric Hodel) over 13 years ago Actions #8 [ruby-core:46120]

duerst (Martin Dürst) wrote:

I think this is the right direction to go, and doing it for a major
version (2.0) is the right timing.

Maybe we can also on this occasion abolish the -U option and make its
action the default? Matz proposed that quite a long time ago, when we
were moving to 1.9.

#5206 (make -K warn) may be relevant to removing -U

Updated by naruse (Yui NARUSE) over 13 years ago Actions #9 [ruby-core:46123]

= Default Ruby source file encoding to utf-8

it almost can keep compatibility but breaks

  • escaped bytes in string literal like "a\xff", its encoding changed from ASCII-8BIT to UTF-8.
  • escaped bytes in regexp literal like above

= -U as default

What is the expected merit of this?

Updated by rosenfeld (Rodrigo Rosenfeld Rosas) over 13 years ago Actions #10 [ruby-core:46141]

You could at least consider it for 3.0 and yielding a deprecation warning in such strings on 2.0... Although I think much more people are currently complaining about UTF-8 not being default when compared to those who might complain because they were using ASCII-8BIT encoded escaped chars in strings.

Updated by duerst (Martin Dürst) over 13 years ago Actions #11 [ruby-core:46171]

On 2012/07/03 10:33, naruse (Yui NARUSE) wrote:

Issue #6679 has been updated by naruse (Yui NARUSE).

= Default Ruby source file encoding to utf-8

it almost can keep compatibility but breaks

  • escaped bytes in string literal like "a\xff", its encoding changed from ASCII-8BIT to UTF-8.
  • escaped bytes in regexp literal like above

Good point. Thinking about it, the rule that \x in strings means these
strings are in the source encoding seems to work well for non-UTF-8
strings. For UTF-8, because we have \u, we could make string containing
\x be ASCII-8BIT.

But maybe that's too complicated.

Regards, Martin.

Updated by mame (Yusuke Endoh) over 13 years ago Actions #12 [ruby-core:46653]

Clay Trump,

I'm happy to inform you that matz has (basically) accepted your
proposal.

But not that the decision may be cancelled if the compatibility
impact is considered serious.
Naruse-san will implement and experiment it.

--
Yusuke Endoh

Updated by Anonymous over 13 years ago Actions #13 [ruby-core:46655]

If this seem to be too large of a performance impact, please consider making it a configuration option.

Making it a configuration option may be nice anyway.

Updated by rosenfeld (Rodrigo Rosenfeld Rosas) over 13 years ago Actions #14 [ruby-core:46672]

You mean the default would be UTF-8 right?

In Ruby I believe happiness > performance :)

Em 23-07-2012 10:57, Perry Smith escreveu:

If this seem to be too large of a performance impact, please consider making it a configuration option.

Making it a configuration option may be nice anyway.

On Jul 23, 2012, at 8:44 AM, mame (Yusuke Endoh) wrote:

Issue #6679 has been updated by mame (Yusuke Endoh).

Clay Trump,

I'm happy to inform you that matz has (basically) accepted your
proposal.

But not that the decision may be cancelled if the compatibility
impact is considered serious.
Naruse-san will implement and experiment it.

--
Yusuke Endoh

Feature #6679: Default Ruby source file encoding to utf-8
https://bugs.ruby-lang.org/issues/6679#change-28316

Author: claytrump (Clay Trump)
Status: Assigned
Priority: Normal
Assignee: naruse (Yui NARUSE)
Category:
Target version:

Let's change the default encoding for Ruby source files from US-ASCII
to UTF-8 in Ruby 2.0

• Convention over Configuration
• Ruby 1.9 forced encoding for code that was not pure ASCII, so
existing codebase already has magic comments.

In Ruby 2.0, "# encoding: utf-8" can be the default.

--
http://bugs.ruby-lang.org/

Updated by naruse (Yui NARUSE) over 13 years ago Actions #15 [ruby-core:46681]

匿名ユーザ wrote:

If this seem to be too large of a performance impact, please consider making it a configuration option.

Making it a configuration option may be nice anyway.

Benchmark by yourself, and if it shows performance impact, please report it.

Updated by ko1 (Koichi Sasada) over 13 years ago Actions #16 [ruby-core:46698]

(2012/07/23 22:57), Perry Smith wrote:

Making it a configuration option may be nice anyway.

+1

--
// SASADA Koichi at atdot dot net

Updated by naruse (Yui NARUSE) over 13 years ago Actions #17 [ruby-core:46703]

mame (Yusuke Endoh) wrote:

I'm happy to inform you that matz has (basically) accepted your
proposal.

But not that the decision may be cancelled if the compatibility
impact is considered serious.
Naruse-san will implement and experiment it.

diff --git a/lib/rexml/encoding.rb b/lib/rexml/encoding.rb
index d1d5172..23e912f 100644
--- a/lib/rexml/encoding.rb
+++ b/lib/rexml/encoding.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
module REXML
module Encoding
# ID ---> Encoding name
diff --git a/lib/rexml/source.rb b/lib/rexml/source.rb
index 112393c..7ecb98f 100644
--- a/lib/rexml/source.rb
+++ b/lib/rexml/source.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
require 'rexml/encoding'

module REXML
diff --git a/parse.y b/parse.y
index 049e356..00e80a2 100644
--- a/parse.y
+++ b/parse.y
@@ -10558,7 +10558,7 @@ parser_initialize(struct parser_params *parser)
#ifdef YYMALLOC
parser->heap = NULL;
#endif

  • parser->enc = rb_usascii_encoding();
  • parser->enc = rb_utf8_encoding();
    }

#ifdef RIPPER
diff --git a/ruby.c b/ruby.c
index ab4b674..5ab5ca2 100644
--- a/ruby.c
+++ b/ruby.c
@@ -1630,7 +1630,7 @@ load_file_internal(VALUE arg)
enc = rb_locale_encoding();
}
else {

  • enc = rb_usascii_encoding();
  • enc = rb_utf8_encoding();
    }
    if (NIL_P(f)) {
    f = rb_str_new(0, 0);
    diff --git a/test/base64/test_base64.rb b/test/base64/test_base64.rb
    index 9ae54cb..c5e61b3 100644
    --- a/test/base64/test_base64.rb
    +++ b/test/base64/test_base64.rb
    @@ -1,3 +1,4 @@
    +# coding: US-ASCII
    require "test/unit"
    require "base64"

diff --git a/test/dl/test_import.rb b/test/dl/test_import.rb
index 26b9f3c..41def7c 100644
--- a/test/dl/test_import.rb
+++ b/test/dl/test_import.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
require_relative 'test_base'
require 'dl/import'

diff --git a/test/logger/test_logger.rb b/test/logger/test_logger.rb
index 8fc02f8..100c1ea 100644
--- a/test/logger/test_logger.rb
+++ b/test/logger/test_logger.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
require 'test/unit'
require 'logger'
require 'tempfile'
diff --git a/test/net/http/test_http.rb b/test/net/http/test_http.rb
index fc7bfa9..cb8bf44 100644
--- a/test/net/http/test_http.rb
+++ b/test/net/http/test_http.rb
@@ -1,5 +1,4 @@
-# $Id$

+# coding: US-ASCII
require 'test/unit'
require 'net/http'
require 'stringio'
diff --git a/test/net/http/test_httpresponse.rb b/test/net/http/test_httpresponse.rb
index d57614b..ccff224 100644
--- a/test/net/http/test_httpresponse.rb
+++ b/test/net/http/test_httpresponse.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
require 'net/http'
require 'test/unit'
require 'stringio'
diff --git a/test/openssl/test_x509name.rb b/test/openssl/test_x509name.rb
index 90c0992..968ad97 100644
--- a/test/openssl/test_x509name.rb
+++ b/test/openssl/test_x509name.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
require_relative 'utils'

if defined?(OpenSSL)
diff --git a/test/psych/test_yaml.rb b/test/psych/test_yaml.rb
index 807c058..796a44f 100644
--- a/test/psych/test_yaml.rb
+++ b/test/psych/test_yaml.rb
@@ -1,4 +1,4 @@
-# -- mode: ruby; ruby-indent-level: 4; tab-width: 4 --
+# -- coding: us-ascii; mode: ruby; ruby-indent-level: 4; tab-width: 4 --

vim:sw=4:ts=4

$Id$

diff --git a/test/psych/visitors/test_to_ruby.rb b/test/psych/visitors/test_to_ruby.rb
index 5b0702c..ee473c9 100644
--- a/test/psych/visitors/test_to_ruby.rb
+++ b/test/psych/visitors/test_to_ruby.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
require 'psych/helper'

module Psych
diff --git a/test/ripper/test_ripper.rb b/test/ripper/test_ripper.rb
index 72dc52d..1d6e893 100644
--- a/test/ripper/test_ripper.rb
+++ b/test/ripper/test_ripper.rb
@@ -17,7 +17,7 @@ class TestRipper::Ripper < Test::Unit::TestCase
end

def test_encoding

  • assert_equal Encoding::US_ASCII, @ripper.encoding
  • assert_equal Encoding::UTF_8, @ripper.encoding
    end

def test_end_seen_eh
diff --git a/test/ruby/test_array.rb b/test/ruby/test_array.rb
index fff55e1..856a994 100644
--- a/test/ruby/test_array.rb
+++ b/test/ruby/test_array.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
require 'test/unit'
require_relative 'envutil'

diff --git a/test/ruby/test_io.rb b/test/ruby/test_io.rb
index d1edaaf..93967c6 100644
--- a/test/ruby/test_io.rb
+++ b/test/ruby/test_io.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
require 'test/unit'
require 'tmpdir'
require "fcntl"
diff --git a/test/ruby/test_io_m17n.rb b/test/ruby/test_io_m17n.rb
index b6358e0..3cc8437 100644
--- a/test/ruby/test_io_m17n.rb
+++ b/test/ruby/test_io_m17n.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
require 'test/unit'
require 'tmpdir'
require 'timeout'
diff --git a/test/ruby/test_m17n.rb b/test/ruby/test_m17n.rb
index dfcaa94..ce94886 100644
--- a/test/ruby/test_m17n.rb
+++ b/test/ruby/test_m17n.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
require 'test/unit'
require_relative 'envutil'

diff --git a/test/ruby/test_pack.rb b/test/ruby/test_pack.rb
index c72035c..4810c6e 100644
--- a/test/ruby/test_pack.rb
+++ b/test/ruby/test_pack.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
require 'test/unit'

class TestPack < Test::Unit::TestCase
diff --git a/test/ruby/test_parse.rb b/test/ruby/test_parse.rb
index 563e2ce..b5d31db 100644
--- a/test/ruby/test_parse.rb
+++ b/test/ruby/test_parse.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
require 'test/unit'
require 'stringio'

diff --git a/test/ruby/test_regexp.rb b/test/ruby/test_regexp.rb
index 7e31e99..781af50 100644
--- a/test/ruby/test_regexp.rb
+++ b/test/ruby/test_regexp.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
require 'test/unit'
require 'envutil'

diff --git a/test/syck/test_yaml.rb b/test/syck/test_yaml.rb
index 132bc92..c286b03 100644
--- a/test/syck/test_yaml.rb
+++ b/test/syck/test_yaml.rb
@@ -1,4 +1,4 @@
-# -- mode: ruby; ruby-indent-level: 4; tab-width: 4; indent-tabs-mode: t --
+# -- coding: us-ascii; mode: ruby; ruby-indent-level: 4; tab-width: 4; indent-tabs-mode: t --

vim:sw=4:ts=4

$Id$

diff --git a/test/syslog/test_syslog_logger.rb b/test/syslog/test_syslog_logger.rb
index 9224296..d382b4a 100644
--- a/test/syslog/test_syslog_logger.rb
+++ b/test/syslog/test_syslog_logger.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
require 'test/unit'
require 'tempfile'
require 'syslog/logger'
diff --git a/test/webrick/test_cgi.rb b/test/webrick/test_cgi.rb
index d930c26..282183e 100644
--- a/test/webrick/test_cgi.rb
+++ b/test/webrick/test_cgi.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
require_relative "utils"
require "webrick"
require "test/unit"

Updated by duerst (Martin Dürst) over 13 years ago Actions #18 [ruby-core:46709]

On 2012/07/24 3:27, naruse (Yui NARUSE) wrote:

Issue #6679 has been updated by naruse (Yui NARUSE).

匿名ユーザ wrote:

If this seem to be too large of a performance impact, please consider making it a configuration option.

Making it a configuration option may be nice anyway.

Benchmark by yourself, and if it shows performance impact, please report it.

I agree. For a file that's ASCII only, I can't imagine that performance
decreases much (but of course I might be wrong). For a file that's
UTF-8, there's no change. Same for a file that's in another encoding
(because that can't use the default).

Regards, Martin.

Updated by naruse (Yui NARUSE) over 13 years ago Actions #19 [ruby-core:46712]

ko1 (Koichi Sasada) wrote:

(2012/07/23 22:57), Perry Smith wrote:

Making it a configuration option may be nice anyway.

+1

diff --git a/ruby.c b/ruby.c
index ab4b674..d6a8a91 100644
--- a/ruby.c
+++ b/ruby.c
@@ -702,6 +702,7 @@ static long
proc_options(long argc, char **argv, struct cmdline_options *opt, int envopt)
{
long n, argc0 = argc;

  • int opt_K_p = FALSE;
    const char *s;

    if (argc == 0)
    @@ -909,6 +910,7 @@ proc_options(long argc, char **argv, struct cmdline_options *opt, int envopt)
    break;
    }
    if (enc_name) {

  •       opt_K_p = TRUE;
          opt->src.enc.name = rb_str_new2(enc_name);
          if (!opt->ext.enc.name)
      	opt->ext.enc.name = opt->src.enc.name;
    

@@ -1013,10 +1015,8 @@ proc_options(long argc, char **argv, struct cmdline_options opt, int envopt)
if (!
(s = ++p)) break;
set_encoding_part(internal);
if (!(s = ++p)) break;
-#if defined ALLOW_DEFAULT_SOURCE_ENCODING && ALLOW_DEFAULT_SOURCE_ENCODING
set_encoding_part(source);
if (!
(s = ++p)) break;
-#endif
rb_raise(rb_eRuntimeError, "extra argument for %s: %s",
(arg[1] == '-' ? "--encoding" : "-E"), s);

undef set_encoding_part

@@ -1028,11 +1028,9 @@ proc_options(long argc, char **argv, struct cmdline_options *opt, int envopt)
else if (is_option_with_arg("external-encoding", Qfalse, Qtrue)) {
set_external_encoding_once(opt, s, 0);
}
-#if defined ALLOW_DEFAULT_SOURCE_ENCODING && ALLOW_DEFAULT_SOURCE_ENCODING
else if (is_option_with_arg("source-encoding", Qfalse, Qtrue)) {
set_source_encoding_once(opt, s, 0);
}
-#endif
else if (strcmp("version", s) == 0) {
if (envopt) goto noenvopt_long;
opt->dump |= DUMP_BIT(version);
@@ -1097,6 +1095,9 @@ proc_options(long argc, char **argv, struct cmdline_options *opt, int envopt)
}

switch_end:

  • if (opt_K_p)
  • rb_warning("-K is specified; it is for 1.8 compatibility and may cause odd behavior");
  • return argc0 - argc;
    }

@@ -1268,9 +1269,6 @@ process_options(int argc, char **argv, struct cmdline_options *opt)
opt->intern.enc.name = int_enc_name;
}

  • if (opt->src.enc.name)
  • rb_warning("-K is specified; it is for 1.8 compatibility and may cause odd behavior");
  • if (opt->dump & DUMP_BIT(version)) {
    ruby_show_version();
    return Qtrue;

Updated by naruse (Yui NARUSE) almost 13 years ago Actions #20

  • Status changed from Assigned to Closed
  • % Done changed from 0 to 100

This issue was solved with changeset r37485.
Clay, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.


  • ruby.c (load_file_internal): set default source encoding as
    UTF-8 instead of US-ASCII. [ruby-core:46021] [Feature #6679]

  • parse.y (parser_initialize): set default parser encoding as
    UTF-8 instead of US-ASCII.

Updated by mame (Yusuke Endoh) almost 13 years ago Actions #21

  • Target version set to 2.0.0
Actions

Also available in: PDF Atom