Feature #10391: Provide %eISO-8859-1'string \xAA literal' string literals with explicit encoding - Ruby - Ruby Issue Tracking System

Actions

Copy link

Feature #10391

open

Provide %eISO-8859-1'string \xAA literal' string literals with explicit encoding

Added by duerst (Martin Dürst) almost 11 years ago. Updated almost 11 years ago.

Status:

Open

Assignee:

Target version:

[ruby-core:65743]

Description

There is occasionally a need to use a string literal with an Encoding different from the source encoding.
This proposes to use %e (e for encoding) to introduce such string literals.

The syntax used in the subject relies on the fact that the set of characters used in Encoding names and the set of characters used to surround the actual string in a %-literal are completely disjoint (or if they currently aren't, can be made completely disjoint). Alternatives would be to use % as a separator before and/or after the encoding, e.g. like this:

%eISO-8859-1'string \xAA literal' # original proposal
%e%ISO-8859-1%'string \xAA literal' # before and after
%e%ISO-8859-1'string \xAA literal' # before only
%eISO-8859-1%'string \xAA literal' # after only
%e(ISO-8859-1)(string \xAA literal) # surrounding the encoding name

The most frequent use of this would be with binary, so we probably want to allow a shortcut for binary, e.g.

%eB'binary \x80 string'
or even just
%b'binary \x08 string'
We could then in the long term deprecate String#b, and go back to check string validity at creation.

The upper/lowercase distinction can be used to distinguish single-quoted strings (%e) and double-quoted strings (%E). We probably also want something for regular expressions, but I'm not sure which letter is best.

There is one question about semantics: What's the meaning of e.g. %eGB2312'松本' in a program with a source encoding of UTF-8 or Shift_JIS? In some cases, it might be convenient to have the result contain the same characters. But that would mean that the data needs to be transcoded, and that could fail. The easier way to define this is that the result is the same as '松本'.force_encoding('GB2312'), i.e. just using the byte values.

Related issues 1 (1 open — 0 closed)

Actions

Copy link

#1 [ruby-core:65923]

Updated by duerst (Martin Dürst) almost 11 years ago

Related to Feature #8848: Syntax for binary strings added

Actions

Copy link

#2 [ruby-core:65942]

Updated by akr (Akira Tanaka) almost 11 years ago

It is useful when string literals are frozen.
So I think this feature is good to have.
The syntax is a problem, though.

However I feel %eGB2312'文字列' preserves characters, not bytes.
I.e. I think %eGB2312'文字列' should be interpreted as '文字列'.encode('GB2312').
I expect SyntaxError when encode() fails.
(Similar situation: /*/ is an invalid regexp which is also SyntaxError.)

Actions

Copy link

Also available in: Atom PDF

Like0

Like0Like0

Project

General

Profile

Ruby

Tags

Custom queries

Feature #10391

Provide %eISO-8859-1'string \xAA literal' string literals with explicit encoding

Updated by duerst (Martin Dürst) almost 11 years ago

Updated by akr (Akira Tanaka) almost 11 years ago