Feature #10391
openProvide %eISO-8859-1'string \xAA literal' string literals with explicit encoding
Description
There is occasionally a need to use a string literal with an Encoding different from the source encoding.
This proposes to use %e (e for encoding) to introduce such string literals.
The syntax used in the subject relies on the fact that the set of characters used in Encoding names and the set of characters used to surround the actual string in a %-literal are completely disjoint (or if they currently aren't, can be made completely disjoint). Alternatives would be to use % as a separator before and/or after the encoding, e.g. like this:
- %eISO-8859-1'string \xAA literal' # original proposal
- %e%ISO-8859-1%'string \xAA literal' # before and after
- %e%ISO-8859-1'string \xAA literal' # before only
- %eISO-8859-1%'string \xAA literal' # after only
- %e(ISO-8859-1)(string \xAA literal) # surrounding the encoding name
The most frequent use of this would be with binary, so we probably want to allow a shortcut for binary, e.g.
- %eB'binary \x80 string'
or even just - %b'binary \x08 string'
We could then in the long term deprecate String#b, and go back to check string validity at creation.
The upper/lowercase distinction can be used to distinguish single-quoted strings (%e) and double-quoted strings (%E). We probably also want something for regular expressions, but I'm not sure which letter is best.
There is one question about semantics: What's the meaning of e.g. %eGB2312'松本' in a program with a source encoding of UTF-8 or Shift_JIS? In some cases, it might be convenient to have the result contain the same characters. But that would mean that the data needs to be transcoded, and that could fail. The easier way to define this is that the result is the same as '松本'.force_encoding('GB2312'), i.e. just using the byte values.
Updated by duerst (Martin Dürst) about 10 years ago
- Related to Feature #8848: Syntax for binary strings added
Updated by akr (Akira Tanaka) about 10 years ago
It is useful when string literals are frozen.
So I think this feature is good to have.
The syntax is a problem, though.
However I feel %eGB2312'文字列' preserves characters, not bytes.
I.e. I think %eGB2312'文字列' should be interpreted as '文字列'.encode('GB2312').
I expect SyntaxError when encode() fails.
(Similar situation: /*/ is an invalid regexp which is also SyntaxError.)