Bug #8630
closed
Transcoding high-bit bytes from ASCII-8BIT to a text encoding should be :invalid, not :undef
Added by headius (Charles Nutter) almost 11 years ago.
Updated almost 11 years ago.
Description
When transcoding from ASCII-8BIT (BINARY) to a text encoding (e.g. UTF-8), MRI will raise an error for high-bit bytes:
"\xC3".encode("utf-8", "binary") # => Encoding::UndefinedConversionError
This can be disabled by passing :undef => :replace as an option to the encode call.
I believe that "undef" is the wrong treatment for this error. Undef means that the input character has no representation in the target encoding. In this case, the error is raised because only US-ASCII range of bytes are valid for transcoding, so the transcoding of high-bit bytes is by definition invalid, not undefined. In other words, high-bit bytes in ASCII-8BIT/BINARY are invalid as characters.
The error raised should be InvalidByteSequenceError and it should be prevented by using :invalid => :replace option.
2013/7/13 headius (Charles Nutter) headius@headius.com:
Bug #8630: Transcoding high-bit bytes from ASCII-8BIT to a text encoding should be :invalid, not :undef
https://bugs.ruby-lang.org/issues/8630
When transcoding from ASCII-8BIT (BINARY) to a text encoding (e.g. UTF-8), MRI will raise an error for high-bit bytes:
"\xC3".encode("utf-8", "binary") # => Encoding::UndefinedConversionError
This can be disabled by passing :undef => :replace as an option to the encode call.
I believe that "undef" is the wrong treatment for this error. Undef means that the input character has no representation in the target encoding. In this case, the error is raised because only US-ASCII range of bytes are valid for transcoding, so the transcoding of high-bit bytes is by definition invalid, not undefined. In other words, high-bit bytes in ASCII-8BIT/BINARY are invalid as characters.
No.
ASCII-8BIT consists 128 ASCII characters and 128 special characters to
represent 0x80 to 0xff binary bytes.
The special characters are not representable in UTF-8.
So UndefinedConversionError is raised.
The validity of a characetr is defined by encoding, not transcoding.¶
Tanaka Akira
- Status changed from Open to Rejected
Hello Charles,
On 2013/07/13 6:26, Tanaka Akira wrote:
2013/7/13 headius (Charles Nutter)headius@headius.com:
Bug #8630: Transcoding high-bit bytes from ASCII-8BIT to a text encoding should be :invalid, not :undef
https://bugs.ruby-lang.org/issues/8630
When transcoding from ASCII-8BIT (BINARY) to a text encoding (e.g. UTF-8), MRI will raise an error for high-bit bytes:
"\xC3".encode("utf-8", "binary") # => Encoding::UndefinedConversionError
I believe that "undef" is the wrong treatment for this error. Undef means that the input character has no representation in the target encoding. In this case, the error is raised because only US-ASCII range of bytes are valid for transcoding, so the transcoding of high-bit bytes is by definition invalid, not undefined. In other words, high-bit bytes in ASCII-8BIT/BINARY are invalid as characters.
No.
I fully agree.
ASCII-8BIT consists 128 ASCII characters and 128 special characters to
represent 0x80 to 0xff binary bytes.
That's one way to put it, but a better way is to say that ASCII-8BIT
consists of 128 ASCII characters and 128 unassigned codepoints. This is
similar to unassigned codepoints in UTF-8.
The special characters are not representable in UTF-8.
So UndefinedConversionError is raised.
The validity of a characetr is defined by encoding, not transcoding.
Yes. Valid means that the original data as is is valid, nothing more. It
does not depend on the target encoding. And ASCII-8BIT of course can
contain bytes 0x80 and beyond, that's its job.
Regards, Martin.
2013/7/13 "Martin J. Dürst" duerst@it.aoyama.ac.jp:
That's one way to put it, but a better way is to say that ASCII-8BIT
consists of 128 ASCII characters and 128 unassigned codepoints. This is
similar to unassigned codepoints in UTF-8.
Your interpretation forbids us to convert binary between encodings.
For example, Emacs has charsets for binary such as eight-bit-control or
eight-bit-graphic (or eight-bit? I'm not familier with recent Emacs).
If we support a encoding which supports them and ASCII, we can convert
binary string between the encoding and ASCII-8BIT.
In your interpretation, such conversion would raise
UndefinedConversionError because unassigned codepoints can't have
character mapping for another encoding.
Tanaka Akira
Also available in: Atom
PDF
Like0
Like0Like0Like0Like0