Feature #15995

Add encoding conversion for CESU-8 from and to UTF-8

Added by duerst (Martin Dürst) 11 months ago. Updated 11 months ago.

Target version:


As discussed in issue #15931, encoding conversion (transcoding) from/to CESU-8 is missing, so we should add it. When then hopefully can make CESU-8 a dummy encoding.

Related issues

Related to Ruby master - Feature #15931: encoding for CESU-8Opennaruse (Yui NARUSE)Actions

Updated by duerst (Martin Dürst) 11 months ago

Updated by duerst (Martin Dürst) 11 months ago

Issue #15931 mentions both and as definitions of CESU-8, but they are not identical.

The difference is in how they treat U+0000 (NULL) characters: UTR 26 does not treat it in any special way (i.e. it is encoded as "\x00"), but the Java definition treats specially, encoding it as "\xC0\x80". The IANA registration refers to the Unicode definition (see TR 26 explains that "CESU-8 is useful in 8-bit processing environments where binary collation with UTF-16 is required.". For this to work, U+0000 has to be encoded as "\x00".

Issue #15931 currently implements CESU-8 as defined in UTR 26:

$ ruby -e 'puts "\xC0\x80".force_encoding("cesu-8").valid_encoding?'

$ ruby -e 'puts "\x00".force_encoding("cesu-8").valid_encoding?'

It is unclear whether this is what the originator of issue #15931 wanted; his use case seems to be Java.


Updated by duerst (Martin Dürst) 11 months ago

  • Status changed from Open to Closed

Also available in: Atom PDF