Project

General

Profile

Actions

Bug #12052

closed

String#encode with xml option returns wrong result for totally non-ASCII-compatible encodings

Added by nobu (Nobuyoshi Nakada) over 5 years ago. Updated 14 days ago.

Status:
Closed
Priority:
Normal
Target version:
-
ruby -v:
[ruby-dev:49493]

Description

String#encodeをASCII非互換エンコーディングから同じエンコーディングへ、xml:オプション付きで呼ぶとおかしな結果を返します。
バイナリとして変換してしまっているようです。

p "<\0>\0".encode("utf-16le", "utf-16le", xml: :text)
#=> "\u6C26\u3B74\u2600\u7467;"

Updated by jeremyevans0 (Jeremy Evans) about 1 month ago

  • Status changed from Assigned to Rejected

After an extensive session with gdb, I've determined that this isn't an issue with String#encode, and it isn't a bug.

"<\0>\0".encode("utf-16le", "utf-16le", xml: :text) returns the same string as "&lt;\0&gt;\0".force_encoding("utf-16le"). I think that's the correct behavior for String#encode, since you are specifying the source and destination encodings match.

"&lt;\0&gt;\0".force_encoding("utf-16le") is the same string as "\u6C26\u3B74\u2600\u7467;".encode("utf-16le"). The 10 ASCII bytes are the same as the bytes for the 5 codepoints in UTF16-LE encoding.

String#inspect processes the string, and formats each of the non-ASCII codepoints using the \u syntax, and the final codepoint (59) as a regular ASCII character.

As an example:

"<\0>\0".encode("utf-16le", "utf-16le", xml: :text) == "&lt;\0&gt;\0".force_encoding("utf-16le")
=> true

"&lt;\0&gt;\0".force_encoding("utf-16le").codepoints
=> [27686, 15220, 9728, 29799, 59]

"&lt;\0&gt;\0".force_encoding("utf-16le").codepoints.map{|x| x >= 128 ? '-u%X'%x : x.chr}.join
"-u6C26-u3B74-u2600-u7467;"

Updated by duerst (Martin Dürst) about 1 month ago

  • Status changed from Rejected to Open
  • Subject changed from String#encode with xml option returns wrong result to String#encode with xml option returns wrong result for totally non-ASCII-compatible encodings

Sorry to jeremyevans0 (Jeremy Evans), but I have to disagree. This is a bug. We can disagree about how important it is to fix this bug, but it's a bug nevertheless. First, xml: :text works correctly in other encodings even if the source and destination encodings match.

"<q&".force_encoding("shift_JIS").encode("shift_JIS", xml: :text)
=> "&lt;q&amp;"

The bug is that we process UTF-16LE as if it consisted of 1-byte ASCII-based code units. I still have to identify exactly where and when that happens.

I have changed the subject to indicate what I understand is the extent of the problem. By using "totally", I want to distinguish this from encodings such as Shift_JIS which are also not as ASCII-compatible as say UTF-8, but still more so than UTF-16 (in its various variants).

Updated by jeremyevans0 (Jeremy Evans) about 1 month ago

duerst (Martin Dürst) wrote in #note-2:

Sorry to jeremyevans0 (Jeremy Evans), but I have to disagree. This is a bug. We can disagree about how important it is to fix this bug, but it's a bug nevertheless. First, xml: :text works correctly in other encodings even if the source and destination encodings match.

"<q&".force_encoding("shift_JIS").encode("shift_JIS", xml: :text)
=> "&lt;q&amp;"

The bug is that we process UTF-16LE as if it consisted of 1-byte ASCII-based code units. I still have to identify exactly where and when that happens.

Ah. So you are saying that "<\0>\0".encode("utf-16le", "utf-16le", xml: :text) needs to have the same result as:
"<\0>\0".encode("utf-8", "utf-16le", xml: :text).encode("utf-16le"). I agree, that makes more sense and this is a bug.

It looks like this issue occurs when using both multibyte source and destination encoding. If either the source or destination encoding is not multibyte, the issue doesn't occur:

# Multibyte source, single-byte destination
"<\0>\0".encode("utf-8", "utf-16le", xml: :text).bytes
=> [38, 108, 116, 59, 38, 103, 116, 59]

# Single-byte source, multibyte destination
"<>".encode("utf-16le", "utf-8", xml: :text).bytes
=> [38, 0, 108, 0, 116, 0, 59, 0, 38, 0, 103, 0, 116, 0, 59, 0]

# Multibyte source, multibyte destination
"<\0>\0".encode("utf-16le", "utf-16le", xml: :text).bytes
=> [38, 108, 116, 59, 0, 38, 103, 116, 59, 0]

So a possible way to work around the issue until it can be properly fixed would be to detect the case where both source and destination are multibyte, switch the destination to UTF-8, then encode the result of that to the desired destination encoding.

Updated by jeremyevans0 (Jeremy Evans) about 1 month ago

jeremyevans0 (Jeremy Evans) wrote in #note-3:

So a possible way to work around the issue until it can be properly fixed would be to detect the case where both source and destination are multibyte, switch the destination to UTF-8, then encode the result of that to the desired destination encoding.

I've submitted a patch that does this, and confirmed it fixes this issue: https://github.com/ruby/ruby/pull/4605

Updated by duerst (Martin Dürst) about 1 month ago

jeremyevans0 (Jeremy Evans) wrote in #note-3:

It looks like this issue occurs when using both multibyte source and destination encoding. If either the source or destination encoding is not multibyte, the issue doesn't occur:

# Multibyte source, single-byte destination
"<\0>\0".encode("utf-8", "utf-16le", xml: :text).bytes
=> [38, 108, 116, 59, 38, 103, 116, 59]

# Single-byte source, multibyte destination
"<>".encode("utf-16le", "utf-8", xml: :text).bytes
=> [38, 0, 108, 0, 116, 0, 59, 0, 38, 0, 103, 0, 116, 0, 59, 0]

# Multibyte source, multibyte destination
"<\0>\0".encode("utf-16le", "utf-16le", xml: :text).bytes
=> [38, 108, 116, 59, 0, 38, 103, 116, 59, 0]

True, except that usually the term "multibyte encoding" includes encodings such as UTF-8, and we are speaking here about encodings with code units longer than one byte.

But thinking about it, it may also include encodings such as EBCDIC (IBM037) and Shift_JIS and ISO-2020-JP. In the former case, I get

"<>".encode("IBM037")
=> "\x4C\x6E"
"<>".encode("IBM037").encode("IBM037", xml: :text)
=> "\x4C\x6E"```
"<>".encode("IBM037").force_encoding("US-ASCII")
=> "Ln"

This is explained rather easily: '<' and '>' are \x4C and \x6E in EBCDIC, but because the xml: :text processing runs in ASCII, these are interpreted as 'L' and 'n' and left alone. Shift_JIS actually seems safe, because the characters to be converted are encoded as plain ASCII, and because they all fall into the range 0x20..0x3F, which isn't used as a second byte in Shift_JIS.

For ISO-2022-JP, we are not so lucky. Take the string "<>湿" (the Kanji stands for 'wet', which is appropriate here in Japan because we are in the Rainy Season now :-; it's there because in ISO-2022-JP, its value is encoded with the same bytes as "<>", after switching with the necessary escape sequences):

"<>青".encode("ISO-2022-JP")
=> "\x3C\x3E\e\x24\x42\x40\x44\e\x28\x42"
"<>湿".encode("ISO-2022-JP").force_encoding("US-ASCII")
=> "<>\e$B<>\e(B"
"<>湿".encode("ISO-2022-JP").encode("ISO-2022-JP", xml: :text)
=> "\x26\x6C\x74\x3B\x26\x67\x74\x3B\e\x24\x42\x26\x6C\x74\x3B\x26\x67\x74\x3B\e\x28\x42"
"<>湿".encode("ISO-2022-JP").encode("ISO-2022-JP", xml: :text).force_encoding("US-ASCII")
=> "&lt;&gt;\e$B&lt;&gt;\e(B"

Trying to further transcode the result of encode("ISO-2022-JP", xml: :text) leads to an encoding error.

So a possible way to work around the issue until it can be properly fixed would be to detect the case where both source and destination are multibyte, switch the destination to UTF-8, then encode the result of that to the desired destination encoding.

The condition seems to be slightly more narrow. Even if both encodings have code units of more than one byte, things work as long as the encodings are not the same, most probably because these cases already get transcoded via UTF-8:

"<\0>\0".force_encoding("UTF-16LE").encode("UTF-16BE", xml: :text)
=> "&lt;&gt;"
"<\0>\0".force_encoding("UTF-16LE").encode("UTF-32BE", xml: :text)
=> "&lt;&gt;"
"<\0>\0".force_encoding("UTF-16LE").encode("UTF-32LE", xml: :text)
=> "&lt;&gt;"
"<\0>\0".force_encoding("UTF-16LE").encode("UTF-16LE", xml: :text)
=> "\u6C26\u3B74\u2600\u7467;"

I'll have a look at your patch later, but just wanted to get this out. Sorry to be more quick with encodings than with the actual code :-(.

Actions #6

Updated by jeremyevans (Jeremy Evans) about 1 month ago

  • Status changed from Open to Closed

Applied in changeset git|e86c1f6fc53433ef5c82ed2b7a4cc9a12c153e4c.


Work around issue transcoding issue with non-ASCII compatible encodings and xml escaping

When using a non-ASCII compatible source and destination encoding
and xml escaping (the :xml option to String#encode), the resulting
string was broken, as it used the correct non-ASCII compatible
encoding, but contained data that was ASCII-compatible instead of
compatible with the string's encoding.

Work around this issue by detecting the case where both the
source and destination encoding are non-ASCII compatible, and
transcoding the source string from the non-ASCII compatible
encoding to UTF-8. The xml escaping code will correctly handle
the UTF-8 source string and the return the correctly encoded
and escaped value.

Fixes [Bug #12052]

Co-authored-by: Nobuyoshi Nakada nobu@ruby-lang.org

Actions #7

Updated by nagachika (Tomoyuki Chikanaga) about 1 month ago

  • Backport changed from 2.0.0: REQUIRED, 2.1: REQUIRED, 2.2: REQUIRED, 2.3: REQUIRED to 2.7: REQUIRED, 3.0: REQUIRED

Updated by nagachika (Tomoyuki Chikanaga) 29 days ago

  • Backport changed from 2.7: REQUIRED, 3.0: REQUIRED to 2.7: REQUIRED, 3.0: DONE

ruby_3_0 e62cccaeb0986d43480bccbd365cb20056bda4d7 merged revision(s) e86c1f6fc53433ef5c82ed2b7a4cc9a12c153e4c,f6539202c52a051a4e6946a318a1d9cd29002990.

Updated by nagachika (Tomoyuki Chikanaga) 29 days ago

  • Backport changed from 2.7: REQUIRED, 3.0: DONE to 2.7: REQUIRED, 3.0: REQUIRED

The test introduced at e62cccaeb0986d43480bccbd365cb20056bda4d7 failed on almost all environments.
I reverted the changesets for the time being.

Updated by duerst (Martin Dürst) 28 days ago

nagachika (Tomoyuki Chikanaga) wrote in #note-9:

The test introduced at e62cccaeb0986d43480bccbd365cb20056bda4d7 failed on almost all environments.
I reverted the changesets for the time being.

nagachika (Tomoyuki Chikanaga): Can you tell us exactly what, or how, things failed? It's very difficult to imagine a reason for such a failure, because I don't know of any recent (e.g. 2.7) changes to the involved code (i.e. transcode.c).

Updated by nagachika (Tomoyuki Chikanaga) 28 days ago

The multiple GitHub Actions environments show the failure of make check.
The error messages are all the same.

Ubuntu: https://github.com/ruby/ruby/runs/2977522707
macOS: https://github.com/ruby/ruby/runs/2977522761
MinGW: https://github.com/ruby/ruby/runs/2977522628

  1) Failure:
TestTranscode#test_encode_xml_multibyte [/home/runner/work/ruby/ruby/src/test/ruby/test_transcode.rb:140]:
failed encoding UTF-16LE to UTF-16LE with xml: :text.
<"&lt;&gt;"> expected but was
<"&lt;\u0000&gt;\u0000">.

Updated by nobu (Nobuyoshi Nakada) 21 days ago

nagachika (Tomoyuki Chikanaga) wrote in #note-11:

The multiple GitHub Actions environments show the failure of make check.

391abc543cea118a9cd7d6310acadbfa352668ef is needed.

Updated by nagachika (Tomoyuki Chikanaga) 14 days ago

  • Backport changed from 2.7: REQUIRED, 3.0: REQUIRED to 2.7: REQUIRED, 3.0: DONE

ruby_3_0 b93a2d9d2cac5d3efe72537debedb089d447d33a merged revision(s) 391abc543cea118a9cd7d6310acadbfa352668ef,e86c1f6fc53433ef5c82ed2b7a4cc9a12c153e4c,f6539202c52a051a4e6946a318a1d9cd29002990.

Actions

Also available in: Atom PDF