Bug #18955
closedKernel#sprintf - %c ignores a non-ASCII character's encoding
Description
I haven't found any similar existing issue so decided to create a new one.
I noticed that sprintf("%c", string)
doesn't handle (in an expected way) a case when encodings of format sequence and string argument aren't the same and the string argument contains non-ASCII character.
In this case it seems to me that sprintf
just uses binary representation of a character and assigns (or interprets with) encoding of the format sequence string.
I would expect that sprintf
negotiates encoding and converts everything (the character and the format string) to the chosen one. And raises error when negotiation fails.
Examples to illustrate this behavior:
format = "%c".encode("Windows-1251")
string = "Й".encode(Encoding::KOI8_U)
r = sprintf(format, string)
r.encoding
# => #<Encoding:Windows-1251>
r == "Й".encode("Windows-1251")
# => false
r.codepoints
# => [234]
string.codepoints
# => [234]
In this example the result's encoding is a format's encoding. But codepoint isn't changed and equals a codepoint of the character in the original string's encoding. But it should be different:
"Й".encode("Windows-1251").codepoints
# => [201]
Another example:
string = "À".encode(Encoding::CP1252)
sprintf("%c", string)
# => in `sprintf': invalid byte sequence in UTF-8 (ArgumentError)
In this example the error means that sprintf
doesn't encode properly a codepoint (of string's encoding) in UTF-8. It uses just raw bytes.
Updated by andrykonchin (Andrew Konchin) about 2 years ago
- Description updated (diff)
Updated by andrykonchin (Andrew Konchin) about 2 years ago
- Subject changed from Kernel#sprintf - %c doesn't convert non-ASCII characters to Kernel#sprintf - %c ignores a non-ASCII character's encoding
Updated by andrykonchin (Andrew Konchin) about 2 years ago
- Description updated (diff)
Updated by andrykonchin (Andrew Konchin) about 2 years ago
- Description updated (diff)
Updated by andrykonchin (Andrew Konchin) about 2 years ago
- Description updated (diff)
Updated by andrykonchin (Andrew Konchin) about 2 years ago
- Description updated (diff)
Updated by andrykonchin (Andrew Konchin) about 2 years ago
- Description updated (diff)
Updated by nobu (Nobuyoshi Nakada) about 2 years ago
A codepoint is expected for %c
, then the former examples are currently expected behaviors, I think.
The latter example is a bug.
Updated by mame (Yusuke Endoh) about 2 years ago
At the dev-meeting, @akr (Akira Tanaka) proposed that the format %c
behaves like %s
(with the one-codepoint restriction) and @matz (Yukihiro Matsumoto) agreed with it.
Updated by nobu (Nobuyoshi Nakada) about 2 years ago
- Status changed from Open to Closed
Applied in changeset git|ce384ef5a95b809f248e089c1608e60753dabe45.
[Bug #18955] Check length of argument for %c
in proper encoding