Bug #20663
closedReading characters from IO does not recover gracefully from bad data pushed via IO#ungetc
Description
If bytes that result in at least 2 invalid characters for the internal encoding of an IO object are pushed into the internal buffer with IO#getc, reading from the stream returns invalid characters composed of both bytes from the internal buffer and the converted bytes from the stream even if the next character in the stream itself is completely valid.
char_bytes = Tempfile.open(encoding: 'utf-8:utf-16le') do |f|
f.write("🍣")
f.rewind
f.ungetc("🍣".encode('utf-16le').b[0..-2])
f.each_char.map(&:bytes)
end
puts char_bytes.inspect
The above outputs:
[[60, 216], [99, 60], [216, 99], [223]]
I expect it to output:
[[60, 216], [99], [60, 216, 99, 223]]
In other words, I expect it to first completely drain the internal character buffer returning as many characters as necessary (invalid or otherwise) before reading from the stream and converting and returning the next character. After a bit of testing, the way it seems to behave is this:
- Return the next character from the internal buffer either if it's a valid encoding or if there is more than 1 character in the buffer, valid or not
- Otherwise, read another character from the stream, convert it, and append the converted bytes to the buffer
- Go back to step 1
Maybe this is desired behavior, but I can't understand why. It can't recover from the kind of erroneous data demonstrated in the example above.