Bug #20663
closedReading characters from IO does not recover gracefully from bad data pushed via IO#ungetc
Description
If bytes that result in at least 2 invalid characters for the internal encoding of an IO object are pushed into the internal buffer with IO#getc, reading from the stream returns invalid characters composed of both bytes from the internal buffer and the converted bytes from the stream even if the next character in the stream itself is completely valid.
char_bytes = Tempfile.open(encoding: 'utf-8:utf-16le') do |f|
f.write("🍣")
f.rewind
f.ungetc("🍣".encode('utf-16le').b[0..-2])
f.each_char.map(&:bytes)
end
puts char_bytes.inspect
The above outputs:
[[60, 216], [99, 60], [216, 99], [223]]
I expect it to output:
[[60, 216], [99], [60, 216, 99, 223]]
In other words, I expect it to first completely drain the internal character buffer returning as many characters as necessary (invalid or otherwise) before reading from the stream and converting and returning the next character. After a bit of testing, the way it seems to behave is this:
- Return the next character from the internal buffer either if it's a valid encoding or if there is more than 1 character in the buffer, valid or not
- Otherwise, read another character from the stream, convert it, and append the converted bytes to the buffer
- Go back to step 1
Maybe this is desired behavior, but I can't understand why. It can't recover from the kind of erroneous data demonstrated in the example above.
Updated by javanthropus (Jeremy Bopp) 3 months ago
- ruby -v set to ruby 3.3.4 (2024-07-09 revision be1089c8ec) [x86_64-linux]
Updated by javanthropus (Jeremy Bopp) 3 months ago
- Subject changed from Reading from IO does not recover gracefully from bad data pushed via IO#ungetc to Reading characters from IO does not recover gracefully from bad data pushed via IO#ungetc
Updated by javanthropus (Jeremy Bopp) 3 months ago
After reviewing the sources, I see that this behavior is a consequence of how character conversion is handled. Fetching the next character always looks at an internal buffer for bytes to compose the character. When that buffer is empty, bytes are fetched from the stream, converted from the external encoding to the internal encoding, and then stored in the buffer. As far as IO#each_char is concerned, the bytes pushed into the buffer via IO#ungetc may as well have been left from the previous buffer fill operation, so there's no way to tell that the next bytes in the buffer should be handled in any special way to assist with character reading recovery from the underlying stream.
In short, please close this issue as a non-bug.
Updated by jeremyevans0 (Jeremy Evans) 3 months ago
- Status changed from Open to Rejected