Project

General

Profile

Actions

Bug #20663

closed

Reading characters from IO does not recover gracefully from bad data pushed via IO#ungetc

Added by javanthropus (Jeremy Bopp) 3 months ago. Updated 3 months ago.

Status:
Rejected
Assignee:
-
Target version:
-
ruby -v:
ruby 3.3.4 (2024-07-09 revision be1089c8ec) [x86_64-linux]
[ruby-core:118782]

Description

If bytes that result in at least 2 invalid characters for the internal encoding of an IO object are pushed into the internal buffer with IO#getc, reading from the stream returns invalid characters composed of both bytes from the internal buffer and the converted bytes from the stream even if the next character in the stream itself is completely valid.

char_bytes = Tempfile.open(encoding: 'utf-8:utf-16le') do |f|
  f.write("🍣")
  f.rewind
  f.ungetc("🍣".encode('utf-16le').b[0..-2])
  f.each_char.map(&:bytes)
end
puts char_bytes.inspect

The above outputs:

[[60, 216], [99, 60], [216, 99], [223]]

I expect it to output:

[[60, 216], [99], [60, 216, 99, 223]]

In other words, I expect it to first completely drain the internal character buffer returning as many characters as necessary (invalid or otherwise) before reading from the stream and converting and returning the next character. After a bit of testing, the way it seems to behave is this:

  1. Return the next character from the internal buffer either if it's a valid encoding or if there is more than 1 character in the buffer, valid or not
  2. Otherwise, read another character from the stream, convert it, and append the converted bytes to the buffer
  3. Go back to step 1

Maybe this is desired behavior, but I can't understand why. It can't recover from the kind of erroneous data demonstrated in the example above.

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0