Bug #20663: Reading characters from IO does not recover gracefully from bad data pushed via IO#ungetc - Ruby - Ruby Issue Tracking System

Actions

Copy link

Bug #20663

closed

Reading characters from IO does not recover gracefully from bad data pushed via IO#ungetc

Added by javanthropus (Jeremy Bopp) 12 months ago. Updated 11 months ago.

Status:

Rejected

Assignee:

Target version:

ruby -v:

ruby 3.3.4 (2024-07-09 revision be1089c8ec) [x86_64-linux]

Backport:

3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN

[ruby-core:118782]

Description

If bytes that result in at least 2 invalid characters for the internal encoding of an IO object are pushed into the internal buffer with IO#getc, reading from the stream returns invalid characters composed of both bytes from the internal buffer and the converted bytes from the stream even if the next character in the stream itself is completely valid.

char_bytes = Tempfile.open(encoding: 'utf-8:utf-16le') do |f|
  f.write("🍣")
  f.rewind
  f.ungetc("🍣".encode('utf-16le').b[0..-2])
  f.each_char.map(&:bytes)
end
puts char_bytes.inspect

The above outputs:

[[60, 216], [99, 60], [216, 99], [223]]

I expect it to output:

[[60, 216], [99], [60, 216, 99, 223]]

In other words, I expect it to first completely drain the internal character buffer returning as many characters as necessary (invalid or otherwise) before reading from the stream and converting and returning the next character. After a bit of testing, the way it seems to behave is this:

Return the next character from the internal buffer either if it's a valid encoding or if there is more than 1 character in the buffer, valid or not
Otherwise, read another character from the stream, convert it, and append the converted bytes to the buffer
Go back to step 1

Maybe this is desired behavior, but I can't understand why. It can't recover from the kind of erroneous data demonstrated in the example above.

Actions

Copy link

Updated by javanthropus (Jeremy Bopp) 12 months ago

ruby -v set to ruby 3.3.4 (2024-07-09 revision be1089c8ec) [x86_64-linux]

Actions

Copy link

Updated by javanthropus (Jeremy Bopp) 12 months ago

Description updated (diff)

Actions

Copy link

Updated by javanthropus (Jeremy Bopp) 12 months ago

Subject changed from Reading from IO does not recover gracefully from bad data pushed via IO#ungetc to Reading characters from IO does not recover gracefully from bad data pushed via IO#ungetc

Actions

Copy link

#4 [ruby-core:118820]

Updated by javanthropus (Jeremy Bopp) 11 months ago

After reviewing the sources, I see that this behavior is a consequence of how character conversion is handled. Fetching the next character always looks at an internal buffer for bytes to compose the character. When that buffer is empty, bytes are fetched from the stream, converted from the external encoding to the internal encoding, and then stored in the buffer. As far as IO#each_char is concerned, the bytes pushed into the buffer via IO#ungetc may as well have been left from the previous buffer fill operation, so there's no way to tell that the next bytes in the buffer should be handled in any special way to assist with character reading recovery from the underlying stream.

In short, please close this issue as a non-bug.

Actions

Copy link

Updated by jeremyevans0 (Jeremy Evans) 11 months ago

Status changed from Open to Rejected

Actions

Copy link

Also available in: Atom PDF

Like0

Like0Like0Like0Like0Like0

Project

General

Profile

Ruby

Tags

Custom queries

Bug #20663

Reading characters from IO does not recover gracefully from bad data pushed via IO#ungetc

Updated by javanthropus (Jeremy Bopp) 12 months ago

Updated by javanthropus (Jeremy Bopp) 12 months ago

Updated by javanthropus (Jeremy Bopp) 12 months ago

Updated by javanthropus (Jeremy Bopp) 11 months ago

Updated by jeremyevans0 (Jeremy Evans) 11 months ago