Bug #20663
Updated by javanthropus (Jeremy Bopp) 3 months ago
If bytes that result in at least 2 invalid characters for the internal encoding of an IO object are pushed into the internal buffer with IO#getc, reading from the stream returns invalid characters composed of both bytes from the internal buffer and the converted bytes from the stream even if the next character in the stream itself is completely valid. ``` ruby char_bytes = Tempfile.open(encoding: 'utf-8:utf-16le') do |f| f.write("🍣") f.rewind f.ungetc("🍣".encode('utf-16le').b[0..-2]) f.each_char.map(&:bytes) end puts char_bytes.inspect ``` The above outputs: ``` [[60, 216], [99, 60], [216, 99], [223]] ``` I expect it to output: ``` [[60, 216], [99], [60, 216, 99, 223]] ``` In other words, I expect it to first completely drain the internal character buffer returning as many characters as necessary (invalid or otherwise) before reading from the stream and converting and returning the next character. After a bit of testing, the way it seems to behave is this: 1. Return the next character from the internal buffer either Interestingly, if it's a valid encoding or if there is more than are only bytes sufficient for 1 invalid character in the internal buffer, valid or not it behaves that way: ``` ruby 2. Otherwise, read another character from char_bytes = Tempfile.open(encoding: 'utf-8:utf-16le') do |f| f.write("🍣") f.rewind f.ungetc("🍣".encode('utf-16le').b[0..-3]) # <- Note the stream, convert it, and append -3 here vs. the converted bytes to the buffer -2 earlier f.each_char.map(&:bytes) 3. Go back to step 1 end puts char_bytes.inspect ``` Maybe this This outputs: ``` [[60, 216], [60, 216, 99, 223]] ``` The first character is desired behavior, invalid, but I can't understand why. returning it first clears the buffer. It can't recover from Then the kind of erroneous data demonstrated next character is read, converted, and returned in the example above. full.