Project

General

Profile

Actions

Bug #20889

open

IO#ungetc and IO#ungetbyte should not cause IO#pos to report an inaccurate position

Bug #20889: IO#ungetc and IO#ungetbyte should not cause IO#pos to report an inaccurate position

Added by javanthropus (Jeremy Bopp) about 1 year ago. Updated 24 minutes ago.

Status:
Open
Assignee:
-
Target version:
-
ruby -v:
ruby 3.3.6 (2024-11-05 revision 75015d4c1f) [x86_64-linux]
[ruby-core:119895]

Description

require 'tempfile'

Tempfile.open(encoding: 'utf-8') do |f|
  f.write('0123456789')
  f.rewind
  f.ungetbyte(93)
  f.pos       # => -1; negative value is surprising!
end

Tempfile.open(encoding: 'utf-8') do |f|
  f.write('0123456789')
  f.rewind
  f.ungetc('a'.encode('utf-8'))
  f.pos       # => -1; similar to the ungetbyte case
end

Tempfile.open(encoding: 'utf-8:utf-16le') do |f|
  f.write('0123456789')
  f.rewind
  f.ungetc('a'.encode('utf-16le'))
  f.pos       # => 0; maybe should be -2 to match the previous ungetc case?
end

It doesn't seem logical that IO#pos should ever be affected by IO#ungetc or IO#ungetbyte. The pushed characters or bytes aren't really in the stream source. The value of IO#pos implies that jumping directly to that position via IO#seek and reading from there would return the same character or byte that was pushed, but the pushed characters or bytes are lost when the operation to seek in the stream is performed. In the case where IO#pos is a negative value, attempting to seek to that position actually raises an exception.

In the IO#ungetc with character conversion case above, it seems unreasonable to make IO#pos report an even less correct position. In that case, the position would need to be adjusted by 2 bytes in reverse due to the internal encoding of the stream, but that is completely inconsistent with the behavior of IO#pos when reading from the stream normally where it reports the underlying stream's byte position and not the number of transcoded bytes that have been read:

require 'tempfile'

Tempfile.open(encoding: 'utf-8:utf-16le') do |f|
  f.write('0123456789')
  f.rewind
  f.getc.bytesize # => 2; due to the internal encoding of the stream
  f.pos           # => 1; reports actual bytes read from the stream, not transcoded bytes
end

Attempting to use IO#pos when there are characters or bytes pushed into the read buffer by way of IO#ungetc or IO#ungetbyte should result in one of the following behaviors:

  1. Raise and exception
  2. Return the stream's position, clearing the read buffer entirely
  3. Return the stream's position, ignoring the pushed characters or bytes, and produce a warning

Updated by YO4 (Yoshinao Muramatsu) 24 minutes ago Actions #1 [ruby-core:123791]

I checked #20869 and DevMeeting-2024-11-07.
On that basis, I will now state my current opinion.

IO#ungetc

  • As stated in #21682, the character buffer is considered isolated from the underlying stream, and IO#ungetc does not change the file position.

IO#getbyte

  1. To enable the operation of reading and then reverting back, it is preferable for the file position to change. This is a way that can be used even on devices where seeking is not possible.
  2. To allow reverting back using the same number of ungetbyte/getbyte operations, negative file positions should be permitted. If the operation to move to a negative file position were prohibited, it would be less convenient.
  3. Clearing the buffer while at a negative file position prevents the file position and subsequent reads from being determined, and cannot be restored using IO#getbyte. Therefore, IO#pos should not clear the buffer.

IO#pos

  • IO#pos is an operation that does not affect the target, analogous to how peek relates to read, as compared to seek.

IO#seek

  • IO#seek changes the file position and clear the buffer.
Actions

Also available in: PDF Atom