Project

General

Profile

Actions

Bug #20869

closed

IO buffer handling is inconsistent when seeking

Added by javanthropus (Jeremy Bopp) 16 days ago. Updated 9 days ago.

Status:
Closed
Assignee:
-
Target version:
-
ruby -v:
ruby 3.3.4 (2024-07-09 revision be1089c8ec) [x86_64-linux]
[ruby-core:119741]

Description

When performing any of the seek based operations on IO (IO#seek, IO#pos=, or IO#rewind), the read buffer is inconsistently cleared:

require 'tempfile'

Tempfile.open do |f|
  f.write('0123456789')
  f.rewind

  # Calling #ungetbyte as the first read buffer
  # operation uses a buffer that is preserved during
  # seek operations
  f.ungetbyte(97)
  # Byte buffer will not be cleared
  f.seek(2, :SET)

  f.getbyte       # => 97
end

Tempfile.open do |f|
  f.write('0123456789')
  f.rewind

  # Calling #getbyte before #ungetbyte uses a
  # buffer that is not preserved when seeking
  f.getbyte
  f.ungetbyte(97)
  # Byte buffer will be cleared
  f.seek(2, :SET)

  f.getbyte       # => 50
end

Similar behavior happens when reading characters:

require 'tempfile'

Tempfile.open do |f|
  f.write('0123456789')
  f.rewind

  # Calling #ungetc as the first read buffer
  # operation uses a buffer that is preserved during
  # seek operations
  f.ungetc('a')
  # Character buffer will not be cleared
  f.seek(2, :SET)

  f.getc       # => 'a'
end

Tempfile.open do |f|
  f.write('0123456789')
  f.rewind

  # Calling #getc before #ungetc uses a
  # buffer that is not preserved when seeking
  f.getc
  f.ungetc('a')
  # Character buffer will be cleared
  f.seek(2, :SET)

  f.getc       # => '2'
end

When transcoding, however, the character buffer is never cleared when seeking:

require 'tempfile'

Tempfile.open(encoding: 'utf-8:utf-16le') do |f|
  f.write('0123456789')
  f.rewind

  f.ungetc('a'.encode('utf-16le'))
  # Character buffer will not be cleared
  f.seek(2, :SET)

  f.getc       # => 'a'.encode('utf-16le')
end

Tempfile.open(encoding: 'utf-8:utf-16le') do |f|
  f.write('0123456789')
  f.rewind

  f.getc
  f.ungetc('a'.encode('utf-16le'))
  # Character buffer will not be cleared
  f.seek(2, :SET)

  f.getc       # => 'a'.encode('utf-16le')
end

I would expect the buffers to be cleared in all cases except possibly when the seek operation doesn't actually move the file pointer such as when calling IO#pos or IO#seek(0, :CUR). The inconsistent behavior demonstrated here is a problem regardless though.

Updated by byroot (Jean Boussier) 16 days ago ยท Edited

I just looked into this a bit, I'm not quite familiar enough with the code to really propose a fix, but I get what is happening:

ungetbyte just shift the buffer offset, but the FD offset in unchanged.

static void
io_ungetbyte(VALUE str, rb_io_t *fptr)
{
    // snip...
    // ungetbyte just shift the buffer offset, but the FD offset in unchanged
    fptr->rbuf.off-=(int)len;
    fptr->rbuf.len+=(int)len;
    MEMMOVE(fptr->rbuf.ptr+fptr->rbuf.off, RSTRING_PTR(str), char, len);
}

fptr->rbuf.len == 1, but real FD offset is 0
So we're doing lseek(-1) which fail with EINVAL

static void
io_unread(rb_io_t *fptr)
{
    rb_off_t r;
    rb_io_check_closed(fptr);
    if (fptr->rbuf.len == 0 || fptr->mode & FMODE_DUPLEX)
        return;
    /* xxx: target position may be negative if buffer is filled by ungetc */
    errno = 0;
    // fptr->rbuf.len == 1, but real FD offset is 0
    // So we're doing lseek(-1) which fail with EINVAL
    r = lseek(fptr->fd, -fptr->rbuf.len, SEEK_CUR);
    if (r < 0 && errno) {
        if (errno == ESPIPE)
            fptr->mode |= FMODE_DUPLEX;
        return;
    }
    fptr->rbuf.off = 0;
    fptr->rbuf.len = 0;
    return;
}

So I suppose some more tracking info is needed to know that the real FD position and the buffer offset are desynced.

Updated by byroot (Jean Boussier) 16 days ago

Just a quick proof of concept that fixes the first case: https://github.com/ruby/ruby/commit/7481a12fef3df934ab0d9db7f8f2d36341a1562e

But I think a lot more codepath would need to consider and update that new offset for the entire class of bug to be fixed.

Updated by javanthropus (Jeremy Bopp) 15 days ago

I think your code change highlights another bug caused by the current behavior where IO#pos can report negative values. Oddly, IO#seek(0, :CUR) still returns 0:

require 'tempfile'

Tempfile.open do |f|
  f.write('0123456789')
  f.rewind

  f.ungetbyte(97)
  f.pos             # => -1
  f.seek(0, :CUR)   # => 0
end

Note that IO#pos works correctly when used with IO#ungetc while transcoding since that cases causes an entirely different buffer to be used which is currently ignored by the seeking functions. As demonstrated in the issue description though, that buffer isn't ever cleared when using the seeking functions.

Conceptually, it makes sense to me that the seeking functions should only care about bytes from the underlying stream since that's what they operate on. They should ignore read buffer manipulation by IO#ungetbyte and IO#ungetc since the data pushed by those methods have no relationship to the bytes of the stream. What I don't see anywhere I've looked is a statement regarding how IO#ungetbyte and IO#ungetc should interact with seeking operations. The existing specs and docs don't seem to cover those cases.

It would be great to get clarification here before working on solutions. While I think the best solution would be to disregard the bytes added by IO#ungetbyte and IO#ungetc and to clear the relevant buffers when seeking, I can imagine others may prefer to preserve the buffers. Maybe the solution is to leave the behavior deliberately undefined and just warn people against mixing these methods via documentation.

Updated by byroot (Jean Boussier) 15 days ago

I'll add this to the next developer meeting.

Updated by javanthropus (Jeremy Bopp) 15 days ago

Considering a parallel, the manpage for the fseek(3) function clearly states that the effects of ungetc(3) are undone on successful calls to fseek(3).

Updated by nobu (Nobuyoshi Nakada) 14 days ago

The buffers and Encoding::Converters should be discarded at positioning, we think.

Updated by byroot (Jean Boussier) 14 days ago

Right, but just discarding the buffers isn't enough, because in the case of SEEK_CUR, you need to know exactly how much ahead of the expected cursor you really are.

The only solution I see is to keep a count of how many bytes were added through ungetc etc, which is what I outlined in https://github.com/ruby/ruby/commit/7481a12fef3df934ab0d9db7f8f2d36341a1562e

Actions #8

Updated by nobu (Nobuyoshi Nakada) 14 days ago

  • Status changed from Open to Closed

Applied in changeset git|ee29aade1a432cbd5e4d5ec6ea6ccb4fa87361be.


[Bug #20869] Discard read buffer and encoding converters at seeking

Updated by javanthropus (Jeremy Bopp) 14 days ago

The documentation that was added says that IO#tell and IO#pos would clear the buffers, but they appear to have a special code path now to avoid it. IO#seek(0, :CUR) doesn't share this behavior, and that's a curious inconsistency.

Updated by nobu (Nobuyoshi Nakada) 14 days ago

The documentation is outdated, I'll fix it.

Since io.pos (not assignment) looks mere attribute, differentiated from seek.
Doesn't this make sense?

Updated by javanthropus (Jeremy Bopp) 13 days ago

Since io.pos (not assignment) looks mere attribute, differentiated from seek.

If not for the fact that IO#seek always returns 0 regardless of its arguments (something I've never understood), IO#pos could be implemented as IO#seek(0, :CUR). Why not avoid busting the buffer in that case? On the other hand, why not simplify the implementation and bust the buffer in all cases? Maybe I'm too hung up on implementation and am unable to see IO#pos as merely an attribute.

I also just installed the latest master branch build to check the changes, and there are still a couple of issues:

  1. It's still possible for IO#pos to return negative values:
    require 'tempfile'
    
    Tempfile.open do |f|
      f.write('0123456789')
      f.rewind
    
      f.ungetbyte(97)
      f.pos             # => -1
    end
    
  2. The character buffer isn't cleared when transcoding and seeking without first calling IO#getc:
    require 'tempfile'
    
    Tempfile.open(encoding: 'utf-8:utf-16le') do |f|
      f.write('0123456789')
      f.rewind
    
      f.ungetc('a'.encode('utf-16le'))
      # Character buffer will not be cleared
      f.seek(2, :SET)
    
      f.getc       # => 'a'.encode('utf-16le'); should be '2'.encode('utf-16le')
    end
    

Updated by javanthropus (Jeremy Bopp) 9 days ago

I found another issue while looking for more sharp edges on this, and I've opened #20889 as a result.

Actions

Also available in: Atom PDF

Like0
Like0Like0Like1Like1Like0Like0Like0Like0Like0Like0Like0Like0