Bug #18833: Documentation for IO#gets is inaccurate (bytes versus characters) - Ruby - Ruby Issue Tracking System

Actions

Copy link

Bug #18833

closed

Documentation for IO#gets is inaccurate (bytes versus characters)

Added by adh1003 (Andrew Hodgkinson) about 3 years ago. Updated about 3 years ago.

Status:

Rejected

Assignee:

Target version:

ruby -v:

N/A

Backport:

2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN

[ruby-core:108943]

Description

Please see https://ruby-doc.org/core-3.1.2/IO.html#method-i-gets:

With integer argument limit given, returns up to limit+1 bytes:

In relation to https://github.com/janko/down/pull/74, I discovered that while IO#read ignores the stream's specified encoding if asked to read a specific number of bytes and does then do exactly that - reads the requested number of 8-bit bytes - IO#gets respects the encoding if given a limit and the number provided is characters, not bytes. This means that not only might more actual bytes be read from the file (advancing its file pointer accordingly) due to things like a BOM, more bytes might also be read for multi-byte encoding. Moreover, the number of bytes in the returned data can exceed the number passed to the method (because it's a number of characters, contrary to the documentation) and it won't necessarily include some bytes from the very start of the file (a UTF-8 BOM is stripped, for example). IO#gets does correctly handle a multibyte character being split at the limit of the requested read position if taken as bytes and continues reading more bytes until it has read the requested number of complete characters.

(It is in fact clearly unavoidable that it works in an encoding-aware fashion, else it would be unable to accurately interpret the sep parameter. Coercing everything down to a pure 8-bit byte stream and trying to dumb-match the stream that way would risk mismatching a separator byte stream within the wider file byte stream at a non-character boundary).

This is causing confusion for people implementing IO subclasses or IO-like classes and I'm sure you recognise that it is of critical importance that the distinction between bytes and characters is made accurately, especially in such a crucial low-level piece of documentation as IO.

If you wish, I can have a go at figuring out a PR for it (not really done that ouside of GitHub before, so something of a learning curve!).

Actions

Copy link

Updated by adh1003 (Andrew Hodgkinson) about 3 years ago

Subject changed from Documentation for IO#gets is in accurate (bytes versus characters) to Documentation for IO#gets is inaccurate (bytes versus characters)

Actions

Copy link

#2 [ruby-core:108944]

Updated by adh1003 (Andrew Hodgkinson) about 3 years ago

For avoidance of doubt, the behaviour of Ruby itself is (IMHO) sensible and working well. The only change needed is to alter the word "bytes" to "characters" for the IO#gets description of the limit parameter.

Actions

Copy link

#3 [ruby-core:108947]

Updated by adh1003 (Andrew Hodgkinson) about 3 years ago

Correction - the IO#gets data for a UTF-8 input stream including BOM does include the BOM as an invisible first character. I didn't notice at first because it's, well, invisible!

Doesn't change the documentation issue at hand, but wanted to correct my incorrect assertion in case it confused or distracted anyone reading.

Actions

Copy link

#4 [ruby-core:109024]

Updated by jeremyevans0 (Jeremy Evans) about 3 years ago

Status changed from Open to Rejected

The documentation is correct, the limit is in bytes and not characters:

File.write("a", "\u1234a") # => 4 # bytes written
File.open('a', 'r:UTF-8').read.length # => 2 # characters in file
File.open('a', 'r:UTF-8').gets(1) # => "\u1234"
File.open('a', 'r:UTF-8').gets(2) # => "\u1234"
File.open('a', 'r:UTF-8').gets(3) # => "\u1234"
File.open('a', 'r:UTF-8').gets(4) # => "\u1234a"

If limit were in characters and not bytes, gets(2) and gets(3) would return "\u1234a", since there are only two characters.

For multibyte encodings, the limit actually sets a limit on the starting byte of a multibyte character. That's why gets(1), gets(2), and gets(3) all return "\u1234" (a 3-byte character). The current documentation accurately describes this: https://docs.ruby-lang.org/en/master/IO.html#class-IO-label-Line+Limit

Actions

Copy link

Also available in: Atom PDF

Like0

Like0Like0Like0Like0

Project

General

Profile

Ruby

Tags

Custom queries

Bug #18833

Documentation for IO#gets is inaccurate (bytes versus characters)

Updated by adh1003 (Andrew Hodgkinson) about 3 years ago

Updated by adh1003 (Andrew Hodgkinson) about 3 years ago

Updated by adh1003 (Andrew Hodgkinson) about 3 years ago

Updated by jeremyevans0 (Jeremy Evans) about 3 years ago