Feature #13626
openAdd String#byteslice!
Added by ioquatix (Samuel Williams) over 7 years ago. Updated about 2 years ago.
Description
It's a common pattern in IO buffering, to read a part of a string while leaving the remainder.
# Consume only part of the read buffer:
result = @read_buffer.byteslice(0, size)
@read_buffer = @read_buffer.byteslice(size, @read_buffer.bytesize)
It would be nice if this code could be simplified to:
result = @read_buffer.byteslice!(size)
Additionally, this allows a significantly improved implementation by the interpreter.
Updated by normalperson (Eric Wong) over 7 years ago
samuel@oriontransfer.org wrote:
I used to want this, too; but then I realized IO#read and
similar methods will always return a binary string when given a
length limit.
So String#slice! should be enough.
(And IO#read and friends without a length limit is suicidal, anyways :)
Updated by ioquatix (Samuel Williams) over 7 years ago
Thanks for that idea.
If that's the case, when appending to the write buffer:
write_buffer = String.new.b
unicode_string = "\u1234".force_encoding("UTF-8")
write_buffer << unicode_string
write_buffer.encoding # Changed from ASCII-8BIT to Encoding:UTF-8
The only way I can think to fix this is to run +force_encoding+ on the write buffer after every append but this seems hugely inefficient.
Ideas?
Updated by normalperson (Eric Wong) over 7 years ago
samuel@oriontransfer.org wrote:
Thanks for that idea.
If that's the case, when appending to the write buffer:
write_buffer = String.new.b unicode_string = "\u1234".force_encoding("UTF-8") write_buffer << unicode_string write_buffer.encoding # Changed from ASCII-8BIT to Encoding:UTF-8
The only way I can think to fix this is to run +force_encoding+ on the write buffer after every append but this seems hugely inefficient.
Ideas?
String#force_encoding is done in-place so it should not be
that slow, the String#<< would be the slow part since it
involves at least one memcpy (worst case is realloc + 2 memcpy)
But I'm not sure why you would want to be setting data to
UTF-8; I guess you got it from some 3rd-party library?
Maybe String#b! could be shorter alias for
force_encoding(Encoding::UTF_8); but yeah, exposing writev via
[Feature #9323] is probably the best option, anyways.
Fwiw, I'm also not convinced String#<< behavior about changing
write_buffer to Encoding::UTF-8 in your above example is good
behavior on Ruby's part... But I don't know much about human
language encodings, I am just a *nix plumber where a byte is a
byte.
Updated by ioquatix (Samuel Williams) over 7 years ago
Fwiw, I'm also not convinced String#<< behavior about changing
write_buffer to Encoding::UTF-8 in your above example is good
behavior on Ruby's part...
Agreed.
Updated by matz (Yukihiro Matsumoto) about 7 years ago
Sounds OK to me.
Matz.
Updated by akr (Akira Tanaka) about 7 years ago
At the developer meeting, we discuss that byteslice! and byteslice method should take same arguments.
Updated by duerst (Martin Dürst) about 7 years ago
normalperson (Eric Wong) wrote:
Fwiw, I'm also not convinced String#<< behavior about changing
write_buffer to Encoding::UTF-8 in your above example is good
behavior on Ruby's part... But I don't know much about human
language encodings, I am just a *nix plumber where a byte is a
byte.
This behavior may not be the best for this specific case, but in general, if one string is US-ASCII, and the other is UTF-8, then UTF-8 is a superset of US-ASCII, and concatenating the two will produce a string in UTF-8. Dropping the encoding would loose important information.
Please also note that you are actually on dangerous ground here. The above only works because the string doesn't contain any non-ASCII (high bit set) bytes. As soon as there is such a byte, there will be an error.
s = "abcde".b
s.encoding # => #<Encoding:ASCII-8BIT>
s << "αβγδε" # => "abcdeαβγδε"
s.encoding # => #<Encoding:UTF-8>
but:
t = "αβγδε".b # => "\xCE\xB1\xCE\xB2\xCE\xB3\xCE\xB4\xCE\xB5"
t.encoding # => #<Encoding:ASCII-8BIT>
t << "λμπρ" # => Encoding::CompatibilityError: incompatible character encodings: ASCII-8BIT and UTF-8
So if you have an ASCII-8BIT buffer, and want to append something, always make sure you make the appended stuff also ASCII-8BIT.
Updated by ioquatix (Samuel Williams) over 6 years ago
If you round trip UTF-8 to ASCII-8BIT and back again, the result should be the same IMHO. It's just the interpretation of the bytes which is different, but the underlying data should be the same. I still think adding String#byteslice!
is a good idea. Has there been any progress?
Updated by ioquatix (Samuel Williams) over 6 years ago
By the way, I ended up implementing https://github.com/socketry/async-io/blob/master/lib/async/io/binary_string.rb which I guess is okay but it's not ideal.
Updated by janko (Janko Marohnić) over 6 years ago
I support adding String#byteslice!
. I've been using String#byteslice
in custom IO-like objects that implement IO#read
semantics, as the strings I work with don't necessarily have to be in binary encoding (otherwise I'd just use String#slice
), they can also be in UTF-8. Since IO#read
needs to work in terms of bytes, that's why I needed String#byteslice
.
I've used the exact idiom from Samuel's original description in three different projects already:
- https://github.com/janko-m/down/blob/ac4a32f296cb9cd8c12fc46a01a7e2f7c5fcd1b2/lib/down/chunked_io.rb#L169-L170
- https://github.com/janko-m/goliath-rack_proxy/blob/7b359ff3ddfa3cba23c32220389abb39481735a9/lib/goliath/rack_proxy.rb#L134-L135
- https://github.com/socketry/falcon/blob/12b8818812b23c920e545e6b4c91e08e5348ee04/lib/falcon/adapters/input.rb#L80-L81
String#byteslice!
would allow reducing the code and probably end up with fewer strings at the end.
Updated by byroot (Jean Boussier) over 2 years ago
- Related to Bug #18972: String#byteslice should return BINARY (aka ASCII-8BIT) Strings added
Updated by Eregon (Benoit Daloze) over 2 years ago
Why not simply String#slice!
if the string encoding is BINARY?
result = @read_buffer.slice!(size) # @read_buffer must be in the BINARY encoding
For IO buffers, I think it's reasonable to ensure every string appended is BINARY, so the <<
gotcha is just a small inconvenience.
And if it's not BINARY (or fixed-width encoding), how do you ensure you are not cutting e.g. in the middle of a UTF-8 character?
Updated by Eregon (Benoit Daloze) over 2 years ago
I think there is a misunderstand of what byte*
methods are for.
byte*
methods are for dealing with byte indices and avoid the conversion between byte and character indices (which can be expensive for UTF-8).
byte*
methods are not "methods for BINARY strings".
For BINARY strings it's fine/better to use the regular String methods since byte index=character index for BINARY and other fixed-width encodings.
Updated by byroot (Jean Boussier) over 2 years ago
The PR is here in case someone feels like reviewing: https://github.com/ruby/ruby/pull/6275
As for the recently raised concerns, I don't really have any strong opinion. I implemented this on @ioquatix (Samuel Williams) 's demand, I personally believe that given Ruby's String implementation, calling slice!
(or byteslice!) on a buffer is terrible for performance (cf https://github.com/ruby/net-protocol/pull/14).
That said, it's always very awkward to see code that mix bytesize
, byteslice
and slice!
, every time I see some, I think I found a bug until I audit and figure that the string in indeed Encoding::BINARY
. So for that alone I'm in favor of this method.
Updated by ioquatix (Samuel Williams) about 2 years ago
Just to clarify I hope I have not demanded anything. "But if you want to have a go at it, that would be awesome" was all I said.
I have been trying the buffer.force_encoding(Encoding::BINARY)
followed by slice!
but you are right it does look awkward and given how easily a string can change to non-binary encoding, I also get the similar feeling about whether it's a bug or not (or could be in some unexpected scenario).
Updated by byroot (Jean Boussier) about 2 years ago
I hope I have not demanded anything
Yes, sorry, not what I meant, it's one of these words that has similar meaning in French, yet a radically different connotation.
Updated by Eregon (Benoit Daloze) about 2 years ago
I think the underlying issue is we want a string append method which does not change the receiver's encoding (and instead raises an EncodingError if it would need to change it).