Project

General

Profile

Actions

Feature #20394

closed

Add an offset parameter to `String#to_i`

Added by byroot (Jean Boussier) about 1 month ago. Updated 30 days ago.

Status:
Closed
Assignee:
-
Target version:
-
[ruby-core:117324]

Description

Context

I maintain the redis-client gem, and it comes with an optional swapable implementation in C that binds the hiredis C client, which used to performs up to 5 times faster in some cases.

I recently paired with @tenderlovemaking (Aaron Patterson) to try to close this gap, or even try to make the pure Ruby version faster, and we came up with several optimizations that now almost make both version on par (assuming YJIT is enabled).

An important source of performance loss, is that the Redis protocol is line based and to parse it in Ruby requires to slice a lot of small strings from the buffer. To give an example, here's how an Array with two String (["foo", "plop"]) is serialized in RESP3 (Redis protocol):

*2\r\n
$3\r\n
foo\r\n
$4\r\n
plop\r\n

From this you can understand that a big hotspot in the parser is essentially Integer(gets).

With @tenderlovemaking (Aaron Patterson) we managed to get a fairly significant perf boost by avoiding these string allocation using String#getbyte and basically implementing a rudimentary String#to_i(offset: ) in Ruby.

But while the gains are huge with YJIT enabled, they are much more tame with the interpreter. And it feels a bit wrong to have to implement this sorts of things for performance reasons.

String#to_i(offset: )

Similar to String#unpack(offset:) ([Feature #18254]), I believe String#to_i(offset: ) would be useful.

Alternative new String#unpack format

Another possibility would be to add a new format to String#pack String#unpack for decimal numbers. It sounds a bit weird at first, but given it supports things like Base64 and hexadecimal, perhaps it's not that much of a stretch?

Updated by Eregon (Benoit Daloze) about 1 month ago

I think #19315 is a more general solution for this.
It feels unidiomatic to add offset kwargs to core methods just to avoid substrings, I think we should make substrings faster instead.

For unpack it's more natural to have an offset as it basically operates on a byte array/binary data and not really a string + there was @ (skip to the offset given by the length argument) already.

Which makes me think, if the proposed offset kwarg would be in characters that could be quite slow, but then it seems potentially also surprising if it was in number of bytes (given String#to_i at least conceptually is iterating/interpreting characters).

Updated by Eregon (Benoit Daloze) about 1 month ago

BTW I think the custom String#to_i makes a lot of sense, and might be faster than String#to_i because it needs to handle fewer cases and can e.g. handle reading directly from an IO vs going through a String in between.

Updated by byroot (Jean Boussier) about 1 month ago · Edited

I think #19315 is a more general solution for this.

I don't think so, because for the to_i case at least, you can already do this today with byteslice:

def parse_int(offset)
  @buffer.byteslice(offset, -1).to_i
end

But the performance is much worse because this byteslice while it avoid copying, does two allocations:

  • It allocated a frozen "root" string, and make it the owner of the byte array.
  • @buffer become a shared string pointing to the new frozen root
  • Then byteslice returns a new shared string.

Additionally, doing this to @buffer means the next time I read more bytes from the IO, Ruby will allocate a new buffer instead of reusing the old one.

if the proposed offset kwarg would be in characters that could be quite slow, but then it seems potentially also surprising if it was in number of bytes

Good point. unpack might make more sense, but as you say it's a bit of a stretch compared to its current capabilities.

handle reading directly from an IO vs going through a String in between.

We tried this, but it was way slower (https://github.com/redis-rb/redis-client/pull/150), we haven't dug much as to why. But generally I'd love if I could just rely on the internal IO buffer instead of having to maintain a buffer string an an offset, but we're far from there I'm afraid.

Updated by Dan0042 (Daniel DeLorme) about 1 month ago

byroot (Jean Boussier) wrote in #note-3:

We tried this, but it was way slower (https://github.com/redis-rb/redis-client/pull/150), we haven't dug much as to why. But generally I'd love if I could just rely on the internal IO buffer instead of having to maintain a buffer string an an offset, but we're far from there I'm afraid.

The idea of an offset in a string is perfectly represented via the cursor in IO/StringIO, so it would be ideal to just use that instead of adding offset kwargs to various core methods.

If the only issue is performance I tried to get an idea of the problem with some cursory benchmarking for #getbyte and...

require "stringio"
fd = File.open("1.2MB-file")
str = fd.read
io = StringIO.new(str)

bm{ fd.rewind; nil while fd.getbyte } #=> 0.0270s
bm{ io.rewind; nil while io.getbyte } #=> 0.0195s
bm{ i=0; i+=1 while str.getbyte(i)  } #=> 0.0237s

It doesn't seem like String#getbyte is much faster than File#getbyte, and StringIO#getbyte is fastest of all. Maybe the result would be different if using sockets? In that case it might be worth buffering to a StringIO.

Of course even with that it's not ideal to call #getbyte multiple times to build an integer, so maybe we could have something like IO#get_i(base=10) which reads as many numeric characters as possible and return them as an integer. Returns nil if no numeric characters at cursor. WDYT?

Updated by shan (Shannon Skipper) about 1 month ago

Dan0042 (Daniel DeLorme) wrote in #note-4:

It doesn't seem like String#getbyte is much faster than File#getbyte, and StringIO#getbyte is fastest of all.

I'm seeing a similar result to what you show above with YJIT disabled, but str.getbyte(i) seems to pull ahead substantially with YJIT enabled on macOS and Linux with both Ruby 3.3 and nightly.

ruby 3.3.0 (2023-12-25 revision 5124f9ac75) +YJIT [arm64-darwin23]
Calculating -------------------------------------
          fd.getbyte    114.407 (± 0.9%) i/s -    575.000 in   5.026157s
          io.getbyte    148.602 (± 0.7%) i/s -    756.000 in   5.087645s
      str.getbyte(i)    261.846 (± 0.8%) i/s -      1.310k in   5.003151s

Comparison:
      str.getbyte(i):      261.8 i/s
          io.getbyte:      148.6 i/s - 1.76x  slower
          fd.getbyte:      114.4 i/s - 2.29x  slower

Updated by mame (Yusuke Endoh) about 1 month ago

As @Eregon (Benoit Daloze) said, String#to_i(offset:) with byte offset looks strange to me.

I like IO#get_i(base=10) and StringIO#get_i (despite the name).

I was a little concerned that it returns infinitely large Bignum when the IO reads "99999999..." in terms of Denial of Service. But this is essentially the same for IO#gets, so it may be acceptable to prohibit using get_i for unreliable IO.

About the method name, I'd propose IO#get_integer or IO#parse_integer or IO#scan_integer. I don't think it will be used so frequently enough to give an abbreviated name like get_i.

BTW, I think other reader than Integer will be wanted. Recently I saw The One Billion Row Challenge and thought that I want IO#parse_float. Generalizing, we may want IO#scanf, but that's probably overkill?

(This comment includes ideas from chatting with @osyoyu (Daisuke Aritomo))

Updated by byroot (Jean Boussier) about 1 month ago

The idea of an offset in a string is perfectly represented via the cursor in IO/StringIO

Indeed, but the problem is that you then have very few methods to parse values or peak in the buffer to find elements. Lots of methods needed for parsing various protocols are present on String, but not IO.

Also until recently, methods like IO#gets didn't have a way to be timedout, so they weren't safe to use in such context. In 3.2+ there's now IO#timeout=, but as I said parsing directly from the IO turned out much slower in my attempts, but perhaps I didn't do it well.

Updated by Dan0042 (Daniel DeLorme) about 1 month ago

mame (Yusuke Endoh) wrote in #note-6:

Generalizing, we may want IO#scanf, but that's probably overkill?

This previously existed in the stdlib but was removed: #16170

byroot (Jean Boussier) wrote in #note-7:

Indeed, but the problem is that you then have very few methods to parse values or peak in the buffer to find elements. Lots of methods needed for parsing various protocols are present on String, but not IO.

With StringIO you have the best of both worlds! You can use the methods with built-in #pos, like #getbyte, or drop down to the underlying #string if you want to use specialized methods like #unpack or #scan ! We could even add #unpack and #scan to StringIO and they would start the unpack/scan operation at the current #pos. As well as any other methods needed for parsing various protocols.

It comes down to: we can either make String a better buffer (adding #to_i(offset:)), or make StringIO a better buffer (adding #get_integer). IMHO the latter seems a better direction in the long run.

Also until recently, methods like IO#gets didn't have a way to be timedout, so they weren't safe to use in such context. In 3.2+ there's now IO#timeout=, but as I said parsing directly from the IO turned out much slower in my attempts, but perhaps I didn't do it well.

I think the timeout stuff is orthogonal to parsing with offsets. Let's compare String vs StringIO so we can ignore timeouts and focus on the API design of #to_i vs #get_integer

# @str is String
@str << io.gets
case @str[@offset] #@offset is character position, so this may be O(n) operation
when "*"
  @offset += 1 
  nb_elements = @str.to_i(offset: @offset)
  @offset += ???
end

# @io is StringIO
@io.string << io.gets
case @io.getc
when "*"
  nb_elements = @io.get_integer
end

It seems to me that the Redis protocol, like most wire protocols, is designed to be parsed in a streaming fashion, so a stream-oriented API like StringIO works pretty well.

But I recognize the speed aspect is important, with String#getbyte much faster than StringIO#getbyte when YJIT is enabled... which I find very strange. Unlike IO, StringIO is just a wrapper over a String object, so in theory there's no reason why they should be so different.

Updated by mame (Yusuke Endoh) about 1 month ago

I thought of a security concern.

I suppose all methods proposed in this ticket would allow underscores as a digit separator. This is natural in Ruby, but usually an unnecessary feature outside of Ruby, including the Redis protocol.
This interpretation mismatch could be a source of vulnerability. I don't know if this will actually cause a problem in the Redis protocol, but in HTTP it is called "HTTP request/response smuggling".

https://cwe.mitre.org/data/definitions/444.html

Integer(gets) or IO#scan_integer converts the input to Integer without making sure that the contents are /^[0-9]+$/. I think that it is okay for a script that handles only trusted sources, but in general it is not recommended for a library used by various people.

Updated by byroot (Jean Boussier) about 1 month ago

  • Status changed from Open to Closed

This interpretation mismatch could be a source of vulnerability.

Good catch @mame (Yusuke Endoh), that does indeed make the to_i proposal much more problematic. I guess I don't really have much of a proposal anymore.

With StringIO you have the best of both worlds!

Not today no. You can look at gems that implement network protocols, I'm yet to find one that uses StringIO as a buffer, StringIO isn't as convenient as you make it out to be. Maybe it could become that, but it isn't today.

It comes down to: we can either make String a better buffer [...] or make StringIO a better buffer [...]. IMHO the latter seems a better direction in the long run.

The thing is, was StringIO even thought as a buffer in the first place? My understanding is that it's just meant as a facade to pass strings to APIs that expect an IO, nothing more. I don't see any extra methods on it that suggest using it as a buffer.

Now for better or for worse, since Ruby strings are mutable, and historically they didn't have an encoding, they're used as buffers everywhere. i.e. look at methods like IO#read_nonblock, they take a string parameter called outbuf, look at core gems like net-http, they use that same "read into a string, then parse the string" pattern, etc.

I'd be all for a dedicated Buffer class that allow to efficiently parse text protocols like HTTP and RESP3, but right now all we got is String.

I think the timeout stuff is orthogonal to parsing with offsets.

It's not. If you are parsing a stream directly from an IO using blocking methods, you must be able to timeout if the character or pattern you are waiting for never comes. In your example, that io.gets could potentially block forever if the IO is a socket and the server is unresponsive or malicious. Hence why most if not all protocol clients use read_nonblock into a String and then parse the string.

with String#getbyte much faster than StringIO#getbyte when YJIT is enabled... which I find very strange. Unlike IO, StringIO is just a wrapper over a String object, so in theory there's no reason why they should be so different.

It's not that strange. YJIT has special optimization for String#getbyte and some other String methods because they are such hotspots. Additionally, StringIO is implemented in C so YJIT can't optimize it, and even if it was re-written in Ruby, it would be some extra method calls that YJIT isn't yet capable of inlining.

So in the end I think I'll just implement some small gems that provide the capabilities I need.

Updated by zverok (Victor Shepelev) about 1 month ago

I'd be all for a dedicated Buffer class that allow to efficiently parse text protocols like HTTP and RESP3, but right now all we got is String.

What about IO::Buffer? It was introduced somewhat stealthily and its API is somewhat unique, and yet by its idea it seems to be exactly that.

Updated by byroot (Jean Boussier) about 1 month ago

What about IO::Buffer? It was introduced somewhat stealthily and its API is somewhat unique, and yet by its idea it seems to be exactly that.

I'm aware of it, but it only offer binary oriented methods. Perhaps it could be extended to offer text protocols oriented methods, but it's really starting from zero right now.

If you look at @ioquatix (Samuel Williams) 's own HTTP gem, it uses its own Async::IO::Stream class which funnily enough is a String subclass... https://github.com/socketry/async-io/blob/085b506b9bc5b4669f1d65b6fe1b972f09442932/lib/async/io/buffer.rb, which kinda confirms my point that in Ruby today, the best buffer is String.

Updated by ioquatix (Samuel Williams) 30 days ago

My protocol gems have been around since Ruby 2.3+ so there are compatibility issues, but rest assured once we can start using IO::Buffer exclusively, I will do so. Using String for IO buffers is equally fraught with odd issues like encoding, memory usage, etc.

Updated by byroot (Jean Boussier) 30 days ago

but rest assured once we can start using IO::Buffer exclusively, I will do so.

I'd be curious to see an HTTP1 parser using IO::Buffer. I don't see how you'd do it today.

Updated by Dan0042 (Daniel DeLorme) 30 days ago

byroot (Jean Boussier) wrote in #note-10:

StringIO isn't as convenient as you make it out to be. Maybe it could become that, but it isn't today.

Hmm, it's not like it matters very much, but I get the weird feeling you misunderstood something in what I said. It's not like we'd ever limit ourselves to just the StringIO interface; my point was that StringIO provides a byte-oriented cursor interface ON TOP of String. Since we can still use the underlying String buffer, that means StringIO+String is a strict superset of String. There's no way that can be any less convenient than just String. Same thing for IO::Buffer afaict.

Actions

Also available in: Atom PDF

Like0
Like1Like0Like0Like1Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0