Feature #18254: Add an `offset` parameter to String#unpack and String#unpack1 - Ruby - Ruby Issue Tracking System

Actions

Copy link

Feature #18254

closed

Add an `offset` parameter to String#unpack and String#unpack1

Feature #18254: Add an `offset` parameter to String#unpack and String#unpack1

Added by byroot (Jean Boussier) about 4 years ago. Updated about 4 years ago.

Status:

Closed

Assignee:

Target version:

[ruby-core:105660]

Description

When working with binary protocols it's common to have to first unpack some kind of header or type prefix, and then based on that unpack another part of the string.

For instance here's a code snippet from Dalli, the most common Memcached client:

while buf.bytesize - pos >= 24
  header = buf.slice(pos, 24)
  (key_length, _, body_length, cas) = header.unpack(KV_HEADER)

  if key_length == 0
    # all done!
    @multi_buffer = nil
    @position = nil
    @inprogress = false
    break

  elsif buf.bytesize - pos >= 24 + body_length
    flags = buf.slice(pos + 24, 4).unpack1("N")
    key = buf.slice(pos + 24 + 4, key_length)
    value = buf.slice(pos + 24 + 4 + key_length, body_length - key_length - 4) if body_length - key_length - 4 > 0

    pos = pos + 24 + body_length

    begin
      values[key] = [deserialize(value, flags), cas]
    rescue DalliError
    end

  else
    # not enough data yet, wait for more
    break
  end
end
@position = pos

Proposal¶

If unpack and unpack1 had an offset: parameter, it would allow this kind of code to extract the fields it needs without allocating and copying as much strings, e.g.:

flags = buf.slice(pos + 24, 4).unpack1("N")

could be:

buf.unpack1("N", offset: pos + 24)

Updated by znz (Kazuhiro NISHIYAMA) about 4 years ago Actions
Copy link
#1 [ruby-core:105661]

You can use unpack1("@#{pos + 24}N").

Updated by byroot (Jean Boussier) about 4 years ago Actions
Copy link
#2 [ruby-core:105662]

Ah, I didn't know about it, but then you just allocated a string and converted an integer to string, so it's even slower than the slice pattern:

# frozen_string_literal: true
require 'benchmark/ips'

STRING = Random.bytes(200)
POS = 12
Benchmark.ips do |x|
  x.report("no-offset") { STRING.unpack1("N") }
  x.report("slice-offset") { STRING.slice(POS, 4).unpack1("N")}
  x.report("unpack-offset") { STRING.unpack1("@#{POS}N") }
  x.compare!
end

# Ruby 2.7.2
Warming up --------------------------------------
           no-offset     1.016M i/100ms
        slice-offset   532.173k i/100ms
       unpack-offset   321.805k i/100ms
Calculating -------------------------------------
           no-offset     10.090M (± 1.2%) i/s -     50.782M in   5.033549s
        slice-offset      5.318M (± 2.1%) i/s -     26.609M in   5.005346s
       unpack-offset      3.205M (± 1.8%) i/s -     16.090M in   5.021922s

Comparison:
           no-offset: 10090269.9 i/s
        slice-offset:  5318453.9 i/s - 1.90x  (± 0.00) slower
       unpack-offset:  3205017.9 i/s - 3.15x  (± 0.00) slower

Based on this, an offset parameter could make the current code almost 2x more efficient.

Updated by byroot (Jean Boussier) about 4 years ago Actions
Copy link
#3 [ruby-core:105663]

I submitted a pull request for it, https://github.com/ruby/ruby/pull/4984.

Updated by matz (Yukihiro Matsumoto) about 4 years ago Actions
Copy link
#4 [ruby-core:105766]

Sounds reasonable. Accepted.

Matz.

Updated by mame (Yusuke Endoh) about 4 years ago Actions
Copy link
#5 [ruby-core:105769]

Just a confirmation: the offset is byte-oriented, not character-oriented, right? There are a format "u" which is UTF-8 coding, so the behavior should be explained clearly in the document.

Updated by nobu (Nobuyoshi Nakada) about 4 years ago Actions
Copy link
#6 [ruby-core:105774]

As the RDoc of String#unpack states:

  # Decodes <i>str</i> (which may contain binary data) according to the
  # format string, returning an array of each value extracted. The

Isn't it clear that it is counted as binary?

Updated by byroot (Jean Boussier) about 4 years ago Actions
Copy link
#7 [ruby-core:105775]

Just a confirmation: the offset is byte-oriented, not character-oriented, right? There

Yes.

Updated by duerst (Martin Dürst) about 4 years ago Actions
Copy link
#8 [ruby-core:105783]

mame (Yusuke Endoh) wrote in #note-5:

Just a confirmation: the offset is byte-oriented, not character-oriented, right? There are a format "u" which is UTF-8 coding, so the behavior should be explained clearly in the document.

This is not only a problem of "explain it in the document". In order for this offset to work well, there should be a way to know how many bytes an invocation of String#unpack consumes. In many cases, that's very easy to calculate from the format string, but in others, in particular for UTF-8, it's not easy.

Updated by byroot (Jean Boussier) about 4 years ago Actions
Copy link
#9 [ruby-core:105784]

That argument will indeed be pretty much worthless if you use the U format, but I don't really see it as a blocker. It is meant to help binary parsers, I don't see U making sense for these.

As for the documentation, we indeed need to be clear that it's a byte offset.

Updated by byroot (Jean Boussier) about 4 years ago Actions
Copy link
#10 [ruby-core:105788]

I extended the pull request to clearly document the offset keyword and stress that it's a byte offset. Hopefully that clears that concern.

Updated by mame (Yusuke Endoh) about 4 years ago Actions
Copy link
#11 [ruby-core:105792]

@byroot (Jean Boussier) Thank you for adding documentation. I agree with merging.

there should be a way to know how many bytes an invocation of String#unpack consumes.

In fact, some committers discussed this point at the dev-meeting. However, in many cases, it is trivial (or able to calculate) for a programmer how many bytes are consumed. Also, it looks difficult to provide the feature by just extending the current API design of String#unpack. So, matz concluded that those who really wants the feaature should create another ticket with use case discussion and a concrete API proposal.

Updated by byroot (Jean Boussier) about 4 years ago Actions
Copy link
#12 [ruby-core:105794]

Agreed. The goal is to avoid slicing anyway, and to slice you need to know how many bytes you consumed.

If there's no other objections I'll merge in a day or two.

Updated by byroot (Jean Boussier) about 4 years ago Actions
Copy link
#13

Status changed from Open to Closed

Applied in changeset git|e5319dc9856298f38aa9cdc6ed55e39ad0e8e070.

pack.c: add an offset argument to unpack and unpack1

[Feature #18254]

This is useful to avoid repeteadly copying strings when parsing binary formats

Actions

Copy link

Also available in: PDF Atom

Project

General

Profile

Ruby

Tags

Custom queries

Feature #18254

Add an `offset` parameter to String#unpack and String#unpack1

Proposal¶

Updated by znz (Kazuhiro NISHIYAMA) about 4 years ago Actions
Copy link
#1 [ruby-core:105661]

Updated by byroot (Jean Boussier) about 4 years ago Actions
Copy link
#2 [ruby-core:105662]

Updated by byroot (Jean Boussier) about 4 years ago Actions
Copy link
#3 [ruby-core:105663]

Updated by matz (Yukihiro Matsumoto) about 4 years ago Actions
Copy link
#4 [ruby-core:105766]

Updated by mame (Yusuke Endoh) about 4 years ago Actions
Copy link
#5 [ruby-core:105769]

Updated by nobu (Nobuyoshi Nakada) about 4 years ago Actions
Copy link
#6 [ruby-core:105774]

Updated by byroot (Jean Boussier) about 4 years ago Actions
Copy link
#7 [ruby-core:105775]

Updated by duerst (Martin Dürst) about 4 years ago Actions
Copy link
#8 [ruby-core:105783]

Updated by byroot (Jean Boussier) about 4 years ago Actions
Copy link
#9 [ruby-core:105784]

Updated by byroot (Jean Boussier) about 4 years ago Actions
Copy link
#10 [ruby-core:105788]

Updated by mame (Yusuke Endoh) about 4 years ago Actions
Copy link
#11 [ruby-core:105792]

Updated by byroot (Jean Boussier) about 4 years ago Actions
Copy link
#12 [ruby-core:105794]

Updated by byroot (Jean Boussier) about 4 years ago Actions
Copy link
#13

Project

General

Profile

Ruby

Tags

Custom queries

Feature #18254

Add an `offset` parameter to String#unpack and String#unpack1

Proposal¶

Updated by znz (Kazuhiro NISHIYAMA) about 4 years ago ActionsCopy link #1 [ruby-core:105661]

Updated by byroot (Jean Boussier) about 4 years ago ActionsCopy link #2 [ruby-core:105662]

Updated by byroot (Jean Boussier) about 4 years ago ActionsCopy link #3 [ruby-core:105663]

Updated by matz (Yukihiro Matsumoto) about 4 years ago ActionsCopy link #4 [ruby-core:105766]

Updated by mame (Yusuke Endoh) about 4 years ago ActionsCopy link #5 [ruby-core:105769]

Updated by nobu (Nobuyoshi Nakada) about 4 years ago ActionsCopy link #6 [ruby-core:105774]

Updated by byroot (Jean Boussier) about 4 years ago ActionsCopy link #7 [ruby-core:105775]

Updated by duerst (Martin Dürst) about 4 years ago ActionsCopy link #8 [ruby-core:105783]

Updated by byroot (Jean Boussier) about 4 years ago ActionsCopy link #9 [ruby-core:105784]

Updated by byroot (Jean Boussier) about 4 years ago ActionsCopy link #10 [ruby-core:105788]

Updated by mame (Yusuke Endoh) about 4 years ago ActionsCopy link #11 [ruby-core:105792]

Updated by byroot (Jean Boussier) about 4 years ago ActionsCopy link #12 [ruby-core:105794]

Updated by byroot (Jean Boussier) about 4 years ago ActionsCopy link #13

Updated by znz (Kazuhiro NISHIYAMA) about 4 years ago Actions
Copy link
#1 [ruby-core:105661]

Updated by byroot (Jean Boussier) about 4 years ago Actions
Copy link
#2 [ruby-core:105662]

Updated by byroot (Jean Boussier) about 4 years ago Actions
Copy link
#3 [ruby-core:105663]

Updated by matz (Yukihiro Matsumoto) about 4 years ago Actions
Copy link
#4 [ruby-core:105766]

Updated by mame (Yusuke Endoh) about 4 years ago Actions
Copy link
#5 [ruby-core:105769]

Updated by nobu (Nobuyoshi Nakada) about 4 years ago Actions
Copy link
#6 [ruby-core:105774]

Updated by byroot (Jean Boussier) about 4 years ago Actions
Copy link
#7 [ruby-core:105775]

Updated by duerst (Martin Dürst) about 4 years ago Actions
Copy link
#8 [ruby-core:105783]

Updated by byroot (Jean Boussier) about 4 years ago Actions
Copy link
#9 [ruby-core:105784]

Updated by byroot (Jean Boussier) about 4 years ago Actions
Copy link
#10 [ruby-core:105788]

Updated by mame (Yusuke Endoh) about 4 years ago Actions
Copy link
#11 [ruby-core:105792]

Updated by byroot (Jean Boussier) about 4 years ago Actions
Copy link
#12 [ruby-core:105794]

Updated by byroot (Jean Boussier) about 4 years ago Actions
Copy link
#13