Feature #20576: Add MatchData#bytebegin and MatchData#byteend - Ruby - Ruby Issue Tracking System

Actions

Copy link

Feature #20576

closed

Add MatchData#bytebegin and MatchData#byteend

Added by shugo (Shugo Maeda) about 1 year ago. Updated about 1 year ago.

Status:

Closed

Assignee:

Target version:

3.4

[ruby-core:118299]

Description

I'd like to propose MatchData#bytebegin and MatchData#byteend.
These methods are similar to MatchData#begin and MatchData#end, but returns offsets in bytes instead of codepoints.

Pull request: https://github.com/ruby/ruby/pull/10973

One of the use cases is scanning strings: https://github.com/ruby/net-imap/pull/286/files
MatchData#byteend is faster than MatchData#byteoffset because there is no need to allocate an Array.
Here's a benchmark result:

voyager:ruby$ cat b.rb 
require "benchmark"
require "strscan"

text = "あ" * 100000

Benchmark.bmbm do |b|
  b.report("byteoffset(0)[1]") do
    pos = 0
    while text.byteindex(/\G./, pos)
      pos = $~.byteoffset(0)[1]
    end
  end

  b.report("byteend(0)") do
    pos = 0
    while text.byteindex(/\G./, pos)
      pos = $~.byteend(0)
    end
  end
end
voyager:ruby$ ./tool/runruby.rb b.rb           
Rehearsal ----------------------------------------------------
byteoffset(0)[1]   0.020558   0.000393   0.020951 (  0.020963)
byteend(0)         0.018149   0.000000   0.018149 (  0.018151)
------------------------------------------- total: 0.039100sec

                       user     system      total        real
byteoffset(0)[1]   0.020821   0.000000   0.020821 (  0.020822)
byteend(0)         0.017455   0.000000   0.017455 (  0.017455)

Actions

Copy link

#1 [ruby-core:118301]

Updated by Eregon (Benoit Daloze) about 1 year ago

Does this difference matter in realistic usages (e.g. that net-imap one)? How much improvement is it there?

Regarding naming, byteend seems hard to read, I think byte_begin/byte_end is much clearer.

Actions

Copy link

#2 [ruby-core:118309]

Updated by shugo (Shugo Maeda) about 1 year ago

Eregon (Benoit Daloze) wrote in #note-1:

Does this difference matter in realistic usages (e.g. that net-imap one)? How much improvement is it there?

I guess the diffrence doesn't matter so much compared to I/O etc, but it's frustrating to write code like $~.byteoffset(0)[1] when only the end offset is needed.

Regarding naming, byteend seems hard to read, I think byte_begin/byte_end is much clearer.

I proposed byteend for consistency with existing methods such as byteoffset.
If we choose byte_end, it may be better to introduce new aliases for such existing methods.

Actions

Copy link

#3 [ruby-core:118310]

Updated by matz (Yukihiro Matsumoto) about 1 year ago

I understand the use-case. I agree with the addition of the feature, but I don't like the name. The names bytebegin, byteend are follow the byteindex tradition, but it is very hard to read (especially byteend). Any other name suggestions?

Matz.

Actions

Copy link

#4 [ruby-core:118313]

Updated by shugo (Shugo Maeda) about 1 year ago

matz (Yukihiro Matsumoto) wrote in #note-3:

I understand the use-case. I agree with the addition of the feature, but I don't like the name. The names bytebegin, byteend are follow the byteindex tradition, but it is very hard to read (especially byteend). Any other name suggestions?

I came up with names begin_in_bytes and end_in_bytes, but byte_begin / byte_end suggested by Eregon may be better.

Actions

Copy link

#5 [ruby-core:118601]