Feature #20576
closedAdd MatchData#bytebegin and MatchData#byteend
Description
I'd like to propose MatchData#bytebegin and MatchData#byteend.
These methods are similar to MatchData#begin and MatchData#end, but returns offsets in bytes instead of codepoints.
Pull request: https://github.com/ruby/ruby/pull/10973
One of the use cases is scanning strings: https://github.com/ruby/net-imap/pull/286/files
MatchData#byteend is faster than MatchData#byteoffset because there is no need to allocate an Array.
Here's a benchmark result:
voyager:ruby$ cat b.rb
require "benchmark"
require "strscan"
text = "あ" * 100000
Benchmark.bmbm do |b|
b.report("byteoffset(0)[1]") do
pos = 0
while text.byteindex(/\G./, pos)
pos = $~.byteoffset(0)[1]
end
end
b.report("byteend(0)") do
pos = 0
while text.byteindex(/\G./, pos)
pos = $~.byteend(0)
end
end
end
voyager:ruby$ ./tool/runruby.rb b.rb
Rehearsal ----------------------------------------------------
byteoffset(0)[1] 0.020558 0.000393 0.020951 ( 0.020963)
byteend(0) 0.018149 0.000000 0.018149 ( 0.018151)
------------------------------------------- total: 0.039100sec
user system total real
byteoffset(0)[1] 0.020821 0.000000 0.020821 ( 0.020822)
byteend(0) 0.017455 0.000000 0.017455 ( 0.017455)
Updated by Eregon (Benoit Daloze) 7 months ago
Does this difference matter in realistic usages (e.g. that net-imap one)? How much improvement is it there?
Regarding naming, byteend
seems hard to read, I think byte_begin
/byte_end
is much clearer.
Updated by shugo (Shugo Maeda) 7 months ago
Eregon (Benoit Daloze) wrote in #note-1:
Does this difference matter in realistic usages (e.g. that net-imap one)? How much improvement is it there?
I guess the diffrence doesn't matter so much compared to I/O etc, but it's frustrating to write code like $~.byteoffset(0)[1]
when only the end offset is needed.
Regarding naming,
byteend
seems hard to read, I thinkbyte_begin
/byte_end
is much clearer.
I proposed byteend
for consistency with existing methods such as byteoffset.
If we choose byte_end
, it may be better to introduce new aliases for such existing methods.
Updated by matz (Yukihiro Matsumoto) 7 months ago
I understand the use-case. I agree with the addition of the feature, but I don't like the name. The names bytebegin
, byteend
are follow the byteindex
tradition, but it is very hard to read (especially byteend
). Any other name suggestions?
Matz.
Updated by shugo (Shugo Maeda) 7 months ago
matz (Yukihiro Matsumoto) wrote in #note-3:
I understand the use-case. I agree with the addition of the feature, but I don't like the name. The names
bytebegin
,byteend
are follow thebyteindex
tradition, but it is very hard to read (especiallybyteend
). Any other name suggestions?
I came up with names begin_in_bytes
and end_in_bytes
, but byte_begin
/ byte_end
suggested by Eregon may be better.
Updated by matz (Yukihiro Matsumoto) 6 months ago
OK. I didn't like the names (especially byteend), but after looking at them for a while I got used to it and was ready to compromise.
Matz.
Updated by shugo (Shugo Maeda) 6 months ago
- Status changed from Open to Closed
Applied in changeset git|e048a073a3cba04576b8f6a1673c283e4e20cd90.
Add MatchData#bytebegin and MatchData#byteend
These methods return the byte-based offset of the beginning or end of the specified match.
[Feature #20576]