Bug #9680
closedString#sub and siblings should not use regex when String pattern is passed
Description
Currently String#sub
, #sub!
, #gsub, and
#gsub!all accept a String pattern, but immediately create a Regexp from it, and use the regex engine to search for the pattern. This is not performant. For example,
"123:456".gsub(":", "_")` creates the following objects, most of which are immediately up for GC:
- dup of the original String
- result String
- 2x
":"<US-ASCII>
- 2x
":"<ASCII-8BIT>
- Regexp from pattern:
/:/
#<MatchData ":">
#<MatchData nil>
I have a solution which is not too complicated, at https://github.com/ruby/ruby/pull/579 and attached. Calls to rb_reg_search()
are replaced with calls to a new function, rb_pat_search()
, which conditionally calls rb_reg_search()
or rb_str_index()
, depending on whether the pattern is a String. Calculating the substring that needs to be replaced is also different when the pattern is a String.
Runtime of each method is dramatically reduced:
require 'benchmark'
n = 4_000_000
Benchmark.bm(7) do |bm|
str1 = "123:456"; str2 = "123_456";
colon = ":"; underscore = "_"
# each benchmark runs the substring method twice so that the bang methods can
# perform the same number of substitutions to str1 each go around.
bm.report("sub") { n.times { str1.sub(colon, underscore); str2.sub(underscore, colon) } }
bm.report("sub!") { n.times { str1.sub!(colon, underscore); str1.sub!(underscore, colon) } }
bm.report("gsub") { n.times { str1.gsub(colon, underscore); str2.gsub(underscore, colon) } }
bm.report("gsub!") { n.times { str1.gsub!(colon, underscore); str1.gsub!(underscore, colon) } }
end
# trunk
user system total real
sub 40.450000 0.580000 41.030000 ( 41.209658)
sub! 39.780000 0.580000 40.360000 ( 40.656789)
gsub 58.500000 0.820000 59.320000 ( 59.603923)
gsub! 59.400000 0.770000 60.170000 ( 60.435687)
# this patch
user system total real
sub 3.060000 0.010000 3.070000 ( 3.091920)
sub! 2.380000 0.010000 2.390000 ( 2.390769)
gsub 7.130000 0.130000 7.260000 ( 7.299139)
gsub! 7.660000 0.150000 7.810000 ( 7.846190)
When using a String pattern, runtime is reduced by 87% to 94%.
There is only one incompatibility that I am aware of: $&
will not be set after using a sub method with a String pattern. (Subgroups ($1
, ...) will not be available either, but weren't before, since String patterns are escaped before being used.)
In the future, only 3 more methods use the function, get_pat()
, that creates a Regexp from the String pattern: #split
, #scan
, and #match
. I think this fix could be applied to these as well.
Files