Bug #9680
closedString#sub and siblings should not use regex when String pattern is passed
Description
Currently String#sub
, #sub!
, #gsub, and
#gsub!all accept a String pattern, but immediately create a Regexp from it, and use the regex engine to search for the pattern. This is not performant. For example,
"123:456".gsub(":", "_")` creates the following objects, most of which are immediately up for GC:
- dup of the original String
- result String
- 2x
":"<US-ASCII>
- 2x
":"<ASCII-8BIT>
- Regexp from pattern:
/:/
#<MatchData ":">
#<MatchData nil>
I have a solution which is not too complicated, at https://github.com/ruby/ruby/pull/579 and attached. Calls to rb_reg_search()
are replaced with calls to a new function, rb_pat_search()
, which conditionally calls rb_reg_search()
or rb_str_index()
, depending on whether the pattern is a String. Calculating the substring that needs to be replaced is also different when the pattern is a String.
Runtime of each method is dramatically reduced:
require 'benchmark'
n = 4_000_000
Benchmark.bm(7) do |bm|
str1 = "123:456"; str2 = "123_456";
colon = ":"; underscore = "_"
# each benchmark runs the substring method twice so that the bang methods can
# perform the same number of substitutions to str1 each go around.
bm.report("sub") { n.times { str1.sub(colon, underscore); str2.sub(underscore, colon) } }
bm.report("sub!") { n.times { str1.sub!(colon, underscore); str1.sub!(underscore, colon) } }
bm.report("gsub") { n.times { str1.gsub(colon, underscore); str2.gsub(underscore, colon) } }
bm.report("gsub!") { n.times { str1.gsub!(colon, underscore); str1.gsub!(underscore, colon) } }
end
# trunk
user system total real
sub 40.450000 0.580000 41.030000 ( 41.209658)
sub! 39.780000 0.580000 40.360000 ( 40.656789)
gsub 58.500000 0.820000 59.320000 ( 59.603923)
gsub! 59.400000 0.770000 60.170000 ( 60.435687)
# this patch
user system total real
sub 3.060000 0.010000 3.070000 ( 3.091920)
sub! 2.380000 0.010000 2.390000 ( 2.390769)
gsub 7.130000 0.130000 7.260000 ( 7.299139)
gsub! 7.660000 0.150000 7.810000 ( 7.846190)
When using a String pattern, runtime is reduced by 87% to 94%.
There is only one incompatibility that I am aware of: $&
will not be set after using a sub method with a String pattern. (Subgroups ($1
, ...) will not be available either, but weren't before, since String patterns are escaped before being used.)
In the future, only 3 more methods use the function, get_pat()
, that creates a Regexp from the String pattern: #split
, #scan
, and #match
. I think this fix could be applied to these as well.
Files
Updated by normalperson (Eric Wong) over 10 years ago
I had an idea to replace the current reg_cache with memoization for
converting string literals, but never got around to it. That would
also reduce garbage while preserving $& compatibility.
Updated by srawlins (Sam Rawlins) over 10 years ago
I think the speedup in this patch comes almost entirely from skipping the regex engine, not from the GC savings.
Preserving $&
(and $~
and friends) while still not firing up the regex engine might be possible (constructing the basic backref, with no subgroups, by hand), but very very ugly code (an RMatch
has an RRegexp
and an rmatch
which has a re_registers
, etc). This might only a ~20 line function, but feels so dirty...
I think an improvement (or replacement) to reg_cache
would also be welcome.
Updated by Eregon (Benoit Daloze) over 10 years ago
It would be interesting to run the benchmark on a more realistic example. One should use String#tr or String#tr! if it is only to replace a single character.
Updated by srawlins (Sam Rawlins) over 10 years ago
Good point, Benoit! This is actually why I started looking into #gsub in the first place. I benchmarked ActiveSupport::Inflector [1], which does operations like gsub!('/', '::')
and gsub('::', '/')
. Here are the benchmarks, before and after Nobu's patch, based on my patch:
BEFORE
user system total real
#underscore 24.000000 0.160000 24.160000 ( 24.254921)
#camelize 3.040000 0.010000 3.050000 ( 3.060907)
AFTER
user system total real
#underscore 23.690000 0.160000 23.850000 ( 24.012497)
#camelize 2.680000 0.010000 2.690000 ( 2.706418)
So #underscore is 1% faster; #camelize is 11% faster.
I also benchmarked Psych with YAML.dump(["one string", "two string"])
:
user system total real
YAML.dump BEFORE 11.380000 0.250000 11.630000 ( 11.680545)
YAML.dump AFTER 11.030000 0.240000 11.270000 ( 11.313147)
The patch seems to shave 3% off here. Take all of these benchmarks with a grain of salt :)
[1] https://github.com/rails/rails/blob/master/activesupport/lib/active_support/inflector/methods.rb
Updated by nobu (Nobuyoshi Nakada) over 10 years ago
- Status changed from Open to Closed
I missed to include this ticket reference in the commit log.
Inflector seems using other replacements with RE, so this improvement might not be significant much.
Updated by usa (Usaku NAKAMURA) over 10 years ago
- Backport changed from 2.0.0: UNKNOWN, 2.1: UNKNOWN to 2.0.0: WONTFIX, 2.1: UNKNOWN
Updated by nagachika (Tomoyuki Chikanaga) over 10 years ago
- Backport changed from 2.0.0: WONTFIX, 2.1: UNKNOWN to 2.0.0: WONTFIX, 2.1: WONTFIX