Bug #9680: String#sub and siblings should not use regex when String pattern is passed - Ruby - Ruby Issue Tracking System

Actions

Copy link

Bug #9680

closed

String#sub and siblings should not use regex when String pattern is passed

Bug #9680: String#sub and siblings should not use regex when String pattern is passed

Added by srawlins (Sam Rawlins) over 12 years ago. Updated almost 12 years ago.

Status:

Closed

Assignee:

Target version:

ruby -v:

trunk

Backport:

2.0.0: WONTFIX, 2.1: WONTFIX

[ruby-core:61706]

Description

Currently String#sub, #sub!, #gsub, and #gsub!all accept a String pattern, but immediately create a Regexp from it, and use the regex engine to search for the pattern. This is not performant. For example,"123:456".gsub(":", "_")` creates the following objects, most of which are immediately up for GC:

dup of the original String
result String
2x ":"<US-ASCII>
2x ":"<ASCII-8BIT>
Regexp from pattern: /:/
#<MatchData ":">
#<MatchData nil>

I have a solution which is not too complicated, at https://github.com/ruby/ruby/pull/579 and attached. Calls to rb_reg_search() are replaced with calls to a new function, rb_pat_search(), which conditionally calls rb_reg_search() or rb_str_index(), depending on whether the pattern is a String. Calculating the substring that needs to be replaced is also different when the pattern is a String.

Runtime of each method is dramatically reduced:

require 'benchmark'

n = 4_000_000
Benchmark.bm(7) do |bm|
  str1 = "123:456"; str2 = "123_456";
  colon = ":"; underscore = "_"
  # each benchmark runs the substring method twice so that the bang methods can
  # perform the same number of substitutions to str1 each go around.
  bm.report("sub")   { n.times { str1.sub(colon, underscore);   str2.sub(underscore, colon) } }
  bm.report("sub!")  { n.times { str1.sub!(colon, underscore);  str1.sub!(underscore, colon) } }
  bm.report("gsub")  { n.times { str1.gsub(colon, underscore);  str2.gsub(underscore, colon) } }
  bm.report("gsub!") { n.times { str1.gsub!(colon, underscore); str1.gsub!(underscore, colon) } }
end

# trunk
              user     system      total        real
sub      40.450000   0.580000  41.030000 ( 41.209658)
sub!     39.780000   0.580000  40.360000 ( 40.656789)
gsub     58.500000   0.820000  59.320000 ( 59.603923)
gsub!    59.400000   0.770000  60.170000 ( 60.435687)

# this patch
              user     system      total        real
sub       3.060000   0.010000   3.070000 (  3.091920)
sub!      2.380000   0.010000   2.390000 (  2.390769)
gsub      7.130000   0.130000   7.260000 (  7.299139)
gsub!     7.660000   0.150000   7.810000 (  7.846190)

When using a String pattern, runtime is reduced by 87% to 94%.

There is only one incompatibility that I am aware of: $& will not be set after using a sub method with a String pattern. (Subgroups ($1, ...) will not be available either, but weren't before, since String patterns are escaped before being used.)

In the future, only 3 more methods use the function, get_pat(), that creates a Regexp from the String pattern: #split, #scan, and #match. I think this fix could be applied to these as well.

Files

ruby-579.diff (5.12 KB) ruby-579.diff

srawlins (Sam Rawlins), 03/27/2014 12:09 AM

Updated by normalperson (Eric Wong) over 12 years ago Actions
Copy link
#1 [ruby-core:61707]

I had an idea to replace the current reg_cache with memoization for
converting string literals, but never got around to it. That would
also reduce garbage while preserving $& compatibility.

Updated by srawlins (Sam Rawlins) over 12 years ago Actions
Copy link
#2 [ruby-core:61710]

I think the speedup in this patch comes almost entirely from skipping the regex engine, not from the GC savings.

Preserving $& (and $~ and friends) while still not firing up the regex engine might be possible (constructing the basic backref, with no subgroups, by hand), but very very ugly code (an RMatch has an RRegexp and an rmatch which has a re_registers, etc). This might only a ~20 line function, but feels so dirty...

I think an improvement (or replacement) to reg_cache would also be welcome.

Updated by Eregon (Benoit Daloze) over 12 years ago Actions
Copy link
#3 [ruby-core:61725]

It would be interesting to run the benchmark on a more realistic example. One should use String#tr or String#tr! if it is only to replace a single character.

Updated by srawlins (Sam Rawlins) over 12 years ago Actions
Copy link
#4 [ruby-core:61728]

Good point, Benoit! This is actually why I started looking into #gsub in the first place. I benchmarked ActiveSupport::Inflector [1], which does operations like gsub!('/', '::') and gsub('::', '/'). Here are the benchmarks, before and after Nobu's patch, based on my patch:

BEFORE
                   user     system      total        real
#underscore   24.000000   0.160000  24.160000 ( 24.254921)
#camelize      3.040000   0.010000   3.050000 (  3.060907)

AFTER
                   user     system      total        real
#underscore   23.690000   0.160000  23.850000 ( 24.012497)
#camelize      2.680000   0.010000   2.690000 (  2.706418)

So #underscore is 1% faster; #camelize is 11% faster.

I also benchmarked Psych with YAML.dump(["one string", "two string"]):

                       user     system      total        real
YAML.dump BEFORE  11.380000   0.250000  11.630000 ( 11.680545)
YAML.dump AFTER   11.030000   0.240000  11.270000 ( 11.313147)

The patch seems to shave 3% off here. Take all of these benchmarks with a grain of salt :)

[1] https://github.com/rails/rails/blob/master/activesupport/lib/active_support/inflector/methods.rb

Updated by nobu (Nobuyoshi Nakada) over 12 years ago Actions
Copy link
#5 [ruby-core:61733]

Status changed from Open to Closed

I missed to include this ticket reference in the commit log.

Inflector seems using other replacements with RE, so this improvement might not be significant much.

Updated by usa (Usaku NAKAMURA) about 12 years ago Actions
Copy link
#6 [ruby-core:63491]

Backport changed from 2.0.0: UNKNOWN, 2.1: UNKNOWN to 2.0.0: WONTFIX, 2.1: UNKNOWN

Updated by nagachika (Tomoyuki Chikanaga) almost 12 years ago Actions
Copy link
#7 [ruby-core:64260]

Backport changed from 2.0.0: WONTFIX, 2.1: UNKNOWN to 2.0.0: WONTFIX, 2.1: WONTFIX

Actions

Copy link

Also available in: PDF Atom

Project

General

Profile

Ruby

Custom queries

Bug #9680

String#sub and siblings should not use regex when String pattern is passed

Updated by normalperson (Eric Wong) over 12 years ago Actions
Copy link
#1 [ruby-core:61707]

Updated by srawlins (Sam Rawlins) over 12 years ago Actions
Copy link
#2 [ruby-core:61710]

Updated by Eregon (Benoit Daloze) over 12 years ago Actions
Copy link
#3 [ruby-core:61725]

Updated by srawlins (Sam Rawlins) over 12 years ago Actions
Copy link
#4 [ruby-core:61728]

Updated by nobu (Nobuyoshi Nakada) over 12 years ago Actions
Copy link
#5 [ruby-core:61733]

Updated by usa (Usaku NAKAMURA) about 12 years ago Actions
Copy link
#6 [ruby-core:63491]

Updated by nagachika (Tomoyuki Chikanaga) almost 12 years ago Actions
Copy link
#7 [ruby-core:64260]

Project

General

Profile

Ruby

Custom queries

Bug #9680

String#sub and siblings should not use regex when String pattern is passed

Updated by normalperson (Eric Wong) over 12 years ago ActionsCopy link #1 [ruby-core:61707]

Updated by srawlins (Sam Rawlins) over 12 years ago ActionsCopy link #2 [ruby-core:61710]

Updated by Eregon (Benoit Daloze) over 12 years ago ActionsCopy link #3 [ruby-core:61725]

Updated by srawlins (Sam Rawlins) over 12 years ago ActionsCopy link #4 [ruby-core:61728]

Updated by nobu (Nobuyoshi Nakada) over 12 years ago ActionsCopy link #5 [ruby-core:61733]

Updated by usa (Usaku NAKAMURA) about 12 years ago ActionsCopy link #6 [ruby-core:63491]

Updated by nagachika (Tomoyuki Chikanaga) almost 12 years ago ActionsCopy link #7 [ruby-core:64260]

Updated by normalperson (Eric Wong) over 12 years ago Actions
Copy link
#1 [ruby-core:61707]

Updated by srawlins (Sam Rawlins) over 12 years ago Actions
Copy link
#2 [ruby-core:61710]

Updated by Eregon (Benoit Daloze) over 12 years ago Actions
Copy link
#3 [ruby-core:61725]

Updated by srawlins (Sam Rawlins) over 12 years ago Actions
Copy link
#4 [ruby-core:61728]

Updated by nobu (Nobuyoshi Nakada) over 12 years ago Actions
Copy link
#5 [ruby-core:61733]

Updated by usa (Usaku NAKAMURA) about 12 years ago Actions
Copy link
#6 [ruby-core:63491]

Updated by nagachika (Tomoyuki Chikanaga) almost 12 years ago Actions
Copy link
#7 [ruby-core:64260]