Project

General

Profile

Feature #15446

Add a method `String#each_match` to the Ruby core

Added by CaryInVictoria (Cary Swoveland) 6 months ago. Updated 4 months ago.

Status:
Open
Priority:
Normal
Assignee:
-
Target version:
-
[ruby-core:90652]

Description

String#each_match would have two forms:

each_match(pattern) { |match| block } → str
each_match(pattern) → an_enumerator

The latter would be identical to the form gsub(pattern) → enumerator of String#gsub. The former would simply yield the matches to a block and return the receiver.

I frequently use the form of gsub that returns an enumerator instead of scan when chaining to Enumerable methods. That's because scan returns an unneeded temporary array. This use of gsub can also be useful when the pattern contains capture groups, which can be a complication when using scan, as in the following example

Suppose we are given a string and wish to count the number of occurrences of each word that begins and ends with the same letter (case-insensitive).

 str = "Viv and Bob are party animals. Bob and Eve are a couple who met on Christmas Eve. Bob is a regular guy."

 r = /\b(?:[a-z]|([a-z])[a-z]*\1)\b/i

This regular expression reads, "match a word break, followed by one letter or by two or more letters with the last matching the first (case insensitive), all followed by a word break".

 enum = str.each_match(r)
    #=> #<Enumerator: "Viv and Bob are party...a regular guy.":gsub(/\b(?:[a-z]|([a-z])[a-z]*\1)\b/i)> 

We can convert enum to an array to see the words that will be generated by the enumerator and passed to the block.

enum.to_a
    #=> ["Viv", "Bob", "Bob", "Eve", "a", "Eve", "Bob", "a", "regular"] 

Continuing,

enum.each_with_object(Hash.new(0)) { |word, h| h[word] += 1 }
   #=> {"Viv"=>1, "Bob"=>3, "Eve"=>2, "a"=>2, "regular"=>1} 

We could alternatively use each_match with a block.

 h = Hash.new(0)
 str.each_match(r) { |word| h[word] += 1 }
    #=> "Viv and Bob are party animals. Bob and Eve are a couple who met on Christmas Eve. Bob is a regular guy."
 h #=> {"Viv"=>1, "Bob"=>3, "Eve"=>2, "a"=>2, "regular"=>1} 

This form of each_match has no counterpart with gsub.

Consider now how scan would be used here. Because of the way scan treats capture groups, we cannot write

str.scan(r)
   #=> [["V"], ["B"], ["B"], ["E"], [nil], ["E"], ["B"], [nil], ["r"]] 

Instead we must add a second capture group.

arr = str.scan(/\b((?:[a-z]|([a-z])[a-z]*\2))\b/i)
   #=> [["Viv", "V"], ["Bob", "B"], ["Bob", "B"], ["Eve", "E"], ["a", nil], ["Eve", "E"], ["Bob", "B"], ["a", nil], ["regular", "r"]]

Then

arr.each_with_object(Hash.new(0)) { |(word,_),h| h[word] += 1 }
   #=> {"Viv"=>1, "Bob"=>3, "Eve"=>2, "a"=>2, "regular"=>1}

This works but it's a bit of a dog's breakfast when compared to the use of the proposed method.

The problem with using gsub in this way is that it is confusing to readers who are expecting character substitutions to be performed. I also believe that the name of this method (the "sub" in gsub) has resulted in the form of the method that returns an enumerator to be under-appreciated and under-used.

Some comments below propose that this suggestion be adopted and, in time, the form of gsub that returns an enumerator be deprecated.

History

Updated by duerst (Martin Dürst) 6 months ago

This looks like a good idea. Actually, I might suggest that we even go further: We introduce a new method and depreciate (and ultimately remove) the functionality of producing an enumerator by gsub.

(I wouldn't mind keeping producing an enumerator with gsub, but only if that resulted in actual substitutions.)

Updated by shevegen (Robert A. Heiler) 6 months ago

The suggested idea by Cary seems fine to me. We have to ask
matz what he thinks about the proposed idea + name choice and
functionality.

I would suggest, however had, to, if necessary, deprecate at
a later time or decouple it from the suggestion here for now.

Reason being is mostly that deprecation (and then removing
functionality) is a little bit different to the proposal of
adding a new functionality (e. g. #matches or any other name
to class String). I think the step of deprecation could be
done at a later step or in another proposal. (I don't know
if anyone depends on producing an enumerator by gsub, but
in my opinion it would be just simpler to bypass that
question for now, and only focus on the suggested method
addition Cary proposed.)

Updated by sos4nt (Stefan Schüßler) 6 months ago

Regarding the name – I'd prefer String#each_match.

And it should accept an optional block which yields the matches and (as opposed to gsub) returns the receiver (i.e. no substitution):

each_match(pattern) { |match| block } → str
each_match(pattern) → an_enumerator

Updated by CaryInVictoria (Cary Swoveland) 6 months ago

Stefan, I've incorporated both of your suggestions. Thanks.

#5

Updated by CaryInVictoria (Cary Swoveland) 6 months ago

  • Description updated (diff)
  • Subject changed from Add a method `String#matches` to the Ruby core to Add a method `String#each_match` to the Ruby core

Updated by sawa (Tsuyoshi Sawada) 4 months ago

I would rather propose to have String#scan take an optional second argument that is comparable to the optional second argument capture of String#[] after a regexp argument:

r = /\b([a-z]|([a-z])[a-z]*\1)\b/i
str[r] # => "Viv"
str[r, 0] # => "Viv"
str[r, 1] # => "Viv"
str[r, 2] # => "V"

so that it should work like this:

str.scan(r) # => [["Viv", "V"], ["Bob", "B"], ["Bob", "B"], ["Eve", "E"], ["a", nil], ["Eve", "E"], ["Bob", "B"], ["a", nil], ["regular", "r"]]
str.scan(r, 0) # => ["Viv", "Bob", "Bob", "Eve", "a""Eve", "Bob", "a", "regular"]
str.scan(r, 1) # => ["Viv", "Bob", "Bob", "Eve", "a""Eve", "Bob", "a", "regular"]
str.scan(r, 2) # => ["V", "B", "B", "E", nil, "E", "B", nil, "r"]

Also available in: Atom PDF