Project

General

Profile

Feature #15580

Proposal: method addition to class String called .indices ( String#indices )

Added by shevegen (Robert A. Heiler) 8 months ago. Updated 7 months ago.

Status:
Open
Priority:
Normal
Assignee:
-
Target version:
-
[ruby-core:91368]

Description

Hello,

I am not sure whether this proposal has a realistic chance to be added to Ruby; but
I think it is ok to suggest it nonetheless and let matz and the core team decide
whether this may be a useful addition to ruby (at the least a bit), or whether
it may not be a useful addition or not necessary. Also, I am trying to learn from
sawa on the issue tracker here, making useful suggestions. :)

I propose to add the following new method to class String directly:

String#indices

This would behave similar to String#index in that it will return the position
of a substring, but rather than return a single number or nil, it should return
an Array
of all positions found between the main (target) String; and a substring
match. If no match is found, nil should be returned, similar to String#index.
(It may be possible to extend String#index to provide this functionality, but
I do not want to get into the problem of backwards compatibility; and #indices
seems to make more sense to me when reading it than #index, since the intent is
a different one - hence why I suggest this new method addition.)

Right now .index on class String will return a result like this:

'abcabcabc'.index 'a' # => 0
'abcabcabc'.index 'd' # => nil

So either the number of the first member found ('a', at 0), or nil
if no result is found (in the example of 'd').

In general, the proposal here is to keep #indices behaviour the very
same as #index, just with the sole difference being that an Array
is returned when at the least one index is found; and all positions
that are found are stored in that array.

What is the use case for this proposal or why would I suggest it?

Actually, the use case I have had was a very simple one: to find a
DNA/RNA "subsequence" of just a single nucleotide in a longer DNA/RNA
string. As you may know, most organisms use double stranded DNA (dsDNA)
consisting of four different bases (A,T,C,G); and RNA that is usually
single stranded (ssRNA), with the four different bases being (A,T,C,U).

For example, given the RNA sequence of a String like
'AUGCUUCAGAAAGAGAAAGAGAAAGGUCUUACGUAG' or a similar String, I wanted to
know at which positions 'U' (Uracil) would be in that substring. So ideally
an Array of where the positions were. So that was my use case for
String#indices.

We can of course already get the above as-is via existing ruby features.

One solution is to use .find_all - which I am actually using (and adding
+1, because nucleotide positions by default start not at 0 but at 1). So
I do not really need this addition to class String to begin with, since
I can use find_all or other useful features that ruby has as-is just
fine.

However had, I also thought that it may be useful for others if a
String#indices method may exist directly, which is why I propose it here.
Perhaps it may simplify some existing code bases out there to a limited
extent if ruby users could use the same method/functionality.

There may be other use cases for String#indices, but I will only refer
to the use case that I have found here. If others wish to add their use
case please feel free to do so at your own leisure if you feel like it.

Please also do feel free to close this issue here at any moment in time if
it is considered to be not necessary. It is not really a high priority
suggestion at all - just mostly a convenience feature (possibly).

Thanks!

PS: I should also add that of course in bioinformatics you often deal with
very large datasets, gigabytes/terabytes of genome sequencing data / Next
generation sequencing dataset, but if you need more speed anyway then you may
use C or another language to do the "primary" work; and ruby could do very fine
with smaller datsets just as well; "big data" is not necessarily everywhere.

I only wanted to mention this in the event that it may be pointed out that
String#indices may not be very fast for very long target strings/substrings -
there are still many use cases for smaller substrings, for example. Perl
was used very early in the bioinformatics field to good success, for
instance.

As for documentation, I think the documentation for String#index could be
used for String#indices too, just with the change that an Array of the
positions found may be returned.

History

Updated by duerst (Martin Dürst) 7 months ago

Just a quick question: Should the results include overlaps or not? I.e. is it
'abababa'.indices('aba') # => [0, 2, 4]
or is it just
'abababa'.indices('aba') # => [0, 4]?

Also available in: Atom PDF