Project

General

Profile

Feature #15562

`String#split` option to suppress the initial empty substring

Added by sawa (Tsuyoshi Sawada) 11 months ago. Updated 10 months ago.

Status:
Open
Priority:
Normal
Assignee:
-
Target version:
-
[ruby-core:91256]

Description

String#split returns an empty substring if any at the beginning of the original string, even though it does not return an empty substring at the end of the original string:

"aba".split("a") # => ["", "b"]

This is probably heritage from Perl or AWK, and may have some use cases, but in some (if not most) use cases, this looks asymmetric, and the initial empty string is unnatural and often requires some additional code to remove it. I propose to give an option to String#split to suppress it, perhaps like this (with true being the default):

"aba".split("a", initial_empty_string: false) # => ["b"]
"aba".split("a", initial_empty_string: true) # => ["", "b"]
"aba".split("ba", initial_empty_string: true) # => ["b"]

This does not mean to suppress empty strings in the middle. So it should work like this:

"aaaba".split("a", initial_empty_string: false) # => ["", "", "b"]
"aaaba".split("a", initial_empty_string: true) # => ["", "", "", "b"]

Or may be we can even go on further to control both the initial and the final ones like (with :initial being the default):

"aba".split("a", terminal_empty_string: :none) # => ["b"]
"aba".split("a", terminal_empty_string: :initial) # => ["", "b"]
"aba".split("a", terminal_empty_string: :final) # => ["b", ""]
"aba".split("a", terminal_empty_string: :both) # => ["", "b", ""]

History

#1

Updated by sawa (Tsuyoshi Sawada) 11 months ago

  • Description updated (diff)

Updated by znz (Kazuhiro NISHIYAMA) 11 months ago

String#split with -1 does not remove empty strings.

>> "aba".split("a", -1)
=> ["", "b", ""]
>> "abaa".split("a", -1)
=> ["", "b", "", ""]

Updated by sawa (Tsuyoshi Sawada) 11 months ago

znz (Kazuhiro NISHIYAMA) wrote:

String#split with -1 does not remove empty strings.

>> "aba".split("a", -1)
=> ["", "b", ""]
>> "abaa".split("a", -1)
=> ["", "b", "", ""]

I (particularly) want ["b"].

Updated by sawa (Tsuyoshi Sawada) 11 months ago

An example of a frequent use case of split("a", initial_empty_string: false) is when we have a text like text in the following, and want to extract the paragraphs that follow SECTION:

text = <<~_
  SECTION
  Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam in massa eget mauris lobortis fermentum non in risus. Etiam sit amet dui et velit laoreet pulvinar. Donec convallis, nisi ut lobortis volutpat, est sapien bibendum ante, ac laoreet enim neque at nulla. Aliquam ex urna, porttitor nec mi vitae, suscipit lacinia diam. Maecenas semper, enim id eleifend viverra, lorem velit facilisis tellus, sit amet efficitur nulla nibh sit amet eros. Cras erat mauris, rutrum id mattis nec, auctor eu diam. Aenean mattis at nisl sit amet aliquam. Proin euismod hendrerit eros, quis rhoncus ipsum.

  SECTION
  Curabitur eget quam quis nulla lacinia dapibus ut quis mauris. Maecenas volutpat molestie pulvinar. Mauris porttitor semper arcu. Fusce congue tempor urna in suscipit. Duis a neque lacinia, consectetur elit id, ullamcorper neque. Morbi sit amet eleifend ipsum, sit amet porta libero. Mauris euismod ipsum sit amet ante porttitor consequat. Suspendisse malesuada nunc quis orci posuere dapibus. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Nulla quis massa ut tortor pulvinar egestas in ut nunc. Aenean vitae malesuada elit, nec posuere massa. Nullam risus ipsum, fermentum at fringilla eget, tincidunt nec ante. Pellentesque malesuada pulvinar bibendum. Cras massa erat, tristique vitae vehicula et, aliquet vestibulum magna.
_

text.split(/^SECTION\n/, initial_empty_string: false).map(&:strip)

Updated by knu (Akinori MUSHA) 11 months ago

Isn't the new option name too long? I'd use .drop_while(&:empty?).

Updated by shevegen (Robert A. Heiler) 11 months ago

This reminds me a bit of Dir['*'] versus Dir.entries(Dir.pwd). The latter
also has . and .. entries such as:

=> ["foobar.md", "..", "."]

To me the . and .. entries were never useful. I ended up switching to Dir[]
consistently anyway so I don't see . or .., but I am bringing this example
because I agree with the statement by sawa about empty strings not being
too terribly useful as a result, if you may wish to work with it. Perhaps
it may be useful if you wish to .join on it again, but if you are only
interested in non-empty results (or non-empty strings) then I think it may
be ok to have an additional way to return only the entries you are interested
in. Of course you can process the result on your own as-is, via .reject or
.select (or .filter), but it may be more convenient to simply pass in another
option to .split as second argument.

So from this point of view I agree with sawa, even though I personally probably
don't need this much at all (oddly enough I think almost all of the use cases
I personally have had, were left to pass only one argument to .split()).

The only adaptation I would suggest is that I think the proposed syntax is too
long.

"aba".split("a", terminal_empty_string: :none) # => ["b"]
"aba".split("a", terminal_empty_string: :initial) # => ["", "b"]

I understand that, I assume, sawa proposes flexibility, which is fine,
but it is a bit clumsy and long, IMO. Perhaps something simpler?

ignore_empty: true

Can't think of many more. Rails/Active* has .blank? which I do not like
as a name, but from a conceptual point of view, being able to have a
short way to refer to something like the following, may be nice to
have in general:

"ruby, please ignore nil and empty strings as results, as I need

the alternative only".

In my own code I (mis)use symbols a lot, so I may propose
:ignore_empty_string too. :)

(It's actually almost as long as sawa's suggestion, but when I just tried
it, making this shorter was not easy, since we lose a bit of meaning what
we try to convey here. That is also one reason why it may be useful to
somehow refer to situations where we could easily filter away nil and
'' empty strings, via a single word/command. Even .blank? may become a
bit more verbose if you try to use it via the API above, such as
ignore_blanks: true - or something like that. Good API design is hard...

Updated by shevegen (Robert A. Heiler) 11 months ago

Isn't the new option name too long? I'd use .drop_while(&:empty?).

I personally agree with your observation here; but I think that
.drop_while(&:empty?) is also not ideal. I'd then actually prefer
sawa's longer variant than the combined drop_whilte(&:empty?)
syntax. :)

Updated by sawa (Tsuyoshi Sawada) 10 months ago

(shevegen (Robert A. Heiler):) This reminds me a bit of Dir['*'] versus Dir.entries(Dir.pwd). The latter
also has . and .. entries

Actually, I had the same thing in mind. I have never felt the initial "" in String#split useful (as well as the . and .. in Dir[]). They are along the same lines to me.

And I agree with knu that the name for the option was too long. I had felt that too. So I came up with a different name. What about leader?

"aba".split(`"a", leader: false) # => ["b"]

Updated by knu (Akinori MUSHA) 10 months ago

I believe an initial empty string should often be useful and significant, so it is a reasonable default to include one. String#split is used for splitting strings like key=value and /path/components, not to mention CSV, where key= and =value need to be differentiated and elements.join('/') should round-trip.

Also available in: Atom PDF