Project

General

Profile

Actions

Feature #15771

closed

Add `String#split` option to set `split_type string` with a single space separator

Added by 284km (kazuma furuhashi) almost 6 years ago. Updated over 4 years ago.

Status:
Feedback
Assignee:
-
Target version:
-
[ruby-core:92301]

Description

When String#split's separator is a single space character, it executes under split_type: awk.

When you want to split literally by a single space " ", and not a sequence of space characters, you need to take special care. For example, the CSV library detours this behavior like this:

if @column_separator == " ".encode(@encoding)
  @split_column_separator = Regexp.new(@escaped_column_separator)
else
  @split_column_separator = @column_separator
end

Unfortunately, using a regexp here makes it slower than using a string. The following result shows it is about nine times slower.

$ be benchmark-driver string_split_string-regexp.yml --rbenv '2.6.2'
Comparison:
              string:   3161117.6 i/s
              regexp:    344448.0 i/s - 9.18x  slower

I want to add a :literal option to execute the method under split_type: string as follows:

" a  b   c    ".split(" ")                    # => ["a", "b", "c"]
" a  b   c    ".split(" ", literal: true)     # => ["", "a", "", "b", "", "", "c"]
" a  b   c    ".split(" ", -1)                # => ["a", "b", "c", ""]
" a  b   c    ".split(" ", -1, literal: true) # => ["", "a", "", "b", "", "", "c", "", "", "", ""]

Implementation

Actions #2

Updated by sawa (Tsuyoshi Sawada) over 4 years ago

  • Subject changed from Add `String#split` option to set split_type string when a single space separator to Add `String#split` option to set `split_type string` with a single space separator
  • Description updated (diff)
Actions #3

Updated by sawa (Tsuyoshi Sawada) over 4 years ago

  • Description updated (diff)
Actions #4

Updated by sawa (Tsuyoshi Sawada) over 4 years ago

  • Description updated (diff)

Updated by Eregon (Benoit Daloze) over 4 years ago

Since splitting on whitespace is the default (ignoring $; which is deprecated in 2.7), maybe we could make split(" ") not special longer-term?

Updated by Dan0042 (Daniel DeLorme) over 4 years ago

I've often thought that the default behavior should be tied to nil rather than " ", but in terms of compatibility I don't really think it's worth the change.
The proposed option makes it easy to avoid special-casing the " " separator; imho str.split(sep, literal: true) feels cleaner than str.split(sep==" " ? / / : sep).
Not too sure about literal: true though, maybe awk: false would be more meaningful?

Updated by sawa (Tsuyoshi Sawada) over 4 years ago

My guess is that, perhaps, even now, it is very rare to use a single space string argument with the expectation to match single or multiple spaces. In such use cases, normally, split is used without an argument, or with a regex argument such as /\s+/ or / +/. If it turns out that it is rare enough, then the behavior of split could be changed to split_type: string altogether.

Updated by znz (Kazuhiro NISHIYAMA) over 4 years ago

How about optimization split(/ /) (when regexp is single space only) instead of changing split(" ")?

Updated by nobu (Nobuyoshi Nakada) over 4 years ago

znz (Kazuhiro NISHIYAMA) wrote in #note-8:

How about optimization split(/ /) (when regexp is single space only) instead of changing split(" ")?

Sounds nice. https://github.com/ruby/ruby/pull/3103

Updated by Dan0042 (Daniel DeLorme) over 4 years ago

That optimization is nice to have, but I think the point of this ticket is that it's currently not possible to have an arbitrary string separator.

str = "aaabababbbabbabaabaaabbbabab"
sep = "x" #or anything except " "
str.gsub("a",sep).split(sep).size #=> 15
sep = " "
str.gsub("a",sep).split(sep).size #=> 9

sawa (Tsuyoshi Sawada) wrote in #note-7:

My guess is that, perhaps, even now, it is very rare to use a single space string argument with the expectation to match single or multiple spaces.

It looks like using a single space string argument is not so rare:
https://pastebin.com/pPyEf2GA

Updated by matz (Yukihiro Matsumoto) over 4 years ago

  • Status changed from Open to Feedback

I get the point, but we still need a concrete use-case. (Unlike tabs and commas) Space-separated CSV is not common, and consequent spaces are considered as one space usually. I still feel like it's a theoretical concern.

Besides that, literal does not seem to be a right name for the option. In programming languages, the term literal means literal constants (e.g. "literal" or 3), so it can be confusing.

Matz.

Updated by sawa (Tsuyoshi Sawada) over 4 years ago

Dan0042 (Daniel DeLorme) wrote in #note-10:

That optimization is nice to have, but I think the point of this ticket is that it's currently not possible to have an arbitrary string separator.

I agree.

Dan0042 (Daniel DeLorme) wrote in #note-10:

sawa (Tsuyoshi Sawada) wrote in #note-7:

My guess is that, perhaps, even now, it is very rare to use a single space string argument with the expectation to match single or multiple spaces.

It looks like using a single space string argument is not so rare:
https://pastebin.com/pPyEf2GA

I didn't write that using a single space string argument is rare, I wrote that using a single space string argument with the expectation to match single or multiple spaces is rare.

In fact, my guess is that using a single space string argument is not rare, and that most of them expect to match only single space. I have not confirmed this. If it turns out to be correct, then that would constitute the use cases that are asked for.

Updated by sawa (Tsuyoshi Sawada) over 4 years ago

I have a particular use case. I was creating a file that describes UTF-8 characters, which included lines like this:

002  !"#$%&'()*+,-./
003 0123456789:;<=>?
004 @ABCDEFGHIJKLMNO

The first three characters of each line describes the significant hex-digits of the UTF-8 code, which are followed by a space character that separates the following sixteen characters that belong to that line.

Notice that the first space character after 002 is used as a separator, and the space character right after it is intended to express a literal space character.

I tried to parse each line with a code like this:

significant_digits, characters = line.split(" ", 2)

but it did not work as I expected, which is:

significant_digits # => "002"
characters # => ' !"#$%&'()*+,-./'

and I realized the issue mentioned on this ticket.

Updated by matz (Yukihiro Matsumoto) over 4 years ago

@sawa, for your "use-case", line.split(/ /, 2) is far better than line.split(" ", 2, literal: true), I think, no matter what keyword we'd choose.

  • shorter
  • clearer
  • need to rewrite anyway

Matz.

Updated by Dan0042 (Daniel DeLorme) over 4 years ago

I think it's worth mentioning nobu's comment from the dev meeting:

have wanted to deprecate that behavior for years, and made non-nil $/ warned.

So what about finally deprecating this behavior? If the incompatibility is too bad it's always possible to go back. sawa's use-case, rather than being about the literal option, is more about the benefit of treating " " as split_type string. And I remember being very surprised about the behavior of str.split(" ") when I started out in ruby.

Maybe:
if $VERBOSE show a warning "use nil or / /" if separator is " "
if !$VERBOSE show a warning "use nil" if separator is " " and matches differently from / /

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0