Feature #15771
closedAdd `String#split` option to set `split_type string` with a single space separator
Description
When String#split
's separator is a single space character, it executes under split_type: awk
.
When you want to split literally by a single space " "
, and not a sequence of space characters, you need to take special care. For example, the CSV library detours this behavior like this:
if @column_separator == " ".encode(@encoding)
@split_column_separator = Regexp.new(@escaped_column_separator)
else
@split_column_separator = @column_separator
end
Unfortunately, using a regexp here makes it slower than using a string. The following result shows it is about nine times slower.
$ be benchmark-driver string_split_string-regexp.yml --rbenv '2.6.2'
Comparison:
string: 3161117.6 i/s
regexp: 344448.0 i/s - 9.18x slower
I want to add a :literal
option to execute the method under split_type: string
as follows:
" a b c ".split(" ") # => ["a", "b", "c"]
" a b c ".split(" ", literal: true) # => ["", "a", "", "b", "", "", "c"]
" a b c ".split(" ", -1) # => ["a", "b", "c", ""]
" a b c ".split(" ", -1, literal: true) # => ["", "a", "", "b", "", "", "c", "", "", "", ""]
Implementation¶
Updated by 284km (kazuma furuhashi) over 5 years ago
pull request: https://github.com/ruby/ruby/pull/2132
Updated by sawa (Tsuyoshi Sawada) over 4 years ago
- Subject changed from Add `String#split` option to set split_type string when a single space separator to Add `String#split` option to set `split_type string` with a single space separator
- Description updated (diff)
Updated by Eregon (Benoit Daloze) over 4 years ago
Since splitting on whitespace is the default (ignoring $;
which is deprecated in 2.7), maybe we could make split(" ")
not special longer-term?
Updated by Dan0042 (Daniel DeLorme) over 4 years ago
I've often thought that the default behavior should be tied to nil rather than " ", but in terms of compatibility I don't really think it's worth the change.
The proposed option makes it easy to avoid special-casing the " " separator; imho str.split(sep, literal: true)
feels cleaner than str.split(sep==" " ? / / : sep)
.
Not too sure about literal: true
though, maybe awk: false
would be more meaningful?
Updated by sawa (Tsuyoshi Sawada) over 4 years ago
My guess is that, perhaps, even now, it is very rare to use a single space string argument with the expectation to match single or multiple spaces. In such use cases, normally, split
is used without an argument, or with a regex argument such as /\s+/
or / +/
. If it turns out that it is rare enough, then the behavior of split
could be changed to split_type: string
altogether.
Updated by znz (Kazuhiro NISHIYAMA) over 4 years ago
How about optimization split(/ /)
(when regexp is single space only) instead of changing split(" ")
?
Updated by nobu (Nobuyoshi Nakada) over 4 years ago
znz (Kazuhiro NISHIYAMA) wrote in #note-8:
How about optimization
split(/ /)
(when regexp is single space only) instead of changingsplit(" ")
?
Sounds nice. https://github.com/ruby/ruby/pull/3103
Updated by Dan0042 (Daniel DeLorme) over 4 years ago
That optimization is nice to have, but I think the point of this ticket is that it's currently not possible to have an arbitrary string separator.
str = "aaabababbbabbabaabaaabbbabab"
sep = "x" #or anything except " "
str.gsub("a",sep).split(sep).size #=> 15
sep = " "
str.gsub("a",sep).split(sep).size #=> 9
sawa (Tsuyoshi Sawada) wrote in #note-7:
My guess is that, perhaps, even now, it is very rare to use a single space string argument with the expectation to match single or multiple spaces.
It looks like using a single space string argument is not so rare:
https://pastebin.com/pPyEf2GA
Updated by matz (Yukihiro Matsumoto) over 4 years ago
- Status changed from Open to Feedback
I get the point, but we still need a concrete use-case. (Unlike tabs and commas) Space-separated CSV is not common, and consequent spaces are considered as one space usually. I still feel like it's a theoretical concern.
Besides that, literal
does not seem to be a right name for the option. In programming languages, the term literal
means literal constants (e.g. "literal" or 3), so it can be confusing.
Matz.
Updated by sawa (Tsuyoshi Sawada) over 4 years ago
Dan0042 (Daniel DeLorme) wrote in #note-10:
That optimization is nice to have, but I think the point of this ticket is that it's currently not possible to have an arbitrary string separator.
I agree.
Dan0042 (Daniel DeLorme) wrote in #note-10:
sawa (Tsuyoshi Sawada) wrote in #note-7:
My guess is that, perhaps, even now, it is very rare to use a single space string argument with the expectation to match single or multiple spaces.
It looks like using a single space string argument is not so rare:
https://pastebin.com/pPyEf2GA
I didn't write that using a single space string argument is rare, I wrote that using a single space string argument with the expectation to match single or multiple spaces is rare.
In fact, my guess is that using a single space string argument is not rare, and that most of them expect to match only single space. I have not confirmed this. If it turns out to be correct, then that would constitute the use cases that are asked for.
Updated by sawa (Tsuyoshi Sawada) over 4 years ago
I have a particular use case. I was creating a file that describes UTF-8 characters, which included lines like this:
002 !"#$%&'()*+,-./
003 0123456789:;<=>?
004 @ABCDEFGHIJKLMNO
The first three characters of each line describes the significant hex-digits of the UTF-8 code, which are followed by a space character that separates the following sixteen characters that belong to that line.
Notice that the first space character after 002
is used as a separator, and the space character right after it is intended to express a literal space character.
I tried to parse each line with a code like this:
significant_digits, characters = line.split(" ", 2)
but it did not work as I expected, which is:
significant_digits # => "002"
characters # => ' !"#$%&'()*+,-./'
and I realized the issue mentioned on this ticket.
Updated by matz (Yukihiro Matsumoto) over 4 years ago
@sawa (Tsuyoshi Sawada), for your "use-case", line.split(/ /, 2)
is far better than line.split(" ", 2, literal: true)
, I think, no matter what keyword we'd choose.
- shorter
- clearer
- need to rewrite anyway
Matz.
Updated by Dan0042 (Daniel DeLorme) over 4 years ago
I think it's worth mentioning nobu's comment from the dev meeting:
have wanted to deprecate that behavior for years, and made non-nil $/ warned.
So what about finally deprecating this behavior? If the incompatibility is too bad it's always possible to go back. sawa's use-case, rather than being about the literal
option, is more about the benefit of treating " " as split_type string. And I remember being very surprised about the behavior of str.split(" ")
when I started out in ruby.
Maybe:
if $VERBOSE show a warning "use nil or / /" if separator is " "
if !$VERBOSE show a warning "use nil" if separator is " " and matches differently from / /