Bug #4298
closedDuration of calling String#[] with the same index is strangely related to string length.
Added by radarek (Radosław Bułat) over 13 years ago. Updated about 13 years ago.
Description
=begin
The longer string is the slower is String#[] (for the same index > 0). See the example benchmark below:
# encoding: utf-8
require 'benchmark'
s1 = "ąćęłńóśżź!0123456789"
puts Benchmark.measure {
1000.times { s1[1] }
}
s2 = "ąćęłńóśżź!0123456789" * 100_000
puts Benchmark.measure {
1000.times { s2[1] }
}
Running above code:
$ ruby-trunk x.rb 0.010000 0.000000 0.010000 ( 0.000341) 1.780000 0.020000 1.800000 ( 1.998366)
=end
Files
benchmark.rb (228 Bytes) benchmark.rb | Benchmark showing described problem | radarek (Radosław Bułat), 01/20/2011 06:21 PM |
Updated by radarek (Radosław Bułat) over 13 years ago
=begin
I copied&pasted wrong info about ruby's version. It's actually:
$ ruby-trunk -v
ruby 1.9.3dev (2010-09-06 trunk 29190) [x86_64-darwin10.4.0]
=end
Updated by phasis68 (Heesob Park) over 13 years ago
=begin
This bug(or feature) is due to the repetitive calling of str_strlen() function.
Here is a sentry patch for rb_str_substr() function of string.c
--- string.c 2011-01-22 21:25:38.000000000 +0900
+++ string.c.new 2011-01-22 21:28:05.000000000 +0900
@@ -1603,7 +1603,7 @@
if (beg < 0) return Qnil;
}
}
- else if (beg > 0 && beg > str_strlen(str, enc)) {
- else if (beg > 0 && beg > RSTRING_LEN(str) && beg > str_strlen(str, enc)) {
return Qnil;
}
if (len == 0) {
Before patch:
phasis:~ phasis$ /usr/local/bin/ruby t1.rb
0.000000 0.000000 0.000000 ( 0.000398)
1.690000 0.010000 1.700000 ( 1.731168)
After patch:
phasis:~ phasis$ /usr/local/bin/ruby t1.rb
0.000000 0.000000 0.000000 ( 0.000355)
0.000000 0.000000 0.000000 ( 0.000356)
=end
Updated by headius (Charles Nutter) over 13 years ago
=begin
Perhaps I'm mistaken, but isn't RSTRING_LEN the byte length of the string? Isn't that always guaranteed to be >= str_strlen, which is the character length? And doesn't this patch basically break completely if beg is greater than the encoded character length but not greater than the raw byte length?
It seems to me that the problem is more complicated. For many encodings you must either save the new length each time you mutate it (doing str_strlen on every mutation) or recalculate it as needed (such as in rb_str_substr). It seems ruby-core has chosen the latter, and the behavior also exists in JRuby (since we have tried to mimic behavior very closely).
This is a peril of using variable-length encodings like UTF-8. str_strlen does do a coderange check, to see if it can do a simple byte offset, but if it can't it must walk the characters one at a time.
As an experiment, I modified the benchmark to encode the text as UTF-32BE, so there would be a constant size for each character. And as expected, the performance improves dramatically:
encoding: utf-8¶
require 'benchmark'
s1 = "ąćęłńóśżź!0123456789".encode('UTF-32BE')
puts Benchmark.measure {
1000.times { s1[1] }
}
s2 = "ąćęłńóśżź!0123456789".encode('UTF-32BE') * 100_000
puts Benchmark.measure {
1000.times { s2[1] }
}
Results:
~/projects/jruby ➔ ruby1.9 strlen_thing.rb
0.000000 0.000000 0.000000 ( 0.000283)
0.000000 0.000000 0.000000 ( 0.000259)
I believe this is one reason Python 3+ allows you to force all strings to be either UTF-8 or UTF-32 in-memory, and only supports those two encodings. It is also one of the original reasons Java and .NET chose to use UTF-16, though that obviously became a small problem when Unicode expanded to 24 bits...
=end
Updated by matz (Yukihiro Matsumoto) over 13 years ago
=begin
Hi,
In message "Re: [ruby-core:34793] [Ruby 1.9-Bug#4298] Duration of calling String#[] with the same index is strangely related to string length."
on Sun, 23 Jan 2011 03:42:20 +0900, Charles Nutter redmine@ruby-lang.org writes:
|Perhaps I'm mistaken, but isn't RSTRING_LEN the byte length of the string? Isn't that always guaranteed to be >= str_strlen, which is the character length? And doesn't this patch basically break completely if beg is greater than the encoded character length but not greater than the raw byte length?
RSTRING_LEN is the byte length of the string, but we can assume
RSTRING_LEN > str_strlen(str,enc), I think.
|I believe this is one reason Python 3+ allows you to force all strings to be either UTF-8 or UTF-32 in-memory, and only supports those two encodings. It is also one of the original reasons Java and .NET chose to use UTF-16, though that obviously became a small problem when Unicode expanded to 24 bits...
Unfortunately, UTF-16 is no longer fixed length encoding since
introduction of surrogate pairs. Almost same for UTF-32 with IVS.
matz.
=end
Updated by phasis68 (Heesob Park) over 13 years ago
=begin
Hi,
2011/1/23 Charles Nutter redmine@ruby-lang.org:
Issue #4298 has been updated by Charles Nutter.
Perhaps I'm mistaken, but isn't RSTRING_LEN the byte length of the string? Isn't that always guaranteed to be >= str_strlen, which is the character length? And doesn't this patch basically break completely if beg is greater than the encoded character length but not greater than the raw byte length?
Sorry, you are right. It is a wrong patch.
I think that storing str_strlen value to a variable only when string
length changes is a possible workaround.
Regards,
Park Heesob
=end
Updated by nobu (Nobuyoshi Nakada) over 13 years ago
- Status changed from Open to Closed
- % Done changed from 0 to 100
=begin
This issue was solved with changeset r30636.
Radosław, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.
-
string.c (str_nth_len, str_utf8_nth): return the rest length together.
- string.c (rb_str_substr): get rid of measure the length always
to improve performance for huge string. [ruby-core:34648]
=end
- string.c (rb_str_substr): get rid of measure the length always
Updated by headius (Charles Nutter) over 13 years ago
=begin
I think it's worth pointing out that Nobu's fix does not prevent String#[] from needing to do a character scan to reach the nth character, but it does fix the issue of always scanning all characters.
I'll check if JRuby needs a similar fix.
=end
Updated by headius (Charles Nutter) over 13 years ago
=begin
JRuby does need a similar fix. I filed http://jira.codehaus.org/browse/JRUBY-5411 to cover it.
=end
Updated by enebo (Thomas Enebo) over 13 years ago
=begin
Marcin and I were wondering the other day why character length is not cached in the string. It seems whenever code range has to be updated str length cache could also be updated with little extra cost. Is there a thread that explains why caching character length was not done.
=end
Updated by naruse (Yui NARUSE) over 13 years ago
=begin
2011/1/24 Thomas Enebo redmine@ruby-lang.org:
Marcin and I were wondering the other day why character length is not cached in the string. It seems whenever code range has to be updated str length cache could also be updated with little extra cost. Is there a thread that explains why caching character length was not done.
Simply because there is no such space.
--
NARUSE, Yui naruse@airemix.jp
=end
Updated by matz (Yukihiro Matsumoto) over 13 years ago
=begin
Hi,
In message "Re: [ruby-core:34814] [Ruby 1.9-Bug#4298] Duration of calling String#[] with the same index is strangely related to string length."
on Mon, 24 Jan 2011 09:51:14 +0900, Thomas Enebo redmine@ruby-lang.org writes:
|Marcin and I were wondering the other day why character length is not cached in the string. It seems whenever code range has to be updated str length cache could also be updated with little extra cost. Is there a thread that explains why caching character length was not done.
Not a big reason. We didn't have room in the RString structure, and
we hadn't met big problem without length cache so far. Of course JRuby
is free to have string length cache.
matz.
=end
Updated by enebo (Thomas Enebo) over 13 years ago
=begin
Thanks for the update. We will look into a cache in JRuby...
=end