Actions

Copy link

Bug #4298

closed

Duration of calling String#[] with the same index is strangely related to string length.

Added by radarek (Radosław Bułat) over 14 years ago. Updated about 14 years ago.

Status:

Closed

Assignee:

Target version:

ruby -v:

ruby 1.8.7 (2009-12-24 patchlevel 248) [i686-darwin10.0.0], MBARI 0x6770, Ruby Enterprise Edition 2010.01

Backport:

[ruby-core:34648]

Description

=begin
The longer string is the slower is String#[] (for the same index > 0). See the example benchmark below:


 # encoding: utf-8
 
 require 'benchmark'
 
 s1 = "ąćęłńóśżź!0123456789"
 
 puts Benchmark.measure {
   1000.times { s1[1] }
 }
 
 s2 = "ąćęłńóśżź!0123456789" * 100_000
 
 puts Benchmark.measure {
   1000.times { s2[1] }
 }

Running above code:

 $ ruby-trunk x.rb 
   0.010000   0.000000   0.010000 (  0.000341)
   1.780000   0.020000   1.800000 (  1.998366)

=end

Files

benchmark.rb (228 Bytes) benchmark.rb

Benchmark showing described problem

radarek (Radosław Bułat), 01/20/2011 06:21 PM

Actions

Copy link

Updated by radarek (Radosław Bułat) over 14 years ago

=begin
I copied&pasted wrong info about ruby's version. It's actually:

$ ruby-trunk -v
ruby 1.9.3dev (2010-09-06 trunk 29190) [x86_64-darwin10.4.0]

=end

Actions

Copy link

Updated by phasis68 (Heesob Park) over 14 years ago

=begin
This bug(or feature) is due to the repetitive calling of str_strlen() function.

Here is a sentry patch for rb_str_substr() function of string.c

--- string.c 2011-01-22 21:25:38.000000000 +0900
+++ string.c.new 2011-01-22 21:28:05.000000000 +0900
@@ -1603,7 +1603,7 @@
if (beg < 0) return Qnil;
}
}

else if (beg > 0 && beg > str_strlen(str, enc)) {

else if (beg > 0 && beg > RSTRING_LEN(str) && beg > str_strlen(str, enc)) {
return Qnil;
}
if (len == 0) {

Before patch:
phasis:~ phasis$ /usr/local/bin/ruby t1.rb
0.000000 0.000000 0.000000 ( 0.000398)
1.690000 0.010000 1.700000 ( 1.731168)

After patch:
phasis:~ phasis$ /usr/local/bin/ruby t1.rb
0.000000 0.000000 0.000000 ( 0.000355)
0.000000 0.000000 0.000000 ( 0.000356)

=end

Actions

Copy link

Updated by headius (Charles Nutter) over 14 years ago

=begin
Perhaps I'm mistaken, but isn't RSTRING_LEN the byte length of the string? Isn't that always guaranteed to be >= str_strlen, which is the character length? And doesn't this patch basically break completely if beg is greater than the encoded character length but not greater than the raw byte length?

It seems to me that the problem is more complicated. For many encodings you must either save the new length each time you mutate it (doing str_strlen on every mutation) or recalculate it as needed (such as in rb_str_substr). It seems ruby-core has chosen the latter, and the behavior also exists in JRuby (since we have tried to mimic behavior very closely).

This is a peril of using variable-length encodings like UTF-8. str_strlen does do a coderange check, to see if it can do a simple byte offset, but if it can't it must walk the characters one at a time.

As an experiment, I modified the benchmark to encode the text as UTF-32BE, so there would be a constant size for each character. And as expected, the performance improves dramatically:

encoding: utf-8¶

require 'benchmark'

s1 = "ąćęłńóśżź!0123456789".encode('UTF-32BE')

puts Benchmark.measure {
1000.times { s1[1] }
}

s2 = "ąćęłńóśżź!0123456789".encode('UTF-32BE') * 100_000

puts Benchmark.measure {
1000.times { s2[1] }
}

Results:

~/projects/jruby ➔ ruby1.9 strlen_thing.rb
0.000000 0.000000 0.000000 ( 0.000283)
0.000000 0.000000 0.000000 ( 0.000259)

I believe this is one reason Python 3+ allows you to force all strings to be either UTF-8 or UTF-32 in-memory, and only supports those two encodings. It is also one of the original reasons Java and .NET chose to use UTF-16, though that obviously became a small problem when Unicode expanded to 24 bits...
=end

Actions

Copy link

Updated by matz (Yukihiro Matsumoto) over 14 years ago

=begin
Hi,

In message "Re: [ruby-core:34793] [Ruby 1.9-Bug#4298] Duration of calling String#[] with the same index is strangely related to string length."
on Sun, 23 Jan 2011 03:42:20 +0900, Charles Nutter redmine@ruby-lang.org writes:

|Perhaps I'm mistaken, but isn't RSTRING_LEN the byte length of the string? Isn't that always guaranteed to be >= str_strlen, which is the character length? And doesn't this patch basically break completely if beg is greater than the encoded character length but not greater than the raw byte length?

RSTRING_LEN is the byte length of the string, but we can assume
RSTRING_LEN > str_strlen(str,enc), I think.

|I believe this is one reason Python 3+ allows you to force all strings to be either UTF-8 or UTF-32 in-memory, and only supports those two encodings. It is also one of the original reasons Java and .NET chose to use UTF-16, though that obviously became a small problem when Unicode expanded to 24 bits...

Unfortunately, UTF-16 is no longer fixed length encoding since
introduction of surrogate pairs. Almost same for UTF-32 with IVS.

						matz.

=end

Actions

Copy link

Updated by phasis68 (Heesob Park) over 14 years ago

=begin
Hi,

2011/1/23 Charles Nutter redmine@ruby-lang.org:

Issue #4298 has been updated by Charles Nutter.

Perhaps I'm mistaken, but isn't RSTRING_LEN the byte length of the string? Isn't that always guaranteed to be >= str_strlen, which is the character length? And doesn't this patch basically break completely if beg is greater than the encoded character length but not greater than the raw byte length?

Sorry, you are right. It is a wrong patch.

I think that storing str_strlen value to a variable only when string
length changes is a possible workaround.

Regards,
Park Heesob

=end

Actions

Copy link

Updated by nobu (Nobuyoshi Nakada) over 14 years ago

Status changed from Open to Closed
% Done changed from 0 to 100

=begin
This issue was solved with changeset r30636.
Radosław, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.

string.c (str_nth_len, str_utf8_nth): return the rest length together.
- string.c (rb_str_substr): get rid of measure the length always
  to improve performance for huge string. [ruby-core:34648]
  =end

Actions

Copy link

Updated by headius (Charles Nutter) over 14 years ago

=begin
I think it's worth pointing out that Nobu's fix does not prevent String#[] from needing to do a character scan to reach the nth character, but it does fix the issue of always scanning all characters.

I'll check if JRuby needs a similar fix.
=end

Actions

Copy link

Updated by headius (Charles Nutter) over 14 years ago

=begin
JRuby does need a similar fix. I filed http://jira.codehaus.org/browse/JRUBY-5411 to cover it.
=end

Actions

Copy link

Updated by enebo (Thomas Enebo) over 14 years ago

=begin
Marcin and I were wondering the other day why character length is not cached in the string. It seems whenever code range has to be updated str length cache could also be updated with little extra cost. Is there a thread that explains why caching character length was not done.
=end

Actions

Copy link

#10

Updated by naruse (Yui NARUSE) over 14 years ago

=begin
2011/1/24 Thomas Enebo redmine@ruby-lang.org:

Marcin and I were wondering the other day why character length is not cached in the string. It seems whenever code range has to be updated str length cache could also be updated with little extra cost. Is there a thread that explains why caching character length was not done.

Simply because there is no such space.

--
NARUSE, Yui naruse@airemix.jp

=end

Actions

Copy link

#11

Updated by matz (Yukihiro Matsumoto) over 14 years ago

=begin
Hi,

In message "Re: [ruby-core:34814] [Ruby 1.9-Bug#4298] Duration of calling String#[] with the same index is strangely related to string length."
on Mon, 24 Jan 2011 09:51:14 +0900, Thomas Enebo redmine@ruby-lang.org writes:

|Marcin and I were wondering the other day why character length is not cached in the string. It seems whenever code range has to be updated str length cache could also be updated with little extra cost. Is there a thread that explains why caching character length was not done.

Not a big reason. We didn't have room in the RString structure, and
we hadn't met big problem without length cache so far. Of course JRuby
is free to have string length cache.

						matz.

=end

Actions

Copy link

#12

Updated by enebo (Thomas Enebo) over 14 years ago

=begin
Thanks for the update. We will look into a cache in JRuby...
=end

Actions

Copy link

Also available in: Atom PDF

Like0

Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0

Project

General

Profile

Ruby

Tags

Custom queries

Bug #4298

Duration of calling String#[] with the same index is strangely related to string length.

Updated by radarek (Radosław Bułat) over 14 years ago

Updated by phasis68 (Heesob Park) over 14 years ago

Updated by headius (Charles Nutter) over 14 years ago

encoding: utf-8¶

Updated by matz (Yukihiro Matsumoto) over 14 years ago

Updated by phasis68 (Heesob Park) over 14 years ago

Updated by nobu (Nobuyoshi Nakada) over 14 years ago

Updated by headius (Charles Nutter) over 14 years ago

Updated by headius (Charles Nutter) over 14 years ago

Updated by enebo (Thomas Enebo) over 14 years ago

Updated by naruse (Yui NARUSE) over 14 years ago

Updated by matz (Yukihiro Matsumoto) over 14 years ago

Updated by enebo (Thomas Enebo) over 14 years ago