Project

General

Profile

Actions

Bug #4298

closed

Duration of calling String#[] with the same index is strangely related to string length.

Added by radarek (Radosław Bułat) about 13 years ago. Updated almost 13 years ago.

Status:
Closed
Assignee:
-
Target version:
-
ruby -v:
ruby 1.8.7 (2009-12-24 patchlevel 248) [i686-darwin10.0.0], MBARI 0x6770, Ruby Enterprise Edition 2010.01
Backport:
[ruby-core:34648]

Description

=begin
The longer string is the slower is String#[] (for the same index > 0). See the example benchmark below:


 # encoding: utf-8
 
 require 'benchmark'
 
 s1 = "ąćęłńóśżź!0123456789"
 
 puts Benchmark.measure {
   1000.times { s1[1] }
 }
 
 s2 = "ąćęłńóśżź!0123456789" * 100_000
 
 puts Benchmark.measure {
   1000.times { s2[1] }
 }
 

Running above code:

 $ ruby-trunk x.rb 
   0.010000   0.000000   0.010000 (  0.000341)
   1.780000   0.020000   1.800000 (  1.998366)
 

=end


Files

benchmark.rb (228 Bytes) benchmark.rb Benchmark showing described problem radarek (Radosław Bułat), 01/20/2011 06:21 PM
Actions #1

Updated by radarek (Radosław Bułat) about 13 years ago

=begin
I copied&pasted wrong info about ruby's version. It's actually:

$ ruby-trunk -v
ruby 1.9.3dev (2010-09-06 trunk 29190) [x86_64-darwin10.4.0]

=end

Actions #2

Updated by phasis68 (Heesob Park) about 13 years ago

=begin
This bug(or feature) is due to the repetitive calling of str_strlen() function.

Here is a sentry patch for rb_str_substr() function of string.c

--- string.c 2011-01-22 21:25:38.000000000 +0900
+++ string.c.new 2011-01-22 21:28:05.000000000 +0900
@@ -1603,7 +1603,7 @@
if (beg < 0) return Qnil;
}
}

  • else if (beg > 0 && beg > str_strlen(str, enc)) {
  • else if (beg > 0 && beg > RSTRING_LEN(str) && beg > str_strlen(str, enc)) {
    return Qnil;
    }
    if (len == 0) {

Before patch:
phasis:~ phasis$ /usr/local/bin/ruby t1.rb
0.000000 0.000000 0.000000 ( 0.000398)
1.690000 0.010000 1.700000 ( 1.731168)

After patch:
phasis:~ phasis$ /usr/local/bin/ruby t1.rb
0.000000 0.000000 0.000000 ( 0.000355)
0.000000 0.000000 0.000000 ( 0.000356)

=end

Actions #3

Updated by headius (Charles Nutter) about 13 years ago

=begin
Perhaps I'm mistaken, but isn't RSTRING_LEN the byte length of the string? Isn't that always guaranteed to be >= str_strlen, which is the character length? And doesn't this patch basically break completely if beg is greater than the encoded character length but not greater than the raw byte length?

It seems to me that the problem is more complicated. For many encodings you must either save the new length each time you mutate it (doing str_strlen on every mutation) or recalculate it as needed (such as in rb_str_substr). It seems ruby-core has chosen the latter, and the behavior also exists in JRuby (since we have tried to mimic behavior very closely).

This is a peril of using variable-length encodings like UTF-8. str_strlen does do a coderange check, to see if it can do a simple byte offset, but if it can't it must walk the characters one at a time.

As an experiment, I modified the benchmark to encode the text as UTF-32BE, so there would be a constant size for each character. And as expected, the performance improves dramatically:

encoding: utf-8

require 'benchmark'

s1 = "ąćęłńóśżź!0123456789".encode('UTF-32BE')

puts Benchmark.measure {
1000.times { s1[1] }
}

s2 = "ąćęłńóśżź!0123456789".encode('UTF-32BE') * 100_000

puts Benchmark.measure {
1000.times { s2[1] }
}

Results:

~/projects/jruby ➔ ruby1.9 strlen_thing.rb
0.000000 0.000000 0.000000 ( 0.000283)
0.000000 0.000000 0.000000 ( 0.000259)

I believe this is one reason Python 3+ allows you to force all strings to be either UTF-8 or UTF-32 in-memory, and only supports those two encodings. It is also one of the original reasons Java and .NET chose to use UTF-16, though that obviously became a small problem when Unicode expanded to 24 bits...
=end

Actions #4

Updated by matz (Yukihiro Matsumoto) about 13 years ago

=begin
Hi,

In message "Re: [ruby-core:34793] [Ruby 1.9-Bug#4298] Duration of calling String#[] with the same index is strangely related to string length."
on Sun, 23 Jan 2011 03:42:20 +0900, Charles Nutter writes:

|Perhaps I'm mistaken, but isn't RSTRING_LEN the byte length of the string? Isn't that always guaranteed to be >= str_strlen, which is the character length? And doesn't this patch basically break completely if beg is greater than the encoded character length but not greater than the raw byte length?

RSTRING_LEN is the byte length of the string, but we can assume
RSTRING_LEN > str_strlen(str,enc), I think.

|I believe this is one reason Python 3+ allows you to force all strings to be either UTF-8 or UTF-32 in-memory, and only supports those two encodings. It is also one of the original reasons Java and .NET chose to use UTF-16, though that obviously became a small problem when Unicode expanded to 24 bits...

Unfortunately, UTF-16 is no longer fixed length encoding since
introduction of surrogate pairs. Almost same for UTF-32 with IVS.

						matz.

=end

Actions #5

Updated by phasis68 (Heesob Park) about 13 years ago

=begin
Hi,

2011/1/23 Charles Nutter :

Issue #4298 has been updated by Charles Nutter.

Perhaps I'm mistaken, but isn't RSTRING_LEN the byte length of the string? Isn't that always guaranteed to be >= str_strlen, which is the character length? And doesn't this patch basically break completely if beg is greater than the encoded character length but not greater than the raw byte length?

Sorry, you are right. It is a wrong patch.

I think that storing str_strlen value to a variable only when string
length changes is a possible workaround.

Regards,
Park Heesob

=end

Actions #6

Updated by nobu (Nobuyoshi Nakada) about 13 years ago

  • Status changed from Open to Closed
  • % Done changed from 0 to 100

=begin
This issue was solved with changeset r30636.
Radosław, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.


  • string.c (str_nth_len, str_utf8_nth): return the rest length together.

    • string.c (rb_str_substr): get rid of measure the length always
      to improve performance for huge string. [ruby-core:34648]
      =end
Actions #7

Updated by headius (Charles Nutter) about 13 years ago

=begin
I think it's worth pointing out that Nobu's fix does not prevent String#[] from needing to do a character scan to reach the nth character, but it does fix the issue of always scanning all characters.

I'll check if JRuby needs a similar fix.
=end

Actions #8

Updated by headius (Charles Nutter) about 13 years ago

=begin
JRuby does need a similar fix. I filed http://jira.codehaus.org/browse/JRUBY-5411 to cover it.
=end

Actions #9

Updated by enebo (Thomas Enebo) about 13 years ago

=begin
Marcin and I were wondering the other day why character length is not cached in the string. It seems whenever code range has to be updated str length cache could also be updated with little extra cost. Is there a thread that explains why caching character length was not done.
=end

Actions #10

Updated by naruse (Yui NARUSE) about 13 years ago

=begin
2011/1/24 Thomas Enebo :

Marcin and I were wondering the other day why character length is not cached in the string.  It seems whenever code range has to be updated str length cache could also be updated with little extra cost.  Is there a thread that explains why caching character length was not done.

Simply because there is no such space.

--
NARUSE, Yui  

=end

Actions #11

Updated by matz (Yukihiro Matsumoto) about 13 years ago

=begin
Hi,

In message "Re: [ruby-core:34814] [Ruby 1.9-Bug#4298] Duration of calling String#[] with the same index is strangely related to string length."
on Mon, 24 Jan 2011 09:51:14 +0900, Thomas Enebo writes:

|Marcin and I were wondering the other day why character length is not cached in the string. It seems whenever code range has to be updated str length cache could also be updated with little extra cost. Is there a thread that explains why caching character length was not done.

Not a big reason. We didn't have room in the RString structure, and
we hadn't met big problem without length cache so far. Of course JRuby
is free to have string length cache.

						matz.

=end

Actions #12

Updated by enebo (Thomas Enebo) about 13 years ago

=begin
Thanks for the update. We will look into a cache in JRuby...
=end

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0