Custom queries

Backport 3.2
Backport 3.3
Backport 3.4
Backport 4.0
bugs: unassigned
DevMeeting
matz
Open issues with attachment
Windows

Actions

Copy link

Bug #4298

closed

Duration of calling String#[] with the same index is strangely related to string length.

Bug #4298: Duration of calling String#[] with the same index is strangely related to string length.

Added by radarek (Radosław Bułat) about 15 years ago. Updated almost 15 years ago.

Status:

Closed

Assignee:

Target version:

ruby -v:

ruby 1.8.7 (2009-12-24 patchlevel 248) [i686-darwin10.0.0], MBARI 0x6770, Ruby Enterprise Edition 2010.01

Backport:

[ruby-core:34648]

Description

=begin
The longer string is the slower is String#[] (for the same index > 0). See the example benchmark below:


 # encoding: utf-8
 
 require 'benchmark'
 
 s1 = "ąćęłńóśżź!0123456789"
 
 puts Benchmark.measure {
   1000.times { s1[1] }
 }
 
 s2 = "ąćęłńóśżź!0123456789" * 100_000
 
 puts Benchmark.measure {
   1000.times { s2[1] }
 }

Running above code:

 $ ruby-trunk x.rb 
   0.010000   0.000000   0.010000 (  0.000341)
   1.780000   0.020000   1.800000 (  1.998366)

=end

Files

benchmark.rb (228 Bytes) benchmark.rb

Benchmark showing described problem

radarek (Radosław Bułat), 01/20/2011 06:21 PM

History
Notes
Property changes

Updated by radarek (Radosław Bułat) about 15 years ago Actions
Copy link
#1

=begin
I copied&pasted wrong info about ruby's version. It's actually:

$ ruby-trunk -v
ruby 1.9.3dev (2010-09-06 trunk 29190) [x86_64-darwin10.4.0]

=end

Updated by phasis68 (Heesob Park) about 15 years ago Actions
Copy link
#2

=begin
This bug(or feature) is due to the repetitive calling of str_strlen() function.

Here is a sentry patch for rb_str_substr() function of string.c

--- string.c 2011-01-22 21:25:38.000000000 +0900
+++ string.c.new 2011-01-22 21:28:05.000000000 +0900
@@ -1603,7 +1603,7 @@
if (beg < 0) return Qnil;
}
}

else if (beg > 0 && beg > str_strlen(str, enc)) {

else if (beg > 0 && beg > RSTRING_LEN(str) && beg > str_strlen(str, enc)) {
return Qnil;
}
if (len == 0) {

Before patch:
phasis:~ phasis$ /usr/local/bin/ruby t1.rb
0.000000 0.000000 0.000000 ( 0.000398)
1.690000 0.010000 1.700000 ( 1.731168)

After patch:
phasis:~ phasis$ /usr/local/bin/ruby t1.rb
0.000000 0.000000 0.000000 ( 0.000355)
0.000000 0.000000 0.000000 ( 0.000356)

=end

Updated by headius (Charles Nutter) about 15 years ago Actions
Copy link
#3

=begin
Perhaps I'm mistaken, but isn't RSTRING_LEN the byte length of the string? Isn't that always guaranteed to be >= str_strlen, which is the character length? And doesn't this patch basically break completely if beg is greater than the encoded character length but not greater than the raw byte length?

It seems to me that the problem is more complicated. For many encodings you must either save the new length each time you mutate it (doing str_strlen on every mutation) or recalculate it as needed (such as in rb_str_substr). It seems ruby-core has chosen the latter, and the behavior also exists in JRuby (since we have tried to mimic behavior very closely).

This is a peril of using variable-length encodings like UTF-8. str_strlen does do a coderange check, to see if it can do a simple byte offset, but if it can't it must walk the characters one at a time.

As an experiment, I modified the benchmark to encode the text as UTF-32BE, so there would be a constant size for each character. And as expected, the performance improves dramatically:

encoding: utf-8¶

require 'benchmark'

s1 = "ąćęłńóśżź!0123456789".encode('UTF-32BE')

puts Benchmark.measure {
1000.times { s1[1] }
}

s2 = "ąćęłńóśżź!0123456789".encode('UTF-32BE') * 100_000

puts Benchmark.measure {
1000.times { s2[1] }
}

Results:

~/projects/jruby ➔ ruby1.9 strlen_thing.rb
0.000000 0.000000 0.000000 ( 0.000283)
0.000000 0.000000 0.000000 ( 0.000259)

I believe this is one reason Python 3+ allows you to force all strings to be either UTF-8 or UTF-32 in-memory, and only supports those two encodings. It is also one of the original reasons Java and .NET chose to use UTF-16, though that obviously became a small problem when Unicode expanded to 24 bits...
=end

Updated by matz (Yukihiro Matsumoto) about 15 years ago Actions
Copy link
#4

=begin
Hi,

In message "Re: [ruby-core:34793] [Ruby 1.9-Bug#4298] Duration of calling String#[] with the same index is strangely related to string length."
on Sun, 23 Jan 2011 03:42:20 +0900, Charles Nutter redmine@ruby-lang.org writes:

|Perhaps I'm mistaken, but isn't RSTRING_LEN the byte length of the string? Isn't that always guaranteed to be >= str_strlen, which is the character length? And doesn't this patch basically break completely if beg is greater than the encoded character length but not greater than the raw byte length?

RSTRING_LEN is the byte length of the string, but we can assume
RSTRING_LEN > str_strlen(str,enc), I think.

|I believe this is one reason Python 3+ allows you to force all strings to be either UTF-8 or UTF-32 in-memory, and only supports those two encodings. It is also one of the original reasons Java and .NET chose to use UTF-16, though that obviously became a small problem when Unicode expanded to 24 bits...

Unfortunately, UTF-16 is no longer fixed length encoding since
introduction of surrogate pairs. Almost same for UTF-32 with IVS.

						matz.

=end

Updated by phasis68 (Heesob Park) about 15 years ago Actions
Copy link
#5

=begin
Hi,

2011/1/23 Charles Nutter redmine@ruby-lang.org:

Issue #4298 has been updated by Charles Nutter.

Perhaps I'm mistaken, but isn't RSTRING_LEN the byte length of the string? Isn't that always guaranteed to be >= str_strlen, which is the character length? And doesn't this patch basically break completely if beg is greater than the encoded character length but not greater than the raw byte length?

Sorry, you are right. It is a wrong patch.

I think that storing str_strlen value to a variable only when string
length changes is a possible workaround.

Regards,
Park Heesob

=end

Updated by nobu (Nobuyoshi Nakada) about 15 years ago Actions
Copy link
#6

Status changed from Open to Closed
% Done changed from 0 to 100

=begin
This issue was solved with changeset r30636.
Radosław, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.

string.c (str_nth_len, str_utf8_nth): return the rest length together.
- string.c (rb_str_substr): get rid of measure the length always
  to improve performance for huge string. [ruby-core:34648]
  =end

Updated by headius (Charles Nutter) about 15 years ago Actions
Copy link
#7

=begin
I think it's worth pointing out that Nobu's fix does not prevent String#[] from needing to do a character scan to reach the nth character, but it does fix the issue of always scanning all characters.

I'll check if JRuby needs a similar fix.
=end

Updated by headius (Charles Nutter) about 15 years ago Actions
Copy link
#8

=begin
JRuby does need a similar fix. I filed http://jira.codehaus.org/browse/JRUBY-5411 to cover it.
=end

Updated by enebo (Thomas Enebo) about 15 years ago Actions
Copy link
#9

=begin
Marcin and I were wondering the other day why character length is not cached in the string. It seems whenever code range has to be updated str length cache could also be updated with little extra cost. Is there a thread that explains why caching character length was not done.
=end

Updated by naruse (Yui NARUSE) about 15 years ago Actions
Copy link
#10

=begin
2011/1/24 Thomas Enebo redmine@ruby-lang.org:

Marcin and I were wondering the other day why character length is not cached in the string. It seems whenever code range has to be updated str length cache could also be updated with little extra cost. Is there a thread that explains why caching character length was not done.

Simply because there is no such space.

--
NARUSE, Yui naruse@airemix.jp

=end

Updated by matz (Yukihiro Matsumoto) about 15 years ago Actions
Copy link
#11

=begin
Hi,

In message "Re: [ruby-core:34814] [Ruby 1.9-Bug#4298] Duration of calling String#[] with the same index is strangely related to string length."
on Mon, 24 Jan 2011 09:51:14 +0900, Thomas Enebo redmine@ruby-lang.org writes:

|Marcin and I were wondering the other day why character length is not cached in the string. It seems whenever code range has to be updated str length cache could also be updated with little extra cost. Is there a thread that explains why caching character length was not done.

Not a big reason. We didn't have room in the RString structure, and
we hadn't met big problem without length cache so far. Of course JRuby
is free to have string length cache.

						matz.

=end

Updated by enebo (Thomas Enebo) about 15 years ago Actions
Copy link
#12

=begin
Thanks for the update. We will look into a cache in JRuby...
=end

Actions

Copy link

Also available in: PDF Atom

Project

General

Profile

Ruby

Custom queries

Bug #4298

Duration of calling String#[] with the same index is strangely related to string length.

Updated by radarek (Radosław Bułat) about 15 years ago Actions
Copy link
#1

Updated by phasis68 (Heesob Park) about 15 years ago Actions
Copy link
#2

Updated by headius (Charles Nutter) about 15 years ago Actions
Copy link
#3

encoding: utf-8¶

Updated by matz (Yukihiro Matsumoto) about 15 years ago Actions
Copy link
#4

Updated by phasis68 (Heesob Park) about 15 years ago Actions
Copy link
#5

Updated by nobu (Nobuyoshi Nakada) about 15 years ago Actions
Copy link
#6

Updated by headius (Charles Nutter) about 15 years ago Actions
Copy link
#7

Updated by headius (Charles Nutter) about 15 years ago Actions
Copy link
#8

Updated by enebo (Thomas Enebo) about 15 years ago Actions
Copy link
#9

Updated by naruse (Yui NARUSE) about 15 years ago Actions
Copy link
#10

Updated by matz (Yukihiro Matsumoto) about 15 years ago Actions
Copy link
#11

Updated by enebo (Thomas Enebo) about 15 years ago Actions
Copy link
#12

Project

General

Profile

Ruby

Custom queries

Bug #4298

Duration of calling String#[] with the same index is strangely related to string length.

Updated by radarek (Radosław Bułat) about 15 years ago ActionsCopy link #1

Updated by phasis68 (Heesob Park) about 15 years ago ActionsCopy link #2

Updated by headius (Charles Nutter) about 15 years ago ActionsCopy link #3

encoding: utf-8¶

Updated by matz (Yukihiro Matsumoto) about 15 years ago ActionsCopy link #4

Updated by phasis68 (Heesob Park) about 15 years ago ActionsCopy link #5

Updated by nobu (Nobuyoshi Nakada) about 15 years ago ActionsCopy link #6

Updated by headius (Charles Nutter) about 15 years ago ActionsCopy link #7

Updated by headius (Charles Nutter) about 15 years ago ActionsCopy link #8

Updated by enebo (Thomas Enebo) about 15 years ago ActionsCopy link #9

Updated by naruse (Yui NARUSE) about 15 years ago ActionsCopy link #10

Updated by matz (Yukihiro Matsumoto) about 15 years ago ActionsCopy link #11

Updated by enebo (Thomas Enebo) about 15 years ago ActionsCopy link #12

Updated by radarek (Radosław Bułat) about 15 years ago Actions
Copy link
#1

Updated by phasis68 (Heesob Park) about 15 years ago Actions
Copy link
#2

Updated by headius (Charles Nutter) about 15 years ago Actions
Copy link
#3

Updated by matz (Yukihiro Matsumoto) about 15 years ago Actions
Copy link
#4

Updated by phasis68 (Heesob Park) about 15 years ago Actions
Copy link
#5

Updated by nobu (Nobuyoshi Nakada) about 15 years ago Actions
Copy link
#6

Updated by headius (Charles Nutter) about 15 years ago Actions
Copy link
#7

Updated by headius (Charles Nutter) about 15 years ago Actions
Copy link
#8

Updated by enebo (Thomas Enebo) about 15 years ago Actions
Copy link
#9

Updated by naruse (Yui NARUSE) about 15 years ago Actions
Copy link
#10

Updated by matz (Yukihiro Matsumoto) about 15 years ago Actions
Copy link
#11

Updated by enebo (Thomas Enebo) about 15 years ago Actions
Copy link
#12