Bug #20512
closedOrder of magnitude performance differenfce in single character slicing UTF-8 strings before and after length method is executed
Description
Slicing of a single character of UTF-8 string becomes ~15 times faster after method "length" is executed on the string.
# Single byte symbols
letters = ("a".."z").to_a
length = 100000
str = length.times.map{letters[rand(26)]}.join
# Slow
start = Time.now
length.times{|i| str[i]}
puts Time.now - start # 0.169156201
str.length # performance hack
# Fast
start = Time.now
length.times{|i| str[i]}
puts Time.now - start # 0.009883919
# UTF-8 Symbols
letters = ("а".."я").to_a
length = 10000
str = length.times.map{letters[rand(26)]}.join
# Slow
start = Time.now
length.times{|i| str[i]}
puts Time.now - start # 0.326204007
str.length # performance hack
# Fast
start = Time.now
length.times{|i| str[i]}
puts Time.now - start # 0.016943093
Updated by byroot (Jean Boussier) 6 months ago
What is happening here is that length
triggers scanning the string coderange
.
And when the coderange is unknown, String#[]
is slower for variable-length character encodings (like UTF-8).
On 3.3:
require 'json'
require 'objspace'
require 'benchmark'
# Single byte symbols
letters = ("a".."z").to_a
length = 100000
str = length.times.map{letters[rand(26)]}.join
# Slow
p Benchmark.realtime { length.times{|i| str[i]} }
p Benchmark.realtime { length.times{|i| str[i]} }
puts JSON.parse(ObjectSpace.dump(str))["coderange"]
p Benchmark.realtime { str.length } # performance hack
puts JSON.parse(ObjectSpace.dump(str))["coderange"]
$ ruby -v /tmp/str.rb
ruby 3.3.1 (2024-04-23 revision c56cd86388) [arm64-darwin23]
0.17216699989512563
0.1763450000435114
unknown
5.999580025672913e-06
7bit
0.004894999787211418
See how coderange
changes from unknown
to 7bit
, allowing String#[]
to treat the string as pure ASCII, hence can directly compute the substring position with a simple offset.
The question here is whether String#[]
should trigger scanning the coderange. It would definitely make some code faster, but may slow down some others, so it's a bit debatable, but I'd be in favor of it.
Updated by nobu (Nobuyoshi Nakada) 6 months ago
- Status changed from Open to Closed
Applied in changeset git|7d144781a93df66379922717da711a09d1cf78ff.
[Bug #20512] Set coderange in Range#each
of strings