Project

General

Profile

Actions

Bug #6509

closed

String#gsub is too slow if receiver includes a binary

Added by okkez (okkez _) over 12 years ago. Updated over 12 years ago.

Status:
Closed
Target version:
ruby -v:
ruby 2.0.0dev (2012-05-28 trunk 35830) [x86_64-linux]
Backport:
[ruby-dev:45688]

Description

=begin

以下のようなコードで String#gsub が遅くなります。

  • b = "" の場合(A): 0.2840230464935303
  • b = "\xB9" の場合(B): 4.183771848678589

-- coding: utf-8 --

a = ("abcde\n"*50000).force_encoding("binary")
#b = ""
b = "\xB9".force_encoding("binary")
c = ("efghi\n"*50000).force_encoding("binary")

d = "#{a}#{b}#{c}"

start = Time.now.to_f
d.gsub(/\n/) { "" }
puts(Time.now.to_f - start)

それぞれの場合で、プロファイルを取ってみたので添付します。

(B)の場合に、search_nonascii を約20万回呼び出して処理時間の92%を費しています。
(A)の場合は、約10万回しか呼び出しておらず、処理時間も短いです。

=end


Files

callgrind.out.9937 (521 KB) callgrind.out.9937 (A)の場合 okkez (okkez _), 05/29/2012 10:03 AM
callgrind.out.10091 (521 KB) callgrind.out.10091 (B)の場合 okkez (okkez _), 05/29/2012 10:03 AM

Updated by shyouhei (Shyouhei Urabe) over 12 years ago

  • Category changed from core to M17N
  • Status changed from Open to Assigned
  • Assignee set to naruse (Yui NARUSE)

str_gsubの中でdestが一回non asciiになってしまったらそれ以降はsearch_nonasciiしても無駄という気がしますが専門家のご意見をうかがいたいところです。

Actions #2

Updated by naruse (Yui NARUSE) over 12 years ago

  • Status changed from Assigned to Closed
  • % Done changed from 0 to 100

This issue was solved with changeset r35863.
okkez, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.


  • string.c (rb_enc_cr_str_buf_cat): don't reset coderange as unknown.
    the condition 'ptr_a8 && str_cr != ENC_CODERANGE_7BIT' means not
    unknown, str is also ASCII-8BIT because str_encindex == ptr_encindex,
    and nont (str_cr == ENC_CODERANGE_UNKNOWN) and
    str_cr != ENC_CODERANGE_7BIT means str_cr is valid because ASCII-8BIT
    can't be broken. [ruby-dev:45688] [Bug #6509]
Actions

Also available in: Atom PDF

Like0
Like0Like0