Bug #8585
closedTime for CSV.generate grows quadratic with number of rows
Description
Hi,
I want to generate a CSV string, from millions of rows.
I see the time to create the string grows quadratic
with the amount of rows. With this issue, I cannot use
ruby 2.0.0 to create the CSV file.
I did not see this problem was not present in ruby 1.9.3.
I see the problem is present in ruby 2.0.0 and ruby-head.
Using ruby-head¶
Installed  with rvm reinstall ruby-head  (built from version 3a01b9e)
peter_v@peter64:~/p/dbd$ rvm use ruby-head
Using /home/peter_v/.rvm/gems/ruby-head
peter_v@peter64:~/p/dbd$ ruby -v
ruby 2.1.0dev (2013-06-30) [x86_64-linux]
peter_v@peter64:/p/dbd$ uname -aprecise1-Ubuntu SMP Fri Jun 7 16:25:50 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
Linux peter64 3.5.0-34-generic #55
peter_v@peter64:~/p/dbd$ rvm current
ruby-head
peter_v@peter64:~/p/dbd$ cat bin/test_4.rb
#!/usr/bin/env ruby
count = ARGV[0].to_i
unless count > 0
puts "Give a 'count' as first argument."
exit(1)
end
require 'csv'
row_data = [
"59ffbb3b-1e48-4c1f-81d8-d93afc84c966",
"2013-06-28 19:14:55.975000806 UTC",
"a11f290e-c441-41bc-8b8c-4e6c27b1b6fc",
"c73e6241-d46f-4952-8377-c11372346d15",
"test",
"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 0"]
puts "starting CSV.generate"
start_time = Time.now
csv_string = CSV.generate(force_quotes: true) do |csv|
count.times do
csv << row_data
end
end
puts "CSV.generate took #{Time.now - start_time} seconds"
peter_v@peter64:~/p/dbd$ time bin/test_4.rb 10_000
starting CSV.generate
CSV.generate took 1.01238478 seconds
real	0m1.045s
user	0m1.044s
sys	0m0.004s
peter_v@peter64:~/p/dbd$ time bin/test_4.rb 20_000
starting CSV.generate
CSV.generate took 3.815373614 seconds
real	0m3.847s
user	0m3.844s
sys	0m0.000s
peter_v@peter64:~/p/dbd$ time bin/test_4.rb 40_000
starting CSV.generate
CSV.generate took 17.176208859 seconds
real	0m17.212s
user	0m17.177s
sys	0m0.020s
peter_v@peter64:~/p/dbd$ time bin/test_4.rb 80_000
starting CSV.generate
CSV.generate took 71.400916725 seconds
real	1m11.436s
user	1m11.320s
sys	0m0.036s
peter_v@peter64:~/p/dbd$
Using ruby-1.9.3-p448¶
This is as expected LINEAR growth of time with number of rows.
peter_v@peter64:~/p/dbd$ rvm use ruby-1.9.3
Using /home/peter_v/.rvm/gems/ruby-1.9.3-p448
peter_v@peter64:~/p/dbd$ ruby -v
ruby 1.9.3p448 (2013-06-27 revision 41675) [x86_64-linux]
peter_v@peter64:~/p/dbd$ rvm current
ruby-1.9.3-p448
peter_v@peter64:~/p/dbd$ time bin/test_4.rb 10_000
starting CSV.generate
CSV.generate took 0.125396387 seconds
real	0m0.150s
user	0m0.140s
sys	0m0.008s
peter_v@peter64:~/p/dbd$ time bin/test_4.rb 20_000
starting CSV.generate
CSV.generate took 0.249746069 seconds
real	0m0.274s
user	0m0.268s
sys	0m0.004s
peter_v@peter64:~/p/dbd$ time bin/test_4.rb 40_000
starting CSV.generate
CSV.generate took 0.498180989 seconds
real	0m0.522s
user	0m0.504s
sys	0m0.016s
peter_v@peter64:~/p/dbd$ time bin/test_4.rb 80_000
starting CSV.generate
CSV.generate took 0.991481147 seconds
real	0m1.015s
user	0m1.000s
sys	0m0.016s
peter_v@peter64:~/p/dbd$ time bin/test_4.rb 100_000
starting CSV.generate
CSV.generate took 1.243347153 seconds
real	0m1.265s
user	0m1.240s
sys	0m0.020s
peter_v@peter64:~/p/dbd$ time bin/test_4.rb 1_000_000
starting CSV.generate
CSV.generate took 12.461711974 seconds
real	0m12.492s
user	0m12.405s
sys	0m0.080s
peter_v@peter64:~/p/dbd$
Files
        
           Updated by peter_v (Peter Vandenabeele) over 12 years ago
          Updated by peter_v (Peter Vandenabeele) over 12 years ago
          
          
        
        
      
      Using
CSV.open(filename, 'w')
I can write large CSV files to disk in Ruby 2.0.0
(e.g. 10 M rows in 132 seconds)
It is only writing it to string that forms a problem in
ruby 2.0.0 and ruby-head.
        
           Updated by Eregon (Benoit Daloze) over 12 years ago
          Updated by Eregon (Benoit Daloze) over 12 years ago
          
          
        
        
      
      Good find!
A git bisect led to r37485 aka 58ef0f06:
Author: naruse
Date:   Tue Nov 6 00:49:57 2012 +0000
* ruby.c (load_file_internal): set default source encoding as
  UTF-8 instead of US-ASCII. [ruby-core:46021] [Feature #6679]
* parse.y (parser_initialize): set default parser encoding as
  UTF-8 instead of US-ASCII.
So definitely looks encoding-related.
And worrying this is causing such performance regression.
        
           Updated by Eregon (Benoit Daloze) over 12 years ago
          Updated by Eregon (Benoit Daloze) over 12 years ago
          
          
        
        
      
      Adding "# encoding: US-ASCII" at the top of the script makes it identical to the previous behavior, therefore taking the same time. I would certainly not call this a solution though.
        
           Updated by Anonymous over 12 years ago
          Updated by Anonymous over 12 years ago
          
          
        
        
      
      This is most likely due to character indexing in UTF-8 being O(n).
I'd suggest reworking CSV.generate to not use character indexing, or convert input strings to UTF-32 first.
        
           Updated by nobu (Nobuyoshi Nakada) over 12 years ago
          Updated by nobu (Nobuyoshi Nakada) over 12 years ago
          
          
        
        
      
      - File bug-8585.diff bug-8585.diff added
Eregon (Benoit Daloze) wrote:
Adding "# encoding: US-ASCII" at the top of the script makes it identical to the previous behavior, therefore taking the same time. I would certainly not call this a solution though.
The file already has that line.
This slug seems because String#encode in do_quote lambda in init_separators is called for each fields.
        
           Updated by Eregon (Benoit Daloze) over 12 years ago
          Updated by Eregon (Benoit Daloze) over 12 years ago
          
          
        
        
      
      nobu (Nobuyoshi Nakada) wrote:
The file already has that line.
I meant at the top of the test script provided in the description.
This slug seems because
String#encodeindo_quotelambda in init_separators is called for each fields.
Any idea why this makes the whole process quadratic?
        
           Updated by nobu (Nobuyoshi Nakada) over 12 years ago
          Updated by nobu (Nobuyoshi Nakada) over 12 years ago
          
          
        
        
      
      - Status changed from Open to Closed
- % Done changed from 0 to 100
This issue was solved with changeset r41722.
Peter, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.
csv.rb: get rid of discarding coderange
- lib/csv.rb (CSV#<<): use StringIO#set_encoding instead of creating
 new StringIO instance with String#force_encoding, forcing encoding
 discards the cached coderange bits and can make further operations
 very slow. [ruby-core:55714] [Bug #8585]