Feature #4017: [PATCH] CSV parsing speedup - Ruby - Ruby Issue Tracking System

Actions

Copy link

Feature #4017

closed

[PATCH] CSV parsing speedup

Feature #4017: [PATCH] CSV parsing speedup

Added by ender672 (Timothy Elliott) almost 15 years ago. Updated over 7 years ago.

Status:

Rejected

Assignee:

kou (Kouhei Sutou)

Target version:

[ruby-core:33026]

Description

=begin
ruby_19_csv_parser_split_methods.patch
This patch breaks the CSV parser into multiple methods that are easier to understand and it allows for the performance optimizations in the second patch. It removes all regular expressions from the parser, resulting in a ~25% speed improvement in the CSV test suite. It adds a new CSV parser option, :io_read_limit, which determines the max size for IO reads. This option defaults to 2048 which to was the fastest in my benchmarks.

ruby_19_csv_parser_split_methods.patch
This patch adds two shortcuts to the patch above that significantly improve parsing of CSV files that have many quoted columns. It has to be applied on top of the first patch.

On large CSV files I observed that these patches resulted in a 20% - 60% reduction of time it takes to parse. If this patchset looks good, I would like to experiment with further improvements that take advantage of io_read_limit to always read from IO in large chunks (right now it only does so with CSV files that have no quote characters).

These patches maintain m17n support and multi-character separator support (and boy, it's tough to make those tests happy :)
=end

Files

Download all files

ruby_19_csv_parser_split_methods.patch (11.9 KB) ruby_19_csv_parser_split_methods.patch	Patch 1/2	ender672 (Timothy Elliott), 11/03/2010 09:43 AM
ruby_19_csv_parser_speedup.patch (1.82 KB) ruby_19_csv_parser_speedup.patch	Patch 2/2	ender672 (Timothy Elliott), 11/03/2010 09:43 AM

Updated by JEG2 (James Gray) almost 15 years ago Actions
Copy link
#1

=begin
I'm for adding the read_limit option (I prefer that name to io_read_limit).

However, I'm not sure how I feel about defaulting it to some number. That's going to mean that some documents CSV used to correctly parse could now fail. While a limit is a good idea in practice, I don't feel comfortable making that choice for the programmer. I think you need to opt-in to a sensible limit for your purposes and, until you do, we assume you want the best parsing we can provide.

So, my main question is, can we drop the default limit and keep the gains?

I would definitely before applying a patch like this if we:

Rename the limit to just read_limit
Default to using no limit

I hope that makes sense.
=end

Updated by ender672 (Timothy Elliott) almost 15 years ago Actions
Copy link
#2

=begin

Rename the limit to just read_limit

That name sounds good. I'll submit an updated patch with the better name.

Default to using no limit

I didn't do a good job explaining how this patch uses the read_limit option. Trunk currently uses IO#gets(newline_character) to read from @io. This patch gets a performance boost by using IO#gets(quote_character, read_limit) instead. The performance advantage is that we don't always care about newline characters (for example when reading data in a quoted column) but we do always care about quote characters.

One side effect of reading upto quote characters instead of newline characters is that it exposes an issue with using IO#gets without a limit:

If you parse a huge file without newline characters (on the trunk parser) or without quote characters (in the parser from this patch) you can run out of memory.

A huge file without newlines probably only would happen if it was done intentionally, but it is quite common for CSV files to have no quote characters in them. If the default for the parser in this patch was no read_limit, these files would consume a lot of memory and possibly cause an OOM situation.

Sorry for providing the patch with so little explanation. That change from IO#gets(newline_char) to IO#gets(quote_char, read_limit) is significant and should have been mentioned earlier.

=end

Updated by JEG2 (James Gray) almost 15 years ago Actions
Copy link
#3

=begin
Ah, sorry. I misunderstood. I think I get it now.

Basically, read_limit isn't a limit so much as a read-at-a-time control dial that can be used to tune parsing speed. It's set to a reasonable default and users can tune it the their specific needs. Am I understanding better now?

I was thinking it was bubbling the limit used by methods like IO#gets() up to the user. I've long wanted to do that, to account for data like:

"...gigs of data containing anything but a quote

Beyond that, the refactoring sounds neat to me. Under what circumstances does it turn out to be slower than the old parser? It seems like there would have to be some. I don't say that because the old parser was well tuned either. It's just that we've now shifted the focus of the parser (from newlines to quotes, if I'm understanding correctly). For example, if each line is 20 simple fields, all unnecessarily but legally quoted, do the increased IO operations slow things down?

I guess if there are such cases, we just need to decide if the ones we are optimizing for are common. My thoughts there are probably that the test sweet isn't an ideal speed measuring tool. We might want to grab several realistic CSV documents to use for that purpose.

Any thoughts on my babbling here Tim? Am I making sense.
=end

Updated by shyouhei (Shyouhei Urabe) almost 15 years ago Actions
Copy link
#4

Status changed from Open to Assigned

=begin

=end

Updated by mame (Yusuke Endoh) almost 13 years ago Actions
Copy link
#5 [ruby-core:49146]

Description updated (diff)
Target version set to 2.6

Updated by naruse (Yui NARUSE) almost 8 years ago Actions
Copy link
#6

Target version deleted (~~2.6~~)

Updated by mame (Yusuke Endoh) over 7 years ago Actions
Copy link
#7 [ruby-core:86025]

Assignee changed from JEG2 (James Gray) to kou (Kouhei Sutou)

@kou (Kouhei Sutou), could you check this ticket?

Updated by kou (Kouhei Sutou) over 7 years ago Actions
Copy link
#8 [ruby-core:86073]

Status changed from Assigned to Feedback

We need some benchmark scripts based on benchmark library (or similar library) to work on speedup.

Can someone work on creating benchmark scripts?

Updated by tomog105 (Tomohiro Ogoke) over 7 years ago Actions
Copy link
#9 [ruby-core:86107]

I tried to create a benchmark script for CSV.parse
(Using benchmark-ips gem)

Script¶

# benchmark script for CSV.parse
# Usage: `ruby $0 [rows count(default: 1000)]`
require 'csv'
require 'benchmark/ips'

Benchmark.ips do |x|
  rows = ARGV.fetch(0, "1000").to_i

  alphas = ['AAAAA'] * 50
  unquoted = (alphas.join(',') + "\r\n") * rows
  quoted = (alphas.map { |s| %("#{s}") }.join(',') + "\r\n") * rows
  inc_col_sep = (alphas.map { |s| %(",#{s}") }.join(',') + "\r\n") * rows
  inc_row_sep = (alphas.map { |s| %("#{s}\r\n") }.join(',') + "\r\n") * rows

  hiraganas = ['あああああ'] * 50
  enc_utf8 = (hiraganas.join(',') + "\r\n") * rows
  enc_sjis = enc_utf8.encode('Windows-31J')

  x.report("unquoted") { CSV.parse(unquoted) }
  x.report("quoted") { CSV.parse(quoted) }
  x.report("include col_sep") { CSV.parse(inc_col_sep) }
  x.report("include row_sep") { CSV.parse(inc_row_sep) }
  x.report("encode utf-8") { CSV.parse(enc_utf8) }
  x.report("encode sjis") { CSV.parse(enc_sjis) }
end

Result¶

Ruby version: ruby 2.5.0p0 (2017-12-25 revision 61468) [x86_64-darwin16]
Processor: Intel Core i7 6700K @ 4 GHz
Memory: 16 GB
Target revision: https://github.com/ruby/csv/commit/583cee457773c0ac9e7234b885e1a82df35a47cb

Rows: 1000¶

            unquoted     41.142  (± 0.0%) i/s -    208.000  in   5.055839s
              quoted     23.093  (± 0.0%) i/s -    116.000  in   5.024081s
     include col_sep     14.826  (± 0.0%) i/s -     75.000  in   5.059138s
     include row_sep      7.136  (± 0.0%) i/s -     36.000  in   5.045395s
        encode utf-8     34.350  (± 0.0%) i/s -    174.000  in   5.066178s
         encode sjis     34.230  (± 0.0%) i/s -    174.000  in   5.083444s

Rows: 10000¶

            unquoted      4.021  (± 0.0%) i/s -     21.000  in   5.230854s
              quoted      2.266  (± 0.0%) i/s -     12.000  in   5.327204s
     include col_sep      1.527  (± 0.0%) i/s -      8.000  in   5.242055s
     include row_sep      0.692  (± 0.0%) i/s -      4.000  in   5.780656s
        encode utf-8      3.215  (± 0.0%) i/s -     16.000  in   5.010681s
         encode sjis      3.400  (± 0.0%) i/s -     17.000  in   5.012165s

Updated by tomog105 (Tomohiro Ogoke) over 7 years ago Actions
Copy link
#10 [ruby-core:86108]

In addition, I'm porting patches in this issue and got benchmark results.

Result¶

Ruby version: ruby 2.5.0p0 (2017-12-25 revision 61468) [x86_64-darwin16]
Processor: Intel Core i7 6700K @ 4 GHz
Memory: 16 GB
Target revision: https://github.com/tomog105/csv/commit/c56ed71097fc7e61ce2a896f013095a5ef36b548
- Since many changes were added to the code base, I'm fixed additional features from original patches.
- But, CSV#line does not work properly in this revision.

Rows 1000¶

            unquoted     91.088  (± 2.2%) i/s -    459.000  in   5.041720s
              quoted     22.547  (± 0.0%) i/s -    114.000  in   5.056618s
     include col_sep     22.167  (± 0.0%) i/s -    112.000  in   5.053609s
     include row_sep     22.369  (± 0.0%) i/s -    112.000  in   5.007715s
        encode utf-8     43.500  (± 0.0%) i/s -    220.000  in   5.058088s
         encode sjis     34.559  (± 0.0%) i/s -    174.000  in   5.035380s

Rows 10000¶

            unquoted      9.395  (± 0.0%) i/s -     47.000  in   5.014335s
              quoted      2.116  (± 0.0%) i/s -     11.000  in   5.208314s
     include col_sep      2.103  (± 0.0%) i/s -     11.000  in   5.241502s
     include row_sep      2.146  (± 0.0%) i/s -     11.000  in   5.128535s
        encode utf-8      4.217  (± 0.0%) i/s -     22.000  in   5.226532s
         encode sjis      3.370  (± 0.0%) i/s -     17.000  in   5.047115s

Compare¶

Rows 1000¶

test	trunk	patched	ratio
unquoted	41.142	91.465	221%
quoted	23.093	22.547	97.6%
include col_sep	14.826	22.167	150%
include row_sep	7.136	22.369	313%
encode utf-8	34.350	43.500	127%
encode sjis	3.400	3.370	99.1%

Rows 10000¶

test	trunk	patched	ratio
unquoted	4.021	9.395	234%
quoted	2.266	2.116	93.4%
include col_sep	1.527	2.103	138%
include row_sep	0.692	2.146	310%
encode utf-8	3.215	4.217	131%
encode sjis	3.400	3.370	99.1%

Updated by kou (Kouhei Sutou) over 7 years ago Actions
Copy link
#11 [ruby-core:86112]

Thanks. I've added the benchmark script to the repository.

Does the comparison result say "the patch will improve performance"?
If so, can you complete your work to adapt the patch to the master and create a pull request?

Updated by tomog105 (Tomohiro Ogoke) over 7 years ago Actions
Copy link
#12 [ruby-core:86127]

Does the comparison result say "the patch will improve performance"?
If so, can you complete your work to adapt the patch to the master and create a pull request?

I attempted to solve the problem that CSV#line does not work, but this fix seems to obviously degrade the performance from the original patch.
(After fix commit: https://github.com/tomog105/csv/commit/be3261baa0ebd8246f75408c503572e0d111a139)
For example, in benchmark of the "quoted" case, it seems to be about 20% slower than trunk.

Therefore, I can not say "the patch improves performance" now.

Updated by kou (Kouhei Sutou) over 7 years ago Actions
Copy link
#13 [ruby-core:86414]

Status changed from Feedback to Rejected

OK. I close this.

Actions

Copy link

Also available in: PDF Atom

Project

General

Profile

Ruby

Tags

Custom queries

Feature #4017

[PATCH] CSV parsing speedup

Updated by JEG2 (James Gray) almost 15 years ago Actions
Copy link
#1

Updated by ender672 (Timothy Elliott) almost 15 years ago Actions
Copy link
#2

Updated by JEG2 (James Gray) almost 15 years ago Actions
Copy link
#3

Updated by shyouhei (Shyouhei Urabe) almost 15 years ago Actions
Copy link
#4

Updated by mame (Yusuke Endoh) almost 13 years ago Actions
Copy link
#5 [ruby-core:49146]

Updated by naruse (Yui NARUSE) almost 8 years ago Actions
Copy link
#6

Updated by mame (Yusuke Endoh) over 7 years ago Actions
Copy link
#7 [ruby-core:86025]

Updated by kou (Kouhei Sutou) over 7 years ago Actions
Copy link
#8 [ruby-core:86073]

Updated by tomog105 (Tomohiro Ogoke) over 7 years ago Actions
Copy link
#9 [ruby-core:86107]

Script¶

Result¶

Rows: 1000¶

Rows: 10000¶

Updated by tomog105 (Tomohiro Ogoke) over 7 years ago Actions
Copy link
#10 [ruby-core:86108]

Result¶

Rows 1000¶

Rows 10000¶

Compare¶

Rows 1000¶

Rows 10000¶

Updated by kou (Kouhei Sutou) over 7 years ago Actions
Copy link
#11 [ruby-core:86112]

Updated by tomog105 (Tomohiro Ogoke) over 7 years ago Actions
Copy link
#12 [ruby-core:86127]

Updated by kou (Kouhei Sutou) over 7 years ago Actions
Copy link
#13 [ruby-core:86414]

Project

General

Profile

Ruby

Tags

Custom queries

Feature #4017

[PATCH] CSV parsing speedup

Updated by JEG2 (James Gray) almost 15 years ago ActionsCopy link #1

Updated by ender672 (Timothy Elliott) almost 15 years ago ActionsCopy link #2

Updated by JEG2 (James Gray) almost 15 years ago ActionsCopy link #3

Updated by shyouhei (Shyouhei Urabe) almost 15 years ago ActionsCopy link #4

Updated by mame (Yusuke Endoh) almost 13 years ago ActionsCopy link #5 [ruby-core:49146]

Updated by naruse (Yui NARUSE) almost 8 years ago ActionsCopy link #6

Updated by mame (Yusuke Endoh) over 7 years ago ActionsCopy link #7 [ruby-core:86025]

Updated by kou (Kouhei Sutou) over 7 years ago ActionsCopy link #8 [ruby-core:86073]

Updated by tomog105 (Tomohiro Ogoke) over 7 years ago ActionsCopy link #9 [ruby-core:86107]

Script¶

Result¶

Rows: 1000¶

Rows: 10000¶

Updated by tomog105 (Tomohiro Ogoke) over 7 years ago ActionsCopy link #10 [ruby-core:86108]

Result¶

Rows 1000¶

Rows 10000¶

Compare¶

Rows 1000¶

Rows 10000¶

Updated by kou (Kouhei Sutou) over 7 years ago ActionsCopy link #11 [ruby-core:86112]

Updated by tomog105 (Tomohiro Ogoke) over 7 years ago ActionsCopy link #12 [ruby-core:86127]

Updated by kou (Kouhei Sutou) over 7 years ago ActionsCopy link #13 [ruby-core:86414]

Updated by JEG2 (James Gray) almost 15 years ago Actions
Copy link
#1

Updated by ender672 (Timothy Elliott) almost 15 years ago Actions
Copy link
#2

Updated by JEG2 (James Gray) almost 15 years ago Actions
Copy link
#3

Updated by shyouhei (Shyouhei Urabe) almost 15 years ago Actions
Copy link
#4

Updated by mame (Yusuke Endoh) almost 13 years ago Actions
Copy link
#5 [ruby-core:49146]

Updated by naruse (Yui NARUSE) almost 8 years ago Actions
Copy link
#6

Updated by mame (Yusuke Endoh) over 7 years ago Actions
Copy link
#7 [ruby-core:86025]

Updated by kou (Kouhei Sutou) over 7 years ago Actions
Copy link
#8 [ruby-core:86073]

Updated by tomog105 (Tomohiro Ogoke) over 7 years ago Actions
Copy link
#9 [ruby-core:86107]

Updated by tomog105 (Tomohiro Ogoke) over 7 years ago Actions
Copy link
#10 [ruby-core:86108]

Updated by kou (Kouhei Sutou) over 7 years ago Actions
Copy link
#11 [ruby-core:86112]

Updated by tomog105 (Tomohiro Ogoke) over 7 years ago Actions
Copy link
#12 [ruby-core:86127]

Updated by kou (Kouhei Sutou) over 7 years ago Actions
Copy link
#13 [ruby-core:86414]