Feature #4017
closed[PATCH] CSV parsing speedup
Description
=begin
ruby_19_csv_parser_split_methods.patch
This patch breaks the CSV parser into multiple methods that are easier to understand and it allows for the performance optimizations in the second patch. It removes all regular expressions from the parser, resulting in a ~25% speed improvement in the CSV test suite. It adds a new CSV parser option, :io_read_limit, which determines the max size for IO reads. This option defaults to 2048 which to was the fastest in my benchmarks.
ruby_19_csv_parser_split_methods.patch
This patch adds two shortcuts to the patch above that significantly improve parsing of CSV files that have many quoted columns. It has to be applied on top of the first patch.
On large CSV files I observed that these patches resulted in a 20% - 60% reduction of time it takes to parse. If this patchset looks good, I would like to experiment with further improvements that take advantage of io_read_limit to always read from IO in large chunks (right now it only does so with CSV files that have no quote characters).
These patches maintain m17n support and multi-character separator support (and boy, it's tough to make those tests happy :)
=end
Files
Updated by JEG2 (James Gray) about 14 years ago
=begin
I'm for adding the read_limit option (I prefer that name to io_read_limit).
However, I'm not sure how I feel about defaulting it to some number. That's going to mean that some documents CSV used to correctly parse could now fail. While a limit is a good idea in practice, I don't feel comfortable making that choice for the programmer. I think you need to opt-in to a sensible limit for your purposes and, until you do, we assume you want the best parsing we can provide.
So, my main question is, can we drop the default limit and keep the gains?
I would definitely before applying a patch like this if we:
- Rename the limit to just read_limit
- Default to using no limit
I hope that makes sense.
=end
Updated by ender672 (Timothy Elliott) about 14 years ago
=begin
- Rename the limit to just read_limit
That name sounds good. I'll submit an updated patch with the better name.
- Default to using no limit
I didn't do a good job explaining how this patch uses the read_limit option. Trunk currently uses IO#gets(newline_character) to read from @io. This patch gets a performance boost by using IO#gets(quote_character, read_limit) instead. The performance advantage is that we don't always care about newline characters (for example when reading data in a quoted column) but we do always care about quote characters.
One side effect of reading upto quote characters instead of newline characters is that it exposes an issue with using IO#gets without a limit:
If you parse a huge file without newline characters (on the trunk parser) or without quote characters (in the parser from this patch) you can run out of memory.
A huge file without newlines probably only would happen if it was done intentionally, but it is quite common for CSV files to have no quote characters in them. If the default for the parser in this patch was no read_limit, these files would consume a lot of memory and possibly cause an OOM situation.
Sorry for providing the patch with so little explanation. That change from IO#gets(newline_char) to IO#gets(quote_char, read_limit) is significant and should have been mentioned earlier.
=end
Updated by JEG2 (James Gray) about 14 years ago
=begin
Ah, sorry. I misunderstood. I think I get it now.
Basically, read_limit isn't a limit so much as a read-at-a-time control dial that can be used to tune parsing speed. It's set to a reasonable default and users can tune it the their specific needs. Am I understanding better now?
I was thinking it was bubbling the limit used by methods like IO#gets() up to the user. I've long wanted to do that, to account for data like:
"...gigs of data containing anything but a quote
Beyond that, the refactoring sounds neat to me. Under what circumstances does it turn out to be slower than the old parser? It seems like there would have to be some. I don't say that because the old parser was well tuned either. It's just that we've now shifted the focus of the parser (from newlines to quotes, if I'm understanding correctly). For example, if each line is 20 simple fields, all unnecessarily but legally quoted, do the increased IO operations slow things down?
I guess if there are such cases, we just need to decide if the ones we are optimizing for are common. My thoughts there are probably that the test sweet isn't an ideal speed measuring tool. We might want to grab several realistic CSV documents to use for that purpose.
Any thoughts on my babbling here Tim? Am I making sense.
=end
Updated by shyouhei (Shyouhei Urabe) about 14 years ago
- Status changed from Open to Assigned
=begin
=end
Updated by mame (Yusuke Endoh) about 12 years ago
- Description updated (diff)
- Target version set to 2.6
Updated by mame (Yusuke Endoh) almost 7 years ago
- Assignee changed from JEG2 (James Gray) to kou (Kouhei Sutou)
@kou (Kouhei Sutou), could you check this ticket?
Updated by kou (Kouhei Sutou) almost 7 years ago
- Status changed from Assigned to Feedback
We need some benchmark scripts based on benchmark
library (or similar library) to work on speedup.
Can someone work on creating benchmark scripts?
Updated by tomog105 (Tomohiro Ogoke) almost 7 years ago
I tried to create a benchmark script for CSV.parse
(Using benchmark-ips
gem)
Script¶
# benchmark script for CSV.parse
# Usage: `ruby $0 [rows count(default: 1000)]`
require 'csv'
require 'benchmark/ips'
Benchmark.ips do |x|
rows = ARGV.fetch(0, "1000").to_i
alphas = ['AAAAA'] * 50
unquoted = (alphas.join(',') + "\r\n") * rows
quoted = (alphas.map { |s| %("#{s}") }.join(',') + "\r\n") * rows
inc_col_sep = (alphas.map { |s| %(",#{s}") }.join(',') + "\r\n") * rows
inc_row_sep = (alphas.map { |s| %("#{s}\r\n") }.join(',') + "\r\n") * rows
hiraganas = ['あああああ'] * 50
enc_utf8 = (hiraganas.join(',') + "\r\n") * rows
enc_sjis = enc_utf8.encode('Windows-31J')
x.report("unquoted") { CSV.parse(unquoted) }
x.report("quoted") { CSV.parse(quoted) }
x.report("include col_sep") { CSV.parse(inc_col_sep) }
x.report("include row_sep") { CSV.parse(inc_row_sep) }
x.report("encode utf-8") { CSV.parse(enc_utf8) }
x.report("encode sjis") { CSV.parse(enc_sjis) }
end
Result¶
- Ruby version: ruby 2.5.0p0 (2017-12-25 revision 61468) [x86_64-darwin16]
- Processor: Intel Core i7 6700K @ 4 GHz
- Memory: 16 GB
- Target revision: https://github.com/ruby/csv/commit/583cee457773c0ac9e7234b885e1a82df35a47cb
Rows: 1000¶
unquoted 41.142 (± 0.0%) i/s - 208.000 in 5.055839s
quoted 23.093 (± 0.0%) i/s - 116.000 in 5.024081s
include col_sep 14.826 (± 0.0%) i/s - 75.000 in 5.059138s
include row_sep 7.136 (± 0.0%) i/s - 36.000 in 5.045395s
encode utf-8 34.350 (± 0.0%) i/s - 174.000 in 5.066178s
encode sjis 34.230 (± 0.0%) i/s - 174.000 in 5.083444s
Rows: 10000¶
unquoted 4.021 (± 0.0%) i/s - 21.000 in 5.230854s
quoted 2.266 (± 0.0%) i/s - 12.000 in 5.327204s
include col_sep 1.527 (± 0.0%) i/s - 8.000 in 5.242055s
include row_sep 0.692 (± 0.0%) i/s - 4.000 in 5.780656s
encode utf-8 3.215 (± 0.0%) i/s - 16.000 in 5.010681s
encode sjis 3.400 (± 0.0%) i/s - 17.000 in 5.012165s
Updated by tomog105 (Tomohiro Ogoke) almost 7 years ago
In addition, I'm porting patches in this issue and got benchmark results.
Result¶
- Ruby version: ruby 2.5.0p0 (2017-12-25 revision 61468) [x86_64-darwin16]
- Processor: Intel Core i7 6700K @ 4 GHz
- Memory: 16 GB
- Target revision: https://github.com/tomog105/csv/commit/c56ed71097fc7e61ce2a896f013095a5ef36b548
- Since many changes were added to the code base, I'm fixed additional features from original patches.
- But,
CSV#line
does not work properly in this revision.
Rows 1000¶
unquoted 91.088 (± 2.2%) i/s - 459.000 in 5.041720s
quoted 22.547 (± 0.0%) i/s - 114.000 in 5.056618s
include col_sep 22.167 (± 0.0%) i/s - 112.000 in 5.053609s
include row_sep 22.369 (± 0.0%) i/s - 112.000 in 5.007715s
encode utf-8 43.500 (± 0.0%) i/s - 220.000 in 5.058088s
encode sjis 34.559 (± 0.0%) i/s - 174.000 in 5.035380s
Rows 10000¶
unquoted 9.395 (± 0.0%) i/s - 47.000 in 5.014335s
quoted 2.116 (± 0.0%) i/s - 11.000 in 5.208314s
include col_sep 2.103 (± 0.0%) i/s - 11.000 in 5.241502s
include row_sep 2.146 (± 0.0%) i/s - 11.000 in 5.128535s
encode utf-8 4.217 (± 0.0%) i/s - 22.000 in 5.226532s
encode sjis 3.370 (± 0.0%) i/s - 17.000 in 5.047115s
Compare¶
Rows 1000¶
test | trunk | patched | ratio |
---|---|---|---|
unquoted | 41.142 | 91.465 | 221% |
quoted | 23.093 | 22.547 | 97.6% |
include col_sep | 14.826 | 22.167 | 150% |
include row_sep | 7.136 | 22.369 | 313% |
encode utf-8 | 34.350 | 43.500 | 127% |
encode sjis | 3.400 | 3.370 | 99.1% |
Rows 10000¶
test | trunk | patched | ratio |
---|---|---|---|
unquoted | 4.021 | 9.395 | 234% |
quoted | 2.266 | 2.116 | 93.4% |
include col_sep | 1.527 | 2.103 | 138% |
include row_sep | 0.692 | 2.146 | 310% |
encode utf-8 | 3.215 | 4.217 | 131% |
encode sjis | 3.400 | 3.370 | 99.1% |
Updated by kou (Kouhei Sutou) almost 7 years ago
Thanks. I've added the benchmark script to the repository.
Does the comparison result say "the patch will improve performance"?
If so, can you complete your work to adapt the patch to the master and create a pull request?
Updated by tomog105 (Tomohiro Ogoke) almost 7 years ago
Does the comparison result say "the patch will improve performance"?
If so, can you complete your work to adapt the patch to the master and create a pull request?
I attempted to solve the problem that CSV#line
does not work, but this fix seems to obviously degrade the performance from the original patch.
(After fix commit: https://github.com/tomog105/csv/commit/be3261baa0ebd8246f75408c503572e0d111a139)
For example, in benchmark of the "quoted" case, it seems to be about 20% slower than trunk.
Therefore, I can not say "the patch improves performance" now.
Updated by kou (Kouhei Sutou) over 6 years ago
- Status changed from Feedback to Rejected
OK. I close this.