Project

General

Profile

Actions

Bug #19402

closed

CSV skip_lines option not behaving as documented

Added by jamie_ca (Jamie Macey) over 2 years ago. Updated over 2 years ago.

Status:
Third Party's Issue
Target version:
-
ruby -v:
ruby 3.2.0 (2022-12-25 revision a528908271) [x86_64-darwin21]
[ruby-core:112185]

Description

The CSV documentation for the skip_lines parser option says "If a String, converts it to a Regexp, ignores lines that match it." Application behaviour as well as the source appears to be normalizing the string encoding and running a simple substring check instead. Given the existing behaviour, this might just want a documentation update to describe it accurately?

I stumbled across this on a project still on ruby 2.6.9 (2.6 docs), but it's applicable still at 3.2.0.

Reproduction script:

require 'csv'

data = <<CSV
data,data
test,data
data,test
CSV

puts "Parsing with regexp skip_lines /^test/, expect 2 rows"
CSV.parse(data, skip_lines: /^test/).each { |row| pp row }
puts

puts "Parsing with text skip_lines \"^test\", expect 2 rows"
CSV.parse(data, skip_lines: "^test").each { |row| pp row }
puts

puts "Parsing with unanchored text skip_lines \"test\", expect 1 row"
CSV.parse(data, skip_lines: "test").each { |row| pp row }
puts
$ ruby csv_test.rb
Parsing with regexp skip_lines /^test/, expect 2 rows
["data", "data"]
["data", "test"]

Parsing with text skip_lines "^test", expect 2 rows
["data", "data"]
["test", "data"]
["data", "test"]

Parsing with unanchored text skip_lines "test", expect 1 row
["data", "data"]

Updated by sawa (Tsuyoshi Sawada) over 2 years ago

I agree with you that the description in the documentation is bad, but for a reason different from what you claim. The problem is that it is ambiguous. It says that the string is converted to a Regexp, but it does not specify how. That leaves a room for the reader to interpret it in one or another way.

Perhaps, you interpreted that a string str passed as the skip_lines: argument is converted to a Regexp by:

Regexp.new(str)

However, I believe the intended interpretation was to convert the string argument to a Regexp by:

Regexp.new(Regexp.escape(str))

in which case matching against the resulting Regexp is equivalent to a substring check against the original string, and the results you got is just as described in the documentation.

I agree with you that the documentation can be improved. The relevant part should be changed to:

If a String or a Regexp, ignores lines that match it.

Updated by kou (Kouhei Sutou) over 2 years ago

  • Status changed from Open to Third Party's Issue
  • Assignee set to kou (Kouhei Sutou)

It's intentional. String skip_lines: value is matched as-is. (You can't use special characters such as ^.)

If you have further discussion including how to improve the current documentation, could you report this to upstream? https://github.com/ruby/csv

Actions

Also available in: Atom PDF

Like0
Like0Like0