Bug #14804

GzipReader cannot read Freebase dump (but gzcat/zless can)

Added by amadan (Goran Topic) over 2 years ago. Updated 2 months ago.

Target version:
ruby -v:
Ruby 2.4.1p111 (2017-03-22 revision 58053) [x86_64-darwin17]


This is likely related to (and its accepted answer).

The file in question:
(watch out, it's 30Gb compressed!)

Steps to reproduce:

require "zlib""freebase-rdf-latest.gz") { |f| f.each_line.count }
# => 14374340

However, the correct answer is different:

$ gzcat freebase-rdf-latest.gz | wc -l

Another experiment showed that the last f.tell was 1945715682, while there's considerably more bytes in the uncompressed version. This fits well with the Stack Overflow report from C# linked above, which states the first "substream" contains exactly that many bytes.

If this is a hard constraint from the wrapped library (and thus should be fixed upstream), at least the documentation should mention it.

Related issues

Related to Ruby master - Bug #9790: Zlib::GzipReader only decompressed the first of concatenated filesCloseddrbrain (Eric Hodel)Actions

Updated by amadan (Goran Topic) over 2 years ago

(Note that f.each_line.count would return the wrong result anyway, due to , since 3130753066 is outside int32 range, but it doesn't have the chance to do so, on account of stopping prematurely.)


Updated by jeremyevans0 (Jeremy Evans) 11 months ago

  • Related to Bug #9790: Zlib::GzipReader only decompressed the first of concatenated files added

Updated by jeremyevans0 (Jeremy Evans) 2 months ago

  • Status changed from Open to Closed

This can now be handled using Zlib::GzipReader.zcat, which was recently added to zlib.

Also available in: Atom PDF