Bug #14804
closedGzipReader cannot read Freebase dump (but gzcat/zless can)
Description
This is likely related to https://stackoverflow.com/questions/35354951/gzipstream-quietly-fails-on-large-file-stream-ends-at-2gb (and its accepted answer).
The file in question: http://commondatastorage.googleapis.com/freebase-public/rdf/freebase-rdf-latest.gz
(watch out, it's 30Gb compressed!)
Steps to reproduce:
require "zlib"
Zlib::GzipReader.open("freebase-rdf-latest.gz") { |f| f.each_line.count }
# => 14374340
However, the correct answer is different:
$ gzcat freebase-rdf-latest.gz | wc -l
3130753066
Another experiment showed that the last f.tell
was 1945715682
, while there's considerably more bytes in the uncompressed version. This fits well with the Stack Overflow report from C# linked above, which states the first "substream" contains exactly that many bytes.
If this is a hard constraint from the wrapped library (and thus should be fixed upstream), at least the documentation should mention it.
Updated by amadan (Goran Topic) over 6 years ago
(Note that f.each_line.count
would return the wrong result anyway, due to https://bugs.ruby-lang.org/issues/14805 , since 3130753066 is outside int32 range, but it doesn't have the chance to do so, on account of stopping prematurely.)
Updated by jeremyevans0 (Jeremy Evans) about 5 years ago
- Related to Bug #9790: Zlib::GzipReader only decompressed the first of concatenated files added
Updated by jeremyevans0 (Jeremy Evans) over 4 years ago
- Status changed from Open to Closed
This can now be handled using Zlib::GzipReader.zcat
, which was recently added to zlib.