Project

General

Profile

Actions

Bug #14804

closed

GzipReader cannot read Freebase dump (but gzcat/zless can)

Added by amadan (Goran Topic) over 6 years ago. Updated over 4 years ago.

Status:
Closed
Assignee:
-
Target version:
-
ruby -v:
Ruby 2.4.1p111 (2017-03-22 revision 58053) [x86_64-darwin17]
[ruby-core:87339]
Tags:

Description

This is likely related to https://stackoverflow.com/questions/35354951/gzipstream-quietly-fails-on-large-file-stream-ends-at-2gb (and its accepted answer).

The file in question: http://commondatastorage.googleapis.com/freebase-public/rdf/freebase-rdf-latest.gz
(watch out, it's 30Gb compressed!)

Steps to reproduce:

require "zlib"
Zlib::GzipReader.open("freebase-rdf-latest.gz") { |f| f.each_line.count }
# => 14374340

However, the correct answer is different:

$ gzcat freebase-rdf-latest.gz | wc -l
3130753066

Another experiment showed that the last f.tell was 1945715682, while there's considerably more bytes in the uncompressed version. This fits well with the Stack Overflow report from C# linked above, which states the first "substream" contains exactly that many bytes.

If this is a hard constraint from the wrapped library (and thus should be fixed upstream), at least the documentation should mention it.


Related issues 1 (0 open1 closed)

Related to Ruby master - Bug #9790: Zlib::GzipReader only decompressed the first of concatenated filesCloseddrbrain (Eric Hodel)Actions
Actions

Also available in: Atom PDF

Like0
Like0Like0Like0