Bug #20289
closedBug in Zlib::GzipReader#eof? breaks reading certain sizes of gzipped files.
Description
Hello,
A bug in the implementation of Zlib::GzipReader#eof? makes it very difficult to read certain rubygems using readpartial without receiving an EOFError.
The bug is caused when the chunk size read from the gzip leaves only empty gzip "overhead" bytes at the end of the unread portion.
The result is that a simple readpartial until eof?
fails on some real world gems.
Bug Explanation¶
Imagine a gzip file, it has compressed data and some bytes that indicate the start and end of a chunk and then end of a file.
gzip file = "[... 2048 chunk of gzip data ...][ \0 bytes, chucksum info, with no new readable bytes ]"
When reading this file, zlib.c will pull a chunk of data according to READ_SIZE and return that data uncompressed.
With exactly the right size of data (within exactly the range of READ-SIZEn - 24 to READ_SIZEn - 15 for any multiple of READ_SIZE) all of the file data is read and returned on the first read.
The bug happens when there are bytes remaining on the gzip file that haven't been read, but contain no new data that can be returned to the buffer.
Before reading the final empty chunk, #eof? returns false. After reading the final chunk, EOFError is raised because no bytes were returned into the buffer.
What should happen is similar to how a Socket eof? is checked. On a socket, #eof? is not "passive" but actively reads ahead, filling the buffer. If that read-ahead fails to find new data, then the EOF is reached and eof? returns true.
In zlib, #eof? does not read ahead to check if the end has been reached, causing this bug.
Solution¶
Samuel Giddins and I have both submitted alternate, but very similar solutions for this bug. Samuel's PR and My PR. We'd leave it to ruby core to decide which is the best solution.
Unfortunately, these bug fix PRs have been waiting for more than 2 months now with no movement on fixing this bug.
The original issue that I opened to point out this bug was submitted Aug 16, 2023. We've been working to find a solution for this bug for more than half a year.
This bug was discovered because rubygems wanted to improve the efficiency of reading a gem by using readpartial to conserve memory. Sam submitted an additional PR that vastly improves the performance of the zlib gem by managing buffers better.
We've already proven that this dramatically decreases memory usage when parsing gems. The performance PR by Samuel has been open since September 13th, 2023.
Plea¶
We've been trying for months to get this solved. Opening a ruby-lang bug is my next attempt after communicating through our contacts at Ruby core has gone nowhere.
Please merge and release these fixes, or tell us what is wrong with them so we can fix them so they can be merged. It will improve rubygems for everyone.