Feature #16352
open
Modify Marshal to dump objects larger than 2 GiB
Added by seoanezonjic (Pedro Seoane) about 5 years ago.
Updated over 4 years ago.
Description
Using a gem called Numo-array to handle matrix operations, I found the following error while saving a large matrix:
in `dump': long too big to dump (TypeError)
Github thread is https://github.com/ruby-numo/numo-narray/issues/144. Digging with the authors, I found the following code that reproduces the error:
ruby -e 'Marshal.dump(" "*2**31)'
Executed in:
ruby 2.7.0dev (2019-11-12T12:03:22Z master 3816622fbe) [x86_64-linux]
The marshal library has a limit based on constant SIZEOF_LONG
. This check is performed in here. I don't understand the motivation of this limit. It has a great impact on libraries that need to serialize large objects such as numeric matrix. In this case, the limit >= 2 GiB is reached easily, and it blocks ruby development. I found another related bug report: #1560, but the Marshal problem was not addressed in it.
- Description updated (diff)
This behaviour has been there since the beginning. No ruby version since 0.49 has successfully dumped such long string. Same thing happens for a very big bignum, a very long array, a class that has very long classpath (Q::W::E::R::...), an object of 2**31 instance variables (which isn't impossible these days), and much much more.
The limitation is due to marshal's binary format. I guess the reason behind this is simply because at the time the format was designed (back in 1990s), there simply was no such thing like a 64 bit integer type. To properly reroute we have to reconsider all use of long
in marshal format. I guess that is essentially a format change. That should hurt data portability so not that easy.
Any nice idea to fix the situation?
I don't understand the motivation of this limit and has a great impact in libraries that need to serialize large objects as numeric matrix.
In this case, the limit of >= 2 GiB it's reached easily and it blocks the ruby development in scientifical projects as cited.
Shyouhei already pointed out the historic reason. I believe you can quite easily convince the ruby core team that a change may
be necessary in the long run (most likely past ruby 3.0) based on use cases. Matz likes to hear real world use cases, so the
more information may be given the better. :)
As for possibility of change, I guess the Marshal format could be kept by default, but another variant could perhaps be added
where people could switch to another format - a bit like syck and psych could be used interchangably for yaml to some extent
(I used syck for quite some time even after psych was added, before I transitioned into Unicode finally; I used to specify
the yaml engine via e. g. YAML.engine = or something like that).
- Tracker changed from Bug to Feature
- Subject changed from Marshal limit of >= 2 GiB to Modify Marshal to dump objects larger than 2 GiB
- ruby -v deleted (
ruby 2.7.0dev (2019-11-12T12:03:22Z master 3816622fbe) [x86_64-linux])
- Backport deleted (
2.5: UNKNOWN, 2.6: UNKNOWN)
It's currently expected that Marshal cannot dump objects larger than 2GiB, so this isn't a bug, though arguably RangeError would be more appropriate than TypeError if the data is too large. Supporting the dumping of larger objects does seem like a useful feature, but as @shyouhei (Shyouhei Urabe) mentioned, it requires a format change, which would break backwards compatibility as a marshal dump from Ruby 3 would not be restorable on Ruby 2.7. It does seem like Ruby 3 would be a good time to implement such a format change if we want to support the marshaling of larger objects. We would probably want to keep the supporting the old format so that a marshal dump from Ruby 2.7 will work in Ruby 3, and maybe consider working on a gem that you could install in older Ruby versions to support the new marshal format.
jeremyevans0 (Jeremy Evans) wrote in #note-3:
Supporting the dumping of larger objects does seem like a useful feature, but as @shyouhei (Shyouhei Urabe) mentioned, it requires a format change, which would break backwards compatibility as a marshal dump from Ruby 3 would not be restorable on Ruby 2.7. It does seem like Ruby 3 would be a good time to implement such a format change if we want to support the marshaling of larger objects. We would probably want to keep the supporting the old format so that a marshal dump from Ruby 2.7 will work in Ruby 3, and maybe consider working on a gem that you could install in older Ruby versions to support the new marshal format.
It sure would be useful to be able to dump huge stuff. But how frequent is this? My guess is that it would be better to by default use the current format, and switch to the new format with an explicit option. That might give the better interoperability story.
- Description updated (diff)
Couldn't we dedicate a special "size" value to indicate "extended marshal size" (say SIZEOF_LONG - 1
), such that compatibility with all current and future marshal dumps is maintained, with the exception of a marshal object that would actually happen to have exactly a size of SIZEOF_LONG - 1
?
def marshal_dump
if size < SIZEOF_LONG - 1
# business as usual, proceed with old dump
else
io << SIZEOF_LONG - 1 << size_as_int_64 << # presumable rest of output similar...
end
end
def marshal_load
io >> size
if size == SIZEOF_LONG - 1
# Assume new format
io = io.read_int_64
# ...
else
# ... as before
end
end
marcandre (Marc-Andre Lafortune) wrote in #note-6:
Couldn't we dedicate a special "size" value to indicate "extended marshal size" (say SIZEOF_LONG - 1
), such that compatibility with all current and future marshal dumps is maintained, with the exception of a marshal object that would actually happen to have exactly a size of SIZEOF_LONG - 1
?
This sounds like a good idea because the chance of old files with a value of exactly SIZEOF_LONG - 1
is low. The problem is that, as far as I understand, it significantly increases the difficulties of diagnosing/debugging problems in the cases where SIZEOF_LONG - 1
was actually used.
If the risk of collision with SIZEOF_LONG - 1
is deemed too high, then add 64 bits of fixed data afterwards (pick a random value). If the 64 bits after the size match, then it is extended format. If they don't, then omg it happens to be size is actually SIZEOF_LONG-1
... I haven't checked the format of Marshal closely enough, but I would not be surprised if there were some bit sequences following the size that would actually be invalid. If so, there would be no risk at all.
Also available in: Atom
PDF
Like0
Like0Like0Like0Like0Like0Like0Like0Like0