Bug #20009
openMarshal.load raises exception when load dumped class include non-ASCII
Description
Reproduction code¶
class Cクラス; end
Marshal.load(Marshal.dump(Cクラス))
Actual result¶
<internal:marshal>:34:in `load': undefined class/module C\xE3\x82\xAF\xE3\x83\xA9\xE3\x82\xB9 (ArgumentError)
from marshal.rb:2:in `<main>'
Expected result¶
Returns Cクラス
Impacted area¶
An exception is raised in Rails under the following conditions
- minitest is used with default settings
- Parallel execution with parallelize
- test class names contain non-ASCII characters
The default parallelization uses DRb, and Marshal is used inside DRb.
Other¶
After trying various things, I thought I could fix it by making rb_path_to_class
support strings containing non-ASCII characters, but I couldn't find anything more than that.
Updated by byroot (Jean Boussier) over 1 year ago
I dug into this bug, and I'm not sure if it's possible to fix it.
Classes are serialized this way:
case T_CLASS:
w_byte(TYPE_CLASS, arg);
{
VALUE path = class2path(obj);
w_bytes(RSTRING_PTR(path), RSTRING_LEN(path), arg);
RB_GC_GUARD(path);
}
break;
We write the TYPE_CLASS
prefix, and then write the bytes of the class name, without any encoding indication.
Then on load
, we just read the bytes and try to lookup the class:
case TYPE_CLASS:
{
VALUE str = r_bytes(arg);
v = path2class(str);
So on load
we're looking for "Cクラス".b.to_sym
, which doesn't match :"Cクラス"
.
To fix this we'd need to include the encoding in the format, but that would mean breaking backward and forward compatibility which is a huge deal.
Half-way solution¶
Some possible half-way solution would be:
- Assume non-ASCII class names are UTF-8
- Raise on dump for class names with non-UTF8 compatible class names.
It's far from ideal though.
Updated by make_now_just (Hiroya Fujinami) 18 days ago
· Edited
In my opinion, we need to introduce a new format for dumping classes/modules correctly.
Marshal uses c
and m
(TYPE_CLASS
and TYPE_MODULE
) as format prefixes currently, so the format is the following:
| 1 byte | ... | ... |
| 'c'/'m' | path size | path name binary string |
And, this format lacks the encoding, then the bug is happened.
To solve this issue, adding the encoding information to dump results is necessary, and introducing a new format seems a natural way to me. Therefore, a new format is the following (as a new type prefix is K
):
| 1 byte | ... |
| 'K' | a dump of the path symbol (`:`, `;`, or `I`) |
Such a format is used for dumping other kinds of objects. e.g., o
and S
(TYPE_OBJECT
and TYPE_STRUCT
).
This idea does not break the backward compatibility, but we need to increment MARSHAL_MINOR
.
Updated by Eregon (Benoit Daloze) 18 days ago
byroot (Jean Boussier) wrote in #note-1:
Half-way solution¶
Some possible half-way solution would be:
- Assume non-ASCII class names are UTF-8
- Raise on dump for class names with non-UTF8 compatible class names.
It's far from ideal though.
I think this would be a pretty good solution actually, it seems very unlikely to have class names which can't be encoded in UTF-8.
Updated by nobu (Nobuyoshi Nakada) 15 days ago
· Edited
Currently instance variables of class/module are prohibited.
It may be possible to put the encoding information there.
Updated by nobu (Nobuyoshi Nakada) 15 days ago
· Edited
I made a patch which works well except for tests added by ruby/spec@907cb35.
Since the dumped strings in these tests have not been loadable, I think they are useless actually.
Updated by Eregon (Benoit Daloze) 15 days ago
Mmh, but Marshal.dump+load of such non-7-bit modules/classes works on TruffleRuby, although it needs a tiny fix:
diff --git a/src/main/ruby/truffleruby/core/marshal.rb b/src/main/ruby/truffleruby/core/marshal.rb
index 102468e774..ea7469ea4a 100644
--- a/src/main/ruby/truffleruby/core/marshal.rb
+++ b/src/main/ruby/truffleruby/core/marshal.rb
@@ -786,19 +786,19 @@ module Marshal
end
def construct_class
- obj = const_lookup(get_byte_sequence.to_sym, Class)
+ obj = const_lookup(get_byte_sequence.force_encoding(Encoding::UTF_8).to_sym, Class)
store_unique_object obj
obj
end
def construct_module
- obj = const_lookup(get_byte_sequence.to_sym, Module)
+ obj = const_lookup(get_byte_sequence.force_encoding(Encoding::UTF_8).to_sym, Module)
store_unique_object obj
obj
end
def construct_old_module
- obj = const_lookup(get_byte_sequence.to_sym)
+ obj = const_lookup(get_byte_sequence.force_encoding(Encoding::UTF_8).to_sym)
store_unique_object obj
obj
end
class MultibyteぁあぃいClass
end
source_object = MultibyteぁあぃいClass
p Marshal.dump(source_object)
p Marshal.load(Marshal.dump(source_object))
$ ruby -v marshal_class.rb
truffleruby 25.0.0-dev-a65bde3d, like ruby 3.3.7, Interpreted JVM [x86_64-linux]
"\x04\bc\x1FMultibyte\xE3\x81\x81\xE3\x81\x82\xE3\x81\x83\xE3\x81\x84Class"
MultibyteぁあぃいClass
I think at least if no encoding information is present we should assume UTF-8, because it's by far the most common source encoding.
I think there is no value to look up the name in BINARY encoding as currently, such a constant wouldn't even print well.
(FWIW TruffleRuby stores constant names as Java Strings, which means no encoding information. I'm not convinced it's a good idea to e.g. have two constants É
in e.g. UTF-8 and ISO-8859-1 on the same module, it just seems needless confusion. Having non-7-bit BINARY-encoded constants seems no good either.)
Updated by Eregon (Benoit Daloze) 15 days ago
· Edited
So my suggestion would be:
- Interpret the serialized module/class name as UTF-8, not as BINARY. And of course if it's only 7-bit as US-ASCII (already the case).
- If the module/class name uses another encoding, we could either transcode it to UTF-8, or use nobu's trick of serializing the encoding name as a fake instance variable. Transcoding to UTF-8 seems simpler, and the only case it wouldn't work is for BINARY with non-7-bit, which seems of no value anyway. EDIT: mmh but I suppose transcoding to UTF-8 wouldn't work on CRuby because the lookup would fail.
Updated by Eregon (Benoit Daloze) 15 days ago
@nobu (Nobuyoshi Nakada) What happens when using your patch to Marshal.dump a class and then trying to load it on an older Ruby? Will it create an instance variable E
or error?
I guess if there is no matching BINARY constant it would error anyway, but if there is it would set that instance variable E
or different error.
Updated by nobu (Nobuyoshi Nakada) 14 days ago
Eregon (Benoit Daloze) wrote in #note-8:
@nobu (Nobuyoshi Nakada) What happens when using your patch to Marshal.dump a class and then trying to load it on an older Ruby? Will it create an instance variable
E
or error?
Just an error.
I guess if there is no matching BINARY constant it would error anyway, but if there is it would set that instance variable
E
or different error.
If the matching BINARY class/module exists, the instance variable causes an error.
Updated by nobu (Nobuyoshi Nakada) 14 days ago
Eregon (Benoit Daloze) wrote in #note-7:
So my suggestion would be:
- Interpret the serialized module/class name as UTF-8, not as BINARY. And of course if it's only 7-bit as US-ASCII (already the case).
- If the module/class name uses another encoding, we could either transcode it to UTF-8, or use nobu's trick of serializing the encoding name as a fake instance variable. Transcoding to UTF-8 seems simpler, and the only case it wouldn't work is for BINARY with non-7-bit, which seems of no value anyway. EDIT: mmh but I suppose transcoding to UTF-8 wouldn't work on CRuby because the lookup would fail.
I'm not against to interpret the default encoding as UTF-8, but don't think transcoding is intuitive as different encoding symbols are different even if they are same when transcoded.