Feature #21963: A solution to completely avoid allocated-but-uninitialized objects - Ruby - Ruby Issue Tracking System

Actions

Copy link

Feature #21963

open

A solution to completely avoid allocated-but-uninitialized objects

Feature #21963: A solution to completely avoid allocated-but-uninitialized objects

Added by Eregon (Benoit Daloze) 3 months ago. Updated about 2 months ago.

Status:

Open

Assignee:

Target version:

[ruby-core:125117]

Description

A common issue when defining a class is to handle allocated-but-uninitialized objects.
For example:

obj = MyClass.allocate
obj.some_method

This can easily segfault for classes defined in C and raise an unclear exception for classes defined in Ruby.
As a workaround many core (and non-core) classes add a check that they are initialized in every instance method.
This is suboptimal for performance and correctness, classes should not need to care about allocated-but-uninitialized objects.

Fundamentally, to solve this we need to guarantee that after the allocation function is used that either initialize, initialize_dup or initialize_clone is called.
And we can't guarantee that for Class#allocate.

The current workarounds are:

undef allocate, but this does not prevent Class.instance_method(:allocate).bind_call(Foo).
rb_undef_alloc_func() but this breaks dup, clone and Marshal.

The idea is to have in addition of the public alloc function (in rb_classext_struct.as.class.allocator) an internal alloc function.
Then:

Class#new, dup, clone and Marshal always use the internal alloc function, because they guarantee to call initialize, initialize_dup or initialize_clone.
rb_define_alloc_func() sets both fields.
rb_undef_alloc_func() sets both fields.
rb_get_alloc_func() reads the public alloc function (unchanged)
Class#allocate uses the public alloc function (unchanged)

We add a new method on Class, for example Class#safe_initialization, which:

Sets the public alloc function to UNDEF_ALLOC_FUNC, same as rb_undef_alloc_func(), so Class#allocate and rb_get_alloc_func() will raise if they are used (as they are unsafe).
Preserves the internal alloc function so Class#new, dup, clone and Marshal keep working.

After that the class has fully safe intialization and does not need to worry about allocated-but-uninitialized objects anymore.

From https://bugs.ruby-lang.org/issues/21852#note-7

Related issues 2 (1 open — 1 closed)

Updated by Eregon (Benoit Daloze) 3 months ago Actions
Copy link
#1

Related to Feature #21852: New improved allocator function interface added

Updated by Eregon (Benoit Daloze) 3 months ago Actions
Copy link
#2 [ruby-core:125121]

PR implementing that idea and applying it for MatchData and Regexp, removing many checks which are no longer necessary:
https://github.com/ruby/ruby/pull/16528

Instead of using 2 fields it's using the existing allocator field + a boolean flag to tell if the allocator is public (default) or internal (set by rb_class_safe_initialization()).

Updated by jhawthorn (John Hawthorn) 3 months ago Actions
Copy link
#3 [ruby-core:125125]

Eregon (Benoit Daloze) wrote:

Class#new, dup, clone and Marshal always use the internal alloc function, because they guarantee to call initialize, initialize_dup or initialize_clone.

Users have control over initialize, initialize_dup or initialize_clone. What's to stop them from replacing those methods with a no-op?

On your branch:

>> RUBY_DESCRIPTION
=> "ruby 4.1.0dev (2026-03-24T15:12:19Z internal_alloc_fun.. b3a027d207) +PRISM [x86_64-linux]"
>> match = "a".match(/./)
=> #<MatchData "a">
>> match.clone
=> #<MatchData "a">
>> def match.initialize_copy(x); end
=> :initialize_copy
>> match.clone
=> #<MatchData:0x00007fd8a78022c0> # <- uninitialized match data

I thought about introducing a flag like this in #21267, but I just don't see a way that it guarantees the inability to create one of these uninitialized objects (rather than just making it slightly more difficult).

Updated by Eregon (Benoit Daloze) 3 months ago · Edited Actions
Copy link
#4 [ruby-core:125127]

Regarding the name in C API it could be rb_class_safe_initialization() to match Class#safe_initialization or maybe more intuitive rb_define_internal_alloc_func() or so (which would do both rb_define_alloc_func() + set it as internal-only).
The disadvantage of the latter is that wouldn't be a good name for a Ruby method, and this functionality is useful for classes defined in Ruby too.

Updated by Eregon (Benoit Daloze) 3 months ago Actions
Copy link
#5

Related to Bug #21267: respond_to check in Class#allocate is inconsistent added

Updated by Eregon (Benoit Daloze) 3 months ago · Edited Actions
Copy link
#6 [ruby-core:125129]

@jhawthorn (John Hawthorn) That's a good point, thank you.
I reread https://bugs.ruby-lang.org/issues/21267 and back then I also wanted to have a way for safe initialization but didn't look yet at how to achieve it.

First I think this proposal still has value because it ensures that initialize/initialize_dup/initialize_clone are called after allocation, and that's wasn't the case before (because the user could just call Class#allocate and never follow with initialize*).

Indeed, initialize/initialize_clone/initialize_dup can still be overwritten to produce a logically-broken object, that is already the case today.
Overwriting these methods is effectively breaking the object and it is a bad case of monkey-patching, so I think any exception or different behavior is fair enough there (the user is breaking the object, we cannot prevent that override but they cannot expect things to work after they broke it), however it must not segfault in that case (I suppose we all agree on that, though I would be tempted to say it's the user's fault but I don't think that will fly).

Currently my PR removes the checks so it could segfault.
So one way to make progress without introducing segfaults would be to keep those checks.
I think that's valuable enough on its own, though not fully satisfying as it keeps these easy-to-forget checks in every instance method.

I'd like to avoid those checks, to do that without risking segfaults I think we then need to improve the reliability of initialization and copying for classes defined in C (classes defined in Ruby should not be able to cause a segfault anyway, so that part is not a concern).

What if one could provide a initialization and copy functions/hooks for TypedData / rb_data_type_t?
Then .new/.dup/.clone would call these hooks before initialize/initialize_clone/initialize_dup, so we have the guarantee they are always run before handing the object to the user.

So we'd have something like:

static const rb_data_type_t my_data_type = {
  ...,
  .init = my_initialize, // VALUE (*)(int argc, VALUE *argv, VALUE self)
  .copy = my_init_copy   // VALUE (*)(VALUE copy, VALUE original)
}

The function signatures would match the signatures typically used for initialize and initialize_copy so it would be easier to share logic with older Ruby versions not having those hooks.

One extra complication here is MatchData is not a TypedData but a raw struct RMatch.
Concretely we could redefine dup and clone on MatchData to achieve the same and call match_init_copy before initialize_dup/initialize_clone (by reusing rb_obj_dup_setup/rb_obj_clone_setup).
We'd also rb_undef_alloc_func() for MatchData to make sure Kernel#dup/Kernel#clone is not used to bypass the initialization logic in the overwritten dup/clone.
MatchData doesn't have initialize or new so we don't need to worry about that one, but if it had we could override new to call match_initialize before the initialize method (e.g. with rb_obj_call_init_kw).

What do you think?

Another idea would be to prevent redefining these crucial hooks (initialize/initialize_clone/initialize_dup/initialize_copy) for classes using Class#safe_initialization, and subclasses of them.
Preventing override of these methods entirely would be too limitating for subclasses which override the hooks correctly.
So instead we could ensure that any override would super into the original hook, that would be safe and it could be checked by looking at the AST/bytecode/IR of the overriding method.
It might be somewhat complicated if a module is later included and defines e.g. initialize_copy but it should be possible to check that it calls super too when including in a safe_initialization class (directly or indirectly).
Preventing monkey-patching in Ruby is unusual, but maybe it would make sense here?
Such monkey-patches or overrides which don't call super seems inherently broken so maybe we'd only forbid broken definitions which is then a good thing?

Updated by Eregon (Benoit Daloze) 3 months ago Actions
Copy link
#7 [ruby-core:125130]

If these init & copy C function hooks would be on RClass instead of rb_data_type_t they could be called from the (confusingly-named) function init_copy which is used by rb_obj_dup_setup/rb_obj_clone_setup and so by dup/clone before initialize_dup/initialize_clone. And then we could just use these new function hooks for MatchData and other core types which are not TypedData.
init_copy already does copying of the ivars, flags and GC attributes so it seems a good fit for "minimal initialization to make the object not segfault" for classes defined in C.
That would be quite elegant I think.

The main problem there is RClass is currently using all of its 160 bytes slot size, and bumping it to twice that doesn't seem great.

Updated by Eregon (Benoit Daloze) 3 months ago · Edited Actions
Copy link
#8 [ruby-core:125145]

I realized these init & copy C function hooks could actually be done partly with the proposal in #21852, cc @byroot (Jean Boussier).
(and that solves the issue about needing more space in RClass!)
Specifically, the rb_copy_alloc_func_t gets the original object, so that's equivalent to copy and it would be called almost at the right time. And that function can then correctly initialize the C parts of the object so it's valid (at least can't cause segfaults) after it returns.
The one difference in timing is for clone it would be called before the singleton class is copied & set (in case the original object has a singleton class), doesn't seem much of an issue.

The missing part is the rb_copy_alloc_func_t when called from Class#new doesn't receive the arguments and so it is hard to properly initialize the C structs without the arguments.

So maybe the new allocator function should be like:

typedef VALUE (*rb_copy_alloc_func_t)(VALUE klass, VALUE original, int initialize_argc, const VALUE *initialize_argv);

or so, and either original (when called from dup/clone) or initialize_argc + initialize_argv would be set (when called from Class#new).

Updated by Eregon (Benoit Daloze) 2 months ago Actions
Copy link
#9 [ruby-core:125266]

Eregon (Benoit Daloze) wrote in #note-8:

So maybe the new allocator function should be like:

typedef VALUE (*rb_copy_alloc_func_t)(VALUE klass, VALUE original, int initialize_argc, const VALUE *initialize_argv);

I'm thinking a more explicit name would be good, as rb_copy_alloc_func_t might sound like it's only for copying.

So how about rb_initializing_alloc_func_t?

typedef VALUE (*rb_initializing_alloc_func_t)(VALUE klass, VALUE original, int initialize_argc, const VALUE *initialize_argv);

That clearly indicates such an alloc func also does some initialization (for the native parts), and the docs can make it clear initialize is still called, this is just extra native initialization to avoid dangerously-uninitialized objects.

If that's deemed too long, we could go rb_safe_alloc_func_t but that's much less explicit.
Because it's a type used once per class I think it's fine to have a longer name.

Updated by matz (Yukihiro Matsumoto) about 2 months ago 1Actions
Copy link
#10 [ruby-core:125296]

Two comments on this proposal.

First, I am not fond of exposing this as a Ruby-level opt-in (Class#safe_initialization). The classes that actually benefit are C-implemented ones, where uninitialized state can cause segfaults. Pure Ruby classes don't have that risk; at worst they raise NoMethodError. I would prefer a C-only mechanism, for example as a flag on rb_data_type_t or as a variant of rb_define_alloc_func. That keeps the Ruby-level surface unchanged and confines the new concept to where it is actually needed.

Second, I don't think safe_initialization is a good name. "Safe" is too vague and doesn't convey what the mechanism actually does. The real intent is closer to "this class requires initialize to be called", so a more direct name like REQUIRE_INITIALIZE (or similar) would be easier for C extension authors to understand without reading documentation.

Matz.

Updated by Eregon (Benoit Daloze) about 2 months ago · Edited Actions
Copy link
#11 [ruby-core:125312]

Thank you for reviewing this ticket.

I agree with both points.
I thought it would be nice-to-have to also have this for classes defined in Ruby but as you say the worst case is NoMethodError, and the specifics of the proposal are mostly focused on C classes anyway.

I've been discussing with @jhawthorn (John Hawthorn) and @byroot (Jean Boussier) to find a good solution for this.
One tricky part is handling of Marshal in case the class defines the marshal_load protocol (no problem if it defines _load as that controls the allocation and no problem if the class defines neither).
In the marshal_load case Marshal needs to allocate an instance without giving any arguments (I think to support cyclic references), and so it is hard to properly initialize the native state of that object in that case.
I think we then have 3 choices:

Have a C hook called before marshal_load, to initialize the native state, but we still can't pass arguments to it so it doesn't seem helpful.
Accept segfault if users redefine marshal_load on a class defined in C (they should never do that), or somehow prevent that redefinition (I guess one could use Class#freeze for that maybe?)
Give up on the combination class defined in C + marshal_load and such classes much check if they are initialized in every instance method (like currently).

Note that for classes defined in C in which the alloc func can define a safe native state without knowing the arguments (safe as in any instance method can work on such an uninitialized object) then this is not a problem.
But that is not possible for some classes for which there is no usable "default/zero state".

Actions

Copy link

Also available in: PDF Atom

Project

General

Profile

Ruby

Custom queries

Feature #21963

A solution to completely avoid allocated-but-uninitialized objects

Updated by Eregon (Benoit Daloze) 3 months ago Actions
Copy link
#1

Updated by Eregon (Benoit Daloze) 3 months ago Actions
Copy link
#2 [ruby-core:125121]

Updated by jhawthorn (John Hawthorn) 3 months ago Actions
Copy link
#3 [ruby-core:125125]

Updated by Eregon (Benoit Daloze) 3 months ago · Edited Actions
Copy link
#4 [ruby-core:125127]

Updated by Eregon (Benoit Daloze) 3 months ago Actions
Copy link
#5

Updated by Eregon (Benoit Daloze) 3 months ago · Edited Actions
Copy link
#6 [ruby-core:125129]

Updated by Eregon (Benoit Daloze) 3 months ago Actions
Copy link
#7 [ruby-core:125130]

Updated by Eregon (Benoit Daloze) 3 months ago · Edited Actions
Copy link
#8 [ruby-core:125145]

Updated by Eregon (Benoit Daloze) 2 months ago Actions
Copy link
#9 [ruby-core:125266]

Updated by matz (Yukihiro Matsumoto) about 2 months ago 1Actions
Copy link
#10 [ruby-core:125296]

Updated by Eregon (Benoit Daloze) about 2 months ago · Edited Actions
Copy link
#11 [ruby-core:125312]

	Related to Ruby - Feature #21852: New improved allocator function interface	Open		Actions
	Related to Ruby - Bug #21267: respond_to check in Class#allocate is inconsistent	Closed		Actions

Project

General

Profile

Ruby

Custom queries

Feature #21963

A solution to completely avoid allocated-but-uninitialized objects

Updated by Eregon (Benoit Daloze) 3 months ago ActionsCopy link #1

Updated by Eregon (Benoit Daloze) 3 months ago ActionsCopy link #2 [ruby-core:125121]

Updated by jhawthorn (John Hawthorn) 3 months ago ActionsCopy link #3 [ruby-core:125125]

Updated by Eregon (Benoit Daloze) 3 months ago · Edited ActionsCopy link #4 [ruby-core:125127]

Updated by Eregon (Benoit Daloze) 3 months ago ActionsCopy link #5

Updated by Eregon (Benoit Daloze) 3 months ago · Edited ActionsCopy link #6 [ruby-core:125129]

Updated by Eregon (Benoit Daloze) 3 months ago ActionsCopy link #7 [ruby-core:125130]

Updated by Eregon (Benoit Daloze) 3 months ago · Edited ActionsCopy link #8 [ruby-core:125145]

Updated by Eregon (Benoit Daloze) 2 months ago ActionsCopy link #9 [ruby-core:125266]

Updated by matz (Yukihiro Matsumoto) about 2 months ago 1ActionsCopy link #10 [ruby-core:125296]

Updated by Eregon (Benoit Daloze) about 2 months ago · Edited ActionsCopy link #11 [ruby-core:125312]

Updated by Eregon (Benoit Daloze) 3 months ago Actions
Copy link
#1

Updated by Eregon (Benoit Daloze) 3 months ago Actions
Copy link
#2 [ruby-core:125121]

Updated by jhawthorn (John Hawthorn) 3 months ago Actions
Copy link
#3 [ruby-core:125125]

Updated by Eregon (Benoit Daloze) 3 months ago · Edited Actions
Copy link
#4 [ruby-core:125127]

Updated by Eregon (Benoit Daloze) 3 months ago Actions
Copy link
#5

Updated by Eregon (Benoit Daloze) 3 months ago · Edited Actions
Copy link
#6 [ruby-core:125129]

Updated by Eregon (Benoit Daloze) 3 months ago Actions
Copy link
#7 [ruby-core:125130]

Updated by Eregon (Benoit Daloze) 3 months ago · Edited Actions
Copy link
#8 [ruby-core:125145]

Updated by Eregon (Benoit Daloze) 2 months ago Actions
Copy link
#9 [ruby-core:125266]

Updated by matz (Yukihiro Matsumoto) about 2 months ago 1Actions
Copy link
#10 [ruby-core:125296]

Updated by Eregon (Benoit Daloze) about 2 months ago · Edited Actions
Copy link
#11 [ruby-core:125312]