Project

General

Profile

Actions

Feature #21963

open

A solution to completely avoid allocated-but-uninitialized objects

Feature #21963: A solution to completely avoid allocated-but-uninitialized objects

Added by Eregon (Benoit Daloze) 1 day ago. Updated about 7 hours ago.

Status:
Open
Assignee:
-
Target version:
-
[ruby-core:125117]

Description

A common issue when defining a class is to handle allocated-but-uninitialized objects.
For example:

obj = MyClass.allocate
obj.some_method

This can easily segfault for classes defined in C and raise an unclear exception for classes defined in Ruby.
As a workaround many core (and non-core) classes add a check that they are initialized in every instance method.
This is suboptimal for performance and correctness, classes should not need to care about allocated-but-uninitialized objects.

Fundamentally, to solve this we need to guarantee that after the allocation function is used that either initialize, initialize_dup or initialize_clone is called.
And we can't guarantee that for Class#allocate.

The current workarounds are:

  • undef allocate, but this does not prevent Class.instance_method(:allocate).bind_call(Foo).
  • rb_undef_alloc_func() but this breaks dup, clone and Marshal.

The idea is to have in addition of the public alloc function (in rb_classext_struct.as.class.allocator) an internal alloc function.
Then:

  • Class#new, dup, clone and Marshal always use the internal alloc function, because they guarantee to call initialize, initialize_dup or initialize_clone.
  • rb_define_alloc_func() sets both fields.
  • rb_undef_alloc_func() sets both fields.
  • rb_get_alloc_func() reads the public alloc function (unchanged)
  • Class#allocate uses the public alloc function (unchanged)

We add a new method on Class, for example Class#safe_initialization, which:

  • Sets the public alloc function to UNDEF_ALLOC_FUNC, same as rb_undef_alloc_func(), so Class#allocate and rb_get_alloc_func() will raise if they are used (as they are unsafe).
  • Preserves the internal alloc function so Class#new, dup, clone and Marshal keep working.

After that the class has fully safe intialization and does not need to worry about allocated-but-uninitialized objects anymore.

From https://bugs.ruby-lang.org/issues/21852#note-7


Related issues 2 (1 open1 closed)

Related to Ruby - Feature #21852: New improved allocator function interfaceOpenActions
Related to Ruby - Bug #21267: respond_to check in Class#allocate is inconsistentClosedActions

Updated by Eregon (Benoit Daloze) 1 day ago Actions #1

  • Related to Feature #21852: New improved allocator function interface added

Updated by Eregon (Benoit Daloze) 1 day ago Actions #2 [ruby-core:125121]

PR implementing that idea and applying it for MatchData and Regexp, removing many checks which are no longer necessary:
https://github.com/ruby/ruby/pull/16528

Instead of using 2 fields it's using the existing allocator field + a boolean flag to tell if the allocator is public (default) or internal (set by rb_class_safe_initialization()).

Updated by jhawthorn (John Hawthorn) about 16 hours ago Actions #3 [ruby-core:125125]

Eregon (Benoit Daloze) wrote:

Class#new, dup, clone and Marshal always use the internal alloc function, because they guarantee to call initialize, initialize_dup or initialize_clone.

Users have control over initialize, initialize_dup or initialize_clone. What's to stop them from replacing those methods with a no-op?

On your branch:

>> RUBY_DESCRIPTION
=> "ruby 4.1.0dev (2026-03-24T15:12:19Z internal_alloc_fun.. b3a027d207) +PRISM [x86_64-linux]"
>> match = "a".match(/./)
=> #<MatchData "a">
>> match.clone
=> #<MatchData "a">
>> def match.initialize_copy(x); end
=> :initialize_copy
>> match.clone
=> #<MatchData:0x00007fd8a78022c0> # <- uninitialized match data

I thought about introducing a flag like this in #21267, but I just don't see a way that it guarantees the inability to create one of these uninitialized objects (rather than just making it slightly more difficult).

Updated by Eregon (Benoit Daloze) about 9 hours ago · Edited Actions #4 [ruby-core:125127]

Regarding the name in C API it could be rb_class_safe_initialization() to match Class#safe_initialization or maybe more intuitive rb_define_internal_alloc_func() or so (which would do both rb_define_alloc_func() + set it as internal-only).
The disadvantage of the latter is that wouldn't be a good name for a Ruby method, and this functionality is useful for classes defined in Ruby too.

Updated by Eregon (Benoit Daloze) about 9 hours ago Actions #5

  • Related to Bug #21267: respond_to check in Class#allocate is inconsistent added

Updated by Eregon (Benoit Daloze) about 7 hours ago · Edited Actions #6 [ruby-core:125129]

@jhawthorn (John Hawthorn) That's a good point, thank you.
I reread https://bugs.ruby-lang.org/issues/21267 and back then I also wanted to have a way for safe initialization but didn't look yet at how to achieve it.

First I think this proposal still has value because it ensures that initialize/initialize_dup/initialize_clone are called after allocation, and that's wasn't the case before (because the user could just call Class#allocate and never follow with initialize*).

Indeed, initialize/initialize_clone/initialize_dup can still be overwritten to produce a logically-broken object, that is already the case today.
Overwriting these methods is effectively breaking the object and it is a bad case of monkey-patching, so I think any exception or different behavior is fair enough there (the user is breaking the object, we cannot prevent that override but they cannot expect things to work after they broke it), however it must not segfault in that case (I suppose we all agree on that, though I would be tempted to say it's the user's fault but I don't think that will fly).

Currently my PR removes the checks so it could segfault.
So one way to make progress without introducing segfaults would be to keep those checks.
I think that's valuable enough on its own, though not fully satisfying as it keeps these easy-to-forget checks in every instance method.

I'd like to avoid those checks, to do that without risking segfaults I think we then need to improve the reliability of initialization and copying for classes defined in C (classes defined in Ruby should not be able to cause a segfault anyway, so that part is not a concern).

What if one could provide a initialization and copy functions/hooks for TypedData / rb_data_type_t?
Then .new/.dup/.clone would call these hooks before initialize/initialize_clone/initialize_dup, so we have the guarantee they are always run before handing the object to the user.

So we'd have something like:

static const rb_data_type_t my_data_type = {
  ...,
  .init = my_initialize, // VALUE (*)(int argc, VALUE *argv, VALUE self)
  .copy = my_init_copy   // VALUE (*)(VALUE copy, VALUE original)
}

The function signatures would match the signatures typically used for initialize and initialize_copy so it would be easier to share logic with older Ruby versions not having those hooks.

One extra complication here is MatchData is not a TypedData but a raw struct RMatch.
Concretely we could redefine dup and clone on MatchData to achieve the same and call match_init_copy before initialize_dup/initialize_clone (by reusing rb_obj_dup_setup/rb_obj_clone_setup).
We'd also rb_undef_alloc_func() for MatchData to make sure Kernel#dup/Kernel#clone is not used to bypass the initialization logic in the overwritten dup/clone.
MatchData doesn't have initialize or new so we don't need to worry about that one, but if it had we could override new to call match_initialize before the initialize method (e.g. with rb_obj_call_init_kw).

What do you think?

Another idea would be to prevent redefining these crucial hooks (initialize/initialize_clone/initialize_dup/initialize_copy) for classes using Class#safe_initialization, and subclasses of them.
Preventing override of these methods entirely would be too limitating for subclasses which override the hooks correctly.
So instead we could ensure that any override would super into the original hook, that would be safe and it could be checked by looking at the AST/bytecode/IR of the overriding method.
It might be somewhat complicated if a module is later included and defines e.g. initialize_copy but it should be possible to check that it calls super too when including in a safe_initialization class (directly or indirectly).
Preventing monkey-patching in Ruby is unusual, but maybe it would make sense here?
Such monkey-patches or overrides which don't call super seems inherently broken so maybe we'd only forbid broken definitions which is then a good thing?

Updated by Eregon (Benoit Daloze) about 7 hours ago Actions #7 [ruby-core:125130]

If these init & copy C function hooks would be on RClass instead of rb_data_type_t they could be called from the (confusingly-named) function init_copy which is used by rb_obj_dup_setup/rb_obj_clone_setup and so by dup/clone before initialize_dup/initialize_clone. And then we could just use these new function hooks for MatchData and other core types which are not TypedData.
init_copy already does copying of the ivars, flags and GC attributes so it seems a good fit for "minimal initialization to make the object not segfault" for classes defined in C.
That would be quite elegant I think.

The main problem there is RClass is currently using all of its 160 bytes slot size, and bumping it to twice that doesn't seem great.

Actions

Also available in: PDF Atom