Project

General

Profile

ActionsLike1

Feature #21216

closed

Implement Set as a core class

Added by jeremyevans0 (Jeremy Evans) 21 days ago. Updated about 16 hours ago.

Status:
Closed
Assignee:
-
Target version:
-
[ruby-core:121540]

Description

I propose to implement Set as a core class. Set has been an autoloaded standard library since Ruby 3.2. The standard library Set is less efficient than it could be, as it uses Hash for storage, which stores unnecessary values for each item in the set.

I've submitted a pull request that implements Set as a core class: https://github.com/ruby/ruby/pull/13074

Implementation details for the pull request:

  • Core Set uses a modified version of st_table, named set_table. Other than s/st_/set_/, the main difference is that the stored records do not have values, making them 1/3 smaller. st_table_entry stores hash, key, and record (value), while set_table_entry only stores hash and key. This results in large sets using ~33% less memory compared to stdlib Set. For small sets, core Set uses 20% less memory (160 bytes, while stdlib set uses 40 for Set and 160 for Hash).

  • All methods are implemented as cfuncs, except the pretty_print methods, which were moved to lib/pp.rb (which is where the pretty_print methods for other core classes are defined). As is typical for core classes, internal calls call C functions and not Ruby methods. For example, to check if something is a Set, rb_obj_is_kind_of is used, instead of calling is_a?(Set) on the related object.

  • Almost all methods use the same algorithm that the pure-Ruby implementation used. The exception is when calling Set#divide with a block with 2-arity. stdlib Set used a lazy-loaded tsort to implement this. I developed an algorithm that only allocates a single intermediate hash and does not need tsort.

  • The flatten_merge protected method is no longer necessary, so it is not implemented (it could be).

  • Similar to Hash/Array, subclasses of Set are no longer reflected in inspect output.

  • Documentation from stdlib Set was moved to core Set, with minor updates.

I developed a comprehensive benchmark suite for all public Set methods (results attached). As you would expect, core Set is faster than stdlib Set in the majority of cases (90%), and multiple times faster in many cases (47% of cases are at least 2x faster, 31% of cases are at least 4x faster). There are a few cases where it is significantly slower:

  • Set.new with no arguments (~1.6x)
  • Set#compare_by_identity for small sets (~1.3x)
  • Set#clone for small sets (~1.5x)
  • Set#dup for small sets (~1.7x)

Some of these are slower as Set does not currently use the AR table optimization that Hash does, so a new set_table is initialized for each Set. I'm not sure it's worth the complexity to have an AR table-like optimization for small sets (for hashes it makes sense, as small hashes are used everywhere in Ruby).

The rbs and repl_type_completor bundled gems will need updates to support core Set. The pull request marks them as allowed failures.

This passes all set tests with no changes. The following specs needed modification:

  • Modifying frozen set error message (changed for the better)
  • Set#divide when passed a 2-arity block no longer yields the same object as both the first and second argument (this seems like an issue with the previous implementation).
  • Set-like objects that override is_a? such that is_a?(Set) return true are no longer treated as Set instances.
  • Set.allocate.hash is no longer the same as nil.hash
  • Set#join no longer calls Set#to_a (it calls the underlying C function).
  • Set#flatten_merge protected method is not implemented.

Previously, set.rb added a SortedSet autoload, which loads set/sorted_set.rb. This replaces the Set autoload in prelude.rb with a SortedSet autoload, but I recommend removing it and set/sorted_set.rb before the release of Ruby 3.5.

This moves test/set/test_set.rb to test/ruby/test_set.rb, reflecting that switch to a core class. This does not move the spec files, as I'm not sure how they should be handled currently. Eventually, the set spec files should be moved to spec/ruby/core, but maybe not until Ruby 3.4 is no longer supported.

This does not add any functions to the C-API (all functions in set.c are static). I think such functions can be added later as we find other places in core where it makes sense to use a Set instead of a Hash or Array, or as we get requests from extension authors.

This is partially inspired by changes in Python 2.4 (released in November 2004), which changed the set implementation from a standard library class backed by a hash (dict in Python terms) to having it as a core class.


Files

set-benchmark-output.txt (74.9 KB) set-benchmark-output.txt jeremyevans0 (Jeremy Evans), 04/06/2025 03:44 AM

Related issues 1 (1 open0 closed)

Related to Ruby - Feature #16989: Sets: need ♥️Assignedknu (Akinori MUSHA)Actions
Like0Actions #1

Updated by hsbt (Hiroshi SHIBATA) 19 days ago

Updated by knu (Akinori MUSHA) 12 days ago

This is wonderful, and I feel like I should have done this long ago.

Here's some thoughts.

Set was initially designed with extensibility in mind because Hash was hard to extend via inheritance. It was also almost impossible to make a Hash-like object class when there were too many methods to implement.

  • SortedSet could be implemented only by changing the data structure of @hash from Hash to RBTree.
  • Although not completely enforced, most methods internally rely on a small set of primitive methods like add and delete so that subclasses could easily customize behavior.

However, there may not have been much demand to subclass Set to customize the behavior. (Further investigation is needed via code search)
If so, it might make more sense to shift the design focus toward performance and thread safety, as this new implementation in C suggests.

I'm in favor of this proposal. Let's discuss the technical details separately.

Like0Actions #3

Updated by jeremyevans (Jeremy Evans) about 16 hours ago

  • Status changed from Open to Closed

Applied in changeset git|e4f85bfc311a3812de7bc2e9d068934e8b364574.


Implement Set as a core class

Set has been an autoloaded standard library since Ruby 3.2.
The standard library Set is less efficient than it could be, as it
uses Hash for storage, which stores unnecessary values for each key.

Implementation details:

  • Core Set uses a modified version of st_table, named set_table.
    than s/st_/set_/, the main difference is that the stored records
    do not have values, making them 1/3 smaller. st_table_entry stores
    hash, key, and record (value), while set_table_entry only
    stores hash and key. This results in large sets using ~33% less
    memory compared to stdlib Set. For small sets, core Set uses 12% more
    memory (160 byte object slot and 64 malloc bytes, while stdlib set
    uses 40 for Set and 160 for Hash). More memory is used because
    the set_table is embedded and 72 bytes in the object slot are
    currently wasted. Hopefully we can make this more efficient and have
    it stored in an 80 byte object slot in the future.

  • All methods are implemented as cfuncs, except the pretty_print
    methods, which were moved to lib/pp.rb (which is where the
    pretty_print methods for other core classes are defined). As is
    typical for core classes, internal calls call C functions and
    not Ruby methods. For example, to check if something is a Set,
    rb_obj_is_kind_of is used, instead of calling is_a?(Set) on the
    related object.

  • Almost all methods use the same algorithm that the pure-Ruby
    implementation used. The exception is when calling Set#divide with a
    block with 2-arity. The pure-Ruby method used tsort to implement this.
    I developed an algorithm that only allocates a single intermediate
    hash and does not need tsort.

  • The flatten_merge protected method is no longer necessary, so it
    is not implemented (it could be).

  • Similar to Hash/Array, subclasses of Set are no longer reflected in
    inspect output.

  • RDoc from stdlib Set was moved to core Set, with minor updates.

This includes a comprehensive benchmark suite for all public Set
methods. As you would expect, the native version is faster in the
vast majority of cases, and multiple times faster in many cases.
There are a few cases where it is significantly slower:

  • Set.new with no arguments (~1.6x)
  • Set#compare_by_identity for small sets (~1.3x)
  • Set#clone for small sets (~1.5x)
  • Set#dup for small sets (~1.7x)

These are slower as Set does not currently use the AR table
optimization that Hash does, so a new set_table is initialized for
each call. I'm not sure it's worth the complexity to have an AR
table-like optimization for small sets (for hashes it makes sense,
as small hashes are used everywhere in Ruby).

The rbs and repl_type_completor bundled gems will need updates to
support core Set. The pull request marks them as allowed failures.

This passes all set tests with no changes. The following specs
needed modification:

  • Modifying frozen set error message (changed for the better)
  • Set#divide when passed a 2-arity block no longer yields the same
    object as both the first and second argument (this seems like an issue
    with the previous implementation).
  • Set-like objects that override is_a? such that is_a?(Set) return
    true are no longer treated as Set instances.
  • Set.allocate.hash is no longer the same as nil.hash
  • Set#join no longer calls Set#to_a (it calls the underlying C
    function).
  • Set#flatten_merge protected method is not implemented.

Previously, set.rb added a SortedSet autoload, which loads
set/sorted_set.rb. This replaces the Set autoload in prelude.rb
with a SortedSet autoload, but I recommend removing it and
set/sorted_set.rb.

This moves test/set/test_set.rb to test/ruby/test_set.rb,
reflecting that switch to a core class. This does not move the spec
files, as I'm not sure how they should be handled.

Internally, this uses the st_* types and functions as much as
possible, and only adds set_* types and functions as needed.
The underlying set_table implementation is stored in st.c, but
there is no public C-API for it, nor is there one planned, in
order to keep the ability to change the internals going forward.

For internal uses of st_table with Qtrue values, those can
probably be replaced with set_table. To do that, include
internal/set_table.h. To handle symbol visibility (rb_ prefix),
internal/set_table.h uses the same macro approach that
include/ruby/st.h uses.

The Set class (rb_cSet) and all methods are defined in set.c.
There isn't currently a C-API for the Set class, though C-API
functions can be added as needed going forward.

Implements [Feature #21216]

Co-authored-by: Jean Boussier
Co-authored-by: Oliver Nutter

ActionsLike1

Also available in: Atom PDF