Project

General

Profile

Actions

Feature #16038

closed

Provide a public WeakMap that compares by equality rather than by identity

Added by byroot (Jean Boussier) over 4 years ago. Updated about 2 years ago.

Status:
Closed
Assignee:
-
Target version:
-
[ruby-core:94114]

Description

I know ObjectSpace::WeakMap isn't really supposed to be used, and that the blessed interface is WeakRef. However, I'd like to make a case for a better public WeakMap.

Usage

As described in [Feature #16035], WeakMap is useful for deduplicating "value objects". A typical use case is as follows:

class Position
  REGISTRY = {}
  private_constant :REGISTRY

  class << self
    def new(*)
      instance = super
      REGISTRY[instance] ||= instance
    end
  end

  attr_reader :x, :y, :z

  def initialize(x, y, z)
    @x = x
    @y = y
    @z = z
    freeze
  end

  def hash
    self.class.hash ^
      x.hash >> 1 ^
      y.hash >> 2 ^
      y.hash >> 3
  end

  def ==(other)
    other.is_a?(Position) &&
      other.x == x &&
      other.y == y &&
      other.z == z
  end
  alias_method :eql?, :==
end

p Position.new(1, 2, 3).equal?(Position.new(1, 2, 3))

That's pretty much the pattern I used in Rails to deduplicate database metadata and save lots of memory.

The big downside here is that these value objects can't be GCed anymore, so this pattern is not viable in many case.

Why not use WeakRef

A couple of reasons. First, when using this pattern, the goal is to reduce memory usage, so having one extra WeakRef for every single value object is a bit counter productive.

Then it's a bit annoying to work with, as you have to constantly check wether the reference is still alive, and/or rescue WeakRef::RefError.

Often, these two complications make the tradeoff not worth it.

Ruby 2.7

Since [Feature #13498] WeakMap is a bit more usable as you can now use an interned string as the unique key, e.g.

class Position
  REGISTRY = ObjectSpace::WeakMap.new
  private_constant :REGISTRY

  class << self
    def new(*)
      instance = super
      REGISTRY[instance.unique_id] ||= instance
    end
  end

  attr_reader :x, :y, :z, :unique_id

  def initialize(x, y, z)
    @x = x
    @y = y
    @z = z
    @unique_id = -"#{self.class}-#{x},#{y},#{z}"
    freeze
  end

  def hash
    self.class.hash ^
      x.hash >> 1 ^
      y.hash >> 2 ^
      y.hash >> 3
  end

  def ==(other)
    other.is_a?(Position) &&
      other.x == x &&
      other.y == y &&
      other.z == z
  end
  alias_method :eql?, :==
end

p Position.new(1, 2, 3).equal?(Position.new(1, 2, 3))

That makes the pattern much easier to work with than dealing with WeakRef, but there is still that an extra instance.

Proposal

What would be ideal would be a WeakMap that works by equality, so that the first snippet could simply replace {} by WeakMap.new.

Changing ObjectSpace::WeakMap's behavior would cause issues, and I see two possibilities:

  • The best IMO would be to have a new top level ::WeakMap be the equality based map, and have ObjectSpace::WeakMap remain as a semi-private interface for backing up WeakRef.
  • Or alternatively, ObjectSpace::WeakMap could have a compare_by_equality method (inverse of Hash#compare_by_identity) to change its behavior post instantiation.

I personally prefer the first one.


Related issues 2 (1 open1 closed)

Related to Ruby master - Feature #16471: Two feature requests for WeakRef: get original object, callback featureOpenActions
Related to Ruby master - Feature #18498: Introduce a public WeakKeysMap that compares by equalityClosedActions
Actions #1

Updated by sawa (Tsuyoshi Sawada) over 4 years ago

  • Subject changed from Provide a public WeakMap which compare by equality rather than identity to Provide a public WeakMap that compares by equality rather than by identity
  • Description updated (diff)

Updated by matz (Yukihiro Matsumoto) over 4 years ago

I am not sure if the proposal has real-world use-case. Can you elaborate?

Matz.

Updated by byroot (Jean Boussier) over 4 years ago

@matz (Yukihiro Matsumoto) Of course. What prompted me to open this feature request is a feature I implemented in Rails where I use this exact pattern: https://github.com/rails/rails/pull/35891 (more specifically https://github.com/rails/rails/blob/48ae1ba36132d7deec0bd43ee50e076786011a5e/activerecord/lib/active_record/connection_adapters/deduplicable.rb)

In short Rails ORM keep a data structure in memory that reflect the database schema, and that structure contains quite a lot of duplications, for instance all tables have an id BIGINT PRIMARY KEY column, so if you have 100 tables, you'll end up with 100 identical Column objects in memory.

By using the pattern described in this issue, I was able to get rid of all the deduplications, which resulted in massive memory saving (over 115MB for us, but granted our schema is massive).

But my worry now is that I made all these objects totally non garbage collectable, so if someone is instantiating these dynamically, it might cause a RAM based DOS vulnerability, an attacker could trigger a lot of instantiations until the application runs out of RAM.

That is why I'd like to use a WeakMap instead, but for that I need it to have an equality semantic like a regular hash. If it existed, I believe this pattern could be used in many libraries and applications which can't do it today because it's too risky.

Actions #4

Updated by nobu (Nobuyoshi Nakada) about 4 years ago

  • Related to Feature #16471: Two feature requests for WeakRef: get original object, callback feature added

Updated by byroot (Jean Boussier) almost 3 years ago

I'd like this to be reconsidered. I'll add it to the next developer meeting.

Updated by marcandre (Marc-Andre Lafortune) almost 3 years ago

Seems reasonable to me.

I also wish that WeakMap was officially public, it can be quite useful. I wish it's interface was closer to Hash.

I'd suggest an option be added to the initializer of ObjectSpace::WeakMap.

Updated by Eregon (Benoit Daloze) almost 3 years ago

marcandre (Marc-Andre Lafortune) wrote in #note-6:

I'd suggest an option be added to the initializer of ObjectSpace::WeakMap.

I think we should have a separate class if this feature is accepted.
It's already complicated enough to implement ObjectSpace::WeakMap, I wouldn't want to have to dynamically handle how comparison is done for keys.
For instance on TruffleRuby one would likely need to wrap keys if we compare by hash/eql?. If we have to selectively wrap it gets ugly and error-prone.

Updated by nobu (Nobuyoshi Nakada) almost 3 years ago

Your example always allocates a new instance first, then deduplicates it.
Why not:

    def new(x, y, z)
      REGISTRY[-"#{self}-#{x},#{y},#{z}"] ||= super
    end

This would work even in 2.7 too.

Updated by byroot (Jean Boussier) almost 3 years ago

Your example always allocates a new instance first, then deduplicates it.

Yes, for more complex cases that's kind of the only way. What I'm after here is mostly memory retention, not so much allocations.

Why not:

Yes, I used this in some cases when the number of property was slow enough. But:

  • The object need to hold that string.
  • That just doubled the number of retained objects, just like WeakRef.
  • Not everything is well suited to be concatenated as a string like this.

And even for your approach, equality based WeakMap would offer a much cleaner interface:

    def new(x, y, z)
      REGISTRY[[x, y, z]] ||= super
    end

Updated by mame (Yusuke Endoh) almost 3 years ago

This ticket was discussed in the previous dev meeting. The use case is understandable, but @matz (Yukihiro Matsumoto) was not completely convinced that WeakMap is a suitable solution for that. He said he would reply.

In addition, @ko1 (Koichi Sasada) said that it would be (possible but) tough to implement the feature. For lookup, it is safe to assume that a key is reachable (otherwise, the key is never looked up). However, the assumption is not safe for compare_by_equality. Also, we need to care a GC during the equality check.


This is just my opinion. I want this kind of feature (non-memory-leakable hash for object dedup), but I'm unsure whether WeakMap is actually usable for that because it is difficult to predict its behavior. It has no configuable size limit, and it may forget almost all contents at every GC (depending upon an application). Instead of exploiting WeakMap, a dedicated cache library makes more sense.

Updated by Eregon (Benoit Daloze) almost 3 years ago

mame (Yusuke Endoh) wrote in #note-10:

This is just my opinion. I want this kind of feature (non-memory-leakable hash for object dedup), but I'm unsure whether WeakMap is actually usable for that because it is difficult to predict its behavior. It has no configuable size limit, and it may forget almost all contents at every GC (depending upon an application).

It's the same or similar behavior for Java WeakHashMap.
When used as a cache, then whatever uses it should keep a strong reference to the key (value if it's a weak values map), so the entry stays alive as long as there are users of it, and it's bounded by the number of usages.
Not keeping a reference to the key (value if it's a weak values map) is a bug, then that cache is of no use as it would indeed be empty after every GC.

@byroot (Jean Boussier) Actually, something is not clear in this proposal: do you want weak keys or weak values?
For deduplication, weak values often make sense, then one only needs to hold to the value to keep the entry alive.

My understanding is that ObjectSpace::WeakMap is weak values, and compare_by_identity.

Updated by byroot (Jean Boussier) almost 3 years ago

My understanding is that ObjectSpace::WeakMap is weak values

I believe it's both weak keys and weak values:

>> m = ObjectSpace::WeakMap.new
=> #<ObjectSpace::WeakMap:0x00007fa56bf75ea0>
>> Object.new.tap { |o| m[o] = o }
=> #<Object:0x00007fa575c167f0>
>> m.size
=> 1
>> 4.times { GC.start }
=> 4
>> m.size
=> 0

And that's essentially the behavior I'm after, as the deduplication code, as defined in the description: REGISTRY[instance] ||= instance. E.g. you create a new instance, and then check for a pre-existing equal one.

Same as String#-@, you first need to build the string, and then can deduplicate it.

Updated by byroot (Jean Boussier) almost 3 years ago

Instead of exploiting WeakMap, a dedicated cache library makes more sense.

I'm not sure how one could write such library today without a WeakMap (or other Weak* datastructure) like I'm requesting.

Updated by Eregon (Benoit Daloze) almost 3 years ago

byroot (Jean Boussier) wrote in #note-12:

My understanding is that ObjectSpace::WeakMap is weak values

I believe it's both weak keys and weak values:

I actually asked this question on the ruby-core Slack a while ago and IIRC I could not get a clear answer.
Maybe @nobu (Nobuyoshi Nakada) knows?

I'm not sure both weak keys and weak values make sense.
For instance, if one of them is no longer referenced (e.g., the key), should the map remove the entry (even if the value is still referenced; which means two equivalent instances could be alive => sounds bad)?

If the logic is to keep the entry as long as either the key or value is alive, that sounds like it needs very deep GC support.
At least it doesn't seem possible to implement just with WeakReferences (unless I'm missing something).

TruffleRuby and JRuby currently implement ObjectSpace::WeakMap with weak values (and strong keys).
I am concerned it might not be possible to have both weak keys and weak values on the JVM, and probably it's also hard with some GCs (seems to need: an object cannot die while another, unrelated by reference, object is alive).
cc @headius (Charles Nutter)


The specific case you illustrate sounds like a "deduplication set" (my words, probably not a well-known term).
That specifically might be easier to implement and safer (since the key and value of an entry are the same, so they naturally GC at the same time).

Updated by mame (Yusuke Endoh) almost 3 years ago

byroot (Jean Boussier) wrote in #note-13:

I'm not sure how one could write such library today without a WeakMap (or other Weak* datastructure) like I'm requesting.

I guess that it is possible to write such library in C extension by using finalizers, but some APIs might lack.

Updated by byroot (Jean Boussier) almost 3 years ago

I guess that it is possible to write such library in C extension by using finalizers

Ah I see. Yeah I didn't think about C-exts. That could work, would probably need to pin all references to avoid compaction problems etc.

But it could allow to experiment more with the idea.

Updated by Eregon (Benoit Daloze) over 2 years ago

This came up again in https://github.com/rails/rails/pull/43723.

I think adding WeakKeysMap and WeakValuesMap would be the best.
That's clear and implementable on all Ruby implementations which support weak references in general.

They could be top-level, or nested under ObjectSpace, I don't really mind either.

Adding this is necessary to have access to a weak-keys map on JVM-based Ruby implementations, since a map that's both weak keys and weak values seems impossible on JVM (and likely other VMs as well).

Updated by Eregon (Benoit Daloze) over 2 years ago

Regarding comparing keys, I think eql?/hash comparison is more intuitive (consistent with {}) and more useful than identity, so I would just go with that, always.

If one actually needs to compare by identity they could wrap the keys, or simply rely on the default #eql?/#hash methods.
The other way is not possible, i.e., emulate equality on top of identity comparisons.

Updated by byroot (Jean Boussier) over 2 years ago

They could be top-level, or nested under ObjectSpace, I don't really mind either.

They could also be under WeakRef, e.g. WeakRef::WeakKeysMap.

Updated by Dan0042 (Daniel DeLorme) over 2 years ago

What is the use case for WeakKeysMap? It doesn't seem like it would allow the deduplication described above. Assuming the same interface as Hash, I don't see an efficient way to get the key from the map.

if !REGISTRY[instance]
  REGISTRY[instance] = true
else
  #??????
  REGISTRY.each_key do |k|
    break instance=k if k.eql?(instance)
  end
end

Updated by byroot (Jean Boussier) over 2 years ago

It doesn't seem like it would allow the deduplication described above.

The subject was enlarged a bit over time.

The initial deduplication scenarios assume both weak keys and weak values, which is currently the case on MRI, but seems complicated to do on Java based implementations.

But I suppose you could do this:

REGISTRY = WeakKeysMap.new

if (ref = REGISTRY[instance])
  ref.__getobj__ # would need to deal with dead ref here but whatever
else
  REGISTRY[instance] = WeakRef.new(instance)
end

Updated by Eregon (Benoit Daloze) over 2 years ago

For the deduplication use-case as in the description, WeakValuesMap is possible too:

REGISTRY = WeakValuesMap.new

key = [x, y, z]
if instance = REGISTRY[key]
  instance
else
  REGISTRY[key] = instance
end

Then that mapping stays in the map as long as the instance is referenced somewhere.
We can't use the instance itself as the key, otherwise no mapping would ever GC, since the key is referenced strongly by the map.

Which to use between the two is based on what you want to refer to keep the mapping alive:
If one wants the mapping to exist as long as the key is referenced, then it's WeakKeysMap.
If one wants the mapping to exist as long as the value is referenced, then it's WeakValuesMap.

As examples from TruffleRuby, the internal map for Symbols or for frozen literal/interned strings use a WeakValuesMap. They need to guarantee uniqueness, so as long as the value is alive it makes sure to to avoid any duplicate value.
WeakKeysMap is used to keep some extra data about an object outside of it, as long as the object is alive (e.g., the eval line offset for a given Source file object, the excluded_descendants in that rails PR).
One can also build a weak set based on WeakKeysMap, by just using e.g. true as the value. This is used e.g. to track subclasses of a class in a weakly manner.

Updated by mame (Yusuke Endoh) over 2 years ago

Let me confirm. The use case that we are considering now is only object deduplication? Or, other use cases are also being discussed?


As far as I understand, eregon says "JRuby/TruffleRuby cannot implement WeakMap (in which both keys and values are weak) due to JVM limitation. So, please do not enhance WeakMap. Instead, let's introduce WeakKeysMap and WeakValuesMap into Ruby, and use them for object deduplication." Am I right? It sounds like the deprecation of WeakMap, and it is far from the original proposal.

The original proposal is to work the following code just by rewriting REGISTRY = {} to REGISTRY = ObjectSpace::WeakMap.new(compare_by_equality: true) or something.

  REGISTRY = {}
  private_constant :REGISTRY

  class << self
    def new(*)
      instance = super
      REGISTRY[instance] ||= instance
    end
  end

I think WeakKeysMap and WeakValuesMap seem not to meet the requirement.

Anyway, I feel that the proposal is drastically changed. Maybe it is a good idea to create another ticket with the latest proposal.

Updated by byroot (Jean Boussier) over 2 years ago

The use case that we are considering now is only object deduplication?

That's my main use case for it yes, but there are plenty of other use cases.

The original proposal is to work the following code just by rewriting REGISTRY = {} to REGISTRY = ObjectSpace::WeakMap.new(compare_by_equality: true) or something.

I posted an updated snippet accommodating Java based VMs:

REGISTRY = WeakKeysMap.new

if (ref = REGISTRY[instance])
  ref.__getobj__ # would need to deal with dead ref here but whatever
else
  REGISTRY[instance] = WeakRef.new(instance)
end

I don't know if a new ticket is necessary, I suppose that since ObjectSpace::WeakMap isn't really officially public API, if we were to add features to it and make it public, it kind of make sense to discussed the exposed semantic at the same time.

Updated by matz (Yukihiro Matsumoto) over 2 years ago

I am a bit confusing. Could you summarize the current proposal?

Matz.

Updated by byroot (Jean Boussier) over 2 years ago

The current proposal is:

  • Two new classes WeakKeysMap and WeakValuesMap
  • Not too sure about the namespace, maybe inside WeakRef?
  • Equality semantic for keys.
Value = Struct.new(:foo)

map = WeakKeysMap.new
map[Value.new(42)] = "something"
map[Value.new(42)] # => "something"

Updated by mame (Yusuke Endoh) over 2 years ago

byroot (Jean Boussier) wrote in #note-24:

The use case that we are considering now is only object deduplication?

That's my main use case for it yes, but there are plenty of other use cases.

What use case do you have for example?

I posted an updated snippet accommodating Java based VMs:

Well. You said in the ticket description:

Why not use WeakRef
A couple of reasons. First, when using this pattern, the goal is to reduce memory usage, so having one extra WeakRef for every single value object is a bit counter productive.

I'm unsure how you intended to use WeakRef for dedup, but is this no longer the requirement? I think the combination of the current WeakMap and compare_by_equality was understandable.

IMO, a description is the most important part in a feature proposal. At the meeting, we tend to read only the description, and a comment written in the dev-meeting ticket. Reading the discussion is time-consuming for us. If the proposal was changed significantly, I like you to close the ticket and create another one with self-contained and complete description.

Updated by byroot (Jean Boussier) over 2 years ago

I like you to close the ticket and create another one with self-contained and complete description.

Fair. I'll chat some more with @Eregon (Benoit Daloze) to make sure we agree on the proposal and do that for next month.

Updated by mame (Yusuke Endoh) over 2 years ago

Eregon (Benoit Daloze) wrote in #note-22:

For the deduplication use-case as in the description, WeakValuesMap is possible too:
...
Then that mapping stays in the map as long as the instance is referenced somewhere.
We can't use the instance itself as the key, otherwise no mapping would ever GC, since the key is referenced strongly by the map.

As far as I understand from the description, what byroot wanted is to use the instance itself as the key.

BTW, is it really impossible for JVM to implement CRuby's current WeakMap in which both keys and values are weak? If the code written in byroot (WeakKeysMap with WeakRef values) works in JVM, the hack can be also used to implement WeakMap.

As examples from TruffleRuby, the internal map for Symbols or for frozen literal/interned strings use a WeakValuesMap. They need to guarantee uniqueness, so as long as the value is alive it makes sure to to avoid any duplicate value.
WeakKeysMap is used to keep some extra data about an object outside of it, as long as the object is alive (e.g., the eval line offset for a given Source file object, the excluded_descendants in that rails PR).
One can also build a weak set based on WeakKeysMap, by just using e.g. true as the value. This is used e.g. to track subclasses of a class in a weakly manner.

Thanks, this summary is very clear to me.

Actions #30

Updated by byroot (Jean Boussier) over 2 years ago

  • Status changed from Open to Feedback
Actions #31

Updated by byroot (Jean Boussier) about 2 years ago

  • Related to Feature #18498: Introduce a public WeakKeysMap that compares by equality added

Updated by byroot (Jean Boussier) about 2 years ago

  • Status changed from Feedback to Closed

Closing in favor of https://bugs.ruby-lang.org/issues/18498 as requested.

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0