Bug #20301
open`Set#add?` does two hash look-ups
Description
A common usage of Set
s is to keep track of seen objects, and do something different whenever an object is seen for the first time, e.g.:
SEEN_VALUES = Set.new
def receive_value(value)
if SEEN_VALUES.add?(value)
puts "Saw #{value} for the first time."
else
puts "Already seen #{value}, ignoring."
end
end
receive_value(1) # Saw 1 for the first time.
receive_value(2) # Saw 2 for the first time.
receive_value(3) # Saw 3 for the first time.
receive_value(1) # Already seen 1, ignoring.
Readers might reasonably assume that add?
is only looking up into the set a single time, but it's actually doing two separate look-ups! (source)
class Set
def add?(o
# 1. `include?(o)` looks up into `@hash`
# 2. if the value isn't there, `add(o)` does a second look-up into `@hash`
add(o) unless include?(o)
end
end
This gets especially expensive if the values are large hash/arrays/objects, whose #hash
is expensive to compute.
We can optimize this if it was possible to set a value in hash, and retrieve the value that was already there, in a single go. I propose adding Hash#exchange_value
to do exactly that. If that existed, we can re-implement #add?
as:
class Set
def add?(o)
# Only requires a single look-up into `@hash`!
self unless @hash.exchange_value(o, true)
end
Here's a proof-of-concept implementation: https://github.com/ruby/ruby/pull/10093
Theory¶
How much of a benefit this has depends on 2 factors:
- How much
#hash
is called, which depends on how many new objects are added to the set.- If every object is new, then
#hash
used to be called twice on every#add?
.- This is where this improvement makes the biggest (2x!) change.
- If every object has already been seen, then
#hash
was never being called twice before anyway, so there would be no improvement.- It's important to not regress in this case, because many use cases of sets don't deal with many distinct objects, but just need to do quick checks against an existing set.
- Every other case lies somewhere in between those two, depending on the % of objects which are new.
- If every object is new, then
- How slow
#hash
is to compute for the key- If the hash is slow to compute, this change will make a bigger improvement
- If the hash value is fast to compute, then it won't matter as much. Even if we called it half as much, it's a minority of the total time, so it won't have much net impact.
Benchmark summary¶
All objects are new | All objects are preexisting | |
---|---|---|
objects with slow #hash
|
100.0% | ~0.0% |
objects with fast #hash
|
24.5% | 4.6% |
As we see, this change makes a huge improvement the cases where it helps, and crucially, doesn't slow down the cases where it can't.
For the complete benchmark source code and results, see the PR: https://github.com/ruby/ruby/pull/10093
Updated by AMomchilov (Alexander Momchilov) 8 months ago
- Description updated (diff)
Updated by AMomchilov (Alexander Momchilov) 8 months ago
- Description updated (diff)
Updated by Dan0042 (Daniel DeLorme) 8 months ago ยท Edited
Now I understand why you proposed #20300 Hash#update_value
However I'd like to suggest an alternative approach for your consideration:
def add?(k)
added = false
@hash.add(k){ added = true } #call block only if k not in @hash; return existing or added value
self if added
end
This is likely to be a bit less efficient than your approach, however Hash#add
is a method I've been missing from ruby for a long time, and would find infinitely more useful than Hash#update_value
Updated by Eregon (Benoit Daloze) 8 months ago
@Dan0042 (Daniel DeLorme) That's related to #17342 then, and also known as compute_if_absent
in concurrent-ruby.
Updated by Eregon (Benoit Daloze) 8 months ago
- Related to Feature #20300: Hash: set value and get pre-existing value in one call added
Updated by AMomchilov (Alexander Momchilov) 8 months ago
I don't mind it @Dan0042 (Daniel DeLorme), but that's a secondary issue IMO. The block call defeats the benefit of this optimization. It'll even slow down the case where you're looking up pre-existing objects (that's currently net-even perf after these changes), and that's a big no-no.
Updated by AMomchilov (Alexander Momchilov) 8 months ago
- Description updated (diff)
Updated by shyouhei (Shyouhei Urabe) 8 months ago
Why not:
def add?(o)
n = size
add(o)
m = size
return n == m ? self : nil
end
This implementation involves only one hash lookup.
Updated by nobu (Nobuyoshi Nakada) 8 months ago
Updated by shyouhei (Shyouhei Urabe) 8 months ago
Updated by Eregon (Benoit Daloze) 8 months ago
That implementation using size
is not thread-safe, even on CRuby AFAIK.
For example, if T2 calls add?
with a new element while T1 calls add?
with an existing element.
If T1 is just before m = size
when T2 executes add(o)
, then both threads return "element added" but T1 did not add an element (incorrect result).
The original code with two lookups does not have that race condition.
However it can have the race condition that two threads adding the same new element both return "element added".
Hash#exchange_value
would fix that.
Updated by shyouhei (Shyouhei Urabe) 8 months ago
Yes. add(o) unless include?(o)
isn't thread safe already. My implementation just doesn't care to improve that.
Updated by AMomchilov (Alexander Momchilov) 8 months ago
shyouhei (Shyouhei Urabe) wrote in #note-8:
Why not:
Because I didn't think of that :)
I would be okay with it, but I think the thread safety issue is also worth solving. The implementation I'm proposing solves both the performance and safety problems.