Project

General

Profile

Bug #20301

Updated by AMomchilov (Alexander Momchilov) 2 months ago

A common usage of `Set`s is to keep track of seen objects, and do something different whenever an object is seen for the first time, e.g.: 

 ```ruby 
 SEEN_VALUES = Set.new 
	
 def receive_value(value) 
	 if SEEN_VALUES.add?(value) 
		 puts "Saw #{value} for the first time." 
	 else 
		 puts "Already seen #{value}, ignoring." 
	 end 
 end 

 receive_value(1) # Saw 1 for the first time. 
 receive_value(2) # Saw 2 for the first time. 
 receive_value(3) # Saw 3 for the first time. 
 receive_value(1) # Already seen 1, ignoring. 
 ``` 

 Readers might reasonably assume that `add?` is only looking up into the set a single time, but it's actually doing two separate look-ups! ([source](https://github.com/ruby/ruby/blob/c976cb5/lib/set.rb#L517-L525)) 

 ```rb 
 class Set 
   def add?(o 
     # 1. `include?(o)` looks up into `@hash` 
     # 2. if the value isn't there, `add(o)` does a second look-up into `@hash` 
     add(o) unless include?(o) 
   end 
 end 
 ``` 

 This gets especially expensive if the values are large hash/arrays/objects, whose `#hash` is expensive to compute. 

 We can optimize this if it was possible to set a value in hash, *and* retrieve the value that was already there, in a single go. I propose adding `Hash#update_value` to do exactly that. If that existed, we can re-implement `#add?` as: 

 ```rb 
 class Set 
   def add?(o) 
     # Only requires a single look-up into `@hash`! 
     self unless @hash.update_value(o, true) 
   end 
 ``` 

 Here's a proof-of-concept implementation: PR: https://github.com/ruby/ruby/pull/10093 

 # Theory 

 How much of a benefit this has depends on 2 factors: 

 1. How much `#hash` is called, which depends on how many **new** new objects are added to the set. 
     * If every object is new, then `#hash` used to be is called twice on every `#add?`. 
         * This is where this improvement makes the biggest (2x!) change. 
     * If every object has already been seen, then `#hash` was never being called twice before anyway, so there would be no improvement. 
         * It's important to not regress in this case, because many use cases of sets don't deal with many distinct objects, but just need to do quick checks against an existing set. improvement 
     * Every other case lies somewhere in between those two, depending on the % of objects which are new. two. 
 2. How slow `#hash` is to compute for the key 
     * If the hash is slow to compute, this change will make a bigger improvement 
     * If the hash value is fast to compute, then it won't matter as much. Even if we called it half as much, it's a minority of the total time, so it won't have much net impact. 

 # Benchmark summary The full benchmark results are in the PR, but here's a summary: 

 |                             | All objects are new | All objects are preexisting | 
 |---------------------------|-------:|------:| 
 | objects with slow `#hash` | 100.0% | ~0.0% | 
 | objects with fast `#hash` |    24.5% |    4.6% | 

 As we see, this change makes a huge improvement the cases where it helps, and crucially, doesn't slow down the cases where it can't. 

 For the complete benchmark source code and results, see the PR: https://github.com/ruby/ruby/pull/10093

Back