Bug #16996

Hash should avoid doing unnecessary rehash

Added by marcandre (Marc-Andre Lafortune) 3 months ago. Updated 3 months ago.

Target version:


Pop quiz: Which is the fastest way to get a copy of a Hash h?

If, like me, you thought h.dup (of course, right?), you are actually wrong.

The fastest way is to call h.merge. Try it:

require 'benchmark/ips'

lengths = 1..50

h = lengths.to_h { |i| ['x' * i, nil] }

Benchmark.ips do |x|"dup")        { h.dup }"merge")      { h.merge }

I get

Calculating -------------------------------------
                 dup    259.233k (± 9.2%) i/s -      1.285M in   5.013445s
               merge    944.095k (± 8.2%) i/s -      4.693M in   5.005315s

Yup, it's 3.5x faster with this example!!

Why? Because Hash#dup does a rehash, and merge does not.

Pop quiz 2: which methods of Hash that produce a new hash do a rehash?

Answer: it depends on the method and on the Ruby version

| Does this rehash?               | head | 2.7 | 2.6 | 2.5 | 2.4 | 2.3 | 2.2 | 2.1 | 2.0 |
| h.dup / clone                   |  Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
|{true} / reject{false}  |  Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
|!{true} / reject!{false}|   Ø  |  Ø  |  Ø  |  Ø  |  Ø  |  Ø  |  Ø  |  Ø  |  Ø  |
| sub_h.to_h                      |   Ø  |  Ø  |  Ø  |  Ø  |  Ø  |  Ø  |  Ø  |  Ø  |  Ø  |
| h.merge({})                     |   Ø  |  Ø  |  Ø  |  Ø  | Yes | Yes | Yes | Yes | Yes |
| h.merge                         |   Ø  |  Ø  |  Ø  |             n/a                   |
| h.transform_values(&:itself)    |   Ø  |  Ø  | Yes | Yes | Yes |          n/a          |
(where `sub_h =`, Ø = no rehash)

So in Ruby head, doing h.merge({}) or even h.transform_values(&:itself) will be much faster than h.dup (but slower in Ruby 2.4) (*)

Notice that select rehashes, but select! doesn't, so the fastest way to do a select in Ruby is... not to call select and instead to actually do a!! (*)

*: on hashes with non-negligible hash functions

class Hash
  def fast_select(&block)!(&block) # don't call dup because it's slow

Benchmark.ips do |x|"select")           {{true} }"fast_select")      { h.fast_select{true} }

On my test case above, fast_select is 2.5x faster than select. fast_select will always return exactly the same result (unless the receiver needed a rehash).

Pop quiz 3: Is this a bug or a feature?

It should be clear that no feature of Ruby should be re-implementable in Ruby with a 3.5x / 2.5x speed gain, so many would think "of course it's a bug".

Well, seems to think that Hash#dup's rehash is a feature...
Because there is actually a test that dup does a rehash
Because a test of Set was failing otherwise!
Short discussion:
Actual test:
This test construct a Set that needs to be rehashed (by mutating an element of the set after it is added), and then checks that rehash_me == rehash_me.clone.
That test is bogus. It passes for obscure and undocumented reasons, and rehash_me.clone == rehash_me doesn't pass.
Today, it is official that sets with elements that are later mutated must be Set#reset, so it is official that this should not be relied upon.

Probably more clear is the case of select/reject (but I didn't check for failing test), and even more clear that merge changed in Ruby 2.5 and transform_values in 2.7, but not a single NEWS file mentions the word "rehash".

My conclusion is that Hash should avoid doing an unnecessary rehash: dup/clone/select/reject. We probably should add a reminder in the NEWS that if anyone mutates a key of a Hash, or an element of a Set and does not call rehash/reset, improper behavior should be expected.

Let's make Hash#dup/clone/select/reject fast please.

Any objection?

Also available in: Atom PDF