Project

General

Profile

Feature #15281

Speed up Set#intersect with size check.

Added by RGBD (Oleg Zubchenko) 10 months ago. Updated 3 months ago.

Status:
Assigned
Priority:
Normal
Target version:
-
[ruby-core:89696]

Description

Current implementation computes set intersection s1 & s2 in O(s2.size) time.
It can be reduced to O([s1.size, s2.size].min) time.

Additional speedup comes from using #each instead of #do_with_enum.

See files attached for benchmarks.

Pull Request

P.S. using benchmark-ips gem


Files

intersect.rb (1.91 KB) intersect.rb RGBD (Oleg Zubchenko), 11/03/2018 12:57 PM
intersect_standalone.rb (671 Bytes) intersect_standalone.rb RGBD (Oleg Zubchenko), 11/03/2018 12:58 PM
benchmark_results.txt (3.68 KB) benchmark_results.txt RGBD (Oleg Zubchenko), 11/03/2018 01:19 PM

History

#1

Updated by RGBD (Oleg Zubchenko) 10 months ago

  • Description updated (diff)
#2

Updated by RGBD (Oleg Zubchenko) 10 months ago

  • Description updated (diff)

Updated by marcandre (Marc-Andre Lafortune) 10 months ago

Thanks for the patch.

Sadly, I don't think we can do that as Sets are ordered. That optimization would change the order of the resulting Set in some cases.

Updated by RGBD (Oleg Zubchenko) 10 months ago

marcandre (Marc-Andre Lafortune) wrote:

Thanks for the patch.

Sadly, I don't think we can do that as Sets are ordered. That optimization would change the order of the resulting Set in some cases.

Where in the documentation can I read that?

this says order is uncertain. I thought it means noone should make assumptions about it.
Top of the page states:

Set implements a collection of unordered values with no duplicates. This is a hybrid of Array's intuitive inter-operation facilities and Hash's fast lookup.

Updated by marcandre (Marc-Andre Lafortune) 10 months ago

  • Assignee set to matz (Yukihiro Matsumoto)

RGBD (Oleg Zubchenko) wrote:

Set implements a collection of unordered values with no duplicates. This is a hybrid of Array's intuitive inter-operation facilities and Hash's fast lookup.

Good point.

Note that this documentation is 15 years old and predates Hash being ordered.

I'd be enclined to say that Set should be officially ordered, even if "mathematically" speaking that doesn't make sense. I'm assigning this to Matz.

Updated by duerst (Martin Dürst) 10 months ago

marcandre (Marc-Andre Lafortune) wrote:

I'd be enclined to say that Set should be officially ordered, even if "mathematically" speaking that doesn't make sense. I'm assigning this to Matz.

I'd be inclined to say that Set should be officially UNordered, because they are mathematically unordered.

Updated by ahorek (Pavel Rosický) 10 months ago

we can do that as Sets are ordered

IMO if they are, it's an implementation detail that you shouldn't rely on. Also there's more room to optimize unordered sets.

if you need an ordered (or indexed?) set, it should be a subclass that implements this behaviour on the top of the generic unordered set. Something like https://ruby-doc.org/stdlib-2.5.3/libdoc/set/rdoc/SortedSet.html

Updated by Eregon (Benoit Daloze) 4 months ago

I think making Set unordered would be a big breaking change, similar to making Hash unordered (as it was in 1.8).
Maybe people forgot, but I find it pretty bad to work with unordered Hashes,
e.g., every iteration method like each yields a confusing order changing on potentially every mutation.

I'm also fairly confident there is code out there relying on Set ordering.

This proposal doesn't propose to make Set unordered though, but it would introduce non-determinism in the order of elements returned by Set#&.
That seems not ideal.

Also, this trivial optimization can already be done at the application level if needed, and there the change of ordering is obvious and conscious:

set1, set2 = ...
if set1.size < set2.size
  set1 & set2
else
  set2 & set1
end

So I don't think it's worth introducing non-determinism/changing order in Set for something that can be expressed so simply in application code, with the advantage of the ordering trade-off being clear if done in application code.

Updated by nobu (Nobuyoshi Nakada) 3 months ago

  • Assignee changed from matz (Yukihiro Matsumoto) to knu (Akinori MUSHA)

The author of set.rb is knu.

#11

Updated by nobu (Nobuyoshi Nakada) 3 months ago

  • Status changed from Open to Assigned

Updated by knu (Akinori MUSHA) 3 months ago

It is clearly stated that Set is an unordered collection and it doesn't even guarantee any determinacy about the order of elements, and as you know, it's just it happened to become ordered and deterministic when the underlying data structure (Hash) got an order.

That being said, I can understand the need for an ordered set if there are already some users using Set as such. We could probably introduce OrderedSet for easier migration before making this "breaking" change (and maybe some others to follow) to Set, I guess?

Updated by sawa (Tsuyoshi Sawada) 3 months ago

There is a mathematical term "ordered set", which means a different thing: an algebra in which the elements are given intrinsic order.

The concept in question should be named Tuple or UniqueTuple.

Updated by chrisseaton (Chris Seaton) 3 months ago

I'm not sure it should be called Tuple - to me that implies that elements can appear more than once - that it's a multi-set - and this isn't the case.

Updated by Eregon (Benoit Daloze) 3 months ago

Just a note, for compatibility there is often a gap between what the documentation states and what Ruby programs assume.
So even though the documentation says it's unordered, I would guess there are programs relying on Sets being ordered since Ruby 1.9.

Maybe not so many programs, but I think we should evaluate before changing.

Also available in: Atom PDF