Project

General

Profile

Actions

Feature #11076

closed

Enumerable method count_by

Added by haraldb (Harald Böttiger) over 9 years ago. Updated over 5 years ago.

Status:
Closed
Target version:
-
[ruby-core:<unknown>]

Description

I very often use Hash[array.group_by{|x|x}.map{|x,y|[x,y.size]}].

Would be nice with to have a method called count_by:

array = ['aa', 'aA', 'bb', 'cc']
p array.count_by(&:downcase) #=> {'aa'=>2,'bb'=>1,'cc'=>1}
Actions #1

Updated by shevegen (Robert A. Heiler) over 9 years ago

Can you also add a sentence or two for documentation? :-)

It may lower the entry barrier for adding a method such as the above (I assume it must be documented by someone before it could be added).

Actions #3

Updated by duerst (Martin Dürst) over 9 years ago

Having this would definitely be very useful. I remember having searched for a 'count_by' method more than once in the past.

Actions #4

Updated by ko1 (Koichi Sasada) over 9 years ago

+1

Actions #5

Updated by haraldb (Harald Böttiger) over 9 years ago

Robert A. Heiler wrote:

Can you also add a sentence or two for documentation? :-)

I am sorry but I am not sure to properly format this, but the documentation would be like:

Syntax:
  group_by { |obj| block } → a_hash
  group_by → an_enumerator

Description:
  Groups the collection by result of the block. Returns a hash where the keys are the evaluated result from the block and the values are the number of the elements in the collection that correspond to the key.

  If no block is given an enumerator is returned.

Examples:
  ['a','a','a','b','c'].group_by { |x| x } #=> {'a'=>3, 'b'=>1, 'c'=>1}
  (1..7).group_by { |i| i%3 }   #=> {0=>2, 1=>3, 2=>2}

Updated by baweaver (Brandon Weaver) over 6 years ago

Has there been any thought on this as a language feature?

There was an earlier conversation demonstrating a practical use for this feature, and I had mentioned a few of the core maintainers to bring the subject back into consideration:

https://twitter.com/keystonelemur/status/1012434696909852672

nobu had recently updated his patch here:

https://github.com/ruby/ruby/compare/trunk...nobu:feature/11076-Enumerable%23count_by

I still believe this would be an incredibly useful feature to have in the core of the language, as a very common pattern to work around it is unintuitive for newer programmers:

# Most common
array
  .group_by { |v| v }
  .map { |k, v| [k, v.size] }
  .to_h

# In older versions:
Hash[array.group_by { |v| v }.map { |k, v| [k, v.size] }]

# or in more recent versions:
array
  .group_by { |v| v }
  .transform_values(&:size)

# or using reduce / ewo:
array.each_with_object(Hash.new(0)) { |v, h| h[v] += 1 }

By giving a name to this concept, we've made it more accessible as well. Given the current trend of 2.6, I believe this would be a welcome addition.

Updated by knu (Akinori MUSHA) over 6 years ago

In today's developer meeting, Matz understood the need for the feature but didn't like the name. One point he made was that existing pairs like sort/sort_by and max/max_by share their features, so count_by() might not go well with count().

Updated by baweaver (Brandon Weaver) over 6 years ago

group_count? It's half-way between group_by and count

Updated by janfri (Jan Friedrich) over 6 years ago

As Naruse in DevelopersMeeting20180809 mentioned: It is a histogram function.
How about histogram_by (and for the block-less counterpart histogram)?

Updated by djones (David Jones) over 6 years ago

How about tally?

array = ['aa', 'aA', 'bb', 'cc']
p array.tally(&:downcase) #=> {'aa'=>2,'bb'=>1,'cc'=>1}

tally describes quite well to me what this method does and avoids clashing with group or count.
tally_by might be worthy of consideration too.

Definition of "Tally"

Current score or amount: that takes his tally to 10 goals in 10 games.

  1. a record of a score or amount: I kept a running tally of David's debt on a note above my desk.
  2. a particular number taken as a group or unit to facilitate counting.
  3. a mark registering a number or amount.
  4. an account kept by means of a tally.

Updated by baweaver (Brandon Weaver) about 6 years ago

@matz (Yukihiro Matsumoto) / @ko1 (Koichi Sasada): Any chance of this making it into 2.6? The code is already done (thanks @nobu (Nobuyoshi Nakada)) and the only consideration left is the name. Would tally_by be an acceptable compromise?

Updated by janfri (Jan Friedrich) about 6 years ago

Just my 2 cents: I'm not a native English speaker. Never heard the word "tally" before. So I would never remember it and has always to look at the api docs.

Updated by odlp (Oliver Peate) about 6 years ago

For me the definition of tally does seem to fit the use case, so +1 to tally(_by).

Couple of alternatives, how about:

  • census (as in census_by(&:downcase))
  • inventory (either inventory or inventory_by)

Both are more widely used than tally (although I think tally is the better choice):

https://books.google.com/ngrams/graph?content=tally%2Ccensus%2Ccount%2Cinventory&case_insensitive=on&year_start=1900&year_end=2018

Updated by inopinatus (Joshua GOODALL) almost 6 years ago

A histogram refers to counts of items in ranges of otherwise continuous data. But this function is more general than that, so I think histogram is too specific a term.

For this native English speaker, tally is the most precisely fitted method name.

Updated by mame (Yusuke Endoh) almost 6 years ago

I have learnt the word "tally" in this thread. Thank you. It looks good to me, a non-native speaker. I have put this on the agenda of the next developers' meeting.

By the way, what is the precise semantics of the method?

Question 1. What identity is the object in the keys?

str1 = "a"
str2 = "a"
t = [str1, str2].tally

p t  #=> { "a" => 2 }

p t.keys.first.object_id #=> str1.object_id or str2.object_id ?

IMO: I think it should prefer the first element, so it should be equal to str1.object_id.

Question 2. What is the key of tally_by?

str1 = "a"
str2 = "A"
t = [str1, str2].tally_by(&:upcase)

p t  #=> { "a" => 2 } or { "A" => 2 } ?

p t.keys.first.object_id #=> str1.object_id, str2.object_id, or otherwise?

IMO: The return values of sort_by and max_by contains the original elements, not the return value of the block. According to the analogy to them, I think that t should be { "a" => 2 } and its key be str1.object_id.

Updated by mrkn (Kenta Murata) almost 6 years ago

enumerable-statistics provides value_counts method.
https://github.com/mrkn/enumerable-statistics/blob/master/ext/enumerable/statistics/extension/statistics.c#L1651-L1668
It is designed to follow pandas’s Series.value_counts.

Updated by baweaver (Brandon Weaver) almost 6 years ago

mame (Yusuke Endoh) wrote:

I have learnt the word "tally" in this thread. Thank you. It looks good to me, a non-native speaker. I have put this on the agenda of the next developers' meeting.

By the way, what is the precise semantics of the method?

Question 1. What identity is the object in the keys?

str1 = "a"
str2 = "a"
t = [str1, str2].tally

p t  #=> { "a" => 2 }

p t.keys.first.object_id #=> str1.object_id or str2.object_id ?

IMO: I think it should prefer the first element, so it should be equal to str1.object_id.

Question 2. What is the key of tally_by?

str1 = "a"
str2 = "A"
t = [str1, str2].tally_by(&:upcase)

p t  #=> { "a" => 2 } or { "A" => 2 } ?

p t.keys.first.object_id #=> str1.object_id, str2.object_id, or otherwise?

IMO: The return values of sort_by and max_by contains the original elements, not the return value of the block. According to the analogy to them, I think that t should be { "a" => 2 } and its key be str1.object_id.

Answer 1: I would say the first, but tally could also be effectively represented by tally_by(&:itself) as shown in an implementation below:

Answer 2: The transformed value, like group_by:

[1, 2, 3].group_by(&:even?)
=> {false=>[1, 3], true=>[2]}

[1, 2, 3].tally_by(&:even?)
=> {false => 2, true => 1}

The implementation is similar to this:

module Enumerable
  # Implementing via group_by
  def tally_by(&fn)
    group_by(&fn).to_h { |k, vs| [k, vs.size] }
  end

  # Implementing via reduction
  def tally_by2(&fn)
    each_with_object(Hash.new(0)) { |v, a| a[fn[v]] += 1 }
  end
end

...which would result in the first object_id I believe.

Updated by nobu (Nobuyoshi Nakada) almost 6 years ago

https://github.com/nobu/ruby/pull/new/feature/11076-Enumerable%23tally

As Hash#[]= copies string keys, the object_id will be unique unless the item is frozen.

Updated by Eregon (Benoit Daloze) almost 6 years ago

For this kind of method, I wish we would implement it in Ruby even in MRI: it's much simpler, more readable, and every Ruby implementation could use it.

Updated by sawa (Tsuyoshi Sawada) almost 6 years ago

knu (Akinori MUSHA) wrote:

In today's developer meeting, Matz understood the need for the feature but didn't like the name. One point he made was that existing pairs like sort/sort_by and max/max_by share their features, so count_by() might not go well with count().

Since this feature is an inferior variant of group_by in the sense that it reduces the value arrays into their lengths, what about naming the method group?

Then, group can be read as "group the block evaluation (with their counts provided as additional information)" while group_by can be read as "group the receiver by the block evaluation".

I personally feel that it is overkill to give a new unrelated name (such as tally) for such a feature that looks quite specific and narrow in nature.

And it is also a good opportunity to fill in the empty slot for the by-less variant of group_by, which has made group_by stand out and a bit awkward.

Updated by duerst (Martin Dürst) almost 6 years ago

sawa (Tsuyoshi Sawada) wrote:

Since this feature is an inferior variant of group_by in the sense that it reduces the value arrays into their lengths, what about naming the method group?

Please not. The _by indicates that there is some specific criterion for grouping. This is the same for this method, so removing the _by is very strange. Also, the fact that the result contains numbers, not the actual groups, is completely lost.

Compared with this, count_by is much better, and so is tally. Other possibilities might be group_by_and_count or count_by_group or something similar.

Updated by mame (Yusuke Endoh) almost 6 years ago

baweaver (Brandon Weaver) wrote:

Answer 2: The transformed value, like group_by:

[1, 2, 3].group_by(&:even?)
=> {false=>[1, 3], true=>[2]}

[1, 2, 3].tally_by(&:even?)
=> {false => 2, true => 1}

If we have tally, we can implement this behavior easily: [1, 2, 3].map {|x| x.even? }.tally. Is a new method really needed just for a shorthand of this behavior?

Updated by matz (Yukihiro Matsumoto) almost 6 years ago

OK, tally sounds reasonable. Accepted.

Matz.

Updated by mame (Yusuke Endoh) almost 6 years ago

  • Status changed from Open to Assigned
  • Assignee set to mame (Yusuke Endoh)

Thanks, I'll implement it.

Note that tally_by is not accepted yet. We need to discuss the detail later (if needed).

Updated by mame (Yusuke Endoh) almost 6 years ago

  • Assignee changed from mame (Yusuke Endoh) to nobu (Nobuyoshi Nakada)

Nobu has already started creating a patch. Leave it to him.

Actions #26

Updated by nobu (Nobuyoshi Nakada) almost 6 years ago

  • Status changed from Assigned to Closed

Applied in changeset trunk|r67020.


enum.c: Enumerable#tally

  • enum.c (enum_tally): new methods Enumerable#tally, which group
    and count elements of the collection. [Feature #11076]

Updated by baweaver (Brandon Weaver) almost 6 years ago

mame (Yusuke Endoh) wrote:

baweaver (Brandon Weaver) wrote:

Answer 2: The transformed value, like group_by:

[1, 2, 3].group_by(&:even?)
=> {false=>[1, 3], true=>[2]}

[1, 2, 3].tally_by(&:even?)
=> {false => 2, true => 1}

If we have tally, we can implement this behavior easily: [1, 2, 3].map {|x| x.even? }.tally. Is a new method really needed just for a shorthand of this behavior?

It's a common enough that the syntax may be justified. It could be argued that a lot of shorthand expressions aren't technically necessary, but I feel that this makes Ruby Ruby, the ability to say something common with less.

That, and there's established precedent of count / count_by, max / max_by, and others that would make this an easily adopted syntax. If it's not adopted I would not be surprised to see a follow-up request to add it.

I would see tally_by and other *_by methods as the base for their counterparts, such that:

[1,2,3].tally == [1,2,3].tally_by(&:itself)

Where the non-*_by method is effectively the *_by method implemented with the itself identity function.

Updated by mame (Yusuke Endoh) almost 6 years ago

baweaver (Brandon Weaver) wrote:

It's a common enough that the syntax may be justified.

That's just because "map + something" is frequent. However, blindly adding a "map" feature to anything does not make sense to me. In fact, "map + select" is much more frequent, but it is not introduced yet (#5663, #15323). If we add "tally_by" as a shorthand to "map + tally", we should confirm if the combination is truly frequent (i.e., "tally" is rarely used without "map"). We can do it affer only "tally" is released.

Updated by jonathanhefner (Jonathan Hefner) over 5 years ago

"map + select" is much more frequent, but it is not introduced yet

I think it would also be nice if filter_map was added. However, a specific justification for adding tally_by is to avoid an extra array allocation. filter_map can already be expressed as map { ... }.compact! to avoid allocating an extra array. But there is no way to avoid an extra allocation with map { ... }.tally.

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0