Feature #21518: Statistical helpers to `Enumerable` - Ruby - Ruby Issue Tracking System

Actions

Copy link

Feature #21518

open

Statistical helpers to `Enumerable`

Feature #21518: Statistical helpers to `Enumerable`

Added by Amitleshed (Amit Leshed) 12 months ago. Updated 5 months ago.

Status:

Open

Assignee:

Target version:

[ruby-core:122842]

Description

Summary

I'd like to add two statistical helpers to Enumerable:

Enumerable#average (arithmetic mean)
Enumerable#median

Both are small, well-defined operations that many Rubyists re-implement in apps and gems. Providing them in core avoids repeated, ad-hoc code and aligns with Enumerable#sum, which Ruby already ships.

Motivation

These are among the most common “roll-your-own” helpers for arrays/ranges of numbers.
They are conceptually simple, universally useful beyond web/Rails.
Similar to sum, they’re primitives for quick data analysis, ETL scripts, CLI tooling, etc.
Including them encourages consistent semantics (what to do with empty sets, mixed numerics, etc.).

Proposed API & Semantics¶

Enumerable#average -> Float or nil
Enumerable#median  -> Numeric or nil

[1, 2, 3, 4].average      # => 2.5
(1..4).average            # => 2.5
[].average                # => nil

[1, 3, 2].median          # => 2
[1, 2, 3, 10].median      # => 2.5
(1..6).median             # => 3.5
[].median                 # => nil

Ruby implementation

module Enumerable
  def average
    count = 0
    total = 0.0
    each do |x|
      raise TypeError, "non-numeric value for average" unless x.is_a?(Numeric)
      total += x
      count += 1
    end
    count.zero? ? nil : total / count
  end

  def median
    arr = to_a
    return nil if arr.empty?
    arr.each { |x| raise TypeError, "non-numeric value for median" unless x.is_a?(Numeric) }
    arr.sort!
    mid = arr.length / 2
    arr.length.odd? ? arr[mid] : (arr[mid - 1] + arr[mid]) / 2.0
  end
end

Upon approval I'm more than willing to implement spec and code in C.

Related issues 4 (1 open — 3 closed)

Updated by Dan0042 (Daniel DeLorme) 12 months ago · Edited 1Actions
Copy link
#1 [ruby-core:122856]

In favor, just careful about the bug in #median

x = [1, 3, 2]
x.median #=> 2
x #=> [1, 2, 3] modified by #median

You'll want to use arr = entries rather than arr = to_a

Updated by Amitleshed (Amit Leshed) 12 months ago Actions
Copy link
#2 [ruby-core:122858]

Thanks, great catch!

Updated by herwin (Herwin W) 12 months ago 1Actions
Copy link
#3 [ruby-core:122864]

Ranges might need their own specialised implementation: this implementation will timeout on infinite ranges, and (1..100000).average (or .median) can be calculated without having to create an intermediate array. (Why anyone would want to calculate these values from this kind of Ranges is beyond me, but that's another issue)

Updated by Amitleshed (Amit Leshed) 12 months ago Actions
Copy link
#4 [ruby-core:122865]

Thanks for the engagement everyone

Here's a refactored version:

module Enumerable
  def average
    return nil if none?
    return range_midpoint if numeric_range?

    total = 0.0
    count = 0
    each do |x|
      raise TypeError, "non-numeric value for average" unless x.is_a?(Numeric)
      total += x
      count += 1
    end
    total / count
  end

  def median
    return nil if none?
    return range_midpoint if numeric_range?

    arr = entries
    arr.each { |x| raise TypeError, "non-numeric value for median" unless x.is_a?(Numeric) }
    arr.sort!
    mid = arr.length / 2
    arr.length.odd? ? arr[mid] : (arr[mid - 1] + arr[mid]) / 2.0
  end

  private

  def numeric_range?
    is_a?(Range) && first.is_a?(Numeric) && last.is_a?(Numeric)
  end

  def range_midpoint
    max = exclude_end? ? (last - step) : last
    (first + max) / 2.0
  end
end

Updated by mame (Yusuke Endoh) 12 months ago Actions
Copy link
#5

Related to Feature #2321: [PATCH] Array Module sum and mean features added
Related to Feature #18057: Introduce Array#mean added

Updated by mame (Yusuke Endoh) 12 months ago Actions
Copy link
#6

Related to Feature #10228: Statistics module added

Updated by mame (Yusuke Endoh) 12 months ago Actions
Copy link
#8

Related to Feature #12222: Introducing basic statistics methods for Enumerable (and optimized implementation for Array) added

Updated by mame (Yusuke Endoh) 12 months ago Actions
Copy link
#9 [ruby-core:122866]

Naturally, these methods have been desired by some people for a very long time, but Ruby has historically been very cautious about introducing them. Even the obviously useful #sum method was only added in 2016, which is relatively recent in Ruby's history.

One reason behind this caution is the reluctance to add methods to Array that assume all elements are Integer or Float. Since Array can contain Strings or other non-numeric objects, there's a question of whether it is appropriate to add methods that make no sense in such cases.

The reason why #sum was eventually added was the growing attention to an algorithm called the Kahan-Babuska Summation Algorithm. This is a clever algorithm that reduces floating-point error when summing, and it is actually implemented in Array#sum. Before this algorithm gained attention, I remember the prevailing opinion was that it should be written explicitly, like ary.inject(0, &:+).

For now, you may want to try using https://github.com/red-data-tools/enumerable-statistics to get a better idea of what you actually need.

Updated by matheusrich (Matheus Richard) 11 months ago Actions
Copy link
#10 [ruby-core:122890]

I wonder if these helpers could be inside Math::Statistics:

Math::Statistics.average(some_enumerable)

I think it would be okay for this module to assume the arguments are numeric.

Updated by matz (Yukihiro Matsumoto) 11 months ago 1Actions
Copy link
#11 [ruby-core:123011]

I am positive about adding those methods, but I am no expert on Mathematics nor Statistics.

Matz.

Updated by mrkn (Kenta Murata) 11 months ago · Edited 2Actions
Copy link
#12 [ruby-core:123046]

Hi. I'm a creator of enumerable-statistics gem and the original proposer of Array#sum and Enumerable#sum.

In general, adding only mean (I prefer mean over average, see below) and median won't cover real-world statistical needs. When a sample mean is required, variance or standard deviation usually follow; where a sample median is used, quantiles or percentiles typically follow. Truly “median-only” scenarios are rare in my experience.

If these are added to core, we should set a high bar: numerically stable, one-pass algorithms with a C implementation for performance; and for median/percentiles computations, avoid full sort in favor of selection algorithms such as quickselect.

The enumerable-statistics gem already provides a simple one-pass combined methods such as mean_variance and mean_stdev. median and percentile for Enumerable remain to be implemented.

On naming: I strongly prefer mean over average for consistency with other programming languages and libraries (cf. #18057 note-8). Across Python/NumPy/Pandas, R, Julia, MATLAB, etc., mean is the standard term and API name. Aligning with that convention keeps Ruby familiar to users who work across stacks (acknowledging that a few general-purpose APIs, e.g., LINQ, use average).

Updated by matheusrich (Matheus Richard) 11 months ago Actions
Copy link
#13 [ruby-core:123054]

An average alias would be nice, though.

Updated by Eregon (Benoit Daloze) 11 months ago Actions
Copy link
#14 [ruby-core:123056]

mrkn (Kenta Murata) wrote in #note-12:

In general, adding only mean (I prefer mean over average, see below) and median won't cover real-world statistical needs. When a sample mean is required, variance or standard deviation usually follow; where a sample median is used, quantiles or percentiles typically follow. Truly “median-only” scenarios are rare in my experience.

I think mean and median are frequently needed (at least I have reimplemented them many times) and would be worth adding to Array.
Not sure of the value to add them to Enumerable instead of Array (it would be much slower implemented on Enumerable).

I typically use the median absolute deviation as a robust measure of the variability when using the median, and that can be trivially implemented on top of #median.
So for that case, only median is enough.

Regarding variance or standard deviation those are not robust and over-influenced by outliers, so I think it would make sense to not provide them, because they are often no longer recommended.

mrkn (Kenta Murata) wrote in #note-12:

avoid full sort in favor of selection algorithms such as quickselect.

That seems one good reason to add it in core, the optimal algorithm is actually non-trivial and cannot easily be done in a Ruby one-liner for median.
mean is trivial but would still be nice to provide given it's so frequently used (also data.sum / data.size.to_f is not so pretty)).

Percentiles would be nice, especially if there is a more efficient algorithm for them than just sorting + indexing.
Percentiles are frequently useful e.g. to characterize response time/latency and also for boxplots. It's also a more robust way (e.g. with quartiles, so just 25 and 75 percentiles) to measure the variability than the standard deviation.

Updated by trinistr (Alexander Bulancov) 5 months ago Actions
Copy link
#15 [ruby-core:124710]

Percentiles would be nice, especially if there is a more efficient algorithm for them than just sorting + indexing.

It may be reasonable to implement percentiles without sorting, expecting the array to be pre-sorted, in the same way as Array#bsearch works. I think in most occasions of using percentiles, several values will be needed, and using some tricky algorithm will probably be more expensive than 1 sort + N fetches.

Actions

Copy link

Also available in: PDF Atom

Related to Ruby - Feature #2321: [PATCH] Array Module sum and mean features	Rejected		Actions
Related to Ruby - Feature #18057: Introduce Array#mean	Open		Actions
Related to Ruby - Feature #10228: Statistics module	Feedback		Actions
Related to Ruby - Feature #12222: Introducing basic statistics methods for Enumerable (and optimized implementation for Array)	Closed	akr (Akira Tanaka)	Actions

Project

General

Profile

Ruby

Custom queries

Feature #21518

Statistical helpers to `Enumerable`

Proposed API & Semantics¶

Updated by Dan0042 (Daniel DeLorme) 12 months ago · Edited 1Actions
Copy link
#1 [ruby-core:122856]

Updated by Amitleshed (Amit Leshed) 12 months ago Actions
Copy link
#2 [ruby-core:122858]

Updated by herwin (Herwin W) 12 months ago 1Actions
Copy link
#3 [ruby-core:122864]

Updated by Amitleshed (Amit Leshed) 12 months ago Actions
Copy link
#4 [ruby-core:122865]

Updated by mame (Yusuke Endoh) 12 months ago Actions
Copy link
#5

Updated by mame (Yusuke Endoh) 12 months ago Actions
Copy link
#6

Updated by mame (Yusuke Endoh) 12 months ago Actions
Copy link
#8

Updated by mame (Yusuke Endoh) 12 months ago Actions
Copy link
#9 [ruby-core:122866]

Updated by matheusrich (Matheus Richard) 11 months ago Actions
Copy link
#10 [ruby-core:122890]

Updated by matz (Yukihiro Matsumoto) 11 months ago 1Actions
Copy link
#11 [ruby-core:123011]

Updated by mrkn (Kenta Murata) 11 months ago · Edited 2Actions
Copy link
#12 [ruby-core:123046]

Updated by matheusrich (Matheus Richard) 11 months ago Actions
Copy link
#13 [ruby-core:123054]

Updated by Eregon (Benoit Daloze) 11 months ago Actions
Copy link
#14 [ruby-core:123056]

Updated by trinistr (Alexander Bulancov) 5 months ago Actions
Copy link
#15 [ruby-core:124710]

Project

General

Profile

Ruby

Custom queries

Feature #21518

Statistical helpers to `Enumerable`

Proposed API & Semantics¶

Updated by Dan0042 (Daniel DeLorme) 12 months ago · Edited 1ActionsCopy link #1 [ruby-core:122856]

Updated by Amitleshed (Amit Leshed) 12 months ago ActionsCopy link #2 [ruby-core:122858]

Updated by herwin (Herwin W) 12 months ago 1ActionsCopy link #3 [ruby-core:122864]

Updated by Amitleshed (Amit Leshed) 12 months ago ActionsCopy link #4 [ruby-core:122865]

Updated by mame (Yusuke Endoh) 12 months ago ActionsCopy link #5

Updated by mame (Yusuke Endoh) 12 months ago ActionsCopy link #6

Updated by mame (Yusuke Endoh) 12 months ago ActionsCopy link #8

Updated by mame (Yusuke Endoh) 12 months ago ActionsCopy link #9 [ruby-core:122866]

Updated by matheusrich (Matheus Richard) 11 months ago ActionsCopy link #10 [ruby-core:122890]

Updated by matz (Yukihiro Matsumoto) 11 months ago 1ActionsCopy link #11 [ruby-core:123011]

Updated by mrkn (Kenta Murata) 11 months ago · Edited 2ActionsCopy link #12 [ruby-core:123046]

Updated by matheusrich (Matheus Richard) 11 months ago ActionsCopy link #13 [ruby-core:123054]

Updated by Eregon (Benoit Daloze) 11 months ago ActionsCopy link #14 [ruby-core:123056]

Updated by trinistr (Alexander Bulancov) 5 months ago ActionsCopy link #15 [ruby-core:124710]

Updated by Dan0042 (Daniel DeLorme) 12 months ago · Edited 1Actions
Copy link
#1 [ruby-core:122856]

Updated by Amitleshed (Amit Leshed) 12 months ago Actions
Copy link
#2 [ruby-core:122858]

Updated by herwin (Herwin W) 12 months ago 1Actions
Copy link
#3 [ruby-core:122864]

Updated by Amitleshed (Amit Leshed) 12 months ago Actions
Copy link
#4 [ruby-core:122865]

Updated by mame (Yusuke Endoh) 12 months ago Actions
Copy link
#5

Updated by mame (Yusuke Endoh) 12 months ago Actions
Copy link
#6

Updated by mame (Yusuke Endoh) 12 months ago Actions
Copy link
#8

Updated by mame (Yusuke Endoh) 12 months ago Actions
Copy link
#9 [ruby-core:122866]

Updated by matheusrich (Matheus Richard) 11 months ago Actions
Copy link
#10 [ruby-core:122890]

Updated by matz (Yukihiro Matsumoto) 11 months ago 1Actions
Copy link
#11 [ruby-core:123011]

Updated by mrkn (Kenta Murata) 11 months ago · Edited 2Actions
Copy link
#12 [ruby-core:123046]

Updated by matheusrich (Matheus Richard) 11 months ago Actions
Copy link
#13 [ruby-core:123054]

Updated by Eregon (Benoit Daloze) 11 months ago Actions
Copy link
#14 [ruby-core:123056]

Updated by trinistr (Alexander Bulancov) 5 months ago Actions
Copy link
#15 [ruby-core:124710]