Feature #18262
openEnumerator::Lazy#partition
Description
(Part of my set of proposals about making .lazy
more useful/popular.)
Currently:
file = File.open('very-large-file.txt')
lines_with_errors, lines_without_errors = file.lazy.partition { _1.start_with?('E:') }
lines_with_errors.class
# => Array, all file is read by this moment
This might be not very practical performance-wise and memory-wise.
I am thinking that maybe returning a pair of lazy enumerators might be a good addition to Enumerator::Lazy
Naive prototype:
class Enumerator::Lazy
def partition(&block)
buffer1 = []
buffer2 = []
source = self
[
Enumerator.new { |y|
loop do
if buffer1.empty?
begin
item = source.next
if block.call(item)
y.yield(item)
else
buffer2.push(item)
end
rescue StopIteration
break
end
else
y.yield buffer1.shift
end
end
}.lazy,
Enumerator.new { |y|
loop do
if buffer2.empty?
begin
item = source.next
if !block.call(item)
y.yield(item)
else
buffer1.push(item)
end
rescue StopIteration
break
end
else
y.yield buffer2.shift
end
end
}.lazy
]
end
end
Testing it:
Enumerator.produce(1) { |i| puts "processing #{i}"; i + 1 }.lazy
.take(30)
.partition(&:odd?)
.then { |odd, even|
p odd.first(3), even.first(3)
}
# Prints:
# processing 1
# processing 2
# processing 3
# processing 4
# processing 5
# [1, 3, 5]
# [2, 4, 6]
As you might notice by the "processing" log, it only fetched the amount of entries that was required by produced enumerators.
The drawback would be—as my prototype implementation shows—the need of internal "buffering" (I don't think it is possible to implement lazy partition without it), but it still might be worth a shot?
Updated by Dan0042 (Daniel DeLorme) over 3 years ago
+1
Since a lazy enumerator is produced for both #select and #reject, it would make sense for #partition as well.
Updated by knu (Akinori MUSHA) about 3 years ago
I agree this would be a good addition, and I think the existing users of lazy
would understand the incompatibility this would bring is a necessary step to make partition
more useful.
However, the buffering could be a pitfall for new users. In today's developer meeting, Matz and I agreed to suggest that the behavior should be well documented. If you were dividing a huge (or infinite) list into two where one enumerator would yield a value extremely less likely than the other and you read from the first enumerator, the buffer could become huge. That is not straightforward from what you normally expect from "lazy", so it should be noted in the documentation.
Updated by Dan0042 (Daniel DeLorme) about 3 years ago
I don't think buffering is a problem. If you consider that the current implementation fully buffers the two sub-lists before returning them, the lazy implementation literally cannot cause more buffering than the current non-lazy implementation.
Updated by knu (Akinori MUSHA) about 3 years ago
Dan0042 (Daniel DeLorme) wrote in #note-3:
I don't think buffering is a problem. If you consider that the current implementation fully buffers the two sub-lists before returning them, the lazy implementation literally cannot cause more buffering than the current non-lazy implementation.
Buffering may be a whole new idea in the history of Enumerator::Lazy, so it won't hurt to document that the new Lazy#partition
implementation can consume a lot of memory.
Updated by Eregon (Benoit Daloze) about 3 years ago
Mmh, I think the whole point of Enumerator::Lazy is it uses constant memory and does not buffer arbitrary amount of elements.
Shouldn't it simply behave like:
class Enumerator::Lazy
def partition(&block)
[select(&block), reject(&block)]
end
end
No buffering, no surprises, consistent with general Lazy methods.
The block will be executed 2 times per element if both enumerators iterate that element.
That seems completely fine since it's a predicate, and expected given you'd do the same with select
/reject
manually.
Updated by Dan0042 (Daniel DeLorme) about 3 years ago
I wouldn't say that constant memory is the "whole point" of Enumerator::Lazy. It's more about performing the minimum amount of computation needed, only when needed. Executing the block twice for each element is not minimal, and much more surprising to me than any amount of buffering. What if there are side effects in the block?
Updated by zverok (Victor Shepelev) about 3 years ago
@Eregon (Benoit Daloze) your version is not the same for effectfull enumerators (where .lazy
is extremely useful):
require 'stringio'
str = StringIO.new(<<~ROWS)
1: OK
2: Err
3: OK
4: Err
5: OK
6: Err
ROWS
err, ok = str.each_line(chomp: true).lazy.partition { _1.include?('Err') }
p [err.first(2), ok.first(2)]
# mine: [["2: Err", "4: Err"], ["1: OK", "3: OK"]]
# yours: [["2: Err", "4: Err"], ["5: OK"]]
...because yours is consuming both kinds of rows while producing err
s.
I agree that if not a "whole point", the "it consumes much less memory" is an implicit expectation of a lazy enumerator. But I believe having (well-documented) quirk in partition
is better than not having lazy partition
at all.