Feature #19061
openProposal: make a concept of "consuming enumerator" explicit
Description
The problem
Let's imagine this synthetic data:
lines = [
"--EMAIL--",
"From: zverok.offline@gmail.com",
"To; bugs@ruby-lang.org",
"Subject: Consuming Enumerators",
"",
"Here, I am presenting the following proposal.",
"Let's talk about consuming enumerators..."
]
The logic of parsing it is more or less clear:
- skip the first line
- take lines until meet empty, to read the header
- take the rest of the lines to read the body
It can be easily translated into Ruby code, almost literally:
def parse(enumerator)
puts "Testing: #{enumerator.inspect}"
enumerator.next
p enumerator.take_while { !_1.empty? }
p enumerator.to_a
end
Now, let's try this code with two different enumerators on those lines:
require 'stringio'
enumerator1 = lines.each
enumerator2 = StringIO.new(lines.join("\n")).each_line(chomp: true)
puts "Array#each"
parse(enumerator1)
puts
puts "StringIO#each_line"
parse(enumerator2)
Output (as you probably already guessed):
Array#each
Testing: #<Enumerator: [...]:each>
["--EMAIL--", "From: zverok.offline@gmail.com", "To; bugs@ruby-lang.org", "Subject: Consuming Enumerators"]
["--EMAIL--", "From: zverok.offline@gmail.com", "To; bugs@ruby-lang.org", "Subject: Consuming Enumerators", "", "Here, I am presenting the following proposal.", "Let's talk about consuming enumerators..."]
StringIO#each_line
Testing: #<Enumerator: #<StringIO:0x00005581018c50a0>:each_line(chomp: true)>
["From: zverok.offline@gmail.com", "To; bugs@ruby-lang.org", "Subject: Consuming Enumerators"]
["Here, I am presenting the following proposal.", "Let's talk about consuming enumerators..."]
Only the second enumerator behaves the way we wanted it to.
Things to notice here:
- Both enumerators are of the same class, "just enumerator," but they behave differently: one of them is consuming data on each iteration method, the other does not; but there is no programmatic way to tell whether some enumerator instance is consuming
- There is no easy way to make a non-consuming enumerator behave in a consuming way, to open a possibility of a sequence of processing "skip this, take that, take the rest"
Concrete proposal
- Introduce an
Enumerator#consuming?
method that will allow telling one of the other (and make core enumerators like#each_line
properly report they are consuming). - Introduce
consuming: true
parameter forEnumerator.new
so it would be easy for user's code to specify the flag - Introduce
Enumerator#consuming
method to produce a consuming enumerator from a non-consuming one:
# reference implementation is trivial:
class Enumerator
def consuming
source = self
Enumerator.new { |y| loop { y << source.next } }
end
end
enumerator3 = lines.each.consuming
parse(enumerator3)
Output:
["From: zverok.offline@gmail.com", "To; bugs@ruby-lang.org", "Subject: Consuming Enumerators"]
["Here, I am presenting the following proposal.", "Let's talk about consuming enumerators..."]
Updated by mame (Yusuke Endoh) about 2 years ago
Here is my understanding:
[1, 2, 3].each.consuming? #=> false
$stdin.each_line.consuming? #=> true
# A user must guarantee whether it is consuming or not.
Enumerator.new {}.consuming? #=> false
Enumerator.new(consuming: true) {}.consuming? #=> true
e = [1, 2. 3].each.consuming
p e.consuming? #=> true
p e.next #=> 1
p e.to_a #=> [2, 3]
I think there are two problems of this proposal.
Problem 1: The consuming flag depends on the underlying IO¶
An enumerator created from a normal file is not consuming.
e = File.foreach("normal-file")
e.next #=> "first line\n"
e.to_a #=> ["first line\n", "second line\n", "third line\n"]
However, an enumerator created from a named FIFO is consuming.
File.mkfifo("fifo-file")
fork do
["first line\n", "second line\n"].each do |s|
sleep 1
File.write("fifo-file", s)
end
end
e = File.foreach("fifo-file")
e.next #=> "first line\n"
e.to_a #=> ["second line\n"]
I am unsure if there is a portable way to determine whether the IO is consuming or not.
Problem 2: The result of Enumerator#consuming shares the state with the original Enumerator¶
After Enumerator#consuming is called, calling #next
and/or #rewind
on the original Enumerator affects the consuming Enumerator and vice versa.
e1 = (1..5).to_enum
e2 = e1.consuming
# This call affects the state of e2
p e1.next #=> 1
p e2.next #=> 2 (is this okay?)
# Also, e2.next affects the state of e1 vice versa
p e1.next #=> 3 (is this okay again?)
# e2.rewind has no effect (as intended), but you can still rewind e2 by calling e1.rewind
e1.rewind
p e2.next #=> 1 (rewound; is this okay?)
I don't think it is intentional, but it is very difficult to implement it correctly. One possible solution I came up with is to prohibit #next
and #rewind
on the original Enumerator, i.e., the right to call the methods is completely transferred to the consuming one. But it introduces yet another new type of Enumerator (unrewindable Enumerator?), which is very complicated.
Updated by zverok (Victor Shepelev) about 2 years ago
Here is my understanding
This is correct.
Problem 1: The consuming flag depends on the underlying IO
That's an interesting problem indeed! I'll look deeper into it.
But for now, I consider it an edge case that can be, in the worst case, just covered by docs. E.g. something like "File.foreach
reports itself as not consuming, but depending on IO properties this might not be true...", while, say, File#each_line
is consuming by design, if I understand correctly.
The distinction of "consuming"/"non-consuming" [by design] still seems helpful.
Problem 2: The result of Enumerator#consuming shares the state with the original Enumerator
It is just because my reference implementation was too naive :)
By simply changing it to
class Enumerator
def consuming
source = dup
Enumerator.new { |y| loop { y << source.next } }
end
end
...for all I can tell, breaks all the ties with the original enumerator's state, and all of the examples behave reasonably:
e1 = (1..5).to_enum
e2 = e1.consuming
p e1.next #=> 1
p e2.next #=> 1 (unaffected by e1.next)
p e1.next #=> 2 (unaffected by e2.next)
e1.rewind
p e2.next #=> 2 (unaffected by rewind)
Do you see a problem with this solution?..
Updated by ioquatix (Samuel Williams) about 2 years ago
For problem 1 you can check if an IO is seekable, and this would tell you whether you could restart from the beginning.
Updated by Dan0042 (Daniel DeLorme) about 2 years ago
mame (Yusuke Endoh) wrote in #note-2:
But it introduces yet another new type of Enumerator (unrewindable Enumerator?), which is very complicated.
It's more complicated, but unrewindable enumerators already exist in practice (as shown by FIFO), so making them visible and explicit should be useful I think. Maybe #consuming?
could return 3 values like [nil, :rewindable, :nonrewindable]
Updated by mame (Yusuke Endoh) about 2 years ago
zverok (Victor Shepelev) wrote in #note-3:
File#each_line
is consuming by design, if I understand correctly.
Well, I guess so. To be honest, I'm not sure which ones are consuming and which ones are not.
Problem 2: The result of Enumerator#consuming shares the state with the original Enumerator
Do you see a problem with this solution?..
I think this is also a possible solution. Note that the Enumerator in the middle of #next
will not be able to return #consuming
. Is this okay?
e1 = (1..5).to_enum
e1.next
e1.consuming #=> can't copy execution context (TypeError)
ioquatix (Samuel Williams) wrote in #note-4:
For problem 1 you can check if an IO is seekable, and this would tell you whether you could restart from the beginning.
I think you misunderstand Problem 1 (maybe due to my bad explanation). Enumerator does not use IO#seek or something. Calling #next
and #to_a
on the Enumerator created from File.foreach
will open the file respectively.
Dan0042 (Daniel DeLorme) wrote in #note-5:
It's more complicated, but unrewindable enumerators already exist in practice (as shown by FIFO), so making them visible and explicit should be useful I think. Maybe
#consuming?
could return 3 values like[nil, :rewindable, :nonrewindable]
The word "unrewindable" was not a good name, which might have confused you. I meant an Enumerator whose #next
and #rewind
raise an exception, say, "you cannot use #next because you have already called #consuming".
Updated by zverok (Victor Shepelev) about 2 years ago
To be honest, I'm not sure which ones are consuming and which ones are not.
Which is one of the points of this ticket! The distinction is internally present (as displayed in original code samples) but never spelled out and can't be introspected. I believe that introducing the explicit concept will make it much more obvious and make people aware of it.
Note that the Enumerator in the middle of
#next
will not be able to return#consuming
. Is this okay?
I think it is totally Ok for the first implementation, especially if #consuming
will raise a bit more friendly error like "The enumerator is mid-enumeration and can't be turned into consuming" or something.
Updated by zverok (Victor Shepelev) about 2 years ago
Re:
"But I'm skeptical about the usefulness of the
consuming?
flag" (from dev.log)
I believe it is extremely useful for introspection. For example the method like shown in the original ticket:
def parse(enumerator)
puts "Testing: #{enumerator.inspect}"
enumerator.next
p enumerator.take_while { !_1.empty? }
p enumerator.to_a
end
...will work properly (enumerator.next and enumerator.take[_while] advance the enumerator) with a consuming enumerator and surprisingly with a non-consuming. As it is too late to make all enumerators consuming :), at least the presence of the explicit notion of "consuming-ness" will make it somehow easier to explain and understand.
And also adjust when needed with enumerator = enumerator.consuming unless enumerator.consuming?
or something.
Updated by mame (Yusuke Endoh) about 2 years ago
zverok (Victor Shepelev) wrote in #note-7:
I think it is totally Ok for the first implementation
Not only "the first implementation". I think it is impossible to implement the method even in the future because a Fiber cannot be duplicated.
Updated by zverok (Victor Shepelev) about 2 years ago
I think it is impossible to implement the method even in the future because a Fiber cannot be duplicated.
Of course, it is impossible directly.
I just might imagine that if it would be a common stumbling question for consuming enumerators (hardly so, but who knows), there might be some workarounds, like, IDK, trying to duplicate the initial state and make consuming enumerator start from the start if possible, or something like that.
Anyway, it is out of the scope of the current proposal :)
Updated by hsbt (Hiroshi SHIBATA) about 2 years ago
- Related to Feature #19069: Default value assignment with `Hash.new` in block form added
Updated by hsbt (Hiroshi SHIBATA) about 2 years ago
- Related to deleted (Feature #19069: Default value assignment with `Hash.new` in block form)
Updated by matz (Yukihiro Matsumoto) almost 2 years ago
Regarding the concrete proposals:
-
Introduce an
Enumerator#consuming?
methodThe consuming information is not reliable especially with I/O (some IO may not be rewindable, but lseek(2) may not return error for the IO, e.g. on MacOS). Thus we cannot implement trust-worthy
consuming?
method -
Introduce
consuming: true
parameter for Enumerator.newSince
consuming?
state of the enumerators are unreliable, this keyword argument is useless -
Introduce Enumerator#consuming method to produce a consuming enumerator from a non-consuming one
The original PoC code modifies the original, the modified one raising error for duping internal fiber. It's not acceptable behavior (but former may be). In theory, we can overhaul the implementation of enumerators, but I don't think it's worth the cost.
The final decision may be up to the actual use-case. But I doubt the benefit.
Matz.
Updated by zverok (Victor Shepelev) almost 2 years ago
@matz (Yukihiro Matsumoto) Thanks for your answer. I'll gather more evidence/real-life examples and will adjust the proposal.
My main concern though was not as much some particular usage but general awareness of the difference between the two types of enumerators.
The latest evidence of the fact that it is a problem is bug #19294 in the new feature of Ruby 3.2, where even the core team member implementing new functionality hasn't considered that some enumerators would be "consumed" by the first iteration.
I believe it to be a pretty important distinction frequently leading to idiosyncrasies and not just a random feature request. But I need to think about how to communicate my intentions and proposals better.