Misc #20260
closedISEQ flag for prism compiler
Description
In order to support error highlight, there's needs to be a way to tell which compiler generated an instruction sequence (compile.c or prism_compile.c). That's because when the file is reparsed to find the node based on the node id, we need to know which parser to use. We can't look at the current parser because it may have been dumped to binary and the parser/compile pair might not match up to the current running process.
I would like to add a flag to rb_iseq_constant_body
that indicates it came from prism, as well as the :prism
hash key to the misc hash in the array form of iseqs. This will allow us to determine the source of the iseq from both C and Ruby.
Since this is user-facing (albeit from an experimental library that shouldn't be relied upon) I wanted to make sure this way okay before merging.
Updated by kddnewton (Kevin Newton) 10 months ago
The PR for this work is here: https://github.com/ruby/ruby/pull/9934.
Updated by ko1 (Koichi Sasada) 10 months ago
I'm okay to change the internal data structure because we can change them in future.
How to change the user facing interface? Changing only to_a
format?
For example, errror_highlight needs to know the code which comes from.
Endoh-san propose me to raise an error if RubyVM::AST.of()
is called for ISeqs from prism.
Updated by kddnewton (Kevin Newton) 10 months ago
Yes, the user-facing part for iseq is only changing RubyVM::InstructionSequence.to_a
to include prism: true/false
.
For error highlight to know, we will either need to change RubyVM::AST.of
to know that it is coming from prism or we will need to add something to Thread::Backtrace::Location
like prism?
.
This code here: https://github.com/ruby/error_highlight/blob/80ede6b8ca3219310f30cff42cd17177a7e47f25/lib/error_highlight/base.rb#L55C7-L59 does:
return nil unless Thread::Backtrace::Location === loc
node = RubyVM::AbstractSyntaxTree.of(loc, keep_script_lines: true)
Spotter.new(node, **opts).spot
That would need to be either:
return nil unless Thread::Backtrace::Location === loc
if loc.prism?
node = Prism.node_for(loc)
PrismSpotter.new(node, **opts).spot
else
node = RubyVM::AbstractSyntaxTree.of(loc, keep_script_lines: true)
Spotter.new(node, **opts).spot
end
or it would need to be:
return nil unless Thread::Backtrace::Location === loc
node = RubyVM::AbstractSyntaxTree.of(loc, keep_script_lines: true)
if node.is_a?(Prism::Node)
PrismSpotter.new(node, **opts).spot
else
Spotter.new(node, **opts).spot
end
or maybe some other kind of API change. Either way we would need to access the prism flag that is on the iseq associated with the backtrace location. I'm fine with any kind of API that gets us there.
Updated by kddnewton (Kevin Newton) 10 months ago
Sorry to respond again, but in thinking about it, I think Prism.node_for
might not be right. Maybe it should be RubyVM::PrismSyntaxTree.of
, or something else under RubyVM
because it is specific to CRuby.
Updated by kddnewton (Kevin Newton) 10 months ago
Ahh sorry again, but actually if we introduce Thread::Backtrace::Location#iseq
, all of this can be done in Ruby, because we could make it:
return nil unless Thread::Backtrace::Location === loc
if loc.iseq.to_a[4][:prism]
# reparse with prism
else
node = RubyVM::AbstractSyntaxTree.of(loc, keep_script_lines: true)
Spotter.new(node, **opts).spot
end
Updated by mame (Yusuke Endoh) 9 months ago
How about the following API?
- Make
RubyVM::AbstractSyntaxTree.of
raise a RuntimeError for objects with iseq complied from Prism. - Introduce
Thread::Backtrace::Location#node_id
, which returns node_id at the current instruction, regardless of compiled from parse.y or Prism.
begin
node = RubyVM::AbstractSyntaxTree.of(loc, keep_script_lines: true)
Spotter.new(node, **opt).spot
rescue RuntimeError
if $!.message =~ /prism/
node = Prism.parse(loc.path).node_for(loc.node_id)
PrismSpotter.new(node, **opt).spot
else
raise
end
end
The rationale of this API design is as follows.
In principle, we don't want to introduce a temporal method outside of RubyVM that will be useless in the near future. Location#prism?
will become useless when Prism becomes Ruby's parser successfully. Ruby VM::InstructionSequence.complied_by_prism?(loc)
is a possible option. However, we need to make RubyVM::AbstractSyntaxTree.of
raise an exception for the Prism-derived iseq anyway. If so, you can use it to determine if the iseq is compiled from Prism without having any extra methods.
Location#iseq
is not good because it returns CRuby's internal data structure. Location#node_id
is not best too, but it seems acceptable and necessary to tell Prism the node_id.
Updated by mame (Yusuke Endoh) 9 months ago
BTW, I prefer to have ErrorHighlight::Spotter.new
accept both RubyVM::AST::Node
and Prism::Node
instead of adding ErrorHighlight::PrismSpotter
. (I guess you showed it just for clarity.)
Updated by kddnewton (Kevin Newton) 9 months ago
Okay! That makes sense, thank you for the discussion. I'm also fine making ErrorHighlight::Spotter
handle both, as you say it was just for illustrative purposes. I'll update this ticket when I have updated RubyVM::AST.of
and Thread::Backtrace::Location.node_id
in the PR.
Updated by Eregon (Benoit Daloze) 9 months ago
It would be great if ErrorHighlight
can work on other Ruby implementations too, which means not relying on RubyVM
.
Thread::Backtrace::Location#node_id
might not be ideal because other Ruby implementations AFAIK don't keep a node_id
(for footprint reasons).
TruffleRuby OTOH keeps the byte offset and byte length of each node, so this could be used to identify a given call node.
That also seems similar to the 2nd paragraph in https://github.com/ruby/prism/issues/2401.
I think ErrorHighlight could even use the byte offset + byte length to reparse just that part of the file and find where the .
/[
/etc is.
Then the generic implementation-agnostic API could be Thread::Backtrace::Location#{byte_offset,byte_length}
(or {first_column,first_lineno,last_column,last_lineno}
but it seems less efficient/convenient).
Or even Thread::Backtrace::Location#code
which would return a String containing the source code of that call node.
@mame (Yusuke Endoh) WDYT, would one of these be enough for ErrorHighlight
? If so it sounds ideal because it avoids needing implementation details like node_id
, iseq
, etc.
Updated by kddnewton (Kevin Newton) 9 months ago
@mame (Yusuke Endoh) I've made these changes here: https://github.com/ruby/ruby/pull/9934. Could you review? Thanks!
Updated by mame (Yusuke Endoh) 9 months ago
@Eregon (Benoit Daloze) I don't think it is a good idea to try to guess a node from a code range. To identify a unique node, the Prism AST would have to avoid creating multiple nodes from the exact same code range. It may be possible to carefully design the AST that way, but I am afraid if it will prevent from future syntax extensions.
Though I don't think node_id is super cool, it will no longer be an implementation detail, assuming the Prism AST. Calculating byte_offset + byte_length from node_id is trivial, and the reverse is not.
By the way, is there any advantage of the representation of byte_offset and byte_length? It is larger in size than (or at best equal to) node_id. To display to the user, we need to convert it to lineno + column. It may be useful to blindly extract the corresponding code fragment from a file, but I think such a use case is rare.
Updated by Eregon (Benoit Daloze) 9 months ago
mame (Yusuke Endoh) wrote in #note-11:
@Eregon (Benoit Daloze) I don't think it is a good idea to try to guess a node from a code range. To identify a unique node, the Prism AST would have to avoid creating multiple nodes from the exact same code range. It may be possible to carefully design the AST that way, but I am afraid if it will prevent from future syntax extensions.
I think for call nodes it would identify uniquely, but it may not be true for all nodes.
By the way, is there any advantage of the representation of byte_offset and byte_length? It is larger in size than (or at best equal to) node_id. To display to the user, we need to convert it to lineno + column. It may be useful to blindly extract the corresponding code fragment from a file, but I think such a use case is rare.
It's basically byte_offset and byte_length vs line and node_id.
If all 4 are the same integer size then it's the same in memory footprint, but from byte_offset and byte_length we also have precise column information readily available without parsing again (although that means keeping the source code in memory or rereading the file, could be avoided if a file is only US-ASCII characters then only line offsets is enough), which is nice for tools, debugger, etc.
Updated by kddnewton (Kevin Newton) 9 months ago
@matz (Yukihiro Matsumoto) The work is done to add support for the flag on the ISEQ, but I think we additionally need your approval on adding Thread::Backtrace::Location#node_id
.
Updated by kddnewton (Kevin Newton) 9 months ago
- Status changed from Open to Closed
@mame (Yusuke Endoh) remembered that we have RubyVM::AbstractSyntaxTree.node_id_for_backtrace_location(loc)
, so we don't actually need Thread::Backtrace::Location#node_id
. In this case we've only implemented the flag for the body.