Feature #21795
openMethods for retrieving ASTs
Description
I would like to propose a handful of methods for retrieving ASTs from various objects that correspond to locations in code. This includes:
- Proc#ast
- Method#ast
- UnboundMethod#ast
- Thread::Backtrace::Location#ast
- TracePoint#ast (on call/return events)
The purpose of this is to make tooling easier to write and maintain. Specifically, this would be able to be used in irb, power_assert, error_highlight, and various other tools both in core and not that make use of source code.
There have been many previous discussions of retrieving node_id, source_location, source, etc. All of these use cases are covered by returning the AST for some entity. In this case node_id becomes an implementation detail, invisible to the user. Source location can be derived from the information on the AST itself. Similarly, source can be derived from the AST.
Internally, I do not think we have to store any more information than we already do (since we have node_id for the first four of these, it becomes rather trivial). For TracePoint we can have a larger discussion about it, but I think it should not be too much work. In terms of implementation, the only caveat I would put is that if the ISEQ were compiled through the old parser/compiler, this should return nil, as the node ids do not match up and we do not want to further propagate the RubyVM::AST API.
The reason I am opening up this ticket with 5 different methods requested in it is to get approval first for the direction, then I can open individual tickets or just PRs for each method. I believe this feature would ease the maintenance burden of many core libraries, and unify otherwise disparate efforts to achieve the same thing.
Updated by mame (Yusuke Endoh) 3 months ago
I anticipated that we would consider this eventually, but incorporating it into the core presents significant challenges.
Here are two major issues regarding feasibility.
(Based on chats with @ko1 (Koichi Sasada), @tompng (tomoya ishida), and @yui-knk (Kaneko Yuichiro), though these are my personal views.)
The Implementation Approach¶
CRuby currently discards source code and ASTs after ISeq generation. The proposed #ast method would have to re-read and re-parse the source, which causes two problems:
- If the file is modified after loading,
#astmay return the wrong node. - It does not work for
evalstrings.
error_highlight accepts this fragility because it displays just "hints". But I don't think that it is allowed for a built-in method. At least, we must avoid returning an incorrect node, and clarify when failures occur.
I propose two approaches:
- Keep loaded source in memory (e.g.,
RubyVM.keep_script_lines = trueby default). This supportsevalbut increase memory usage. - Validate source hash. Store a hash in the ISeq and check it to ensure the file hasn't changed.
The Parser Switching Problem¶
What is the node definition returned by #ast?
As noted in #21618, built-in Prism is not exposed as a Ruby API. If Gemfile.lock specifies an older version of prism gem, even require "prism" won't provide the expected definition.
IMO, it would be good to have a node definition that does not depend on prism gem (maybe Ruby::Node?). I am not sure how much effort is needed for this. We would also need to consider where to place what in the ruby/prism and ruby/ruby repositories for development.
We also need to decide if #ast should return RubyVM::AST::Node when --parser=parse.y is specified.
Updated by Eregon (Benoit Daloze) 3 months ago
· Edited
I think this would be great to have, and abstract over implementation details like node_id.
It's also very powerful as e.g. Thread::Backtrace::Location#ast would be able to return a Prism::CallNode with all the relevant information, which error_highlight and others could then use very conveniently.
mame (Yusuke Endoh) wrote in #note-1:
The Implementation Approach¶
error_highlightaccepts this fragility because it displays just "hints".
But I don't think that it is allowed for a built-in method. At least, we must avoid returning an incorrect node, and clarify when failures occur.
I think a built-in method doesn't imply it must work perfectly, it's e.g. not really possible to succeed when the file is modified (without keeping the source in memory).
I propose two approaches:
- Keep loaded source in memory (e.g.,
RubyVM.keep_script_lines = trueby default). This supportsevalbut increase memory usage.
I think we could keep only eval code (IOW, non-file-based code) in memory, then the memory increase wouldn't be so big, and it would work as long as the files aren't modified.
Keeping all code in memory would be convenient, and interesting to measure how much an overhead it is (source code is often quite a compact representation actually), but I suspect given the general focus of CRuby on memory footprint it would be considered only as last resort.
- Validate source hash. Store a hash in the ISeq and check it to ensure the file hasn't changed.
This is a great idea, it should be reasonably fast and avoid the pitfall about modified files.
Although modified files are probably very rare in practice, so I'm not sure how much this matters, but it does seem nicer to fail properly than potentially return the wrong node.
The Parser Switching Problem¶
What is the node definition returned by
#ast?
A Prism::Node because Matz has agreed that going forward the official parser API for Ruby will be the Prism API.
Actually more specific nodes where known:
-
Proc#ast:LambdaNode | CallNode | ForNode | DefNode(DefNodebecauseMethod#to_proc) -
Method#ast&UnboundMethod#ast:DefNode | LambdaNode | CallNode | ForNode(block-related nodes becausedefine_method(&Proc)) -
Thread::Backtrace::Location#ast&TracePoint#ast:Call*Node | Index*Node | YieldNode, and maybe a few more.
As noted in #21618, built-in Prism is not exposed as a Ruby API. If
Gemfile.lockspecifies an older version of prism gem, evenrequire "prism"won't provide the expected definition.
This is basically a solved problem, as discussed there.
In that case, Prism.parse(foo, version: "current") fails with a clear exception explaining one needs to use a newer prism gem.
This only happens if one explicitly downgrades the Prism gem, which is expected to be pretty rare.
IMO, it would be good to have a node definition that does not depend on prism gem (maybe
Ruby::Node?). I am not sure how much effort is needed for this. We would also need to consider where to place what in the ruby/prism and ruby/ruby repositories for development.
IMO there should only be Prism::Node, otherwise tools would have to switch between two APIs whether they want to use the current-running syntax or another syntax.
From the discussion in #21618 my take away is it's unnecessary to have a different API.
We also need to decide if
#astshould returnRubyVM::AST::Nodewhen--parser=parse.yis specified.
It must not, the users of these new methods expect a Prism::Node.
Matz has said the official parser API for Ruby is the Prism API, so it doesn't make sense to return RubyVM::AST::Node there.
Also that wouldn't be reasonable when considering alternative Ruby implementations.
This does mean these methods wouldn't work with --parser=parse.y, until parse.y can be used to create a Prism::Node AST.
Since that's the official Ruby parsing API, it's already a goal anyway for parse.y to do that (#21825), so that shouldn't be a blocker.
Updated by Eregon (Benoit Daloze) 3 months ago
One idea to make it work with --parser=parse.y until universal parser supports the Prism API (#21825) would be to:
- Get the
RubyVM::AST::Nodeof that object, then extract the start/end line & start/end columns. Or do the same internally without needing aRubyVM::AST::Node, it's just converting from parse.y node_id to "bounds". - With those values, use something similar to Prism.node_for to find a node base on those bounds.
- There should be a single node matching those bounds because we are only looking for specific nodes.
Updated by kddnewton (Kevin Newton) 3 months ago
Thanks @mame (Yusuke Endoh) for the detailed reply! I appreciate your thoughtfulness here.
With regard to the implementation approach problem, I love your solution of keeping a source hash on the iseq. I think that makes a lot of sense, and could be used in error highlight today as well. That could potentially even be used by other tools in the case of code reloading. I think we could potentially store the code for eval, but I would be tempted to say let us not change anything for now and return nil or raise an error in that case. (In the same way we would need to return nil or raise an error for C methods.)
For the parser switching problem, I think I would like to introduce a Prism ABI version (alongside the Prism gem version). I would update this version whenever a structural change is made (field added/renamed/removed/etc.). Then, if we could store the Prism ABI version on the ISEQ as well, we could require prism and check if the ABI version matches before attempting to re-parse. We could be clear through the error message that the Prism ABI version is a mismatch and therefore we cannot re-parse.
I am not sure if we should return RubyVM::AST nodes in the case the ISEQ was compiled with parse.y/compile.c, but I am okay with it if that's the direction you would like to go.
Updated by Eregon (Benoit Daloze) 3 months ago
- Related to Feature #21826: Deprecating RubyVM::AbstractSyntaxTree added
Updated by Eregon (Benoit Daloze) 2 months ago
- Related to deleted (Feature #21826: Deprecating RubyVM::AbstractSyntaxTree)
Updated by Eregon (Benoit Daloze) 2 months ago
- Blocks Feature #21826: Deprecating RubyVM::AbstractSyntaxTree added
Updated by Eregon (Benoit Daloze) 2 months ago
Rails' _callable_to_source_string would be a good use case for this, see https://github.com/rails/rails/pull/56624
Updated by mame (Yusuke Endoh) about 1 month ago
Eregon (Benoit Daloze) wrote in #note-2:
As noted in #21618, built-in Prism is not exposed as a Ruby API. If
Gemfile.lockspecifies an older version of prism gem, evenrequire "prism"won't provide the expected definition.This is basically a solved problem, as discussed there.
In that case,Prism.parse(foo, version: "current")fails with a clear exception explaining one needs to use a newer prism gem.
I believe #21618 primarily discusses released Ruby versions. My concern is specifically about the behavior on the master branch.
When new syntax is introduced to the Ruby master branch, the built-in prism.c is updated immediately. In this scenario, if we attempt to retrieve #ast using the node definitions from a released prism gem, I am concerned that we will not get a correct AST due to the node definition mismatch.
kddnewton (Kevin Newton) wrote in #note-4:
For the parser switching problem, I think I would like to introduce a Prism ABI version (alongside the Prism gem version). I would update this version whenever a structural change is made (field added/renamed/removed/etc.). Then, if we could store the Prism ABI version on the ISEQ as well, we could require prism and check if the ABI version matches before attempting to re-parse. We could be clear through the error message that the Prism ABI version is a mismatch and therefore we cannot re-parse.
While this is certainly a feasible solution, I don't feel it is the optimal one.
I acknowledge the engineering challenges involved, but ideally, I believe having a built-in node definition (like Ruby::Node) within Ruby core itself would be the simplest and best approach.
Updated by kddnewton (Kevin Newton) about 1 month ago
Would a Ruby::Node be the same thing as a Prism::Node? As in, would it basically be a Ruby API that duplicates the Prism interface?
I'm not sure about how to maintain it. For example, if we add more features to Prism's Ruby API (for example the work we've been doing on the translation layers to Ripper recently) would we also duplicate it to the various live branches of the Ruby::Node API? Or would it just be a trimmed down version? Either way, I'm not sure when I would recommend using Ruby::Node, because it seems like it would always be an out-of-date version of Prism::Node.
Updated by matz (Yukihiro Matsumoto) 8 days ago
I have two concerns before we move forward.
On the name AST
I'm not sure ast is the right name. The nodes returned by Prism retain concrete information such as positions, whitespace, and comments, making them closer to a Concrete Syntax Tree than an Abstract Syntax Tree. A name like node or syntax_tree might be more accurate.
On the ABI version approach.
Embedding a Prism ABI version in the ISeq sounds reasonable at first, but I'm worried it would make these methods reliably broken during active development on master — any time prism.c is updated ahead of a released gem, callers would get nil or an exception as a matter of course. That's a poor developer experience for people working on master. This concern points back to the suggestion from @mame (Yusuke Endoh): perhaps we need the built-in Prism to be exposed as a gem-independent API first, before we can ship these methods in a stable way.
I'm positive about the overall direction. I just want to make sure we resolve these two points before committing to the API shape.
Matz.
Updated by Eregon (Benoit Daloze) 8 days ago
mame (Yusuke Endoh) wrote in #note-9:
When new syntax is introduced to the Ruby master branch, the built-in
prism.cis updated immediately. In this scenario, if we attempt to retrieve#astusing the node definitions from a released prism gem, I am concerned that we will not get a correct AST due to the node definition mismatch.
AFAIK the parser and node definitions always match.
If the node definitions need changes, they would be updated at the same time than prism.c.
If updating node definitions is somehow forgotten, then it needs to be fixed anyway (regardless of this issue), but it will only result in e.g. not having a new node field yet, not a big deal.
As such there will never be "an incorrect AST".
The scenario you mention would only be an issue when all of these are the case:
- Using ruby-master and not a release
- Using
#aston a file which uses new syntax not in the latest prism release (very rare already) - Using Bundler (just using RubyGems would pick the prism default gem which has the latest syntax changes)
- In Gemfile, depending on a release version of prism and not using
bundle install --prefer-local(which would pick the prism default gem)
This seems such a rare case, and there are solutions like bundle install --prefer-local for those cases, or releasing prism (e.g. if new syntax is being adopted quickly and widely).
In such a case, it wouldn't return an incorrect AST, it would raise a SyntaxError for the new syntax being used and not being recognized (or a Prism::CurrentVersionError if using an old Prism release).
Isn't that good enough?
I think it's worth highlighting that users of Ruby releases wouldn't have this problem at all, they would just need to depend on a recent enough prism, which is already a requirement and is fine for the many existing usages of Prism.
If we do want to raise when using an older Prism, one idea here is: Method#ast would do require "prism" and after that require it would check that Prism::VERSION >= EMBEDDED_PRISM_VERSION (i.e. the version of Prism used by the interpreter to parse). If not, raise an exception.
It's similar to the Prism ABI idea but simpler and doesn't require to maintain such an ABI version manually.
One change we would need is to bump to the next version immediately in ruby/prism whenever doing a release, so e.g. on master the prism gem would be reported as 1.10.0 (or 1.10.0.dev to be more explicit), and not as 1.9.0 (which is the current latest release). That way the last release would be considered incompatible since there might be syntax changes since then.
matz (Yukihiro Matsumoto) wrote in #note-11:
I'm not sure
astis the right name. The nodes returned by Prism retain concrete information such as positions, whitespace, and comments, making them closer to a Concrete Syntax Tree than an Abstract Syntax Tree. A name likenodeorsyntax_treemight be more accurate.
The parser and ast gems and RubyVM::AbstractSyntaxTree call them "AST" and they also have positions.
I'm not sure what is meant by whitespace, Prism itself doesn't return objects for whitespace (even in the lexer).
Regarding comments, those would be ignored for these new methods as they would return a Prism::Node, not a Prism::ParseResult, but even then I think it's a minor detail.
Prism is also not a pure Concrete Syntax Tree as e.g. postfix-if and regular if are both Prism::IfNode.
And it's clearly used successfully as an abstract syntax tree in Ruby implementations.
I think ast is the name most Rubyists would expect.
syntax_tree sounds fine to me too if ast is not acceptable.
node sounds rather ambiguous to me.