Project

General

Profile

Actions

Feature #21005

open

Update the source location method to include line start/stop and column start/stop details

Added by bkuhlmann (Brooke Kuhlmann) 3 days ago. Updated about 11 hours ago.

Status:
Open
Assignee:
-
Target version:
-
[ruby-core:120488]

Description

Why

πŸ‘‹ Hello. After discussing with Kevin Newton and Benoit Daloze in Feature 20999, I'd like to propose adding line start/stop and column start/stop information to the #source_location method for the following objects:

At the moment, when using #source_location, you only get the following information:

def demo = "A demonstration."

# From disk.
method(:demo).source_location  # ["/Users/bkuhlmann/Engineering/Misc/demo", 15]

# From memory.
method(:demo).source_location  # ["(irb)", 3]

Notice, when asking for the source location, we only get the path/location as the first element and the line number as the second element but I'd like to obtain a much richer set of data which includes line start/stop and column start/stop so I can avoid leaning on the RubyVM for this information. Example:

def demo = "A demonstration."

# From disk.
instructions = RubyVM::InstructionSequence.of method(:demo)
puts [instructions.absolute_path, *instructions.to_a.dig(4, :code_location)]

[
  "/Users/bkuhlmann/Engineering/Misc/demo",  # Source path.
  15,                                        # Line start.
  0,                                         # Column start.
  15,                                        # Line stop.
  29                                         # Column stop.
]

# From memory.
instructions = RubyVM::InstructionSequence.of method(:demo)
puts instructions.script_lines

[
  "def demo = \"A demonstration.\"\n",
  ""
]

By having access to the path (or lack thereof in case of IRB), line start/stop, and column start/stop, this means we could avoid using the RubyVM to obtain raw source code for any of these objects. This would not only enhance debugging situations but also improve Domain Specific Languages that wish to leverage this information for introducing new features and/or new debugging capabilities to the language.

How

Building upon the examples provided above, I'd like to see Binding, Proc, Method, and UnboundMethod respond to #source_location as follows:

[
  "/Users/bkuhlmann/Engineering/Misc/demo",  # Source path.
  15,                                        # Line start.
  15,                                        # Line stop.
  0,                                         # Column start.
  29                                         # Column stop.
]

Notice, for data grouping purposes, I changed the array structure to always start with the path as the first element, followed by line information, and ending with column information. Alternatively, it could might be nice to improve upon the above by answering a hash each time, instead, for a more self-describing data structure. Example:

{
  path: "/Users/bkuhlmann/Engineering/Misc/demo",
  line_start: 15,
  line_stop: 15,
  column_start: 0,
  column_stop: 29
}

For in-memory, situations like IRB, it would be nice to answer the equivalent of RubyVM::InstructionSequence#script_lines which would always be an Array with no line or column information since only the source code is necessary. Example:

[
  "def demo = \"A demonstration.\"\n",
  ""
]

From a pattern matching perspective, this could provide the best of both worlds especially if information is answered as either a Hash or and Array. Example:

def demo = "A demonstration."

case method(:demo).source_location
  in Hash then puts "Source information obtained from disk."
  in Array then puts "Source obtained from memory."
  else fail TypeError, "Unrecognized source location type."
end

This above is only a simple example but there's a lot we could do with this information if the above pattern match was enhanced to deal with the extraction and formatting of the actual source code!

Notes

This feature request is related to the following discussions in case more context is of help:


Related issues 2 (1 open1 closed)

Related to Ruby master - Feature #20999: Add RubyVM object source supportRejectedActions
Related to Ruby master - Feature #6012: Proc#source_location also return the columnAssignednobu (Nobuyoshi Nakada)Actions
Actions #1

Updated by Eregon (Benoit Daloze) 3 days ago

Actions #2

Updated by Eregon (Benoit Daloze) 3 days ago

  • Related to Feature #6012: Proc#source_location also return the column added

Updated by Eregon (Benoit Daloze) 3 days ago Β· Edited

I think adding end line and start/end columns is a straightforward and compatible extension of source_location.
I will add that to the dev meeting agenda: https://bugs.ruby-lang.org/issues/20949

Changing to return a Hash (or an Array of 2 Strings) instead would be too incompatible, that would need to be a new method then.
If it's a new method, I think we should return a "code location" object (could be Ruby::CodeLocation or Ruby::Location or Ruby::SourceLocation or so) and have the following methods (inspired from https://bugs.ruby-lang.org/issues/6012#note-19):

  • start_line
  • start_column
  • start_offset
  • end_line
  • end_column
  • end_offset
  • code or maybe source_code: gets the source of the Proc/Method/UnboundMethod if available, nil otherwise

That last one seems particularly nice because it abstracts how it gets that source code, and allows for multiple implementations.
For instance if the Ruby implementation keeps the source code in memory (or a compressed version of it) it could just use that, if it doesn't it would re-read from disk, etc.

Updated by bkuhlmann (Brooke Kuhlmann) 2 days ago Β· Edited

Yeah, eager to see what the consensus on this becomes from the next dev meeting.

I like the idea of a Ruby::SourceLocation or even a Ruby::Source object? The latter would be nice because you could message the instance as source.code for situations in which the source could be obtained from memory, re-read from disk, or answer nil as you point out. Super helpful for situations where you're using IRB or Binding.

Updated by mame (Yusuke Endoh) 2 days ago

I read the use case for #20999.

In short, by the following DSL,

class Demo
  include Initable[[:key, :default, proc { Object.new }]]
end

you want to generate the following code string, right?

def initialize(default = Object.new)
  @default = default
end

I believe that the last position of Proc is not sufficient for your case, because of a here document. Consider the following proc.

proc { <<END }
  heredoc
END

Where do you expect the end position of the Proc to be? The end brace? Or the end of β€œEND” delimiter? Either way, your DSL will generate a broken Ruby code string.

What you really need is not the end position information, but an AST node of Method, Proc, etc.

f = proc { <<END }
  heredoc
END

# What you really need
node = Prism.node_for(f)

# You can get all the locations including the here document
pp node.block.body.body
# =>
# [@ StringNode (location: (1,7)-(1,12))
#  β”œβ”€β”€ flags: newline
#  β”œβ”€β”€ opening_loc: (1,7)-(1,12) = "<<END"
#  β”œβ”€β”€ content_loc: (2,0)-(3,0) = "  heredoc\n"
#  β”œβ”€β”€ closing_loc: (3,0)-(3,3) = "END"
#  └── unescaped: "  heredoc\n"]

# You can even get the end position of the proc, if it is really sufficient for you
p [node.location.end_line, node.location.end_column] #=> [1, 14]

Updated by byroot (Jean Boussier) 1 day ago

I think adding end line and start/end columns is a straightforward and compatible extension of source_location.

I fear it may break code that splats the array, e.g.:

some_method(*proc.source_location)

So perhaps it should use an argument, e.g. proc.source_location(true) ?

I like the idea of a Ruby::SourceLocation

Me too, because otherwise it would be an array with many positional numbers, and that would be quite cryptic. Returning an actual object with named methods make it easier to understand what's being used and easier to extend later on.

Updated by bkuhlmann (Brooke Kuhlmann) 1 day ago Β· Edited

Yusuke: I went hunting for the Prism.node_for method/documentation but couldn't find it. If I understand you correctly, I think you are proposing adding the .node_for implementation to Prism? If so, I like the detailed information in your example especially when dealing with heredocs and being able to access content location. That would yield even greater flexibility for anyone implementing atop this data. To be more exact, I'm guessing the following could be amended to what Benoit proposed above then?:

  • content_line_start
  • content_line_end
  • content_column_start
  • content_column_end

By the way, one slight correction to what you were asking above. This is what my DSL generates:

# DSL
class Demo
  include Initable[[:key, :default, proc { Object.new }]]
end

# End result.
class Demo
  def initialize(default: Object.new)
    @default = default
  end

  private
  
  attr_reader :default
end

The emphasis is on key because key is an optional keyword parameter as defined in the Method#parameters implementation. Essentially, my DSL reconstitutes the raw parameters answered back into a method signature where the third element in the 3-tuple is additional sugar (specific to my DSL) for providing a default value since Method#parameters only answers an array of tuples.

Jean

I fear it may break code that splats the array

Ugh, yes. What about this instead?

proc.source_location(as: :verbose) 

This would give us room to expand the as keyword with different values to support different object shapes in the future (assuming this design expands and grows in future Ruby versions) without backing us into a corner.

Returning an actual object with named methods make it easier to understand what's being used and easier to extend later on.

Yeah, this would be most welcome. I know I gave an example of using a Hash above but icing on the cake would be to answer a Data object instead so you could simply ask for whatever attribute you need.

Updated by byroot (Jean Boussier) 1 day ago

Ugh, yes. What about [keyword argument] instead?

I think it may be a bit much.

But if we decide with returning a new class, instead of just retunring a longer array, I think the better API would be another method entirely.

Perhaps #source, #source_info, or something along these lines.

Updated by Earlopain (Earlopain _) 1 day ago

But if we decide with returning a new class, instead of just retunring a longer array, I think the better API would be another method entirely.

Yes, I agree with that. caller and caller_location are also different methods. I don't think source would be a good name, it could just as easily return the source string. source_info definitly sounds better to me.

Updated by tenderlovemaking (Aaron Patterson) 1 day ago

It seems like a very similar solution to what @mame (Yusuke Endoh) is proposing was merged at one point as Feature #14836. It seems RubyVM::AST doesn't exist anymore, but it should be possible to do something similar.

For example, this code should work:

f = proc { <<END }
  heredoc
END

iseq = RubyVM::InstructionSequence.of(f)

require "prism"

node_id = iseq.to_a[4][:node_id]

ast = Prism.parse(File.binread(__FILE__))
p ast.value.breadth_first_search { |node| node.node_id == node_id }

It does not currently work due to a bug in Prism, but I'm trying to address that in #21014

Updated by tenderlovemaking (Aaron Patterson) 1 day ago

Also just to be clear, I don't think the code I pasted above is nice code, but it does work after the Prism bug is addressed.

In particular this line is pretty heavy:

node_id = iseq.to_a[4][:node_id]

I think we should add a reader method on RubyVM::InstructionSequence to get the node id. It already has some methods for accessing location information, so I think adding the node id could be acceptable. However, I think such an addition is outside the scope of this ticket, so I will open a new ticket.

Updated by mame (Yusuke Endoh) 1 day ago

I think you are proposing adding the .node_for implementation to Prism?

Yes.

To be more exact, I'm guessing the following could be amended to what Benoit proposed above then?

Prism.node_for is my proposal, but the others are not my proposal but what Prism already provides.

require "prism"
node = Prism.parse("proc { <<END }
  heredoc
END")

pp node
$ ruby test.rb
#<Prism::ParseResult:0x00007f17ceafc278
 @comments=[],
 @data_loc=nil,
 @errors=[],
 @magic_comments=[],
 @source=#<Prism::ASCIISource:0x00007f17cdd4a2d8 @offsets=[0, 15, 25], @source="proc { <<END }\n  heredoc\nEND", @start_line=1>,
 @value=
  @ ProgramNode (location: (1,0)-(1,14))
  β”œβ”€β”€ flags: βˆ…
  β”œβ”€β”€ locals: []
  └── statements:
      @ StatementsNode (location: (1,0)-(1,14))
      β”œβ”€β”€ flags: βˆ…
      └── body: (length: 1)
          └── @ CallNode (location: (1,0)-(1,14))
              β”œβ”€β”€ flags: newline, ignore_visibility
              β”œβ”€β”€ receiver: βˆ…
              β”œβ”€β”€ call_operator_loc: βˆ…
              β”œβ”€β”€ name: :proc
              β”œβ”€β”€ message_loc: (1,0)-(1,4) = "proc"
              β”œβ”€β”€ opening_loc: βˆ…
              β”œβ”€β”€ arguments: βˆ…
              β”œβ”€β”€ closing_loc: βˆ…
              └── block:
                  @ BlockNode (location: (1,5)-(1,14))
                  β”œβ”€β”€ flags: βˆ…
                  β”œβ”€β”€ locals: []
                  β”œβ”€β”€ parameters: βˆ…
                  β”œβ”€β”€ body:
                  β”‚   @ StatementsNode (location: (1,7)-(1,12))
                  β”‚   β”œβ”€β”€ flags: βˆ…
                  β”‚   └── body: (length: 1)
                  β”‚       └── @ StringNode (location: (1,7)-(1,12))
                  β”‚           β”œβ”€β”€ flags: newline
                  β”‚           β”œβ”€β”€ opening_loc: (1,7)-(1,12) = "<<END"
                  β”‚           β”œβ”€β”€ content_loc: (2,0)-(3,0) = "  heredoc\n"
                  β”‚           β”œβ”€β”€ closing_loc: (3,0)-(3,3) = "END"
                  β”‚           └── unescaped: "  heredoc\n"
                  β”œβ”€β”€ opening_loc: (1,5)-(1,6) = "{"
                  └── closing_loc: (1,13)-(1,14) = "}",
 @warnings=[]>

tenderlovemaking (Aaron Patterson) wrote in #note-10:

It seems like a very similar solution to what @mame (Yusuke Endoh) is proposing was merged at one point as Feature #14836. It seems RubyVM::AST doesn't exist anymore, but it should be possible to do something similar.

Yeah, we need to reimplement Prism version of RubyVM::AbstractSyntaxTree.of. This is what I tried to mean by Prism.node_for.

tenderlovemaking (Aaron Patterson) wrote in #note-11:

I think we should add a reader method on RubyVM::InstructionSequence to get the node id.

I don't think it is good for a user to handle node_id explicitly.
Rather, I think Prism should provide a simple method to directly retrieve a node subtree of a Method/Proc object given, like RubyVM::AST.of.

Updated by tenderlovemaking (Aaron Patterson) about 23 hours ago

mame (Yusuke Endoh) wrote in #note-12:

tenderlovemaking (Aaron Patterson) wrote in #note-11:

I think we should add a reader method on RubyVM::InstructionSequence to get the node id.

I don't think it is good for a user to handle node_id explicitly.
Rather, I think Prism should provide a simple method to directly retrieve a node subtree of a Method/Proc object given, like RubyVM::AST.of.

I agree. The problem is that RubyVM::AST is built in to Ruby, so it has access to rb_iseq_t members. Prism cannot access fields on rb_iseq_t.

My idea is to apply a patch like this to iseq.c: https://github.com/tenderlove/ruby/commit/9a54230012d8837a981e0ddec88384ab6ef4db89

Then add code to Prism like this:

require "prism"

# Put the following code in Prism
module Prism
  def self.ast_for iseq
    ast = Prism.parse(File.read(iseq.absolute_path))
    node_id = iseq.node_id
    ast.value.breadth_first_search { |node| node.node_id == node_id }
  end
end

# User code is below
f = proc { <<END }
  heredoc
END

iseq = RubyVM::InstructionSequence.of(f)
tree = Prism.ast_for(iseq)
p tree

I implemented the Prism method as a monkey patch just as demonstration. But the idea is the same as RubyVM::AbstractSyntaxTree.of, where Prism.ast_for takes an iseq object and returns the AST for that iseq.

Updated by Eregon (Benoit Daloze) about 18 hours ago Β· Edited

It's very important that this new feature does not expect users to use RubyVM::InstructionSequence or anything under RubyVM since RubyVM is CRuby-only.
The feature itself is possible on any Ruby implementation.

So something like Prism.node_for(Proc|Method|UnboundMethod) is good, and Prism.ast_for(RubyVM::InstructionSequence) is not.
Internally Prism can of course use RubyVM::InstructionSequence.of(Proc|Method|UnboundMethod).node_id on CRuby, and something else on other Ruby implementations.

Note that if it's enough to locate a node by its start/end line/column, we might not need node_id at all, and then just providing start/end line/column to source_location would be enough to find the right node with Prism.
Are there cases where this would be a problem, i.e. where 2 Prism AST nodes would have the same start/end line/column?
Actually since we are only talking about Proc|Method|UnboundMethod here it would need to be two nodes which define a proc/lambda/method with the same start/end line/column.
I think that's not possible.

If that holds, then the original proposal to provide start/end line/column is enough, and we can add a convenience method in Prism using those.
That would work on all Ruby implementations, without needing a low-level implementation-specific concept of node_id:

module Prism
  def self.node_for callable
    start_line, end_line, start_column, end_column = callable.source_location(true)
    ast.value.breadth_first_search { |node|
      loc = node.location
      loc.start_line == start_line and loc.end_line == end_line and
      loc.start_column == start_column and loc.end_column == end_column
    }
  end
end

Maybe CRuby does not currently preserve the information of end line and start/end column for procs and methods?
For def it would be trivial to preserve it but I guess for blocks and define_method it might be trickier.
For such cases source_location could internally use the node_id stuff if that's easier or deemed a better trade-off on CRuby.

In summary:

  • I think we can build Prism.node_for(Proc|Method|UnboundMethod) on (Proc|Method|UnboundMethod)#source_location with start/end line/column.
  • Those would all be public APIs working on all Ruby implementations.
  • Users don't need to know about low-level implementation-specific (i.e. CRuby-only) concepts like node_id.

Updated by Dan0042 (Daniel DeLorme) about 14 hours ago

Eregon (Benoit Daloze) wrote in #note-3:

If it's a new method, I think we should return a "code location" object (could be Ruby::CodeLocation or Ruby::Location or Ruby::SourceLocation or so) and have the following methods (inspired from https://bugs.ruby-lang.org/issues/6012#note-19):

  • start_line
  • end_line

I really like the idea of a #source method that returns a Ruby::SourceLocation object. However, when there's a start and end, I believe Ruby should ideally align with its own core conventions and return a Range. For example, method.source.lines => start...end. While I understand concerns about the allocation of Range objects and performance, I feel that: 1) this might be an example of premature micro-optimization, and 2) from an API design perspective, a Range object feels like the natural default. Separate start/end accessors could remain a low-level, performance-focused API if truly necessary.

As an alternative to start/end accessors, it would be even better if the Range-returning method could be optimized via opcode. Referring back to the idea of Range#bounds in #20080, we could have something like start_line, end_line = method.source.lines.bounds, which could be optimized via opcode to avoid allocating a Range object entirely.

It would be great to see Ruby continue to embrace its own language idioms and explore such optimizations for a more elegant API.

Updated by tenderlovemaking (Aaron Patterson) about 11 hours ago

Eregon (Benoit Daloze) wrote in #note-14:

Maybe CRuby does not currently preserve the information of end line and start/end column for procs and methods?

I think we do, but I can investigate. IIRC it's on the InstructionSequence object (so we'd still have to use RubyVM::InstructionSequence on CRuby).

For def it would be trivial to preserve it but I guess for blocks and define_method it might be trickier.
For such cases source_location could internally use the node_id stuff if that's easier or deemed a better trade-off on CRuby.

I think start / column info is always there, but not 100% sure.

In summary:

  • I think we can build Prism.node_for(Proc|Method|UnboundMethod) on (Proc|Method|UnboundMethod)#source_location with start/end line/column.
  • Those would all be public APIs working on all Ruby implementations.
  • Users don't need to know about low-level implementation-specific (i.e. CRuby-only) concepts like node_id.

πŸ‘

Dan0042 (Daniel DeLorme) wrote in #note-15:

Eregon (Benoit Daloze) wrote in #note-3:

If it's a new method, I think we should return a "code location" object (could be Ruby::CodeLocation or Ruby::Location or Ruby::SourceLocation or so) and have the following methods (inspired from https://bugs.ruby-lang.org/issues/6012#note-19):

  • start_line
  • end_line

I really like the idea of a #source method that returns a Ruby::SourceLocation object. However, when there's a start and end, I believe Ruby should ideally align with its own core conventions and return a Range. For example, method.source.lines => start...end. While I understand concerns about the allocation of Range objects and performance, I feel that: 1) this might be an example of premature micro-optimization, and 2) from an API design perspective, a Range object feels like the natural default. Separate start/end accessors could remain a low-level, performance-focused API if truly necessary.

As an alternative to start/end accessors, it would be even better if the Range-returning method could be optimized via opcode. Referring back to the idea of Range#bounds in #20080, we could have something like start_line, end_line = method.source.lines.bounds, which could be optimized via opcode to avoid allocating a Range object entirely.

It would be great to see Ruby continue to embrace its own language idioms and explore such optimizations for a more elegant API.

As @mame (Yusuke Endoh) was pointing out, I don't think a single range for lines makes sense.

Consider the following code:

def foo; <<FOO; end; def bar; <<BAR; end
foo method
FOO
bar method
BAR

What should the lines method report for the source code for bar? It cannot be a single Range because lines 2 and 3 and part of the foo method. bar is only defined line lines 1, 4, and 5. If we were to provide a source method to return the text of the bar method object, what text would it return?

Since heredocs are allowed to extend beyond the end of the method / block, I really don't think it makes sense to try to provide a single start / end line. In order to truly provide the source location of the method, we would have to return multiple objects and I think that type of interface would just be too cumbersome to use. I completely agree that we should provide a way to get the AST for a method rather than try to provide line / column information alone.

Actions

Also available in: Atom PDF

Like1
Like0Like0Like2Like0Like0Like0Like0Like0Like0Like0Like0Like1Like0Like1Like1Like0