Bug #21783
open{Method,UnboundMethod,Proc}#source_location returns columns in bytes and not in characters
Description
The documentation says:
= Proc.source_location
(from ruby core)
------------------------------------------------------------------------
prc.source_location -> [String, Integer, Integer, Integer, Integer]
------------------------------------------------------------------------
Returns the location where the Proc was defined. The returned Array
contains:
(1) the Ruby source filename
(2) the line number where the definition starts
(3) the column number where the definition starts
(4) the line number where the definition ends
(5) the column number where the definitions ends
This method will return nil if the Proc was not defined in Ruby (i.e.
native).
So it talks about column numbers, so it should be a number of characters and not of bytes.
But currently it's a number of bytes:
$ ruby --parser=prism -ve 'def été; end; p method(:été).source_location'
ruby 4.0.0dev (2025-12-14T07:11:02Z master 711d14992e) +PRISM [x86_64-linux]
["-e", 1, 0, 1, 14]
$ ruby --parser=parse.y -ve 'def été; end; p method(:été).source_location'
ruby 4.0.0dev (2025-12-14T07:11:02Z master 711d14992e) [x86_64-linux]
["-e", 1, 0, 1, 14]
The last number should be 12 because "def été; end".size is 12 characters.
This is a Ruby-level API so I would never expect "byte columns" here, I think it's clear it should be a number of "editor columns" i.e. a number of characters.
Updated by Eregon (Benoit Daloze) 1 day ago
- Description updated (diff)
Updated by Eregon (Benoit Daloze) 1 day ago
- Related to Feature #21005: Update the source location method to include line start/stop and column start/stop details added
Updated by Eregon (Benoit Daloze) 1 day ago
- Related to Feature #6012: Proc#source_location also return the column added
Updated by kddnewton (Kevin Newton) about 22 hours ago
I think this is a documentation issue, as both parsers/compilers operate in terms of bytes. Changing this to characters would likely be a noticeable difference in speed, and quite a bit of code change. (Either both parsers/compilers would have to do this work initially, as that's where the numbers come from, or the source_location function would have to re-parse the source, which is not possible in some cases.) All of that is to say, please do not change this, it will be a ton of work for minimal benefit.
Updated by Eregon (Benoit Daloze) about 18 hours ago
Updating the docs is one solution, so at least it's consistent between docs and behavior.
I think as a Ruby-facing API it's weird that it operates in terms of bytes (and source_location does not have a byte prefix to indicate that).
I think most programmers when they hear line 4 column 6 they expect the 6th character on the 4th line, not the character starting at the 6th byte (actually hard to find in an editor, most editors don't show "byte columns", in fact it's not even possible to place the cursor at some byte positions, every programmer always think in characters when looking at source code).
For example, one might expect that highlighting with ^ based on the return values from source_location works, but it doesn't:
def underline(callable)
file, start_line, start_column, end_line, end_column = callable.source_location
raise unless start_line == end_line
source = File.readlines(file)[start_line-1]
puts source
puts ' '*start_column + '^'*(end_column-start_column)
end
my_proc = proc { ascii-only }
underline my_proc
my_proc = proc { il était une fois un été }
underline my_proc
gives
$ ruby underline.rb
my_proc = proc { ascii-only }
^^^^^^^^^^^^^^
my_proc = proc { il était une fois un été }
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Either both parsers/compilers would have to do this work initially, as that's where the numbers come from, or the source_location function would have to re-parse the source, which is not possible in some cases.
This is a good point, I didn't realize that.
I think it would still be worth it to change the parsers/compilers to compute the proper character column for literal lambdas, blocks and methods, and probably wouldn't be very expensive given most source files are ASCII-only and potentially the parsers could even use the knowledge that a given line is ASCII-only so it would still be as fast even if the file contains a few non-ASCII characters.
If columns would e.g. appear in error messages, I think everyone would expect them to be character columns, not byte columns.
For example gcc shows character columns, as one would expect:
int main() {
/* été */ notexist
}
gcc test.c
test.c: In function ‘main’:
test.c:2:15: error: ‘notexist’ undeclared (first use in this function)
2 | /* été */ notexist
| ^~~~~~~~
Note it's 2:15 (i.e. character columns), not 2:17 (byte columns).
The highlighting also needs to use character columns of course.
Updated by Eregon (Benoit Daloze) about 17 hours ago
From https://bugs.ruby-lang.org/issues/6012#note-25 @matz (Yukihiro Matsumoto) said adding column was OK, but not byte offsets.
I'm not sure what were his reasons, but maybe it's that byte offsets are too low-level for source_location?
If so, I would think byte columns are also too low level and it should be character columns instead.
From a user POV character columns seem better and more expected.
OTOH, I understand the reservation from @kddnewton (Kevin Newton) and I share it as a Ruby implementer, it's much simpler to return byte columns.
For example in TruffleRuby we currently save location information by having int32_t start_offset; int32_t length; in every Truffle AST node, i.e. byte offset and byte length.
Returning byte columns from that is easy and only requires the "newline offsets" array, and not the actual source code.
To return character columns, TruffleRuby would need to read from the beginning of the line to the byte offset to find how many characters that is, and keep the source code in memory (currently TruffleRuby does keep it in memory, but it might not in the future).
I have also seen this in the context of adding Prism.node_for and for that usage having byte columns is actually easier than character columns, OTOH it's not hard to convert from character columns to byte columns in that case and I already wrote the logic for that (because I expected source_location would return character columns, even before reading the docs).
It is of course possible to convert from character column to byte column and vice versa, but it requires access to the source code, which is not always available (e.g. eval).
Updated by kddnewton (Kevin Newton) about 16 hours ago
Honestly if we're interpreting column as something visual like you're implying, we're also going to run into issues with grapheme clusters and east asian width and all the other implications for whatever "character" actually means. I think we would also have to return the encoding of the source file inside that array in order for it to make any sense.
Updated by matz (Yukihiro Matsumoto) about 10 hours ago
I'd like to cancel source_location to have column information in 4.0, due to this concern. In my personal opinion, I am leaning toward byte index, though.
Matz.
Updated by Eregon (Benoit Daloze) about 4 hours ago
matz (Yukihiro Matsumoto) wrote in #note-8:
I'd like to cancel
source_locationto have column information in 4.0, due to this concern.Matz.
Thank you for the quick reply, I think that would be the worst outcome though, https://bugs.ruby-lang.org/issues/6012 was already opened 14 years ago and I have seen multiple users needing this in the last years.
It is one of the features I'm most looking forward to in Ruby 4.0 (in fact it's one of two features that really interests me in 4.0: that and Ractor improvements, the rest looks rather unexciting to me).
IOW, I would much rather have byte columns in 4.0 than no columns at all.
If we delay this, we'll implicitly tell people that using RubyVM::AbstractSyntaxTree is the only way to get column information, and that's bad because it only works on CRuby and it's not a proper API.
Alternative Ruby implementations might have to define their own API to get column information, vs just using the one we have agreed on in #6012 (I'd much rather not get there).
In my personal opinion, I am leaning toward byte index, though.
Let's go with byte columns then? I can make a PR to document that.
It seems you and @kddnewton (Kevin Newton) agree on that, and I'm basically hesitating which is best, but definitely better byte columns than columns.
I see it has already been reverted though :/ https://github.com/ruby/ruby/commit/065c48cdf11a1c4cece84db44ed8624d294f8fd5
Updated by byroot (Jean Boussier) about 3 hours ago
Maybe I'm totally off, but I expect this data to be used to extract the source code, e.g show a snippet of code in an error message, or something akin to that, hence byte offsets seem actually more convenient? (and performant).
But yes, it definitely need to be explicitly documented.
Updated by Eregon (Benoit Daloze) about 3 hours ago
· Edited
I made a PR to re-add {Method,UnboundMethod,Proc}#source_location and fix all known issues: https://github.com/ruby/ruby/pull/15580
@matz (Yukihiro Matsumoto) Would it be OK to merge it? 🙏
For context I opened this issue because I was surprised at the semantics and it was documented in character columns.
If the documentation stated the columns are in bytes I would have thought "somewhat unexpected for me, but I can deal with it, moving on".
So for the next person, if they look at the docs it should be clear now with this PR.
I would have never expected this issue to revert the feature, this is certainly not what I want.
I opened this issue to show the doc & implementation inconsistency, explain my expectations, and discuss what fix makes sense.
I think @kddnewton (Kevin Newton) makes a good point about grapheme clusters and east asian width, where even character width is not enough.
And it's probably not reasonable to ask parsers to handle those cases either.
So I now think byte columns is a good choice, as long as it's properly documented.
I think this is a case of we shouldn't let perfection (it's not really achievable here) get in the way of usefulness, from the saying Perfect is the enemy of good.
BTW using a variant of my C example from above with different compilers shows some variety:
int main() {
char* s = "🎉🎉🎉"; oops
}
$ gcc test.c
test.c: In function ‘main’:
test.c:2:25: error: ‘oops’ undeclared (first use in this function)
2 | char* s = "🎉🎉🎉"; oops
| ^~~~
$ clang test.c
test.c:2:31: error: use of undeclared identifier 'oops'
2 | char* s = "🎉🎉🎉"; oops
| ^
gcc shows column 25 which corresponds to nothing, ' char* s = "🎉🎉🎉"; ' is 21 characters and 30 bytes.
clang shows the column in bytes, so there is clearly some variety there, and at least clang chose to show byte columns.
Updated by Eregon (Benoit Daloze) about 3 hours ago
byroot (Jean Boussier) wrote in #note-10:
Maybe I'm totally off, but I expect this data to be used to extract the source code, e.g show a snippet of code in an error message, or something akin to that, hence byte offsets seem actually more convenient? (and performant).
Yes, for that case which might indeed be the most common one, it's more convenient to have the columns in bytes.
And I suspect many of these cases would likely use Prism and specifically Prism.node_for to get more information, such as showing the method call to which the block was given, highlight some specific part of the block or method, etc.
Updated by mame (Yusuke Endoh) about 3 hours ago
Eregon (Benoit Daloze) wrote in #note-12:
And I suspect many of these cases would likely use
Prismand specifically Prism.node_for to get more information, such as showing the method call to which the block was given, highlight some specific part of the block or method, etc.
I agree. But then, when would Proc#source_location be useful?
Updated by Eregon (Benoit Daloze) about 3 hours ago
mame (Yusuke Endoh) wrote in #note-13:
I agree. But then, when would
Proc#source_locationbe useful?
There are cases where it's useful to show just the block, i.e. { block body }.
And also to be able to use Prism.node_for one of course needs Proc#source_location.
Maybe you're saying Proc#source_location should always include the method call? I think it's good as-is and does not need changing (a little bit discussed in #21784).
Updated by byroot (Jean Boussier) about 2 hours ago
Isn't Proc.source_location very useful for things akin to power_assert?
If it was easy to extract a proc's source code I'd certainly integrate it in testing framework to improve failure rendering.
Updated by Eregon (Benoit Daloze) 37 minutes ago
Yes, it'd be useful for e.g. assert_raise(SomeException) { ... } in test-unit, expect { ... }.to in RSpec, -> { ... }.should raise_error(...) in MSpec, etc.
For example -> { 13 - "10" }.should raise_error(ArgumentError) in MSpec currently gives:
1)
Integer#- fixnum raises a TypeError when given a non-Integer ERROR
Expected ArgumentError
but got: TypeError (String can't be coerced into Integer)
/home/eregon/code/rubyspec/core/integer/minus_spec.rb:21:in 'Integer#-'
But it could show something like:
1)
Integer#- fixnum raises a TypeError when given a non-Integer ERROR
Expected ArgumentError
but got: TypeError (String can't be coerced into Integer)
For expression `13 - "10"`
/home/eregon/code/rubyspec/core/integer/minus_spec.rb:21:in 'Integer#-'