Project

General

Profile

Feature #15940

Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals

Added by byroot (Jean Boussier) about 1 month ago. Updated 9 days ago.

Status:
Open
Priority:
Normal
Assignee:
-
Target version:
-
[ruby-core:93250]

Description

Patch: https://github.com/ruby/ruby/pull/2242

It's not uncommon for symbols to have literal string counterparts, e.g.

class User
  attr_accessor :name

  def as_json
    { 'name' => name }
  end
end

Since the default source encoding is UTF-8, and that symbols coerce their internal fstring to ASCII when possible, the above snippet will actually keep two instances of "name" in the fstring registry. One in ASCII, the other in UTF-8.

Considering that UTF-8 is a strict superset of ASCII, storing the symbols fstrings as UTF-8 instead makes no significant difference, but allows in most cases to reuse the equivalent string literals.

The only notable behavioral change is Symbol#to_s.

Previously :name.to_s.encoding would be #<Encoding:US-ASCII>.
After this patch it's #<Encoding:UTF-8>. I can't foresee any significant compatibility impact of this change on existing code.

However, there are several ruby specs asserting this behavior, but I don't know if they can be changed or not: https://github.com/ruby/spec/commit/a73a1c11f13590dccb975ba4348a04423c009453

If this specification is impossible to change, then we could consider changing the encoding of the String returned by Symbol#to_s, e.g in ruby pseudo code:

def to_s
  str = fstr.dup
  str.force_encoding(Encoding::ASCII) if str.ascii_only?
  str
end

History

Updated by byroot (Jean Boussier) about 1 month ago

In order to provide some data, I counted the duplicates in a Redmine heap dump (ObjectSpace.dump_all):

Here the counting code:

#!/usr/bin/env ruby
# frozen_string_literal: true
require 'json'

fstrings = []
STDIN.each do |line|
  object = JSON.parse(line)
  fstrings << object if object['fstring']
end

counts = {}
fstrings.each do |str|
  counts[str['value']] ||= 0
  counts[str['value']] += 1
end
duplicates = counts.select { |k, v| v > 1 }.map(&:first)

puts "total fstrings: #{fstrings.size}"
puts "dups: #{duplicates.size}"
puts "sample:"
puts duplicates.first(20)

And the results for Redmine:

total fstrings: 84678
dups: 3686
sample:
changes
absent
part
EVENTS
RANGE
OBJECT
Silent
EXCEPTION
Settings
DATE
Index
Graph
COMPLEX
Definition
fcntl
inline
lockfile
update
gemfile
oth

That's about 4% of the fstring table being duplicates.

I also ran the script against one much bigger private app, and the duplicate ratio was similar, but the table was an order of magnitude bigger.

Updated by duerst (Martin Dürst) about 1 month ago

I think this change makes a lot of sense, not only because of the memory savings, but also because of the streamlining for encodings in general.

I agree that the chance of backwards incompatibilities in actual programs is very low. I think this is a good change for Ruby 2.7.

Updated by naruse (Yui NARUSE) 26 days ago

Note that an incompatibility which is caused by the change of string encoding is String#<<(integer).

Maybe String#<<(n) should be deprecated if n > 127 and explicitly specify the encoding argument.

Updated by byroot (Jean Boussier) 26 days ago

naruse (Yui NARUSE) Interesting, I actually had no idea you could String#<<(integer).

In my humble opinion, integer shifting on string returned by Symbol#to_s is quite specific, and is unlikely to be common in the wild.

Additionally UTF8 strings accept everything ASCII string would, so to break existing code it would need to expect String#<<(127+) to blow up.

>> 'a' << 234324
=> "a\u{39354}"
>> 'a'.force_encoding(Encoding::BINARY) << 234324
RangeError: 234324 out of char range

That being said, it's not my role to judge the backward compatibility impact, and I'm likely to overlook parts of it.

More generally, let me know if there's anything I can do to push this forward. I think I'll update the test suite in the pull request to acknowledge the encoding change and get a green test suite.

Updated by duerst (Martin Dürst) 25 days ago

naruse (Yui NARUSE) wrote:

Note that an incompatibility which is caused by the change of string encoding is String#<<(integer).

Maybe String#<<(n) should be deprecated if n > 127 and explicitly specify the encoding argument.

If I understand this correctly, the proposal is to change the encoding of Symbols from ASCII to UTF-8. So if such a symbol is converted to a String (which in itself may not be that frequent), and then an Integer is 'shifted' into that String with <<, then the only incompatibility that we get is that until now, it was an error to do that with a number > 127.

So the overall consequence is that something that produced an error up to now doesn't produce an error anymore. I guess that's an incompatibility that we should be able to tolerate. It's much more of a problem if something that worked until now stops to work, or if something that worked one way suddenly works another way.

As for explicitly specifying an encoding argument for String#<<, I understand that it may be the conceptually correct thing to do (we are using the Integer as a character number, so we better knew what encoding this character number was expressed in). But the encoding is already available from the string, and in most cases will be the source encoding or so anyway, which will be UTF-8 in most cases. Also, because << is a binary operator, it would be difficult to add additional parameters.

Updated by Eregon (Benoit Daloze) 24 days ago

duerst (Martin Dürst) wrote:

If I understand this correctly, the proposal is to change the encoding of Symbols from ASCII to UTF-8. So if such a symbol is converted to a String (which in itself may not be that frequent), and then an Integer is 'shifted' into that String with <<, then the only incompatibility that we get is that until now, it was an error to do that with a number > 127.
So the overall consequence is that something that produced an error up to now doesn't produce an error anymore. I guess that's an incompatibility that we should be able to tolerate. It's much more of a problem if something that worked until now stops to work, or if something that worked one way suddenly works another way.

It's not raising an error:

$ ruby -ve 's=:abc.to_s; s<<233; p s; p s.encoding'                     
ruby 2.6.2p47 (2019-03-13 revision 67232) [x86_64-linux]
"abc\xE9"
#<Encoding:ASCII-8BIT>

$ ruby -ve 's=:abc.to_s.force_encoding("UTF-8"); s<<233; p s; p s.encoding'
ruby 2.6.2p47 (2019-03-13 revision 67232) [x86_64-linux]
"abcé"
#<Encoding:UTF-8>

I'm a bit concerned about compatibility, I think we should evaluate with a few gems, and how much of test-all and specs fail with this change.

I agree in general having a consistent encoding for Symbol literals seems simpler for semantics.

TruffleRuby reuses the underlying memory (byte[], aka char*) for interned Strings of different encodings, so only the metadata (encoding, coderange, etc) is duplicated, but not the actual bytes. Probably MRI could do the same, and that would be transparent and not need to change semantics.

Updated by Eregon (Benoit Daloze) 24 days ago

byroot (Jean Boussier) wrote:

However, there are several ruby specs asserting this behavior, but I don't know if they can be changed or not.

Specs can always be changed, along with ruby_version_is guards to specify which behavior on which version (https://github.com/ruby/ruby/blob/trunk/spec/README.md).
Needing to change them indicate a potential incompatibility though.

Updated by Eregon (Benoit Daloze) 24 days ago

If we change this, the encoding of Symbol literals should be the same as String literals, i.e., use the file's magic encoding comment or UTF-8 if there isn't one.

Updated by nirvdrum (Kevin Menard) 23 days ago

I generally like the idea, but really from a semantics perspective rather than a memory savings one. It's confusing to both implementers and end users alike that Symbols take on a different encoding from Strings if they happen to be ASCII-only. So the other nice benefit of the change is String#{intern,to_sym} can be made much cheaper. Having said all of that, I'm sure the current behavior was maintained when non-ASCII-only Symbols were introduced for a reason. I think it'd be good to look back and see what the rationale was.

If the solution then is to convert the String's encoding when calling Symbol#to_s, if the Symbol is ASCII-only, then I think you're going to investigate knock-on effects. E.g., String#force_encoding currently unconditionally clears the String's code range. That's metadata you really don't want to lose. But, by setting the encoding to ASCII-only, you may be okay most of the time because there are code paths that just check if the encoding uses single byte characters without doing a full code range scan. Likewise, if you do decide to skip the US-ASCII conversion, you could have the inverse problem. Now you have a UTF-8 string and if that doesn't have its code range set, you've turned some O(1) operations to O(n). Please note, I haven't really traced all the String and Symbol code. These were potential pitfalls that stood out to me when reviewing the PR and looking briefly at the CRuby source. My general point being that even if things come out correct, you could still alter the execution profile in such a way as to introduce a performance regression by changing from a fixed-width to a variable-width encoding or by not taking proper care of the code range value. None of that is insurmountable, of course.

Updated by byroot (Jean Boussier) 20 days ago

Sorry for the late reply, somehow I can't make email notifications work on Redmine...

Specs can always be changed, along with ruby_version_is guards to specify which behavior on which version

Thanks fro letting me know. I updated the PR, I expect it to pass CI, but will do further updates if it doesn't.

If we change this, the encoding of Symbol literals should be the same as String literals, i.e., use the file's magic encoding comment or UTF-8 if there isn't one.

Yes and no.

First it's kinda already the case and stays that way. If the symbol name can't be expressed as pure ASCII, it will have the string's encoding, hence the file encoding.

However, one of the reason why the encoding is coerced, it's because if you have the following situation:

# encoding: iso-8659-1
ISO_SYMBOL = :foo

# encoding: utf-8
UTF_SYMBOL = :foo

You do want both constants to reference the same symbol. From what I gathered it was the whole reason behind the ASCII coercion.

I'm sure the current behavior was maintained when non-ASCII-only Symbols were introduced for a reason.

I believe it's the reason I described above.

If the solution then is to convert the String's encoding when calling Symbol#to_s

Yeah, that was just a suggestion to retain to_s backward compatibility, but I really don't think it's a good idea.

Updated by Eregon (Benoit Daloze) 20 days ago

byroot (Jean Boussier) wrote:

You do want both constants to reference the same symbol. From what I gathered it was the whole reason behind the ASCII coercion.

That makes sense, thanks for the explanation.

US-ASCII is the natural subset for 7-bit characters, so it makes perfect sense to me that it's used for 7-bit symbols.
UTF-8 is not, and is less precise than US-ASCII for that matter.
At least performance-wise it shouldn't matter too much since the coderange will be CR_7BIT.

I'm unsure, it seems a bit arbitrary to give "ascii" symbols a UTF-8 encoding.
And many core methods return US-ASCII Strings and I would say that it is expected when they only return 7-bit characters.

Updated by Eregon (Benoit Daloze) 20 days ago

Sharing char* is a more general optimization, and could apply to more cases (e.g., frozen Strings with identical bytes but different encodings).
So I'm thinking that would be better rather than changing semantics for the (rather obscure to end users) purpose of fitting better with the current fstring representation.

I'd like another reason than the internal optimization which can be done another way if we do this, but it's just my opinion.

Updated by byroot (Jean Boussier) 20 days ago

US-ASCII is the natural subset for 7-bit characters, so it makes perfect sense to me that it's used for 7-bit symbols. UTF-8 is not, and is less precise than US-ASCII for that matter.

I don't disagree with this, but my point is that UTF-8 is a superset of US-ASCII, and much more likely to be the encoding of the various frozen string literals.

At least performance-wise it shouldn't matter too much

What do you mean by performance ? String comparisons ? If so it doesn't really matter much for symbols AFAIK.

I'm unsure, it seems a bit arbitrary to give "ascii" symbols a UTF-8 encoding.

IMO there's two arguments here:

  • Consistency / Least surprise: UTF-8 is now the default source file encoding, it would make sense that the symbols created out of these files (not just Symbol instances, but module names, method names etc) would be UTF-8 as well.
  • Memory usage: as explained is the original issue description, it save some memory usage.

Honestly, what is surprising to me is this:

'foo'.encoding # => UTF-8
:foo.to_s.encoding # => US-ASCII
module Foo; end
Foo.name.encoding # => US-ASCII
Foo.method(:name).name.encoding # => US-ASCII
:"olé".to_s.encoding # => UTF-8

Sharing char* is a more general optimization, and could apply to more cases (e.g., frozen Strings with identical bytes but different encodings).

The problem is that the different encoding have to be kept somewhere. So you end up with the original string plus some form of shared string that point to the original one and hold the different encoding.

So unless that string is too big to be embedded (rarely the case for symbols), you haven't actually saved anything.

Updated by Hanmac (Hans Mackowiak) 20 days ago

i didn't checked the deep code, but would it help if Symbol.to_s or Module.name would return a shared string?

in this case only memory is allocated when the returned string itself gets changed?

or why not return a frozen string?

Updated by Eregon (Benoit Daloze) 20 days ago

byroot (Jean Boussier) wrote:

What do you mean by performance ? String comparisons ? If so it doesn't really matter much for symbols AFAIK.

I mean performance of String operations on a UTF-8 vs a US-ASCII String.
As nirvdrum (Kevin Menard) said above, some optimizations might only apply to US-ASCII, although in most cases the coderange should make it apply to UTF-8 too.

IMO there's two arguments here:

  • Consistency / Least surprise: UTF-8 is now the default source file encoding, it would make sense that the symbols created out of these files (not just Symbol instances, but module names, method names etc) would be UTF-8 as well.

Right, that argument makes sense to me.

Does this PR also addresses module and method names?
FWIW, TruffleRuby already uses UTF-8 for module and method names, and it seems not to be a compatibility problem.

It will be a bit weird if there is a magic encoding comment though, as then Symbols, module/method names will be UTF-8 if 7-bit but the specified magic encoding if not 7-bit.

Updated by byroot (Jean Boussier) 20 days ago

would it help if Symbol.to_s or Module.name would return a shared string?

It's not really about the returned string, it's about the internal frozen string that is kept in the symbol table.

or why not return a frozen string?

I already proposed it, but it was rejected for backward compatibilty concerns: https://bugs.ruby-lang.org/issues/15836

And again, it's kind of another topic entirely.

I mean performance of String operations on a UTF-8 vs a US-ASCII String.

Right. What I was trying to say is that most of the time you compare the symbols directly, which doesn't involve string comparisons.

However it's true that performance might be impacted for operations done on the strings returned by Symbol#to_s.

I wonder wether the coderange could be eagerly set as in this case we do know it's 7-bit. I suppose so, I need to dig into that part of strings.

Does this PR also addresses module and method names?

Yes, I think it does.

it seems not to be a compatibility problem.

That doesn't surprise me one bit. I bet the vast majority of the strings returned by Symbol#to_s and Module#name end up converted to UTF-8 because they are concatenated with string literals which are UTF-8.

Updated by Eregon (Benoit Daloze) 20 days ago

I'm fine with this proposal, it would be interesting to hear what others think.
byroot (Jean Boussier) Could you add this issue to #15930 for discussion?

Updated by matz (Yukihiro Matsumoto) 11 days ago

First of all, this pull-request itself breaks non UTF-8 programs. It should be the source encoding instead of direct UTF-8.
Second, 4% of fstring table is only a fraction of total memory consumption. I am not sure how much effective.

If you update the pull-request to use the source encoding, we will merge it for the experiment.

Matz.

Updated by Eregon (Benoit Daloze) 10 days ago

matz (Yukihiro Matsumoto) wrote:

If you update the pull-request to use the source encoding, we will merge it for the experiment.

matz (Yukihiro Matsumoto) I thought the same, but that makes :foo in files with different source encodings no longer the same object though:
https://bugs.ruby-lang.org/issues/15940#note-10
That sounds like a larger breaking change to me.

Updated by matz (Yukihiro Matsumoto) 10 days ago

That's why I said experiment. We need to measure in the real-world application to ensure the change will reduce memory consumption before the release. If not, we should give up the idea. I worry 4% reduction in fstring table is not big enough, especially considering the mix of source encodings. But we need a number anyway.

Matz.

Updated by byroot (Jean Boussier) 9 days ago

First of all, this pull-request itself breaks non UTF-8 programs.

Could you elaborate on this? I don't understand what breaks in non UTF-8 programs. I ran some tests with # encoding: EUC-JP and can't find anything breaking.

However there was indeed a bug that would break at build time that I just fixed.

It should be the source encoding instead of direct UTF-8.

Like Eregon (Benoit Daloze) I don't understand the rationale. Currently, regardless of the file encoding, ASCII only symbols are coerced to ASCII encoding.

4% of fstring table is only a fraction of total memory consumption. I am not sure how much effective.

Yes, I know it's not a big saving. I only submitted it because I couldn't see any real drawback to it, so a small gain for a small effort seemed worth it.

I can already tell approximately how much it saves based on the Redmine benchmark. 3 686 string instances saved, the vast majority of them being embedded so 40 B, it's 147_440 B (147kB).

Compared to ObjectSpace.memsize_of_all in the same process giving 48_149_329 B. So in relative it's a 0.3% saving overall (or even less because ObjectSpace.memsize_of_all isn't perfectly accurate). Which indeed isn't impressive at all.

That being said, on our internal app that have a ~10x bigger fstring table, the duplication ratio is similar, so the saving would be over 1MB, which while still small relatively speaking, is significant in absolute.

So if you are sure this change would cause issues, then I'd rather close it now because I know the savings it brings won't ever justify it.

Updated by matz (Yukihiro Matsumoto) 9 days ago

If I understand your patch correctly:

# encoding: euc-jp
#old behavior
p "foo".encoding     # => "euc-jp"
p :foo.to_s.encoding # => "us-ascii"

#new behavior
p "foo".encoding     # => "euc-jp"
p :foo.to_s.encoding # => "utf-8"

I feel this is an inconsistent and confusing behavior change. Am I wrong?

Besides that, I am not sure if this change worth saving 147KB or even 1.4MB in the apps that might consume a few hundred GB of memory.

Matz.

Updated by byroot (Jean Boussier) 9 days ago

If I understand your patch correctly

Yes you do.

I feel this is an inconsistent and confusing behavior change. Am I wrong?

I don't know if you are wrong, but at least we don't agree.

My reasoning is as follow:

  • Simple symbols (read pure ASCII) have to be coerced into a common encoding so that # encoding: euc-jp :foo == # encoding: iso-8601-1 :foo
  • UTF-8 is a strict super set of ASCII. Any valid ASCII is valid UTF-8.
  • Simple symbols being UTF-8 encoded isn't any weirder than them being ASCII encoded to me.
  • UTF-8 being the default ruby source encoding, it makes sense for it to be the default internal symbol encoding.
  • If like most Ruby users my source is UTF-8 encoded, then it removes one source of surprise.

Besides that, I am not sure if this change worth saving 147KB or even 1.4MB in the apps that might consume a few hundred GB of memory.

That is entirely your call. I personally don't see any downside to this change, hence why the minor memory saving is welcome to me, but if you see some downside to it then I agree it's not a big enough saving to justify it.

Also small nitpick, the 1.4MB saving, it's for an app consuming hundreds of MB not GB.

Updated by byroot (Jean Boussier) 9 days ago

Just to extend your example:

p "foo".encoding     # => "euc-jp"
p :foo.to_s.encoding # => "us-ascii"
p :"注意書".to_s.encoding # => "euc-jp"

#new behavior
p "foo".encoding     # => "euc-jp"
p :foo.to_s.encoding # => "utf-8"
p :"注意書".to_s.encoding # => "euc-jp"

If the symbol is not pure ASCII, the current behavior doesn't change.

Updated by matz (Yukihiro Matsumoto) 9 days ago

UTF-8 is a strict super set of ASCII. Any valid ASCII is valid UTF-8.

So is EUC-JP (or Shift_JIS or many other encodings). Even though UTF-8 is the current default source encoding of Ruby, ALL ASCII symbols being in UTF-8 encoding make me feel weird.

I understand the need and room to reduce fstring table size, but probably in a different way.

Matz.

Also available in: Atom PDF