Feature #20792
open
String#forcible_encoding?
Added by kddnewton (Kevin Newton) 26 days ago.
Updated 15 days ago.
Description
I would like to add a method to String called forcible_encoding?(encoding)
. This would return true or false depending on whether the receiver can be forced into the given encoding without breaking the string. It would effectively be an alias for:
def forcible_encoding?(enc)
original = encoding
result = force_encoding(enc).valid_encoding?
force_encoding(original)
result
end
I would like this method because there are extremely rare but possible circumstances where source files are marked as binary but contain UTF-8-encoded characters. In that case I would like to check if it's possible to cleanly force UTF-8 before actually doing it. The code I'm trying to replace is here: https://github.com/ruby/prism/blob/d6e9b8de36b4d18debfe36e4545116539964ceeb/lib/prism/parse_result.rb#L15-L30.
The pull request for the code is here: https://github.com/ruby/ruby/pull/11851.
I have wanted this feature too, how about adding an optional argument to String#valid_encoding?
?
Like binary_string.valid_encoding?(Encoding::UTF_8)
.
In terms of performance though, both methods need 2 code range scans if the String is then force_encoding(Encoding::UTF_8)
if valid and used later for some operation needing the code range.
Maybe the String should stay in the argument encoding if it's valid in it?
I think I discussed this with @byroot (Jean Boussier) a couple times as well, what about String#with_encoding(Encoding)
, which is like force_encoding
but doesn't mutate the receiver.
Then this would be efficient in that it would scan the code range once:
utf8 = binary_str.with_encoding(Encoding::UTF_8)
if utf8.valid_encoding?
return utf8
else
return binary_str
end
And of course it would shorten str.dup.force_encoding(Encoding::UTF_8)
to str.with_encoding(Encoding::UTF_8)
.
Eregon (Benoit Daloze) wrote in #note-3:
I think I discussed this with @byroot (Jean Boussier) a couple times as well, what about String#with_encoding(Encoding)
, which is like force_encoding
but doesn't mutate the receiver.
Then this would be efficient in that it would scan the code range once:
utf8 = binary_str.with_encoding(Encoding::UTF_8)
if utf8.valid_encoding?
return utf8
else
return binary_str
end
And of course it would shorten str.dup.force_encoding(Encoding::UTF_8)
to str.with_encoding(Encoding::UTF_8)
.
A variation of this would be something like:
if utf8 = binary_str.try_encoding(Encoding::UTF_8)
return utf8
else
return binary_str
end
The implementation would be the equivalent of:
def try_encoding(encoding)
target = self.dup.force_encoding(encoding)
target.valid_encoding? ? target : nil
end
Right, that would also solve this specific usage.
However I think String#with_encoding
is a good general method to have, as for instance there are lots of .dup.force_encoding
.
It's one of the rare cases where Ruby core API don't have a nice way to do something in a non-mutating manner so that would fix it.
It's especially striking for frozen literal strings where one just wants it in some specific encoding, but currently one has to "foo".dup.force_encoding(SOME_ENCODING)
.
There is String#b
but that's only for the BINARY encoding (and the name is not really clear).
Eregon (Benoit Daloze) wrote in #note-1:
I have wanted this feature too, how about adding an optional argument to String#valid_encoding?
?
Like binary_string.valid_encoding?(Encoding::UTF_8)
.
I'm partial to this one. Alternatively, it could be nice to have the inverse: Encoding#valid_string?
(or Encoding#valid_bytes?
). I like the symmetry, but you'd give up the short-hand of specifying the encoding by its string name.
nirvdrum (Kevin Menard) wrote in #note-6:
I'm partial to this one. Alternatively, it could be nice to have the inverse: Encoding#valid_string?
(or Encoding#valid_bytes?
).
I prefer valid_sequence?
over valid_bytes?
.
I like the symmetry, but you'd give up the short-hand of specifying the encoding by its string name.
Encoding.valid_string?(str, enc)
may be possible as a short-hand for Encoding.find(enc).valid_string?(str)
.
as a short-hand for Encoding.find(enc).valid_string?(str).
I suspect most of the time you'd check a specific encoding, so Encoding::UTF_8.valid_sequence?(str)
would be enough.
That said I think that what @Eregon (Benoit Daloze) mentioned in term of performance makes sense. Most of the time when using such method, the next step will be to dup the string and call force_encoding
on it.
So a higher level method would have the advantage of being able to store the computed coderange on the resulting string.
I think the advantage right now is that it doesn't require a mutable string to check. It seems like all of these other options would? Unless you mean to make with_encoding
return nil
if it wasn't valid?
I think the advantage right now is that it doesn't require a mutable string to check.
with_encoding
would always be the same as .dup.force_encoding
(except slightly more efficient).
It doesn't mutate the receiver.
For the description use case you could then use valid_encoding?
like in https://bugs.ruby-lang.org/issues/20792#note-3
That might compute the code range (if not already computed), but that's needed to know if the encoding is valid anyway, and interestingly it will remember this coderange for further operations on that new String.
Using forcible_encoding?
it can't remember the code range and so it would have to be recomputed on force_encoding
or the next operation needing it.
Yeah I'm saying it doesn't require a mutable string because it just checks the current string, it doesn't require a dup/allocation. So this avoids having to allocate a new string to check if it's possible.
Right.
But if it is valid in that encoding, wouldn't you always or almost always then want the String (or a copy of it) in that encoding?
If you do, .with_encoding
would be more efficient as it keeps the computed coderange.
If you don't then indeed just a predicate would avoid the String instance allocation (Strings are copy-on-write, so it's just one object allocation, string bytes are not copied).
Do you have an example where you wouldn't want a String in that encoding?
The linked example needs a UTF-8 String on line 19. So with_encoding
seems a perfect fit there and more efficient than the predicate (1 vs 2 coderange scans).
The code also has to explicitly workaround force_encoding
being inplace (a common inconvenience with force_encoding
) on line 27, which with_encoding
addresses.
Also available in: Atom
PDF
Like0
Like1Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0