Project

General

Profile

Actions

Feature #20792

open

String#forcible_encoding?

Added by kddnewton (Kevin Newton) 26 days ago. Updated 15 days ago.

Status:
Open
Assignee:
-
Target version:
-
[ruby-core:119495]

Description

I would like to add a method to String called forcible_encoding?(encoding). This would return true or false depending on whether the receiver can be forced into the given encoding without breaking the string. It would effectively be an alias for:

def forcible_encoding?(enc)
  original = encoding
  result = force_encoding(enc).valid_encoding?
  force_encoding(original)
  result
end

I would like this method because there are extremely rare but possible circumstances where source files are marked as binary but contain UTF-8-encoded characters. In that case I would like to check if it's possible to cleanly force UTF-8 before actually doing it. The code I'm trying to replace is here: https://github.com/ruby/prism/blob/d6e9b8de36b4d18debfe36e4545116539964ceeb/lib/prism/parse_result.rb#L15-L30.

The pull request for the code is here: https://github.com/ruby/ruby/pull/11851.

Updated by Eregon (Benoit Daloze) 26 days ago ยท Edited

I have wanted this feature too, how about adding an optional argument to String#valid_encoding??
Like binary_string.valid_encoding?(Encoding::UTF_8).

Updated by Eregon (Benoit Daloze) 26 days ago

In terms of performance though, both methods need 2 code range scans if the String is then force_encoding(Encoding::UTF_8) if valid and used later for some operation needing the code range.
Maybe the String should stay in the argument encoding if it's valid in it?

Updated by Eregon (Benoit Daloze) 26 days ago

I think I discussed this with @byroot (Jean Boussier) a couple times as well, what about String#with_encoding(Encoding), which is like force_encoding but doesn't mutate the receiver.
Then this would be efficient in that it would scan the code range once:

utf8 = binary_str.with_encoding(Encoding::UTF_8)
if utf8.valid_encoding?
  return utf8
else
  return binary_str
end

And of course it would shorten str.dup.force_encoding(Encoding::UTF_8) to str.with_encoding(Encoding::UTF_8).

Updated by austin (Austin Ziegler) 26 days ago

Eregon (Benoit Daloze) wrote in #note-3:

I think I discussed this with @byroot (Jean Boussier) a couple times as well, what about String#with_encoding(Encoding), which is like force_encoding but doesn't mutate the receiver.
Then this would be efficient in that it would scan the code range once:

utf8 = binary_str.with_encoding(Encoding::UTF_8)
if utf8.valid_encoding?
  return utf8
else
  return binary_str
end

And of course it would shorten str.dup.force_encoding(Encoding::UTF_8) to str.with_encoding(Encoding::UTF_8).

A variation of this would be something like:

if utf8 = binary_str.try_encoding(Encoding::UTF_8)
  return utf8
else
  return binary_str
end

The implementation would be the equivalent of:

def try_encoding(encoding)
  target = self.dup.force_encoding(encoding)
  target.valid_encoding? ? target : nil
end

Updated by Eregon (Benoit Daloze) 26 days ago

Right, that would also solve this specific usage.
However I think String#with_encoding is a good general method to have, as for instance there are lots of .dup.force_encoding.
It's one of the rare cases where Ruby core API don't have a nice way to do something in a non-mutating manner so that would fix it.
It's especially striking for frozen literal strings where one just wants it in some specific encoding, but currently one has to "foo".dup.force_encoding(SOME_ENCODING).
There is String#b but that's only for the BINARY encoding (and the name is not really clear).

Updated by nirvdrum (Kevin Menard) 26 days ago

Eregon (Benoit Daloze) wrote in #note-1:

I have wanted this feature too, how about adding an optional argument to String#valid_encoding??
Like binary_string.valid_encoding?(Encoding::UTF_8).

I'm partial to this one. Alternatively, it could be nice to have the inverse: Encoding#valid_string? (or Encoding#valid_bytes?). I like the symmetry, but you'd give up the short-hand of specifying the encoding by its string name.

Updated by nobu (Nobuyoshi Nakada) 25 days ago

nirvdrum (Kevin Menard) wrote in #note-6:

I'm partial to this one. Alternatively, it could be nice to have the inverse: Encoding#valid_string? (or Encoding#valid_bytes?).

I prefer valid_sequence? over valid_bytes?.

I like the symmetry, but you'd give up the short-hand of specifying the encoding by its string name.

Encoding.valid_string?(str, enc) may be possible as a short-hand for Encoding.find(enc).valid_string?(str).

Updated by byroot (Jean Boussier) 24 days ago

as a short-hand for Encoding.find(enc).valid_string?(str).

I suspect most of the time you'd check a specific encoding, so Encoding::UTF_8.valid_sequence?(str) would be enough.

That said I think that what @Eregon (Benoit Daloze) mentioned in term of performance makes sense. Most of the time when using such method, the next step will be to dup the string and call force_encoding on it.

So a higher level method would have the advantage of being able to store the computed coderange on the resulting string.

Updated by kddnewton (Kevin Newton) 21 days ago

I think the advantage right now is that it doesn't require a mutable string to check. It seems like all of these other options would? Unless you mean to make with_encoding return nil if it wasn't valid?

Updated by Eregon (Benoit Daloze) 21 days ago

I think the advantage right now is that it doesn't require a mutable string to check.

with_encoding would always be the same as .dup.force_encoding (except slightly more efficient).
It doesn't mutate the receiver.

For the description use case you could then use valid_encoding? like in https://bugs.ruby-lang.org/issues/20792#note-3
That might compute the code range (if not already computed), but that's needed to know if the encoding is valid anyway, and interestingly it will remember this coderange for further operations on that new String.

Using forcible_encoding? it can't remember the code range and so it would have to be recomputed on force_encoding or the next operation needing it.

Updated by kddnewton (Kevin Newton) 18 days ago

Yeah I'm saying it doesn't require a mutable string because it just checks the current string, it doesn't require a dup/allocation. So this avoids having to allocate a new string to check if it's possible.

Updated by Eregon (Benoit Daloze) 15 days ago

Right.
But if it is valid in that encoding, wouldn't you always or almost always then want the String (or a copy of it) in that encoding?
If you do, .with_encoding would be more efficient as it keeps the computed coderange.
If you don't then indeed just a predicate would avoid the String instance allocation (Strings are copy-on-write, so it's just one object allocation, string bytes are not copied).

Do you have an example where you wouldn't want a String in that encoding?
The linked example needs a UTF-8 String on line 19. So with_encoding seems a perfect fit there and more efficient than the predicate (1 vs 2 coderange scans).
The code also has to explicitly workaround force_encoding being inplace (a common inconvenience with force_encoding) on line 27, which with_encoding addresses.

Actions

Also available in: Atom PDF

Like0
Like1Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0