Project

General

Profile

Feature #6321

Find and repair bad bytes in encodings, without transcoding

Added by jrochkind (jonathan rochkind) about 8 years ago. Updated about 7 years ago.

Status:
Closed
Priority:
Normal
Target version:
[webmaster@ruby-lang.org:<unknown>]

Description

If I use the String#encode feature to transcode from one encoding to another, then bad (invalid) bytes in the source encoding will raise, or else I can pass in :invalid and :replace options to tell it to do something different with bad bytes in the source encoding.

Sometimes I do not want to transcode to a new encoding. I have a string which, ought to be, say, UTF-8

string = something.force_encoding("UTF-8")

However, like all input from an external source that I don't have complete control over, it's possible that it contains invalid bytes. I'd like to check it right away, sometimes raising right away, sometimes using :invalid/:replace functionality similar to String#encode.

As far as I can tell, ruby gives me no way to do it. This does not work, it's a no-op even when there are invalid bytes:

string.encoding => UTF-8
string.encode("UTF-8") # Does NOT raise even if there are bad bytes
string.encode("UTF-8", :invalid => :replace) # Does NOT replace bad bytes

So this is a feature request for a built-in way to do this. It is actually a pretty common thing to want to do, sometimes strings come from external sources that are not want they claim they are; it's very useful to be able to check/validate them, and possibly repair them, right away, rather than waiting for an "invalid byte sequence" error to crop up at some indeterminate point in the future.

I don't know if this functionality should be provided by String#encode as above, even when the target encoding is the same as the destination encoding. Or if it needs to be a new method name, say #validate_encoding. Either way is fine with em.

Here's a pure-ruby partial implementation showing what I need, but it's not as full-featured as the relevant functions in #encode for trans-coding, and it's probably much much slower too. This ought to be built-in, and in C.

https://gist.github.com/2416043


Related issues

Related to Ruby master - Feature #6752: Replacing ill-formed subsequencceClosedmatz (Yukihiro Matsumoto)07/19/2012Actions
Related to Ruby master - Bug #7967: String#encode invalid: :replace doesn't replace invalid charsRejected02/26/2013Actions

Also available in: Atom PDF