Project

General

Profile

Bug #10584

String.valid_encoding?, String.ascii_only? fails to account for BOM.

Added by geoff-codes (Geoff Nixon) about 6 years ago. Updated about 3 years ago.

Status:
Open
Priority:
Normal
Assignee:
-
Target version:
-
ruby -v:
ruby 2.2.0preview2 (2014-11-28 trunk 48628) [x86_64-darwin14]
[ruby-core:66761]

Description

IMO:

  • A Unicode (UTF-16, UTF-32) string with a valid BOM should not be considered a valid encoding if endianness is changed.

  • A UTF-8 string with BOM should not consider the BOM as a codepoint.

> file utf-16be-file
utf-16be-file: POSIX shell script, Big-endian UTF-16 Unicode text executable

> file utf-16le-file
utf-16le-file: POSIX shell script, Little-endian UTF-16 Unicode text executable

> file utf-8-with-bom-file
utf-8-with-bom-file: POSIX shell script, UTF-8 Unicode (with BOM) text executable
> ruby -e "p File.binread('utf-16le-file').force_encoding('UTF-16BE').valid_encoding?"
true # false

> ruby -e "p File.binread('utf-16be-file').force_encoding('UTF-16LE').valid_encoding?"
true # false

> ruby -e "p File.read('utf-8-with-bom-file').ascii_only?"
false # true

> ruby -e "p File.read('utf-8-with-bom-file')[0]"
"" # '#'

No?


Files

utf-8-with-bom-file (14 Bytes) utf-8-with-bom-file geoff-codes (Geoff Nixon), 12/10/2014 12:54 AM
utf-16le-file (2.46 KB) utf-16le-file geoff-codes (Geoff Nixon), 12/10/2014 12:54 AM
utf-16be-file (2.45 KB) utf-16be-file geoff-codes (Geoff Nixon), 12/10/2014 12:54 AM

Also available in: Atom PDF