Project

General

Profile

Actions

Bug #16402

closed

UTF-16LE BOM causing regex match to fail with "invalid byte sequence in UTF-8"

Added by PikachuEXE (Pikachu EXE) over 4 years ago. Updated over 4 years ago.

Status:
Third Party's Issue
Target version:
-
ruby -v:
ruby 2.6.5p114 (2019-10-01 revision 67812) [x86_64-darwin18]
[ruby-core:96118]

Description

$ ruby -e 'File.binwrite("u.txt", "\xff\xfe\x00\x01")'
$ file u.txt 
u.txt: Little-endian UTF-16 Unicode text, with no line terminators
$ ruby -e 'p /\w+/.match?(File.read("u.txt"))'
Traceback (most recent call last):
	1: from -e:1:in `<main>'
-e:1:in `match?': invalid byte sequence in UTF-8 (ArgumentError)

No error should be raised, just like when comparing with string without BOM

$ ruby -e 'p /\w+/.match?(File.read("u.txt")[2..-1])'
false

Updated by shyouhei (Shyouhei Urabe) over 4 years ago

  • Status changed from Open to Feedback

I bet your locale setting is UTF-8? Hence the error message. You have to be explicit then. File.read("u.txt", mode: "rb:bom|utf-16") Would give you a correct String instance.

Updated by PikachuEXE (Pikachu EXE) over 4 years ago

Thanks for your answer
But I actually encounter this when processing text input from remote data source
And would not be using File.read

text = HTTPClient.new.get(
  data_feed_url,
  follow_redirect: true,
).tap do |response|
  raise "Unexpected response code: #{response.status}" unless response.ok?
end.body

Updated by shyouhei (Shyouhei Urabe) over 4 years ago

  • Status changed from Feedback to Third Party's Issue
  • Assignee set to nahi (Hiroshi Nakamura)

PikachuEXE (Pikachu Leung) wrote:

Thanks for your answer
But I actually encounter this when processing text input from remote data source
And would not be using File.read

Well that's... complicated. There are lots of debates as to how to know network remote content's content type. Though not a direct answer, issue #2567 can be interesting to read.

Not sure but maybe HTTPClient provides a way to specify encoding. Can you ask the author?

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0