Bug #16402: UTF-16LE BOM causing regex match to fail with "invalid byte sequence in UTF-8" - Ruby - Ruby Issue Tracking System

Actions

Copy link

Bug #16402

closed

UTF-16LE BOM causing regex match to fail with "invalid byte sequence in UTF-8"

Added by PikachuEXE (Pikachu EXE) over 5 years ago. Updated over 5 years ago.

Status:

Third Party's Issue

Assignee:

nahi (Hiroshi Nakamura)

Target version:

ruby -v:

ruby 2.6.5p114 (2019-10-01 revision 67812) [x86_64-darwin18]

Backport:

2.5: UNKNOWN, 2.6: UNKNOWN

[ruby-core:96118]

Description

$ ruby -e 'File.binwrite("u.txt", "\xff\xfe\x00\x01")'
$ file u.txt 
u.txt: Little-endian UTF-16 Unicode text, with no line terminators
$ ruby -e 'p /\w+/.match?(File.read("u.txt"))'
Traceback (most recent call last):
	1: from -e:1:in `<main>'
-e:1:in `match?': invalid byte sequence in UTF-8 (ArgumentError)

No error should be raised, just like when comparing with string without BOM

$ ruby -e 'p /\w+/.match?(File.read("u.txt")[2..-1])'
false

Actions

Copy link

#1 [ruby-core:96119]

Updated by shyouhei (Shyouhei Urabe) over 5 years ago

Status changed from Open to Feedback

I bet your locale setting is UTF-8? Hence the error message. You have to be explicit then. File.read("u.txt", mode: "rb:bom|utf-16") Would give you a correct String instance.

Actions

Copy link

#2 [ruby-core:96124]

Updated by PikachuEXE (Pikachu EXE) over 5 years ago

Thanks for your answer
But I actually encounter this when processing text input from remote data source
And would not be using File.read

text = HTTPClient.new.get(
  data_feed_url,
  follow_redirect: true,
).tap do |response|
  raise "Unexpected response code: #{response.status}" unless response.ok?
end.body

Actions

Copy link

#3 [ruby-core:96125]

Updated by shyouhei (Shyouhei Urabe) over 5 years ago

Status changed from Feedback to Third Party's Issue
Assignee set to nahi (Hiroshi Nakamura)

PikachuEXE (Pikachu Leung) wrote:

Thanks for your answer
But I actually encounter this when processing text input from remote data source
And would not be using File.read

Well that's... complicated. There are lots of debates as to how to know network remote content's content type. Though not a direct answer, issue #2567 can be interesting to read.

Not sure but maybe HTTPClient provides a way to specify encoding. Can you ask the author?

Actions

Copy link

#4 [ruby-core:96147]

Updated by PikachuEXE (Pikachu EXE) over 5 years ago

Submitted a question to httpclient on https://github.com/nahi/httpclient/issues/413

Actions

Copy link

Also available in: Atom PDF

Like0

Like0Like0Like0Like0

Project

General

Profile

Ruby

Tags

Custom queries

Bug #16402

UTF-16LE BOM causing regex match to fail with "invalid byte sequence in UTF-8"

Updated by shyouhei (Shyouhei Urabe) over 5 years ago

Updated by PikachuEXE (Pikachu EXE) over 5 years ago

Updated by shyouhei (Shyouhei Urabe) over 5 years ago

Updated by PikachuEXE (Pikachu EXE) over 5 years ago