Misc #20406: Question about Regexp encoding negotiation - Ruby - Ruby Issue Tracking System

Custom queries

Backport 3.3
Backport 3.4
Backport 4.0
bugs: unassigned
DevMeeting
matz
Open issues with attachment
Windows

Actions

Copy link

Misc #20406

open

Question about Regexp encoding negotiation

Misc #20406: Question about Regexp encoding negotiation

Added by andrykonchin (Andrew Konchin) about 2 years ago. Updated about 2 years ago.

Status:

Open

Assignee:

[ruby-core:117408]

Description

I am wondering what are the rules to calculate Regexp literal encoding in case an encoding modifier is specified.

From the documentstion:

By default, a regexp with only US-ASCII characters has US-ASCII encoding:
...
A regular expression containing non-US-ASCII characters is assumed to use the source encoding. This can be overridden with one of the following modifiers.
//n ...
//u ...
//e ...
//s ...

Looking at the following examples I would assume that these rules are followed except one case:

 p /\xc2\xa1/e     .encoding # EUC-JP
 p /#{ }\xc2\xa1/e .encoding # EUC-JP

 p /a/e            .encoding # EUC-JP
 p /a #{} a/e      .encoding # EUC-JP
 p /#{} a/e        .encoding # US-ASCII

The last Regexp /#{} a/e is supposed to have EUC-JP encoding but has US-ASCII. So I am wondering what rule is applied in this case.

Related issues 2 (1 open — 1 closed)

	Related to Ruby - Misc #20407: Question about applying encoding modifier to an interpolated Regexp	Closed		Actions
	Related to Ruby - Misc #20434: Deprecate encoding-related regular expression modifiers	Open		Actions

Issue # Delay: days Cancel Multiple values allowed (comma separated).

History
Notes
Property changes

Updated by andrykonchin (Andrew Konchin) about 2 years ago Actions
Copy link
#1

Description updated (diff)

Updated by shyouhei (Shyouhei Urabe) about 2 years ago Actions
Copy link
#2 [ruby-core:117421]

Seems like a real bug to me.

% docker run --rm -it -e 'ALL_RUBY_SINCE=ruby-1.8.7' rubylang/all-ruby ./all-ruby -e 'p(/#{} a/e.encoding)'
ruby-1.8.7          -e:1: undefined method `encoding' for / a/e:Regexp (NoMethodError)
                exit 1
...
ruby-1.8.7-p374     -e:1: undefined method `encoding' for / a/e:Regexp (NoMethodError)
                exit 1
ruby-1.9.0-0        #<Encoding:EUC-JP>
...
ruby-1.9.2-preview1 #<Encoding:EUC-JP>
ruby-1.9.2-preview3 #<Encoding:US-ASCII>
...
ruby-3.3.0          #<Encoding:US-ASCII>

Updated by Eregon (Benoit Daloze) about 2 years ago Actions
Copy link
#3 [ruby-core:117426]

By default, a regexp with only US-ASCII characters has US-ASCII encoding:

I was wondering what kind of check is used for that and it seems to be checking the Regexp source when building it (makes sense):

$ ruby -e 'p /a/.encoding'
#<Encoding:US-ASCII>
$ ruby -e 'p /a#{}b/.encoding'
#<Encoding:US-ASCII>
$ ruby -e 'p /a#{"c"}b/.encoding'
#<Encoding:US-ASCII>
$ ruby -e 'p /a#{"é"}b/.encoding'
#<Encoding:UTF-8>

Updated by Eregon (Benoit Daloze) about 2 years ago · Edited Actions
Copy link
#4 [ruby-core:117427]

I found another case which does not seem to respect those rules:

$ ruby -ve 'p /#{"é".dup}/e.encoding'
ruby 3.3.0 (2023-12-25 revision 5124f9ac75) [x86_64-linux]
#<Encoding:UTF-8>
$ ruby -e 'p /a#{"é".dup}b/e.encoding'       
#<Encoding:UTF-8>

It seems to behave a bit like string interpolation/concatenation here, but that's very confusing when mixed with the above rules.
How to know which rule is applied when and what has precedence?

When mixing two incompatible encodings there is an error, which makes sense:

$ ruby -e 'p /a#{"é".dup}\xc2\xa1/e.encoding'
-e:1:in `<main>': encoding mismatch in dynamic regexp : UTF-8 and EUC-JP (RegexpError)
$ ruby -e '"é" + "\xc2\xa1".force_encoding("EUC-JP")'
-e:1:in `+': incompatible character encodings: UTF-8 and EUC-JP (Encoding::CompatibilityError)

Without the .dup there is a compile error:

$ ruby -e 'p /#{"é"}/e.encoding' 
-e:1: regexp encoding option 'e' differs from source encoding 'UTF-8'
-e:1: regexp encoding option 'e' differs from source encoding 'UTF-8'
-e: compile error (SyntaxError)

Which is not so nice because this breaks referential transparency (e.g. a string literal can be replaced by a variable referencing that string literal) and adds more edge cases. I think the compiler should not look inside #{} for interpolated regexps.

OTOH this error seems OK, because it's something that can be detected at parse time:

$ ruby -e 'p /é/e.encoding' 
-e:1: regexp encoding option 'e' differs from source encoding 'UTF-8'
-e: compile error (SyntaxError)

I think it would be tempting semantically to assign the encoding of the static parts of an interpolated regexp with a /nesu flag to that encoding.
IOW, /nesu would take precedence over the source encoding for the "static parts/static string literals" in an interpolated regexp.
That would however allow /é/e, unless there is an extra check for such parts being all 7-bit or not, which seems OK to have.

Updated by Eregon (Benoit Daloze) about 2 years ago · Edited Actions
Copy link
#5 [ruby-core:117428]

It seems to behave a bit like string interpolation/concatenation here

Specifically:

$ ruby -e '# encoding: EUC-JP
p ("a" + "\xC3\xA9".force_encoding("UTF-8") + "c").encoding'
#<Encoding:UTF-8>

But it seems very much unexpected for a /e regexp to have a UTF-8 encoding.

Updated by duerst (Martin Dürst) about 2 years ago Actions
Copy link
#6 [ruby-core:117437]

This is a more general comment, but my impression is that the encoding flags on regular expressions may be outdated. They exist since before Ruby introduced encoding information for Strings,... in Ruby 1.9. It may be time now to look into how/when they can be deprecated.

Updated by Eregon (Benoit Daloze) about 2 years ago Actions
Copy link
#7 [ruby-core:117441]

Indeed, on a similar topic I wonder how much encoding negotiation at Regexp creation time matters.
Because there is another encoding negotiation between the regexp and the string being matched which happens when matching.
Maybe the Regexp encoding should e.g. always be US-ASCII if there are only 7-bit characters in the Regexp source,
or maybe always UTF-8 in that case since it's most likely a regexp will be matched against UTF-8 strings,
this illustrates the Regexp encoding doesn't really matter for the 7-bit source case.

Or maybe Regexp literals should just always use the source encoding, that would make things a lot simpler and closer to string literals.
And the /nesu flag would just override the source encoding (and maybe be eventually deprecated, but probably not worth it if their semantics are clear).

I'm not sure what's the point of Regexp#fixed_encoding? either, it seems regardless of it a Regexp can be matched with strings of different but compatible encodings (the docs about this in ri Regexp are incorrect).

Updated by Eregon (Benoit Daloze) about 2 years ago Actions
Copy link
#8

Related to Misc #20407: Question about applying encoding modifier to an interpolated Regexp added

Updated by Eregon (Benoit Daloze) about 2 years ago Actions
Copy link
#9

Related to Misc #20434: Deprecate encoding-related regular expression modifiers added

Actions

Copy link

Also available in: PDF Atom

Project

General

Profile

Ruby

Custom queries

Misc #20406

Question about Regexp encoding negotiation

Updated by andrykonchin (Andrew Konchin) about 2 years ago Actions
Copy link
#1

Updated by shyouhei (Shyouhei Urabe) about 2 years ago Actions
Copy link
#2 [ruby-core:117421]

Updated by Eregon (Benoit Daloze) about 2 years ago Actions
Copy link
#3 [ruby-core:117426]

Updated by Eregon (Benoit Daloze) about 2 years ago · Edited Actions
Copy link
#4 [ruby-core:117427]

Updated by Eregon (Benoit Daloze) about 2 years ago · Edited Actions
Copy link
#5 [ruby-core:117428]

Updated by duerst (Martin Dürst) about 2 years ago Actions
Copy link
#6 [ruby-core:117437]

Updated by Eregon (Benoit Daloze) about 2 years ago Actions
Copy link
#7 [ruby-core:117441]

Updated by Eregon (Benoit Daloze) about 2 years ago Actions
Copy link
#8

Updated by Eregon (Benoit Daloze) about 2 years ago Actions
Copy link
#9

Project

General

Profile

Ruby

Custom queries

Misc #20406

Question about Regexp encoding negotiation

Updated by andrykonchin (Andrew Konchin) about 2 years ago ActionsCopy link #1

Updated by shyouhei (Shyouhei Urabe) about 2 years ago ActionsCopy link #2 [ruby-core:117421]

Updated by Eregon (Benoit Daloze) about 2 years ago ActionsCopy link #3 [ruby-core:117426]

Updated by Eregon (Benoit Daloze) about 2 years ago · Edited ActionsCopy link #4 [ruby-core:117427]

Updated by Eregon (Benoit Daloze) about 2 years ago · Edited ActionsCopy link #5 [ruby-core:117428]

Updated by duerst (Martin Dürst) about 2 years ago ActionsCopy link #6 [ruby-core:117437]

Updated by Eregon (Benoit Daloze) about 2 years ago ActionsCopy link #7 [ruby-core:117441]

Updated by Eregon (Benoit Daloze) about 2 years ago ActionsCopy link #8

Updated by Eregon (Benoit Daloze) about 2 years ago ActionsCopy link #9

Updated by andrykonchin (Andrew Konchin) about 2 years ago Actions
Copy link
#1

Updated by shyouhei (Shyouhei Urabe) about 2 years ago Actions
Copy link
#2 [ruby-core:117421]

Updated by Eregon (Benoit Daloze) about 2 years ago Actions
Copy link
#3 [ruby-core:117426]

Updated by Eregon (Benoit Daloze) about 2 years ago · Edited Actions
Copy link
#4 [ruby-core:117427]

Updated by Eregon (Benoit Daloze) about 2 years ago · Edited Actions
Copy link
#5 [ruby-core:117428]

Updated by duerst (Martin Dürst) about 2 years ago Actions
Copy link
#6 [ruby-core:117437]

Updated by Eregon (Benoit Daloze) about 2 years ago Actions
Copy link
#7 [ruby-core:117441]

Updated by Eregon (Benoit Daloze) about 2 years ago Actions
Copy link
#8

Updated by Eregon (Benoit Daloze) about 2 years ago Actions
Copy link
#9