Misc #20407
Updated by andrykonchin (Andrew Konchin) 8 months ago
I am wondering how Regexp encoding modifiers (u, s, e, n) interfere in encoding negotiation of parts/fragments in an interpolated Regexp literal.
Examples #1
```ruby
# encoding: us-ascii
# Unicode: Ф - U+0424
# windows-1251: Ф - 0xD4
# without encoding modifier
puts /a #{ "\xd4".force_encoding("windows-1251") } c/.encoding # Windows-1251
puts /a #{ "b".encode("windows-1251") } c/.encoding # US-ASCII
puts /a #{ "\u0424".force_encoding("UTF-8") } c/.encoding # UTF-8
puts /a #{ "\xc2\xa1".b } c/.encoding # ASCII-8BIT
# with encoding modifier
puts /a #{ "\xd4".force_encoding("windows-1251") } c/e.encoding # Windows-1251
puts /a #{ "b".encode("windows-1251") } c/e.encoding # EUC-JP
puts /a #{ "\u0424".force_encoding("UTF-8") } c/e.encoding # UTF-8
puts /a #{ "\xc2\xa1".b } c/e.encoding # ASCII-8BIT
# string literals concatenation
puts ("a" + "\xd4".force_encoding("windows-1251") + "c").encoding # Windows-1251
puts ("a" + "b".encode("windows-1251") + "c").encoding # US-ASCII
puts ("a" + "\u0424".force_encoding("UTF-8") + "c").encoding # UTF-8
puts ("a" + "\xc2\xa1".b + "c").encoding # ASCII-8BIT
```
Example #2
```ruby
# encoding: utf-8
# windows-1251: Ф - 0xD4
# unicode: Ф - U+0424
# without encoding modifier
puts /a #{ "\xd4".force_encoding("windows-1251") } c/.encoding # Windows-1251
puts /a #{ "b".encode("windows-1251") } c/.encoding # US-ASCII
puts /a #{ "\u0424".force_encoding("UTF-8") } c/.encoding # UTF-8
puts /a #{ "\xc2\xa1".b } c/.encoding # ASCII-8BIT
# with encoding modifier
puts /a #{ "\xd4".force_encoding("windows-1251") } c/e.encoding # Windows-1251
puts /a #{ "b".encode("windows-1251") } c/e.encoding # EUC-JP
puts /a #{ "\u0424".force_encoding("UTF-8") } c/e.encoding # UTF-8
puts /a #{ "\xc2\xa1".b } c/e.encoding # ASCII-8BIT
# string literals concatenation
puts ("a" + "\xd4".force_encoding("windows-1251") + "c").encoding # Windows-1251
puts ("a" + "b".encode("windows-1251") + "c").encoding # UTF-8
puts ("a" + "\u0424".force_encoding("UTF-8") + "c").encoding # UTF-8
puts ("a" + "\xc2\xa1".b + "c").encoding # ASCII-8BIT
```
In the examples above the `e` modifier changes Regexp's encoding only in one case when Regexp's encoding would be `US-ASCII` without the modifier:
```ruby
# encoding: us-ascii
puts /a #{ "b".encode("windows-1251") } c/.encoding # US-ASCII
puts /a #{ "b".encode("windows-1251") } c/e.encoding # EUC-JP
```
```ruby
# encoding: utf-8
puts /a #{ "b".encode("windows-1251") } c/.encoding # US-ASCII
puts /a #{ "b".encode("windows-1251") } c/e.encoding # EUC-JP
```
And the `e` modifier doesn't change Regexp's final encoding in all the other cases either Regexp's encoding without modifier is a file source encoding or `ASCII-8BIT`.
Looking at the following example:
```ruby
# encoding: us-ascii
# without modifier
p /\xc2\xa1 #{ "a" }\xc2\xa1/.encoding # ASCII-8BIT
p /a #{ "\xc2\xa1".force_encoding("EUC-JP") } b/.encoding # EUC-JP
p /a #{ "\xc2\xa1".b } b/.encoding # ASCII-8BIT
# with modifier
p /\xc2\xa1 #{ "a" }\xc2\xa1/e.encoding # EUC-JP
p /a #{ "\xc2\xa1".force_encoding("EUC-JP") } b/e.encoding # EUC-JP
p /a #{ "\xc2\xa1".b } b/e.encoding # ASCII-8BIT
```
we can notice that the `e` modifier changes `ASCII-8BIT` to `EUC-JP` in the first case (`/\xc2\xa1 #{ "a" }\xc2\xa1/`) but doesn't in the third one (`/a #{ "\xc2\xa1".b } b/`). So I assume that the `e` modifier could be applied to the Regexp fragments (`\xc2\xa1` and `\xc2\xa1`) before encoding negotiation and not to the whole result after negotiation.
Could you please clarify how it works?