Misc #20407: Question about applying encoding modifier to an interpolated Regexp - Ruby - Ruby Issue Tracking System

Actions

Copy link

Misc #20407

closed

Question about applying encoding modifier to an interpolated Regexp

Misc #20407: Question about applying encoding modifier to an interpolated Regexp

Added by andrykonchin (Andrew Konchin) about 2 years ago. Updated almost 2 years ago.

Status:

Closed

Assignee:

[ruby-core:117431]

Description

I am wondering how Regexp encoding modifiers (u, s, e, n) interfere in encoding negotiation of parts/fragments in an interpolated Regexp literal.

Examples #1

# encoding: us-ascii

# Unicode: Ф - U+0424
# windows-1251: Ф - 0xD4

# without encoding modifier
puts /a #{ "\xd4".force_encoding("windows-1251") } c/.encoding    # Windows-1251
puts /a #{ "b".encode("windows-1251") } c/.encoding               # US-ASCII
puts /a #{ "\u0424".force_encoding("UTF-8") } c/.encoding         # UTF-8
puts /a #{ "\xc2\xa1".b } c/.encoding                             # ASCII-8BIT

# with encoding modifier
puts /a #{ "\xd4".force_encoding("windows-1251") } c/e.encoding   # Windows-1251
puts /a #{ "b".encode("windows-1251") } c/e.encoding              # EUC-JP
puts /a #{ "\u0424".force_encoding("UTF-8") } c/e.encoding        # UTF-8
puts /a #{ "\xc2\xa1".b } c/e.encoding                            # ASCII-8BIT

# string interpolation
puts "a #{ "\xd4".force_encoding("windows-1251") } c".encoding    # Windows-1251
puts "a #{ "b".encode("windows-1251") } c".encoding               # Windows-1251
puts "a #{ "\u0424".force_encoding("UTF-8") } c".encoding         # UTF-8
puts "a #{ "\xc2\xa1".b } c".encoding                             # ASCII-8BIT

Example #2

# encoding: utf-8

# windows-1251: Ф - 0xD4
# unicode: Ф - U+0424

# without encoding modifier
puts /a #{ "\xd4".force_encoding("windows-1251") } c/.encoding    # Windows-1251
puts /a #{ "b".encode("windows-1251") } c/.encoding               # US-ASCII
puts /a #{ "\u0424".force_encoding("UTF-8") } c/.encoding         # UTF-8
puts /a #{ "\xc2\xa1".b } c/.encoding                             # ASCII-8BIT

# with encoding modifier
puts /a #{ "\xd4".force_encoding("windows-1251") } c/e.encoding   # Windows-1251
puts /a #{ "b".encode("windows-1251") } c/e.encoding              # EUC-JP
puts /a #{ "\u0424".force_encoding("UTF-8") } c/e.encoding        # UTF-8
puts /a #{ "\xc2\xa1".b } c/e.encoding                            # ASCII-8BIT

# string interpolation
puts "a #{ "\xd4".force_encoding("windows-1251") } c".encoding    # Windows-1251
puts "a #{ "b".encode("windows-1251") } c".encoding               # UTF-8
puts "a #{ "\u0424".force_encoding("UTF-8") } c".encoding         # UTF-8
puts "a #{ "\xc2\xa1".b } c".encoding                             # ASCII-8BIT

In the examples above the e modifier changes Regexp's encoding only in one case when Regexp's encoding would be US-ASCII without the modifier:

# encoding: us-ascii

puts /a #{ "b".encode("windows-1251") } c/.encoding                        # US-ASCII
puts /a #{ "b".encode("windows-1251") } c/e.encoding                       # EUC-JP

# encoding: utf-8

puts /a #{ "b".encode("windows-1251") } c/.encoding                        # US-ASCII
puts /a #{ "b".encode("windows-1251") } c/e.encoding                       # EUC-JP

And the e modifier doesn't change Regexp's final encoding in all the other cases either Regexp's encoding without modifier is a file source encoding or ASCII-8BIT.

Looking at the following example:

# encoding: us-ascii

# without modifier
p /\xc2\xa1 #{ "a" }\xc2\xa1/.encoding                                 # ASCII-8BIT
p /a #{ "\xc2\xa1".force_encoding("EUC-JP") } b/.encoding              # EUC-JP
p /a #{ "\xc2\xa1".b } b/.encoding                                     # ASCII-8BIT

# with modifier
p /\xc2\xa1 #{ "a" }\xc2\xa1/e.encoding                                # EUC-JP
p /a #{ "\xc2\xa1".force_encoding("EUC-JP") } b/e.encoding             # EUC-JP
p /a #{ "\xc2\xa1".b } b/e.encoding                                    # ASCII-8BIT

we can notice that the e modifier changes ASCII-8BIT to EUC-JP in the first case (/\xc2\xa1 #{ "a" }\xc2\xa1/) but doesn't in the third one (/a #{ "\xc2\xa1".b } b/). So I assume that the e modifier could be applied to the Regexp fragments (\xc2\xa1 and \xc2\xa1) before encoding negotiation and not to the whole result after negotiation.

Could you please clarify how it works?

Related issues 2 (2 open — 0 closed)

Actions

Copy link

Also available in: PDF Atom

	Related to Ruby - Misc #20406: Question about Regexp encoding negotiation	Open		Actions
	Related to Ruby - Bug #20466: Interpolated regular expressions have different encoding than interpolated strings	Open		Actions

Project

General

Profile

Ruby

Custom queries

Misc #20407

Question about applying encoding modifier to an interpolated Regexp

Updated by andrykonchin (Andrew Konchin) about 2 years ago Actions
Copy link
#1

Updated by andrykonchin (Andrew Konchin) about 2 years ago Actions
Copy link
#2

Updated by andrykonchin (Andrew Konchin) about 2 years ago Actions
Copy link
#3

Updated by andrykonchin (Andrew Konchin) about 2 years ago Actions
Copy link
#4

Updated by Eregon (Benoit Daloze) about 2 years ago Actions
Copy link
#5

Updated by Eregon (Benoit Daloze) about 2 years ago Actions
Copy link
#6

Updated by naruse (Yui NARUSE) almost 2 years ago Actions
Copy link
#7 [ruby-core:117903]

Updated by nobu (Nobuyoshi Nakada) almost 2 years ago Actions
Copy link
#8 [ruby-core:118197]

Updated by matz (Yukihiro Matsumoto) almost 2 years ago Actions
Copy link
#9 [ruby-core:118224]

Updated by naruse (Yui NARUSE) almost 2 years ago Actions
Copy link
#10 [ruby-core:118549]

Updated by matz (Yukihiro Matsumoto) almost 2 years ago Actions
Copy link
#11 [ruby-core:118550]

Project

General

Profile

Ruby

Custom queries

Misc #20407

Question about applying encoding modifier to an interpolated Regexp

Updated by andrykonchin (Andrew Konchin) about 2 years ago ActionsCopy link #1

Updated by andrykonchin (Andrew Konchin) about 2 years ago ActionsCopy link #2

Updated by andrykonchin (Andrew Konchin) about 2 years ago ActionsCopy link #3

Updated by andrykonchin (Andrew Konchin) about 2 years ago ActionsCopy link #4

Updated by Eregon (Benoit Daloze) about 2 years ago ActionsCopy link #5

Updated by Eregon (Benoit Daloze) about 2 years ago ActionsCopy link #6

Updated by naruse (Yui NARUSE) almost 2 years ago ActionsCopy link #7 [ruby-core:117903]

Updated by nobu (Nobuyoshi Nakada) almost 2 years ago ActionsCopy link #8 [ruby-core:118197]

Updated by matz (Yukihiro Matsumoto) almost 2 years ago ActionsCopy link #9 [ruby-core:118224]

Updated by naruse (Yui NARUSE) almost 2 years ago ActionsCopy link #10 [ruby-core:118549]

Updated by matz (Yukihiro Matsumoto) almost 2 years ago ActionsCopy link #11 [ruby-core:118550]

Updated by andrykonchin (Andrew Konchin) about 2 years ago Actions
Copy link
#1

Updated by andrykonchin (Andrew Konchin) about 2 years ago Actions
Copy link
#2

Updated by andrykonchin (Andrew Konchin) about 2 years ago Actions
Copy link
#3

Updated by andrykonchin (Andrew Konchin) about 2 years ago Actions
Copy link
#4

Updated by Eregon (Benoit Daloze) about 2 years ago Actions
Copy link
#5

Updated by Eregon (Benoit Daloze) about 2 years ago Actions
Copy link
#6

Updated by naruse (Yui NARUSE) almost 2 years ago Actions
Copy link
#7 [ruby-core:117903]

Updated by nobu (Nobuyoshi Nakada) almost 2 years ago Actions
Copy link
#8 [ruby-core:118197]

Updated by matz (Yukihiro Matsumoto) almost 2 years ago Actions
Copy link
#9 [ruby-core:118224]

Updated by naruse (Yui NARUSE) almost 2 years ago Actions
Copy link
#10 [ruby-core:118549]

Updated by matz (Yukihiro Matsumoto) almost 2 years ago Actions
Copy link
#11 [ruby-core:118550]