Bug #19784: String#delete_prefix! problem - Ruby - Ruby Issue Tracking System

Actions

Copy link

Bug #19784

closed

String#delete_prefix! problem

Bug #19784: String#delete_prefix! problem

Added by inversion (Yura Babak) almost 3 years ago. Updated almost 3 years ago.

Status:

Closed

Assignee:

Target version:

ruby -v:

ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux]

Backport:

3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN

[ruby-core:114276]

Description

Here is the snipped and the question is in the comments:

fp = 'with_BOM_16.txt'
body = File.read(fp).force_encoding('UTF-8')

p body                            # "\xFF\xFE1\u00001\u0000"
p body.start_with?("\xFF\xFE")    # true

body.delete_prefix!("\xFF\xFE")   # !!! why doesn't work?
p body                            # "\xFF\xFE1\u00001\u0000"
p body.start_with?("\xFF\xFE")    # true

body[0, 2] = ''
p body                            # "1\u00001\u0000"
p body.start_with?("\xFF\xFE")    # false

Works same
on Linux (ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux])
and Windows (ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x64-mingw-ucrt])

Updated by byroot (Jean Boussier) almost 3 years ago Actions
Copy link
#1 [ruby-core:114278]

I suspect it's because "\xFF\xFE1\u00001\u0000" (UTF-8) has invalid encoding.

That said, if starts_with? returns true, delete_prefix should arguably work.

I'll try to see if I can find why, but in the meantime I tested that you can workaround this by casting the string to Encoding::BINARY:

str = "\xff\xfe1\u00001\u0000"
str.force_encoding(Encoding::BINARY)
str.delete_prefix!("\xff\xfe".b)
str.force_encoding(Encoding::UTF_8)
p str # => "1\u00001\u0000"

Updated by byroot (Jean Boussier) almost 3 years ago Actions
Copy link
#2 [ruby-core:114279]

Ok, so as suspected, in the code strings with a broken encoding are very explicitly rejected: https://github.com/ruby/ruby/commit/10082360b9124c3eaabfccf4fe10a3640c40a05e#diff-430d86fdb6c4a558ab0f1b6648bbfae1720e8bde84f026e95a52740014752040R9201

if (is_broken_string(prefix)) return 0;

This is from the initial implementation decided in [Feature #12694], and the PR implementing it was https://github.com/ruby/ruby/pull/1632

I don't see much discussion of that behavior in either the ticket or PR, but there are tests for it so it was intentional from the PR author, not necessarily from committers.

I'd be in favor of changing it so that it matches start_with? behavior, but I think that needs to be discussed at the developer meeting. It's also unclear whether if accepted it would be as a feature or a bug fix.

Updated by byroot (Jean Boussier) almost 3 years ago 1Actions
Copy link
#3 [ruby-core:114281]

Ticket added to the next dev meeting.

Updated by duerst (Martin Dürst) almost 3 years ago Actions
Copy link
#4 [ruby-core:114294]

I agree that start_with? and delete_prefix! should probably be aligned. But in general, encoding problems should lead to early errors, not be deferred.

Updated by mame (Yusuke Endoh) almost 3 years ago Actions
Copy link
#5 [ruby-core:114486]

Discussed at the dev meeting.
We agreed that both start_with? and delete_prefix should compare character by character if valid, and byte by byte for invalid.

"\xFF\xFE".start_with?("\xFF")   #=> true   # not changed
"\xFF\xFE".delete_prefix("\xFF") #=> "\xFE" # changed
"Ä".start_with?("\xC3")          #=> false  # changed
"Ä".delete_prefix("\xC3")        #=> "Ä"    # not changed

Note that encoding validity is checked on a position-by-position basis:

"\xFFÄ".delete_prefix("\xFF") #=> should be "Ä"
"\xFFÄ".delete_prefix("\xFF\xC3") #=> should be "\xFFÄ"
"\xFFÄ".delete_prefix("\xFF\xC3\x84") #=> should be ""

Updated by Eregon (Benoit Daloze) almost 3 years ago Actions
Copy link
#6 [ruby-core:114494]

mame (Yusuke Endoh) wrote in #note-5:

Note that encoding validity is checked on a position-by-position basis:

That's not clear to me.
Do you mean that if the argument or if the receiver String is not String#valid_encoding?, then we compare byte-by-byte, and otherwise we compare character-by-character ?

I think we should not consider whether a given substring is valid, that would be very expensive to check.

I wonder if always comparing byte-by-byte for both is not a lot simpler, faster, and also what most users expect.
And of course still have the encoding check to ensure both strings have a compatible encoding.

Updated by mame (Yusuke Endoh) almost 3 years ago Actions
Copy link
#7 [ruby-core:114527]

Do you mean that if the argument or if the receiver String is not String#valid_encoding?, then we compare byte-by-byte, and otherwise we compare character-by-character ?

No.

I think we should not consider whether a given substring is valid

I think it's this way. In the case of "\xFF\xC3\x84" (= "\xFFÄ"), the byte sequence from byteoffset 0 is invalid, so we take out one byte "\xFF", and the next byte sequence from byteoffset 1 is valid, so we take out two bytes (one character) "\xC3\x84", and so on, I think @akr (Akira Tanaka) or @naruse (Yui NARUSE) can explain the rationale.

Updated by nobu (Nobuyoshi Nakada) almost 3 years ago 1Actions
Copy link
#8

Status changed from Open to Closed

Applied in changeset git|b054c2fe06598f1141fdc337b10046f41f0e227c.

[Bug #19784] Fix behaviors against prefix with broken encoding

String#start_with?
String#delete_prefix
String#delete_prefix!

Updated by byroot (Jean Boussier) almost 3 years ago 1Actions
Copy link
#9 [ruby-core:114563]

Note that we have the same issue with end_with? and delete_suffix.

Updated by jhawthorn (John Hawthorn) almost 3 years ago 2Actions
Copy link
#10 [ruby-core:114608]

@ywenc (CY Wen) and I found a regression from this patch. We have some code handling a broken UTF-8 String with a combination of valid and invalid bytes (UTF-8 followed by binary, which IMO should probably be binary encoded, but it's surprising that the behaviour changed).

"hello\xBE".start_with?("hello") #=> false in trunk, was true on 3.2
"hello\xFE".start_with?("hello") #=> true (both 3.2 and trunk, intended behaviour)
"hello\xBE".delete_prefix("hello") => "\xBE" (both on 3.2 and trunk), because we skip the check when the prefix is valid
"\xFFhello\xBE".delete_prefix("\xFFhello") => "\xFFhello\xBE" in trunk, "\xBE" on 3.2

This is because we're looking at character following the prefix, observing that it looks like a UTF-8 continuation byte, and so returns false.

This approach might work for ends_with?/delete_suffix, where we don't break on an invalid character in the suffix, but doesn't feel right for prefixes. It sounds like the intended design is that to the user this should feel like we were comparing from the start of the strings char-by-char for valid and byte-by-byte for invalid.

We added tests and tried using the end of the previous character, rather than the "start" of the current, to determine if the prefix ends at a char boundary. https://github.com/ruby/ruby/pull/8348

Actions

Copy link

Also available in: PDF Atom

Project

General

Profile

Ruby

Custom queries

Bug #19784

String#delete_prefix! problem

Updated by byroot (Jean Boussier) almost 3 years ago Actions
Copy link
#1 [ruby-core:114278]

Updated by byroot (Jean Boussier) almost 3 years ago Actions
Copy link
#2 [ruby-core:114279]

Updated by byroot (Jean Boussier) almost 3 years ago 1Actions
Copy link
#3 [ruby-core:114281]

Updated by duerst (Martin Dürst) almost 3 years ago Actions
Copy link
#4 [ruby-core:114294]

Updated by mame (Yusuke Endoh) almost 3 years ago Actions
Copy link
#5 [ruby-core:114486]

Updated by Eregon (Benoit Daloze) almost 3 years ago Actions
Copy link
#6 [ruby-core:114494]

Updated by mame (Yusuke Endoh) almost 3 years ago Actions
Copy link
#7 [ruby-core:114527]

Updated by nobu (Nobuyoshi Nakada) almost 3 years ago 1Actions
Copy link
#8

Updated by byroot (Jean Boussier) almost 3 years ago 1Actions
Copy link
#9 [ruby-core:114563]

Updated by jhawthorn (John Hawthorn) almost 3 years ago 2Actions
Copy link
#10 [ruby-core:114608]

Project

General

Profile

Ruby

Custom queries

Bug #19784

String#delete_prefix! problem

Updated by byroot (Jean Boussier) almost 3 years ago ActionsCopy link #1 [ruby-core:114278]

Updated by byroot (Jean Boussier) almost 3 years ago ActionsCopy link #2 [ruby-core:114279]

Updated by byroot (Jean Boussier) almost 3 years ago 1ActionsCopy link #3 [ruby-core:114281]

Updated by duerst (Martin Dürst) almost 3 years ago ActionsCopy link #4 [ruby-core:114294]

Updated by mame (Yusuke Endoh) almost 3 years ago ActionsCopy link #5 [ruby-core:114486]

Updated by Eregon (Benoit Daloze) almost 3 years ago ActionsCopy link #6 [ruby-core:114494]

Updated by mame (Yusuke Endoh) almost 3 years ago ActionsCopy link #7 [ruby-core:114527]

Updated by nobu (Nobuyoshi Nakada) almost 3 years ago 1ActionsCopy link #8

Updated by byroot (Jean Boussier) almost 3 years ago 1ActionsCopy link #9 [ruby-core:114563]

Updated by jhawthorn (John Hawthorn) almost 3 years ago 2ActionsCopy link #10 [ruby-core:114608]

Updated by byroot (Jean Boussier) almost 3 years ago Actions
Copy link
#1 [ruby-core:114278]

Updated by byroot (Jean Boussier) almost 3 years ago Actions
Copy link
#2 [ruby-core:114279]

Updated by byroot (Jean Boussier) almost 3 years ago 1Actions
Copy link
#3 [ruby-core:114281]

Updated by duerst (Martin Dürst) almost 3 years ago Actions
Copy link
#4 [ruby-core:114294]

Updated by mame (Yusuke Endoh) almost 3 years ago Actions
Copy link
#5 [ruby-core:114486]

Updated by Eregon (Benoit Daloze) almost 3 years ago Actions
Copy link
#6 [ruby-core:114494]

Updated by mame (Yusuke Endoh) almost 3 years ago Actions
Copy link
#7 [ruby-core:114527]

Updated by nobu (Nobuyoshi Nakada) almost 3 years ago 1Actions
Copy link
#8

Updated by byroot (Jean Boussier) almost 3 years ago 1Actions
Copy link
#9 [ruby-core:114563]

Updated by jhawthorn (John Hawthorn) almost 3 years ago 2Actions
Copy link
#10 [ruby-core:114608]