Feature #9111: Encoding-free String comparison - Ruby - Ruby Issue Tracking System

Actions

Updated by nobu (Nobuyoshi Nakada) over 11 years ago

sawa (Tsuyoshi Sawada) wrote:

I suggest that the comparison String#<=> should not be based on the respective encoding of the strings, but all the strings should be internally converted to UTF-8 for the purpose of comparison.

It's unacceptable to always convert all strings to UTF-8, should restrict to comparison with an ASCII-8BIT string.

Actions

Copy link

#2 [ruby-core:58339]

Updated by sawa (Tsuyoshi Sawada) over 11 years ago

Following nobu's suggestion, I came up with the following several possibilities:

When two strings with different encodings are to be compared by String#<=>, then one of the following options should be taken:

Raise a Warning message
Raise an error
Convert one of the strings to the other one.

I am not sure which option would be the best, but feel the feature should not be left as is now.

Actions

Copy link

#3 [ruby-core:58343]

Updated by Hanmac (Hans Mackowiak) over 11 years ago

what about strings with the same encoding, but different content, but that is turned the same?
like "â" can be maked from "a" + "^" somehow, should they also treated as equal?

Actions

Copy link

#4 [ruby-core:58354]

Updated by sawa (Tsuyoshi Sawada) over 11 years ago

Hanmac: "â" can be maked from "a" + "^"

Treating them the same is too much, I think. There are various marking methods. For example, â would have a different marking in TeX. Assuming them equal is going too much. They should be treated differently.

Actions

Copy link

#5 [ruby-core:58364]

Updated by Hanmac (Hans Mackowiak) over 11 years ago

i found the wikipedia source: http://en.wikipedia.org/wiki/Combining_character
its not about treating "^a" or "a^" the same as "â" but there is a way to clue the chars together

i think thats also a reason for http://api.rubyonrails.org/classes/String.html#method-i-mb_chars ?

i found another interesting gems http://rubygems.org/gems/unicode_utils
with that is also possible to do something like this: "ä".upcase => "Ä"

there is another page about combining character: http://sbp.so/supercombiner

Actions

Copy link

#6 [ruby-core:58459]

Updated by naruse (Yui NARUSE) over 11 years ago

Hanmac (Hans Mackowiak) wrote:

what about strings with the same encoding, but different content, but that is turned the same?
like "â" can be maked from "a" + "^" somehow, should they also treated as equal?

The standard practice is NFD("â") == NFD("a" + "^").
To NFD, you can use some libraries.
see also http://bibwild.wordpress.com/2013/11/19/benchmarking-ruby-unicode-normalization-alternatives/

Actions

Copy link

#7 [ruby-core:63961]

Updated by duerst (Martin Dürst) almost 11 years ago

Related to Feature #10084: Add Unicode String Normalization to String class added

Project

General

Profile

Ruby

Tags

Custom queries

Feature #9111

Encoding-free String comparison

Updated by nobu (Nobuyoshi Nakada) over 11 years ago

Updated by sawa (Tsuyoshi Sawada) over 11 years ago

Updated by Hanmac (Hans Mackowiak) over 11 years ago

Updated by sawa (Tsuyoshi Sawada) over 11 years ago

Updated by Hanmac (Hans Mackowiak) over 11 years ago

Updated by naruse (Yui NARUSE) over 11 years ago

Updated by duerst (Martin Dürst) almost 11 years ago