Feature #14919
closedAdd String#byteinsert
Description
It's important for multibyte String editing. Unicode grapheme characters sometimes have plural code points. In text editing, software sometimes should add a new code point to an existing grapheme character. String#byteinsert is important for it.
I implemented by pure Ruby in my code.
https://github.com/aycabta/reline/blob/b17e5fd61092adfd7e87d576301e4e19a4d9e6d8/lib/reline/line_editor.rb#L255-L260
Updated by aycabta (aycabta .) over 6 years ago
- Tracker changed from Bug to Feature
- Backport deleted (
2.3: UNKNOWN, 2.4: UNKNOWN, 2.5: UNKNOWN)
Updated by duerst (Martin Dürst) over 6 years ago
aycabta (aycabta .) wrote:
It's important for multibyte String editing. Unicode grapheme characters sometimes have plural code points. In text editing, software sometimes should add a new code point to an existing grapheme character. String#byteinsert is important for it.
Can you explain this a bit more? Editing of code points is easily possible with String#[]=; there is no need to use byteinsert.
Updated by aycabta (aycabta .) over 6 years ago
duerst (Martin Dürst) wrote:
Editing of code points is easily possible with String#[]=; there is no need to use byteinsert.
Input from CLI¶
In CLI tool, all characters come as each of the bytes. All multibyte characters are split. In the middle of a line, a software should use an insertion of a new character but not a replacement.
Yank¶
In the middle of a line, yank manipulation needs #byteinsert for multibyte editing.
Updated by duerst (Martin Dürst) over 6 years ago
aycabta (aycabta .) wrote:
duerst (Martin Dürst) wrote:
Editing of code points is easily possible with String#[]=; there is no need to use byteinsert.
Input from CLI¶
In CLI tool, all characters come as each of the bytes. All multibyte characters are split.
On the lowest level, characters indeed come in as a string of bytes. But it would be wrong to insert individual bytes into a string unless these bytes are also characters. It would just lead to mojibake.
The right thing to do is to collect a (small) number of bytes, check how many bytes are needed to form one or more characters, insert these characters into the string, and keep the remaining bytes for further processing (wait until more bytes arrive so that we get more complete codepoints/characters).
In the middle of a line, a software should use an insertion of a new character but not a replacement.
Insertion of characters can be done with String#[]=.
Yank¶
In the middle of a line, yank manipulation needs #byteinsert for multibyte editing.
I still don't see why. You don't want to insert bytes, you want to insert characters, so that the String is correctly encoded at all times.
Updated by shevegen (Robert A. Heiler) over 6 years ago
I don't have a specific opinion on the suggestion itself; Martin raised some valid
points, in my opinion. But I wanted to comment on something else.
There have been some suggestions to the developer meeting, as recently as 8 hours
ago; so probably just shortly before the developer meeting started:
https://bugs.ruby-lang.org/issues/14861
This is a very short time frame. I would like to suggest to give a little bit more
time before the developer meeting, so that other people can also comment on the
suggestions. Something like +24 hours or so if it has not yet discussed; I feel
that ~8 hours without any real possibility for a discussion is very, very short.
Updated by noraj (Alexandre ZANNI) about 2 years ago
Yes a grapheme can be composed of several code points.
An example is variant selector:
irb(main):001:0> a = "\u2665\n\u2764\n\u2665\ufe0f\n\u2764\ufe0f"
=> "♥\n❤\n♥️\n❤️"
irb(main):002:0> puts a
♥
❤
♥️
❤️
=> nil
irb(main):003:0> a.chars
=> ["♥", "\n", "❤", "\n", "♥", "️", "\n", "❤", "️"]
But fortunately, in Ruby, string indices are already mapping characters and not graphemes. So has Martin highlighted, String#[]=
already cover all use cases I can think of.
irb(main):007:0> r = "I \u2665 Ruby!"
=> "I ♥ Ruby!"
irb(main):009:0> r[2] = "\u2764\ufe0f"
=> "❤️"
irb(main):010:0> r
=> "I ❤️ Ruby!"
The only thing I could think of String#byteinsert
would be to directly mess with UTF-8 encoding to forge invalid encoding on purpose. But such a use case is rare and advanced and so can maybe be handled with pack and unpack rather than creating a new byteinsert method?
irb(main):014:0> r.unpack1('a*')
=> "I \xE2\x9D\xA4\xEF\xB8\x8F Ruby!"
@aycabta (aycabta .) Maybe you could give me a handy example of the usage of String#byteinsert
I can't think of?
Updated by ufuk (Ufuk Kayserilioglu) 6 months ago
Given that we now have String#bytesplice
since Ruby 3.2, these kinds of operations should be possible using "xxxxx".bytesplice(byte_pointer, 0, other)
to insert bytes of other
at byte_pointer
and "xxxxx".bytesplice(byte_pointer, num, "")
to remove num
bytes at byte_pointer
.
Updated by jeremyevans0 (Jeremy Evans) 6 months ago
- Status changed from Open to Closed