Feature #6361
closedBitwise string operations
Added by MartinBosslet (Martin Bosslet) over 12 years ago. Updated over 12 years ago.
Description
I know this has been discussed a lot in the past (and if there's still
an open issue for this, I apologize, I couldn't find one), for example
in [1]. While it is generally no problem to implement this on the fly,
I still find that built-in support would be a real improvement. There
are quite some use cases in cryptography where this would come in very
handy, but I'm sure there are lots of other areas, too.
While of course I understand the reasons that were given in the previous
threads that ultimately lead to rejection, I still would like to reopen
the discussion as I felt that in every thread so far the consensus was
that having bitwise string operations would indeed be quite valuable.
[1] http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-core/34586
Files
bytestring001.tar.gz (3.39 KB) bytestring001.tar.gz | MartinBosslet (Martin Bosslet), 04/30/2012 11:52 AM | ||
bytestring-002.tar.gz (3.47 KB) bytestring-002.tar.gz | MartinBosslet (Martin Bosslet), 05/03/2012 09:07 AM | ||
6361-proposal.pdf (25.4 KB) 6361-proposal.pdf | jballanc (Joshua Ballanco), 06/30/2012 12:13 AM |
Updated by headius (Charles Nutter) over 12 years ago
+1 from me...
There are many cases where high-perf treatment of a homogeneous array would be useful in Ruby, as evidenced by libraries like NArray and friends. NArray has been proposed for inclusion in Ruby in the past, but I don't know why it never was.
Providing a few bitwise operations on String that operate against the raw bytes would fulfill one of the most common use cases, that of a byte[] and symmetric manipulations of all elements.
Updated by naruse (Yui NARUSE) over 12 years ago
- Status changed from Open to Feedback
Propose a detailed spec with use cases.
For example the behavior of the case "aa" ^ "bbbb"
Updated by trans (Thomas Sawyer) over 12 years ago
Would it be better to have a dedicated class, e.g. a Blob ?
Updated by alexeymuranov (Alexey Muranov) over 12 years ago
+1 for Blob, strings are not the only data that one might wish to encrypt.
Updated by jballanc (Joshua Ballanco) over 12 years ago
+1 for Blob (or, my preferred name: Data). I've mentioned this in the past, but we can look at Obj-C's example: NSString => a string class with encoding and code-point/character-wise manipulation, NSData => a class to encapsulate a random array of bytes.
Updated by sdaubert (Sylvain Daubert) over 12 years ago
+1 : I use them every day.
Updated by nobu (Nobuyoshi Nakada) over 12 years ago
Then what kind of methods should Blob have?
And does it need to be built-in?
Updated by MartinBosslet (Martin Bosslet) over 12 years ago
nobu (Nobuyoshi Nakada) wrote:
Then what kind of methods should Blob have?
And does it need to be built-in?
A real advantage of having it built-in could be
that this gives us the chance to fix #5741 at
the same time. I could imagine that we have two
kinds of "byte array" classes - one, mutable,
that shares COW semantics and all the other
optimizations with String, but with no notion of
encoding and a yet-to-be-defined interface.
And then a second class, which is basically the
immutable version of the first one. By sharing
only a reference we could ensure that the content
would not be proliferated and we could securely
erase its contents after use.
Updated by kosaki (Motohiro KOSAKI) over 12 years ago
On Fri, Apr 27, 2012 at 8:53 PM, MartinBosslet (Martin Bosslet)
Martin.Bosslet@googlemail.com wrote:
Issue #6361 has been updated by MartinBosslet (Martin Bosslet).
nobu (Nobuyoshi Nakada) wrote:
Then what kind of methods should Blob have?
And does it need to be built-in?
A real advantage of having it built-in could be
that this gives us the chance to fix #5741 at
the same time. I could imagine that we have two
kinds of "byte array" classes - one, mutable,
that shares COW semantics and all the other
optimizations with String, but with no notion of
encoding and a yet-to-be-defined interface.And then a second class, which is basically the
immutable version of the first one. By sharing
only a reference we could ensure that the content
would not be proliferated and we could securely
erase its contents after use.
I don't dislike a bult-in idea. But you haven't show a detailed spec
and I don't think I clearly understand your idea. Can you spend a
few time for writing a spec? (probably rough a few line explanation
is enough)
Updated by nobu (Nobuyoshi Nakada) over 12 years ago
Hi,
(12/04/28 9:53), MartinBosslet (Martin Bosslet) wrote:
A real advantage of having it built-in could be
that this gives us the chance to fix #5741 at
the same time.
It doesn't explain anything about why they need to be built-in. You
can just make them an external library altogether, can't you?
--
Nobu Nakada
Updated by MartinBosslet (Martin Bosslet) over 12 years ago
kosaki (Motohiro KOSAKI) wrote:
I don't dislike a bult-in idea. But you haven't show a detailed spec
and I don't think I clearly understand your idea. Can you spend a
few time for writing a spec? (probably rough a few line explanation
is enough)
Yes, I thought about writing a proof of concept that could serve as
the blueprint for further discussion. I just wanted to make sure
first that there is a general interest and I also wanted to get your
ideas and suggestions first :) Nobuyoshi's point is a very valid one,
it helped me get a better view on the subject, and to rethink my own
position. (More on that below)
Updated by MartinBosslet (Martin Bosslet) over 12 years ago
nobu (Nobuyoshi Nakada) wrote:
Hi,
(12/04/28 9:53), MartinBosslet (Martin Bosslet) wrote:
A real advantage of having it built-in could be
that this gives us the chance to fix #5741 at
the same time.It doesn't explain anything about why they need to be built-in. You
can just make them an external library altogether, can't you?
I thought about this, and yes, you are absolutely right, there's
nothing I couldn't do in a separate C extension. The only remaining
argument I have for built-in support:
I would have liked to use the "secure memory erasure" feature within
the OpenSSL extension, which I couldn't do if it's a separate
library outside of the stdlib.
I'm not sure if that's enough. Especially since we could add the
functionality to OpenSSL directly. Being a "nice to have" feature
in itself probably would not justify adding built-in support, but
I'm fine with either decision.
Updated by jballanc (Joshua Ballanco) over 12 years ago
On Saturday, April 28, 2012 at 8:52 AM, KOSAKI Motohiro wrote:
On Fri, Apr 27, 2012 at 8:53 PM, MartinBosslet (Martin Bosslet)
<Martin.Bosslet@googlemail.com (mailto:Martin.Bosslet@googlemail.com)> wrote:Issue #6361 has been updated by MartinBosslet (Martin Bosslet).
nobu (Nobuyoshi Nakada) wrote:
Then what kind of methods should Blob have?
And does it need to be built-in?
A real advantage of having it built-in could be
that this gives us the chance to fix #5741 at
the same time. I could imagine that we have two
kinds of "byte array" classes - one, mutable,
that shares COW semantics and all the other
optimizations with String, but with no notion of
encoding and a yet-to-be-defined interface.And then a second class, which is basically the
immutable version of the first one. By sharing
only a reference we could ensure that the content
would not be proliferated and we could securely
erase its contents after use.I don't dislike a bult-in idea. But you haven't show a detailed spec
and I don't think I clearly understand your idea. Can you spend a
few time for writing a spec? (probably rough a few line explanation
is enough)
If I may intrude for a moment… I think the advantage to having a built in Data/Blob library would be that it could be used in all places where a data class is more appropriate than a string. For example, the Socket library currently returns Strings for data read in from a socket. I think a Data class is more appropriate here since the socket itself does not contain encoding information (i.e. either an arbitrary default encoding needs to be set, a heuristic can be used to guess the encoding, or the encoding is set by a previously agreed up convention; but you cannot ask a socket for its encoding).
As for a spec, I think it should be kept relatively simple. The one interesting optimization from NSData that might be useful is the option of copying bytes on instantiation. Copying is the default, but it is also possible to create a Data object that merely points at the storage of another live object and allows byte-wise manipulation. This is particularly interesting for the case of strings, since I would guess that String and Data would have identical storage layout, allowing one to optimize the case of creating a Data from a String with no copying.
A quick attempt at a spec:
Data.new #=> New, dynamically resizable container to store some bytes
Data.new('Test') #=> Can be created from any object that responds to #bytes with an enumerator
Data.new('Hello', copy_bytes: false) #=> Creates the Data from the String by merely pointing to the same storage
Data.open('./foo/test.txt') #=> Create a Data object from a File
Data.open('./bar/test.txt', copy_bytes: false) #=> Same as open above, but manipulates IO#pos for access
Data.write('./baz/test.txt') #=> Writes the bytes to disk.
d = Data.new(a_string)
d[2] #=> Returns the third byte, same as a_string.bytes.to_a[2]
d[2] = 42 #=> Same as a_string.setbyte(2, 42)
d.each #=> Equivalent to a_string.each_byte
d.length #=> Number of bytes currently being stored
d.slice(2, 4) #=> Similar to String#slice
d.slice(2, 4, copy_bytes: false) #=> New data object from slice shares storage with the original
d << other_data #=> Appends bytes from other_data
d.to_s #=> Returns a string using the default internal encoding
d.string_with_encoding('UTF-16') #=> Returns a string using the encoding passed
I know it seems like this class is just wrapping String and always defaulting to byte-wise operations, but it's more fundamental than that. Because there is no encoding on the bytes, there will never be an encoding error when working with them. This could be extremely useful for applications that combine bytes from multiple sources (e.g. Socket data + a file on disk + immediate strings in code) that could potentially have different encodings. By operating on bytes, you can defer the encoding checks until later, if at all.
- Josh
Updated by MartinBosslet (Martin Bosslet) over 12 years ago
- File bytestring001.tar.gz added
If I may intrude for a moment…
No such thing - thank you for your ideas!
I think the advantage to having a built in Data/Blob library would be that it could be used in all places where a data class is more appropriate than a string. For example, the Socket library currently returns Strings for data read in from a socket. I think a Data class is more appropriate here since the socket itself does not contain encoding information (i.e. either an arbitrary default encoding needs to be set, a heuristic can be used to guess the encoding, or the encoding is set by a previously agreed up convention; but you cannot ask a socket for its encoding).
That's a good point. IO would definitely benefit from this
feature, Strings would only be needed when reading line by
line. Encoding would not be a problem anymore when reading
raw bytes.
A quick attempt at a spec:
I picked up Joshua's ideas and implemented a fully functional
ByteString class (simulating the desired behavior in Ruby for
now). I like ByteString better than Data, but that's me, I
would be fine with whatever name the majority likes best.
The class features a reduced String interface like Joshua
proposed already, plus bit-level operations and a
#to_hex/from_hex (I need those all the time).
In addition, there is ByteString::Immutable, which tries to
illustrate the behavior I had in mind for a class that would
allow secure in-memory erasure of its contents. It can only
be created from an IO directly (we can't use a String because
this would already leak the sensitive information).
By default, ByteString copies the contents it is given, but
if we want to include referencing as described by Joshua,
this could be integrated in Immutable, and the security
aspects could be implemented in yet another class, that
borrows most of its functionality from Immutable.
To demonstrate the behavior and as a basis for further
discussion, I included specs for both classes, I find those
to be the easiest way for discussing an interface.
The code can also be found at [1], in case somebody would
like to hack on it and improve it, I would update this
thread in that case.
Updated by MartinBosslet (Martin Bosslet) over 12 years ago
- File deleted (
bytestring001.tar.gz)
Updated by MartinBosslet (Martin Bosslet) over 12 years ago
- File bytestring001.tar.gz bytestring001.tar.gz added
Updated by trans (Thomas Sawyer) over 12 years ago
I'm not so sure using "String" in the name is a good idea.
The reason I suggested Blob is b/c that's what it is often called: http://en.wikipedia.org/wiki/Binary_large_object
Updated by MartinBosslet (Martin Bosslet) over 12 years ago
trans (Thomas Sawyer) wrote:
I'm not so sure using "String" in the name is a good idea.
The reason I suggested Blob is b/c that's what it is often called: http://en.wikipedia.org/wiki/Binary_large_object
Hmm, I guess names are a very subtle topic. I stole from [1],
but I see your point why "String" could be confusing. I don't
really care about the name, I'm fine with anything that's
reasonable enough :)
[1] http://hackage.haskell.org/packages/archive/bytestring/0.9.2.1/doc/html/Data-ByteString.html
Updated by MartinBosslet (Martin Bosslet) over 12 years ago
- File bytestring-002.tar.gz bytestring-002.tar.gz added
Added the possibility to specify an encoding when converting
to a String, forgot about that in the first version.
Updated by duerst (Martin Dürst) over 12 years ago
On 2012/04/30 1:50, Joshua Ballanco wrote:
I know it seems like this class is just wrapping String and always defaulting to byte-wise operations, but it's more fundamental than that. Because there is no encoding on the bytes, there will never be an encoding error when working with them. This could be extremely useful for applications that combine bytes from multiple sources (e.g. Socket data + a file on disk + immediate strings in code) that could potentially have different encodings. By operating on bytes, you can defer the encoding checks until later, if at all.
I'm not saying I'm totally against this, but "extremely useful" could
also mean "too useful". There are clearly cases where one needs to put
things together at the byte level. But there are also quite some cases
that seem to "just work" when using byte-wise operations, at least as
long as nothing else but US-ASCII gets used. Things then blow up
terribly once some other characters get into the mix.
Actually, the binary/ASCII-8bit encoding is very close to a Blob. It was
mostly Akira Tanaka who didn't want to distinguish between "true" binary
and ASCII-8bit, because that would have made the use of regular
expressions with binary impossible or convoluted.
Despite the title of this issue, I didn't see any bitwise operations
(e.g. bitwise and/or/xor/not) proposed. Were you just taking them for
granted? What about adding these to String, maybe limiting them to
binary/ASCII-8bit?
Regards, Martin.
Updated by jballanc (Joshua Ballanco) over 12 years ago
On Thursday, May 3, 2012 at 9:16 AM, "Martin J. Dürst" wrote:
On 2012/04/30 1:50, Joshua Ballanco wrote:
I know it seems like this class is just wrapping String and always defaulting to byte-wise operations, but it's more fundamental than that. Because there is no encoding on the bytes, there will never be an encoding error when working with them. This could be extremely useful for applications that combine bytes from multiple sources (e.g. Socket data + a file on disk + immediate strings in code) that could potentially have different encodings. By operating on bytes, you can defer the encoding checks until later, if at all.
I'm not saying I'm totally against this, but "extremely useful" could
also mean "too useful". There are clearly cases where one needs to put
things together at the byte level. But there are also quite some cases
that seem to "just work" when using byte-wise operations, at least as
long as nothing else but US-ASCII gets used. Things then blow up
terribly once some other characters get into the mix.
So, as an addendum to the spec, what about adding a flag when doing a string conversion:
d.string_with_encoding('UTF-8', reject_if_invalid: true)
So that we could ensure that the return value is always either nil or a string with valid encoding.
Actually, the binary/ASCII-8bit encoding is very close to a Blob. It was
mostly Akira Tanaka who didn't want to distinguish between "true" binary
and ASCII-8bit, because that would have made the use of regular
expressions with binary impossible or convoluted.
My problem with String and ASCII-8BIT/BINARY encoding currently is that you can't just set a string's encoding to binary and forget about encodings. You will still run into issues working with binary data using Ruby 1.9 strings. I demonstrated the issue here: http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-core/40269 (where I, consequently, also made a plea for a Data/Blob type).
Despite the title of this issue, I didn't see any bitwise operations
(e.g. bitwise and/or/xor/not) proposed. Were you just taking them for
granted? What about adding these to String, maybe limiting them to
binary/ASCII-8bit?
I was taking bit-wise operations for granted. Ideally, a Data/Blob type would just represent N groupings of 8 1s and/or 0s, with byte-wise access and bit-wise manipulation. i.e. Less structured than an Array, less restrictive than a String. Just data.
Updated by kernigh (George Koehler) over 12 years ago
=begin
A new BinaryString (or Blob) class would entail several changes. For example, Array#pack and IO#read(n) would need to return BinaryString, not String. I prefer to keep String for binary strings.
The main reason, to perform bitwise operations on a String, is to use this String as an array of bits. The purpose of each bitwise operation is to clear or set some bits in this array. For example, a sieve of Erathosthenes might clear a bit in a String to show that a number is not prime.
String has no bitwise operations. So, the code to clear a bit is long, not simple.
sieve = "\xAC(\x8A\xA2("
25 is not prime. Clear bit 0x02 of byte 3.¶
sieve.setbyte(3, sieve.getbyte(3) & ~0x02)
A simpler way, with current Ruby, is to unpack this String into an Array of Integers. The code can then perform bitwise operations with these Integers.
sieve = "\xAC(\x8A\xA2(".unpack("C*")
25 is not prime. Clear bit 0x02 of byte 3.¶
sieve[3] &= ~0x02
The sieve in [1] uses an Array of 16-bit Fixnums. This works because Fixnum has bitwise operations. A sieve at [2] uses an Array of true and false. This also works, though it uses more memory.
[1] https://github.com/ruby/ruby/blob/trunk/lib/prime.rb
[2] http://rosettacode.org/wiki/Sieve_of_Eratosthenes#Ruby
If String would have bitwise operations, a sieve of Eratosthenes might use them. If String#& and String#~ would exist, then they can clear a bit.
sieve = "\xAC(\x8A\xA2("
sieve[3] &= ~"\x02"
With this design, String#& and String#~ never modify the original string. This design forgot that Strings are mutable. String#& must allocate a new string, before String#[]= modifies the original string. A better design might provide destructive methods that use bitwise operations to modify a slice of the original string.
=end
Updated by jballanc (Joshua Ballanco) over 12 years ago
Just throwing this out there, but a new class could be introduced first, then used as a replacement for Pack/IO/etc. in a later version (to allay breaking change concerns).
I understand the desire to simply work on an array of bits, but I think your sieve example is a little too simplistic. Are you never going to pass this string holding bits to another method? Are you never going to concatenate it with another string? If the answer is yes, then your solution would work. However, if there is ever the chance that someone might call #encode or #<< on your string, then the current Ruby string implementation is insufficient for working with bits.
Updated by nobu (Nobuyoshi Nakada) over 12 years ago
=begin
: kernigh (George Koehler) wrote:
The main reason, to perform bitwise operations on a String, is to use this String as an array of bits. The purpose of each bitwise operation is to clear or set some bits in this array. For example, a sieve of Erathosthenes might clear a bit in a String to show that a number is not prime.
What you want seems (({BitArray})), not a kind of char string.
: kernigh (George Koehler) wrote:
If String would have bitwise operations, a sieve of Eratosthenes might use them. If String#& and String#~ would exist, then they can clear a bit.
sieve = "\xAC(\x8A\xA2("
sieve[3] &= ~"\x02"
Isn't (({BitArray})) clearer?
sieve = BitArray("\xAC(\x8A\xA2(")
sieve[26] = false
or¶
sieve.clear(26)
=end
Updated by MartinBosslet (Martin Bosslet) over 12 years ago
nobu (Nobuyoshi Nakada) wrote:
What you want seems (({BitArray})), not a kind of char string.
Thought the same, sounds much like this[1].
I think having both bit and byte operations in the same class at the same time would become a mess real soon. But apart from the "resolution" (bit, byte, multi-byte) the operations one would wish to perform on the array/vector are pretty much the same everywhere. So I had this idea: what if we made the resolution a parameter on initialization, and this would determine the further behavior? Something like
bits = ByteString.new(2)
bits[7] = 1 #Sets the 8th bit to 1, argument may only be in the range 0..1 (= [0;2^1 - 1])
bits[2] = 1 #Sets the 3rd bit to 1
bytes = ByteString.new(8)
bytes[0] = 0xcc #Sets the first byte to 0xcc, arguments are in the range of 0..255 (=[0;2^8-1])
uint32s = ByteString.new(32)
uint32s[0] = 65536 #Sets the first word (32bit) to 65536, arguments are in the range [0;2^32-1])
...
The initialization parameter would be a power of two that indicates the "width" in bits of a single unit of data that you wish to manipulate.
The behavior of "xor", "and", "or", "not" would follow naturally and would be implied by the underlying "resolution". This allows everyone to have their version of what fits best in their current domain, much like pack/unpack does.
[1] http://docs.oracle.com/javase/6/docs/api/java/util/BitSet.html
Updated by MartinBosslet (Martin Bosslet) over 12 years ago
duerst (Martin Dürst) wrote:
Despite the title of this issue, I didn't see any bitwise operations
(e.g. bitwise and/or/xor/not) proposed. Were you just taking them for
granted? What about adding these to String, maybe limiting them to
binary/ASCII-8bit?
Sorry for the confusion, you are right, I should have actually written "bytewise"
in the title for my initial proposal.
There is a risk of exaggerating things, but on the other hand I think
a feature like this would simplify implementing low-level binary protocols
a lot by reducing the amount of bit shuffling which obviously is a common
source of errors, for example I really like what Erlang allows you to do [1].
I'm not against adding the functionality to String, it could just be that
a separate class might help to separate concerns more clearly. Adding more
methods to String would would mean more work when duck-typing a String etc.
[1] http://www.erlang.org/documentation/doc-5.6/doc/programming_examples/bit_syntax.html
Updated by MartinBosslet (Martin Bosslet) over 12 years ago
kernigh (George Koehler) wrote:
=begin
With this design, String#& and String#~ never modify the original string. This design forgot that Strings are mutable. String#& must allocate a new string, before String#[]= modifies the original string. A better design might provide destructive methods that use bitwise operations to modify a slice of the original string.
=end
Good point, I like the idea of having an alternative destructive version, should boost performance a lot.
I think you actually need both because the "secure" version, ByteString::Immutable, needs to be immutable.
But in the general case both versions have their use. I'll add it to the proposal.
Updated by jballanc (Joshua Ballanco) over 12 years ago
- File 6361-proposal.pdf 6361-proposal.pdf added
Second feature request. I picked this issue for this feature request because it contains the most recent discussion on this topic. Please let me know if a new, more specific, feature request would be more appropriate.
Updated by mame (Yusuke Endoh) over 12 years ago
Your slide is received. Thank you!
A new class, not bitwise string operations, is now being discussed in this thread?
(I don't read the long discussion ;-)
If so, it may be good to get a fresh start by opening another ticket.
Anyway, your slide (and proposal) looks attractive and will be presented at the 7/21 meeting. Thank you again!
--
Yusuke Endoh mame@tsg.ne.jp
Updated by mame (Yusuke Endoh) over 12 years ago
- Status changed from Feedback to Rejected
Martin Bosslet and Joshua Ballanco,
Sorry but this proposal was rejected at the developer meeting (7/21).
Matz explicitly said he will not accept another class than String.
He hates Python way.
Two additional notes:
-
The original request (bitwise string operation) was not considered
because it is not essential for ByteString; if you want, please file
a new ticket for the specific feature with the detailed spec including
corner cases (e.g., [ruby-core:44646]). -
Matz wants to save the motivating example in the slide (below)
in other approach than the proposal:"\xff".force_encoding("BINARY") << "\xff"
#=> incompatible character encodings:
ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)We discussed three approaches:
- adding String#b that returns BINARY string:
"\xff".force_encoding("BINARY") << "\xff".b
- adding a new `percent' syntax for binary string:
"\xff".force_encoding("BINARY") << %b{\xff}
- suppressing the exception in some way
Some people (including Matz) liked (3), but akr showed strong objection
because it causes breaking a character (Sorry I didn't understand his
opinion precisely). (2) is difficult because of syntax extension.
So we are now keen on (1).
This is discussed in #6767 (Sorry, in Japanese!)
--
Yusuke Endoh mame@tsg.ne.jp
Updated by MartinBosslet (Martin Bosslet) over 12 years ago
mame (Yusuke Endoh) wrote:
Martin Bosslet and Joshua Ballanco,
Sorry but this proposal was rejected at the developer meeting (7/21).
Matz explicitly said he will not accept another class than String.
He hates Python way.
No problem! It's very specific and I understand well that it certainly
adds complexity that is not needed often enough. I still believe that
having something like ByteString would boost low-level IO performance a
lot - I'll take the gem route then, I have some ideas there. Maybe I'll
get the chance to discuss some of them with Matz at LSRC next week :)
Two additional notes:
- The original request (bitwise string operation) was not considered
because it is not essential for ByteString; if you want, please file
a new ticket for the specific feature with the detailed spec including
corner cases (e.g., [ruby-core:44646]).
OK, I'll open another issue for this.