Feature #22082
openIntroduce Bit Operations into String
Description
Relevant tickets¶
- Introduce #bit_count method on Integer --- #20163
Abstract¶
Ruby's String is already a byte sequence, but it lacks high-level bit operations.
As a result, packed binary data must be handled with manual byte arithmetic.
For example, checking set-bit at offset 10 currently requires calculating the byte position and the bit's position within that byte:
data = "\xAA\xAA\xAA\xAA"
# The "classic" way
byte_offset = 10 / 8
byte = data.getbyte(byte_offset)
bit_offset = 10 % 8
((byte >> bit_offset) & 1) == 1 #=> false
# The "concise" way using Integer#[]
data.getbyte(10 / 8)[10 % 8] == 1 #=> false
The concise form is a single line, but it still leaks implementation details into every call site: the caller writes 10 / 8 and 10 % 8 by hand (a recurring source of off-by-one errors at byte boundaries), and the Integer result must be compared against 1 to be used as a boolean. With a bit-addressed API, data.bit_at(10) takes a bit position and returns true/false directly. The cost compounds for iteration, run-length scanning, or splicing, where each operation otherwise needs its own byte/bit arithmetic.
I propose native bit-level APIs, treating the String class as a first-class bit sequence:
data = "\xAA\xAA\xAA\xAA"
data.bit_at(10) #=> false
By providing a flat bit-addressing model that handles the underlying bit-to-byte mapping, we allow developers to focus on the logical layout of their data (e.g., an Apache Arrow bitmap or a pixel buffer), making the code more readable and less error-prone.
This proposal presents a family of methods for bit-oriented use of String, organized into groups below.
The immediate goal is agreement on the overall direction and feedback on which subset should be pursued first.
Presenting the full menu matters because some design questions only become clear at that level:
- bit numbering keyword (
lsb_first:) and its consistent application across methods - naming symmetry such as
bits/each_bit - behavior for out-of-range bit indices
Implementation (Prototype)¶
You can try an actual working implementation with gem install string_bits.
The source code can be seen in https://github.com/hasumikin/string_bits
The reason I want to include this feature in the Ruby core rather than a third-party gem is that I want this API to be common across Ruby implementations such as mruby and Spinel.
Proposed Methods¶
Full prototype and documentation:
https://github.com/hasumikin/string_bits/blob/0.2.0/docs/ProposedMethods.md
Read
-
bit_at(bit_offset, lsb_first: true) -> true | false | nil-- read a single bit -
bit_count -> Integer-- count of set-bits (popcount)
bit_count(bit_offset, bit_length, lsb_first: true) -> Integer
bit_count(bit_range, lsb_first: true) -> Integer -
bit_run_count(bit, bit_offset, lsb_first: true) -> Integer | nil-- length of the run ofbitstarting at bit_offset
Iterator
-
each_bit(start_offset=0, lsb_first: true) { |bool| ... } -> self-- yield each bit as true/false
each_bit(start_offset=0, lsb_first: true) -> Enumerator -
bits(start_offset=0, lsb_first: true) -> Array-- Array form ofeach_bit
bits(start_offset=0, lsb_first: true) { |bool| ... } -> self -
each_bit_run(start_offset=0, lsb_first: true) { |bool, offset, len| } -> self-- yield(bool, offset, run_length)pairs
each_bit_run(start_offset=0, lsb_first: true) -> Enumerator -
bit_runs(start_offset=0, lsb_first: true) -> Array-- Array form ofeach_bit_run
bit_runs(start_offset=0, lsb_first: true) { |bool, len| } -> self -
each_bit_offset(bit, start_offset=0, lsb_first: true) { |offset| ... } -> self-- yield position of each bit equal tobit
each_bit_offset(bit, start_offset=0, lsb_first: true) -> Enumerator -
bit_offsets(bit, start_offset=0, lsb_first: true) -> Array-- Array form ofeach_bit_offset
bit_offsets(bit, start_offset=0, lsb_first: true) { |offset| ... } -> self
Mutation
-
bit_set(bit_offset, bit_length=1, lsb_first: true) -> self-- set one bit or alogical bitbit_range to 1
bit_set(bit_range, lsb_first: true) -> self -
bit_clear(bit_offset, bit_length=1, lsb_first: true) -> self-- set one bit or alogical bitbit_range to 0
bit_clear(bit_range, lsb_first: true) -> self -
bit_flip(bit_offset, bit_length=1, lsb_first: true) -> self-- toggle one bit or alogical bitbit_range
bit_flip(bit_range, lsb_first: true) -> self -
bit_splice(bit_offset, bit_length, str, str_bit_offset=0, lsb_first: true) -> self-- write a sub-sequence of bits in place (bit-granularitybytesplice)
bit_splice(bit_range, str, str_bit_offset=0, lsb_first: true) -> self
Slice
-
bit_slice(bit_offset, bit_length, lsb_first: true) -> String | nil-- extract a sub-sequence of bits (bit-granularitybyteslice)
bit_slice(bit_range, lsb_first: true) -> String | nil
Bitwise
-
bitwise_not -> String/bitwise_not! -> self-- invert every bit -
bitwise_and(other) -> String/bitwise_and!(other) -> self-- bitwise AND -
bitwise_or(other) -> String/bitwise_or!(other) -> self-- bitwise OR -
bitwise_xor(other) -> String/bitwise_xor!(other) -> self-- bitwise XOR
Performance¶
This is not only about convenience.
In a prototype implementation (string_bits gem), bulk operations such as bitwise_and, bitwise_or, and bit_count are also substantially faster than Ruby-level loops over bytes (see the Benchmark link below).
I do not think performance alone is the reason to add the feature, but it is a practical benefit.
Notes¶
Benchmarks, discussion, and prior art:
- Proposed methods (with use cases): https://github.com/hasumikin/string_bits/blob/0.2.0/docs/ProposedMethods.md
- Benchmark: https://github.com/hasumikin/string_bits/blob/0.2.0/docs/Benchmark.md
- Discussion: https://github.com/hasumikin/string_bits/blob/0.2.0/docs/Discussion.md
- Why extend
Stringrather than introduce a new class? - Naming convention: symmetry with
bytes/each_byte - Bit Position Numbering of the String bit API
- Why
lsb_first: trueis the default? - Bit ordering across domains
- Apache Arrow Compatibility
- Error behavior for out-of-range bit indices
- Why extend
- Prior art: https://github.com/hasumikin/string_bits/blob/0.2.0/docs/PriorArt.md
Updated by kou (Kouhei Sutou) 8 days ago
I want this feature for Apache Arrow. Apache Arrow uses bitmap https://arrow.apache.org/docs/format/Columnar.html#validity-bitmaps for null. I'm maintaining the official pure Ruby Apache Arrow library ( https://github.com/apache/arrow/tree/main/ruby/red-arrow-format ). If (fast implementation of) this feature is provided by Ruby, I don't need to re-implement this feature.
bit_set(bit_offset, bit_length=1, lsb_first: true) -> self-- set one bit or a logical bit bit_range to 1
What does "logical bit" mean here?
Updated by hasumikin (hitoshi hasumi) 7 days ago
What does "logical bit" mean here?
The "logical bit" refers to the Bit Position Numbering determined by the lsb_first: keyword.
Please see this page:
https://github.com/hasumikin/string_bits/blob/0.2.0/docs/Discussion.md#bit-position-numbering-of-the-string-bit-api
(Perhaps I shouldn't have written it as if it had any special meaning because lsb_first: is a concept that spans this entire group of methods)
Updated by kou (Kouhei Sutou) 7 days ago
hasumikin (hitoshi hasumi) wrote in #note-2:
(Perhaps I shouldn't have written it as if it had any special meaning because
lsb_first:is a concept that spans this entire group of methods)
+1 Could you update the description?
Updated by byroot (Jean Boussier) 7 days ago
ยท Edited
Just a couple comments
lsb_first: true¶
No strong opinion on whether LSB or MSB should be the default, but in my opinion the default value of the keyword argument should be false, because otherwise if I want MSB first I have to write str.bit_at(42, lsb_first: false), which is some kind of double negation.
Would be way more readable to write str.bit_at(42, msb_first: true) (or str.bit_at(42, lsb_last: true) or str.bit_at(42, msb: true)).
Integer?¶
While adding these to String make a lot of sense (arbitrary size and mutability), most of the time when I use bit operation, it's because I'm working on a very performance sensitive piece of code, and I'd tend to try to favor immediate types like integer.
So I wonder if at least some of these methods should also be available on Integer.
Updated by hasumikin (hitoshi hasumi) 7 days ago
- Description updated (diff)
kou (Kouhei Sutou) wrote in #note-3:
+1 Could you update the description?
I updated it with <del>...</del>
byroot (Jean Boussier) wrote in #note-4:
lsb_first: true¶
Thank you. That'd be one of the next points if Matz approved the overall direction. I'm also OK with msb_first: false as the default. I feel solely msb: and lsb: are a little weak though
Integer?¶
+1 I agree
Updated by Eregon (Benoit Daloze) 7 days ago
- Related to Feature #20163: Introduce #bit_count method on Integer added
Updated by Eregon (Benoit Daloze) 7 days ago
Strangely this ticket was not sent to the ruby-core mailing list, and replies on it neither.
Updated by Eregon (Benoit Daloze) 7 days ago
This seems like a lot of methods to add to String, which I think already has (too) many methods.
I think a new BitSet class or so would be much cleaner, and remove the need for all these bit_ prefixes.
It would also enable to e.g. set the lsb_first as a kwarg of initialize and nowhere else, the current design seems quite messy passing that everywhere.
I wonder if we really need lsb_first, Java's BitSet doesn't have it.
Maybe this would also becomes much less relevant with a BitSet class and only matter e.g. when converting to String?
Then it could be a kwarg only for that conversion.
But maybe there isn't even a use case to convert a BitSet to a String and supporting Marshal would be enough?
Updated by Eregon (Benoit Daloze) 7 days ago
Regarding bitwise operations, that's already available on IO::Buffer, so maybe this is redundant?
Updated by shugo (Shugo Maeda) 7 days ago
Eregon (Benoit Daloze) wrote in #note-7:
Strangely this ticket was not sent to the ruby-core mailing list, and replies on it neither.
It's because the ticket's preferred language was configured as "ruby-dev in Japanese".