Project

General

Profile

Actions

Feature #22082

open

Introduce Bit Operations into String

Feature #22082: Introduce Bit Operations into String
1

Added by hasumikin (hitoshi hasumi) 12 days ago. Updated 7 days ago.

Status:
Open
Assignee:
-
Target version:
-
[ruby-dev:<unknown>]

Description

Relevant tickets

  • Introduce #bit_count method on Integer --- #20163

Abstract

Ruby's String is already a byte sequence, but it lacks high-level bit operations.
As a result, packed binary data must be handled with manual byte arithmetic.

For example, checking set-bit at offset 10 currently requires calculating the byte position and the bit's position within that byte:

data = "\xAA\xAA\xAA\xAA"

# The "classic" way
byte_offset = 10 / 8
byte = data.getbyte(byte_offset)
bit_offset = 10 % 8
((byte >> bit_offset) & 1) == 1 #=> false

# The "concise" way using Integer#[]
data.getbyte(10 / 8)[10 % 8] == 1 #=> false

The concise form is a single line, but it still leaks implementation details into every call site: the caller writes 10 / 8 and 10 % 8 by hand (a recurring source of off-by-one errors at byte boundaries), and the Integer result must be compared against 1 to be used as a boolean. With a bit-addressed API, data.bit_at(10) takes a bit position and returns true/false directly. The cost compounds for iteration, run-length scanning, or splicing, where each operation otherwise needs its own byte/bit arithmetic.

I propose native bit-level APIs, treating the String class as a first-class bit sequence:

data = "\xAA\xAA\xAA\xAA"
data.bit_at(10) #=> false

By providing a flat bit-addressing model that handles the underlying bit-to-byte mapping, we allow developers to focus on the logical layout of their data (e.g., an Apache Arrow bitmap or a pixel buffer), making the code more readable and less error-prone.

This proposal presents a family of methods for bit-oriented use of String, organized into groups below.
The immediate goal is agreement on the overall direction and feedback on which subset should be pursued first.

Presenting the full menu matters because some design questions only become clear at that level:

  • bit numbering keyword (lsb_first:) and its consistent application across methods
  • naming symmetry such as bits / each_bit
  • behavior for out-of-range bit indices

Implementation (Prototype)

You can try an actual working implementation with gem install string_bits.
The source code can be seen in https://github.com/hasumikin/string_bits

The reason I want to include this feature in the Ruby core rather than a third-party gem is that I want this API to be common across Ruby implementations such as mruby and Spinel.

Proposed Methods

Full prototype and documentation:
https://github.com/hasumikin/string_bits/blob/0.2.0/docs/ProposedMethods.md

Read

  • bit_at(bit_offset, lsb_first: true) -> true | false | nil -- read a single bit
  • bit_count -> Integer -- count of set-bits (popcount)
    bit_count(bit_offset, bit_length, lsb_first: true) -> Integer
    bit_count(bit_range, lsb_first: true) -> Integer
  • bit_run_count(bit, bit_offset, lsb_first: true) -> Integer | nil -- length of the run of bit starting at bit_offset

Iterator

  • each_bit(start_offset=0, lsb_first: true) { |bool| ... } -> self -- yield each bit as true/false
    each_bit(start_offset=0, lsb_first: true) -> Enumerator
  • bits(start_offset=0, lsb_first: true) -> Array -- Array form of each_bit
    bits(start_offset=0, lsb_first: true) { |bool| ... } -> self
  • each_bit_run(start_offset=0, lsb_first: true) { |bool, offset, len| } -> self -- yield (bool, offset, run_length) pairs
    each_bit_run(start_offset=0, lsb_first: true) -> Enumerator
  • bit_runs(start_offset=0, lsb_first: true) -> Array -- Array form of each_bit_run
    bit_runs(start_offset=0, lsb_first: true) { |bool, len| } -> self
  • each_bit_offset(bit, start_offset=0, lsb_first: true) { |offset| ... } -> self -- yield position of each bit equal to bit
    each_bit_offset(bit, start_offset=0, lsb_first: true) -> Enumerator
  • bit_offsets(bit, start_offset=0, lsb_first: true) -> Array -- Array form of each_bit_offset
    bit_offsets(bit, start_offset=0, lsb_first: true) { |offset| ... } -> self

Mutation

  • bit_set(bit_offset, bit_length=1, lsb_first: true) -> self -- set one bit or a logical bit bit_range to 1
    bit_set(bit_range, lsb_first: true) -> self
  • bit_clear(bit_offset, bit_length=1, lsb_first: true) -> self -- set one bit or a logical bit bit_range to 0
    bit_clear(bit_range, lsb_first: true) -> self
  • bit_flip(bit_offset, bit_length=1, lsb_first: true) -> self -- toggle one bit or a logical bit bit_range
    bit_flip(bit_range, lsb_first: true) -> self
  • bit_splice(bit_offset, bit_length, str, str_bit_offset=0, lsb_first: true) -> self -- write a sub-sequence of bits in place (bit-granularity bytesplice)
    bit_splice(bit_range, str, str_bit_offset=0, lsb_first: true) -> self

Slice

  • bit_slice(bit_offset, bit_length, lsb_first: true) -> String | nil -- extract a sub-sequence of bits (bit-granularity byteslice)
    bit_slice(bit_range, lsb_first: true) -> String | nil

Bitwise

  • bitwise_not -> String / bitwise_not! -> self -- invert every bit
  • bitwise_and(other) -> String / bitwise_and!(other) -> self -- bitwise AND
  • bitwise_or(other) -> String / bitwise_or!(other) -> self -- bitwise OR
  • bitwise_xor(other) -> String / bitwise_xor!(other) -> self -- bitwise XOR

Performance

This is not only about convenience.
In a prototype implementation (string_bits gem), bulk operations such as bitwise_and, bitwise_or, and bit_count are also substantially faster than Ruby-level loops over bytes (see the Benchmark link below).
I do not think performance alone is the reason to add the feature, but it is a practical benefit.

Notes

Benchmarks, discussion, and prior art:


Related issues 1 (1 open0 closed)

Related to Ruby - Feature #20163: Introduce #bit_count method on IntegerOpenActions

Updated by kou (Kouhei Sutou) 8 days ago 1Actions #1

I want this feature for Apache Arrow. Apache Arrow uses bitmap https://arrow.apache.org/docs/format/Columnar.html#validity-bitmaps for null. I'm maintaining the official pure Ruby Apache Arrow library ( https://github.com/apache/arrow/tree/main/ruby/red-arrow-format ). If (fast implementation of) this feature is provided by Ruby, I don't need to re-implement this feature.

bit_set(bit_offset, bit_length=1, lsb_first: true) -> self -- set one bit or a logical bit bit_range to 1

What does "logical bit" mean here?

Updated by hasumikin (hitoshi hasumi) 7 days ago Actions #2

What does "logical bit" mean here?

The "logical bit" refers to the Bit Position Numbering determined by the lsb_first: keyword.
Please see this page:
https://github.com/hasumikin/string_bits/blob/0.2.0/docs/Discussion.md#bit-position-numbering-of-the-string-bit-api

(Perhaps I shouldn't have written it as if it had any special meaning because lsb_first: is a concept that spans this entire group of methods)

Updated by kou (Kouhei Sutou) 7 days ago Actions #3

hasumikin (hitoshi hasumi) wrote in #note-2:

(Perhaps I shouldn't have written it as if it had any special meaning because lsb_first: is a concept that spans this entire group of methods)

+1 Could you update the description?

Updated by byroot (Jean Boussier) 7 days ago ยท Edited Actions #4

Just a couple comments

lsb_first: true

No strong opinion on whether LSB or MSB should be the default, but in my opinion the default value of the keyword argument should be false, because otherwise if I want MSB first I have to write str.bit_at(42, lsb_first: false), which is some kind of double negation.

Would be way more readable to write str.bit_at(42, msb_first: true) (or str.bit_at(42, lsb_last: true) or str.bit_at(42, msb: true)).

Integer?

While adding these to String make a lot of sense (arbitrary size and mutability), most of the time when I use bit operation, it's because I'm working on a very performance sensitive piece of code, and I'd tend to try to favor immediate types like integer.

So I wonder if at least some of these methods should also be available on Integer.

Updated by hasumikin (hitoshi hasumi) 7 days ago Actions #5

  • Description updated (diff)

kou (Kouhei Sutou) wrote in #note-3:

+1 Could you update the description?

I updated it with <del>...</del>


byroot (Jean Boussier) wrote in #note-4:

lsb_first: true

Thank you. That'd be one of the next points if Matz approved the overall direction. I'm also OK with msb_first: false as the default. I feel solely msb: and lsb: are a little weak though

Integer?

+1 I agree

Updated by Eregon (Benoit Daloze) 7 days ago Actions #6

  • Related to Feature #20163: Introduce #bit_count method on Integer added

Updated by Eregon (Benoit Daloze) 7 days ago Actions #7

Strangely this ticket was not sent to the ruby-core mailing list, and replies on it neither.

Updated by Eregon (Benoit Daloze) 7 days ago Actions #8

This seems like a lot of methods to add to String, which I think already has (too) many methods.
I think a new BitSet class or so would be much cleaner, and remove the need for all these bit_ prefixes.
It would also enable to e.g. set the lsb_first as a kwarg of initialize and nowhere else, the current design seems quite messy passing that everywhere.

I wonder if we really need lsb_first, Java's BitSet doesn't have it.
Maybe this would also becomes much less relevant with a BitSet class and only matter e.g. when converting to String?
Then it could be a kwarg only for that conversion.
But maybe there isn't even a use case to convert a BitSet to a String and supporting Marshal would be enough?

Updated by Eregon (Benoit Daloze) 7 days ago Actions #9

Regarding bitwise operations, that's already available on IO::Buffer, so maybe this is redundant?

Updated by shugo (Shugo Maeda) 7 days ago Actions #10

Eregon (Benoit Daloze) wrote in #note-7:

Strangely this ticket was not sent to the ruby-core mailing list, and replies on it neither.

It's because the ticket's preferred language was configured as "ruby-dev in Japanese".

Actions

Also available in: PDF Atom