Project

General

Profile

Actions

Feature #21943

closed

Add StringScanner#get_int to extract capture group as Integer without intermediate String

Feature #21943: Add StringScanner#get_int to extract capture group as Integer without intermediate String

Added by jinroq (Jinroq SAITOH) 3 months ago. Updated 14 days ago.

Status:
Closed
Assignee:
-
Target version:
-
[ruby-core:124928]

Description

Motivation

The date library is being rewritten from C to pure Ruby. During this effort, Date._strptime was identified as a major performance bottleneck. Profiling revealed that the root cause is the overhead of extracting capture groups as Strings and then converting them to Integers:

sc.scan(/(\d{4})-(\d{2})-(\d{2})/)
year = sc[1].to_i   # allocates String "2024", converts to Integer, discards String
mon  = sc[2].to_i   # allocates String "06",   converts to Integer, discards String
mday = sc[3].to_i   # allocates String "15",   converts to Integer, discards String

Each sc[n].to_i call allocates a temporary String object that is immediately discarded. When parsing dates, only the integer values are needed — the intermediate Strings serve no purpose.

In the C implementation of date, matched byte ranges are converted directly to integers without any String allocation. The pure Ruby version cannot do this with the current StringScanner API.

Proposal

Add StringScanner#get_int(index) that returns the captured substring at the given index as an Integer, converting directly from the matched byte range at the C level without allocating an intermediate String object.

scanner = StringScanner.new("2024-06-15")
scanner.scan(/(\d{4})-(\d{2})-(\d{2})/)
scanner.get_int(1)  # => 2024
scanner.get_int(2)  # => 6
scanner.get_int(3)  # => 15

It returns nil in the same cases where scanner[index] would return nil (no match, index out of range, optional group did not participate).

Use case

The primary use case is Date._strptime in the pure Ruby date library. The fast path for %Y-%m-%d format currently does:

# Current: 3 temporary String allocations
sc.scan(/(\d{4})-(\d{2})-(\d{2})/)
year = sc[1].to_i
mon  = sc[2].to_i
mday = sc[3].to_i

With get_int:

# Proposed: 0 temporary String allocations
sc.scan(/(\d{4})-(\d{2})-(\d{2})/)
year = sc.get_int(1)
mon  = sc.get_int(2)
mday = sc.get_int(3)

This pattern appears throughout _strptime for every date/time component (%H, %M, %S, %m, %d, etc.), so the cumulative impact is significant.

Benchmark

Environment: Ruby 4.0.1, x86_64-linux

Operation i/s per iteration Comparison
sc.get_int(n) 1,029,041.7 971.78 ns/i (Reference)
sc[n].to_i 791,945.6 1.26 μs/i 1.30x slower

get_int is 1.30x faster than sc[n].to_i for a typical date parsing scenario (3 capture groups). The improvement comes from eliminating 3 temporary String allocations per call.

In the context of Date._strptime("%Y-%m-%d"), this overhead is a significant portion of the total parse time, as shown in earlier profiling:

Operation Time
C ext _strptime (reference) 408 ns
SC.new + scan + captures + .to_i x3 1,210 ns
Pure Ruby _strptime_ymd total 1,290 ns

The capture extraction + .to_i conversion accounts for roughly 40% of the total parse time. get_int directly reduces this portion.

Implementation

A working implementation is available. It reuses the same index resolution logic as StringScanner#[] (including negative indices) but calls rb_cstr2inum on the matched byte range instead of extract_range, avoiding String object allocation entirely.


Related issues 1 (0 open1 closed)

Related to Ruby - Feature #21932: `MatchData#get_int`ClosedActions

Updated by Eregon (Benoit Daloze) 3 months ago Actions #1

Updated by Eregon (Benoit Daloze) 3 months ago · Edited Actions #2 [ruby-core:124940]

jinroq (Jinroq SAITOH) wrote:

In the context of Date._strptime("%Y-%m-%d"), this overhead is a significant portion of the total parse time, as shown in earlier profiling:

Operation Time
C ext _strptime (reference) 408 ns
SC.new + scan + captures + .to_i x3 1,210 ns
Pure Ruby _strptime_ymd total 1,290 ns

The capture extraction + .to_i conversion accounts for roughly 40% of the total parse time. get_int directly reduces this portion.

This part is not clear to me, notably what does that 40% refer to?

What I would expect is the measurement of the pure-Ruby strptime, with and without StringScanner#get_int.
Then we could could see how much it helps for the Date use case.

Updated by Anonymous 14 days ago Actions #3

  • Status changed from Open to Closed

Applied in changeset git|e73e4f2d4cbbd741649572b840e3a9816c31bb17.


[ruby/strscan] [Feature #21943] Add StringScanner#integer_at
(https://github.com/ruby/strscan/pull/205)

See also: https://bugs.ruby-lang.org/issues/21943

This is semantically equivalent to scanner[specifier]&.to_i(base) but
this is faster than scanner[specifier]&.to_i(base) because
integer_at doesn't create a temporary String when possible.

This PR also includes a benchmark for them:

$ ruby -v -S benchmark-driver benchmark/integer_at.yaml
ruby 4.1.0dev (2026-05-01T19:25:51Z master https://github.com/ruby/strscan/commit/f2845eab29) +PRISM [x86_64-linux]
Warming up --------------------------------------
             [].to_i    24.272M i/s -     25.109M times in 1.034481s (41.20ns/i, 32clocks/i)
          integer_at    61.188M i/s -     62.491M times in 1.021289s (16.34ns/i, 62clocks/i)
Calculating -------------------------------------
             [].to_i    26.831M i/s -     72.816M times in 2.713883s (37.27ns/i, 169clocks/i)
          integer_at    81.331M i/s -    183.564M times in 2.256998s (12.30ns/i, 43clocks/i)

Comparison:
          integer_at:  81331225.5 i/s
             [].to_i:  26831046.3 i/s - 3.03x  slower

In this environment, integer_at is 3.03x faster than [].to_i.

https://github.com/ruby/strscan/commit/8a60879b2d

Co-authored-by: jinroq

Actions

Also available in: PDF Atom