Feature #19315: Lazy substrings in CRuby - Ruby - Ruby Issue Tracking System

Actions

Copy link

Feature #19315

open

Lazy substrings in CRuby

Added by Eregon (Benoit Daloze) over 2 years ago. Updated about 2 years ago.

Status:

Open

Assignee:

Target version:

[ruby-core:111678]

Description

CRuby should implement lazy substrings, i.e., "abcdef"[1..3] must not copy bytes.

Currently CRuby only reuse the char* if the substring is until the end of the buffer.
But it should also work wherever the substring starts and ends.
Yes, it means RSTRING_PTR() might need to allocate to \0-terminate, so be it, it's worth it.

There is already code for this (SHARABLE_MIDDLE_SUBSTRING), but it's disabled by default and RSTRING_PTR() needs to be changed to deal with this.
It seems a good idea to introduce a variant of RSTRING_PTR which doesn't guarantee \0-termination, so such callers can then use the existing bytes always without copy.

There are countless workarounds for this missing optimization, all not worth it with lazy substring and all less readable:

https://bugs.ruby-lang.org/issues/19314
https://bugs.ruby-lang.org/issues/18598#note-3
https://github.com/ruby/net-protocol/pull/14
Manual lazy substrings which track string + index + length
More but I don't remember all now, feel free to comment or link more urls/tickets.

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Eregon (Benoit Daloze) over 2 years ago

Related to Feature #19314: String#bytesplice should support partial copy added

Actions

Copy link

#2 [ruby-core:111679]

Updated by Eregon (Benoit Daloze) over 2 years ago

The documentation of RSTRING_PTR() doesn't specify it returns a \0-terminated char*, but it seems assumed in various places and it would likely be a security issue if that's not always \0-terminated.
So RSTRING_PTR() needs to realloc and \0-terminate if RSTRING_END(str) is not already \0 (can be multiple zeros for minwidth > 1 encodings, one way to deal with that is always terminate with 4 \0).

Actions

Copy link

#3 [ruby-core:111681]

Updated by byroot (Jean Boussier) over 2 years ago

SHARABLE_MIDDLE_SUBSTRING was introduced circa 2014 in https://github.com/ruby/ruby/commit/a707ab4bc8a by @nobu (Nobuyoshi Nakada).

@nobu (Nobuyoshi Nakada) maybe you have some insights to share on this topic? Is there a reason you remember why this flag was never enabled by default? I assume compatibility issues with C extensions but there might be more.

Actions

Copy link

Updated by Eregon (Benoit Daloze) over 2 years ago

Related to Feature #18598: Add String#bytesplice added

Actions

Copy link

Updated by Eregon (Benoit Daloze) over 2 years ago

Description updated (diff)

Actions

Copy link

Updated by Eregon (Benoit Daloze) over 2 years ago

Description updated (diff)

Actions

Copy link

#7 [ruby-core:111847]

Updated by mame (Yusuke Endoh) over 2 years ago

I heard that Java stopped the shared substring technique 10 years ago (https://www.infoq.com/news/2013/12/Oracle-Tunes-Java-String/) because of the potential for memory leaks

I don't disagree this proposal, but it would be nice if we could evaluate the effectiveness of this optimization.

Actions

Copy link

#8 [ruby-core:111851]

Updated by Eregon (Benoit Daloze) over 2 years ago

mame (Yusuke Endoh) wrote in #note-7:

I don't disagree this proposal, but it would be nice if we could evaluate the effectiveness of this optimization.

https://github.com/ruby/net-protocol/pull/14 shows gains between 2% and 27%, and that's with the overhead of doing it manually.
Also the workaround makes the code far more complicated, see https://github.com/ruby/net-protocol/pull/14/files#diff-038ee4fdc5401fa2ae8da1c0a0e340167119af07b12696b403cb385be8008005R266

Actions

Copy link

#9 [ruby-core:112163]

Updated by ianks (Ian Ker-Seymer) over 2 years ago

It seems a good idea to introduce a variant of RSTRING_PTR which doesn't guarantee \0-termination, so such callers can then use the existing bytes always without copy.

It would be nice to have a way to get the raw parts of a string ([ptr, len]) as part of the official ruby C api. As you mentioned, RSTRING_PTR has some caveats:

It may reallocate
It relies on inline code (not accessibly via dylib)

As a workaround, I’ve seen a lot of hacks in the wild that manually implement this logic, and it gets hairy since you have to consider embedded strings, etc.

So if we are going to add a feature, we should add something like rb_string_raw_parts which can return a tuple of [ptr, len].

Actions

Copy link

#10 [ruby-core:113902]

Updated by Dan0042 (Daniel DeLorme) about 2 years ago

Bumping this because it's kinda shocking to me that strings don't already work this way. My mental model of ruby strings has always been that

m = rx.match(very_large_string)
before, match, after = m.pre_match, m[0], m.post_match

is memory-wise a cheap operation because we only allocate 3 objects slots which point to the same string data. I have a lot of code built on this assumption. But it turns out this was false! The before and match strings actually copy the string data as well.

Same thing for File.read(very_large_file).split("\n") which I assumed allocated one large blob and then had pointers to various parts of that blob for each string of the resulting array. But actually it needs double the memory.

Allocating and copying memory is not free; I expect fixing this will lead to a large performance improvement.

Actions

Copy link

#11 [ruby-core:113914]

Updated by Hanmac (Hans Mackowiak) about 2 years ago

it confused me too, i thought Copy On Write was default for shared strings

https://patshaughnessy.net/2012/1/18/seeing-double-how-ruby-shares-string-values

Actions

Copy link

#12 [ruby-core:113915]

Updated by duerst (Martin Dürst) about 2 years ago

Hanmac (Hans Mackowiak) wrote in #note-11:

it confused me too, i thought Copy On Write was default for shared strings

https://patshaughnessy.net/2012/1/18/seeing-double-how-ruby-shares-string-values

Pat Shaughnessy in his blog describes exactly the same thing as Benoit Daloze above: Ruby shares string data as long as the ends of the strings align.

The reason for this is that (C)Ruby uses NULL-terminated string data.

Actions

Copy link

#13 [ruby-core:113970]

Updated by Dan0042 (Daniel DeLorme) about 2 years ago

duerst (Martin Dürst) wrote in #note-12:

Pat Shaughnessy in his blog describes exactly the same thing as Benoit Daloze above: Ruby shares string data as long as the ends of the strings align.

On first skimming the blog I actually didn't notice that. It's mentioned in one sentence and everything else is about how great Ruby is for avoiding unneeded allocations thanks to copy-on-write.

I realize that RSTRING_PTR is used everywhere, but would it be in the realm of possibility to deprecate it and replace it by something like RSTRING_CSTR and RSTRING_START.

Actions

Copy link

#14 [ruby-core:113974]

Updated by Eregon (Benoit Daloze) about 2 years ago

Dan0042 (Daniel DeLorme) wrote in #note-13:

I realize that RSTRING_PTR is used everywhere, but would it be in the realm of possibility to deprecate it and replace it by something like RSTRING_CSTR and RSTRING_START.

I think that would be great (I didn't find a good name myself but these names are perfect IMO).
At the very least we can add both of these macros/functions.

The deprecation would be useful to guide people to use the more efficient RSTRING_START (+ RSTRING_LEN/RSTRING_END) whenever possible, as RSTRING_CSTR & RSTRING_PTR would need to copy the byte for a lazy substring which does not reaches the end of the original string.

Actions

Copy link

Also available in: Atom PDF

Like2

Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0

	Related to Ruby - Feature #19314: String#bytesplice should support partial copy	Closed					Actions
	Related to Ruby - Feature #18598: Add String#bytesplice	Closed					Actions

Project

General

Profile

Ruby

Tags

Custom queries

Feature #19315

Lazy substrings in CRuby

Updated by Eregon (Benoit Daloze) over 2 years ago

Updated by Eregon (Benoit Daloze) over 2 years ago

Updated by byroot (Jean Boussier) over 2 years ago

Updated by Eregon (Benoit Daloze) over 2 years ago

Updated by Eregon (Benoit Daloze) over 2 years ago

Updated by Eregon (Benoit Daloze) over 2 years ago

Updated by mame (Yusuke Endoh) over 2 years ago

Updated by Eregon (Benoit Daloze) over 2 years ago

Updated by ianks (Ian Ker-Seymer) over 2 years ago

Updated by Dan0042 (Daniel DeLorme) about 2 years ago

Updated by Hanmac (Hans Mackowiak) about 2 years ago

Updated by duerst (Martin Dürst) about 2 years ago

Updated by Dan0042 (Daniel DeLorme) about 2 years ago

Updated by Eregon (Benoit Daloze) about 2 years ago