Project

General

Profile

Actions

Feature #19315

open

Lazy substrings in CRuby

Added by Eregon (Benoit Daloze) over 1 year ago. Updated 10 months ago.

Status:
Open
Assignee:
-
Target version:
-
[ruby-core:111678]

Description

CRuby should implement lazy substrings, i.e., "abcdef"[1..3] must not copy bytes.

Currently CRuby only reuse the char* if the substring is until the end of the buffer.
But it should also work wherever the substring starts and ends.
Yes, it means RSTRING_PTR() might need to allocate to \0-terminate, so be it, it's worth it.

There is already code for this (SHARABLE_MIDDLE_SUBSTRING), but it's disabled by default and RSTRING_PTR() needs to be changed to deal with this.
It seems a good idea to introduce a variant of RSTRING_PTR which doesn't guarantee \0-termination, so such callers can then use the existing bytes always without copy.

There are countless workarounds for this missing optimization, all not worth it with lazy substring and all less readable:


Related issues 2 (0 open2 closed)

Related to Ruby master - Feature #19314: String#bytesplice should support partial copyClosedActions
Related to Ruby master - Feature #18598: Add String#bytespliceClosedActions
Actions #1

Updated by Eregon (Benoit Daloze) over 1 year ago

  • Related to Feature #19314: String#bytesplice should support partial copy added

Updated by Eregon (Benoit Daloze) over 1 year ago

The documentation of RSTRING_PTR() doesn't specify it returns a \0-terminated char*, but it seems assumed in various places and it would likely be a security issue if that's not always \0-terminated.
So RSTRING_PTR() needs to realloc and \0-terminate if RSTRING_END(str) is not already \0 (can be multiple zeros for minwidth > 1 encodings, one way to deal with that is always terminate with 4 \0).

Updated by byroot (Jean Boussier) over 1 year ago

SHARABLE_MIDDLE_SUBSTRING was introduced circa 2014 in https://github.com/ruby/ruby/commit/a707ab4bc8a by @nobu (Nobuyoshi Nakada).

@nobu (Nobuyoshi Nakada) maybe you have some insights to share on this topic? Is there a reason you remember why this flag was never enabled by default? I assume compatibility issues with C extensions but there might be more.

Actions #4

Updated by Eregon (Benoit Daloze) over 1 year ago

Actions #5

Updated by Eregon (Benoit Daloze) over 1 year ago

  • Description updated (diff)
Actions #6

Updated by Eregon (Benoit Daloze) over 1 year ago

  • Description updated (diff)

Updated by mame (Yusuke Endoh) over 1 year ago

I heard that Java stopped the shared substring technique 10 years ago (https://www.infoq.com/news/2013/12/Oracle-Tunes-Java-String/) because of the potential for memory leaks

I don't disagree this proposal, but it would be nice if we could evaluate the effectiveness of this optimization.

Updated by Eregon (Benoit Daloze) over 1 year ago

mame (Yusuke Endoh) wrote in #note-7:

I don't disagree this proposal, but it would be nice if we could evaluate the effectiveness of this optimization.

https://github.com/ruby/net-protocol/pull/14 shows gains between 2% and 27%, and that's with the overhead of doing it manually.
Also the workaround makes the code far more complicated, see https://github.com/ruby/net-protocol/pull/14/files#diff-038ee4fdc5401fa2ae8da1c0a0e340167119af07b12696b403cb385be8008005R266

Updated by ianks (Ian Ker-Seymer) about 1 year ago

It seems a good idea to introduce a variant of RSTRING_PTR which doesn't guarantee \0-termination, so such callers can then use the existing bytes always without copy.

It would be nice to have a way to get the raw parts of a string ([ptr, len]) as part of the official ruby C api. As you mentioned, RSTRING_PTR has some caveats:

  1. It may reallocate
  2. It relies on inline code (not accessibly via dylib)

As a workaround, I’ve seen a lot of hacks in the wild that manually implement this logic, and it gets hairy since you have to consider embedded strings, etc.

So if we are going to add a feature, we should add something like rb_string_raw_parts which can return a tuple of [ptr, len].

Updated by Dan0042 (Daniel DeLorme) 11 months ago

Bumping this because it's kinda shocking to me that strings don't already work this way. My mental model of ruby strings has always been that

m = rx.match(very_large_string)
before, match, after = m.pre_match, m[0], m.post_match

is memory-wise a cheap operation because we only allocate 3 objects slots which point to the same string data. I have a lot of code built on this assumption. But it turns out this was false! The before and match strings actually copy the string data as well.

Same thing for File.read(very_large_file).split("\n") which I assumed allocated one large blob and then had pointers to various parts of that blob for each string of the resulting array. But actually it needs double the memory.

Allocating and copying memory is not free; I expect fixing this will lead to a large performance improvement.

Updated by Hanmac (Hans Mackowiak) 11 months ago

it confused me too, i thought Copy On Write was default for shared strings

https://patshaughnessy.net/2012/1/18/seeing-double-how-ruby-shares-string-values

Updated by duerst (Martin Dürst) 11 months ago

Hanmac (Hans Mackowiak) wrote in #note-11:

it confused me too, i thought Copy On Write was default for shared strings

https://patshaughnessy.net/2012/1/18/seeing-double-how-ruby-shares-string-values

Pat Shaughnessy in his blog describes exactly the same thing as Benoit Daloze above: Ruby shares string data as long as the ends of the strings align.

The reason for this is that (C)Ruby uses NULL-terminated string data.

Updated by Dan0042 (Daniel DeLorme) 10 months ago

duerst (Martin Dürst) wrote in #note-12:

Pat Shaughnessy in his blog describes exactly the same thing as Benoit Daloze above: Ruby shares string data as long as the ends of the strings align.

On first skimming the blog I actually didn't notice that. It's mentioned in one sentence and everything else is about how great Ruby is for avoiding unneeded allocations thanks to copy-on-write.

I realize that RSTRING_PTR is used everywhere, but would it be in the realm of possibility to deprecate it and replace it by something like RSTRING_CSTR and RSTRING_START.

Updated by Eregon (Benoit Daloze) 10 months ago

Dan0042 (Daniel DeLorme) wrote in #note-13:

I realize that RSTRING_PTR is used everywhere, but would it be in the realm of possibility to deprecate it and replace it by something like RSTRING_CSTR and RSTRING_START.

I think that would be great (I didn't find a good name myself but these names are perfect IMO).
At the very least we can add both of these macros/functions.

The deprecation would be useful to guide people to use the more efficient RSTRING_START (+ RSTRING_LEN/RSTRING_END) whenever possible, as RSTRING_CSTR & RSTRING_PTR would need to copy the byte for a lazy substring which does not reaches the end of the original string.

Actions

Also available in: Atom PDF

Like2
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0