Feature #19315
openLazy substrings in CRuby
Description
CRuby should implement lazy substrings, i.e., "abcdef"[1..3] must not copy bytes.
Currently CRuby only reuse the char* if the substring is until the end of the buffer.
But it should also work wherever the substring starts and ends.
Yes, it means RSTRING_PTR() might need to allocate to \0-terminate, so be it, it's worth it.
There is already code for this (SHARABLE_MIDDLE_SUBSTRING
), but it's disabled by default and RSTRING_PTR()
needs to be changed to deal with this.
It seems a good idea to introduce a variant of RSTRING_PTR
which doesn't guarantee \0-termination, so such callers can then use the existing bytes always without copy.
There are countless workarounds for this missing optimization, all not worth it with lazy substring and all less readable:
- https://bugs.ruby-lang.org/issues/19314
- https://bugs.ruby-lang.org/issues/18598#note-3
- https://github.com/ruby/net-protocol/pull/14
- Manual lazy substrings which track string + index + length
- More but I don't remember all now, feel free to comment or link more urls/tickets.
Updated by Eregon (Benoit Daloze) about 2 years ago
- Related to Feature #19314: String#bytesplice should support partial copy added
Updated by Eregon (Benoit Daloze) about 2 years ago
The documentation of RSTRING_PTR() doesn't specify it returns a \0-terminated char*
, but it seems assumed in various places and it would likely be a security issue if that's not always \0-terminated.
So RSTRING_PTR() needs to realloc and \0-terminate if RSTRING_END(str) is not already \0 (can be multiple zeros for minwidth > 1 encodings, one way to deal with that is always terminate with 4 \0).
Updated by byroot (Jean Boussier) about 2 years ago
SHARABLE_MIDDLE_SUBSTRING
was introduced circa 2014 in https://github.com/ruby/ruby/commit/a707ab4bc8a by @nobu (Nobuyoshi Nakada).
@nobu (Nobuyoshi Nakada) maybe you have some insights to share on this topic? Is there a reason you remember why this flag was never enabled by default? I assume compatibility issues with C extensions but there might be more.
Updated by Eregon (Benoit Daloze) about 2 years ago
- Related to Feature #18598: Add String#bytesplice added
Updated by mame (Yusuke Endoh) about 2 years ago
I heard that Java stopped the shared substring technique 10 years ago (https://www.infoq.com/news/2013/12/Oracle-Tunes-Java-String/) because of the potential for memory leaks
I don't disagree this proposal, but it would be nice if we could evaluate the effectiveness of this optimization.
Updated by Eregon (Benoit Daloze) about 2 years ago
mame (Yusuke Endoh) wrote in #note-7:
I don't disagree this proposal, but it would be nice if we could evaluate the effectiveness of this optimization.
https://github.com/ruby/net-protocol/pull/14 shows gains between 2% and 27%, and that's with the overhead of doing it manually.
Also the workaround makes the code far more complicated, see https://github.com/ruby/net-protocol/pull/14/files#diff-038ee4fdc5401fa2ae8da1c0a0e340167119af07b12696b403cb385be8008005R266
Updated by ianks (Ian Ker-Seymer) almost 2 years ago
It seems a good idea to introduce a variant of
RSTRING_PTR
which doesn't guarantee \0-termination, so such callers can then use the existing bytes always without copy.
It would be nice to have a way to get the raw parts of a string ([ptr, len]) as part of the official ruby C api. As you mentioned, RSTRING_PTR has some caveats:
- It may reallocate
- It relies on inline code (not accessibly via dylib)
As a workaround, I’ve seen a lot of hacks in the wild that manually implement this logic, and it gets hairy since you have to consider embedded strings, etc.
So if we are going to add a feature, we should add something like rb_string_raw_parts
which can return a tuple of [ptr, len].
Updated by Dan0042 (Daniel DeLorme) over 1 year ago
Bumping this because it's kinda shocking to me that strings don't already work this way. My mental model of ruby strings has always been that
m = rx.match(very_large_string)
before, match, after = m.pre_match, m[0], m.post_match
is memory-wise a cheap operation because we only allocate 3 objects slots which point to the same string data. I have a lot of code built on this assumption. But it turns out this was false! The before
and match
strings actually copy the string data as well.
Same thing for File.read(very_large_file).split("\n")
which I assumed allocated one large blob and then had pointers to various parts of that blob for each string of the resulting array. But actually it needs double the memory.
Allocating and copying memory is not free; I expect fixing this will lead to a large performance improvement.
Updated by Hanmac (Hans Mackowiak) over 1 year ago
it confused me too, i thought Copy On Write was default for shared strings
https://patshaughnessy.net/2012/1/18/seeing-double-how-ruby-shares-string-values
Updated by duerst (Martin Dürst) over 1 year ago
Hanmac (Hans Mackowiak) wrote in #note-11:
it confused me too, i thought Copy On Write was default for shared strings
https://patshaughnessy.net/2012/1/18/seeing-double-how-ruby-shares-string-values
Pat Shaughnessy in his blog describes exactly the same thing as Benoit Daloze above: Ruby shares string data as long as the ends of the strings align.
The reason for this is that (C)Ruby uses NULL-terminated string data.
Updated by Dan0042 (Daniel DeLorme) over 1 year ago
duerst (Martin Dürst) wrote in #note-12:
Pat Shaughnessy in his blog describes exactly the same thing as Benoit Daloze above: Ruby shares string data as long as the ends of the strings align.
On first skimming the blog I actually didn't notice that. It's mentioned in one sentence and everything else is about how great Ruby is for avoiding unneeded allocations thanks to copy-on-write.
I realize that RSTRING_PTR
is used everywhere, but would it be in the realm of possibility to deprecate it and replace it by something like RSTRING_CSTR
and RSTRING_START
.
Updated by Eregon (Benoit Daloze) over 1 year ago
Dan0042 (Daniel DeLorme) wrote in #note-13:
I realize that
RSTRING_PTR
is used everywhere, but would it be in the realm of possibility to deprecate it and replace it by something likeRSTRING_CSTR
andRSTRING_START
.
I think that would be great (I didn't find a good name myself but these names are perfect IMO).
At the very least we can add both of these macros/functions.
The deprecation would be useful to guide people to use the more efficient RSTRING_START
(+ RSTRING_LEN
/RSTRING_END
) whenever possible, as RSTRING_CSTR
& RSTRING_PTR
would need to copy the byte for a lazy substring which does not reaches the end of the original string.