Feature #19315
openLazy substrings in CRuby
Description
CRuby should implement lazy substrings, i.e., "abcdef"[1..3] must not copy bytes.
Currently CRuby only reuse the char* if the substring is until the end of the buffer.
But it should also work wherever the substring starts and ends.
Yes, it means RSTRING_PTR() might need to allocate to \0-terminate, so be it, it's worth it.
There is already code for this (SHARABLE_MIDDLE_SUBSTRING), but it's disabled by default and RSTRING_PTR() needs to be changed to deal with this.
It seems a good idea to introduce a variant of RSTRING_PTR which doesn't guarantee \0-termination, so such callers can then use the existing bytes always without copy.
There are countless workarounds for this missing optimization, all not worth it with lazy substring and all less readable:
- https://bugs.ruby-lang.org/issues/19314
- https://bugs.ruby-lang.org/issues/18598#note-3
- https://github.com/ruby/net-protocol/pull/14
- Manual lazy substrings which track string + index + length
- More but I don't remember all now, feel free to comment or link more urls/tickets.
Updated by Eregon (Benoit Daloze) over 3 years ago
- Related to Feature #19314: String#bytesplice should support partial copy added
Updated by Eregon (Benoit Daloze) over 3 years ago
The documentation of RSTRING_PTR() doesn't specify it returns a \0-terminated char*, but it seems assumed in various places and it would likely be a security issue if that's not always \0-terminated.
So RSTRING_PTR() needs to realloc and \0-terminate if RSTRING_END(str) is not already \0 (can be multiple zeros for minwidth > 1 encodings, one way to deal with that is always terminate with 4 \0).
Updated by byroot (Jean Boussier) over 3 years ago
SHARABLE_MIDDLE_SUBSTRING was introduced circa 2014 in https://github.com/ruby/ruby/commit/a707ab4bc8a by @nobu (Nobuyoshi Nakada).
@nobu (Nobuyoshi Nakada) maybe you have some insights to share on this topic? Is there a reason you remember why this flag was never enabled by default? I assume compatibility issues with C extensions but there might be more.
Updated by Eregon (Benoit Daloze) over 3 years ago
- Related to Feature #18598: Add String#bytesplice added
Updated by Eregon (Benoit Daloze) over 3 years ago
- Description updated (diff)
Updated by Eregon (Benoit Daloze) over 3 years ago
- Description updated (diff)
Updated by mame (Yusuke Endoh) over 3 years ago
I heard that Java stopped the shared substring technique 10 years ago (https://www.infoq.com/news/2013/12/Oracle-Tunes-Java-String/) because of the potential for memory leaks
I don't disagree this proposal, but it would be nice if we could evaluate the effectiveness of this optimization.
Updated by Eregon (Benoit Daloze) over 3 years ago
mame (Yusuke Endoh) wrote in #note-7:
I don't disagree this proposal, but it would be nice if we could evaluate the effectiveness of this optimization.
https://github.com/ruby/net-protocol/pull/14 shows gains between 2% and 27%, and that's with the overhead of doing it manually.
Also the workaround makes the code far more complicated, see https://github.com/ruby/net-protocol/pull/14/files#diff-038ee4fdc5401fa2ae8da1c0a0e340167119af07b12696b403cb385be8008005R266
Updated by ianks (Ian Ker-Seymer) over 3 years ago
It seems a good idea to introduce a variant of
RSTRING_PTRwhich doesn't guarantee \0-termination, so such callers can then use the existing bytes always without copy.
It would be nice to have a way to get the raw parts of a string ([ptr, len]) as part of the official ruby C api. As you mentioned, RSTRING_PTR has some caveats:
- It may reallocate
- It relies on inline code (not accessibly via dylib)
As a workaround, I’ve seen a lot of hacks in the wild that manually implement this logic, and it gets hairy since you have to consider embedded strings, etc.
So if we are going to add a feature, we should add something like rb_string_raw_parts which can return a tuple of [ptr, len].
Updated by Dan0042 (Daniel DeLorme) about 3 years ago
Bumping this because it's kinda shocking to me that strings don't already work this way. My mental model of ruby strings has always been that
m = rx.match(very_large_string)
before, match, after = m.pre_match, m[0], m.post_match
is memory-wise a cheap operation because we only allocate 3 objects slots which point to the same string data. I have a lot of code built on this assumption. But it turns out this was false! The before and match strings actually copy the string data as well.
Same thing for File.read(very_large_file).split("\n") which I assumed allocated one large blob and then had pointers to various parts of that blob for each string of the resulting array. But actually it needs double the memory.
Allocating and copying memory is not free; I expect fixing this will lead to a large performance improvement.
Updated by Hanmac (Hans Mackowiak) almost 3 years ago
it confused me too, i thought Copy On Write was default for shared strings
https://patshaughnessy.net/2012/1/18/seeing-double-how-ruby-shares-string-values
Updated by duerst (Martin Dürst) almost 3 years ago
Hanmac (Hans Mackowiak) wrote in #note-11:
it confused me too, i thought Copy On Write was default for shared strings
https://patshaughnessy.net/2012/1/18/seeing-double-how-ruby-shares-string-values
Pat Shaughnessy in his blog describes exactly the same thing as Benoit Daloze above: Ruby shares string data as long as the ends of the strings align.
The reason for this is that (C)Ruby uses NULL-terminated string data.
Updated by Dan0042 (Daniel DeLorme) almost 3 years ago
duerst (Martin Dürst) wrote in #note-12:
Pat Shaughnessy in his blog describes exactly the same thing as Benoit Daloze above: Ruby shares string data as long as the ends of the strings align.
On first skimming the blog I actually didn't notice that. It's mentioned in one sentence and everything else is about how great Ruby is for avoiding unneeded allocations thanks to copy-on-write.
I realize that RSTRING_PTR is used everywhere, but would it be in the realm of possibility to deprecate it and replace it by something like RSTRING_CSTR and RSTRING_START.
Updated by Eregon (Benoit Daloze) almost 3 years ago
Dan0042 (Daniel DeLorme) wrote in #note-13:
I realize that
RSTRING_PTRis used everywhere, but would it be in the realm of possibility to deprecate it and replace it by something likeRSTRING_CSTRandRSTRING_START.
I think that would be great (I didn't find a good name myself but these names are perfect IMO).
At the very least we can add both of these macros/functions.
The deprecation would be useful to guide people to use the more efficient RSTRING_START (+ RSTRING_LEN/RSTRING_END) whenever possible, as RSTRING_CSTR & RSTRING_PTR would need to copy the byte for a lazy substring which does not reaches the end of the original string.
Updated by Eregon (Benoit Daloze) 3 months ago
FWIW, StringValueCStr() already exists and checks for no \0 bytes and implies a terminating \0.
It still seems too risky to change RSTRING_PTR() to not terminate though.
FWIW, TruffleRuby already has lazy substrings and always guarantee to \0-terminate RSTRING_PTR().
Updated by kou (Kouhei Sutou) about 1 month ago
I think that we should not deprecate RSTRING_PTR() for backward compatibility.
How about adding a new API such as RSTRING_RAW_PTR() for char * that may not be \0-terminated instead?
RSTRING_RAW_PTR() works like the current RSTRING_PTR() (no \0-terminated check). New RSTRING_PTR() ensures \0-terminated char * automatically like str_fill_term() does.
If many C extension codes migrate to RSTRING_RAW_PTR() from RSTRING_PTR(), we can avoid data copy.
Updated by Eregon (Benoit Daloze) about 1 month ago
Yes I think adding RSTRING_RAW_PTR() is a good way, I would just suggest another name: RSTRING_START().
Actually now I see this name was already proposed in https://bugs.ruby-lang.org/issues/19315#note-13.
Why I think RSTRING_START() is better than RSTRING_RAW_PTR():
- Pairs better with
RSTRING_END(), which is crucial to use in combination with the new macro/function. -
_RAW_sounds internal/low-level likeFL_TEST_RAW(which probably shouldn't be public or is dangerous) which might discourage its use. This new API is not low level, it represents the true start of the Ruby String, which can contain\0in the middle and might not end with\0, soRSTRING_END()must be used. - I think it's good
RSTRING_PTR()sounds more low-level/internal thanRSTRING_START(), because it becomes more expensive, just likeRARRAY_PTR(low-level, slow) vsRARRAY_AREF()/rb_ary_entry()(higher-level, fast).
@matz (Yukihiro Matsumoto) How about this for Ruby 4.1:
- Add
RSTRING_START() - Document
RSTRING_PTR()as more expensive in some cases and recommend usingRSTRING_START()instead. People can easily use it on older versions with:
#ifndef RSTRING_START
#define RSTRING_START RSTRING_PTR
#endif
- Enable
SHARABLE_MIDDLE_SUBSTRINGand changeRSTRING_PTR()to re-allocate with a final\0if it's not already there.
We could do 3 later, but given the idea to not actually deprecate RSTRING_PTR() I think there is no value to delay it.
Updated by himura467 (Akito Shitara) 12 days ago
· Edited
Following the direction suggested by @Eregon (Benoit Daloze) in #note-17, I created a prototype implementation for this issue: https://github.com/ruby/ruby/pull/17045
While maintaining backward compatibility with RSTRING_PTR() and RSTRING_END(), there are two design points to consider:
1. Handling RSTRING_PTR() without the GVL¶
To ensure null-termination for shared middle substrings, RSTRING_PTR() now calls str_make_independent_expand. Since this function requires the GVL, I added a separate code path for when RSTRING_PTR() is called without it.
However, it is unclear if RSTRING_PTR() is ever called without the GVL in practice. I would appreciate feedback on whether this edge-case handling is actually necessary or redundant.
2. Naming the new API¶
To prevent breaking existing code that uses RSTRING_PTR() and RSTRING_END() together, I think RSTRING_END() must now guarantee null-termination in the same way as RSTRING_PTR(). Otherwise, they could point to different buffers after unsharing.
While the name RSTRING_START() would pair well with RSTRING_END(), the new API and RSTRING_END() cannot safely be mixed because RSTRING_END() now also triggers unsharing. Therefore, I used the name RSTRING_RAW_PTR() as proposed in #note-16.
Updated by byroot (Jean Boussier) 12 days ago
However, it is unclear if RSTRING_PTR() is ever called without the GVL in practice. I would appreciate feedback on whether this edge-case handling is actually necessary or redundant.
It's not necessary. It's already unsafe to call RSTRING_PTR without the GVL.
Updated by himura467 (Akito Shitara) 12 days ago
byroot (Jean Boussier) wrote in #note-19:
However, it is unclear if RSTRING_PTR() is ever called without the GVL in practice. I would appreciate feedback on whether this edge-case handling is actually necessary or redundant.
It's not necessary. It's already unsafe to call
RSTRING_PTRwithout the GVL.
I see, is this excessive? Thanks.
Updated by byroot (Jean Boussier) 12 days ago
· Edited
I see, is this excessive?
I would say yes.
Updated by Eregon (Benoit Daloze) 12 days ago
himura467 (Akito Shitara) wrote in #note-18:
2. Naming the new API¶
To prevent breaking existing code that uses
RSTRING_PTR()andRSTRING_END()together, I thinkRSTRING_END()must now guarantee null-termination in the same way asRSTRING_PTR(). Otherwise, they could point to different buffers after unsharing.
What's the problem?
Using RSTRING_END() means one is aware of null bytes in the middle and does not assume null-terminated.
Is it because of this case?
char* end = RSTRING_END(str);
char* ptr = RSTRING_PTR(str);
Updated by rhenium (Kazuki Yamaguchi) 12 days ago
Eregon (Benoit Daloze) wrote in #note-22:
What's the problem?
UsingRSTRING_END()means one is aware of null bytes in the middle and does not assume null-terminated.
Is it because of this case?char* end = RSTRING_END(str); char* ptr = RSTRING_PTR(str);
I think it would also be a problem in some_function_call(RSTRING_PTR(str), RSTRING_END(str)), where the evaluation order is not guaranteed.
himura467 (Akito Shitara) wrote in #note-18:
To ensure null-termination for shared middle substrings,
RSTRING_PTR()now callsstr_make_independent_expand.
I wonder if RSTRING_PTR() could instead allocate a read-only copy dedicated for RSTRING_PTR(), rather than permanently converting the String away from STR_SHARED.
I think the current patch in https://github.com/ruby/ruby/pull/17045 introduces an unfortunate corner case that could be difficult to reproduce or debug:
VALUE shared_root = rb_str_freeze(rb_str_new(NULL, 100));
VALUE str = rb_str_subseq(shared_root, 0, 50); // STR_SHARED
rb_str_freeze(str);
shared_root = Qnil; // Forget it
char *ptr;
long len;
RSTRING_GETMEM(str, ptr, len); // ptr = RSTRING_RAW_PTR(str);
// This is legal if `str` is frozen or `rb_str_locktmp()`'ed, but would become unsafe with the PR.
//
// I also suspect there are carelessly written core/extension methods that accidentally allow
// Ruby code to run after taking a pointer from a non-`rb_str_locktmp()`'ed mutable string.
// Such code is unsafe even without the PR, but may happen to work in practice unless someone
// actively tries to break it by modifying the string content.
{
// Converts `str` into an independent string, making `shared_root` unreachable from GC roots
(void)RSTRING_PTR(str);
rb_gc();
}
// `ptr` may no longer be valid at this point.
printf("%.*s", (int)len, ptr);
This also means RSTRING_RAW_PTR() is slightly harder to use safely, even when a NUL terminator is not required. For example, in ruby/openssl, there is code that needs a pointer that survives across execution of arbitrary Ruby code:
https://github.com/ruby/openssl/blob/c7cdca78f5ee19576113d6d600fd22650f0e37c4/ext/openssl/ossl_ssl.c#L2119
I think it's also worth noting that some users of RSTRING_PTR() assume it never fail and use it where longjmp isn't allowed. However, since allocation failure should be rare in normal circumstances, this may be an acceptable incompatibility.
Updated by himura467 (Akito Shitara) 12 days ago
Eregon (Benoit Daloze) wrote in #note-22:
What's the problem?
If RSTRING_END() remains unchanged, it can cause inconsistency due to the undefined evaluation order of C.
This happens if RSTRING_END() is evaluated before RSTRING_PTR() in expressions like search_nonascii(RSTRING_PTR(dest), RSTRING_END(dest)).
Given that this pattern actually appears in several places within the Ruby core, I think we should preserve backward compatibility here to be safe.
Ref: https://github.com/ruby/ruby/blob/2ff671c93d337cc7c4793f9958557f3f8f4eb623/string.c#L916
Updated by himura467 (Akito Shitara) 11 days ago
rhenium (Kazuki Yamaguchi) wrote in #note-23:
I wonder if
RSTRING_PTR()could instead allocate a read-only copy dedicated forRSTRING_PTR(), rather than permanently converting the String away fromSTR_SHARED.
Interesting. The current prototype already uses a similar approach for the no-GVL case, where it allocates a temporary, null-terminated copy. It might be better to extend this behavior to cases with the GVL or something like that.
rhenium (Kazuki Yamaguchi) wrote in #note-23:
I think the current patch in https://github.com/ruby/ruby/pull/17045 introduces an unfortunate corner case that could be difficult to reproduce or debug:
You are right, it can indeed occur. This may need to be addressed somehow, such as by your suggestion above or another approach.
rhenium (Kazuki Yamaguchi) wrote in #note-23:
This also means
RSTRING_RAW_PTR()is slightly harder to use safely
Good point. I will add a note to clarify the safety considerations around holding a pointer across Ruby code execution.
rhenium (Kazuki Yamaguchi) wrote in #note-23:
I think it's also worth noting that some users of
RSTRING_PTR()assume it never fail and use it where longjmp isn't allowed. However, since allocation failure should be rare in normal circumstances, this may be an acceptable incompatibility.
Thanks. While this may be acceptable, it is worth keeping in mind.
Updated by Dan0042 (Daniel DeLorme) 11 days ago
I agree that completely deprecating RSTRING_PTR might not be strictly necessary, but for the sake of a smooth migration, I highly recommend introducing RSTRING_CSTR as an alias for the new, NUL-terminated RSTRING_PTR behavior.
Introducing this alias provides a clear upgrade path. Maintainers can simply grep through their codebases for RSTRING_PTR and systematically replace it with either RSTRING_START or RSTRING_CSTR. Without RSTRING_CSTR, it becomes much harder to audit existing code, as maintainers will look at an instance of RSTRING_PTR and have to wonder: "Was this intended to rely on the new NUL-terminated behavior, or should I change it to RSTRING_START?"
The compatibility issue involving the RSTRING_PTR + RSTRING_END pattern perfectly highlights this advantage. If a developer sees RSTRING_PTR paired with RSTRING_END, they can confidently convert RSTRING_PTR to RSTRING_START to ensure seamless compatibility without any guesswork.
Updated by kou (Kouhei Sutou) 11 days ago
rhenium (Kazuki Yamaguchi) wrote in #note-23:
I wonder if
RSTRING_PTR()could instead allocate a read-only copy dedicated forRSTRING_PTR(), rather than permanently converting the String away fromSTR_SHARED.I think the current patch in https://github.com/ruby/ruby/pull/17045 introduces an unfortunate corner case that could be difficult to reproduce or debug:
How about doing it only when the target String is frozen or STR_TMPLOCK?
We can use https://github.com/ruby/ruby/pull/17045/changes#diff-430d86fdb6c4a558ab0f1b6648bbfae1720e8bde84f026e95a52740014752040R3030-R3050 for it as @himura467 (Akito Shitara) already mentioned.
Do we need it for non frozen String?
This also means
RSTRING_RAW_PTR()is slightly harder to use safely, even when a NUL terminator is not required. For example, in ruby/openssl, there is code that needs a pointer that survives across execution of arbitrary Ruby code:
https://github.com/ruby/openssl/blob/c7cdca78f5ee19576113d6d600fd22650f0e37c4/ext/openssl/ossl_ssl.c#L2119
I think that the current RSTRING_PTR() for non frozen String also has similar situation. If arbitrary Ruby code changes the target String, pointer returned by RSTRING_PTR() may be invalid:
char *ptr = RSTRING_PTR(str);
// Run "str << large_str" in Ruby
// ptr may be invalid here.
So I don't feel that this is a new problem of new RSTRING_RAW_PTR() API.
Updated by kou (Kouhei Sutou) 11 days ago
Dan0042 (Daniel DeLorme) wrote in #note-26:
I agree that completely deprecating RSTRING_PTR might not be strictly necessary, but for the sake of a smooth migration, I highly recommend introducing RSTRING_CSTR as an alias for the new, NUL-terminated RSTRING_PTR behavior.
I don't object the idea but CSTR may not be a good name. Because we already have StringValueCStr() that not only ensures the target content is \0 terminated but also ensures the target content doesn't include \0. RSTRING_CSTR() doesn't check whether \0 is included in the target content or not.
Updated by Dan0042 (Daniel DeLorme) 10 days ago
· Edited
RSTRING_CSTR()doesn't check whether\0is included in the target content or not.
That is a very good point. In that case, perhaps the new NUL-terminated macro should instead alias or behave similarly to StringValueCStr(). Otherwise, we risk allowing a NUL-terminated string that contains embedded NUL bytes in the middle. I struggle to see a valid use case for that scenario; it feels like an invitation for bugs and undefined behavior.
I think
RSTRING_END()must now guarantee null-termination in the same way asRSTRING_PTR(). Otherwise, they could point to different buffers after unsharing.
I would like to point out that this introduces the exact opposite problem for raw pointers: if someone uses RSTRING_RAW_PTR alongside RSTRING_END, they could end up pointing to different buffers. To maintain consistency, would we then need to introduce RSTRING_RAW_END as well?
Updated by Eregon (Benoit Daloze) 10 days ago
· Edited
Right, if one wants a NUL-terminated C string, i.e. a char* with no \0 in the middle and only at the end, then StringValueCStr() is already the correct answer.
I' believe migrating usages of RSTRING_PTR() which want \0-terminated to StringValueCStr() is the right thing to do.
And in fact current usages of RSTRING_PTR() that expect \0-termination are incorrect, if \0 can be in the middle of the String (though it results in a smaller C string rather than reading out of bounds).
How about this:
- Deprecate
RSTRING_PTR()with a message saying to use eitherStringValueCStr()(if want to use it as a NUL-terminated C string) orRSTRING_START()(for efficiency & fixing the warning, must be paired with eitherRSTRING_LEN()orRSTRING_END()). - Don't change the behavior of
RSTRING_PTR()yet (so noSHARABLE_MIDDLE_SUBSTRING), let some time for people to migrate off ofRSTRING_PTR(). - Then some release(s) later enable
SHARABLE_MIDDLE_SUBSTRING.RSTRING_PTR()at that point is the same asRSTRING_START()(so no issues with using a different buffer thanRSTRING_END()and already-correct usages ofRSTRING_PTRare preserved even if they didn't migrate), but no longer guarantees to be NUL-terminated (which was already incorrect for the\0in the middle case and should have usedStringValueCStr()already).
I was thinking maybe we could also change RSTRING_PTR() to be StringValueCStr() in that last step, but then we get the different buffer issues mentioned above, and I think that's worse.
Notably because we'd break correct usages of RSTRING_PTR() to help incorrect usages of RSTRING_PTR() (which should be StringValueCStr() anyway).
I think we do need to deprecate RSTRING_PTR(), otherwise we have no way to tell existing users of RSTRING_PTR() that they should change to StringValueCStr() if they want NUL-termination (well, only NEWS file & docs or so, but very few will see that).
Updated by byroot (Jean Boussier) 10 days ago
so no SHARABLE_MIDDLE_SUBSTRING), let some time for people to migrate off
I wonder if it could be turned into a runtime flag. That would offer the benefit to help with testing, get a lot more feedback, etc.
Updated by kou (Kouhei Sutou) 7 days ago
Dan0042 (Daniel DeLorme) wrote in #note-29:
RSTRING_CSTR()doesn't check whether\0is included in the target content or not.That is a very good point. In that case, perhaps the new NUL-terminated macro should instead alias or behave similarly to
StringValueCStr(). Otherwise, we risk allowing a NUL-terminated string that contains embedded NUL bytes in the middle. I struggle to see a valid use case for that scenario; it feels like an invitation for bugs and undefined behavior.
I think that we don't need a new macro for the CStr behavior. We can just use existing StringValueCStr().
I think
RSTRING_END()must now guarantee null-termination in the same way asRSTRING_PTR(). Otherwise, they could point to different buffers after unsharing.I would like to point out that this introduces the exact opposite problem for raw pointers: if someone uses
RSTRING_RAW_PTRalongsideRSTRING_END, they could end up pointing to different buffers. To maintain consistency, would we then need to introduceRSTRING_RAW_ENDas well?
Correct.
The proposed PR includes a note about the case:
@warningThe returned pointer is invalidated if the string is
subsequently unshared. Operations such as RSTRING_PTR()
and RSTRING_END() may unshare the string and reallocate
its buffer, making this pointer dangling. Only call this
function after all unsharing operations onstrhave
completed, or whenstris known to be independent. To
obtain the end pointer without risking invalidation, use
RSTRING_RAW_PTR(str) + RSTRING_LEN(str).
I think that we don't need to introduce RSTRING_RAW_END() or something. We can just recommend RSTRING_RAW_PTR() + RSTRING_LEN() instead of RSTRING_END().
Updated by kou (Kouhei Sutou) 7 days ago
Eregon (Benoit Daloze) wrote in #note-30:
- Deprecate
RSTRING_PTR()with a message saying to use eitherStringValueCStr()(if want to use it as a NUL-terminated C string) orRSTRING_START()(for efficiency & fixing the warning, must be paired with eitherRSTRING_LEN()orRSTRING_END()).
...- Then some release(s) later enable
SHARABLE_MIDDLE_SUBSTRING.RSTRING_PTR()at that point is the same asRSTRING_START()(so no issues with using a different buffer thanRSTRING_END()and already-correct usages ofRSTRING_PTRare preserved even if they didn't migrate), but no longer guarantees to be NUL-terminated (which was already incorrect for the\0in the middle case and should have usedStringValueCStr()already).
I object deprecating RSTRING_PTR() and changing its behavior later. Because it breaks backward compatibility. We should not break backward compatibility as much as possible to not increase maintenance costs of existing gem maintainers.
I think that adding a new API and keep compatibility of existing APIs is better.
Updated by himura467 (Akito Shitara) 6 days ago
IMO, backward compatibility of RSTRING_PTR should be preserved within the scope of this feature.
Eregon (Benoit Daloze) wrote in #note-30:
I' believe migrating usages of
RSTRING_PTR()which want\0-terminated toStringValueCStr()is the right thing to do.
And in fact current usages ofRSTRING_PTR()that expect\0-termination are incorrect, if\0can be in the middle of the String (though it results in a smaller C string rather than reading out of bounds).How about this:
- Deprecate
RSTRING_PTR()with a message saying to use eitherStringValueCStr()(if want to use it as a NUL-terminated C string) orRSTRING_START()(for efficiency & fixing the warning, must be paired with eitherRSTRING_LEN()orRSTRING_END()).
...- Then some release(s) later enable
SHARABLE_MIDDLE_SUBSTRING.RSTRING_PTR()at that point is the same asRSTRING_START()(so no issues with using a different buffer thanRSTRING_END()and already-correct usages ofRSTRING_PTRare preserved even if they didn't migrate), but no longer guarantees to be NUL-terminated (which was already incorrect for the\0in the middle case and should have usedStringValueCStr()already).
I understand the motivation, but I think that broadens the impact of this feature well beyond what it warrants.
The primary goal here is enabling zero-copy substring reuse via SHARABLE_MIDDLE_SUBSTRING, with RSTRING_RAW_PTR (or RSTRING_START) as the new API. Ensuring RSTRING_PTR still guarantees null termination is a necessary protective measure to minimize breakage of existing C extensions.
Deprecating and changing the behavioral guarantees of RSTRING_PTR itself would affect all existing callers of RSTRING_PTR, which is a much larger blast radius than this feature itself justifies.
I think our basic policy should be adding RSTRING_RAW_PTR, while keeping RSTRING_PTR compatible by default, and leaving any deprecation as a separate future discussion.
Updated by rhenium (Kazuki Yamaguchi) 6 days ago
kou (Kouhei Sutou) wrote in #note-27:
I think that the current
RSTRING_PTR()for non frozenStringalso has similar situation. If arbitrary Ruby code changes the targetString, pointer returned byRSTRING_PTR()may be invalid:char *ptr = RSTRING_PTR(str); // Run "str << large_str" in Ruby // ptr may be invalid here.So I don't feel that this is a new problem of new
RSTRING_RAW_PTR()API.
That's a fair point. My concern is that such code is probably common in extensions, and currently it often happens to work fine unless a user intentionally tries to trigger the bug.
If we go down the route of having RSTRING_PTR() lazily allocate a NUL-terminated buffer in it, that is a breaking change on its own because it introduces the possibility of GC and longjmp where they previously weren't possible. This will force some extensions to migrate away from RSTRING_PTR(). I think this may be acceptable, but when migrating to RSTRING_RAW_PTR(), users should ideally only need to worry about the absence of the NUL terminator.
kou (Kouhei Sutou) wrote in #note-32:
Dan0042 (Daniel DeLorme) wrote in #note-29:
RSTRING_CSTR()doesn't check whether\0is included in the target content or not.That is a very good point. In that case, perhaps the new NUL-terminated macro should instead alias or behave similarly to
StringValueCStr(). Otherwise, we risk allowing a NUL-terminated string that contains embedded NUL bytes in the middle. I struggle to see a valid use case for that scenario; it feels like an invitation for bugs and undefined behavior.I think that we don't need a new macro for the
CStrbehavior. We can just use existingStringValueCStr().
On a related note, while StringValueCStr() can work here, I've sometimes wished for a function to ensure a C string without the StringValue part (implicit type conversion by #to_str) for clarity. Something like:
const char *rb_str_cstr(VALUE caller_must_ensure_t_string);
Updated by kou (Kouhei Sutou) 5 days ago
rhenium (Kazuki Yamaguchi) wrote in #note-35:
If we go down the route of having
RSTRING_PTR()lazily allocate a NUL-terminated buffer in it, that is a breaking change on its own because it introduces the possibility of GC and longjmp where they previously weren't possible.
What case do you imagine? Calling RSTRING_PTR() without GVL case?
On a related note, while
StringValueCStr()can work here, I've sometimes wished for a function to ensure a C string without theStringValuepart (implicit type conversion by#to_str) for clarity. Something like:const char *rb_str_cstr(VALUE caller_must_ensure_t_string);
Interesting. How about discussing it as a separated issue?
FYI: We have internal rb_str_to_cstr() function: https://github.com/ruby/ruby/blob/1c004aa92f8e93750df6edff5de4599912622974/string.c#L2939-L2944
It doesn't raise an exception nor report a warning but you want rb_str_cstr() to do them, right?
Updated by Dan0042 (Daniel DeLorme) 5 days ago
· Edited
I think that we don't need to introduce
RSTRING_RAW_END()or something. We can just recommendRSTRING_RAW_PTR() + RSTRING_LEN()instead ofRSTRING_END().
Then what was the original purpose of introducing RSTRING_END() in the first place? Why not always RSTRING_PTR() + RSTRING_LEN() ? I imagine it was introduced to provide a better developer experience (DX) and cleaner code, so going back to only ptr + len feels like going back on that API decision.
I am highly sympathetic to the argument for backward compatibility, but it must be weighted against other factors. In order to migrate to the new API while maintaining compatibility for RSTRING_PTR() + RSTRING_END() we must change
from char *beg = RSTRING_PTR(str), *end = RSTRING_END(str);
to char *beg = RSTRING_RAW_PTR(str), *end = beg + RSTRING_LEN(str);
and any misuses of RSTRING_PTR() to StringValueCStr()
and as a result be left with no RSTRING_PTR() or RSTRING_END() anywhere, which is ironic given the effort to preserve their compatibility.
That feels very awkward to me, to the point that a clean break might be better. Important to note: because this only impacts C extensions and not Ruby code, compatibility issues are caught immediately at compile time, making them straightforward (although annoying) for maintainers to identify and fix.
Updated by kou (Kouhei Sutou) 5 days ago
Dan0042 (Daniel DeLorme) wrote in #note-37:
I think that we don't need to introduce
RSTRING_RAW_END()or something. We can just recommendRSTRING_RAW_PTR() + RSTRING_LEN()instead ofRSTRING_END().Then what was the original purpose of introducing
RSTRING_END()in the first place? Why not alwaysRSTRING_PTR() + RSTRING_LEN()? I imagine it was introduced to provide a better developer experience (DX) and cleaner code, so going back to onlyptr + lenfeels like going back on that API decision.
I'm not sure the original purpose but RSTRING_END() was introduced by Matz by a25fbe3b3e531bbe479f344af24eaf9d2eeae6ea .
Matz may share the original purpose.
I don't object that we add RSTRING_RAW_END() or something.
I am highly sympathetic to the argument for backward compatibility, but it must be weighted against other factors. In order to migrate to the new API while maintaining compatibility for
RSTRING_PTR() + RSTRING_END()we must change
fromchar *beg = RSTRING_PTR(str), *end = RSTRING_END(str);
tochar *beg = RSTRING_RAW_PTR(str), *end = beg + RSTRING_LEN(str);
and any misuses ofRSTRING_PTR()toStringValueCStr()
and as a result be left with noRSTRING_PTR()orRSTRING_END()anywhere, which is ironic given the effort to preserve their compatibility.
If it happens, it's a success story. It means that there are no problems by this optimization. We can deprecate/remove RSTRING_PTR()/RSTRING_END() after it. We may reuse them like we did for Data.
That feels very awkward to me, to the point that a clean break might be better. Important to note: because this only impacts C extensions and not Ruby code, compatibility issues are caught immediately at compile time, making them straightforward (although annoying) for maintainers to identify and fix.
I concern about gems that don't have any active maintainers. They may not be fixed soon. Users of them may need to migrate to other gems or re-implement features provided by them. I'm not sure whether it justifies the RSTRING_PTR() (that is a widely used API) breaking change or not.
Updated by Dan0042 (Daniel DeLorme) 4 days ago
You have mostly convinced me about backward compatibility, but I still find that new API awkward.
I may be bikeshedding here, but is there a precedent in Ruby for a macro that assigns multiple variables? Given the new API is meant to get a start+end pointer pair, it could something along the lines of
#define RSTRING_START_END(beg, end, str) *beg=RSTRING_RAW_PTR(str), *end=beg+RSTRING_LEN(str)
char RSTRING_START_END(beg, end, str);
Updated by byroot (Jean Boussier) 4 days ago
is there a precedent in Ruby for a macro that assigns multiple variables?
Yes: RSTRING_GETMEM.
Updated by rhenium (Kazuki Yamaguchi) 4 days ago
kou (Kouhei Sutou) wrote in #note-36:
rhenium (Kazuki Yamaguchi) wrote in #note-35:
If we go down the route of having
RSTRING_PTR()lazily allocate a NUL-terminated buffer in it, that is a breaking change on its own because it introduces the possibility of GC and longjmp where they previously weren't possible.What case do you imagine? Calling
RSTRING_PTR()without GVL case?
On top of my head: I think some users will have to choose between adding new RB_GC_GUARD() for temporary objects on the stack or migrating to RSTRING_RAW_PTR(). Some users have malloc()'ed memory and currently do not expect RSTRING_PTR() might raise an exception.
On a related note, while
StringValueCStr()can work here, I've sometimes wished for a function to ensure a C string without theStringValuepart (implicit type conversion by#to_str) for clarity. Something like:const char *rb_str_cstr(VALUE caller_must_ensure_t_string);Interesting. How about discussing it as a separated issue?
FYI: We have internalrb_str_to_cstr()function: https://github.com/ruby/ruby/blob/1c004aa92f8e93750df6edff5de4599912622974/string.c#L2939-L2944
It doesn't raise an exception nor report a warning but you wantrb_str_cstr()to do them, right?
Indeed, I've created a new feature request for this: https://bugs.ruby-lang.org/issues/22102
Updated by byroot (Jean Boussier) 4 days ago
currently do not expect RSTRING_PTR() might raise an exception.
I don't think the exception is really the bigger concern, as most programs assume overcommit and don't handle OOM errors at all.
The bigger concern IMO is that RSTRING_PTR might need to reallocate the string buffer to be able to fit the NUL terminator, which might invalidate some previously acquired pointers inside the string, and lead to a use-after-free.
Updated by mame (Yusuke Endoh) 3 days ago
This ticket was discussed at the dev meeting.
Please provide a quantitative evaluation of performance and compatibility at the application level. Does this change bring an observable performance improvement? Does memory usage increase or decrease?
It would also be useful to observe how often RSTRING_PTR is actually called without the GVL in practice, for example by inserting rb_bug into that code path and running test suites. Evaluating with yjit-bench might be a good idea.
By the way, this did not come up during the meeting, but I think the fact that RSTRING_PTR can now trigger GC could itself be a serious compatibility problem. There is probably a fair amount of code written under the assumption that RSTRING_PTR never triggers GC, and such code may start to fail with very low probability, depending on subtle behaviors of the conservative GC. Evaluating this risk is very difficult.
Updated by Eregon (Benoit Daloze) 2 days ago
· Edited
mame (Yusuke Endoh) wrote in #note-43:
By the way, this did not come up during the meeting, but I think the fact that
RSTRING_PTRcan now trigger GC could itself be a serious compatibility problem. There is probably a fair amount of code written under the assumption thatRSTRING_PTRnever triggers GC, and such code may start to fail with very low probability, depending on subtle behaviors of the conservative GC. Evaluating this risk is very difficult.
Regarding that, I think there is no need for RSTRING_PTR to trigger GC, so the implementation could make sure it never does.
Updated by himura467 (Akito Shitara) 1 day ago
mame (Yusuke Endoh) wrote in #note-43:
Please provide a quantitative evaluation of performance and compatibility at the application level. Does this change bring an observable performance improvement? Does memory usage increase or decrease?
It would also be useful to observe how oftenRSTRING_PTRis actually called without the GVL in practice, for example by insertingrb_buginto that code path and running test suites. Evaluating with yjit-bench might be a good idea.
Thanks, I'll share the results.
Eregon (Benoit Daloze) wrote in #note-44:
mame (Yusuke Endoh) wrote in #note-43:
By the way, this did not come up during the meeting, but I think the fact that
RSTRING_PTRcan now trigger GC could itself be a serious compatibility problem. There is probably a fair amount of code written under the assumption thatRSTRING_PTRnever triggers GC, and such code may start to fail with very low probability, depending on subtle behaviors of the conservative GC. Evaluating this risk is very difficult.Regarding that, I think there is no need for
RSTRING_PTRto trigger GC, so the implementation could make sure it never does.
Makes sense. I'll work on updating the implementation with these in mind.
Updated by byroot (Jean Boussier) 1 day ago
If RSTRING_PTR never allocates, it means it doesn't provide NUL termination anymore, right?
It may already be wrong, but a lot of code our there assume RSTRING_PTR returns a NUL terminated strings.
I'll reiterate that I believe for the ecosystem to catch up to such a change, we need a way to turn it on/off at runtime, otherwise if we enable this in a new Ruby version with no way to turn it back off, people will run into lots of bugs.
Updated by himura467 (Akito Shitara) 1 day ago
byroot (Jean Boussier) wrote in #note-46:
If
RSTRING_PTRnever allocates, it means it doesn't provide NUL termination anymore, right?
Regardless of whether it's the optimal approach, I think an implementation like https://github.com/ruby/ruby/pull/17045/changes#diff-430d86fdb6c4a558ab0f1b6648bbfae1720e8bde84f026e95a52740014752040R3030-R3050 could guarantee NUL termination without triggering GC.
Updated by byroot (Jean Boussier) 1 day ago
I think an implementation like [...] could guarantee NUL termination without triggering GC.
Alright, I missed/forgot that part. But I can't say I like that solution, as it's a new contention point for Ractors, and if the main thread fork while a background ractor is holding this lock, the child will likely end up deadlocking at some point.
Updated by himura467 (Akito Shitara) 1 day ago
· Edited
byroot (Jean Boussier) wrote in #note-48:
as it's a new contention point for Ractors, and if the main thread fork while a background ractor is holding this lock, the child will likely end up deadlocking at some point.
That's a valid concern. However, it might be structurally addressable via internal synchronization, similar to how jemalloc manages its internal states for concurrency.
Ref: https://github.com/jemalloc/jemalloc/blob/dev/src/jemalloc_fork.c
That said, doing so would add implementation complexity. Given that code combining Ractors with RSTRING_PTR() expecting NUL termination might not be very common yet, it is worth discussing how much effort we should put into handling this specific case.
byroot (Jean Boussier) wrote in #note-48:
But I can't say I like that solution
I can't say I like that solution either, so I'm trying to think if there might be a better approach to implement this.