Project

General

Profile

Actions

Feature #20878

open

A new C API to create a String by adopting a pointer: `rb_enc_str_adopt(const char *ptr, long len, long capa, rb_encoding *enc)`

Added by byroot (Jean Boussier) 11 days ago. Updated 10 days ago.

Status:
Open
Assignee:
-
Target version:
-
[ruby-core:119801]

Description

Context

A common use case when writing C extensions is to generate text or bytes into a buffer, and to return it back
wrapped into a Ruby String. Examples are JSON.generate(obj) -> String, and all other format serializers,
compression libraries such as ZLib.deflate, etc, but also methods such as Time.strftime,

Current Solution

Work in a buffer and copy the result

The most often used solution is to work with a native buffer and to manage a native allocated buffer,
and once the generation is done, call rb_str_new* to copy the result inside memory managed by Ruby.

It works, but isn't very efficient because it cause an extra copy and an extra free().

On ruby/json macro-benchmarks, this represent around 5% of the time spent in JSON.generate.

static void fbuffer_free(FBuffer *fb)
{
    if (fb->ptr && fb->type == FBUFFER_HEAP_ALLOCATED) {
        ruby_xfree(fb->ptr);
    }
}

static VALUE fbuffer_to_s(FBuffer *fb)
{
    VALUE result = rb_utf8_str_new(FBUFFER_PTR(fb), FBUFFER_LEN(fb));
    fbuffer_free(fb);
    return result;
}

Work inside RString allocated memory

Another way this is currently done, is to allocate an RString using rb_str_buf_new,
and write into it with various functions such as rb_str_catf,
or writing past RString.len through RSTRING_PTR and then resize it with rb_str_set_len.

The downside with this approach is that it contains a lot of inefficiencies, as rb_str_set_len will perform
numerous safety checks, compute coderange, and write the string terminator on every invocation.

Another major inneficiency is that this API make it hard to be in control of the buffer
growth, so it can result in a lot more realloc() calls than manually managing the buffer.

This method is used by Kernel#sprintf, Time#strftime etc, and when I attempted to improve Time#strftime
performance, this problem showed up as the biggest bottleneck:

Proposed API

I think a more effcient way to do this would be to work with a native buffer, and then build a RString
that "adopt" the memory region.

Technically, you can currently do this by reaching directly into RString members, but I don't think it's clean,
and a dedicated API would be preferable:

/**
 * Similar to rb_str_new(), but it adopts the pointer instead of copying.
 *
 * @param[in]  ptr             A memory region of `capa` bytes length. MUST have been allocated with `ruby_xmalloc`
 * @param[in]  len             Length  of the string,  in bytes,  not including  the
 *                             terminating NUL character, not including extra capacity.
 * @param[in]  capa            The usable length of `ptr`, in bytes,  including  the
 *                             terminating NUL character.
 * @param[in]  enc             Encoding of `ptr`.
 * @exception  rb_eArgError    `len` is negative.
 * @return     An instance  of ::rb_cString,  of `len`  bytes length, `capa - 1` bytes capacity,
 *             and of `enc` encoding.
 * @pre        At  least  `capa` bytes  of  continuous  memory region  shall  be
 *             accessible via `ptr`.
 * @pre        `ptr` MUST have been allocated with `ruby_xmalloc`.
 * @pre        `ptr` MUST not be manually freed after `rb_enc_str_adopt` has been called.
 * @note       `enc` can be a  null pointer.  It can also be  seen as a routine
 *             identical to rb_usascii_str_new() then.
 */
rb_enc_str_adopt(const char *ptr, long len, long capa, rb_encoding *enc);

An alternative to the adopt term, could be move.

Updated by Eregon (Benoit Daloze) 11 days ago

LGTM, +1.
Maybe simply rb_str_adopt() for the name?
That way it's closer to rb_str_new(), and these days all String C API taking a C string should also take an encoding anyway so we don't need enc_ and enc-less variants.

Updated by byroot (Jean Boussier) 11 days ago

Maybe simply rb_str_adopt() for the name?

I don't have a strong opinion here, I just went with the current convention.

On another note:

ptr MUST have been allocated with ruby_xmalloc.

I'm actually not sure this really need to be a MUST, I suppose what is a MUST is that the pointer should be freeable with ruby_xfree, but that's it.

Updated by nobu (Nobuyoshi Nakada) 11 days ago

I think it is unsafe for memory leak, in comparison with "RString allocated memory".

Updated by byroot (Jean Boussier) 11 days ago

I think it is unsafe for memory leak, in comparison with "RString allocated memory".

I'm sorry I don't follow, could you expand on what you mean is unsafe? The entire "adopt" idea?

Updated by shyouhei (Shyouhei Urabe) 11 days ago

byroot (Jean Boussier) wrote in #note-4:

I think it is unsafe for memory leak, in comparison with "RString allocated memory".

I'm sorry I don't follow, could you expand on what you mean is unsafe? The entire "adopt" idea?

There is no reason for us to believe that the const char *ptr was allocated by malloc. It could be done by mmap or dlopen or anything. Ruby cannot garbage collect the string because it simply doesn't know how. Memory leak here is kind of inevitable.

Updated by nobu (Nobuyoshi Nakada) 11 days ago

byroot (Jean Boussier) wrote in #note-4:

I think it is unsafe for memory leak, in comparison with "RString allocated memory".

I'm sorry I don't follow, could you expand on what you mean is unsafe? The entire "adopt" idea?

Whenever you allocate a new object, there is a risk of a memory error.
In that case, who will look after the pointer that is about to be "adopted"?

Updated by byroot (Jean Boussier) 10 days ago

There is no reason for us to believe that the const char *ptr was allocated by malloc.

The proposed function documentation state that the pointer MUST have been allocated with ruby_xmalloc.

henever you allocate a new object, there is a risk of a memory error. In that case, who will look after the pointer that is about to be "adopted"?

I see. From my understanding, the only possible error is OutOfMemory, what if rb_enc_str_adopt would directly call ruby_xfree on the pointer in such case? Would that cover your concern?

Updated by shyouhei (Shyouhei Urabe) 10 days ago

byroot (Jean Boussier) wrote in #note-7:

There is no reason for us to believe that the const char *ptr was allocated by malloc.

The proposed function documentation state that the pointer MUST have been allocated with ruby_xmalloc.

If that's okay that's okay. For instance a return value of asprintf cannot be "adopt"ed then because obviously, that's not allocated by ruby_xmalloc.

Updated by rhenium (Kazuki Yamaguchi) 10 days ago

byroot (Jean Boussier) wrote:

Work inside RString allocated memory

[...]
The downside with this approach is that it contains a lot of inefficiencies, as rb_str_set_len will perform
numerous safety checks, compute coderange, and write the string terminator on every invocation.

I thought rb_str_set_len() was supposed to be the efficient alternative to rb_str_resize() meant for such a purpose.

I think an assert on the capacity or filling the terminator is cheap enough that it won't matter. That it computes coderange is news to me - I found it was since commit 6b66b5fdedb2c9a9ee48e290d57ca7f8d55e01a2 / [Bug #19902] in 2023. I think correcting coderange after directly modifying the RString-managed buffer is the caller's responsibility. Perhaps it could be reversed?

Updated by byroot (Jean Boussier) 10 days ago

If that's okay that's okay. For instance a return value of asprintf cannot be "adopt"ed then because obviously, that's not allocated by ruby_xmalloc.

Yes, that's why I'm wondering if this requirement should be relaxed to "MUST be freeable by ruby_xfree", which I believe would be true for asprintf.

I think an assert on the capacity or filling the terminator is cheap enough that it won't matter.

It seemed to matter when I profiled. In some cases like strftime the string is written byte by byte, so it basically double the cost of appending a byte.

Updated by kddnewton (Kevin Newton) 10 days ago

I would use this in Prism as well. There are many cases where we allocate a string in the parser and then when we reify the Ruby AST we have to copy the string over. But the string content was allocated with ruby_xmalloc. So it would be nice to just hand over the string content without having to make a copy.

Personally I would prefer move as a naming convention, just because it mirrors what I would expect from std::move.

Updated by mdalessio (Mike Dalessio) 10 days ago

This would likely be useful in Nokogiri as well. The two key places I have in mind are

  1. returning a large serialization string generated within libxml2 (which is configured to use ruby_xmalloc by default)
  2. assembling an HTML5-compliant serialization within the extension (which currently uses rb_enc_str_buf_cat)
Actions

Also available in: Atom PDF

Like3
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0