Project

General

Profile

Feature #16029

Expose fstring related APIs to C-extensions

Added by byroot (Jean Boussier) 3 months ago. Updated 3 days ago.

Status:
Open
Priority:
Normal
Assignee:
-
Target version:
-
[ruby-core:94064]

Description

As discussed with @tenderlove here: https://github.com/ruby/ruby/pull/2287#issuecomment-513865160

We'd like to update various data format parsers (JSON, MessagePack, etc) to add the possibility to deduplicate strings while parsing.

But unfortunately the rb_fstring_* family of functions isn't available to C-extensions, so the only available fallback is rb_funcall(str, rb_intern("-@")) which most parsers will likely consider too slow. So the various rb_fstring_* functions would need to be public.

Proposed patch: https://github.com/ruby/ruby/pull/2299

History

Updated by byroot (Jean Boussier) about 2 months ago

mame (Yusuke Endoh) this feature was added by nobu (Nobuyoshi Nakada) in the last developer's meeting issue [#15996], but I suppose it wasn't discussed by lack of time.

Should I just add it back to the next developer's meeting issue once it is created?

Updated by mame (Yusuke Endoh) about 2 months ago

ko1 (Koichi Sasada) and nobu (Nobuyoshi Nakada) discussed the issue. The memo says:

ko1, nobu: now it is difficult to expose them because of current implementation.

I didn't follow the discussion, so I don't know the detail. ko1 (Koichi Sasada) , nobu (Nobuyoshi Nakada) , could you explain them?

Updated by sam.saffron (Sam Saffron) 12 days ago

I think the larger change here is allowing for a new type of API.

From a performance perspective the people using the new API would like to avoid allocating an RVALUE unless needed, the current fstring APIs require an RVALUE unless you give it a constant string I think?

Perhaps what we need here is the ability to ask Ruby: "Hey do you have an fstring for cstr X? If so then you use it, otherwise you do the slow path of allocating an RVALUE and passing it in to the fstring function.

Updated by byroot (Jean Boussier) 12 days ago

sam.saffron (Sam Saffron) Unless I'm missing something, that's exactly what rb_fstring_new / rb_fstring_cstr does.

VALUE rb_fstring_new(const char *ptr, long len);
VALUE rb_fstring_cstr(const char *str);

AFAICT It does allocate an RVALUE to lookup the table, but it does it on the stack, so I think it's fine GC wise.

Updated by sam.saffron (Sam Saffron) 11 days ago

I think when it gets called it expects to reuse the memory allocated by the cstr eventually

https://github.com//blob/96753e8475ee69537134ab3d966c3d25cb5c467c/string.c#L287-L292

So if your library is in charge of the memory for the object this is not desirable, you want to simply ask the question "do you have an fstring for the current string eg"

VALUE rb_fstring_lookup(char *ptr);

This non const char means it would not take ownership and the thing can return Qnil if there is no fstring.

Then if 99.999% of the string your library has are already fstrings, the lookup becomes a super cheap function you can use to lookup.

Updated by sam.saffron (Sam Saffron) 6 days ago

I was thinking something like?

VALUE
rb_fstring_lookup(char *ptr, rb_encoding *enc)
{
    st_data_t fstring;
    struct RString fake_str;
    setup_fake_str(&fake_str, ptr, len, ENCINDEX_US_ASCII)

    st_table *frozen_strings = rb_vm_fstring_table();

    if (!st_lookup(frozen_strings, (st_data_t)fake_str, &fstring)) {
    return Qnil;
    }

    return ret;
}

Updated by ko1 (Koichi Sasada) 3 days ago

Hi.

(1) implementation

Current implementation can have issue (related to shared string) and this issue can cause something wrong behavior for C-extension.
Sorry, we need a time to confirm.

(2) naming

i think it should be rb_str_... prefix.

Updated by sam.saffron (Sam Saffron) 3 days ago

Koichi,

What about rb_str_fstring_lookup and rb_str_fstring_lookup_enc? Both will not create strings so shared strings should not be a problem.

To be honest creation can be somewhat inefficient, the one place I can see this being used is in DB drivers like PG where a consumer keeps asking for field names over and over.

Also available in: Atom PDF