Project

General

Profile

Feature #16029

Expose fstring related APIs to C-extensions

Added by byroot (Jean Boussier) 4 months ago. Updated about 1 month ago.

Status:
Open
Priority:
Normal
Assignee:
-
Target version:
-
[ruby-core:94064]

Description

As discussed with @tenderlove here: https://github.com/ruby/ruby/pull/2287#issuecomment-513865160

We'd like to update various data format parsers (JSON, MessagePack, etc) to add the possibility to deduplicate strings while parsing.

But unfortunately the rb_fstring_* family of functions isn't available to C-extensions, so the only available fallback is rb_funcall(str, rb_intern("-@")) which most parsers will likely consider too slow. So the various rb_fstring_* functions would need to be public.

Proposed patch: https://github.com/ruby/ruby/pull/2299

History

Updated by byroot (Jean Boussier) 3 months ago

mame (Yusuke Endoh) this feature was added by nobu (Nobuyoshi Nakada) in the last developer's meeting issue [#15996], but I suppose it wasn't discussed by lack of time.

Should I just add it back to the next developer's meeting issue once it is created?

Updated by mame (Yusuke Endoh) 3 months ago

ko1 (Koichi Sasada) and nobu (Nobuyoshi Nakada) discussed the issue. The memo says:

ko1, nobu: now it is difficult to expose them because of current implementation.

I didn't follow the discussion, so I don't know the detail. ko1 (Koichi Sasada) , nobu (Nobuyoshi Nakada) , could you explain them?

Updated by sam.saffron (Sam Saffron) 2 months ago

I think the larger change here is allowing for a new type of API.

From a performance perspective the people using the new API would like to avoid allocating an RVALUE unless needed, the current fstring APIs require an RVALUE unless you give it a constant string I think?

Perhaps what we need here is the ability to ask Ruby: "Hey do you have an fstring for cstr X? If so then you use it, otherwise you do the slow path of allocating an RVALUE and passing it in to the fstring function.

Updated by byroot (Jean Boussier) 2 months ago

sam.saffron (Sam Saffron) Unless I'm missing something, that's exactly what rb_fstring_new / rb_fstring_cstr does.

VALUE rb_fstring_new(const char *ptr, long len);
VALUE rb_fstring_cstr(const char *str);

AFAICT It does allocate an RVALUE to lookup the table, but it does it on the stack, so I think it's fine GC wise.

Updated by sam.saffron (Sam Saffron) about 2 months ago

I think when it gets called it expects to reuse the memory allocated by the cstr eventually

https://github.com//blob/96753e8475ee69537134ab3d966c3d25cb5c467c/string.c#L287-L292

So if your library is in charge of the memory for the object this is not desirable, you want to simply ask the question "do you have an fstring for the current string eg"

VALUE rb_fstring_lookup(char *ptr);

This non const char means it would not take ownership and the thing can return Qnil if there is no fstring.

Then if 99.999% of the string your library has are already fstrings, the lookup becomes a super cheap function you can use to lookup.

Updated by sam.saffron (Sam Saffron) about 2 months ago

I was thinking something like?

VALUE
rb_fstring_lookup(char *ptr, rb_encoding *enc)
{
    st_data_t fstring;
    struct RString fake_str;
    setup_fake_str(&fake_str, ptr, len, ENCINDEX_US_ASCII)

    st_table *frozen_strings = rb_vm_fstring_table();

    if (!st_lookup(frozen_strings, (st_data_t)fake_str, &fstring)) {
    return Qnil;
    }

    return ret;
}

Updated by ko1 (Koichi Sasada) about 2 months ago

Hi.

(1) implementation

Current implementation can have issue (related to shared string) and this issue can cause something wrong behavior for C-extension.
Sorry, we need a time to confirm.

(2) naming

i think it should be rb_str_... prefix.

Updated by sam.saffron (Sam Saffron) about 2 months ago

Koichi,

What about rb_str_fstring_lookup and rb_str_fstring_lookup_enc? Both will not create strings so shared strings should not be a problem.

To be honest creation can be somewhat inefficient, the one place I can see this being used is in DB drivers like PG where a consumer keeps asking for field names over and over.

Updated by byroot (Jean Boussier) about 1 month ago

What about rb_str_fstring_lookup and rb_str_fstring_lookup_enc?

I don't think a lookup would be enough for what I'd like to do.

A typical use case would be a JSON document with lots of duplicated strings.

If we only lookup the fstring table, then only the strings already present in the codebase would be deduplicated. IMHO it would be much better to create them all as fstrings.

Also available in: Atom PDF