Project

General

Profile

Feature #13077

[PATCH] introduce String#fstring method

Added by normalperson (Eric Wong) over 2 years ago. Updated about 2 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Target version:
-
[ruby-core:78852]

Description

introduce String#fstring method

This exposes the rb_fstring internal function to return a
deduped and frozen string. This is useful for writing all sorts
of record processing key values maybe stored, but certain keys
and values are often duplicated at a high frequency,
so memory savings can noticeable.

Use cases are many:

  • email/NNTP header processing

There are some standard header keys everybody uses
(From/To/Cc/Date/Subject/Received/Message-ID/References/In-Reply-To),
as well as common ones specific to a certain lists:
(ruby-core has X-Redmine-* headers)
It is also useful to dedupe values, as most inboxes have
multiple messages from the same sender, or MUA.

  • package management systems -
    things like RubyGems stores identical strings for licenses,
    dependency names, author names/emails, etc

  • HTTP headers/trailers -
    standard headers (Host/Accept/Accept-Encoding/User-Agent/...)
    are common, but there are also uncommon ones.
    Values may be deduped, as well, as it is likely a user
    agent will make multiple/parallel requests to the same
    server.

  • version control systems -
    this can be useful for deduplicating names of frequent
    committers (like "nobu" :)

In linux.git and git.git, there are also common
trailers such as Signed-Off-By/Acked-by/Reviewed-by/Fixes/...
as well as less common ones.

  • audio metadata -

There are commonly used tags (Artist/Album/Title/Tracknumber),
but Vorbis comments allows arbitrary key values to be stored.
Music collections contain songs by the same artist or mutiple
songs from the same album, so deduplicating values will be
helpful there, too.

  • JSON, YAML, XML, HTML processing

certain fields, tags and attributes are commonly used
across the same and multiple documents


Files

0001-introduce-String-fstring-method.patch (3.47 KB) 0001-introduce-String-fstring-method.patch normalperson (Eric Wong), 12/27/2016 01:50 AM

Associated revisions

Revision 4e90dcc9
Added by normal about 2 years ago

string.c (str_uminus): deduplicate strings

This exposes the rb_fstring internal function to return a
deduped and frozen string when a non-frozen string is given.
This is useful for writing all sorts of record processing key
values maybe stored, but certain keys and values are often
duplicated at a high frequency, so memory savings can
noticeable.

Use cases are many:

  • email/NNTP header processing

There are some standard header keys everybody uses
(From/To/Cc/Date/Subject/Received/Message-ID/References/In-Reply-To),
as well as common ones specific to a certain lists:
(ruby-core has X-Redmine-* headers)
It is also useful to dedupe values, as most inboxes have
multiple messages from the same sender, or MUA.

  • package management systems -
    things like RubyGems stores identical strings for licenses,
    dependency names, author names/emails, etc

  • HTTP headers/trailers -
    standard headers (Host/Accept/Accept-Encoding/User-Agent/...)
    are common, but there are also uncommon ones.
    Values may be deduped, as well, as it is likely a user
    agent will make multiple/parallel requests to the same
    server.

  • version control systems -
    this can be useful for deduplicating names of frequent
    committers (like "nobu" :)

In linux.git and git.git, there are also common
trailers such as Signed-Off-By/Acked-by/Reviewed-by/Fixes/...
as well as less common ones.

  • audio metadata -

There are commonly used tags (Artist/Album/Title/Tracknumber),
but Vorbis comments allows arbitrary key values to be stored.
Music collections contain songs by the same artist or mutiple
songs from the same album, so deduplicating values will be
helpful there, too.

  • JSON, YAML, XML, HTML processing

Certain fields, tags and attributes are commonly used
across the same and multiple documents

There is no security concern in this being a DoS vector by
causing immortal strings. The fstring table is not a GC-root
and not walked during the mark phase. GC-able dynamic symbols
since Ruby 2.2 are handled in the same manner, and that
implementation also relies on the non-immortality of fstrings.

[Feature #13077] [ruby-core:79663]

git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@57698 b2dd03c8-39d4-4d8f-98ff-823fe69b080e

Revision 57698
Added by normalperson (Eric Wong) about 2 years ago

string.c (str_uminus): deduplicate strings

This exposes the rb_fstring internal function to return a
deduped and frozen string when a non-frozen string is given.
This is useful for writing all sorts of record processing key
values maybe stored, but certain keys and values are often
duplicated at a high frequency, so memory savings can
noticeable.

Use cases are many:

  • email/NNTP header processing

There are some standard header keys everybody uses
(From/To/Cc/Date/Subject/Received/Message-ID/References/In-Reply-To),
as well as common ones specific to a certain lists:
(ruby-core has X-Redmine-* headers)
It is also useful to dedupe values, as most inboxes have
multiple messages from the same sender, or MUA.

  • package management systems -
    things like RubyGems stores identical strings for licenses,
    dependency names, author names/emails, etc

  • HTTP headers/trailers -
    standard headers (Host/Accept/Accept-Encoding/User-Agent/...)
    are common, but there are also uncommon ones.
    Values may be deduped, as well, as it is likely a user
    agent will make multiple/parallel requests to the same
    server.

  • version control systems -
    this can be useful for deduplicating names of frequent
    committers (like "nobu" :)

In linux.git and git.git, there are also common
trailers such as Signed-Off-By/Acked-by/Reviewed-by/Fixes/...
as well as less common ones.

  • audio metadata -

There are commonly used tags (Artist/Album/Title/Tracknumber),
but Vorbis comments allows arbitrary key values to be stored.
Music collections contain songs by the same artist or mutiple
songs from the same album, so deduplicating values will be
helpful there, too.

  • JSON, YAML, XML, HTML processing

Certain fields, tags and attributes are commonly used
across the same and multiple documents

There is no security concern in this being a DoS vector by
causing immortal strings. The fstring table is not a GC-root
and not walked during the mark phase. GC-able dynamic symbols
since Ruby 2.2 are handled in the same manner, and that
implementation also relies on the non-immortality of fstrings.

[Feature #13077] [ruby-core:79663]

Revision 57698
Added by normal about 2 years ago

string.c (str_uminus): deduplicate strings

This exposes the rb_fstring internal function to return a
deduped and frozen string when a non-frozen string is given.
This is useful for writing all sorts of record processing key
values maybe stored, but certain keys and values are often
duplicated at a high frequency, so memory savings can
noticeable.

Use cases are many:

  • email/NNTP header processing

There are some standard header keys everybody uses
(From/To/Cc/Date/Subject/Received/Message-ID/References/In-Reply-To),
as well as common ones specific to a certain lists:
(ruby-core has X-Redmine-* headers)
It is also useful to dedupe values, as most inboxes have
multiple messages from the same sender, or MUA.

  • package management systems -
    things like RubyGems stores identical strings for licenses,
    dependency names, author names/emails, etc

  • HTTP headers/trailers -
    standard headers (Host/Accept/Accept-Encoding/User-Agent/...)
    are common, but there are also uncommon ones.
    Values may be deduped, as well, as it is likely a user
    agent will make multiple/parallel requests to the same
    server.

  • version control systems -
    this can be useful for deduplicating names of frequent
    committers (like "nobu" :)

In linux.git and git.git, there are also common
trailers such as Signed-Off-By/Acked-by/Reviewed-by/Fixes/...
as well as less common ones.

  • audio metadata -

There are commonly used tags (Artist/Album/Title/Tracknumber),
but Vorbis comments allows arbitrary key values to be stored.
Music collections contain songs by the same artist or mutiple
songs from the same album, so deduplicating values will be
helpful there, too.

  • JSON, YAML, XML, HTML processing

Certain fields, tags and attributes are commonly used
across the same and multiple documents

There is no security concern in this being a DoS vector by
causing immortal strings. The fstring table is not a GC-root
and not walked during the mark phase. GC-able dynamic symbols
since Ruby 2.2 are handled in the same manner, and that
implementation also relies on the non-immortality of fstrings.

[Feature #13077] [ruby-core:79663]

Revision 57698
Added by normal about 2 years ago

string.c (str_uminus): deduplicate strings

This exposes the rb_fstring internal function to return a
deduped and frozen string when a non-frozen string is given.
This is useful for writing all sorts of record processing key
values maybe stored, but certain keys and values are often
duplicated at a high frequency, so memory savings can
noticeable.

Use cases are many:

  • email/NNTP header processing

There are some standard header keys everybody uses
(From/To/Cc/Date/Subject/Received/Message-ID/References/In-Reply-To),
as well as common ones specific to a certain lists:
(ruby-core has X-Redmine-* headers)
It is also useful to dedupe values, as most inboxes have
multiple messages from the same sender, or MUA.

  • package management systems -
    things like RubyGems stores identical strings for licenses,
    dependency names, author names/emails, etc

  • HTTP headers/trailers -
    standard headers (Host/Accept/Accept-Encoding/User-Agent/...)
    are common, but there are also uncommon ones.
    Values may be deduped, as well, as it is likely a user
    agent will make multiple/parallel requests to the same
    server.

  • version control systems -
    this can be useful for deduplicating names of frequent
    committers (like "nobu" :)

In linux.git and git.git, there are also common
trailers such as Signed-Off-By/Acked-by/Reviewed-by/Fixes/...
as well as less common ones.

  • audio metadata -

There are commonly used tags (Artist/Album/Title/Tracknumber),
but Vorbis comments allows arbitrary key values to be stored.
Music collections contain songs by the same artist or mutiple
songs from the same album, so deduplicating values will be
helpful there, too.

  • JSON, YAML, XML, HTML processing

Certain fields, tags and attributes are commonly used
across the same and multiple documents

There is no security concern in this being a DoS vector by
causing immortal strings. The fstring table is not a GC-root
and not walked during the mark phase. GC-able dynamic symbols
since Ruby 2.2 are handled in the same manner, and that
implementation also relies on the non-immortality of fstrings.

[Feature #13077] [ruby-core:79663]

Revision 15ef28a9
Added by normal about 2 years ago

NEWS: document String#-@ change

  • test/ruby/test_string.rb (test_uplus_minus): test deduplication [ruby-core:79747] [Feature #13077]

git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@57710 b2dd03c8-39d4-4d8f-98ff-823fe69b080e

Revision 57710
Added by normalperson (Eric Wong) about 2 years ago

NEWS: document String#-@ change

  • test/ruby/test_string.rb (test_uplus_minus): test deduplication [ruby-core:79747] [Feature #13077]

Revision 57710
Added by normal about 2 years ago

NEWS: document String#-@ change

  • test/ruby/test_string.rb (test_uplus_minus): test deduplication [ruby-core:79747] [Feature #13077]

Revision 57710
Added by normal about 2 years ago

NEWS: document String#-@ change

  • test/ruby/test_string.rb (test_uplus_minus): test deduplication [ruby-core:79747] [Feature #13077]

History

Updated by shevegen (Robert A. Heiler) over 2 years ago

I have no particular pro or con opinion on the proposal in itself so feel free to ignore this.

The only comment I have is that the name .fstring() is a bit strange. On first read, I
assumed that it was short for "format_string" like % on class String or sprintf.

In the proposal I read that it is for frozen_string e. g. rb_fstring. While I don't have
anything against the functionality, and I also don't fully mind a method called
fstring(), I think that at the least a longer alias name to it would be nice to have too
such as frozen_string or something that is more readable on a first look. (I can't comment
on whether the functionality in itself is useful or not but I assume that Eric has had
a good reason which he described too, so I have no qualms at all with the functionality
in itself, only the method-name part.)

Updated by normalperson (Eric Wong) over 2 years ago

shevegen@gmail.com wrote:

The only comment I have is that the name .fstring() is a bit strange. On first read, I
assumed that it was short for "format_string" like % on class String or sprintf.

Yeah, the name isn't final, of course; naming is the hardest
problem in computer science :<

Maybe "dedup" is better and still short (I would expect the user
to know deduplication implicitly requires a frozen string)

Updated by Eregon (Benoit Daloze) over 2 years ago

So this is essentially like Java's String.intern()?

There is already String#intern in Ruby but it returns a Symbol.
Depending on the use-case, I guess this might be less convenient than getting a de-duplicated String.
String#dedup or sounds better than #fstring.

Updated by normalperson (Eric Wong) over 2 years ago

eregontp@gmail.com wrote:

So this is essentially like Java's String.intern()?

There is already String#intern in Ruby but it returns a Symbol.
Depending on the use-case, I guess this might be less convenient than getting a de-duplicated String.

Yeah, I considered using intern/to_sym for my use case;
but the problem is it that still creates a new string object
whenever it needs to be written/printed/concatenated.

And I also feel using symbol like this is ugly (just a gut
feeling), despite having GC-able symbols since 2.2.

String#dedup or sounds better than #fstring.

Yes. Lets wait for Matz to comment.

Updated by nobu (Nobuyoshi Nakada) over 2 years ago

Why not String#-@?

Updated by normalperson (Eric Wong) over 2 years ago

nobu@ruby-lang.org wrote:

Why not String#-@?

As in the following? (short patch, full below)

--- a/string.c
+++ b/string.c
@@ -10002,7 +9989,7 @@ Init_String(void)
     rb_define_method(rb_cString, "scrub!", str_scrub_bang, -1);
     rb_define_method(rb_cString, "freeze", rb_str_freeze, 0);
     rb_define_method(rb_cString, "+@", str_uplus, 0);
-    rb_define_method(rb_cString, "-@", str_uminus, 0);
+    rb_define_method(rb_cString, "-@", rb_fstring, 0);

     rb_define_method(rb_cString, "to_i", rb_str_to_i, -1);
     rb_define_method(rb_cString, "to_f", rb_str_to_f, 0);

Changing existing behavior method might break compatibility;
but test-all and test-rubyspec seems to pass...

full: https://80x24.org/spew/20161228024937.9345-1-e@80x24.org/raw

Updated by matz (Yukihiro Matsumoto) about 2 years ago

For the time being, let us make -@ to call rb_fstring.
If users want more descriptive name, let's discuss later.
In my opinion, fstring is not acceptable.

Matz.

Updated by normalperson (Eric Wong) about 2 years ago

matz@ruby-lang.org wrote:

For the time being, let us make -@ to call rb_fstring.
If users want more descriptive name, let's discuss later.
In my opinion, fstring is not acceptable.

OK, I think the following is always backwards compatible,
unlike my previous [ruby-core:78884]:

--- a/string.c
+++ b/string.c
@@ -2530,7 +2530,7 @@ str_uminus(VALUE str)
    return str;
     }
     else {
-   return rb_str_freeze(rb_str_dup(str));
+   return rb_fstring(str);
     }
 }

Will commit in a day or two.

Updated by shyouhei (Shyouhei Urabe) about 2 years ago

A bit of security consideration:

Am I correct that rb_vm_fstring_table() is never GCed? If so feeding user-generated strings to that table needs extra care. Malicious user input might let memory exhausted.

Updated by nobu (Nobuyoshi Nakada) about 2 years ago

Shyouhei Urabe wrote:

Am I correct that rb_vm_fstring_table() is never GCed?

That table is not a GC-root, and registered strings get GCed as usual.

#11

Updated by shyouhei (Shyouhei Urabe) about 2 years ago

Nobuyoshi Nakada wrote:

Shyouhei Urabe wrote:

Am I correct that rb_vm_fstring_table() is never GCed?

That table is not a GC-root, and registered strings get GCed as usual.

So this is a kind of weak reference? No security concern then.

#12

Updated by Anonymous about 2 years ago

  • Status changed from Open to Closed

Applied in changeset r57698.


string.c (str_uminus): deduplicate strings

This exposes the rb_fstring internal function to return a
deduped and frozen string when a non-frozen string is given.
This is useful for writing all sorts of record processing key
values maybe stored, but certain keys and values are often
duplicated at a high frequency, so memory savings can
noticeable.

Use cases are many:

  • email/NNTP header processing

There are some standard header keys everybody uses
(From/To/Cc/Date/Subject/Received/Message-ID/References/In-Reply-To),
as well as common ones specific to a certain lists:
(ruby-core has X-Redmine-* headers)
It is also useful to dedupe values, as most inboxes have
multiple messages from the same sender, or MUA.

  • package management systems -
    things like RubyGems stores identical strings for licenses,
    dependency names, author names/emails, etc

  • HTTP headers/trailers -
    standard headers (Host/Accept/Accept-Encoding/User-Agent/...)
    are common, but there are also uncommon ones.
    Values may be deduped, as well, as it is likely a user
    agent will make multiple/parallel requests to the same
    server.

  • version control systems -
    this can be useful for deduplicating names of frequent
    committers (like "nobu" :)

In linux.git and git.git, there are also common
trailers such as Signed-Off-By/Acked-by/Reviewed-by/Fixes/...
as well as less common ones.

  • audio metadata -

There are commonly used tags (Artist/Album/Title/Tracknumber),
but Vorbis comments allows arbitrary key values to be stored.
Music collections contain songs by the same artist or mutiple
songs from the same album, so deduplicating values will be
helpful there, too.

  • JSON, YAML, XML, HTML processing

Certain fields, tags and attributes are commonly used
across the same and multiple documents

There is no security concern in this being a DoS vector by
causing immortal strings. The fstring table is not a GC-root
and not walked during the mark phase. GC-able dynamic symbols
since Ruby 2.2 are handled in the same manner, and that
implementation also relies on the non-immortality of fstrings.

[Feature #13077] [ruby-core:79663]

Updated by normalperson (Eric Wong) about 2 years ago

shyouhei@ruby-lang.org wrote:

Nobuyoshi Nakada wrote:

Shyouhei Urabe wrote:

Am I correct that rb_vm_fstring_table() is never GCed?

That table is not a GC-root, and registered strings get GCed as usual.

So this is a kind of weak reference? No security concern then.

Right. Also, keep in mind that dynamic GC-able symbols from
2.2+ also stores symbol names as fstrings. Thus GC-able symbols
would not work if fstrings could not be GC-ed.

Anyways, committed as r57698

Updated by Eregon (Benoit Daloze) about 2 years ago

Eric Wong wrote:

Anyways, committed as r57698

This should have a NEWS entry and tests since it changes the semantics.

BTW, should my_string.freeze behave similarly to String#@-?
Otherwise String#freeze only dedup if the String is a literal.
Always deduping for String#freeze would make the semantics more consistent.

Updated by normalperson (Eric Wong) about 2 years ago

eregontp@gmail.com wrote:

Eric Wong wrote:

Anyways, committed as r57698

This should have a NEWS entry and tests since it changes the semantics.

Sorry, I forgot; will do. Thanks for the reminder.

BTW, should my_string.freeze behave similarly to String#@-?
Otherwise String#freeze only dedup if the String is a literal.
Always deduping for String#freeze would make the semantics more consistent.

No. There is existing code which assumes #freeze always returns
the same object as its caller. Changing #freeze will break
existing code.

We can only cheat with String literals (opt_str_freeze) because
literals are not assigned to user-visible variables, yet.

Updated by Eregon (Benoit Daloze) about 2 years ago

Eric Wong wrote:

No. There is existing code which assumes #freeze always returns
the same object as its caller. Changing #freeze will break
existing code.

We can only cheat with String literals (opt_str_freeze) because
literals are not assigned to user-visible variables, yet.

Oh indeed, that slipped my mind, thanks for the explanation.

Also available in: Atom PDF