Project

General

Profile

Actions

Feature #19090

closed

Do not duplicate an unescaped string in CGI.escapeHTML

Added by k0kubun (Takashi Kokubun) over 1 year ago. Updated over 1 year ago.

Status:
Closed
Assignee:
-
Target version:
-
[ruby-core:110542]

Description

Proposal

Stop guaranteeing that GGI.escapeHTML returns a new string even if there's nothing to be escaped.

More specifically, stop calling this rb_str_dup https://github.com/ruby/cgi/blob/v0.3.3/ext/cgi/escape/escape.c#L72 for the case that nothing needs to be escaped.

Background

My original implementation https://github.com/ruby/ruby/pull/1164 was not calling it. The reason why rb_str_dup was added was that [Bug #11858] claimed returning the argument object for non-escaped cases is a backward incompatibility because the original gsub-based implementation always returns a new object. As a result, even while many people use CGI.escapeHTML as an optimized implementation for escaping HTML today, it ended up having a compromised performance.

Motivation

The motivation is to improve performance. By just doing so, escaping a pre-allocated "string" becomes 1.34x faster on my machine https://gist.github.com/k0kubun/f66d6fe1e6ba821e4263257e504ba28f.

The most major use case of CGP.escapeHTML is to safely embed a user input. When the result is just embedded in another string, the allocated new object will be just wasted. It's pretty common that an embedded string fragment doesn't contain any of '"&<> characters. So we should stop wasting that to optimize that case.

[Bug #11858] wasn't really a use case but just "I think this is backward incompatibility" based on frozen Hello World. Unlike user input, you usually don't need to escape your own string literal. It feels like the ticket addressed a problem that doesn't exist in actual applications. It should have cited existing code that could be broken by that, and I can't find such code with gem-codesearch today.

The only reason to maintain the current behavior would be to allow using a return value of CGI.escapeHTML as a buffer for creating another longer string starting with the escaped value, but using CGI.escapeHTML to initialize a string buffer feels like an abuse. Relying on the behavior never makes sense as an "optimization" either because it makes all other cases (the result is not used as a string buffer) suboptimal.

Why not an optional flag like CGI.escapeHTML(str, dup: false)?

Two reasons:

  • The non-dup behavior should be used 99.999..9% of the time. We shouldn't make code using CGI.escapeHTML less readable just for maintaining a use case that doesn't exist.
  • Passing keyword arguments to a C extension is unfortunately slow, and it defeats the optimization purpose. In core classes, we could use Primitive to address that, but this is a default gem and we can't use that.
    • We could workaround that if we choose CGI.escapeHTML(str, false), but again it'd spoil the readability for maintaining an invalid use case.

Why not a new method?

It's a good idea actually, but with escapeHTML, escape_html, and h aliased to it already, I can't think of a good name for it. And again, not calling it escapeHTML or escape_html would spoil the readability for no valid reason.


Related issues 2 (0 open2 closed)

Related to Ruby master - Bug #11858: CGI.escapeHTML should NOT return frozen stringClosedActions
Related to Ruby master - Feature #19102: Optimize ERB::Util.html_escape more than CGI.escapeHTML for template enginesClosedActions
Actions #1

Updated by k0kubun (Takashi Kokubun) over 1 year ago

  • Description updated (diff)
Actions #2

Updated by k0kubun (Takashi Kokubun) over 1 year ago

  • Related to Bug #11858: CGI.escapeHTML should NOT return frozen string added
Actions #3

Updated by k0kubun (Takashi Kokubun) over 1 year ago

  • Description updated (diff)

Updated by Eregon (Benoit Daloze) over 1 year ago

Isn't rb_str_dup copy-on-write and so should be fairly cheap?

Updated by Dan0042 (Daniel DeLorme) over 1 year ago

I agree the dup is unnecessary since 99.99% of the time you're just doing buf << CGI.escapeHTML(str). Not sure the performance gain would be measurable. A new method like CGI.escapeHTML! would make sense to me, as it indicates the danger of mutating the return value.

I did a search for CGI.escapeHTML in gems: https://pastebin.com/7HYUTASZ
There's a lot of safe usage where the result is directly used in interpolation or concatenation.
There's also some "potentially unsafe" usage like

      def escape_html(string)
        CGI.escapeHTML(string)
      end

which depends on how the escape_html method is called, but in general the only valid use case for these methods is when appending to a buffer. So I'd say it's a low-risk change.

Updated by k0kubun (Takashi Kokubun) over 1 year ago

Isn't rb_str_dup copy-on-write and so should be fairly cheap?

As I wrote in the description, calling rb_str_dup is making the whole method 1.34x slower no matter how efficient CoW is. Whether the 34% slowdown is cheap or not depends on how often the method is used and hits the non-escaped case. What I'm saying is that this method is used for almost all embedded expressions in templates and it hits the non-escaped case most of the time. Let's say you have a page that lists 100 resources with 5 fields, it would call CGI.escapeHTML at least 500 times and many of them could have no '"&<> characters.

Generally, escaping an HTML is known to be one of the largest bottlenecks in template engine benchmarks that enable HTML escaping, which is why I've literally spent years optimizing this method. I would never say a 34% slowdown in CGI.escapeHTML is cheap. When I modified the benchmark created by the Slim team to escape embedded expressions, the benchmark became 1.1x faster by just removing this rb_str_dup call. So the 34% slowdown in the microbenchmark of this method translates to a 10% slowdown in the template rendering.

A new method like CGI.escapeHTML! would make sense to me, as it indicates the danger of mutating the return value.

It's not doing any mutation by itself, and I don't know of any existing method named like that. Thus Matz would not like the name, and I'm not sure if the library maintainer (@nobu (Nobuyoshi Nakada)) likes it either. But thanks for proposing an alternative name.

Updated by Dan0042 (Daniel DeLorme) over 1 year ago

You know, if speed is that much a concern, I think optimized_escape_html2 should seek the first escapable character and only if found then do the ALLOCV_N (and memcpy from cstr to buf).

Updated by Dan0042 (Daniel DeLorme) over 1 year ago

Actually an even better approach may be to append the escaped bytes directly to the final output buffer, instead of generating any intermediary string at all. As such I would like to propose:
CGI.escapeHTML(str, "") append to new string (current behavior)
CGI.escapeHTML(str, nil) append to new string or return unmodified str (new default?)
CGI.escapeHTML(str, buffer) append to buffer (most efficient when rendering mutliple strings in template)

Updated by k0kubun (Takashi Kokubun) over 1 year ago

You know, if speed is that much a concern, I think optimized_escape_html2 should seek the first escapable character and only if found then do the ALLOCV_N (and memcpy from cstr to buf).

ALLOCV_N is essentially alloca. I'd say this one is indeed "fairly cheap". Feel free to prove it otherwise by submitting a patch and benchmarking it though.

As to memcpy, while technically memcpy isn't called for the case we're discussing, *dest++ = c; feels like an unneeded overhead and I have an idea to improve it. I'd discuss it in a different ticket though.

Actually an even better approach may be to append the escaped bytes directly to the final output buffer, instead of generating any intermediary string at all.

You seem to assume that the buffer is bare String, but it's not necessarily the case for Rails. I believe it could/should be fixed, but it's a different discussion.

Updated by k0kubun (Takashi Kokubun) over 1 year ago

We might be able to make it efficient for a non-String buffer too, so I'll give it a try first. Thanks for the idea.

Updated by k0kubun (Takashi Kokubun) over 1 year ago

  • Status changed from Open to Closed

On second thought, it might not be necessarily more efficient even with a bare String buffer because the point of the current CGI.escapeHTML implementation is to pre-allocate a 6x-size buffer on the stack to avoid multiple reallocations for buffer extension. That optimization is hard if you need to directly write to the argument String from the beginning; the best thing you could do would be to realloc it to a 6x size first and then realloc it again to shrink it. Doing object reallocation twice seems expensive compared to just manipulating a stack pointer, even combined with extra memcpy.

Plus, as a maintainer of many template engines, this interface seems hard to use in some cases. It seems to help only limited places with a lot of complexity.


I came up with a completely different solution. I'll file it as a different ticket after verifying the idea. So closing this. Thank you for the discussion.

Updated by Dan0042 (Daniel DeLorme) over 1 year ago

It's true ALLOCV_N is almost free but I was thinking more about the cost of copying between buffers
a) when nothing needs escaping: copy bytes to tmp buffer then discard
b) when something needs escaping: copy bytes to tmp buffer then copy again to final string

You seem to assume that the buffer is bare String, but it's not necessarily the case for Rails.

If buffer is bare String we can use CGI.escapeHTML(str, buffer), and if something else (Array?) we can use CGI.escapeHTML(str). I think it's nice to support both cases.

6x allocation is optimal for worst cast like '"' * 1000 but personally I would optimize for common case. So grow the string by an amount sufficient for 99.9% of cases (maybe something like (N+30)*1.2) and accept a realloc for degenerate cases. It's a different kind of choice/trade-off.

Thank you for the discussion, it's a fun topic.

Actions #13

Updated by Eregon (Benoit Daloze) over 1 year ago

  • Related to Feature #19102: Optimize ERB::Util.html_escape more than CGI.escapeHTML for template engines added

Updated by Eregon (Benoit Daloze) over 1 year ago

I did not expect rb_str_dup() is so costly on CRuby, I guess the allocation is slow and of course CRuby can't escape-analyze it.
I think it is important to keep the optimized HTML escape in core/stdlib (e.g., in CGI), as the fastest way is implementation-dependent.
See https://bugs.ruby-lang.org/issues/19102#note-4

I think we should add an optional argument to avoid copying the string when unchanged, that's easy and avoids yet another place/method with that escaping logic.

Updated by k0kubun (Takashi Kokubun) over 1 year ago

I filed a PR for truffleruby https://github.com/ruby/erb/pull/39.

I think it is important to keep the optimized HTML escape in core/stdlib (e.g., in CGI)

I didn't get what you mean in this part. CGI and ERB are both default gems. There's no difference in its "coreness" between them.

Updated by Eregon (Benoit Daloze) over 1 year ago

Right, I forgot CGI is a default gem too.
I think it seems cleaner for other template engines (e.g. haml, slim, etc) to depend (as in require "cgi") on CGI vs depending on ERB, i.e. CGI feels smaller/like a subset of ERB. In other words, it doesn't seem ideal for other template engines to depend on the ERB template engine just for escaping (although practically speaking there is no issue, just from a design perspective).

Updated by k0kubun (Takashi Kokubun) over 1 year ago

I get what you're saying. My position on this issue is:

  • CGI is not a good place either unless you're writing a CGI application. ERB also has ERB::Escape now, and I'd say "embedded Ruby escape" is a better module name than "CGI" for the purpose.
  • ERB 4.0.0 was shipped with a new file erb/util.rb, which allows you to load only a couple of escape methods, not loading an extra template engine.
  • The way we defined the method was designed by @jeremyevans0 (Jeremy Evans) for Erubi. Loading the ERB template engine would be the last thing the Erubi maintainer would like to do. So, thanks to his idea, it's possible to load only escape methods.
  • ERB::Util.html_escape is monkey-patched by Rails and it's been the only official html_safe-compatible HTML escape method for years (while it's been using Erubis or Erubi). Since it's the only Rails-official way to do it, moving it to another place would be unnecessarily disruptive.

Updated by Eregon (Benoit Daloze) over 1 year ago

Thank you for the explanation, that's very good arguments and they address my concerns.

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0