Project

General

Profile

Actions

Bug #9262

closed

global_method_cache should be configurable or grow automatically

Added by tmm1 (Aman Karmani) almost 11 years ago. Updated almost 8 years ago.

Status:
Closed
Target version:
ruby -v:
-
Backport:
[ruby-core:59198]

Description

The global_method_cache is currently a fixed 2048 entries. This is not configurable, and can be inadequate for large applications with thousands of classes and methods.

In our app, increasing the size of the cache to 32768 entries reduced time spent in search_method and overall pressure on st_lookup:

before
420 14.0% st_lookup
182 6.1% vm_search_method (inline)

after
265 9.5% st_lookup
125 4.5% vm_search_method (inline)

It's possible the VM could grow global_method_cache automatically, using some heuristic based on the number of long-lived classes or method_entries that are defined.
However this would break the hashing in the current implementation.

Updated by sam.saffron (Sam Saffron) almost 11 years ago

Before I measure the changes proposed by funny falcon I wanted to present the sad state of affairs.

I applied this patch to gather method cache statistics: https://github.com/SamSaffron/ruby/commit/e6ecf6d8390b2e4036b91778268a7544f364b8f1

I then wrote some middleware to measure the impact of the method cache while browsing around Discourse in production mode:

class CacheStatMiddleware
def initialize(app, config={})
@app = app
end

def call(env)
  start = Time.new
  hits = Kernel.cache_hits
  misses = Kernel.cache_misses
  time = Kernel.cache_miss_time

  r = @app.call(env)
  lost = Kernel.cache_miss_time - time
  total = ((Time.new - start)* 1000 * 1000).to_i

  pct = (lost.to_f / total.to_f) * 100

  puts "hits #{Kernel.cache_hits - hits} " <<
       "misses #{Kernel.cache_misses - misses } " <<
       "microsecs lost #{lost} #{pct.round(2)}%  " <<
       "cache usage #{Kernel.cache_entries} / #{Kernel.cache_size} " <<
       "req duration #{total}  "

  r
end

end

config.middleware.insert 0, CacheStatMiddleware


Here are some results:

hits 74383 misses 5095 microsecs lost 4620 7.75% cache usage 2048 / 2048 req duration 59613
hits 74033 misses 5093 microsecs lost 4750 8.04% cache usage 2048 / 2048 req duration 59070
hits 73962 misses 5106 microsecs lost 4509 7.74% cache usage 2048 / 2048 req duration 58253
hits 677 misses 229 microsecs lost 393 6.23% cache usage 2048 / 2048 req duration 6305
hits 523 misses 35 microsecs lost 211 3.68% cache usage 2048 / 2048 req duration 5738
hits 549 misses 31 microsecs lost 279 3.58% cache usage 2048 / 2048 req duration 7804
hits 4412 misses 143 microsecs lost 97 3.14% cache usage 2048 / 2048 req duration 3088
hits 47 misses 18 microsecs lost 21 8.97% cache usage 2048 / 2048 req duration 234
hits 3405 misses 130 microsecs lost 152 4.88% cache usage 2048 / 2048 req duration 3115
hits 519 misses 158 microsecs lost 214 8.1% cache usage 2048 / 2048 req duration 2641
hits 114133 misses 14584 microsecs lost 14338 9.31% cache usage 2048 / 2048 req duration 154000
hits 44835 misses 2754 microsecs lost 2850 8.16% cache usage 2048 / 2048 req duration 34914
hits 75084 misses 8601 microsecs lost 8048 10.95% cache usage 2048 / 2048 req duration 73475


Conclusion:

For an app like Discourse 3-10% of request time is occupied looking up methods, due to cache inefficiency.

Out of the box cache is just too small to fit all the entries required, so its thrashing.


If we raise the method cache x16 size we see:

hits 720 misses 5 microsecs lost 11 0.44% cache usage 17045 / 32768 req duration 2509
hits 40 misses 0 microsecs lost 0 0.0% cache usage 17061 / 32768 req duration 242
hits 53076 misses 266 microsecs lost 685 1.67% cache usage 17090 / 32768 req duration 41001
hits 47234 misses 238 microsecs lost 315 1.24% cache usage 17117 / 32768 req duration 25418
hits 53012 misses 281 microsecs lost 850 2.03% cache usage 17145 / 32768 req duration 41803
hits 720 misses 5 microsecs lost 15 0.65% cache usage 17163 / 32768 req duration 2308
hits 52353 misses 261 microsecs lost 605 1.63% cache usage 17190 / 32768 req duration 37031
hits 46785 misses 190 microsecs lost 429 1.32% cache usage 17207 / 32768 req duration 32392
hits 52893 misses 273 microsecs lost 610 1.71% cache usage 17240 / 32768 req duration 35571
hits 720 misses 5 microsecs lost 7 0.45% cache usage 17258 / 32768 req duration 1551
hits 52295 misses 261 microsecs lost 709 1.75% cache usage 17287 / 32768 req duration 40552
hits 46784 misses 191 microsecs lost 393 1.35% cache usage 17308 / 32768 req duration 29146
hits 86409 misses 2013 microsecs lost 2655 3.4% cache usage 17829 / 32768 req duration 78033
hits 644 misses 7 microsecs lost 22 0.64% cache usage 17843 / 32768 req duration 3458
hits 46729 misses 295 microsecs lost 423 1.73% cache usage 17864 / 32768 req duration 24385
hits 82777 misses 893 microsecs lost 1807 2.57% cache usage 17972 / 32768 req duration 70341
hits 46819 misses 191 microsecs lost 400 1.18% cache usage 17989 / 32768 req duration 33862
hits 82412 misses 865 microsecs lost 1426 2.12% cache usage 18099 / 32768 req duration 67163


Even when the cache has room, too many collisions are happening forcing cache eviction when uneeded, leave 2% of the perf on the table, this is significantly better than 2-10% though.

This can be fixed possibly by introducing quadratic probing or perhaps a cuckoo like algorithm.

The global method cache has a couple of advantages over per class cache and a few disadvantage.

Advantages:

  1. Size is cleanly capped
  2. Very easy to gather stats and info about it
  3. Expiry semantics are fairly clean, can force a method_state increment when utilization is too high

Disadvantages:

  1. Extra hashing required to look up elements, you need the klass hash
  2. Extra storage required (klass_seq stored per element)

Updated by tmm1 (Aman Karmani) almost 11 years ago

I ran some benchmarks with funny-falcon's patch. Memory usage is increased by 5-10mb RSS in our app, and response time is improved by 4-8%.

before:

500 requests to https://github.com/dashboard as tmm1
peak memory: 349.4 MB RSS
cpu time: 7,171ms total (14ms avg/req, 13ms - 21ms)
500 requests to https://github.com/github/github/pull/17695 as tmm1
peak memory: 404.6 MB RSS
cpu time: 98,713ms total (197ms avg/req, 184ms - 285ms)

after:

500 requests to https://github.com/dashboard as tmm1
peak memory: 359.2 MB RSS
cpu time: 6,806ms total (13ms avg/req, 12ms - 20ms)
500 requests to https://github.com/github/github/pull/17695 as tmm1
peak memory: 410.3 MB RSS
cpu time: 95,559ms total (191ms avg/req, 177ms - 280ms)

Updated by sam.saffron (Sam Saffron) almost 11 years ago

There was a concern that gettimeofday is expensive and adds too much time to my results

I measured:

static VALUE
rb_timing_overhead_per_100k(VALUE klass){
int i, iterations = 100000;
struct timeval temp,tv1,tv2,res;

gettimeofday(&tv1, NULL);
for(i=0; i<iterations; i++) {
gettimeofday(&temp, NULL);
}
gettimeofday(&tv2, NULL);

timersub(&tv2, &tv1, &res);
return INT2FIX(res.tv_sec*100000 + res.tv_usec);

}

turns out that its 2000 microseconds per 100k calls, so the additional time involved due to measuring is negligible, the numbers can be trusted

Updated by sam.saffron (Sam Saffron) almost 11 years ago

Discourse Bench,

Disabled Method Cache vs Current cache vs 16x larger method cache vs Funny Falcon

Disabled Method cache:

Your Results: (note for timings- percentile is first, duration is second in millisecs)

home_page:
50: 43
75: 44
90: 48
99: 67
topic_page:
50: 14
75: 15
90: 16
99: 39
home_page_admin:
50: 57
75: 60
90: 64
99: 89
topic_page_admin:
50: 23
75: 24
90: 26
99: 35
timings:
load_rails: 2733
ruby-version: 2.1.0-p-1
architecture: amd64
operatingsystem: Ubuntu
kernelversion: 3.11.0
memorysize: 5.89 GB
physicalprocessorcount: 1
processor0: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz

Current Method Cache:


home_page:
50: 37
75: 39
90: 43
99: 60
topic_page:
50: 12
75: 13
90: 14
99: 40
home_page_admin:
50: 48
75: 50
90: 55
99: 77
topic_page_admin:
50: 20
75: 21
90: 23
99: 32
timings:
load_rails: 2721
ruby-version: 2.1.0-p-1
architecture: amd64
operatingsystem: Ubuntu
kernelversion: 3.11.0
memorysize: 5.89 GB
physicalprocessorcount: 1
processor0: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz

16x larger method cache

Your Results: (note for timings- percentile is first, duration is second in millisecs)

home_page:
50: 35
75: 38
90: 41
99: 67
topic_page:
50: 11
75: 11
90: 13
99: 34
home_page_admin:
50: 45
75: 47
90: 52
99: 79
topic_page_admin:
50: 18
75: 19
90: 20
99: 29
timings:
load_rails: 2761
ruby-version: 2.1.0-p-1
architecture: amd64
operatingsystem: Ubuntu
kernelversion: 3.11.0
memorysize: 5.89 GB
physicalprocessorcount: 1
processor0: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz

Funny Falcon

Your Results: (note for timings- percentile is first, duration is second in millisecs)

home_page:
50: 34
75: 35
90: 38
99: 53
topic_page:
50: 10
75: 11
90: 12
99: 37
home_page_admin:
50: 43
75: 45
90: 49
99: 68
topic_page_admin:
50: 18
75: 19
90: 20
99: 32
timings:
load_rails: 2771
ruby-version: 2.1.0-p-1
architecture: amd64
operatingsystem: Ubuntu
kernelversion: 3.11.0
memorysize: 5.89 GB
physicalprocessorcount: 1
processor0: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz


Observations:

Funny Falcon patch is clearly fastest, x10 cache size improvement makes a very big difference, method caching impact is about 20% give or take a bit.

Updated by sam.saffron (Sam Saffron) almost 11 years ago

Note: for comparison see the results on an unpatched 2.0.0 p353


home_page:
50: 35
75: 40
90: 107
99: 119
topic_page:
50: 12
75: 13
90: 17
99: 150
home_page_admin:
50: 45
75: 55
90: 118
99: 127
topic_page_admin:
50: 19
75: 21
90: 76
99: 100
timings:
load_rails: 3644
ruby-version: 2.0.0-p353
architecture: amd64
operatingsystem: Ubuntu
kernelversion: 3.11.0
memorysize: 5.89 GB
physicalprocessorcount: 1
processor0: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz

either the 16x size increase or the optimal falcon implementation means Ruby 2.1 no longer regress performance.

Updated by funny_falcon (Yura Sokolov) almost 11 years ago

Add couple of fixes to patch:

  1. fix rb_mcache_resize - it didn't copy method_state and class_serial on resize, so that cache were invalidated on next check.
  2. add "gargabe collection" of "undefined" cache entries - do not copy them on resize, shrink cache size if it is too sparse after resize.

https://github.com/funny-falcon/ruby/compare/trunk...class_local_cache
https://github.com/funny-falcon/ruby/compare/trunk...class_local_cache.diff

Updated by sam.saffron (Sam Saffron) almost 11 years ago

FYI it seems perf of method lookups has regressed in 2.1:

https://gist.github.com/SamSaffron/8232978

This makes this change particularly important

Updated by ko1 (Koichi Sasada) almost 11 years ago

Hi,

Now, method cache technique is important and we need more measurement
and techniques.

From Ruby 2.0, we use inline/global method cache aggressively. So
performance impact on method cache miss has huge impact (compare with
Ruby 1.8, 1.9), that guys already show.

funny_falcon's approach is very impressive to solve this problem.
However, now I'm not sure how it increase memory imapct.

tmm1 measured the memory impact on
https://bugs.ruby-lang.org/issues/9262#change-43840 . This survery is
very impressive. I think it is better we measure other cases. For
example, if some classes calls many methods at onece, it will memory
issue. I think the simple limitation (cap) approach with funny_falcon's
patch works fine.

New years holiday (Japanese take holidays in new years week), I'm
thinking about this issue. Some ideas are available.

(1) inline cache
(1-1) polymorphic inline cache
(1-2) per class inline cache other than call site
(1-3) per-method cache invalidation
(2) global cache
(2-1) per class method cache w/ cap

Now the above ideas are not implemented/verified. And huge effort is
needed (because we need to change the method entry data structure).
Before the try, I need to know the why and how method cache is missed.

Basically I don't against to introduce and backport some proposed
patches (w/ measurement, if we can). In my opinion, simple variable
global cache entry size approach will fine for backport.

And also I try above ideas for Ruby 2.2. Current patches are good
starting point, I think.

--
// SASADA Koichi at atdot dot net

Updated by normalperson (Eric Wong) almost 11 years ago

SASADA Koichi wrote:

From Ruby 2.0, we use inline/global method cache aggressively. So
performance impact on method cache miss has huge impact (compare with
Ruby 1.8, 1.9), that guys already show.

Not a serious patch or benchmark, but I don't think it's too bad without
global method cache (inline cache enabled).

OPT_GLOBAL_METHOD_CACHE did not disable write to method cache,
only writes before

http://bogomips.org/ruby.git/patch/?id=a899e54e6abc13b8816e6c5f69ff560918db4be1

git://80x24.org/ruby nomethodcache

I'll rerun the test later tonight when my machine is quieter.

Updated by normalperson (Eric Wong) almost 11 years ago

It looks like the performance regressions w/o global method cache are
because rb_funcall and friends do not have call info, so they don't
hit the inline cache. So perhaps we should just add a call info-aware
version of rb_funcall-like functions so we can just use inline cache
everywhere.

I'm pretty sure ko1 already knows that, but I just discovered it :x
tmm1: what do you think?

Updated by normalperson (Eric Wong) almost 11 years ago

Eric Wong wrote:

So perhaps we should just add a call info-aware
version of rb_funcall-like functions so we can just use inline cache
everywhere.

I should add: a cheap way to do this might be to just use a do/while
macro wrapping rb_funcall with a static __thread rb_call_info_t
variable in its scope.

__thread works on gcc and clang, and maybe other compilers, too,
but other compilers may be stuck with the slow version (or non-MT-safe).
(I prefer we use __thread in case we get rid of GVL)

Updated by ko1 (Koichi Sasada) almost 11 years ago

  • ruby -v changed from trunk to -

(2014/01/30 4:17), Eric Wong wrote:

It looks like the performance regressions w/o global method cache are
because rb_funcall and friends do not have call info, so they don't
hit the inline cache. So perhaps we should just add a call info-aware
version of rb_funcall-like functions so we can just use inline cache
everywhere.

I'm pretty sure ko1 already knows that, but I just discovered it :x
tmm1: what do you think?

charliesome have proposed a simiar API (sorry I forget the URL).
He use only static variable (not thread-local) and it seems works well.
However, I think it may have pitfalls (recursive call) so I can't decide
to introduce it.

--
// SASADA Koichi at atdot dot net

Updated by funny_falcon (Yura Sokolov) almost 11 years ago

I tried to do once with static variables, but had performance regression. Perhaps, i missed something.

Updated by normalperson (Eric Wong) over 10 years ago

wrote:

http://bogomips.org/ruby.git/patch/?id=a899e54e6abc13b8816e6c5f69ff560918db4be1

Btw, I'd like to commit just the vm_method.c change for this
to avoid writing to method cache if disabled.

It also replaces CPP #if with C if for readability.

Ugh, commit message was long, just the vm_method.c change:
http://bogomips.org/ruby.git/diff/vm_method.c?id=a899e54e6abc1

Updated by nobu (Nobuyoshi Nakada) over 10 years ago

Eric Wong wrote:

Btw, I'd like to commit just the vm_method.c change for this
to avoid writing to method cache if disabled.

Agreed.

It also replaces CPP #if with C if for readability.

I don't think it is needed.

Updated by normalperson (Eric Wong) over 10 years ago

wrote:

Eric Wong wrote:

It also replaces CPP #if with C if for readability.

I don't think it is needed.

OK, does it that mean UNDEFINED_METHOD_ENTRY_P is unnecessary
with cache disabled? Could just do this:
http://80x24.org/gmc_disable.patch

Updated by normalperson (Eric Wong) over 10 years ago

Eric Wong wrote:

wrote:

Eric Wong wrote:

It also replaces CPP #if with C if for readability.

I don't think it is needed.

OK, does it that mean UNDEFINED_METHOD_ENTRY_P is unnecessary
with cache disabled? Could just do this:
http://80x24.org/gmc_disable.patch

r45261. I used my original hunk in rb_method_entry_get_without_cache
since I got "make test" failures on my 32-bit machine without checking
UNDEFINED_METHOD_ENTRY_P(me) when bypassing GMC

Updated by nobu (Nobuyoshi Nakada) over 10 years ago

Eric Wong wrote:

wrote:

Eric Wong wrote:

It also replaces CPP #if with C if for readability.

I don't think it is needed.

OK, does it that mean UNDEFINED_METHOD_ENTRY_P is unnecessary
with cache disabled? Could just do this:
http://80x24.org/gmc_disable.patch

I meant "replace CPP #if with C".

Updated by ko1 (Koichi Sasada) almost 8 years ago

  • Status changed from Open to Closed
  • Backport deleted (1.9.3: UNKNOWN, 2.0.0: UNKNOWN)

I close this issue because I want to try another approach.
Otherwise, current MRI has global variable to configure this size.

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0