Feature #16614
Updated by ko1 (Koichi Sasada) almost 5 years ago
I'm developing a brand new method caching mechanism for Guild and further improvements.
Important ideas:
* (1) Disposable inline method cache (IMC) for race-free inline method cache
* Making call-cache (CC) as a RVALUE (GC target object) and allocate new CC on cache miss
* (2) Introduce per-Class method cache (pCMC)
* Instead of fixed-size global method cache (GMC), pCMC allows flexible cache size
* Caching CCs reduces CC allocation and allow sharing CC's fast-path between same call-info (CI) call-sites.
* (3) Invalidate an inline method cache by invalidating corresponding method entries (MEs)
* Instead of using class serials, we set "invalidated" flag for method entry itself to represent cache invalidation.
* Compare with using class serials, the impact of method modification (add/overwrite/delete) is small.
* Updating class serials invalidate all method caches of the class and sub-classes.
* Proposed approach only invalidate the method cache of only one ME.
# Acronym
* IMC: Inline Method Cache
* GMC: Global Method Cache (and its table)
* pCMC: per-Class Method Cache
* CC: Call Cache (struct rb_call_cache, renamed to rb_callcache)
* CI: Call Info (struct rb_call_info, renamed to rb_callinfo)
* ME: method entry (method body)
# Background
Method caching mechanism is an important technique to improve performance of method invocation on OO language interpreters.
Without method caching, we need to traverse class-tree to find method definition for each method invocation.
![class-tree_traversak](class-tree_traversak.PNG)
Recent Ruby uses two layer method caching: inline method cache (IMC) and global method cache (GMC).
IMC caches method entries (ME / method definition) and fast-path information (in some cases, skip checking code).
GMC caches method entries.
Method "cache" requires cache invalidate protocol and Ruby interpreter employs (basically) class-serial approach.
On a method call, filling call-cache (CC) by IMC and GMC, looking up class-tree and invoke it.
![method_call_flow](method_call_flow.PNG)
The following subsections introduce Ruby's implementation in brief.
## Inline method cache (IMC)
IMC is represented by `rb_call_cache` struct, which are allocated per call-site.
`rb_call_cache` contains keys and values:
* Keys: method_state and class_serial(s).
* Values: (v1) a method entry (ME) and (v2) info to invoke the method
In a narrow sense, (v1) is a main purpose of method cache.
IMC also contains (v2) to skip some checking processes using a result of last execution. We call it is fast-path.
When cache hits (a CC is not invalidated), we can use values without searching ME in class tree and can use fast-path.
Checking cache invalidation is done by compare with current serials.
![imc_hit_miss_judge](imc_hit_miss_judge.PNG)
per-Class serials are updated when defining a method and so on. Not only the modified class, but also sub-classes of modified classes. So that modified class and its sub-classes are updated, and all of method caches which points to the methods of these classes are invalidated.
ex1) we modify Object class, then all of method cache will be invalidated.
ex2) we add C#foo, then all of caches for methods of C and sub-classes are invalidated. If `D < C`, then a cache for `D#bar` is also invalidated, even if `C#foo` is not related to `D#bar`. Essentially, we don't need to invalidate IMCs for `D#bar`, but it is easy to invalidate.
![imc_invalidation](imc_invalidation.PNG)
## Global method cache (GMC)
GMC is a fixed-size hash table: Keys (serials) -> Value (a method entry), only one table in a interpreter process.
Keys are same as IMC's.
## Method lookup on method call using IMC and GMC
At first, it checks IMC. If it hits, then return ME and fast-path information to invoke the method.
If it misses, then check GMC to get corresponding ME.
If we can't find ME in GMC, then we search ME in class-tree.
After looking up a ME, setup CC for IMC and invoke a found method.
# Problems
There are several issues on current method caching mechanism, especially for Guild: parallel execution system.
## Atomic access
For IMC, keys and values are written randomly. If guilds (parallel processing elements) share IMC, it introduces data race.
To avoid the race, there are several traditional ways:
(1) access synchronously
(2) copy CCs for Guilds
However, Accessing IMC is so frequently in Ruby, so that (1) is unacceptable because of its overhead.
(2) is one way for performance, but it can consumes O(n) memory (n: the number of Guilds).
FYI: very simple Rails web app has about 10,000 call-site.
## The size of Global method cache
GMC is a simple table, but the size is fixed. We can tune by `RUBY_GLOBAL_METHOD_CACHE_SIZE` envval at the process start time, but it is not easy to predict the optimal value.
## Update references for `GC.compact`
`GC.compact` requires updating the references in IMC/GMC. However it is difficult to update references because of several implementation restrictions. Now `GC.compact` invalidate all of method cache. It works, but not so fast.
## Invalidate extra methods
Assume that we cached `C1#bar`, `C1#baz`, ... (`C1 < C0`). When `C0#foo` is added, then `C1#bar` (and others) will be invalidated even if `C1#bar` and `C0#foo` is not related.
(Most of case, this is not a problem because such (re-)definitions are not applied in production mode, I guess)
## Can't share the knowledge between call-sites.
We allocate IMC per call-site, so there is no relation between call-sites.
However, we can share the fast-path information between them.
For example, the following code calls `require` methods with same manner.
```ruby
require 'x'
require 'y'
require 'z'
```
If we can share `call_cache` information between the 3 call-sites, we can call `require` with fast-path for latter 2 method calls (first call is for setting up fast-path).
# Proposal
To solve above issues, I propose a new method cache mechanism. There are two layers.
(1) Disposable inline method cache
To assure atomic access, I use similar technique to RCU. Invalidate the IMC, then throw away this data structure instead of reusing it. Allocate an IMC as an GC target objects, thrown away IMC will be collected by GC.
(2) per-Class method cache (pCMC)
Instead of global method cache, we introduce per-Class method cache (pCMC).
pCMC is a map `mid -> CCs`. "CCs" is (a) cached method entry and (b) cached call-cache entries.
(a) is similar to the global method cache.
(b) is new idea. We can reuse call-cache entries between same type call-site.
In this context, the type is "call-info", a tuple of (argc, flags and keywords).
CCs are disposable, but we can share a CC
![metho_call_flow_comparison](metho_call_flow_comparison.PNG)
## Disposable IMC
Each call-site points Call Info (CI) and Call Cache (CC). CI is static information which are decided at compile time. CC is dynamic information and master store cache information by mutating single CC per call-site.
Disposable IMC, we make a CC as a GC target object. ISEQ needs to mark cached CCs.
Allocate a new CC and point it from call-site if cache miss, and release missed-CC.
Cache hit/miss decision is very simple: compare the `klass` (receiver's class). The code is : `CC->klass == klass`. Very simple.
After that, we need to check validity of the cache, by checking cached ME's valid bit: `CC->me->valid_bit`.
![imc_hit_miss_judge_cmp](imc_hit_miss_judge_cmp.PNG)
If cache is not hit, then allocate or reuse another CC by asking per-Class method cache (pCMC) and store it in the call-site.
Old CC is released from cached call-site (ISEQ) and GC will collect it soon if the CC is not cached by pCMC.
## per-Class method cache (pCMC)
pCMC is consists of tables which each classes have.
A pCMC is a table `mid -> CCs`. CCs is new data structure, based on two data: a cached method entry (ME) and CC entries corresponding to call-info (CI).
Cached ME is a same cached value of GMC. The difference is number of the tables.
CC entries is a set of CC objects for each CIs (`[[ci1, cc1], [ci2, cc2], ...]`).
On a same CI, I found that a same CC works well (or I fixed to work well).
It means that call-sites which use same CI can use CC's fast-path.
![pCMC](pCMC.PNG)
## Invalidate protocol
Two type invalidation protocol: clearing `CC->klass = 0` and clear valid bit for ME (`ME->valid_bit = 0`).
It depends on how to modify a class by adding/overwriting/removing methods.
### Modified class has no sub-classes
We only need to clear CCs of specified method id in class's pCMC table's entry.
Like that: `modified_class.pCMC_table[mid].each_cc_entries{|cc| cc.klass = 0}`
The computation time i O(n) (n: the number of CIs for cache). But n should be small numbers, I guess.
![invalidate_nosubclasses.PNG](invalidate_nosubclasses.PNG)
### Modified class has sub-classes
We need to clear all sub-classes pCMC tables. However, the computation time is O(n) (n: number of sub classes).
Instead of clearing all corresponding caches in sub-classes , I decide to modify ME to tell invalidation.
```ruby
if ME.cached_by_anyone?
ME, defined_class = lookup(modified_class, mid)
defined_class.define_method(mid, ME.dup)
ME.valid_bit = 0
end
```
![invalidate_with_subclasses.PNG](invalidate_with_subclasses.PNG)
Computation time is O(n), n is class-tree height (`modified_class.ancestors.size`).
It allocates and copy the found ME, so it is not low cost operation.
# Discussion
## Solved issues
### Atomic access
By using disposable IMC (CC), atomic access of IMC is guaranteed (strictly speaking, we need some more assumptions, but we can do that). Most of cases, IMC works well (hit frequently), so we need to tune this code, even if on the parallel computing.
Other logic requires appropriate locking mechanism. For example, to access pCMC we need to introduce synchronization mechanism (global lock, etc).
### The size of Global method cache
We don't have GMC anymore.
### Update references for `GC.compact`
CCs are objects, so we can update references by adding simple updating code.
### Invalidate extra methods.
We only invalidate a specified method.
However there is a possibility to invalidate a non-related method yet.
### Can't share the knowledge between call-sites.
We can share CCs between call-sites point to same CIs.
So we can utilize fast-path information more.
## Introduced issues
### Memory consumption predictability
pCMC can contain much memory because there is no limitation just now.
I don't think it is problem on most of cases, but we can introduce some limitation for it.
# Implementation
https://github.com/ruby/ruby/pull/2888
# Measurements
* On small benchmark and optcarrot, almost same performance compare with master.
* On simple rails application, I measured slows down a bit (discuss on next section).
# TODO
## Negative cache
We measured slows down on simple Rails application maybe because of lack of negative cache. Now we don't cache the information about "the specified method is not defined". So method looking by class-tree traversal is used frequently than master. We can introduce negative cache by introducing "undef" method entry, but it changes current code in many places and the patch will be bigger and bigger.
This is why I make a ticket now.
## Polymorphic inline cache
Ruby 2.7 introduced limited version of polymorphic inline cache (class serials are not same, but points to the same method entry, they share the same method entry). Current proposed implementation does not have this feature.
Using value-ed CC it is easy to extend polymorphic inline cache from monolithic inline cache, I guess.
(it should be difficult to design how many entries are needed, LRU implementation and so on, though)
## Callable method entry
To make explanation simple, I eliminate the description about method entries and callable method entries.
On master, GMC cached method entries and pCMC cached callable method entries. It can affect performance and we need to check it more.
This is a kind of trivial issue, but we need to make clear how to treat two kind of method entries.
## MJIT support
Now it can compile on MJIT, but not completed (maybe it has GC bug).
## CCs for `rb_funcallv()` and others
Now it is disabled but it is easy to implement it.
# Conclusion
This ticket propose a new method cache mechanism, algorithm and implementation.
It is not completed, but I believe these techniques are needed for parallel execution.