Project

General

Profile

Feature #16254

MRI internal: Define built-in classes in Ruby with `__intrinsic__` syntax

Added by ko1 (Koichi Sasada) over 1 year ago. Updated 9 months ago.

Status:
Closed
Priority:
Normal
Target version:
-
[ruby-core:95344]

Description

Abstract

MRI defines most of built-in classes in C with C-APIs like rb_define_method().
However, there are several issues using C-APIs.

A few methods are defined in Ruby written in prelude.rb.
However, we can not define all of classes because we can not touch deep data structure in Ruby.
Furthermore, there are performance issues if we write all of them in Ruby.

To solve this situation, I want to suggest written in Ruby with C intrinsic functions.
This proposal is same as my RubyKaigi 2019 talk https://rubykaigi.org/2019/presentations/ko1.html.

Terminology

  • C-methods: methods defined in C (defined with rb_define_method(), etc).
  • Ruby-methods: methods defined in Ruby.
  • ISeq: The body of RUbyVM::InstructionSequence object which represents bytecode for VM.

Background / Problem / Idea

Written in C

As you MRI developers know, most of methods are written in C with C-APIs.
However, there are several issues.

(1) Annotation issues (compare with Ruby methods)

For example, C-methods defined by C-APIs doesn't have parameters information which are returned by Method#parameters, because there is way to define parameters for C methods.
There are proposals to add parameter name information for C-methods, however, I think it will introduce new complex C-APIs and introduce additional overhead on boot time.

-> Idea; Writing methods in Ruby will solve this issue.

(2) Annotation issues (for further optimization)

It is useful to know the methods attribute, for example, the method causes no side-effect (a pure method).
Labeling all of methods including user program's methods doesn't seem good idea (not Ruby-way). But I think annotating built-in methods is good way because we can manage (and we can remove them when we can make good analyzer).

There are no way to annotate this kind of attributes.

-> Idea: Writing methods in Ruby will make it easy to introduce new annotations.

(3) Performance issue

There are several features which are slower in C than written in Ruby.

  • exception handling (rb_ensure(), etc) because we need to capture context with setjmp on C-methods. Ruby-methods doesn't need to capture any context for exception handling.
  • Passing keyword parameters because Ruby-methods doesn't need to make a Hash object to pass the keyword parameters if they are passed with explicit keyword parameters (foo(k1: v1, k2: v2)).

-> Idea: Writing methods in Ruby makes them faster.

(4) Productivity

It is tough to write some features in C:

For example, it is easy to write rescue syntax in Ruby:

# in Ruby
def dummy_func_rescue
  nil
rescue
  nil
end

But it is difficult to write/read in C:

static VALUE
dummy_body(VALUE self)
{
    return Qnil;
}
static VALUE
dummy_rescue(VALUE self)
{
    return Qnil;
}
static VALUE
tdummy_func_rescue(VALUE self)
{
    return rb_rescue(dummy_body, self, dummy_rescue, self);
}

(trained MRI developer can say it is not tough, though :p)

-> Idea: Writing methods in Ruby makes them easy.

(5) API change

To introduce Guild, I want to pass a "context" parameter (as a first parameter) for each C-functions like mrb_state on mruby.
This is because getting it from TLS (Thread-local-storage) is high-cost operation on dynamic library (libruby).

Maybe nobody allow me to change the specification of functions used by rb_define_method().

-> Idea: But introduce new method definition framework, we can move and change the specification, I hope.
Of course, we can remain current rb_define_method() APIs (with additional cost on Guild available MRI).

Written in Ruby in prelude.rb

There is a file prelude.rb which are loaded at boot time.
This file is used to define several methods, to reduce keyword parameters overhead, for example (IO#read_nonblock, TracePoint#enable).

However, writing all of methods in Ruby is not possible because:

  • (1) feasibility issue (we can not access internal data structure)
  • (2) performance issue (slow in general, of course)
  • (3) atomicity issue (GVL/GIL)

To solve (1), we can provide low-level C-methods to implement high-level (normal built-in) methods. However issues (2) and (3) are not solved.
(From CS researchers perspective, making clever compiler will solve them, like JVM, etc, But we don't have it yet)

-> Idea: Writing method body in C is feasible.

Proposal

(1) Introducing intrinsic mechanism to define built-in methods in Ruby.
(2) Load from binary format to reduce startup time.

(1) Intrinsic function

Calling intrinsic function syntax in Ruby

To define built-in methods, introduce special Ruby syntax __intrinsic__.func(args).
In this case, registered intrinsic function func() is called with args.

In normal Ruby program, __intrinsic__ is a local variable or a method.
However, running on special mode, they are parsed as intrinsic function call.

Intrinsic functions can not be called with:

  • block
  • keyword arguments
  • splat arguments

Development step with intrinsic functions

(1) Write a class/module in Ruby with intrinsic function.

# string.rb
class String
  def length
    __intrinsic__.str_length
  end
end

(2) Implement intrinsic functions

It is almost same as functions used by rb_define_method().
However it will accept context parameter as the first parameter.

(rb_execution_context_t is too long, so we can rename it, rb_state for example)

static VALUE
str_length(rb_execution_context_t *ec, VALUE self)
{
  return LONG2NUM(RSTRING_LEN(self));
}

(3) Define an intrinsic function table and load .rb file with the table.

Init_String(void)
{
  ...
  static const rb_export_intrinsic_t table[] = {
    RB_EXPORT_INTRINSIC(str_length, 0), // 0 is arity
    ...
  };
  rb_vm_builtin_load("string", table);
}

Example

There are two examples:

(1) Comparable module: https://gist.github.com/ko1/7f18e66d1ae25bb30c7e823aa57f0d31
(2) TracePoint class: https://gist.github.com/ko1/969e5690cda6180ed989eb79619ca612

(2) Load from binary file with lazy loading

Loading many ".rb" files slows down startup time.

We have ISeq#to_binary method to generate compiled binary data so that we can eliminate parse/compile time.
Fortunately, [Feature #16163] makes binary data small.
Furthermore, enabling "lazy loading" feature improves startup time because we don't need to generate complete ISeqs. USE_LAZY_LOAD in vm_core.h enables this feature.

We need to combine binary. There are several way (convert into C's array, concat with objcopy if available and so on).

Evaluation

Evaluations are written in my RubyKaigi 2019 presentation: https://rubykaigi.org/2019/presentations/ko1.html

Points:

  • Calling overhead of Ruby mehtods with intrinsic functions

    • Normal case, it is almost same as C-methods using optimized VM instructions.
    • With keyword parameters, it is faster than C-methods.
    • With optional parameters, it is x2 slower so it should be solved (*1).
  • Loading overhead

    • Requiring ".rb" files is about x15 slower than defining C methods.
    • Loading binary data with lazy loading technique is about x2 slower than C methods. Not so bad result.
    • At RubyKaigi 2019, the binary data was very huge, but [Feature #16163] reduces the size of binary data.

[*1] Introducing special "overloading" specifier can solve it because we don't need to assign optional parameters. First method lookup can be slowed down, but we can cache the method lookup results (with arity).

# example syntax
overload def foo(a)
  __intrinsic__.foo1(a)
end
overload def foo(a, b)
  __intrinsic__.foo2(a, b)
end

Implementation

Done:

  • Compile calling intrinsic functions (.rb)
  • Exporting intrinsic function table (.c)

Not yet:

  • Loading from binary mechanism
  • Attribute syntax
  • most of built-in class replacement

Now, miniruby and ruby (libruby) load '*.rb' files directly. However, ruby (libruby) should load compiled binary file.

Discussion

Do we rewrite all of built-in classes at once?

No. We can try and migrate them.

Do we support intrinsic mechanism for C-extension libraries?

Maybe in future. Now we can try it on MRI cores.

__intrinsic__ keyword

On my RubyKaigi 2019 talk, I proposed __C__, but I think __intrinsic__ is more descriptive (but a bit long).
Another idea is RubyVM::intrinsic.func(...).

I have no strong opinion. We can change this syntax until we expose this syntax for C-extensions.

Can we support __intrinsic__ in normal Ruby script?

No. This feature is only for built-in features.
As I described, calling intrinsic function syntax has several restriction compare with normal method calls, so that I think they are not exposed as normal Ruby programs, IMO.

Should we maintain intrinsic function table?

Now, yes. And we need to make this table automatically because manual operations can introduce mistake very easily.

Corresponding ".rb" file (trace_point.rb, for example) knows which intrinsic functions are needed.
Parsing ".rb" file can generate the table automatically.
However, we need a latest version Ruby to parse the scripts if they uses syntax which are supported by latest version of Ruby.

For example, we need Ruby 2.7 master to parse a script which uses pattern matching syntax.
However, the system's ruby (BASE_RUBY) should be older version. This is one of bootstrap problem.
This is "chicken-and-egg" problem.

There are several ideas.

(1) Parse a ".c" file to generate a table using function attribute.

INTRINSIC_FUNCTION static VALUE
str_length(...)
...

(2) Build another ruby parser with source code, "parse-ruby".

  • 1. generate parse-ruby with C code.
  • 2. run parse-ruby to generate tables by parsing ".rb" files. This process is written in C.
  • 3. build miniruby and ruby with generated table.

We can make it, but it introduces new complex build process.

(3) Restrict ".rb" syntax

Restrict syntax which can be used by BASE_RUBY for built-in ".rb" files.
It is easy to list up intrinsic functions using Ripper or AST or ISeq#to_a.

(3) is most easy but not so cool.
(2) is flexible, but it needs implementation cost and increases build complexity.

Path of '*.rb' files and install or not

The path of prelude.rb is <internal:prelude>. We have several options.

  • (1) Don't install ".rb" files and make these path <internal:trace_point.rb>, for example.
  • (2) Install ".rb" and make these paths non-existing paths such as <internal>/installdir/lib/builtin/trace_point.rb.
  • (3) Install ".rb" and make these paths real paths.

We will translate ".rb" files into binary data and link them into ruby (libruby).
So the modification of installed ".rb" files are not affect the behavior. It can introduce confusion so that I wrote (1) and (2).

For (3), it is possible to load ".rb" files if there is modification (maybe detect by modified date) and load from them. But it will introduce an overhead (disk access overhead).

Compatibility issue?

There are several compatibility issues. For example, TracePoint c-call events are changed to call events.
And there are more incompatibles.
We need to check them carefully.

Bootstrap issue?

Yes, there are.

Loading .rb files at boot timing of an interpreter can cause problem.
For example, before initializing String class, the class of String literal is 0 (because String class is not generated).

I introduces several workarounds but we need to modify more.

Conclusion

How about to introduce this mechanism and try it on Ruby 2.7?
We can revert these changes if we found any troubles, if we don't expose this mechanism and only internal changes.


Related issues

Related to Ruby master - Misc #17502: C vs RubyOpenko1 (Koichi Sasada)Actions

Also available in: Atom PDF