Feature #14739
Improve fiber yield/resume performance
Description
I am interested to improve Fiber yield/resume performance.
I've used this library before: http://software.schmorp.de/pkg/libcoro.html and handled millions of HTTP requests using it.
I'd suggest to use that library.
As this is used in many places in Ruby (e.g. enumerable) it could be a big performance win across the board.
Here is a nice summary of what was done for RethinkDB: https://rethinkdb.com/blog/making-coroutines-fast/
Does Ruby currently reuse stacks? This is also a big performance win if it's not being done already.
History
Updated by shyouhei (Shyouhei Urabe) 10 months ago
ioquatix (Samuel Williams) wrote:
Does Ruby currently reuse stacks?
Yes.
Not sure how fast libcoro is, though.
Updated by ioquatix (Samuel Williams) 10 months ago
Here is the code https://github.com/ioquatix/ruby/tree/fiber-libcoro
UPDATE: I provided some benchmark details, but it turns out they were wrong. I've retracted it until I can provide correct information to prevent any confusion.
Updated by ioquatix (Samuel Williams) 10 months ago
Not sure how fast libcoro is, though.
In my experience, the libcoro
ASM implementation is the fastest implementation I found.
It's not much slower than a (normal) C function call.
Updated by ioquatix (Samuel Williams) 10 months ago
# Without libcoro koyoko% ./build/bin/ruby ./fiber_benchmark.rb 10000 1000 setup time for 10000 fibers: 0.099961 execution time for 1000 messages: 19.505909 # With libcoro koyoko% ./build/bin/ruby ./fiber_benchmark.rb 10000 1000 setup time for 10000 fibers: 0.099268 execution time for 1000 messages: 8.491746
It's about 2.2x faster.
That's about what I was expecting.
Can someone else confirm? Thanks.
Updated by ioquatix (Samuel Williams) 10 months ago
# Without libcoro (macOS) ^_^ > ./build/bin/ruby ./fiber_benchmark.rb 10000 1000 setup time for 10000 fibers: 0.298039 execution time for 1000 messages: 35.248941 # With libcoro (macOS) ^_^ > ./build/bin/ruby ./fiber_benchmark.rb 10000 1000 setup time for 10000 fibers: 0.167117 execution time for 1000 messages: 15.460046
On macOS, it's about the same, 2.2x faster.
Updated by ioquatix (Samuel Williams) 10 months ago
I don't know how to run a full benchmark of Ruby. Can someone help me with that? It would be interesting to get a more general idea of the performance.
Updated by vo.x (Vit Ondruch) 10 months ago
I wonder what architectures libcoro supports? It seems it supports x86 a probably some ARM, but what about s390x and ppc64?
Updated by nobu (Nobuyoshi Nakada) 10 months ago
And seems it requires gcc (variants) and non-Windows.
coro.c can't compile with Visual C nor mingw gcc.
Also, asm
needed to be replaced with __asm__
to compile with Apple clang, and it is 3% faster.
$ ruby fiber_benchmark.rb 10000 1000 setup time for 10000 fibers: 0.227721 execution time for 1000 messages: 74.540142 $ ./ruby fiber_benchmark.rb 10000 1000 setup time for 10000 fibers: 0.293740 execution time for 1000 messages: 72.180107
Updated by ioquatix (Samuel Williams) 10 months ago
You can see the supported methods here.
For the proof of concept, I forced it to use the ASM method, which supports 32-bit and 64-bit x86 CPUs and ARM (I've never tested it).
It would make sense to set up some configure tests to detect which one is available.
I'd also suggest if we move forward with this, we should remove most of the native implementation of coroutines in Ruby because they are slower and clutter up the implementation.
Updated by ioquatix (Samuel Williams) 10 months ago
I've compiled this on both LLVM and GCC just fine.
I've never tried compiling it on Windows but it should work. It might require some work.
Also, asm needed to be replaced with
__asm__
to compile with Apple clang
I didn't have this problem. What version of the developer tools are you using?
and it is 3% faster.
If you get that, something is wrong, it's definitely a much bigger improvement than that. Did you try it on Linux?
Updated by ioquatix (Samuel Williams) 10 months ago
I am trying out your branch, and will report back. 3% is within the margin for error so it sounds like nothing changed for some reason. There will be some explanation.
Updated by nobu (Nobuyoshi Nakada) 10 months ago
ioquatix (Samuel Williams) wrote:
Also, asm needed to be replaced with
__asm__
to compile with Apple clangI didn't have this problem. What version of the developer tools are you using?
$ clang --version Apple LLVM version 8.0.0 (clang-800.0.42.1) Target: x86_64-apple-darwin15.6.0 Thread model: posix InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
If you get that, something is wrong, it's definitely a much bigger improvement than that. Did you try it on Linux?
On Ubuntu 18.04, it has the effect with gcc version 7.3.0 (Ubuntu 7.3.0-16ubuntu3)
.
trunk¶
$ ./x86_64-linux/exe/ruby src/fiber_benchmark.rb 10000 1000 setup time for 10000 fibers: 0.153903 execution time for 1000 messages: 25.395488
fiber-libcoro¶
$ make -C x86_64-linux prog > /dev/null && ./x86_64-linux/exe/ruby src/fiber_benchmark.rb 10000 1000 In file included from ../src/libcoro/coro.c:41:0, from ../src/cont.c:51: ../src/cont.c: In function ‘cont_free’: ../src/libcoro/coro.h:401:28: warning: statement with no effect [-Wunused-value] # define coro_destroy(ctx) (void *)(ctx) ^~~~~~~~~~~~~ ../src/cont.c:370:2: note: in expansion of macro ‘coro_destroy’ coro_destroy((coro_context *)&fib->context); ^~~~~~~~~~~~ ../src/cont.c: In function ‘fiber_initialize_machine_stack_context’: ../src/cont.c:862:32: warning: passing argument 2 of ‘coro_create’ from incompatible pointer type [-Wincompatible-pointer-types] coro_create(&fib->context, rb_fiber_start, NULL, fib->ss_sp, fib->ss_size); ^~~~~~~~~~~~~~ In file included from ../src/cont.c:51:0: ../src/libcoro/coro.c:331:1: note: expected ‘coro_func {aka void (*)(void *)}’ but argument is of type ‘__attribute__((noreturn)) void (*)(void)’ coro_create (coro_context *ctx, coro_func coro, void *arg, void *sptr, size_t ssize) ^~~~~~~~~~~ In file included from ../src/libcoro/coro.c:41:0, from ../src/cont.c:51: ../src/cont.c: In function ‘rb_fiber_terminate’: ../src/libcoro/coro.h:401:28: warning: statement with no effect [-Wunused-value] # define coro_destroy(ctx) (void *)(ctx) ^~~~~~~~~~~~~ ../src/cont.c:1799:5: note: in expansion of macro ‘coro_destroy’ coro_destroy(&fib->context); ^~~~~~~~~~~~ ../src/cont.c: At top level: cc1: warning: unrecognized command line option ‘-Wno-self-assign’ cc1: warning: unrecognized command line option ‘-Wno-constant-logical-operand’ cc1: warning: unrecognized command line option ‘-Wno-parentheses-equality’ setup time for 10000 fibers: 0.146823 execution time for 1000 messages: 7.855211
Updated by ioquatix (Samuel Williams) 10 months ago
Yes, that supports my own test as well.
koyoko% ruby --version ruby 2.5.0p0 (2017-12-25 revision 61468) [x86_64-linux] koyoko% ruby ./fiber_benchmark.rb 10000 1000 setup time for 10000 fibers: 0.094309 execution time for 1000 messages: 22.248827 koyoko% ./build/bin/ruby --version ruby 2.6.0dev (2018-05-03 fiber-libcoro 63333) [x86_64-linux] last_commit=Use libcoro for Fiber implementation to improve performance. koyoko% ./build/bin/ruby ./fiber_benchmark.rb 10000 1000 setup time for 10000 fibers: 0.104364 execution time for 1000 messages: 19.717851 koyoko% ./build/bin/ruby --version ruby 2.6.0dev (2018-05-03 fiber-libcoro 63333) [x86_64-linux] last_commit=Use libcoro for Fiber implementation to improve performance. koyoko% ./build/bin/ruby ./fiber_benchmark.rb 10000 1000 setup time for 10000 fibers: 0.104798 execution time for 1000 messages: 8.988672
However, on macOS, I can't reproduce my original results. I apologise. I was playing around with stack allocation. I tried to revert back to that state, but couldn't reproduce the results I gave earlier.
I will continue to investigate.
Updated by ioquatix (Samuel Williams) 10 months ago
Okay, I found out what happened.
On macOS, you need to set
#include "libcoro/coro.c" #define FIBER_USE_NATIVE 1
Otherwise it won't take the optimal code path. My apologies, I think as I was playing with the code I made that change but didn't commit it after I started patching it to work on Linux, since it seems on Linux that's the default.
Here is the performance improvement.
^_^ > ./build/bin/ruby ./fiber_benchmark.rb 10000 1000 setup time for 10000 fibers: 0.165381 execution time for 1000 messages: 14.267517 ^_^ > ./build/bin/ruby ./fiber_benchmark.rb 10000 1000 setup time for 10000 fibers: 0.160629 execution time for 1000 messages: 6.307580
So, it's similar speed-up.
I tried to compile without libcoro, but with #define FIBER_USE_NATIVE 1
, but it fails because swapcontext/makecontext
is deprecated on macOS and compile fails.
Updated by ioquatix (Samuel Williams) 10 months ago
I updated my branch with a few changes.
I'm sorry I didn't rebase on your branch.
I think once we decide if this is a good idea or not, we can decide how best to integrate it with Ruby. I just wanted to make a proof of concept to show it was a good improvement to performance.
My suggestion would be to remove the implementations from cont.c
and update libcoro to support all required platforms. The API provided by libcoro is really great and a nice wrapper.
It should be possible to build libcoro on Windows. I do have Windows with Visual Studio set up but I really have no idea how to use it :) However, it wouldn't be silly to update libcoro to make it compile without problems on all supported platforms. It's quite an "old" implementation, but it does work really well. There are some other implementations available too, some are more modern, but I found this one was pretty good.
It might make sense to fork libcoro into a separate repo, I don't mind maintaining it, I already have a fork of it actually, and it's a bit different from the one here. But, it would make sense to update it a bit.
Updated by ioquatix (Samuel Williams) 10 months ago
I was reading https://sourceware.org/ml/libc-help/2016-01/msg00008.html and noticed the following regarding *context
functions:
these functions are deprecated/dead -- they no longer exist in the latest
POSIX specification. the preference would be to stop using them. i think
we might consider dropping them in a future glibc version.
Of course they still exist, but yes they are deprecated, and non-existent in the latest POSIX standard. I might even remove it from my fork of libcoro
.
Updated by shevegen (Robert A. Heiler) 10 months ago
However, it wouldn't be silly to update libcoro to make it
compile without problems on all supported platforms.
I can't speak for matz and the ruby core team, but in the past
there were (feature-)proposals that were rejected since they
were only specific for e. g. Linux - that is, improvements
pertaining to Linux, but not other OS. I think matz wants to have
ruby be as OS-agnostic as possible; in other words to work on
as many OS as possible, too. And there are quite some people
who use ruby on windows as well, for one reason or another.
As for benchmarks, I think any noticable improvement is a
win and may fit into the "ruby 3 is 3x as fast as ruby 2.0",
but to get to that, it may be more important to verify that
the improvements could also work on windows. Even 3% would
be considerable. :)
By the way, I think there are some ruby-devs who use windows
too ... greg I think. May take a little before the issue here
is seen by them; they could probably help. (I use linux
myself so I won't be of much help.)
Updated by ioquatix (Samuel Williams) 10 months ago
The windows code path for fibers is relatively trivial both in libcoro and cont.c, so I wouldn’t be too concerned about windows support. It shouldn’t be much effort to make it work well in libcoro or keep existing windows code path.
Thanks for your concern and support, and I hope we can get some traction with this improvement.
I use fibers a lot (https://github.com/socketry/async is a fiber [stackful coroutine] based concurrency library). My next step is to benchmark the improvement. It obviously won't be anywhere near 2.2x for real code, but I think it should at least be noticeable.
Updated by shyouhei (Shyouhei Urabe) 10 months ago
I'm neutral. This is a feature request but the "feature" being discussed is the speed of execution. It is by nature different from each other. If this improvement could be truly transparent (and seems currently it is), I think there are chances for acceptance. Wider support for different OSes is definitely nice-to-have of course.
Updated by ioquatix (Samuel Williams) 10 months ago
Thanks for your feedback. When I made this issue, I could only select "Bug", "Feature" or "Misc". Should I have selected "Misc" instead?
Updated by ioquatix (Samuel Williams) 10 months ago
I test in some real world applications today. The first is async, which has a performance test for read context switch overhead: https://github.com/socketry/async/blob/master/spec/async/performance_spec.rb
This isn't direct comparison since I'm using rvm with ruby head and my branch, but it's pretty close.
# Without libcoro fibers Async::Wrapper Warming up -------------------------------------- Wrapper#wait_readable 1.801k i/100ms Reactor#register 2.087k i/100ms Calculating ------------------------------------- Wrapper#wait_readable 176.789k (± 5.7%) i/s - 880.689k in 5.004582s Reactor#register 227.882k (± 2.9%) i/s - 1.140M in 5.004740s Comparison: Reactor#register: 227882.2 i/s Wrapper#wait_readable: 176789.3 i/s - 1.29x slower # With libcoro fibers (12% more context switch for read operations) Async::Wrapper Warming up -------------------------------------- Wrapper#wait_readable 2.217k i/100ms Reactor#register 2.380k i/100ms Calculating ------------------------------------- Wrapper#wait_readable 197.116k (± 2.7%) i/s - 986.565k in 5.008582s Reactor#register 256.078k (± 4.4%) i/s - 1.278M in 5.003710s Comparison: Reactor#register: 256077.8 i/s Wrapper#wait_readable: 197115.9 i/s - 1.30x slower
Updated by ioquatix (Samuel Williams) 10 months ago
Compare async-dns with bind9 for the same workload:
# Without libcoro-fiber user system total real Async::DNS::Server 0.000345 0.000029 0.000374 ( 0.000381) Bind9 0.000294 0.000025 0.000319 ( 0.000328) # With libcoro-fiber (no significant difference) user system total real Async::DNS::Server 0.000320 0.000048 0.000368 ( 0.000371) Bind9 0.000218 0.000033 0.000251 ( 0.000258)
This one was a toss-up, I'd say there was no significant difference.
Updated by ioquatix (Samuel Williams) 10 months ago
I tested async-http, a web server, it has a basic performance spec using wrk
as the client.
I ran it several times and report the best result of each below. It's difficult to make a judgement. I'd like to say performance was improved but if so, < 5%. However, this benchmark is testing an entire web server stack. Context switching only happens a few times per request.. If I had to take a guess, maybe not more than 4 times (accept, read request, write response). In many cases, we only context switch if the operation would block which is unlikely for small request/response on loopback interface.
# Without libcoro-fiber Async::HTTP::Server simple response Running 2m test @ http://127.0.0.1:9292/ 8 threads and 8 connections Thread Stats Avg Stdev Max +/- Stdev Latency 110.06us 647.25us 67.72ms 99.33% Req/Sec 12.58k 3.07k 26.94k 70.77% 12021990 requests in 2.00m, 401.28MB read Requests/sec: 100100.72 Transfer/sec: 3.34MB # With libcoro-fiber Async::HTTP::Server simple response Running 2m test @ http://127.0.0.1:9292/ 8 threads and 8 connections Thread Stats Avg Stdev Max +/- Stdev Latency 106.47us 834.32us 99.45ms 99.46% Req/Sec 12.66k 2.95k 17.61k 71.12% 12093398 requests in 2.00m, 403.66MB read Requests/sec: 100694.76 Transfer/sec: 3.36MB
This result surprised me a little bit, but now that I think about it, it could make sense (there is also the possibility I made a mistake or the benchmark is bad). Because the cost of network (read/write) and processing (parsing, generating response, buffers, GC) far outweigh the fiber yield/resume, which is already minimised. In real world situations, the results should lean more in favour of libcoro.
Just for interest, I also collect system call stats.
# Without libcoro % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 45.76 4.635066 2 2095278 sendto 32.47 3.288691 1 4191323 rt_sigprocmask 20.90 2.117062 1 2095611 324 recvfrom 0.67 0.068189 9741 7 poll 0.07 0.006821 1 6256 5313 openat 0.03 0.003404 1 4034 5 lstat 0.01 0.001072 1 1158 read 0.01 0.001049 1 987 close 0.01 0.000805 1 901 421 stat 0.01 0.000627 25 25 clone 0.01 0.000624 1 793 fstat 0.01 0.000521 4 124 mmap 0.00 0.000475 1 798 246 fcntl 0.00 0.000475 2 297 1 epoll_wait 0.00 0.000402 3 140 mremap 0.00 0.000386 1 346 322 epoll_ctl 0.00 0.000331 1 557 552 ioctl 0.00 0.000323 16 20 futex 0.00 0.000321 3 94 mprotect 0.00 0.000307 1 213 brk 0.00 0.000255 4 62 getdents 0.00 0.000183 1 291 getuid 0.00 0.000180 1 292 geteuid 0.00 0.000177 1 292 getegid 0.00 0.000172 1 291 getgid 0.00 0.000096 3 36 pipe2 0.00 0.000074 6 12 munmap 0.00 0.000066 11 6 2 execve 0.00 0.000052 2 23 14 accept4 0.00 0.000047 3 18 prctl 0.00 0.000047 2 27 set_robust_list 0.00 0.000045 2 19 getpid 0.00 0.000040 0 81 2 rt_sigaction 0.00 0.000028 2 16 8 access 0.00 0.000017 1 15 getcwd 0.00 0.000016 1 14 readlink 0.00 0.000016 0 241 238 newfstatat 0.00 0.000014 0 96 lseek 0.00 0.000013 1 10 chdir 0.00 0.000013 3 4 arch_prctl 0.00 0.000012 0 25 setsockopt 0.00 0.000009 0 25 getsockname 0.00 0.000007 2 4 prlimit64 0.00 0.000006 0 17 getsockopt 0.00 0.000006 3 2 getrandom 0.00 0.000004 2 2 sched_getaffinity 0.00 0.000004 4 1 clock_gettime 0.00 0.000003 2 2 write 0.00 0.000003 3 1 sigaltstack 0.00 0.000003 2 2 set_tid_address 0.00 0.000002 2 1 vfork 0.00 0.000001 1 1 wait4 0.00 0.000001 1 1 getresgid 0.00 0.000000 0 8 pipe 0.00 0.000000 0 1 dup2 0.00 0.000000 0 8 socket 0.00 0.000000 0 8 bind 0.00 0.000000 0 8 listen 0.00 0.000000 0 1 sysinfo 0.00 0.000000 0 1 getresuid 0.00 0.000000 0 8 epoll_create1 ------ ----------- ----------- --------- --------- ---------------- 100.00 10.128563 8400935 7448 total # With libcoro % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 65.83 5.263501 2 2708883 sendto 32.87 2.628193 1 2709155 263 recvfrom 1.06 0.084583 16917 5 poll 0.09 0.006915 1 6232 5313 openat 0.06 0.004405 1 4034 5 lstat 0.02 0.001276 1 1123 read 0.02 0.001207 1 833 379 stat 0.01 0.000996 1 963 close 0.01 0.000510 1 785 fstat 0.01 0.000492 1 533 528 ioctl 0.00 0.000330 2 162 1 epoll_wait 0.00 0.000327 0 797 246 fcntl 0.00 0.000285 11 25 clone 0.00 0.000253 1 232 brk 0.00 0.000253 1 284 260 epoll_ctl 0.00 0.000239 2 123 mmap 0.00 0.000207 2 95 mprotect 0.00 0.000168 8 20 futex 0.00 0.000163 3 62 getdents 0.00 0.000142 0 291 getuid 0.00 0.000139 1 238 235 newfstatat 0.00 0.000133 0 292 geteuid 0.00 0.000131 0 291 getgid 0.00 0.000129 0 292 getegid 0.00 0.000080 7 12 munmap 0.00 0.000058 2 32 rt_sigprocmask 0.00 0.000057 1 88 lseek 0.00 0.000057 2 36 pipe2 0.00 0.000044 1 81 2 rt_sigaction 0.00 0.000043 3 14 readlink 0.00 0.000039 2 16 8 access 0.00 0.000036 2 22 13 accept4 0.00 0.000035 1 27 set_robust_list 0.00 0.000033 2 18 prctl 0.00 0.000028 1 19 getpid 0.00 0.000026 2 15 getcwd 0.00 0.000020 2 10 chdir 0.00 0.000013 13 1 wait4 0.00 0.000009 5 2 getrandom 0.00 0.000008 0 25 setsockopt 0.00 0.000006 3 2 write 0.00 0.000006 0 25 getsockname 0.00 0.000003 3 1 vfork 0.00 0.000003 1 6 2 execve 0.00 0.000003 1 4 arch_prctl 0.00 0.000003 2 2 set_tid_address 0.00 0.000003 1 4 prlimit64 0.00 0.000002 0 17 getsockopt 0.00 0.000002 2 1 sigaltstack 0.00 0.000001 1 1 getresuid 0.00 0.000001 1 1 getresgid 0.00 0.000001 1 2 sched_getaffinity 0.00 0.000000 0 8 pipe 0.00 0.000000 0 1 dup2 0.00 0.000000 0 8 socket 0.00 0.000000 0 8 bind 0.00 0.000000 0 8 listen 0.00 0.000000 0 1 sysinfo 0.00 0.000000 0 1 clock_gettime 0.00 0.000000 0 8 epoll_create1 ------ ----------- ----------- --------- --------- ----------------
rt_sigprocmask
was gone because it's not invoked by libcoro unless using swapcontext
.
Updated by ioquatix (Samuel Williams) 10 months ago
It's been a while since I played around with libcoro.
I was evaluating it's performance in a C++ program.
I found that it's not thread safe due to global variables. I change them to thread local to fix the issue, it works well.
I just want to reinforce that this was a proof of concept, if we decide to roll with such an implementation, it requires more work. I am happy to help with that but it would be good to get some feedback regarding whether such a contribution would be acceptable before investing so much time.
Updated by ko1 (Koichi Sasada) 10 months ago
Sorry I can't read all of your comments because it too long :p
As you quoted first,
Here is a nice summary of what was done for RethinkDB: https://rethinkdb.com/blog/making-coroutines-fast/
In this article:
A lightweight swapcontext implementation
It shows that swapcontext
has extra overhead because of sigprocmask system call.
rt_sigprocmask was gone because it's not invoked by libcoro unless using swapcontext.
Yes.
Last year, I tried modified swapcontext
that article introduced, and I got good performance.
(I found Fiber resume/yiled ping ping and I found sigprocmask is one overhead, and google about it, and I also found same page :p)
However, introduced swapcontext
is based on glibc, so there is a license problem that we can't merge it into Ruby source code.
Using libcoro (I don't see the library, but as you say) seems to use same tech, so it is one idea to employ.
However, I'm not sure it is the best way.
No conclusion, but it is my current comment.
Thanks,
Koichi
Updated by duerst (Martin Dürst) 10 months ago
ioquatix (Samuel Williams) wrote:
Thanks for your feedback. When I made this issue, I could only select "Bug", "Feature" or "Misc". Should I have selected "Misc" instead?
"Feature" should be okay.
Updated by ioquatix (Samuel Williams) 10 months ago
Thanks Koichi, for your valuable response and I appreciate your past work in this area.
I started hacking on my own implementation for x64. It is slightly simpler than libcoro.
I have been reviewing x64 ABI, and it should be pretty trivial to support both 64-bit Windows ABI and 64-bit System V ABI (Linux, Mac, Solaris, BSD). The amount of code is < 200 lines for both ABIs.
For all other ABIs, I suggest using existing code path. I am happy to release this code to Ruby/MRI under whatever license is suitable.
Please be patient while I finish off the patch, when it is done I will update here.
Updated by ioquatix (Samuel Williams) 10 months ago
What compiler is used to compile 64-bit Ruby on Windows?
Updated by ioquatix (Samuel Williams) 10 months ago
Here is the initial code.
https://github.com/kurocha/coroutine
It implements a semantically similar interface to libcoro
, but it supports native coroutines on win32, win64 and amd64. I should add a ucontext
wrapper (makecontext
/swapcontext
) for other platforms, then I think all platforms are supported. libcoro
didn't have good windows support.
I've put this code under the MIT license.
Updated by sam.saffron (Sam Saffron) 9 months ago
Does this change move us any closer to being able to ship fibers between threads?
Updated by ko1 (Koichi Sasada) 9 months ago
sorry I missed comments.
How to ship with this library? bundle it or download by others?
(this is similar discussion with jemalloc :))
Updated by ioquatix (Samuel Williams) 9 months ago
ko1 (Koichi Sasada) I would suggest we make a Ruby specific version, but we can also try to make generic static library so that it can be maintained separately. I already have some other projects using coroutines so it's useful to me to have a C library implementation which is maintained well.
sam.saffron (Sam Saffron) This is an interesting question which I did specifically try to address in this implementation. I will give you the details.
Typical implementation of Fiber uses thread local variables for main fiber and currently executing fiber Fiber.current
. Because of this, it's annoying to ship fiber between threads. Additionally, I'd argue that moving fibers between threads is inherently not safe. I'd Kindly suggest that a coroutine which can be resumed on different threads is not a "Fiber" but a "Green Thread". The fundamental difference is how Fiber is implemented, and it depends on thread local storage. For example, how would Fiber#resume work on a different thread if it's executing already? Right now, yield
and resume
are VERY efficient because they don't have to check anything like this.
However, coroutines are the underlying abstraction for implementing Fiber and they CAN be moved across threads.
This particular implementation was designed very carefully to allow for this. In particular, coroutine_transfer
function takes two arguments, a coroutine to store the current stack, and a coroutine to restore it's stack. In particular, coroutine_transfer
passes both these arguments to the start function, and additionally, coroutine_transfer
returns the coroutine that invoked it, so returning back doesn't require any shared state. Because of this, the implementation avoids any kind of "global" state, it's all on the coroutine stack.
Therefore, with this coroutine library, we can nicely implement green threads too, but you'd need to provide additional guarantees/locking around coroutine_transfer. If you want to transfer a coroutine to another thread, you need to move the coroutine_context
data structure (contains stack) to the new thread, and the new thread needs to call coroutine_transfer
. The coroutine can simply call coroutine_transfer
to return back, using either the argument from
or the result of a previous coroutine_transfer
.
So, the short answer is yes.
ko1 (Koichi Sasada) I also finished implementing for arm64, and hopefully can implement for arm32 soon. I test on raspberry pi :) I don't know about PowerPC, I don't have any hardware to test this. Can we test in a VM?
Updated by ioquatix (Samuel Williams) 9 months ago
Here is the test which shows coroutine arguments and coroutine_transfer
result.
The reason for COROUTINE
macro is that on win32, in order to avoid lots of stack manipulation, we need to use __fastcall
.
Updated by ioquatix (Samuel Williams) 9 months ago
I've made a new branch with the new implementation above.
It shows a slightly improved performance improvement over libcoro
.
Here is without the PR:
^_^ > ./build/bin/ruby ./fiber_benchmark.rb 10000 1000 setup time for 10000 fibers: 0.161763 execution time for 1000 messages: 14.018874 setup time for 10000 fibers: 1.572869 execution time for 1000 messages: 13.778874 setup time for 10000 fibers: 0.917040 execution time for 1000 messages: 13.942525 setup time for 10000 fibers: 1.616929 execution time for 1000 messages: 13.991115 setup time for 10000 fibers: 1.623587 execution time for 1000 messages: 14.281334
And here it is with the PR, on macOS (the same system used in previous benchmarks):
^_^ > ./build/bin/ruby ./fiber_benchmark.rb 10000 1000 setup time for 10000 fibers: 0.160637 execution time for 1000 messages: 6.009332 setup time for 10000 fibers: 0.244175 execution time for 1000 messages: 6.246711 setup time for 10000 fibers: 0.242718 execution time for 1000 messages: 6.142166 setup time for 10000 fibers: 0.233410 execution time for 1000 messages: 5.994752 setup time for 10000 fibers: 0.288830 execution time for 1000 messages: 6.216617
Performance is about 2~2.5x faster depending on your analysis. Both creation and execution time is improved. But remember this is micro-benchmark.
I was also interested in mjit performance:
Without PR, enabled mjit:
^_^ > ./build/bin/ruby --jit ./fiber_benchmark.rb 10000 1000 setup time for 10000 fibers: 0.172145 execution time for 1000 messages: 25.176702 setup time for 10000 fibers: 1.654751 execution time for 1000 messages: 14.729177 setup time for 10000 fibers: 1.016810 execution time for 1000 messages: 15.154141 setup time for 10000 fibers: 1.726305 execution time for 1000 messages: 14.797269 setup time for 10000 fibers: 2.025997 execution time for 1000 messages: 15.124753
With PR, enabled mjit:
x_x > ./build/bin/ruby --jit ./fiber_benchmark.rb 10000 1000 setup time for 10000 fibers: 0.179744 execution time for 1000 messages: 13.793318 setup time for 10000 fibers: 0.354717 execution time for 1000 messages: 10.664870 setup time for 10000 fibers: 0.308818 execution time for 1000 messages: 6.956352 setup time for 10000 fibers: 0.378568 execution time for 1000 messages: 6.553922 setup time for 10000 fibers: 0.295583 execution time for 1000 messages: 7.274086
We can see it still needs a bit of work.
I will try to isolate some interesting results from higher level frameworks.
The updated branch is here: https://github.com/ioquatix/ruby/tree/native-fiber
It only work on Darwin x64 at the moment, because changes to autoconf do not cover all platforms yet. I'll fix this soon.
Updated by ioquatix (Samuel Williams) 9 months ago
I fixed autoconf issues and built on Linux. The performance improvement was even more impressive.
koyoko% ruby --version ruby 2.6.0dev (2018-06-01 native-fiber 63544) [x86_64-linux] last_commit=Better support for amd64 platforms koyoko% ruby ./fiber_benchmark.rb setup time for 1000 fibers: 0.007222 execution time for 10000 messages: 3.433891 setup time for 1000 fibers: 0.015365 execution time for 10000 messages: 3.177730 setup time for 1000 fibers: 0.010035 execution time for 10000 messages: 3.205329 setup time for 1000 fibers: 0.012063 execution time for 10000 messages: 2.968101 setup time for 1000 fibers: 0.010448 execution time for 10000 messages: 2.947756 koyoko% rvm use 2.6 Using /home/samuel/.rvm/gems/ruby-2.6.0-preview2 koyoko% ruby --version ruby 2.6.0preview2 (2018-05-31 trunk 63539) [x86_64-linux] koyoko% ruby ./fiber_benchmark.rb setup time for 1000 fibers: 0.006881 execution time for 10000 messages: 13.242779 setup time for 1000 fibers: 0.009869 execution time for 10000 messages: 13.468187 setup time for 1000 fibers: 0.013938 execution time for 10000 messages: 12.691139 setup time for 1000 fibers: 0.014423 execution time for 10000 messages: 12.005481 setup time for 1000 fibers: 0.013953 execution time for 10000 messages: 12.535145
nobu (Nobuyoshi Nakada) do you mind confirming?
Updated by ioquatix (Samuel Williams) 9 months ago
Here is a more realistic benchmark which fiber context switch is only a tiny percentage of the actual run-time.
A brief summary of the benchmark: async-http
uses an event-driven stackful coroutine (fiber) based design. Each request allocates a fiber, and each blocking operation (i.e. read
) results in Fiber.yield
. Once the IO is ready, Fiber#resume
is called. So, for each request being processed, we expect several calls to Fiber.yield
. async
is optimistic so it tries to perform the operation e.g. read
and only yields if it results in EWOULDBLOCK
so in some cases (especially in synthetic benchmarks) some scheduling may be elided.
koyoko% rvm use 2.6 Using /home/samuel/.rvm/gems/ruby-2.6.0-preview2 koyoko% ruby --version ruby 2.6.0preview2 (2018-05-31 trunk 63539) [x86_64-linux] koyoko% bundle exec rake wrk Running 10s test @ http://127.0.0.1:9294/ 1 threads and 1 connections Thread Stats Avg Stdev Max +/- Stdev Latency 63.59us 77.52us 4.53ms 98.32% Req/Sec 16.68k 1.07k 18.32k 74.26% 167544 requests in 10.10s, 14.54MB read Requests/sec: 16589.33 Transfer/sec: 1.44MB Running 10s test @ http://127.0.0.1:9294/ 1 threads and 1 connections Thread Stats Avg Stdev Max +/- Stdev Latency 60.85us 34.26us 1.39ms 95.82% Req/Sec 16.82k 0.87k 18.49k 70.00% 167424 requests in 10.00s, 14.53MB read Requests/sec: 16742.19 Transfer/sec: 1.45MB Running 10s test @ http://127.0.0.1:9294/ 1 threads and 1 connections Thread Stats Avg Stdev Max +/- Stdev Latency 62.44us 54.34us 3.81ms 97.62% Req/Sec 16.62k 1.00k 18.09k 67.33% 166959 requests in 10.10s, 14.49MB read Requests/sec: 16530.76 Transfer/sec: 1.43MB Running 10s test @ http://127.0.0.1:9294/ 1 threads and 1 connections Thread Stats Avg Stdev Max +/- Stdev Latency 61.89us 32.53us 687.00us 94.29% Req/Sec 16.54k 1.20k 18.37k 67.33% 166105 requests in 10.10s, 14.42MB read Requests/sec: 16445.91 Transfer/sec: 1.43MB Running 10s test @ http://127.0.0.1:9294/ 1 threads and 1 connections Thread Stats Avg Stdev Max +/- Stdev Latency 60.90us 37.64us 1.70ms 95.89% Req/Sec 16.89k 1.22k 18.57k 72.28% 169694 requests in 10.10s, 14.73MB read Requests/sec: 16802.33 Transfer/sec: 1.46MB
Here is with the PR:
koyoko% rvm use ruby-head-fiber Using /home/samuel/.rvm/gems/ruby-head-fiber koyoko% ruby --version ruby 2.6.0dev (2018-06-01 native-fiber 63544) [x86_64-linux] last_commit=Better support for amd64 platforms koyoko% bundle exec rake wrk Running 10s test @ http://127.0.0.1:9294/ 1 threads and 1 connections Thread Stats Avg Stdev Max +/- Stdev Latency 62.53us 73.11us 5.02ms 97.96% Req/Sec 16.80k 1.35k 19.46k 63.37% 168863 requests in 10.10s, 14.65MB read Requests/sec: 16719.77 Transfer/sec: 1.45MB Running 10s test @ http://127.0.0.1:9294/ 1 threads and 1 connections Thread Stats Avg Stdev Max +/- Stdev Latency 58.91us 35.19us 1.54ms 95.25% Req/Sec 17.49k 1.16k 19.42k 69.31% 175719 requests in 10.10s, 15.25MB read Requests/sec: 17399.00 Transfer/sec: 1.51MB Running 10s test @ http://127.0.0.1:9294/ 1 threads and 1 connections Thread Stats Avg Stdev Max +/- Stdev Latency 58.64us 45.92us 3.09ms 96.88% Req/Sec 17.72k 1.10k 19.42k 71.29% 178027 requests in 10.10s, 15.45MB read Requests/sec: 17626.32 Transfer/sec: 1.53MB Running 10s test @ http://127.0.0.1:9294/ 1 threads and 1 connections Thread Stats Avg Stdev Max +/- Stdev Latency 60.83us 33.93us 1.06ms 94.93% Req/Sec 16.86k 1.54k 19.36k 63.37% 169307 requests in 10.10s, 14.69MB read Requests/sec: 16764.19 Transfer/sec: 1.45MB Running 10s test @ http://127.0.0.1:9294/ 1 threads and 1 connections Thread Stats Avg Stdev Max +/- Stdev Latency 59.07us 39.77us 2.17ms 95.97% Req/Sec 17.52k 0.98k 19.32k 66.34% 176112 requests in 10.10s, 15.28MB read Requests/sec: 17436.64 Transfer/sec: 1.51MB
This is actually better than I expected. I would say there is a practical improvement of about ~5%. In this situation it's very workload dependent, but I'm glad that I saw something.
Updated by ioquatix (Samuel Williams) 9 months ago
I've made a short blog post about this PR: https://www.codeotaku.com/journal/2018-06/improving-ruby-fibers/index
Updated by cremes (Chuck Remes) 9 months ago
I'd like to link this to another open issue regarding Fiber migration between threads. https://bugs.ruby-lang.org/issues/13821
ioquatix (Samuel Williams), please note in the above-referenced bug that I put in a link to the "boost" documentation regarding coroutine movement between threads. An explicit API to lock/unlock ownership of the fiber to a thread would probably resolve some of the complaints people raise about fiber migration. If it's explicit, more guarantees can be made. Default behavior should be the current behavior where Fibers cannot migrate.
Thanks for your work on this.
Updated by ioquatix (Samuel Williams) 9 months ago
cremes (Chuck Remes) Thanks for your positive feedback and linking me to related issues.
The coroutine implementation was specifically designed to handle cross-thread migrations, in the sense that all the required state to yield/resume is passed as arguments/returns to/from the coroutine.
What this means is that no global/thread-local state is required and thus when moving a coroutine to another thread, there is almost no additional data to sync which is nice from an API point of view.
The bigger challenge is how Ruby Fiber is implemented. It does make it tricky. I would be happy to work towards this. I see the following path being viable:
- Merge these changes.
- Simplify the Fiber implementation by removing all the other implementations from
cont.c
and if necessary move these to the coroutine code (but ideally remove them). - With the simplified Fiber code base, explore the overheads of Fiber creation/context switching and figure out the right places to put locking/checks (e.g. for locks being held, etc).
Updated by matz (Yukihiro Matsumoto) 5 months ago
OK, it sounds reasonable. We will give you commit privilege.
Matz.
Updated by hsbt (Hiroshi SHIBATA) 5 months ago
Hi, ioquatix.
I send an invitation of the ruby core team. Please check it.
Updated by ioquatix (Samuel Williams) 2 months ago
- Target version set to 2.6
- Assignee set to ioquatix (Samuel Williams)
- Status changed from Open to Closed
This is now implemented across: arm32, arm64, ppc64le, win32, win64, x86, amd64. Thanks to everyone who helped with this. This is a really awesome first step to improving Ruby Fiber performance.