Project

General

Profile

Actions

Bug #10009

open

IO operation is 10x slower in multi-thread environment

Added by ariveira (Alexandre Riveira) over 9 years ago. Updated almost 9 years ago.

Status:
Open
Target version:
-
ruby -v:
ruby 2.1 x ruby 1.9.2 with taskset
[ruby-core:63556]
Tags:

Description

I created this issue #9832 but not have io operation.
In the script attached I simulate IO operation in multi-thread environment.
For ruby 1.9.2 apply taskset -c -p 2 #{Process.pid} for regulates threads behavior.
The second Thread is a io operation

My results:

  1. ruby 2.1.2
    first 43500194
    second 95
    third 42184385

  2. ruby-2.0.0-p451
    first 38418401
    second 95
    third 37444470

  3. 1.9.3-p545
    first 121260313
    second 50
    third 44275164

  4. 1.9.2-p320
    first 31189901
    second 897 <============
    third 31190598

Regards

Alexandre Riveira


Files

teste_thread_schedule_2.rb (1.05 KB) teste_thread_schedule_2.rb ariveira (Alexandre Riveira), 07/06/2014 07:33 AM
teste_thread_schedule.py (953 Bytes) teste_thread_schedule.py ariveira (Alexandre Riveira), 07/08/2014 09:54 AM
teste_thread_schedule.rb (955 Bytes) teste_thread_schedule.rb ariveira (Alexandre Riveira), 07/08/2014 09:56 AM
test_thread_sched_pipe.rb (1.01 KB) test_thread_sched_pipe.rb normalperson (Eric Wong), 07/08/2014 08:37 PM
test_thread_sched.rb (2.82 KB) test_thread_sched.rb ariveira (Alexandre Riveira), 08/16/2014 03:55 PM
test_thread_sched.rb (2.88 KB) test_thread_sched.rb ariveira (Alexandre Riveira), 08/16/2014 05:50 PM
tests.txt (2.5 KB) tests.txt ariveira (Alexandre Riveira), 08/16/2014 05:50 PM
test.py (1.41 KB) test.py ariveira (Alexandre Riveira), 10/29/2014 01:23 PM

Updated by ariveira (Alexandre Riveira) over 9 years ago

My environment is Debian 3.2.0-4-amd64

Updated by ariveira (Alexandre Riveira) over 9 years ago

Alexandre Riveira wrote:
I applied tests using Rubinius.
Rubinius uses only 1 processor due to applied taskset, results:

first 18164692
second 10007 <==========
third 18184825

Updated by normalperson (Eric Wong) over 9 years ago

I'll try resurrecting an old eventfd proposal and maybe also bare futexes
to see if that improves things.

Updated by ariveira (Alexandre Riveira) over 9 years ago

Eric Wong wrote:

I'll try resurrecting an old eventfd proposal and maybe also bare futexes
to see if that improves things.

Tank's Eric,

If an application running Rainbows has only one thread using 100% the worker is affected greatly in the query database. The solution is to try the fork worker for heavy tasks but often this is not possible.

GIL in Pyhton is better but use 160% of cpu and ruby use 100% of cpu.

Updated by normalperson (Eric Wong) over 9 years ago

eventfd doesn't help performance (but still reduces FD count),
I never expected eventfd to improve speed, though.

Lowering TIME_QUANTUM_USEC (in thread_pthread.c) helps with the I/O case
(try it yourself if you have a 1000HZ kernel); but hurts overall
throughput.

Attached is a I/O bench using pipes without Postgres requirement.
Increasing GVL (or any lock) performance is tricky because we need to
balance fairness and avoid starvation cases. The GVL was rewritten to
avoid starvation in 1.9.3, so that's likely the cause of the major
difference starting with 1.9.3.

I doubt I can noticeably improve performance with futexes vs mutex/condvar.

How much does GVL performance between 1.9.2 and 2.1 affect real-world
performance on Rainbows!/yahns apps for you? (not "hello world"-type
apps).

I hope to make GVL optional in a few years, but that is tricky.
Ironically, part of the reason I don't like GVL is I don't want to pay
any threading/locking costs for tiny single-threaded apps, either :)

Updated by ariveira (Alexandre Riveira) over 9 years ago

My application is not web-site is an ERP. So reporting and very heavy tasks are performed. Then the system crashes because only one thread using 100% cpu will damage the whole worker, passing any request for at least 2 seconds, then the requests go piling.
The key point is a thread using 100% of cpu will make all worker threads just make a few requests for postgres.

Ruby Without GVL is that possible??

I believe python works best because it uses part of another cpu (160%) to manage all the threads.
Doing the same test with pypy it uses 100% cpu as ruby and presents the same problems as ruby.

Updated by ariveira (Alexandre Riveira) over 9 years ago

information that I consider important
Kernels BFS and ruby 1.9.2 work fine as if applied taskset.
Other kernels like freebsd and macos with ruby 1.9.2 has similar behavior.

http://en.wikipedia.org/wiki/Brain_Fuck_Scheduler
https://wiki.archlinux.org/index.php/linux-ck

Updated by ariveira (Alexandre Riveira) over 9 years ago

Alexandre Riveira wrote:

information that I consider important
Kernels BFS and ruby 1.9.2 work fine as if applied taskset.
Other kernels like freebsd and macos with ruby 1.9.2 has similar behavior.

http://en.wikipedia.org/wiki/Brain_Fuck_Scheduler
https://wiki.archlinux.org/index.php/linux-ck

My results kernel linux BFS/CK + taskset

first 103214331
second 2762 <======
third 24259986

Updated by ariveira (Alexandre Riveira) over 9 years ago

Eric Wong wrote:

Lowering TIME_QUANTUM_USEC (in thread_pthread.c) helps with the I/O case
(try it yourself if you have a 1000HZ kernel); but hurts overall
throughput.

Hello Eric!!!!

I stayed enjoyed the result of change TIME_QUANTUM_USEC. Changed its value to 1000 only see the results:

ruby 2.
first 17434583
second 2754 <=============
third 16752441

If you have any problems I will try 10 * 1000.

It seems incredible because there was no need to apply taskset.
As this is a microbenchmark'll do the tests and if all goes well put into production. After I report news.

Updated by ariveira (Alexandre Riveira) over 9 years ago

Alexandre Riveira wrote:

Eric Wong wrote:

Lowering TIME_QUANTUM_USEC (in thread_pthread.c) helps with the I/O case
(try it yourself if you have a 1000HZ kernel); but hurts overall
throughput.

Hello Eric!!!!

I stayed enjoyed the result of change TIME_QUANTUM_USEC. Changed its value to 1000

Tests completes, my system without changes join stress tests 30 secons for load page, after changes, pages loading in instant all pages loading in less than 1 second.

Updated by normalperson (Eric Wong) over 9 years ago

Good to know it works for you. Keep in mind TIME_QUANTUM_USEC=1000 is
very low and may cause problems on some systems, too.

My gut feeling is 100ms (default) is too high, but 10ms is too low
(based on kosaki's comment). Maybe 20ms - 50ms is acceptable. There is
a wide variety of configuration we must work with (even just on Linux).

Can you try 20-50ms?

About GVL:
Replacing GVL with fine-grained locks is possible (and ko1 tried it),
but performance suffered for single-thread cases.
It should be possible to do with lock-free techniques, but that is
difficult to get right.

Updated by ariveira (Alexandre Riveira) over 9 years ago

Hi Eric !

Eric Wong wrote:

Good to know it works for you. Keep in mind TIME_QUANTUM_USEC=1000 is

What problems do I have?

Can you try 20-50ms?

In the application do a stress test where 5 threads overload.

I tested 50 and the latency is over the next 15 seconds.
I tested the latency is 20 and next 10 seconds.
I tested the latency is 10 and next 4 or 5 seconds.

The magic number is TIME_QUANTUM_USEC=1000. There is no latency in this case

Follow microbenckmars teste_thread_schedule_2 with postgres

TIME_QUANTUM_USEC (1000)
first 22882400
second 2654 <===
third 22642172
in 21.08 seconds

2654 / 21.08 is 125 connections for database per second

TIME_QUANTUM_USEC (20 * 1000)
first 33003617
second 258 <==
third 33851933
in 23.07 seconds
258 / 23.07 is 11 connections for database per second. I think this small amount of connections per second but accept comments.

TIME_QUANTUM_USEC (50 * 1000)
first 42811975
second 116
third 42005480
in 25.12 seconds

116 / 25.12 is 5 connections for database per second.

Updated by normalperson (Eric Wong) over 9 years ago

wrote:

I doubt I can noticeably improve performance with futexes vs mutex/condvar.

Totally not-speed-optimized futex-based lock/condvar implementation at

git://bogomips.org/ruby.git (futex branch)
http://bogomips.org/ruby.git/patch?id=ae93c50c8de

I am not sure if my implementation is correct, but "make check" passes
with both 8 cores and 1 core active (8-core Vishera). I will probably
write an independent (C-only) test for more parallelism and maybe steal
some from glibc (I also plan on using this futex-based lock
implementation outside of Ruby).

Benchmarks don't seem to show much (if any) improvement, yet. Speed
improvement from reimplementing GVL around bare futex interface may be
possible (w/o using separate condvar/mutex layer).

On amd64 GNU/Linux, pthread_mutex_t is 40 bytes, but these futex-based
locks only need 4 bytes. Similarly, pthread_cond_t is 48 bytes, making
rb_nativethread_cond_t 56 bytes with pthreads; this futex implementation
currently requires only 16 bytes for a condvar.

Size improvement may be noticeable for some apps with many Mutexes:
the lock/cond reductions mean rb_mutex_struct is now 48 bytes instead
of 128 bytes.

Updated by ariveira (Alexandre Riveira) over 9 years ago

I rewrote the test, I created the --tasket --postgres arguments and to use the same test file.

Feel free to change whatever you want.

Soon bring news about the test with futex

Updated by ariveira (Alexandre Riveira) over 9 years ago

I added in the uname test script for details kernel / platform
Follow the accompanying tests

tests (test_thread_sche.rb --postgres) in debian-kfreebsd-amd64

ruby 1.9.2
name...........: 9.0-2-amd64 x86_64
processor......: Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz with (4 processores)
taskset........: false
total..........: 101480933
postgres.......: 467
time...........: 20.232985931 (ideal value of 20 seconds)

ruby 2.1.2
name...........: 9.0-2-amd64 x86_64
processor......: Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz with (4 processores)
taskset........: false
total..........: 71870185
postgres.......: 58
time...........: 21.123303293 (ideal value of 20 seconds)

ruby 2.1.2 with TIME_QUANTUM_USEC = 1000
name...........: 9.0-2-amd64 x86_64
processor......: Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz with (4 processores)
taskset........: false
total..........: 63996063
postgres.......: 2510
time...........: 20.050760184 (ideal value of 20 seconds)

Updated by normalperson (Eric Wong) over 9 years ago

Some tests adapted from glibc:

git clone git://80x24.org/rb_futex_test

tst-cond18-f/p are micro benchmarks, -f (futex version) is roughly
twice a fast as the -p (pthreads version); but that doesn't seem
to translate to noticeable real-world speed improvements in Ruby.

Updated by ariveira (Alexandre Riveira) over 9 years ago

Following script in python to buy blocking io python x ruby

Results:

ruby without changes TIME_QUANTUM_USEC (100 * 1000)
first..........: 32445253
second.........: 30660119
postgres.......: 61
time...........: 1.5022704 secs

ruby with TIME_QUANTUM_USEC (1 * 1000)

first..........: 17793384
second.........: 17438453
postgres.......: 4638

python

first 17498064
postgres 2027
third 18702539

Actions #19

Updated by hsbt (Hiroshi SHIBATA) almost 9 years ago

  • Assignee set to ko1 (Koichi Sasada)
  • Priority changed from 6 to Normal
Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0