Project

General

Profile

Actions

Misc #10278

closed

[RFC] st.c: use ccan linked list

Added by normalperson (Eric Wong) over 10 years ago. Updated almost 7 years ago.


Description

Mainly posting this for documentation purposes because it seems like
an obvious thing to try given we have ccan/list nowadays.

Having shorter code along branchless insert/delete, and using a common
linked-list API is very appealing.

On the other hand, benchmark results are a mixed bag:

http://80x24.org/bmlog-20140922-032221.13002

Also, I may have introduced new bugs the tests didn't catch.
The st_foreach* functions get a bit strange when dealing with
packed-to-unpacked transitions while iterating.

Great thing: bighash is faster (as expected) because of branchless
linked list insertion. However, the major speedup in bighash probably
isn't too important, most hashes are small and users never notice.

vm2_bighash*	1.222

Also, we could introduce rb_hash_new_with_size() for use insns.def
(newhash) if people really care about the static bighash case (I don't
think many do).

Real regressions, iteration seems more complex because loop conditions
are more complex :<

hash_keys	0.978
hash_values	0.941

However, hash_keys/values regressions are pretty small.

Things that worry me:

vm1_attr_ivar*	0.736
vm1_attr_ivar_set*	0.845

WTF? I reran the attr_ivar tests, and the numbers got slightly better:

 ["vm1_attr_ivar",
  [[1.851297842,
    1.549076322,
    1.623306027,
    1.956916541,
    1.533218607,
    1.554089054,
    1.702590516,
    1.789863782,
    1.711815018,
    1.851260599],
   [1.825423191,
    1.824934062,
    1.542471471,
    1.868502091,
    1.79106375,
    1.884568825,
    1.850712387,
    1.797538962,
    2.165696827,
    1.866671482]]],
 ["vm1_attr_ivar_set",
  [[1.926496052,
    2.04742421,
    2.025571131,
    2.047656291,
    2.043747069,
    2.099586827,
    1.953769267,
    2.017580504,
    2.440432603,
    2.111254634],
   [2.365839125,
    2.076282818,
    2.112784977,
    2.118754445,
    2.091752673,
    2.161164561,
    2.107439445,
    2.128147747,
    2.945295069,
    2.131679632]]]]

Elapsed time: 91.963235593 (sec)
-----------------------------------------------------------
benchmark results:
minimum results in each 10 measurements.
Execution time (sec)
name	orig	stll
loop_whileloop	0.672	0.670
vm1_attr_ivar*	0.861	0.872
vm1_attr_ivar_set*	1.255	1.406

Speedup ratio: compare with the result of `orig' (greater is better)
name	stll
loop_whileloop	1.002
vm1_attr_ivar*	0.987
vm1_attr_ivar_set*	0.892

Note: these tests do not even hit st, and even if they did, these are
tiny tables which are packed so the linked-list implementation has no
impact (especially not on lookup tests)

So yeah, probably something messy with the CPU caches.
I always benchmark with the performance CPU governor, and the
rerun ivar numbers are run with CPU pinned to a single core.
CPU: AMD FX-8320 Maybe I can access my other systems later.


Files

0001-st.c-use-ccan-linked-list.patch (13.1 KB) 0001-st.c-use-ccan-linked-list.patch normalperson (Eric Wong), 09/22/2014 05:32 AM
0001-st.c-use-ccan-linked-list-v2.patch (16.8 KB) 0001-st.c-use-ccan-linked-list-v2.patch normalperson (Eric Wong), 06/24/2015 08:20 AM

Related issues 1 (0 open1 closed)

Related to Ruby master - Feature #10321: [PATCH] test st_foreach{,_check} for packed-to-unpack changeClosednormalperson (Eric Wong)10/03/2014Actions

Updated by nobu (Nobuyoshi Nakada) over 10 years ago

  • Description updated (diff)

Probably, we should remove back member.

Updated by normalperson (Eric Wong) over 10 years ago

wrote:

Probably, we should remove back member.

Just back and making it a singly-linked list?
st_delete would become O(n), unfortunately.

I also do not think going from 48 to 40 bytes makes a difference on most
x86-64 mallocs because (IIRC) the ABI requires 16-byte alignment.

If we can go from 48 => 32 bytes, great!
But I don't see what else to remove while keeping compatibility/speed :<

Updated by nobu (Nobuyoshi Nakada) over 10 years ago

  • Status changed from Open to Assigned

Indeed.

Updated by normalperson (Eric Wong) over 10 years ago

Better (at least more explainable) results on the Xeon:
http://80x24.org/spew/m/st-ccan-list-bench@meltdown.html

Will test on the old Phenom II, too.

Updated by normalperson (Eric Wong) over 10 years ago

Eric Wong wrote:

Will test on the old Phenom II, too.

bighash looks nice as expected, haven't had time to look more into
these numbers but it's more consistent than the Vishera (FX-8320):
http://80x24.org/spew/m/20140922231823.GA21644%40dcvr.yhbt.net.html

Updated by normalperson (Eric Wong) about 10 years ago

A fixup patch for packed => unpacked transitions:

http://80x24.org/spew/m/st-ccan-ll-fixup-1%4080x24.org.txt

This needs tests, but it seems the packed => unpacked transitions during
iteration are totally untested in the current Ruby implementation.
Fortunately, it seems hash.c bans such transitions.

I suppose I can write tests to explicitly test for these changes,
but it may be easier and cheaper to bail out (possibly raise an error)

Updated by normalperson (Eric Wong) about 10 years ago

  • Related to Feature #10321: [PATCH] test st_foreach{,_check} for packed-to-unpack change added

Updated by normalperson (Eric Wong) about 10 years ago

I like my original patch a little more, now, especially since it passes
the test in #10321. I'll squash the following simplfication on top
if I commit: http://80x24.org/spew/m/st-ll-foreach-simple%40whir.txt

Updated by normalperson (Eric Wong) about 10 years ago

Since we'll need it for st_reverse_foreach_check ([ruby-core:65408]),
I've implemented list_for_each_rev_safe to ccan/list:
https://lists.ozlabs.org/pipermail/ccan/2014-October/thread.html
It applies on top of two of my others intended for compile.c:
https://lists.ozlabs.org/pipermail/ccan/2014-September/thread.html

Also, updated bench results from the weird FX-8320 CPU after
simplifying the foreach loops a little:
http://80x24.org/spew/m/st-ll-v2-results%40whir.txt (good, I think)

Also http://80x24.org/spew/m/st-ccan-ll-fixup-1%4080x24.org.txt was
wrong and I rejected it due to improved tests in [Feature #10321]

Updated by normalperson (Eric Wong) over 9 years ago

Updated v2 patch.

I care about this more, now, since I want to try to make unordered hash an
option with st.c in the future for internals. This should make future changes
easier-to-understand with less code.

I'm willing to trade a little hash iteration (rare, I hope) performance
for better insert/delete performance (on big hashes).

Also, this has minor object size reductions (on 32-bit x86)

   text    data     bss     dec     hex filename
  14718      24       0   14742    3996 st.o-before
  14166      24       0   14190    376e st.o-after
Actions #11

Updated by normalperson (Eric Wong) almost 7 years ago

  • Status changed from Assigned to Closed
Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0