Performance fix in the runtime actor schedule #1521

SeanTAllen · 2017-01-18T17:31:47Z

The original testing for performance impact of the ticketing
algo change was done at Sendence. At the time, we didn't see
any impact. However, we weren't looking in the right place.

We have since found that the ticketing algo under some workloads
can result in very long pauses (up to a second) getting a ticket.
Further, we regularly see pauses while waiting to get a ticket
measured in the 100's of micro-seconds.

This commit reverts to the prior double word cas algo that we
found has far better performance characteristics in the "bad"
scenarios we found and the same or slightly better characteristics
under normal circumstances.

The reversion includes the work in a later commit to support GCC
4.7.

It does not include the VALGRIND awareness that was added later.
I'm not sure where to put that in, as such, I left it out.

There was a change in heap that went along with the original
ticketing algo change. That is not included as part of this commit
as we at Sendence didn't switch that and thus are unsure if it
should be switched or what the impact would be.

The original testing for performance impact of the ticketing algo change was done at Sendence. At the time, we didn't see any impact. However, we weren't looking in the right place. We have since found that the ticketing algo under some workloads can result in very long pauses (up to a second) getting a ticket. Further, we regularly see pauses while waiting to get a ticket measured in the 100's of micro-seconds. This commit reverts to the prior double word cas algo that we found has far better performance characteristics in the "bad" scenarios we found and the same or slightly better characteristics under normal circumstances. The reversion includes the work in a later commit to support GCC 4.7. It does not include the VALGRIND awareness that was added later. I'm not sure where to put that in, as such, I left it out. There was a change in heap that went along with the original ticketing algo change. That is not included as part of this commit as we at Sendence didn't switch that and thus are unsure if it should be switched or what the impact would be.

Praetonus · 2017-01-19T15:06:20Z

This will need some additional changes to work on GCC 4.7 and 4.8 because of this thing. I've added this assert as a safeguard because when doing a compare_exchange on an object bigger than machine word size, the object must be aligned on a 16 bits boundary on x86 and the existing function doesn't enforce that.

However, I think we can solve the ABA problem in a more efficient way in this algorithm by freeing the nodes at a different time.

void* ponyint_mpmcq_pop(mpmcq_t* q, mpmcq_node_t** old_tail)
{
  mpmcq_node_t* tail, next;

  tail = atomic_load_explicit(&q->tail, memory_order_acquire);

  do
  {
    next = atomic_load_explicit(&tail->next, memory_order_acquire);

    if(next == NULL)
      return NULL;

  } while(!atomic_compare_exchange_weak_explicit(&q->tail, &tail, next,
    memory_order_acq_rel, memory_order_acquire));

  void* data = atomic_load_explicit(&next->data, memory_order_acquire);

  atomic_store_explicit(&next->data, NULL, memory_order_release);

  while(atomic_load_explicit(&(*old_tail)->data, memory_order_acquire) != NULL)
    ponyint_cpu_relax();

  POOL_FREE(mpmcq_node_t, *old_tail);
  *old_tail = tail;
  return data;
}

The idea here is that each thread keeps a "buffer node" that it will only free after popping another node from the queue. As a result, the ABA problem will occur if, during a single CAS iteration for one thread

another thread pops a node
the same thread pops an additional node
the first node is added back to the same queue and comes back as the tail.

I think having this chain of events in this time frame is unlikely enough to never occur in real world cases, even in the highly contended inject queue. We'll probably need to have a buffer node per thread and per queue. The memory overhead would be n_threads * (n_threads + 1) * sizeof(mpmcq_node_t*), which is approximately 33 kB for 64 threads. That seems reasonable to me.

Also, this whole thing is unnecessary on ARM as the ABA problem detection is handled by the hardware instructions (load-link/store-conditional).

@sylvanc @SeanTAllen Thoughts?

sylvanc · 2017-01-19T15:28:58Z

The buffer node idea seems very interesting. It is n^2 space complexity, so for a 4096 core machine (like an SGI) it's a 128 MB overhead... which still seems ok to me.

Can the thread keep next instead of tail as old_tail? That way, a thread frees the previous thing it popped from the queue, exactly like a messageq_t but with multiple consumers. Something like:

void* ponyint_mpmcq_pop(mpmcq_t* q, mpmcq_node_t** old_tail)
{
  mpmcq_node_t* tail, next;

  tail = atomic_load_explicit(&q->tail, memory_order_acquire);

  do
  {
    next = atomic_load_explicit(&tail->next, memory_order_acquire);

    if(next == NULL)
      return NULL;

  } while(!atomic_compare_exchange_weak_explicit(&q->tail, &tail, next,
    memory_order_acq_rel, memory_order_acquire));

  POOL_FREE(mpmcq_node_t, *old_tail);
  *old_tail = next;
  return atomic_load_explicit(&next->data, memory_order_acquire);
}

I may be overlooking something...

Praetonus · 2017-01-19T16:00:01Z

Yes, that's a good idea. The only question is, who gets to free the original stub node? For the individual scheduler queues, the stub node of a queue could be the original old_tail of the thread owning that queue. For the inject queue, one of the scheduler threads could get the stub as its old_tail and the other threads could get a stub node not linked to anything (or NULL, but then we'd have to check for that uncommon case).

Praetonus · 2017-01-19T19:11:10Z

Also, I think that the memory ordering for some of these operations can be weakened:

void* ponyint_mpmcq_pop(mpmcq_t* q, mpmcq_node_t** old_tail)
{
  mpmcq_node_t* tail, next;

  // relaxed instead of acquire because there is no release operation on the tail (see the CAS comment)
  tail = atomic_load_explicit(&q->tail, memory_order_relaxed);

  do
  {
    // relaxed instead of acquire. The synchronisation is moved outside of the loop as a standalone fence
    next = atomic_load_explicit(&tail->next, memory_order_relaxed);

    if(next == NULL)
      return NULL;

    // On success, relaxed instead of acquire/release because we didn't do any write and there isn't anything to release, and there aren't any writes to acquire
    // On failure, same thing as the load of the tail, there isn't anything to acquire
  } while(!atomic_compare_exchange_weak_explicit(&q->tail, &tail, next,
    memory_order_relaxed, memory_order_relaxed));

  // synchronise on tail->next and prevent the POOL_FREE of old_tail and the load of next->data from "moving" before the CAS, where we don't own the node yet
  atomic_thread_fence(memory_order_acquire);

  POOL_FREE(mpmcq_node_t, *old_tail);
  *old_tail = next;
  // relaxed instead of acquire. There is no release operation on next->data
  return atomic_load_explicit(&next->data, memory_order_relaxed);
}

Praetonus · 2017-01-21T09:48:07Z

@SeanTAllen @sylvanc Should I open a PR with that change? Or maybe add it on top of this PR?

SeanTAllen · 2017-01-21T15:45:26Z

I think it makes sense to have this PR as is, the revert.
And then another commit to weaken. That will make for a better history and easier testing if we find problems later.

Praetonus · 2017-01-21T15:53:02Z

Yes, that makes senses. Note that it won't work as is on GCC 4.7 and 4.8 though, as I explained above.

SeanTAllen · 2017-01-21T17:45:48Z

ok.

SeanTAllen · 2017-01-21T17:46:10Z

so @Praetonus @sylvanc merge this and then more work can be done over the top?

Praetonus · 2017-01-23T22:06:43Z

Yeah, let's revert first. The only concern I have left about this PR is the changelog entry. I think this deserves a manual changelog entry to explain the performance fix as the PR title isn't very descriptive.

SeanTAllen · 2017-01-24T01:52:13Z

@Praetonus suggestions on a good CHANGELOG entry?

Praetonus · 2017-01-24T10:54:52Z

@SeanTAllen Maybe "Performance fix in the runtime actor scheduler"?

jemc · 2017-01-24T19:21:52Z

If you do a manual changelog entry, please be sure to include the PR ticket number.

However, it seems to me you could simply change the title of this PR and use the automatic changelog entry?

SeanTAllen · 2017-01-24T19:53:10Z

@jemc done

SeanTAllen · 2017-02-03T04:33:51Z

@Praetonus @jemc good to merge this now?

jemc · 2017-02-06T05:45:43Z

FWIW this broke the build for me. I now see the following compiler errors:

src/libponyrt/sched/mpmcq.c:25:3: error: address argument to atomic operation must be a pointer to integer or pointer ('mpmcq_dwcas_t *' (aka 'struct mpmcq_dwcas_t *') invalid)
  atomic_store_explicit(&q->tail, tail, memory_order_relaxed);
  ^                     ~~~~~~~~
src/common/pony/detail/atomics.h:87:7: note: expanded from macro 'atomic_store_explicit'
      __atomic_store_n(PTR, VAL, MO); \
      ^
src/libponyrt/sched/mpmcq.c:30:24: error: address argument to atomic operation must be a pointer to integer or pointer ('mpmcq_dwcas_t *' (aka 'struct mpmcq_dwcas_t *') invalid)
  mpmcq_dwcas_t tail = atomic_load_explicit(&q->tail, memory_order_relaxed);
                       ^                    ~~~~~~~~
src/common/pony/detail/atomics.h:81:7: note: expanded from macro 'atomic_load_explicit'
      __atomic_load_n(PTR, MO); \
      ^
src/libponyrt/sched/mpmcq.c:30:17: error: initializing 'mpmcq_dwcas_t' (aka 'struct mpmcq_dwcas_t') with an expression of incompatible type 'void'
  mpmcq_dwcas_t tail = atomic_load_explicit(&q->tail, memory_order_relaxed);
                ^      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/libponyrt/sched/mpmcq.c:66:9: error: address argument to atomic operation must be a pointer to integer or pointer ('mpmcq_dwcas_t *' (aka 'struct mpmcq_dwcas_t *') invalid)
  cmp = atomic_load_explicit(&q->tail, memory_order_acquire);
        ^                    ~~~~~~~~
src/common/pony/detail/atomics.h:81:7: note: expanded from macro 'atomic_load_explicit'
      __atomic_load_n(PTR, MO); \
      ^
src/libponyrt/sched/mpmcq.c:66:7: error: assigning to 'mpmcq_dwcas_t' (aka 'struct mpmcq_dwcas_t') from incompatible type 'void'
  cmp = atomic_load_explicit(&q->tail, memory_order_acquire);
      ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/libponyrt/sched/mpmcq.c:82:12: error: address argument to atomic operation must be a pointer to integer or pointer ('mpmcq_dwcas_t *' (aka 'struct mpmcq_dwcas_t *') invalid)
  } while(!atomic_compare_exchange_weak_explicit(&q->tail, &cmp, xchg,
           ^                                     ~~~~~~~~
src/common/pony/detail/atomics.h:99:7: note: expanded from macro 'atomic_compare_exchange_weak_explicit'
      __atomic_compare_exchange_n(PTR, EXP, DES, true, SUCC, FAIL); \
      ^
src/libponyrt/sched/mpmcq.c:82:11: error: invalid argument type 'void' to unary expression
  } while(!atomic_compare_exchange_weak_explicit(&q->tail, &cmp, xchg,
          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
7 errors generated.
Makefile:585: recipe for target 'build/debug/obj/libponyrt/sched/mpmcq.o' failed
make: *** [build/debug/obj/libponyrt/sched/mpmcq.o] Error 1

SeanTAllen · 2017-02-06T12:39:57Z

@jemc what's your setup? it passes CI both for Pony via Travis and ours at Sendence. I'm assuming you have a setup that is different than what has been tested.

jemc · 2017-02-06T16:03:51Z

Fedora 22 Linux, 64bit
clang version 3.5.0
LLVM 3.9.1

I saw there are some comments in #1543 about this change breaking some versions of GCC, but I'm using clang, not GCC.

SeanTAllen · 2017-02-06T16:30:59Z

Ok so, all the CI that runs for this is using clang 3.9 with LLVM 3.9.1, that could be where the problem lies. Are you able to test with clang 3.9 to see if that is the case and if yes, we can open an issue to look into clang 3.5 support?

Praetonus · 2017-02-06T18:34:20Z

@jemc @SeanTAllen The change isn't working on Clang 3.3 through 3.5 for the same reason as GCC (no support of stdatomic.h). Sorry, I forgot to mention it previously.

SeanTAllen added the changelog - fixed Automatically add "Fixed" CHANGELOG entry on merge label Jan 18, 2017

SeanTAllen requested review from sylvanc and Praetonus January 18, 2017 17:31

SeanTAllen force-pushed the pre-ticketing-mpmcq-pop-algo branch from 8d34ca1 to 48e6990 Compare January 18, 2017 17:33

SeanTAllen added the do not merge This PR should not be merged at this time label Jan 18, 2017

SeanTAllen force-pushed the pre-ticketing-mpmcq-pop-algo branch from 48e6990 to 925e649 Compare January 19, 2017 00:38

SeanTAllen force-pushed the pre-ticketing-mpmcq-pop-algo branch from 925e649 to 225b31b Compare January 19, 2017 11:49

SeanTAllen changed the title ~~Revert to pre-ticketing mpmcq pop algo~~ Performance fix in the runtime actor schedule Jan 24, 2017

Praetonus approved these changes Jan 24, 2017

View reviewed changes

jemc approved these changes Jan 24, 2017

View reviewed changes

SeanTAllen removed the do not merge This PR should not be merged at this time label Feb 3, 2017

jemc merged commit 81e727a into master Feb 3, 2017

jemc deleted the pre-ticketing-mpmcq-pop-algo branch February 3, 2017 04:42

ponylang-main added a commit that referenced this pull request Feb 3, 2017

Update CHANGELOG for PR #1521

a337127

Praetonus mentioned this pull request Feb 3, 2017

Release 0.11.0 #1543

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance fix in the runtime actor schedule #1521

Performance fix in the runtime actor schedule #1521

SeanTAllen commented Jan 18, 2017

Praetonus commented Jan 19, 2017

sylvanc commented Jan 19, 2017

Praetonus commented Jan 19, 2017 •

edited

Loading

Praetonus commented Jan 19, 2017 •

edited

Loading

Praetonus commented Jan 21, 2017

SeanTAllen commented Jan 21, 2017 •

edited

Loading

Praetonus commented Jan 21, 2017

SeanTAllen commented Jan 21, 2017

SeanTAllen commented Jan 21, 2017

Praetonus commented Jan 23, 2017

SeanTAllen commented Jan 24, 2017

Praetonus commented Jan 24, 2017

jemc commented Jan 24, 2017

SeanTAllen commented Jan 24, 2017

SeanTAllen commented Feb 3, 2017

jemc commented Feb 6, 2017

SeanTAllen commented Feb 6, 2017

jemc commented Feb 6, 2017 •

edited

Loading

SeanTAllen commented Feb 6, 2017

Praetonus commented Feb 6, 2017

Performance fix in the runtime actor schedule #1521

Performance fix in the runtime actor schedule #1521

Conversation

SeanTAllen commented Jan 18, 2017

Praetonus commented Jan 19, 2017

sylvanc commented Jan 19, 2017

Praetonus commented Jan 19, 2017 • edited Loading

Praetonus commented Jan 19, 2017 • edited Loading

Praetonus commented Jan 21, 2017

SeanTAllen commented Jan 21, 2017 • edited Loading

Praetonus commented Jan 21, 2017

SeanTAllen commented Jan 21, 2017

SeanTAllen commented Jan 21, 2017

Praetonus commented Jan 23, 2017

SeanTAllen commented Jan 24, 2017

Praetonus commented Jan 24, 2017

jemc commented Jan 24, 2017

SeanTAllen commented Jan 24, 2017

SeanTAllen commented Feb 3, 2017

jemc commented Feb 6, 2017

SeanTAllen commented Feb 6, 2017

jemc commented Feb 6, 2017 • edited Loading

SeanTAllen commented Feb 6, 2017

Praetonus commented Feb 6, 2017

Praetonus commented Jan 19, 2017 •

edited

Loading

Praetonus commented Jan 19, 2017 •

edited

Loading

SeanTAllen commented Jan 21, 2017 •

edited

Loading

jemc commented Feb 6, 2017 •

edited

Loading