Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance fix in the runtime actor schedule #1521

Merged
merged 1 commit into from
Feb 3, 2017

Conversation

SeanTAllen
Copy link
Member

The original testing for performance impact of the ticketing
algo change was done at Sendence. At the time, we didn't see
any impact. However, we weren't looking in the right place.

We have since found that the ticketing algo under some workloads
can result in very long pauses (up to a second) getting a ticket.
Further, we regularly see pauses while waiting to get a ticket
measured in the 100's of micro-seconds.

This commit reverts to the prior double word cas algo that we
found has far better performance characteristics in the "bad"
scenarios we found and the same or slightly better characteristics
under normal circumstances.

The reversion includes the work in a later commit to support GCC
4.7.

It does not include the VALGRIND awareness that was added later.
I'm not sure where to put that in, as such, I left it out.

There was a change in heap that went along with the original
ticketing algo change. That is not included as part of this commit
as we at Sendence didn't switch that and thus are unsure if it
should be switched or what the impact would be.

@SeanTAllen SeanTAllen added the changelog - fixed Automatically add "Fixed" CHANGELOG entry on merge label Jan 18, 2017
@SeanTAllen SeanTAllen force-pushed the pre-ticketing-mpmcq-pop-algo branch from 8d34ca1 to 48e6990 Compare January 18, 2017 17:33
@SeanTAllen SeanTAllen added the do not merge This PR should not be merged at this time label Jan 18, 2017
@SeanTAllen SeanTAllen force-pushed the pre-ticketing-mpmcq-pop-algo branch from 48e6990 to 925e649 Compare January 19, 2017 00:38
The original testing for performance impact of the ticketing
algo change was done at Sendence. At the time, we didn't see
any impact. However, we weren't looking in the right place.

We have since found that the ticketing algo under some workloads
can result in very long pauses (up to a second) getting a ticket.
Further, we regularly see pauses while waiting to get a ticket
measured in the 100's of micro-seconds.

This commit reverts to the prior double word cas algo that we
found has far better performance characteristics in the "bad"
scenarios we found and the same or slightly better characteristics
under normal circumstances.

The reversion includes the work in a later commit to support GCC
4.7.

It does not include the VALGRIND awareness that was added later.
I'm not sure where to put that in, as such, I left it out.

There was a change in heap that went along with the original
ticketing algo change. That is not included as part of this commit
as we at Sendence didn't switch that and thus are unsure if it
should be switched or what the impact would be.
@SeanTAllen SeanTAllen force-pushed the pre-ticketing-mpmcq-pop-algo branch from 925e649 to 225b31b Compare January 19, 2017 11:49
@Praetonus
Copy link
Member

This will need some additional changes to work on GCC 4.7 and 4.8 because of this thing. I've added this assert as a safeguard because when doing a compare_exchange on an object bigger than machine word size, the object must be aligned on a 16 bits boundary on x86 and the existing function doesn't enforce that.

However, I think we can solve the ABA problem in a more efficient way in this algorithm by freeing the nodes at a different time.

void* ponyint_mpmcq_pop(mpmcq_t* q, mpmcq_node_t** old_tail)
{
  mpmcq_node_t* tail, next;

  tail = atomic_load_explicit(&q->tail, memory_order_acquire);

  do
  {
    next = atomic_load_explicit(&tail->next, memory_order_acquire);

    if(next == NULL)
      return NULL;

  } while(!atomic_compare_exchange_weak_explicit(&q->tail, &tail, next,
    memory_order_acq_rel, memory_order_acquire));

  void* data = atomic_load_explicit(&next->data, memory_order_acquire);

  atomic_store_explicit(&next->data, NULL, memory_order_release);

  while(atomic_load_explicit(&(*old_tail)->data, memory_order_acquire) != NULL)
    ponyint_cpu_relax();

  POOL_FREE(mpmcq_node_t, *old_tail);
  *old_tail = tail;
  return data;
}

The idea here is that each thread keeps a "buffer node" that it will only free after popping another node from the queue. As a result, the ABA problem will occur if, during a single CAS iteration for one thread

  • another thread pops a node
  • the same thread pops an additional node
  • the first node is added back to the same queue and comes back as the tail.

I think having this chain of events in this time frame is unlikely enough to never occur in real world cases, even in the highly contended inject queue. We'll probably need to have a buffer node per thread and per queue. The memory overhead would be n_threads * (n_threads + 1) * sizeof(mpmcq_node_t*), which is approximately 33 kB for 64 threads. That seems reasonable to me.

Also, this whole thing is unnecessary on ARM as the ABA problem detection is handled by the hardware instructions (load-link/store-conditional).

@sylvanc @SeanTAllen Thoughts?

@sylvanc
Copy link
Contributor

sylvanc commented Jan 19, 2017

The buffer node idea seems very interesting. It is n^2 space complexity, so for a 4096 core machine (like an SGI) it's a 128 MB overhead... which still seems ok to me.

Can the thread keep next instead of tail as old_tail? That way, a thread frees the previous thing it popped from the queue, exactly like a messageq_t but with multiple consumers. Something like:

void* ponyint_mpmcq_pop(mpmcq_t* q, mpmcq_node_t** old_tail)
{
  mpmcq_node_t* tail, next;

  tail = atomic_load_explicit(&q->tail, memory_order_acquire);

  do
  {
    next = atomic_load_explicit(&tail->next, memory_order_acquire);

    if(next == NULL)
      return NULL;

  } while(!atomic_compare_exchange_weak_explicit(&q->tail, &tail, next,
    memory_order_acq_rel, memory_order_acquire));

  POOL_FREE(mpmcq_node_t, *old_tail);
  *old_tail = next;
  return atomic_load_explicit(&next->data, memory_order_acquire);
}

I may be overlooking something...

@Praetonus
Copy link
Member

Praetonus commented Jan 19, 2017

Yes, that's a good idea. The only question is, who gets to free the original stub node? For the individual scheduler queues, the stub node of a queue could be the original old_tail of the thread owning that queue. For the inject queue, one of the scheduler threads could get the stub as its old_tail and the other threads could get a stub node not linked to anything (or NULL, but then we'd have to check for that uncommon case).

@Praetonus
Copy link
Member

Praetonus commented Jan 19, 2017

Also, I think that the memory ordering for some of these operations can be weakened:

void* ponyint_mpmcq_pop(mpmcq_t* q, mpmcq_node_t** old_tail)
{
  mpmcq_node_t* tail, next;

  // relaxed instead of acquire because there is no release operation on the tail (see the CAS comment)
  tail = atomic_load_explicit(&q->tail, memory_order_relaxed);

  do
  {
    // relaxed instead of acquire. The synchronisation is moved outside of the loop as a standalone fence
    next = atomic_load_explicit(&tail->next, memory_order_relaxed);

    if(next == NULL)
      return NULL;

    // On success, relaxed instead of acquire/release because we didn't do any write and there isn't anything to release, and there aren't any writes to acquire
    // On failure, same thing as the load of the tail, there isn't anything to acquire
  } while(!atomic_compare_exchange_weak_explicit(&q->tail, &tail, next,
    memory_order_relaxed, memory_order_relaxed));

  // synchronise on tail->next and prevent the POOL_FREE of old_tail and the load of next->data from "moving" before the CAS, where we don't own the node yet
  atomic_thread_fence(memory_order_acquire);

  POOL_FREE(mpmcq_node_t, *old_tail);
  *old_tail = next;
  // relaxed instead of acquire. There is no release operation on next->data
  return atomic_load_explicit(&next->data, memory_order_relaxed);
}

@Praetonus
Copy link
Member

@SeanTAllen @sylvanc Should I open a PR with that change? Or maybe add it on top of this PR?

@SeanTAllen
Copy link
Member Author

SeanTAllen commented Jan 21, 2017

I think it makes sense to have this PR as is, the revert.
And then another commit to weaken. That will make for a better history and easier testing if we find problems later.

@Praetonus
Copy link
Member

Yes, that makes senses. Note that it won't work as is on GCC 4.7 and 4.8 though, as I explained above.

@SeanTAllen
Copy link
Member Author

ok.

@SeanTAllen
Copy link
Member Author

so @Praetonus @sylvanc merge this and then more work can be done over the top?

@Praetonus
Copy link
Member

Yeah, let's revert first. The only concern I have left about this PR is the changelog entry. I think this deserves a manual changelog entry to explain the performance fix as the PR title isn't very descriptive.

@SeanTAllen
Copy link
Member Author

@Praetonus suggestions on a good CHANGELOG entry?

@Praetonus
Copy link
Member

@SeanTAllen Maybe "Performance fix in the runtime actor scheduler"?

@jemc
Copy link
Member

jemc commented Jan 24, 2017

If you do a manual changelog entry, please be sure to include the PR ticket number.

However, it seems to me you could simply change the title of this PR and use the automatic changelog entry?

@SeanTAllen SeanTAllen changed the title Revert to pre-ticketing mpmcq pop algo Performance fix in the runtime actor schedule Jan 24, 2017
@SeanTAllen
Copy link
Member Author

@jemc done

@SeanTAllen SeanTAllen removed the do not merge This PR should not be merged at this time label Feb 3, 2017
@SeanTAllen
Copy link
Member Author

@Praetonus @jemc good to merge this now?

@jemc jemc merged commit 81e727a into master Feb 3, 2017
@jemc jemc deleted the pre-ticketing-mpmcq-pop-algo branch February 3, 2017 04:42
ponylang-main added a commit that referenced this pull request Feb 3, 2017
@Praetonus Praetonus mentioned this pull request Feb 3, 2017
@jemc
Copy link
Member

jemc commented Feb 6, 2017

FWIW this broke the build for me. I now see the following compiler errors:

src/libponyrt/sched/mpmcq.c:25:3: error: address argument to atomic operation must be a pointer to integer or pointer ('mpmcq_dwcas_t *' (aka 'struct mpmcq_dwcas_t *') invalid)
  atomic_store_explicit(&q->tail, tail, memory_order_relaxed);
  ^                     ~~~~~~~~
src/common/pony/detail/atomics.h:87:7: note: expanded from macro 'atomic_store_explicit'
      __atomic_store_n(PTR, VAL, MO); \
      ^
src/libponyrt/sched/mpmcq.c:30:24: error: address argument to atomic operation must be a pointer to integer or pointer ('mpmcq_dwcas_t *' (aka 'struct mpmcq_dwcas_t *') invalid)
  mpmcq_dwcas_t tail = atomic_load_explicit(&q->tail, memory_order_relaxed);
                       ^                    ~~~~~~~~
src/common/pony/detail/atomics.h:81:7: note: expanded from macro 'atomic_load_explicit'
      __atomic_load_n(PTR, MO); \
      ^
src/libponyrt/sched/mpmcq.c:30:17: error: initializing 'mpmcq_dwcas_t' (aka 'struct mpmcq_dwcas_t') with an expression of incompatible type 'void'
  mpmcq_dwcas_t tail = atomic_load_explicit(&q->tail, memory_order_relaxed);
                ^      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/libponyrt/sched/mpmcq.c:66:9: error: address argument to atomic operation must be a pointer to integer or pointer ('mpmcq_dwcas_t *' (aka 'struct mpmcq_dwcas_t *') invalid)
  cmp = atomic_load_explicit(&q->tail, memory_order_acquire);
        ^                    ~~~~~~~~
src/common/pony/detail/atomics.h:81:7: note: expanded from macro 'atomic_load_explicit'
      __atomic_load_n(PTR, MO); \
      ^
src/libponyrt/sched/mpmcq.c:66:7: error: assigning to 'mpmcq_dwcas_t' (aka 'struct mpmcq_dwcas_t') from incompatible type 'void'
  cmp = atomic_load_explicit(&q->tail, memory_order_acquire);
      ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/libponyrt/sched/mpmcq.c:82:12: error: address argument to atomic operation must be a pointer to integer or pointer ('mpmcq_dwcas_t *' (aka 'struct mpmcq_dwcas_t *') invalid)
  } while(!atomic_compare_exchange_weak_explicit(&q->tail, &cmp, xchg,
           ^                                     ~~~~~~~~
src/common/pony/detail/atomics.h:99:7: note: expanded from macro 'atomic_compare_exchange_weak_explicit'
      __atomic_compare_exchange_n(PTR, EXP, DES, true, SUCC, FAIL); \
      ^
src/libponyrt/sched/mpmcq.c:82:11: error: invalid argument type 'void' to unary expression
  } while(!atomic_compare_exchange_weak_explicit(&q->tail, &cmp, xchg,
          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
7 errors generated.
Makefile:585: recipe for target 'build/debug/obj/libponyrt/sched/mpmcq.o' failed
make: *** [build/debug/obj/libponyrt/sched/mpmcq.o] Error 1

@SeanTAllen
Copy link
Member Author

@jemc what's your setup? it passes CI both for Pony via Travis and ours at Sendence. I'm assuming you have a setup that is different than what has been tested.

@jemc
Copy link
Member

jemc commented Feb 6, 2017

  • Fedora 22 Linux, 64bit
  • clang version 3.5.0
  • LLVM 3.9.1

I saw there are some comments in #1543 about this change breaking some versions of GCC, but I'm using clang, not GCC.

@SeanTAllen
Copy link
Member Author

Ok so, all the CI that runs for this is using clang 3.9 with LLVM 3.9.1, that could be where the problem lies. Are you able to test with clang 3.9 to see if that is the case and if yes, we can open an issue to look into clang 3.5 support?

@Praetonus
Copy link
Member

@jemc @SeanTAllen The change isn't working on Clang 3.3 through 3.5 for the same reason as GCC (no support of stdatomic.h). Sorry, I forgot to mention it previously.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog - fixed Automatically add "Fixed" CHANGELOG entry on merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants