-
-
Notifications
You must be signed in to change notification settings - Fork 419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance fix in the runtime actor schedule #1521
Conversation
8d34ca1
to
48e6990
Compare
48e6990
to
925e649
Compare
The original testing for performance impact of the ticketing algo change was done at Sendence. At the time, we didn't see any impact. However, we weren't looking in the right place. We have since found that the ticketing algo under some workloads can result in very long pauses (up to a second) getting a ticket. Further, we regularly see pauses while waiting to get a ticket measured in the 100's of micro-seconds. This commit reverts to the prior double word cas algo that we found has far better performance characteristics in the "bad" scenarios we found and the same or slightly better characteristics under normal circumstances. The reversion includes the work in a later commit to support GCC 4.7. It does not include the VALGRIND awareness that was added later. I'm not sure where to put that in, as such, I left it out. There was a change in heap that went along with the original ticketing algo change. That is not included as part of this commit as we at Sendence didn't switch that and thus are unsure if it should be switched or what the impact would be.
925e649
to
225b31b
Compare
This will need some additional changes to work on GCC 4.7 and 4.8 because of this thing. I've added this assert as a safeguard because when doing a However, I think we can solve the ABA problem in a more efficient way in this algorithm by freeing the nodes at a different time. void* ponyint_mpmcq_pop(mpmcq_t* q, mpmcq_node_t** old_tail)
{
mpmcq_node_t* tail, next;
tail = atomic_load_explicit(&q->tail, memory_order_acquire);
do
{
next = atomic_load_explicit(&tail->next, memory_order_acquire);
if(next == NULL)
return NULL;
} while(!atomic_compare_exchange_weak_explicit(&q->tail, &tail, next,
memory_order_acq_rel, memory_order_acquire));
void* data = atomic_load_explicit(&next->data, memory_order_acquire);
atomic_store_explicit(&next->data, NULL, memory_order_release);
while(atomic_load_explicit(&(*old_tail)->data, memory_order_acquire) != NULL)
ponyint_cpu_relax();
POOL_FREE(mpmcq_node_t, *old_tail);
*old_tail = tail;
return data;
} The idea here is that each thread keeps a "buffer node" that it will only free after popping another node from the queue. As a result, the ABA problem will occur if, during a single CAS iteration for one thread
I think having this chain of events in this time frame is unlikely enough to never occur in real world cases, even in the highly contended inject queue. We'll probably need to have a buffer node per thread and per queue. The memory overhead would be Also, this whole thing is unnecessary on ARM as the ABA problem detection is handled by the hardware instructions (load-link/store-conditional). @sylvanc @SeanTAllen Thoughts? |
The buffer node idea seems very interesting. It is Can the thread keep void* ponyint_mpmcq_pop(mpmcq_t* q, mpmcq_node_t** old_tail)
{
mpmcq_node_t* tail, next;
tail = atomic_load_explicit(&q->tail, memory_order_acquire);
do
{
next = atomic_load_explicit(&tail->next, memory_order_acquire);
if(next == NULL)
return NULL;
} while(!atomic_compare_exchange_weak_explicit(&q->tail, &tail, next,
memory_order_acq_rel, memory_order_acquire));
POOL_FREE(mpmcq_node_t, *old_tail);
*old_tail = next;
return atomic_load_explicit(&next->data, memory_order_acquire);
} I may be overlooking something... |
Yes, that's a good idea. The only question is, who gets to free the original stub node? For the individual scheduler queues, the stub node of a queue could be the original |
Also, I think that the memory ordering for some of these operations can be weakened: void* ponyint_mpmcq_pop(mpmcq_t* q, mpmcq_node_t** old_tail)
{
mpmcq_node_t* tail, next;
// relaxed instead of acquire because there is no release operation on the tail (see the CAS comment)
tail = atomic_load_explicit(&q->tail, memory_order_relaxed);
do
{
// relaxed instead of acquire. The synchronisation is moved outside of the loop as a standalone fence
next = atomic_load_explicit(&tail->next, memory_order_relaxed);
if(next == NULL)
return NULL;
// On success, relaxed instead of acquire/release because we didn't do any write and there isn't anything to release, and there aren't any writes to acquire
// On failure, same thing as the load of the tail, there isn't anything to acquire
} while(!atomic_compare_exchange_weak_explicit(&q->tail, &tail, next,
memory_order_relaxed, memory_order_relaxed));
// synchronise on tail->next and prevent the POOL_FREE of old_tail and the load of next->data from "moving" before the CAS, where we don't own the node yet
atomic_thread_fence(memory_order_acquire);
POOL_FREE(mpmcq_node_t, *old_tail);
*old_tail = next;
// relaxed instead of acquire. There is no release operation on next->data
return atomic_load_explicit(&next->data, memory_order_relaxed);
} |
@SeanTAllen @sylvanc Should I open a PR with that change? Or maybe add it on top of this PR? |
I think it makes sense to have this PR as is, the revert. |
Yes, that makes senses. Note that it won't work as is on GCC 4.7 and 4.8 though, as I explained above. |
ok. |
so @Praetonus @sylvanc merge this and then more work can be done over the top? |
Yeah, let's revert first. The only concern I have left about this PR is the changelog entry. I think this deserves a manual changelog entry to explain the performance fix as the PR title isn't very descriptive. |
@Praetonus suggestions on a good CHANGELOG entry? |
@SeanTAllen Maybe "Performance fix in the runtime actor scheduler"? |
If you do a manual changelog entry, please be sure to include the PR ticket number. However, it seems to me you could simply change the title of this PR and use the automatic changelog entry? |
@jemc done |
@Praetonus @jemc good to merge this now? |
FWIW this broke the build for me. I now see the following compiler errors: src/libponyrt/sched/mpmcq.c:25:3: error: address argument to atomic operation must be a pointer to integer or pointer ('mpmcq_dwcas_t *' (aka 'struct mpmcq_dwcas_t *') invalid)
atomic_store_explicit(&q->tail, tail, memory_order_relaxed);
^ ~~~~~~~~
src/common/pony/detail/atomics.h:87:7: note: expanded from macro 'atomic_store_explicit'
__atomic_store_n(PTR, VAL, MO); \
^
src/libponyrt/sched/mpmcq.c:30:24: error: address argument to atomic operation must be a pointer to integer or pointer ('mpmcq_dwcas_t *' (aka 'struct mpmcq_dwcas_t *') invalid)
mpmcq_dwcas_t tail = atomic_load_explicit(&q->tail, memory_order_relaxed);
^ ~~~~~~~~
src/common/pony/detail/atomics.h:81:7: note: expanded from macro 'atomic_load_explicit'
__atomic_load_n(PTR, MO); \
^
src/libponyrt/sched/mpmcq.c:30:17: error: initializing 'mpmcq_dwcas_t' (aka 'struct mpmcq_dwcas_t') with an expression of incompatible type 'void'
mpmcq_dwcas_t tail = atomic_load_explicit(&q->tail, memory_order_relaxed);
^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/libponyrt/sched/mpmcq.c:66:9: error: address argument to atomic operation must be a pointer to integer or pointer ('mpmcq_dwcas_t *' (aka 'struct mpmcq_dwcas_t *') invalid)
cmp = atomic_load_explicit(&q->tail, memory_order_acquire);
^ ~~~~~~~~
src/common/pony/detail/atomics.h:81:7: note: expanded from macro 'atomic_load_explicit'
__atomic_load_n(PTR, MO); \
^
src/libponyrt/sched/mpmcq.c:66:7: error: assigning to 'mpmcq_dwcas_t' (aka 'struct mpmcq_dwcas_t') from incompatible type 'void'
cmp = atomic_load_explicit(&q->tail, memory_order_acquire);
^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/libponyrt/sched/mpmcq.c:82:12: error: address argument to atomic operation must be a pointer to integer or pointer ('mpmcq_dwcas_t *' (aka 'struct mpmcq_dwcas_t *') invalid)
} while(!atomic_compare_exchange_weak_explicit(&q->tail, &cmp, xchg,
^ ~~~~~~~~
src/common/pony/detail/atomics.h:99:7: note: expanded from macro 'atomic_compare_exchange_weak_explicit'
__atomic_compare_exchange_n(PTR, EXP, DES, true, SUCC, FAIL); \
^
src/libponyrt/sched/mpmcq.c:82:11: error: invalid argument type 'void' to unary expression
} while(!atomic_compare_exchange_weak_explicit(&q->tail, &cmp, xchg,
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
7 errors generated.
Makefile:585: recipe for target 'build/debug/obj/libponyrt/sched/mpmcq.o' failed
make: *** [build/debug/obj/libponyrt/sched/mpmcq.o] Error 1 |
@jemc what's your setup? it passes CI both for Pony via Travis and ours at Sendence. I'm assuming you have a setup that is different than what has been tested. |
I saw there are some comments in #1543 about this change breaking some versions of GCC, but I'm using clang, not GCC. |
Ok so, all the CI that runs for this is using clang 3.9 with LLVM 3.9.1, that could be where the problem lies. Are you able to test with clang 3.9 to see if that is the case and if yes, we can open an issue to look into clang 3.5 support? |
@jemc @SeanTAllen The change isn't working on Clang 3.3 through 3.5 for the same reason as GCC (no support of |
The original testing for performance impact of the ticketing
algo change was done at Sendence. At the time, we didn't see
any impact. However, we weren't looking in the right place.
We have since found that the ticketing algo under some workloads
can result in very long pauses (up to a second) getting a ticket.
Further, we regularly see pauses while waiting to get a ticket
measured in the 100's of micro-seconds.
This commit reverts to the prior double word cas algo that we
found has far better performance characteristics in the "bad"
scenarios we found and the same or slightly better characteristics
under normal circumstances.
The reversion includes the work in a later commit to support GCC
4.7.
It does not include the VALGRIND awareness that was added later.
I'm not sure where to put that in, as such, I left it out.
There was a change in heap that went along with the original
ticketing algo change. That is not included as part of this commit
as we at Sendence didn't switch that and thus are unsure if it
should be switched or what the impact would be.