-
Notifications
You must be signed in to change notification settings - Fork 18k
cmd/compile: improve interface dispatch performance #29276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I can see why it’s problematic - due to the GC and stack growth, but only the dedicated struct pointer would need fix up. The interface fn pointer doesn’t need it. |
Note that I think DX would currently need to be at least spilled and reloaded on every iteration since the function executed with the interface call could clobber DX. |
That’s why maybe only and64 and reserve one of the rN registers and make the callee preserve if used. I think most interface calls do not end up calling another interface, it is usually into the concrete after the initial, so the save would be slight overhead as it is outside the loop.
… On Dec 15, 2018, at 8:59 AM, Martin Möhrmann ***@***.***> wrote:
The compiler could easily generate optimized code where the DX is loaded once and used for every interface call, as w is constant in the method.
Note that I think DX would currently need to be at least spilled and reloaded on every iteration since the function executed with the interface call could clobber DX.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Registers are precious on and64. If we are going to dedicate one to a single purpose it’ll probably be for the current g. Changing the ABI (e.g. to support callee-save registers) is a major undertaking...and one which we are actively working on. With callee-save registers, it is reasonably likely that a good register allocator will already choose a callee-save register for this, but it would be good to confirm (once that hypothetical becomes reality). |
It would be nice to teach the compiler that there's no need to reload from slots of an itab - there's nothing that writes (already initialized) itabs. I'm not sure how much performance it actually costs, at least in this example. The branch predictor is going to predict the call correctly, so nothing is actually waiting for the results of these loads. As long as L1 has the bandwidth for these loads it won't slow anything down. It would be good to find a benchmark where we could see an improvement. Not sure what that would look like, or how we would know without implementing the optimization. Maybe use performance counters to find a benchmark with stalls on these instructions? |
I am working on some hand assembly to test the performance difference. |
I was also thinking that but I don’t see what the implementation would look like. Do you have a suggestion for how? |
Initial testing shows no performance improvement when you need to save/reload the register on each call. It seems the only viable solution is to dedicate registers for interface dispatch and make them callee save - once the GC/safepoint occurs all registers appear to be trashed trashed - which in most cases (outside of GC/runtime) they should not need to be saved/restored. I think this would go along ways towards improving the performance of Go while encouraging interface based design. |
Nope. If we're going to do any sort of alias analysis to enable more aggressive store-load forwarding, the itab special case would be easy to incorporate. Not sure how to do the former, though. |
Reviewing the following code:
and the generated assembly:
The generated code loads the interface address using double indirection in every loop invocation, and every call (line 14 & 20).
The compiler could easily generate optimized code where the DX is loaded once and used for every interface call, as w is constant in the method.
It is my opinion that loops like this are very common in typical Go code and deserve more optimization attention. As an example, issue #29010 makes specific reference to not using interfaces as call sites due to their inefficiency.
At a minimum the call address could be placed in a stack local avoiding one indirection.
A more advanced change might be to reserve a couple of general purpose registers for the hot interface call address and object reference (r10/r11) and so push/pop r10/r11 on entry/exit for those routines using the optimization.
Issue #18597 has some overlap with this.
The text was updated successfully, but these errors were encountered: