cmd/compile: improve interface dispatch performance #29276

robaho · 2018-12-14T22:21:19Z

Reviewing the following code:

and the generated assembly:

varint_bench_test.go:14       0x10f07ee               488b4c2430              MOVQ 0x30(SP), CX
varint_bench_test.go:14       0x10f07f3               488b5118                MOVQ 0x18(CX), DX
...
varint_bench_test.go:14       0x10f0807               ffd2                    CALL DX

TEXT command-line-arguments.WriteUvarint(SB) /Users/robertengels/gotest/gotest/varint_bench_test.go
  varint_bench_test.go:12       0x10f07b0               65488b0c2530000000      MOVQ GS:0x30, CX
  varint_bench_test.go:12       0x10f07b9               483b6110                CMPQ 0x10(CX), SP
  varint_bench_test.go:12       0x10f07bd               0f869f000000            JBE 0x10f0862
  varint_bench_test.go:12       0x10f07c3               4883ec28                SUBQ $0x28, SP
  varint_bench_test.go:12       0x10f07c7               48896c2420              MOVQ BP, 0x20(SP)
  varint_bench_test.go:12       0x10f07cc               488d6c2420              LEAQ 0x20(SP), BP
  varint_bench_test.go:13       0x10f07d1               488b442440              MOVQ 0x40(SP), AX
  varint_bench_test.go:13       0x10f07d6               eb09                    JMP 0x10f07e1
  varint_bench_test.go:18       0x10f07d8               488b442440              MOVQ 0x40(SP), AX
  varint_bench_test.go:18       0x10f07dd               48c1e807                SHRQ $0x7, AX
  varint_bench_test.go:13       0x10f07e1               483d80000000            CMPQ $0x80, AX
  varint_bench_test.go:13       0x10f07e7               7243                    JB 0x10f082c
  varint_bench_test.go:13       0x10f07e9               4889442440              MOVQ AX, 0x40(SP)
  varint_bench_test.go:14       0x10f07ee               488b4c2430              MOVQ 0x30(SP), CX
  varint_bench_test.go:14       0x10f07f3               488b5118                MOVQ 0x18(CX), DX
  varint_bench_test.go:14       0x10f07f7               83c880                  ORL $-0x80, AX
  varint_bench_test.go:14       0x10f07fa               88442408                MOVB AL, 0x8(SP)
  varint_bench_test.go:14       0x10f07fe               488b442438              MOVQ 0x38(SP), AX
  varint_bench_test.go:14       0x10f0803               48890424                MOVQ AX, 0(SP)
  varint_bench_test.go:14       0x10f0807               ffd2                    CALL DX
  varint_bench_test.go:14       0x10f0809               488b442418              MOVQ 0x18(SP), AX
  varint_bench_test.go:14       0x10f080e               488b4c2410              MOVQ 0x10(SP), CX
  varint_bench_test.go:15       0x10f0813               4885c9                  TESTQ CX, CX
  varint_bench_test.go:15       0x10f0816               74c0                    JE 0x10f07d8
  varint_bench_test.go:16       0x10f0818               48894c2448              MOVQ CX, 0x48(SP)
  varint_bench_test.go:16       0x10f081d               4889442450              MOVQ AX, 0x50(SP)
  varint_bench_test.go:16       0x10f0822               488b6c2420              MOVQ 0x20(SP), BP
  varint_bench_test.go:16       0x10f0827               4883c428                ADDQ $0x28, SP
  varint_bench_test.go:16       0x10f082b               c3                      RET
  varint_bench_test.go:20       0x10f082c               488b4c2430              MOVQ 0x30(SP), CX
  varint_bench_test.go:20       0x10f0831               488b4918                MOVQ 0x18(CX), CX
  varint_bench_test.go:20       0x10f0835               88442408                MOVB AL, 0x8(SP)
  varint_bench_test.go:20       0x10f0839               488b442438              MOVQ 0x38(SP), AX
  varint_bench_test.go:20       0x10f083e               48890424                MOVQ AX, 0(SP)
  varint_bench_test.go:20       0x10f0842               ffd1                    CALL CX
  varint_bench_test.go:20       0x10f0844               488b442418              MOVQ 0x18(SP), AX
  varint_bench_test.go:20       0x10f0849               488b4c2410              MOVQ 0x10(SP), CX
  varint_bench_test.go:20       0x10f084e               48894c2448              MOVQ CX, 0x48(SP)
  varint_bench_test.go:20       0x10f0853               4889442450              MOVQ AX, 0x50(SP)
  varint_bench_test.go:20       0x10f0858               488b6c2420              MOVQ 0x20(SP), BP
  varint_bench_test.go:20       0x10f085d               4883c428                ADDQ $0x28, SP
  varint_bench_test.go:20       0x10f0861               c3                      RET
  varint_bench_test.go:12       0x10f0862               e89948f6ff              CALL runtime.morestack_noctxt(SB)
  varint_bench_test.go:12       0x10f0867               e944ffffff              JMP command-line-arguments.WriteUvarint(SB)
  :-1                           0x10f086c               cc                      INT $0x3
  :-1                           0x10f086d               cc                      INT $0x3
  :-1                           0x10f086e               cc                      INT $0x3
  :-1                           0x10f086f               cc                      INT $0x3

The generated code loads the interface address using double indirection in every loop invocation, and every call (line 14 & 20).

The compiler could easily generate optimized code where the DX is loaded once and used for every interface call, as w is constant in the method.

It is my opinion that loops like this are very common in typical Go code and deserve more optimization attention. As an example, issue #29010 makes specific reference to not using interfaces as call sites due to their inefficiency.

At a minimum the call address could be placed in a stack local avoiding one indirection.

A more advanced change might be to reserve a couple of general purpose registers for the hot interface call address and object reference (r10/r11) and so push/pop r10/r11 on entry/exit for those routines using the optimization.

Issue #18597 has some overlap with this.

robaho · 2018-12-15T14:28:00Z

I can see why it’s problematic - due to the GC and stack growth, but only the dedicated struct pointer would need fix up. The interface fn pointer doesn’t need it.

martisch · 2018-12-15T14:58:44Z

The compiler could easily generate optimized code where the DX is loaded once and used for every interface call, as w is constant in the method.

Note that I think DX would currently need to be at least spilled and reloaded on every iteration since the function executed with the interface call could clobber DX.

robaho · 2018-12-15T18:09:20Z

That’s why maybe only and64 and reserve one of the rN registers and make the callee preserve if used. I think most interface calls do not end up calling another interface, it is usually into the concrete after the initial, so the save would be slight overhead as it is outside the loop.

…

On Dec 15, 2018, at 8:59 AM, Martin Möhrmann ***@***.***> wrote: The compiler could easily generate optimized code where the DX is loaded once and used for every interface call, as w is constant in the method. Note that I think DX would currently need to be at least spilled and reloaded on every iteration since the function executed with the interface call could clobber DX. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

josharian · 2018-12-15T18:42:42Z

Registers are precious on and64. If we are going to dedicate one to a single purpose it’ll probably be for the current g.

Changing the ABI (e.g. to support callee-save registers) is a major undertaking...and one which we are actively working on.

With callee-save registers, it is reasonably likely that a good register allocator will already choose a callee-save register for this, but it would be good to confirm (once that hypothetical becomes reality).

randall77 · 2018-12-15T18:55:29Z

It would be nice to teach the compiler that there's no need to reload from slots of an itab - there's nothing that writes (already initialized) itabs.

I'm not sure how much performance it actually costs, at least in this example. The branch predictor is going to predict the call correctly, so nothing is actually waiting for the results of these loads. As long as L1 has the bandwidth for these loads it won't slow anything down.

It would be good to find a benchmark where we could see an improvement. Not sure what that would look like, or how we would know without implementing the optimization. Maybe use performance counters to find a benchmark with stalls on these instructions?

robaho · 2018-12-15T19:03:52Z

I am working on some hand assembly to test the performance difference.

josharian · 2018-12-15T19:52:12Z

It would be nice to teach the compiler that there's no need to reload from slots of an itab - there's nothing that writes (already initialized) itabs.

I was also thinking that but I don’t see what the implementation would look like. Do you have a suggestion for how?

robaho · 2018-12-15T20:38:28Z

Initial testing shows no performance improvement when you need to save/reload the register on each call. It seems the only viable solution is to dedicate registers for interface dispatch and make them callee save - once the GC/safepoint occurs all registers appear to be trashed trashed - which in most cases (outside of GC/runtime) they should not need to be saved/restored.

I think this would go along ways towards improving the performance of Go while encouraging interface based design.

randall77 · 2018-12-15T21:27:15Z

I was also thinking that but I don’t see what the implementation would look like. Do you have a suggestion for how?

Nope. If we're going to do any sort of alias analysis to enable more aggressive store-load forwarding, the itab special case would be easy to incorporate. Not sure how to do the former, though.

robaho mentioned this issue Dec 14, 2018

proposal: encoding/binary: add WriteVarint WriteUVarint methods #29010

Closed

ALTree added the Performance label Dec 15, 2018

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 13, 2022

mknyszek added this to Go Compiler / Runtime Jul 13, 2022

mknyszek moved this to Triage Backlog in Go Compiler / Runtime Jul 15, 2022

seankhliao added this to the Unplanned milestone Aug 20, 2022

seankhliao added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Aug 20, 2022

gabyhelp mentioned this issue Oct 18, 2024

cmd/compile: get struct member when inlining #69935

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: improve interface dispatch performance #29276

cmd/compile: improve interface dispatch performance #29276

robaho commented Dec 14, 2018

robaho commented Dec 15, 2018

martisch commented Dec 15, 2018

robaho commented Dec 15, 2018 via email

josharian commented Dec 15, 2018

randall77 commented Dec 15, 2018

robaho commented Dec 15, 2018

josharian commented Dec 15, 2018

robaho commented Dec 15, 2018

randall77 commented Dec 15, 2018

cmd/compile: improve interface dispatch performance #29276

cmd/compile: improve interface dispatch performance #29276

Comments

robaho commented Dec 14, 2018

robaho commented Dec 15, 2018

martisch commented Dec 15, 2018

robaho commented Dec 15, 2018 via email

josharian commented Dec 15, 2018

randall77 commented Dec 15, 2018

robaho commented Dec 15, 2018

josharian commented Dec 15, 2018

robaho commented Dec 15, 2018

randall77 commented Dec 15, 2018