profile
viewpoint

Ask questionscmd/internal/obj/x86: pad jumps to avoid Intel erratum

Intel erratum SKX102 “Processor May Behave Unpredictably Under Complex Sequence of Conditions Which Involve Branches That Cross 64-Byte Boundaries” applies to:

  • Intel® Celeron® Processor 4000 Series
  • Intel® Celeron® Processor G Series 10th Generation
  • Intel® Core™ i5 Processors 10th Generation
  • Intel® Core™ i7 Processors 6th Generation
  • Intel® Core™ i3 Processors 6th Generation
  • Intel® Core™ i5 Processors 6th Generation
  • Intel® Core™ i7 Processors 6th Generation
  • Intel® Core™ m Processors 7th Generation
  • Intel® Core™ i3 Processors 7th Generation
  • Intel® Core™ i5 Processors 7th Generation
  • Intel® Core™ i7 Processors 7th Generation
  • Intel® Core™ m Processors 8th Generation
  • Intel® Core™ i3 Processors 8th Generation
  • Intel® Core™ i5 Processors 8th Generation
  • Intel® Core™ i7 Processors 8th Generation
  • Intel® Core™ m Processors 9th Generation
  • Intel® Core™ i9 Processors
  • Intel® Core™ X-series Processors
  • Intel® Pentium® Gold Processor Series
  • Intel® Pentium® Processor G Series
  • Intel® Xeon® Processor E3 v5 Family
  • Intel® Xeon® Processor E3 v6 Family 2nd Generation
  • Intel® Xeon® Scalable Processors
  • Intel® Xeon® E Processor
  • Intel® Xeon® Scalable Processors
  • Intel® Xeon® W Processor

There is a microcode fix that can be applied by the BIOS to avoid the incorrect execution. It stops any jump (jump, jcc, call, ret, direct, indirect) from being cached in the decoded icache when the instruction ends at or crosses a 32-byte boundary. Intel says:

Intel has observed performance effects associated with the workaround [microcode fix] ranging from 0-4% on many industry-standard benchmarks. In subcomponents of these benchmarks, Intel has observed outliers higher than the 0-4% range. Other workloads not observed by Intel may behave differently.

The suggested workaround for the workaround is to insert padding so that fused branch sequences never end at or cross a 64-byte boundary. This means the whole CMP+Jcc, not just Jcc.

CL 206837 adds a new environment variable to set the padding policy. The original CL used $GO_X86_PADJUMP but the discussion has moved on to using $GOAMD64, which would avoid breaking the build cache.

There are really two questions here:

  • What is the right amount of padding to insert by default?
  • Given that default, what additional control do developers need over the padding?

In general, we try to do the right thing for developers so that they don't have to keep track of every last CPU erratum. That seems like it would suggest we should do the padding automatically. Otherwise Go programs on this very large list of processors have the possibility of behaving “unpredictably."

If the overheads involved are small enough and we are 100% confident in the padding code, we could stop there and just leave it on unconditionally. It seems like that's what we should do rather than open the door to arbitrary compiler option configuration in $GOAMD64, and all the complexity that comes with it.

So what are the overheads? Here is an estimate.

$ ls -l $(which go)
-rwxr-xr-x  1 rsc  primarygroup  15056484 Nov 15 15:10 /Users/rsc/go/bin/go
$ go tool objdump $(which go) >go.dump
$ grep -c '^TEXT' go.dump
10362
$ cat go.dump | awk '$2~/^0x/ {print $4, length($3)/2}' | sort | uniq -c | egrep 'CALL| J|RET' >jumps
$ cat jumps | awk '{n=$3; if($2 ~ /^J[^M]/) n += 3; total += $1*(n/16)*(n+1)/2} END{print total}'
251848
$ 

This go command has 10,362 functions, and the padding required for instructions crossing or ending at a 16-byte boundary should average out to 251,848 extra bytes. The awk adds 3 to conditional jumps to simulate fusing of a preceding register-register CMP instruction.

Changing function alignment to 32 bytes would halve the padding added (saving 125,924 bytes) but add 16 more bytes on average to each of the functions (adding 165,792 bytes). So changing function alignment does not seem to be worthwhile.

Same for a smaller binary:

$ ls -l $(which gofmt)
-rwxr-xr-x  1 rsc  primarygroup  3499584 Nov 15 15:09 /Users/rsc/go/bin/gofmt
$ go tool objdump $(which gofmt) >gofmt.dump
$ grep -c '^TEXT' gofmt.dump
2956
$ cat gofmt.dump | awk '$2~/^0x/ {print $4, length($3)/2}' | sort | uniq -c | egrep 'CALL| J|RET' >jumps
$ cat jumps | awk '{n=$3; if($2 ~ /^J[^M]/) n += 3; total += $1*(n/16)*(n+1)/2} END{print total}'
58636.8
$ 

Changing alignment to 32 would save 29,318.4 bytes but add 47,296 bytes.

Overall, the added bytes are 1.67% in both the go command and gofmt. This is not nothing, but it seems like a small price to pay for correct execution, and if it makes things faster on some systems, even better.

My tentative conclusion then would be that we should just turn this on by default and not have an option. Thoughts?

golang/go

Answer questions dr2chase

Just so everyone knows, the penalty can be quite bad, as reported in #37190.

I reproduced this,

  • comparing 1.13 (which was just lucky),
  • two versions of aligned,
  • and unpadded
name \ time/op    Go-1.13     Go-1.14-vzu-align  Go-1.14-vzu-nopalign  Go-1.14-vzu
FastTest2KB-4     141ns ± 2%  115ns ± 2%         112ns ± 0%            269ns ± 1%

Note that alignment improves the best case, so the best-to-worst slowdown exceeds 100% when things line up just so.

For that set of benchmarks (excluding those affected by a not-fully-mitigated MacOS bug):

name \ time/op    Go-1.13     Go-1.14-vzu-align  Go-1.14-vzu-nopalign  Go-1.14-vzu
[Geo mean]        54.9µs      53.1µs             53.3µs                55.1µs

The benchmarks were those in https://github.com/dr2chase/bent

In another benchmark run, I also checked the size and performance costs of 16 vs 32-byte alignment; we want 32-byte alignment, 16 gives 0.7% bigger text and 0.82% slower geomean execution, with almost no winners in the run-time column.

For reference, the two benchmark configurations:

[[Configurations]]
  Name = "Go-1.14-vzeroupper-nopalign-32-lessf2i-nopreempt"
  Root = "$HOME/work/go-quick/"
  GcEnv = ["GOAMD64=alignedjumps"]
  RunEnv = ["GODEBUG=asyncpreemptoff=1"]

[[Configurations]]
  Name = "Go-1.14-vzeroupper-nopalign-16-lessf2i-nopreempt"
  Root = "$HOME/work/go/"
  GcEnv = ["GOAMD64=alignedjumps"]
  RunEnv = ["GODEBUG=asyncpreemptoff=1"]

and git diff in go:

diff --git a/src/cmd/internal/obj/x86/asm6.go b/src/cmd/internal/obj/x86/asm6.go
index 16e73fad44..21d254d1e2 100644
--- a/src/cmd/internal/obj/x86/asm6.go
+++ b/src/cmd/internal/obj/x86/asm6.go
@@ -1982,7 +1982,7 @@ func makePjc(ctxt *obj.Link) *padJumpsCtx {
                return &padJumpsCtx{}
        }
        return &padJumpsCtx{
-               jumpAlignment: 32,
+               jumpAlignment: 16,
        }
 }
 
diff --git a/src/cmd/link/internal/amd64/obj.go b/src/cmd/link/internal/amd64/obj.go
index 3239c61864..f1f2e3e11c 100644
--- a/src/cmd/link/internal/amd64/obj.go
+++ b/src/cmd/link/internal/amd64/obj.go
@@ -40,9 +40,9 @@ func Init() (*sys.Arch, ld.Arch) {
        arch := sys.ArchAMD64
 
        fa := funcAlign
-       if objabi.GOAMD64 == "alignedjumps" {
-               fa = 32
-       }
+       //if objabi.GOAMD64 == "alignedjumps" {
+       //      fa = 32
+       //}
 
        theArch := ld.Arch{
                Funcalign:  fa,

I think we should be looking at the NOP-only patch and probably just have it turned on all the time. This seems like the least-risk way of avoiding this sometimes-terrible slowdown that will also interfere with performance-tuning work on updated-microcode Intel processors.

useful!

Related questions

cmd/link: segmentation fault during mach-o linking hot 4
cmd/go: cannot find module providing package error stops `go get` processing hot 2
cmd/go: needs a better error than "missing dot in first path element" when GOROOT is set incorrectly hot 2
x/xerrors: fails to compile on tip hot 1
vendor/golang.org/x/xerrors/adaptor_go1_13.go:16:14: undefined: errors.Frame ... hot 1
cmd/go: `go clean <package>` downloads modules hot 1
cmd/cgo error: runtime: unknown pc 0x7fff5c805b86 hot 1
runtime: crash with "invalid pc-encoded table" hot 1
cmd/vet: potential false positive in the "suspect or" check hot 1
cmd/link: showing many ld warnings of "building for macOS, but linking in object file" hot 1
runtime: go program crach, it seems fall into infinite loop hot 1
cmd/go: major version without preceding tag must be v0, not v1 - breaks build of github.com/go-check hot 1
runtime: macOS Sierra builders spinning hot 1
cmd/go: Problem using go modules hot 1
cmd/go: "unrecognized import path" for local packages after updating to go1.13 hot 1
Github User Rank List