profile
viewpoint

issue commentgolang/go

runtime: fatal error: found bad pointer in Go heap (incorrect use of unsafe or cgo?) [1.14 backport]

I checked in the fix for #37688 at 825ae71e567593d3a28b7dddede8745701273c52 , https://go-review.googlesource.com/c/go/+/224581

Thanks for cherry-picking to go 1.14.2!

gopherbot

comment created time in 4 days

issue openedgolang/go

crypto/dsa: builders with long tests are failing in crypto/dsa.TestEqual

The builders with long tests (GO_TEST_SHORT=0) are failing in crypto.dsa.TestEqual. For example:

https://build.golang.org/log/eaeb045adcbf56e2df980dc0cb8c17554dc2b50a

Looks like it started happening with change 5c9bd49, https://go-review.googlesource.com/c/go/+/223754 @FiloSottile

What version of Go are you using (go version)?

Running on current gc tip

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

Happens on linux-amd64 and bunch of other architectures.

created time in 5 days

issue commentgolang/go

x/build: linux-amd64-staticlockranking consistently failing

Yes, I need to check in https://go-review.googlesource.com/c/go/+/222925 (probably go in today) and https://go-review.googlesource.com/c/go/+/207619 (probably submit in another couple of days).

I'm not quite sure why @dmitshur enabled the builder before my checkins went in. Feel free to disable until my second checkin goes in -- whatever makes sense.

andybons

comment created time in 5 days

issue commentgolang/go

x/build: create builder that runs with GOEXPERIMENT set to 'staticlockranking' (once lock ranking change is in)

OK, thank you for doing all this, @dmitshur . Will let you know how it goes with the SlowBots when I'm ready.

danscales

comment created time in 8 days

issue commentgolang/go

runtime: fatal error: found bad pointer in Go heap (incorrect use of unsafe or cgo?)

I understand the cause and have a simple fix. Working to see if I can create a much simpler test that can reproduce the problem. Not sure if that will be doable, because of the complex interactions and non-determinism involving the GC, etc.

peterbourgon

comment created time in 8 days

issue commentgolang/go

runtime: fatal error: found bad pointer in Go heap (incorrect use of unsafe or cgo?)

Yes, thanks for bisecting it! I have reproduced it on my system, so I will start debugging it.

peterbourgon

comment created time in 9 days

issue commentgolang/go

x/build: create builder that runs with GOEXPERIMENT set to 'staticlockranking' (once lock ranking change is in)

I see. For those CLs, it'll be possible to run the new builder via SlowBots by asking for that builder specifically with something like TRY=linux-amd64-staticlockranking.

I understand it should test only the main Go repository. Let me know if think it would be helpful to test some/many/all golang.org/x repos as well.

No, there will be no need to run such a builder for any other repo other than the main Go repo.

Thanks for figuring out the necessary builder work!

I hope to get my first change (222925) submitted today or tomorrow, and then it may be a few more days before I get the full 'static lock ranking' change (207619) submitted.

danscales

comment created time in 10 days

issue commentgolang/go

x/build: create builder that runs with GOEXPERIMENT set to 'staticlockranking' (once lock ranking change is in)

Yes, I think a post-submit builder will be fine. This static lock ranking will only affect people who do fairly significant changes to the Go runtime, and those folks will hopefully also know when it will be useful to run the static lock ranking checks before checkin (because they changed the way certain runtime locks are used).

danscales

comment created time in 10 days

issue openedgolang/go

x/build: create builder that runs with GOEXPERIMENT set to 'staticlockranking' (once lock ranking change is in)

Once my changes https://go-review.googlesource.com/c/go/+/222925 (enable build tag corresponding to GOEXPERIMENT value) and https://go-review.googlesource.com/c/go/+/207619 (enforce static lock ranking in the runtime when GOEXPERIMENT enabled), we will want to enable a builder (probably just for linux/amd64) that runs with GOEXPERIMENT equal to staticlockranking.

This will not only test that the lock ranking is not being violated in the runtime (hence helping to avoid deadlocks related to lock acquisition ordering), but will also test that the buildtags are correctly set for a different GOEXPERIMENT value.

created time in 10 days

issue commentgolang/go

proposal: cmd/compile: make 64-bit fields be 64-bit aligned on 32-bit systems, add //go:packed directive on structs

Yes, I am working on a design proposal, but haven't been able to get back to it for a few weeks. There are lots of pros & cons for several variants of the proposed syntax/semantics, so there hasn't been an obvious consensus. But I will still aim to publish the design proposal (with a particular set of proposed choices) in the next couple of weeks.

danscales

comment created time in 17 days

issue commentgolang/go

runtime: crash on 1.14 with unexpected return pc, fatal error: unknown caller pc [1.14 backport]

Fix (submitted for 1.15) is here: https://go-review.googlesource.com/c/go/+/222420

gopherbot

comment created time in 18 days

issue commentgolang/go

runtime: crash on 1.14 with unexpected return pc, fatal error: unknown caller pc [1.14 backport]

This is a regression in defer behavior for 1.14 for programs that do repeated panics, recovers, and re-panics (which could be because of a recursive, interpreter-like function). The actual fix is fairly simple, so it would be good to get this into 1.14.1.

gopherbot

comment created time in 18 days

issue commentgolang/go

runtime: crash on 1.14 with unexpected return pc, fatal error: unknown caller pc

@gopherbot please consider this for backport to 1.14, it's a regression (and the fix is quite simple).

apmckinlay

comment created time in 18 days

issue commentgolang/go

runtime: fatal error: found bad pointer in Go heap (incorrect use of unsafe or cgo?)

Though it's always possible, this doesn't look likely related to #37664 . In that bug, I would expect either an invalid pc during a traceback (gentraceback()) from a stale defer or a problem during adjustdefers() or tracebackdefers() in scanstack(). It would easy to confirm if we have a repro case (but that's hard to come by, I know).

peterbourgon

comment created time in 19 days

issue commentgolang/go

runtime: crash on 1.14 with unexpected return pc, fatal error: unknown caller pc

I would have expected that you would only need the workaround for every defer that has a recover, then possible re-panic, or is likely to be on the stack when such panic-recover-re-panics are happening. I don't expect you would have any trouble with the defer in the go Runtime (really just in runtime.main, I think, at the very start of the program). However, it may be a little hard to catch all such defers.

It would be helpful if you happen to try a bit more to apply the workaround to the various defers that seem to be related to the panic-recover-repanic loop, and let me know through the bug if you are successful (and how many defers you tried fixing, whether successful or not).

I have the fix and a sample test that I'm about to put out for review.

apmckinlay

comment created time in 22 days

issue commentgolang/go

runtime: crash on 1.14 with unexpected return pc, fatal error: unknown caller pc

Currently, it seems like it could be a good option for cherrypicking for 1.14.1, but will have to confirm the fix and check with release folks, etc. I'll update as I learn more.

apmckinlay

comment created time in 23 days

issue commentgolang/go

runtime: crash on 1.14 with unexpected return pc, fatal error: unknown caller pc

@apmckinlay Thanks for setting up the repro case! It actually reproduces on Linux, though I had to fix the sys_nix.go file (syscall.Sysctl no longer exists on Linux).

The bug is actually related to the new open-coded defers and their interaction with panic/recover. Confusingly, 'go build' doesn't recompile all the sub-packages with the -N option unless you do:

go build -gcflags="all=-N"

The bug goes away if you do that, since the problem is a defer in interp.go. As a more targeted work-around, you can put a 1-iteration for loop around the defer statement in interp() (and no -N option needed) and the problem goes away:

for i := 0; i < 1; i++ {
    defer func() {
        // this is an optimization ...
    }()
}

The bug does require several sequences of panics and recovers with a further re-panic, before doing the final recover.

I think that I have the actual fix in the Go runtime, which is fairly simple, but I'm still working to verify it is the full fix, do more testing, etc.

apmckinlay

comment created time in 23 days

issue commentgolang/go

runtime: crash on 1.14 with unexpected return pc, fatal error: unknown caller pc

@apmckinlay I'm happy to work on debugging this if you can create a code example that you are able to share. Even though it is not specifically related to the defer changes, I also have recently worked with the panic/recover implementation. One change that also went into Go 1.14 relating to panic/recover is https://go-review.googlesource.com/c/go/+/200081 . It should only affect behavior if you did a panic/recover after initiating a Goexit(). I assuming that was not your scenario, but the change could have had some other unintended side effect.

apmckinlay

comment created time in 24 days

issue commentgolang/go

proposal: cmd/compile: make 64-bit fields be 64-bit aligned on 32-bit systems, add //go:packed directive on structs

With the addition of something like '//go:align(8)' for types, could we declare that the byte alignment of a pointer type is now included in its underlying type? Therefore, pointers to otherwise identical struct types with different byte alignments are not longer convertible.

Hence in this example (where default alignment of A and B is 4 bytes):

type A struct {
  x, y, z, w int32
}

type B struct {
  x, y, z, w int32
}

//align:8
type C struct {
  x, y, z, w int32
}

a := A{}
b := (*B)(a)     // allowed
c := (*C)(a)    // not allowed

converting a pointer to A to a pointer to C is not allowed (compile-time error).

danscales

comment created time in 2 months

issue commentgolang/go

proposal: cmd/compile: make 64-bit fields be 64-bit aligned on 32-bit systems, add //go:packed directive on structs

@beoran In your galago code, I see very little use of uint64/int64/float64, and even less for 32-bit systems. In that case, the only 64-bit type in use that I see is Time_t, which is int64 even for 32-bit system. 'Long' however looks to be int32 on a 32-bit system. So, there could possibly be a change in alignment only because of the use of Time_t, but the only use of Time_t is in Timeval, where it is the first field. So, the offset of the fields in this struct would not change on 32-bit systems, though the overall required alignment of the structure would increase from 32-bit to 64-bit alignment. I don't think this should affect any interaction with the OS, but let me know if you foresee some issue.

We will keep your case in mind, and please do send any other pointers to code that might require code changes with this proposal.

danscales

comment created time in 2 months

issue commentgolang/go

proposal: cmd/compile: make 64-bit fields be 64-bit aligned on 32-bit systems, add //go:packed directive on structs

@mknyszek proposed that we could create an ABI wrapper to relayout such structs in the arguments/results to assembly functions. This is kind of a pain, but is certainly doable. If we go down this path, maybe we also make ABIInternal Go functions continue to take a struct-style argument frame.

@aclements: I'm not sure if I'm understanding, but are you saying we could convert an incoming struct from the new layout back to the Go 1.13 layout via the wrapper, so that the existing assembly doesn't notice the field offset change? I suppose that is possible, but remember that the assembly routine might be taking a pointer to the struct, or access the struct via a chain of multiple pointers. So, it will certainly never be possible to ensure that an assembly routine that hard-codes field offsets will remain compatible with the struct alignment changes.

danscales

comment created time in 2 months

issue commentgolang/go

proposal: cmd/compile: make 64-bit fields be 64-bit aligned on 32-bit systems, add //go:packed directive on structs

@beoran: just to get back to the immediate proposal (which only changes the alignment of 64-bit fields on 32-bit systems), it would definitely be helpful if you can send pointers to any of your code (or any other code that you know of) that will be immediately impacted. That is, if you have assembly language code that hard-codes offsets of fields in structs. Or if you have structs (other than cgo, which we plan to handle automatically) that must match existing syscall/dll/C structures, and will no longer match because of the alignment change. Similar to the Flock_t and Stat_t structs we mentioned above, any other such structures would potentially need to use //go:packed and explicit padding to keep the exact alignment desired for syscalls/dlls or other C API (but that is not using cgo).

danscales

comment created time in 2 months

issue commentgolang/go

proposal: cmd/compile: make 64-bit fields be 64-bit aligned on 32-bit systems, add //go:packed directive on structs

We expect that any assembler functions that depend on the layout of a struct argument or return value should be using unsafe.Offsetof to determine field offsets within a struct.

Does the assembler accept unsafe.Offsetof for FP offsets? I don’t recall ever seeing that in assembly code.

I was thinking that you could have a Go init routine that stores the relevant values of unsafe.OffsetOf in some global variables that are accessed by the assembler routine. Also, Ian pointed out to me that in many situations you can use 'go tool compile -asmhdr go_asm.h ...' on the Go file that defines the type to get the relevant offsets written out to a go_asm.h file. I can clarify this in the proposal.

The good news is that vet checks these offsets, so breakage should be caught as a matter of course. (It’ll mean teaching go/types about the annotations, but that would need to happen anyway.)

Or do I misunderstand?

I don't really understand what you mean by 'vet checks these offsets' in this case. What vet rule is this? Or are you talking about the suggested new vet checks that I mention in the proposal?

Making this change and then an ABI change would mean breaking Go assembly twice.

As I understand it, the goal with introducing a new ABI would be to use it in Go code mainly, and not break existing assembly code (which would stay with the original Go ABI, referenced as ABI0 in Austin's changes).

I do agree that this proposal could break some existing assembly code. I'm guessing that it might be quite minimal (but that may definitely be wrong), and certainly possibly breaking some assembler code is relevant to deciding about the benefits/drawbacks of this proposal.

danscales

comment created time in 2 months

issue openedgolang/go

proposal: cmd/compile: make 64-bit fields be 64-bit aligned on 32-bit systems, add //go:packed directive on structs

Summary: We propose to change the default layout of structs on 32-bit systems such that 64-bit fields will be 64-bit aligned. The layout of structs on 64-bit systems will not change. For compatibility reasons (and finer control of struct layout), we also propose the addition of a //go:packed directive that applies to struct types. When the //go:packed directive is specified immediately before a struct type, then that struct will have a fully-packed layout, where fields are placed in order with no padding between them and hence no alignment based on types. The developer must explicitly add padding to enforce any desired alignment.

The main goal of this change is to avoid the bugs that frequently happen on 32-bit systems, where a developer wants to be able to do 64-bit operations on a 64-bit field (such as an atomic operation), but gets an alignment error because the field is not 64-bit aligned. With the current struct layout rules, a developer must often add explicit padding in order to make sure that such a 64-bit field is on a 64-bit boundary. As shown by repeated mentions in issue #599 (18 in 2019 alone), developers still often run into this problem. They may only run into it late in the development cycle as they are testing on 32-bit architectures, or when they execute an uncommon code path that requires the alignment.

As an example, the struct for ticks in runtime/runtime.go is declared as

var ticks struct {
	lock mutex
	pad  uint32 // ensure 8-byte alignment of val on 386
	val  uint64
}

so that the val field is properly aligned on 32-bit architectures. With the change in this proposal, it could be declared as:

var ticks struct {
	lock mutex
	val  uint64
}

and the val field would always be 64-bit aligned, even if other fields are added to struct.

We do not propose changing the alignment of 64-bit types which are arguments or return values, since that would change the Go calling ABI on 32-bit architectures. For consistency, we also don't propose to change the alignment of 64-bit types which are local variables. However, since we are changing the layout of some structs, we could be changing the ABI for functions that take a struct as an argument or return value. We expect that any assembler functions that depend on the layout of a struct argument or return value should be using unsafe.Offsetof to determine field offsets within a struct.

However, we do require a new directive for compatibility in the case of cgo and also interactions with the operating system. We propose the addition of a //go:packed directive that applies to struct types. When the //go:packed directive is specified immediately before a struct type, then that struct will have a fully-packed layout, where fields are placed in order with no padding between them and hence no alignment based on types. The developer must explicitly add padding fields to enforce any desired alignment.

We would then use the //go:packed directive for cgo-generated struct types (with explicit padding added as needed). Also, given the strict layout requirements for structures passed to system calls, we would use //go:packed for structs that are used for system calls (notably Flock_t and Stat_t, used by Linux system calls). We can enforce the use of //go:packed for structures that are passed to system calls (typically via pointers) with the use of a go vet check. I don't think we would particularly encourage the use of //go:packed in any other situations (as with most directives), but it might be useful in a few specific other cases.

This proposal is essentially a solution to issue #599. A related issue #19057 proposes a way to force the overall alignment of structures. That proposal does not propose changing the default alignment of 64-bit fields on 32-bit architectures (the source of the problems mentioned in issue #599), but would provide a mechanism to explicitly align 64-bit fields without using developer-computed padding. With that proposal, aligning a uint64 field (for example) in a struct to a 64-bit boundary on a 32-bit system would require replacing the uint64 type with a struct type that has the required 64-bit alignment (as I understand it).

Compatibility: since we would be changing the default layout of structs, we could affect some programs running on 32-bit systems that depend on the layout of structs. However, the layout of structs is not explicitly defined in the Go language spec, except for than minimum alignment, and we are maintaining the previous minimum alignments. So, I don't believe this change breaks the Go 1 compatibility promise. Any program that depends on the layout of a struct (other than minimum alignment) should be using unsafe.Offsetof.

created time in 2 months

issue commentgolang/go

runtime: use frame pointers for callers

The most profitable and critical is to optimize tracer, maybe mprof. Yes, expanding later is perfectly fine. And I think we already do this in trace. Inline frames don't have PCs anyway. So even if we expand and apply skip, we can't possibly shrink it back to an array of PCs.

The problem is that the semantics of Callers (and presumable gcallers) is that tne pcbuf slice must be filled up to its max (as possible) after skipping 'skip' frames, and those frames must included description of the inlined frames. So, if we don't understand the inlining at the time we fill up pcbuf initially, we don't know exactly how many physical frames to skip, so we may not grab enough frames to fill up the max after we do the skip. (The easier thing would be to grab all physical frame pcs until we do the inlining later, but where do we store them all -- we only have pcbuf.) So, we need a slightly looser definition for skip and for filling in pcbuf for gcallers() if we are going to use the frame pointer optimization.

But I will proceed with trying to optimizing gcallers, and doing the inlining interpretation later, and see how things go with your tests.

dvyukov

comment created time in 3 months

issue commentgolang/go

runtime: use frame pointers for callers

@dvyukov I took a look at this and it seems that if we want to get runtime.Callers() semantics fully correct (including a proper skip, based on any inlined frames due to mid-stack inlining), then we will lose a lot of the benefit of the optimization. See my message for my prototype code https://go-review.googlesource.com/c/go/+/212301

But then I noticed that actually the tracer and mprof are using a separate routine gcallers. We could possible only optimize gcallers(). In that case, would it be acceptible to not expand out the logical frames that are not physically there because of mid-stack inlining? Or do the inline expanding at CallerFrames()/Next() time only, but do the early skip processing based on physical frames?

One really rough set of numbers that I did for your benchmark (take it with a grain of salt) is 40% overhead currently [like your 55% overhead], then 24% overhead if we do the optimization but have to deal with inlined frames, and 6% overhead if we don't have to deal with inlined frames (just follow FP pointers and done).

dvyukov

comment created time in 3 months

issue openedgolang/go

runtime, cmd/trace: recent regression: "failed to parse trace: no consistent ordering of events possible

<!-- Please answer these questions before submitting your issue. Thanks! --> Recent regression: I ran the BenchmarkClientServerParallel4 benchmark with tracing enabled and 'go tool trace' can't deal with the trace file. I bisected the problem to change 7148478f1b by @rhysh . I'm guessing that the semaphore optimization in that change is somehow confusing the trace tool, and the trace tool needs to be updated to understand the event ordering.

What version of Go are you using (go version)?

Tip 4b21702fdc , essentially go 1.14 beta 1

Does this issue reproduce with the latest release?

Does not happen in go 1.13

What operating system and processor architecture are you using (go env)?

<details><summary><code>go env</code> Output</summary><br><pre> $ go env

GO111MODULE="" GOARCH="amd64" GOBIN="" GOCACHE="/usr/local/google/home/danscales/.cache/go-build" GOENV="/usr/local/google/home/danscales/.config/go/env" GOEXE="" GOFLAGS="" GOHOSTARCH="amd64" GOHOSTOS="linux" GOINSECURE="" GONOPROXY="" GONOSUMDB="" GOOS="linux" GOPATH="/usr/local/google/home/danscales/go" GOPRIVATE="" GOPROXY="https://proxy.golang.org,direct" GOROOT="/usr/local/google/home/danscales/gerrit/go" GOSUMDB="sum.golang.org" GOTMPDIR="" GOTOOLDIR="/usr/local/google/home/danscales/gerrit/go/pkg/tool/linux_amd64" GCCGO="gccgo" AR="ar" CC="gcc" CXX="g++" CGO_ENABLED="1" GOMOD="/usr/local/google/home/danscales/gerrit/go/src/go.mod" CGO_CFLAGS="-g -O2" CGO_CPPFLAGS="" CGO_CXXFLAGS="-g -O2" CGO_FFLAGS="-g -O2" CGO_LDFLAGS="-g -O2" PKG_CONFIG="pkg-config" GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build750268088=/tmp/go-build -gno-record-gcc-switches"

</pre></details>

What did you do?

<!-- If possible, provide a recipe for reproducing the error. A complete runnable program is good. A link on play.golang.org is best. -->

% go test -trace trace.out -run NONE -bench BenchmarkClientServerParallel4 net/http
goos: linux
goarch: amd64
pkg: net/http
BenchmarkClientServerParallel4-12          31840             39298 ns/op           10043 B/op        80 allocs/op
PASS
ok      net/http        2.416s
% go tool trace trace.out
2019/12/17 10:07:54 Parsing trace...
failed to parse trace: no consistent ordering of events possible

What did you expect to see?

Expected to see normal output like:

2019/12/17 10:19:17 Parsing trace...
2019/12/17 10:19:22 Splitting trace...
2019/12/17 10:19:28 Opening browser. Trace viewer is listening on http://127.0.0.1:42335

and the trace web page comes up in the browser.

What did you see instead?

The "failed to parse trace: no consistent ordering of events possible" error message.

This is related to issue #29707 , which has the same error message, but is currently only indicated for programs related to cgo or a rare test failure. The failure in this bug is consistent and a more recent regression (and seems unrelated to cgo).

created time in 3 months

issue commentgolang/go

cmd/compile: eliminate runtime.conv* calls with unused results

This bug, as currently specified, was actually fixed when @mdempsky enabled his new escape analysis algorithm. The reason that the runtime.conv* calls were created in the first place was because (I think) the old escape analysis algorithm was not proving that the arguments to debug() were not escaping, probably related to the variadic function call. With the new escape analysis in place, the runtime.conv* routines are not generated. All the args are created on the stack, and therefore all operations related to the debug function call can be and are dead-coded.

I think we probably just want to close this bug. We could create a new bug or proposal for the more general case of understanding when functions (like runtime.conv* or user functions) have no side effects, and optimizing/removing when their results are not used.

Here's an example of a function where the compiler could get rid of the purefunc() call, after inlining inlinedfunc() and doing constant propagation, etc, but currently purefunc() remains in the code (of course, since the compiler doesn't record that purefunc has no side effects):

package main

func main() {
	mytest("abc", 1)
}

//go:noinline
func mytest(s string, n int) int {
	var r int
	a := purefunc(s)
	if inlinedfunc() < 4 {
		r = a
	} else {
		r = 3
	}
	return r
}

//go:noinline
func purefunc(s string) int {
	return 18
}

func inlinedfunc() int {
	return 5
}
pavel

comment created time in 3 months

issue commentgolang/go

cmd/compile: make 64-bit fields 64-bit aligned on 32-bit systems

Since I was already running into this issue on 32-bit architectures when working on a different change, and then @aclements pointed me at this issue and #19057, I decided to do a prototype change for this issue, and see the problems that I discovered. Actually, there were surprisingly few problems -- some of this was helped out by some of the code that had to be added for amd64p32 a few years. Note that I did not change the alignment of 64-bit stack variables (int64, uint64, float64), since that would change the calling ABI and break a bunch of assembly language functions.

Prototype change is here

We knew that if we changed the default alignment for 64-bit fields on 32-bit architecture, we would have to have some exception/annotation, etc. for Cgo-generated Go structs, so that we could mimic the alignment of C structs. The main unforeseen problem that I ran into was that two Syscall structures on 32-bit architectures, Stat_t and Flock_t (see ztypes_linux_386.go), have a 64-bit field which is not aligned to 64-bits naturally. Therefore, changing the alignment rules according to this issue would change the layout of those structs. That does not work, since those structs are used in Linux system calls. (This may be related to the Apr 15th comment by @schlamar)

So, if we change the default alignment rules, it seems that, for both Cgo purposes and for several system call structs, we will need some kind of solution that specifies a "packed" layout for a struct, which means the layout follows the current rules. This could be a "//go:packed" annotation on a struct type.

What do people think? I believe that many people would like to fix this issue, so that folks on 32-bit architectures do not keep running into the unaligned atomic problem. Are there any suggestions on how to deal with the syscall and Cgo issues other than an annotation on a struct type (or maybe on a struct field)? Is it worth fixing this issue at this point if we have to add a new kind of annotation to allow for the original semantics?

rsc

comment created time in 4 months

issue commentgolang/go

runtime: TestCallersDeferNilFuncPanic failed in generating traceback (linux/arm64)

@shawn-xdji thanks for the bug report and the stack trace.

I can reproduce this always with debug compilation of the individual runtime test on a linux-arm64 gomote:

cd /workdir/go/src; ../bin/go test -gcflags="-N -l" runtime -test.run TestCallersDeferNilFuncPanic

The issue is that we are calling gentraceback() with a callback in order to look for the next open-coded defer frame to process in the panic case. But calling gentraceback() with a callback on arm64 (lr architectures) throws an error if jmpdefer is on the stack (because jmpdefer register manipulation is non-atomic). This only happens in the one case where the defer pointer is null, since then there is a seg fault on the jmpdefer instruction that jumps though the defer pointer. Also, it only happens if stack-allocated defers are used (since open-coded defers don't use jmpdefer), hence why this only happens when debug flags are enabled.

The simplest fix is to force the seg fault to happen in deferreturn() just before the jmpdefer(). This will cause one extra load in cases when jmpdefer is used, but that is now rare, since stack-allocated defers are now rare.

shawn-xdji

comment created time in 4 months

issue commentgolang/go

cmd/compile: panic during early copyelim crash

I built the compiler/go command, etc., at the indicated commit 6ba0be1639 and then repeatedly did this command (-a to force recompile):

go install -a github.com/kevinburke/go-bindata

I wasn't able to reproduce on linux-amd64 on over 250 repetitions of the command

mvdan

comment created time in 4 months

issue commentgolang/go

runtime: SIGSEGV in mapassign_fast64 during cmd/vet

I compiled cmd/vet at the indicating commit (0ac8739ad5) and got the exact same binary, as far as I can tell. The assembly language, with the indicated SEGV is here:

   0x0000000000410ff1 <+33>:    mov    0x48(%rsp),%rax
   0x0000000000410ff6 <+38>:    test   %rax,%rax
   0x0000000000410ff9 <+41>:    je     0x4112f3 <runtime.mapassign_fast64+803>
   0x0000000000410fff <+47>:    movzbl 0x8(%rax),%ecx     <====================SEGV

So we are loading the argument h (type hmap) into %rax and jumping to a panic if it is nil/zero (first three assembly instructions). But then when we do a load through (%rax), we are getting a SEGV and the address for the SEGV is indicated as 0 ("unexpected fault address 0x0")

@aclements Any chance that preemption is happening between these instructions, and not restoring the %rax register (i.e. changing it from non-zero to zero)? Just a thought, since I thought that there was still no pre-emption in runtime code.

Otherwise, this is pretty mysterious, since the map was just initialized above, as pointed out by @bcmills

One other thing to note is that the h arg of runtime.mapassign_fast64 in the stacktrace looks like a bogus pointer (I think) -- 0x637469796d2f656d. But I'm not sure these stack args are always right during a panic, etc. But it is definitely not zero.

myitcv

comment created time in 4 months

issue commentgolang/go

runtime: TestNetpollDeadlock flakes across the board

I reproduced this on linux-amd64 at 1da575a7bc (the latest test failure above) by repeating this 100 times:

go test runtime -test.run TestNetpollDeadlock -test.count=100

(usually happened within about 20 times). Using this reproduction case, I showed that it was also fixed by Ian's commit d80ab3e85a ("runtime: wake netpoller when dropping P, don't sleep too long in sysmon"). The failure is reproducible in commit immediate before Ian's, and not reproducible with Ian's commit.

So, I think we can close this bug as well.

FiloSottile

comment created time in 5 months

issue commentgolang/go

sync: TestMutexFairness flaky on openbsd-*-62

I was able to reproduce this reliably at the change 02a5502ab8d862309aaec3c5ec293b57b913d01d listed in one of the failure instances, by repeating this 100 times:

gomote run user-danscales-openbsd-amd64-62-0 go/bin/go test sync -test.run TestMutexFairness -test.count=1 -test.timeout=15s

Using this same way to reproduce, I found that it started happening when Ian turned on the new timer code. It is fixed (not reproducible) in the current master, and by bisecting, I found it was fixed by Ian's commit

d80ab3e85a : runtime: wake netpoller when dropping P, don't sleep too long in sysmon

@ianlancetaylor does that make sense to you (seems reasonable to me based on the description of the change and the bugs that it fixed). If so, I think we can close this bug.

(Also, doesn't seem like it has been happening on the builder for the last week, as expected.)

bcmills

comment created time in 5 months

issue commentgolang/go

runtime: "call not at safe point" in TestDebugCallStack

As far as I can tell, this only shows up on linux-amd64-noopt , i.e. only for non-optimized code.

bcmills

comment created time in 5 months

issue commentgolang/go

image/gif: TestDecodeMemoryConsumption flake on dragonfly-amd64

Hmmm, I should have said that I reproduced it at the change where the bug was originally reported, so that is at 316fb95f4fd94fb00f7746c32ae85a82d5be1b81 (Around Oct 25th). So the pacing fix for #28574 (submitted Sept 4th) would have been included, but the problem still occurred.

But I justed tried with most recent master (today), and I can't seem to reproduce the problem. So, was there any other GC pacing change made recently (between Oct. 25th and Sept 4th)?

@mknyszek

bcmills

comment created time in 5 months

issue commentgolang/go

image/gif: TestDecodeMemoryConsumption flake on dragonfly-amd64

I reproduced by using dragonfly gomote, and repeatedly calling:

gomote run user-danscales-dragonfly-amd64-0 go/bin/go test -test.count=100 image/gif

Reproduced about 4 times in 50 runs.

I changed the test so that when failure happens (because heap is more than 30MB bigger at end of decode than at the beginning), the test does a runtime.GC() and then measures the heap difference. This new code shows that GC fully recovers the 77MB and actually 4MB more. So, I'm guessing this is not a bug, just a case where sometimes memory is not quite scanned/freed in the same way by the initial GC call.

If this happens a lot, we should probably just change the test threshold from 30MB to 100MB, or something like that.

bcmills

comment created time in 5 months

issue commentgolang/go

runtime: a runtime.Goexit call will cancel a panic, which seems unexpected/incorrect

any Goexit should remain permanently below the panic stack, so that if all panics are recovered the goroutine resumes the Goexit, but if any panic propagates out of the topmost call of the goroutine then the program crashes.

This option would probably be quite tricky to implement, but first there are a bunch of details that would have to be specified:

  • When the runtime.Goexit is called (after a panic has happened), do we continuing executing normal code after the Goexit, or do we immediately end the defer that the Goexit was running in and continue to the next defer in the panicking sequence?

  • If all panics are recovered, and we return to processing the Goexit, what does the stack look like at that point in runtime.Callers(). Does the stack have to look exactly like what it was when the Goexit was first called?

danscales

comment created time in 5 months

issue commentgolang/go

runtime: "internal error: misuse of lockOSThread/unlockOSThread" in runOpenDeferFrame

Should be fixed by https://golang.org/cl/204802

bcmills

comment created time in 5 months

issue closedgolang/go

runtime: "internal error: misuse of lockOSThread/unlockOSThread" in runOpenDeferFrame

https://build.golang.org/log/2ad3c689c702a04aa5f733d1be1310466392dc07:

# cmd/go/internal/web.test
fatal error: runtime: internal error: misuse of lockOSThread/unlockOSThread

runtime stack:
runtime.throw(0x1289aa7, 0x3e)
	/private/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/runtime/panic.go:1045 +0x72
runtime.badunlockosthread()
	/private/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/runtime/proc.go:3697 +0x36
runtime.systemstack(0x0)
	/private/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/runtime/asm_amd64.s:370 +0x66
runtime.mstart()
	/private/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/runtime/proc.go:1069

goroutine 1 [running]:
runtime.systemstack_switch()
	/private/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/runtime/asm_amd64.s:330 fp=0xc00075af20 sp=0xc00075af18 pc=0x105ac90
runtime.unlockOSThread()
	/private/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/runtime/proc.go:3690 +0x84 fp=0xc00075af40 sp=0xc00075af20 pc=0x1038a74
runtime.main.func2(0xc000070fae)
	/private/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/runtime/proc.go:159 +0x33 fp=0xc00075af50 sp=0xc00075af40 pc=0x1059d63
runtime.call32(0x0, 0x128be78, 0xc003ca14e8, 0x800000008)
	/private/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/runtime/asm_amd64.s:539 +0x3b fp=0xc00075af80 sp=0xc00075af50 pc=0x105b04b
runtime.runOpenDeferFrame(0xc000000180, 0xc003ca14a0, 0xc00075b0a0)
	/private/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/runtime/panic.go:816 +0x2b5 fp=0xc00075b008 sp=0xc00075af80 pc=0x102d8a5
panic(0x1238ca0, 0x142d630)
	/private/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/runtime/panic.go:909 +0x155 fp=0xc00075b0a0 sp=0xc00075b008 pc=0x102db05
runtime.panicmem(...)
	/private/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/runtime/panic.go:212
runtime.sigpanic()
	/private/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/runtime/signal_unix.go:594 +0x3da fp=0xc00075b0d0 sp=0xc00075b0a0 pc=0x10438ba
memeqbody()
	/private/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/internal/bytealg/equal_amd64.s:102 +0xd1 fp=0xc00075b0d8 sp=0xc00075b0d0 pc=0x10024e1
strings.HasPrefix(...)
	/private/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/strings/strings.go:449
cmd/link/internal/ld.deadcode(0xc000046000)
	/private/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/cmd/link/internal/ld/deadcode.go:111 +0x4f1 fp=0xc00075b300 sp=0xc00075b0d8 pc=0x11634e1
cmd/link/internal/ld.Main(0x1431fe0, 0x10, 0x20, 0x1, 0x7, 0x10, 0x1280fdc, 0x1b, 0x127daf6, 0x14, ...)
	/private/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/cmd/link/internal/ld/main.go:211 +0xb64 fp=0xc00075b458 sp=0xc00075b300 pc=0x11a6184
main.main()
	/private/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/cmd/link/main.go:65 +0x1bc fp=0xc00075bf88 sp=0xc00075b458 pc=0x120a5bc
runtime.main()
	/private/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/runtime/proc.go:203 +0x212 fp=0xc00075bfe0 sp=0xc00075bf88 pc=0x1030102
runtime.goexit()
	/private/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-10_14/go/src/runtime/asm_amd64.s:1375 +0x1 fp=0xc00075bfe8 sp=0xc00075bfe0 pc=0x105cc01
FAIL	cmd/go/internal/web [build failed]

closed time in 5 months

bcmills

issue commentgolang/go

runtime: panic + recover can cancel a call to Goexit

Should this following program print should be unreachable or not. I tried tip, it prints it.

package main

import "fmt"
import "runtime"

func f() {
	defer func() {
		recover()
	}()
	panic("will cancel Goexit but should not")
	runtime.Goexit()
}

func main() {
	c := make(chan struct{})
	go func() {
		defer close(c)
		f()
		fmt.Println("should be unreachable")
	}()
	<-c
}

This is exactly performing to spec and is unrelated to the current bug. https://golang.org/ref/spec#Handling_panics and https://go-review.googlesource.com/c/go/+/189377

The panic() call starts a panic sequence. By spec, the function that called panic will never continue running normally (hence you will never reach the Goexit). If a recover occurs in a defer directly called by the panicking sequence, then the panic will be recovered, and (after finishing any remaining defers in that frame), execution of the goroutine will continue in the caller of that frame (which in this case is f).

bcmills

comment created time in 5 months

issue commentgoogleapis/google-api-go-client

bundler: implement more advanced queueing

I did a slight optimization with https://code-review.googlesource.com/c/google-api-go-client/+/47790

@escholtz is investigating more extensive optimizations, including the second one mentioned above ("Coalesce") and also a rewrite that would create a more explicit queue of ready Bundles, so that a goroutine that handles a bundle is not actually created until it can run right away (i.e. is handling the next bundle that will fit within HandleLimit). Currently, in certain situations, a large number of goroutines can be created to send bundles, but are not runnable yet because of the low HandlerLimit, and so are being woken up repeatedly by the Broadcasts in acquire() and release().

jadekler

comment created time in 5 months

issue commentgolang/go

runtime: TestLockedDeadlock2 is flaky

I wasn't able to reproduce on the darwin-amd64-10_11 gomote running this command 100 times:

go test runtime -test.count=500 -test.run TestLockedDeadlock2

I guess this could be related to #34575 (since it seems to be a hang), even though that only ever showed up on ppc64. We could wait until the fix for that is in, and see if this problem repros. Just that we only saw cases within the last day seems like it would indicate it was a different issue.

bcmills

comment created time in 5 months

issue commentgolang/go

runtime: "internal error: misuse of lockOSThread/unlockOSThread" in runOpenDeferFrame

I tried to reproduce by running go test -test.count=1000 cmd/go/internal/web many times on darwin-amd64-10_14, but didn't have any luck reproducing (all passed fine).

However, this might have to do with the fact that runtime.main only leaves the function via exit(0) (which is followed by an infinite loop), so there is no explicit function return/exit where inline defer code is generated. Hence, the implicit &needUnlock argument to the defer closure is not necessarily kept alive, so it may not get adjusted if there is a stack move. I should maybe just make all OpenDeferSlots live through out the entire function.

bcmills

comment created time in 5 months

issue commentgolang/go

runtime: "signal: segmentation fault (core dumped)" on several builders

I tried, but I haven't been able to reproduce at all on my workstation (using the commands above).

@mvdan Did you actually get a core that might have a stack trace? (I see '(core dumped)' above).

eliasnaur

comment created time in 5 months

issue commentgolang/go

runtime: SIGSEGV during gcenable on darwin-arm64-corellium

It's not obvious to me what test or what build of a test is causing the failure (just must be some test after sync/atomic). But I tried reproducing the failure by running the following command (which executes a sequence of tests around the point where things failed) repeatedly via gomote to darwin-arm64-corellium:

go test -test.count=1 sort strconv strings sync sync/atomic syscall testing testing/quick text/scanner text/tabwriter text/template text/template/parse time unicode unicode/utf16 unicode/utf8 cmd/addr2line cmd/api cmd/asm/internal/asm cmd/asm/internal/lex cmd/compile cmd/compile/internal/gc cmd/compile/internal/syntax

I couldn't reproduce -- never got any failures in 50 runs. So, I think we'll probably just have to wait & see if there are any other failures from builders.

bcmills

comment created time in 5 months

issue commentgolang/go

runtime: defer is slow

@danscales , the results for https://golang.org/cl/202340 look great: it eliminates about 80% of defer's direct CPU cost in the application I described. A small change to crypto/tls would fix the remaining case.

For go1.12.12, go1.13.3, and be64a19, I counted profile samples that have crypto/tls.(*Conn).Write on the stack (with -focus=tls...Conn..Write) and profile samples that additionally have defer-related functions on the stack (saving the prior results and adding on -focus='^runtime\.(deferproc|deferreturn|jmpdefer)$'). With go1.12.12, about 2.5% of time in crypto/tls.(*Conn).Write is spent on defer. With go1.13.3, it's about 1.5%. With the development branch at be64a19, it's down to 0.5%.

Zooming out to the application's total CPU spend on defer-related functions, more than 90% of the samples are caused by a single use of defer "in" a loop in crypto/tls.(*Conn).Write. If that call were outside of the loop—or if the compiler could prove that it's effectively outside the loop already—then CL 202340 would all but eliminate the CPU cost of defer for the application (down to roughly 0.01%).

Thank you!

@rhysh Glad to hear the results that you have measured! And, as a bunch of folks have already pointed out, there are a bunch of further optimizations that we can do to tighten up the inline defer code, eliminate some of the checks, etc.

minux

comment created time in 5 months

issue closedgolang/go

cmd/compile, runtime: a design for low-cost defers via inline code

We are working on an implementation of defer statements that makes most defers no more expensive than open-coding the deferred call, hence eliminating the incentives to avoid using this language to its fullest extent. We wanted to document the design in the proposal/design repo.
See issue #14939 for discussion of defer performance issues.

The design document is posted here. The current, mostly complete implementation is here

Comments are welcome on either the design or the implementation.

closed time in 5 months

danscales

issue commentgolang/go

cmd/compile, runtime: a design for low-cost defers via inline code

Fixed by be64a19

danscales

comment created time in 5 months

issue commentgolang/go

runtime: "morestack on gsignal" on linux-arm64-packet builder

@cherrymui Yes, that changed fixed the SEGV. Just ran 100 times on linux-arm64-packet with that change with no SEGV and no repro of the "morestack on signal" issue.

bcmills

comment created time in 5 months

issue commentgolang/go

runtime: "morestack on gsignal" on linux-arm64-packet builder

Unfortunately, I am synced past 758eb02 (i'm at 59a6847039, which is at least Oct 26th). But on the other hand, #34391 looks like it is a hang, whereas the thing that I'm running into is a SEGV.

bcmills

comment created time in 5 months

issue commentgolang/go

runtime: "morestack on gsignal" on linux-arm64-packet builder

I just tried reproducing by running go test -test.count=1 os repeatedly (currently about 100 times) on the linux-arm64-packet builder, and I wasn't able to reproduce. I did get the same SEGV twice:

fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x6ed80]

runtime stack:
runtime: unexpected return pc for runtime.mcommoninit called from 0x40000259e0
stack: frame={sp:0x40002e9df0, fp:0x40002e9e30} stack=[0x40002e8000,0x40002ea000)
00000040002e9cf0:  00000040002e9d50  000000000003d914 <runtime.throw+84> 
00000040002e9d00:  0000004000130180  00000040002e9d48 
00000040002e9d10:  000000000003d914 <runtime.throw+84>  00000040002e9d28 
00000040002e9d20:  000000000003d8fc <runtime.throw+60>  000000000006a410 <runtime.fatalthrow.func1+0> 
00000040002e9d30:  0000004000130180  000000000003d914 <runtime.throw+84> 
00000040002e9d40:  00000040002e9d50  00000040002e9d78 
00000040002e9d50:  0000000000053df8 <runtime.sigpanic+1096>  00000040002e9d60 
00000040002e9d60:  000000000006a390 <runtime.throw.func1+0>  00000000001a7c01 
00000040002e9d70:  000000000000002a  00000040002e9dd8 
00000040002e9d80:  000000000006ed80 <runtime.nanotime1+96>  00000000001a7c01 
00000040002e9d90:  000000000000002a  0000000000000000 
00000040002e9da0:  0000000000000001  00000040002e9d00 
00000040002e9db0:  0000000000040cb8 <runtime.mcommoninit+152>  00000040002e9e08 
00000040002e9dc0:  0000000000040c5c <runtime.mcommoninit+60>  00000040002e9e18 
00000040002e9dd0:  0000000000044e10 <runtime.findrunnable+2144>  00000040002e9e08 
00000040002e9de0:  0000000000040cb8 <runtime.mcommoninit+152>  0000000000000000 
00000040002e9df0: <00000040000259e0  0000000000000000 
00000040002e9e00:  0000000000000001  00000040002e9e48 
00000040002e9e10:  0000000000042c88 <runtime.allocm+328>  00000000002dc2d8 
00000040002e9e20:  0000000000042c60 <runtime.allocm+288>  0000000000000380 
00000040002e9e30: >000000000019c0c0  0000000000000001 
00000040002e9e40:  0000004000308000  00000040002e9e98 
00000040002e9e50:  0000000000043530 <runtime.newm+48>  0000004000308000 
00000040002e9e60:  0000004000308000  0000000000000000 
00000040002e9e70:  0000000000000000  00000040002e9ec8 
00000040002e9e80:  0000004000308000  0000000000000000 
00000040002e9e90:  0000004000130180  00000040002e9ec8 
00000040002e9ea0:  0000000000043b1c <runtime.startm+284>  0000004000020a00 
00000040002e9eb0:  00000000001aaf20  0000000000000000 
00000040002e9ec0:  0000000000000000  00000040002e9ef8 
00000040002e9ed0:  0000000000045698 <runtime.resetspinning+184>  00000000001aaf20 
00000040002e9ee0:  0000004000020a00  0000004000020a00 
00000040002e9ef0:  0000000000000000  00000040002e9f18 
00000040002e9f00:  0000000000045b60 <runtime.schedule+704>  0000000000000000 
00000040002e9f10:  0000004000025401  00000040002e9f98 
00000040002e9f20:  00000000000422a0 <runtime.mstart1+128>  0000004000001680 
runtime.throw(0x1a7c01, 0x2a)
        /workdir/go/src/runtime/panic.go:1045 +0x54
runtime.sigpanic()
        /workdir/go/src/runtime/signal_unix.go:578 +0x448
runtime.nanotime1(0x0)
        /workdir/go/src/runtime/sys_linux_arm64.s:300 +0x60
runtime: unexpected return pc for runtime.mcommoninit called from 0x40000259e0
...

Unrelated, I assume? I also ran the same test on linux-amd64 60 times, no problems at all, as expected.

bcmills

comment created time in 5 months

issue commentgolang/go

runtime: panic + recover can cancel a call to Goexit

@go101 I think we have mostly settled that the behavior of panic and recover will stay as it is right now. My current proposed changes to the Go language spec are here (don't know if you've seen the latest changeset):

https://go-review.googlesource.com/c/go/+/189377

I/we are not trying to specify every possibility for panics/recovers in the spec, but just want to specify better when recovers actually apply (only in a defer directly called by a panicking sequence) and then describe the the typical case of recursive panics and also the behavior of when one panic can replace another panic. See also my comment on Oct 9th above on why the panic replacement behavior makes sense.

Goexit is a library routine, so it will not be specified in the language spec. However, as I mentioned above, we are planning to fix the current bug and never allow a panic/recover to cancel a Goexit. We have to discuss more about further interaction between panic and Goexit (as your example in your Sept 26 comment and examples from Bryan Mills illustrate). The fix for the current bug may make it into Go 1.14, but not definitely. Since we still have to discuss things, I'm fairly sure that no further changes in Goexit/panic/recover interaction will make it into Go 1.14.

bcmills

comment created time in 5 months

issue commentgolang/go

runtime: TestGoexitCrash failure on linux-ppc64le-buildlet

It turns out that the problem is on the latest release of go1.13, but not on the very first releases of 1.13 in August, so I did a 'git bisect'. The change that it came up with (not saying this is definitive at all) was:

runtime: redefine scavenge goal in terms of heap_inuse [mknyszek]

So, good guess that it might be related to GC/scavenger-related. Also, as I mentioned, it isn't reproducible for any commit if runtime.GC() is removed. Not sure why this would particularly show up only for ppc64.

Will update further when I get a change to try out GOGC=20

bcmills

comment created time in 5 months

issue commentgolang/go

runtime: TestGoexitCrash failure on linux-ppc64le-buildlet

OK, it definitely also happens in go1.13. I haven't been able to reproduce in go1.12 yet, so it might be a slight regression in the go1.13 timeframe.

bcmills

comment created time in 5 months

issue commentgolang/go

runtime: TestGoexitCrash failure on linux-ppc64le-buildlet

I am able to reproduce once every few times when I run this command on the ppc64-linux-buildlet (which each time itself runs the test 500 times):

gomote run user-danscales-linux-ppc64-buildlet-0 go/bin/go test runtime -test.count=500 -test.run TestGoexitCrash

When it fails, the test seems to hang (somehow the normal "deadlock" detection of no goroutines and no main thread (because Goexit called) doesn't happen), and then something forces a SIGQUIT after 60 or 120 seconds.

The same test command run locally on amd64 never fails.

The actual test program (that is supposed to deadlock when all threads end and main does the Goexit) is:

func GoexitExit() {
	println("t1")
	go func() {
		time.Sleep(time.Millisecond)
	}()
	i := 0
	println("t2")
	runtime.SetFinalizer(&i, func(p *int) {})
	println("t3")
	runtime.GC()
	println("t4")
	runtime.Goexit()
}

I can still reproduce the problem if I comment out the i := 0 line and the SetFinalizer line. However, I can't reproduce if I comment out the runtime.GC line.

I'll keep investigating and check if it is present in 1.13. However, it seems unlikely that this has to be fixed for 1.14, since it is so rare and only happens when all threads and main exit (which is most likely a programming mistake).

bcmills

comment created time in 5 months

issue commentgolang/go

runtime: TestGoexitCrash failure on linux-ppc64le-buildlet

I will take a look, see if I can reproduce or make a guess as to what might be happening.

bcmills

comment created time in 5 months

issue commentgolang/go

runtime: defer is slow

If we had more data on this, perhaps we could focus on those cases.

I'm working on a high-throughput HTTPS server which shows the cost of defer in crypto/tls and internal/poll fairly clearly in its CPU profiles. The following five defer sites account for several percent of the application's CPU usage, split fairly evenly between each.

  • crypto/tls.(*Conn).Write .. shows up in the source inside a for loop, but could be moved outside: https://github.com/golang/go/blob/go1.11.4/src/crypto/tls/conn.go#L1036
  • crypto/tls.(*Conn).Write again: https://github.com/golang/go/blob/go1.11.4/src/crypto/tls/conn.go#L1046
  • crypto/tls.(*Conn).Handshake: https://github.com/golang/go/blob/go1.11.4/src/crypto/tls/conn.go#L1259
  • crypto/tls.(*Conn).writeRecordLocked (for buffer management .. may be absent in go1.12beta2): https://github.com/golang/go/blob/go1.11.4/src/crypto/tls/conn.go#L868
  • internal/poll.(*FD).Write: https://github.com/golang/go/blob/go1.11.4/src/internal/poll/fd_unix.go#L258

These particular defers are in place unconditionally, so a minimal PC-range-based compiler change as @dr2chase and @aclements discussed in September 2016 might be enough.

@rhysh Go 1.1.3 improved defer performance already by allocating defer records on the stack. And we have just checked into the main tree (for release in Go 1.14) a bigger change https://golang.org/cl/190098 to make defer calls directly (inline) at normal exits. This should reduce overhead even more significantly in many cases.

So, if you are still seeing defer overheads for your server, it will be great to see if the overheads have gone down with Go 1.13 (if you haven't already upgraded) or with the changes in the main tree (or with the beta release of Go 1.14 in early November).

minux

comment created time in 5 months

issue commentgolang/go

runtime: panic + recover can cancel a call to Goexit

Yes, but I think a Goexit becoming a no-op in the panic case may be quite surprising to programmers and hence lead to other weird failures in the code immediately following the Goexit.

bcmills

comment created time in 6 months

issue commentgolang/go

cmd/compile, runtime: a design for low-cost defers via inline code

We already handle conditional and non-conditional defers with this proposal. You have to set the bitmask in all cases, since you can have panics (especially run-time panics) at many points in the function, and so it is not easy to know otherwise (without a detailed pc map) whether you hit a particular unconditional defer statement before the panic happened.

For actual exit code, you can eliminate some bitmask checks if a particular defer statement dominates the exit that you are generating for. That optimization is not in this change, but seems reasonable to do soon as an optimization - definitely on my TODO list.

It would seem to be fairly complex to combine the low-cost defers with the looped defers (which would have to be handled by the current defer record approach) in the same function. I think a bunch more information would have to be save and new code generated to make sure that the looped defers and the inlined defers were executed in right order. So, I'm not planning to do that (especially since those cases are very rarely and not as likely to be performance-sensitive).

danscales

comment created time in 6 months

issue commentgolang/go

runtime: panic + recover can cancel a call to Goexit

Feel free to file a separate bug that Goexit can cancel a Panic. I'm not sure that fix is very high priority, since it seems very unusual to put a Goexit (or something that would do a Goexit) in a defer statement. Also, I'm not sure what a reasonable change in semantics would be. Should Goexit() become a no-op if you are in a panicking sequence? I think that would be surprising to most programmers. Or would you do something more dramatic, like aborting the entire defer call that contains the Goexit (and is being run by the panicking sequence)?

bcmills

comment created time in 6 months

issue commentgolang/go

runtime: panic + recover can cancel a call to Goexit

I figured out how to make sure that Goexits are not aborted by recovers that happen in defers below them... It didn't require any stack smashing or extra stubs -- just saving the pc/sp when making calls to deferred functions from the Goexit defer processing loop. See the mentioned change for the fix.

We discussed the abort behavior with respect to panic and Goexit, and are leaning toward only changing the Goexit behavior (which is the mentioned changed). As summarized well by Robert at https://go-review.googlesource.com/c/go/+/189377/7/doc/go_spec.html#6169 [paraphrased here]

From a programmer's point of view, if a defer contains a recover() call, I expect it to catch any pending panic, no matter where they are coming from. Otherwise it's impossible to write code that protects against (expected and unexpected) panics leaving a function/API entry point. So that would argue against "only recover the panic initiated in its own function".

And the abort (replace) behavior for panics exactly coalesces panics down so that we make sure that a recover in a function always stops all panics within that function or its callees.

bcmills

comment created time in 6 months

issue commentgolang/go

spec: clarify when calling recover stops a panic

Thanks for all the examples and discussion. I missed that there was a defer recover() even in the first example. @ianlancetaylor gives a very good explanation of the current behavior of defer recover(). Overall, the meaning of defer recover() isn't intuitively obvious and its actual behavior with the current implementation is confusing. So, we are thinking about banning the use of defer recover() (since it seems likely to be only useful for test code). Instead of defer recover(), you should always at least use defer func() { recover() }() (and of course, it is always recommended to look at the return code of recover()).

With respect to talking about (and spec'ing out) the behavior of panic/recover/recursive panics, it will be best not to use defer recover(), since that will just create more cases that tend to confuse the other issues.

go101

comment created time in 6 months

issue openedgo-delve/delve

disassemble cmd cause nil ptr exception if run before program is started (v 1.3)

Please answer the following before submitting your issue:

Note: Please include any substantial examples (debug session output, stacktraces, etc) as linked gists.

  1. What version of Delve are you using (dlv version)?

Version: 1.3.0

  1. What version of Go are you using? (go version)?

Version 1.13

  1. What operating system and processor architecture are you using?

Linux-amd64

  1. What did you do?

Tried to disassemble a function immediately after starting delve using 'delve exec'. I wanted to see what the code of a function looked like before debugging. E.g. 'disassemble -l main.main'

  1. What did you expect to see?

Disassembly output.

  1. What did you see instead?

I got a runtime error because of a nil pointer:

Command failed: Internal debugger error: runtime error: invalid memory address or nil pointer dereference
runtime.gopanic (0x545491)
        go/gc/src/runtime/panic.go:679
runtime.panicmem (0x55a92b)
        go/gc/src/runtime/panic.go:199
runtime.sigpanic (0x55a768)
        go/gc/src/runtime/signal_unix.go:394
google3/third_party/golang/delve/service/debugger/debugger.(*Debugger).Disassemble (0x8da8be)
        golang/delve/service/debugger/debugger.go:1180
google3/third_party/golang/delve/service/rpc2/rpc2.(*RPCServer).Disassemble (0xad5053)
        golang/delve/service/rpc2/server.go:610
reflect.Value.call (0x5d65b5)
        go/gc/src/reflect/value.go:460
reflect.Value.Call (0x5d5d73)
        go/gc/src/reflect/value.go:321
google3/third_party/golang/delve/service/rpccommon/rpccommon.(*ServerImpl).serveJSONCodec.func2 (0xb150ca)
        golang/delve/service/rpccommon/server.go:324
google3/third_party/golang/delve/service/rpccommon/rpccommon.(*ServerImpl).serveJSONCodec (0xb132f9)
        /golang/delve/service/rpccommon/server.go:326

In the previous version of dlv, running 'disassemble -l' right after starting dlv worked fine. It looks like this change happened with commit 583d335 . With that change, there is code to get and access through the current Goroutine in (*Debugger).Disassemble that causes nil pointer when there is no current goroutine:

	g, err := proc.FindGoroutine(d.target, goroutineID)
	if err != nil {
		return nil, err
	}

	var regs proc.Registers
	var mem proc.MemoryReadWriter = d.target.CurrentThread()
	if g.Thread != nil {

g comes back nil (even though err is also nil), so g.Thread gives a nil pointer exception. I believe goroutineID is -1 (no current goroutine?)

created time in 6 months

issue commentgolang/go

cmd/compile: program that requires zerorange of >=32 bytes fails to compile on aix/ppc64

It seems that there is no tests at all, generating at least one ADUFFZERO. Is this the case or they are somehow disabled on aix/ppc64 ?

Yes, it looks like there are no explicit tests to generate zerorange requests (which for certain ranges of sizes usually causes an ADUFFZERO call on most architectures). In current tests, we only do zerorange of 8 bytes (one pointer) If you want to make sure that your change works with my test program and check it in (after review), I can then add a bunch of small test cases for a variety of zerorange/ADUFFZERO sizes (generated similarly to my test program), probably in a file cmd/compile/internal/gc/zerorange_test.go. Alternatively, you can add the test yourself in your CL.

Thanks for the quick response to the bug!

danscales

comment created time in 6 months

issue openedgolang/go

cmd/compile: program that requires zerorange of >=32 bytes fails to compile on aix/ppc64

What version of Go are you using (go version)?

go version devel +616c39f6a6 Thu Sep 26 20:45:09 2019 +0000 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

<details><summary><code>go env</code> Output</summary><br><pre> $ go env

</pre></details>

GOARCH="ppc64" GOHOSTARCH="amd64" GOHOSTOS="linux" GOOS="aix" GOPPC64="power8"

What did you do?

Cross-compilation of this program on aix/ppc64 by setting GOOS and GOARCH:

package main

func main() {
	testDuffZero()
}

var glob = 3
var globp *int64

// Compile with GOARCH=ppc64 and GOOS=aix

//go:noinline
func testDuffZero() (r, s, t, u int64) {
	defer func() {
		glob = 4
	}()
	// Force the output params to be heap allocated. The pointer to each output
	// param must be zeroed in the prologue (see plive.go:epilogue()). So we
	// will get a block of stack slots that needs to be zeroed. For size >= 32
	// and <= 1024, this leads to a obj.ADUFFZERO call on ppc64 (see
	// ppc64/ggen.go:zerorange), which then leads to a bug because of AIX's
	// use of obj9.go. It looks like progedit() changes the ADUFFZERO to have
	// a To.Type of TYPE_BRANCH, which then causes diagnostic in
	// rewriteToUseTOC()
	//
	// Error is:
	// tmp.go:19:2: do not know how to handle 00121 (tmp.go:11)        DUFFZERO        runtime.duffzero(SB) without TYPE_MEM

	globp = &r
	globp = &s
	globp = &t
	globp = &u
	return
}

What did you expect to see?

For the .go file to compile. It compiles fine on linux/amd64 architecture.

What did you see instead?

test.go:33:2: do not know how to handle 00121 (test.go:13)        DUFFZERO        runtime.duffzero(SB) without TYPE_MEM

I constructed this test case, after I ran into a similar problem when testing my open-coded defers CL https://go-review.googlesource.com/c/go/+/190098 . The open-coded defer implementation may require zeroing stack slots storing defer args, so the use of Arch.ZeroRange with larger ranges is more likely.

As described in the comments above, the pointer to each output param must be zeroed in the prologue (see plive.go:epilogue()). So we will get a block of stack slots that needs to be zeroed, and the compiler generates a call to obj.ADUFFZERO of size 32 on aix/ppc64 (see ppc64/ggen.go/zerorange). AIX uses obj9.go, which leads to the bug. It looks like progedit() changes the ADUFFZERO to have a To.Type of TYPE_BRANCH (see the first case), which then causes diagnostic in rewriteToUseTOC(). (under the 'p.To.Name == obj.NAME_EXTERN' if branch)

Seems most likely a fix would be in progedit(), but could also be in rewriteToUseTOC(), or we could change ppc64/ggen.co/zerorange to not use ADUFFZERO (use the other two zeroing methods exclusively).

created time in 6 months

issue commentgolang/go

runtime: sometimes recover calls fail to work

Yes, the Go language spec is very vague and imprecise on panic/recover and I am hoping to fix it:

https://go-review.googlesource.com/c/go/+/189377

but there's not quite consensus yet.

I don't think there is any disagreement on the behavior of your example (unlike the abort behavior mentioned in #29226 ). A recover only recovers a panic if it is called directly in a defer function that is directly invoked as part of the panicking process of that panic. A recover does not apply and returns nil if it is not called directly in a defer function or if it is called from a defer that was not directly invoked by the panic sequence of the panic (i.e. is nested inside some other defer or function called by the defer).

Your second example seems to be a special little bug/quirk of doing 'defer recover()'. The detection mechanism in the implementation considers that recover is called directly in the containing function, so that recover does apply. Note that recover does not happen if you do

  defer func() {
    recover()
  }

I think the first thing in both these bugs are to get agreement on the spec - both specifying the current behavior that people depend on and/or would generally agree on, and also the behavior (like the abort behavior in #29226 ) that has been there for a while, but has never been specified and people might like to change.

go101

comment created time in 6 months

issue commentgolang/go

cmd/compile, runtime: a design for low-cost defers via inline code

Yes, inlining happens much earlier in the compiler, so for my implementation, even though the defer calls will be directly expressed (open-coded) at exit, the calls cannot be replaced by their definition (i.e. inlined).

danscales

comment created time in 6 months

issue commentgolang/go

runtime: panic + recover can cancel a call to Goexit

Instead, you want to change the implementation so that if an inner panic is recovered in a frame at the same level or even further down in the stack from where the output panic/Goexit happened, you want the defer processing of the frames to continue under the auspices of the outer panic/Goexit.

OK, and I think the hard part about doing this is that we presumably want to clean up the stack of the inner panic and somehow make it look like we are back to the outer Goexit/panic processing. That seems really hard to do without a lot of stack gymnastics (possibly stack smashing) and/or extra stubs to allow us to get control back to the runtime after cleaning up the stack. If we don't require that the stack is cleaned up, then getting rid of the abort behavior seems more doable.

bcmills

comment created time in 6 months

issue commentgolang/go

runtime: panic + recover can cancel a call to Goexit

I agree that that's more likely, but there are realistic scenarios that could trigger this bug. For example, consider https://play.golang.org/p/vSL3LWEGpm2.

Yes, you are demonstrating the exact panic abort semantics that I mentioned. Just to describe it again, an outer panic/Goexit is aborted if the inner panic reaches & processes any defers in the same frame or in a higher (outer) frame from the frame where the panic/Goexit happened. Once the outer panic/Goexit is aborted, it will no longer apply even if there is a recover() that recovers the inner panic.

I think you want to stop these unspecified abort semantics. Instead, you want to change the implementation so that if an inner panic is recovered in a frame at the same level or even further down in the stack from where the output panic/Goexit happened, you want the defer processing of the frames to continue under the auspices of the outer panic/Goexit.

Note that if the recover happens in a frame which is higher in the stack from the output panic/Goexit, we still want the current behavior, which is that the defers in the current frame are finished, and then "normal" execution continues in the next outer frame, which is some function called by a defer caused by the outer panic. So, the outer panic will continue as expected, but it will get to complete the running of the defer function that caused the panic.

Since the behavior of recursive panics/Goexits (including the abort semantics) was never specified in the first place, I think we have leeway on whether/how we want to fix this.

I've tried to get some feedback on improving the description of panic/recover/recursive panics in the Go language spec, but don't think we have consensus yet about whether or what to improve:

https://go-review.googlesource.com/c/go/+/189377

Feel free to add comments....

bcmills

comment created time in 6 months

issue commentgolang/go

runtime: panic + recover can cancel a call to Goexit

This example is actually a bit of a special case, because the Goexit call and the recover are in the same stack frame/function:

func maybeGoexit() {
	defer func() {
		fmt.Println(recover())
	}()
	defer panic("cancelled Goexit!")
	runtime.Goexit()
}

In this case, we are running somewhat into the "abort" rule (which is currently only described in the implementation, not in any spec). If the processing of defers by a recursive panic gets to the frame where an outer panic or Goexit happened, then that outer panic/Goexit is aborted, because there is no mechanism to resume it with a recover (since a recover always finishes the defers of the current frame and then resumes execution with the next outer frame). The abort rule is exactly related to the comment by @rsc above. If we wanted to allow the Goexit to continue, I think we would have add a bunch of mechanism (maybe some stack smashing) just to deal with this case. (The abort case is also mentioned in the Requirements section for the just-added open-coded defer proposal: https://github.com/golang/proposal/blob/master/design/34481-opencoded-defers.md#requirements )

But note that the more likely/useful case (where a panic and recover happens completely within a deferred function during a Goexit() works fine. The recover happens, and the Goexit() continues.

https://play.golang.org/p/kcQBNOAUeSa

So, for instance, it would work fine if a panic-recover happened during a Printf in a defer function (which can happen with nil pointers, etc. of the printed args).

So, I'll think about this a bit more, but I would lean toward not fixing this (because it would require a bunch of added mechanism just for this one case).

bcmills

comment created time in 6 months

issue commentgolang/go

cmd/compile, runtime: a design for low-cost defers via inline code

Is there a way to avoid checking deferBits for defers that are always executed? I don’t mind that the respective bits get set, but there is no compiler optimization that’s able to remove checks for those bits (unless they are all always set and thus the variable is constant). It looks like it should be sufficient to check whether the defer is called in a block that dominates all exits (or something like that) and not emit the bitcheck in that case.

Yes, the deferBits conditionals (in the inline code) are eliminated completely in the simple cases where all defers are unconditional. If all defers are unconditional, then the stream of operations on deferBits are all unconditional and just involve constants, so the value is known everywhere through constant propagation. The 'opt' phase of the compiler does the constant propagation and then eliminates the if-check on deferBits which is always true.

As you point out (good point!), if one defer is always executed (dominates the exit) but others are not, we could potentially remove the deferBits check for that specific defer on any exits after that defer. I will put that on the TODO list (and add that as an idea for the next rev of the design doc). I may not do that optimization for the initial checkin.

danscales

comment created time in 6 months

issue commentgolang/go

cmd/compile, runtime: a design for low-cost defers via inline code

This proposal only allows up do 64 defers per function, is that correct? Yes, there is some limit to the number of defers per function (the size of the deferBits bitmask). In the current implementation, it is 8, but easily changed to 64.

But to clarify, if open-coded (inlined) defers cannot be used in a function, we just revert back to the current way of doing defers (create a "defer object" and add to the defer chain, process the appropriate objects on the defer chain at all function exits).

Our goal is to reduce the defer overhead in the common cases where it occurs, which is typically smallish functions with a few defers. In unusual cases (functions with defers in loops or lots of defer statements), we just revert to the current method.

I will add some text to the design document to make this clearer.

danscales

comment created time in 6 months

issue commentgolang/go

cmd/compile, runtime: a design for low-cost defers via inline code

Yes, Go1.14 milestone is correct -- that's what were aiming for.

danscales

comment created time in 6 months

issue openedgolang/go

cmd/compile, runtime: a design for low-cost defers via inline code

We are working on an implementation of defer statements that makes most defers no more expensive than open-coding the deferred call, hence eliminating the incentives to avoid using this language to its fullest extent. We wanted to document the design in the proposal/design repo.
See issue #14939 for discussion of defer performance issues.

I will post the design document here shortly. The current, mostly complete implementation is here

Comments are welcome on either the design or the implementation.

created time in 6 months

more