profile
viewpoint

aclements/latexrun 414

A 21st century LaTeX wrapper

aclements/libelfin 144

C++11 ELF/DWARF parser

aclements/go-misc 141

Miscellaneous Go hacks

aclements/go-perf 102

Go packages and tools for Linux perf

aclements/perflock 87

Locking wrapper for running benchmarks on shared hosts

aclements/commuter 74

Automated multicore scalability testing tool

aclements/mtrace 65

Memory access tracing QEMU

aclements/cpubars 49

Lightweight terminal-based multicore CPU usage monitor

aclements/biblib 38

Simple, faithful BibTeX parser and algorithms for Python 3

aclements/go-gcstats 37

Go runtime GC trace analysis and statistics tool

issue commentgolang/go

runtime: can't atomic access of first word of tiny-allocated struct on 32-bit architecture

Since a type's size is rounded up to its alignment, I think this specific case with a 12 byte struct on 32-bit is really the only way that this can happen, and it's a direct consequence of the underalignment of uint64.

/cc @danscales for an interesting consequence of the current alignment rules

NewbMiao

comment created time in 4 days

issue commentgolang/go

doc: write Go 1.14 release notes

<!-- TODO: Maybe CL 200439? -->

I'm inclined to not mention this. I think explaining this is just too subtle and it shouldn't affect many users. Unless, @mknyszek, do you think you could come up with a straightforward user-facing explanation of the consequences of that change?

dmitshur

comment created time in 4 days

issue commentgolang/go

all: get standard library building with -d=checkptr

I will try and report it back here.

Thanks!

Actually, WSASendTo (with the big 'T') is definitely an unsafe API, ...

You have to complain to Microsoft https://docs.microsoft.com/en-us/windows/win32/api/winsock2/nf-winsock2-wsasendto Even better. You should complain to people who designed Berkeley sockets in the first place.

Yeah, it's not great. But WSASendto exposes a perfectly safe API by taking the Sockaddr interface. That's unsafe internally, but only the syscall package can implement its methods, so that unsafeness doesn't leak through the user API. Sendto on UNIX works the same way. I think the mistake was just exporting the unsafe raw WSASendTo API underlying it (and some related functions. The UNIX packages also have an unsafe sendto API, but it's not exported.

Can we just remove WSASendTo?

We can do anything you want. But we are using the function ourselves in net package. And other people might be using it too.

I may be totally missing something, but I don't see any calls to WSASendTo in net (or x/net). I do see WSASendto. In fact, I couldn't find any calls to WSASendTo in the GitHub corpus I have access to (though I'm really confused about how much coverage that has).

mdempsky

comment created time in 6 days

issue commentgolang/go

runtime: go1.14rc1 fatal error: invalid runtime symbol table: runtime: unexpected return pc for runtime.sigreturn called from 0x7

Removing release-blocker, since we think this is probably fixed. If it turns out not to be, we can continue working on this and perhaps issue a fix in a point release.

tonyghita

comment created time in 9 days

issue commentgolang/go

runtime: 10ms-26ms latency from GC in go1.14rc1, possibly due to 'GC (idle)' work

Thanks for trying to put something together to recreate this behavior, especially from the little scraps of information we have. :)

Based on the execution traces, I think you're probably right that the idle GC worker is the root of the problem here. The idle worker currently runs until a scheduler preemption, which happens only every 10 to 20ms. The trace shows a fifth goroutine that appears to enter the run queue a moment after the idle worker kicks in, which is presumably the user goroutine.

I agree with @randall77 that this is not likely to be related to the large map, since we do break up large objects when marking.

In general long linked lists are not very GC-friendly, since it has no choice but to walk the list sequentially. That will inherently serialize that part of marking (assuming it's the only work left), though I don't understand the quoted comment about this happening "with a global lock". I don't think there's been anything like a global lock here since we switched to a concurrent GC in Go 1.5.

thepudds

comment created time in 12 days

issue commentgolang/go

cmd/compile: rc1.14rc1 build internal/ssa out of memory

What's the output of ulimit?

Sorry, ulimit -a.

n2vi

comment created time in 15 days

issue commentgolang/go

cmd/compile: rc1.14rc1 build internal/ssa out of memory

OpenBSD by default has a very low RLIMIT_DATA. What's the output of ulimit?

Even if RLIMIT_DATA is raised, OpenBSD limits anonymous mappings to a fairly small factor larger than physical memory (even if they're untouched), and Go 1.14 does create some larger sparse mappings than Go 1.13 did. For this limit, the relevant top column to watch would be VSZ, I believe.

n2vi

comment created time in 15 days

issue commentgolang/go

all: 'go vet std cmd' no longer passes

@mvdan, I may be missing something, but why do you think this is related to https://go-review.googlesource.com/c/go/+/162237?

A lot of the lines it's flagging haven't changed in years... Why is it flagging them now?

josharian

comment created time in 17 days

issue commentgolang/go

runtime: Interferes with Windows timer frequency

@ianlancetaylor, they're certainly related. I think this issue has become more about providing an API to give users control over the timer resolution, while #8687 is about fixing the runtime to not require high timer resolution. It may be that we need to solve both together, though, so perhaps the issues should be folded together.

stevenh

comment created time in 18 days

issue commentgolang/go

runtime: program appears to spend 10% more time in GC on tip 3c47ead than on Go1.13.3

Thanks, we now understand where you're getting those numbers from, which is very helpful.

What you're measuring simply isn't GC time. And it's unrelated to the 25% goal. You're measuring the wall-clock time that GC is active. While it's true that spending a higher or lower fraction of wall-clock time with GC active can affect the program, these effects are secondary, and generally not large unless that fraction gets a fair bit closer to 100%.

The GC's 25% goal is not about the fraction of wall-clock time spent with GC active. It's that, while the GC is active, it consumes 25% of CPU time. For example, the GC could be active 100% of wall-clock time, but using one of four CPUs while leaving the other three to the application's goroutines, and that would satisfy the 25% goal.

This is why we strongly encourage people to look at end-to-end performance measurements. Increased wall-clock time spent with GC active only matters if it helps explain a change in an application measurement that does matter, such as a program's overall throughput/running time or latency. We haven't been able to find any measurable impact on your application.

ardan-bkennedy

comment created time in 19 days

issue commentgolang/go

dl: corrupted output on SIGQUIT

go run (and the go command in general) traps SIGQUIT (see here and here) so it doesn't go the runtime and cause a traceback in the go process itself.

You're right that the dl binaries could do the same. It would require just a little code that's different between UNIX/non-UNIX. Alternatively, as you point out, it could directly exec the go binary instead of using cmd.Run, which would take the dl binary out of the picture entirely. I'm pretty sure all platforms have syscall.Exec with the same signature, and it might even make the code simpler.

zikaeroh

comment created time in 19 days

IssuesEvent

issue commentgolang/go

runtime: cannot ReadMemStats during GC

Reopening because of revert.

aclements

comment created time in a month

issue commentgolang/go

proposal: cmd/compile: make 64-bit fields be 64-bit aligned on 32-bit systems, add //go:packed directive on structs

@mknyszek pointed out to me that I completely misunderstood one of the concerns with the ABI: while Dan's proposing to maintain the current behavior for aligning the starts of arguments (which means argument frame and struct layout may differ, unlike right now), if you actually pass a struct by value to an assembly function, then it may be affected by an alignment change in that struct.

Does this actually happen in the wild? I'm not sure. It would require 32-bit assembly that takes a struct argument whose layout actually changes. vet would be able to detect it even without a new analysis, so we could try running it on a large corpus to get a sense.

@mknyszek proposed that we could create an ABI wrapper to relayout such structs in the arguments/results to assembly functions. This is kind of a pain, but is certainly doable. If we go down this path, maybe we also make ABIInternal Go functions continue to take a struct-style argument frame.

Still, I'd prefer that, like generics, the issue of changing the go ABI be well thought out

We have been thinking very carefully about how to change the ABI (it's not as flashy as generics :). We already have the system in place (modulo a few rough edges) to maintain strict calling convention compatibility with existing assembly code, even if we change the calling convention in Go code. This is what I was referring to with an "ABI wrapper" above.

danscales

comment created time in a month

issue commentgolang/go

runtime: Kubernetes performance regression with 1.14

@mknyszek writes (to the person running these tests):

Austin mentioned trying https://golang.org/cl/215157, but instead of that CL I just reverted the CL that introduced the bug it was fixing. We think now that it would be better to try the tip of Go's master branch with https://go-review.googlesource.com/c/go/+/216198/. If that goes well, then we'd like you to also try just Go tip to see if Ian's change is the fix.

ianlancetaylor

comment created time in a month

issue commentgolang/go

proposal: cmd/compile: make 64-bit fields be 64-bit aligned on 32-bit systems, add //go:packed directive on structs

I think that perhaps, it's already too late to make this change, as there are hundreds if not thousands of Go projects out there that use C/syscall/asm/DLL with the assumption that the struct layout in Go is as it is today, without automatic rearrangements.

I personally think this is a significant overestimate, but of course, I don't have any data to back up that claim either. If we implement a vet check, we could at least automatically scan a large corpus and get a better sense.

While letting the compiler automatically rearrange fields in a struct will save on some memory space, I don't see that as a benefit that is worth breaking millions of lines of working code over. ... Rather than making such changes, I would specify and document how one can manually lay out a struct efficiently ...

Note that we're not currently proposing field rearrangement. This proposal isn't about saving memory at all; it's about addressing a persistent source of run-time crashes in configurations that are often poorly tested.

Furthermore, a "go vet" check alone is not enough to fix this, we would also need a "go fix" for this so this can be fixed automatically.

Yes, we've actually been talking about this. It probably wouldn't be a "go fix" just because "go fix" is built to make as few assumptions about the code as possible, but it could be a new flag to "go vet", since "go vet" can do the sorts of type and flow analysis that you want for this. We've been thinking about doing this for some other vet checks as well that have clear automatic fixes.

danscales

comment created time in a month

issue commentgolang/go

proposal: cmd/compile: make 64-bit fields be 64-bit aligned on 32-bit systems, add //go:packed directive on structs

A few points that have been made in in-person discussions about this proposal:

  • Some architectures will trap on unaligned access. We can either say that's up to the user to avoid, or have the compiler emit the necessary code to avoid the unaligned access. GCC, for example, will load unaligned fields one byte at a time if it has to.

  • @dr2chase pointed out one reason you may need both a "packed" annotation and an "alignment" annotation on the same struct. If "packed" has the effect of making all fields alignment 1, then the whole struct has alignment 1 and you have no control over what happens if that struct is embedded in another type. In particular, you may be laying out a type carefully so that some fields are aligned, but that alignment may be broken when the type is put in a larger type. However, this can be handled if you can specify both that the struct is packed and explicitly specify its overall alignment.

  • @mknyszek pointed out that the garbage collector requires that pointers be word-aligned. It would be unfortunate to say "you can't have pointers in a packed struct", so I think pointers need to be word-aligned even in packed structs (which also affects overall struct alignment). We can either make this happen silently, or we can say it's an error if a pointer in a packed struct doesn't already fall on a word boundary.

danscales

comment created time in a month

issue commentgolang/go

proposal: cmd/compile: make 64-bit fields be 64-bit aligned on 32-bit systems, add //go:packed directive on structs

During the discussion of changing the Go ABI, several people expressed the opinion that if we're going to break a bunch of assembly code, we'd rather do that only once. I share that opinion.

Dan's change doesn't affect argument layouts, so no FP offsets would change in assembly code. The prototype implementation actually separates the concerns of struct layout from argument frame layout in order to keep this working. Since we have the ABI0/ABIInternal split, we could in principle use struct-style layout for Go calls while still keeping the existing alignment rules for assembly calls (though I don't know if that would be worthwhile).

The one place that it does potentially affect assembly code is if the assembly hard-codes offsets into Go structures. Our hope is that any assembly that is using struct offsets is just including go_asm.h and using the symbolic values already created by the compiler. (You don't even have to use "go tool compile -asmhdr go_asm.h" like Dan mentioned above; the go tool does this automatically.)

danscales

comment created time in a month

issue commentgolang/go

runtime: add a debug flag that force runtime debug logs are sent to stdout rather than stderr

I agree with Ian and would be fine with a standard prefix. It doesn't seem like making it configurable is worthwhile.

In general, stdout is for the primary output of a program: in-band data that you could, say, pipe to another program. stderr is for out-of-band data, such as logs. The presence of data on stderr shouldn't indicate failure; a non-zero exit status indicates failure. (Though I realize that reality is complicated.) However, there are also good reasons for trying to make some sense of what's on stderr, and a standard prefix for runtime messages would go a long way to helping with that.

We already try to prefix runtime crash messages with "runtime: ". Maybe we should just do that for the gc and scvg logs?

tibetsam

comment created time in a month

issue commentgolang/go

runtime: make.bat hangs

Thanks for the detailed report.

Most of the threads look uninteresting, except, I think, these two:

.  4  Id: 263c.196c Suspend: 1 Teb: 000007ff`fffd4000 Unfrozen
      Start: go_bootstrap+0x64d00 (00000000`00464d00) 
      Priority: 0  Priority class: 32  Affinity: f
Child-SP          RetAddr           Call Site
00000000`28e5eb68 00000000`76f48f58 ntdll!ZwWaitForSingleObject+0xa
00000000`28e5eb70 00000000`76f48e54 ntdll!RtlDeNormalizeProcessParams+0x5a8
00000000`28e5ec20 000007fe`fa3b7f0b ntdll!RtlDeNormalizeProcessParams+0x4a4
00000000`28e5ec50 000007fe`fa3b8504 TmUmEvt64+0x17f0b
00000000`28e5eeb0 000007fe`fa3b8c96 TmUmEvt64+0x18504
00000000`28e5ef10 000007fe`fa4565ca TmUmEvt64+0x18c96
00000000`28e5efa0 000007fe`fa455f8e TmUmEvt64+0xb65ca
00000000`28e5f000 000007fe`fa410686 TmUmEvt64+0xb5f8e
00000000`28e5f150 000007fe`fa439730 TmUmEvt64+0x70686
00000000`28e5f260 00000000`7472f146 TmUmEvt64+0x99730
00000000`28e5f4a0 00000000`747e2d7d tmmon64+0x2f146
00000000`28e5f580 00000000`747e29f4 tmmon64+0xe2d7d
00000000`28e5f640 00000000`74733748 tmmon64+0xe29f4
00000000`28e5f6b0 000007fe`fcc77c3f tmmon64+0x33748
00000000`28e5f780 00000000`0046494e KERNELBASE!ResumeThread+0xf
00000000`28e5f7b0 ffffffff`ffffffff go_bootstrap+0x6494e
00000000`28e5f7b8 00000000`00000001 0xffffffff`ffffffff
00000000`28e5f7c0 ffffffff`ffffffff 0x1
00000000`28e5f7c8 00000000`28e5f928 0xffffffff`ffffffff
00000000`28e5f7d0 00000000`00000000 0x28e5f928

. 10  Id: 263c.2b14 Suspend: 1 Teb: 000007ff`fffa4000 Unfrozen
      Start: go_bootstrap+0x64d00 (00000000`00464d00) 
      Priority: 0  Priority class: 32  Affinity: f
Child-SP          RetAddr           Call Site
00000000`29eceb68 00000000`76f48f58 ntdll!ZwWaitForSingleObject+0xa
00000000`29eceb70 00000000`76f48e54 ntdll!RtlDeNormalizeProcessParams+0x5a8
00000000`29ecec20 000007fe`fa3b7f0b ntdll!RtlDeNormalizeProcessParams+0x4a4
00000000`29ecec50 000007fe`fa3b8504 TmUmEvt64+0x17f0b
00000000`29eceeb0 000007fe`fa3b8c96 TmUmEvt64+0x18504
00000000`29ecef10 000007fe`fa4565ca TmUmEvt64+0x18c96
00000000`29ecefa0 000007fe`fa455f8e TmUmEvt64+0xb65ca
00000000`29ecf000 000007fe`fa410686 TmUmEvt64+0xb5f8e
00000000`29ecf150 000007fe`fa439730 TmUmEvt64+0x70686
00000000`29ecf260 00000000`7472f146 TmUmEvt64+0x99730
00000000`29ecf4a0 00000000`747e2d7d tmmon64+0x2f146
00000000`29ecf580 00000000`747e29f4 tmmon64+0xe2d7d
00000000`29ecf640 00000000`74733748 tmmon64+0xe29f4
00000000`29ecf6b0 000007fe`fcc77c3f tmmon64+0x33748
00000000`29ecf780 00000000`0046494e KERNELBASE!ResumeThread+0xf
00000000`29ecf7b0 ffffffff`ffffffff go_bootstrap+0x6494e
00000000`29ecf7b8 00000000`00000000 0xffffffff`ffffffff

. 13  Id: 263c.13a0 Suspend: 2 Teb: 000007ff`fff9e000 Unfrozen
      Start: go_bootstrap+0x64d00 (00000000`00464d00) 
      Priority: 0  Priority class: 32  Affinity: f
Child-SP          RetAddr           Call Site
00000000`2a50ec50 000007fe`fa3b8504 TmUmEvt64+0x17fa0
00000000`2a50eeb0 000007fe`fa3b8c96 TmUmEvt64+0x18504
00000000`2a50ef10 000007fe`fa4565ca TmUmEvt64+0x18c96
00000000`2a50efa0 000007fe`fa455f8e TmUmEvt64+0xb65ca
00000000`2a50f000 000007fe`fa410686 TmUmEvt64+0xb5f8e
00000000`2a50f150 000007fe`fa439730 TmUmEvt64+0x70686
00000000`2a50f260 00000000`7472f146 TmUmEvt64+0x99730
00000000`2a50f4a0 00000000`747e2d7d tmmon64+0x2f146
00000000`2a50f580 00000000`747e29f4 tmmon64+0xe2d7d
00000000`2a50f640 00000000`74733748 tmmon64+0xe29f4
00000000`2a50f6b0 000007fe`fcc77c3f tmmon64+0x33748
00000000`2a50f780 00000000`0046494e KERNELBASE!ResumeThread+0xf
00000000`2a50f7b0 ffffffff`ffffffff go_bootstrap+0x6494e
00000000`2a50f7b8 00000000`00000000 0xffffffff`ffffffff

These are all stuck in ResumeThread, which I didn't know threads could get stuck in.

Thread 13 is also interesting because the "suspend count" is 2, suggesting that some other thread has suspended it and is failing to resume it. This may also be why threads 4 and 10 are stopped in obviously blocking operations, while thread 13 is stopped at a seemingly random point.

Do you know what "tmmon64" and "TmUmEvt64" are?

Maybe we just need to hold the suspendLock for longer (though I don't have a theory for why this would be). What happens if you move the unlock(&suspendLock) in preemptM to the very bottom of the function?

alexbrainman

comment created time in a month

issue commentgolang/go

runtime: Go 1.14/Windows asynchronous preemption mechanism likely incompatible with debugging

Can you explain a bit more about how software breakpoints work under Windows (or point me to the relevant APIs)? I thought this would all be fine because SuspendThread acts like a semaphore, but I keep finding more and more little surprises with SuspendThread. :P

One possible, slightly awful work around may be for the debugging to poke a 1 into runtime.debug.asyncpreemptoff when it attaches. (Or if it's starting the process itself, add asyncpreemptoff=1 to GODEBUG.) Obviously not ideal because it may change program behavior.

aarzilli

comment created time in a month

issue commentgolang/go

all: get standard library building with -d=checkptr

Actually, WSASendTo (with the big 'T') is definitely an unsafe API, since it takes the byte length of the sockaddr as an argument and hence could be used to make the kernel read arbitrary memory from the Go process. We try not to expose unsafe APIs from syscall, and go to some significant lengths to make APIs safe in the UNIX syscall packages (though this is certainly not the only unsafe API in the Windows syscall package).

Can we just remove WSASendTo? syscall is not covered by Go 1 compatibility, and this would fall under the security exception anyway.

mdempsky

comment created time in a month

issue commentgolang/go

all: get standard library building with -d=checkptr

@mdempsky one last crash remaining after I apply https://go-review.googlesource.com/c/go/+/208617

WSASendto calls WSASendTo (which seems a little nuts to me, but okay...), which winds up just casing the *RawSockaddrAny back to a uintptr to make the syscall. What if we were to directly invoke the syscall from WSASendto with the unsafe.Pointer that came from to.sockaddr(), instead of converting it through an intermediate *RawSockaddrAny?

mdempsky

comment created time in a month

issue commentgolang/go

cmd/compile: enable -d=checkptr as part of -race and/or -msan?

I'm not inclined to make it configurable, since that seems like an unnecessary knob and it fails my knob litmus test of "I can give clear guidance on how to set this knob."

A throw seems pretty reasonable to me. If panics are being silently eaten by a system, that seems like a bug in the system to me, but I'm also fine with not making this a panic.

mdempsky

comment created time in a month

issue commentgolang/go

all: get standard library building with -d=checkptr

Bumping to Go 1.15 to keep working on the fixes for Windows. When those are resolved, we should also enable -d=checkptr by default when -race or -msan are enabled (currently this is the case everywhere but Windows).

mdempsky

comment created time in a month

issue closedgolang/go

cmd/compile: enable -d=checkptr as part of -race and/or -msan?

As discussed on golang-dev, the new -d=checkptr instrumentation is compatible with -race and -msan and likely cheaper than either of them (about the cost of two runtime.findObject calls per conversion involving unsafe.Pointer), so maybe it should just be enabled as one of those flags.

It would be easy to tweak cmd/compile to enable -d=checkptr by default when -race or -msan are specified, and then to still allow -race -d=checkptr=0 to turn it back off (i.e., race instrumentation without pointer checking). I can do that now so we get some extra usage testing of -d=checkptr (thanks to existing builders that use -race, etc), and then closer to release we can re-evaluate the best user experience?

/cc @aclements @bradfitz @rsc

closed time in a month

mdempsky

issue commentgolang/go

cmd/compile: enable -d=checkptr as part of -race and/or -msan?

Since this seems to be working pretty smoothly, we've decided to go ahead and keep -race/-msan setting -d=checkptr by default for the release, except on Windows where it's still disabled because of some remaining issues in std. I've just updated the release notes to mention this.

Since this issue was about making this decision, I'm going to go ahead and close it. We'll continue to track remaining std fixes in #34972.

mdempsky

comment created time in a month

issue commentgolang/go

runtime: thread sanitizer failing on ppc64le

Thanks for figuring that out @laboger !

It sounds to me like there's nothing Go can do to fix or work around this (is that true?). If so, maybe we can add something to the documentation and resolve this bug?

randall77

comment created time in a month

issue commentgolang/go

runtime: TestStackWrapperStackPanic failure on windows-amd64-2016

Thanks!

I assume by option 2 you mean overwriting the asyncPreempt frame. I certainly like that option better, as it's much more localized to the cause of the issue. Changing the traceback logic, on the other hand, is several dominoes away from the actual cause.

bcmills

comment created time in a month

issue commentgolang/go

runtime: TestStackWrapperStackPanic failure on windows-amd64-2016

If we tweak the test to print the PC in the stack trace, we see that the runtime.asyncPreempt frame is at the very first instruction of asyncPreempt. This suggests that we injected that call and then immediately injected the sigpanic. Maybe it's possible in Windows for SuspendThread to stop a thread between when it causes an exception and when the VEH runs? If that's the case, preemptM would see the thread at the faulting instruction, tweak the frame to inject the asyncPreempt call, then resume the thread, which would run the VEH and tweak the frame again to inject the sigpanic call.

bcmills

comment created time in a month

issue commentgolang/go

runtime: TestStackWrapperStackPanic failure on windows-amd64-2016

I can easily reproduce this at current master (817afe83578d869b36e8697344bb2d557c86b264) and at f6774bc with the following patch and script:

--- a/src/runtime/proc.go
+++ b/src/runtime/proc.go
@@ -4469,6 +4469,7 @@ func sysmon() {
 		if delay > 10*1000 { // up to 10ms
 			delay = 10 * 1000
 		}
+		delay = 10
 		usleep(delay)
 		now := nanotime()
 		next := timeSleepUntil()
@@ -4563,7 +4564,8 @@ type sysmontick struct {
 
 // forcePreemptNS is the time slice given to a G before it is
 // preempted.
-const forcePreemptNS = 10 * 1000 * 1000 // 10ms
+//const forcePreemptNS = 10 * 1000 * 1000 // 10ms
+const forcePreemptNS = 10 * 1000 // 10us
 
 func retake(now int64) uint32 {
 	n := 0
cd $GOROOT/src/runtime
VM=$(gomote create windows-amd64-2016)
GOOS=windows go test -c && \
gomote put $VM runtime.test.exe && \
gomote run $VM ./runtime.test.exe -test.run TestStackWrapperStackPanic/sigpanic/CallersFrames -test.count 1000000

Running with ./runtime.test -test.run TestStackWrapperStackPanic/sigpanic/CallersFrames -test.v, we can see the traceback when the test passes:

runtime.Callers
runtime_test.testStackWrapperPanic.func1.1
runtime.gopanic
runtime.panicmem
runtime.sigpanic
runtime_test.I.M
runtime_test.TestStackWrapperStackPanic.func1.1
runtime_test.testStackWrapperPanic.func1
testing.tRunner
runtime.goexit

(normally it stops when it reaches runtime_test.I.M, but I tweaked it.)

What I don't fully understand is how we go from asyncPreempt to sigpanic in the bad traces. This is a nil panic, so the exception is synchronous. We should be injecting the sigpanic call at the exact PC/SP where the exception happens, so I don't see how asyncPreempt slips in there.

bcmills

comment created time in a month

issue commentgolang/go

runtime: Windows binaries built with -race occasionally deadlock

For the record, this is what @mknyszek was using as a repro (based on dist's test):

stress -p 1 -timeout 10s -- go test -run 'TestParse|TestEcho|TestStdinCloseRace|TestClosedPipeRace|TestTypeRace|TestFdRace|TestFdReadRace|TestFileCloseRace' -race -count 1 - timeout 0 flag net os os/exec encoding/gob

We whittled this down to the following, which fails after a few minutes

go test -c -race flag
stress -p 4 -timeout 10s -- 'flag.test.exe' '-test.run' TestParse '-test.timeout' 0
bcmills

comment created time in a month

issue commentgolang/go

runtime: Windows binaries built with -race occasionally deadlock

@alexbrainman, thanks for sharing the details you had time to collect. From the Delve output it looks like the process has a lot of threads, which would mean this is a different issue. Could you confirm that through some other process explorer and paste your report into a new issue if there's more than one thread in the hung process?

bcmills

comment created time in a month

issue commentgolang/go

runtime: Windows binaries built with -race occasionally deadlock

Unfortunately this raises a bigger issue: what if C code, called from Go, calls ExitProcess on Windows?

I took a stab at fixing this using @cherrymui's suggestion (varied slightly after chatting in person so there's no actual blocking behavior, just a one-shot CAS). Unfortunately, in Windows, any "system call" can potentially call ExitProcess, so we need to defend against this in reentersyscall, not just cgo calls. That's an extra atomic CAS on every system call.

bcmills

comment created time in 2 months

issue commentgolang/go

cmd/link: ErrorUnresolved provides misleading information

See #36389 for some discussion of the issue that resulted in a revert. The fix itself is almost certainly fine; it's just a problem with the test that was added.

4a6f656c

comment created time in 2 months

IssuesEvent

issue commentgolang/go

cmd/link: ErrorUnresolved provides misleading information

Re-opened by revert CL 213417

4a6f656c

comment created time in 2 months

issue commentgolang/go

cmd/link: building the latest tip version fails

Looks like mostly an overly picky test that was just added. It's looking for "main(.text): relocation target undefined not defined\n", but it's getting:

darwin: main(__TEXT/__text): relocation target undefined not defined linux/mips*: x.a(x3.o): unknown relocation type 11; compiled without -fpic? aix: main.x: relocation target foo not defined

The linux/mips one seems a bit worse. My guess is there's some "go build" logic that the test isn't duplicating correctly.

@4a6f656c

YoshikiShibata

comment created time in 2 months

issue commentgolang/go

runtime: pclntab is too big

That's what I was hinting at with a "nybble-based varint encoding" above.

Thanks for sending your notes on that.

I was thinking of more compact but harder-to-decode formats for pcline in particular, such as Huffman coding (probably with a fixed table we generate from some corpus; maybe a per-package table generated by the compiler) or possibly Golomb-Rice coding (since pcline in particular is probably roughly geometrically distributed, though this wouldn't extend well to other tables).

Looks pretty non-superlinear to me. So now we can get back to the task of shrinking it. :)

Awesome. Thanks for running that experiment!

robpike

comment created time in 2 months

issue commentgolang/go

runtime: pclntab grows superlinearly

The register maps are the most of them.

Great. I'm planning on getting rid of those. :)

Of all the pc tables, the pcline is actually the biggest. I wonder if that's because of instruction interleaving and inlining. I bet that's also wasting a lot of bits writing down very small deltas. Since we only use that table for symbolization, it doesn't have to be extremely efficient to decode, so we could imagine using a more compact encoding of it.

robpike

comment created time in 2 months

issue openedgolang/go

runtime: sweep increased allocation count crash on arm64 [1.12 backport]

Issue #36101 should be considered for backport to the next 1.12 minor release.

created time in 2 months

issue commentgolang/go

runtime: sweep increased allocation count crash on arm64

@gopherbot, please open a backport to Go 1.12

jing-rui

comment created time in 2 months

issue closedgolang/go

runtime: tight loops should be preemptible

Currently goroutines are only preemptible at function call points. Hence, it's possible to write a tight loop (e.g., a numerical kernel or a spin on an atomic) with no calls or allocation that arbitrarily delays preemption. This can result in arbitrarily long pause times as the GC waits for all goroutines to stop.

In unusual situations, this can even lead to deadlock when trying to stop the world. For example, the runtime's TestGoroutineParallelism tries to prevent GC during the test to avoid deadlock. It runs several goroutines in tight loops that communicate through a shared atomic variable. If the coordinator that starts these is paused part way through, it will deadlock.

One possible fix is to insert preemption points on control flow loops that otherwise have no preemption points. Depending on the cost of this check, it may also require unrolling such loops to amortize the overhead of the check.

This has been a longstanding issue, so I don't think it's necessary to fix it for 1.5. It can cause the 1.5 GC to miss its 10ms STW goal, but code like numerical kernels is probably not latency-sensitive. And as far as I can tell, causing a deadlock like the runtime test can do requires looking for trouble.

@RLH

closed time in 2 months

aclements

issue commentgolang/go

runtime: tight loops should be preemptible

Actually, rather than bump this to 1.15, since tight loops are preemptible now, I've created a new issue to keep track of loose ends (#36365) and will go ahead and close this issue as fixed.

aclements

comment created time in 2 months

issue openedgolang/go

runtime: clean up async preemption loose ends

Go 1.14 introduces asynchronous preemption so that tight loops can be preempted (#10958). However, there are still several loose ends to tie up. This is a general bug to keep track of remaining work:

  • [ ] Support on all possible platforms (js/wasm is not currently possible)
    • [ ] windows/arm (CL 207961)
    • [ ] darwin/arm
    • [ ] plan9/*
  • [ ] Redo unsafe-point representation so we can eliminate register maps
  • [ ] Remove register maps and extra stack maps (after redesigning debug call injection)
  • [ ] Possibly incorporate @cherrymui's sequence restarting (CL 208126)
  • [ ] Make large pointer-free memmoves/memclrs preemptible (#31222)
  • [ ] Make large pointer-full memmoves/memclrs preemptible (this is a little harder)
  • [ ] Fix various annoying spins in the scheduler
  • [ ] Make more of the runtime preemptible

created time in 2 months

issue commentgolang/go

runtime: tight loops should be preemptible

Go 1.14 will handle loop preemption far better than previous versions, but there are still some loose ends to tie up in 1.15, so I'm bumping this issue to 1.15.

aclements

comment created time in 2 months

issue commentgolang/go

runtime: sweep increased allocation count crash on arm64

@gopherbot, please open a backport to Go 1.13

jing-rui

comment created time in 2 months

issue commentgolang/go

runtime: pclntab grows superlinearly

The title of this bug (as well as the blog post) claims that pclntab grows super-linearly in the number of functions in the binary, but it's completely unclear to me what that claim is based on. It seems that they're varying both the Cockroach version and the Go version, so not only did Cockroach get bigger, but we also added more pclntab tables for the garbage collector since Go 1.2. That aside, the table in the blog post that compares Cockroach in 2017 to 2019 has exactly two data points in it, with the rest "projected", so I don't know how you get any non-linear projection from that.

The pclntab is large, a significant fraction of Go binaries, and its size is starting to cause issues for cutting-edge users, so this is worth working on. I just don't understand the basis for the super-linear growth claim.

robpike

comment created time in 2 months

issue commentgolang/go

runtime: pclntab grows superlinearly

As @thanm mentioned, @jeremyfaller has been looking at this problem recently and has done some preliminary analysis. He'll be back Thursday. He was going to prototype splitting the strings into two or three components (e.g., package+type+symbol), so each component can be deduped. I'm not sure what the current state of that is. As @cherrymui points out, the strings aren't a whole lot of the pclntab, but that also seems like low-hanging fruit. @jeremyfaller has also been looking at more typical binaries that have much longer package paths than cmd/compile, and hence more string data.

@cherrymui, what tool did you use to generate that breakdown? It would be good to also see the further breakdown of "pcdata", since there are several tables in that.

An important misconception in the blog post is that pclntab only contains traceback symbolization data. It also contains a lot of critical data for the garbage collector.

I did some experiments a while back with compressing stack maps. At the time, I noted that Huffman-coding the PCDATA delta stream roughly halved the size of the stack maps. However, that depended on the linker generating an optimal Huffman table for the whole binary, which would increase link time. Here's the code for those experiments, though it's probably bitrotted.

robpike

comment created time in 2 months

issue commentgolang/go

runtime: "attempt to execute system stack code on user stack" during heap scavenging [1.13 backport]

@mknyszek, would you mind preparing a quick backport CL for this?

gopherbot

comment created time in 2 months

issue commentgolang/go

runtime: should checkptrArithmetic accept a uintptr instead of unsafe.Pointer?

@aclements I'm curious if you have any thoughts on the best code generation for the compiler to emit for checkptr.

I'd be inclined to make checkptrArithmetic take a uintptr, since unsafe.Pointer should really only be used for "real", known-good pointers. But, of course, then we need to ensure the underlying pointer remains live. For that I'd lean toward your first option, as long as we can ensure checkptrArithmetic is recursively non-preemptible, and we can mark the whole region in the caller as non-preemptible (to prevent asynchronous preemption).

But I don't think this is a big deal, either, and I think the current state is fine for 1.14. As @cherrymui pointed out, it's going to throw one way or the other, it could just be a nicer throw.

ianlancetaylor

comment created time in 2 months

issue commentgolang/go

runtime: scavenger not as effective as in previous releases

We've decided this seems a bit too high-risk for 1.14 without enough reward. In practice, these races seem quite rare, and if they do happen, they're likely to only delay scavenging by one GC cycle. Bumping to 1.15. If we get evidence that this is causing real memory pressure problems in 1.14, we have the patches ready.

mknyszek

comment created time in 2 months

issue commentgolang/go

runtime, cmd/trace: recent regression: "failed to parse trace: no consistent ordering of events possible

Tentatively marking this as a release blocker, since it could make the execution tracer unusable.

I suspect we're missing some trace event or just modeling things wrong in the new direct-yield path.

danscales

comment created time in 2 months

push eventaclements/go-misc

Austin Clements

commit sha 0f1ed13b8149e9b04c7d5c6d032dbca2b1f0e062

stackmapcompress: experiments with stack map compression I wrote this a while ago when I added stack maps and register maps everywhere. It's probably stale, but it's still useful code.

view details

push time in 2 months

issue commentgolang/go

runtime: revert Windows change to boot-time timers

I don't think it's clear that program/monotonic time for everything is the obvious path forward for 1.15, so that wouldn't be an effective way to "preview" the change.

networkimprov

comment created time in 2 months

issue commentgolang/go

runtime: revert Windows change to boot-time timers

@aclements what was the possible "change again" in 1.15? Addition of NewTimerAt() and friends?

That's right. Or, at least some sort of OS convergence that we've had more time to think about, whether that's NewTimerAt or something else.

A key rationale for changing to program/monotonic in 1.14 is that it's the native model for Win8/10, which has been in use for 7 years. @jstarks noted that it's odd for Go to second-guess that. At this point, most Windows laptops in use are running Win8/10.

I'm not that concerned with the "native" model of any particular OS, especially when OSes can't agree on what that model should be (e.g., Windows moving to monotonic time, Linux trying [though failing] to move away from monotonic time). I think monotonic time is definitely part of the right answer, but it's also not the whole answer, which is why I'm okay with putting this on hold to minimize design thrashing, and making headway on a whole answer for 1.15.

networkimprov

comment created time in 2 months

issue commentgolang/go

runtime: revert Windows change to boot-time timers

Check out this discrepancy between Docker and non-Docker during S3 sleep: https://data.zx2c4.com/docker-uses-program-time-windows-dec-2019.mp4

Oh goodness. Just making sure I understand completely, since runtime.nanotime is reading "interrupt time" in your example, "interrupt time" in Docker for Windows is actually "unbiased interrupt time" ("program time") and there's perhaps no monotonic clock that's actually on "real time" in Docker?

THIS MEANS THAT THE ORIGINAL SIMPLE COMMIT FIXES THE DOCKER ISSUE https://go-review.googlesource.com/c/go/+/208317

Are we fairly certain this is the only cause of error 2?

This means running on bare Windows and running on Docker for Windows will behave differently, but 1) maybe there's no way around that, and 2) maybe it doesn't matter so much because people don't tend to run Docker on laptops anyway?

Thanks for working hard on the Docker issue. As you pointed out, this makes option 1 viable, where we stay on "real time" for both Sub and Sleep for Windows and try to come up with a more unified, consistent answer for 1.15. I'm okay with that because, if we do change the semantics for 1.15, we just have one big convergence of time behavior in 1.15, rather than changing Windows behavior in 1.14 and then again in 1.15.

networkimprov

comment created time in 2 months

issue commentgolang/go

runtime: "attempt to execute system stack code on user stack" during heap scavenging

@gopherbot, please open a backport to 1.13.

bcmills

comment created time in 2 months

issue commentgolang/go

runtime: revert Windows change to boot-time timers

So, given my current understanding, I see three options here:

  1. "Real time" for both Sub and Sleep on Windows. This is essentially the state we're in now, including the suspend mitigation. This requires fixing the issues with the mitigation: the failures under Docker (#35447) and missing sleep states. But just ignoring PowerRegisterSuspendResumeNotification isn't a good answer, because then we just wind up with unpredictable timer behavior. @zx2c4 debugged this a bit a few weeks ago. As far as I know there aren't any more ideas here, so this currently isn't a viable option.

  2. "Program time" for both Sub and Sleep everywhere. This is bad for network protocols that depend on a distributed, agreed-upon clock, but this is already the case on every other OS we support. This would converge all of our OSes, which is really good. We should definitely continue to think about how to provide real time in a useful way for network protocols in the future (across all platforms), but at least we would be starting from a point where all the OSes are in sync. This is also quite easy to implement.

  3. "Real time" for Sub and "program time" for Sleep on Windows. This effectively moves us toward my rough proposal. It has the advantage that it works whether OS sleeps are "real time" or "program time" (if they're "real time", things just wake up early and go back to sleep). This is fairly easy to implement by using QuaryUnbiasedInterruptTime just for timer deadlines. If my rough proposal is a good idea, this reduces churn in the future, but I'm quite weary of jumping to this when there hasn't been any formal review of my proposal, especially at this point in the release cycle.

Regardless, we need to have a longer term solution in 1.15 that addresses the needs of network protocols across all platforms. But we need a near-term, pragmatic solution now because this issue is directly delaying the release at this point.

I've already prototyped option 2 by changing nanotime and time.Now on Windows to use unbiased interrupt time. This only slows nanotime down from 4.3 ns to 4.73 ns on my laptop. The "stuck buckets" reproducer from #31528 still works and now behaves identically to Linux, and @ijc's test program behaves correctly (all of the "longer than expected" times are very small).

networkimprov

comment created time in 2 months

issue commentgolang/go

runtime: revert Windows change to boot-time timers

But about half of my customers still use Windows 7 - their computers do the job, so why changing.

@alexbrainman, sorry, I'd meant that as a joke (well, mostly :).

Then I run same program built on top of CL 210437, 3 times again. And every time it failed with:

I'm pretty sure I understand what's going on here. The "stuck buckets" problem from #31528 is fixed by Ian's new timer code, even without the suspend mitigation. That's why @ijc's 0.005s post-suspend sleep now works reliably. https://play.golang.org/p/xNHWgBD9k5M is testing something else: it's testing monotonic vs boot-time timer semantics.

What's going on is: the Go process goes completely idle, but has pending timers, so all the Ps go to sleep in (I think) netpoll, which blocks on GetQueuedCompletionStatusEx with a timeout based on the next timer, and sysmon relaxes in semasleep (via notetsleep) also with a timeout based on the next timer. Since the timer's deadlines are derived from nanotime, they are all in boot-time on Windows.

Now, with the suspend mitigation, resume wakes up the semasleep, which unblocks sysmon, which will trigger all timers that expired in boot-time while the system was suspended. It doesn't wake up the GetQueuedCompletionStatusEx, but it doesn't matter because sysmon does the work.

Without the suspend mitigation, both the semasleep and the GetQueuedCompletionStatusEx block in monotonic time, causing timers to get extended by suspend time. But when some timer eventually fires, the runtime compares all the deadlines against boot-time, and at that point all timers that would have expired on boot-time fire.

So, we don't have the original "stuck buckets" problem any more, with or without the suspend mitigation. Once that first timer fires, everything goes back to normal, unlike before. But without the suspend mitigation, the first timer to fire after a resume uses monotonic time semantics, while all timers later than that will use boot-time semantics.

networkimprov

comment created time in 2 months

issue commentgolang/go

runtime: revert Windows change to boot-time timers

So now I have a (cheap) Windows laptop and have Go set up on it so I can experiment and verify things. It's getting late here, so in the morning I am planning to try the reproducer from #31528 at tip both with and without the revert and will probably do some verification right around when Ian introduced the new timer system. I may also try to double-check the semantics of the timeout to GetQueuedCompletionStatusEx.

As far as I can tell from reading the code, the new timer code should not be subject to the problem described at #31528. After the system is unsuspended, timers should fire quickly, though there may be a delay of up to 10ms.

@ianlancetaylor, can you explain a bit about why you think the timers should fire quickly after an unsuspend? We check the expiration time against the boot-style interrupt time, but what wakes it up from GetQueuedCompletionStatusEx?

networkimprov

comment created time in 2 months

issue commentgolang/go

runtime: revert Windows change to boot-time timers

Re @alexbrainman

I think we're definitely interested in fixing the bug, but at the moment it seems to me that the best path forward is to revert back to the buggy but compatible state and then try to roll forward without breaking any existing projects.

What about https://go-review.googlesource.com/c/go/+/208317 ? If we can make this change to only work inside Docker, I think it is fine temporary measure until we have proper solution for Docker.

If Ian's belief that the new timer code isn't subject to the original "stuck buckets" problem is true, then the PowerRegisterSuspendResumeNotification code is a no-op anyway at this point, so whether we remove that or ignore errors has the same overall effect.

But what happens to time.Sleep on Windows < 8? Wouldn't that still be "real time", causing a different kind of skew on older Windows machines?

If we use WaitForSingleObject to wait, WaitForSingleObject uses "real time" on Windows 7, and "program time" on Windows 10.

Okay. So that sounds like it would introduce a skew problem on older Windows. I hear Microsoft is EOL'ing Windows 7, so maybe we don't have to worry about this much longer. :)

I note that as far as I can tell the patch in CL 191957 has no effect on the current timer code. It only affects goroutines sleeping in semasleep, which no longer happens for timers.

This is a news to me. I suppose I should test #31528 with https://golang.org/cl/210437 applied and see what happens.

That would be great.

Unfortunately the Windows VMs I have easy access to, I don't have a good way to suspend, making this rather tricky to experiment with! I'll see about setting up a VM I can actually suspend.

Does anybody know how GetQueuedCompletionStatus and GetQueuedCompletionStatusEx handle their timeouts when the system goes into low power mode?

I do not know. Let me know, if you want me to find out. I need to find appropriate laptop.

If it's easy to check, that would be great. I'd be really surprised if the timeout semantics there didn't match WaitForSingleObject, though.

networkimprov

comment created time in 2 months

issue commentgolang/go

runtime: revert Windows change to boot-time timers

For 1.15 I think we should aim to have Windows timers behave like timers on non-Windows systems, which is to say that they should ignore suspend time. As there is a need for timers that do not ignore suspend time, @aclements has suggested that we should introduce time.SleepUntil and time.NewTimerAt (and perhaps time.At and time.AtFunc). These new functions will take a time.Time rather than time.Duration, and will wake up or send a value on a channel or call a function at the specified time. We should also see if we can fix time.Sub to report a duration that includes suspend time when taking the difference between a call to time.Now before the suspension and a call to time.Now after the system wakes up again.

Just to expand a little on this, I have no idea if this is possible to implement (or at least how widely), but the idea is that:

  1. time.Sub would always be real time (boot time) because people think of time.Time as representing an instant in wall clock time since it renders as a wall clock time, and it's surprising when the subtraction has nothing to do with real time.

  2. time.Sleep and friends that take a duration would treat the duration as program time (monotonic time) for the same reasons the Windows kernel developers moved to this model (thundering herds, watchdogs, etc.) and because it makes for a natural new API, namely:

  3. add time.SleepUntil etc that take a time.Time to sleep until. Leaning into the idea that "time.Time is an instant in real time", this would naturally use real/boot time internally.

This isn't fully-baked; just me kicking around ideas.

I think this solves the issues with network protocols that have to roughly agree on a clock in a distributed system since it provides access to real time for both duration and sleeping, but I'm not positive. And any sort of new sleeping API is going to be a least a little confusing for users, but I think splitting it on time.Duration vs time.Time will naturally guide people to use the right one for the situation.

networkimprov

comment created time in 2 months

issue commentgolang/go

runtime: revert Windows change to boot-time timers

@aclements you didn't address the suggestion to use QueryUnbiasedInterruptTime()

My understanding is QueryUnbiasedInterruptTime would change time.Sub to use "program time", which would make Windows 8 use program time for both duration and sleep. This would match Unix semantics, though doesn't help the network protocol issue. But what happens to time.Sleep on Windows < 8? Wouldn't that still be "real time", causing a different kind of skew on older Windows machines?

networkimprov

comment created time in 2 months

issue commentgolang/go

runtime: SIGSEGV crash when OOM after goroutines leak on linux/386

@randall77, what commit was that traceback from? (mpagealloc.go:409 is a comment at current tip.)

rathann

comment created time in 2 months

issue commentgolang/go

runtime: revert Windows change to boot-time timers

Summary

I'll start with a summary to make sure I have all of this straight.

There are two different clocks we're concerned with:

  1. "Real time", aka CLOCK_BOOTTIME on Linux or "interrupt time" on Windows. This measures the passage of real time and continues to pass when the system is suspended. This clock has meaning in the external world, such as to users and across networks and distributed systems.

  2. "Program time", aka CLOCK_MONOTONIC on POSIX or "unbiased interrupt time" on Windows. This also measures the passage of real time, but pauses when the system is suspended and no programs can make progress. This clock is more meaningful internal to a system.

Neither of these are affected by wall clock adjustments. "Wall time" is another clock, but not one we're concerned with here.

There are also two broad classes of time-related operations:

  1. Measuring duration, aka time.Sub, time.Since, time1.Before(time2.Add(duration)), etc.

  2. Sleeping, aka time.Sleep, time.After, time.Ticker, etc.

The following table is how I believe these operations behave for different OSes and different Go versions (but please correct me if I'm wrong!):

< 1.13.3 >= 1.13.3 & tip
Windows < 8 time.Sub Real Real
time.Sleep Real Real
Windows >= 8 time.Sub Real Real
time.Sleep Program Mostly real
Unix time.Sub Program Program
time.Sleep Program Program

A few things to note: Windows and Unix have always been inconsistent with each other, and are still inconsistent. Windows intentionally changed the semantics of sleeping in Windows 8. This led to the inconsistency between time.Sub and time.Sleep on Windows, which was pointed out in #31528. This was (partially) fixed for 1.14 in CL 191957 and backported for 1.13.3 in a way that tried to replicate the behavior on earlier versions of Windows (but did not converge it with the behavior of Unix).

The issue

However, this change had some issues. It did not converge the behavior between Windows and Unix; if anything it made it more divergent. While it did (mostly) converge the behavior of duration and sleep on Windows, as far as I can tell, the only real system we know of that was affected by this inconsistency is WireGuard. The change, on the other hand, broke Go applications running Docker (#35447), and the only obvious "fix" for that problem seems to be to ignore some errors, which would lead to sleep behavior that is mostly real time, but sometimes program time. Finally, the change also ignores some sleep states, again leading to inconsistent sleep behavior.

My conclusion

Given the weight of the evidence, we should revert CL 191957 for the 1.14 beta as well as its 1.13 cherry-pick, CL 193607. This will return us to the state in the left column above where on Windows >= 8, time.Sub is real time and time.Sleep is program time. The old state is not the "right" answer in any absolute sense, but we’d been in the old state for years with few problems and the new state broke more users than it helped and is even more unpredictably inconsistent both between OSes and between environments and setups on Windows. I think the whole approach to clocks bears further consideration, but 1.14 testing has shown that we don’t have the right answer yet.

networkimprov

comment created time in 2 months

issue commentgolang/go

runtime: memory corruption on Linux 5.2+

Do you mind elaborating how you tested 386?

I ran the go vet stress test with a toolchain built with GOHOSTARCH=386 GOARCH=386.

However, I just ran my C reproducer, changed to use XMM instead of YMM and compiled with gcc -m32 -msse4.2 -pthread testxmm.c and it failed. So I guess 386 has this problem, too. :(

aclements

comment created time in 3 months

issue commentgolang/go

runtime: memory corruption on Linux 5.2+

@mdempsky: https://go-review.googlesource.com/c/go/+/209899/3/src/runtime/os_linux_386.go#7 (it's a little buried)

It may just be harder to reproduce. But we do use the X registers in memmove on 386, so I would still have expected to see it.

aclements

comment created time in 3 months

issue commentgolang/go

runtime: apparent deadlocks on Windows builders

does the runtime assume that its preemption signals are eventually delivered?

It does not. It will always try both a cooperative and a non-cooperative preemption and take whichever one it can get. If the signal goes missing, it will wait for a cooperative preemption just like before.

bcmills

comment created time in 3 months

issue commentgolang/go

runtime: corrupt binary export data seen after signal preemption CL

For everyone following along, I've closed up with a summary over on #35777. Thanks for all of the reports and helping to debug!

mvdan

comment created time in 3 months

issue commentgolang/go

runtime: memory corruption on Linux 5.2+

To close with a summary:

Linux 5.2 introduced a bug when compiled with GCC 9 that could cause vector register corruption on amd64 on return from a signal handler where the top page of the signal stack had not yet been paged in. This can affect any program in any language (assuming it uses at least SSE registers), including versions of Go before 1.14, and generally results in arbitrary memory corruption. It became significantly more likely in Go 1.14 because the addition of signal-based non-cooperative preemption significantly increased the number of asynchronous signals received by Go processes. It's also somewhat more likely in Go than other languages because Go regularly creates new OS threads with alternate signal stacks that are likely not to be paged in.

The kernel bug was fixed in Linux 5.3.15 and 5.4.2, and the fix will appear in all 5.5 and future releases. 5.4 is a long-term support release, and 5.4.2 was released with the fix just 10 days after 5.4.

For Go 1.14, we introduced a workaround that mlocks the top page of each signal stack on the affected kernel versions to ensure it is paged in and remains paged in.

Thanks everyone for helping track this down!

I'll keep this issue pinned until next week for anyone running a tip from before the workaround.

aclements

comment created time in 3 months

issue commentgolang/go

runtime: memory corruption on Linux 5.2+

I've filed a tracking bug to remove the workaround for Go 1.15: #35979.

aclements

comment created time in 3 months

issue openedgolang/go

runtime: revert Linux signal stack mlock workaround

For Go 1.14, we worked around a Linux kernel bug that caused vector register corruption on return from the signal handler (#35777, kernel bug).

The workaround is non-trivial and impossible to do 100% correctly from user space. Also, the affected Linux kernel releases are unlikely to be in the wild by the Go 1.15 release. Hence, I propose that we revert CLs 209597 and 209899 for Go 1.15.

The bug was introduced in Linux 5.2, though generally wasn't visible until Linux 5.3 because it also required GCC 9, which Linux 5.2's default configuration was incompatible with. It was fixed in Linux 5.3.15 and 5.4.2, and the fix will appear in all 5.5 and future releases. 5.4 is a long-term support release, and 5.4.2 was released with the fix just 10 days after 5.4, so by Go 1.15, stable distributions will have the patched kernel, and unstable distributions will have long since moved on to more recent kernels.

created time in 3 months

issue closedgolang/go

runtime: memory corruption on Linux 5.2+

We've had several reports of memory corruption on Linux 5.3.x (or later) kernels from people running tip since asynchronous preemption was committed. This is a super-bug to track these issues. I suspect they all have one root cause.

Typically these are "runtime error: invalid memory address or nil pointer dereference" or "runtime: unexpected return pc" or "segmentation violation" panics. They can also appear as self-detected data corruption.

If you encounter a crash that could be random memory corruption, are running Linux 5.3.x or later, and are running a recent tip Go (after commit 62e53b79227dafc6afcd92240c89acb8c0e1dd56), please file a new issue and add a comment here. If you can reproduce it, please try setting "GODEBUG=asyncpreemptoff=1" in your environment and seeing if you can still reproduce it.

Duplicate issues (I'll edit this comment to keep this up-to-date):

runtime: corrupt binary export data seen after signal preemption CL (#35326): Corruption in file version header observed by vet. Medium reproducible. Strong leads.

cmd/compile: panic during early copyelim crash (#35658): Invalid memory address in cmd/compile/internal/ssa.copyelim. Not reproducible. Nothing obvious in stack trace. Haven't dug into assembly.

runtime: SIGSEGV in mapassign_fast64 during cmd/vet (#35689): Invalid memory address in runtime.mapassign_fast64 in vet. Stack trace includes random pointers. Some assembly decoding work.

runtime: unexpected return pc for runtime.(*mheap).alloc (#35328): Unexpected return pc. Stack trace includes random pointers. Not reproducible.

cmd/dist: I/O error: read src/xxx.go: is a directory (#35776): Random misbehavior. Not reproducible.

runtime: "fatal error: mSpanList.insertBack" in mallocgc (#35771): Bad mspan next pointer (random and unaligned). Not reproducible.

cmd/compile: invalid memory address or nil pointer dereference in gc.convlit1 (#35621): Invalid memory address in cmd/compile/internal/gc.convlit1. Evidence of memory corruption, though no obvious random pointers. Not reproducible.

cmd/go: unexpected signal during runtime execution (#35783): Corruption in file version header observed by vet. Not reproducible.

runtime: unexpected return pc for runtime.systemstack_switch (#35592): Unexpected return pc. Stack trace includes random pointers. Not reproducible.

cmd/compile: random compile error running tests (#35760): Compiler data corruption. Not reproducible.

closed time in 3 months

aclements

issue commentgolang/go

runtime: memory corruption on Linux 5.2+

Fixed by commit 8174f7fb2b64c221f7f80c9f7fd4d7eb317ac8bb (I messed up the magic commit syntax)

aclements

comment created time in 3 months

issue commentgolang/go

runtime: memory corruption on Linux 5.2+

And the patch was just merged into Linux 5.4.2.

aclements

comment created time in 3 months

issue commentgolang/go

runtime: memory corruption on Linux 5.2+

Okay, one more shot at the workaround. @cherrymui pointed out the issue also almost certainly affects profiling signals, and neither of the workarounds I posted can be applied to profiling signals. So I went ahead and wrote the mlock solution: CL 209899. It's only slightly more complex than the workaround to disable asynchronous preemption, since the complexity of both is dominated by getting and parsing the kernel version.

aclements

comment created time in 3 months

issue commentgolang/go

cmd/compile: pointers passed to cgo escape to the heap

@cherrymui made an excellent observation that any sort of stack splitting approach isn't enough: Go could pass a pointer to C, which could pass that pointer back to Go, and the Go callback could then leak it to the heap. We completely lose the escape flows when we enter C, despite the cgo pointer passing rules, so it's not only about moving stacks.

I'm not sure there's any good solution to this. We could introduce annotations on C functions to communicate promises about what happens with pointers (the C call is already unsafe anyway). We could do something where we allocate C-escaped objects in some special heap where we can detect leaks and otherwise free it on return from the allocating frame (this sounds really hard). @dr2chase had an interesting idea that if we were to do a Rust-specific FFI, that the Rust type system may help here.

FiloSottile

comment created time in 3 months

issue commentgolang/go

runtime: memory corruption on Linux 5.2+

I think we need a workaround for the beta, which is imminent. Beyond that, my inclination is to keep a workaround in place for 1.14 because the workaround is low cost and the impact of this bug is so subtle. But I'm okay with removing the workaround in 1.15, given the kernel release cycles.

aclements

comment created time in 3 months

issue commentgolang/go

runtime: memory corruption on Linux 5.2+

I don't feel very strongly about it, either. I lean toward touching the signal stack just because it's one line of code instead of 161, doesn't disable any features, and is likely to coincidentally mitigate any issues caused by this bug for other signals.

aclements

comment created time in 3 months

issue commentgolang/go

runtime: memory corruption on Linux 5.2+

I've mailed two possible workarounds:

  1. CL 209598 checks the kernel version and disables signal-based preemption for the affected versions. This implements what we talked about, but it's a fair amount of code, mostly to figure out the kernel version.

  2. CL 209599 just touches the signal stack before sending the preemption signal. It's a one line change and obviously harmless.

Both workarounds only affect preemption signals. There's still a danger of other asynchronous signals causing this, but working around that is much harder and still imperfect.

aclements

comment created time in 3 months

issue commentgolang/go

runtime: large address space footprint on 32-bit linux

Sorry, I'd missed this.

I think my comment was saying that it could literally just reserve less space for the heap arenas, and fall back to using fresh mmaps when mheap_.heapArenaAlloc is exhausted.

Right now, 32-bit reserves address space for all possible arenas just to avoid interleaving the memory reservations with the heap, since that would cause memory fragmentation and make it more likely that a large allocation would fail. But on 64-bit, we don't reserve any space up front for the arenas and instead just mmap them as we need them.

It's probably a good idea to still reserve some space on 32-bit for the area metadata, but it could probably be much less. We could also try to generate hint addresses that make it less likely that the heap and the heapArenas will collide. For example, we could continue to make space for all heapArenas before the initial heap hint, but just not reserve that space. Even without the reservation, the heap is likely to grow up from the hint and not interfere with the heapArenas space unless we're really tight on address space.

tmm1

comment created time in 3 months

issue commentgolang/go

runtime: windows system clock resolution issue

I think we're all in agreement that the runtime shouldn't be lowering the system timer frequency. The question is what we should be doing instead.

We tried removing the timeBeginPeriod in 2016 and it caused severe regressions in some benchmarks (#14790) because it affected sysmon's ability to promptly retake Ps from system calls and C calls. We're not doing this because specific Go applications depend on the higher frequency. If that were the case, I'd be happy to give them the API to ask for it. We're doing this because the Go runtime itself depends on the higher frequency. This doesn't affect all Go applications, but it's hard to predict what it will affect; it's clearly not just the games and multimedia applications that typically depend on a higher timer frequency.

I'd be happy to told this is no longer the case, but as far as I know the solution is still to use UMS to detect blocked system calls rather than short timers (this would be great, in fact; UMS is a far better solution than short timers, just nobody has tackled that complexity yet). Until that's implemented, my understanding is that simply removing the timeBeginPeriod calls isn't viable.

defia

comment created time in 3 months

issue commentgolang/go

runtime: memory corruption on Linux 5.2+

Thanks for the updates. Out of curiosity, does anybody who follows Linux kernel development closer than I do know if this would get backported to a 5.2 or 5.3 patch release? I know 5.4 is an LTS release, but I don't know how that affects further patch releases of older minor releases.

aclements

comment created time in 3 months

issue commentgolang/go

runtime: revert Windows change to boot-time timers

Hi. Sorry for the silence here. Just a heads up that I have a few other things at the top of my list right now, but then I plan to review all the history here and wade in.

networkimprov

comment created time in 3 months

issue commentgolang/go

runtime: memory corruption on Linux 5.2+

@knweiss, thanks for pointing that out (not sure how else I would have found that out otherwise...). I've replied on the kernel bug, since I'm not subscribed to LKML.

aclements

comment created time in 3 months

issue commentgolang/go

runtime: memory corruption on Linux 5.2+

@mdempsky, you're right. This also affects XMM registers. That means we can't work around it by just disabling AVX. We use XMM registers all over the place.

aclements

comment created time in 3 months

issue commentgolang/go

runtime: "goroutine stack exceeds 250000000-byte limit" on linux-arm

Should we also close #35784?

bcmills

comment created time in 3 months

issue closedgolang/go

cmd/compile: random compile error running tests

<!-- Please answer these questions before submitting your issue. Thanks! -->

What version of Go are you using (go version)?

<pre> $ go version go version devel +0ac8739ad5 Mon Nov 18 15:11:03 2019 +0000 linux/amd64 </pre>

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

<details><summary><code>go env</code> Output</summary><br><pre> $ go env GO111MODULE="on" GOARCH="amd64" GOBIN="" GOCACHE="/home/myitcv/.cache/go-build" GOENV="/home/myitcv/.config/go/env" GOEXE="" GOFLAGS="" GOHOSTARCH="amd64" GOHOSTOS="linux" GOINSECURE="" GONOPROXY="" GONOSUMDB="" GOOS="linux" GOPATH="/home/myitcv/gostuff" GOPRIVATE="" GOPROXY="https://proxy.golang.org,direct" GOROOT="/home/myitcv/gos" GOSUMDB="sum.golang.org" GOTMPDIR="" GOTOOLDIR="/home/myitcv/gos/pkg/tool/linux_amd64" GCCGO="gccgo" AR="ar" CC="gcc" CXX="g++" CGO_ENABLED="1" GOMOD="/home/myitcv/gostuff/src/github.com/myitcv/govim/go.mod" CGO_CFLAGS="-g -O2" CGO_CPPFLAGS="" CGO_CXXFLAGS="-g -O2" CGO_FFLAGS="-g -O2" CGO_LDFLAGS="-g -O2" PKG_CONFIG="pkg-config" GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build920574822=/tmp/go-build -gno-record-gcc-switches" </pre></details>

What did you do?

I just got a random failure running tests on govim:

$ go test -short -count=1 ./...
# github.com/govim/govim/cmd/govim/internal/golang_org_x_tools/lsp/source
cmd/govim/internal/golang_org_x_tools/lsp/source/symbols.go:206:18: internal compiler error: unexpected untyped expression: <node XXX>

goroutine 1 [running]:
runtime/debug.Stack(0x100a7a0, 0xc00000e018, 0x0)
        /home/myitcv/dev/go/src/runtime/debug/stack.go:24 +0x9d
cmd/compile/internal/gc.Fatalf(0xe5e7ec, 0x21, 0xc001d85c70, 0x1, 0x1)
        /home/myitcv/dev/go/src/cmd/compile/internal/gc/subr.go:193 +0x291
cmd/compile/internal/gc.convlit1(0xc001e8bd80, 0xc00039e780, 0xc00039e700, 0x0, 0xd1)
        /home/myitcv/dev/go/src/cmd/compile/internal/gc/const.go:269 +0x810
cmd/compile/internal/gc.convlit(0xc001e8bd80, 0xc00039e780, 0xc000af6e10)
        /home/myitcv/dev/go/src/cmd/compile/internal/gc/const.go:208 +0x43
cmd/compile/internal/gc.defaultlit2(0xc000af7590, 0xc001e8bd80, 0xc001e8bd00, 0x0, 0xc00072dd80)
        /home/myitcv/dev/go/src/cmd/compile/internal/gc/const.go:1135 +0x241
cmd/compile/internal/gc.typecheck1(0xc001e8be00, 0x12, 0x0)
        /home/myitcv/dev/go/src/cmd/compile/internal/gc/typecheck.go:631 +0x1072
cmd/compile/internal/gc.typecheck(0xc001e8be00, 0x12, 0x0)
        /home/myitcv/dev/go/src/cmd/compile/internal/gc/typecheck.go:300 +0x704
cmd/compile/internal/gc.typecheckas(0xc001e8be80)
        /home/myitcv/dev/go/src/cmd/compile/internal/gc/typecheck.go:3172 +0xa1
cmd/compile/internal/gc.typecheck1(0xc001e8be80, 0x1, 0x0)
        /home/myitcv/dev/go/src/cmd/compile/internal/gc/typecheck.go:1900 +0x2fcc
cmd/compile/internal/gc.typecheck(0xc001e8be80, 0x1, 0x0)
        /home/myitcv/dev/go/src/cmd/compile/internal/gc/typecheck.go:300 +0x704
cmd/compile/internal/gc.walkrange(0xc00072de00, 0x0)
        /home/myitcv/dev/go/src/cmd/compile/internal/gc/range.go:443 +0x72f
cmd/compile/internal/gc.walkstmt(0xc00072de00, 0xc001e80980)
        /home/myitcv/dev/go/src/cmd/compile/internal/gc/walk.go:344 +0xba5
cmd/compile/internal/gc.walkstmtlist(0xc000ae62d0, 0x5, 0x6)
        /home/myitcv/dev/go/src/cmd/compile/internal/gc/walk.go:80 +0x46
cmd/compile/internal/gc.walkstmt(0xc00072dc80, 0xc000ae62d8)
        /home/myitcv/dev/go/src/cmd/compile/internal/gc/walk.go:262 +0xffe
cmd/compile/internal/gc.walkrange(0xc00072dc80, 0x0)
        /home/myitcv/dev/go/src/cmd/compile/internal/gc/range.go:452 +0x7e4
cmd/compile/internal/gc.walkstmt(0xc00072dc80, 0xc001e80900)
        /home/myitcv/dev/go/src/cmd/compile/internal/gc/walk.go:344 +0xba5
cmd/compile/internal/gc.walkstmtlist(0xc000afc400, 0x30, 0x40)
        /home/myitcv/dev/go/src/cmd/compile/internal/gc/walk.go:80 +0x46
cmd/compile/internal/gc.walkstmt(0xc00072cd80, 0xc00072ce00)
        /home/myitcv/dev/go/src/cmd/compile/internal/gc/walk.go:262 +0xffe
cmd/compile/internal/gc.walkstmtlist(0xc000aea240, 0x7, 0x8)
        /home/myitcv/dev/go/src/cmd/compile/internal/gc/walk.go:80 +0x46
cmd/compile/internal/gc.walkstmt(0xc00072cc80, 0xc001e80380)
        /home/myitcv/dev/go/src/cmd/compile/internal/gc/walk.go:266 +0x10f6
cmd/compile/internal/gc.walkstmtlist(0xc001e82000, 0x4a, 0x80)
        /home/myitcv/dev/go/src/cmd/compile/internal/gc/walk.go:80 +0x46
cmd/compile/internal/gc.walk(0xc0005bec60)
        /home/myitcv/dev/go/src/cmd/compile/internal/gc/walk.go:64 +0x3b0
cmd/compile/internal/gc.compile(0xc0005bec60)
        /home/myitcv/dev/go/src/cmd/compile/internal/gc/pgen.go:236 +0x6b
cmd/compile/internal/gc.funccompile(0xc0005bec60)
        /home/myitcv/dev/go/src/cmd/compile/internal/gc/pgen.go:222 +0xc1
cmd/compile/internal/gc.Main(0xe6ebd0)
        /home/myitcv/dev/go/src/cmd/compile/internal/gc/main.go:714 +0x3299
main.main()
        /home/myitcv/dev/go/src/cmd/compile/main.go:50 +0xac
ok      github.com/govim/govim  1.797s

Could not reproduce this by re-running the same command.

What did you expect to see?

No error

What did you see instead?

Compile error

Possibly, likely related to #35689 and friends.

CC @mdempsky @aclements @mknyszek @ianlancetaylor @bcmills

closed time in 3 months

myitcv

issue commentgolang/go

cmd/compile: random compile error running tests

Thanks. Since this isn't reproducible, I'm going to close this in favor of the super-bug.

myitcv

comment created time in 3 months

issue commentgolang/go

runtime: memory corruption on Linux 5.2+

@klauspost, I think #35158 is unrelated to this issue. While this issue technically applies to Go 1.12 and 1.13, it requires an application that's receiving a lot of signals. The corruption in that issue's stack traces also doesn't look like the corruption we typically see as a result of this issue.

aclements

comment created time in 3 months

issue commentgolang/go

runtime: fatal error: unknown caller pc (libfuzz)

Just checking if this is related to #35777. What kernel version are you running? Does the application itself receive a lot of signals?

klauspost

comment created time in 3 months

issue commentgolang/go

runtime: memory corruption on Linux 5.3.x from async preemption

We just chatted about workarounds and the favorite workaround is to check the kernel version and disable AVX use on the known-bad kernels. This way it doesn't matter where the signals are coming from or who set up the signal stacks. The solution focuses on the AVX corruption. (We would, of course, also mention this in the release notes.)

aclements

comment created time in 3 months

issue commentgolang/go

runtime: memory corruption on Linux 5.3.x from async preemption

That's a good question.

@dr2chase suggested a clever, simple workaround, which is to use a CAS to ensure the top page of the gsignal stack is faulted in just before sending the signal (the CAS ensures it's write-faulted without the danger of a racing write). I may do that now just to head off more memory corruption bugs. Though I'm not positive this completely works with funny cgo thread and signal stack configurations.

On the other hand, this only happens in really bleeding-edge kernels. Assuming it gets fixed upstream quickly, the people who are running bleeding-edge kernels will continue to run bleeding-edge kernels, and will get the kernel fix.

Maybe we put in the workaround for 1.14 and remove it later.

aclements

comment created time in 3 months

issue commentgolang/go

runtime: memory corruption on Linux 5.3.x from async preemption

Reproduced at torvalds/linux@b81ff1013eb8eef2934ca7e8cf53d553c1029e84, as well as v5.4, which was just released.

I've filed the upstream kernel bug here: https://bugzilla.kernel.org/show_bug.cgi?id=205663

aclements

comment created time in 3 months

issue commentgolang/go

runtime: corrupt binary export data seen after signal preemption CL

I've filed the upstream kernel bug here: https://bugzilla.kernel.org/show_bug.cgi?id=205663

mvdan

comment created time in 3 months

issue commentgolang/go

runtime: corrupt binary export data seen after signal preemption CL

For completeness, I have also reproduced at torvalds/linux@b81ff1013eb8eef2934ca7e8cf53d553c1029e84, which fixed a bug in the kernel code in question (but apparently not this bug), and at v5.4.

mvdan

comment created time in 3 months

issue commentgolang/go

runtime: corrupt binary export data seen after signal preemption CL

Here's the C reproducer. This fails almost instantly on 5.3.0-1008-gcp, and torvalds/linux@d9c9ce34ed5c892323cbf5b4f9a4c498e036316a (5.1.0-rc3+). It does not fail at the parent of that commit (tolvalds/linux@a5eff7259790d5314eff10563d6e59d358cce482).

I'll work on filing this upstream with Linux.

// Build with: gcc -pthread test.c
//
// This demonstrates an issue where AVX state becomes corrupted when a
// signal is delivered where the signal stack pages aren't faulted in.
//
// There appear to be three necessary ingredients, which are marked
// with "!!!" below:
//
// 1. A thread doing AVX operations using YMM registers.
//
// 2. A signal where the kernel must fault in stack pages to write the
//    signal context.
//
// 3. Context switches. Having a single task isn't sufficient.

#include <errno.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <pthread.h>
#include <sys/mman.h>
#include <sys/prctl.h>
#include <sys/wait.h>

static int sigs;

static stack_t altstack;
static pthread_t tid;

static void die(const char* msg, int err) {
  if (err != 0) {
    fprintf(stderr, "%s: %s\n", msg, strerror(err));
  } else {
    fprintf(stderr, "%s\n", msg);
  }
  exit(EXIT_FAILURE);
}

void handler(int sig __attribute__((unused)),
             siginfo_t* info __attribute__((unused)),
             void* context __attribute__((unused))) {
  sigs++;
}

void* sender(void *arg) {
  int err;

  for (;;) {
    usleep(100);
    err = pthread_kill(tid, SIGWINCH);
    if (err != 0)
      die("pthread_kill", err);
  }
  return NULL;
}

void dump(const char *label, unsigned char *data) {
  printf("%s =", label);
  for (int i = 0; i < 32; i++)
    printf(" %02x", data[i]);
  printf("\n");
}

void doAVX(void) {
  unsigned char input[32];
  unsigned char output[32];

  // Set input to a known pattern.
  for (int i = 0; i < sizeof input; i++)
    input[i] = i;
  // Mix our PID in so we detect cross-process leakage, though this
  // doesn't appear to be what's happening.
  pid_t pid = getpid();
  memcpy(input, &pid, sizeof pid);

  while (1) {
    for (int i = 0; i < 1000; i++) {
      // !!! Do some computation we can check using YMM registers.
      asm volatile(
        "vmovdqu %1, %%ymm0;"
        "vmovdqa %%ymm0, %%ymm1;"
        "vmovdqa %%ymm1, %%ymm2;"
        "vmovdqa %%ymm2, %%ymm3;"
        "vmovdqu %%ymm3, %0;"
        : "=m" (output)
        : "m" (input)
        : "memory", "ymm0", "ymm1", "ymm2", "ymm3");
      // Check that input == output.
      if (memcmp(input, output, sizeof input) != 0) {
        dump("input ", input);
        dump("output", output);
        die("mismatch", 0);
      }
    }

    // !!! Release the pages of the signal stack. This is necessary
    // because the error happens when copy_fpstate_to_sigframe enters
    // the failure path that handles faulting in the stack pages.
    // (mmap with MMAP_FIXED also works.)
    //
    // (We do this here to ensure it doesn't race with the signal
    // itself.)
    if (madvise(altstack.ss_sp, altstack.ss_size, MADV_DONTNEED) != 0)
      die("madvise", errno);
  }
}

void doTest() {
  // Create an alternate signal stack so we can release its pages.
  void *altSigstack = mmap(NULL, SIGSTKSZ, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
  if (altSigstack == MAP_FAILED)
    die("mmap failed", errno);
  altstack.ss_sp = altSigstack;
  altstack.ss_size = SIGSTKSZ;
  if (sigaltstack(&altstack, NULL) < 0)
    die("sigaltstack", errno);

  // Install SIGWINCH handler.
  struct sigaction sa = {
    .sa_sigaction = handler,
    .sa_flags = SA_ONSTACK | SA_RESTART,
  };
  sigfillset(&sa.sa_mask);
  if (sigaction(SIGWINCH, &sa, NULL) < 0)
    die("sigaction", errno);

  // Start thread to send SIGWINCH.
  int err;
  pthread_t ctid;
  tid = pthread_self();
  if ((err = pthread_create(&ctid, NULL, sender, NULL)) != 0)
    die("pthread_create sender", err);

  // Run test.
  doAVX();
}

void *exiter(void *arg) {
  sleep(60);
  exit(0);
}

int main() {
  int err;
  pthread_t ctid;

  // !!! We need several processes to cause context switches. Threads
  // probably also work. I don't know if the other tasks also need to
  // be doing AVX operations, but here we do.
  int nproc = sysconf(_SC_NPROCESSORS_ONLN);
  for (int i = 0; i < 2 * nproc; i++) {
    pid_t child = fork();
    if (child < 0) {
      die("fork failed", errno);
    } else if (child == 0) {
      // Exit if the parent dies.
      prctl(PR_SET_PDEATHSIG, SIGKILL, 0, 0, 0);
      doTest();
    }
  }

  // Exit after a while.
  if ((err = pthread_create(&ctid, NULL, exiter, NULL)) != 0)
    die("pthread_create exiter", err);

  // Wait for a failure.
  int status;
  if (wait(&status) < 0)
    die("wait", errno);
  if (status == 0)
    die("child unexpectedly exited with success", 0);
  fprintf(stderr, "child process failed\n");
  exit(1);
}
mvdan

comment created time in 3 months

issue commentgolang/go

runtime: memory corruption on Linux 5.3.x from async preemption

Thanks @dvyukov. I just re-confirmed that I can still reproduce it in the same way on 5.3, which includes that commit. I'll double check that I can still reproduce right at that commit, just in case it was somehow re-introduced later.

aclements

comment created time in 3 months

issue commentgolang/go

runtime: corrupt binary export data seen after signal preemption CL

I just got the C reproducer working. I'm working on tidying it up and I'll post it. Both madvising and mmaping the sigaltstack work to clear the pages (and that is necessary). The other missing ingredient was just running lots of the processes simultaneously.

mvdan

comment created time in 3 months

more