profile
viewpoint
Michael Knyszek mknyszek Google Boston, MA

mknyszek/rust-libquantum 7

Rust bindings for libquantum

mknyszek/schmelda-bread-rpg 3

XML-based RPG Engine Written in Haxe

mknyszek/gateron-eagle 2

A barebones eagle library for Gateron Cherry-MX-clone switches

joshim5/Provita-Website 1

Provita is a bio-tech startup company comprised of students at the Bergen County Academies in Hackensack, NJ..

mknyszek/markov-python 1

A simple implementation of nth order markov chains in python.

mknyszek/Quick 1

A bytecode interpreter for a prototype quantum scripting language

SporkList/sporklist.github.io 1

For the indecisive eater

mknyszek/adjacent-squares 0

Optimized Computational Solution to Coffee Time Challenge Problem #12

mknyszek/curnel 0

A (Python -> CUDA C) compiler for use with PyCUDA

mknyszek/curse-mania 0

A Curses-based Rhythm Game

issue commentgolang/go

runtime: VirtualAlloc of 4294967296 bytes failed with errno=1455 on windows/amd64 go 1.13.4

To be clear, the runtime is asking the OS for 4 GiB of memory on behalf of your application. The request is ultimately coming from there. Please let me know if there are any further issues, though.

glycerine

comment created time in 2 days

issue commentgolang/go

runtime: can't atomic access of first word of tiny-allocated struct on 32-bit architecture

@aclements @cherrymui Got it. I missed the fact that the size is rounded up. That makes sense, and I was surprised that this wasn't more broken...

@josharian It looks like what I was proposing is already there, so there shouldn't be any packing problems.

It looks like this problem is wholly #36606 as @cherrymui mentioned, so we might want to just close this as a duplicate?

NewbMiao

comment created time in 6 days

issue commentgolang/go

runtime: cannot ReadMemStats during GC

@dmitshur I'm pinging my reviewers. This should go in ASAP given how much trouble it gave us last release.

It should be a lot better this time though; the fix to this bug last time was a bit sloppy, and everything here has been re-done in a more principled manner (with a new benchmark).

aclements

comment created time in 7 days

issue commentgolang/go

runtime: cannot ReadMemStats during GC

@dmitshur I'm my reviewers. This should go in ASAP given how much trouble it gave us last release.

It should be a lot better this time though; the fix to this bug last time was a bit sloppy, and everything here has been re-done in a more principled manner (with a new benchmark).

aclements

comment created time in 7 days

issue commentgolang/go

doc: write Go 1.14 release notes

@aclements I think you're right, and I cannot. I don't want to discourage users from filings bugs about changes in their application's behavior (especially if its for the worse) by implying that the runtime got slower in exchange for memory use, and I can't think of a straight-forward wording that doesn't have this connotation.

dmitshur

comment created time in 8 days

issue commentgolang/go

cmd/compile: cant atomic access of first word of allocated struct occasionally?

@mknyszek would that make the tiny allocator much less efficient, by increasing the number and size of gaps? And if so, are there mitigations?

Maybe, but given that we don't have atomic types and instead have atomic operations, it's impossible to know whether we really need the full alignment, and so we must conservatively always assume we need it. In that sense, I can't think of any mitigations besides waiting for #36606 which may allow users to set a custom alignment (thereby signaling the alignment they actually need).

It might also be slower because we have to touch the type whereas we didn't have to before.

NewbMiao

comment created time in 8 days

issue commentgolang/go

cmd/compile: cant atomic access of first word of allocated struct occasionally?

@aclements Looks like the tiny allocator doesn't uphold the guarantee in the spec. The lines are:

			// Align tiny pointer for required (conservative) alignment.
			if size&7 == 0 {
				off = round(off, 8)
			} else if size&3 == 0 {
				off = round(off, 4)
			} else if size&1 == 0 {
				off = round(off, 2)
			}

In this case size is 12, and so we align to 4 (as the trace in the original post at the top shows), but it should be 8, because it contains an 8-byte field (right?). I don't think this is even just a 386 problem; we could end up misaligning in any case.

Maybe the fix is to just align to typ.align?

NewbMiao

comment created time in 8 days

issue commentgolang/go

runtime: Darwin slow when using signals + int->float instructions

@randall77 I tried also commenting out the "raise" line and leaving "vzeroupper" uncommented and the performance was somewhere in between "vzeroupper" being used and not (with raise still in there), which doesn't really make much sense to me. Might be a useful data point.

randall77

comment created time in 15 days

issue commentgolang/go

runtime: performance degradation in go 1.12

@interviewQ Right, I figured based on the original benchmark you provided, that's why I explicitly tried out 64 KiB allocations. I can try out larger ones but up to 128 KiB they all have the potential to be satisfied out the page cache, so the scalability of the allocator should be fine.

@aclements suggested to me that if there is some scalability problem in the runtime here, and your application is indeed CPU-bound, we should be able to figure out what it is (or get a strong hint) by just looking at a differential profile for the same version of Go but for different GOMAXPROCS. So for example, for Go 1.14, if you could collect a pprof CPU profile of your application running with GOMAXPROCS=36 and GOMAXPROCS=48 and GOMAXPROCS=72, you can look at the diff in pprof and see if anything is growing. If anything is, that will point heavily to whatever the bottleneck is.

You can collect a profile by using https://golang.org/pkg/runtime/pprof/ and then running go tool pprof --diff_base=low_count.cpu high_count.cpu then typing the command "top 20" in the interactive input.

interviewQ

comment created time in 16 days

issue commentgolang/go

runtime: performance degradation in go 1.12

@mknyszek I ran tests on an EC2 instance with 36 core CPU. With this, I see a similar performance with go 1.11.13 vs 1.13.7. Go 1.14rc1 is actually slightly faster than go1.11.13. So, the performance issue which I am seeing is only with CPUs with large number of cores. This points to the lock contention issue in the memory allocator that you mention above. I do see this has improved considerably in Go 1.14, but it is still slower than Go 1.11 (only with large number of cores). Is there any way to improve this any further in Go1.14.

Thinking about this more, it doesn't make sense that if Go 1.11 collapses at 48 cores that it would do better than Go 1.14 at 72 (in terms of allocations). One case I can think of is that the Go 1.14 allocator just completely breaks down at 72 cores, doing worse than Go 1.11.

So I set up a 72-core VM and ran some allocator scalability microbenchmarks. These show that the 1.14 allocator scales much better than Go 1.11: around 1.5-2x throughput improvement at 72 cores for a range of allocation sizes (anywhere from 1 KiB to 64 KiB).

I think that perhaps the scalability of another part of the runtime got worse in Go 1.12, and that perhaps scalability improvements are carrying things a bit. I don't know what that is, but now that I have a 72-core VM I can experiment more deeply.

I poured over the GC traces for Go 1.11 and Go 1.14 for several hours, doing a bunch of aggregate analysis, and things have certainly changed a lot (e.g. much more time spent in assists in Go 1.14 vs. older versions; this is not totally unexpected), but nothing stands out as "this is obviously the cause" to me. Though, since I don't think this relates to #35112 anymore, I'm going to take a closer look at the Go 1.13 GC traces and see what I can glean from them.

interviewQ

comment created time in 16 days

issue openedgolang/go

proposal: API for unstable runtime metrics

Proposal: API for unstable runtime metrics

Background & Motivation

Today runtime metrics are exposed in two ways.

The first way is via the struct-based sampling APIs runtime.ReadMemStats and runtime/debug.GCStats. These functions accept a pointer to a struct and then populate the struct with data from the runtime.

The problems with this type of API are:

  • Removing/renaming old metrics from the structs is impossible.
    • For example, MemStats.BySize is hard-coded to 61 size classes when there are currently 83. We cannot ever change BySize.
  • Adding implementation-specific metrics to the structs is discouraged, because it pollutes the API when inevitably they'll be deprecated.
  • runtime.ReadMemStats has a global effect on the application because it forces a STW. This has a direct effect on latency. Being able to tease apart which metrics actually need gives users more control over performance.

The good things about this type of API are:

  • Protected by the Go 1 compatibility promise.
  • Easy for applications to ingest, use for their own purposes, or push to a metrics collection service or log.

The second is via GODEBUG flags which emit strings containing metrics to standard error (e.g. gctrace, gcpacertrace, scavtrace).

The problems with this type of API are:

  • Difficult for an application to ingest because it must be parsed.
  • Format of the output is not protected by the Go 1 backwards compatibility promise.

The good things about this type of API are:

  • We can freely change it and add implementation-specific metrics.
  • We never have to live with bad decisions.

I would like to propose a new API which takes the best of both approaches.

Requirements

  • The API should be easily extendable with new metrics.
  • The API should be easily retractable, to deprecate old metrics.
    • Removing a metric should not break any Go applications as per the Go 1 compatibility promise.
  • The API should be discoverable, to obtain a list of currently relevant metrics.
  • The API should be rich, allowing a variety of metrics (e.g. distributions).
  • The API implementation should minimize CPU/memory usage, such that it does not appreciably affect any of the metrics being measured.
  • The API should include useful existing metrics already exposed by the runtime.

Goals

Given the requirements, I suggest we prioritize the following concerns when designing the API in the following order.

  1. Extensibility.
  • Metrics are “unstable” and therefore it should always be compatible to add or remove metrics.
  • Since metrics will tend to be implementation-specific, this feature is critical.
  1. Discoverability.
  • Because these metrics are “unstable,” there must be a way for the application, and for the human writing the application, to discover the set of usable metrics and be able to do something useful with that information (e.g. log the metric).
  • The API should enable collecting a subset of metrics programmatically. For example, one might want to “collect all memory-related metrics” or “collect all metrics which are efficient to collect”.
  1. Performance.
  • Must have a minimized effect on the metrics it returns in the steady-state.
  • Should scale up to 100s of metrics, an amount that a human might consider “a lot.” * Note that picking the right types to expose can limit the amount of metrics we need to expose. For example, a distribution type would significantly reduce the number of metrics.
  1. Ergonomics.
  • The API should be as easy to use as it can be, given the above.

Design

See full design document here. (TODO: add a link to the design document).

Highlights:

  • Expose a new sampling-based API in a new package, the runtime/metrics package.
  • Use string keys for each metric which include the unit of the metric in an easily-parseable format.
  • Expose a discovery API which provides metadata about each metric at runtime, such as whether it requires a STW and whether it's cumulative (counter as opposed to a gauge).
  • Add a Histogram interface to the package which represents a distribution.
  • Support for event-based metrics is discussed and left open, but considered outside the scope of this proposal.

Backwards Compatibility

Note that although the set of metrics the runtime exposes will not be stable across Go versions, the API to discover and access those metrics will be.

Therefore, this proposal strictly increases the API surface of the Go standard library without changing any existing functionality and is therefore Go 1 compatible.

created time in 19 days

issue commentgolang/go

runtime: golang 1.14.rc1 3-5% performance regression from golang 1.13 during protobuf marshalling

@ianlancetaylor Yeah that's right. @cherrymui and I dug a little deeper and it's actually map with keys of type reflect.Type that is causing >70% of the mapaccess1 calls. :)

howardjohn

comment created time in 20 days

issue commentgolang/go

runtime: golang 1.14.rc1 3-5% performance regression from golang 1.13 during protobuf marshalling

@howardjohn I can reproduce, and I think I know the source of the problem.

First let me say that I tried pretty hard to reproduce the profiles you attached but I was unable to using the commands in the original post. Every time I compared 1.14 and 1.13 profiles the diff between them was completely different. Part of the problem is that because the number of iterations isn't fixed, each run could be doing different amounts of work. It's harder to compare along these lines.

Next, I need to point out that there's a memory leak in the benchmark. Every time BenchmarkEDS gets invoked, the heap size increases. This is especially noticeable with GODEBUG=gctrace=1 running and -count=N where N>1, but the leak also affects N=1 if the number of benchmark iterations are not fixed (since BenchmarkEDS gets called multiple times by the testing framework).

So, given this, I collected profiles on the BenchmarkEDS/100/100 benchmark with -benchtime=1000x to fix the iteration count. This is roughly the same amount of work as -benchtime=10s.

In the end, I consistently get two profiles which implicate mapaccess1, runtime.interhash, and callees as the culprit in the diff. Specifically, aeshash64 is gone in the 1.14 profile, replaced with aeshashbody (called through runtime.typehash) which takes more time.

For documentation purposes, the command I ran (after cloning istio at the PR mentioned above; if you don't do that the benchmark won't run) was:

perflock go test ./pilot/pkg/proxy/envoy/v2/ -bench=BenchmarkEDS/100$/100$ -count=1 -run=^$ -benchmem -benchtime=1000x -cpuprofile=cpu.prof

Pinging @randall77 since @cherrymui suggested you were involved in some map changes in the Go 1.14 cycle, and I don't know much about the map implementation.

howardjohn

comment created time in 20 days

issue commentgolang/go

runtime: golang 1.14.rc1 3-5% performance regression from golang 1.13 during protobuf marshalling

@howardjohn Thanks so much for the issue and the detailed information. Looking into it.

howardjohn

comment created time in 20 days

issue commentgolang/go

runtime: performance degradation in go 1.12

@interviewQ I did not mean anything else, that's exactly it. Thank you for confirming.

Running those sounds great to me. If you have the time/resources to try tip of the master branch as well that would be very helpful too.

interviewQ

comment created time in 21 days

issue commentgolang/go

runtime: performance degradation in go 1.12

@interviewQ Also, if you'd be willing to share a GC trace, that would give us a lot more insight into what the runtime is doing. You can collect one by running your application with GODEBUG=gctrace=1 set and capturing STDERR.

interviewQ

comment created time in 21 days

issue commentgolang/go

runtime: performance degradation in go 1.12

Just out of curiosity, how did you measure RSS?

I just want to be very precise on this because the virtual memory usage of Go increased significantly with 1.14 (around 600 MiB, which admittedly doesn't account for everything you're seeing), but most of that memory is not mapped in/committed.

Thanks for the quick turnaround on this and for your cooperation. 72 cores is not a level of parallelism I've personally ever tested, it's possible we have some scalability problems at that level that we haven't seen before.

As another experiment, can you try setting GOMAXPROCS to 48 or so? There are other things I'd be interested in looking at as well, but I'll wait for your reply first.

interviewQ

comment created time in 21 days

issue commentgolang/go

runtime: VirtualAlloc of 4294967296 bytes failed with errno=1455 on windows/amd64 go 1.13.4

I replied on the other issue, but copying here for visibility:

@glycerine AFAICT your application ran out of memory. It looks like your application tried to read more than 4 GiB from an io.Reader and the OS determined it didn't have enough memory left to satisfy the allocation (note that the bytes.Buffer grows, calling makeSlice with an argument of 4294966784 bytes, which was then rounded up to 4 GiB even because it's allocating at a page granularity).

glycerine

comment created time in 22 days

issue commentgolang/go

runtime: VirtualAlloc of 0 bytes failed with errno=1455

@glycerine AFAICT your application ran out of memory. It looks like your application tried to read more than 4 GiB from an io.Reader and the OS determined it didn't have enough memory left to satisfy the allocation (note that the bytes.Buffer grows, calling makeSlice with an argument of 4294966784 bytes, which was then rounded up to 4 GiB even because it's allocating at a page granularity).

ayanamist

comment created time in 22 days

issue commentgolang/go

runtime: performance degradation in go 1.12

First of all thanks a lot for these quick replies and very useful info.

I had tested with 1.14beta1 in Dec and that still showed the performance issue. Did the fix for 35112 go after 1.14 beta1 was released? I was planning on retesting with 1.14 when it gets released (was due on Feb 1). However, I could try the tip of the master branch.

There may be a regression in the default configuration, but have you looked at memory use like I mentioned? If it went down, you could increase GOGC from the default until your memory use is the same as before (in the steady-state) and see if it performs better? Since you mentioned you're relatively new to Go, check out the comment at the top of https://golang.org/pkg/runtime for an explanation of GOGC, and also see https://golang.org/pkg/runtime/debug/#SetGCPercent for more information.

In our app, we have 1000 goroutines, all allocating memory in parallel, so from what you are saying above "allocating heavily and in parallel" is what applies to our app.

The number of goroutines doesn't really matter since that's just concurrency. The GOMAXPROCS value is your actual level of parallelism (up to the number of independent CPU cores on your machine). What is GOMAXPROCS for your application?

Using sync.Pool is something I have thought about but it seems non-trivial since I will need to figure out when to put the buffer back to the pool. It is not impossible just hard to figure out using sync.Pool.

What if I have 1 goroutine whose only job is to alloc memory. These 1000 goroutines can talk to this "allocator" goroutine (via channels). That way these huge allocations happen from only 1 goroutine and contention is eliminated. Please let me know if this makes sense. Also, I come from C++ world and hence new to Go, so please pardon my ignorance.

Contention isn't really eliminated; you're just moving it from the allocator and to the channels in this case. Before trying to restructure your code for this please give my suggestion above a try. You might just have to increase GOGC to get the same level of CPU performance for the same memory usage.

interviewQ

comment created time in 22 days

issue commentgolang/go

runtime: performance degradation in go 1.12

@interviewQ Some background: there was a known regression in the slow path of the allocator in Go 1.12, which was intentional in order to support returning memory to the OS more eagerly. For the vast majority of programs, this wasn't noticed. The regression only showed up in microbenchmarks, not in any real production services or applications (as far as we could see/find). You need to be really allocation bound to notice this, as @randall77's benchmark shows. Note that @randall77's reproducer makes many 64 KiB allocations without ever touching that memory (aside from zeroing), which would already meaningfully reduce the load on the allocator.

However, it's also been known that allocating heavily and in parallel has had serious lock contention issues since at least Go 1.11. This was noticed when we worked on the regression in Go 1.12. In Go 1.14 we worked to fix these issues (#35112) and it went reasonably well. I'm fairly confident that the allocator is now faster, so this seems to me like it's the change in the default trade-off in these "way past the heap goal" cases as @randall77 says.

With that being said, @interviewQ could you:

  1. Check if using the tip of the master branch for Go results in less memory used for your application.
  2. If so, try cranking up GOGC to the point where you're using the same amount of memory as before, then measure CPU performance again.

Hopefully the net effect will be that we made performance better. :) If not, we can move on from there.

interviewQ

comment created time in 22 days

issue commentgolang/go

runtime: Kubernetes performance regression with 1.14

@ianlancetaylor I think we can probably close this now? WDYT?

ianlancetaylor

comment created time in a month

issue commentgolang/go

runtime: program appears to spend 10% more time in GC on tip 3c47ead than on Go1.13.3

Hey @ardan-bkennedy, I think it would be best if we kept the chat here. My apologies for the delay in replying.

Could you share more information about how you're generating the output? Specifically this:

GC | 12,385,913 ns | 12,385,913 ns | 619,296 ns | 20
Selection start: 32,301,149 ns
Selection extent: 452,915,504 ns
Total Run time: 505.645 ms

You mentioned that you got it from a trace, but if you have more concrete details about how you extracted this out of a trace that would be very helpful.

Also, could you show a comparison of these numbers for you between Go 1.13 and tip?

I tried to produce a wall-clock time regression with both the freqConcurrent and freqNumCPU configurations (since you indicated there was an increase in total run time) on my 4-core macOS machine and have been unable to.

Now the 4000 goroutine program is running at 50% GC. I am seeing a TON of mark assist on the first GC which I never saw before.

As I mentioned earlier in this thread, more mark assists overall is expected with the new release. Note that when an application changes phases (e.g. it starts doing lots of work and moves toward a steady-state) the GC tends to over-do it and then converge to a steady-state. Given what I mentioned earlier, it makes sense that you'd see a lot more mark assists in these GCs. An increase in time spent in mark assists isn't necessarily a bad thing.

The heap size is really causing the pacer to over GC, IMHO.

I'm not quite sure what you mean by "over GC" here. In the default configuration, GOGC=100, the heap will grow to 2x the live heap and pace the mark phase to start and end before it gets there. At GOGC=1000, the heap will grow to 11x the live heap. Since the amount of mark work is proportional to only the size of the live heap, in the steady-state (the live heap isn't growing or shrinking) the application is going to spend the same amount of time GCing regardless of the heap goal. By dialing GOGC up, you're trading off a bigger heap size in exchange for fewer GCs.

ardan-bkennedy

comment created time in a month

issue commentgolang/go

runtime: cannot ReadMemStats during GC

We're pretty sure we know how to fix this properly now. I'll have a more complete patch ready early in the Go 1.15 cycle.

aclements

comment created time in a month

push eventmknyszek/mknyszek.github.io

Michael Anthony Knyszek

commit sha 286aa7b1c0a360666288bc001bb6f9cc090405ec

Clean up a sentence.

view details

push time in a month

issue openedgolang/go

runtime: sysUsed often called on non-scavenged memory

In working on #36507, I noticed that one source of regression on Darwin was that sysUsed, as of Go 1.14, is now called on the whole memory allocation, even if just one page of that is actually scavenged. This behavior was already present in Go 1.11, went away in Go 1.12, and is now back.

On systems where sysUsed is a no-op and we rely on the first access "unscavenging" the page, this has no effect, because we're already only unscavenging exactly what's needed. On systems like Darwin and Windows where the "recommit" step is explicit (i.e. sysUsed is not a no-op) this can have a non-trivial impact on performance, since the kernel probably walks over the unscavenged pages for nothing.

This contributed very slightly to the performance regression in #36218, but not enough to cause it to block the 1.14 release.

The fix here is straight-forward: we just need to lower the call of sysUsed in the page allocator where we have complete information over exactly which pages are scavenged. If a given allocation has >50% of its pages scavenged then we can probably just sysUsed the whole thing to save a bit on syscall overheads. This change also makes sense because we already do sysUnused at a fairly low-level part of the page allocator, so bringing sysUsed down there makes sense.

created time in a month

issue commentgolang/go

runtime: RSS keeps on increasing, suspected scavenger issue

@dbakti7 The MemStats numbers indicate that everything is pretty much WAI; the OOMs however are not great and can be caused by a number of things. The high HeapSys value you're reporting suggests to me that your application has a high peak memory use though, either that or you have a really bad case of fragmentation (which is harder to diagnose).

For now, can you provide the following details?

  • What is the peak memory use for your application? (Either VmHWM from /proc/<id>/status or whatever Prometheus tells you is the peak HeapAlloc recorded for your application.)
  • Linux kernel version.
  • Linux overcommit settings (cat /proc/sys/vm/overcommit_memory and cat /proc/sys/vm/overcommit_ratio).
  • Is swap enabled on your system? If so, how much swap space do you have?
dbakti7

comment created time in a month

issue commentgolang/go

runtime: scavenger is too eager on Darwin

OK going to walk back my second thought. Turns out, yes, https://github.com/golang/go/issues/36507#issuecomment-573887201 is a real problem. It's slightly worse now, but only by a little bit. This is basically going back to Go 1.11 behavior where if any memory was scavenged you would just scavenge the whole thing.

mknyszek

comment created time in a month

issue commentgolang/go

runtime: scavenger is too eager on Darwin

On second thought, I can't really find evidence of the number of scavenged pages in an allocation being less than the size of the allocation in the above benchmarks. Perhaps this isn't a problem...

mknyszek

comment created time in a month

issue commentgolang/go

runtime: scavenger is too eager on Darwin

Digging deeper, I think I know better why sysUsed is hurting performance so much, even though it only takes around the same amount of time as sysUnused: starting with this release, we might allocate across a scavenged/unscavenged memory boundary. If one allocates several pages at once and if even only one page in that range is scavenged, we call sysUsed on that whole region. This has no effect on systems where sysUsed is a no-op, since those systems just fault in those pages on demand.

I'm not sure what to do here. The patch I uploaded helps the problem a little bit, but to completely solve the problem would require figuring out exactly which pages are scavenged and only calling sysUsed on those, but that's somewhat expensive to keep track of in the new allocator.

Because the allocator doesn't propagate up which pages are scavenged (though it does know this information precisely!) currently we just do the heavy-handed thing. But, we could instead have the allocator, which actually clears the bits, do the sysUsed operation. It would then have precise information. Because syscall overhead is also a factor here, there would need to be a heuristic, though. For example, if >50% of the memory region is scavenged, just sysUsed the whole thing, instead of piece-by-piece.

Unfortunately, lowering the sysUsed operation down into the allocator like this is annoying for tests. In most cases it's safe to sysUsed something already considered in-use by the OS, but I don't think we make that part of sysUsed's API, though I suppose could. We also do already have a flag which does this so we can test scavenging, so maybe we should just use that?

mknyszek

comment created time in a month

push eventmknyszek/mknyszek.github.io

Michael Anthony Knyszek

commit sha 56a55508be7fc28b4226da19f61a184d1c3c6fe0

change license to MIT

view details

Michael Anthony Knyszek

commit sha eb5065a1bb2c508b15840feff78fb96053bae828

Rewrite website into simpler format

view details

push time in a month

issue openedgolang/go

runtime: scavenger is too eager on Darwin

Currently on Darwin the scavenger is too eager and causing performance regressions. The issue mainly stems from the fact that in Go 1.14 the scavenger is paced empirically according to costs of scavenging, most of which comes from sysUnused which makes an madvise syscall on most platforms, including Darwin.

However, the problem on Darwin is that we don't just do MADV_FREE anymore, we do MADV_FREE_REUSABLE in the sysUnused path and MADV_FREE_REUSE in the sysUsed path (instead of sysUsed being a no-op). It turns out the source of the regression is mostly the fact that sysUsed is actually quite expensive relative to other systems where it's just a no-op and we instead incur an implicit page fault once the memory is actually touched. The benefits of using this API outweigh the costs, since it updates process RSS counters in the kernel appropriately and MADV_FREE_REUSABLE is properly lazy.

So since we don't account for sysUsed we end up scavenging a lot more frequently than we should to maintain the scavenger's goal of only using 1% of the CPU time of one CPU.

The actual size of the regression can be quite large, up to 5%, as seen in #36218, so we should fix this before 1.14 goes out.

The fix here is relatively simple: we just need to account for this extra cost somehow. We could measure it directly in the runtime but this then slows down allocation unnecessarily, and even then it's unclear how we should attribute that cost to the scavenger (maybe as a debt it needs to pay down?). Trying to account for this cost on non-Darwin platforms is also tricky because the costs aren't actually coming from sysUsed but from the page fault.

Instead, I think it's a good idea to do something along the lines of what we did last release: get some empirical measurements and use that to get an order-of-magnitude approximation. In this particular case, I think we should compute an empirical ratio "r" of using a scavenged page and sysUnused and turn this into a multiplicative constant, "1+r" for the time spent scavenging. So for example if sysUsed is roughly as expensive as sysUnused the factor would be 2.

created time in 2 months

issue commentgolang/go

crypto: performance loss with 1.14 beta

Oh, and also I'll probably open a separate bug for that so this bug can continue to be about a crypto performance regression.

pascaldekloe

comment created time in 2 months

issue commentgolang/go

crypto: performance loss with 1.14 beta

I discovered in the last few weeks that the problem is more fundamental: we don't account for the cost of sysUsed in scavenger pacing like we do the cost of sysUnused, and as it turns out sysUsed is more expensive than (or equal to) sysUnused on Darwin and that's really what's going on here. On most other platforms sysUsed is a no-op and the only performance hit in that case is a page fault (which is non-trivial, but definitely much less expensive than the madvise).

So, the fix I used above turns out to be kind of a hack. I think the right thing to do is just make some empirical measurements of the relative costs of sysUsed and sysUnused on Darwin and multiply the measured critical time by this ratio. This is simple and likely to be the solution we'd settle on in 1.15 anyway. We can always change our minds, too.

pascaldekloe

comment created time in 2 months

issue commentgolang/go

runtime: HeapSys increases until OOM termination

@lysu This is exactly the behavior I'd expect and it is working as intended. We never unmap heap memory, we only return memory to the OS via madvise, so HeapSys is monotonically increasing. HeapIdle-HeapReleased, you'll notice, tends toward zero when you take load off the system, and that's an estimate of physical memory requirements.

savalin

comment created time in 2 months

issue commentgolang/go

runtime: RSS keeps on increasing, suspected scavenger issue

@mknyszek Thanks for the elaborate explanation! I tried your suggested approach, i.e. to build the executable first for go 1.13.5: https://pastebin.com/3jvnkBJC for go 1.13.0: https://pastebin.com/44veNiJK both have similar behavior now, which is calling MADV_FREE for big chunk with size 209715200 bytes. Not sure why it's different with gctrace reports (that seems to release memory in small chunks frequently), or maybe they refer to two different things? (i.e. I'm referring to the statement "Every time you see X MiB released that means there was at least one madvise syscall made")

That's odd. gctrace should print "X MiB released" times <= the number of syscalls you see in strace. I'll try this out.

And agree with your point regarding MADV_FREE and MADV_DONTNEED, problem is we are still unable to replicate our production load to see whether MADV_DONTNEED will improve our memory usage. :(

To be clear, it'll just improve reported memory use. MADV_FREE works fine, and will do the right thing under memory pressure, you just won't see counters go down. It depends on whether you use those counters for anything. If you don't, then you can instead estimate how much memory is being used by go by subtracting Sys - HeapReleased in MemStats.

One clarification, since MADV_DONTNEED is set with GODEBUG flag, I hope the word "Debug" here doesn't mean it's not safe to be used in production env?

It's fine to use in production. It doesn't print anything extra or do anything other than set a global variable.

Regarding our application behavior, I think the most notable characteristic is it's doing a lot of json and string manipulation, especially json marshal and unmarshal. Which I suspect is the cause of high velocity object creation. (Is memory fragmentation a possible culprit? I'm not sure how Go runtime handles memory reuse and fragmentation if service is creating many objects rapidly)

It could be fragmentation, though I don't think we have fragmentation blow-up issues in the Go runtime's allocator like some other allocators have. One way to confirm this would be a high value of float64(Sys - HeapReleased) / float64(HeapInUse) from MemStats. That would suggest a high amount of page-level fragmentation (or that the scavenger isn't doing its job). I know you're having a tough time replicating this now, but a dump of MemStats every once in a while would help a lot in understanding where the RSS is coming from. I mentioned earlier that it could be due to GC pacing, for example.

dbakti7

comment created time in 2 months

issue commentgolang/go

runtime: RSS keeps on increasing, suspected scavenger issue

@dbakti7 strace -q -e trace=memory go run main.go is giving you an strace for the go command, not the executable built for main.go. Build it first (go build) then run the executable under strace. The madvise syscalls should then line up with what gctrace reports.

The 0 MiB reported by gctrace is also misleading because it rounds down (sorry about that). Every time you see X MiB released that means there was at least one madvise syscall made. Furthermore, this number is only what the background scavenger does; we also have a heap-growth scavenger which, judging by the gctrace, is doing the heavy lifting.

Finally, Linux's MADV_FREE is known to not actually modify any RSS counters until pages are purged due to memory pressure. MADV_DONTNEED does update RSS counters but is more expensive and makes the next page fault on those pages more expensive. We fall back to that if MADV_FREE is not available, or if GODEBUG=madvdontneed=1 is set. I would suggest trying that out to see if the RSS counters look more reasonable. The runtime thinks that most of the memory is in fact returned to the OS (the "released" counter in gctrace shows this).

With all that said, I'm not sure how to help with the RSS growth in your application unless you provide more information about its behavior (I don't mean specifics about the application, just how it interacts with memory). For example, is the heap itself growing really big? That is, if you run your service under gctrace, do you see the goal rising? If so, that could mean a GC pacing bug, or just that the runtime deems that your application needs that much memory and you haven't reached a steady-state. If it really is just RSS that's going higher than you expect, then it could be that simply setting GODEBUG=madvdontneed=1 will give you the behavior you expect for a minor performance hit.

dbakti7

comment created time in 2 months

issue commentgolang/go

runtime: 1.13.5 continuous CPU consumption with memory balast

@pmalekn I forgot to finish my last comment.

On a separate note, in Go 1.14, the scavenger is now paced according to how long each scavenging operation actually took, so it's more likely to be close to 1%. Trying this out with Go 1.14 and the scavenger actually... halts. Sigh. This is related to #35788. But anyway the high utilization you're seeing is gone.

pmalekn

comment created time in 2 months

issue commentgolang/go

runtime: 1.13.5 continuous CPU consumption with memory balast

@pmalekn Interesting. I notice that you're running this inside docker (and maybe a linux image?)? It may be that docker's indirection makes the syscalls to return memory to the OS much more expensive, since running a linux image in docker on darwin spins up a VM.

If I run it directly on darwin, I instead see utilization on the order of 6-7%. Still higher than expected but not as bad as 30%.

pmalekn

comment created time in 2 months

issue closedgolang/go

runtime: 1.13.5 continuous CPU consumption with memory balast

What version of Go are you using (go version)?

$ go version
go version go1.13.5 darwin/amd64

Does this issue reproduce with the latest release?

Yes (1.13.5)

What operating system and processor architecture are you using (go env)?

<details><summary><code>go env</code> Output</summary><br><pre> $ go env GO111MODULE="" GOARCH="amd64" GOBIN="" GOCACHE="/Users/pmalek/Library/Caches/go-build" GOENV="/Users/pmalek/Library/Application Support/go/env" GOEXE="" GOFLAGS="" GOHOSTARCH="amd64" GOHOSTOS="darwin" GONOPROXY="" GONOSUMDB="" GOOS="darwin" GOPATH="/Users/pmalek/.gvm/pkgsets/go1.13.5/global" GOPRIVATE="" GOPROXY="https://proxy.golang.org,direct" GOROOT="/Users/pmalek/.gvm/gos/go1.13.5" GOSUMDB="sum.golang.org" GOTMPDIR="" GOTOOLDIR="/Users/pmalek/.gvm/gos/go1.13.5/pkg/tool/darwin_amd64" GCCGO="gccgo" AR="ar" CC="clang" CXX="clang++" CGO_ENABLED="1" GOMOD="" CGO_CFLAGS="-g -O2" CGO_CPPFLAGS="" CGO_CXXFLAGS="-g -O2" CGO_FFLAGS="-g -O2" CGO_LDFLAGS="-g -O2" PKG_CONFIG="pkg-config" GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/zy/lzyp__pd7vx6762s5jcgkkl80000gn/T/go-build447181729=/tmp/go-build -gno-record-gcc-switches -fno-common" </pre></details>

What did you do?

Run the following snippet

package main

import "time"

func main() {
	balast := make([]byte, 10<<30) // 10G
	_ = balast
	time.Sleep(time.Hour)
}

under go1.13.5 and go1.12.14

What did you expect to see?

No CPU being consumed on both versions

What did you see instead?

Continuous CPU consumption on go1.13.5.

CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
95ae3972c485        1.12.14             0.00%               332.1MiB / 7.294GiB   4.45%               758B / 0B           0B / 0B             5
e2b0b1654941        1.13.5              30.10%              332.2MiB / 7.294GiB   4.45%               578B / 0B           0B / 0B             6

after adding GODEBUG=gctrace=1 one can observe scavenger prints:

scvg: 0 MB released
scvg: inuse: 0, idle: 10303, sys: 10303, released: 108, consumed: 10195 (MB)
scvg: 0 MB released
scvg: inuse: 0, idle: 10303, sys: 10303, released: 108, consumed: 10194 (MB)
scvg: 0 MB released
scvg: inuse: 0, idle: 10303, sys: 10303, released: 108, consumed: 10194 (MB)
scvg: 0 MB released

closed time in 2 months

pmalekn

issue commentgolang/go

runtime: 1.13.5 continuous CPU consumption with memory balast

That 10 GiB ballast is freed (notice that inuse is 0) because the compiler determines the ballast value is no longer live before the sleep. The scavenger is trying to reclaim the space, so that object is not actually a ballast because it gets garbage collected. This is all WAI.

It takes a while to return 10 GiB of memory back to the OS (pacing is 1 page per ms, in this case 4 KiB per ms => ~43 minutes) if the system doesn't support transparent huge pages, which darwin does not. I'm not sure why you're seeing 30% CPU utilization specifically, but when running any non-trivial application we haven't seen additional costs associated with the scavenger on that order of magnitude (in fact, it's paced to try to use about 1% of 1 CPU, which is consistent with performance regressions I've encountered). I suspect the increased CPU utilization is actually coming from the fact that scavenging prints very often in gctrace which is a known issue we need to fix (#32952).

If you modify the program to instead be:

package main

import "time"

var ballast []byte

func main() {
	ballast = make([]byte, 10<<30)
	time.Sleep(time.Hour)
}

you'll notice that with GODEBUG=gctrace=1 that the scavenger is no longer doing any work, since the ballast never dies. Alternatively, you could add a runtime.KeepAlive after the time.Sleep to achieve a similar effect.

pmalekn

comment created time in 2 months

issue commentgolang/go

runtime: Windows binaries built with -race occasionally deadlock

@cherrymui that solution makes sense, and would solve other issues such as sending signals to threads running C code resulting in surprising behavior like in #36281 (IIUC) if we used it on all platforms, not just Windows.

bcmills

comment created time in 2 months

issue commentgolang/go

runtime: Windows binaries built with -race occasionally deadlock

Alrighty thanks to @aclements and a Windows laptop we have a reproducer, a theory, and a partial fix.

The problem is a race between SuspendThread and ExitProcess on Windows. The order of events is as follows:

Thread 1: Suspend (asynchronously) Thread 2. Thread 2: Call ExitProcess, which terminates all threads except Thread 2. Thread 2: In ExitProcess, receives asynchronous notification to suspend, and stops.

This race is already handled in the runtime for the usual exits by putting a lock around suspending a thread (and effectively disallowing it in certain cases, like exit), but in race mode __tsan_fini (called by racefini) calls ExitProcess instead. The fix is to just grab this lock before calling into __tsan_fini.

Unfortunately this raises a bigger issue: what if C code, called from Go, calls ExitProcess on Windows? We have no way to synchronize asynchronous preemption with that like we do with exits we can actually control. One thought is that ExitProcess already calls a bunch of DLL hooks; could we throw in our own to side-step this issue maybe? More thought on this problem is required.

bcmills

comment created time in 2 months

issue commentgolang/go

runtime: TestPageAllocScavenge/AllFreeUnscavExhaust fails on OpenBSD with 2.5GB RAM

TestPageCacheFlush also fails with: fatal error: failed to reserve page summary memory.

Adding a Skip to TestPageCacheFlush allows the runtime tests to complete with -short.

Good catch. I'll add that one.

Skipping the tests helps, but there is also a noticeable increase in the actual memory requirements for building the compiler under go1.14beta1 with OpenBSD. 2GB memory continues to work for FreeBSD, but not OpenBSD:

$ ./all.bash
Building Go cmd/dist using /home/buildsrv/go1.13.5. (go1.13.5 openbsd/amd64)
Building Go toolchain1 using /home/buildsrv/go1.13.5.
Building Go bootstrap cmd/go (go_bootstrap) using Go toolchain1.
Building Go toolchain2 using go_bootstrap and Go toolchain1.
Building Go toolchain3 using go_bootstrap and Go toolchain2.
# cmd/compile/internal/ssa
fatal error: runtime: out of memory

GOGC=70 allows the build to complete successfully (with the fixes mentioned above) in 2GB. I only used 2.5GB earlier to see if the tests would succeed with a little more RAM.

What are you doing to determine if 2 GiB or 2.5 GiB of RAM is enough? Are you trying different machines? Are you varying RLIMIT_DATA?

I suspect this is a combination of the virtual memory use increase (because OpenBSD treats virtual memory like physical memory and limits it per-process) and that the Go compiler might simply be using a little bit more memory than last release. I don't think we're going to take additional steps to curb virtual memory use for this release.

mpx

comment created time in 2 months

issue commentgolang/go

crypto: performance loss with 1.14 beta

Setting the limit back to the pacing of last release brings the regression down further.

name                  old time/op  new time/op  delta
ECDSA/sign-ES256-4    28.9µs ±13%  26.0µs ± 3%  -10.07%  (p=0.002 n=10+10)
ECDSA/sign-ES384-4    4.17ms ±10%  4.11ms ± 3%     ~     (p=0.965 n=10+8)
ECDSA/sign-ES512-4    7.09ms ± 3%  7.15ms ± 1%     ~     (p=0.315 n=10+10)
ECDSA/check-ES256-4   78.4µs ± 2%  76.8µs ± 0%   -2.02%  (p=0.000 n=9+8)
ECDSA/check-ES384-4   8.24ms ± 5%  7.97ms ± 2%   -3.30%  (p=0.006 n=10+9)
ECDSA/check-ES512-4   13.4ms ± 1%  14.3ms ± 2%   +6.46%  (p=0.000 n=10+10)
EdDSA/sign-EdDSA-4    49.6µs ± 2%  49.4µs ± 1%     ~     (p=0.436 n=10+10)
EdDSA/check-EdDSA-4    134µs ± 1%   129µs ± 1%   -4.03%  (p=0.000 n=10+10)
HMAC/sign-HS256-4     1.95µs ± 4%  1.88µs ± 0%   -3.53%  (p=0.000 n=9+9)
HMAC/sign-HS384-4     2.26µs ± 4%  2.12µs ± 1%   -6.19%  (p=0.000 n=9+8)
HMAC/sign-HS512-4     2.28µs ± 2%  2.16µs ± 1%   -5.26%  (p=0.000 n=9+9)
HMAC/check-HS256-4    3.92µs ± 5%  3.96µs ± 1%     ~     (p=0.143 n=10+10)
HMAC/check-HS384-4    4.09µs ± 1%  4.21µs ± 1%   +2.98%  (p=0.000 n=10+10)
HMAC/check-HS512-4    4.52µs ±12%  4.30µs ± 1%     ~     (p=0.676 n=10+9)
RSA/sign-1024-bit-4    347µs ±26%   321µs ± 3%     ~     (p=0.400 n=10+9)
RSA/sign-2048-bit-4   1.46ms ± 7%  1.47ms ± 2%     ~     (p=0.165 n=10+10)
RSA/sign-4096-bit-4   7.95ms ± 1%  8.26ms ± 5%   +3.84%  (p=0.000 n=9+9)
RSA/check-1024-bit-4  26.1µs ± 2%  26.9µs ± 3%   +3.13%  (p=0.000 n=10+9)
RSA/check-2048-bit-4  59.1µs ± 1%  60.7µs ± 2%   +2.57%  (p=0.000 n=10+10)
RSA/check-4096-bit-4   159µs ± 1%   161µs ± 0%   +1.37%  (p=0.000 n=10+10)
[Geo mean]             107µs        105µs        -1.37%

The results aren't super reproducible (hence why that -10% is now -4%) since I'm running on a laptop which has the full UI running and perflock doesn't work on darwin/amd64. I'd say that ECDSA/check-ES512 is a bit noisy on my machine (though would be worth maybe trying more runs, spinning up a new process every time, etc., to capture the noise) so I'm not going to dig any deeper there. Same for HMAC/check-HS384. But I think there may still be a legitimate regression in the RSA/check-*-bit after my rate-limiting fix.

Looking at profiles for RSA/check-1024-bit more closely, there are clearly differences and costs have shifted around but nothing jumps out at me.

go1.13.5

Showing nodes accounting for 13.52s, 84.66% of 15.97s total
Dropped 212 nodes (cum <= 0.08s)
Showing top 30 nodes out of 130
      flat  flat%   sum%        cum   cum%
     4.67s 29.24% 29.24%      9.18s 57.48%  math/big.nat.divLarge
     1.73s 10.83% 40.08%      1.73s 10.83%  math/big.mulAddVWW
     1.37s  8.58% 48.65%      1.37s  8.58%  math/big.addMulVVW
     1.17s  7.33% 55.98%      1.17s  7.33%  math/big.subVV
     0.51s  3.19% 59.17%      0.51s  3.19%  runtime.madvise
     0.48s  3.01% 62.18%      0.48s  3.01%  runtime.pthread_cond_wait
     0.42s  2.63% 64.81%      1.81s 11.33%  math/big.basicMul
     0.35s  2.19% 67.00%      0.35s  2.19%  runtime.procyield
     0.29s  1.82% 68.82%      0.29s  1.82%  math/big.shlVU
     0.25s  1.57% 70.38%      0.25s  1.57%  runtime.pthread_cond_signal
     0.24s  1.50% 71.88%      0.24s  1.50%  math/big.shrVU
     0.22s  1.38% 73.26%      0.22s  1.38%  math/big.greaterThan
     0.19s  1.19% 74.45%      0.19s  1.19%  runtime.kevent
     0.17s  1.06% 75.52%      0.17s  1.06%  runtime.pthread_cond_timedwait_relative_np
     0.15s  0.94% 76.46%      1.12s  7.01%  runtime.mallocgc
     0.14s  0.88% 77.33%      0.14s  0.88%  runtime.memclrNoHeapPointers
     0.14s  0.88% 78.21%      0.15s  0.94%  runtime.stkbucket
     0.13s  0.81% 79.02%      0.13s  0.81%  runtime.usleep
     0.12s  0.75% 79.77%      0.12s  0.75%  crypto/sha512.blockAVX2
     0.12s  0.75% 80.53%      0.23s  1.44%  encoding/json.checkValid
     0.10s  0.63% 81.15%      0.10s  0.63%  runtime.procPin
     0.10s  0.63% 81.78%      0.31s  1.94%  sync.(*Pool).Get
     0.09s  0.56% 82.34%      0.09s  0.56%  math/big.nat.norm
     0.07s  0.44% 82.78%      0.11s  0.69%  encoding/base64.(*Encoding).Decode
     0.06s  0.38% 83.16%      0.12s  0.75%  runtime.heapBitsSetType
     0.05s  0.31% 83.47%      9.25s 57.92%  math/big.nat.div
     0.05s  0.31% 83.78%      1.92s 12.02%  math/big.nat.sqr
     0.05s  0.31% 84.10%      0.22s  1.38%  sync.(*Pool).Put
     0.05s  0.31% 84.41%      0.19s  1.19%  sync.(*Pool).pin
     0.04s  0.25% 84.66%      0.12s  0.75%  encoding/json.indirect

go1.14beta1

Showing nodes accounting for 13.07s, 85.37% of 15.31s total
Dropped 246 nodes (cum <= 0.08s)
Showing top 30 nodes out of 116
      flat  flat%   sum%        cum   cum%
     4.21s 27.50% 27.50%      7.45s 48.66%  math/big.nat.divBasic
     1.66s 10.84% 38.34%      1.66s 10.84%  math/big.mulAddVWW
     1.49s  9.73% 48.07%      1.49s  9.73%  math/big.addMulVVW
     1.14s  7.45% 55.52%      1.14s  7.45%  math/big.subVV
     0.96s  6.27% 61.79%      0.96s  6.27%  runtime.kevent
     0.56s  3.66% 65.45%      0.56s  3.66%  runtime.madvise
     0.47s  3.07% 68.52%      2.06s 13.46%  math/big.basicMul
     0.33s  2.16% 70.67%      0.33s  2.16%  math/big.shlVU
     0.19s  1.24% 71.91%      0.19s  1.24%  math/big.shrVU
     0.19s  1.24% 73.15%      0.19s  1.24%  runtime.pthread_cond_wait
     0.17s  1.11% 74.27%      0.65s  4.25%  runtime.mallocgc
     0.16s  1.05% 75.31%      0.16s  1.05%  runtime.memclrNoHeapPointers
     0.14s  0.91% 76.22%      0.14s  0.91%  math/big.greaterThan (inline)
     0.13s  0.85% 77.07%      0.14s  0.91%  runtime.stkbucket
     0.11s  0.72% 77.79%     10.52s 68.71%  math/big.nat.expNN
     0.10s  0.65% 78.45%      0.10s  0.65%  runtime.procyield
     0.10s  0.65% 79.10%      0.28s  1.83%  sync.(*Pool).Get
     0.10s  0.65% 79.75%      0.23s  1.50%  sync.(*Pool).Put
     0.09s  0.59% 80.34%      8.50s 55.52%  math/big.nat.divLarge
     0.09s  0.59% 80.93%      0.09s  0.59%  runtime.nextFreeFast (inline)
     0.09s  0.59% 81.52%      0.09s  0.59%  runtime.pthread_cond_signal
     0.08s  0.52% 82.04%      0.09s  0.59%  math/big.nat.bytes
     0.08s  0.52% 82.56%      0.08s  0.52%  runtime.memmove
     0.08s  0.52% 83.08%      0.30s  1.96%  runtime.newobject
     0.07s  0.46% 83.54%      0.11s  0.72%  encoding/base64.(*Encoding).Decode
     0.07s  0.46% 84.00%      0.10s  0.65%  sync.runtime_procPin
     0.06s  0.39% 84.39%      0.87s  5.68%  encoding/json.(*decodeState).object
     0.05s  0.33% 84.72%      0.12s  0.78%  encoding/json.checkValid
     0.05s  0.33% 85.04%      0.33s  2.16%  math/big.nat.make (inline)
     0.05s  0.33% 85.37%      0.28s  1.83%  math/big.putNat (inline)

@randall77 @FiloSottile any ideas?

pascaldekloe

comment created time in 2 months

issue commentgolang/go

crypto: performance loss with 1.14 beta

Alrighty, I think I found the problem.

It isn't that madvise is too slow, and in fact, the system handles that case quite well.

It's that madvise is too fast on these systems. About four times as fast as the old assumption in Go 1.13 (we assumed 4 KiB takes 10 µs, but here 4 KiB takes about 2.2 µs). Because we no longer make any assumptions, the rate isn't limited, and scavenging too often leads to outsized effects on performance because of goroutine wake-ups and lock contention.

I think a reasonable course of action here is to limit how fast the background scavenger can go. I tried one (runtime) page per millisecond, which is an assumption of 8 KiB taking 10 µs. This is roughly twice our old assumption, and I can confirm that just doing this fixes the performance regression by a lot.

name                  old time/op  new time/op  delta
ECDSA/sign-ES256-4    28.3µs ±24%  26.0µs ± 1%     ~     (p=0.095 n=10+9)
ECDSA/sign-ES384-4    4.04ms ± 2%  4.09ms ± 1%   +1.05%  (p=0.035 n=10+10)
ECDSA/sign-ES512-4    7.09ms ± 2%  7.16ms ± 2%   +0.96%  (p=0.043 n=8+10)
ECDSA/check-ES256-4   80.1µs ± 2%  77.3µs ± 1%   -3.51%  (p=0.000 n=10+8)
ECDSA/check-ES384-4   8.10ms ± 3%  8.00ms ± 2%     ~     (p=0.182 n=10+9)
ECDSA/check-ES512-4   15.0ms ±18%  14.4ms ± 3%     ~     (p=1.000 n=10+10)
EdDSA/sign-EdDSA-4    52.7µs ±14%  50.7µs ± 2%     ~     (p=0.481 n=10+10)
EdDSA/check-EdDSA-4    146µs ±11%   132µs ± 3%  -10.00%  (p=0.000 n=10+10)
HMAC/sign-HS256-4     1.92µs ± 1%  1.90µs ± 1%     ~     (p=0.145 n=9+8)
HMAC/sign-HS384-4     2.15µs ± 1%  2.19µs ± 4%   +1.92%  (p=0.013 n=9+10)
HMAC/sign-HS512-4     2.25µs ± 3%  2.20µs ± 3%     ~     (p=0.052 n=10+10)
HMAC/check-HS256-4    3.98µs ± 3%  3.93µs ± 1%     ~     (p=0.110 n=10+10)
HMAC/check-HS384-4    4.29µs ± 4%  4.31µs ± 5%     ~     (p=0.842 n=9+10)
HMAC/check-HS512-4    4.30µs ± 3%  4.47µs ± 3%   +3.83%  (p=0.001 n=10+9)
RSA/sign-1024-bit-4    315µs ± 8%   317µs ± 3%     ~     (p=0.447 n=10+9)
RSA/sign-2048-bit-4   1.45ms ± 2%  1.46ms ± 3%     ~     (p=0.315 n=8+10)
RSA/sign-4096-bit-4   8.03ms ± 2%  8.13ms ± 2%   +1.28%  (p=0.015 n=10+10)
RSA/check-1024-bit-4  26.2µs ± 1%  27.8µs ± 2%   +6.10%  (p=0.000 n=10+10)
RSA/check-2048-bit-4  59.7µs ± 1%  63.1µs ± 2%   +5.68%  (p=0.000 n=8+9)
RSA/check-4096-bit-4   161µs ± 2%   163µs ± 1%   +1.32%  (p=0.006 n=10+9)
[Geo mean]             107µs        106µs        -0.65%

There are a few benchmarks still showing a regression, which may be for a different reason, or perhaps. I'm going to try going all the way back to our old order-of-magnitude assumption as well and see if that helps at all.

pascaldekloe

comment created time in 2 months

issue commentgolang/go

crypto: performance loss with 1.14 beta

@pascaldekloe @randall77 OK so I still can't reproduce on linux/amd64 after building 1.13.5 myself (go1.14beta1 is consistently a hair faster). The profiles contain runtime.madvise as the 20th entry for 1.13.5 and it doesn't appear in the top 30 for 1.14beta1.

Now I'll go and try on darwin for real.

pascaldekloe

comment created time in 2 months

issue commentgolang/go

crypto: performance loss with 1.14 beta

@FiloSottile ah hah, thank you, that's probably it. I'll keep that in mind for the future.

pascaldekloe

comment created time in 2 months

issue commentgolang/go

runtime: TestPageAllocScavenge/AllFreeUnscavExhaust fails on OpenBSD with 2.5GB RAM

@mpx could you please verify that https://golang.org/cl/212177 lets you run all.bash on your OpenBSD box?

mpx

comment created time in 2 months

issue commentgolang/go

runtime: "attempt to execute system stack code on user stack" during heap scavenging [1.13 backport]

Sorry! Slipped under my radar. On it.

gopherbot

comment created time in 2 months

issue commentgolang/go

crypto: performance loss with 1.14 beta

So here's an interesting bit: I tried to reproduce this on linux/amd64 and I couldn't.

Running under the perflock 1.14beta1 faster by a few percent. Also, the profiles look completely different. For 1.13.4 it's dominated by runtime.cgocall, but it's just crypto functions for go1.14beta1. I don't really understand what's going on here, but that's not the platform this bug is about so I'll put that to rest for now.

I'll try to run this on a darwin machine and see what's going on. One big difference this release with scavenging is that we're calling madvise more often (even though, I think we're doing roughly the same amount of scavenging work or less). It's possible that there are larger overheads on darwin for each call to madvise which could make this a problem. On the other hand, the pacing is now time-based to avoid using more than 1% of CPU time of a single core, so even if it was more expensive, it should work itself out.

pascaldekloe

comment created time in 2 months

issue commentgolang/go

runtime: TestPageAllocScavenge/AllFreeUnscavExhaust fails on OpenBSD with 2.5GB RAM

@bcmills OK, I'll put up a change for that.

Linux's overcommit rules in all settings (even the very strict ones!) are totally OK with large PROT_NONE mappings that haven't been touched before AFAICT.

mpx

comment created time in 2 months

issue commentgolang/go

crypto: performance loss with 1.14 beta

That's pretty surprising since I figured scavenging only got less aggressive this release. The increased it's probably related to the changes that were made there. I'll check it out.

pascaldekloe

comment created time in 2 months

issue commentgolang/go

runtime: TestPageAllocScavenge/AllFreeUnscavExhaust fails on OpenBSD with 2.5GB RAM

@bcmills the requirement is only virtual memory, not RAM. We only reserve the pages as PROT_NONE and map in a very small number (somewhere around 10 pages in the worst case). This is not a problem on non-OpenBSD systems. The physical memory footprint of each test execution, for the duration of that execution, is (IIRC) < 100 KiB.

That's the tricky bit. This only qualifies as "intensive" on OpenBSD (and maybe AIX? though I haven't seen any problems there after we made some fixes).

mpx

comment created time in 2 months

issue commentgolang/go

runtime: TestPageAllocScavenge/AllFreeUnscavExhaust fails on OpenBSD with 2.5GB RAM

This is all WAI as of Go 1.14 (see #35568). sysReserve is not allocation, only a reservation (the memory is PROT_NONE and untouched). The tests you're running here are end-to-end and create the same reservations the runtime does. The tests only ever make at 2 such reservations. When you include the reservations made by the currently executing runtime, you get a total virtual memory footprint of ~1.8 GiB to run these tests.

From #35568 my understanding is that even if you have datasize-max=infinity set, OpenBSD still imposes a per-process virtual memory ceiling based on the amount of physical memory available.

I'd like to avoid disabling these tests on OpenBSD, but perhaps it makes sense to disable them for non-builders so folks can run all.bash locally with the virtual memory requirements of one runtime, not 3.

@aclements @bradfitz WDYT?

mpx

comment created time in 2 months

issue commentgolang/go

runtime: apparent deadlocks on Windows builders

I've had a tough time trying to reproduce this in a gomote to get a stack trace, but I was able to reproduce at least once (unfortunately, I didn't get a stack trace that time...). If it's not too hard, it might be easier to reproduce in CI and the trybots @bradfitz, though I'm not sure how best to do that.

In particular, I've been targeting the "Testing race detector" tests that tend to be the source of failure as @ianlancetaylor mentioned and haven't been able to reproduce this failure on a gomote.

Meanwhile, I'll keep trying.

bcmills

comment created time in 2 months

issue commentgolang/go

runtime: SIGSEGV crash when OOM after goroutines leak on linux/386

AFAICT this is the result of the runtime not handling allocations at the top of the address space well.

We're running on 32-bit platforms and the arguments to both setSpans (the failing function in the original post) and grow (the failing function at tip that @randall77 produced) are a base address of 0xff000000 and 0x800 pages which translates to 0x1000000 bytes. Adding that to the base address gives us zero with wrap-around, so it's no wonder its failing.

I think we need a bunch of overflow checking in more than one place in the runtime to fix this properly. I'm not sure why setSpans would suddenly start having problems in Go 1.13 or even Go 1.12 as that function hasn't been touched in several releases IIRC. I suspect memory behavior just changed in recent releases which is surfacing this behavior in this instance.

rathann

comment created time in 3 months

issue commentgolang/go

runtime: TestArenaCollision has many "out of memory" failures on linux-ppc64le-power9osu

As a post-mortem, I suspect https://go-review.googlesource.com/c/go/+/207497 actually helped a lot here, since each L2 entry is only 1 MiB in size now, and the page bitmap had the biggest footprint.

https://go-review.googlesource.com/c/go/+/207758 reduces the footprint by an order of magnitude, which should make concerns about a discontiguous address space go away for good with the new page allocator.

bcmills

comment created time in 3 months

issue commentgolang/go

runtime: scavenger not as effective as in previous releases

Sorry, that sounds like they haven't been reviewed yet. By "iterating on them" I mean that they're being iterated on via review passes.

mknyszek

comment created time in 3 months

issue commentgolang/go

runtime: scavenger not as effective as in previous releases

All the patches for this are out for review and I'm iterating on them.

mknyszek

comment created time in 3 months

issue closedgolang/go

runtime: running Go code on OpenBSD gomote fails when not running as root

I don't know what is going on here, but recording since something is wrong.

When I use gomote run with the openbsd-amd64-62 gomote, everything works as expected. When I use gomote ssh to ssh into the gomote, the go tool consistently fails with the following stack trace.

The only obvious difference is that gomote run runs as root but gomote ssh does not.

CC @mknyszek @aclements @bradfitz

fatal error: failed to reserve page bitmap memory

runtime stack:
runtime.throw(0xa4504f, 0x24)
        /tmp/workdir/go/src/runtime/panic.go:1106 +0x72 fp=0x7f7ffffd5018 sp=0x7f7ffffd4fe8 pc=0x4331a2
runtime.(*pageAlloc).init(0xeb7b88, 0xeb7b80, 0xecff58)
        /tmp/workdir/go/src/runtime/mpagealloc.go:239 +0x162 fp=0x7f7ffffd5060 sp=0x7f7ffffd5018 pc=0x428c12
runtime.(*mheap).init(0xeb7b80)
        /tmp/workdir/go/src/runtime/mheap.go:694 +0x274 fp=0x7f7ffffd5088 sp=0x7f7ffffd5060 pc=0x425de4
runtime.mallocinit()
        /tmp/workdir/go/src/runtime/malloc.go:471 +0xff fp=0x7f7ffffd50b8 sp=0x7f7ffffd5088 pc=0x40c5af
runtime.schedinit()
        /tmp/workdir/go/src/runtime/proc.go:545 +0x60 fp=0x7f7ffffd5110 sp=0x7f7ffffd50b8 pc=0x436700
runtime.rt0_go(0x7f7ffffd5148, 0x1, 0x7f7ffffd5148, 0x0, 0x0, 0x1, 0x7f7ffffd5238, 0x0, 0x7f7ffffd523b, 0x7f7ffffd5254, ...)
        /tmp/workdir/go/src/runtime/asm_amd64.s:214 +0x125 fp=0x7f7ffffd5118 sp=0x7f7ffffd5110 pc=0x45eff5

closed time in 3 months

ianlancetaylor

issue commentgolang/go

runtime: running Go code on OpenBSD gomote fails when not running as root

@4a6f656c It's unfortunate that virtual address space is limited in this way, but c'est l'vie. I'm glad the sparse array change helped.

We were also aware of ulimit -v failure mode on Linux for a while. The default is infinity because virtual address space is cheap (for example, I can make a 2 TiB PROT_NONE mapping without issue on Linux, with either default and strict overcommit rules set). Generally speaking ulimit -v is not the most accurate way to limit physical memory use (it's only per-process!), and cgroups do much better, so I don't think folks tend to use it much anymore.

@ianlancetaylor I don't think there's anything else we want to do now in terms of reducing PROT_NONE memory mapped. We tried a few things and they were either quite complicated or had other problems.

ianlancetaylor

comment created time in 3 months

issue closedgolang/go

runtime: make the page allocator scale

Over the course of the last year, we've seen several cases where making relatively minor changes to the allocator slow path, which allocates pages from the heap, caused serious performance issues (#32828, #31678). The problem stemmed largely from contention on the heap lock: a central m-based spinlock (so we can't schedule another g while it's waiting!) which guards nearly all operations on the page heap. Since Go 1.11, (and likely earlier) we've observed barging behavior on this lock in applications which allocate larger objects (~1K) frequently, which indicates a collapsing lock. Furthermore, these applications tend to stay the same or worsen in performance as the number of available cores on the machine increases. This represents a fundamental scalability bottleneck in the runtime.

Currently the page allocator is based on a treap (and formerly had a small linked-list based cache for smaller free spans), but I propose we rethink it and rewrite to something that:

  1. Is more cache-friendly, branch-predictor friendly, etc. on the whole.
  2. Has a lockless fast path.

The former just makes the page allocator faster and less likely to fall into bad paths in the microarchitecture, while the latter directly reduces contention on the lock.

While increased performance across the board is what we want, what we're most interested in solving here is the scalability bottleneck: when we increase the number of cores available to a Go application, we want to see a performance improvement.

Edit: Design doc

I've already built both an out-of-tree and in-tree prototype of a solution which is based on a bitmap representing the whole heap and a radix tree over that bitmap. The key idea with bitmaps here being that we may "land grab" several pages at once and cache them in a P. Two RPC benchmarks, a Google internal one, and one based on Tile38, show up to a 30% increase in throughput and up to a 40% drop in tail latencies (90th, 99th percentile) on a 48-core machine, with the benefit increasing with the number of cores available to the benchmark.

closed time in 3 months

mknyszek

issue commentgolang/go

runtime: make the page allocator scale

Nope! Closing.

mknyszek

comment created time in 3 months

issue commentgolang/go

runtime: program appears to spend 10% more time in GC on tip 3c47ead than on Go1.13.3

@ardan-bkennedy I tried this out on both Linux and macOS and cannot reproduce a difference. On Linux it was consistently faster (but only by a few ms out of 1 second-ish run so I'm hesitant to claim any wins here). On macOS it takes roughly the same amount of time either way.

What is GOMAXPROCS for you?

Also, looking at GODEBUG=gcpacertrace=1, the GC utilization is similar overall in each GC (u_a, the output is a bit hard to read), and in Go 1.13.5 already goes higher than 25% for me quite often. I'm not sure whether there's a problem here with that at all.

Also, how are you producing the output you're posting in this issue? I haven't used the trace tool extensively (except for the in-browser viewer) so this is a mode I'm not familiar with.

ardan-bkennedy

comment created time in 3 months

issue commentgolang/go

runtime: HeapSys increases until OOM termination

@randall77 #14045 is related I think, since heap growth should be causing lots of small bits of memory to get scavenged (maybe the scavenger thinks, erroneously, it should be off?)

I suspect that in Go 1.12 and 1.13 it's easier to trigger this because we don't allocate across memory that is scavenged and unscavenged (and consequently this is why @randall77 you see that it maxes out a bit lower, maybe).

savalin

comment created time in 3 months

issue commentgolang/go

runtime: being unable to allocate across scavenged and unscavenged memory leads to fragmentation blow-up in some cases

@savalin My mistake. This doesn't reproduce at tip as long as you put a runtime.GC() call at the very top of the for loop (when all the heap memory is actually dead).

I can get it to increase the same as in #35890. What @randall77 mentioned in that one is pretty much what I'm seeing here: it caps out at around 20 GiB.

savalin

comment created time in 3 months

issue commentgolang/go

runtime: debug.FreeOSMemory not freeing memory

I've uploaded a fix which I've confirmed resolves the issue.

mknyszek

comment created time in 3 months

issue commentgolang/go

runtime: being unable to allocate across scavenged and unscavenged memory leads to fragmentation blow-up in some cases

@savalin I will close this once you're able to verify that this is fixed for you. If you're still experiencing problems, we'll move on from there.

savalin

comment created time in 3 months

issue openedgolang/go

runtime: debug.FreeOSMemory not freeing memory

At tip, debug.FreeOSMemory isn't doing what its documentation is saying it will do. Notably, HeapReleased doesn't increase all the way to HeapIdle. This failure is obvious in https://github.com/savalin/example/blob/master/main.go and this was discovered in investigating #35848.

This is happening because the changes to debug.FreeOSMemory in the page allocator don't do the right thing. The fix is to reset the scavenger address, then scavenge, then reset it again.

This may need to block the beta, but the fix is known and easy. I will put up a change ASAP.

CC @aclements @andybons

created time in 3 months

issue commentgolang/go

HeapIdle is increasing infinitely on big data volume allocations

@savalin FWIW, I can't reproduce this at tip. HeapSys stays nice and low. This leads me to an interesting conclusion. Go 1.12 introduced the concept of treating memory separately if it's returned to the OS or if it isn't. This policy was carried through into Go 1.13, but in Go 1.14, we no longer do this because it would be expensive and cumbersome to do so.

The reason this decision was made in the first place was to accommodate the more aggressive mechanism for returning memory to the OS in Go 1.13 (#30333). The decision was deemed a good idea because jemalloc and other allocators follow a similar rule for returning memory to the OS. It seems like for whatever reason this decision is causing fragmentation which really hurts this application.

Anyway, it shouldn't matter for future releases. This bug is effectively fixed for the Go 1.14 release. Please give the tip of the master branch a try and see if you get the same results.

Finally, this made me realize that debug.FreeOSMemory no longer works as it's supposed to in Go 1.14, so I'll open up an issue and fix that.

CC @aclements, since this is the first consequence we're seeing of that original decision.

Please continue reading if you'd like to see some additional notes I wrote while sussing this out.

  • HeapSys is expected to never decrease since we never unmap memory in the runtime. Instead we call madvise which increases HeapReleased.
  • In Go 1.13 we added a more aggressive scavenging mechanism (#30333) which should keep HeapSys - HeapReleased closer to your actual working set, not accounting for certain accounts of fragmentation.
  • I noticed in the other thread also that you tried GODEBUG=madvdontneed=1, but here you specify darwin/amd64; note that that option only has any effect on Linux, where it'll properly drop RSS counters at the cost of additional page faults (but the OS lets you overcommit, it'll just kick out MADV_FREE'd pages when it needs it). We don't do what drops RSS counters on darwin/amd64 because it's non-trivial to do, though perhaps we could now that we rely on libSystem.
  • Looking at your example, HeapIdle == HeapReleased after you call debug.FreeOSMemory, which is exactly what that does. I'm not sure why your OOM killer is triggering because HeapSys is just virtual memory. Do you have ulimit -v set? Perhaps this is a Darwin overcommit thing we need to work around.
  • There was temporarily some issues in Go 1.12/1.13 around fragmentation causing HeapSys to keep getting larger (#31616) because of a bug. We also noticed that there was a problem in Go 1.12 if we decided to prefer unscavenged memory over scavenged memory in the allocator, so we decided to take the best between the two instead, which resolved the issue.
savalin

comment created time in 3 months

issue commentgolang/go

runtime: long GC STW after large allocation (100+ msec waiting for runtime.memclrNoHeapPointers)

@cuonghuutran #31222 is about a large allocation blocking STW in memclrNoHeapPointers as well, not about sweeping. The title is misleading: sweep termination is the phase which brings us into the new mark phase, which forces a STW. The problem is not that the STW is long, but rather that it's taking a long time to STW (the runtime manages to stop every thread except the one clearing this large allocation). In the execution trace, the attempt to STW is counted against pause time as well. In the other issue, a 1 GiB allocation is the root cause of the STW delay.

Clearing in smaller chunks would work at the expense of some throughput. I don't think we could get this in for 1.14 given that it's been an issue for so many releases (the freeze is mostly about fixing new issues) but as @aclements mentioned in the other issue, it dovetails nicely with the recent non-cooperative preemption work. We'll take a look at it again in 1.15.

@ALTree In sum, I believe this is a duplicate of #31222.

cuonghuutran

comment created time in 3 months

issue commentgolang/go

runtime: scavenger's frequent wake-ups interfere with runnext

FWIW I don't think this should block the beta. These frequent wake-ups definitely don't seem to show up in benchmarks.

bcmills

comment created time in 3 months

issue commentgolang/go

runtime: timeout with GDB tests on aix/ppc64

I'm sorry, what I really meant to ask was, do you know for sure that it's the new page allocator it doesn't like? There were a number of other big changes that went into this release (e.g. asynchronous preemption). Maybe it really dislikes getting a lot of signals? I could see that being some very slow path in a gdb port.

Helflym

comment created time in 3 months

issue commentgolang/go

runtime: timeout with GDB tests on aix/ppc64

@Helflym can you be more specific about how it doesn't like the new page allocator? Do you know if https://go-review.googlesource.com/c/go/+/207497/5 helps? (It should reduce the size of the PROT_NONE mappings from 9ish GiB to about 600 MiB).

Helflym

comment created time in 3 months

issue commentgolang/go

runtime: performance issue with the new page allocator on aix/ppc64

@Helflym Would you consider this bug fixed for now? We can open a new issue if we see a need for 60-bit addresses again.

Helflym

comment created time in 3 months

issue commentgolang/go

runtime: TestPingPongHog is flaky

@aclements informed me that TestPingPongHog is trying to check for fairness. Fairness is helped by the "next" mechanism in the scheduler, the 1 element LIFO in front of the per-P FIFO (which is the runqueue). Currently the scavenger, upon wake-up (which it's doing a lot more of now), will be readied onto this LIFO, interfering with the test.

One solution could be to just have the scavenger do more work. But perhaps a better solution is to just call "ready" with "next" as false. This way the scavenger never ends up in this LIFO, and I can confirm that this fixes the test flake (1000 consecutive runs and no failure, with the 1 GiB allocation in front of the test). This fix stays in line with the reasoning that the scavenger should mostly stay out of the way of the rest of the application, and if the scavenger takes longer to get scheduled (i.e. sleeps longer than it wanted to) it'll simply request to sleep for less to account for this overhead, effectively doing more work anyway.

This also gave me an opportunity to get some numbers for the new self-paced scavenger: each cycle is on the order of 300-1000µs, which means our order-of-magnitude assumption in the old scavenger was mostly on-target. :)

bcmills

comment created time in 3 months

issue commentgolang/go

runtime: TestPingPongHog is flaky

OK! I got somewhere.

TestPingPongHog is all about scheduling goroutines. I was digging through the history around these failures and found some interesting information:

  1. They started happening VERY consistently as of 21445b0 (as I mentioned in a previous comment).
  2. Reproducing this by running the test directly is nigh impossible but...
  3. If I make some big allocation (1 GiB), free it, and runtime.GC() twice at the beginning of the test (which basically ensures the scavenger will be enabled with plenty of work to do), it fails very, very consistently.
  4. At the same time it started happening consistently on Linux, it stopped happening on Windows.

This implies that the frequent wake-ups from the scavenger are having a detrimental effect, at least on Linux (and perhaps a beneficial one on Windows?). Keep in mind that GOMAXPROCS=1 for this test, so I'm more inclined to think now that it's the wake-ups themselves causing issues.

I think there are actually two issues at play here: one introduced around Oct 31st when Windows started failing, and the scavenger sleeping/waking more often. I'll tackle the latter first and come back to the former if I can get that fixed.

Note also that my fixes to #35788 make the test fail around 1 in 500 times when run directly (as opposed to not reproducible at all).

bcmills

comment created time in 3 months

issue openedgolang/go

runtime: scavenger not as effective as in previous releases

Recent page allocator changes (#35112) forced changes to scavenger as well. AFAICT these changes came with two problems.

Firstly, the scavenger maintains an address which it uses to mark which part of the address space has already been searched. Unfortunately the updates of this value are racy, and it's difficult to determine what kind of check makes sense to prevent these races. Today these races mean that the scavenger can miss updates and end up not working for a whole GC cycle (since it only runs down the heap once), or it could end up iterating over address space that a partially concurrent scavenge (e.g. from heap-growth) had already looked at. The affect on application performance is a higher average RSS than in Go 1.13.

Secondly, the scavenger is awoken on each "pacing" update, but today that only means an update to the goal. Because the scavenger is now self-paced, this wake-up is mostly errant, and is generally not a good indicator that there's new work to be done. What this means for application performance is that it might be scavenging memory further down the heap than it should (violating some principles of the original scavenge design) and thus causing undue page faults, resulting in worse CPU performance than in Go 1.13.

I suggest that we fix these for Go 1.14 (fixes are already implemented), but they need not block the beta.

created time in 3 months

issue commentgolang/go

runtime: "found bad pointer in Go heap" on linux-mips64le-mengzhuo builder

@mengzhuo Only the scavenger (mgcscavenge.go) operates in terms of physical pages, the runtime page size is always 8 KiB on all platforms, so 8 KiB alignment is exactly what we want.

It could be related to the page allocator, but the physical page size here doesn't matter, I don't think.

bcmills

comment created time in 3 months

issue commentgolang/go

runtime: TestPingPongHog is flaky

I'll take a look into this since Austin is currently focused on the memory corruption issues.

bcmills

comment created time in 3 months

issue commentgolang/go

runtime: STW GC stops working

I believe I found the problem, and it's actually the deadlock detector detecting a deadlock! Unfortunately the deadlock happens during a STW GC so checkdead observes runnable goroutines and panics.

The problem is that after 7b294cdd8 we no longer hold onto worldsema for the duration of a GC; we hold gcsema instead. However, in STW GC we don't release it before calling Gosched on the user goroutine, but because of how STW GC is implemented we will need to STW again before the GC is complete, and we'll be unable to because this disabled user goroutine is holding onto worldsema. The fix is easy, just drop worldsema before calling Gosched.

cherrymui

comment created time in 3 months

issue commentgolang/go

runtime: "fatal error: mSpanList.insertBack" in mallocgc

@myitcv Thanks!

@bcmills The span is indeed coming out of the new allocator, but the problem here is that its' next field is non-nil. And in fact, it's some really bizarre value (0x4115c774f9191e87) which doesn't look like it's any range familiar to me. In that way, it's similar to some of the other bugs (and the value is equally bizarre there as well). The span pointer itself looks reasonable.

I've just confirmed that any span returned by mheap_.alloc (called by mcentral.grow, which is where the span above is coming from) has init called on it (unless its nil) which zeroes out the next field explicitly. In other words, even if there was a problem with say, a freed span being left on some linked list, it would never actually manifest in this way, because we always zero the field which is being checked here.

Based on the bizarre value (disclaimer: bizarre to me, maybe someone else has better insights), my next guess is that this is the manifestation of another memory corruption bug such as #35592.

I'll keep thinking about it.

myitcv

comment created time in 3 months

issue commentgolang/go

runtime: TestPhysicalMemoryUtilization failures

Found the problem. It's that the page cache was hiding memory from the scavenger (which is intentional). Just uploaded a fix to the test to account for that.

ianlancetaylor

comment created time in 3 months

issue commentgolang/go

runtime: program appears to spend 10% more time in GC on tip 3c47ead than on Go1.13.3

@ardan-bkennedy Interesting. I wonder to what extent we're seeing additional start-up costs here, considering that the application only runs for about a second (though 70 ms is kind of a lot, I wonder what the distributions look like).

ardan-bkennedy

comment created time in 3 months

issue commentgolang/go

runtime: program appears to spend 10% more time in GC on tip 3c47ead than on Go1.13.3

@mknyszek I am running on a Mac. I need time to test this on linux.

Side Note: I find it interesting that you consider Darwin a less popular platform when most developers I know are working on that platform?

@ardan-bkennedy That's my mistake, I omitted it by accident. I do consider it a popular platform. Please give it a try.

ardan-bkennedy

comment created time in 3 months

issue commentgolang/go

runtime: program appears to spend 10% more time in GC on tip 3c47ead than on Go1.13.3

@ianlancetaylor Ah! Sorry. I completely forgot.

@ardan-bkennedy If it's on linux, windows, freebsd, feel free to try again from tip any time. :) Still working out some issues on the less popular platforms.

ardan-bkennedy

comment created time in 3 months

issue commentgolang/go

runtime: performance issue with the new page allocator on aix/ppc64

@Helflym Yeah that's what I figured. I think for this release we're going to stick with the 48-bit address space assumption as the fix.

I tried to do an incremental mapping approach wherein we would keep making mappings twice the size, copying data between them each time (there are at most 64 times that we would perform this doubling, and the copying wouldn't actually take very long, a very conservative upper bound would be 500µs for a 1 TiB heap, and it would only happen once).

This seemed like a great idea in principle, but there are some sharp edges here. The following is more for me to document why we ended up not doing this, so I apologize for the low level of specifics in the rest of this comment:

Firstly, picking the contiguous section of the summaries that the mapping represents is tricky. Ideally you want some kind of "centering" in case the heap keeps growing down in the address space. Next, we want to maintain the illusion to the page allocator that this mapping is actually one enormous mapping, but it's possible to get an address from mmap that's too low to store the pointer, and we'd have to rely on slice calculations in the compiler doing the right overflow. It would work but could lead to unexpected bugs in the future. The alternative would be to maintain an explicit offset and apply it everywhere but this significantly complicates the code.

We also considered doing a sparse-array approach for the summaries, but that complicates the code a lot as well, since each summary level is a different size (generics over constants would fix this, but that's a slippery slope).

Helflym

comment created time in 3 months

issue commentgolang/go

runtime: performance issue with the new page allocator on aix/ppc64

@ianlancetaylor I +2'd both changes.

@Helflym if you can confirm that the patches work on/fix AIX, we can land them any time and unblock the builder (modulo some minor comments, submit when ready).

Helflym

comment created time in 3 months

issue commentgolang/go

runtime: performance issue with the new page allocator on aix/ppc64

@Helflym Ahhh... I spoke too soon. There are some problems. I think landing the arenaBaseOffset change is the way to go for now to unblock the builder.

Helflym

comment created time in 3 months

issue commentgolang/go

runtime: performance issue with the new page allocator on aix/ppc64

@Helflym I'm working on a couple of patches so that this shouldn't be a problem for the foreseeable future (involving incremental mapping). I hope to have them up for review today.

Helflym

comment created time in 3 months

issue commentgolang/go

runtime: TestPhysicalMemoryUtilization failures

I suspect the problem here is that the scavenger's scavAddr may already be too low (for some reason, and only by a little bit) to adequately achieve the desired physical memory utilization. I have some ideas on how to fix it and will come back to this bug shortly.

ianlancetaylor

comment created time in 3 months

issue commentgolang/go

runtime: TestArenaCollision has many "out of memory" failures on linux-ppc64le-power9osu

I suspect the problem is that TestArenaCollision makes the heap very discontiguous, and we make a contiguity assumption in the page allocator (so too much memory ends up mapped and the system can't handle it, if other processes are running; in essence, it's an overcommit issue). I'm working on a couple of patches which should hopefully fix this and a couple of other issues.

bcmills

comment created time in 3 months

issue commentgolang/go

runtime: performance issue with the new page allocator on aix/ppc64

It would be better to use mprotect and still assume a 60-bit address space just in case AIX changes it's mmap policy (since this isn't documented anywhere).

What policy do you mean by this ? The fact that mmap addresses are starting after 0x0a00000000000000 ? I don't think it will happen in a near future and if it does anyway, there will be a way to still use this segment. AIX have a strict compatibility policy, everything compiled in a previous version must run as is in all the following one. Therefore, there is many ways to keep older behaviors when running an newly compiled process.

Yeah, that's what I meant. If it's that strict then perhaps it's OK. @aclements?

However, the fact that mmap takes 1 second to run makes this plan dead-on-arrival. Perhaps the arenaBaseOffset is the right way to go in this case, and to just deal with changes to AIX's mmap in the future?

mmap (and munmap afterwards) is taking so long because the memory area reserved is really huge. Isn't it possible to allocate s.chunks incrementally ? Or have several levels of chunks (like this is done in the arena with arenaL1 and arenaL2? At the moment, only AIX is facing issues but others OS might have the same problem in the future. Especially, because amd64 is already providing a way to mmap on 57 bits addresses (according to malloc.go)

While that's true, very large mappings are not nearly as costly on other systems (though to be honest I only have hard numbers for Linux right now, and for 32 TiB PROT_NONE mappings it's <10 µs). A big difference between s.chunks and the arenas array is that s.chunks is mapped PROT_NONE, which theoretically means the OS shouldn't have to do anything expensive. The only other issues we've run into so far are artificial limits imposed by some systems (#35568).

With that said, we're exploring an incremental mapping approach. We could also add a layer of indirection but this complicates the code more than the incremental mapping approach @aclements suggested (IMO), and is probably safer considering we're in the freeze. Anything else that limits how much address space we map is likely going to be more complex than what we have now, which is why I proposed the current approach in the first place.

Helflym

comment created time in 3 months

issue commentgolang/go

runtime: performance issue with the new page allocator on aix/ppc64

It would be better to use mprotect and still assume a 60-bit address space just in case AIX changes it's mmap policy (since this isn't documented anywhere). If it does change its policy, then existing Go binaries will break on new versions of AIX.

However, the fact that mmap takes 1 second to run makes this plan dead-on-arrival. Perhaps the arenaBaseOffset is the right way to go in this case, and to just deal with changes to AIX's mmap in the future?

Helflym

comment created time in 3 months

issue commentgolang/go

runtime: running Go code on OpenBSD gomote fails when not running as root

@ianlancetaylor I investigated this with someone who knows OpenBSD a bit.

The number that's checked in the kernel from a PROT_NONE anonymous mapping is RLIMIT_DATA, which is limited for non-root users in login.conf. We can fix this on our builders by making datasize-cur and datasize-max unlimited, but if OpenBSD has a low default then that's a problem since Go no longer works out of the box on a newly-installed OpenBSD image.

Perhaps there's a workaround here but I need to give it some thought.

It's a little bit weird to me that a PROT_NONE mapping counts toward this on any platform. Linux does this too, but its default for everyone for virtual address space is unlimited and not 768 MiB.

ianlancetaylor

comment created time in 3 months

issue commentgolang/go

runtime: running Go code on OpenBSD gomote fails when not running as root

If I were to guess, there's an RLIMIT_AS set for non-root users or something. I'll look into this now.

ianlancetaylor

comment created time in 3 months

issue commentgolang/go

runtime: VirtualFree of 0 bytes failed with errno=487 (ERROR_INVALID_ADDRESS)

@heschik confirmed this fixes the problem in the x/tools test (10 consecutive runs and no failure; the tests in that package take a while to run). It's a bit wild to me how consistently that test was able to trigger this, but it was nice to have a consistent reproducer. :)

heschik

comment created time in 3 months

issue commentgolang/go

runtime: VirtualFree of 0 bytes failed with errno=487 (ERROR_INVALID_ADDRESS)

I think I found the problem. Will push a fix in a bit.

heschik

comment created time in 3 months

issue commentgolang/go

runtime: VirtualFree of 0 bytes failed with errno=487 (ERROR_INVALID_ADDRESS)

Ah, excellent. Thank you @heschik.

This supports my hypothesis, though it's curious that it's the same test every time. It's probably just quite good at tickling some edge case.

I was able to reproduce once in a gomote (unfortunately it dumped the output into my terminal window and I lost all the useful info) but it's timing out now. Now that I can reproduce though it should be easier to get to the bottom of this.

heschik

comment created time in 3 months

more