profile
viewpoint
Bryan Boreham bboreham Weaveworks London, UK Director of Engineering, Weaveworks

bboreham/coatl 2

running, conducting, directing

bboreham/alertmanager 0

Prometheus Alertmanager

bboreham/amazon-vpc-cni-k8s 0

Networking plugin repository for pod networking in Kubernetes using Elastic Network Interfaces on AWS

bboreham/argo 0

Quest in pursuit of the Golden Fleece in Forex chaos

bboreham/arping 0

native go library to ping a host per arp datagram, or query a host mac address

bboreham/avalanche 0

Prometheus/OpenMetrics endpoint series generator for load testing.

bboreham/aws-workshop-for-kubernetes 0

AWS Workshop for Kubernetes

bboreham/cadvisor 0

Analyzes resource usage and performance characteristics of running containers.

bboreham/capnproto 0

Windows/MSVC port of Cap'n Proto serialization/RPC system

PR opened weaveworks/scope

Reviewers
Add tracing for pipe operations; updates weaveworks, jaeger and aws dependencies

Part of trying to track down #3684.

Need to update weaveworks/common to get a version of the tracer which works with websockets (https://github.com/weaveworks/common/pull/173).

Add some spans for Consul operations related to pipes, which in turn need .

+137274 -29864

0 comment

601 changed files

pr created time in 2 days

Pull request review commentcortexproject/cortex

Release 0.6.0

  ## master / unreleased ++## 0.6.0 / 2020-01-22+ Note that the ruler flags need to be changed in this upgrade. You're moving from a single node ruler to something that might need to be sharded. If you are running with a high `-ruler.num-workers` and if you're not able to execute all your rules in `-ruler.evaluation-interval`, then you'll need to shard. Further, if you're using the configs service, we've upgraded the migration library and this requires some manual intervention. See full-instructions below to upgrade your Postgres.+instructions below to upgrade your PostgreSQL. -* [CHANGE] Remove unnecessary configs/flags from the ruler ring config to align with the pattern used in the distributor ring. #1987-    * Ruler ring related flags are now all prefixed with `ruler.ring.` as opposed to just `ruler.`-    * Changed the default value for `-ruler.ring.prefix` from `collectors/` to `rulers/` in order to not clash with other keys (ie. ring) stored in the same key-value store. * [CHANGE] The frontend component now does not cache results if it finds a `Cache-Control` header and if one of its values is `no-store`. #1974 * [CHANGE] Flags changed with transition to upstream Prometheus rules manager:-  * `ruler.client-timeout` is now `ruler.configs.client-timeout` in order to match `ruler.configs.url`-  * `ruler.group-timeout`has been removed-  * `ruler.num-workers` has been removed-  * `ruler.rule-path` has been added to specify where the prometheus rule manager will sync rule files+  * `ruler.client-timeout` is now `ruler.configs.client-timeout` in order to match `ruler.configs.url`.

Add change to default ruler.evaluation-interval

pracucci

comment created time in 2 days

Pull request review commentcortexproject/cortex

Release 0.6.0

  ## master / unreleased ++## 0.6.0 / 2020-01-22+ Note that the ruler flags need to be changed in this upgrade. You're moving from a single node ruler to something that might need to be sharded. If you are running with a high `-ruler.num-workers` and if you're not able to execute all your rules in `-ruler.evaluation-interval`, then you'll need to shard.

This text can go, I think. It was based on an incorrect diagnosis.

pracucci

comment created time in 2 days

CommitCommentEvent

issue openedopentracing-contrib/go-stdlib

Add an option to force sampling before tags are set

OpenTracing conventions say that sampling.priority >0 should capture the trace. However, when using Jaeger, any tags set before this are ignored https://github.com/jaegertracing/jaeger-client-go/issues/391.

I think this could be done via spanObserver in nethttp.Middleware, if that were called before the method, url, etc., are set.

created time in 2 days

Pull request review commentcortexproject/cortex

Update code to fix almost all pre-existing lint errors

 func (d *Distributor) emitBillingRecord(ctx context.Context, buf []byte, samples  	now := time.Now().UTC() 	hasher := sha256.New()-	hasher.Write(buf)+	_, _ = hasher.Write(buf)

I think I had a lot of trouble stemming from the way it doesn't re-read the config file unless it thinks something else has changed. So you need to touch the file or make clean or something like that. I can't remember exactly.

zendern

comment created time in 2 days

issue commentcortexproject/cortex

Ruler never unregisters group metrics

Upstream issue https://github.com/prometheus/prometheus/issues/6689

bboreham

comment created time in 2 days

issue openedprometheus/prometheus

Rule group metrics are not unregistered when the group is removed

When Prometheus sees a new rules file it will register new metrics here

However there is no code to remove those metrics when the rule file is removed, so metrics like prometheus_rule_group_last_evaluation_timestamp_seconds will keep being scraped at their last value until Prometheus restarts. This makes those metrics hard to use in dashboards, alerts, etc.

created time in 2 days

issue openedcortexproject/cortex

Ruler never unregisters group metrics

As a new group is created (e.g. on resharding when a ruler starts or stops) it registers metrics with Prometheus, but there is no code to unregister them when the group stops being used (e.g. on another resharding). Over time this will build up a substantial number of useless metrics.

The metric registration is done in Prometheus code; ruler calls Update() with a list of files, but nobody is checking which files have disappeared since last update.

created time in 2 days

Pull request review commentcortexproject/cortex

Update code to fix almost all pre-existing lint errors

 func (f *Frontend) RoundTripGRPC(ctx context.Context, req *ProcessRequest) (*Pro 	tracer, span := opentracing.GlobalTracer(), opentracing.SpanFromContext(ctx) 	if tracer != nil && span != nil { 		carrier := (*httpgrpcHeadersCarrier)(req.HttpRequest)-		tracer.Inject(span.Context(), opentracing.HTTPHeaders, carrier)+		_ = tracer.Inject(span.Context(), opentracing.HTTPHeaders, carrier)

sure, whatever.

zendern

comment created time in 2 days

Pull request review commentcortexproject/cortex

Update code to fix almost all pre-existing lint errors

 func InstrumentMiddleware(name string) Middleware { 	return MiddlewareFunc(func(next Handler) Handler { 		return HandlerFunc(func(ctx context.Context, req Request) (Response, error) { 			var resp Response-			err := instrument.TimeRequestHistogram(ctx, name, queryRangeDuration, func(ctx context.Context) error {+			err := instrument.CollectedRequest(ctx, name, instrument.NewHistogramCollector(queryRangeDuration), instrument.ErrorCode, func(ctx context.Context) error {

TimeRequestHistogram is commented as deprecated.

zendern

comment created time in 2 days

Pull request review commentcortexproject/cortex

Update code to fix almost all pre-existing lint errors

 func (c *Memcached) Fetch(ctx context.Context, keys []string) (found []string, b  func (c *Memcached) fetch(ctx context.Context, keys []string) (found []string, bufs [][]byte, missed []string) { 	var items map[string]*memcache.Item-	instr.CollectedRequest(ctx, "Memcache.GetMulti", c.requestDuration, memcacheStatusCode, func(_ context.Context) error {+	_ = instr.CollectedRequest(ctx, "Memcache.GetMulti", c.requestDuration, memcacheStatusCode, func(_ context.Context) error {

I can't see much more we could do with the error. Maybe the inner function should just return nil ? Broadly Cortex thinks that caches are things we can do without.

zendern

comment created time in 2 days

Pull request review commentcortexproject/cortex

Update code to fix almost all pre-existing lint errors

 func (a dynamoDBStorageClient) BatchWrite(ctx context.Context, input chunk.Write 			if awsErr, ok := err.(awserr.Error); ok && ((awsErr.Code() == dynamodb.ErrCodeProvisionedThroughputExceededException) || request.Retryable()) { 				logWriteRetry(ctx, requests) 				unprocessed.TakeReqs(requests, -1)-				a.writeThrottle.WaitN(ctx, len(requests))+				_ = a.writeThrottle.WaitN(ctx, len(requests))
  1. can't happen; it's a hard error from DynamoDB to go over the limit.
  2. from #1361 "I did wonder whether to remove the backoff object, but it's still handling max retries and cancellation, so left it in place."
zendern

comment created time in 2 days

pull request commentprometheus/prometheus

rules: use evalTimestamp more accurate.

This confused me too. Perhaps it should be named lastExecutionStartTimestamp ?

johncming

comment created time in 2 days

Pull request review commentcortexproject/cortex

Remove error return from Cache.Stop()

 func (c *Memcached) Store(ctx context.Context, keys []string, bufs [][]byte) { }  // Stop does nothing.-func (c *Memcached) Stop() error {+func (c *Memcached) Stop() { 	if c.inputCh == nil {

The Go way is to return early and do the work outside of an if.

bboreham

comment created time in 3 days

Pull request review commentcortexproject/cortex

Update code to fix almost all pre-existing lint errors

 func ReuseSlice(slice []PreallocTimeseries) { 	for i := range slice { 		ReuseTimeseries(slice[i].TimeSeries) 	}-	slicePool.Put(slice[:0])+	slicePool.Put(slice[:0]) //nolint:staticcheck

We cannot pool these as pointer-to-slice because the place we use them is in WriteRequest which is generated from Protobuf and we don't have an option to make it a pointer.

So the warning is correct: we generate 24 bytes of garbage every time a PreallocTimeseries is re-used. But since the slices are far far larger, we come out ahead.

I think there should be a comment on slicePool to explain this.

zendern

comment created time in 3 days

Pull request review commentcortexproject/cortex

Update code to fix almost all pre-existing lint errors

 func NewChunkFetcher(cfg cache.Config, cacheStubs bool, storage ObjectClient) (* func (c *Fetcher) Stop() { 	close(c.decodeRequests) 	c.wait.Wait()-	c.cache.Stop()+	_ = c.cache.Stop()

I vote to remove the error return - #2030

zendern

comment created time in 3 days

PR opened cortexproject/cortex

Remove error return from Cache.Stop()

All code paths return nil as the error, so we can simplify the code.

(Redis is the only one that looks as if it might, but if you go one level down it doesn't)

+18 -26

0 comment

11 changed files

pr created time in 3 days

create barnchcortexproject/cortex

branch : no-cache-stop-error

created branch time in 3 days

issue commentkubernetes/kubernetes

daemonSet pod with 'hostNetwork: true' has no route to service net

This issue has an answer at https://github.com/kubernetes/kubernetes/issues/56934#issuecomment-350243730 that was accepted by the reporter. If you are indeed having the same problem, you have an answer: you don't need to re-open.

If your problem is not exactly the same please click the "New issue" button.

gintautassulskus

comment created time in 3 days

push eventweaveworks/scope

Bryan Boreham

commit sha 7dc7215a264fc8fa65e8ae9811d0502cc442fc9e

Refactor: improve readability based on review feedback

view details

push time in 3 days

push eventweaveworks/scope

Bryan Boreham

commit sha 92b8a489e72cb6be442df8a7f365ead3c3500594

kubernetes: detect more 'pause' containers Dockershim has added a label `io.kubernetes.docker.type` for at least four years, where the pause container is of type `podsandbox`. This should be more reliable than trying to keep up with everyone's name for the pause container.

view details

Bryan Boreham

commit sha 880daa78ff7daa051e18e060dd47435e61e6d366

Extend K8s tagger test to cover pause containers

view details

Bryan Boreham

commit sha 53297eb07cc900dff24e697c9ed70abfb31ee9f8

Merge pull request #3743 from weaveworks/more-pause kubernetes: detect more 'pause' containers

view details

push time in 3 days

delete branch weaveworks/scope

delete branch : more-pause

delete time in 3 days

PR merged weaveworks/scope

Reviewers
kubernetes: detect more 'pause' containers

Dockershim has added a label io.kubernetes.docker.type for at least four years, where the pause container is of type podsandbox. This should be more reliable than trying to keep up with everyone's name for the pause container.

I don't like the way we do this in two places (one probe, one app), but it seemed a bit complicated to unwind that right now.

I left in the old name-based code just in case; I don't think it is necessary any more.

+40 -3

1 comment

3 changed files

bboreham

pr closed time in 3 days

push eventweaveworks/scope

Bryan Boreham

commit sha 85d2f6309c8a7565c667788fd0fd4cdc9ee298a5

performance: Send active controls as a single string per node Instead of a whole extra data structure which is quite expensive to marshal and unmarshal, just send the information in a string. No clever merging strategy is required - the states are all set in one place per node type.

view details

Bryan Boreham

commit sha c88be40b193e824c91ae3df8054d0227c6da3363

performance: Update plugins to new-style controls data

view details

Bryan Boreham

commit sha eb381f167d11b1804b9630620e1527ec9c39d609

refactor: move Report backwards-compatibility code into its own file Makes report.go a little easier to read.

view details

Bryan Boreham

commit sha 635cea0b5614cdecc02e2aff51d8f00cb980cc9e

backwards-compatibility: unmarshall latestControls data from older probes With a test.

view details

Bryan Boreham

commit sha 1dcdfab05af2c1991b29a523e13d82e19be022a4

fixup: from review feedback Fix a logic error in ECS scale-down button, bad copy/paste in ActiveControls() and neaten the switch cases in container controls. Co-Authored-By: Filip Barl <filip@weave.works>

view details

Bryan Boreham

commit sha a375a54546cd7a8769699753d377f8432f59224d

Merge pull request #3714 from weaveworks/simplify-control-serialisation performance: send active controls as a single string per node

view details

push time in 3 days

delete branch weaveworks/scope

delete branch : simplify-control-serialisation

delete time in 3 days

PR merged weaveworks/scope

Reviewers
performance: send active controls as a single string per node

Scope "controls" are buttons to do something like restart a container. These controls can be turned on and off by the probe, e.g. depending on whether the container is running.

Previously this information was conveyed by a complicated data structure mapping each control name to a Dead boolean which had a last-modified timestamp. All of which takes effort to send and merge. This PR changes the representation to a single string saying which controls are active, e.g. "pause;restart;stop". String values have a timestamp, so the latest value prevails in a merge.

I guess we have to update the plugins docs but haven't really looked.

This is a protocol change, requires a version bump.

+275 -475

0 comment

13 changed files

bboreham

pr closed time in 3 days

push eventweaveworks/scope

Bryan Boreham

commit sha 880daa78ff7daa051e18e060dd47435e61e6d366

Extend K8s tagger test to cover pause containers

view details

push time in 3 days

PR closed cortexproject/cortex

Copy label values to avoid unsafe memory use

This is an alternative solution to #2000 (and #2004) where we do more copying on the error path but need fewer comments to avoid mistakes.

Benchmarks show this alternative is significantly more expensive on the error path: Before #2000

BenchmarkIngesterPush/encoding=DoubleDelta	   20165041 ns/op  2590736 B/op	   12441 allocs/op
BenchmarkIngesterPushErrors/encoding=DoubleDelta   26534724 ns/op  5270153 B/op	   85194 allocs/op

After #2000

BenchmarkIngesterPush/encoding=DoubleDelta	   20305149 ns/op  2591191 B/op	   12444 allocs/op
BenchmarkIngesterPushErrors/encoding=DoubleDelta   27959445 ns/op  5276005 B/op	   85189 allocs/op

This branch:

BenchmarkIngesterPush/encoding=DoubleDelta	   20074468 ns/op  2591331 B/op	   12598 allocs/op
BenchmarkIngesterPushErrors/encoding=DoubleDelta   35415601 ns/op 11072246 B/op	   95055 allocs/op
+20 -4

2 comments

3 changed files

bboreham

pr closed time in 3 days

pull request commentcortexproject/cortex

Copy label values to avoid unsafe memory use

I think the benchmarks show that we don't want to do this so have updated #2004.

bboreham

comment created time in 3 days

pull request commentkubernetes/kubernetes

kubelet: ensure static pod UIDs are unique

/retest

Failed tests kinda look like storage mount errors so let's see what happens if I roll the dice again.

bboreham

comment created time in 3 days

PR opened cortexproject/cortex

Add a note to docs about rulers re-sharding on rollout

I had assumed rulers would work more like ingesters in this respect, so seems worth mentioning.

+2 -0

0 comment

1 changed file

pr created time in 3 days

create barnchcortexproject/cortex

branch : ruler-resharding

created branch time in 3 days

push eventcortexproject/cortex

Bryan Boreham

commit sha 62efd50ac160015bf7489c7808e95383374ff84d

Fix some cases where flush reason is recorded as 'Immediate' (#2022) * Make sure we return noFlush on an empty series Signed-off-by: Bryan Boreham <bryan@weave.works> * Pass the correct flush reason to closeHead() Pass the reason from `shouldFlushChunk()` to `closeHead()`; sending `reasonImmediate` was a bug. We don't need to check `len(chunks)` is non-zero, since that would return `noFlush` from `shouldFlushSeries()`. Signed-off-by: Bryan Boreham <bryan@weave.works>

view details

push time in 3 days

delete branch cortexproject/cortex

delete branch : fix-flush-immediate

delete time in 3 days

PR merged cortexproject/cortex

Fix some cases where flush reason is recorded as 'Immediate'

Pass the reason from shouldFlushChunk() to closeHead(); sending reasonImmediate was a bug introduced in #1978.

We don't need to check len(chunks) is non-zero, since that would return noFlush from shouldFlushSeries().

Also close the gap where reasonImmediate could be returned on an empty series, which would then have crashed. Don't know if it's possible in practice to get that state.

+10 -7

0 comment

1 changed file

bboreham

pr closed time in 3 days

pull request commentkubernetes/kubernetes

kubelet: ensure static pod UIDs are unique

/test pull-kubernetes-e2e-gce

bboreham

comment created time in 4 days

pull request commentkubernetes/kubernetes

kubelet: fix the generating the UID for pod manifest files

Re-opened as #87461

yujuhong

comment created time in 4 days

PR opened kubernetes/kubernetes

kubelet: ensure static pod UIDs are unique

What this PR does / why we need it: Fix the issue where pods like kube-proxy, created from the same static manifest across a cluster, all have the same UID. This breaks the concept of a unique ID and confuses some tools.

The underlying issue is a simple clash of behaviours: kubelet uses a md5 hasher which was changed to reset the hash in DeepHashObject. The fix is to re-order the calls so that one goes first. This does have the side-effect that all pod UIDs will change.

Special notes for your reviewer: This is a re-opening of #43420. It and the similar #57135 were held up and ultimately abandoned over concerns that all pods would get restarted during upgrade.

Since then, several Kubernetes upgrades have restarted all pods, and there is a statement at #84443 that this is expected behaviour, so this should not be a reason to delay the fix.

Change the calculation of pod UIDs so that static pods get a unique value - will cause all containers to be killed and recreated after in-place upgrade.

/kind bug /sig node

+3 -1

0 comment

1 changed file

pr created time in 4 days

push eventbboreham/kubernetes

Bryan Boreham

commit sha 5cac49efa3486105971e2d0fe1ecc95dc161bf09

kubelet: ensure pod UIDs are unique Ensure node name and file path are included in the hash which produces the pod UID, otherwise all pods created from the same manifest have the same UID across the cluster. The real author of this code is Yu-Ju Hong <yjhong@google.com>. I am resurrecting an abandoned PR, and changed the git author to pass CLA check. Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

view details

push time in 4 days

create barnchbboreham/kubernetes

branch : fix-uid-gen

created branch time in 4 days

issue commentcortexproject/cortex

Metrics labels make it hard to join

The method and status_code labels on cortex_cache_request_duration_seconds come from library code that does metrics, tracing, etc., so a little tricky to change. This is probably why it's stuffing two labels into method.

I agree this is poor for Prometheus, although it works quite well in Jaeger.

While we're on the subject, cortex_memcache_request_duration_seconds is almost completely overlapping with cortex_cache_request_duration_seconds, right?

sylr

comment created time in 4 days

PR merged containernetworking/cni

Update Result.String implementation to work as declared in interface

String() of Result declared in pkg/types/types.go should return a Result representation in JSON, instead of custom formatted string, which additionally is not in the same form as described in docstring, which also e.g. is missing Routes part.

This PR changes this behavior to be consistent with behavior documented in the interface.

As String() interface does not have any place for errors - if there would be any - instead of JSON value output will be an empty string.

Closes #581

edit/update: As discussed in comments - this PR removes funcs which are turning Result to string.

+0 -42

6 comments

5 changed files

jellonek

pr closed time in 4 days

push eventcontainernetworking/cni

Piotr Skamruk

commit sha f208f19fd3c50dd8fd6a5f877cf9b366ee19a5da

Remove Result.String Closes #581 Signed-off-by: Piotr Skamruk <piotr.skamruk@gmail.com>

view details

Piotr Skamruk

commit sha 704c56d291ff88cf367480c3c4792d9a494d915b

Update tests Signed-off-by: Piotr Skamruk <piotr.skamruk@gmail.com>

view details

Bryan Boreham

commit sha d0fd3ff4e9f25676f8166af9e53932bb5f967973

Merge pull request #583 from jellonek/resultasstring Update Result.String implementation to work as declared in interface

view details

push time in 4 days

issue closedcontainernetworking/cni

Inconsistent Result documentation

It looks like there are some inconsistencies in the String() function between the Result interface and struct. The interface is documented as returning the JSON representation of the result, while the struct is documented as returning some formatted string that isn't JSON.

Not sure which direction should be fixed here, but the documentation should at least be consistent.

closed time in 4 days

andybradshaw

push eventcontainernetworking/cni

Bruce Ma

commit sha cc6e8afa5ac53efcb4d3bb98251c4ce68ba50e49

pkg/utils: utility package should use black-box testing Signed-off-by: Bruce Ma <brucema19901024@gmail.com>

view details

Bryan Boreham

commit sha c2b68cd5950a912be522ad3a92ccc10c73831bf8

Merge pull request #738 from mars1024/black-box-testing pkg/utils: utility package should use black-box testing

view details

push time in 4 days

PR merged containernetworking/cni

pkg/utils: utility package should use black-box testing

A follow-up fix PR of #736 , IMO, we should use black-box testing for our utility functions.

Signed-off-by: Bruce Ma brucema19901024@gmail.com

+5 -4

0 comment

1 changed file

mars1024

pr closed time in 4 days

issue openedcontainernetworking/plugins

Examine cases where plugins return error on DEL

#438 appears to have been caused by a typo where the code was special-casing 'not found', but in practice it is hardly ever a good idea to return an error from DEL because runtimes like Kubernetes will simply repeat the operation forever.

This issue is to remind us to review all error returns from DEL in the plugins in this repo.

created time in 4 days

Pull request review commentcortexproject/cortex

Release 0.6.0

 instructions below to upgrade your Postgres.   * Track `cortex_querier_blocks_sync_seconds` metric for the initial sync too   * Fixed race condition * [BUGFIX] Fixed unnecessary CAS operations done by the HA tracker when the jitter is enabled. #1861-* [BUGFIX] Fixed #1904 ingesters getting stuck in a LEAVING state after coming up from an ungraceful exit. #1921+* [BUGFIX] Fixed ingesters getting stuck in a LEAVING state after coming up from an ungraceful exit. #1921 * [BUGFIX] Reduce memory usage when ingester Push() errors. #1922-* [BUGFIX] TSDB: Fixed handling of out of order/bound samples in ingesters with the experimental TSDB blocks storage. #1864-* [BUGFIX] TSDB: Fixed querying ingesters in `LEAVING` state with the experimental TSDB blocks storage. #1854-* [BUGFIX] TSDB: Fixed error handling in the series to chunks conversion with the experimental TSDB blocks storage. #1837-* [BUGFIX] TSDB: Fixed TSDB creation conflict with blocks transfer in a `JOINING` ingester with the experimental TSDB blocks storage. #1818-* [BUGFIX] TSDB: `experimental.tsdb.ship-interval` of <=0 treated as disabled instead of allowing panic. #1975-* [BUGFIX] TSDB: Fixed `cortex_ingester_queried_samples` and `cortex_ingester_queried_series` metrics when using block storage. #1981-* [BUGFIX] TSDB: Fixed `cortex_ingester_memory_series` and `cortex_ingester_memory_users` metrics when using with the experimental TSDB blocks storage. #1982-* [BUGFIX] TSDB: Fixed `cortex_ingester_memory_series_created_total` and `cortex_ingester_memory_series_removed_total` metrics when using TSDB blocks storage. #1990 * [BUGFIX] Table Manager: Fixed calculation of expected tables and creation of tables from next active schema considering grace period. #1976--### Upgrading Postgres (if you're using configs service)+* [BUGFIX] Experimental TSDB: Fixed handling of out of order/bound samples in ingesters with the experimental TSDB blocks storage. #1864+* [BUGFIX] Experimental TSDB: Fixed querying ingesters in `LEAVING` state with the experimental TSDB blocks storage. #1854+* [BUGFIX] Experimental TSDB: Fixed error handling in the series to chunks conversion with the experimental TSDB blocks storage. #1837+* [BUGFIX] Experimental TSDB: Fixed TSDB creation conflict with blocks transfer in a `JOINING` ingester with the experimental TSDB blocks storage. #1818+* [BUGFIX] Experimental TSDB: `experimental.tsdb.ship-interval` of <=0 treated as disabled instead of allowing panic. #1975+* [BUGFIX] Experimental TSDB: Fixed `cortex_ingester_queried_samples` and `cortex_ingester_queried_series` metrics when using block storage. #1981+* [BUGFIX] Experimental TSDB: Fixed `cortex_ingester_memory_series` and `cortex_ingester_memory_users` metrics when using with the experimental TSDB blocks storage. #1982+* [BUGFIX] Experimental TSDB: Fixed `cortex_ingester_memory_series_created_total` and `cortex_ingester_memory_series_removed_total` metrics when using TSDB blocks storage. #1990

This could be one line "fixed metrics when using experimental TSDB blocks storage".

pracucci

comment created time in 4 days

Pull request review commentcortexproject/cortex

Release 0.6.0

 instructions below to upgrade your Postgres.   * Track `cortex_querier_blocks_sync_seconds` metric for the initial sync too   * Fixed race condition

If you want this to be of value to a reader, say more about how they might have hit the race. Or delete this line.

pracucci

comment created time in 4 days

Pull request review commentcortexproject/cortex

Release 0.6.0

 instructions below to upgrade your Postgres. * [CHANGE] Overrides mechanism has been renamed to "runtime config", and is now separate from limits. Runtime config is simply a file that is reloaded by Cortex every couple of seconds. Limits and now also multi KV use this mechanism.<br />New arguments were introduced: `-runtime-config.file` (defaults to empty) and `-runtime-config.reload-period` (defaults to 10 seconds), which replace previously used `-limits.per-user-override-config` and `-limits.per-user-override-period` options. Old options are still used if `-runtime-config.file` is not specified. This change is also reflected in YAML configuration, where old `limits.per_tenant_override_config` and `limits.per_tenant_override_period` fields are replaced with `runtime_config.file` and `runtime_config.period` respectively. #1749 * [CHANGE] Cortex now rejects data with duplicate labels. Previously, such data was accepted, with duplicate labels removed with only one value left. #1964 * [CHANGE] Changed the default value for `-distributor.ha-tracker.prefix` from `collectors/` to `ha-tracker/` in order to not clash with other keys (ie. ring) stored in the same key-value store. #1940+* [FEATURE] Write-Ahead-Log added in ingesters for more data reliability against ingester crashes. #1103+  * `--ingester.wal-enabled`: Setting this to `true` enables writing to WAL during ingestion.+  * `--ingester.wal-dir`: Directory where the WAL data should be stored and/or recovered from.+  * `--ingester.checkpoint-enabled`: Set this to `true` to enable checkpointing of in-memory chunks to disk.+  * `--ingester.checkpoint-duration`: This is the interval at which checkpoints should be created.+  * `--ingester.recover-from-wal`: Set this to `true` to recover data from an existing WAL.+  * For more information, please checkout the ["Ingesters with WAL" guide](https://cortexmetrics.io/docs/guides/ingesters-with-wal/). * [FEATURE] The distributor can now drop labels from samples (similar to the removal of the replica label for HA ingestion) per user via the `distributor.drop-label` flag. #1726 * [FEATURE] Added flag `debug.mutex-profile-fraction` to enable mutex profiling #1969 * [FEATURE] Added `global` ingestion rate limiter strategy. Deprecated `-distributor.limiter-reload-period` flag. #1766 * [FEATURE] Added support for Microsoft Azure blob storage to be used for storing chunk data. #1913 * [FEATURE] Added readiness probe endpoint`/ready` to queriers. #1934-* [FEATURE] EXPERIMENTAL: Added `/series` API endpoint support with TSDB blocks storage. #1830 * [FEATURE] Added "multi" KV store that can interact with two other KV stores, primary one for all reads and writes, and secondary one, which only receives writes. Primary/secondary store can be modified in runtime via runtime-config mechanism (previously "overrides"). #1749-* [FEATURE] EXPERIMENTAL: Added TSDB blocks `compactor` component, which iterates over users blocks stored in the bucket and compact them according to the configured block ranges. #1942+* [FEATURE] Added support to store ring tokens to a file and read it back on startup, instead of generating/fetching the tokens to/from the ring. This feature can be enabled with the flag `-ingester.tokens-file-path` for the ingesters and `-ruler.tokens-file-path` for the ruler. #1750+* [FEATURE] Experimental TSDB: Added `/series` API endpoint support with TSDB blocks storage. #1830+* [FEATURE] Experimental TSDB: Added TSDB blocks `compactor` component, which iterates over users blocks stored in the bucket and compact them according to the configured block ranges. #1942 * [ENHANCEMENT] metric `cortex_ingester_flush_reasons` gets a new `reason` value: `Spread`, when `-ingester.spread-flushes` option is enabled. #1978 * [ENHANCEMENT] Added `password` and `enable_tls` options to redis cache configuration. Enables usage of Microsoft Azure Cache for Redis service. #1923+* [ENHANCEMENT] Upgraded Kubernetes API version for deployments from `extensions/v1beta1` to `apps/v1`. #1941 * [ENHANCEMENT] Experimental TSDB: Open existing TSDB on startup to prevent ingester from becoming ready before it can accept writes. #1917   * `--experimental.tsdb.max-tsdb-opening-concurrency-on-startup` * [ENHANCEMENT] Experimental TSDB: Added `cortex_ingester_shipper_dir_syncs_total`, `cortex_ingester_shipper_dir_sync_failures_total`, `cortex_ingester_shipper_uploads_total` and `cortex_ingester_shipper_upload_failures_total` metrics from TSDB shipper component. #1983

"added more metrics"? If I know what "TSDB shipper component" means I probably know what metrics I expect from it.

pracucci

comment created time in 4 days

Pull request review commentcortexproject/cortex

Release 0.6.0

  ## master / unreleased ++## 0.6.0 / 2020-01-22+ Note that the ruler flags need to be changed in this upgrade. You're moving from a single node ruler to something that might need to be sharded. If you are running with a high `-ruler.num-workers` and if you're not able to execute all your rules in `-ruler.evaluation-interval`, then you'll need to shard. Further, if you're using the configs service, we've upgraded the migration library and this requires some manual intervention. See full-instructions below to upgrade your Postgres.+instructions below to upgrade your PostgreSQL.  * [CHANGE] Remove unnecessary configs/flags from the ruler ring config to align with the pattern used in the distributor ring. #1987

Release 0.4 didn't have a ruler ring, so these particular differences don't need to be listed as a CHANGE. It would be good to mention the ring alongside the "transition to upstream Prometheus rules manager"

pracucci

comment created time in 4 days

Pull request review commentcortexproject/cortex

Release 0.6.0

 instructions below to upgrade your Postgres. * [CHANGE] Overrides mechanism has been renamed to "runtime config", and is now separate from limits. Runtime config is simply a file that is reloaded by Cortex every couple of seconds. Limits and now also multi KV use this mechanism.<br />New arguments were introduced: `-runtime-config.file` (defaults to empty) and `-runtime-config.reload-period` (defaults to 10 seconds), which replace previously used `-limits.per-user-override-config` and `-limits.per-user-override-period` options. Old options are still used if `-runtime-config.file` is not specified. This change is also reflected in YAML configuration, where old `limits.per_tenant_override_config` and `limits.per_tenant_override_period` fields are replaced with `runtime_config.file` and `runtime_config.period` respectively. #1749 * [CHANGE] Cortex now rejects data with duplicate labels. Previously, such data was accepted, with duplicate labels removed with only one value left. #1964 * [CHANGE] Changed the default value for `-distributor.ha-tracker.prefix` from `collectors/` to `ha-tracker/` in order to not clash with other keys (ie. ring) stored in the same key-value store. #1940+* [FEATURE] Write-Ahead-Log added in ingesters for more data reliability against ingester crashes. #1103+  * `--ingester.wal-enabled`: Setting this to `true` enables writing to WAL during ingestion.+  * `--ingester.wal-dir`: Directory where the WAL data should be stored and/or recovered from.+  * `--ingester.checkpoint-enabled`: Set this to `true` to enable checkpointing of in-memory chunks to disk.+  * `--ingester.checkpoint-duration`: This is the interval at which checkpoints should be created.+  * `--ingester.recover-from-wal`: Set this to `true` to recover data from an existing WAL.+  * For more information, please checkout the ["Ingesters with WAL" guide](https://cortexmetrics.io/docs/guides/ingesters-with-wal/). * [FEATURE] The distributor can now drop labels from samples (similar to the removal of the replica label for HA ingestion) per user via the `distributor.drop-label` flag. #1726 * [FEATURE] Added flag `debug.mutex-profile-fraction` to enable mutex profiling #1969 * [FEATURE] Added `global` ingestion rate limiter strategy. Deprecated `-distributor.limiter-reload-period` flag. #1766 * [FEATURE] Added support for Microsoft Azure blob storage to be used for storing chunk data. #1913 * [FEATURE] Added readiness probe endpoint`/ready` to queriers. #1934-* [FEATURE] EXPERIMENTAL: Added `/series` API endpoint support with TSDB blocks storage. #1830 * [FEATURE] Added "multi" KV store that can interact with two other KV stores, primary one for all reads and writes, and secondary one, which only receives writes. Primary/secondary store can be modified in runtime via runtime-config mechanism (previously "overrides"). #1749-* [FEATURE] EXPERIMENTAL: Added TSDB blocks `compactor` component, which iterates over users blocks stored in the bucket and compact them according to the configured block ranges. #1942+* [FEATURE] Added support to store ring tokens to a file and read it back on startup, instead of generating/fetching the tokens to/from the ring. This feature can be enabled with the flag `-ingester.tokens-file-path` for the ingesters and `-ruler.tokens-file-path` for the ruler. #1750+* [FEATURE] Experimental TSDB: Added `/series` API endpoint support with TSDB blocks storage. #1830+* [FEATURE] Experimental TSDB: Added TSDB blocks `compactor` component, which iterates over users blocks stored in the bucket and compact them according to the configured block ranges. #1942 * [ENHANCEMENT] metric `cortex_ingester_flush_reasons` gets a new `reason` value: `Spread`, when `-ingester.spread-flushes` option is enabled. #1978 * [ENHANCEMENT] Added `password` and `enable_tls` options to redis cache configuration. Enables usage of Microsoft Azure Cache for Redis service. #1923+* [ENHANCEMENT] Upgraded Kubernetes API version for deployments from `extensions/v1beta1` to `apps/v1`. #1941 * [ENHANCEMENT] Experimental TSDB: Open existing TSDB on startup to prevent ingester from becoming ready before it can accept writes. #1917   * `--experimental.tsdb.max-tsdb-opening-concurrency-on-startup`

It's not clear to me what this line means.

pracucci

comment created time in 4 days

Pull request review commentcortexproject/cortex

Release 0.6.0

 instructions below to upgrade your Postgres.   * Track `cortex_querier_blocks_sync_seconds` metric for the initial sync too   * Fixed race condition * [BUGFIX] Fixed unnecessary CAS operations done by the HA tracker when the jitter is enabled. #1861-* [BUGFIX] Fixed #1904 ingesters getting stuck in a LEAVING state after coming up from an ungraceful exit. #1921+* [BUGFIX] Fixed ingesters getting stuck in a LEAVING state after coming up from an ungraceful exit. #1921 * [BUGFIX] Reduce memory usage when ingester Push() errors. #1922-* [BUGFIX] TSDB: Fixed handling of out of order/bound samples in ingesters with the experimental TSDB blocks storage. #1864-* [BUGFIX] TSDB: Fixed querying ingesters in `LEAVING` state with the experimental TSDB blocks storage. #1854-* [BUGFIX] TSDB: Fixed error handling in the series to chunks conversion with the experimental TSDB blocks storage. #1837-* [BUGFIX] TSDB: Fixed TSDB creation conflict with blocks transfer in a `JOINING` ingester with the experimental TSDB blocks storage. #1818-* [BUGFIX] TSDB: `experimental.tsdb.ship-interval` of <=0 treated as disabled instead of allowing panic. #1975-* [BUGFIX] TSDB: Fixed `cortex_ingester_queried_samples` and `cortex_ingester_queried_series` metrics when using block storage. #1981-* [BUGFIX] TSDB: Fixed `cortex_ingester_memory_series` and `cortex_ingester_memory_users` metrics when using with the experimental TSDB blocks storage. #1982-* [BUGFIX] TSDB: Fixed `cortex_ingester_memory_series_created_total` and `cortex_ingester_memory_series_removed_total` metrics when using TSDB blocks storage. #1990 * [BUGFIX] Table Manager: Fixed calculation of expected tables and creation of tables from next active schema considering grace period. #1976--### Upgrading Postgres (if you're using configs service)+* [BUGFIX] Experimental TSDB: Fixed handling of out of order/bound samples in ingesters with the experimental TSDB blocks storage. #1864+* [BUGFIX] Experimental TSDB: Fixed querying ingesters in `LEAVING` state with the experimental TSDB blocks storage. #1854+* [BUGFIX] Experimental TSDB: Fixed error handling in the series to chunks conversion with the experimental TSDB blocks storage. #1837+* [BUGFIX] Experimental TSDB: Fixed TSDB creation conflict with blocks transfer in a `JOINING` ingester with the experimental TSDB blocks storage. #1818+* [BUGFIX] Experimental TSDB: `experimental.tsdb.ship-interval` of <=0 treated as disabled instead of allowing panic. #1975

Do we really need so much text about bugfixes in an experimental feature?

pracucci

comment created time in 4 days

Pull request review commentcortexproject/cortex

Release 0.6.0

 instructions below to upgrade your Postgres. * [CHANGE] Overrides mechanism has been renamed to "runtime config", and is now separate from limits. Runtime config is simply a file that is reloaded by Cortex every couple of seconds. Limits and now also multi KV use this mechanism.<br />New arguments were introduced: `-runtime-config.file` (defaults to empty) and `-runtime-config.reload-period` (defaults to 10 seconds), which replace previously used `-limits.per-user-override-config` and `-limits.per-user-override-period` options. Old options are still used if `-runtime-config.file` is not specified. This change is also reflected in YAML configuration, where old `limits.per_tenant_override_config` and `limits.per_tenant_override_period` fields are replaced with `runtime_config.file` and `runtime_config.period` respectively. #1749 * [CHANGE] Cortex now rejects data with duplicate labels. Previously, such data was accepted, with duplicate labels removed with only one value left. #1964 * [CHANGE] Changed the default value for `-distributor.ha-tracker.prefix` from `collectors/` to `ha-tracker/` in order to not clash with other keys (ie. ring) stored in the same key-value store. #1940+* [FEATURE] Write-Ahead-Log added in ingesters for more data reliability against ingester crashes. #1103+  * `--ingester.wal-enabled`: Setting this to `true` enables writing to WAL during ingestion.+  * `--ingester.wal-dir`: Directory where the WAL data should be stored and/or recovered from.+  * `--ingester.checkpoint-enabled`: Set this to `true` to enable checkpointing of in-memory chunks to disk.+  * `--ingester.checkpoint-duration`: This is the interval at which checkpoints should be created.+  * `--ingester.recover-from-wal`: Set this to `true` to recover data from an existing WAL.+  * For more information, please checkout the ["Ingesters with WAL" guide](https://cortexmetrics.io/docs/guides/ingesters-with-wal/). * [FEATURE] The distributor can now drop labels from samples (similar to the removal of the replica label for HA ingestion) per user via the `distributor.drop-label` flag. #1726 * [FEATURE] Added flag `debug.mutex-profile-fraction` to enable mutex profiling #1969 * [FEATURE] Added `global` ingestion rate limiter strategy. Deprecated `-distributor.limiter-reload-period` flag. #1766 * [FEATURE] Added support for Microsoft Azure blob storage to be used for storing chunk data. #1913 * [FEATURE] Added readiness probe endpoint`/ready` to queriers. #1934-* [FEATURE] EXPERIMENTAL: Added `/series` API endpoint support with TSDB blocks storage. #1830 * [FEATURE] Added "multi" KV store that can interact with two other KV stores, primary one for all reads and writes, and secondary one, which only receives writes. Primary/secondary store can be modified in runtime via runtime-config mechanism (previously "overrides"). #1749-* [FEATURE] EXPERIMENTAL: Added TSDB blocks `compactor` component, which iterates over users blocks stored in the bucket and compact them according to the configured block ranges. #1942+* [FEATURE] Added support to store ring tokens to a file and read it back on startup, instead of generating/fetching the tokens to/from the ring. This feature can be enabled with the flag `-ingester.tokens-file-path` for the ingesters and `-ruler.tokens-file-path` for the ruler. #1750

is this ruler.ring ?

pracucci

comment created time in 4 days

pull request commentcortexproject/cortex

Release 0.6.0

@pracucci I made a fix #2022 Do you want me to target that at the release branch?

pracucci

comment created time in 4 days

pull request commentcortexproject/cortex

Update various things in the build image

I changed the title since we have a few "Dockerfile"s

sylr

comment created time in 4 days

PR opened cortexproject/cortex

Fix flush immediate

Pass the reason from shouldFlushChunk() to closeHead(); sending reasonImmediate was a bug.

We don't need to check len(chunks) is non-zero, since that would return noFlush from shouldFlushSeries().

Also close the gap where reasonImmediate could be returned on an empty series, which would then have crashed. Don't know if it's possible in practice to get that state.

+10 -7

0 comment

1 changed file

pr created time in 4 days

push eventcortexproject/cortex

Bryan Boreham

commit sha 17c0c21e4f964140eb1b651392d05b72a327dda2

Make sure we return noFlush on an empty series Signed-off-by: Bryan Boreham <bryan@weave.works>

view details

Bryan Boreham

commit sha df8b95f2109b6fd27212a01b69a42e2518834849

Pass the correct flush reason to closeHead() Pass the reason from `shouldFlushChunk()` to `closeHead()`; sending `reasonImmediate` was a bug. We don't need to check `len(chunks)` is non-zero, since that would return `noFlush` from `shouldFlushSeries()`. Signed-off-by: Bryan Boreham <bryan@weave.works>

view details

push time in 4 days

create barnchcortexproject/cortex

branch : fix-flush-immediate

created branch time in 4 days

push eventcortexproject/cortex

Tom Wilkie

commit sha 7228c1cfa5f40bdf0b5bda66d49decf1ee9e413e

Update vendored libraries: - Add github.com/promtheus/tsdb/wal - Update github.com/prometheus/client_golang for WrapRegistererWith function. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>

view details

Tom Wilkie

commit sha 831080f73a8c6d39ad63da98d0da281ed51ba384

Use TSDB's WAL for writes. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>

view details

Ganesh Vernekar

commit sha 8710a22abb3ea79bc9f9657aa92f2cbf69f245c4

Merge remote-tracking branch 'upstream/master' into wal Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

view details

Ganesh Vernekar

commit sha 0802aa64d849e213f59d987ee7a71ebc9facb223

Merge pull request #23 from codesome/wal Rebase with master

view details

Ganesh Vernekar

commit sha 2a6fc2db5a80997a8b055172cf1c1dad04d24785

Fix merge conflicts Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

view details

Ganesh Vernekar

commit sha bc25534f88f35170ee492044aa22915ed60d6bc6

Merge remote-tracking branch 'grafana/wal' into wal

view details

Ganesh Vernekar

commit sha f023ddb88d27c5c52b8f8e62887c91ca07b2083f

Fix creation of user states in WAL recover Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

view details

Ganesh Vernekar

commit sha 5c022dac373f587f1acec7904035b7d8e4613d23

Merge remote-tracking branch 'upstream/master' into wal Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

view details

Ganesh Vernekar

commit sha ac98b13ffa5da5445118d6feb3be83ce96211df5

Remove WAL recovery on start-up and allow nil Record Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

view details

Ganesh Vernekar

commit sha b84b02e85b1ea99ec8178d32779c6e1c97a676c2

Fix types Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

view details

Ganesh Vernekar

commit sha 587b8ad812aaee93f963eba9d6266846c6fc945b

Merge remote-tracking branch 'upstream/master' into wal

view details

Ganesh Vernekar

commit sha 05127b016acf3750c303fe7507e7054aa3445295

WAL compression always enabled Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

view details

Ganesh Vernekar

commit sha 787fbb81c7874c8310b0c2e7c9d099da1c536ca4

Change checkpoint logic to be more like prometheus's tsdb Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

view details

Ganesh Vernekar

commit sha 9845a5e2d1742de20ed126d5ad0f91515c132fb8

Add metrics for checkpoint and name changes Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

view details

Ganesh Vernekar

commit sha 5bb950f1916b947cb1e839d730d4fe6abd33a332

Merge remote-tracking branch 'upstream/master' into wal Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

view details

Ganesh Vernekar

commit sha f21f814ba32b869f384779bcce7f637ae4b57891

Initial attempt for flushing chunks from WAL [WIP] Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

view details

Ganesh Vernekar

commit sha 46eaf362566700d8b64d546d62a7c349bbf717d6

Combine checkpoint and WAL chunks before flushing Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

view details

Ganesh Vernekar

commit sha 2b1710341fb4137df34f8c20852366ac214bcfd2

Bring back recovery and tests Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

view details

Ganesh Vernekar

commit sha 935b73e447591097056fc1cfb56bca6a5f4e9526

Fix race in the test Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

view details

Ganesh Vernekar

commit sha a7844c7fe83ecc33f0fad0a3eec95ef88923b208

Merge remote-tracking branch 'upstream/master' into wal Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

view details

push time in 4 days

CommitCommentEvent

push eventcortexproject/cortex

Bryan Boreham

commit sha d52556a6b917ecc5b9ffe7a7b2eb822470e87cb0

Wrap ingester Push errors to avoid retaining any reference to unsafe data Signed-off-by: Bryan Boreham <bryan@weave.works>

view details

push time in 4 days

push eventcortexproject/cortex

Peter Štibraný

commit sha b7d5ab016a9fb2df20644f0c0fd429aefcbe77d0

Docs update (#1967) * Explain PENDING/JOINING and LEAVING states in a little more detail, and fix some incorrect statements. Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com> * Fix some dashes. Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com> * Fix dot. Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com> * Replaced mdash entities with commas and parenthesis instead. Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com> * Explain auto-join timeout little bit. Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com> * Fix argument name to -ingester.join-after. Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

view details

Thor

commit sha e138d09f9f176b9339bf36bd6ba50da69d9baa80

tsdb: optionally turn off shipper (#1975) Signed-off-by: Thor <thansen@digitalocean.com>

view details

Goutham Veeramachaneni

commit sha cd9a203586b415cf8dcfe583c8da84bc8d654936

Add Marco Pracucci as a Maintainer (#1977) Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>

view details

Thor

commit sha 49bceb0629e71ba1f267a677b528d291ee25db2c

CHANGELOG: changed ShipInterval to the cli flag (#1979) Signed-off-by: Thor <thansen@digitalocean.com>

view details

Vineeth Pothulapati

commit sha a6ab46fd599a004636122c1b405b8c66d213bd96

Adding docs for running cortex with cassandra (#1953) * Adding docs for running cortex with cassandra This PR contains step by step guide for running cortex as a single binary and configuring cassandra as backend storage. Signed-off-by: vineeth <vineethpothulapati@outlook.com> * Addresses review comments & added the URL in readme.md file Signed-off-by: vineeth <vineethpothulapati@outlook.com> * Addressed review comments and styling Signed-off-by: vineeth <vineethpothulapati@outlook.com> * Improving wordings in docs and forms. Signed-off-by: vineeth <vineethpothulapati@outlook.com>

view details

Marco Pracucci

commit sha 819cc0ff65fcebc81af2ac8eba1c4e56f570bf2a

Added the series to the error when a sample fail to be appended to TSDB (#1971) Signed-off-by: Marco Pracucci <marco@pracucci.com>

view details

Jacob Lisi

commit sha e319ef5c3744856087edfb9b57a78f2805334088

add flag to enable and set the mutex profile fraction (#1969) * add flag to enable and set the mutex profile fraction Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com> * update flag name and update changelog Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com>

view details

Marco Pracucci

commit sha f1a9e361abe5d5b15da184406fa10a75b007efe7

Minor styling touches to the website (#1966) Signed-off-by: Marco Pracucci <marco@pracucci.com>

view details

Marco Pracucci

commit sha 6d8dee4fcde10a4eced58c86ae1323c2094fc656

Do not display the last modified timestamp in the website pages (#1965) Signed-off-by: Marco Pracucci <marco@pracucci.com>

view details

Marco Pracucci

commit sha 209fe10a17abdf2f904bdb489ee46a90fae1bdae

Website: fixed image aspect ratio for Twitter (#1962) * Fixed Website image aspect ratio for Twitter Signed-off-by: Marco Pracucci <marco@pracucci.com> * Reduced image size of the Cortex logo for Twitter card Signed-off-by: Marco Pracucci <marco@pracucci.com> * Further reduced image size Signed-off-by: Marco Pracucci <marco@pracucci.com>

view details

Bryan Boreham

commit sha ec2d8c4c5e292bceeeb4e68075c24b7c6728e2b4

Report chunks flushed by spread-flushes option under separate label (#1978) * Report chunks flushed by spread-flushes option under separate label This improves observability of flushing, creating separate labels for chunks that overflowed versus to series that reached their time under spread-flushes behaviour. The flushReason type shrinks to int8 to avoid bloating the chunk desc object. Signed-off-by: Bryan Boreham <bryan@weave.works> * Add changelog note about new flush_reasons label value Signed-off-by: Bryan Boreham <bryan@weave.works>

view details

Peter Štibraný

commit sha dd511dde6f536c00426a5f0c2ed8385c982fdb4d

TSDB: metrics for queries (#1981) * Update queriedSeries and queriedSamples metrics in v2Query. Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com> * Increment number of users when creating new TSDB. Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com> * Use length of slices to update metrics. Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com> * Update CHANGELOG.md Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com> * Removed memUsers.Inc() in favor of PR #1982. Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com> * Removed mention of cortex_ingester_memory_users from CHANGELOG.md Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com> * One more small fix to CHANGELOG.md. Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

view details

Marco Pracucci

commit sha 1aaa4a18d21ed87e634e39c0fc47f258aa595a55

Changed the default value for -distributor.ha-tracker.prefix in order to not clash with other keys (#1940) Signed-off-by: Marco Pracucci <marco@pracucci.com>

view details

Thor

commit sha 326828bac2ed0126d442f36e21b825c2a834c91e

ingester_v2: open TSDB on startup, limit the number of concurrently (#1917) opening Signed-off-by: Thor <thansen@digitalocean.com>

view details

Marco Pracucci

commit sha 876c8fae431513712daf57b52d8ad80f0a002dfd

Fix ingester_memory_series/users metrics for TSDB storage (#1982) * Fixed cortex_ingester_memory_users metric tracking for blocks storage Signed-off-by: Marco Pracucci <marco@pracucci.com> * Fixed cortex_ingester_memory_series metric tracking for blocks storage Signed-off-by: Marco Pracucci <marco@pracucci.com> * Updated changelog Signed-off-by: Marco Pracucci <marco@pracucci.com>

view details

Marco Pracucci

commit sha 5365b843ae1c8c282a36c6b1f8797b04efb9b118

Added tool to automatically generate config file documentation (#1952) * Added tool to automatically generate config file documentation Signed-off-by: Marco Pracucci <marco@pracucci.com> * Make doc due to config changes after rebasing master Signed-off-by: Marco Pracucci <marco@pracucci.com> * Added support for inline yaml field Signed-off-by: Marco Pracucci <marco@pracucci.com> * Fixed linter issues Signed-off-by: Marco Pracucci <marco@pracucci.com> * Fixed linter in the alertmanager Signed-off-by: Marco Pracucci <marco@pracucci.com> * Fixed Cortex module name type detection Signed-off-by: Marco Pracucci <marco@pracucci.com> * Commented why SchemaConfig doc is not auto-generated Signed-off-by: Marco Pracucci <marco@pracucci.com> * Updated config file doc Signed-off-by: Marco Pracucci <marco@pracucci.com>

view details

Giedrius Statkevičius

commit sha 8af66753479caa039989ed1e31ee783a6eb65f5b

frontend: implement cache control (#1974) * querier: do not cache results if requested Add an extra `Headers` field to the `PrometheusResponse` message which contains the headers and their values that came from Prometheus. Use them in other places to deduce if the response should be cached. If `Cache-Control` is equal to `no-store` then it is *not* cached. This will be used by the Thanos project to indicate when a partial response has been returned. In such cases the result should not be cached so that invalid data would not be stored there. Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com> * queryrange: factor out cache checking + add tests Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com> * querier: gofmt Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com> * querier: tests: add missing member Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com> * querier: fix logical mistake Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com> * queryrange: fix generated code Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com> * querier: test adjustments Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com> * querier: test adjustments Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com> * querier: results_cache: cache by default Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com> * queryrange: cache: check all header values `Cache-Control` might contain more than one value so check all of them. Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com> * Update according to Goutham's comments Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com> * CHANGELOG: add full stop, PR's number Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>

view details

Goutham Veeramachaneni

commit sha 2cc8d0789d2f9f9b7bc107961cbd5ab45a90326c

Release 0.5.0 (#1963) * Release 0.5.0-rc0 Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com> * Fix changelog to point to 0.5.0-rc.0 Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com> * Validate incoming labels for order and duplicate names. (#1964) * Validate incoming labels for order and duplicate names. Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com> * Document that cortex rejects requests with incorrectly ordered or duplicate labels. Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com> * Ignore empty metric name. {__name__=""} will be shown formatted as {__name__=""} instead of just empty string. Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com> * As we rely on sorted labels, sort them before using them. Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com> * Updated CHANGELOG.md to reflect latest changes to PR. Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com> * Put back redundant aliases. Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com> * wrap migration commands (#1980) * wrap migration commands Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com> * update changelog Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com> * fix missing semicolon Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com> * Fix typo to make lint pass. Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com> Co-authored-by: Goutham Veeramachaneni <gouthamve+github@gmail.com> * Call out breaking changes better. Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com> * Label this version as -rc1 Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com> * We're abandoning the `0.5.0` release. Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com> Co-authored-by: Peter Štibraný <pstibrany@gmail.com> Co-authored-by: Jacob Lisi <jacob.t.lisi@gmail.com>

view details

Peter Štibraný

commit sha 83556236a32ccf05369d2085576f201f8a104b3b

TSDB: Expose shipper metrics via ingester (#1983) * Expose TSDB shipper metrics via ingester Since we have one TSDB shipper per tenant in memory, we need to gather all of these metrics and sum them together (fortunately they are all counters). One shipper metric, thanos_shipper_upload_compacted_done ("If 1 it means shipper uploaded all compacted blocks from the filesystem.") was skipped and is not exposed, as there is no good way to group it across all tenants. Ingester metrics were moved to separate file, metrics.go. shipper metrics have their own type and are only used from TSDB code. Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com> * Added test for shipper metrics. Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com> * Use warn instead of debug, so that we can see it in logs. This should never happen. Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com> * Updated CHANGELOG.md Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com> * Updated CHANGELOG.md Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

view details

Owen Diehl

commit sha add279fd8a8fe6b77b7f8c8a0cc7e2fa598bb35c

defers putting slice back into pool on ingester push to prevent panicking Error() calls which reference underlying labels (#2000) Signed-off-by: Owen Diehl <ow.diehl@gmail.com>

view details

push time in 4 days

Pull request review commentcortexproject/cortex

Update code to fix almost all pre-existing lint errors

 func (a s3ObjectClient) bucketFromKey(key string) string { 	}  	hasher := fnv.New32a()-	hasher.Write([]byte(key))+	_, _ = hasher.Write([]byte(key))

Could add hash.Hash32.Write to the project .errcheck-exclude

zendern

comment created time in 4 days

Pull request review commentcortexproject/cortex

Update code to fix almost all pre-existing lint errors

 func (c *Memcached) Stop() error { // HashKey hashes key into something you can store in memcached. func HashKey(key string) string { 	hasher := fnv.New64a()-	hasher.Write([]byte(key)) // This'll never error.+	_, _ = hasher.Write([]byte(key)) // This'll never error.

Could add this to the project .errcheck-exclude

zendern

comment created time in 4 days

Pull request review commentcortexproject/cortex

Update code to fix almost all pre-existing lint errors

 loop: 	}  	// Mark ourselved as Leaving so no more samples are send to us.-	i.changeState(context.Background(), LEAVING)+	_ = i.changeState(context.Background(), LEAVING)

seems like this error might be worth warning about?

zendern

comment created time in 4 days

Pull request review commentcortexproject/cortex

Update code to fix almost all pre-existing lint errors

 func (d *Distributor) emitBillingRecord(ctx context.Context, buf []byte, samples  	now := time.Now().UTC() 	hasher := sha256.New()-	hasher.Write(buf)+	_, _ = hasher.Write(buf)

Could add to the project .errcheck-exclude

zendern

comment created time in 4 days

Pull request review commentcortexproject/cortex

Update code to fix almost all pre-existing lint errors

 func (f *Frontend) handle(w http.ResponseWriter, r *http.Request) { 		hs[h] = vs 	} 	w.WriteHeader(resp.StatusCode)-	io.Copy(w, resp.Body)+	_, _ = io.Copy(w, resp.Body)

io.Copy() could be added to the project .errcheck-exclude file.

zendern

comment created time in 4 days

Pull request review commentcortexproject/cortex

Update code to fix almost all pre-existing lint errors

 func SerializeProtoResponse(w http.ResponseWriter, resp proto.Message, compressi 	case NoCompression: 	case FramedSnappy: 		buf := bytes.Buffer{}-		if _, err := snappy.NewWriter(&buf).Write(data); err != nil {+		writer := snappy.NewBufferedWriter(&buf)+		defer writer.Close()

Don't we have to close the compressor before calling buf.Bytes() ?

(we try not to use FramedSnappy at all, but it's here for backwards-compatibility)

zendern

comment created time in 4 days

Pull request review commentcortexproject/cortex

Update code to fix almost all pre-existing lint errors

 func NewRateLimiter(cfg *Config) grpc.UnaryClientInterceptor { 	} 	limiter := rate.NewLimiter(rate.Limit(cfg.RateLimit), burst) 	return func(ctx context.Context, method string, req, reply interface{}, cc *grpc.ClientConn, invoker grpc.UnaryInvoker, opts ...grpc.CallOption) error {-		limiter.Wait(ctx)+		_ = limiter.Wait(ctx)

I think we'd want to abort on error here: it could mean the context deadline was exceeded.

zendern

comment created time in 4 days

Pull request review commentcortexproject/cortex

Update code to fix almost all pre-existing lint errors

 func ReuseSlice(slice []PreallocTimeseries) { 	for i := range slice { 		ReuseTimeseries(slice[i].TimeSeries) 	}-	slicePool.Put(slice[:0])+	slicePool.Put(slice[:0]) //nolint:staticcheck

why do we need a special comment here?

zendern

comment created time in 4 days

Pull request review commentcortexproject/cortex

Update code to fix almost all pre-existing lint errors

 func TestStoreMaxLookBack(t *testing.T) { 	chunks, err = storeWithLookBackLimit.Get(ctx, userID, now.Add(-time.Hour), now, matchers...) 	require.NoError(t, err) 	require.Equal(t, 1, len(chunks))-	chunks[0].Through.Equal(now)+	require.Equal(t, now, chunks[0].Through)

a real bug!

zendern

comment created time in 4 days

Pull request review commentcortexproject/cortex

Update code to fix almost all pre-existing lint errors

 func (d *Distributor) Push(ctx context.Context, req *client.WriteRequest) (*clie 			continue 		} -		metricName, _ := extract.MetricNameFromLabelAdapters(ts.Labels)-		samples := make([]client.Sample, 0, len(ts.Samples))-		for _, s := range ts.Samples {-			if err := validation.ValidateSample(d.limits, userID, metricName, s); err != nil {-				lastPartialErr = err-				continue-			}-			samples = append(samples, s)-		}-

What happened here? Don't we need this code?

zendern

comment created time in 4 days

Pull request review commentcortexproject/cortex

Update code to fix almost all pre-existing lint errors

 func (t timed) GetAllConfigs(ctx context.Context) (map[string]configs.View, erro 		cfgs map[string]configs.View 		err  error 	)-	instrument.CollectedRequest(ctx, "DB.GetAllConfigs", databaseRequestDuration, instrument.ErrorCode, func(ctx context.Context) error {+	_ = instrument.CollectedRequest(ctx, "DB.GetAllConfigs", databaseRequestDuration, instrument.ErrorCode, func(ctx context.Context) error {

I think it would be better to use the return value rather than rely on err being captured.

zendern

comment created time in 4 days

Pull request review commentcortexproject/cortex

Update code to fix almost all pre-existing lint errors

 func forAllFixtures(t *testing.T, storageClientTest storageClientTest) { 		t.Run(fixture.Name(), func(t *testing.T) { 			indexClient, objectClient, err := testutils.Setup(fixture, tableName) 			require.NoError(t, err)-			defer fixture.Teardown()+			defer fixture.Teardown() //nolint:errcheck

could fail on error?

zendern

comment created time in 4 days

Pull request review commentcortexproject/cortex

Update code to fix almost all pre-existing lint errors

 func decodeBase64Value(bs []byte) (model.LabelValue, error) { func encodeTime(t uint32) []byte { 	// timestamps are hex encoded such that it doesn't contain null byte, 	// but is still lexicographically sortable.-	throughBytes := make([]byte, 4, 4)+	throughBytes := make([]byte, 4) 	binary.BigEndian.PutUint32(throughBytes, t)-	encodedThroughBytes := make([]byte, 8, 8)+	encodedThroughBytes := make([]byte, 8) 	hex.Encode(encodedThroughBytes, throughBytes) 	return encodedThroughBytes }  func decodeTime(bs []byte) uint32 {-	buf := make([]byte, 4, 4)-	hex.Decode(buf, bs)+	buf := make([]byte, 4)+	_, _ = hex.Decode(buf, bs)

This function is only used in tests so I would be happier to move it into the _test file where the lack of error-checking will be less concerning.

zendern

comment created time in 4 days

Pull request review commentcortexproject/cortex

Update code to fix almost all pre-existing lint errors

 func (c *seriesStore) PutOne(ctx context.Context, from, through model.Time, chun 			return err 		} 	}-	c.writeBackCache(ctx, chunks)+	_ = c.writeBackCache(ctx, chunks)

Elsewhere we log this error as a warning.

zendern

comment created time in 4 days

Pull request review commentcortexproject/cortex

Update code to fix almost all pre-existing lint errors

 func TestCompositeStore(t *testing.T) { 	} { 		t.Run(fmt.Sprintf("%d", i), func(t *testing.T) { 			have := []result{}-			tc.cs.forStores(model.TimeFromUnix(tc.from), model.TimeFromUnix(tc.through), collect(&have))+			err := tc.cs.forStores(model.TimeFromUnix(tc.from), model.TimeFromUnix(tc.through), collect(&have))+			if err != nil {

Why not require.NoError() ?

zendern

comment created time in 4 days

Pull request review commentcortexproject/cortex

Update code to fix almost all pre-existing lint errors

 func NewChunkFetcher(cfg cache.Config, cacheStubs bool, storage ObjectClient) (* func (c *Fetcher) Stop() { 	close(c.decodeRequests) 	c.wait.Wait()-	c.cache.Stop()+	_ = c.cache.Stop()

If we don't care about errors from Stop() here, where do we care about them?

zendern

comment created time in 4 days

Pull request review commentcortexproject/cortex

Update code to fix almost all pre-existing lint errors

 func (i *instrumentedCache) Store(ctx context.Context, keys []string, bufs [][]b 	}  	method := i.name + ".store"-	instr.CollectedRequest(ctx, method, requestDuration, instr.ErrorCode, func(ctx context.Context) error {+	_ = instr.CollectedRequest(ctx, method, requestDuration, instr.ErrorCode, func(ctx context.Context) error {

I question why Store() has no error return.

zendern

comment created time in 4 days

Pull request review commentcortexproject/cortex

Update code to fix almost all pre-existing lint errors

 func (c *store) LabelNamesForMetricName(ctx context.Context, userID string, from }  func (c *store) validateQueryTimeRange(ctx context.Context, userID string, from *model.Time, through *model.Time) (bool, error) {-	log, ctx := spanlogger.New(ctx, "store.validateQueryTimeRange")+	log, _ := spanlogger.New(ctx, "store.validateQueryTimeRange")

This seems slightly dangerous: if someone adds a use of ctx later in the method they will get the wrong one. Maybe leave this line as was and silence the warning, or add _ = ctx after with a comment ?

zendern

comment created time in 4 days

Pull request review commentcortexproject/cortex

Update code to fix almost all pre-existing lint errors

 const (  func TestChunksPartialError(t *testing.T) { 	fixture := dynamoDBFixture(0, 10, 20)-	defer fixture.Teardown()+	defer func() {

This could be a one-liner. No biggie.

zendern

comment created time in 4 days

Pull request review commentcortexproject/cortex

fix calculation of expected tables and create tables from upcoming schema considering grace period

 func (cfg *PeriodicTableConfig) periodicTables(from, through model.Time, pCfg Pr 		nowWeek        = now / periodSecs 		result         = []TableDesc{} 	)-	// If through ends on 00:00 of the day, don't include the upcoming day-	if through.Unix()%secondsInDay == 0 {+	// Make sure we don't have an extra table

I think this comment is now less explanatory. Could say something like “If interval ends exactly on a period boundary, don’t include the upcoming period”.

sandlis

comment created time in 4 days

push eventweaveworks/scope

Filip Barl

commit sha 51c731a1e9d5327975e7f6272e3e51232b504319

build(client): deduplicate all yarn dependencies

view details

Filip Barl

commit sha c6d5fe1ec6ac082d4a5b35ddc32a262a58a2fd19

build(client): Ran yarn install to update yarn.lock

view details

Filip Barl

commit sha 1f51495e59816a5cc796d1ef1bf53ab01147d310

Merge pull request #3733 from weaveworks/dedupe-yarn-with-integrity Dedupe yarn.lock

view details

Bryan Boreham

commit sha 2bfe2146e59728d20eea21f710cfc0a861dacf57

build: initial modules conversion Created by running `go mod init`; all versions the same as before.

view details

Bryan Boreham

commit sha 20898ea352a58c6b760743beb76b19c6e3a70cef

build: tweak version of iovisor and its dependencies Without this we get a build error: github.com/Sirupsen/logrus: github.com/Sirupsen/logrus@v1.4.2: parsing go.mod: module declares its path as: github.com/sirupsen/logrus but was required as: github.com/Sirupsen/logrus

view details

Bryan Boreham

commit sha b6c25e6cfe24004e893826e5194bcdbf25d8a36d

build: update docker dependency to fix compile error in fsouza

view details

Bryan Boreham

commit sha 0bb92227a2a7f846fbd75c593bb419eb6af3225b

build: update vendor directory

view details

Bryan Boreham

commit sha 2171e21caee91513f2201f6ee45ccf8f1cf54caa

Ensure ugorji code-generation dependencies are vendored

view details

Bryan Boreham

commit sha 335755161930cc753b4707ae46378703328f0bcf

build: don't need -i flag This was useful to cache dependent library builds, but Go does that better now so we don't need the flag.

view details

Bryan Boreham

commit sha 3a991eca99da5577ec8a4fc1bc8ba58fd59a1c3c

build: use vendored files when building

view details

Bryan Boreham

commit sha 54a513c9fc2889fc876028be791a28d71a8b37d6

fixup: manually add C files that 'go mod' doesn't bring in Need workaround until a fix upstream such as https://github.com/iovisor/gobpf/pull/206/files is merged.

view details

Bryan Boreham

commit sha aae86877c723c90aa25a8a3908c5b99c6b39e68f

Add back runsvint dependency This is a small program used to watch app and probe running inside the container. We need to go round a few houses to be able to build it from vendor directory.

view details

Bryan Boreham

commit sha bc4ec26bf581e39844c566bcf789b9e9d62ab3b8

Turn off module use for codecgen build There is no way to pass `-mod=vendor` through to the Go compilation that codecgen runs.

view details

Bryan Boreham

commit sha 15bb96bb6740eb8a3d41d76a1c6f0b5e803eb376

Remove outdated instructions about rebuilding Go libraries This bit is unnecessary with newer versions of Go that cache builds for different tags in different places.

view details

Bryan Boreham

commit sha f70f2481c77274ec9043b1efc9cf236a093dc37e

Update tcptracer-bpf version so we don't need manual workaround

view details

Bryan Boreham

commit sha 35451b4826c40e0d77c737a5a7288b4fc594ef60

Merge pull request #3742 from weaveworks/go-modules Convert to Go modules

view details

Bryan Boreham

commit sha 634e8f1158673ebe9a2228473132b364167bc2f8

Add tracing for pipe operations

view details

Bryan Boreham

commit sha 321dbeef6d09679e647a32262986f8dee1c9e25b

Update weaveworks/common and dependencies

view details

Bryan Boreham

commit sha f1b06a020efd30914fe61c6500cc765dc40defb0

Update jaeger-client package and dependencies

view details

push time in 5 days

issue commentweaveworks/weave

Intermittent unknownhost issues with errors in weave logs

Is this a joke? Generally if something is going to go wrong it will be towards the top of the log.

dstrimble

comment created time in 5 days

delete branch weaveworks/common

delete branch : trace-websockets

delete time in 5 days

PR merged weaveworks/common

Trace Websockets requests

Update the OpenTracing library to a version which supports Websockets, and remove the workaround for when it didn't.

+6 -13

0 comment

2 changed files

bboreham

pr closed time in 5 days

push eventweaveworks/common

Bryan Boreham

commit sha c8387ad988c78dde0253e86480ab21ae3542f4e2

Trace Websocket requests Update the OpenTracing library to a version which supports Websockets, and remove the workaround for when it didn't.

view details

Bryan Boreham

commit sha f95aa0ece94007f37f2d77e30478240653e227fa

Merge pull request #173 from weaveworks/trace-websockets Trace Websockets requests

view details

push time in 5 days

issue commentweaveworks/weave

Intermittent unknownhost issues with errors in weave logs

When we ask for "logs" we mean the whole thing.

dstrimble

comment created time in 5 days

issue closedweaveworks/weave

weave losing connections to other nodes with error: Multiple connections (Kubernetes CNI)

What you expected to happen?

Inter-node cluster-internal traffic to work

What happened?

At random times one nodes' pod network becomes unreachable/can't connect to other nodes' pod-network. Nodes internal traffic still works

Deleting the pod fixes the issue temporarily

Anything else we need to know?

Baremetal deployment with 3 nodes (1 master, 2 workers), metallb in L2 mode and WEAVE_MTU set to 1500 and NO_MASQ_LOCAL set to 1

Versions:

$ weave version
weave script 2.5.1
weave 2.5.1

$ docker version
 Version:           18.09.2
 API version:       1.39
 Go version:        go1.10.4
 Git commit:        6247962
 Built:             Tue Feb 26 23:52:23 2019
 OS/Arch:           linux/amd64
 Experimental:      false

$ uname -a
Linux k8sm1 4.15.0-46-generic #49-Ubuntu SMP Wed Feb 6 09:33:07 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

$ kubectl version
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.3", GitCommit:"721bfa751924da8d1680787490c54b9179b1fed0", GitTreeState:"clean", BuildDate:"2019-02-01T20:00:57Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}

Logs:

Error occurred around 08:36 server time

$ kubectl logs -n kube-system <weave-net-pod> weave
INFO: 2019/03/20 08:34:47.205497 Sending ICMP 3,4 (10.32.0.71 -> 10.40.0.87): PMTU=1438
INFO: 2019/03/20 08:35:31.789047 Sending ICMP 3,4 (10.32.0.69 -> 10.40.0.87): PMTU=1438
INFO: 2019/03/20 08:36:21.533420 overlay_switch ->[ae:3f:91:18:2e:cb(k8sm1)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/03/20 08:36:21.533536 ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: connection shutting down due to error: no working forwarders to ae:3f:91:18:2e:cb(k8sm1)
INFO: 2019/03/20 08:36:21.533888 ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: connection deleted
INFO: 2019/03/20 08:36:21.534236 ->[192.168.100.72:6783] attempting connection
INFO: 2019/03/20 08:36:21.534529 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/03/20 08:36:21.534565 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection shutting down due to error: no working forwarders to 9e:34:c5:bd:b8:9e(k8sw2)
INFO: 2019/03/20 08:36:21.534642 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection deleted
INFO: 2019/03/20 08:36:21.534686 Removed unreachable peer 9e:34:c5:bd:b8:9e(k8sw2)
INFO: 2019/03/20 08:36:21.534702 Removed unreachable peer ae:3f:91:18:2e:cb(k8sm1)
INFO: 2019/03/20 08:36:21.534815 ->[192.168.100.83:6783] attempting connection
INFO: 2019/03/20 08:36:21.535475 ->[192.168.100.72:34749] connection accepted
INFO: 2019/03/20 08:36:21.536439 ->[192.168.100.83:56185] connection accepted
INFO: 2019/03/20 08:36:21.536784 ->[192.168.100.72:34749|ae:3f:91:18:2e:cb(k8sm1)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:36:21.536816 ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:36:21.536911 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:36:21.536958 overlay_switch ->[ae:3f:91:18:2e:cb(k8sm1)] using fastdp
INFO: 2019/03/20 08:36:21.536982 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] using fastdp
INFO: 2019/03/20 08:36:21.537033 ->[192.168.100.72:34749|ae:3f:91:18:2e:cb(k8sm1)]: connection added (new peer)
INFO: 2019/03/20 08:36:21.537067 overlay_switch ->[ae:3f:91:18:2e:cb(k8sm1)] using fastdp
INFO: 2019/03/20 08:36:21.537121 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection added (new peer)
INFO: 2019/03/20 08:36:21.537254 ->[192.168.100.83:56185|9e:34:c5:bd:b8:9e(k8sw2)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:36:21.537418 ->[192.168.100.72:34749|ae:3f:91:18:2e:cb(k8sm1)]: connection deleted
INFO: 2019/03/20 08:36:21.537428 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] using fastdp
INFO: 2019/03/20 08:36:21.537466 ->[192.168.100.72:34749|ae:3f:91:18:2e:cb(k8sm1)]: connection shutting down due to error: Multiple connections to ae:3f:91:18:2e:cb(k8sm1) added to f2:e1:3e:28:f1:80(k8sw1)
INFO: 2019/03/20 08:36:21.537573 ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: connection added (new peer)
INFO: 2019/03/20 08:36:21.538144 ->[192.168.100.83:56185|9e:34:c5:bd:b8:9e(k8sw2)]: connection shutting down due to error: Multiple connections to 9e:34:c5:bd:b8:9e(k8sw2) added to f2:e1:3e:28:f1:80(k8sw1)
INFO: 2019/03/20 08:36:21.538952 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] using sleeve
INFO: 2019/03/20 08:36:21.538997 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection fully established
INFO: 2019/03/20 08:36:21.539047 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
INFO: 2019/03/20 08:36:21.539820 overlay_switch ->[ae:3f:91:18:2e:cb(k8sm1)] using sleeve
INFO: 2019/03/20 08:36:21.539865 ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: connection fully established
INFO: 2019/03/20 08:36:21.539889 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
INFO: 2019/03/20 08:36:21.540080 sleeve ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: Effective MTU verified at 1438
INFO: 2019/03/20 08:36:21.540633 sleeve ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: Effective MTU verified at 1438
INFO: 2019/03/20 08:37:21.538009 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/03/20 08:37:21.538103 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection shutting down due to error: no working forwarders to 9e:34:c5:bd:b8:9e(k8sw2)
INFO: 2019/03/20 08:37:21.538208 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection deleted
INFO: 2019/03/20 08:37:21.538918 overlay_switch ->[ae:3f:91:18:2e:cb(k8sm1)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/03/20 08:37:21.538991 ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: connection shutting down due to error: no working forwarders to ae:3f:91:18:2e:cb(k8sm1)
INFO: 2019/03/20 08:37:21.539300 ->[192.168.100.83:58245] connection accepted
INFO: 2019/03/20 08:37:21.539487 ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: connection deleted
INFO: 2019/03/20 08:37:21.539927 Removed unreachable peer ae:3f:91:18:2e:cb(k8sm1)
INFO: 2019/03/20 08:37:21.540016 Removed unreachable peer 9e:34:c5:bd:b8:9e(k8sw2)
INFO: 2019/03/20 08:37:21.540245 ->[192.168.100.83:6783] attempting connection
INFO: 2019/03/20 08:37:21.540329 ->[192.168.100.72:6783] attempting connection
INFO: 2019/03/20 08:37:21.540428 ->[192.168.100.83:58245|9e:34:c5:bd:b8:9e(k8sw2)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:37:21.540520 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] using fastdp
INFO: 2019/03/20 08:37:21.540564 ->[192.168.100.83:58245|9e:34:c5:bd:b8:9e(k8sw2)]: connection added (new peer)
INFO: 2019/03/20 08:37:21.541363 ->[192.168.100.72:34149] connection accepted
INFO: 2019/03/20 08:37:21.541773 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:37:21.541884 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] using fastdp
INFO: 2019/03/20 08:37:21.541919 ->[192.168.100.83:58245|9e:34:c5:bd:b8:9e(k8sw2)]: connection deleted
INFO: 2019/03/20 08:37:21.542005 ->[192.168.100.83:58245|9e:34:c5:bd:b8:9e(k8sw2)]: connection shutting down due to error: Multiple connections to 9e:34:c5:bd:b8:9e(k8sw2) added to f2:e1:3e:28:f1:80(k8sw1)
INFO: 2019/03/20 08:37:21.542359 Removed unreachable peer ae:3f:91:18:2e:cb(k8sm1)
INFO: 2019/03/20 08:37:21.542428 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection added (new peer)
INFO: 2019/03/20 08:37:21.544058 ->[192.168.100.83:60179] connection accepted
INFO: 2019/03/20 08:37:21.544313 ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:37:21.544401 overlay_switch ->[ae:3f:91:18:2e:cb(k8sm1)] using fastdp
INFO: 2019/03/20 08:37:21.544435 ->[192.168.100.72:34149|ae:3f:91:18:2e:cb(k8sm1)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:37:21.544515 overlay_switch ->[ae:3f:91:18:2e:cb(k8sm1)] using fastdp
INFO: 2019/03/20 08:37:21.544654 ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: connection added (new peer)
INFO: 2019/03/20 08:37:21.544768 ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: connection deleted
INFO: 2019/03/20 08:37:21.544806 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] using sleeve
INFO: 2019/03/20 08:37:21.544841 ->[192.168.100.72:34149|ae:3f:91:18:2e:cb(k8sm1)]: connection added (new peer)
INFO: 2019/03/20 08:37:21.544929 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection fully established
INFO: 2019/03/20 08:37:21.544771 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
INFO: 2019/03/20 08:37:21.545106 ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: connection shutting down due to error: Multiple connections to ae:3f:91:18:2e:cb(k8sm1) added to f2:e1:3e:28:f1:80(k8sw1)
INFO: 2019/03/20 08:37:21.545485 sleeve ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: Effective MTU verified at 1438
INFO: 2019/03/20 08:37:21.545499 ->[192.168.100.83:60179|9e:34:c5:bd:b8:9e(k8sw2)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:37:21.545576 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] using fastdp
INFO: 2019/03/20 08:37:21.545624 ->[192.168.100.83:60179|9e:34:c5:bd:b8:9e(k8sw2)]: connection shutting down due to error: Multiple connections to 9e:34:c5:bd:b8:9e(k8sw2) added to f2:e1:3e:28:f1:80(k8sw1)
INFO: 2019/03/20 08:38:21.543908 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/03/20 08:38:21.544002 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection shutting down due to error: no working forwarders to 9e:34:c5:bd:b8:9e(k8sw2)
INFO: 2019/03/20 08:38:21.544094 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection deleted
INFO: 2019/03/20 08:38:21.544839 ->[192.168.100.83:6783] attempting connection
INFO: 2019/03/20 08:38:21.545145 ->[192.168.100.83:53093] connection accepted
INFO: 2019/03/20 08:38:21.545377 overlay_switch ->[ae:3f:91:18:2e:cb(k8sm1)] using sleeve
INFO: 2019/03/20 08:38:21.545653 overlay_switch ->[ae:3f:91:18:2e:cb(k8sm1)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/03/20 08:38:21.545717 ->[192.168.100.72:34149|ae:3f:91:18:2e:cb(k8sm1)]: connection shutting down due to error: no working forwarders to ae:3f:91:18:2e:cb(k8sm1)
INFO: 2019/03/20 08:38:21.545796 ->[192.168.100.72:34149|ae:3f:91:18:2e:cb(k8sm1)]: connection deleted
INFO: 2019/03/20 08:38:21.545840 Removed unreachable peer 9e:34:c5:bd:b8:9e(k8sw2)
INFO: 2019/03/20 08:38:21.545859 Removed unreachable peer ae:3f:91:18:2e:cb(k8sm1)
INFO: 2019/03/20 08:38:21.545971 ->[192.168.100.83:53093|9e:34:c5:bd:b8:9e(k8sw2)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:38:21.546050 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] using fastdp
INFO: 2019/03/20 08:38:21.546107 ->[192.168.100.83:53093|9e:34:c5:bd:b8:9e(k8sw2)]: connection added (new peer)
INFO: 2019/03/20 08:38:21.546289 ->[192.168.100.72:6783] attempting connection
INFO: 2019/03/20 08:38:21.546325 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:38:21.546526 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] using fastdp
INFO: 2019/03/20 08:38:21.546586 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection shutting down due to error: Multiple connections to 9e:34:c5:bd:b8:9e(k8sw2) added to f2:e1:3e:28:f1:80(k8sw1)
INFO: 2019/03/20 08:38:21.546786 ->[192.168.100.72:36693] connection accepted
INFO: 2019/03/20 08:38:21.547484 ->[192.168.100.72:36693|ae:3f:91:18:2e:cb(k8sm1)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:38:21.547587 overlay_switch ->[ae:3f:91:18:2e:cb(k8sm1)] using fastdp
INFO: 2019/03/20 08:38:21.547669 ->[192.168.100.72:36693|ae:3f:91:18:2e:cb(k8sm1)]: connection added (new peer)
INFO: 2019/03/20 08:38:21.547876 ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:38:21.547987 overlay_switch ->[ae:3f:91:18:2e:cb(k8sm1)] using fastdp
INFO: 2019/03/20 08:38:21.548020 ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: connection shutting down due to error: Multiple connections to ae:3f:91:18:2e:cb(k8sm1) added to f2:e1:3e:28:f1:80(k8sw1)
INFO: 2019/03/20 08:39:21.546646 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/03/20 08:39:21.546717 ->[192.168.100.83:53093|9e:34:c5:bd:b8:9e(k8sw2)]: connection shutting down due to error: no working forwarders to 9e:34:c5:bd:b8:9e(k8sw2)
INFO: 2019/03/20 08:39:21.546826 ->[192.168.100.83:53093|9e:34:c5:bd:b8:9e(k8sw2)]: connection deleted
INFO: 2019/03/20 08:39:21.547406 ->[192.168.100.83:6783] attempting connection
INFO: 2019/03/20 08:39:21.548006 overlay_switch ->[ae:3f:91:18:2e:cb(k8sm1)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/03/20 08:39:21.548072 ->[192.168.100.83:40009] connection accepted
INFO: 2019/03/20 08:39:21.548155 ->[192.168.100.72:36693|ae:3f:91:18:2e:cb(k8sm1)]: connection shutting down due to error: no working forwarders to ae:3f:91:18:2e:cb(k8sm1)
INFO: 2019/03/20 08:39:21.548281 ->[192.168.100.72:36693|ae:3f:91:18:2e:cb(k8sm1)]: connection deleted
INFO: 2019/03/20 08:39:21.548336 Removed unreachable peer 9e:34:c5:bd:b8:9e(k8sw2)
INFO: 2019/03/20 08:39:21.548371 Removed unreachable peer ae:3f:91:18:2e:cb(k8sm1)
INFO: 2019/03/20 08:39:21.548525 ->[192.168.100.72:6783] attempting connection
INFO: 2019/03/20 08:39:21.549229 ->[192.168.100.83:40009|9e:34:c5:bd:b8:9e(k8sw2)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:39:21.549295 ->[192.168.100.72:45581] connection accepted
INFO: 2019/03/20 08:39:21.549327 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] using fastdp
INFO: 2019/03/20 08:39:21.549346 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:39:21.549437 ->[192.168.100.83:40009|9e:34:c5:bd:b8:9e(k8sw2)]: connection added (new peer)
INFO: 2019/03/20 08:39:21.549462 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] using fastdp
INFO: 2019/03/20 08:39:21.549797 ->[192.168.100.83:40009|9e:34:c5:bd:b8:9e(k8sw2)]: connection deleted
INFO: 2019/03/20 08:39:21.549858 ->[192.168.100.72:45581|ae:3f:91:18:2e:cb(k8sm1)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:39:21.549914 overlay_switch ->[ae:3f:91:18:2e:cb(k8sm1)] using fastdp
INFO: 2019/03/20 08:39:21.549925 ->[192.168.100.83:40009|9e:34:c5:bd:b8:9e(k8sw2)]: connection shutting down due to error: write tcp4 192.168.100.79:6783->192.168.100.83:40009: write: connection reset by peer
INFO: 2019/03/20 08:39:21.549980 ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: connection ready; using protocol version 2
INFO: 2019/03/20 08:39:21.550006 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection added (new peer)
INFO: 2019/03/20 08:39:21.550154 overlay_switch ->[ae:3f:91:18:2e:cb(k8sm1)] using fastdp
INFO: 2019/03/20 08:39:21.550257 ->[192.168.100.72:45581|ae:3f:91:18:2e:cb(k8sm1)]: connection added (new peer)
INFO: 2019/03/20 08:39:21.550368 ->[192.168.100.72:6783|ae:3f:91:18:2e:cb(k8sm1)]: connection shutting down due to error: Multiple connections to ae:3f:91:18:2e:cb(k8sm1) added to f2:e1:3e:28:f1:80(k8sw1)
INFO: 2019/03/20 08:39:21.552197 overlay_switch ->[9e:34:c5:bd:b8:9e(k8sw2)] using sleeve
INFO: 2019/03/20 08:39:21.552260 ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: connection fully established
INFO: 2019/03/20 08:39:21.552399 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
INFO: 2019/03/20 08:39:21.552982 sleeve ->[192.168.100.83:6783|9e:34:c5:bd:b8:9e(k8sw2)]: Effective MTU verified at 1438

weave status in working state:

# weave status

        Version: 2.5.1 (up to date; next check at 2019/03/20 14:13:47)

        Service: router
       Protocol: weave 1..2
           Name: f2:e1:3e:28:f1:80(k8sw1)
     Encryption: disabled
  PeerDiscovery: enabled
        Targets: 3
    Connections: 3 (2 established, 1 failed)
          Peers: 3 (with 6 established connections)
 TrustedSubnets: none

        Service: ipam
         Status: ready
          Range: 10.32.0.0/12
  DefaultSubnet: 10.32.0.0/12

# weave status peers
ae:3f:91:18:2e:cb(k8sm1)
   -> 192.168.100.83:6783   9e:34:c5:bd:b8:9e(k8sw2)              established
   <- 192.168.100.79:50229  f2:e1:3e:28:f1:80(k8sw1)              established
9e:34:c5:bd:b8:9e(k8sw2)
   <- 192.168.100.72:38745  ae:3f:91:18:2e:cb(k8sm1)              established
   <- 192.168.100.79:39057  f2:e1:3e:28:f1:80(k8sw1)              established
f2:e1:3e:28:f1:80(k8sw1)
   -> 192.168.100.72:6783   ae:3f:91:18:2e:cb(k8sm1)              established
   -> 192.168.100.83:6783   9e:34:c5:bd:b8:9e(k8sw2)              established

weave status connections
-> 192.168.100.72:6783   established sleeve ae:3f:91:18:2e:cb(k8sm1) mtu=1438
-> 192.168.100.83:6783   established sleeve 9e:34:c5:bd:b8:9e(k8sw2) mtu=1438
-> 192.168.100.79:6783   failed      cannot connect to ourself, retry: never

I will try to get the weave outputs during failure state, but didn't have the weave script installed at the time/looked up weave troubleshooting and had to get the issue fixed asap.

closed time in 5 days

Tatsu-Kishi

issue commentweaveworks/weave

weave losing connections to other nodes with error: Multiple connections (Kubernetes CNI)

The original trail is long-dead and it's too confusing to respond to multiple threads of conversation in a GitHub issue; please open a new issue rather than commenting on this. The new issue template will request info that is essential to debug.

Note that "multiple connections" is not really a problem; just a transient condition that gets reported in passing. People commenting here were having other issues.

Tatsu-Kishi

comment created time in 5 days

issue closedweaveworks/weave

Node loses pod connectivity after OOM

In random situations of OOMKiller getting triggered, after the node is back up again (i.e. in Ready state) the node loses its pod connectivity. Deleting the weave pod (and consequently it getting recreated) makes the issue go away.

What you expected to happen?

I expected the node to eventually recover from the OOM, and/or report its state as NotReady if it hasn't.

What happened?

The node reports its network state as ready, but one can not access pod IPs from that node or pods running on it.

How to reproduce it?

This is not fully reproducible, but almost all occurrences have been after some random pod causes OOMKiller to be triggered. We've successfully quarantined the bug on a node, and can examine it if further information is needed.

Versions:

$ weave version
weave 2.5.2
$ docker version
...
Server: Docker Engine - Community
 Engine:
  Version:          18.09.6
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.8
  Git commit:       481bc77
  Built:            Sat May  4 01:59:36 2019
  OS/Arch:          linux/amd64
  Experimental:     false
$ uname -a
Linux c4-b2 4.4.0-119-generic #143-Ubuntu SMP Mon Apr 2 16:08:24 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
$ kubectl version
...
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.4", GitCommit:"c27b913fddd1a6c480c229191a087698aa92f0b1", GitTreeState:"clean", BuildDate:"2019-02-28T13:30:26Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}

Logs:

Lots of occurrences of the following lines:

...connection shutting down due to error: read tcp4...
...connection deleted...
...connection shutting down due to error during handshake: failed to receive remote protocol header...

but these only appear during the OOM, after that it just goes back to normal logs (e.g. Discovered remote MAC)

Network:

$ ip route
...
10.32.0.0/12 dev weave  proto kernel  scope link  src 10.33.128.0
$ ip -4 -o addr
...
6: weave    inet 10.33.128.0/12 brd 10.47.255.255 scope global weave\       valid_lft forever preferred_lft forever
$ /home/weave/weave --local status # inside the weave container
        Version: 2.5.2 (up to date; next check at 2019/09/02 21:13:01)

        Service: router
       Protocol: weave 1..2
           Name: 66:cf:c7:9d:f2:00(c4-b2)
     Encryption: disabled
  PeerDiscovery: enabled
        Targets: 16
    Connections: 16 (15 established, 1 failed)
          Peers: 16 (with 240 established connections)
 TrustedSubnets: none

        Service: ipam
         Status: ready
          Range: 10.32.0.0/12
  DefaultSubnet: 10.32.0.0/12

closed time in 5 days

SeMeKh

issue commentweaveworks/weave

Node loses pod connectivity after OOM

I'm going to close this on the basis that Weave Net 2.6 uses far less memory. Please open a new issue rather than commenting on this; the template will request info that is essential to debug.

SeMeKh

comment created time in 5 days

issue commentweaveworks/weave

Intermittent unknownhost issues with errors in weave logs

IP allocation was seeded by different peers

See https://www.weave.works/docs/net/latest/tasks/ipam/troubleshooting-ipam/#seeded-by-different-peers

connection shutting down due to error: Multiple connections to da:a4:78:fc:54:c6(nodew00487.nonprod.company.com) added to 16:84:46:93:53:9e(nodem00402.nonprod.company.com)

This is not a problem, it's just something that can happen due to timing when many connections are being added and removed, which is caused by the fatal condition above.

dstrimble

comment created time in 5 days

pull request commentweaveworks/weave

add support for `ips` in ipam config

I am not sure if it makes sense to configure also checkAlive

In a Kubernetes context it won't make any difference - that parameter is only used when Weave Net is configured to talk to Docker.

nonsense

comment created time in 5 days

Pull request review commentweaveworks/weave

add support for `ips` in ipam config

 func (i *Ipam) Release(args *skel.CmdArgs) error { }  type ipamConf struct {-	Subnet  string         `json:"subnet,omitempty"`-	Gateway net.IP         `json:"gateway,omitempty"`-	Routes  []*types.Route `json:"routes"`+	Subnet  string              `json:"subnet,omitempty"`+	Gateway net.IP              `json:"gateway,omitempty"`+	Routes  []*types.Route      `json:"routes"`+	IPs     []*current.IPConfig `json:"ips,omitempty"`

Per the Conventions this should be "string entries". Maybe you can find a plugin at https://github.com/containernetworking/plugins that implements ips and try to be consistent with that?

nonsense

comment created time in 5 days

Pull request review commentweaveworks/weave

add support for `ips` in ipam config

 func (i *Ipam) Allocate(args *skel.CmdArgs) (types.Result, error) { 	} 	var ipnet *net.IPNet -	if conf.Subnet == "" {+	if len(conf.IPs) == 1 {+		ip := conf.IPs[0]+		if ip.Version == "4" {+			ipnet = &ip.Address+			err = i.weave.ClaimIP(containerID, ipnet, false)+		} else {+			return nil, errors.New("Not Implemented")

Please include the thing that is not implemented in the error message, don't assume it will always be what you think it is, or that the person reading the error message can figure that out.

nonsense

comment created time in 5 days

Pull request review commentweaveworks/weave

add support for `ips` in ipam config

 func (i *Ipam) Allocate(args *skel.CmdArgs) (types.Result, error) { 	} 	var ipnet *net.IPNet -	if conf.Subnet == "" {+	if len(conf.IPs) == 1 {

What happens when 2 or more IPs are supplied?

nonsense

comment created time in 5 days

pull request commentweaveworks/weave

add support for `ips` in ipam config

Thanks for the PR! Would you consider adding a test, e.g. extending https://github.com/weaveworks/weave/blob/master/test/830_cni_plugin_test.sh

(the integration tests at CircleCI always fail for 3rd-party branches, but I've pushed your branch to this repo so they will run this time)

nonsense

comment created time in 5 days

more