profile
viewpoint
Brian Atkinson nairb774 San Jose, CA, USA

nairb774/binarylearn 3

Simple distributed implementation of a restricted boltzmann machine.

nairb774/aether-launch 2

Lightweight launcher built upon aether

nairb774/jwebunit-scala 2

Scala DSL for jwebunit

nairb774/akka 1

Akka Transactors

nairb774/grails-easyb 1

Easyb implementation for Grails done by Gustavo Madruga and Richard Vowles

PR opened fluent/fluentd

output: Replace ${chunk_id} before logging warning.

The updated implementation makes sure to replace instances of ${chunk_id} prior to looking for and warning on unreplaced keys in the pattern. Prior to this, a warning line was printed for every chunk that was flushed even though nothing was actually wrong (replacement happened right after).

Signed-off-by: Brian Atkinson brian@atkinson.mn

<!-- Thank you for contributing to Fluentd! Your commits need to follow DCO: https://probot.github.io/apps/dco/ And please provide the following information to help us make the most of your pull request: -->

Which issue(s) this PR fixes: None filed.

What this PR does / why we need it: When using the s3 output plugin, with the path set to "${tag}/%Y/%m/%d/%H/${chunk_id}_#{ENV['RUN_ID']}_#{worker_id}" the fluentd process would print a log line like the following for every chunk flushed.

2020-09-25 00:32:49 +0000 [warn]: #1 [s3_output] chunk key placeholder 'chunk_id' not replaced. template:${tag}/%Y/%m/%d/%H/${chunk_id}_6ead9663854b0799849b47d91148d344f5fcb534329354a0dfad76ef6056_1.gz

This log line was being printed a little too early and is entirely misleading.

Note to reviewer(s): This is the first ruby I've worked with in about 10 years. Anything that doesn't feel normal about the code is likely due to this, and any improvements are appreciated.

Docs Changes:

Release Note:

+26 -21

0 comment

2 changed files

pr created time in 17 minutes

push eventnairb774/fluentd

Brian Atkinson

commit sha 37a1c0e28720ee639cdd2f4834d36fe5e4b4dbbc

output: Replace ${chunk_id} before logging warning. The updated implementation makes sure to replace instances of ${chunk_id} prior to looking for and warning on unreplaced keys in the pattern. Prior to this, a warning line was printed for every chunk that was flushed even though nothing was actually wrong (replacement happened right after). Signed-off-by: Brian Atkinson <brian@atkinson.mn>

view details

push time in 25 minutes

create barnchnairb774/fluentd

branch : youthful_wilbur

created branch time in 26 minutes

fork nairb774/fluentd

Fluentd: Unified Logging Layer (project under CNCF)

https://www.fluentd.org/

fork in 26 minutes

push eventnairb774/codesearch

Brian Atkinson

commit sha 3ff9125ef0a60bd0f68e8cf8add8336269dbb8ff

indexer: Handle SIGTERM/SIGINT

view details

Brian Atkinson

commit sha b61863ce3a0a8538db67a7727701efbc26dcd6f2

indexer: Create a repo if it does not exist.

view details

Brian Atkinson

commit sha d7219046c1b1a63656764cc19181325a2c9a0d47

indexing: Add http auth support via netrc.

view details

Brian Atkinson

commit sha 6bece31b562fe7161b94afdfb232f5cc2d694e16

index: Use signed ints across the board. For better or worse this will reduce the amount of casting needed, and generally be easier to work with. int64/uint64 are compatible in flatbuffers, as long at the top bit is not set. It is really hard to index super large repos at this point, so it is unlikely to imapct anything.

view details

Brian Atkinson

commit sha 228e38fc2c6b1860dd5d70362cf71bd4a4730417

repo: Allow unreferencing a CREATING shard. This can happen when two shards are being created at the same time and the reference resolution process kicks in. The shard that is still in the process of being written will be in the creating state, and the resolution process will try to "unreference" it, which is fairly normal (should be left in the CREATING state).

view details

Brian Atkinson

commit sha 43101cda9cc64fa051614f88984864639560c8e2

Have the services listen on all interfaces. This makes it easier to run the systems in docker or kubernetes.

view details

Brian Atkinson

commit sha 897d0c4934dcaba90526bb477bc210179f491ee4

indexing: Improve error output when indexing fails.

view details

Brian Atkinson

commit sha dca1538c8c787209d838e2bbc7d5462d1e37ab72

storage: Start to break storage away from local files. This was intended to simplify some of the structure around reading/writing index shards. This initial implementation allows for loading data from a remote service, it simplified overall deployment on kubernetes by reducing the number of places a hostPath needed to be mounted in. In the long run it is entirely possible to see a local disk access mode come back, along wiht support for things like S3, or GCS, or other for blob storage.

view details

Brian Atkinson

commit sha 2697205cf36e813045088d1a06afd749cee01645

docs: Fix spelling.

view details

Brian Atkinson

commit sha 82eb45c013c007439ef77dc4d4c1cc31129b634c

k8s: Some basic configs for running the service on Kubernetes. Enables some level of persistence in the face of Apple's really unstable macOS operating system. Also atakes a small step towards making this something that could be deployed at scale.

view details

push time in 11 days

issue commentistio/istio

Setting DestinationRule.caCertificates breaks cluster wide mTLS

Updated cluster dump: istio-dump.tar.gz

nairb774

comment created time in 14 days

issue commentistio/istio

Setting DestinationRule.caCertificates breaks cluster wide mTLS

Re port 80: I've gotten in the habit of moving things out of the <1024 range as I've been trying to move things out of running as root. I think my brain got ahead of itself. Looking at the log line above, I swapped it for one generated from the example, without fully reading be beyond the 503.

I also noticed I had commented out the caCertificates attribute. Uncommenting, and restarting the httpbin/sleep pods triggered it.

Working repo: drca-repro.yaml.gz

Actual log line generated (will edit the initial report to fix):

[2020-09-11T18:34:41.098Z] "GET / HTTP/1.1" 503 UF,URX "-" "TLS error: 268435581:SSL routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED" 0 91 21 - "-" "curl/7.69.1" "cadc0151-7ada-44f7-a209-f52a6fefe17f" "httpbin.sample-app.svc.cluster.local" "10.1.0.19:80" outbound|80||httpbin.sample-app.svc.cluster.local - 10.106.161.48:80 10.1.0.18:59016 - default
nairb774

comment created time in 14 days

issue openedistio/istio

Setting DestinationRule.caCertificates breaks cluster wide mTLS

Bug description

Applying the following yaml seems to break all mTLS communication in the cluster:

apiVersion: networking.istio.io/v1beta1
kind: ServiceEntry
metadata:
  name: google
  namespace: some-other-ns
spec:
  hosts:
  - www.google.com
  ports:
  - number: 443
    name: http
    protocol: TCP
  resolution: DNS
  location: MESH_EXTERNAL
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: google
  namespace: some-other-ns
spec:
  host: "www.google.com"
  trafficPolicy:
    tls:
      mode: SIMPLE
      caCertificates: /etc/ssl/certs/ca-certificates.crt

It may take a minute or two for SSL errors like the following to start showing up:

 [2020-09-10T18:00:07.326Z] "GET / HTTP/1.1" 503 URX "-" "-" 0 91 93 92 "-" "curl/7.69.1" "eee6fa43-b829-497c-9931-4bf0fa1f13ff" "httpbin.sample-app.svc.cluster.local" "10.1.0.228:8080" outbound|80||httpbin.sample-app.svc.cluster.local 10.1.0.229:54442 10.98.149.69:80 10.1.0.229:51568 - default

If the DestinationRule.caCertificates attribute is not set, then communication within the cluster is not broken. Removing the caCertificates attribute is not usually sufficient to have the cluster recover. In many cases it requires all affected sidecars to be restarted.

Affected product area (please put an X in all that apply)

[ ] Docs [ ] Installation [X] Networking [ ] Performance and Scalability [ ] Extensions and Telemetry [X] Security [ ] Test and Release [ ] User Experience [ ] Developer Infrastructure

Affected features (please put an X in all that apply)

[ ] Multi Cluster [ ] Virtual Machine [ ] Multi Control Plane

Expected behavior

The caCertificates feature is functions, and is able to be used to properly validate Egress TLS Origination.

Steps to reproduce the bug

The following was isolated on Docker Desktop for Mac running Kubernetes 1.16.5.

Install the Istio 1.7.0 operator (istioctl installed via brew):

$ istioctl operator init
Using operator Deployment image: docker.io/istio/operator:1.7.0
✔ Istio operator installed
✔ Installation complete

Install the control plane with the following yaml:

apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
metadata:
  name: istio-system
  namespace: istio-system
spec:
  profile: default
  meshConfig:
    accessLogFile: /dev/stdout # Only needed for observing the broken communication.

Configure a few services to be able to communicate with each other. The httpbin/curl pod combo used in many of the examples is sufficient. These services can be running in any namespace you choose.

Apply the ServiceEntry/DestinationRule to any namespace. They can be applied to a brand new namespace, empty namespace.

Make periodic attempts to communicate between the sample services installed. Within a few minutes (usually less than 10), communication will start to fail, and log lines like above will be generated by Envoy.

Full reproduction yaml: drca-repro.yaml.gz It may need to be applied a few times, and pods restarted to trigger injection, but all the parts are there.

Version (include the output of istioctl version --remote and kubectl version --short and helm version if you used Helm)

$ istioctl version --remote
client version: 1.7.0
control plane version: 1.7.0
data plane version: 1.7.0 (3 proxies)
$ kubectl version --short
Client Version: v1.19.1
Server Version: v1.16.6-beta.0

How was Istio installed?

Operator, default profile.

Environment where bug was observed (cloud vendor, OS, etc)

Initially observed on AWS EKS 1.16.13, but was able to reproduce with Docker Desktop for Mac running Kubernetes 1.16.5.

Additionally, please consider attaching a cluster state archive by attaching the dump file to this issue.

istio-dump.tar.gz

created time in 15 days

issue openedkubernetes/autoscaler

VPA ControlledResources doesn't work.

Using a VPA like the following results in the CPU recommendations being produced and applied:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: hamster
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: hamster
  resourcePolicy:
    containerPolicies:
    - containerName: "*"
      controlledResources: ["memory"]

It looks like in the process of building the container name to aggregate state map the ControlledResources value is not copied from the working state to the per-container aggregate state in AggregateStateByContainerName. This prevents the ControlledResource information from flowing down to where the filtering is applied.

These changes, while hacky, address the problem and result in controlled resources working correctly. The main reason for not submitting a PR, is its not clear to me that this is the right change. In some ways the ControlledResources hanging off the AggregateContainerState breaks the abstraction boundary. Alternatively, the filtering is possibly being done at the wrong place in the pipeline. Maybe the ControlledResources should be observed in the updater or the admission controller rather than in the recommender?

@tweks you seem to have done the initial implementation - thoughts?

created time in 16 days

create barnchnairb774/autoscaler

branch : upbeat_kapitsa

created branch time in 16 days

fork nairb774/autoscaler

Autoscaling components for Kubernetes

fork in 16 days

PR opened fluxcd/flux

Cleanup exported repos during sync failures

Several functions which generate a clone of the repo can result in the repository being left behind when an error is triggered. This shores up some of those failure paths to prevent the storage leaks.

One of our deployments of FluxCD was found to be eating up a bunch of space in the /tmp dir. There were many folders of like /tmp/flux-workingXXXXXXX that seemed to be hanging out. This seemed to be the result of Flux being able to pull the repository, but the path it was configured to read from had been removed. Each failure caused a new copy of the repository to be generated on disk. This shores up some of the exit paths to do a better job attempting to clean the data up.

<!--

General contribution criteria

Please have a look at our contribution guidelines: https://github.com/fluxcd/flux/blob/master/CONTRIBUTING.md Particularly the sections about the:

  • DCO;
  • contribution workflow; and
  • how to get your fix accepted

To help the maintainers out when they're writing release notes, please try to include a sentence or two here describing your change for end users. See the CHANGELOG.md file in the top-level directory for examples.

Particularly for ground-breaking changes and new features, it's important to make users and developers aware of what's changing and where those changes were documented or discussed.

Even for smaller changes it's useful to see things documented as well, as it gives everybody a chance to see at a glance what's coming up in the next release. It makes the life of the project maintainer a lot easier as well.

The following short checklist can be used to make sure your PR is of good quality, and can be merged easily:

  • [ ] if it resolves an issue; is a reference (i.e. #1) to this issue included?
  • [ ] if it introduces a new functionality or configuration flag; did you document this in the references or guides?
  • [ ] optional but much appreciated; do you think many users would profit from a dedicated setting for this functionality in the Helm chart? -->
+93 -7

0 comment

4 changed files

pr created time in 18 days

push eventnairb774/flux

Brian Atkinson

commit sha 1c55cafb06252f8006af3a7fe3ab6f4be0523e74

Cleanup exported repos during sync failures Several functions which generate a clone of the repo can result in the repository being left behind when an error is triggered. This shores up some of those failure paths to prevent the storage leaks.

view details

push time in 18 days

create barnchnairb774/flux

branch : nostalgic_booth

created branch time in 18 days

fork nairb774/flux

The GitOps Kubernetes operator

https://docs.fluxcd.io

fork in 18 days

issue openedgrpc/grpc-node

GitHub Releases/Tags missing for grpc-js 1.1.4 and 1.1.5

According to https://www.npmjs.com/package/@grpc/grpc-js there have been releases of 1.1.4 and 1.1.5, but these releases don't look to be tagged in the repo. Could the correct tags be added/pushed to the repo?

It also is entirely possible I'm holding it wrong or GitHub is lying to me, but this is what I see: image image.

created time in 19 days

startedhjacobs/kube-janitor

started time in 23 days

issue commentstrongdm/terraform-provider-sdm

1.0.8 and 1.0.9 fail to install, 1.0.7 works

1.0.9 also has this problem.

nairb774

comment created time in 23 days

push eventnairb774/devpi

Brian Atkinson

commit sha b2f540e56273e5db5941d21d8eaffeb1cd850c34

deps: Upgrade attrs, cffi and strictyaml.

view details

push time in a month

issue openedterraform-docs/terraform-docs

Variable default of "<" is transformed to "\u003e" in markdown

<!-- Please note, this template is for bugs report, not feature requests --> <!-- For more information, see the Contributing Guidelines at --> <!-- https://github.com/terraform-docs/terraform-docs/tree/master/CONTRIBUTING.md -->

Describe the bug

The tool looks to be over-escaping the defaults of variables when generating markdown.

To Reproduce

Steps to reproduce the behavior:

  1. Define a variable like:
    variable "monitor_operator" {
      description = "query operator"
      type        = string
      default     = ">"
    }
    
  2. Run the tool like:terraform-docs markdown --sort-by-required .
  3. Observe the output in the table has the following contents:
    | monitor\_operator | query operator | `string` | `"\u003e"` | no |
    

Expected behavior

Ideally the default value would render faithfully. For Github, when wrapped in backticks, it seem to not require any escaping. Something like the following should render correctly:

| monitor\_operator | query operator | `string` | `"<"` | no |

Version information:

  • terraform-docs version (use terraform-docs --version):
$ terraform-docs --version
terraform-docs version v0.0.0- v0.9.1 darwin/amd64 BuildDate: 2020-04-03T02:30:49+0100
  • Go version (if you manually built. use go version): Installed via homebrew.
  • OS (e.g. Windows, MacOS): macOS

created time in a month

startedistio/istio

started time in a month

issue openedstrongdm/terraform-provider-sdm

1.0.8 fails to install, 1.0.7 works

The following works:

provider sdm {
  version = "1.0.7"
}

But the following fails:

provider sdm {
  version = "1.0.8"
}

With the following error:

Initializing the backend...
2020/08/25 13:12:08 [TRACE] Preserving existing state lineage "05c5744c-9ee7-c542-3c3e-3a76e2dc7ccc"
2020/08/25 13:12:08 [TRACE] Preserving existing state lineage "05c5744c-9ee7-c542-3c3e-3a76e2dc7ccc"
2020/08/25 13:12:08 [TRACE] Meta.Backend: working directory was previously initialized for "remote" backend
2020/08/25 13:12:08 [TRACE] Meta.Backend: using already-initialized, unchanged "remote" backend configuration
2020/08/25 13:12:08 [DEBUG] Service discovery for <<redacted>> at https://<<redacted>>/.well-known/terraform.json
2020/08/25 13:12:08 [TRACE] HTTP client GET request to https://<<redacted>>/.well-known/terraform.json
2020/08/25 13:12:09 [DEBUG] Retrieve version constraints for service tfe.v2.1 and product terraform
2020/08/25 13:12:09 [TRACE] HTTP client GET request to https://checkpoint-api.hashicorp.com/v1/versions/tfe.v2.1?product=terraform
2020/08/25 13:12:10 [TRACE] Meta.Backend: instantiated backend of type *remote.Remote
2020/08/25 13:12:10 [DEBUG] checking for provider in "."
2020/08/25 13:12:10 [DEBUG] checking for provider in "/usr/local/Cellar/tfenv/2.0.0/versions/0.12.26"
2020/08/25 13:12:10 [DEBUG] checking for provider in ".terraform/plugins/darwin_amd64"
2020/08/25 13:12:10 [DEBUG] found provider "terraform-provider-aws_v2.68.0_x4"
2020/08/25 13:12:10 [DEBUG] found provider "terraform-provider-null_v2.1.2_x4"
2020/08/25 13:12:10 [DEBUG] found provider "terraform-provider-sdm_v1.0.7"
2020/08/25 13:12:10 [DEBUG] found provider "terraform-provider-template_v2.1.2_x4"
2020/08/25 13:12:10 [DEBUG] found valid plugin: "aws", "2.68.0", "/Users/<<redacted>>/workspaces/<<redacted>>/.terraform/plugins/darwin_amd64/terraform-provider-aws_v2.68.0_x4"
2020/08/25 13:12:10 [DEBUG] found valid plugin: "null", "2.1.2", "/Users/<<redacted>>/workspaces/<<redacted>>/.terraform/plugins/darwin_amd64/terraform-provider-null_v2.1.2_x4"
2020/08/25 13:12:10 [DEBUG] found valid plugin: "sdm", "1.0.7", "/Users/<<redacted>>/workspaces/<<redacted>>/.terraform/plugins/darwin_amd64/terraform-provider-sdm_v1.0.7"
2020/08/25 13:12:10 [DEBUG] found valid plugin: "template", "2.1.2", "/Users/<<redacted>>/workspaces/<<redacted>>/.terraform/plugins/darwin_amd64/terraform-provider-template_v2.1.2_x4"
2020/08/25 13:12:10 [DEBUG] checking for provisioner in "."
2020/08/25 13:12:10 [DEBUG] checking for provisioner in "/usr/local/Cellar/tfenv/2.0.0/versions/0.12.26"
2020/08/25 13:12:10 [DEBUG] checking for provisioner in ".terraform/plugins/darwin_amd64"
2020/08/25 13:12:10 [TRACE] Meta.Backend: backend *remote.Remote supports operations
2020/08/25 13:12:11 [DEBUG] checking for provider in "."
2020/08/25 13:12:11 [DEBUG] checking for provider in "/usr/local/Cellar/tfenv/2.0.0/versions/0.12.26"
2020/08/25 13:12:11 [DEBUG] checking for provider in ".terraform/plugins/darwin_amd64"
2020/08/25 13:12:11 [DEBUG] found provider "terraform-provider-aws_v2.68.0_x4"
2020/08/25 13:12:11 [DEBUG] found provider "terraform-provider-null_v2.1.2_x4"
2020/08/25 13:12:11 [DEBUG] found provider "terraform-provider-sdm_v1.0.7"
2020/08/25 13:12:11 [DEBUG] found provider "terraform-provider-template_v2.1.2_x4"
2020/08/25 13:12:11 [DEBUG] found valid plugin: "null", "2.1.2", "/Users/<<redacted>>/workspaces/<<redacted>>/.terraform/plugins/darwin_amd64/terraform-provider-null_v2.1.2_x4"
2020/08/25 13:12:11 [DEBUG] found valid plugin: "sdm", "1.0.7", "/Users/<<redacted>>/workspaces/<<redacted>>/.terraform/plugins/darwin_amd64/terraform-provider-sdm_v1.0.7"
2020/08/25 13:12:11 [DEBUG] found valid plugin: "template", "2.1.2", "/Users/<<redacted>>/workspaces/<<redacted>>/.terraform/plugins/darwin_amd64/terraform-provider-template_v2.1.2_x4"
2020/08/25 13:12:11 [DEBUG] found valid plugin: "aws", "2.68.0", "/Users/<<redacted>>/workspaces/<<redacted>>.terraform/plugins/darwin_amd64/terraform-provider-aws_v2.68.0_x4"

2020/08/25 13:12:11 [DEBUG] plugin requirements: "aws"="~> 2.0,~> 2.0"
2020/08/25 13:12:11 [DEBUG] plugin requirements: "sdm"="1.0.8"
2020/08/25 13:12:11 [DEBUG] plugin requirements: "null"=""
2020/08/25 13:12:11 [DEBUG] plugin requirements: "template"=""
2020/08/25 13:12:11 [DEBUG] Service discovery for registry.terraform.io at https://registry.terraform.io/.well-known/terraform.json
2020/08/25 13:12:11 [TRACE] HTTP client GET request to https://registry.terraform.io/.well-known/terraform.json
Initializing provider plugins...
- Checking for available provider plugins...
2020/08/25 13:12:11 [DEBUG] fetching provider versions from "https://registry.terraform.io/v1/providers/-/sdm/versions"
2020/08/25 13:12:11 [DEBUG] GET https://registry.terraform.io/v1/providers/-/sdm/versions
2020/08/25 13:12:11 [TRACE] HTTP client GET request to https://registry.terraform.io/v1/providers/-/sdm/versions
2020/08/25 13:12:11 [DEBUG] fetching provider location from "https://registry.terraform.io/v1/providers/terraform-providers/sdm/1.0.8/download/darwin/amd64"
2020/08/25 13:12:11 [DEBUG] GET https://registry.terraform.io/v1/providers/terraform-providers/sdm/1.0.8/download/darwin/amd64
2020/08/25 13:12:11 [TRACE] HTTP client GET request to https://registry.terraform.io/v1/providers/terraform-providers/sdm/1.0.8/download/darwin/amd64
2020/08/25 13:12:11 [TRACE] HTTP client GET request to https://releases.hashicorp.com/terraform-provider-sdm/1.0.8/terraform-provider-sdm_1.0.8_SHA256SUMS
2020/08/25 13:12:11 [TRACE] HTTP client GET request to https://releases.hashicorp.com/terraform-provider-sdm/1.0.8/terraform-provider-sdm_1.0.8_SHA256SUMS.sig
2020/08/25 13:12:11 [DEBUG] verified GPG signature with key from HashiCorp Security <security@hashicorp.com>
2020/08/25 13:12:11 [DEBUG] getting provider "sdm" (terraform-providers/sdm) version "1.0.8"
2020/08/25 13:12:11 [DEBUG] plugin cache is disabled, so downloading sdm 1.0.8 from https://releases.hashicorp.com/terraform-provider-sdm/1.0.8/terraform-provider-sdm_1.0.8_darwin_amd64.zip?checksum=sha256:5454f95211ec0baa4ed43a532b743d390e4a71974915722951ceb19878adf77a
- Downloading plugin for provider "sdm" (terraform-providers/sdm) 1.0.8...
2020/08/25 13:12:11 [TRACE] HTTP client HEAD request to https://releases.hashicorp.com/terraform-provider-sdm/1.0.8/terraform-provider-sdm_1.0.8_darwin_amd64.zip
2020/08/25 13:12:11 [TRACE] HTTP client GET request to https://releases.hashicorp.com/terraform-provider-sdm/1.0.8/terraform-provider-sdm_1.0.8_darwin_amd64.zip
2020/08/25 13:12:12 [DEBUG] looking for the sdm 1.0.8 plugin we just installed
2020/08/25 13:12:12 [DEBUG] checking for provider in ".terraform/plugins/darwin_amd64"
2020/08/25 13:12:12 [DEBUG] found provider "terraform-provider-aws_v2.68.0_x4"
2020/08/25 13:12:12 [DEBUG] found provider "terraform-provider-null_v2.1.2_x4"
2020/08/25 13:12:12 [WARN] found legacy provider "terraform-provider-sdm_1.0.8"
2020/08/25 13:12:12 [DEBUG] found provider "terraform-provider-sdm_v1.0.7"
2020/08/25 13:12:12 [DEBUG] found provider "terraform-provider-template_v2.1.2_x4"
2020/08/25 13:12:12 [DEBUG] all plugins found discovery.PluginMetaSet{discovery.PluginMeta{Name:"aws", Version:"2.68.0", Path:"/Users/<<redacted>>/workspaces/<<redacted>>/.terraform/plugins/darwin_amd64/terraform-provider-aws_v2.68.0_x4"}:
struct {}{}, discovery.PluginMeta{Name:"null", Version:"2.1.2", Path:"/Users/<<redacted>>/workspaces/<<redacted>>/.terraform/plugins/darwin_amd64/terraform-provider-null_v2.1.2_x4"}:struct {}{}, discovery.PluginMeta{Name:"sdm", Version:"1.
0.7", Path:"/Users/<<redacted>>/workspaces/<<redacted>>/.terraform/plugins/darwin_amd64/terraform-provider-sdm_v1.0.7"}:struct {}{}, discovery.PluginMeta{Name:"sdm_1.0.8", Version:"0.0.0", Path:"/Users/<<redacted>>/workspaces/<<redacted>>/.terraform/plugins/darwin_amd64/terraform-provider-sdm_1.0.8"}:struct {}{}, discovery.PluginMeta{Name:"template", Version:"2.1.2", Path:"/Users/<<redacted>>/workspaces/<<redacted>>/.terraform/plugins/darwin_amd64/terraform-provider-template_v2.1.2_x4"}:struct {}{}}
2020/08/25 13:12:12 [DEBUG] filtered plugins discovery.PluginMetaSet{}

Error installing provider "sdm": failed to find installed plugin version 1.0.8; this is a bug in Terraform and should be reported.

Terraform analyses the configuration and state and automatically downloads
plugins for the providers used. However, when attempting to download this
plugin an unexpected error occurred.

This may be caused if for some reason Terraform is unable to reach the
plugin repository. The repository may be unreachable if access is blocked
by a firewall.

If automatic installation is not possible or desirable in your environment,
you may alternatively manually install plugins by downloading a suitable
distribution package and placing the plugin's executable file in the
following directory:
    terraform.d/plugins/darwin_amd64


Error: failed to find installed plugin version 1.0.8; this is a bug in Terraform and should be reported

This is using Terraform 0.12.26. The install of 1.0.7 doesn't seem to trigger the found legacy provider warning. I wonder if that is related. It is possible this is a Terraform problem like the error says, but given version 1.0.7 installs fine, I'm guessing it was something to do with the publishing of 1.0.8.

created time in a month

issue openedterraform-providers/terraform-provider-aws

aws_s3_bucket_inventory attribute `bucket` documentation is confusing

<!--- Please note the following potential times when an issue might be in Terraform core:

If you are running into one of these scenarios, we recommend opening an issue in the Terraform core repository instead. --->

<!--- Please keep this note for the community --->

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

<!--- Thank you for keeping this note for the community --->

Affected Resource(s)

<!--- Please list the affected resources and data sources. --->

  • aws_s3_bucket_inventory

https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/s3_bucket_inventory

Documentation says:

bucket - (Required) The name of the bucket where the inventory configuration will be stored.

which is partially confusing because the inventory configuration isn't really stored in the bucket. A slightly better wording might be to swap "stored" for "applied":

bucket - (Required) The name of the bucket where the inventory configuration will be applied.

created time in a month

push eventnairb774/serving

Matt Moore

commit sha 5388b6efad78bfcad4c38e80c6cd9649b7b993e0

Have Envoy emit debug logs (#9020)

view details

Matt Moore

commit sha 480334d3d5a2ab0959c1ecb987ed654c965d5ba8

[master] Update net-certmanager nightly (#9016) Produced via: `curl https://storage.googleapis.com/knative-nightly/net-certmanager/latest/$x > /workspace/source/./third_party/cert-manager-0.12.0/$x` /assign tcnghia ZhiminXiang /cc tcnghia ZhiminXiang /test pull-knative-serving-autotls-tests

view details

Brian Atkinson

commit sha 8941cf3824354871d239351f01992fff72a2e7f9

channel only semaphore

view details

push time in a month

create barnchnairb774/serving

branch : tender_darwin

created branch time in a month

fork nairb774/serving

Kubernetes-based, scale-to-zero, request-driven compute

https://knative.dev/docs/serving/

fork in a month

issue openedistio/istio.io

Possible alternative solution to "Provision a certificate and key for an application without sidecars"

Context: https://preliminary.istio.io/latest/blog/2020/proxy-cert/

With the work in https://github.com/istio/istio/issues/23583 and https://github.com/istio/istio/pull/25363, long term this blog post is likely to age poorly. On the other hand, I think there is a more elegant solution that became possible with the 1.6 release, and likely is going to be more maintainable for users over the long run. The one major downside, currently, is that it makes use of an undocumented setting.

First, the way in which this could be better implemented. Rather than including a manually configured proxy sidecar to mint certificates, we can make use of the proxy.istio.io/config and sidecar.istio.io/userVolumeMount annotations to tweak how the normally injected sidecar behaves. This assumes that the istio-sidecar-injector config map is mostly unmodified, but I've been able to successfully use the following pod annotations to provide certificates to our Datadog agent (in a similar way to how the prometheus stuff is done):

annotations:
  proxy.istio.io/config: |
    proxyMetadata:
      OUTPUT_CERTS: /etc/istio-certs
  sidecar.istio.io/userVolumeMount: |
    {"istio-certs": {"mountPath": "/etc/istio-certs"}}
  traffic.sidecar.istio.io/excludeOutboundIPRanges: "0.0.0.0/0"

Note: This does require an emptyDir volume to be provided, but that should be fairly self explanatory.

As I mentioned before, this takes advantage of the undocumented proxyMetadata field in the ProxyConfig to set the OUTPUT_CERTS environment variable like the existing blog post does. If the proxyMetadata field were documented, then the annotations like above would be a much more sustainable approach to provisioning certificates than the previous manual container configuration. The other option would be to have some sort of outputCerts option which, depending on how much magic is desired, could be a path to export the certs to, a volume name to mount and export the certs to, or just a boolean indicating that a volume should be provisioned with a predetermined name and the certs written there.

No matter the direction, some sort of improved documentation in this space could go a long way for enabling this uncommon, but powerful escape hatch to be easier to maintain.

created time in a month

issue commentistio/istio

Many blackholed requests cause Envoy to consume excessive ram/cpu

Apparently I need to slow down and read, because once I did that things look like they will be in a better state.

reporter=.=source;.;source_workload=.=autoscaler;.;source_workload_namespace=.=knative-serving;.;source_principal=.=unknown;.;source_app=.=autoscaler;.;source_version=.=unknown;.;source_canonical_service=.=autoscaler;.;source_canonical_revision=.=latest;.;destination_workload=.=unknown;.;destination_workload_namespace=.=unknown;.;destination_principal=.=unknown;.;destination_app=.=unknown;.;destination_version=.=unknown;.;destination_service=.=unknown;.;destination_service_name=.=BlackHoleCluster;.;destination_service_namespace=.=unknown;.;destination_canonical_service=.=unknown;.;destination_canonical_revision=.=latest;.;request_protocol=.=http;.;response_code=.=502;.;grpc_response_status=.=;.;response_flags=.=-;.;connection_security_policy=.=unknown;.;_istio_requests_total: 17
reporter=.=source;.;source_workload=.=autoscaler;.;source_workload_namespace=.=knative-serving;.;source_principal=.=unknown;.;source_app=.=autoscaler;.;source_version=.=unknown;.;source_canonical_service=.=autoscaler;.;source_canonical_revision=.=latest;.;destination_workload=.=unknown;.;destination_workload_namespace=.=unknown;.;destination_principal=.=unknown;.;destination_app=.=unknown;.;destination_version=.=unknown;.;destination_service=.=unknown;.;destination_service_name=.=BlackHoleCluster;.;destination_service_namespace=.=unknown;.;destination_canonical_service=.=unknown;.;destination_canonical_revision=.=latest;.;request_protocol=.=http;.;response_code=.=502;.;grpc_response_status=.=;.;response_flags=.=-;.;connection_security_policy=.=unknown;.;_istio_request_bytes: P0(nan,1200.0) P25(nan,1225.0) P50(nan,1250.0) P75(nan,1275.0) P90(nan,1290.0) P95(nan,1295.0) P99(nan,1299.0) P99.5(nan,1299.5) P99.9(nan,1299.9) P100(nan,1300.0)
reporter=.=source;.;source_workload=.=autoscaler;.;source_workload_namespace=.=knative-serving;.;source_principal=.=unknown;.;source_app=.=autoscaler;.;source_version=.=unknown;.;source_canonical_service=.=autoscaler;.;source_canonical_revision=.=latest;.;destination_workload=.=unknown;.;destination_workload_namespace=.=unknown;.;destination_principal=.=unknown;.;destination_app=.=unknown;.;destination_version=.=unknown;.;destination_service=.=unknown;.;destination_service_name=.=BlackHoleCluster;.;destination_service_namespace=.=unknown;.;destination_canonical_service=.=unknown;.;destination_canonical_revision=.=latest;.;request_protocol=.=http;.;response_code=.=502;.;grpc_response_status=.=;.;response_flags=.=-;.;connection_security_policy=.=unknown;.;_istio_request_duration_milliseconds: P0(nan,0.0) P25(nan,0.0) P50(nan,0.0) P75(nan,11.2917) P90(nan,11.7167) P95(nan,11.8583) P99(nan,11.9717) P99.5(nan,11.9858) P99.9(nan,11.9972) P100(nan,12.0)
reporter=.=source;.;source_workload=.=autoscaler;.;source_workload_namespace=.=knative-serving;.;source_principal=.=unknown;.;source_app=.=autoscaler;.;source_version=.=unknown;.;source_canonical_service=.=autoscaler;.;source_canonical_revision=.=latest;.;destination_workload=.=unknown;.;destination_workload_namespace=.=unknown;.;destination_principal=.=unknown;.;destination_app=.=unknown;.;destination_version=.=unknown;.;destination_service=.=unknown;.;destination_service_name=.=BlackHoleCluster;.;destination_service_namespace=.=unknown;.;destination_canonical_service=.=unknown;.;destination_canonical_revision=.=latest;.;request_protocol=.=http;.;response_code=.=502;.;grpc_response_status=.=;.;response_flags=.=-;.;connection_security_policy=.=unknown;.;_istio_response_bytes: P0(nan,54.0) P25(nan,54.25) P50(nan,54.5) P75(nan,54.75) P90(nan,54.9) P95(nan,54.95) P99(nan,54.99) P99.5(nan,54.995) P99.9(nan,54.999) P100(nan,55.0)

Now I need to find a way to prevent the Istio Operator from overwriting the values. I can see a few options, but I think this looks to be a much better state.

I was meant to add this option to those two blocks.

Is this something you think might be changing in a future release?

Thank you immensely for your patience in the face of my apparently lacking comprehension skills.

nairb774

comment created time in 2 months

issue commentistio/istio

Many blackholed requests cause Envoy to consume excessive ram/cpu

disable_host_header_fallback was already set to true (state prior to edits) which is likely what got me turned around. Your initial diff was clear, I mentally broke when trying to apply.

nairb774

comment created time in 2 months

issue commentistio/istio

Many blackholed requests cause Envoy to consume excessive ram/cpu

I got some time to try out the suggestions. Turning off telemetry is a little painful as it seems to break how Datadog is getting data. We are using their default integration, and I have not dug into any sort of configuration that might exist there.

Removing disable_host_header_fallback doesn't seem to help. Assuming I'm looking at the right code (https://github.com/istio/proxy/blob/1.6.4/extensions/stats/plugin.cc#L157 and https://github.com/istio/proxy/blob/1.6.4/extensions/common/context.cc#L106-L110), I would assume it to work. I restarted the istiod pods and then restarted the problematic Knative pod. I've attached all of the envoyfilter objects in the istio-system namespace. As a last ditch effort, I even edited the 1.4 and 1.5 objects as well, with no success. Stats lines like the following are still showing up for every errant attempt of the Knative autoscaler to try to reach individual pods:

reporter=.=source;.;source_workload=.=autoscaler;.;source_workload_namespace=.=knative-serving;.;source_principal=.=unknown;.;source_app=.=autoscaler;.;source_version=.=unknown;.;source_canonical_service=.=autoscaler;.;source_canonical_revision=.=latest;.;destination_workload=.=unknown;.;destination_workload_namespace=.=unknown;.;destination_principal=.=unknown;.;destination_app=.=unknown;.;destination_version=.=unknown;.;destination_service=.=10.11.11.74:9090;.;destination_service_name=.=BlackHoleCluster;.;destination_service_namespace=.=unknown;.;destination_canonical_service=.=unknown;.;destination_canonical_revision=.=latest;.;request_protocol=.=http;.;response_code=.=502;.;grpc_response_status=.=;.;response_flags=.=-;.;connection_security_policy=.=unknown;.;_istio_requests_total: 1
reporter=.=source;.;source_workload=.=autoscaler;.;source_workload_namespace=.=knative-serving;.;source_principal=.=unknown;.;source_app=.=autoscaler;.;source_version=.=unknown;.;source_canonical_service=.=autoscaler;.;source_canonical_revision=.=latest;.;destination_workload=.=unknown;.;destination_workload_namespace=.=unknown;.;destination_principal=.=unknown;.;destination_app=.=unknown;.;destination_version=.=unknown;.;destination_service=.=10.11.14.241:9090;.;destination_service_name=.=BlackHoleCluster;.;destination_service_namespace=.=unknown;.;destination_canonical_service=.=unknown;.;destination_canonical_revision=.=latest;.;request_protocol=.=http;.;response_code=.=502;.;grpc_response_status=.=;.;response_flags=.=-;.;connection_security_policy=.=unknown;.;_istio_requests_total: 1
reporter=.=source;.;source_workload=.=autoscaler;.;source_workload_namespace=.=knative-serving;.;source_principal=.=unknown;.;source_app=.=autoscaler;.;source_version=.=unknown;.;source_canonical_service=.=autoscaler;.;source_canonical_revision=.=latest;.;destination_workload=.=unknown;.;destination_workload_namespace=.=unknown;.;destination_principal=.=unknown;.;destination_app=.=unknown;.;destination_version=.=unknown;.;destination_service=.=10.11.32.253:9090;.;destination_service_name=.=BlackHoleCluster;.;destination_service_namespace=.=unknown;.;destination_canonical_service=.=unknown;.;destination_canonical_revision=.=latest;.;request_protocol=.=http;.;response_code=.=502;.;grpc_response_status=.=;.;response_flags=.=-;.;connection_security_policy=.=unknown;.;_istio_requests_total: 1
...
reporter=.=source;.;source_workload=.=autoscaler;.;source_workload_namespace=.=knative-serving;.;source_principal=.=unknown;.;source_app=.=autoscaler;.;source_version=.=unknown;.;source_canonical_service=.=autoscaler;.;source_canonical_revision=.=latest;.;destination_workload=.=unknown;.;destination_workload_namespace=.=unknown;.;destination_principal=.=unknown;.;destination_app=.=unknown;.;destination_version=.=unknown;.;destination_service=.=10.11.11.74:9090;.;destination_service_name=.=BlackHoleCluster;.;destination_service_namespace=.=unknown;.;destination_canonical_service=.=unknown;.;destination_canonical_revision=.=latest;.;request_protocol=.=http;.;response_code=.=502;.;grpc_response_status=.=;.;response_flags=.=-;.;connection_security_policy=.=unknown;.;_istio_request_bytes: P0(nan,1200.0) P25(nan,1225.0) P50(nan,1250.0) P75(nan,1275.0) P90(nan,1290.0) P95(nan,1295.0) P99(nan,1299.0) P99.5(nan,1299.5) P99.9(nan,1299.9) P100(nan,1300.0)
reporter=.=source;.;source_workload=.=autoscaler;.;source_workload_namespace=.=knative-serving;.;source_principal=.=unknown;.;source_app=.=autoscaler;.;source_version=.=unknown;.;source_canonical_service=.=autoscaler;.;source_canonical_revision=.=latest;.;destination_workload=.=unknown;.;destination_workload_namespace=.=unknown;.;destination_principal=.=unknown;.;destination_app=.=unknown;.;destination_version=.=unknown;.;destination_service=.=10.11.11.74:9090;.;destination_service_name=.=BlackHoleCluster;.;destination_service_namespace=.=unknown;.;destination_canonical_service=.=unknown;.;destination_canonical_revision=.=latest;.;request_protocol=.=http;.;response_code=.=502;.;grpc_response_status=.=;.;response_flags=.=-;.;connection_security_policy=.=unknown;.;_istio_request_duration_milliseconds: P0(nan,0.0) P25(nan,0.0) P50(nan,0.0) P75(nan,0.0) P90(nan,0.0) P95(nan,0.0) P99(nan,0.0) P99.5(nan,0.0) P99.9(nan,0.0) P100(nan,0.0)
reporter=.=source;.;source_workload=.=autoscaler;.;source_workload_namespace=.=knative-serving;.;source_principal=.=unknown;.;source_app=.=autoscaler;.;source_version=.=unknown;.;source_canonical_service=.=autoscaler;.;source_canonical_revision=.=latest;.;destination_workload=.=unknown;.;destination_workload_namespace=.=unknown;.;destination_principal=.=unknown;.;destination_app=.=unknown;.;destination_version=.=unknown;.;destination_service=.=10.11.11.74:9090;.;destination_service_name=.=BlackHoleCluster;.;destination_service_namespace=.=unknown;.;destination_canonical_service=.=unknown;.;destination_canonical_revision=.=latest;.;request_protocol=.=http;.;response_code=.=502;.;grpc_response_status=.=;.;response_flags=.=-;.;connection_security_policy=.=unknown;.;_istio_response_bytes: P0(nan,54.0) P25(nan,54.25) P50(nan,54.5) P75(nan,54.75) P90(nan,54.9) P95(nan,54.95) P99(nan,54.99) P99.5(nan,54.995) P99.9(nan,54.999) P100(nan,55.0)
...

Any thoughts on next steps? In the meantime, I should update the cluster to 1.6.7 (from 1.6.4).

nairb774

comment created time in 2 months

push eventnairb774/devpi

Brian Atkinson

commit sha 9d2ce9e91d17282238f0fea39cdae1e658384cea

Update python to 3.8.5, and update cffi and urllib3 to latest.

view details

push time in 2 months

startedcr0hn/festin

started time in 2 months

issue commentistio/istio

Many blackholed requests cause Envoy to consume excessive ram/cpu

Interesting. I'll play with this tomorrow (time willing) and report back. Knative has its own metrics pipelines for its own operations, but I have a hunch (need to test) that DataDog's Istio integration makes use of the Ist io metrics. I'll turn it off and see what happens.

nairb774

comment created time in 2 months

issue openedistio/istio

Many blackholed requests cause Envoy to consume excessive ram/cpu

(NOTE: This is used to report product bugs: To report a security vulnerability, please visit https://istio.io/about/security-vulnerabilities/ To ask questions about how to use Istio, please visit https://discuss.istio.io )

Bug description

A program that makes occasional requests to services outside of the mesh combined with the cluster having outboundTrafficPolicy:REGISTRY_ONLY set will result in the Envoy process accumulating an unbounded number of stats entries (? stuff available on /stats) eventually causing Envoy to consume lots of ram and in some cases stops functioning all together.

Affected product area (please put an X in all that apply)

[ ] Configuration Infrastructure [ ] Docs [ ] Installation [X] Networking [X] Performance and Scalability [ ] Policies and Telemetry [ ] Security [ ] Test and Release [ ] User Experience [ ] Developer Infrastructure

Affected features (please put an X in all that apply)

[ ] Multi Cluster [ ] Virtual Machine [ ] Multi Control Plane

Expected behavior

Ram/CPU stays bounded and Envoy remains functional

Steps to reproduce the bug

Important context: Istio has outboundTrafficPolicy:REGISTRY_ONLY set.

The Knative autoscaler acquired a feature which attempts to scrape statistics directly from individual pods. This direct scraping happens on autoscaler startup, and on occasion when some internal state gets reset. This results in periodic bursts of 1-10ish requests being attempted to individual pods to grab stats. After the autoscaler attempts to make these fetches, and finds out it isn't possible, it proceeds to scrape via the service (good!). As pods move around, and the autoscaler's internal state gets reset, it goes back to attempting to talk to the pods directly. This causes the number of /stats entries in Envoy to continuously grow because each attempt to scrape a pod results in an attempt to contact a unique IP. Eventually Envoy struggles to operate correctly, and seems to either lock up or becomes so slow as to be non-functional. With the Knative autoscaler it seems to take roughly a week to cause full breakdown of Envoy. The direct pod probes happen at a rate of about 10ish every 15 minutes (really slow burn). Lots of details in https://github.com/knative/serving/issues/8761 for the curious.

https://github.com/nairb774/experimental/blob/28f76fcc2db73ed30c46aff7ce4b25a47515d25c/http-prober/main.go is a simplified and accelerated reproduction of the behavior of the Knative autoscaler component. This can be deployed/run with ko to simulate a simplified Knative autoscaler behavior. Within a minute or two of running the http-prober, the /stats/prometheus page, on the Envoy dashboard (istioctl dashboard envoy), takes forever to return, and the response size is well over 100MiB (in the one test I ran) with about 140k rows. The memory usage of Envoy also ballooned to about 500MiB in that minute time.

Ideally the slow burn version of the Knative autoscaler wouldn't cause Envoy to topple over (though it is just as easy to say the autoscaler is as much at fault). It doesn't look like the blackhole metrics get garbage collected if they sit idle for a long time.

Version (include the output of istioctl version --remote and kubectl version and helm version if you used Helm)

$ istioctl version --remote
client version: 1.6.5
control plane version: 1.6.4
data plane version: 1.6.4 (57 proxies)
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.6-beta.0", GitCommit:"e7f962ba86f4ce7033828210ca3556393c377bcc", GitTreeState:"clean", BuildDate:"2020-01-15T08:26:26Z", GoVersion:"go1.13.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.8-eks-e16311", GitCommit:"e163110a04dcb2f39c3325af96d019b4925419eb", GitTreeState:"clean", BuildDate:"2020-03-27T22:37:12Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}

Knative v0.16.0 (for what it is worth)

How was Istio installed?

Istio Operator (version 1.6.4)

Environment where bug was observed (cloud vendor, OS, etc)

AWS EKS 1.16 AWS Managed Node Groups (1.16.8-20200618 / ami-0d960646974cf9e5b)

Additionally, please consider attaching a cluster state archive by attaching the dump file to this issue.

Here is an operator config for Istio: istio-system.yaml.gz which is from the cluster exhibiting the issue both via the Knative autoscaler as well as the minimal program above. I'm a little apprehensive dumping so much information from the cluster and making it publically accessible. I'm happy to pull specific logs/info that might be useful. I'm even happy to hop on a VC and do some live debugging/reproductions if that would help out.

created time in 2 months

push eventnairb774/experimental

Brian Atkinson

commit sha 28f76fcc2db73ed30c46aff7ce4b25a47515d25c

http-prober: Sample program emulating the behavior of the Knative autoscaler. The Knative autoscaler attempts to rech Pods directly on occasion and when it does, the request is rejected by the REGISTRY_ONLY setting. This is a simplified use case for https://github.com/knative/serving/issues/8761 to present to the Istio folks.

view details

push time in 2 months

create barnchnairb774/experimental

branch : master

created branch time in 2 months

created repositorynairb774/experimental

created time in 2 months

issue commentknative/serving

Interaction between Autoscaler and Istio Proxy triggering memory leak

If you mean Knative Revisions, there haven't been explicit changes to those - we have only made use of the Knative Services, and they haven't been changed/deployed since July 1 - what could be triggering changes to revisions?

nairb774

comment created time in 2 months

issue commentknative/serving

Interaction between Autoscaler and Istio Proxy triggering memory leak

Here is the current hypothesis:

Knative, by way of trying to scrape metrics from individual pods, is inducing a large number of metrics (think prometheus) to be generated in Envoy which are never garbage collected.

Reasoning:

While the pod scrape logic attempts to remember if scraping pods is possible, the CreateOrUpdate method of the MetricCollector causes the replacement of the stats scraper. This seems to happen approximately at the same rate memory is consumed by the Envoy proxy:

image

Where that is trying to match log lines like the following:

{
  "caller": "metrics/stats_scraper.go:267",
  "commit": "d74ecbe",
  "level": "info",
  "logger": "autoscaler",
  "msg": "Pod 10.11.15.78 failed scraping: GET request for URL \"http://10.11.15.78:9090/metrics\" returned HTTP status 502",
  "ts": "2020-07-24T00:01:53.379Z"
}

The two graphs are of similar shape. Taking the IP from the sample log line and poking in the /stats endpoint of the Envoy proxy, there are the following lines:

reporter=.=source;.;source_workload=.=autoscaler;.;source_workload_namespace=.=knative-serving;.;source_principal=.=unknown;.;source_app=.=autoscaler;.;source_version=.=unknown;.;source_canonical_service=.=autoscaler;.;source_canonical_revision=.=latest;.;destination_workload=.=unknown;.;destination_workload_namespace=.=unknown;.;destination_principal=.=unknown;.;destination_app=.=unknown;.;destination_version=.=unknown;.;destination_service=.=10.11.15.78:9090;.;destination_service_name=.=BlackHoleCluster;.;destination_service_namespace=.=unknown;.;destination_canonical_service=.=unknown;.;destination_canonical_revision=.=latest;.;request_protocol=.=http;.;response_code=.=502;.;grpc_response_status=.=;.;response_flags=.=-;.;connection_security_policy=.=unknown;.;_istio_requests_total: 2
reporter=.=source;.;source_workload=.=autoscaler;.;source_workload_namespace=.=knative-serving;.;source_principal=.=unknown;.;source_app=.=autoscaler;.;source_version=.=unknown;.;source_canonical_service=.=autoscaler;.;source_canonical_revision=.=latest;.;destination_workload=.=unknown;.;destination_workload_namespace=.=unknown;.;destination_principal=.=unknown;.;destination_app=.=unknown;.;destination_version=.=unknown;.;destination_service=.=10.11.15.78:9090;.;destination_service_name=.=BlackHoleCluster;.;destination_service_namespace=.=unknown;.;destination_canonical_service=.=unknown;.;destination_canonical_revision=.=latest;.;request_protocol=.=http;.;response_code=.=502;.;grpc_response_status=.=;.;response_flags=.=-;.;connection_security_policy=.=unknown;.;_istio_request_bytes: P0(nan,1200.0) P25(nan,1225.0) P50(nan,1250.0) P75(nan,1275.0) P90(nan,1290.0) P95(nan,1295.0) P99(nan,1299.0) P99.5(nan,1299.5) P99.9(nan,1299.9) P100(nan,1300.0)
reporter=.=source;.;source_workload=.=autoscaler;.;source_workload_namespace=.=knative-serving;.;source_principal=.=unknown;.;source_app=.=autoscaler;.;source_version=.=unknown;.;source_canonical_service=.=autoscaler;.;source_canonical_revision=.=latest;.;destination_workload=.=unknown;.;destination_workload_namespace=.=unknown;.;destination_principal=.=unknown;.;destination_app=.=unknown;.;destination_version=.=unknown;.;destination_service=.=10.11.15.78:9090;.;destination_service_name=.=BlackHoleCluster;.;destination_service_namespace=.=unknown;.;destination_canonical_service=.=unknown;.;destination_canonical_revision=.=latest;.;request_protocol=.=http;.;response_code=.=502;.;grpc_response_status=.=;.;response_flags=.=-;.;connection_security_policy=.=unknown;.;_istio_request_duration_milliseconds: P0(nan,0.0) P25(nan,0.0) P50(nan,0.0) P75(nan,0.0) P90(nan,0.0) P95(nan,0.0) P99(nan,0.0) P99.5(nan,0.0) P99.9(nan,0.0) P100(nan,0.0)
reporter=.=source;.;source_workload=.=autoscaler;.;source_workload_namespace=.=knative-serving;.;source_principal=.=unknown;.;source_app=.=autoscaler;.;source_version=.=unknown;.;source_canonical_service=.=autoscaler;.;source_canonical_revision=.=latest;.;destination_workload=.=unknown;.;destination_workload_namespace=.=unknown;.;destination_principal=.=unknown;.;destination_app=.=unknown;.;destination_version=.=unknown;.;destination_service=.=10.11.15.78:9090;.;destination_service_name=.=BlackHoleCluster;.;destination_service_namespace=.=unknown;.;destination_canonical_service=.=unknown;.;destination_canonical_revision=.=latest;.;request_protocol=.=http;.;response_code=.=502;.;grpc_response_status=.=;.;response_flags=.=-;.;connection_security_policy=.=unknown;.;_istio_response_bytes: P0(nan,54.0) P25(nan,54.25) P50(nan,54.5) P75(nan,54.75) P90(nan,54.9) P95(nan,54.95) P99(nan,54.99) P99.5(nan,54.995) P99.9(nan,54.999) P100(nan,55.0)

Sampling a few of the recent log lines from above reveal they are all showing up in the /stats output. Sampling a few log lines from hours ago, shows the IPs from those failed pod scraping events to still be in the /stats endpoint even though it has been ~3 hours since the attempt to scrape the pod.

What might be contributing to this is that the Istio cluster is setup with outboundTrafficPolicy set to REGISTRY_ONLY which might be contributing to the stats lines.

An easy fix could be to persist podsAddressable in some way across calls to updateScraper. I'm sure there is some nuance there that I don't yet understand, but it would be really helpful as I don't anticipate getting Envoy to drop old metrics is going to happen quickly (still looking to see if an issue is even filed yet).

nairb774

comment created time in 2 months

issue commentknative/serving

Interaction between Autoscaler and Istio Proxy triggering memory leak

I started on this side as the problem didn't manifest with 0.14.0+Istio 1.6.4, and only started once upgrading Knative to 0.15.2 (and later 0.16.0).

https://github.com/istio/istio/issues/25145 is interesting as this might share a similar root cause, but the notable difference being 25145 is about the ingressgateway, and here the issues is the sidecar of the autoscaler job. Unfortunately I don't know enough about the behavior of the autoscaler to know what a common root cause might look like.

nairb774

comment created time in 2 months

issue openedknative/serving

Interaction between Autoscaler and Istio Proxy triggering memory leak

<!-- If you need to report a security issue with Knative, send an email to knative-security@googlegroups.com. -->

I'm just starting to dig into this, but I'm filing the bug to raise visibility and hopefully be able to get some help debugging.

<!--

In what area(s)?

Remove the '> ' to select

/area API /area autoscale /area build /area monitoring /area networking /area test-and-release

Other classifications:

/kind good-first-issue /kind process /kind spec -->

What version of Knative?

<!-- Delete all but your choice --> Both:

0.15.2 0.16.0

but not 0.14.0

Expected Behavior

<!-- Briefly describe what you expect to happen -->

The memory usage of the autoscaler's Istio proxy stays relatively flat, or at least proportional to some reasonable metric (number of pods, number of ksvcs, ...).

Actual Behavior

<!-- Briefly describe what is actually happening -->

Over time the memory usage of the istio proxy container seems to grow over time.

Steps to Reproduce the Problem

On July 8th, I updated our cluster from 0.14.0 to 0.15.2. A about a week later on July 14th, traffic going to Knative services ended up grinding to a halt. In a quick effort to get things running again, I restarted a number of pods, which quickly got things back working. I didn't spend much time looking through logs and metrics as I knew at the time there was a 0.16.0 release available. On July 15th, I updated the cluster to 0.16.0. Just a few hours ago (July 22nd), traffic seized up again.

Looking at the Knative pods I saw:

NAMESPACE                 NAME                                                             READY   STATUS             RESTARTS   AGE
knative-serving           activator-6768988647-npxzd                                       1/2     CrashLoopBackOff   59         4h24m
knative-serving           autoscaler-57cb4c8475-vvtn7                                      1/2     Running            0          5d23h
knative-serving           controller-5448b975d-k7z7q                                       2/2     Running            0          5d12h
knative-serving           istio-webhook-594cc55456-25sfj                                   2/2     Running            0          7d2h
knative-serving           networking-istio-7b9854496d-h6r56                                1/1     Running            0          7d2h
knative-serving           webhook-756b79fb8d-b2q2j                                         2/2     Running            0          7d2h

I noticed that the autoscaler only had one of the two containers running. To quickly fix the problem, the autoscaler pod was deleted, and once the new autoscaler pod was started, traffic began to flow again. I only noticed this now, but the age of the activator-6768988647-npxzd roughly aligns with the start of the outage. This cluster is serving low priority traffic (batch pipelines and whatnot), so the top line monitoring is quite delayed.

The autoscaler-57cb4c8475-vvtn7 pod was suffering the following events:

LAST SEEN   TYPE      REASON      OBJECT                            MESSAGE
20m         Warning   Unhealthy   pod/autoscaler-57cb4c8475-vvtn7   Readiness probe failed: Get http://10.11.9.139:15021/healthz/ready: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

Graphs of memory and CPU usage over the last month which show the change in behavior and ramping memory usage in the Istio proxy container: image

A few notes about this cluster which might be contributing:

  • Traffic coming into the cluster is unreasonably spiky, which results in lots of scale up/scale down behaviors. 1->100->1 in small numbers of minutes, over and over.
  • Overall traffic isn't really high. On the order of 1kqps, but the concurrency of several ksvcs is <10 which combined with traffic changes results in lots of pod churn.
  • The cluster is running on EKS 1.16, Istio 1.6.4 for most of this.
  • Early logs of the activator-6768988647-npxzd has a number of messages like Websocket connection could not be established as well as a bunch of Envoy request logs having the response_code: UH, which means No healthy upstream hosts in upstream cluster in addition to 503 response code I believe.

I'm going to let the current autoscaler run a little and then try to see what metrics/stats/(core?) I can grab off it and try to figure out what might be contributing to the problem.

I am fully aware that the problem is manifesting in the Istio container, and there is a high likelihood that the bug is over there. Any help I can get collecting application specific data, or diagnostics to pin down which system has the problem would be greatly appreciated. I started on this side as the problem didn't manifest with 0.14.0+Istio 1.6.4, and only started once upgrading Knative to 0.15.2 (and later 0.16.0). My guess is that there is something that changed on the Knative side which is having a bad interaction with Istio - just need help finding it.

Feel free to ping me on the Knative slack (nairb774) if some live debugging would help.

<!-- How can a maintainer reproduce this issue (be detailed) -->

created time in 2 months

startedaws/amazon-eks-pod-identity-webhook

started time in 2 months

startedpperzyna/awesome-operator-frameworks

started time in 2 months

push eventnairb774/devpi

Brian Atkinson

commit sha 37118cc84e602d4bce41257233659509aaefd347

enable root pypi: This is needed to be able to run this like a mirror.

view details

Brian Atkinson

commit sha cf955ad6c52b6921df25b95d0f49c9b3e914aa89

dependency updates

view details

push time in 2 months

issue openedaws/aws-codebuild-docker-images

aws/codebuild/standard:2.0 no longer maintained as of June 2020?

According to https://docs.aws.amazon.com/codebuild/latest/userguide/build-env-ref-available.html it looks like the aws/codebuild/standard:2.0 image is no longer maintaine. Having this information surfaced here would also be useful. The root README still references building the standard:2.0 image as an example which might be worth updating.

I only accidentally spotted the footnote, so making such things more visible would be nice. Prevents building an image only to find out that it isn't worth using.

created time in 3 months

delete branch nairb774/istio

delete branch : exciting_morse

delete time in 3 months

issue openedistio/istio

dns: poor message ID management could result in mis-routed responses.

(NOTE: This is used to report product bugs: To report a security vulnerability, please visit https://istio.io/about/security-vulnerabilities/ To ask questions about how to use Istio, please visit https://discuss.istio.io )

Bug description

While authoring https://github.com/istio/istio/pull/25249 I noticed in pkg/dns/dns.go another issue with the code.

There is a correctness bug in that outID is only 16 bits and the implementation blindly allows the value to wrap around without checking. Furthermore, if the connection established in openTLS is laggy enough, it is possible for it to return results for incorrect messages. The process by which this could happen is the following:

  1. Receive incoming message and allocate an outID.
  2. Message is forwarded out h.conn.
  3. Response on h.conn is significantly delayed such that ~65k requests later, the outID is reused in h.pending
  4. Initial response comes back and is returned to the reused outID.

While this is unlikely, it isn't entirely out of the realm of possibility given the implementation.

Affected product area (please put an X in all that apply)

[ ] Configuration Infrastructure [ ] Docs [ ] Installation [X] Networking [ ] Performance and Scalability [ ] Policies and Telemetry [ ] Security [ ] Test and Release [ ] User Experience [ ] Developer Infrastructure

Affected features (please put an X in all that apply)

[ ] Multi Cluster [ ] Virtual Machine [ ] Multi Control Plane

Expected behavior

Steps to reproduce the bug

Version (include the output of istioctl version --remote and kubectl version and helm version if you used Helm)

How was Istio installed?

Environment where bug was observed (cloud vendor, OS, etc)

Additionally, please consider attaching a cluster state archive by attaching the dump file to this issue.

created time in 3 months

push eventnairb774/istio

Brian Atkinson

commit sha 7dc600ae692672cbf052ade1a04b21b1c8b35af0

dns: Stop timeout timer on successful resolution.

view details

Brian Atkinson

commit sha b533b227a0953d56d22fe622d7fa1948e167d428

dns: Make pending channel buffered. The unconditional send inside of openTLS can result in goroutines getting stuck on an unconditional send. By making the channel buffered, the unconditional send will always succeed. It is not possible to make the send conditional, as it can race to the read from the channel.

view details

push time in 3 months

PR opened istio/istio

Reviewers
dns: Deadlock fix and performance improvements.

Please provide a description for what this PR is for.

The pkg/dns/dns.go implementation is currently leaking timers, and has a channel send/receive deadlock race. The included commits address both the resource waste as well as eliminate the potential for a deadlock.

And to help us figure out who should review this PR, please put an X in all the areas that this PR affects.

[ ] Configuration Infrastructure [ ] Docs [ ] Installation [X] Networking [X] Performance and Scalability [ ] Policies and Telemetry [ ] Security [ ] Test and Release [ ] User Experience [ ] Developer Infrastructure

+4 -3

0 comment

1 changed file

pr created time in 3 months

push eventnairb774/istio

Brian Atkinson

commit sha 950d34a2e5a7e061f66c5c8702f27ede7e7c1953

dns: Make pending channel buffered. The unconditional send inside of openTLS can result in goroutines getting stuck on an unconditional send. By making the channel buffered, the unconditional send will always succeed. It is not possible to make the send conditional, as it can race to the read from the channel.

view details

push time in 3 months

create barnchnairb774/istio

branch : exciting_morse

created branch time in 3 months

fork nairb774/istio

Connect, secure, control, and observe services.

https://istio.io

fork in 3 months

push eventnairb774/codesearch

Brian Atkinson

commit sha 8422a1c14bcbfdf65f1217caa95cedd6d2ee4437

Add project specific README.

view details

push time in 3 months

issue openedterraform-providers/terraform-provider-aws

aws_eks_node_group can't set min_size of 0

<!--- Please note the following potential times when an issue might be in Terraform core:

If you are running into one of these scenarios, we recommend opening an issue in the Terraform core repository instead. --->

<!--- Please keep this note for the community --->

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

<!--- Thank you for keeping this note for the community --->

Terraform Version

$ terraform -v
Terraform v0.12.26
+ provider.aws v2.68.0
+ provider.template v2.1.2

<!--- Please run terraform -v to show the Terraform core version and provider version(s). If you are not running the latest version of Terraform or the provider, please upgrade because your issue may have already been fixed. Terraform documentation on provider versioning. --->

Affected Resource(s)

<!--- Please list the affected resources and data sources. --->

  • aws_eks_node_group

Terraform Configuration Files

<!--- Information about code formatting: https://help.github.com/articles/basic-writing-and-formatting-syntax/#quoting-code --->

resource "aws_eks_node_group" "eks_node_group" {
  cluster_name    = locals.eks_cluster_name
  node_group_name = locals.eks_node_group_name
  node_role_arn   = locals.eks_worker_role_arn
  instance_types  = [locals.eks_instance_type]
  subnet_ids      = [locals.eks_subnet_id]

  scaling_config {
    desired_size = 0
    min_size = 0
    max_size = 50
  }

  lifecycle {
    ignore_changes = [
      scaling_config[0].desired_size,
    ]
  }
}

Error Output

<!--- Please provide a link to a GitHub Gist containing the complete debug output. Please do NOT paste the debug output in the issue; just paste a link to the Gist.

To obtain the debug output, see the Terraform documentation on debugging. --->

Error: expected scaling_config.0.min_size to be at least (1), got 0

  on eks-cluster.tf line 1, in resource "aws_eks_node_group" "eks_node_group":
  1: resource "aws_eks_node_group" "eks_node_group" {



Error: expected scaling_config.0.desired_size to be at least (1), got 0

  on eks-cluster.tf line 1, in resource "aws_eks_node_group" "eks_node_group":
  1: resource "aws_eks_node_group" "eks_node_group" {

Expected Behavior

<!--- What should have happened? --->

Be able to create the managed node group with a min size, and initial size of 0. The Cluster Autoscaler is able to scale up Node Groups/ASGs with a size of 0, but it is not possible to create such node groups. Ideally, the provider should not validate things that are left unspecified in AWS documentation (see references).

Actual Behavior

<!--- What actually happened? --->

Steps to Reproduce

<!--- Please list the steps required to reproduce the issue. --->

  1. terraform apply

The above errors are produced.

Important Factoids

<!--- Are there anything atypical about your accounts that we should know? For example: Running in EC2 Classic? --->

References

<!--- Information about referencing Github Issues: https://help.github.com/articles/basic-writing-and-formatting-syntax/#referencing-issues-and-pull-requests

Are there any other GitHub issues (open or closed) or pull requests that should be linked here? Vendor documentation? For example: --->

created time in 3 months

issue commentkubernetes/kubernetes

Cannot drain node with pod with more than one Pod Disruption Budget

Can you explain a little bit why you want to do this?

I think multiple PDBs applying to a Pod can make sense if you think about disruptions at different aggregations of an application. If you imagine the smallest unit of aggregation of an application being a Deployment, there might be a desire to limit disruptions for that deployment. But then the next level up the aggregation tree could be a Service. It isn't all that far fetched for a Service to blend together one or more deployments, and as a result want to limit disruptions at that level.

It was based on this thinking that I ended up here after experimenting with ways to simplify management of developer applications by automatically creating PDBs for each Deployment and Service in the cluster. The idea being that for each deployment there should be a maximal amount of disruption it could endure at any given time. I tried to cap every Deployment at 25% max unavailable (number pulled from thin air). In a similar idea, I capped the amount of disruption that a Service could endure as well. Since a single service could be made up of one or more constituent parts, each of which are blended together as a single service (an example could be primary+canary deployments), it seemed sensible to apply a disruption budget at this level as well. I picked the number 10% unavailable for services. The expectation was that if a service was made of a single deployment, the 10% rule would be more strict than the 25% rule and would automatically constrain disruptions. If the service ever became more complex, the 10% and 25% rules would enable a sensible level of SLA for each level of aggregation. It is entirely possible that some services or deployments would need to be further tightened - the hope would be that an additional, manual, PDB would be added to restrict the behavior as needed and the automatic ones could be left in place.

This was bore out of an effort to deploy "sensible defaults" to our clusters. The lack of PDBs means that things which expect to be able to lean on them (VPA as an example) can result in fairly large service disruptions when they kick in. I have yet to deploy any sort of chaos monkey tooling, but I'd imagine being able to partially lean on sensible PDBs is going to go a long way towards having chaos without large disasters.

I hope that explains some some reasoning behind having multiple PDBs apply to a single Pod and how one might expect them to interact.

yaseenhamdulay

comment created time in 3 months

more