profile
viewpoint
crhuber HelloFresh Berlin, Germany

hellofresh/eks-rolling-update 136

EKS Rolling Update is a utility for updating the launch configuration of worker nodes in an EKS cluster.

crhuber/linux-cheatsheet 16

Handy commands for Linux

crhuber/flask-prelaunchr 2

A small and simple Flask project that is ready to be used as a prelaunch site

crhuber/cantada 1

Tracking the stuff you can't use in Canada

crhuber/golinks 1

An internal URL shortener written in Python with a Vue frontend

crhuber/aws-node-termination-handler 0

A Kubernetes Daemonset to gracefully handle EC2 instance shutdown

crhuber/charts 0

Curated applications for Kubernetes

crhuber/descheduler 0

Descheduler for Kubernetes

crhuber/django-rest-framework-mongoengine 0

Mongoengine support for Django Rest Framework

issue closedhellofresh/eks-rolling-update

Clarify documentation on termination check

The script timed out due to an instance taking longer than usual to terminate.

The environment variable that needs to be set to increase the number of checks is GLOBAL_MAX_RETRY. But the documentation says this is Number of attempts of a health check. However the termination check is not a health check and this documentation should be corrected to something like Number of attempts of a health or termination check.

closed time in 18 days

farhank3389

issue commenthellofresh/eks-rolling-update

Clarify documentation on termination check

Fixed by #58

farhank3389

comment created time in 18 days

issue closedhellofresh/eks-rolling-update

cordoning node removes the node from loadbalancer causing downtime

As discussed here https://github.com/kubernetes/kubernetes/issues/65013, while rolling the node pool, cordoning the nodes will remove all the nodes from the loadbalancer which might cause downtime. Instead, we can taint the node to make it non-schedulable instead of cordoning the node.

Update: this might not cause downtime as the scale up of nodes in the ASG is done before cordoning of the nodes.

closed time in 18 days

akshaychitneni

issue commenthellofresh/eks-rolling-update

cordoning node removes the node from loadbalancer causing downtime

Resolved by https://github.com/hellofresh/eks-rolling-update/pull/49

akshaychitneni

comment created time in 18 days

push eventhellofresh/eks-rolling-update

Akshay Chitneni

commit sha 1514239dc894c7c07560eff09630bea8d8385d47

fixing cordoning of nodes

view details

Akshay Chitneni

commit sha 7cd9f852e9b477783c2ff02d9bf7b1bf7f4d3d2a

adding changes to support node taints using config

view details

Akshay Chitneni

commit sha 7d3359022144aea1cac767b38cb89000fe44700c

fixing conflicts

view details

Akshay Chitneni

commit sha 4f2acc740dba0d358d064a059f511cfb08b30650

fixing build issue

view details

crhuber

commit sha 34a3b3e5213b7cc1c0a9daa904126d5a4d48ee09

Merge pull request #49 from akshaychitneni/master fixing cordoning of nodes during node rollout

view details

push time in 18 days

PR merged hellofresh/eks-rolling-update

fixing cordoning of nodes during node rollout

fix for https://github.com/hellofresh/eks-rolling-update/issues/48

+46 -10

1 comment

3 changed files

akshaychitneni

pr closed time in 18 days

pull request commenthellofresh/eks-rolling-update

fixing cordoning of nodes during node rollout

thanks @akshaychitneni

akshaychitneni

comment created time in 18 days

issue closedhellofresh/eks-rolling-update

[Feature] Timeout waiting for instance scale up

Currently, after setting the desired capacity of the ASG, the script simply waits CLUSTER_HEALTH_WAIT seconds once (without any retries) to check if all instances come online. This works in best case scenarios, but we observe good amount of variances in AWS when it comes to time it takes for instances to come online. (In my current example that lead to this issue it took 9 minutes).

I know I can increase the CLUSTER_HEALTH_WAIT to 600s, but that means it always wait 10 minutes which is waste of time. So my request is to add a retry, so that we can have a worst case timeout without increasing the rollout time in best case scenario.

closed time in 18 days

johnthedev97

issue commenthellofresh/eks-rolling-update

[Feature] Timeout waiting for instance scale up

fixed by https://github.com/hellofresh/eks-rolling-update/pull/58

johnthedev97

comment created time in 18 days

push eventhellofresh/eks-rolling-update

Michael Crosby

commit sha 86689560a53ffff89692c063edd27af2cb839a3f

adding environment variable CLUSTER_HEALTH_RETRY to allow x amount of retries(specified by the user) of a cluster health check after an ASG Scaling

view details

Michael Crosby

commit sha 8f5dfd42dd6efcb3903038b75c53a22553213f8b

adjusted the order of the boolean checks

view details

Michael Crosby

commit sha 036f5d3c92cc7d1964c7df31a7e65effbfdd5082

update variable descriptions to be more accurate on their purpose

view details

Chad Wilson

commit sha 29e9e291751221ebf03f1656ee257b4f34c92555

Try and simplify the logging and checks while waiting for ASG/cluster health by avoiding nesting and boolean variable sttae checks

view details

Michael Crosby

commit sha 89019e7fd7e48a7575565a7674afbb09f905611c

Merge pull request #1 from chadlwilson/retry-asg-health-check Try and simplify the logging while waiting for ASG/cluster health

view details

crhuber

commit sha cfc3ee8a3957966c6cd504098d0aef8dbcab6a7c

Merge pull request #58 from crosbymichael1/retry-asg-health-check adding environment variable CLUSTER_HEALTH_RETRY to allow x amount of…

view details

push time in 18 days

PR merged hellofresh/eks-rolling-update

adding environment variable CLUSTER_HEALTH_RETRY to allow x amount of…

In order to address the Feature Request '[Feature] Timeout waiting for instance scale up #55': Screen Shot 2020-08-29 at 12 22 31 AM

I also updated the documentation to address #54

I have added the environment variable CLUSTER_HEALTH_RETRY to allow x amount of retries(specified by the user) of a cluster health check after an ASG Scaling. For some people it make take longer than 90 seconds for there instances to spin up but also by making this change it also can speed up the change for many people. So we can set the CLUSTER_HEALTH_WAIT to only 5 seconds for example and set CLUSTER_HEALTH_RETRY to 40. So the max time it will try is 200 seconds in this scenario but it's possible that it can finish within 20 second, setting it up this way can work for those times when more time is needed and other times when the scale up happens quickly. I also made sure that by default it still runs only once and still uses the wait time of 90 seconds in case people who are already using it like the way it is.

Heres an example snippet of it running and working:

2020-08-29 00:13:49,640 INFO Setting asg desired capacity from 1 to 2 and max size to 20... 2020-08-29 00:13:49,836 INFO Waiting for 5 seconds for ASG to scale before validating cluster health...

2020-08-29 00:13:54,841 INFO Describing autoscaling groups... 2020-08-29 00:13:55,390 INFO Current asg instance count in cluster is: 9. K8s node count should match this number 2020-08-29 00:13:55,390 INFO Checking asg eks-test-workers-2-spot instance count... 2020-08-29 00:13:55,504 INFO Asg eks-test-workers-2-spot does not have enough running instances to proceed 2020-08-29 00:13:55,504 INFO Actual instances: 1 Desired instances: 2 2020-08-29 00:13:55,504 INFO Validation failed for asg eks-test-workers-2-spot. Not enough instances online. 2020-08-29 00:13:55,504 INFO Waiting for 5 seconds for ASG to scale before validating cluster health...

2020-08-29 00:14:00,506 INFO Describing autoscaling groups... 2020-08-29 00:14:01,158 INFO Current asg instance count in cluster is: 9. K8s node count should match this number 2020-08-29 00:14:01,158 INFO Checking asg eks-test-workers-2-spot instance count... 2020-08-29 00:14:01,280 INFO Asg eks-test-workers-2-spot does not have enough running instances to proceed 2020-08-29 00:14:01,280 INFO Actual instances: 1 Desired instances: 2 2020-08-29 00:14:01,281 INFO Validation failed for asg eks-test-workers-2-spot. Not enough instances online. 2020-08-29 00:14:01,281 INFO Waiting for 5 seconds for ASG to scale before validating cluster health...

2020-08-29 00:14:06,285 INFO Describing autoscaling groups... 2020-08-29 00:14:06,778 INFO Current asg instance count in cluster is: 9. K8s node count should match this number 2020-08-29 00:14:06,778 INFO Checking asg eks-test-workers-2-spot instance count... 2020-08-29 00:14:06,930 INFO Asg eks-test-workers-2-spot does not have enough running instances to proceed 2020-08-29 00:14:06,930 INFO Actual instances: 1 Desired instances: 2 2020-08-29 00:14:06,931 INFO Validation failed for asg eks-test-workers-2-spot. Not enough instances online. 2020-08-29 00:14:06,931 INFO Waiting for 5 seconds for ASG to scale before validating cluster health...

2020-08-29 00:14:11,931 INFO Describing autoscaling groups... 2020-08-29 00:14:12,493 INFO Current asg instance count in cluster is: 9. K8s node count should match this number 2020-08-29 00:14:12,493 INFO Checking asg eks-test-workers-2-spot instance count... 2020-08-29 00:14:12,596 INFO Asg eks-test-workers-2-spot does not have enough running instances to proceed 2020-08-29 00:14:12,596 INFO Actual instances: 1 Desired instances: 2 2020-08-29 00:14:12,597 INFO Validation failed for asg eks-test-workers-2-spot. Not enough instances online. 2020-08-29 00:14:12,597 INFO Waiting for 5 seconds for ASG to scale before validating cluster health...

2020-08-29 00:14:17,601 INFO Describing autoscaling groups... 2020-08-29 00:14:18,145 INFO Current asg instance count in cluster is: 10. K8s node count should match this number 2020-08-29 00:14:18,145 INFO Checking asg eks-test-workers-2-spot instance count... 2020-08-29 00:14:18,249 INFO Asg eks-test-workers-2-spot scaled OK 2020-08-29 00:14:18,250 INFO Actual instances: 2 Desired instances: 2 2020-08-29 00:14:18,250 INFO Checking asg eks-test-workers-2-spot instance health... 2020-08-29 00:14:18,350 INFO Instance i-06b896811187d6cc0 - Healthy 2020-08-29 00:14:18,350 INFO Instance i-0be321879567e30aa - Healthy

+51 -44

7 comments

3 changed files

crosbymichael1

pr closed time in 18 days

pull request commenthellofresh/eks-rolling-update

adding environment variable CLUSTER_HEALTH_RETRY to allow x amount of…

Great work @crosbymichael1 and thanks @chadlwilson . I did some testing on my side as well and looks good thanks!

crosbymichael1

comment created time in 18 days

push eventcrhuber/linux-cheatsheet

crhuber

commit sha 935eb621416c157b22d7aacb363dd9b87ae66e8f

add cut

view details

push time in 20 days

issue openedistio/istio

Add ability to turn on logging per proxy

Describe the feature request Currently there is MeshConfig. accessLogFile which controls logging for the entire mesh. This is fine for small deployments but for larger deployments enabling logging on thousands of sidecars can cause problems for logging infrastructure. It would be ideal to be able to turn on access logging via an annotation, the Sidecar resource and via the command line.

Also for short term debugging it would be great to be able to use istioctl proxy-config log to control access logging

Previously in 1.5.x we were able to configure logging from the istio-ingressgateway via a telemetry rule, handler and instance. Without this functionality we cannot upgrade to telemetry-v2

Describe alternatives you've considered

istioctl proxy-config log EnvoyFilter - https://github.com/istio/istio/wiki/EnvoyFilter-Samples#tracing-and-access-logging

Affected product area (please put an X in all that apply)

[ ] Docs [ ] Installation [ ] Networking [ ] Performance and Scalability [x] Extensions and Telemetry [ ] Security [ ] Test and Release [ ] User Experience [ ] Developer Infrastructure

Affected features (please put an X in all that apply)

[ ] Multi Cluster [ ] Virtual Machine [ ] Multi Control Plane

Additional context

created time in 24 days

delete branch hellofresh/eks-rolling-update

delete branch : whitesource/migrate-configuration

delete time in a month

push eventhellofresh/eks-rolling-update

whitesource-for-github-com[bot]

commit sha 49c881bbaf0df11f0b6942c144ef9742277ac363

Migrate .whitesource configuration file to inheritance mode

view details

crhuber

commit sha 2178899eb7a234bb53dd89b9412de0c2313564cd

Merge pull request #57 from hellofresh/whitesource/migrate-configuration WhiteSource Configuration Migration

view details

push time in a month

PR merged hellofresh/eks-rolling-update

WhiteSource Configuration Migration

Added .whitesource file. Configuration will now be inherited from the 'repo-config.json' file in the 'whitesource-config' repository.

+3 -0

1 comment

1 changed file

whitesource-for-github-com[bot]

pr closed time in a month

issue closedaws/eks-charts

[aws-termination-handler] Enabling PodMonitor Causes Error

Enabling PodMonitor on aws-termnation-handler with the following values:

podMonitor:
  create: true
  interval: 30s
  sampleLimit: 5000
  labels:
    prometheus: kube-prometheus
STDERR:
  Error: YAML parse error on aws-node-termination-handler/templates/podmonitor.yaml: error converting YAML to JSON: yaml: line 14: did not find expected key
  Error: plugin "diff" exited with error

closed time in a month

crhuber

issue openedaws/eks-charts

[aws-termination-handler] Enabling PodMonitor Causes Error

Enabling PodMonitor on aws-termnation-handler with the following values:

podMonitor:
  create: true
  interval: 30s
  sampleLimit: 5000
  labels:
    prometheus: kube-prometheus
STDERR:
  Error: YAML parse error on aws-node-termination-handler/templates/podmonitor.yaml: error converting YAML to JSON: yaml: line 14: did not find expected key
  Error: plugin "diff" exited with error

created time in a month

push eventcrhuber/aws-node-termination-handler

Craig Huber

commit sha 6ddf6e74d575717fe1c94f5b54bbf7a2127c7adf

add readme

view details

push time in 2 months

pull request commentaws/aws-node-termination-handler

Add PodMonitor for Prometheus Metrics

@bwagner5 fixed

crhuber

comment created time in 2 months

push eventcrhuber/aws-node-termination-handler

Craig Huber

commit sha fa7e61d8c1ac0cdfa0afb52050102dfee89018c5

remove quotes on ports

view details

push time in 2 months

issue commentaws/aws-node-termination-handler

Add optional Service and ServiceMonitor to helm chart to expose prometheus metrics

Added PR to create a PodMonitor https://github.com/aws/aws-node-termination-handler/pull/206

michaeljmarshall

comment created time in 2 months

PR closed aws/eks-charts

[aws-node-termination-handler] Add PodMonitor

Description of changes:

Since Prometheus metrics are now available, we want to be able to scrape those metrics. This PR adds a container port for the prometheus metrics and creates a PodMonitor to scrape the /metrics endpoint.

By adding the following values

podMonitor:
  # Specifies whether PodMonitor should be created
  create: true
   # The Prometheus scrape interval
  interval: 30s
  # The number of scraped samples that will be accepted
  sampleLimit: 5000
   # Additional labels to add to the metadata
  labels:
    prometheus: kube-prometheus

Generates the following PodMonitor

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: release-name-aws-node-termination-handler
  namespace: default
  labels:
    app.kubernetes.io/name: aws-node-termination-handler
    helm.sh/chart: aws-node-termination-handler-0.9.1
    app.kubernetes.io/instance: release-name
    k8s-app: aws-node-termination-handler
    app.kubernetes.io/version: "1.6.1"
    app.kubernetes.io/managed-by: Tiller
    prometheus: kube-prometheus
    
 spec:
  jobLabel: aws-node-termination-handler
  namespaceSelector:
    matchNames:
    - default
  podMetricsEndpoints:
  - interval: 30s
    path: /metrics
    port: http-metrics
  sampleLimit: 5000
  selector:
    matchLabels:
      app.kubernetes.io/name: aws-node-termination-handler

Ports

apiVersion: apps/v1
kind: DaemonSet
          ports:
          - containerPort: "9092"
            hostPort: "9092"
            name: http-metrics
            protocol: TCP

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

+54 -1

2 comments

5 changed files

crhuber

pr closed time in 2 months

pull request commentaws/eks-charts

[aws-node-termination-handler] Add PodMonitor

closing in favour of https://github.com/aws/aws-node-termination-handler/pull/206

crhuber

comment created time in 2 months

PR opened aws/aws-node-termination-handler

Add PodMonitor

Description of changes:

Since Prometheus metrics are now available, we want to be able to scrape those metrics. This PR adds a container port for the prometheus metrics and creates a PodMonitor to scrape the /metrics endpoint.

By adding the following values

podMonitor:
  # Specifies whether PodMonitor should be created
  create: true
   # The Prometheus scrape interval
  interval: 30s
  # The number of scraped samples that will be accepted
  sampleLimit: 5000
   # Additional labels to add to the metadata
  labels:
    prometheus: kube-prometheus

Generates the following PodMonitor

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: release-name-aws-node-termination-handler
  namespace: default
  labels:
    app.kubernetes.io/name: aws-node-termination-handler
    helm.sh/chart: aws-node-termination-handler-0.9.1
    app.kubernetes.io/instance: release-name
    k8s-app: aws-node-termination-handler
    app.kubernetes.io/version: "1.6.1"
    app.kubernetes.io/managed-by: Tiller
    prometheus: kube-prometheus
    
 spec:
  jobLabel: aws-node-termination-handler
  namespaceSelector:
    matchNames:
    - default
  podMetricsEndpoints:
  - interval: 30s
    path: /metrics
    port: http-metrics
  sampleLimit: 5000
  selector:
    matchLabels:
      app.kubernetes.io/name: aws-node-termination-handler

Ports

apiVersion: apps/v1
kind: DaemonSet
          ports:
          - containerPort: "9092"
            hostPort: "9092"
            name: http-metrics
            protocol: TCP

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

+49 -0

0 comment

4 changed files

pr created time in 2 months

push eventcrhuber/aws-node-termination-handler

Craig Huber

commit sha 9e1fb1daf4753469e50d1075f9ece65b9ef5bdf6

Add PodMonitor

view details

push time in 2 months

fork crhuber/aws-node-termination-handler

A Kubernetes Daemonset to gracefully handle EC2 instance shutdown

https://aws.amazon.com/ec2

fork in 2 months

PR opened aws/eks-charts

[aws-node-termination-handler] Add PodMonitor

Issue #, if available:

Description of changes: Since Prometheus metrics are now available, we want to be able to scrape those metrics. This PR adds a container port for the prometheus metrics and creates a PodMonitor to scrape the /metrics endpoint.

By adding the following values

podMonitor:
  # Specifies whether PodMonitor should be created
  create: true
   # The Prometheus scrape interval
  interval: 30s
  # The number of scraped samples that will be accepted
  sampleLimit: 5000
   # Additional labels to add to the metadata
  labels:
    prometheus: kube-prometheus

Generates the following PodMonitor

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: release-name-aws-node-termination-handler
  namespace: default
  labels:
    app.kubernetes.io/name: aws-node-termination-handler
    helm.sh/chart: aws-node-termination-handler-0.9.1
    app.kubernetes.io/instance: release-name
    k8s-app: aws-node-termination-handler
    app.kubernetes.io/version: "1.6.1"
    app.kubernetes.io/managed-by: Tiller
    prometheus: kube-prometheus
    
 spec:
  jobLabel: aws-node-termination-handler
  namespaceSelector:
    matchNames:
    - default
  podMetricsEndpoints:
  - interval: 30s
    path: /metrics
    port: http-metrics
  sampleLimit: 5000
  selector:
    matchLabels:
      app.kubernetes.io/name: aws-node-termination-handler

Ports

apiVersion: apps/v1
kind: DaemonSet
          ports:
          - containerPort: "9092"
            hostPort: "9092"
            name: http-metrics
            protocol: TCP

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

+54 -1

0 comment

5 changed files

pr created time in 2 months

push eventcrhuber/eks-charts

Craig Huber

commit sha c8d1bd58794a6c74a05b3eb5b3359c6e227ee560

fix indent

view details

push time in 2 months

push eventcrhuber/eks-charts

Craig Huber

commit sha 9506b4ff6c80b8611e703c110cadbdc77d356655

Add PodMonitor

view details

push time in 2 months

fork crhuber/eks-charts

Amazon EKS Helm chart repository

fork in 2 months

issue commentistio/istio

Discovery Process reporting Pod Not Found - causing endpoints to be out of date

@kornholi We updated to Istio 1.5.4 and still occasionally see this issue

crhuber

comment created time in 3 months

more