profile
viewpoint
Jason Haugen haugenj Amazon Web Services Austin, TX https://aws.amazon.com

jenkinsci/ec2-fleet-plugin 68

The EC2 Spot Jenkins plugin launches EC2 Spot instances as worker nodes for Jenkins CI server, automatically scaling the capacity with the load.

haugenj/aws-node-termination-handler 1

A Kubernetes Daemonset to gracefully handle EC2 instance shutdown

haugenj/aws-simple-ec2-cli 0

A CLI tool that simplifies the process of launching, connecting and terminating an EC2 instance.

haugenj/ec2-fleet-plugin 0

The EC2 Spot Jenkins plugin launches EC2 Spot instances as worker nodes for Jenkins CI server, automatically scaling the capacity with the load.

haugenj/eks-charts 0

Amazon EKS Helm chart repository

haugenj/escalator 0

Escalator is a batch or job optimized horizontal autoscaler for Kubernetes

haugenj/homebrew-tap 0

Homebrew formulae that allows installation of AWS tools through the Homebrew package manager.

startedawslabs/aws-simple-ec2-cli

started time in 3 days

PullRequestReviewEvent

push eventaws/aws-node-termination-handler

Jason Haugen

commit sha 37bae7f91fdf0b63ebf53c083b2fdcb773bd3d8d

Reformat logs to support proper JSON logging (#255) * Reformat logs to support proper JSON logging * Make the log keys snake_case * change webhook template log

view details

push time in 3 days

PR merged aws/aws-node-termination-handler

Reformat logs to support proper JSON logging

Issue #, if available: #249

Description of changes: Reformat logs so that they actually use json formatting for the errors and fields instead of hiding those values within the 'message' field.

Question - should we make the log key format more consistent?

  • Standard keys like 'message', 'time', and 'error' are lowercase due to zerolog defaults
  • The arguments are logged in kebab case to match the old format.
  • Custom keys like 'NodeName' are pascal case because I matched them against how .Interface() values are logged (they match their struct names which are pascal case)

I'm learning toward making all the keys we control snake_case.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

+152 -91

2 comments

11 changed files

haugenj

pr closed time in 3 days

Pull request review commentaws/aws-node-termination-handler

upgrade test dependency versions

 aemm_helm_args=(   "$AEMM_DL_URL"   --namespace default   --set servicePort="$IMDS_PORT"+  --set 'rbac.pspEnabled=true'

why do we need to set this to true in all the tests now?

bwagner5

comment created time in 3 days

PullRequestReviewEvent
PullRequestReviewEvent

push eventhaugenj/aws-node-termination-handler

Supasteevo

commit sha dad76703aa4741cb9e78b89c026a37a31ebd60b8

Allow users to configure webhook message content with a template file (#253) * added new config variable to customize webhook template from file * added new webhookTemplateFile variable in helm chart * fix gofmt & ineffassign errors * set template file as configmap in helm chart * check webhook template file in ValidateWebhookConfig function + set template content message as debug + typo nthConfig Co-authored-by: Steven Bressey <steven.bressey@mediakeys.com>

view details

Jason Haugen

commit sha f90e90cc8f0ed5762e1c1d66e87bff4ab29f714e

Add AEMM mock interruption documentation (#256) * Add AEMM mock interruption documentation * fix misspelling

view details

Jason Haugen

commit sha 2fe8eb393e9fb7078357020b217a7ab832a5c2d3

Reformat logs to support proper JSON logging

view details

Jason Haugen

commit sha d1d99cc3aeeb1b11dbe9354045a7f2bb9aa3c6de

Make the log keys snake_case

view details

Jason Haugen

commit sha 2f47eecd6673155b05b06038ccf9789c3afbc7ab

change webhook template log

view details

push time in 3 days

push eventhaugenj/aws-node-termination-handler

Jason Haugen

commit sha 20e18c95fbc2047f9200a22d024010eeccf51aac

Make the log keys snake_case

view details

push time in 4 days

push eventhaugenj/aws-node-termination-handler

Jason Haugen

commit sha 399e58c7fb00333ea64d497b3cb070fe878f7ab0

fix misspelling

view details

push time in 4 days

PullRequestReviewEvent

Pull request review commentaws/aws-node-termination-handler

Add AEMM mock interruption documentation

+# AWS Node Termination Handler & Amazon EC2 Metadata Mock++We have open sourced a tool called the [amazon-ec2-metadata-mock](https://github.com/aws/amazon-ec2-metadata-mock) (AEMM)+that simulates spot interruption notices and more by starting a real webserver that serves data similar to EC2 Instance+Metadata Service. The tool is easily deployed to kubernetes with a Helm chart.++Below is a short guide on how to set AEMM up with your Node Termination Handler cluster in case you'd like to verify the+behavior yourself.++## Triggering AWS Node Termination Handler with Amazon EC2 Metadata Mock++Start by installing AEMM on your cluster. For full and up to date installation instructions reference the AEMM repository.+Here's just one way to do it.++Download the latest tar ball from the releases page, at the time of writing this that was v1.6.0. Then install it using+Helm:+```+helm install amazon-ec2-metadata-mock amazon-ec2-metadata-mock-1.6.0.tgz \+  --namespace default+```++Once AEMM is installed, you need to change the instance metadata url of Node Termination Handler to point+to the location AEMM is serving from. If you use the default values of AEMM, the installation will look similar to this:+```+helm upgrade --install aws-node-termination-handler \+  --namespace kube-system \+  --set instanceMetadataURL="http://amazon-ec2-metadata-mock-service.default.svc.cluster.local:1338" \+  eks/aws-node-termination-handler+```++That's it! Instead of polling the real IMDS service endpoint, AWS Node Termination Handler will instead poll AEMM.+If you open the logs of an AWS Node Termination Handler pod you should see that it receives (mock) interruption+events from AEMM and that the nodes are cordoned and drained. Keep in mind that these nodes won't actually get terminated,+so you might need to manually uncordon the nodes if you want to reset your test cluster.++### AEMM Advanced Configuration+If you run the example above you might notice that the logs are heavily populated. Here's an example output:+```+2020/09/15 21:13:41 Sending interruption event to the interruption channel+2020/09/15 21:13:41 Got interruption event from channel {InstanceID:i-1234567890abcdef0 InstanceType:m4.xlarge PublicHostname:ec2-192-0-2-54.compute-1.amazonaws.com PublicIP:192.0.2.54 LocalHostname:ip-172-16-34-43.ec2.internal LocalIP:172.16.34.43 AvailabilityZone:us-east-1a} {EventID:spot-itn-47ddfb5e39791606bec3e91fea4cdfa86f86a60ddaf014c8b4af8e008f134b19 Kind:SPOT_ITN Description:Spot ITN received. Instance will be interrupted at 2020-09-15T21:15:41Z+ State: NodeName:ip-192-168-123-456.us-east-1.compute.internal StartTime:2020-09-15 21:15:41 +0000 UTC EndTime:0001-01-01 00:00:00 +0000 UTC Drained:false PreDrainTask:0x113c8a0 PostDrainTask:<nil>}+WARNING: ignoring DaemonSet-managed Pods: default/amazon-ec2-metadata-mock-pszj2, kube-system/aws-node-bl2bj, kube-system/aws-node-termination-handler-2pvjr, kube-system/kube-proxy-fct9f+evicting pod "coredns-67bfd975c5-rgkh7"+evicting pod "coredns-67bfd975c5-6g88n"+2020/09/15 21:13:42 Node "ip-192-168-123-456.us-east-1.compute.internal" successfully cordoned and drained.+2020/09/15 21:13:43 Sending interruption event to the interruption channel+2020/09/15 21:13:43 Got interruption event from channel {InstanceID:i-1234567890abcdef0 InstanceType:m4.xlarge PublicHostname:ec2-192-0-2-54.compute-1.amazonaws.com PublicIP:192.0.2.54 LocalHostname:ip-172-16-34-43.ec2.internal LocalIP:172.16.34.43 AvailabilityZone:us-east-1a} {EventID:spot-itn-97be476b6246aba6401ba36e54437719bfdf987773e9c83fe30336eb7fea9704 Kind:SPOT_ITN Description:Spot ITN received. Instance will be interrupted at 2020-09-15T21:15:43Z+ State: NodeName:ip-192-168-123-456.us-east-1.compute.internal StartTime:2020-09-15 21:15:43 +0000 UTC EndTime:0001-01-01 00:00:00 +0000 UTC Drained:false PreDrainTask:0x113c8a0 PostDrainTask:<nil>}+WARNING: ignoring DaemonSet-managed Pods: default/amazon-ec2-metadata-mock-pszj2, kube-system/aws-node-bl2bj, kube-system/aws-node-termination-handler-2pvjr, kube-system/kube-proxy-fct9f+2020/09/15 21:13:44 Node "ip-192-168-123-456.us-east-1.compute.internal" successfully cordoned and drained.+2020/09/15 21:13:45 Sending interruption event to the interruption channel+2020/09/15 21:13:45 Got interruption event from channel...+```++This isn't a mistake, by default AEMM will respond to any request for metadata with a spot interruption occurring 2 minutes+later than the request time.\* AWS Node Termination Handler polls for events every 2 seconds by default, so the effect is+that new interruption events are found and processed every 2 seconds. ++In reality there will only be a single interruption event, and you can mock this by setting the `spot.time` parameter of+AEMM when installing it. +```+helm install amazon-ec2-metadata-mock amazon-ec2-metadata-mock-1.6.0.tgz \+  --set aemm.spot.time="2020-09-09T22:40:47Z" \+  --namespace default+```++Now when you check the logs you should only see a single event get processed. ++For more ways of configuring AEMM check out the [Helm configuration page](https://github.com/aws/amazon-ec2-metadata-mock/tree/master/helm/amazon-ec2-metadata-mock).++## Node Termination Handler E2E Tests++AEMM started out as a test server for aws-node-termination-handler's end-to-end tests in this repo. We use AEMM throughout+our end to end tests to create interruption notices.++The e2e tests install aws-node-termination-handler using Helm and set the metadata url [here](https://github.com/aws/aws-node-termination-handler/blob/master/test/e2e/spot-interruption-test#L36).+This becomes where aws-node-termination-handler looks for metadata; other applications on the node still look at the real+EC2 metadata service.++We set the metadata url environment variable [here](https://github.com/aws/aws-node-termination-handler/blob/master/test/k8s-local-cluster-test/run-test#L18)+for the local tests that use a kind cluster, and [here](https://github.com/aws/aws-node-termination-handler/blob/master/test/eks-cluster-test/run-test#L117)+for the eks-cluster e2e tests.++Check out the [ReadMe](https://github.com/aws/aws-node-termination-handler/tree/master/test) in our test folder for more+info on the e2e tests. ++---++\* Only the first two unique IPs to request data from AEMM receive spot itn information in the response. This was introduced+in AEMM v1.6.0. This can be overridden with a configuration paramter. For previous versions there is no unique IP restriction. 

f

haugenj

comment created time in 4 days

PR opened aws/aws-node-termination-handler

Add AEMM mock interruption documentation

Issue #, if available: #208

Description of changes: Moves the information about using AEMM to simulate a spot interruption from this issue into a more permanent location

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

+91 -0

0 comment

1 changed file

pr created time in 4 days

create barnchhaugenj/aws-node-termination-handler

branch : aemm_demo

created branch time in 4 days

Pull request review commentaws/aws-node-termination-handler

Reformat logs to support proper JSON logging

 func addTaint(node *corev1.Node, nth Node, taintKey string, taintValue string, e 		}  		if err != nil {-			log.Log().Msgf("Error while adding %v taint on node %v: %v", taintKey, node.Name, err)+			log.Log().

what do you mean by "compared to the one you have posted"?

I thought wrapping our message in Err() would be misleading since the actual error is stored in err. So in the logs it would look like there are two errors when there's really only one. I removed err from the log statement here for a similar reason - since we're returning err it should be handled by the calling function and logging it in both places seemed redundant.

haugenj

comment created time in 5 days

PullRequestReviewEvent

PR opened aws/aws-node-termination-handler

Reformat logs to support proper JSON logging

Issue #, if available: #249

Description of changes: Reformat logs so that they actually use json formatting for the errors and fields instead of hiding those values within the 'message' field.

Question - should we make the log key format more consistent?

  • Standard keys like 'message', 'time', and 'error' are lowercase due to zerolog defaults
  • The arguments are logged in kebab case to match the old format.
  • Custom keys like 'NodeName' are pascal case because I matched them against how .Interface() values are logged (they match their struct names which are pascal case)

I'm learning toward making all the keys we control snake_case.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

+148 -90

0 comment

11 changed files

pr created time in 6 days

push eventhaugenj/aws-node-termination-handler

Jason Haugen

commit sha 750d7fc1e62f44b6bcf9e217fed643d7767472e7

Reformat logs to support proper JSON logging

view details

push time in 6 days

push eventhaugenj/aws-node-termination-handler

Jason Haugen

commit sha ae0676e404e5313d35d3620e8834ba18e5f9ece7

Reduce event logging to only new events (#252) * Reduce event logging to only new events * Disable logging for benchmark test

view details

push time in 6 days

issue commentjenkinsci/ec2-fleet-plugin

Scale Executors By Weight -- is weight ignored when setting target size or am I making a mistake?

In the logs you'll see something like INFO: currentDemand 2 availableCapacity 0 (availableExecutors 0 connectingExecutors 0 plannedCapacitySnapshot 0 additionalPlannedCapacity 0). This shows that Jenkins is aware of the pending jobs, but there's no capacity allocated that can handle it.

By default Jenkins waits some amount of time before actually doing the scale up operation, though I don't know exactly where that's set. If you use the "No Delay Provision Strategy" that's a part of this plugin we override the default and it should try to scale up right away.

I don't think using weighted capacity should affect the scale up in any way though. As far as I can tell the executors are scaled by the weights after the new node has already come online (code link)

ianfixes

comment created time in 6 days

pull request commentjenkinsci/ec2-fleet-plugin

Fix executable resubmission behavior on instance termination

Thanks for submitting this! Code looks good to me, but as I'm still ramping up on Jenkins I'd like to get a second set of eyes on this before merging it in

SrodriguezO

comment created time in 10 days

issue commentjenkinsci/ec2-fleet-plugin

Scale Executors By Weight -- is weight ignored when setting target size or am I making a mistake?

Gotcha, is this a problem you're still seeing? I'm wondering if this could've been resolved since version 1.17.3 or if it's something that still needs to be investigated

ianfixes

comment created time in 10 days

issue commentjenkinsci/ec2-fleet-plugin

Scale Executors By Weight -- is weight ignored when setting target size or am I making a mistake?

@ianfixes

The problem is that the "target" spot fleet size seems to be getting set as a number of instances, when it should really be set as a target weight. So the scaling target was 10, but AWS only allocated 4 instances to meet that weight.

I'm not sure I totally understand your problem, can you elaborate on this? This sounds like it is interpreting the target fleet size as a target weight - the scaling target is 10 (weighted capacity), and instances are weighted higher than 1, so it makes sense to me that you get fewer than 10 instances.


@haedaal Can you confirm that the instances launched were actually c5.4xlarge when there was multiple jobs to run? I tried to repoduce on my end but it only gave me the weight 1 instance types, so the number of executors per instance was 1 for me. Want to double check you haven't had the same thing happen to you

ianfixes

comment created time in 10 days

push eventaws/aws-node-termination-handler

Jason Haugen

commit sha ae0676e404e5313d35d3620e8834ba18e5f9ece7

Reduce event logging to only new events (#252) * Reduce event logging to only new events * Disable logging for benchmark test

view details

push time in 11 days

PR merged aws/aws-node-termination-handler

Reduce event logging to only new events

Issue #, if available: #250

Description of changes: Reduce the congestion of event logging by only logging for new events. No behavioral changes.


Upside: Better looking logs

Human readable example

2020/09/09 16:06:44 Kubernetes AWS Node Termination Handler has started successfully!
2020/09/09 16:06:44 Started monitoring for SPOT_ITN events
2020/09/09 16:06:44 Started watching for event cancellations
2020/09/09 16:06:46 Adding new event to the event store event={"Description":"Spot ITN received. Instance will be interrupted at 2020-09-09T16:10:47Z \n","Drained":false,"EndTime":"0001-01-01T00:00:00Z","EventID":"spot-itn-a47ba0ecd3e0bf3a0023591eaf85caf67eb18a13a7c4a5e3d0c2a9885d19c791","Kind":"SPOT_ITN","NodeName":"ip-123-456-789-123.us-east-2.compute.internal","StartTime":"2020-09-09T16:10:47Z","State":""}
2020/09/09 16:08:47 Cordoning the node
2020/09/09 16:08:47 Draining the node
WARNING: ignoring DaemonSet-managed Pods: default/amazon-ec2-metadata-mock-98gtp, kube-system/aws-node-6wzns, kube-system/aws-node-termination-handler-8wkt2, kube-system/kube-proxy-k9kzp
evicting pod "coredns-67bfd975c5-c5xjh"
2020/09/09 16:08:47 Node "ip-123-456-789-123.us-east-2.compute.internal" successfully cordoned and drained.

json logging example

{"time":"2020-09-09T17:12:23Z","message":"Kubernetes AWS Node Termination Handler has started successfully!"}
{"time":"2020-09-09T17:12:23Z","message":"Started watching for event cancellations"}
{"time":"2020-09-09T17:12:23Z","message":"Started monitoring for SPOT_ITN events"}
{"event":{"EventID":"spot-itn-a47ba0ecd3e0bf3a0023591eaf85caf67eb18a13a7c4a5e3d0c2a9885d19c791","Kind":"SPOT_ITN","Description":"Spot ITN received. Instance will be interrupted at 2020-09-09T17:15:47Z \n","State":"","NodeName":"ip-123-456-789-123.us-east-2.compute.internal","StartTime":"2020-09-09T17:15:47Z","EndTime":"0001-01-01T00:00:00Z","Drained":false},"time":"2020-09-09T17:12:25Z","message":"Adding new event to the event store"}
{"time":"2020-09-09T17:13:47Z","message":"Cordoning the node"}
{"time":"2020-09-09T17:13:47Z","message":"Draining the node"}
WARNING: ignoring DaemonSet-managed Pods: default/amazon-ec2-metadata-mock-pbqgx, kube-system/aws-node-bl2bj, kube-system/aws-node-termination-handler-j8ktr, kube-system/kube-proxy-fct9f
evicting pod "coredns-67bfd975c5-qbnjd"
{"time":"2020-09-09T17:13:47Z","message":"Node \"ip-123-456-789-123.us-east-2.compute.internal\" successfully cordoned and drained."}

Downside: Removed node metadata from event logs. As a k8s application I don't think it's necessary to put that information in there. No "heartbeat" logging anymore (event logs used to come in every 2 seconds)

old logging example:

2020/09/09 16:13:02 Kubernetes AWS Node Termination Handler has started successfully!
2020/09/09 16:13:02 Started watching for event cancellations
2020/09/09 16:13:02 Started monitoring for SPOT_ITN events
2020/09/09 16:13:04 Sending interruption event to the interruption channel
2020/09/09 16:13:04 Got interruption event from channel {InstanceID:i-1234567890abcdef0 InstanceType:m4.xlarge PublicHostname:ec2-123-456-789-123.compute-1.amazonaws.com PublicIP:123.456.789.123 LocalHostname:ip-123-456-789-123.ec2.internal LocalIP:123.456.789.123 AvailabilityZone:us-east-1a} {EventID:spot-itn-793d64d900d62573e95ef93881f5290b46df70d62b994613f296ca1152672b7c Kind:SPOT_ITN Description:Spot ITN received. Instance will be interrupted at 2020-09-09T16:17:47Z
 State: NodeName:ip-123-456-789-123.us-east-2.compute.internal StartTime:2020-09-09 16:17:47 +0000 UTC EndTime:0001-01-01 00:00:00 +0000 UTC Drained:false PreDrainTask:0x113c8a0 PostDrainTask:<nil>}
2020/09/09 16:13:06 Sending interruption event to the interruption channel
2020/09/09 16:13:06 Got interruption event from channel {InstanceID:i-1234567890abcdef0 InstanceType:m4.xlarge PublicHostname:ec2-123-456-789-123.compute-1.amazonaws.com PublicIP:123.456.789.123 LocalHostname:ip-123-456-789-123.ec2.internal LocalIP:123.456.789.123 AvailabilityZone:us-east-1a} {EventID:spot-itn-793d64d900d62573e95ef93881f5290b46df70d62b994613f296ca1152672b7c Kind:SPOT_ITN Description:Spot ITN received. Instance will be interrupted at 2020-09-09T16:17:47Z
 State: NodeName:ip-123-456-789-123.us-east-2.compute.internal StartTime:2020-09-09 16:17:47 +0000 UTC EndTime:0001-01-01 00:00:00 +0000 UTC Drained:false PreDrainTask:0x113c8a0 PostDrainTask:<nil>}
.
. (this repeats every 2 seconds)
.
2020/09/09 16:15:46 Sending interruption event to the interruption channel
2020/09/09 16:15:46 Got interruption event from channel {InstanceID:i-1234567890abcdef0 InstanceType:m4.xlarge PublicHostname:ec2-123-456-789-123.compute-1.amazonaws.com PublicIP:123.456.789.123 LocalHostname:ip-123-456-789-123.ec2.internal LocalIP:123.456.789.123 AvailabilityZone:us-east-1a} {EventID:spot-itn-793d64d900d62573e95ef93881f5290b46df70d62b994613f296ca1152672b7c Kind:SPOT_ITN Description:Spot ITN received. Instance will be interrupted at 2020-09-09T16:17:47Z
 State: NodeName:ip-123-456-789-123.us-east-2.compute.internal StartTime:2020-09-09 16:17:47 +0000 UTC EndTime:0001-01-01 00:00:00 +0000 UTC Drained:false PreDrainTask:0x113c8a0 PostDrainTask:<nil>}
WARNING: ignoring DaemonSet-managed Pods: default/amazon-ec2-metadata-mock-f4jrv, kube-system/aws-node-bl2bj, kube-system/aws-node-termination-handler-6zdv4, kube-system/kube-proxy-fct9f
evicting pod "coredns-67bfd975c5-hn8tz"
2020/09/09 16:15:47 Node "ip-123-456-789-123.us-east-2.compute.internal" successfully cordoned and drained.
2020/09/09 16:15:48 Sending interruption event to the interruption channel
2020/09/09 16:15:48 Got interruption event from channel {InstanceID:i-1234567890abcdef0 InstanceType:m4.xlarge PublicHostname:ec2-123-456-789-123.compute-1.amazonaws.com PublicIP:123.456.789.123 LocalHostname:ip-123-456-789-123.ec2.internal LocalIP:123.456.789.123 AvailabilityZone:us-east-1a} {EventID:spot-itn-793d64d900d62573e95ef93881f5290b46df70d62b994613f296ca1152672b7c Kind:SPOT_ITN Description:Spot ITN received. Instance will be interrupted at 2020-09-09T16:17:47Z

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

+16 -11

1 comment

6 changed files

haugenj

pr closed time in 11 days

push eventhaugenj/aws-node-termination-handler

Jason Haugen

commit sha 558e36d9dc22d558f6e8687a6fafd66730a8eb29

Disable logging for benchmark test

view details

push time in 11 days

PR opened aws/aws-node-termination-handler

Reduce event logging to only new events

Issue #, if available: #250

Description of changes: Reduce the congestion of event logging by only logging for new events. No behavioral changes.


Upside: Better looking logs

Human readable example

2020/09/09 16:06:44 Kubernetes AWS Node Termination Handler has started successfully!
2020/09/09 16:06:44 Started monitoring for SPOT_ITN events
2020/09/09 16:06:44 Started watching for event cancellations
2020/09/09 16:06:46 Adding new event to the event store event={"Description":"Spot ITN received. Instance will be interrupted at 2020-09-09T16:10:47Z \n","Drained":false,"EndTime":"0001-01-01T00:00:00Z","EventID":"spot-itn-a47ba0ecd3e0bf3a0023591eaf85caf67eb18a13a7c4a5e3d0c2a9885d19c791","Kind":"SPOT_ITN","NodeName":"ip-123-456-789-123.us-east-2.compute.internal","StartTime":"2020-09-09T16:10:47Z","State":""}
2020/09/09 16:08:47 Cordoning the node
2020/09/09 16:08:47 Draining the node
WARNING: ignoring DaemonSet-managed Pods: default/amazon-ec2-metadata-mock-98gtp, kube-system/aws-node-6wzns, kube-system/aws-node-termination-handler-8wkt2, kube-system/kube-proxy-k9kzp
evicting pod "coredns-67bfd975c5-c5xjh"
2020/09/09 16:08:47 Node "ip-123-456-789-123.us-east-2.compute.internal" successfully cordoned and drained.

json logging example

{"time":"2020-09-09T17:12:23Z","message":"Kubernetes AWS Node Termination Handler has started successfully!"}
{"time":"2020-09-09T17:12:23Z","message":"Started watching for event cancellations"}
{"time":"2020-09-09T17:12:23Z","message":"Started monitoring for SPOT_ITN events"}
{"event":{"EventID":"spot-itn-a47ba0ecd3e0bf3a0023591eaf85caf67eb18a13a7c4a5e3d0c2a9885d19c791","Kind":"SPOT_ITN","Description":"Spot ITN received. Instance will be interrupted at 2020-09-09T17:15:47Z \n","State":"","NodeName":"ip-123-456-789-123.us-east-2.compute.internal","StartTime":"2020-09-09T17:15:47Z","EndTime":"0001-01-01T00:00:00Z","Drained":false},"time":"2020-09-09T17:12:25Z","message":"Adding new event to the event store"}
{"time":"2020-09-09T17:13:47Z","message":"Cordoning the node"}
{"time":"2020-09-09T17:13:47Z","message":"Draining the node"}
WARNING: ignoring DaemonSet-managed Pods: default/amazon-ec2-metadata-mock-pbqgx, kube-system/aws-node-bl2bj, kube-system/aws-node-termination-handler-j8ktr, kube-system/kube-proxy-fct9f
evicting pod "coredns-67bfd975c5-qbnjd"
{"time":"2020-09-09T17:13:47Z","message":"Node \"ip-123-456-789-123.us-east-2.compute.internal\" successfully cordoned and drained."}

Downside: Removed node metadata from event logs. As a k8s application I don't think it's necessary to put that information in there. No "heartbeat" logging anymore (event logs used to come in every 2 seconds)

old logging example:

2020/09/09 16:13:02 Kubernetes AWS Node Termination Handler has started successfully!
2020/09/09 16:13:02 Started watching for event cancellations
2020/09/09 16:13:02 Started monitoring for SPOT_ITN events
2020/09/09 16:13:04 Sending interruption event to the interruption channel
2020/09/09 16:13:04 Got interruption event from channel {InstanceID:i-1234567890abcdef0 InstanceType:m4.xlarge PublicHostname:ec2-123-456-789-123.compute-1.amazonaws.com PublicIP:123.456.789.123 LocalHostname:ip-123-456-789-123.ec2.internal LocalIP:123.456.789.123 AvailabilityZone:us-east-1a} {EventID:spot-itn-793d64d900d62573e95ef93881f5290b46df70d62b994613f296ca1152672b7c Kind:SPOT_ITN Description:Spot ITN received. Instance will be interrupted at 2020-09-09T16:17:47Z
 State: NodeName:ip-192-168-75-111.us-east-2.compute.internal StartTime:2020-09-09 16:17:47 +0000 UTC EndTime:0001-01-01 00:00:00 +0000 UTC Drained:false PreDrainTask:0x113c8a0 PostDrainTask:<nil>}
2020/09/09 16:13:06 Sending interruption event to the interruption channel
2020/09/09 16:13:06 Got interruption event from channel {InstanceID:i-1234567890abcdef0 InstanceType:m4.xlarge PublicHostname:ec2-123-456-789-123.compute-1.amazonaws.com PublicIP:123.456.789.123 LocalHostname:ip-123-456-789-123.ec2.internal LocalIP:123.456.789.123 AvailabilityZone:us-east-1a} {EventID:spot-itn-793d64d900d62573e95ef93881f5290b46df70d62b994613f296ca1152672b7c Kind:SPOT_ITN Description:Spot ITN received. Instance will be interrupted at 2020-09-09T16:17:47Z
 State: NodeName:ip-192-168-75-111.us-east-2.compute.internal StartTime:2020-09-09 16:17:47 +0000 UTC EndTime:0001-01-01 00:00:00 +0000 UTC Drained:false PreDrainTask:0x113c8a0 PostDrainTask:<nil>}
.
. (this repeats every 2 seconds)
.
2020/09/09 16:15:46 Sending interruption event to the interruption channel
2020/09/09 16:15:46 Got interruption event from channel {InstanceID:i-1234567890abcdef0 InstanceType:m4.xlarge PublicHostname:ec2-123-456-789-123.compute-1.amazonaws.com PublicIP:123.456.789.123 LocalHostname:ip-123-456-789-123.ec2.internal LocalIP:123.456.789.123 AvailabilityZone:us-east-1a} {EventID:spot-itn-793d64d900d62573e95ef93881f5290b46df70d62b994613f296ca1152672b7c Kind:SPOT_ITN Description:Spot ITN received. Instance will be interrupted at 2020-09-09T16:17:47Z
 State: NodeName:ip-192-168-75-111.us-east-2.compute.internal StartTime:2020-09-09 16:17:47 +0000 UTC EndTime:0001-01-01 00:00:00 +0000 UTC Drained:false PreDrainTask:0x113c8a0 PostDrainTask:<nil>}
WARNING: ignoring DaemonSet-managed Pods: default/amazon-ec2-metadata-mock-f4jrv, kube-system/aws-node-bl2bj, kube-system/aws-node-termination-handler-6zdv4, kube-system/kube-proxy-fct9f
evicting pod "coredns-67bfd975c5-hn8tz"
2020/09/09 16:15:47 Node "ip-192-168-75-111.us-east-2.compute.internal" successfully cordoned and drained.
2020/09/09 16:15:48 Sending interruption event to the interruption channel
2020/09/09 16:15:48 Got interruption event from channel {InstanceID:i-1234567890abcdef0 InstanceType:m4.xlarge PublicHostname:ec2-123-456-789-123.compute-1.amazonaws.com PublicIP:123.456.789.123 LocalHostname:ip-123-456-789-123.ec2.internal LocalIP:123.456.789.123 AvailabilityZone:us-east-1a} {EventID:spot-itn-793d64d900d62573e95ef93881f5290b46df70d62b994613f296ca1152672b7c Kind:SPOT_ITN Description:Spot ITN received. Instance will be interrupted at 2020-09-09T16:17:47Z

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

+11 -11

0 comment

5 changed files

pr created time in 11 days

push eventhaugenj/aws-node-termination-handler

Bryan™

commit sha ac6f0e79c3493c7ac621dc1aac6e7f642f9b4af7

upgrade aemm to 1.6 (#251)

view details

Jason Haugen

commit sha 40dada94a27547edb834239c455f5e6498047e69

Reduce event logging to only new events

view details

push time in 11 days

issue commentaws/aws-node-termination-handler

Spot Termination events are logged dozens of times

The Monitor function is working as intended by running continuously and finding the event, but we should be deduplicating these events when they have the same EventID hash to skip sending them to the channel here

kemra102

comment created time in 17 days

issue commentaws/aws-node-termination-handler

JSON logging isn't truly JSON logging

Hey @kemra102 !

I agree the JSON logging could be better. When we added JSON logging with Zerolog we wanted to make sure the default human-readable logs displayed in exactly the same format as they were before, in case anyone was scraping the logs and parsing them already. However, like you brought up this limits the JSON log format and keeps it from really being JSON logging. Since that time we've added Prometheus metrics to NTH so changing the format might not be as much of an issue anymore.

I haven't worked with Logstash or ElasticSearch, so if we did change the format here what would be the ideal output? Anything different than the following?

{
"message" : "Got interruption event from channel",
"time" : "2020-09-03T00:00:00Z",
"level" : "info",
"metadata" :
  {
  "InstanceID" : "i-12345678",
  "InstanceType" : "t3.large",
  ...
  },
"event" :
  {
  "EventID" : "spot-itn-asdfghjkl",
  "Kind":"SPOT_ITN",
  ...
  }
}

The places we're missing JSON logging is a mistake, I could've missed a few spots when I implemented JSON logging and we might not be as vigilant on new PRs as we should be.

kemra102

comment created time in 17 days

push eventhaugenj/aws-node-termination-handler

Thomas O'Neill

commit sha 95fdc025df8f92b725f988869854917d961cc14b

Fix Helm chart comment on enableSpotInterruptionDraining default behavior (#221) * default enableSpotInterruptionDraining to true * enableSpotInterruptionDraining default to "true" * Update README for helm chart, and roll back value change Co-authored-by: Thomas O'Neill <toneill@new-innov.com>

view details

Prathibha Datta Kumar

commit sha 1943f8bce4f9a1ef73a52c8973712aac1e1c3462

fix missing new line for updateStrategy (#227)

view details

Prathibha Datta Kumar

commit sha 292e4af3844abae869fb95bcd8f37255dfbf5286

updating default updateStrategy and bumping version for helm chart release (#228)

view details

Bryan™

commit sha ca45b8f8274c25be8c5180e51efdea4b52e7e847

fix manifest updating in push-docker-images (#231)

view details

Brandon Wagner

commit sha 3e355f41607f887a3ebc7e39d38137cfb193c6eb

Add Amazon EC2 Spot Instances Integration Roadmap to readme (#232)

view details

Brandon Wagner

commit sha e48c1f8b965f7e608688fb59a555f87203bd1326

upgrade to go 1.15 (#233) * upgrade to go 1.15 * remove duplicate linker flags * remove -s linker flag from windows

view details

Paulo Martins

commit sha c685bbaf7a614305bed5e33f4aed867e15a041b3

fix identation on PodMonitor (#235)

view details

Brandon Wagner

commit sha 021b71f48a2500e54def0d04dbd59046cc086bd0

Helm Template Gen Test (#236) * add helm test to check template generation * sync enableSpotInterruption usage statement

view details

Bryan™

commit sha 97f42ef9f6c656f84ccfda982be4da4fa3529e3b

Update .travis.yml go1.15 (#238)

view details

Brandon Wagner

commit sha 713db7c6b117e638500984cbe6ab26b74783b6a2

up helm chart to use v1.7.0 image (#234)

view details

Brandon Wagner

commit sha 3fa790071c86c17ad6fbb35910f982dea3072dcf

retry aemm installs (#237)

view details

Brandon Wagner

commit sha f92ed02c5a6eb7681582b8858574c9a5853d9e90

fix release script upload to github (#240)

view details

Brandon Wagner

commit sha ef0eec0e9f4cd69f873cf1fc21744ea20e799552

only build one binary to check licenses (#239)

view details

Brandon Wagner

commit sha e5705ffd6fed89b277990bf349327f09eaadc790

clear up shellcheck warnings (#242) * clear up shellcheck warnings * changes based on PR feedback * missed some spots

view details

Brandon Wagner

commit sha 24ff89c46173c56e14ed37eb26b9525bb9194e6c

fix helm test (#243)

view details

David Pait

commit sha 92fb0c2fab2dc43c335766e3288b9b1604d198b4

Add retries when reponse from IMDSv2 retruns a 401 (#244) * Add retries when reponse from IMDSv2 retruns a 401 * Revert "Add retries when reponse from IMDSv2 retruns a 401" This reverts commit c1e34774098384db6b33c0774ec7dcf7a5c9bce7. * move IMDSv2 401 retries to Request function. move Panic to main function. add tests for 401 retries. * add missing equals to duplicateErrorThershold check in main function * fix spelling of NTH * fix report card

view details

Brandon Wagner

commit sha 33a2e9596c53773ca5c4d86be5cd51cf295e5495

Update README.md

view details

Brandon Wagner

commit sha a31e7c52926582b215b31426f407dbbc65a465b2

retry second AEMM install in maintenance-event-cancellation-test (#246)

view details

push time in 17 days

push eventhaugenj/aws-simple-ec2-cli

Jason Haugen

commit sha 31256a98cbd9fe671b1fbc0993232b188b13f35e

Add 'make fmt' target. Format project with it

view details

push time in 18 days

PR opened awslabs/aws-simple-ec2-cli

Change 'ez' to 'simple' to align with repo name

Issue #, if available: none

Description of changes: change 'ez' to 'simple' everywhere to align the code with the repository name. Done in preparation of adding this to the aws-homebrew tap.

Depending on the context, the format could be slightly different now than before. Examples:

'ez-ec2' -> 'simple-ec2'
'ezec2' -> 'simpleEc2'
'Ezec2' -> 'SimpleEc2'
'EZEC2' -> 'SIMPLE_EC2'

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

+167 -167

0 comment

37 changed files

pr created time in 18 days

push eventhaugenj/aws-simple-ec2-cli

Jason Haugen

commit sha 3fa1f54edc0e64465e51a6d132bb06beb78a0f59

Change 'ez' to 'simple' to align with repo name

view details

push time in 18 days

push eventhaugenj/aws-simple-ec2-cli

Jason Haugen

commit sha e89dd0ef443498ca758694a2b0652effae72d10c

Change 'ez' to 'simple' to align with repo name

view details

push time in 18 days

fork haugenj/aws-simple-ec2-cli

A CLI tool that simplifies the process of launching, connecting and terminating an EC2 instance.

fork in 18 days

PR closed aws/homebrew-tap

Reviewers
Add aws-simple-ec2-cli formula DO NOT MERGE

Description of changes: Add the awslabs/aws-simple-ec2-cli tool. aws-simple-ec2-cli provides customers with a simplified cli experience to make getting started with ec2 easier.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

+47 -0

1 comment

3 changed files

haugenj

pr closed time in 18 days

pull request commentaws/homebrew-tap

Add aws-simple-ec2-cli formula

Closing for now, want to make a few changes to aws-simple-ec2-cli before adding it to the tap

haugenj

comment created time in 18 days

PR opened aws/homebrew-tap

Add aws-simple-ec2-cli formula

Description of changes: Add the awslabs/aws-simple-ec2-cli tool. aws-simple-ec2-cli provides customers with a simplified cli experience to make getting started with ec2 easier.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

+47 -0

0 comment

3 changed files

pr created time in 20 days

push eventhaugenj/homebrew-tap

Jason Haugen

commit sha eae90bd9da86147b6d67911df7fb758bc1bdd916

Add aws-simple-ec2-cli formula

view details

push time in 20 days

fork haugenj/homebrew-tap

Homebrew formulae that allows installation of AWS tools through the Homebrew package manager.

fork in 20 days

fork haugenj/ec2-fleet-plugin

The EC2 Spot Jenkins plugin launches EC2 Spot instances as worker nodes for Jenkins CI server, automatically scaling the capacity with the load.

fork in a month

PullRequestReviewEvent
PullRequestReviewEvent

Pull request review commentaws/aws-node-termination-handler

Helm Template Gen Test

+# Test values for aws-node-termination-handler.+# This is a YAML-formatted file.+# Declare variables to test template rendering functionality.++image:+    repository: amazon/aws-node-termination-handler+    tag: v1.6.1+    pullPolicy: IfNotPresent+    pullSecrets: ["test"]+  +securityContext:+    runAsUserID: 1000+    runAsGroupID: 1000++nameOverride: "test-nth"+fullnameOverride: "test-aws-node-termination-handler"++priorityClassName: system-node-critical++podAnnotations: {+    test: test+}+linuxPodAnnotations: {+    test: test+}+windowsPodAnnotations: {+    test: test+}++podLabels: {+    test: test+}+linuxPodLabels: {+    test: test+}+windowsPodLabels: {+    test: test+}++resources:+    requests:+        memory: "64Mi"+        cpu: "50m"+    limits:+        memory: "128Mi"+        cpu: "100m"++## enableSpotInterruptionDraining If true, drain nodes when the spot interruption termination notice is received

should be 'If false, do not..."

bwagner5

comment created time in a month

issue commentaws/aws-node-termination-handler

Does this Mitigate AWS ALB Ingress Controller?

There's no unique functionality in aws-node-termination-handler to do this, but if you're using a k8s Service you're good to go. The Service endpoint attached to the load balancer won't change and will automatically update the pod endpoints it forwards to when the pods are evicted and new pods are created.

This is assuming you're connecting to the pods via the Service and not going to the pods directly.

petrukngantuk

comment created time in a month

issue commentaws/aws-node-termination-handler

How to simulate the interruption situation manually to see what aws-node-termination-handler is doing normally

@kavicse87 The "interruption log" you posted looks like a new instance request failure log, not an actual interruption event. There isn't any relation between that error message and an interruption event that triggers the aws-node-termination-handler.

Using instances with a >20% interruption rate doesn't necessarily mean they'll be interrupted, though it does make it more likely. If you want a consistent test of actual interruption events I suggest following one of the methods that Brandon suggested before.

kavicse87

comment created time in a month

push eventhaugenj/amazon-ec2-instance-selector

Brandon Wagner

commit sha eab69ead589a01daea1c75989b0f91d36a4258e3

fix filters marshal to include regex strings (#17)

view details

Brandon Wagner

commit sha 3e751485c8a68cb083d832b88b53b5953e8318ec

fixed vcpus-to-memory ratio calc to use ceiling (#18) * fixed vcpus-to-memory ratio calc to use ceiling * fix ratio calc when vcpus is 0

view details

Brandon Wagner

commit sha a7e05ab111663c631e8fdc943ed4bd78d73967be

implement the --flexible suite filter (#20) * implement the --flexible suite filter * fix readme usage

view details

Brandon Wagner

commit sha 5b6cceeac026fab2a9c3f6ce340dc93388f4142f

Update README.md

view details

Brandon Wagner

commit sha 594137b88ea6da410921f93800f08a17cb15b505

fix removal of gpus from base-instance-type agg filter (#22) * fix removal of gpus from base-instance-type agg filter * add base-instance-type gpu unit test

view details

Brandon Wagner

commit sha e2af7ce0ff3e19def35f3b3436fcc9a7b280edf2

build tar.gz files as assets (#24)

view details

Brandon Wagner

commit sha d98d38b92726b34263861263a5a0e2ad9c51f20d

lower aggregate percent lower-band to reduce variance in instance resource specs (#23)

view details

Brandon Wagner

commit sha d8527d8a0c58495e871c04b8bbb1dbe9847db9d9

do not include os-arch suffix on tar file bins (#26)

view details

Brandon Wagner

commit sha 7f0667cb7155b52935eb71cd9fe32d76f1e287be

add homebrew-sync script (#25) * add homebrew-sync script * updates from feedback

view details

Brandon Wagner

commit sha 27f8ff63d11c378fac4e61bb7e477241f8c9cdf4

fix gh config dir on travis (#28)

view details

Brandon Wagner

commit sha 39f1dbce4db929fab4c72610257eeef636d749a9

add shellcheck test (#27) * add shellcheck test * constrain the bash grep

view details

Brandon Wagner

commit sha 902cdb0edc128dea9adf950885c5a33ca8b7d99d

Homebrew-sync gh config backup fix (#29) * fix gh config dir on travis * fix gh config bkup from failing full build

view details

Brandon Wagner

commit sha c446a12bbd4cb67c35fced91757dcdc1a64d6ab8

sync to homebrew tap on release (#30)

view details

Brandon Wagner

commit sha 5df83d29ad2bdcda82162e73652239724a6304b1

prevent caching on brew sync (#31)

view details

Brandon Wagner

commit sha 540741a7935def52c992da4d530e7f8088c3e413

add verbose output to homebrew sync script make pr body nice add build-id so we can redrive failed syncs

view details

Brandon Wagner

commit sha df059469aa69765cff8a3ed09035823f8b5bbf41

fix homebrew sync (#32)

view details

Brandon Wagner

commit sha d5b7b27e9ae7e1da492bac95525aadc84b751b7c

fix the gh args (#33)

view details

Brandon Wagner

commit sha be67b3f78fbfea010e6e2c0c58a4441cf59f2734

correct type in hb sync script (#34)

view details

Brandon Wagner

commit sha 039a9ac712d4115d15de61624f825f189080046a

hb sync: fix upstream fork sync (#37)

view details

Brandon Wagner

commit sha 42ea818c89b3a60769566ae2a722f293891faf5d

treat amd64 the same as x86_64 (#35)

view details

push time in a month

issue commentaws/aws-node-termination-handler

execute custom script during spot notification event

Hey thanks for the question! At the moment this is not a supported feature of aws-node-termination-hander.

I vaguely remember discussing such a feature with @bwagner5 a while back and I can't remember where we landed on it. He'll be back in the office on Monday, I'll reconnect with him then get back to you.

amrragab8080

comment created time in a month

issue commentaws/aws-node-termination-handler

Clusterrole does not have 'get' permissions on pods

Hey thanks for bringing this up! Without diving too deep on this, I don't think this affects the ability of aws-node-termination-hander (famous last words). Cordoning and draining is working in all our tests, have you seen any issues with draining in your cluster?

I audited the clusterrole permissions a while back and didn't notice any errors percolate into the aws-node-termination-handler logs during execution, is there any more info in the cluster api logs about when this get request is being made? I think we should narrow down why this get request is being made before deciding whether to modify the permissions.

ajcann

comment created time in a month

issue commentaws/aws-node-termination-handler

How to simulate the interruption situation manually to see what aws-node-termination-handler is doing normally

@kavicse87 I do think our charts are only configured for Helm 3. Can you post the logs from an aws-node-termination-handler pod that was interrupted? Specifically the beginning of the logs that show the config like:

aws-node-termination-handler arguments:
	dry-run: false,
	node-name: ip-.us-east-2.compute.internal,
	metadata-url: http://amazon-ec2-metadata-mock-service.default.svc.cluster.local:1338,
	kubernetes-service-host: 10.100.0.1,
	kubernetes-service-port: 443,
	delete-local-data: true,
	ignore-daemon-sets: true,
	pod-termination-grace-period: -1,
	node-termination-grace-period: 120,
	enable-scheduled-event-draining: false,
	enable-spot-interruption-draining: true,
	metadata-tries: 3,
	cordon-only: false,
	taint-node: false,
	json-logging: false,
	webhook-proxy: ,
	uptime-from-file: ,
	enable-prometheus-server: false,
	prometheus-server-port: 9092,

Can you also post the exact Spot interruption message you're getting?

kavicse87

comment created time in a month

issue commentaws/aws-node-termination-handler

How to simulate the interruption situation manually to see what aws-node-termination-handler is doing normally

@kavicse87 Hey, sorry for the delay, Brandon's out until next week. Can you clarify what the issue is? Is it that you aren't seeing pods being drained off the nodes when there's an interruption? Or are you still trying to validate this behavior and need more help with simulating an interruption event?

To simulate a Spot interruption event with amazon-ec2-metadata-mock:

Install amazon-ec2-metadata-mock. Here's one way:

kubectl apply -f https://github.com/aws/amazon-ec2-metadata-mock/releases/download/v1.2.0/all-resources.yaml

Then change the EC2 metadata service endpoint of aws-node-termnation-handler by Helm installing with the instanceMetadataURL flag like:

helm upgrade --install aws-node-termination-handler \
 --set instanceMetadataURL="http://amazon-ec2-metadata-mock-service.default.svc.cluster.local:1338" \
 --namespace kube-system \
 eks/aws-node-termination-handler

Then when you check the aws-node-termination-handler logs (kubectl logs -n kube-system <aws-node-termination-handler-pod>)you should see a reference to draining, and if you check the nodes they should be marked unschedulable. I'd give you an example log output but my test cluster is broken right now 😔

kavicse87

comment created time in a month

issue commentaws/aws-node-termination-handler

How to simulate the interruption situation manually to see what aws-node-termination-handler is doing normally

@YakobovLior The e2e tests install aws-node-termiation-handler using Helm and set the metadata url here. This becomes where aws-node-termination-handler looks for metadata; other applications on the node still look at the real EC2 metadata service.

We set that environment variable here for the local tests that use a kind cluster, and here for the eks-cluster e2e tests.

Check out the ReadMe in our test folder for more info on the e2e tests.

kavicse87

comment created time in a month

push eventaws/aws-node-termination-handler

Thomas O'Neill

commit sha 95fdc025df8f92b725f988869854917d961cc14b

Fix Helm chart comment on enableSpotInterruptionDraining default behavior (#221) * default enableSpotInterruptionDraining to true * enableSpotInterruptionDraining default to "true" * Update README for helm chart, and roll back value change Co-authored-by: Thomas O'Neill <toneill@new-innov.com>

view details

push time in 2 months

PR merged aws/aws-node-termination-handler

default enableSpotInterruptionDraining to true

Description of changes: According to the README, the default value for enableSpotInterruptionDraining is true, but the value in values.yaml had it set to empty string ("").

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

+1 -1

2 comments

1 changed file

toneill818

pr closed time in 2 months

pull request commentaws/aws-node-termination-handler

default enableSpotInterruptionDraining to true

Thanks for submitting this! Without a value set for enableSpotInterruptionDraining in the Helm chart we use the default value in config.go which is set to true.

I think rather than modify the default value in the Helm chart off of "" which would be inconsistent with the other Helm values, it would be better for us to improve the wording of the comment so it's more clear that true is indeed the default value.

toneill818

comment created time in 2 months

push eventhaugenj/aws-node-termination-handler

Brandon Wagner

commit sha 466060d6348b0977a233a79213a41f76bbf65c15

stabilize e2e tests (#174) * stabilize e2e tests * tidy up the changes

view details

Brandon Wagner

commit sha cc6170de36fd583b24e8ba122c07face3f1b610c

prepare for 1.5.0 release (#175)

view details

Brandon Wagner

commit sha 8febad4fcf28bd9eab55aff92a183802bb4c9e20

update licenses (#176)

view details

Brandon Wagner

commit sha fc625bb7187077c9bb25f615671d2fcf9385629f

travis build improvements (#177) * travis build improvements * remove name from e2e tests so that version is more accessible * remove direct downloads when building images and binaries * only run helm validation on release * update k8s 1.18 kind node image * add makefile help target * remove --force on reinstall of emtp * sort assertion scripts so that order is deterministic between different versions of find

view details

Brandon Wagner

commit sha b21fff3ff5f55c25db53a06eea331c64853ebd5b

fix docker sync readme (#178)

view details

Brandon Wagner

commit sha 4c11208c1379ec4ff8f594b1aeaf9415768e3e43

Update .travis.yml

view details

Brandon Wagner

commit sha 2112381d71ae28a19deb323e4ac783896427af3f

update kubectl apply instructions to v1.5.0

view details

Abdul Muqtadir Mohammed

commit sha 9c6268b6936ce3c519e59529d6a6baeba5c2c53e

Add support for Webhook URL as a Secret (#179) * Add support for Webhook URL as a Secret

view details

Matheus Weber da Conceição

commit sha f16d712da4aef224278b52f097785380ffbdc8fe

Add new podLabels parameter for adding labels to each pod (#180)

view details

Andrey Klimentyev

commit sha e8cce3e0ec48a43388fd7200bbd255a79624f203

Fixed typo in "pre-dain" NodeAction (#184) Signed-off-by: Andrey Klimentyev <andrey.klimentyev@flant.com>

view details

Leo Palmer Sunmo

commit sha 049947cd675206fe7e219bb4bb8f5ef4f977d4b8

Add metadata endpoint and Kiam documentation (#188) This adds documentation on which EC2 metadata endpoints the termination handler relies on. It also provides a config option to enable the termination handler to function properly on a cluster that has Kiam deployed to manage AWS IAM credentials.

view details

Jerad C

commit sha f5f8766609022dd8c2df18367089205e3a8ce92e

add Windows node support (#185) * add Windows node support * fix system test mock for k8s <1.14 * fix syntax error * fix unit tests * fix gofmt issues * increase test coverage * add make target to run unit tests in linux container * add uptime stub for darwin

view details

Brandon Wagner

commit sha 20b816957ddb9f37a3820f2ddca60fc3ef74666c

do not fail when scheduled event does not have end date (#182) * do not fail when scheduled event does not have end date. It's nice-to-have information only. * fix unit-test for scheduled events

view details

Brandon Wagner

commit sha b6cbb2a9c8bb9b7cfd91dc072ea238864a105f35

prepare for v1.6.0 release (#189)

view details

Brandon Wagner

commit sha e16065c826fc2fc350e15476ffca34f9222431e8

Kube version helm (#190) * fix kube version retrieval in helm template helpers * add helm lint test * change kube-version retrieval on emtp

view details

Brandon Wagner

commit sha eb7d8114daa5286e93d531bb224e8d5e68a83501

Update README.md

view details

Ed Lim

commit sha e769d9e2dc42fa34aba5f7e47c6b81b7656c3757

Don't schedule aws-node-termination-handler on fargate nodes (#191) * Don't schedule node termination handler if compute type is of type fargate * Bump version

view details

Takayuki Watanabe

commit sha bc39bfbbfad9d7e633d2bcace41069affaab1b99

Fix explanation of ignore-daemon-sets flag (#194)

view details

Brandon Wagner

commit sha 43a69e0f41f2d7ed97e7199f5026dfc2c7eec6e4

remove + from major minor server version (#195)

view details

Brandon Wagner

commit sha 216ed72ff9e85dd93e8d11bbef270f68e26729ea

up for patch release (#196)

view details

push time in 2 months

Pull request review commentaws/amazon-ec2-instance-selector

accept memory and gpu memory in GiB instead of MiB

 func isInAllowList(allowRegex *regexp.Regexp, instanceTypeName string) bool { 	} 	return allowRegex.MatchString(instanceTypeName) }++func gbToMBRange(gbRange *Float64RangeFilter) *IntRangeFilter {

gibToMibRange? little confusing that this is the only place where you switch the syntax

bwagner5

comment created time in 2 months

Pull request review commentaws/aws-node-termination-handler

refactor event handlers to monitors with an interface

 func TestMonitorForSpotITNEventsSuccess(t *testing.T) { 			"Expected description to contain: "+startTime+" but is actually: "+result.Description) 	}() -	err := interruptionevent.MonitorForSpotITNEvents(drainChan, cancelChan, imds)+	nodeName := "test-node"+	spotITNMonitor := interruptionevent.NewSpotInterruptionMonitor(imds, drainChan, cancelChan, nodeName)+	err := spotITNMonitor.Monitor()

That sounds good to me

bwagner5

comment created time in 2 months

Pull request review commentaws/aws-node-termination-handler

refactor event handlers to monitors with an interface

 func TestMonitorForSpotITNEventsSuccess(t *testing.T) { 			"Expected description to contain: "+startTime+" but is actually: "+result.Description) 	}() -	err := interruptionevent.MonitorForSpotITNEvents(drainChan, cancelChan, imds)+	nodeName := "test-node"+	spotITNMonitor := interruptionevent.NewSpotInterruptionMonitor(imds, drainChan, cancelChan, nodeName)+	err := spotITNMonitor.Monitor()

this is pedantic but should we change these function names? They're referencing functions that are removed. TestMonitor would be better imo

bwagner5

comment created time in 2 months

Pull request review commentaws/aws-node-termination-handler

refactor event handlers to monitors with an interface

 func TestMonitorForScheduledEventsSuccess(t *testing.T) { 	drainChan := make(chan interruptionevent.InterruptionEvent) 	cancelChan := make(chan interruptionevent.InterruptionEvent) 	imds := ec2metadata.New(server.URL, 1)+	nodeName := "test-node"

const?

bwagner5

comment created time in 2 months

Pull request review commentaws/aws-node-termination-handler

refactor event handlers to monitors with an interface

 func main() { 	cancelChan := make(chan interruptionevent.InterruptionEvent) 	defer close(cancelChan) -	monitoringFns := map[string]monitorFunc{}+	monitoringFns := map[string]interruptionevent.Monitor{} 	if nthConfig.EnableSpotInterruptionDraining {-		monitoringFns[spotITN] = interruptionevent.MonitorForSpotITNEvents+		imdsSpotMonitor := interruptionevent.NewSpotInterruptionMonitor(imds, interruptionChan, cancelChan, nthConfig.NodeName)+		monitoringFns[spotITN] = imdsSpotMonitor 	} 	if nthConfig.EnableScheduledEventDraining {-		monitoringFns[scheduledMaintenance] = interruptionevent.MonitorForScheduledEvents+		imdsScheduledEventMonitor := interruptionevent.NewScheduledEventMonitor(imds, interruptionChan, cancelChan, nthConfig.NodeName)+		monitoringFns[scheduledMaintenance] = imdsScheduledEventMonitor 	} -	for eventType, fn := range monitoringFns {-		go func(monitorFn monitorFunc, eventType string) {-			log.Log().Msgf("Started monitoring for %s events", eventType)+	for _, fn := range monitoringFns {+		go func(monitor interruptionevent.Monitor) {

is type monitorFunc ... on line 40 necessary anymore?

bwagner5

comment created time in 2 months

Pull request review commentaws/aws-node-termination-handler

refactor event handlers to monitors with an interface

 func TestMonitorForSpotITNEventsSuccess(t *testing.T) { 			"Expected description to contain: "+startTime+" but is actually: "+result.Description) 	}() -	err := interruptionevent.MonitorForSpotITNEvents(drainChan, cancelChan, imds)+	nodeName := "test-node"

can we make this a constant?

bwagner5

comment created time in 2 months

Pull request review commentaws/aws-node-termination-handler

refactor event handlers to monitors with an interface

 import ( 	h "github.com/aws/aws-node-termination-handler/pkg/test" ) +const (+	node1 = "test-node-1"+	node2 = "test-node-2"

unused?

bwagner5

comment created time in 2 months

issue commentaws/aws-node-termination-handler

Can't use nodeSelector with helm chart

Thanks for bringing this up!

We released chart version 1.6 last night that includes support for Windows, and after a quick look it seems we might not be passing the value for nodeSelector correctly. We'll investigate further as soon as we can

kuzaxak

comment created time in 2 months

Pull request review commentaws/aws-node-termination-handler

build .tar.gz bin files

 USAGE=$(cat << 'EOM'    Example: build-binaries -p "linux/amd64,linux/arm"           Optional:+            -b          Base bin name [DEFAULT: output of "make binary-name"]             -p          Platform pair list (os/architecture) [DEFAULT: linux/amd64]             -d          DIRECT: Set GOPROXY=direct to bypass go proxies-            -v          VERSION: The application version of the docker image [DEFAULT: output of `make version`]+            -v          VERSION: The application version of the docker image [DEFAULT: output of `make version`]        

extra whitespace?

bwagner5

comment created time in 2 months

Pull request review commentaws/aws-node-termination-handler

add shellcheck and spellcheck

 for dep in "${deps[@]}"; do     if [ ! -x "$path_to_executable" ]; then         echoerr "You are required to have $dep installed on your system..."         echoerr "Please install $dep and try again. "-        exit -1+        exit 3

y tho

bwagner5

comment created time in 2 months

push eventhaugenj/escalator

Jason Haugen

commit sha 991f956aafeea2d5890f83a0b3f1c94b0a94115b

make AWS resource tagging optional via a config flag

view details

push time in 2 months

pull request commentatlassian/escalator

Add tags to Auto Scaling Group and Fleet request

Gotcha. They definitely don't need to be enabled by default if there are real problems, I just personally lean towards less user configuration whenever possible and was hoping it would work in this case. I'll work on a revision with the requested changes, thanks!

haugenj

comment created time in 3 months

pull request commentatlassian/escalator

Add tags to Auto Scaling Group and Fleet request

Thanks for taking a look at this so quickly!

I don't believe the tags will break deployments if the IAM policy isn't updated; the error from the request is caught and logged but it doesn't force an early return or stop the nodegroup from being added. I've tested this on an EKS cluster with both setDesiredCapacity and createFleet scaling strategies and can confirm in both cases Escalator works as before. Perhaps logging with a 'warning' level is more appropriate than an 'error' level?

@awprice can you elaborate on the potential Terraform/tag removing issue you mentioned? I only have a passing knowledge of how Terraform works so I've scanned through their docs a little to try and get some context. This page makes me think that the Escalator tags could be removed if Terraform updates the ASG, so there would be some potential back and forth of adding/removing the tags if there's a lot of Escalator deployments mixed with Terraform ASG updates. That doesn't seem like it would cause any real problems, but let me know if I'm misunderstanding this or missing something.

haugenj

comment created time in 3 months

PR opened atlassian/escalator

Add tags to Auto Scaling Group and Fleet request

Tags are pieces of metadata customers can use to organize resources. They're particularly helpful to customers for breaking down costs, so they can see how much they're spending per tag. Functionally, they don't affect the performance of a resource in any way.

This change checks the ASGs when the nodegroups are created to see if the escalator tag is present on the ASG. If the tag isn't present, a request is made to add the tag. The tag will propagate to any instances launched within the ASG, so instances created with either setDesiredCapacity or CreateFleet are tagged.

Additionally, when CreateFleet is used the request itself is tagged.

The tag names are arbitrary, but I had them follow the same style as the ones cluster autoscaler uses

+84 -4

0 comment

4 changed files

pr created time in 3 months

push eventhaugenj/escalator

Jason Haugen

commit sha 153e22f8a7380b3477d3410b43af091d999e43fc

Add tags to Auto Scaling Group and Fleet request

view details

push time in 3 months

push eventhaugenj/escalator

Jason Haugen

commit sha bd128aa2241356c86e274fea58bc19d5ec35c8a5

Fix the IAM Policy permissions (#191)

view details

push time in 3 months

Pull request review commentaws/amazon-ec2-instance-selector

implement the --flexible suite filter

 Filter Flags:  Suite Flags:       --base-instance-type string   Instance Type used to retrieve similarly spec'd instance types+      --flexible              Retrieves a group of instance types spanning multiple generations based on opinionated defaults and user overridden resource filters

Needs another tab :)

bwagner5

comment created time in 3 months

pull request commentaws/amazon-ec2-instance-selector

implement int suite flag in cli pkg

do we need to add this to the readme?

bwagner5

comment created time in 3 months

Pull request review commentaws/amazon-ec2-instance-selector

implement int suite flag in cli pkg

 func (cl *CommandLineInterface) SuiteBoolFlag(name string, shorthand *string, de 	cl.BoolFlagOnFlagSet(cl.suiteFlags, name, shorthand, defaultValue, description) } +// SuiteIntFlag creates and registers a flag accepting a boolean for aggregate filters.

boolean -> int

bwagner5

comment created time in 3 months

Pull request review commentaws/amazon-ec2-instance-selector

fixed vcpus-to-memory ratio calc to use ceiling

 func calculateVCpusToMemoryRatio(vcpusVal *int64, memoryVal *int64) *float64 { 		return nil 	} 	// normalize vcpus to a mebivcpu value-	return aws.Float64(float64(*memoryVal) / float64(*vcpusVal*1024))+	result := math.Ceil(float64(*memoryVal) / float64(*vcpusVal*1024))

unrelated to your changes, can vcpus be 0?

bwagner5

comment created time in 3 months

Pull request review commentaws/amazon-ec2-instance-selector

fixed vcpus-to-memory ratio calc to use ceiling

 func TestCalculateVCpusToMemoryRatio(t *testing.T) { 	vcpus := aws.Int64(4) 	memory := aws.Int64(4096) 	ratio := calculateVCpusToMemoryRatio(vcpus, memory)-	h.Assert(t, *ratio == 1.00, "nil should evaluate to nil")+	h.Assert(t, *ratio == 1.00, "ratio should equal 1:1")  	vcpus = aws.Int64(2) 	memory = aws.Int64(4096) 	ratio = calculateVCpusToMemoryRatio(vcpus, memory)-	h.Assert(t, *ratio == 2.00, "nil should evaluate to nil")+	h.Assert(t, *ratio == 2.00, "ration should equal 1:2")  	vcpus = aws.Int64(1) 	memory = aws.Int64(512) 	ratio = calculateVCpusToMemoryRatio(vcpus, memory)-	h.Assert(t, *ratio == 0.50, "nil should evaluate to nil")

were these messages just off before or am i missing something?

bwagner5

comment created time in 3 months

Pull request review commentaws/amazon-ec2-instance-selector

fixed vcpus-to-memory ratio calc to use ceiling

 func TestCalculateVCpusToMemoryRatio(t *testing.T) { 	vcpus := aws.Int64(4) 	memory := aws.Int64(4096) 	ratio := calculateVCpusToMemoryRatio(vcpus, memory)-	h.Assert(t, *ratio == 1.00, "nil should evaluate to nil")+	h.Assert(t, *ratio == 1.00, "ratio should equal 1:1")  	vcpus = aws.Int64(2) 	memory = aws.Int64(4096) 	ratio = calculateVCpusToMemoryRatio(vcpus, memory)-	h.Assert(t, *ratio == 2.00, "nil should evaluate to nil")+	h.Assert(t, *ratio == 2.00, "ration should equal 1:2")

ration

bwagner5

comment created time in 3 months

fork haugenj/amazon-ec2-instance-selector

A CLI tool and go library which recommends instance types based on resource criteria like vcpus and memory

fork in 3 months

Pull request review commentaws/amazon-ec2-instance-selector

cleanup travis build file

 language: minimal services: - docker -matrix:+jobs:   include:     - stage: Test       language: go       go: "1.14.x"       script: make unit-test-      env: GO_UNIT_TESTS=TRUE

Are these vars used anywhere else? Forgive my lack of Travis knowledge but my thinking is that there's another location these need to be removed from

bwagner5

comment created time in 3 months

Pull request review commentaws/amazon-ec2-instance-selector

allow-list and deny-list implemented

 func TestRetrieveInstanceTypesSupportedInAZs_DescribeAZErr(t *testing.T) { 	_, err := itf.RetrieveInstanceTypesSupportedInLocations([]string{"us-east-2a"}) 	h.Nok(t, err) }++func TestFilter_AllowList(t *testing.T) {+	ec2Mock := mockedEC2{+		DescribeInstanceTypesPagesResp:    setupMock(t, describeInstanceTypesPages, "25_instances.json").DescribeInstanceTypesPagesResp,+		DescribeInstanceTypeOfferingsResp: setupMock(t, describeInstanceTypeOfferings, "us-east-2a.json").DescribeInstanceTypeOfferingsResp,+	}+	itf := selector.Selector{+		EC2: ec2Mock,+	}+	allowRegex, err := regexp.Compile("c4.large")+	h.Ok(t, err)+	filters := selector.Filters{+		AllowList: allowRegex,+	}+	results, err := itf.Filter(filters)+	h.Ok(t, err)+	h.Assert(t, len(results) == 1, "c4.large should return 1 instance type matching regex")+}++func TestFilter_DenyList(t *testing.T) {+	ec2Mock := mockedEC2{+		DescribeInstanceTypesPagesResp:    setupMock(t, describeInstanceTypesPages, "25_instances.json").DescribeInstanceTypesPagesResp,+		DescribeInstanceTypeOfferingsResp: setupMock(t, describeInstanceTypeOfferings, "us-east-2a.json").DescribeInstanceTypeOfferingsResp,+	}+	itf := selector.Selector{+		EC2: ec2Mock,+	}+	denyRegex, err := regexp.Compile("c4.large")+	h.Ok(t, err)+	filters := selector.Filters{+		DenyList: denyRegex,+	}+	results, err := itf.Filter(filters)+	h.Ok(t, err)+	h.Assert(t, len(results) == 24, "c4.large should return 24 instance type matching regex but returned %d", len(results))

[nit] can we change the error message to make it more explicit that c4.large is the regex? just something like putting single quotes around it or saying "Regex _____ should ..."

bwagner5

comment created time in 3 months

Pull request review commentaws/amazon-ec2-instance-selector

allow-list and deny-list implemented

 ec2-instance-selector --vcpus 4 --region us-east-2 --availability-zones us-east- ec2-instance-selector --memory-min 4096 --memory-max 8192 --vcpus-min 4 --vcpus-max 8 --region us-east-2  Filter Flags:+      --allow-list string                 List of allowed instance types to select from w/ regex syntax (Example: m[3-5]\.*)       --availability-zone string          [DEPRECATED] use --availability-zones instead   -z, --availability-zones strings        Availability zones or zone ids to check EC2 capacity offered in specific AZs       --baremetal                         Bare Metal instance types (.metal instances)   -b, --burst-support                     Burstable instance types   -a, --cpu-architecture string           CPU architecture [x86_64, i386, or arm64]       --current-generation                Current generation instance types (explicitly set this to false to not return current generation instance types)+      --deny-list string                  List of instance types which should be excluded w/ regex syntax (Example: m[1-2]\.*)

should we rename this "block-list" ? That seems to be the new standard

bwagner5

comment created time in 3 months

more