profile
viewpoint
Max Jonas Werner makkes D2iQ Hamburg, Germany https://makk.es Boldly going where no one has gone before.

makkes/gitlab-cli 53

A simple command line interface to GitLab

makkes/cluster-diagnostics 1

PoC for automated cluster diagnostics

makkes/fileboy-node 1

Simple one-click hosting application

makkes/k8s-examples 1

Examples for interacting with a K8s cluster: operators, clients etc.

makkes/action-notify 0

GitHub Action for sending notifications to a Slack channel

makkes/assert 0

Assertion library for Golang

makkes/dcos 0

DC/OS - The Datacenter Operating System

makkes/dcos-cli 0

The command line for DC/OS.

makkes/dcos-core-cli 0

Core plugin for the DC/OS CLI

push eventkubernetes/enhancements

Aldo Culquicondor

commit sha 7bcfc35d0618e978e21887362b2823f933dc0ae6

Add comment on PRR approver for reviewed scheduling KEPs Signed-off-by: Aldo Culquicondor <acondor@google.com>

view details

Kubernetes Prow Robot

commit sha 858fd2a2e98b9a7b9d826de22f819c815f3e3171

Merge pull request #1809 from alculquicondor/prr Add comment on PRR approver for reviewed scheduling KEPs

view details

push time in 2 minutes

PR merged kubernetes/enhancements

Reviewers
Add comment on PRR approver for reviewed scheduling KEPs approved cncf-cla: yes kind/kep lgtm sig/architecture sig/scheduling size/XS

/assign @ahg-g cc @wojtek-t

+2 -2

2 comments

2 changed files

alculquicondor

pr closed time in 2 minutes

pull request commentkubernetes/enhancements

Add comment on PRR approver for reviewed scheduling KEPs

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: <a href="https://github.com/kubernetes/enhancements/pull/1809#issuecomment-633708581" title="Approved">ahg-g</a>, <a href="https://github.com/kubernetes/enhancements/pull/1809#" title="Author self-approved">alculquicondor</a>

The full list of commands accepted by this bot can be found here.

The pull request process is described here

<details > Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment </details> <!-- META={"approvers":[]} -->

alculquicondor

comment created time in 4 minutes

pull request commentkubernetes/enhancements

Add comment on PRR approver for reviewed scheduling KEPs

/lgtm /approve

alculquicondor

comment created time in 4 minutes

pull request commentkubernetes/enhancements

Add comment on PRR approver for reviewed scheduling KEPs

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: <a href="https://github.com/kubernetes/enhancements/pull/1809#" title="Author self-approved">alculquicondor</a> To complete the pull request process, please assign ahg-g You can assign the PR to them by writing /assign @ahg-g in a comment when ready.

The full list of commands accepted by this bot can be found here.

<details open> Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment </details> <!-- META={"approvers":["ahg-g"]} -->

alculquicondor

comment created time in 8 minutes

PR opened kubernetes/enhancements

Add comment on PRR approver for reviewed scheduling KEPs

/assign @ahg-g cc @wojtek-t

+2 -2

0 comment

2 changed files

pr created time in 8 minutes

push eventkubernetes/enhancements

Aldo Culquicondor

commit sha bdde5d17cca31879f1b44ca41111061c4b39eaba

PRR as approver in scheduler component config KEP According to KEP template

view details

Kubernetes Prow Robot

commit sha c258a34ae2154db7c75dbfa4c21fa1171ddf4cd1

Merge pull request #1808 from alculquicondor/patch-2 PRR as approver in scheduler component config KEP

view details

push time in 24 minutes

PR merged kubernetes/enhancements

Reviewers
PRR as approver in scheduler component config KEP approved cncf-cla: yes kind/kep lgtm sig/architecture sig/scheduling size/XS

According to KEP template

/assign @ahg-g cc @wojtek-t

+1 -1

2 comments

1 changed file

alculquicondor

pr closed time in 24 minutes

pull request commentkubernetes/enhancements

PRR as approver in scheduler component config KEP

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: <a href="https://github.com/kubernetes/enhancements/pull/1808#issuecomment-633703380" title="Approved">ahg-g</a>, <a href="https://github.com/kubernetes/enhancements/pull/1808#" title="Author self-approved">alculquicondor</a>

The full list of commands accepted by this bot can be found here.

The pull request process is described here

<details > Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment </details> <!-- META={"approvers":[]} -->

alculquicondor

comment created time in 25 minutes

pull request commentkubernetes/enhancements

PRR as approver in scheduler component config KEP

/lgtm /approve

alculquicondor

comment created time in 26 minutes

pull request commentkubernetes/enhancements

PRR as approver in scheduler component config KEP

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: <a href="https://github.com/kubernetes/enhancements/pull/1808#" title="Author self-approved">alculquicondor</a> To complete the pull request process, please assign ahg-g You can assign the PR to them by writing /assign @ahg-g in a comment when ready.

The full list of commands accepted by this bot can be found here.

<details open> Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment </details> <!-- META={"approvers":["ahg-g"]} -->

alculquicondor

comment created time in 27 minutes

PR opened kubernetes/enhancements

PRR as approver in scheduler component config KEP

According to KEP template

/assign @ahg-g cc @wojtek-t

+1 -1

0 comment

1 changed file

pr created time in 27 minutes

Pull request review commentkubernetes/enhancements

Add PRR questionnaire section to New Event API KEP

 approvers:   - "@wojtekt"   - "@brancz" creation-date: 2019-01-31-last-updated: 2020-05-01+last-updated: 2020-05-19

Please add me as a PRR approver: https://github.com/kubernetes/enhancements/blob/master/keps/NNNN-kep-template/kep.yaml#L16

chelseychen

comment created time in 33 minutes

Pull request review commentkubernetes/enhancements

Add PRR questionnaire section to New Event API KEP

 List Events from the NamespaceSystem with field selector `reportingController =  List all Event with field selector `regarding.name = podName, regarding.namespace = podNamespace`, and `related.name = podName, related.namespace = podNamespace`. You need to join results outside of the kubernetes API. +## Production Readiness Review Questionnaire++### Feature enablement and rollback++_This section must be completed when targeting alpha to a release._++* **How can this feature be enabled / disabled in a live cluster?**+  - [ ] Feature gate (also fill in values in `kep.yaml`)+    - Feature gate name:+    - Components depending on the feature gate:+  - [x] Other+    - Describe the mechanism:++      (1) The API itself can be enabled / disabled at kube-apiserver level+      by using `--runtime-config` flag;++      (2) For the use of API, we have a fallback mechanism instead of using+      a feature gate. That is, we simply fallback to the old Event libraries+      if the API is diabled.++      Currently this fallback is implemented purely in scheduler but we're+      planning to move it into the library itself.++    - Will enabling / disabling the feature require downtime of the control+      plane?++      (1) Yes, enabling API requires to restart apiserver.++      (2) No, enabling the use of the API doesn't require that.++    - Will enabling / disabling the feature require downtime or reprovisioning+      of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled).++      No.++* **Does enabling the feature change any default behavior?**+  Any change of default behavior may be surprising to users or break existing+  automations, so be extremely careful here.++  While the graduation of the API itself doesn't change default behavior,+  migration of individual components does, as the events will be reported+  differently.++* **Can the feature be disabled once it has been enabled (i.e. can we rollback+  the enablement)?**+  Also set `rollback-supported` to `true` or `false` in `kep.yaml`.+  Describe the consequences on existing workloads (e.g. if this is runtime+  feature, can it break the existing applications?).++  Yes. If the new Event API is disabled, it will fallback to the original one +  (The new events are roundtrippable with the old `corev1.Events`).++  If individual components don't implement it, rollback of client-library use+  may not be possible (i.e. they only fallback to the old API if the new API+  is disabled, if there is bug in the client-library, there is no way to+  fallback as of now).++* **What happens if we reenable the feature if it was previously rolled back?**++  New types of Events will be generated instead of the old one.++* **Are there any tests for feature enablement/disablement?**+  The e2e framework does not currently support enabling and disabling feature+  gates. However, unit tests in each component dealing with managing data created+  with and without the feature are necessary. At the very least, think about+  conversion tests if API types are being modified.++  Manual tests will be performed to ensure things work when either enabling+  or disabling the new Event API.++  More information in [Test Plan](#test-plan) section.++### Rollout, Upgrade and Rollback Planning++_This section must be completed when targeting beta graduation to a release._++* **How can a rollout fail? Can it impact already running workloads?**+  Try to be as paranoid as possible - e.g. what if some components will restart+  in the middle of rollout?++  A rollout could fail if some components restart in the middle of the rollout.+  Then those components will continue using the old Event API.++* **What specific metrics should inform a rollback?**++* **Were upgrade and rollback tested? Was upgrade->downgrade->upgrade path tested?**+  Describe manual testing that was done and the outcomes.+  Longer term, we may want to require automated upgrade/rollback tests, but we+  are missing a bunch of machinery and tooling and do that now.++  Not yet. It could be done by enabling / disabling new Event API.++* **Is the rollout accompanied by any deprecations and/or removals of features,+  APIs, fields of API types, flags, etc.?**+  Even if applying deprecation policies, they may still surprise some users.++  State field of EventSeries will be removed from corev1.Event API.++### Monitoring requirements++_This section must be completed when targeting beta graduation to a release._++* **How can an operator determine if the feature is in use by workloads?**+  Ideally, this should be a metrics. Operations against Kubernetes API (e.g.+  checking if there are objects with field X set) may be last resort. Avoid+  logs or events for this purpose.++  The API, as a feature that workloads may in theory use,+  can be determined by looking into the apiserver_requests_total metric.++* **What are the SLIs (Service Level Indicators) an operator can use to+  determine the health of the service?**+  - [x] Metrics+    - Metric name: apiserver_requests_total+    - Components exposing the metric: kube-apiserver+  - [ ] Other (treat as last resort)+    - Details:++* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**+  At the high-level this usually will be in the form of "high percentile of SLI+  per day <= X". It's impossible to provide a comprehensive guidance, but at the very+  high level (they needs more precise definitions) those may be things like:+  - per-day percentage of API calls finishing with 5XX errors <= 1%+  - 99% percentile over day of absolute value from (job creation time minus expected+    job creation time) for cron job <= 10%+  - 99,9% of /health requests per day finish with 200 code++  Events have always been "best-effort".+  We're sticking to that with the new API too, so no SLO will be introduced.++* **Are there any missing metrics that would be useful to have to improve+  observability if this feature?**+  Describe the metrics themselves and the reason they weren't added (e.g. cost,+  implementation difficulties, etc.).++  No.++### Dependencies++_This section must be completed when targeting beta graduation to a release._++* **Does this feature depend on any specific services running in the cluster?**+  Think about both cluster-level services (e.g. metrics-server) as well+  as node-level agents (e.g. specific version of CRI). Focus on external or+  optional services that are needed. For example, if this feature depends on+  a cloud provider API, or upon an external software-defined storage or network+  control plane.++  For each of the fill in the following, thinking both about running user workloads+  and creating new ones, as well as about cluster-level services (e.g. DNS):+  +  N/A+++### Scalability++_For alpha, this section is encouraged: reviewers should consider these questions+and attempt to answer them._++_For beta, this section is required: reviewers must answer these questions._++_For GA, this section is required: approvers should be able to confirms the+previous answers based on experience in the field._++* **Will enabling / using this feature result in any new API calls?**+  Describe them, providing:++  In the new EventRecorder, every 30 minutes a "heartbeat" call will be performed+  to update Event status and prevent garbage collection in etcd. This heartbeat+  is happening for events that are happening all the time (If an event didn't+  happen for 6 minutes, it will be GC-ed).++* **Will enabling / using this feature result in introducing new API types?**++  Yes, a new API type "eventsv1.Event" is being introduced.+  The migration of Event API will cause creation of new types of Event objects.+  The number of Event objects depends on cluster state, which theoretically+  won't be too large due to deduplication logic and reasonable-cardinality+  of objects in the system.

This sentence is super misleading. Suggest:

The number of Event objects depends on the cluster state and its churn. Event deduplication and reasonable cardinality of the fields should keep their number within reasonable boundaries (obviously dependent on cluster size).

chelseychen

comment created time in 39 minutes

Pull request review commentkubernetes/enhancements

Add PRR questionnaire section to New Event API KEP

 List Events from the NamespaceSystem with field selector `reportingController =  List all Event with field selector `regarding.name = podName, regarding.namespace = podNamespace`, and `related.name = podName, related.namespace = podNamespace`. You need to join results outside of the kubernetes API. +## Production Readiness Review Questionnaire++### Feature enablement and rollback++_This section must be completed when targeting alpha to a release._++* **How can this feature be enabled / disabled in a live cluster?**+  - [ ] Feature gate (also fill in values in `kep.yaml`)+    - Feature gate name:+    - Components depending on the feature gate:+  - [x] Other+    - Describe the mechanism:++      (1) The API itself can be enabled / disabled at kube-apiserver level+      by using `--runtime-config` flag;++      (2) For the use of API, we have a fallback mechanism instead of using+      a feature gate. That is, we simply fallback to the old Event libraries+      if the API is diabled.++      Currently this fallback is implemented purely in scheduler but we're+      planning to move it into the library itself.++    - Will enabling / disabling the feature require downtime of the control+      plane?++      (1) Yes, enabling API requires to restart apiserver.++      (2) No, enabling the use of the API doesn't require that.++    - Will enabling / disabling the feature require downtime or reprovisioning+      of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled).++      No.++* **Does enabling the feature change any default behavior?**+  Any change of default behavior may be surprising to users or break existing+  automations, so be extremely careful here.++  While the graduation of the API itself doesn't change default behavior,+  migration of individual components does, as the events will be reported+  differently.++* **Can the feature be disabled once it has been enabled (i.e. can we rollback+  the enablement)?**+  Also set `rollback-supported` to `true` or `false` in `kep.yaml`.+  Describe the consequences on existing workloads (e.g. if this is runtime+  feature, can it break the existing applications?).++  Yes. If the new Event API is disabled, it will fallback to the original one +  (The new events are roundtrippable with the old `corev1.Events`).++  If individual components don't implement it, rollback of client-library use+  may not be possible (i.e. they only fallback to the old API if the new API+  is disabled, if there is bug in the client-library, there is no way to+  fallback as of now).++* **What happens if we reenable the feature if it was previously rolled back?**++  New types of Events will be generated instead of the old one.++* **Are there any tests for feature enablement/disablement?**+  The e2e framework does not currently support enabling and disabling feature+  gates. However, unit tests in each component dealing with managing data created+  with and without the feature are necessary. At the very least, think about+  conversion tests if API types are being modified.++  Manual tests will be performed to ensure things work when either enabling+  or disabling the new Event API.++  More information in [Test Plan](#test-plan) section.++### Rollout, Upgrade and Rollback Planning++_This section must be completed when targeting beta graduation to a release._++* **How can a rollout fail? Can it impact already running workloads?**+  Try to be as paranoid as possible - e.g. what if some components will restart+  in the middle of rollout?++  A rollout could fail if some components restart in the middle of the rollout.+  Then those components will continue using the old Event API.++* **What specific metrics should inform a rollback?**++* **Were upgrade and rollback tested? Was upgrade->downgrade->upgrade path tested?**+  Describe manual testing that was done and the outcomes.+  Longer term, we may want to require automated upgrade/rollback tests, but we+  are missing a bunch of machinery and tooling and do that now.++  Not yet. It could be done by enabling / disabling new Event API.++* **Is the rollout accompanied by any deprecations and/or removals of features,+  APIs, fields of API types, flags, etc.?**+  Even if applying deprecation policies, they may still surprise some users.++  State field of EventSeries will be removed from corev1.Event API.++### Monitoring requirements++_This section must be completed when targeting beta graduation to a release._++* **How can an operator determine if the feature is in use by workloads?**+  Ideally, this should be a metrics. Operations against Kubernetes API (e.g.+  checking if there are objects with field X set) may be last resort. Avoid+  logs or events for this purpose.++  The API, as a feature that workloads may in theory use,+  can be determined by looking into the apiserver_requests_total metric.++* **What are the SLIs (Service Level Indicators) an operator can use to+  determine the health of the service?**+  - [x] Metrics+    - Metric name: apiserver_requests_total+    - Components exposing the metric: kube-apiserver+  - [ ] Other (treat as last resort)+    - Details:++* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**+  At the high-level this usually will be in the form of "high percentile of SLI+  per day <= X". It's impossible to provide a comprehensive guidance, but at the very+  high level (they needs more precise definitions) those may be things like:+  - per-day percentage of API calls finishing with 5XX errors <= 1%+  - 99% percentile over day of absolute value from (job creation time minus expected+    job creation time) for cron job <= 10%+  - 99,9% of /health requests per day finish with 200 code++  Events have always been "best-effort".+  We're sticking to that with the new API too, so no SLO will be introduced.++* **Are there any missing metrics that would be useful to have to improve+  observability if this feature?**+  Describe the metrics themselves and the reason they weren't added (e.g. cost,+  implementation difficulties, etc.).++  No.++### Dependencies++_This section must be completed when targeting beta graduation to a release._++* **Does this feature depend on any specific services running in the cluster?**+  Think about both cluster-level services (e.g. metrics-server) as well+  as node-level agents (e.g. specific version of CRI). Focus on external or+  optional services that are needed. For example, if this feature depends on+  a cloud provider API, or upon an external software-defined storage or network+  control plane.++  For each of the fill in the following, thinking both about running user workloads+  and creating new ones, as well as about cluster-level services (e.g. DNS):+  +  N/A+++### Scalability++_For alpha, this section is encouraged: reviewers should consider these questions+and attempt to answer them._++_For beta, this section is required: reviewers must answer these questions._++_For GA, this section is required: approvers should be able to confirms the+previous answers based on experience in the field._++* **Will enabling / using this feature result in any new API calls?**+  Describe them, providing:++  In the new EventRecorder, every 30 minutes a "heartbeat" call will be performed+  to update Event status and prevent garbage collection in etcd. This heartbeat+  is happening for events that are happening all the time (If an event didn't+  happen for 6 minutes, it will be GC-ed).++* **Will enabling / using this feature result in introducing new API types?**++  Yes, a new API type "eventsv1.Event" is being introduced.+  The migration of Event API will cause creation of new types of Event objects.+  The number of Event objects depends on cluster state, which theoretically+  won't be too large due to deduplication logic and reasonable-cardinality+  of objects in the system.++* **Will enabling / using this feature result in any new calls to cloud+  provider?**++  No.++* **Will enabling / using this feature result in increasing size or count+  of the existing API objects?**+  Describe them providing:+  +  The difference in size of the Event object comes from new Action and Related+  fields. We can safely estimate the increase to be smaller than 30%. We'll

Let's change the "We'll ..." sentence to the following:

... 30%.
However, more events may be emitted. As an example, new Event will be emitted for Pod creation done by standard controllers (e.g. ReplicaSet), as they are currently deduplicated across all 'owner' objects. However, given that that are at least 5 other events being emitted during pod startup, the impact for it can be bounded by 20%."
chelseychen

comment created time in 36 minutes

Pull request review commentkubernetes/enhancements

Add PRR questionnaire section to New Event API KEP

 List Events from the NamespaceSystem with field selector `reportingController =  List all Event with field selector `regarding.name = podName, regarding.namespace = podNamespace`, and `related.name = podName, related.namespace = podNamespace`. You need to join results outside of the kubernetes API. +## Production Readiness Review Questionnaire++### Feature enablement and rollback++_This section must be completed when targeting alpha to a release._++* **How can this feature be enabled / disabled in a live cluster?**+  - [ ] Feature gate (also fill in values in `kep.yaml`)+    - Feature gate name:+    - Components depending on the feature gate:+  - [x] Other+    - Describe the mechanism:++      (1) The API itself can be enabled / disabled at kube-apiserver level+      by using `--runtime-config` flag;++      (2) For the use of API, we have a fallback mechanism instead of using+      a feature gate. That is, we simply fallback to the old Event libraries+      if the API is diabled.++      Currently this fallback is implemented purely in scheduler but we're+      planning to move it into the library itself.++    - Will enabling / disabling the feature require downtime of the control+      plane?++      (1) Yes, enabling API requires to restart apiserver.++      (2) No, enabling the use of the API doesn't require that.++    - Will enabling / disabling the feature require downtime or reprovisioning+      of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled).++      No.++* **Does enabling the feature change any default behavior?**+  Any change of default behavior may be surprising to users or break existing+  automations, so be extremely careful here.++  While the graduation of the API itself doesn't change default behavior,+  migration of individual components does, as the events will be reported+  differently.++* **Can the feature be disabled once it has been enabled (i.e. can we rollback+  the enablement)?**+  Also set `rollback-supported` to `true` or `false` in `kep.yaml`.+  Describe the consequences on existing workloads (e.g. if this is runtime+  feature, can it break the existing applications?).++  Yes. If the new Event API is disabled, it will fallback to the original one +  (The new events are roundtrippable with the old `corev1.Events`).++  If individual components don't implement it, rollback of client-library use+  may not be possible (i.e. they only fallback to the old API if the new API+  is disabled, if there is bug in the client-library, there is no way to+  fallback as of now).++* **What happens if we reenable the feature if it was previously rolled back?**++  New types of Events will be generated instead of the old one.++* **Are there any tests for feature enablement/disablement?**+  The e2e framework does not currently support enabling and disabling feature+  gates. However, unit tests in each component dealing with managing data created+  with and without the feature are necessary. At the very least, think about+  conversion tests if API types are being modified.++  Manual tests will be performed to ensure things work when either enabling+  or disabling the new Event API.++  More information in [Test Plan](#test-plan) section.++### Rollout, Upgrade and Rollback Planning++_This section must be completed when targeting beta graduation to a release._++* **How can a rollout fail? Can it impact already running workloads?**+  Try to be as paranoid as possible - e.g. what if some components will restart+  in the middle of rollout?++  A rollout could fail if some components restart in the middle of the rollout.+  Then those components will continue using the old Event API.++* **What specific metrics should inform a rollback?**++* **Were upgrade and rollback tested? Was upgrade->downgrade->upgrade path tested?**+  Describe manual testing that was done and the outcomes.+  Longer term, we may want to require automated upgrade/rollback tests, but we+  are missing a bunch of machinery and tooling and do that now.++  Not yet. It could be done by enabling / disabling new Event API.++* **Is the rollout accompanied by any deprecations and/or removals of features,+  APIs, fields of API types, flags, etc.?**+  Even if applying deprecation policies, they may still surprise some users.++  State field of EventSeries will be removed from corev1.Event API.++### Monitoring requirements++_This section must be completed when targeting beta graduation to a release._++* **How can an operator determine if the feature is in use by workloads?**+  Ideally, this should be a metrics. Operations against Kubernetes API (e.g.+  checking if there are objects with field X set) may be last resort. Avoid+  logs or events for this purpose.++  The API, as a feature that workloads may in theory use,+  can be determined by looking into the apiserver_requests_total metric.++* **What are the SLIs (Service Level Indicators) an operator can use to+  determine the health of the service?**+  - [x] Metrics+    - Metric name: apiserver_requests_total+    - Components exposing the metric: kube-apiserver+  - [ ] Other (treat as last resort)+    - Details:++* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**+  At the high-level this usually will be in the form of "high percentile of SLI+  per day <= X". It's impossible to provide a comprehensive guidance, but at the very+  high level (they needs more precise definitions) those may be things like:+  - per-day percentage of API calls finishing with 5XX errors <= 1%+  - 99% percentile over day of absolute value from (job creation time minus expected+    job creation time) for cron job <= 10%+  - 99,9% of /health requests per day finish with 200 code++  Events have always been "best-effort".+  We're sticking to that with the new API too, so no SLO will be introduced.++* **Are there any missing metrics that would be useful to have to improve+  observability if this feature?**+  Describe the metrics themselves and the reason they weren't added (e.g. cost,+  implementation difficulties, etc.).++  No.++### Dependencies++_This section must be completed when targeting beta graduation to a release._++* **Does this feature depend on any specific services running in the cluster?**+  Think about both cluster-level services (e.g. metrics-server) as well+  as node-level agents (e.g. specific version of CRI). Focus on external or+  optional services that are needed. For example, if this feature depends on+  a cloud provider API, or upon an external software-defined storage or network+  control plane.++  For each of the fill in the following, thinking both about running user workloads+  and creating new ones, as well as about cluster-level services (e.g. DNS):+  +  N/A+++### Scalability++_For alpha, this section is encouraged: reviewers should consider these questions+and attempt to answer them._++_For beta, this section is required: reviewers must answer these questions._++_For GA, this section is required: approvers should be able to confirms the+previous answers based on experience in the field._++* **Will enabling / using this feature result in any new API calls?**+  Describe them, providing:++  In the new EventRecorder, every 30 minutes a "heartbeat" call will be performed+  to update Event status and prevent garbage collection in etcd. This heartbeat+  is happening for events that are happening all the time (If an event didn't+  happen for 6 minutes, it will be GC-ed).++* **Will enabling / using this feature result in introducing new API types?**++  Yes, a new API type "eventsv1.Event" is being introduced.+  The migration of Event API will cause creation of new types of Event objects.+  The number of Event objects depends on cluster state, which theoretically+  won't be too large due to deduplication logic and reasonable-cardinality+  of objects in the system.++* **Will enabling / using this feature result in any new calls to cloud+  provider?**++  No.++* **Will enabling / using this feature result in increasing size or count+  of the existing API objects?**+  Describe them providing:+  +  The difference in size of the Event object comes from new Action and Related+  fields. We can safely estimate the increase to be smaller than 30%. We'll+  also emit additional Event per Pod creation, as currently Events for that+  are being deduplicated. There are currently at least 6 Events emitted when+  Pod is started, so impact of this change can be bounded by 20%. This means+  that in the worst case the increase in Event size can be bounded by 56%.

I've never really understood where this 56 is coming from. Especially, given the above is just an example.

So I suggest rephrasing this sentence to:

In total, we estimated that increase in total size of all Events can be conservatively bounded by ~50%, but practical boundary should be much smaller.
chelseychen

comment created time in 34 minutes

Pull request review commentkubernetes/enhancements

Add PRR questionnaire section to New Event API KEP

 List Events from the NamespaceSystem with field selector `reportingController =  List all Event with field selector `regarding.name = podName, regarding.namespace = podNamespace`, and `related.name = podName, related.namespace = podNamespace`. You need to join results outside of the kubernetes API. +## Production Readiness Review Questionnaire++### Feature enablement and rollback++_This section must be completed when targeting alpha to a release._++* **How can this feature be enabled / disabled in a live cluster?**+  - [ ] Feature gate (also fill in values in `kep.yaml`)+    - Feature gate name:+    - Components depending on the feature gate:+  - [x] Other+    - Describe the mechanism:++      (1) The API itself can be enabled / disabled at kube-apiserver level+      by using `--runtime-config` flag;++      (2) For the use of API, we have a fallback mechanism instead of using+      a feature gate. That is, we simply fallback to the old Event libraries+      if the API is diabled.++      Currently this fallback is implemented purely in scheduler but we're+      planning to move it into the library itself.++    - Will enabling / disabling the feature require downtime of the control+      plane?++      (1) Yes, enabling API requires to restart apiserver.++      (2) No, enabling the use of the API doesn't require that.++    - Will enabling / disabling the feature require downtime or reprovisioning+      of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled).++      No.++* **Does enabling the feature change any default behavior?**+  Any change of default behavior may be surprising to users or break existing+  automations, so be extremely careful here.++  While the graduation of the API itself doesn't change default behavior,+  migration of individual components does, as the events will be reported+  differently.++* **Can the feature be disabled once it has been enabled (i.e. can we rollback+  the enablement)?**+  Also set `rollback-supported` to `true` or `false` in `kep.yaml`.+  Describe the consequences on existing workloads (e.g. if this is runtime+  feature, can it break the existing applications?).++  Yes. If the new Event API is disabled, it will fallback to the original one +  (The new events are roundtrippable with the old `corev1.Events`).++  If individual components don't implement it, rollback of client-library use+  may not be possible (i.e. they only fallback to the old API if the new API+  is disabled, if there is bug in the client-library, there is no way to+  fallback as of now).++* **What happens if we reenable the feature if it was previously rolled back?**++  New types of Events will be generated instead of the old one.++* **Are there any tests for feature enablement/disablement?**+  The e2e framework does not currently support enabling and disabling feature+  gates. However, unit tests in each component dealing with managing data created+  with and without the feature are necessary. At the very least, think about+  conversion tests if API types are being modified.++  Manual tests will be performed to ensure things work when either enabling+  or disabling the new Event API.++  More information in [Test Plan](#test-plan) section.++### Rollout, Upgrade and Rollback Planning++_This section must be completed when targeting beta graduation to a release._++* **How can a rollout fail? Can it impact already running workloads?**+  Try to be as paranoid as possible - e.g. what if some components will restart+  in the middle of rollout?++  A rollout could fail if some components restart in the middle of the rollout.+  Then those components will continue using the old Event API.++* **What specific metrics should inform a rollback?**++* **Were upgrade and rollback tested? Was upgrade->downgrade->upgrade path tested?**+  Describe manual testing that was done and the outcomes.+  Longer term, we may want to require automated upgrade/rollback tests, but we+  are missing a bunch of machinery and tooling and do that now.++  Not yet. It could be done by enabling / disabling new Event API.++* **Is the rollout accompanied by any deprecations and/or removals of features,+  APIs, fields of API types, flags, etc.?**+  Even if applying deprecation policies, they may still surprise some users.++  State field of EventSeries will be removed from corev1.Event API.++### Monitoring requirements++_This section must be completed when targeting beta graduation to a release._++* **How can an operator determine if the feature is in use by workloads?**+  Ideally, this should be a metrics. Operations against Kubernetes API (e.g.+  checking if there are objects with field X set) may be last resort. Avoid+  logs or events for this purpose.++  The API, as a feature that workloads may in theory use,+  can be determined by looking into the apiserver_requests_total metric.++* **What are the SLIs (Service Level Indicators) an operator can use to+  determine the health of the service?**+  - [x] Metrics+    - Metric name: apiserver_requests_total+    - Components exposing the metric: kube-apiserver+  - [ ] Other (treat as last resort)+    - Details:++* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**+  At the high-level this usually will be in the form of "high percentile of SLI+  per day <= X". It's impossible to provide a comprehensive guidance, but at the very+  high level (they needs more precise definitions) those may be things like:+  - per-day percentage of API calls finishing with 5XX errors <= 1%+  - 99% percentile over day of absolute value from (job creation time minus expected+    job creation time) for cron job <= 10%+  - 99,9% of /health requests per day finish with 200 code++  Events have always been "best-effort".+  We're sticking to that with the new API too, so no SLO will be introduced.++* **Are there any missing metrics that would be useful to have to improve+  observability if this feature?**+  Describe the metrics themselves and the reason they weren't added (e.g. cost,+  implementation difficulties, etc.).++  No.++### Dependencies++_This section must be completed when targeting beta graduation to a release._++* **Does this feature depend on any specific services running in the cluster?**+  Think about both cluster-level services (e.g. metrics-server) as well+  as node-level agents (e.g. specific version of CRI). Focus on external or+  optional services that are needed. For example, if this feature depends on+  a cloud provider API, or upon an external software-defined storage or network+  control plane.++  For each of the fill in the following, thinking both about running user workloads+  and creating new ones, as well as about cluster-level services (e.g. DNS):+  +  N/A+++### Scalability++_For alpha, this section is encouraged: reviewers should consider these questions+and attempt to answer them._++_For beta, this section is required: reviewers must answer these questions._++_For GA, this section is required: approvers should be able to confirms the+previous answers based on experience in the field._++* **Will enabling / using this feature result in any new API calls?**+  Describe them, providing:++  In the new EventRecorder, every 30 minutes a "heartbeat" call will be performed+  to update Event status and prevent garbage collection in etcd. This heartbeat+  is happening for events that are happening all the time (If an event didn't+  happen for 6 minutes, it will be GC-ed).++* **Will enabling / using this feature result in introducing new API types?**++  Yes, a new API type "eventsv1.Event" is being introduced.+  The migration of Event API will cause creation of new types of Event objects.+  The number of Event objects depends on cluster state, which theoretically+  won't be too large due to deduplication logic and reasonable-cardinality+  of objects in the system.++* **Will enabling / using this feature result in any new calls to cloud+  provider?**++  No.++* **Will enabling / using this feature result in increasing size or count+  of the existing API objects?**+  Describe them providing:+  +  The difference in size of the Event object comes from new Action and Related+  fields. We can safely estimate the increase to be smaller than 30%. We'll+  also emit additional Event per Pod creation, as currently Events for that+  are being deduplicated. There are currently at least 6 Events emitted when+  Pod is started, so impact of this change can be bounded by 20%. This means+  that in the worst case the increase in Event size can be bounded by 56%.++* **Will enabling / using this feature result in increasing time taken by any+  operations covered by [existing SLIs/SLOs][]?**+  +  No++* **Will enabling / using this feature result in non-negligible increase of+  resource usage (CPU, RAM, disk, IO, ...) in any components?**+  +  The potential increase of Event size might cause non-negligible storage+  increase in Etcd.

Which means also:

  • network bandwidth to sent them
  • cpu to process them

[neither of them should be dominating to what we already have, but should be mentioned]

chelseychen

comment created time in 34 minutes

Pull request review commentkubernetes/enhancements

Add PRR questionnaire section to New Event API KEP

 List Events from the NamespaceSystem with field selector `reportingController =  List all Event with field selector `regarding.name = podName, regarding.namespace = podNamespace`, and `related.name = podName, related.namespace = podNamespace`. You need to join results outside of the kubernetes API. +## Production Readiness Review Questionnaire++### Feature enablement and rollback++_This section must be completed when targeting alpha to a release._++* **How can this feature be enabled / disabled in a live cluster?**+  - [ ] Feature gate (also fill in values in `kep.yaml`)+    - Feature gate name:+    - Components depending on the feature gate:+  - [x] Other+    - Describe the mechanism:++      (1) The API itself can be enabled / disabled at kube-apiserver level+      by using `--runtime-config` flag;++      (2) For the use of API, we have a fallback mechanism instead of using+      a feature gate. That is, we simply fallback to the old Event libraries+      if the API is diabled.++      Currently this fallback is implemented purely in scheduler but we're+      planning to move it into the library itself.++    - Will enabling / disabling the feature require downtime of the control+      plane?++      (1) Yes, enabling API requires to restart apiserver.++      (2) No, enabling the use of the API doesn't require that.

This one is a bit misleading. And also isn't fully true imho.

Basically, the currently implemented fallback happens only at component initialization - if the API was enabled at that point, we will never recheck it later. So I would say, that if you enable/disable the API, you also need to restart the components using that.

chelseychen

comment created time in an hour

Pull request review commentkubernetes/enhancements

Add PRR questionnaire section to New Event API KEP

 List Events from the NamespaceSystem with field selector `reportingController =  List all Event with field selector `regarding.name = podName, regarding.namespace = podNamespace`, and `related.name = podName, related.namespace = podNamespace`. You need to join results outside of the kubernetes API. +## Production Readiness Review Questionnaire++### Feature enablement and rollback++_This section must be completed when targeting alpha to a release._++* **How can this feature be enabled / disabled in a live cluster?**+  - [ ] Feature gate (also fill in values in `kep.yaml`)+    - Feature gate name:+    - Components depending on the feature gate:+  - [x] Other+    - Describe the mechanism:++      (1) The API itself can be enabled / disabled at kube-apiserver level+      by using `--runtime-config` flag;++      (2) For the use of API, we have a fallback mechanism instead of using+      a feature gate. That is, we simply fallback to the old Event libraries+      if the API is diabled.++      Currently this fallback is implemented purely in scheduler but we're+      planning to move it into the library itself.++    - Will enabling / disabling the feature require downtime of the control+      plane?++      (1) Yes, enabling API requires to restart apiserver.++      (2) No, enabling the use of the API doesn't require that.++    - Will enabling / disabling the feature require downtime or reprovisioning+      of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled).++      No.++* **Does enabling the feature change any default behavior?**+  Any change of default behavior may be surprising to users or break existing+  automations, so be extremely careful here.++  While the graduation of the API itself doesn't change default behavior,+  migration of individual components does, as the events will be reported+  differently.++* **Can the feature be disabled once it has been enabled (i.e. can we rollback+  the enablement)?**+  Also set `rollback-supported` to `true` or `false` in `kep.yaml`.+  Describe the consequences on existing workloads (e.g. if this is runtime+  feature, can it break the existing applications?).++  Yes. If the new Event API is disabled, it will fallback to the original one +  (The new events are roundtrippable with the old `corev1.Events`).++  If individual components don't implement it, rollback of client-library use+  may not be possible (i.e. they only fallback to the old API if the new API+  is disabled, if there is bug in the client-library, there is no way to+  fallback as of now).++* **What happens if we reenable the feature if it was previously rolled back?**++  New types of Events will be generated instead of the old one.++* **Are there any tests for feature enablement/disablement?**+  The e2e framework does not currently support enabling and disabling feature+  gates. However, unit tests in each component dealing with managing data created+  with and without the feature are necessary. At the very least, think about+  conversion tests if API types are being modified.++  Manual tests will be performed to ensure things work when either enabling+  or disabling the new Event API.++  More information in [Test Plan](#test-plan) section.++### Rollout, Upgrade and Rollback Planning++_This section must be completed when targeting beta graduation to a release._++* **How can a rollout fail? Can it impact already running workloads?**+  Try to be as paranoid as possible - e.g. what if some components will restart+  in the middle of rollout?++  A rollout could fail if some components restart in the middle of the rollout.

I don't really understand this one - can you clarify?

chelseychen

comment created time in an hour

Pull request review commentkubernetes/enhancements

Add PRR questionnaire section to New Event API KEP

 List Events from the NamespaceSystem with field selector `reportingController =  List all Event with field selector `regarding.name = podName, regarding.namespace = podNamespace`, and `related.name = podName, related.namespace = podNamespace`. You need to join results outside of the kubernetes API. +## Production Readiness Review Questionnaire++### Feature enablement and rollback++_This section must be completed when targeting alpha to a release._++* **How can this feature be enabled / disabled in a live cluster?**+  - [ ] Feature gate (also fill in values in `kep.yaml`)+    - Feature gate name:+    - Components depending on the feature gate:+  - [x] Other+    - Describe the mechanism:++      (1) The API itself can be enabled / disabled at kube-apiserver level+      by using `--runtime-config` flag;++      (2) For the use of API, we have a fallback mechanism instead of using+      a feature gate. That is, we simply fallback to the old Event libraries+      if the API is diabled.++      Currently this fallback is implemented purely in scheduler but we're+      planning to move it into the library itself.++    - Will enabling / disabling the feature require downtime of the control+      plane?++      (1) Yes, enabling API requires to restart apiserver.++      (2) No, enabling the use of the API doesn't require that.++    - Will enabling / disabling the feature require downtime or reprovisioning+      of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled).++      No.++* **Does enabling the feature change any default behavior?**+  Any change of default behavior may be surprising to users or break existing+  automations, so be extremely careful here.++  While the graduation of the API itself doesn't change default behavior,+  migration of individual components does, as the events will be reported+  differently.++* **Can the feature be disabled once it has been enabled (i.e. can we rollback+  the enablement)?**+  Also set `rollback-supported` to `true` or `false` in `kep.yaml`.+  Describe the consequences on existing workloads (e.g. if this is runtime+  feature, can it break the existing applications?).++  Yes. If the new Event API is disabled, it will fallback to the original one +  (The new events are roundtrippable with the old `corev1.Events`).++  If individual components don't implement it, rollback of client-library use+  may not be possible (i.e. they only fallback to the old API if the new API+  is disabled, if there is bug in the client-library, there is no way to+  fallback as of now).++* **What happens if we reenable the feature if it was previously rolled back?**++  New types of Events will be generated instead of the old one.++* **Are there any tests for feature enablement/disablement?**+  The e2e framework does not currently support enabling and disabling feature+  gates. However, unit tests in each component dealing with managing data created+  with and without the feature are necessary. At the very least, think about+  conversion tests if API types are being modified.++  Manual tests will be performed to ensure things work when either enabling+  or disabling the new Event API.++  More information in [Test Plan](#test-plan) section.++### Rollout, Upgrade and Rollback Planning++_This section must be completed when targeting beta graduation to a release._++* **How can a rollout fail? Can it impact already running workloads?**+  Try to be as paranoid as possible - e.g. what if some components will restart+  in the middle of rollout?++  A rollout could fail if some components restart in the middle of the rollout.+  Then those components will continue using the old Event API.++* **What specific metrics should inform a rollback?**++* **Were upgrade and rollback tested? Was upgrade->downgrade->upgrade path tested?**+  Describe manual testing that was done and the outcomes.+  Longer term, we may want to require automated upgrade/rollback tests, but we+  are missing a bunch of machinery and tooling and do that now.++  Not yet. It could be done by enabling / disabling new Event API.++* **Is the rollout accompanied by any deprecations and/or removals of features,+  APIs, fields of API types, flags, etc.?**+  Even if applying deprecation policies, they may still surprise some users.++  State field of EventSeries will be removed from corev1.Event API.++### Monitoring requirements++_This section must be completed when targeting beta graduation to a release._++* **How can an operator determine if the feature is in use by workloads?**+  Ideally, this should be a metrics. Operations against Kubernetes API (e.g.+  checking if there are objects with field X set) may be last resort. Avoid+  logs or events for this purpose.++  The API, as a feature that workloads may in theory use,+  can be determined by looking into the apiserver_requests_total metric.++* **What are the SLIs (Service Level Indicators) an operator can use to+  determine the health of the service?**+  - [x] Metrics+    - Metric name: apiserver_requests_total+    - Components exposing the metric: kube-apiserver+  - [ ] Other (treat as last resort)+    - Details:++* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**+  At the high-level this usually will be in the form of "high percentile of SLI+  per day <= X". It's impossible to provide a comprehensive guidance, but at the very+  high level (they needs more precise definitions) those may be things like:+  - per-day percentage of API calls finishing with 5XX errors <= 1%+  - 99% percentile over day of absolute value from (job creation time minus expected+    job creation time) for cron job <= 10%+  - 99,9% of /health requests per day finish with 200 code++  Events have always been "best-effort".+  We're sticking to that with the new API too, so no SLO will be introduced.++* **Are there any missing metrics that would be useful to have to improve+  observability if this feature?**+  Describe the metrics themselves and the reason they weren't added (e.g. cost,+  implementation difficulties, etc.).++  No.++### Dependencies++_This section must be completed when targeting beta graduation to a release._++* **Does this feature depend on any specific services running in the cluster?**+  Think about both cluster-level services (e.g. metrics-server) as well+  as node-level agents (e.g. specific version of CRI). Focus on external or+  optional services that are needed. For example, if this feature depends on+  a cloud provider API, or upon an external software-defined storage or network+  control plane.++  For each of the fill in the following, thinking both about running user workloads+  and creating new ones, as well as about cluster-level services (e.g. DNS):+  +  N/A+++### Scalability++_For alpha, this section is encouraged: reviewers should consider these questions+and attempt to answer them._++_For beta, this section is required: reviewers must answer these questions._++_For GA, this section is required: approvers should be able to confirms the+previous answers based on experience in the field._++* **Will enabling / using this feature result in any new API calls?**+  Describe them, providing:++  In the new EventRecorder, every 30 minutes a "heartbeat" call will be performed+  to update Event status and prevent garbage collection in etcd. This heartbeat+  is happening for events that are happening all the time (If an event didn't+  happen for 6 minutes, it will be GC-ed).++* **Will enabling / using this feature result in introducing new API types?**++  Yes, a new API type "eventsv1.Event" is being introduced.+  The migration of Event API will cause creation of new types of Event objects.

Let's remove this sentence. Given they have common representation in etcd (they are roundtrippable), it's not fully true.

chelseychen

comment created time in 41 minutes

Pull request review commentkubernetes/enhancements

Add PRR questionnaire section to New Event API KEP

 List Events from the NamespaceSystem with field selector `reportingController =  List all Event with field selector `regarding.name = podName, regarding.namespace = podNamespace`, and `related.name = podName, related.namespace = podNamespace`. You need to join results outside of the kubernetes API. +## Production Readiness Review Questionnaire++### Feature enablement and rollback++_This section must be completed when targeting alpha to a release._++* **How can this feature be enabled / disabled in a live cluster?**+  - [ ] Feature gate (also fill in values in `kep.yaml`)+    - Feature gate name:+    - Components depending on the feature gate:+  - [x] Other+    - Describe the mechanism:++      (1) The API itself can be enabled / disabled at kube-apiserver level+      by using `--runtime-config` flag;++      (2) For the use of API, we have a fallback mechanism instead of using+      a feature gate. That is, we simply fallback to the old Event libraries+      if the API is diabled.++      Currently this fallback is implemented purely in scheduler but we're+      planning to move it into the library itself.++    - Will enabling / disabling the feature require downtime of the control+      plane?++      (1) Yes, enabling API requires to restart apiserver.++      (2) No, enabling the use of the API doesn't require that.++    - Will enabling / disabling the feature require downtime or reprovisioning+      of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled).++      No.++* **Does enabling the feature change any default behavior?**+  Any change of default behavior may be surprising to users or break existing+  automations, so be extremely careful here.++  While the graduation of the API itself doesn't change default behavior,+  migration of individual components does, as the events will be reported+  differently.++* **Can the feature be disabled once it has been enabled (i.e. can we rollback+  the enablement)?**+  Also set `rollback-supported` to `true` or `false` in `kep.yaml`.+  Describe the consequences on existing workloads (e.g. if this is runtime+  feature, can it break the existing applications?).++  Yes. If the new Event API is disabled, it will fallback to the original one +  (The new events are roundtrippable with the old `corev1.Events`).++  If individual components don't implement it, rollback of client-library use+  may not be possible (i.e. they only fallback to the old API if the new API+  is disabled, if there is bug in the client-library, there is no way to+  fallback as of now).++* **What happens if we reenable the feature if it was previously rolled back?**++  New types of Events will be generated instead of the old one.++* **Are there any tests for feature enablement/disablement?**+  The e2e framework does not currently support enabling and disabling feature+  gates. However, unit tests in each component dealing with managing data created+  with and without the feature are necessary. At the very least, think about+  conversion tests if API types are being modified.++  Manual tests will be performed to ensure things work when either enabling+  or disabling the new Event API.++  More information in [Test Plan](#test-plan) section.++### Rollout, Upgrade and Rollback Planning++_This section must be completed when targeting beta graduation to a release._++* **How can a rollout fail? Can it impact already running workloads?**+  Try to be as paranoid as possible - e.g. what if some components will restart+  in the middle of rollout?++  A rollout could fail if some components restart in the middle of the rollout.+  Then those components will continue using the old Event API.++* **What specific metrics should inform a rollback?**

Too high/too low (vs what is expected) apiserver_request_total: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/endpoints/metrics/metrics.go#L66

[that may suggest bug in the library]

chelseychen

comment created time in 44 minutes

Pull request review commentkubernetes/enhancements

Add PRR questionnaire section to New Event API KEP

 List Events from the NamespaceSystem with field selector `reportingController =  List all Event with field selector `regarding.name = podName, regarding.namespace = podNamespace`, and `related.name = podName, related.namespace = podNamespace`. You need to join results outside of the kubernetes API. +## Production Readiness Review Questionnaire++### Feature enablement and rollback++_This section must be completed when targeting alpha to a release._++* **How can this feature be enabled / disabled in a live cluster?**+  - [ ] Feature gate (also fill in values in `kep.yaml`)+    - Feature gate name:+    - Components depending on the feature gate:+  - [x] Other+    - Describe the mechanism:++      (1) The API itself can be enabled / disabled at kube-apiserver level+      by using `--runtime-config` flag;++      (2) For the use of API, we have a fallback mechanism instead of using+      a feature gate. That is, we simply fallback to the old Event libraries+      if the API is diabled.++      Currently this fallback is implemented purely in scheduler but we're+      planning to move it into the library itself.++    - Will enabling / disabling the feature require downtime of the control+      plane?++      (1) Yes, enabling API requires to restart apiserver.++      (2) No, enabling the use of the API doesn't require that.++    - Will enabling / disabling the feature require downtime or reprovisioning+      of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled).++      No.++* **Does enabling the feature change any default behavior?**+  Any change of default behavior may be surprising to users or break existing+  automations, so be extremely careful here.++  While the graduation of the API itself doesn't change default behavior,+  migration of individual components does, as the events will be reported+  differently.++* **Can the feature be disabled once it has been enabled (i.e. can we rollback+  the enablement)?**+  Also set `rollback-supported` to `true` or `false` in `kep.yaml`.+  Describe the consequences on existing workloads (e.g. if this is runtime+  feature, can it break the existing applications?).++  Yes. If the new Event API is disabled, it will fallback to the original one +  (The new events are roundtrippable with the old `corev1.Events`).++  If individual components don't implement it, rollback of client-library use+  may not be possible (i.e. they only fallback to the old API if the new API+  is disabled, if there is bug in the client-library, there is no way to+  fallback as of now).++* **What happens if we reenable the feature if it was previously rolled back?**++  New types of Events will be generated instead of the old one.++* **Are there any tests for feature enablement/disablement?**+  The e2e framework does not currently support enabling and disabling feature+  gates. However, unit tests in each component dealing with managing data created+  with and without the feature are necessary. At the very least, think about+  conversion tests if API types are being modified.++  Manual tests will be performed to ensure things work when either enabling+  or disabling the new Event API.++  More information in [Test Plan](#test-plan) section.++### Rollout, Upgrade and Rollback Planning++_This section must be completed when targeting beta graduation to a release._++* **How can a rollout fail? Can it impact already running workloads?**+  Try to be as paranoid as possible - e.g. what if some components will restart+  in the middle of rollout?++  A rollout could fail if some components restart in the middle of the rollout.+  Then those components will continue using the old Event API.++* **What specific metrics should inform a rollback?**++* **Were upgrade and rollback tested? Was upgrade->downgrade->upgrade path tested?**+  Describe manual testing that was done and the outcomes.+  Longer term, we may want to require automated upgrade/rollback tests, but we+  are missing a bunch of machinery and tooling and do that now.++  Not yet. It could be done by enabling / disabling new Event API.++* **Is the rollout accompanied by any deprecations and/or removals of features,+  APIs, fields of API types, flags, etc.?**+  Even if applying deprecation policies, they may still surprise some users.++  State field of EventSeries will be removed from corev1.Event API.++### Monitoring requirements++_This section must be completed when targeting beta graduation to a release._++* **How can an operator determine if the feature is in use by workloads?**+  Ideally, this should be a metrics. Operations against Kubernetes API (e.g.+  checking if there are objects with field X set) may be last resort. Avoid+  logs or events for this purpose.++  The API, as a feature that workloads may in theory use,+  can be determined by looking into the apiserver_requests_total metric.++* **What are the SLIs (Service Level Indicators) an operator can use to+  determine the health of the service?**+  - [x] Metrics+    - Metric name: apiserver_requests_total+    - Components exposing the metric: kube-apiserver+  - [ ] Other (treat as last resort)+    - Details:++* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**+  At the high-level this usually will be in the form of "high percentile of SLI+  per day <= X". It's impossible to provide a comprehensive guidance, but at the very+  high level (they needs more precise definitions) those may be things like:+  - per-day percentage of API calls finishing with 5XX errors <= 1%+  - 99% percentile over day of absolute value from (job creation time minus expected+    job creation time) for cron job <= 10%+  - 99,9% of /health requests per day finish with 200 code++  Events have always been "best-effort".+  We're sticking to that with the new API too, so no SLO will be introduced.++* **Are there any missing metrics that would be useful to have to improve+  observability if this feature?**+  Describe the metrics themselves and the reason they weren't added (e.g. cost,+  implementation difficulties, etc.).++  No.++### Dependencies++_This section must be completed when targeting beta graduation to a release._++* **Does this feature depend on any specific services running in the cluster?**+  Think about both cluster-level services (e.g. metrics-server) as well+  as node-level agents (e.g. specific version of CRI). Focus on external or+  optional services that are needed. For example, if this feature depends on+  a cloud provider API, or upon an external software-defined storage or network+  control plane.++  For each of the fill in the following, thinking both about running user workloads+  and creating new ones, as well as about cluster-level services (e.g. DNS):+  +  N/A+++### Scalability++_For alpha, this section is encouraged: reviewers should consider these questions+and attempt to answer them._++_For beta, this section is required: reviewers must answer these questions._++_For GA, this section is required: approvers should be able to confirms the+previous answers based on experience in the field._++* **Will enabling / using this feature result in any new API calls?**+  Describe them, providing:++  In the new EventRecorder, every 30 minutes a "heartbeat" call will be performed+  to update Event status and prevent garbage collection in etcd. This heartbeat+  is happening for events that are happening all the time (If an event didn't

nit: s/all the time/periodically/ ?

chelseychen

comment created time in 43 minutes

issue commentkubernetes/enhancements

Immutable Secrets and ConfigMaps

@mikejoh - opened https://github.com/kubernetes/website/pull/21189

wojtek-t

comment created time in an hour

pull request commentkubernetes/enhancements

An initial version of the External TLS certificate authenticator KEP

Based on the feedback from @enj and @awly, we have came up with a draft of a protocol for the communication between kubectl/client-go and external signer using gRPC over a unix socket. Please find it below.

syntax = "proto3";

package v1alpha1;

// This service defines the public APIs for external signer plugin.
service ExternalSignerService {
    // Version returns the version of the external signer plugin.
    rpc Version(VersionRequest) returns (VersionResponse) {}
    // Get certificate from the external signer.
    rpc GetCertificate(CertificateRequest) returns (stream CertificateResponse) {}
    // Execute signing operation in the external signer plugin.
    rpc Sign(SignatureRequest) returns (stream SignatureResponse) {}
}
message VersionRequest {
    // Version of the external signer plugin API.
    string version = 1;
}
message VersionResponse {
    // Version of the external signer plugin API.
    string version = 1;
}
message CertificateRequest {
    // Version of the external signer plugin API.
    string version = 1;
    // Name of the Kubernetes cluster.
    string clusterName = 2;
    // Configuration of the external signer plugin. This configuration is specific to the external signer, but stored in KUBECONFIG for the user's convenience to allow multiplexing a single external signer for several K8s users.
    map<string, string> configuration = 3;
}
message CertificateResponse {
    oneof content {
        // Client certificate.
        bytes certificate = 1;
        // User prompt.
        string userPrompt = 2;
    }
}
message SignatureRequest {
    // Version of the external signer plugin API.
    string version = 1;
    // Name of the Kubernetes cluster.
    string clusterName = 2;
    // Configuration of the external signer plugin (HSM protocol specific).
    map<string, string> configuration = 3;
    // Digest to be signed.
    bytes digest = 4;
    // Enumeration of supported signer types.
    enum SignerType {
        RSAPSS = 0;
    }
    // Type of signer.
    SignerType signerType = 5;
    // Definition of options for creating the PSS signature.
    message RSAPSSOptions {
        // Length of the salt for creating the PSS signature.
        int32 saltLenght = 1;
        // Hash function for creating the PSS signature.
        uint32 hash = 2;
    }
    // Options for creating the PSS signature (used when signerType is set to RSAPSS).
    RSAPSSOptions signerOptsRSAPSS = 6;
}
message SignatureResponse {
    oneof content {
        // Signature.
        bytes signature = 1;
        // User prompt.
        string userPrompt = 2;
    }
}
jakubkrzywda

comment created time in an hour

pull request commentkubernetes/enhancements

Update release-notes KEP to reflect the current state

@saschagrunert thanks for the PR. I think your changes capture most of the current direction I understand the project to be taking. There are three different areas we could address when thinking about the KEP:

  1. Updating general design issues to where they are now and where they are heading. I think you already mentioned most of them.
  2. Addressing those areas where the original ideas from @jeefy 's KEP have already been implemented.
  3. Future plans and direction

Here what I think about those three items:

  1. As I said before, I think your changes reflect the current design well, those areas that have shifted from a year ago when the KEP was written.
  2. Regarding the actual implementation, I like your checklist because it shows what has been implemented without altering the KEP much. If we are trying to reflecting the current progress of the implementation in the KEP itself, there are other areas we ought to note as well. For example, the fact that the website is already up, with it's own domain and out of the personal repo.
  3. Finally, there are the plans that lie ahead of us (as of 2020). I think we are due for a good talk on the focus of the tools and how they are used. Mostly derived from the current status of the code but also from the human/organizational side of things.

But perhaps this last point should left out of the KEP. After all, the original intent of the KEP was this:

this KEP would graduate once we have a dedicated release notes website that is automatically updated with minimal human interaction.

And we are at the brink of that. In fact, if we were to leave out the scope of the KEP the bucket issue we could say that the original mission of the KEP has already been fulfilled already as it only takes one command to go from nothing to the PR that updates the website.

What do you think ?

saschagrunert

comment created time in 2 hours

PR opened mesosphere/kubernetes-base-addons

Move to KUDO based istio operator

What type of PR is this? Feature

What this PR does/ why we need it: Istio 1.5.x is from KUDO based istio operator. The operator repo is mesosphere/kudo-istio

Which issue(s) this PR fixes: no issue

Special notes for your reviewer:

Does this PR introduce a user-facing change?: NONE

Checklist

  • [ ] The commit message explains the changes and why are needed.
  • [ ] The code builds and passes lint/style checks locally.
  • [ ] The relevant subset of integration tests pass locally.
  • [ ] The core changes are covered by tests.
  • [ ] The documentation is updated where needed.
+39 -0

0 comment

1 changed file

pr created time in 2 hours

pull request commentkubernetes/enhancements

optionally disable node ports for Service Type=LoadBalancer

I think a flag is fine but not a boolean. If possible coordinate some --lb-default-mode with #1392. If I have got it right some state must be added in the service object like "bind-always", "bind-never", can a "disable-nodeport" also be an option here? the --lb-default-mode would then be the default for new services.

andrewsykim

comment created time in 2 hours

create barnchmesosphere/kubernetes-base-addons

branch : deepak/istio

created branch time in 2 hours

issue commentkubernetes/enhancements

Seccomp

@palnabarun here's the current ones: https://github.com/kubernetes/kubernetes/pull/91381 https://github.com/kubernetes/kubernetes/pull/91408 https://github.com/kubernetes/kubernetes/pull/91182

I also created an umbrella issue that contained all of them.

pweil-

comment created time in 3 hours

issue commentkubernetes/enhancements

Seccomp

@pjbgf -- Can you please link to all the implementation PR's here - k/k or otherwise? :slightly_smiling_face:


The current release schedule is:

  • ~Monday, April 13: Week 1 - Release cycle begins~
  • ~Tuesday, May 19: Week 6 - Enhancements Freeze~
  • Thursday, June 25: Week 11 - Code Freeze
  • Thursday, July 9: Week 14 - Docs must be completed and reviewed
  • Tuesday, August 4: Week 17 - Kubernetes v1.19.0 released
pweil-

comment created time in 3 hours

more