profile
viewpoint
Qian Zhang qianzhangxa Mesosphere Xi'an, China Apache Mesos committer and PMC member

qianzhangxa/cni 0

Container Network Interface - networking for Linux containers

qianzhangxa/convoy 0

A Docker volume plugin, managing persistent container volumes.

qianzhangxa/dcos 0

DC/OS Build and Release tools

qianzhangxa/dcos-docs 0

Documentation for DC/OS

qianzhangxa/dcos-docs-site 0

D2iQ Product Documentation and Docs Website Code

qianzhangxa/dcos-mesos-modules 0

Mesos Modules used in DC/OS

qianzhangxa/homebrew-core 0

🍻 Default formulae for the missing package manager for macOS

qianzhangxa/https-server 0

Launched a simple https server for unit test.

qianzhangxa/image-spec 0

OCI Image Format

fork qianzhangxa/spec

Container Storage Interface (CSI) Specification.

fork in 16 days

pull request commentdcos/dcos

[1.13] Bump Mesos to f8b8f1e9c0fbf0a3ee769eebcebf0b4e98e0bfa6.

@mesosphere-mergebot changelog-not-required

qianzhangxa

comment created time in 18 days

pull request commentdcos/dcos

[1.13] Bump Mesos to f8b8f1e9c0fbf0a3ee769eebcebf0b4e98e0bfa6.

@mesosphere-mergebot label Ready For Review

qianzhangxa

comment created time in 18 days

pull request commentdcos/dcos

[1.13] Bump Mesos to f8b8f1e9c0fbf0a3ee769eebcebf0b4e98e0bfa6.

@mesosphere-mergebot label Ready For Review

qianzhangxa

comment created time in 18 days

PR opened dcos/dcos

[1.13] Bump Mesos to f8b8f1e9c0fbf0a3ee769eebcebf0b4e98e0bfa6.

High-level description

Fix D2IQ-65497.

Corresponding DC/OS tickets (required)

  • D2IQ-65497 COPS-5920: Unmouting volume error after restarting the task (External persistant volumes in Marathon)
+1 -1

0 comment

1 changed file

pr created time in 18 days

create barnchqianzhangxa/dcos

branch : D2IQ-65497-1-13

created branch time in 18 days

issue commentkubernetes-sigs/aws-ebs-csi-driver

Implement ListVolumes

@msau42 I think the timeout you are talking about is maxWaitForUnmountDuration, right? In code comments I see:

// maxWaitForUnmountDuration is the max amount of time the reconciler will wait
// for the volume to be safely unmounted, after this it will detach the volume
// anyway (to handle crashed/unavailable nodes). If during this time the volume
// becomes used by a new pod, the detach request will be aborted and the timer
// cleared.

So the reconciler will wait for maxWaitForUnmountDuration before detaching volume anyway. My question is, who will detach the volume when a node is down in the first place before reconciler tries to detach the volume anyway. Or which component notifies the reconciler that a volume needs to be detached because the node is down?

msau42

comment created time in 24 days

issue commentkubernetes-sigs/aws-ebs-csi-driver

Implement ListVolumes

Thanks @msau42 for your reply!

For your case, assuming the instance is down and then the Kubernetes Node gets deleted as a result, then the volume will be force detached from the node after 5 minutes

Could you please let me know who will be responsible for forcily detaching the volume after 5 mins? Is it kube-controller-manager to create a VolumeAttachment object in this case to trigger csi-attacher to forcily detach the volume?

msau42

comment created time in 25 days

issue commentkubernetes-sigs/aws-ebs-csi-driver

Implement ListVolumes

I am thinking about this case: an EBS volume has been attached to an AWS instance via ControllerPublishVolume call, and then this instance is down by accident. The container orchestrator may choose to issue another ControllerPublishVolume call to attach the volume to another healthy instance, but I guess this call will fail (since the volume is still being attached in the failed instance), right?

@msau42 Do you think the latest csi-attacher can help resolve the above problem?

msau42

comment created time in a month

pull request commentdcos/dcos

[2.0] Bump Mesos to 802a50f4902f1f5ca3829dca4a472d8a582f7b9b.

@mesosphere-mergebot label Ready For Review

qianzhangxa

comment created time in a month

PR opened dcos/dcos

[2.0] Bump Mesos to 802a50f4902f1f5ca3829dca4a472d8a582f7b9b.

High-level description

Fix D2IQ-65497.

Corresponding DC/OS tickets (required)

  • D2IQ-65497 COPS-5920: Unmouting volume error after restarting the task (External persistant volumes in Marathon)
+1 -1

0 comment

1 changed file

pr created time in a month

create barnchqianzhangxa/dcos

branch : D2IQ-65497

created branch time in a month

Pull request review commentapache/mesos

Keep retrying to remove cgroup on EBUSY.

 Future<Nothing> remove(const string& hierarchy, const string& cgroup)       [=](const Nothing&) mutable -> Future<ControlFlow<Nothing>> {         if (::rmdir(path.c_str()) == 0) {           return process::Break();-        } else if ((errno == EBUSY) && (retry > 0)) {-          --retry;+        } else if (errno == EBUSY) {+          LOG(WARNING) << "Removal of cgroup " << path+                       << " failed with EBUSY, will try again";

A newline here.

cf-natali

comment created time in 2 months

Pull request review commentapache/mesos

Keep retrying to remove cgroup on EBUSY.

 class Destroyer : public Process<Destroyer>    // The killer processes used to atomically kill tasks in each cgroup.   vector<Future<Nothing>> killers;+  // Future used to destroy the cgroup once the tasks have been killed.

A newline before.

s/cgroup/cgroups/

cf-natali

comment created time in 2 months

issue closedkubernetes-sigs/kubefed

Failed to join host cluster `TLS handshake timeout`

What happened: I installed kubefed in my Kubernetes cluster via Helm chart.

$ helm list
NAME    REVISION        UPDATED                         STATUS          CHART                   APP VERSION     NAMESPACE             
kubefed 1               Sun Apr 12 17:40:36 2020        DEPLOYED        kubefed-0.2.0-alpha.1                   kube-federation-system

$ kubectl get deploy -n kube-federation-system
NAME                         READY   UP-TO-DATE   AVAILABLE   AGE
kubefed-admission-webhook    1/1     1            1           3h57m
kubefed-controller-manager   2/2     2            2           3h57m

And then I tried to join the host cluster itself first, but it failed.

$ kubefedctl join local-up-cluster --cluster-context local-up-cluster --host-cluster-context local-up-cluster --v=2
I0412 21:29:00.211409   13215 join.go:159] Args and flags: name local-up-cluster, host: local-up-cluster, host-system-namespace: kube-federation-system, kubeconfig: , cluster-context: local-up-cluster, secret-name: , dry-run: false
I0412 21:29:00.317977   13215 join.go:240] Performing preflight checks.
I0412 21:29:00.321766   13215 join.go:246] Creating kube-federation-system namespace in joining cluster
I0412 21:29:00.326105   13215 join.go:382] Already existing kube-federation-system namespace
I0412 21:29:00.326131   13215 join.go:254] Created kube-federation-system namespace in joining cluster
I0412 21:29:00.326137   13215 join.go:403] Creating service account in joining cluster: local-up-cluster
I0412 21:29:00.337572   13215 join.go:413] Created service account: local-up-cluster-local-up-cluster in joining cluster: local-up-cluster
I0412 21:29:00.337596   13215 join.go:441] Creating cluster role and binding for service account: local-up-cluster-local-up-cluster in joining cluster: local-up-cluster
I0412 21:29:00.358396   13215 join.go:450] Created cluster role and binding for service account: local-up-cluster-local-up-cluster in joining cluster: local-up-cluster
I0412 21:29:00.358421   13215 join.go:809] Creating cluster credentials secret in host cluster
I0412 21:29:00.365974   13215 join.go:835] Using secret named: local-up-cluster-local-up-cluster-token-7cgbg
I0412 21:29:00.371421   13215 join.go:878] Created secret in host cluster named: local-up-cluster-rqkfg
I0412 21:29:10.379445   13215 join.go:354] Could not create federated cluster local-up-cluster due to Internal error occurred: failed calling webhook "kubefedclusters.core.kubefed.io": Post "https://kubefed-admission-webhook.kube-federation-system.svc:443/apis/validation.core.kubefed.io/v1beta1/kubefedclusters?timeout=30s": net/http: TLS handshake timeout
I0412 21:29:10.380578   13215 join.go:278] Failed to create federated cluster resource: Internal error occurred: failed calling webhook "kubefedclusters.core.kubefed.io": Post "https://kubefed-admission-webhook.kube-federation-system.svc:443/apis/validation.core.kubefed.io/v1beta1/kubefedclusters?timeout=30s": net/http: TLS handshake timeout
F0412 21:29:10.381185   13215 join.go:126] Error: Internal error occurred: failed calling webhook "kubefedclusters.core.kubefed.io": Post "https://kubefed-admission-webhook.kube-federation-system.svc:443/apis/validation.core.kubefed.io/v1beta1/kubefedclusters?timeout=30s": net/http: TLS handshake timeout

What you expected to happen: Successfully join the host cluster.

How to reproduce it (as minimally and precisely as possible): See the above.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version)
Client Version: version.Info{Major:"1", Minor:"19+", GitVersion:"v1.19.0-alpha.1.317+3ff54eb10ce8a7", GitCommit:"3ff54eb10ce8a780a21262c18bccf8fd01380596", GitTreeState:"clean", BuildDate:"2020-04-04T08:26:00Z", GoVersion:"go1.14.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"19+", GitVersion:"v1.19.0-alpha.1.317+3ff54eb10ce8a7", GitCommit:"3ff54eb10ce8a780a21262c18bccf8fd01380596", GitTreeState:"clean", BuildDate:"2020-04-04T08:26:00Z", GoVersion:"go1.14.1", Compiler:"gc", Platform:"linux/amd64"}
  • KubeFed version kubefed-0.2.0-alpha.1

  • Scope of installation (namespaced or cluster)

  • Others

<!-- DO NOT EDIT BELOW THIS LINE --> /kind bug

closed time in 3 months

qianzhangxa

issue commentkubernetes-sigs/kubefed

Failed to join host cluster `TLS handshake timeout`

This issue disappeared when I run local-up-cluster.sh with sudo.

qianzhangxa

comment created time in 3 months

more