profile
viewpoint

Ask questionspods in "CrashLoopBackOff" status after restoring from backup

RKE version:

$ rke --version
rke version v0.2.4

Docker version:

$ docker version
Client:
 Version:           18.09.7
 API version:       1.39
 Go version:        go1.10.8
 Git commit:        2d0083d
 Built:             Thu Jun 27 17:56:17 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.6
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.8
  Git commit:       481bc77
  Built:            Sat May  4 01:59:36 2019
  OS/Arch:          linux/amd64
  Experimental:     false

Operating system and kernel:

$ cat /etc/os-release
NAME="Ubuntu"
VERSION="16.04.6 LTS (Xenial Xerus)"
$ uname -r
4.15.0-46-generic

Type/provider of hosts: AWS

cluster.yml file (reproduced with flannel cni too):

nodes:
- address: x.x.x.x
  port: ""
  internal_address: ""
  role:
  - controlplane
  - worker
  - etcd
  hostname_override: ""
  user: xxx
  docker_socket: ""
  ssh_key: |
    -----BEGIN RSA PRIVATE KEY-----
    xxx
    -----END RSA PRIVATE KEY-----

network:
    plugin: canal

Steps to Reproduce:

  1. create cluster:
$ rke up
  1. create test resource:
$ export KUBECONFIG=$(pwd)/kube_config_cluster.yml
$ kubectl create ns test
$ kubectl -n test run nginx --image=nginx --replicas=1
$ kubectl get pod --all-namespaces
NAMESPACE       NAME                                      READY   STATUS      RESTARTS   AGE
ingress-nginx   default-http-backend-78fccfc5d9-27fxf     1/1     Running     0          87s
ingress-nginx   nginx-ingress-controller-jk8dk            1/1     Running     0          87s
kube-system     canal-6b4v5                               2/2     Running     0          2m39s
kube-system     kube-dns-58bd5b8dd7-t9sll                 3/3     Running     0          2m14s
kube-system     kube-dns-autoscaler-77bc5fd84-6rgq8       1/1     Running     0          2m13s
kube-system     metrics-server-58bd5dd8d7-sqk8k           1/1     Running     0          102s
kube-system     rke-ingress-controller-deploy-job-sxqw2   0/1     Completed   0          99s
kube-system     rke-kube-dns-addon-deploy-job-45rjd       0/1     Completed   0          2m26s
kube-system     rke-metrics-addon-deploy-job-jchqj        0/1     Completed   0          2m2s
kube-system     rke-network-plugin-deploy-job-v46f7       0/1     Completed   0          2m45s
test            nginx-7cdbd8cdc9-tr2fs                    1/1     Running     0          22s
  1. create backup:
$ rke etcd snapshot-save --name test-backup
...
INFO[0010] Finished saving snapshot [test-backup] on all etcd hosts
  1. re-create cluster (reproduced on clean instance too):
$ rke remove
$ rke up
$ kubectl get pod --all-namespaces
NAMESPACE       NAME                                      READY   STATUS      RESTARTS   AGE
ingress-nginx   default-http-backend-78fccfc5d9-4tg2r     1/1     Running     0          13s
ingress-nginx   nginx-ingress-controller-b4cbc            1/1     Running     0          13s
kube-system     canal-qtltm                               2/2     Running     0          42s
kube-system     kube-dns-58bd5b8dd7-bpvns                 3/3     Running     0          34s
kube-system     kube-dns-autoscaler-77bc5fd84-n7q96       1/1     Running     0          33s
kube-system     metrics-server-58bd5dd8d7-pffj9           1/1     Running     0          24s
kube-system     rke-ingress-controller-deploy-job-lkhrv   0/1     Completed   0          17s
kube-system     rke-kube-dns-addon-deploy-job-rmxk7       0/1     Completed   0          37s
kube-system     rke-metrics-addon-deploy-job-w65wf        0/1     Completed   0          30s
kube-system     rke-network-plugin-deploy-job-rrhwh       0/1     Completed   0          45s
  1. restore cluster from "test-backup" snapshot:
$ rke etcd snapshot-restore --name test-backup
  1. RBAC looks correct:
$ kubectl -n kube-system get sa canal -o yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","kind":"ServiceAccount","metadata":{"annotations":{},"name":"canal","namespace":"kube-system"}}
  creationTimestamp: "2019-07-02T11:45:20Z"
  name: canal
  namespace: kube-system
  resourceVersion: "363"
  selfLink: /api/v1/namespaces/kube-system/serviceaccounts/canal
  uid: deb34e0a-9cbe-11e9-94c5-0af5369073f0
secrets:
- name: canal-token-s256l
$ kubectl -n kube-system get clusterrolebinding canal-calico -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"rbac.authorization.k8s.io/v1beta1","kind":"ClusterRoleBinding","metadata":{"annotations":{},"name":"canal-calico"},"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"ClusterRole","name":"calico"},"subjects":[{"kind":"ServiceAccount","name":"canal","namespace":"kube-system"},{"apiGroup":"rbac.authorization.k8s.io","kind":"Group","name":"system:nodes"}]}
  creationTimestamp: "2019-07-02T11:45:20Z"
  name: canal-calico
  resourceVersion: "353"
  selfLink: /apis/rbac.authorization.k8s.io/v1/clusterrolebindings/canal-calico
  uid: deaf1b73-9cbe-11e9-94c5-0af5369073f0
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: calico
subjects:
- kind: ServiceAccount
  name: canal
  namespace: kube-system
- apiGroup: rbac.authorization.k8s.io
  kind: Group
  name: system:nodes
$ kubectl -n kube-system get clusterrole calico -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"rbac.authorization.k8s.io/v1beta1","kind":"ClusterRole","metadata":{"annotations":{},"name":"calico"},"rules":[{"apiGroups":[""],"resources":["pods","nodes","namespaces"],"verbs":["get"]},{"apiGroups":[""],"resources":["endpoints","services"],"verbs":["watch","list","get"]},{"apiGroups":[""],"resources":["nodes/status"],"verbs":["patch","update"]},{"apiGroups":["networking.k8s.io"],"resources":["networkpolicies"],"verbs":["watch","list"]},{"apiGroups":[""],"resources":["pods","namespaces","serviceaccounts"],"verbs":["list","watch"]},{"apiGroups":[""],"resources":["pods/status"],"verbs":["patch"]},{"apiGroups":["crd.projectcalico.org"],"resources":["globalfelixconfigs","felixconfigurations","bgppeers","globalbgpconfigs","bgpconfigurations","ippools","globalnetworkpolicies","globalnetworksets","networkpolicies","clusterinformations","hostendpoints"],"verbs":["get","list","watch"]},{"apiGroups":["crd.projectcalico.org"],"resources":["ippools","felixconfigurations","clusterinformations"],"verbs":["create","update"]},{"apiGroups":[""],"resources":["nodes"],"verbs":["get","list","watch"]},{"apiGroups":["crd.projectcalico.org"],"resources":["bgpconfigurations","bgppeers"],"verbs":["create","update"]}]}
  creationTimestamp: "2019-07-02T11:45:20Z"
  name: calico
  resourceVersion: "349"
  selfLink: /apis/rbac.authorization.k8s.io/v1/clusterroles/calico
  uid: deabd410-9cbe-11e9-94c5-0af5369073f0
rules:
- apiGroups:
  - ""
  resources:
  - pods
  - nodes
  - namespaces
  verbs:
  - get
- apiGroups:
  - ""
  resources:
  - endpoints
  - services
  verbs:
  - watch
  - list
  - get
- apiGroups:
  - ""
  resources:
  - nodes/status
  verbs:
  - patch
  - update
- apiGroups:
  - networking.k8s.io
  resources:
  - networkpolicies
  verbs:
  - watch
  - list
- apiGroups:
  - ""
  resources:
  - pods
  - namespaces
  - serviceaccounts
  verbs:
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - pods/status
  verbs:
  - patch
- apiGroups:
  - crd.projectcalico.org
  resources:
  - globalfelixconfigs
  - felixconfigurations
  - bgppeers
  - globalbgpconfigs
  - bgpconfigurations
  - ippools
  - globalnetworkpolicies
  - globalnetworksets
  - networkpolicies
  - clusterinformations
  - hostendpoints
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - crd.projectcalico.org
  resources:
  - ippools
  - felixconfigurations
  - clusterinformations
  verbs:
  - create
  - update
- apiGroups:
  - ""
  resources:
  - nodes
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - crd.projectcalico.org
  resources:
  - bgpconfigurations
  - bgppeers
  verbs:
  - create
  - update

Results:

$ kubectl get pod --all-namespaces
NAMESPACE       NAME                                      READY   STATUS             RESTARTS   AGE
ingress-nginx   default-http-backend-78fccfc5d9-27fxf     1/1     Running            9          32m
ingress-nginx   nginx-ingress-controller-sk5rg            0/1     CrashLoopBackOff   9          18m
kube-system     canal-t8xls                               0/2     CrashLoopBackOff   16         18m
kube-system     kube-dns-58bd5b8dd7-hxxz9                 0/3     CrashLoopBackOff   19         18m
kube-system     kube-dns-autoscaler-77bc5fd84-r952d       1/1     Running            0          18m
kube-system     metrics-server-58bd5dd8d7-jq4tt           0/1     CrashLoopBackOff   7          18m
kube-system     rke-ingress-controller-deploy-job-sxqw2   0/1     Completed          0          32m
kube-system     rke-kube-dns-addon-deploy-job-45rjd       0/1     Completed          0          33m
kube-system     rke-metrics-addon-deploy-job-jchqj        0/1     Completed          0          32m
kube-system     rke-network-plugin-deploy-job-v46f7       0/1     Completed          0          33m
test            nginx-7cdbd8cdc9-tr2fs                    1/1     Running            0          30m
$ kubectl -n kube-system logs canal-t8xls --all-containers
ls: /calico-secrets: No such file or directory
Wrote Calico CNI binaries to /host/opt/cni/bin
CNI plugin version: v3.4.0
/host/secondary-bin-dir is non-writeable, skipping
Using CNI config template from CNI_NETWORK_CONFIG environment variable.
CNI config: {
  "name": "k8s-pod-network",
  "cniVersion": "0.3.0",
  "plugins": [
    {
      "type": "calico",
      "log_level": "WARNING",
      "datastore_type": "kubernetes",
      "nodename": "x.x.x.x",
      "ipam": {
        "type": "host-local",
        "subnet": "usePodCidr"
      },
      "policy": {
          "type": "k8s"
      },
      "kubernetes": {
          "kubeconfig": "/etc/kubernetes/ssl/kubecfg-kube-node.yaml"
      }
    },
    {
      "type": "portmap",
      "snat": true,
      "capabilities": {"portMappings": true}
    }
  ]
}
Created CNI config 10-canal.conflist
Done configuring CNI.  Sleep=false
2019-07-02 12:01:50.479 [INFO][8] startup.go 244: Early log level set to info
2019-07-02 12:01:50.479 [INFO][8] startup.go 260: Using NODENAME environment for node name
2019-07-02 12:01:50.479 [INFO][8] startup.go 272: Determined node name: x.x.x.x
2019-07-02 12:01:50.481 [INFO][8] startup.go 304: Checking datastore connection
2019-07-02 12:01:50.488 [WARNING][8] startup.go 316: Connection to the datastore is unauthorized
2019-07-02 12:01:50.488 [WARNING][8] startup.go 1004: Terminating
Calico node failed to start
I0702 12:01:52.092644       1 main.go:475] Determining IP address of default interface
I0702 12:01:52.093330       1 main.go:488] Using interface with name ens3 and address x.x.x.x
I0702 12:01:52.093352       1 main.go:505] Defaulting external address to interface address (x.x.x.x)
E0702 12:01:52.105173       1 main.go:232] Failed to create SubnetManager: error retrieving pod spec for 'kube-system/canal-t8xls': Unauthorized

Looks like some of pods are not autorized in cluster after restoration, the workaround is to re-create pod's secret (this secret was restored from backup) and pod like this:

$ kubectl -n kube-system get secret canal-token-s256l
NAME                TYPE                                  DATA   AGE
canal-token-s256l   kubernetes.io/service-account-token   3      25m

$ kubectl -n kube-system delete secret canal-token-s256l
secret "canal-token-s256l" deleted

$ kubectl -n kube-system delete pod canal-t8xls
pod "canal-t8xls" deleted

$ kubectl -n kube-system get pod canal-svn42
NAME          READY   STATUS    RESTARTS   AGE
canal-svn42   2/2     Running   0          2m4s
rancher/rke

Answer questions eroji

This worked for me.

useful!

Related questions

"Failed to reconcile etcd plane" when updating RKE binary hot 3
Failed to get /health for host - remote error: tls: bad certificate hot 2
Failed to rotate expired certificates on an RKE cluster: unable to reach api server to fetch CA hot 2
Error response from daemon: chown /etc/resolv.conf: operation not permitted hot 1
Pods can't reach networks outside of node hot 1
rke 0.1.17 Can't initiate NewClient: protocol not available hot 1
Calico node failed to start after upgrading the cluster hot 1
Job rke-network-plugin-deploy-job never completes (virtualbox) hot 1
rke up --local fails to deploy successfully hot 1
Job rke-network-plugin-deploy-job never completes (virtualbox) hot 1
go panic on intial rke up hot 1
Unable to update cluster "crypto/rsa: verification error" hot 1
Calico node failed to start after upgrading the cluster hot 1
[SOLVED] Failed to apply the ServiceAccount needed for job execution: Post https://10.102.X.X:6443/apis/rbac.authorization.k8s.io/v1/clusterrolebindings: Forbidden hot 1
Failed to get /health for host - remote error: tls: bad certificate hot 1
Github User Rank List