profile
viewpoint

Ask questionsdockerd stopped responding to API requests; no installed keys could decrypt message

Description

I have a 10 node Docker swarm running Docker CE 17.06.0. One of the controllers stopped responding to API commands, and all ten nodes started logging messages about memberlist: failed to receive: No installed keys could decrypt the message from=…. The swarm was in a broken state until the node that stopped responding was restarted.

Steps to reproduce the issue:

Not sure how to reproduce the issue. This is on a 10-node Docker swarm that is heavily used for development.

Describe the results you received:

On the node that stopped responding, the daemon outputted log lines that it could not decrypt messages from the other 9 nodes.

Jul 16 10:02:56 itrmsdev02.ucalgary.ca dockerd[1142]: time="2017-07-16T10:02:56.003807997-06:00" level=warning msg="memberlist: failed to receive: No installed keys could decrypt the message from=10.41.149.137:56134"
Jul 16 10:02:56 itrmsdev02.ucalgary.ca dockerd[1142]: time="2017-07-16T10:02:56.013832761-06:00" level=warning msg="memberlist: failed to receive: No installed keys could decrypt the message from=10.41.149.147:39506"
Jul 16 10:02:56 itrmsdev02.ucalgary.ca dockerd[1142]: time="2017-07-16T10:02:56.036012506-06:00" level=warning msg="memberlist: failed to receive: No installed keys could decrypt the message from=10.41.149.148:40842"
Jul 16 10:02:56 itrmsdev02.ucalgary.ca dockerd[1142]: time="2017-07-16T10:02:56.275404103-06:00" level=warning msg="memberlist: failed to receive: No installed keys could decrypt the message from=10.41.149.84:49264"
Jul 16 10:02:56 itrmsdev02.ucalgary.ca dockerd[1142]: time="2017-07-16T10:02:56.644752645-06:00" level=warning msg="memberlist: failed to receive: No installed keys could decrypt the message from=10.41.149.87:35600"
Jul 16 10:02:56 itrmsdev02.ucalgary.ca dockerd[1142]: time="2017-07-16T10:02:56.691320840-06:00" level=warning msg="memberlist: failed to receive: No installed keys could decrypt the message from=10.41.149.149:44430"
Jul 16 10:02:56 itrmsdev02.ucalgary.ca dockerd[1142]: time="2017-07-16T10:02:56.811744743-06:00" level=warning msg="memberlist: failed to receive: No installed keys could decrypt the message from=10.41.149.152:37456"

On the other 9 nodes, they all outputted log lines that they could not decrypt messages from the misbehaving node.

Jul 16 10:04:41 itrmsdev04.ucalgary.ca dockerd[1105]: time="2017-07-16T10:04:41.417179382-06:00" level=warning msg="memberlist: failed to receive: No installed keys could decrypt the message from=10.41.149.139:59514"
Jul 16 10:04:45 itrmsdev04.ucalgary.ca dockerd[1105]: time="2017-07-16T10:04:45.417459159-06:00" level=warning msg="memberlist: failed to receive: No installed keys could decrypt the message from=10.41.149.139:59522"
Jul 16 10:04:47 itrmsdev04.ucalgary.ca dockerd[1105]: time="2017-07-16T10:04:47.417319707-06:00" level=warning msg="memberlist: failed to receive: No installed keys could decrypt the message from=10.41.149.139:59526"
Jul 16 10:04:48 itrmsdev04.ucalgary.ca dockerd[1105]: time="2017-07-16T10:04:48.417226763-06:00" level=warning msg="memberlist: failed to receive: No installed keys could decrypt the message from=10.41.149.139:59528"
Jul 16 10:05:16 itrmsdev04.ucalgary.ca dockerd[1105]: time="2017-07-16T10:05:16.417069436-06:00" level=warning msg="memberlist: failed to receive: No installed keys could decrypt the message from=10.41.149.139:59586"
Jul 16 10:05:18 itrmsdev04.ucalgary.ca dockerd[1105]: time="2017-07-16T10:05:18.417402441-06:00" level=warning msg="memberlist: failed to receive: No installed keys could decrypt the message from=10.41.149.139:59590"
Jul 16 10:05:35 itrmsdev04.ucalgary.ca dockerd[1105]: time="2017-07-16T10:05:35.417190950-06:00" level=warning msg="memberlist: failed to receive: No installed keys could decrypt the message from=10.41.149.139:59624"

Docker would not respond to any API calls on the failed node. I sent a SIGUSR signal to get a stack dump, and restarted dockerd. I'm not sure how to interpret the dump. It's attached as a zip file because the attachment size limit is 10 MB.

goroutine-stacks-2017-07-17T094958-0600.log.zip

Describe the results you expected:

The swarm should operate normally.

Additional information you deem important (e.g. issue happens only occasionally):

Output of docker version:

This output is from the failed node after it was restarted.

-bash-4.2$ docker version
Client:
 Version:      17.06.0-ce
 API version:  1.30
 Go version:   go1.8.3
 Git commit:   02c1d87
 Built:        Fri Jun 23 21:20:36 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.06.0-ce
 API version:  1.30 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   02c1d87
 Built:        Fri Jun 23 21:21:56 2017
 OS/Arch:      linux/amd64
 Experimental: true

Output of docker info:

-bash-4.2$ docker info
Containers: 63
 Running: 12
 Paused: 0
 Stopped: 51
Images: 488
Server Version: 17.06.0-ce
Storage Driver: overlay
 Backing Filesystem: xfs
 Supports d_type: true
Logging Driver: gelf
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge host ipvlan macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: nj1kr88iu2ai9947u8neoskdw
 Is Manager: true
 ClusterID: m0gz05zqgmhx67noqdvko2npr
 Managers: 3
 Nodes: 10
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Root Rotation In Progress: false
 Node Address: 10.41.149.139
 Manager Addresses:
  10.41.149.137:2377
  10.41.149.138:2377
  10.41.149.139:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: cfb82a876ecc11b5ca0977d1733adbe58599088a
runc version: 2d41c047c83e09a6d61d464906feb2a2f3c52aa4
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 3.10.0-514.21.1.el7.x86_64
Operating System: Red Hat Enterprise Linux Server 7.3 (Maipo)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 15.51GiB
Name: itrmsdev02.ucalgary.ca
ID: VPME:AFML:SQU3:5WR5:UPXE:B5YZ:BBFP:JAEU:A5R4:UYGT:IFO5:AEKF
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: true
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):

All 10 nodes are on VMs running RHEL 7.3.

docker/swarmkit#1954 introduced a fix to avoid encryption key rotation on leader change, incorporated into 1.13.x. I don't know if that's the situation I ran into. This is the first time I've ever had a problem about decryption messages in any of my swarms.

moby/moby

Answer questions alfonsodg

Hi,

Today (6/3/19) my cluster down because this bug, Docker engine version: 18.09.6 Nodes: 8 / 3(Master) and 5(workers)

The fail apparently starts when one master node have a hardware problem

useful!

Related questions

start container failed with "failed to umount /var/lib/docker/containers/.../shm: no such file or directory" hot 241
start container failed with "failed to umount /var/lib/docker/containers/.../shm: no such file or directory" hot 176
upgrade docker-18.09.2-ce , shim.sock: bind: address already in use: unknown hot 83
Windows Server 2019 publish ports in swarm not working hot 70
OCI runtime exec failed: exec failed: cannot exec a container that has stopped: unknown hot 59
integration: "error reading the kernel parameter" errors during CI hot 58
write unix /var/run/docker.sock->@: write: broken pipe hot 57
Swarm restarts all containers hot 56
hcsshim::PrepareLayer failed in Win32: The parameter is incorrect hot 52
Error response from daemon: rpc error: code = DeadlineExceeded desc = context deadline exceeded hot 50
runc regression - EPERM running containers from selinux hot 50
"docker stack deploy">"rpc error: code = 3 desc = name must be valid as a DNS name component" hot 41
Can't set net.ipv4.tcp_tw_reuse in docker 1.10.3 hot 40
Docker stack fails to allocate IP on an overlay network, and gets stuck in `NEW` current state hot 38
Docker 18.09.1 doesn't work with iptables v1.8.2 hot 35
Github User Rank List