swarm node lost leader status

I have a docker swarm with 3 node manager (VM debian 9.1 on ESXi 5.5) and a local registry.

We have a docker-compose.yml inheritance, so we have a script to create environment variable COMPOSE_FILE with all docker-compose.yml path.

When I am trying to deploy with this script :

# Build sources volume image + push source image in local swarm registry + get all image in case of change in distant images
docker-compose build sources && docker-compose push sources && docker-compose pull

# Export merged config for docker stack deploy
docker-compose config > $PROD_PATH/docker-compose.yml

# Go to folder to deploy and deploy the stack
cd $PROD_PATH && docker stack deploy -c docker-compose.yml $PROJECT_NAME --prune

after that when deploying the stack, I have this issue :

failed to create service foo_apache: Error response from daemon: rpc error: code = Unknown desc = raft: failed to process the request: node lost leader status

Did I something wrong ?

Steps to reproduce the issue:

  1. create my source image
  2. deploy the stack

Describe the results you received:

Some services start but I have this error sometimes (service is not always the same) :

failed to create service foo_apache: Error response from daemon: rpc error: code = Unknown desc = raft: failed to process the request: node lost leader status

Describe the results you expected:

All service start successfully without this error :/

Additional information you deem important (e.g. issue happens only occasionally):

Output of docker version:

 Version:      17.09.0-ce
 API version:  1.32
 Go version:   go1.8.3
 Git commit:   afdb6d4
 Built:        Tue Sep 26 22:42:09 2017
 OS/Arch:      linux/amd64

 Version:      17.09.0-ce
 API version:  1.32 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   afdb6d4
 Built:        Tue Sep 26 22:40:48 2017
 OS/Arch:      linux/amd64
 Experimental: false

Output of docker info:

Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 0
Server Version: 17.09.0-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: wcrvmo1q7zalv9y21e8umr3zs
 Is Manager: true
 ClusterID: w74skx7rusf129ubuh4lolmeh
 Managers: 3
 Nodes: 3
  Task History Retention Limit: 5
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Autolock Managers: false
 Root Rotation In Progress: false
 Node Address:
 Manager Addresses:
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 06b9cb35161009dcb7123345749fef02f7cea8e0
runc version: 3f2f8b84a77f73d38244dd690525642a72156c64
init version: 949e6fa
Security Options:
  Profile: default
Kernel Version: 4.9.0-3-amd64
Operating System: Debian GNU/Linux 9 (stretch)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 3.863GiB
Name: ren1web1l
Docker Root Dir: /home/docker
Debug Mode (client): false
Debug Mode (server): false
Experimental: false
Insecure Registries:
Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):

All 3 manager nodes running on ESXi 5.5 with Debian 9.1

Before to be in 17.09-ce we had the same issue in 17.06-ce


Answer questions m4r10k

@mwaeckerlin Are you running workloads on your Docker managers? If yes, prohibit this by labelling your manager nodes as "manager" and add appropriate constraints to all of your Docker Swarm services. Like this:

          - node.labels.maintenance != yes
          - node.labels.worker == yes
          - .....

The reason for this is, that if the managers are facing high CPU loads (or if you have fast spinning containers with no health checks), they will probably miss the RAFT messages of the other managers. The RAFT timeout by default is something like 500ms if I remember correctly. Missing RAFT logs can lead to split brains and this can hurt the Swarm badly. We have faced this situation multiple times with similar symptoms. Therefore our managers are running just as managers (zero workload) and we never faced this situation again.

If nothing else helps, you can try to stop all managers but leave one online. Let the other manages leave the cluster, stop Docker, delete the data in the Docker folder (the RAFT logs), clean up everything and then rejoin them as manager. By doing this, the RAFT database should get back clear and in sync, but do not run workloads on the mangers!


