profile
viewpoint

Ask questionsswarm node lost leader status

<!-- If you are reporting a new issue, make sure that we do not have any duplicates already open. You can ensure this by searching the issue list for this repository. If there is a duplicate, please close your issue and add a comment to the existing issue instead.

If you suspect your issue is a bug, please edit your issue description to include the BUG REPORT INFORMATION shown below. If you fail to provide this information within 7 days, we cannot debug your issue and will close it. We will, however, reopen it if you later provide the information.

For more information about reporting issues, see https://github.com/docker/docker/blob/master/CONTRIBUTING.md#reporting-other-issues


GENERAL SUPPORT INFORMATION

The GitHub issue tracker is for bug reports and feature requests. General support can be found at the following locations:

  • Docker Support Forums - https://forums.docker.com
  • IRC - irc.freenode.net #docker channel
  • Post a question on StackOverflow, using the Docker tag

BUG REPORT INFORMATION

Use the commands below to provide key information from your environment: You do NOT have to include this information if this is a FEATURE REQUEST -->

Description

I have a docker swarm with 3 node manager (VM debian 9.1 on ESXi 5.5) and a local registry.

We have a docker-compose.yml inheritance, so we have a script to create environment variable COMPOSE_FILE with all docker-compose.yml path.

When I am trying to deploy with this script :

# Build sources volume image + push source image in local swarm registry + get all image in case of change in distant images
docker-compose build sources && docker-compose push sources && docker-compose pull

# Export merged config for docker stack deploy
docker-compose config > $PROD_PATH/docker-compose.yml

# Go to folder to deploy and deploy the stack
cd $PROD_PATH && docker stack deploy -c docker-compose.yml $PROJECT_NAME --prune

after that when deploying the stack, I have this issue :

failed to create service foo_apache: Error response from daemon: rpc error: code = Unknown desc = raft: failed to process the request: node lost leader status

Did I something wrong ?

Steps to reproduce the issue:

  1. create my source image
  2. deploy the stack

Describe the results you received:

Some services start but I have this error sometimes (service is not always the same) :

failed to create service foo_apache: Error response from daemon: rpc error: code = Unknown desc = raft: failed to process the request: node lost leader status

Describe the results you expected:

All service start successfully without this error :/

Additional information you deem important (e.g. issue happens only occasionally):

Output of docker version:

Client:
 Version:      17.09.0-ce
 API version:  1.32
 Go version:   go1.8.3
 Git commit:   afdb6d4
 Built:        Tue Sep 26 22:42:09 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.09.0-ce
 API version:  1.32 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   afdb6d4
 Built:        Tue Sep 26 22:40:48 2017
 OS/Arch:      linux/amd64
 Experimental: false

Output of docker info:

Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 0
Server Version: 17.09.0-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: wcrvmo1q7zalv9y21e8umr3zs
 Is Manager: true
 ClusterID: w74skx7rusf129ubuh4lolmeh
 Managers: 3
 Nodes: 3
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Autolock Managers: false
 Root Rotation In Progress: false
 Node Address: 10.1.124.2
 Manager Addresses:
  10.1.124.1:2377
  10.1.124.2:2377
  10.1.124.3:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 06b9cb35161009dcb7123345749fef02f7cea8e0
runc version: 3f2f8b84a77f73d38244dd690525642a72156c64
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.9.0-3-amd64
Operating System: Debian GNU/Linux 9 (stretch)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 3.863GiB
Name: ren1web1l
ID: P2RT:IM7W:3U3U:MF7U:3FAD:IPD6:LCWB:EZ46:6XDA:ZS5G:7ZFB:3X6W
Docker Root Dir: /home/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):

All 3 manager nodes running on ESXi 5.5 with Debian 9.1

Before to be in 17.09-ce we had the same issue in 17.06-ce

moby/moby

Answer questions m4r10k

@mwaeckerlin Are you running workloads on your Docker managers? If yes, prohibit this by labelling your manager nodes as "manager" and add appropriate constraints to all of your Docker Swarm services. Like this:

      placement:
        constraints:
          - node.labels.maintenance != yes
          - node.labels.worker == yes
          - .....

The reason for this is, that if the managers are facing high CPU loads (or if you have fast spinning containers with no health checks), they will probably miss the RAFT messages of the other managers. The RAFT timeout by default is something like 500ms if I remember correctly. Missing RAFT logs can lead to split brains and this can hurt the Swarm badly. We have faced this situation multiple times with similar symptoms. Therefore our managers are running just as managers (zero workload) and we never faced this situation again.

If nothing else helps, you can try to stop all managers but leave one online. Let the other manages leave the cluster, stop Docker, delete the data in the Docker folder (the RAFT logs), clean up everything and then rejoin them as manager. By doing this, the RAFT database should get back clear and in sync, but do not run workloads on the mangers!

useful!

Related questions

Swarm restarts all containers hot 2
integration: "error reading the kernel parameter" errors during CI hot 2
can not successfully install docker-ce on ubuntu 16.04 ? why ,Can you help me? hot 1
OCI runtime exec failed: exec failed: cannot exec a container that has stopped: unknown hot 1
Allow COPY command's --chown to be dynamically populated via ENV or ARG hot 1
windowsRS1 and windowsRS5-process are failing due to "Unable to delete '\gopath\src\github.com\docker\docker" hot 1
Panic: runtime error: invalid memory address or nil pointer dereference hot 1
Error response from daemon: rpc error: code = DeadlineExceeded desc = context deadline exceeded hot 1
one container in the overlay network not available hot 1
Containers on overlay network cannot reach other containers hot 1
&#34;initgroups, operation not permitted&#34; error in apache2-mpm-itk when inside Docker - moby hot 1
[Windows] windowsfilter folder impossible to delete hot 1
New-SmbGlobalMapping don't continued working in Container hot 1
failed to export image: failed to create image: failed to get layer: layer does not exist hot 1
"docker stack deploy">"rpc error: code = 3 desc = name must be valid as a DNS name component" hot 1
Github User Rank List