Ask questionsDocker stack fails to allocate IP on an overlay network, and gets stuck in `NEW` current state

Description I do have a docker swarm cluster with 5 managers and 4 working nodes. A while ago, it was 3 managers and 4 working nodes. We use immutable infrastructure for the hosts.

EDIT: I managed to get reproducible tests in

Majority of our containers connect to a specific overlay network, with CIDR /24. my_network: driver: overlay driver_opts: encrypted: "" ipam: driver: default config: - subnet:

Docker stack deployments happen all the time in this specific cluster.

Occasionally (and I cannot understand what causes it), the swarm is unable to allocate an IP to that task. "Failed allocation for service <service>" error="could not find an available IP while allocating VIP"

So I assumed we ran out of IPs for the CIDR. But when I count the number of tasks currently attached to the network, there was less than 40 running tasks. I also went and counted all the stopped/historical tasks, and counted all the IPs on that network; still, there were less than 120 IPs. A lot less than the 200-and-something I'd expect.

I tried to restrict the task history size, but that by itself didn't make any difference. I deleted almost all stacks, and some containers were able to get a new IP. But the problem manifested itself as far as all things got redeployed.

I actually looked for the NetworkDB stats when the problem was happening, and it was all lines like: NetworkDB stats <leader host>(<node>) - netID:<my network> leaving:false netPeers:8 entries:14 Queue qLen:0 netMsg/s:0

After we 'recycled' all the managers (including the leader), the problem appears to be resolved. All the tasks which were stuck then received a new IP.

NetworkDB stats <host>(<node>) - netID:<my network> leaving:false netPeers:7 entries:49 Queue qLen:0 netMsg/s:0

It appears that somehow some IPs are not returned back to the pool, but I'm not even sure where to look for more information. Anyone able to help me on how to investigate this problem?

My problem appears similar to what was described here.

Steps to reproduce the issue:

  1. Create a docker stack that connects to the /24 overlay network
  2. docker deploy stack -c file.yaml my-stack
  3. docker service ps my-stack

Describe the results you received: Tasks get stuck as 'NEW' state.

Describe the results you expected: If we have less than 200 containers attached to the /24 network, I'd expect the task to be running.

Additional information you deem important (e.g. issue happens only occasionally): We've seen this problem before.

The problem apparently persists for days. Eventually, after a few hours waiting, some of the containers receive the IP and start. I've seen containers stuck on that state for more than a day.

Output of docker version:

$ docker version
  Version:	18.03.1-ce
  API version:	1.37 (minimum version 1.12)
  Go version:	go1.9.5
  Git commit:	9ee9f40
  Built:	Thu Apr 26 07:23:03 2018
  OS/Arch:	linux/amd64
  Experimental:	false output here)

Output of docker info:

Containers: 3
 Running: 2
 Paused: 0
 Stopped: 1
Images: 3
Server Version: 18.03.1-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: okljpo50f39c74me2qzem67qw
 Is Manager: true
 ClusterID: 2ufszb0kyswdcmi7nzxfqjb47
 Managers: 5
 Nodes: 9
  Task History Retention Limit: 1
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Autolock Managers: false
 Root Rotation In Progress: false
 Manager Addresses:
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 773c489c9c1b21a6d78b5c538cd395416ec50f88
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: 949e6fa
Security Options:
  Profile: default
Kernel Version: 4.9.107-linuxkit
Operating System: Alpine Linux v3.7
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 1.951GiB
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Experimental: false
Insecure Registries:
Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.): AWS.


Answer questions cintiadr

So I can come back and say that now before stopping a node, we drain it and wait for all tasks/containers to stop.

That appears to have solved the problem for good. Still, we cannot ever ensure that will happen always.


Related questions

Swarm restarts all containers hot 2
can not successfully install docker-ce on ubuntu 16.04 ? why ,Can you help me? hot 1
OCI runtime exec failed: exec failed: cannot exec a container that has stopped: unknown hot 1
Allow COPY command's --chown to be dynamically populated via ENV or ARG hot 1
windowsRS1 and windowsRS5-process are failing due to "Unable to delete '\gopath\src\\docker\docker" hot 1
Error response from daemon: rpc error: code = DeadlineExceeded desc = context deadline exceeded hot 1
one container in the overlay network not available hot 1
Containers on overlay network cannot reach other containers hot 1
[Windows] windowsfilter folder impossible to delete hot 1
swarm node lost leader status hot 1
New-SmbGlobalMapping don't continued working in Container hot 1
failed to export image: failed to create image: failed to get layer: layer does not exist hot 1
"docker stack deploy">"rpc error: code = 3 desc = name must be valid as a DNS name component" hot 1
runc regression - EPERM running containers from selinux hot 1
Read only filesystem creating services with secrets hot 1
Github User Rank List