Ask questionsDocker stack fails to allocate IP on an overlay network, and gets stuck in `NEW` current state
Description I do have a docker swarm cluster with 5 managers and 4 working nodes. A while ago, it was 3 managers and 4 working nodes. We use immutable infrastructure for the hosts.
EDIT: I managed to get reproducible tests in https://github.com/moby/moby/issues/37338#issuecomment-437558916
Majority of our containers connect to a specific overlay network, with CIDR /24. my_network: driver: overlay driver_opts: encrypted: "" ipam: driver: default config: - subnet: 10.100.2.0/24
Docker stack deployments happen all the time in this specific cluster.
Occasionally (and I cannot understand what causes it), the swarm is unable to allocate an IP to that task. "Failed allocation for service <service>" error="could not find an available IP while allocating VIP"
So I assumed we ran out of IPs for the CIDR. But when I count the number of tasks currently attached to the network, there was less than 40 running tasks. I also went and counted all the stopped/historical tasks, and counted all the IPs on that network; still, there were less than 120 IPs. A lot less than the 200-and-something I'd expect.
I tried to restrict the task history size, but that by itself didn't make any difference. I deleted almost all stacks, and some containers were able to get a new IP. But the problem manifested itself as far as all things got redeployed.
I actually looked for the NetworkDB stats when the problem was happening, and it was all lines like: NetworkDB stats <leader host>(<node>) - netID:<my network> leaving:false netPeers:8 entries:14 Queue qLen:0 netMsg/s:0
After we 'recycled' all the managers (including the leader), the problem appears to be resolved. All the tasks which were stuck then received a new IP.
NetworkDB stats <host>(<node>) - netID:<my network> leaving:false netPeers:7 entries:49 Queue qLen:0 netMsg/s:0
It appears that somehow some IPs are not returned back to the pool, but I'm not even sure where to look for more information. Anyone able to help me on how to investigate this problem?
My problem appears similar to what was described here. https://github.com/docker/for-aws/issues/104
Steps to reproduce the issue:
Describe the results you received: Tasks get stuck as 'NEW' state.
Describe the results you expected: If we have less than 200 containers attached to the /24 network, I'd expect the task to be running.
Additional information you deem important (e.g. issue happens only occasionally): We've seen this problem before.
The problem apparently persists for days. Eventually, after a few hours waiting, some of the containers receive the IP and start. I've seen containers stuck on that state for more than a day.
$ docker version Server: Engine: Version: 18.03.1-ce API version: 1.37 (minimum version 1.12) Go version: go1.9.5 Git commit: 9ee9f40 Built: Thu Apr 26 07:23:03 2018 OS/Arch: linux/amd64 Experimental: false output here)
Containers: 3 Running: 2 Paused: 0 Stopped: 1 Images: 3 Server Version: 18.03.1-ce Storage Driver: overlay2 Backing Filesystem: extfs Supports d_type: true Native Overlay Diff: true Logging Driver: json-file Cgroup Driver: cgroupfs Plugins: Volume: local Network: bridge host macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog Swarm: active NodeID: okljpo50f39c74me2qzem67qw Is Manager: true ClusterID: 2ufszb0kyswdcmi7nzxfqjb47 Managers: 5 Nodes: 9 Orchestration: Task History Retention Limit: 1 Raft: Snapshot Interval: 10000 Number of Old Snapshots to Retain: 0 Heartbeat Tick: 1 Election Tick: 3 Dispatcher: Heartbeat Period: 5 seconds CA Configuration: Expiry Duration: 3 months Force Rotate: 0 Autolock Managers: false Root Rotation In Progress: false Manager Addresses: ... Runtimes: runc Default Runtime: runc Init Binary: docker-init containerd version: 773c489c9c1b21a6d78b5c538cd395416ec50f88 runc version: 4fc53a81fb7c994640722ac585fa9ca548971871 init version: 949e6fa Security Options: seccomp Profile: default Kernel Version: 4.9.107-linuxkit Operating System: Alpine Linux v3.7 OSType: linux Architecture: x86_64 CPUs: 1 Total Memory: 1.951GiB Docker Root Dir: /var/lib/docker Debug Mode (client): false Debug Mode (server): false Registry: https://index.docker.io/v1/ Labels: Experimental: false Insecure Registries: 127.0.0.0/8 Live Restore Enabled: false
Additional environment details (AWS, VirtualBox, physical, etc.): AWS.
Answer questions cintiadr
So I can come back and say that now before stopping a node, we drain it and wait for all tasks/containers to stop.
That appears to have solved the problem for good. Still, we cannot ever ensure that will happen always.