profile
viewpoint

Ask questionsDeploy tidb with docker swarm, fail to send snap between tikv of different hosts

Bug Report

What version of TiKV are you using? 3.0.1

What operating system and CPU are you using? docker swarm

What did you do? Deploying tidb with docker swarm, deploying a pd, three tikvs. If tikv is deployed on the same host, there is no problem with network communication between tikvs. But if tikv is deployed on different hosts, there is a problem with communication between tikvs.

What did you expect to see? no error What did you see instead?


tikv_1.log: [2019/07/29 16:15:18.742 +08:00] [ERROR] [snap.rs:393] ["failed to send snap"] [err="Grpc(RpcFinished(Some(RpcStatus { status: Cancelled, details: Some("Cancelled") })))"] [to_addr=192.168.48.156:20164] [2019/07/29 16:15:18.742 +08:00] [INFO] [peer.rs:596] ["report snapshot status"] [status=Failure] [to="id: 1012 store_id: 1001 is_learner: true"] [peer_id=51] [region_id=2] [2019/07/29 16:15:20.740 +08:00] [INFO] [peer_storage.rs:789] ["requesting snapshot"] [request_index=0] [peer_id=51] [region_id=2] [2019/07/29 16:15:22.759 +08:00] [INFO] [snap.rs:721] ["scan snapshot"] [takes=17.839635ms] [size=2041113] [key_count=0] [snapshot=/data/ssdtidb2_kv_1/snap/gen_2_8_252_(default|lock|write).sst] [region_id=2] [2019/07/29 16:15:24.748 +08:00] [ERROR] [snap.rs:393] ["failed to send snap"] [err="Grpc(RpcFinished(Some(RpcStatus { status: Cancelled, details: Some("Cancelled") })))"] [to_addr=192.168.48.156:20164] [2019/07/29 16:15:24.748 +08:00] [INFO] [peer.rs:596] ["report snapshot status"] [status=Failure] [to="id: 1012 store_id: 1001 is_learner: true"] [peer_id=51] [region_id=2] [2019/07/29 16:15:26.746 +08:00] [INFO] [peer_storage.rs:789] ["requesting snapshot"] [request_index=0] [peer_id=51] [region_id=2] [2019/07/29 16:15:28.765 +08:00] [INFO] [snap.rs:721] ["scan snapshot"] [takes=17.319925ms] [size=2041113] [key_count=0] [snapshot=/data/ssdtidb2_kv_1/snap/gen_2_8_252_(default|lock|write).sst] [region_id=2] [2019/07/29 16:15:30.754 +08:00] [ERROR] [snap.rs:393] ["failed to send snap"] [err="Grpc(RpcFinished(Some(RpcStatus { status: Cancelled, details: Some("Cancelled") })))"] [to_addr=192.168.48.156:20164] [2019/07/29 16:15:30.754 +08:00] [INFO] [peer.rs:596] ["report snapshot status"] [status=Failure] [to="id: 1012 store_id: 1001 is_learner: true"] [peer_id=51] [region_id=2] [2019/07/29 16:15:32.752 +08:00] [INFO] [peer_storage.rs:789] ["requesting snapshot"] [request_index=0] [peer_id=51] [region_id=2] [2019/07/29 16:15:34.771 +08:00] [INFO] [snap.rs:721] ["scan snapshot"] [takes=17.307295ms] [size=2041113] [key_count=0] [snapshot=/data/ssdtidb2_kv_1/snap/gen_2_8_252_(default|lock|write).sst] [region_id=2] [2019/07/29 16:15:36.760 +08:00] [ERROR] [snap.rs:393] ["failed to send snap"] [err="Grpc(RpcFinished(Some(RpcStatus { status: Cancelled, details: Some("Cancelled") })))"] [to_addr=192.168.48.156:20164] [2019/07/29 16:15:36.760 +08:00] [INFO] [peer.rs:596] ["report snapshot status"] [status=Failure] [to="id: 1012 store_id: 1001 is_learner: true"] [peer_id=51] [region_id=2] [2019/07/29 16:15:38.758 +08:00] [INFO] [peer_storage.rs:789] ["requesting snapshot"] [request_index=0] [peer_id=51] [region_id=2] [2019/07/29 16:15:40.775 +08:00] [INFO] [snap.rs:721] ["scan snapshot"] [takes=15.937298ms] [size=2041113] [key_count=0] [snapshot=/data/ssdtidb2_kv_1/snap/gen_2_8_252_(default|lock|write).sst] [region_id=2] [2019/07/29 16:15:42.766 +08:00] [ERROR] [snap.rs:393] ["failed to send snap"] [err="Grpc(RpcFinished(Some(RpcStatus { status: Cancelled, details: Some("Cancelled") })))"] [to_addr=192.168.48.156:20164] [2019/07/29 16:15:42.766 +08:00] [INFO] [peer.rs:596] ["report snapshot status"] [status=Failure] [to="id: 1012 store_id: 1001 is_learner: true"] [peer_id=51] [region_id=2] [2019/07/29 16:15:44.764 +08:00] [INFO] [peer_storage.rs:789] ["requesting snapshot"] [request_index=0] [peer_id=51] [region_id=2] [2019/07/29 16:15:46.783 +08:00] [INFO] [snap.rs:721] ["scan snapshot"] [takes=17.345586ms] [size=2041113] [key_count=0] [snapshot=/data/ssdtidb2_kv_1/snap/gen_2_8_252_(default|lock|write).sst] [region_id=2] [2019/07/29 16:15:48.773 +08:00] [ERROR] [snap.rs:393] ["failed to send snap"] [err="Grpc(RpcFinished(Some(RpcStatus { status: Cancelled, details: Some("Cancelled") })))"] [to_addr=192.168.48.156:20164] [2019/07/29 16:15:48.774 +08:00] [INFO] [peer.rs:596] ["report snapshot status"] [status=Failure] [to="id: 1012 store_id: 1001 is_learner: true"] [peer_id=51] [region_id=2]



tikv_4.log:

[2019/07/29 16:21:02.721 +08:00] [ERROR] [snap.rs:354] ["failed to recv snapshot"] [err=RemoteStopped] [2019/07/29 16:21:07.084 +08:00] [ERROR] [snap.rs:354] ["failed to recv snapshot"] [err=RemoteStopped] [2019/07/29 16:21:07.729 +08:00] [ERROR] [snap.rs:354] ["failed to recv snapshot"] [err=RemoteStopped] [2019/07/29 16:21:13.090 +08:00] [ERROR] [snap.rs:354] ["failed to recv snapshot"] [err=RemoteStopped] [2019/07/29 16:21:17.445 +08:00] [WARN] [raft_client.rs:117] ["batch_raft RPC finished fail"] [err="RpcFinished(Some(RpcStatus { status: Unavailable, details: Some("OS Error") }))"] [2019/07/29 16:21:17.445 +08:00] [WARN] [raft_client.rs:117] ["batch_raft RPC finished fail"] [err="RpcFinished(Some(RpcStatus { status: Unavailable, details: Some("OS Error") }))"] [2019/07/29 16:21:17.445 +08:00] [WARN] [raft_client.rs:131] ["batch_raft/raft RPC finally fail"] [err="RpcFinished(Some(RpcStatus { status: Unavailable, details: Some("OS Error") }))"] [to_addr=192.168.48.156:20161] [2019/07/29 16:21:17.445 +08:00] [WARN] [raft_client.rs:131] ["batch_raft/raft RPC finally fail"] [err="RpcFinished(Some(RpcStatus { status: Unavailable, details: Some("OS Error") }))"] [to_addr=192.168.48.156:20162] [2019/07/29 16:21:17.445 +08:00] [INFO] [store.rs:1949] ["broadcasting unreachable"] [unreachable_store_id=5] [store_id=1001] [2019/07/29 16:21:17.445 +08:00] [INFO] [store.rs:1949] ["broadcasting unreachable"] [unreachable_store_id=4] [store_id=1001] [2019/07/29 16:21:17.703 +08:00] [WARN] [raft_client.rs:206] ["send to 192.168.48.156:20161 fail, the gRPC connection could be broken"] [2019/07/29 16:21:17.703 +08:00] [ERROR] [transport.rs:318] ["send raft msg err"] [err="Other("[src/server/raft_client.rs:215]: RaftClient send fail")"] [2019/07/29 16:21:17.703 +08:00] [WARN] [raft_client.rs:206] ["send to 192.168.48.156:20162 fail, the gRPC connection could be broken"] [2019/07/29 16:21:17.703 +08:00] [ERROR] [transport.rs:318] ["send raft msg err"] [err="Other("[src/server/raft_client.rs:215]: RaftClient send fail")"] [2019/07/29 16:21:18.704 +08:00] [WARN] [raft_client.rs:117] ["batch_raft RPC finished fail"] [err="RpcFinished(Some(RpcStatus { status: Unavailable, details: Some("OS Error") }))"] [2019/07/29 16:21:18.704 +08:00] [WARN] [raft_client.rs:131] ["batch_raft/raft RPC finally fail"] [err="RpcFinished(Some(RpcStatus { status: Unavailable, details: Some("OS Error") }))"] [to_addr=192.168.48.154:20163] [2019/07/29 16:21:18.704 +08:00] [INFO] [store.rs:1949] ["broadcasting unreachable"] [unreachable_store_id=1] [store_id=1001] [2019/07/29 16:21:18.705 +08:00] [INFO] [transport.rs:299] ["resolve store address ok"] [addr=192.168.48.156:20161] [store_id=5] [2019/07/29 16:21:18.705 +08:00] [INFO] [raft_client.rs:49] ["server: new connection with tikv endpoint"] [addr=192.168.48.156:20161] [2019/07/29 16:21:18.706 +08:00] [INFO] [transport.rs:299] ["resolve store address ok"] [addr=192.168.48.156:20162] [store_id=4] [2019/07/29 16:21:18.706 +08:00] [INFO] [raft_client.rs:49] ["server: new connection with tikv endpoint"] [addr=192.168.48.156:20162] [2019/07/29 16:21:18.706 +08:00] [WARN] [raft_client.rs:206] ["send to 192.168.48.154:20163 fail, the gRPC connection could be broken"] [2019/07/29 16:21:18.706 +08:00] [ERROR] [transport.rs:318] ["send raft msg err"] [err="Other("[src/server/raft_client.rs:215]: RaftClient send fail")"] [2019/07/29 16:21:18.707 +08:00] [INFO] [transport.rs:299] ["resolve store address ok"] [addr=192.168.48.154:20163] [store_id=1] [2019/07/29 16:21:18.707 +08:00] [INFO] [raft_client.rs:49] ["server: new connection with tikv endpoint"] [addr=192.168.48.154:20163] [2019/07/29 16:21:18.847 +08:00] [INFO] [kv.rs:681] ["batch_raft RPC is called, new gRPC stream established"] [2019/07/29 16:21:18.847 +08:00] [INFO] [kv.rs:681] ["batch_raft RPC is called, new gRPC stream established"] [2019/07/29 16:21:18.849 +08:00] [INFO] [kv.rs:681] ["batch_raft RPC is called, new gRPC stream established"] [2019/07/29 16:21:21.706 +08:00] [INFO] [raft.rs:1109] ["[region 28] 1008 is starting a new election at term 10"] [2019/07/29 16:21:21.706 +08:00] [INFO] [raft.rs:780] ["[region 28] 1008 became pre-candidate at term 10"] [2019/07/29 16:21:21.706 +08:00] [INFO] [raft.rs:873] ["[region 28] 1008 received MsgRequestPreVoteResponse from 1008 at term 10"] [2019/07/29 16:21:21.706 +08:00] [INFO] [raft.rs:847] ["[region 28] 1008 [logterm: 10, index: 17] sent MsgRequestPreVote request to 29 at term 10"] [2019/07/29 16:21:21.706 +08:00] [INFO] [raft.rs:847] ["[region 28] 1008 [logterm: 10, index: 17] sent MsgRequestPreVote request to 45 at term 10"] [2019/07/29 16:21:22.742 +08:00] [INFO] [kv.rs:681] ["batch_raft RPC is called, new gRPC stream established"] [2019/07/29 16:21:23.743 +08:00] [INFO] [kv.rs:681] ["batch_raft RPC is called, new gRPC stream established"] [2019/07/29 16:21:23.746 +08:00] [ERROR] [snap.rs:354] ["failed to recv snapshot"] [err=RemoteStopped] [2019/07/29 16:21:23.879 +08:00] [INFO] [kv.rs:681] ["batch_raft RPC is called, new gRPC stream established"] [2019/07/29 16:21:23.881 +08:00] [INFO] [raft.rs:738] ["[region 28] 1008 became follower at term 10"] [2019/07/29 16:21:24.708 +08:00] [INFO] [raft.rs:1109] ["[region 6] 1010 is starting a new election at term 9"] [2019/07/29 16:21:24.708 +08:00] [INFO] [raft.rs:780] ["[region 6] 1010 became pre-candidate at term 9"] [2019/07/29 16:21:24.708 +08:00] [INFO] [raft.rs:873] ["[region 6] 1010 received MsgRequestPreVoteResponse from 1010 at term 9"] [2019/07/29 16:21:24.708 +08:00] [INFO] [raft.rs:847] ["[region 6] 1010 [logterm: 9, index: 262] sent MsgRequestPreVote request to 30 at term 9"] [2019/07/29 16:21:24.708 +08:00] [INFO] [raft.rs:847] ["[region 6] 1010 [logterm: 9, index: 262] sent MsgRequestPreVote request to 30 at term 9"] [2019/07/29 16:21:24.708 +08:00] [INFO] [raft.rs:847] ["[region 6] 1010 [logterm: 9, index: 262] sent MsgRequestPreVote request to 7 at term 9"] [2019/07/29 16:21:25.878 +08:00] [INFO] [raft.rs:738] ["[region 6] 1010 became follower at term 9"] [2019/07/29 16:21:27.104 +08:00] [ERROR] [snap.rs:354] ["failed to recv snapshot"] [err=RemoteStopped] [2019/07/29 16:21:29.751 +08:00] [ERROR] [snap.rs:354] ["failed to recv snapshot"] [err=RemoteStopped] [2019/07/29 16:21:33.110 +08:00] [ERROR] [snap.rs:354] ["failed to recv snapshot"] [err=RemoteStopped] [2019/07/29 16:21:35.757 +08:00] [ERROR] [snap.rs:354] ["failed to recv snapshot"] [err=RemoteStopped] [2019/07/29 16:21:39.116 +08:00] [ERROR] [snap.rs:354] ["failed to recv snapshot"] [err=RemoteStopped] [2019/07/29 16:21:41.763 +08:00] [ERROR] [snap.rs:354] ["failed to recv snapshot"] [err=RemoteStopped] [2019/07/29 16:21:45.122 +08:00] [ERROR] [snap.rs:354] ["failed to recv snapshot"] [err=RemoteStopped] [2019/07/29 16:21:47.768 +08:00] [ERROR] [snap.rs:354] ["failed to recv snapshot"] [err=RemoteStopped]


pd.log: [2019/07/29 15:41:39.270 +08:00] [INFO] [operator_controller.go:99] ["operator finish"] [region-id=18] [operator=""balance-leader (kind:leader,balance, region:18(8,11), createAt:2019-07-29 15:41:39.025326548 +0800 CST m=+3073.615721365, startAt:2019-07-29 15:41:39.025426834 +0800 CST m=+3073.615821645, currentStep:1, steps:[transfer leader from store 5 to store 1001]) finished""] [2019/07/29 15:41:39.271 +08:00] [INFO] [cluster_info.go:568] ["leader changed"] [region-id=26] [from=5] [to=1001] [2019/07/29 15:41:39.271 +08:00] [INFO] [operator_controller.go:99] ["operator finish"] [region-id=26] [operator=""balance-leader (kind:leader,balance, region:26(12,8), createAt:2019-07-29 15:41:39.005097052 +0800 CST m=+3073.595491867, startAt:2019-07-29 15:41:39.005331137 +0800 CST m=+3073.595725953, currentStep:1, steps:[transfer leader from store 5 to store 1001]) finished""] [2019/07/29 15:41:39.272 +08:00] [INFO] [cluster_info.go:568] ["leader changed"] [region-id=16] [from=5] [to=1001] [2019/07/29 15:41:39.273 +08:00] [INFO] [operator_controller.go:99] ["operator finish"] [region-id=16] [operator=""balance-leader (kind:leader,balance, region:16(7,11), createAt:2019-07-29 15:41:39.035513726 +0800 CST m=+3073.625908540, startAt:2019-07-29 15:41:39.035664284 +0800 CST m=+3073.626059100, currentStep:1, steps:[transfer leader from store 5 to store 1001]) finished""] [2019/07/29 15:50:25.447 +08:00] [INFO] [periodic.go:135] ["starting auto periodic compaction"] [revision=739] [compact-period=1h0m0s] [2019/07/29 15:50:25.448 +08:00] [INFO] [index.go:190] ["compact tree index"] [revision=739] [2019/07/29 15:50:25.448 +08:00] [INFO] [periodic.go:146] ["completed auto periodic compaction"] [revision=739] [compact-period=1h0m0s] [took=1h0m0.002921824s] [2019/07/29 15:50:25.450 +08:00] [INFO] [kvstore_compaction.go:57] ["finished scheduled compaction"] [compact-revision=739] [took=1.538461ms] [2019/07/29 16:21:30.457 +08:00] [ERROR] [heartbeat_streams.go:121] ["send keepalive message fail"] [target-store-id=4] [error=EOF]

tikv/tikv

Answer questions NormanZhang

是docker swarm网络的问题,换了个网络好了

useful!

Related questions

Test failed with error `too many open files` hot 1
combine all binaries into one hot 1
PCP: test hot 1
Fix design of engine_traits for mutual associated types / type bound equality hot 1
Needs protoc dependency hot 1
Failpoint tests will not run when make dev hot 1
Running cargo bench causes panic hot 1
Better memory observation for tidb_query executors hot 1
how to disable gRPC batch mechanism hot 1
Github User Rank List