profile
viewpoint

BusyJay/jemallocator 3

Rust allocator using jemalloc as a backend

BusyJay/libressl-src 2

Libressl source expected to be consumed by openssl-sys.

BusyJay/docs 1

TiDB/TiKV/PD documents.

BusyJay/docs-cn 1

TiDB/TiKV/PD documents in Chinese.

BusyJay/grpc-rs 1

The rust language implementation of gRPC. HTTP/2 based RPC

BusyJay/jinkela 1

Don't fight, peace & love

BusyJay/agatedb 0

A persistent key-value storage in rust.

BusyJay/atty 0

are you or are you not a tty?

BusyJay/bank-race 0

A naive simulation about race used for FOSDEM '19.

create barnchBusyJay/etcd

branch : log-less-election

created branch time in an hour

push eventBusyJay/raft-rs

Jay Lee

commit sha f16927c18ec57acc6535bfaa75f1c37a5b04823c

log less Signed-off-by: Jay Lee <BusyJayLee@gmail.com>

view details

push time in 2 hours

pull request commenttikv/raft-rs

*: introduce joint quorum

@Fullstop000 PTAL if you have time.

BusyJay

comment created time in 2 hours

PR opened tikv/raft-rs

Reviewers
*: introduce joint quorum Feature

This PR ports joint quorum. It rename ProgressSet to ProgressTracker and keep both name and field layout just like etcd.

In addition, the PR fixes a performance issue discovered earlier that campaign prints too much logs.

joint.rs is ported from etcd/raft/quorum/joint.go and tracker is from etcd/raft/tracker/tracker.go.

+412 -212

0 comment

13 changed files

pr created time in 2 hours

create barnchBusyJay/raft-rs

branch : port-joint-quorum

created branch time in 2 hours

issue openedtikv/tikv

Don't allow conf remove until leader has applied to current term

Bug Report

We allow green gc to scan the store data directly without leader. @youjiali1995 suggested that it can miss data if an old leader is removed before new leader has applied all the logs from last term. To get around the problem, we can prevent proposing any conf change remove until leader is able to read.

/cc @gengliqi @NingLin-P

created time in 8 hours

Pull request review commenttikv/tikv

Use iterator instead of slice in Latches::gen_locks

 macro_rules! gen_lock {     };     ($field: ident) => {         fn gen_lock(&self, latches: &Latches) -> latch::Lock {-            latches.gen_lock(&[&self.$field])+            latches.gen_lock(iter::once(&self.$field))

Why? They should be the same.

longfangsong

comment created time in 9 hours

delete branch BusyJay/raft-rs

delete branch : quorum

delete time in a day

push eventtikv/raft-rs

Jay

commit sha 5ca2e467cd34a9b104bc9ee6074fb338cbf532e8

*: add quorum package (#380) Adds quorum package and majority configuration. Configuration in tracker is also updated. The quorum package is ported from etcd master. Signed-off-by: Jay Lee <BusyJayLee@gmail.com>

view details

push time in a day

PR merged tikv/raft-rs

*: add quorum package

Adds quorum package and majority configuration. Configuration in tracker is also updated. The quorum package is ported from etcd master.

+273 -167

2 comments

6 changed files

BusyJay

pr closed time in a day

Pull request review commenttikv/raft-rs

*: add quorum package

+// Copyright 2020 TiKV Project Authors. Licensed under Apache-2.0.++pub mod majority;++use std::collections::HashMap;+use std::fmt::{self, Debug, Display, Formatter};++/// VoteResult indicates the outcome of a vote.+#[derive(Clone, Copy, Debug, PartialEq)]+pub enum VoteResult {+    /// Pending indicates that the decision of the vote depends on future+    /// votes, i.e. neither "yes" or "no" has reached quorum yet.+    Pending,+    // Lost indicates that the quorum has voted "no".+    Lost,+    // Won indicates that the quorum has voted "yes".+    Won,+}++/// Index is a Raft log position.+#[derive(Default, Clone, Copy)]+pub struct Index {+    pub index: u64,+    pub group_id: u64,+}++impl Display for Index {+    #[inline]+    fn fmt(&self, f: &mut Formatter<'_>) -> fmt::Result {+        if self.index != u64::MAX {+            write!(f, "[{}]{}", self.group_id, self.index)+        } else {+            write!(f, "[{}]∞", self.group_id)+        }+    }+}++impl Debug for Index {+    #[inline]+    fn fmt(&self, f: &mut Formatter<'_>) -> fmt::Result {+        Display::fmt(self, f)+    }+}++pub trait AckedIndexer {

It's not a public trait. It can only be accessed by the crate. We have #![deny(missing_docs)] check.

BusyJay

comment created time in a day

Pull request review commenttikv/raft-rs

*: add quorum package

 impl ProgressSet {         let pr = Progress::new(next_idx, max_inflight);         for id in &meta.conf_state.as_ref().unwrap().voters {             self.progress.insert(*id, pr.clone());-            self.configuration.voters.insert(*id);+            self.configuration.voters.voters.insert(*id);

Yes. It will be added when Joint configuration is ported.

BusyJay

comment created time in a day

issue commentetcd-io/etcd

How about make configuration changes effective on received instead of applied

Another limitation of taking effect on applied is it prevents practical rollback of joint consensus when only quorum are alive. For example, when removing C from ABCDE, and DE are isolated, the removal can't succeed when using joint consensus apparently. But there is no way to rollback the change as no logs can be committed any more, the cluster is stuck.

Related discussion tikv/raft-rs#192

hicqu

comment created time in a day

issue commenttikv/tikv

Investigate performance bottleneck when using large value (>8M)

I think you also need to see how read performs in the future.

yiwu-arbug

comment created time in a day

issue commenttikv/tikv

Change max replicas may cause region unavailable

There is no way to guarantee complete availability during conf change. I suggest to changing replicas only after changing state either automatically or manually. And PD should pick down node over health node.

rleungx

comment created time in a day

Pull request review commenttikv/tikv

Use iterator instead of slice in Latches::gen_locks

 impl Latches {     }      /// Creates a lock which specifies all the required latches for a command.-    pub fn gen_lock<H>(&self, keys: &[H]) -> Lock+    pub fn gen_lock<'a, H: 'a, I>(&'a self, keys: I) -> Lock     where         H: Hash,+        I: Iterator<Item = &'a H>,

You can use IntoIterator so that changes to https://github.com/tikv/tikv/pull/8167/files#diff-21f8c57a4ef3ae7c3bb60229864218d0R345 and https://github.com/tikv/tikv/pull/8167/files#diff-21f8c57a4ef3ae7c3bb60229864218d0R350 are unnecessary.

longfangsong

comment created time in a day

push eventBusyJay/tlaplus-specs

Jay Lee

commit sha ee6406d348d5cf07336c4314d77243e7a28672b0

Make it a bit faster Signed-off-by: Jay Lee <busyjaylee@gmail.com>

view details

push time in 2 days

issue commentpingcap/tiup

make useful tool like pd-ctl available

Local mode means run tikv-ctl on the tikv machine. It needs to stop the tikv node first, and then use the tool to modify the data.

BusyJay

comment created time in 2 days

pull request commenttikv/tikv

*: Improve rocksdb logger

It seems to be still WIP. Is it ready for review?

yiwu-arbug

comment created time in 2 days

issue commentpingcap/tiup

make useful tool like pd-ctl available

...user should never execute ctls directly...

On one hand, I have never suggested to execute it directly. What I suggest is match the version automatically. Manage *ctl using cluster is a way to do it. For example, tiup cluster ctl xxx instead of tiup ctl xxx.

On the other hand, tikv-ctl is required to be run in local mode sometime. Hence it's unavoidable to allow user to execute tikv-ctl directly.

BusyJay

comment created time in 2 days

issue commenttikv/grpc-rs

`grpcio-proto` v0.6.0 missing from crates.io

Oh, sorry, I forgot. Now it should be available.

mtp401

comment created time in 2 days

issue commenttikv/tikv

make build failed on master branch

I suspect it's due to corrupted local metadata. Maybe make clean && make build should fix it.

gotoxu

comment created time in 3 days

Pull request review commenttikv/raft-rs

*: add quorum package

+// Copyright 2020 TiKV Project Authors. Licensed under Apache-2.0.++use super::{AckedIndexer, Index, VoteResult};+use crate::{DefaultHashBuilder, HashSet};+use std::mem::MaybeUninit;+use std::{cmp, slice, u64};++/// A set of IDs that uses majority quorums to make decisions.+#[derive(Clone, Debug, Default, PartialEq)]+pub struct Configuration {+    pub(crate) voters: HashSet<u64>,+}++impl Configuration {+    /// Creates a new configuration using the given IDs.+    pub fn new(voters: HashSet<u64>) -> Configuration {+        Configuration { voters }+    }++    /// Creates an empty configuration with given capacity.+    pub fn with_capacity(cap: usize) -> Configuration {+        Configuration {+            voters: HashSet::with_capacity_and_hasher(cap, DefaultHashBuilder::default()),+        }+    }++    /// Returns the MajorityConfig as a sorted slice.+    pub fn slice(&self) -> Vec<u64> {+        let mut voters: Vec<_> = self.voters.iter().cloned().collect();+        voters.sort();+        voters+    }++    /// Computes the committed index from those supplied via the+    /// provided AckedIndexer (for the active config).+    ///+    /// The bool flag indicates whether the index is computed by group commit algorithm+    /// successfully.+    ///+    /// Eg. If the matched indexes are [2,2,2,4,5], it will return 2.+    /// If the matched indexes and groups are `[(1, 1), (2, 2), (3, 2)]`, it will return 1.+    pub fn committed_index(&self, use_group_commit: bool, l: &impl AckedIndexer) -> (u64, bool) {+        if self.voters.is_empty() {+            // This plays well with joint quorums which, when one half is the zero+            // MajorityConfig, should behave like the other half.+            return (u64::MAX, false);+        }++        let mut stack_arr: [MaybeUninit<Index>; 7] = unsafe { MaybeUninit::uninit().assume_init() };

I didn't use SmallVec on purpose. On the one hand, it's such a small piece of code that doesn't need to introduce another dependency; on the other hand, it has less condition branch as every operation is inlined and optimized when using array.

BusyJay

comment created time in 3 days

issue commenttikv/tikv

make build failed on master branch

The trace comes from rustc binary, the string is the path of sources on rustc's release build machine. It has nothing to do with your local registry.

gotoxu

comment created time in 3 days

pull request commenttikv/tikv

raftstore: add peer msg process duration metrics

/release

Connor1996

comment created time in 3 days

pull request commenttikv/tikv

add lockless raft client implement

/release

BusyJay

comment created time in 3 days

issue commentpingcap/tiup

make useful tool like pd-ctl available

What's the point of "decouple" when those should have connections?

I have seen complaint from DBAs in earlier discussions that ctls are not managed by cluster. It's quite natural to expect ctl matches the selected version of a cluster just like tikv-servers or tidb-servers since we use tiup to manage everything.

BusyJay

comment created time in 3 days

issue openedtikv/tikv

Active written leaders is wrong

Bug Report

<!-- Thanks for your bug report! Don't worry if you can't fill out all the sections. -->

What version of TiKV are you using?

Master

What did happened?

https://github.com/tikv/tikv/blob/bcea06ca85daab94f47608d7d98976ec086fb24c/components/raftstore/src/store/worker/pd.rs#L607

PD updates the metrics no matter whether there are keys/bytes written or not. So active written leaders graph actually shows all leaders instead.

created time in 3 days

issue commenttikv/tikv

make build failed on master branch

I tried to reproduce, but all things went well.

gotoxu

comment created time in 3 days

issue commentpingcap/tiup

tiup not support package from release bot

So how to solve it? Do I have to download the package, and extract it and then repackage it to meet tiup's requirement? 🤣

BusyJay

comment created time in 3 days

pull request commenttikv/raft-rs

remove storage error other

It maybe used by implementer of storage trait.

hicqu

comment created time in 3 days

issue commenttikv/tikv

make build failed on master branch

This seems a bug of rustc. Can you build it with previous commits? Can you also send it to rust-lang/rust project?

gotoxu

comment created time in 3 days

Pull request review commenttikv/tikv

raftstore: add peer msg process duration metrics

 impl<'a, T: Transport, C: PdClient> PeerFsmDelegate<'a, T, C> {                 PeerMsg::ApplyRes { res } => {                     self.on_apply_res(res);                 }-                PeerMsg::SignificantMsg(msg) => self.on_significant_msg(msg),-                PeerMsg::CasualMessage(msg) => self.on_casual_msg(msg),+                PeerMsg::SignificantMsg(msg) => {+                    let timer = TiInstant::now_coarse();+                    self.on_significant_msg(msg);+                    RAFT_EVENT_DURATION

There can be a lot of Unreachable and StoreResolved messages in a short time.

Connor1996

comment created time in 3 days

issue commentpingcap/tiup

Unable to change PD address

Supporting force reload should get around the problem.

BusyJay

comment created time in 3 days

issue commentpingcap/tiup

tiup not support package from release bot

I'm not sure what is tiup CI. I build the binary by command "/release" in a PR. It's a common workflow in TiKV.

BusyJay

comment created time in 3 days

pull request commenttikv/raft-rs

*: add quorum package

@Fullstop000 Do you have time to review this? You can use command "LGTM" to give your approval.

BusyJay

comment created time in 3 days

Pull request review commenttikv/tikv

UCP: Support JSON log format

 mod tests {          log_format_cases!(logger); -        let expect = r#"{"time":"2020/05/16 15:49:52.449 +08:00","level":"INFO","caller":"mod.rs","message":""}-{"time":"2020/05/16 15:49:52.450 +08:00","level":"INFO","caller":"mod.rs","message":"Welcome"}-{"time":"2020/05/16 15:49:52.450 +08:00","level":"INFO","caller":"mod.rs","message":"Welcome TiKV"}-{"time":"2020/05/16 15:49:52.450 +08:00","level":"INFO","caller":"mod.rs","message":"欢迎"}-{"time":"2020/05/16 15:49:52.450 +08:00","level":"INFO","caller":"mod.rs","message":"欢迎 TiKV"}-{"time":"2020/05/16 15:49:52.450 +08:00","level":"INFO","caller":"mod.rs","message":"failed to fetch URL","backoff":"3s","attempt":3,"url":"http://example.com"}-{"time":"2020/05/16 15:49:52.450 +08:00","level":"INFO","caller":"mod.rs","message":"failed to \"fetch\" [URL]: http://example.com"}-{"time":"2020/05/16 15:49:52.450 +08:00","level":"DEBUG","caller":"mod.rs","message":"Slow query","process keys":1500,"duration":"123ns","sql":"SELECT * FROM TABLE WHERE ID=\"abc\""}-{"time":"2020/05/16 15:49:52.450 +08:00","level":"WARN","caller":"mod.rs","message":"Type","Other":null,"Score":null,"Counter":null}-{"time":"2020/05/16 15:49:52.451 +08:00","level":"INFO","caller":"mod.rs","message":"more type tests","str_array":"[\"💖\", \"�\", \"☺☻☹\", \"日a本b語ç日ð本Ê語þ日¥本¼語i日©\", \"日a本b語ç日ð本Ê語þ日¥本¼語i日©日a本b語ç日ð本Ê語þ日¥本¼語i日©日a本b語ç日ð本Ê語þ日¥本¼語i日©\", \"\\\\x80\\\\x80\\\\x80\\\\x80\", \"<car><mirror>XML</mirror></car>\"]","u8":34,"is_None":null,"is_false":false,"is_true":true,"store ids":"[1, 2, 3]","url-peers":"[\"peer1\", \"peer 2\"]","urls":"[\"http://xxx.com:2347\", \"http://xxx.com:2432\"]","field2":"in quote","field1":"no_quote"}+        let expect = r#"{"time":"2020/05/16 15:49:52.449 +08:00","level":"INFO","caller":"mod.rs:469","message":""}

Got it.

weihanglo

comment created time in 3 days

issue commentpingcap/tiup

make useful tool like pd-ctl available

It seem just download v3.0.x version. I need a version matched the cluster.

So can it match the version automatically?

BusyJay

comment created time in 4 days

Pull request review commenttikv/tikv

UCP: Support JSON log format

 mod tests {          log_format_cases!(logger); -        let expect = r#"{"time":"2020/05/16 15:49:52.449 +08:00","level":"INFO","caller":"mod.rs","message":""}-{"time":"2020/05/16 15:49:52.450 +08:00","level":"INFO","caller":"mod.rs","message":"Welcome"}-{"time":"2020/05/16 15:49:52.450 +08:00","level":"INFO","caller":"mod.rs","message":"Welcome TiKV"}-{"time":"2020/05/16 15:49:52.450 +08:00","level":"INFO","caller":"mod.rs","message":"欢迎"}-{"time":"2020/05/16 15:49:52.450 +08:00","level":"INFO","caller":"mod.rs","message":"欢迎 TiKV"}-{"time":"2020/05/16 15:49:52.450 +08:00","level":"INFO","caller":"mod.rs","message":"failed to fetch URL","backoff":"3s","attempt":3,"url":"http://example.com"}-{"time":"2020/05/16 15:49:52.450 +08:00","level":"INFO","caller":"mod.rs","message":"failed to \"fetch\" [URL]: http://example.com"}-{"time":"2020/05/16 15:49:52.450 +08:00","level":"DEBUG","caller":"mod.rs","message":"Slow query","process keys":1500,"duration":"123ns","sql":"SELECT * FROM TABLE WHERE ID=\"abc\""}-{"time":"2020/05/16 15:49:52.450 +08:00","level":"WARN","caller":"mod.rs","message":"Type","Other":null,"Score":null,"Counter":null}-{"time":"2020/05/16 15:49:52.451 +08:00","level":"INFO","caller":"mod.rs","message":"more type tests","str_array":"[\"💖\", \"�\", \"☺☻☹\", \"日a本b語ç日ð本Ê語þ日¥本¼語i日©\", \"日a本b語ç日ð本Ê語þ日¥本¼語i日©日a本b語ç日ð本Ê語þ日¥本¼語i日©日a本b語ç日ð本Ê語þ日¥本¼語i日©\", \"\\\\x80\\\\x80\\\\x80\\\\x80\", \"<car><mirror>XML</mirror></car>\"]","u8":34,"is_None":null,"is_false":false,"is_true":true,"store ids":"[1, 2, 3]","url-peers":"[\"peer1\", \"peer 2\"]","urls":"[\"http://xxx.com:2347\", \"http://xxx.com:2432\"]","field2":"in quote","field1":"no_quote"}+        let expect = r#"{"time":"2020/05/16 15:49:52.449 +08:00","level":"INFO","caller":"mod.rs:469","message":""}

Better not to hardcode the line number to avoid unnecessary updates in the future.

weihanglo

comment created time in 4 days

Pull request review commenttikv/tikv

Accelerate makefile build speed

 run-test: 	export RUST_BACKTRACE=1 && \ 	cargo test --workspace \ 		--exclude fuzzer-honggfuzz --exclude fuzzer-afl --exclude fuzzer-libfuzzer \-		--features "${ENABLE_FEATURES}" ${EXTRA_CARGO_ARGS} -- --nocapture && \+		--features "${ENABLE_FEATURES} mem-profiling" ${EXTRA_CARGO_ARGS} -- --nocapture && \ 	if [[ "`uname`" == "Linux" ]]; then \ 		export MALLOC_CONF=prof:true,prof_active:false && \ 		cargo test --features "${ENABLE_FEATURES} mem-profiling" ${EXTRA_CARGO_ARGS} -p tikv_alloc -- --nocapture --ignored && \-		cargo test --features "${ENABLE_FEATURES} mem-profiling" ${EXTRA_CARGO_ARGS} -p tikv --lib -- -- nocapture --ignored; \+		cargo test --workspace \

It seems I misunderstood how --ignored worked.

hunterlxt

comment created time in 4 days

Pull request review commenttikv/tikv

raftstore: add peer msg process duration metrics

 impl<'a, T: Transport, C: PdClient> PeerFsmDelegate<'a, T, C> {                 PeerMsg::ApplyRes { res } => {                     self.on_apply_res(res);                 }-                PeerMsg::SignificantMsg(msg) => self.on_significant_msg(msg),-                PeerMsg::CasualMessage(msg) => self.on_casual_msg(msg),+                PeerMsg::SignificantMsg(msg) => {+                    let timer = TiInstant::now_coarse();+                    self.on_significant_msg(msg);+                    RAFT_EVENT_DURATION

Why observe them? Any messages that can take long time to finish?

Connor1996

comment created time in 4 days

Pull request review commenttikv/tikv

Accelerate makefile build speed

 run-test: 	export RUST_BACKTRACE=1 && \ 	cargo test --workspace \ 		--exclude fuzzer-honggfuzz --exclude fuzzer-afl --exclude fuzzer-libfuzzer \-		--features "${ENABLE_FEATURES}" ${EXTRA_CARGO_ARGS} -- --nocapture && \+		--features "${ENABLE_FEATURES} mem-profiling" ${EXTRA_CARGO_ARGS} -- --nocapture && \ 	if [[ "`uname`" == "Linux" ]]; then \ 		export MALLOC_CONF=prof:true,prof_active:false && \ 		cargo test --features "${ENABLE_FEATURES} mem-profiling" ${EXTRA_CARGO_ARGS} -p tikv_alloc -- --nocapture --ignored && \-		cargo test --features "${ENABLE_FEATURES} mem-profiling" ${EXTRA_CARGO_ARGS} -p tikv --lib -- -- nocapture --ignored; \+		cargo test --workspace \

I'm afraid tests will be run twice.

hunterlxt

comment created time in 4 days

issue commentpingcap/tiup

Support custom binaries when deploying a cluster

Last time I checked, the command didn't work. What I need are:

  1. Support both compressed and raw binaries.
  2. Display information in tiup cluster show.
BusyJay

comment created time in 4 days

Pull request review commenttikv/tikv

raftstore: support smoothly switch replication mode

 fn test_loading_label_after_rolling_start() {     assert_eq!(state.state_id, 1);     assert_eq!(state.state, RegionReplicationState::IntegrityOverLabel); }++// Delay replication mode switch if groups consistent can't reach immediately,+// until groups consistent reached or timeout reached.+#[test]+fn test_delaying_switch_replication_mode() {+    let mut cluster = prepare_cluster();+    let region = cluster.get_region(b"k1");+    cluster.add_send_filter(IsolationFilterFactory::new(3));+    cluster+        .pd_client+        .switch_replication_mode(DrAutoSyncState::Async, None);+    thread::sleep(Duration::from_millis(100));+    cluster.must_put(b"k2", b"v2");+    thread::sleep(Duration::from_millis(100));+    let state = cluster.pd_client.region_replication_status(region.get_id());+    assert_eq!(state.state_id, 2);+    assert_eq!(state.state, RegionReplicationState::SimpleMajority);++    // Replication mode not switch yet, so log entry still can be committed+    cluster+        .pd_client+        .switch_replication_mode(DrAutoSyncState::SyncRecover, Some(1)); // Delay for 1s+    thread::sleep(Duration::from_millis(100));+    cluster.must_put(b"k3", b"v3");+    thread::sleep(Duration::from_millis(100));+    let state = cluster.pd_client.region_replication_status(region.get_id());+    assert_eq!(state.state_id, 3);+    assert_eq!(state.state, RegionReplicationState::SimpleMajority);++    // Replication mode switch because timeout reached+    thread::sleep(Duration::from_millis(1000));+    let rx = cluster+        .async_request(put_request(&region, 1, b"k4", b"v4"))+        .unwrap();+    assert_eq!(+        rx.recv_timeout(Duration::from_millis(100)),+        Err(mpsc::RecvTimeoutError::Timeout)+    );+    must_get_none(&cluster.get_engine(1), b"k4");+    let state = cluster.pd_client.region_replication_status(region.get_id());+    assert_eq!(state.state_id, 3);+    assert_eq!(state.state, RegionReplicationState::SimpleMajority);++    // Replication mode switch because groups consistent reached+    cluster.clear_send_filters();+    cluster

How do you know groups is consistent when the line is executed?

NingLin-P

comment created time in 4 days

Pull request review commenttikv/tikv

Add missing feature for `trace.rs`

 pub fn encode_spans(span_sets: Vec<SpanSet>) -> impl Iterator<Item = spanpb::Spa                 s.set_event(span.event);                  #[cfg(feature = "prost-codec")]-                use minitrace::Link;--                #[cfg(feature = "prost-codec")]-                match span.link {-                    Link::Root => {-                        s.link = spanpb::Link::Root;-                    }-                    Link::Parent(id) => {-                        s.link = spanpb::Link::Parent(id);-                    }-                    Link::Continue(id) => {-                        s.link = spanpb::Link::Continue(id);-                    }+                {+                    use minitrace::Link;

Any tests?

Renkai

comment created time in 4 days

push eventFullstop000/raft-rs

Jay

commit sha b482475292d0091ca8e512db73df89c2a63ac21f

*: extract tracker package (#379) Signed-off-by: Jay Lee <BusyJayLee@gmail.com>

view details

Jay

commit sha b7c662e5141ab6fa4756e84575c82ea745ac6d49

Merge branch 'master' into committed_entries_pagination

view details

push time in 4 days

Pull request review commenttikv/tikv

raftstore: reduce raftstore error size

 quick_error! {     } } +pub struct Error(pub Box<ErrorInner>);

How about just changing some into box? Errors like NotLeader are very common and should be small enough already.

Fullstop000

comment created time in 4 days

Pull request review commenttikv/tikv

UCP: Support JSON log format

 pub fn initial_logger(config: &TiKvConfig) {                     e                 );             });-            let drainer = logger::LogDispatcher::new(drainer, slow_log_drainer);-            logger::init_log(-                drainer,-                config.log_level,-                true,-                true,-                vec![],-                config.slow_log_threshold.as_millis(),-            )-            .unwrap_or_else(|e| {-                fatal!("failed to initialize log: {}", e);-            });++            match config.log_format {+                LogFormat::Text => build_loggers(+                    logger::text_format(writer),+                    logger::text_format(slow_log_writer),+                    config,+                ),+                LogFormat::Json => build_loggers(+                    logger::json_format(writer),+                    logger::json_format(slow_log_writer),+                    config,+                ),+            };++            fn build_loggers<N, S>(normal: N, slow: S, config: &TiKvConfig)

The name is similar with build_logger, better give it a different name or just use the same function.

weihanglo

comment created time in 4 days

Pull request review commenttikv/tikv

UCP: Support JSON log format

 mod tests {         }     } -    #[test]-    fn test_log_format() {-        use std::time::Duration;-        let decorator = PlainSyncDecorator::new(TestWriter);-        let drain = TikvFormat::new(decorator).fuse();-        let logger = slog::Logger::root_typed(drain, slog_o!());+    macro_rules! log_format_cases {

Better use function.

weihanglo

comment created time in 4 days

Pull request review commenttikv/tikv

UCP: Support JSON log format

 where             .add_rotator(RotateBySize::new(rotation_size))             .build()?,     );-    let decorator = PlainDecorator::new(logger);-    let drain = TikvFormat::new(decorator);-    Ok(drain)+    Ok(logger)+}++/// Constructs a new terminal writer which outputs logs to stderr.+pub fn term_writer() -> io::Stderr {+    io::stderr() } -/// Constructs a new terminal drainer which outputs logs to stderr.-pub fn term_drainer() -> TikvFormat<TermDecorator> {-    let decorator = TermDecorator::new().stderr().build();+/// Formats output logs to "TiDB Log Format".+pub fn text_format<W>(io: W) -> TikvFormat<PlainDecorator<W>>+where+    W: io::Write,+{+    let decorator = PlainDecorator::new(io);     TikvFormat::new(decorator) } +/// Formats output logs to JSON format.+pub fn json_format<W>(io: W) -> slog_json::Json<W>+where+    W: io::Write,+{+    slog_json::Json::new(io)+        .set_newlines(true)+        .set_flush(true)+        .add_key_value(slog_o!(+            "message" => PushFnValue(|record, ser| ser.emit(record.msg())),+            "caller" => FnValue(|record| Path::new(record.file())+                .file_name()+                .and_then(|path| path.to_str()).unwrap_or("<unknown>")+            ),+            "level" => FnValue(|record| get_unified_log_level(record.level())),+            "time" => FnValue(|_| chrono::Local::now().format(TIMESTAMP_FORMAT).to_string()),

file line seems missing.

weihanglo

comment created time in 4 days

PR opened tikv/tlaplus-specs

Reviewers
*: add basic raft spec

Add a basic raft spec that can be run model check with.

The spec is written according to the implement details of raft-rs. But it doesn't have ticks and messages to reduce states.

+262 -0

0 comment

3 changed files

pr created time in 5 days

push eventBusyJay/tlaplus-specs

Jay Lee

commit sha ae54abf638648350bbac613cebbaa2bb906651c9

rename directory Signed-off-by: Jay Lee <busyjaylee@gmail.com>

view details

push time in 5 days

push eventBusyJay/tlaplus-specs

Jay Lee

commit sha db29438ca8320c47dd79960c84216dd554a0ceb1

add comments and optimize commit Signed-off-by: Jay Lee <busyjaylee@gmail.com>

view details

push time in 5 days

create barnchBusyJay/tlaplus-specs

branch : basic-raft

created branch time in 5 days

fork BusyJay/tlaplus-specs

TiKV TLA+ specifications

fork in 5 days

PR opened tikv/raft-rs

Reviewers
*: add quorum package

Adds quorum package and majority configuration. Configuration in tracker is also updated. The quorum package is ported from etcd master.

+273 -167

0 comment

6 changed files

pr created time in 6 days

create barnchBusyJay/raft-rs

branch : quorum

created branch time in 6 days

issue commenttikv/raft-rs

port etcd joint consensus

Thanks @accelsao! I'm already working on it.

BusyJay

comment created time in 7 days

delete branch BusyJay/raft-rs

delete branch : extract-tracker

delete time in 8 days

issue commentpingcap/tidb-operator

grafana version is outdated

I suggest to keep consistent across products instead of just upgrading to latest.

BusyJay

comment created time in 9 days

Pull request review commenttikv/grpc-rs

Enhance sinks to make them batchable

 impl<B: Backoff + Send + 'static> RequestExecutor<B> {         };         spawn!(client, keep_running, "streaming ping pong", f);     }++    fn execute_stream_from_client(mut self) {

...I did not fully refer to the implementation of grpc-cpp

You should. It's expected to work across implement.

hunterlxt

comment created time in 9 days

Pull request review commenttikv/grpc-rs

Enhance sinks to make them batchable

 impl SinkBase {         self.batch_f.take();         Poll::Ready(Ok(()))     }++    #[inline]+    fn poll_flush<C: ShareCallHolder>(+        &mut self,+        cx: &mut Context,+        call: &mut C,+    ) -> Poll<Result<()>> {+        if self.batch_f.is_some() {+            ready!(self.poll_ready(cx)?);+        }+        if self.buf_flags.is_some() {+            self.start_send_buffer_message(self.last_buf_hint, call)?;+            ready!(self.poll_ready(cx)?);+        }+        Poll::Ready(Ok(()))+    }++    #[inline]+    fn start_send_buffer_message<C: ShareCallHolder>(+        &mut self,+        buffer_hint: bool,+        call: &mut C,+    ) -> Result<()> {+        // `start_send` is supposed to be called after `poll_ready` returns ready.+        assert!(self.batch_f.is_none());++        let mut flags = self.buf_flags.clone().unwrap();+        flags = flags.buffer_hint(buffer_hint);+        let write_f = call.call(|c| {+            c.call+                .start_send_message(&self.buffer, flags.flags, self.send_metadata)+        })?;+        self.batch_f = Some(write_f);+        self.buf_flags.take();+        // NOTE: Content of `self.buf` is copied into grpc internal.+        self.buffer.clear();+        if self.buffer.capacity() > BUF_SHRINK_SIZE {+            self.buffer.truncate(BUF_SHRINK_SIZE);

This is wrong. The whole point of buffer is to reduce allocation, this implement will make it allocate more.

hunterlxt

comment created time in 9 days

Pull request review commenttikv/grpc-rs

Enhance sinks to make them batchable

 impl SinkBase {         self.batch_f.take();         Poll::Ready(Ok(()))     }++    #[inline]+    fn poll_flush<C: ShareCallHolder>(+        &mut self,+        cx: &mut Context,+        call: &mut C,+    ) -> Poll<Result<()>> {+        if self.batch_f.is_some() {+            ready!(self.poll_ready(cx)?);+        }+        if self.buf_flags.is_some() {+            self.start_send_buffer_message(self.last_buf_hint, call)?;+            ready!(self.poll_ready(cx)?);+        }+        Poll::Ready(Ok(()))+    }++    #[inline]+    fn start_send_buffer_message<C: ShareCallHolder>(+        &mut self,+        buffer_hint: bool,+        call: &mut C,+    ) -> Result<()> {+        // `start_send` is supposed to be called after `poll_ready` returns ready.+        assert!(self.batch_f.is_none());++        let mut flags = self.buf_flags.clone().unwrap();+        flags = flags.buffer_hint(buffer_hint);+        let write_f = call.call(|c| {+            c.call+                .start_send_message(&self.buffer, flags.flags, self.send_metadata)+        })?;+        self.batch_f = Some(write_f);+        self.buf_flags.take();+        // NOTE: Content of `self.buf` is copied into grpc internal.+        self.buffer.clear();+        if self.buffer.capacity() > BUF_SHRINK_SIZE {+            self.buffer.truncate(BUF_SHRINK_SIZE);

Because it's cleared at L749, so truncate will do nothing, and shrink_to_fit may shrink it to 0.

hunterlxt

comment created time in 9 days

Pull request review commenttikv/grpc-rs

Enhance sinks to make them batchable

 impl WriteFlags {     }      /// Get whether buffer hint is enabled.-    pub fn get_buffer_hint(self) -> bool {+    pub fn get_buffer_hint(&self) -> bool {         (self.flags & grpc_sys::GRPC_WRITE_BUFFER_HINT) != 0     }      /// Get whether compression is disabled.-    pub fn get_force_no_compress(self) -> bool {+    pub fn get_force_no_compress(&self) -> bool {         (self.flags & grpc_sys::GRPC_WRITE_NO_COMPRESS) != 0     } }  /// A helper struct for constructing Sink object for batch requests.+///+/// By default the sink works in normal mode, that is, `start_send` always starts to send message+/// immediately. But when the `enhance_buffer_strategy` is enabled, the stream will be batched+/// together as much as possible. The specific rule is listed below:+/// Set the `buffer_hint` of the non-end message in the stream to true. And set the `buffer_hint`+/// of the last message to false in `poll_flush` only when there is at least one message with the+/// `buffer_hint` false, so that the previously bufferd messages will be sent out. struct SinkBase {+    // Batch job to be executed in `poll_ready`.     batch_f: Option<BatchFuture>,-    buf: Vec<u8>,     send_metadata: bool,+    // Flag to indicate if enhance batch strategy. This behavior will modify the `buffer_hint` to batch+    // messages as much as possible.+    enhance_buffer_strategy: bool,+    // Buffer used to store the data to be sent, send out the last data in this round of `start_send`.+    // Note: only used in enhanced buffer strategy.+    buffer: Vec<u8>,+    // Write flags used to control the data to be sent in `buffer`.+    // Note: only used in enhanced buffer strategy.+    buf_flags: Option<WriteFlags>,+    // Used to records whether a message in which `buffer_hint` is false exists.+    // Note: only used in enhanced buffer strategy.+    last_buf_hint: bool, }  impl SinkBase {     fn new(send_metadata: bool) -> SinkBase {         SinkBase {             batch_f: None,-            buf: Vec::new(),+            buffer: Vec::new(),+            buf_flags: None,+            last_buf_hint: true,             send_metadata,+            enhance_buffer_strategy: false,         }     }      fn start_send<T, C: ShareCallHolder>(         &mut self,         call: &mut C,         t: &T,-        mut flags: WriteFlags,+        flags: WriteFlags,         ser: SerializeFn<T>,     ) -> Result<()> {-        // `start_send` is supposed to be called after `poll_ready` returns ready.-        assert!(self.batch_f.is_none());

Why removing the assert?

hunterlxt

comment created time in 9 days

Pull request review commenttikv/grpc-rs

Enhance sinks to make them batchable

 impl WriteFlags {     }      /// Get whether buffer hint is enabled.-    pub fn get_buffer_hint(self) -> bool {+    pub fn get_buffer_hint(&self) -> bool {         (self.flags & grpc_sys::GRPC_WRITE_BUFFER_HINT) != 0     }      /// Get whether compression is disabled.-    pub fn get_force_no_compress(self) -> bool {+    pub fn get_force_no_compress(&self) -> bool {         (self.flags & grpc_sys::GRPC_WRITE_NO_COMPRESS) != 0     } }  /// A helper struct for constructing Sink object for batch requests.+///+/// By default the sink works in normal mode, that is, `start_send` always starts to send message+/// immediately. But when the `enhance_buffer_strategy` is enabled, the stream will be batched+/// together as much as possible. The specific rule is listed below:+/// Set the `buffer_hint` of the non-end message in the stream to true. And set the `buffer_hint`+/// of the last message to false in `poll_flush` only when there is at least one message with the+/// `buffer_hint` false, so that the previously bufferd messages will be sent out. struct SinkBase {+    // Batch job to be executed in `poll_ready`.     batch_f: Option<BatchFuture>,-    buf: Vec<u8>,     send_metadata: bool,+    // Flag to indicate if enhance batch strategy. This behavior will modify the `buffer_hint` to batch+    // messages as much as possible.+    enhance_buffer_strategy: bool,+    // Buffer used to store the data to be sent, send out the last data in this round of `start_send`.+    // Note: only used in enhanced buffer strategy.

I think it's used no matter what buffer strategy is used.

hunterlxt

comment created time in 9 days

Pull request review commenttikv/grpc-rs

Enhance sinks to make them batchable

 impl WriteFlags {     }      /// Get whether buffer hint is enabled.-    pub fn get_buffer_hint(self) -> bool {+    pub fn get_buffer_hint(&self) -> bool {         (self.flags & grpc_sys::GRPC_WRITE_BUFFER_HINT) != 0     }      /// Get whether compression is disabled.-    pub fn get_force_no_compress(self) -> bool {+    pub fn get_force_no_compress(&self) -> bool {         (self.flags & grpc_sys::GRPC_WRITE_NO_COMPRESS) != 0     } }  /// A helper struct for constructing Sink object for batch requests.+///+/// By default the sink works in normal mode, that is, `start_send` always starts to send message

Why paste the comment here?

hunterlxt

comment created time in 9 days

Pull request review commenttikv/grpc-rs

Enhance sinks to make them batchable

 impl<Req> StreamingCallSink<Req> {         }     } +    /// By default the sink works in normal mode, that is, `start_send` always starts to send message+    /// immediately. But when the `enhance_batch` is enabled, the stream will be batched together as+    /// much as possible. The specific rule is listed below:
    /// much as possible. The specific rules are listed below:
hunterlxt

comment created time in 9 days

Pull request review commenttikv/grpc-rs

Enhance sinks to make them batchable

 impl<B: Backoff + Send + 'static> RequestExecutor<B> {         };         spawn!(client, keep_running, "streaming ping pong", f);     }++    fn execute_stream_from_client(mut self) {

Can you reference the original implement?

hunterlxt

comment created time in 9 days

issue openedpingcap/tidb-operator

grafana version is outdated

Bug Report

Now TiDB 4.0 uses grafana 6.1.6, but when deploying by tidb-operator, it will use 6.0.1 instead.

created time in 9 days

push eventBusyJay/raft-rs

Jay Lee

commit sha a2bd0224ba1c631e3e4ae6d147f243b2710c5bac

bump minimun supported rustc Signed-off-by: Jay Lee <BusyJayLee@gmail.com>

view details

push time in 9 days

Pull request review commenttikv/tikv

raftstore: handle the race between creating new peer and splitting correctly

 where             {                 peer.set_id(*peer_id);             }-            write_peer_state(kv_wb_mut, &new_region, PeerState::Normal, None)+            new_regions_map.insert(+                new_region.get_id(),+                (+                    util::find_peer(&new_region, ctx.store_id).unwrap().get_id(),+                    None,+                ),+            );+            regions.push(new_region);+        }++        if right_derive {+            derived.set_start_key(keys.pop_front().unwrap());+            regions.push(derived.clone());+        }++        for (region_id, (_, reason)) in new_regions_map.iter_mut() {+            let region_state_key = keys::region_state_key(*region_id);+            match ctx+                .engine+                .get_msg_cf::<RegionLocalState>(CF_RAFT, &region_state_key)+            {+                Ok(None) => (),+                Ok(Some(state)) => {+                    *reason = Some(format!("state {:?} exist in kv engine", state));+                }+                e => panic!(+                    "{} failed to get regions state of {}: {:?}",+                    self.tag, region_id, e+                ),+            }+        }++        // Note that the following execution sequence is possible.+        // Apply thread:    check `RegionLocalState`(None)+        // Store thread:    create peer+        // Peer thread:     apply snapshot and then be destroyed+        // Apply thread:    check `StoreMeta` and find it's ok to create this new region.+        // It's **very unlikely** to happen because the step 2 and step 3 should take far more time than the time interval+        // between step 1 and step 4. (A similiar case can happen in create-peer process, see details in `maybe_create_peer`)+        // Even it happens, this new region will be destroyed in future when it communicates to other TiKVs or PD.+        // Now it seems there is no other side effects.+        let mut meta = ctx.store_meta.lock().unwrap();

I think you can move L1885~L1912 to L1858, and cleanup metadata when conflict is detected. Then the corner case described in comment can't happen.

gengliqi

comment created time in 10 days

Pull request review commenttikv/grpc-rs

Enhance sinks to make them batchable

 impl BenchmarkService for Benchmark {     fn streaming_from_client(         &mut self,         ctx: RpcContext,-        _: RequestStream<SimpleRequest>,+        mut stream: RequestStream<SimpleRequest>,         sink: ClientStreamingSink<SimpleResponse>,     ) {-        let f = sink.fail(RpcStatus::new(RpcStatusCode::UNIMPLEMENTED, None));+        let f = async move {+            let mut req = SimpleRequest::default();+            while let Some(r) = stream.try_next().await? {+                req = r;+            }+            if req.get_response_size() > 0 {+                sink.success(gen_resp(&req)).await?;+            }+            Ok(())

When response size is zero, cpp will still respond a default response.

hunterlxt

comment created time in 10 days

Pull request review commenttikv/grpc-rs

Enhance sinks to make them batchable

 impl BenchmarkService for Benchmark {     fn streaming_from_client(         &mut self,         ctx: RpcContext,-        _: RequestStream<SimpleRequest>,+        mut stream: RequestStream<SimpleRequest>,         sink: ClientStreamingSink<SimpleResponse>,     ) {-        let f = sink.fail(RpcStatus::new(RpcStatusCode::UNIMPLEMENTED, None));+        let f = async move {+            let mut resp: Option<SimpleResponse> = None;+            while let Some(req) = stream.try_next().await? {

Then the implement is wrong here, cpp constructs the response using the last request.

hunterlxt

comment created time in 10 days

Pull request review commenttikv/tikv

raftstore: handle the race between creating new peer and splitting correctly

 where             {                 peer.set_id(*peer_id);             }-            write_peer_state(kv_wb_mut, &new_region, PeerState::Normal, None)+            new_regions_map.insert(+                new_region.get_id(),+                (+                    util::find_peer(&new_region, ctx.store_id).unwrap().get_id(),+                    None,+                ),+            );+            regions.push(new_region);+        }++        if right_derive {+            derived.set_start_key(keys.pop_front().unwrap());+            regions.push(derived.clone());+        }++        for (region_id, (_, reason)) in new_regions_map.iter_mut() {+            let region_state_key = keys::region_state_key(*region_id);+            match ctx+                .engine+                .get_msg_cf::<RegionLocalState>(CF_RAFT, &region_state_key)+            {+                Ok(None) => (),+                Ok(Some(state)) => {+                    *reason = Some(format!("state {:?} exist in kv engine", state));+                }+                e => panic!(+                    "{} failed to get regions state of {}: {:?}",+                    self.tag, region_id, e+                ),+            }+        }++        // Note that the following execution sequence is possible.+        // Apply thread:    check `RegionLocalState`(None)+        // Store thread:    create peer+        // Peer thread:     apply snapshot and then be destroyed+        // Apply thread:    check `StoreMeta` and find it's ok to create this new region.+        // It's **very unlikely** to happen because the step 2 and step 3 should take far more time than the time interval+        // between step 1 and step 4. (A similiar case can happen in create-peer process, see details in `maybe_create_peer`)+        // Even it happens, this new region will be destroyed in future when it communicates to other TiKVs or PD.+        // Now it seems there is no other side effects.+        let mut meta = ctx.store_meta.lock().unwrap();+        for (region_id, (peer_id, reason)) in new_regions_map.iter_mut() {+            if reason.is_some() {+                continue;+            }+            if let Some(r) = meta.regions.get(region_id) {+                if util::is_region_initialized(r) {+                    *reason = Some(format!("region {:?} has already initialized", r));+                } else {+                    // If the region in `meta.regions` is not initialized, it must exist in `meta.pending_create_peers`.+                    let status = meta.pending_create_peers.get_mut(region_id).unwrap();+                    // If they are the same peer, the new one from splitting can replace it.+                    // Because it must be uninitialized. See detailes in `check_snapshot`.+                    if *status == (*peer_id, false) {+                        *status = (*peer_id, true);+                    } else {+                        *reason = Some(format!("status {:?} is not expected", status));+                    }+                }+            } else {+                assert_eq!(+                    meta.pending_create_peers+                        .insert(*region_id, (*peer_id, true)),+                    None+                );+            }+        }+        drop(meta);++        let kv_wb_mut = ctx.kv_wb.as_mut().unwrap();+        for new_region in &regions {+            if new_region.get_id() == derived.get_id() {+                continue;+            }+            let (new_peer_id, reason) = new_regions_map.get(&new_region.get_id()).unwrap();+            if let Some(r) = reason {+                warn!(

Should be info.

gengliqi

comment created time in 10 days

Pull request review commenttikv/tikv

raftstore: handle the race between creating new peer and splitting correctly

 where             {                 peer.set_id(*peer_id);             }-            write_peer_state(kv_wb_mut, &new_region, PeerState::Normal, None)+            new_regions_map.insert(+                new_region.get_id(),+                (+                    util::find_peer(&new_region, ctx.store_id).unwrap().get_id(),+                    None,+                ),+            );+            regions.push(new_region);+        }++        if right_derive {+            derived.set_start_key(keys.pop_front().unwrap());+            regions.push(derived.clone());+        }++        for (region_id, (_, reason)) in new_regions_map.iter_mut() {+            let region_state_key = keys::region_state_key(*region_id);+            match ctx+                .engine+                .get_msg_cf::<RegionLocalState>(CF_RAFT, &region_state_key)+            {+                Ok(None) => (),+                Ok(Some(state)) => {+                    *reason = Some(format!("state {:?} exist in kv engine", state));+                }+                e => panic!(+                    "{} failed to get regions state of {}: {:?}",+                    self.tag, region_id, e+                ),+            }+        }++        // Note that the following execution sequence is possible.+        // Apply thread:    check `RegionLocalState`(None)+        // Store thread:    create peer+        // Peer thread:     apply snapshot and then be destroyed+        // Apply thread:    check `StoreMeta` and find it's ok to create this new region.+        // It's **very unlikely** to happen because the step 2 and step 3 should take far more time than the time interval+        // between step 1 and step 4. (A similiar case can happen in create-peer process, see details in `maybe_create_peer`)+        // Even it happens, this new region will be destroyed in future when it communicates to other TiKVs or PD.+        // Now it seems there is no other side effects.+        let mut meta = ctx.store_meta.lock().unwrap();

Why not just lock it at L1859?

gengliqi

comment created time in 10 days

Pull request review commenttikv/tikv

raftstore: handle the race between creating new peer and splitting correctly

 where             derived.set_end_key(keys.front().unwrap().to_vec());             regions.push(derived.clone());         }-        let kv_wb_mut = ctx.kv_wb.as_mut().unwrap();++        // region_id -> (peer_id, None or Some(split failed reason))+        let mut new_regions_map: HashMap<u64, (u64, Option<String>)> = HashMap::default();

I think you can make tuple as an enum or a struct to make it readable.

gengliqi

comment created time in 10 days

push eventBusyJay/raft-rs

Jay Lee

commit sha a24773718e060e6b3d1cc57b423205db99af6b08

make clippy happy Signed-off-by: Jay Lee <BusyJayLee@gmail.com>

view details

push time in 10 days

PR opened tikv/raft-rs

Reviewers
*: extract tracker package

Just moving code around, no logic is changed.

+216 -208

0 comment

7 changed files

pr created time in 10 days

create barnchBusyJay/raft-rs

branch : extract-tracker

created branch time in 10 days

Pull request review commenttikv/grpc-rs

Enhance sinks to make them batchable

 impl BenchmarkService for Benchmark {     fn streaming_from_client(         &mut self,         ctx: RpcContext,-        _: RequestStream<SimpleRequest>,+        mut stream: RequestStream<SimpleRequest>,         sink: ClientStreamingSink<SimpleResponse>,     ) {-        let f = sink.fail(RpcStatus::new(RpcStatusCode::UNIMPLEMENTED, None));+        let f = async move {+            let mut resp: Option<SimpleResponse> = None;+            while let Some(req) = stream.try_next().await? {

Can you reference the original implement?

hunterlxt

comment created time in 10 days

Pull request review commenttikv/grpc-rs

Enhance sinks to make them batchable

 impl<Req> StreamingCallSink<Req> {         }     } +    /// By default the sink works in normal mode, that is, `start_send` always starts to send message+    /// immediately. But when the `enhance_batch` is enabled, the stream will be batched together as+    /// much as possible. The specific rule is listed below:+    /// Set the `buffer_hint` of the non-end message in the stream to true, and set the `buffer_hint` of+    /// the last message to false in `poll_flush`, so that the previously bufferd messages will be sent out.

The implement doesn't match the description. The implement sets buffer hint to false only when there is at least a false flag.

hunterlxt

comment created time in 10 days

issue commenttikv/raft-rs

port etcd joint consensus

I will send a new PR and do it in a different way.

BusyJay

comment created time in 10 days

push eventti-srebot/docs

Jay Lee

commit sha aa6784de0421771666ab1c67bec4186529b0841a

resolve conflict Signed-off-by: Jay Lee <BusyJayLee@gmail.com>

view details

push time in 10 days

issue openedtikv/raft-rs

port etcd joint consensus

We have developed ours implementation but as discussed in the past, we decided to go with the community and port the implement of etcd instead.

There will be several PRs:

  • [ ] https://github.com/etcd-io/etcd/pull/10779
  • [ ] https://github.com/etcd-io/etcd/pull/10865
  • [ ] https://github.com/etcd-io/etcd/pull/10884
  • [ ] https://github.com/etcd-io/etcd/pull/10889
  • [ ] https://github.com/etcd-io/etcd/pull/10914
  • [ ] https://github.com/etcd-io/etcd/pull/11003
  • [ ] https://github.com/etcd-io/etcd/pull/11005
  • [ ] https://github.com/etcd-io/etcd/pull/11046

The PRs doesn't have to be picked one by one. And we probably sync the code with upstream first.

created time in 13 days

issue closedtikv/tikv

raft: support joint consensus for cluster membership change

Etcd uses a simple implementation for membership change(adding/removing one peer one time when applying the raft log).

This works well in most of the time, but sometime it may still have risk, especially when PD does balance.

E,g, three racks 1, 2 and 3, each rack has 2 machines (we use h11, h12 for machines in rack1, and so on). PD first schedules three peers p1, p2, p3 to h11, h21 and h31, then it finds that h11 has a high load, so it decides to add a new peer p4 to h12 and remove p1 in h11.

If rack 1 is down after adding p4, the region can't supply service. To avoid this, we must add p4 and remove p1 atomically, but now, we can't support it.

Supporting join consensus can fix this problem, but this is different from etcd, and we must do many tests to verify the correctness and cover the corner case.

/cc @ngaut @xiang90 @BusyJay @hhkbp2

closed time in 13 days

siddontang

issue commenttikv/tikv

raft: support joint consensus for cluster membership change

It's traced by #7587 now.

siddontang

comment created time in 13 days

Pull request review commenttikv/tikv

raftstore: handle the race between creating new peer and splitting correctly

 where                 end_key,             } => {                 fail_point!("on_region_worker_destroy", true, |_| {});-                // try to delay the range deletion because-                // there might be a coprocessor request related to this range-                self.ctx-                    .insert_pending_delete_range(region_id, &start_key, &end_key);--                // try to delete stale ranges if there are any-                self.ctx.clean_stale_ranges();+                // If `region_id` is 0, it's used for deleting extra range from splitting.+                if region_id == 0 {+                    self.ctx.cleanup_range(

Why delete it immediately? There may still be running queries.

gengliqi

comment created time in 13 days

Pull request review commenttikv/raft-rs

implement committed entries pagination

 fn test_set_priority() {         assert_eq!(raw_node.raft.priority, p);     } }++// test_append_pagination ensures that a message will never be sent with entries size overflowing the `max_msg_size`+#[test]+fn test_append_pagination() {+    use std::cell::Cell;+    use std::rc::Rc;+    let l = default_logger();+    let mut config = new_test_config(1, 10, 1);+    let max_size_per_msg = 2048;+    config.max_size_per_msg = max_size_per_msg;+    let mut nt = Network::new_with_config(vec![None, None, None], &config, &l);+    let seen_full_msg = Rc::new(Cell::new(false));+    let b = seen_full_msg.clone();+    nt.msg_hook = Some(Box::new(move |m: &Message| -> bool {+        if m.msg_type == MessageType::MsgAppend {+            let total_size = m.entries.iter().fold(0, |acc, e| acc + e.data.len());+            if total_size as u64 > max_size_per_msg {+                panic!("sent MsgApp that is too large: {} bytes", total_size);+            }+            if total_size as u64 > max_size_per_msg / 2 {+                b.set(true);+            }+        }+        true+    }));+    nt.send(vec![new_message(1, 1, MessageType::MsgHup, 0)]);+    nt.isolate(1);+    for _ in 0..5 {+        let data = "a".repeat(1000);+        nt.send(vec![new_message_with_entries(+            1,+            1,+            MessageType::MsgPropose,+            vec![new_entry(0, 0, Some(&data))],+        )]);+    }+    nt.recover();+    // After the partition recovers, tick the clock to wake everything+    // back up and send the messages.+    nt.send(vec![new_message(1, 1, MessageType::MsgBeat, 0)]);+    assert!(+        seen_full_msg.get(),+        "didn't see any messages more than half the max size; something is wrong with this test"+    );+}++// test_commit_pagination ensures that the max size of committed entries must be limit under `max_committed_size_per_ready` to per ready+#[test]+fn test_commit_pagination() {+    let l = default_logger();+    let storage = MemStorage::new_with_conf_state((vec![1], vec![]));+    let mut config = new_test_config(1, 10, 1);+    config.max_committed_size_per_ready = 2048;+    let mut raw_node = RawNode::new(&config, storage, &l).unwrap();+    raw_node.campaign().unwrap();+    let rd = raw_node.ready();+    let committed_len = rd.committed_entries.as_ref().unwrap().len();+    assert_eq!(+        committed_len, 1,+        "expected 1 (empty) entry, got {}",+        committed_len+    );+    raw_node.mut_store().wl().append(rd.entries()).unwrap();+    raw_node.advance(rd);+    let blob = "a".repeat(1000).into_bytes();+    for _ in 0..3 {+        raw_node.propose(vec![], blob.clone()).unwrap();+    }+    // The 3 proposals will commit in two batches.+    let rd = raw_node.ready();+    let committed_len = rd.committed_entries.as_ref().unwrap().len();+    assert_eq!(+        committed_len, 2,+        "expected 2 entries in first batch, got {}",+        committed_len+    );+    raw_node.mut_store().wl().append(rd.entries()).unwrap();+    raw_node.advance(rd);++    let rd = raw_node.ready();+    let committed_len = rd.committed_entries.as_ref().unwrap().len();+    assert_eq!(+        committed_len, 1,+        "expected 1 entry in second batch, got {}",+        committed_len+    );+    raw_node.mut_store().wl().append(rd.entries()).unwrap();+    raw_node.advance(rd);+}++// test_commit_pagination_after_restart regression tests a scenario in which the+// Storage's Entries size limitation is slightly more permissive than Raft's+// internal one+//+// - node learns that index 11 is committed+// - next_entries returns index 1..10 in committed_entries (but index 10 already+//   exceeds maxBytes), which isn't noticed internally by Raft+// - Commit index gets bumped to 10+// - the node persists the HardState, but crashes before applying the entries+// - upon restart, the storage returns the same entries, but `slice` takes a+//   different code path and removes the last entry.+// - Raft does not emit a HardState, but when the app calls advance(), it bumps+//   its internal applied index cursor to 10 (when it should be 9)+// - the next Ready asks the app to apply index 11 (omitting index 10), losing a+//    write.+#[test]+fn test_commit_pagination_after_restart() {+    let mut persisted_hard_state = HardState::default();+    persisted_hard_state.set_term(1);+    persisted_hard_state.set_vote(1);+    persisted_hard_state.set_commit(10);+    let s = IgnoreSizeHintMemStorage::default();+    s.inner.wl().set_hardstate(persisted_hard_state);+    let ents_count = 10;+    let mut ents = Vec::with_capacity(ents_count);+    let mut size = 0u64;+    for i in 0..ents_count as u64 {+        let e = new_entry(1, i + 1, Some("a"));+        size += u64::from(e.compute_size());+        ents.push(e);+    }+    s.inner.wl().append(&ents).unwrap();++    let mut cfg = new_test_config(1, 10, 1);+    // Set a max_size_per_msg that would suggest to Raft that the last committed entry should+    // not be included in the initial rd.committed_entries. However, our storage will ignore+    // this and *will* return it (which is how the Commit index ended up being 10 initially).+    cfg.max_size_per_msg = size - 1;

I think it should be size - uint64(s.ents[len(s.ents)-1].Size()) - 1. The purpose is to let raft return 9 entries instead of 10, so that entry at 10 will get missed.

Fullstop000

comment created time in 13 days

pull request commenttikv/tikv

raftstore: prevent unsafe local read during merging

/bench

gengliqi

comment created time in 13 days

Pull request review commenttikv/tikv

raftstore: prevent unsafe local read during merging

 impl Peer {             ctx.current_time.replace(monotonic_raw_now());         } -        // The leader can write to disk and replicate to the followers concurrently-        // For more details, check raft thesis 10.2.1.         if self.is_leader() {+            if let Some(hs) = ready.hs() {+                // Correctness depends on the fact that the leader lease must be suspected before+                // other followers know the `PrepareMerge` log is committed, i.e. sends msg to others.+                // Because other followers may complete the merge process, if so, the source region's+                // leader may get a stale data.+                //+                // Check the committed entries.+                // TODO: It can change to not rely on the `committed_entries` must have the latest committed entry+                // and become O(1) by maintaining these not-committed admin requests that changes epoch.+                if hs.get_commit() > self.get_store().committed_index() {+                    assert_eq!(

Can be changed to debug_assert.

gengliqi

comment created time in 13 days

Pull request review commenttikv/grpc-rs

Enhance sinks to make them batchable

 impl SinkBase {         &mut self,         call: &mut C,         t: &T,-        mut flags: WriteFlags,+        flags: WriteFlags,         ser: SerializeFn<T>,     ) -> Result<()> {-        // `start_send` is supposed to be called after `poll_ready` returns ready.-        assert!(self.batch_f.is_none());--        self.buf.clear();-        ser(t, &mut self.buf);-        if flags.get_buffer_hint() && self.send_metadata {-            // temporary fix: buffer hint with send meta will not send out any metadata.-            flags = flags.buffer_hint(false);+        // temporary fix: buffer hint with send meta will not send out any metadata.+        // note: only the first message can enter this code block.+        if self.send_metadata {+            ser(t, &mut self.buf);+            self.buf_flags = Some(flags);+            self.start_send_buffer_message(false, call)?;+            self.send_metadata = false;+            return Ok(());         }-        let write_f = call.call(|c| {-            c.call-                .start_send_message(&self.buf, flags.flags, self.send_metadata)-        })?;-        // NOTE: Content of `self.buf` is copied into grpc internal.-        if self.buf.capacity() > BUF_SHRINK_SIZE {-            self.buf.truncate(BUF_SHRINK_SIZE);-            self.buf.shrink_to_fit();+        // If there is already a buffered message waiting to be sent, set `buffer_hint` to true to indicate+        // that this is not the last message.+        if self.buf_flags.is_some() {+            // `start_send` is supposed to be called after `poll_ready` returns ready.+            assert!(self.batch_f.is_none());+            self.start_send_buffer_message(true, call)?;

Any tests?

hunterlxt

comment created time in 13 days

Pull request review commenttikv/grpc-rs

Enhance sinks to make them batchable

 impl SinkBase {         self.batch_f.take();         Poll::Ready(Ok(()))     }++    #[inline]+    fn poll_flush<C: ShareCallHolder>(+        &mut self,+        cx: &mut Context,+        call: &mut C,+    ) -> Poll<Result<()>> {+        if self.batch_f.is_some() {+            ready!(self.poll_ready(cx)?);+        }+        if self.buf_flags.is_some() {+            self.start_send_buffer_message(self.buf_buffer_hint, call)?;+            ready!(self.poll_ready(cx)?);+        }+        Poll::Ready(Ok(()))+    }++    #[inline]+    fn start_send_buffer_message<C: ShareCallHolder>(+        &mut self,+        buffer_hint: bool,+        call: &mut C,+    ) -> Result<()> {+        let mut flags = self.buf_flags.clone().unwrap();+        flags = flags.buffer_hint(buffer_hint);+        let write_f = call.call(|c| {+            c.call+                .start_send_message(&self.buf, flags.flags, self.send_metadata)+        })?;+        self.batch_f = Some(write_f);

This can go wrong. Let's say user sends [(msg1, not_bufferred), (msg2, bufferred)], old implement will only keep msg2 in buffer, but new implement will keep both.

hunterlxt

comment created time in 13 days

Pull request review commenttikv/tikv

raftstore: support smoothly switch replication mode

 impl Peer {         None     } -    fn region_replication_status(&mut self) -> Option<RegionReplicationStatus> {+    fn region_replication_status<T, C>(+        &mut self,+        ctx: &PollContext<T, C>,+    ) -> Option<RegionReplicationStatus> {         if self.replication_mode_version == 0 {             return None;         }         let mut status = RegionReplicationStatus::default();         status.state_id = self.replication_mode_version;         let state = if !self.replication_sync {             if self.dr_auto_sync_state != DrAutoSyncState::Async {-                let res = self.raft_group.raft.check_group_commit_consistent();+                let res = self.check_group_commit_consistent(ctx.cfg.group_consistent_log_gap);+                self.check_wait_sync_deadline(res.unwrap_or(false), ctx.get_current_time());

This works at minute cycles, may not be frequent enough.

NingLin-P

comment created time in 14 days

Pull request review commenttikv/tikv

raftstore: support smoothly switch replication mode

 fn test_loading_label_after_rolling_start() {     assert_eq!(state.state_id, 1);     assert_eq!(state.state, RegionReplicationState::IntegrityOverLabel); }++// Delay replication mode switch if groups consistent can't reach immediately,+// until groups consistent reached or timeout reached.+#[test]+fn test_delaying_switch_replication_mode() {+    let mut cluster = prepare_cluster();+    let region = cluster.get_region(b"k1");+    cluster.add_send_filter(IsolationFilterFactory::new(3));+    cluster+        .pd_client+        .switch_replication_mode(DrAutoSyncState::Async, None);+    thread::sleep(Duration::from_millis(100));+    cluster.must_put(b"k2", b"v2");+    thread::sleep(Duration::from_millis(100));+    let state = cluster.pd_client.region_replication_status(region.get_id());+    assert_eq!(state.state_id, 2);+    assert_eq!(state.state, RegionReplicationState::SimpleMajority);++    // Replication mode not switch yet, so log entry still can be committed+    cluster+        .pd_client+        .switch_replication_mode(DrAutoSyncState::SyncRecover, Some(1)); // Delay for 1s+    thread::sleep(Duration::from_millis(100));+    cluster.must_put(b"k3", b"v3");+    thread::sleep(Duration::from_millis(100));+    let state = cluster.pd_client.region_replication_status(region.get_id());+    assert_eq!(state.state_id, 3);+    assert_eq!(state.state, RegionReplicationState::SimpleMajority);++    // Replication mode switch because timeout reached+    thread::sleep(Duration::from_millis(1000));+    let rx = cluster+        .async_request(put_request(&region, 1, b"k4", b"v4"))+        .unwrap();+    assert_eq!(+        rx.recv_timeout(Duration::from_millis(100)),+        Err(mpsc::RecvTimeoutError::Timeout)+    );+    must_get_none(&cluster.get_engine(1), b"k4");+    let state = cluster.pd_client.region_replication_status(region.get_id());+    assert_eq!(state.state_id, 3);+    assert_eq!(state.state, RegionReplicationState::SimpleMajority);++    // Replication mode switch because groups consistent reached+    cluster.clear_send_filters();+    cluster

Why switch it again?

NingLin-P

comment created time in 14 days

delete branch BusyJay/raft-rs

delete branch : remove-context

delete time in 14 days

Pull request review commenttikv/tikv

raftstore: handle the race between creating new peer and splitting correctly

 impl Filter for PrevoteRangeFilter {         Ok(())     } }++#[test]+fn test_split_not_to_split_exist_region() {

Missing comments.

gengliqi

comment created time in 14 days

Pull request review commenttikv/tikv

raftstore: handle the race between creating new peer and splitting correctly

 pub fn region_on_same_stores(lhs: &metapb::Region, rhs: &metapb::Region) -> bool     }) } +#[inline]+pub fn is_region_initialized(r: &metapb::Region) -> bool {

Any tests?

gengliqi

comment created time in 14 days

pull request commenttikv/grpc-rs

Enhance sinks to make them batchable

Can you give an example about how to utilize the optimization with async/await?

hunterlxt

comment created time in 14 days

pull request commenttikv/grpc-rs

Enhance sinks to make them batchable

If adapted to send_all, how does it performs?

Since now std::future is used, async/await is the preferred style. It will be less useful if the optimization can't work with that.

hunterlxt

comment created time in 15 days

create barnchtikv/tlaplus-specs

branch : master

created branch time in 15 days

more