profile
viewpoint
If you are wondering where the data of this site comes from, please visit https://api.github.com/users/tobz/events. GitMemory does not store any data, but only uses NGINX to cache data for a period of time. The idea behind GitMemory is simply to give users a better reading experience.
Toby Lawrence tobz @DataDog Boston, MA https://tobz.github.io Software Engineer at @DataDog. Working on @metrics-rs for fun.

tobz/elistrix 12

A latency / fault tolerance library to help isolate your applications from an uncertain world of slow or failed services.

blt/lading 3

A suite of data generation and load testing tools

tobz/ecaggregate 3

elasticache configuration endpoint aggregator

tobz/cadastre 2

a ledger for what mysql is doing at any given moment

nuclearfurnace/ecad-tooling 0

A set of helper scripts, test harnesses, etc, designed to maximize Eagle CAD productivity.

tobz/adventures-in-compression 0

All of the code I use to walk through the explanations in my Adventures in Compression blog series.

tobz/aerospike-client-rust 0

Rust client for the Aerospike database

tobz/are-you-sure 0

a program to help you avoid fat fingering those important, potentially career-destroying commands

issue commentmetrics-rs/quanta

Switch to CLOCK_BOOTTIME and friends to improve accuracy.

Thinking about this more...

The issue is specific to when the system is in suspend, which relates to being in a "sleep state", or in ACPI parlance: S1-S5. The aforementioned snippet from the Intel Developers Manual states that the TSC runs at a constant rate in P/C/T-states, but those are mutually exclusive with sleep states.

Thus, if quanta is in counter/TSC mode, it cannot correctly handle the transition from S0 (not suspended, essentially) to S1-5 and back to S0. Regardless of whether or not our calibration is correct, the TSC simply stops, so we're going to lose time.

Given the performance goals of quanta, there's likely two things should do here:

  • implement the CLOCK_BOOTTIME/CLOCK_MONOTONIC cascaded logic mentioned in the stdlib PR
  • document that quanta is not guaranteed to maintain wall-time through system suspend

The first just makes sense: if we can, we might as well make the monotonic reference clock as close to wall time accurate as possible. The second, well, also just makes sense: quanta is about performance, but time is important, so we need to raise this limitation.

There's likely a future change we could make to allow only using the monotonic reference clock, for when users want quanta for its ability to mock time, but also need that time to advance with wall time.

tobz

comment created time in 6 hours

pull request commentvectordotdev/vector

enhancement(unit tests): Add testing for component specification features

Agreed, this looks a lot more succinct now. Good work. 👍🏻

bruceg

comment created time in 7 hours

issue commentvectordotdev/vector

Change `StreamSink::run` to take `self` instead of `&mut self`

Seems fine to me. Running that change through Rust Playground quickly doesn't yield any weird quirks with async_trait.

fuchsnj

comment created time in 11 hours

issue openedmetrics-rs/quanta

Switch to CLOCK_BOOTTIME and friends to improve accuracy.

Per the discussion happening on rust-lang/rust#88714, there's a meaningful difference between CLOCK_BOOTTIME and CLOCK_MONOTONIC when it comes to time across system suspends. According to the issue, the problem they're trying to solve is that CLOCK_MONOTONIC stops ticking during suspend, while CLOCK_BOOTTIME does not. This raises two problems for quanta:

Monotonic mode

When invariant TSC support is not detected, we fall back to the "monotonic" mode where we query the time directly. This is all fine and good, but we're also querying with CLOCK_MONOTONIC, and similar variants on other platforms. This leaves us open to the exact same problem described in the above issue.

Counter (TSC) mode

While I have not fully traced whether or not this matters, there's a potential reality where CLOCK_MONOTONIC stops ticking during lower CPU power states, such that as we're going through the calibration loop, our reference drifts with every loop we perform. While invariant TSC should be guaranteed to tick at a constant rate -- recent Intel manuals specifically use the language of The invariant TSC will run at a constant rate in all ACPI P-, C-. and T-states. -- this is moot if our initial reference/source calibration is off, as we need that in order to go from TSC cycles to real time units.

At any rate, switching shouldn't do anything but make things more accurate, but to reference the issue again, there are also some concerns about when the support for it was introduced, and on which platforms it matters. With that in mind, we likely need to wait for that PR to shake out to make sure we have a good example of where we'll need to make our changes.

created time in 12 hours

Pull request review commentvectordotdev/vector

chore: Add buffer max size metric to buffer specification

 Vector buffers MUST be instrumented for optimal observability and monitoring. Th  ### Events +#### `BufferCreated`++*All buffers* MUST emit a `BufferCreated` event immediately upon creation.++* Properties+  * `max_size` - the maximum number of events or byte size of the buffer+  * `id` - the ID of the component associated with the buffer+* Metric+  * MUST emit the `buffer_max_event_size` gauge (in-memory buffers) or `buffer_max_byte_size` gauge (disk buffers) with the defined `max_size` value and `id` as tag.

Basically a separate task with a timer or something, yeah. You could theoretically clone the Arc<AtomicUsize> we use for "current size" in disk buffers, and do roughly the same for the in-memory buffers, and then hand that off to your reporting task. Trickiest thing is preserving tags, but that could likely be handled pretty easily e.g. by wrapping the task spawn in a span when building it, since you should be in the same area where you have access to the sink component name/type/ID/etc.

001wwang

comment created time in 14 hours

PullRequestReviewEvent

issue closedvectordotdev/vector

Component in/out metrics should be better aligned with the lifecycle of the events.

Currently, we generically emit metrics for components related to the lifecycle of events: events in, events out, processed bytes (how many bytes we shipped out) and so on. There is a natural alignment of event lifecycle that people would typically imagine: an event "in" is when the component gets the event, the event "out" is when we know it's left the component and so on.

In practice, we do not always do this.

In some scenarios, like the Elasticsearch or generic HTTP sink, we emit the number of processed bytes when events are encoded, which does not account for making sure we actually sent the events payload, whether successful or not. Merely batching the event triggers this, and so while internal backpressure in general tends to have a smoothing effect on how often these metrics are emitted, which is generally coupled with the requests being sent/responded to, it's not quite representative of reality.

The good news is that we often have common framework points -- Pipeline, HttpClient, etc -- where we could more accurately emit these metrics.

closed time in a day

tobz

issue commentvectordotdev/vector

Component in/out metrics should be better aligned with the lifecycle of the events.

@bruceg I do believe it does, yes. :)

tobz

comment created time in a day

push eventvectordotdev/vector

Toby Lawrence

commit sha ae2de653770c565b36faa9396611976edf91aad8

temporary commit to avoid losing too much work before more futzing Signed-off-by: Toby Lawrence <toby@nuclearfurnace.com>

view details

push time in 3 days

Pull request review commentvectordotdev/vector

feat(topology): add transforms with multiple outputs

 pub async fn build_pieces(                 when_full: vector_core::buffers::WhenFull::Block,             })             .unwrap();-        let input_rx = crate::utilization::wrap(Pin::new(input_rx));+        let mut input_rx = crate::utilization::wrap(Pin::new(input_rx)); -        let (output, control) = Fanout::new();+        let task = match transform {+            Transform::Function(mut t) => {+                let (output, control) = Fanout::new();++                let transform = input_rx+                    .filter(move |event| ready(filter_event_type(event, input_type)))+                    .inspect(|_| emit!(EventIn))+                    .flat_map(move |v| {+                        let mut buf = Vec::with_capacity(1);+                        t.transform(&mut buf, v);+                        emit!(EventOut { count: buf.len() });+                        stream::iter(buf.into_iter()).map(Ok)+                    })+                    .forward(output)+                    .boxed()+                    .map_ok(|_| {+                        debug!("Finished.");+                        TaskOutput::Transform+                    });++                outputs.insert(key.clone(), control); -        let transform = match transform {-            Transform::Function(mut t) => input_rx-                .filter(move |event| ready(filter_event_type(event, input_type)))-                .inspect(|_| emit!(EventIn))-                .flat_map(move |v| {-                    let mut buf = Vec::with_capacity(1);-                    t.transform(&mut buf, v);-                    emit!(EventOut { count: buf.len() });-                    stream::iter(buf.into_iter()).map(Ok)-                })-                .forward(output)-                .boxed(),+                Task::new(key.clone(), typetag, transform)+            }+            Transform::FallibleFunction(mut t) => {+                let (mut output, control) = Fanout::new();+                let (mut errors_output, errors_control) = Fanout::new();++                let transform = async move {+                    while let Some(event) = input_rx.next().await {+                        if filter_event_type(&event, input_type) {+                            emit!(EventIn);+                        }++                        let mut buf = Vec::with_capacity(1);+                        let mut err_buf = Vec::with_capacity(1);++                        t.transform(&mut buf, &mut err_buf, event);+                        // TODO: account for error outputs separately?+                        emit!(EventOut { count: buf.len() });++                        for event in buf {+                            output.feed(event).await.expect("unit error");+                        }+                        output.flush().await.expect("unit error");+                        for event in err_buf {+                            errors_output.feed(event).await.expect("unit error");+                        }+                        errors_output.flush().await.expect("unit error");

I would agree, but I didn't want to pigeonhole Luke's response. :P

lukesteensen

comment created time in 3 days

PullRequestReviewEvent
PullRequestReviewEvent
PullRequestReviewEvent

create barnchtobz/ring-buffer-on-disk-rs

branch : main

created branch time in 5 days

created repositorytobz/ring-buffer-on-disk-rs

Experimental crate for a simplistic disk-backed ring buffer.

created time in 5 days

Pull request review commentvectordotdev/vector

feat(topology): add transforms with multiple outputs

 pub async fn build_pieces(                 when_full: vector_core::buffers::WhenFull::Block,             })             .unwrap();-        let input_rx = crate::utilization::wrap(Pin::new(input_rx));+        let mut input_rx = crate::utilization::wrap(Pin::new(input_rx)); -        let (output, control) = Fanout::new();+        let task = match transform {+            Transform::Function(mut t) => {+                let (output, control) = Fanout::new();++                let transform = input_rx+                    .filter(move |event| ready(filter_event_type(event, input_type)))+                    .inspect(|_| emit!(EventIn))+                    .flat_map(move |v| {+                        let mut buf = Vec::with_capacity(1);+                        t.transform(&mut buf, v);+                        emit!(EventOut { count: buf.len() });+                        stream::iter(buf.into_iter()).map(Ok)+                    })+                    .forward(output)+                    .boxed()+                    .map_ok(|_| {+                        debug!("Finished.");+                        TaskOutput::Transform+                    });++                outputs.insert(key.clone(), control); -        let transform = match transform {-            Transform::Function(mut t) => input_rx-                .filter(move |event| ready(filter_event_type(event, input_type)))-                .inspect(|_| emit!(EventIn))-                .flat_map(move |v| {-                    let mut buf = Vec::with_capacity(1);-                    t.transform(&mut buf, v);-                    emit!(EventOut { count: buf.len() });-                    stream::iter(buf.into_iter()).map(Ok)-                })-                .forward(output)-                .boxed(),+                Task::new(key.clone(), typetag, transform)+            }+            Transform::FallibleFunction(mut t) => {+                let (mut output, control) = Fanout::new();+                let (mut errors_output, errors_control) = Fanout::new();++                let transform = async move {+                    while let Some(event) = input_rx.next().await {+                        if filter_event_type(&event, input_type) {+                            emit!(EventIn);+                        }++                        let mut buf = Vec::with_capacity(1);+                        let mut err_buf = Vec::with_capacity(1);++                        t.transform(&mut buf, &mut err_buf, event);+                        // TODO: account for error outputs separately?+                        emit!(EventOut { count: buf.len() });++                        for event in buf {+                            output.feed(event).await.expect("unit error");+                        }+                        output.flush().await.expect("unit error");+                        for event in err_buf {+                            errors_output.feed(event).await.expect("unit error");+                        }+                        errors_output.flush().await.expect("unit error");

Do we actually enforce attaching a downstream component to the error output of a fallible transform? Or allow errors to be silently discarded?

If the former, then wouldn't the above end up eventually filling up the channel and eventually blocking?

lukesteensen

comment created time in 7 days

PullRequestReviewEvent

Pull request review commentvectordotdev/vector

chore(architecture): consolidate sink I/O driver logic into reusable component

 impl S3EventEncoding for S3RequestOptions { }  impl S3RequestBuilder for S3RequestOptions {-    fn build_request(&mut self, key: String, batch: Vec<Event>) -> S3Request {-        {-            // Generate the filename for this batch, which involves a surprising amount-            // of code.-            let filename = {-                /*-                Since this is generic over the partitioner, for purposes of unit tests,-                we can't get the compiler to let us define a conversion trait such that-                we can get &Event from &P::Item, or I at least don't know how to-                trivially do that.  I'm leaving this snippet here because it embodies-                the prior TODO comment of using the timestamp of the last event in the-                batch rather than the current time.--                Now that I think of it... is that even right?  Do customers want logs-                with timestamps in them related to the last event contained within, or-                do they want timestamps that include when the file was generated and-                dropped into the bucket?  My gut says "time when the log dropped" but-                maybe not...--                let last_event_ts = batch-                    .items()-                    .iter()-                    .last()-                    .and_then(|e| match e.into() {-                        // If the event has a field called timestamp, in RFC3339 format, use that.-                        Event::Log(le) => le-                            .get(log_schema().timestamp_key())-                            .cloned()-                            .and_then(|ts| match ts {-                                Value::Timestamp(ts) => Some(ts),-                                Value::Bytes(buf) => std::str::from_utf8(&buf)-                                    .ok()-                                    .and_then(|s| DateTime::parse_from_rfc3339(s).ok())-                                    .map(|dt| dt.with_timezone(&Utc)),-                                _ => None,-                            }),-                        // TODO: We don't ship metrics to the S3, but if we did, would this be right? or is-                        // there an actual field we should be checking similar to above?-                        Event::Metric(_) => Some(Utc::now()),-                    })-                    .unwrap_or_else(|| Utc::now());-                let formatted_ts = last_event_ts.format(&time_format);-                */-                let formatted_ts = Utc::now().format(self.filename_time_format.as_str());--                if self.filename_append_uuid {-                    let uuid = Uuid::new_v4();-                    format!("{}-{}", formatted_ts, uuid.to_hyphenated())-                } else {-                    formatted_ts.to_string()-                }-            };--            let extension = self-                .filename_extension-                .as_ref()-                .cloned()-                .unwrap_or_else(|| self.compression.extension().into());-            let key = format!("{}/{}.{}", key, filename, extension);--            // Process our events. This does all of the necessary encoding rule-            // application, as well as encoding and compressing the events.  We're-            // handed back a tidy `Bytes` instance we can send directly to S3.-            let batch_size = batch.len();-            let (body, finalizers) = process_event_batch(batch, self, self.compression);--            debug!(-                message = "Sending events.",-                bytes = ?body.len(),-                bucket = ?self.bucket,-                key = ?key-            );--            S3Request {-                body,-                bucket: self.bucket.clone(),-                key,-                content_encoding: self.compression.content_encoding(),-                options: self.api_options.clone(),-                batch_size,-                finalizers,+    fn build_request(&self, key: String, batch: Vec<Event>) -> S3Request {+        // Generate the filename for this batch, which involves a surprising amount+        // of code.+        let filename = {+            /*+            Since this is generic over the partitioner, for purposes of unit tests,+            we can't get the compiler to let us define a conversion trait such that+            we can get &Event from &P::Item, or I at least don't know how to+            trivially do that.  I'm leaving this snippet here because it embodies+            the prior TODO comment of using the timestamp of the last event in the+            batch rather than the current time.++            Now that I think of it... is that even right?  Do customers want logs+            with timestamps in them related to the last event contained within, or+            do they want timestamps that include when the file was generated and+            dropped into the bucket?  My gut says "time when the log dropped" but+            maybe not...++            let last_event_ts = batch+                .items()+                .iter()+                .last()+                .and_then(|e| match e.into() {+                    // If the event has a field called timestamp, in RFC3339 format, use that.+                    Event::Log(le) => le+                        .get(log_schema().timestamp_key())+                        .cloned()+                        .and_then(|ts| match ts {+                            Value::Timestamp(ts) => Some(ts),+                            Value::Bytes(buf) => std::str::from_utf8(&buf)+                                .ok()+                                .and_then(|s| DateTime::parse_from_rfc3339(s).ok())+                                .map(|dt| dt.with_timezone(&Utc)),+                            _ => None,+                        }),+                    // TODO: We don't ship metrics to the S3, but if we did, would this be right? or is+                    // there an actual field we should be checking similar to above?+                    Event::Metric(_) => Some(Utc::now()),+                })+                .unwrap_or_else(|| Utc::now());+            let formatted_ts = last_event_ts.format(&time_format);+            */

Yes, but admittedly I'm torn. Maybe I'll pose the hypothetical to the entire team to get their thoughts.

tobz

comment created time in 7 days

PullRequestReviewEvent

Pull request review commentvectordotdev/vector

chore(architecture): consolidate sink I/O driver logic into reusable component

+use std::{collections::HashMap, fmt, marker::PhantomData};++use buffers::{Ackable, Acker};+use futures::{stream::FuturesUnordered, FutureExt, Stream, StreamExt, TryFutureExt};+use tokio::{pin, select, sync::oneshot};+use tower::{Service, ServiceExt};+use tracing::Instrument;++use crate::event::{EventStatus, Finalizable};++/// Drives the interaction between a stream of items and a service which processes them+/// asynchronously.+///+/// `Driver`, as a high-level, facilitates taking items from an arbitrary `Stream` and pushing them+/// through a `Service`, spawning each call to the service so that work can be run concurrently,+/// managing waiting for the service to be ready before processing more items, and so on.+///+/// Additionally, `Driver` handles two event-specific facilities: finalization and acknowledgement.+///+/// This capability is parameterized so any implementation which can define how to interpret the+/// response for each request, as well as define how many events a request is compromised of, can be+/// used with `Driver`.+pub struct Driver<St, Svc, Req>+where+    Svc: Service<Req>,+{+    input: St,+    service: Svc,+    acker: Acker,+    _req: PhantomData<Req>,+}++impl<St, Svc, Req> Driver<St, Svc, Req>+where+    Svc: Service<Req>,+{+    pub fn new(input: St, service: Svc, acker: Acker) -> Self {+        Self {+            input,+            service,+            acker,+            _req: PhantomData,+        }+    }+}++impl<St, Svc, Req> Driver<St, Svc, Req>+where+    St: Stream<Item = Req>,+    Svc: Service<Req>,+    Svc::Error: fmt::Debug + 'static,+    Svc::Future: Send + 'static,+    Svc::Response: AsRef<EventStatus>,+    Req: Ackable + Finalizable,+{+    /// Runs the driver until the input stream is exhausted.+    ///+    /// All in-flight calls to the provided `service` will also be completed before `run` returns.+    ///+    /// # Errors+    ///+    /// No errors are currently returned.  Te return type is purely to simplify caller code, but may+    /// return an error for a legitimate reason in the future.+    pub async fn run(self) -> Result<(), ()> {+        let in_flight = FuturesUnordered::new();+        let mut pending_acks = HashMap::new();

I had the exact same thought. There's actually a nohash-hasher crate since our keys are already u64. I guess if we've both had this thought, maybe I should just go ahead and do it.

tobz

comment created time in 7 days

PullRequestReviewEvent

PR opened vectordotdev/vector

chore(architecture): consolidate sink I/O driver logic into reusable component

This PR moves the logic previously baked into the run_io method for the S3 sink into a reusable component called Driver. Simply put, you give it a Stream of items which can be used as the request for a Service, and it handles building calls for each item, as well as providing finalization and acking as the responses come through.

We've done a small amount of work to integrate it with the S3 sink, and it does indeed provide a cleaner sink implementation, although at this point it's really just shuffling code around.

Readiness checklist:

  • [ ] actual, basic tests for Driver
  • [ ] maybe even property-based testing for Driver
  • [ ] can we consolidate S3RequestBuilder with S3EventEncoding? (nice to have, not required)
+298 -248

0 comment

11 changed files

pr created time in 7 days

create barnchvectordotdev/vector

branch : tobz/streamify-new-sink-io-task

created branch time in 7 days

PullRequestReviewEvent

pull request commentmetrics-rs/metrics

Update atomic-shim requirement from 0.1 to 0.2

Yeah, my thought is to just remove atomic-shim completely and switch to AtomicCell<T> as we did in quanta. Crossbeam feels more stable long-term, although there's the tricky issue that fetch_add isn't supported under MIPS/ARM for AtomicCell<u64>, if I remember correctly...

I'll think about this a little more tonight, and either merge this or do a PR to remove it entirely.

dependabot[bot]

comment created time in 7 days

PullRequestReviewEvent

pull request commentvectordotdev/vector

enhancement(datadog_metrics sink): add support for aggregated histograms and summaries

Note to self: right now the internal aggregated histograms we ship do not look right in the DD metrics UI. We should be sending absolute (monotonically increasing, at that) bucket values but currentl the bucket values fluctuate a bit, like they're not being interpreted as gauges correctly or the data itself actually is going up and down.

tobz

comment created time in 7 days

pull request commentvectordotdev/vector

fix(aws_s3 sink): ship final batches when shutting down

Nah, the system we have is fine. This problem was purely about coordination between two intertwined tasks, not a deficiency in our approach to shutdown sources and let their closure cascade to downstream components.

tobz

comment created time in 7 days

push eventvectordotdev/vector

Toby Lawrence

commit sha 5c702182e39e351406e420a4b7bcb6692f77ae97

fix(aws_s3 sink): ship final batches when shutting down (#9184) * fix(aws_s3 sink): ship final batches when shutting down Signed-off-by: Toby Lawrence <toby@nuclearfurnace.com>

view details

push time in 7 days