profile
viewpoint
Robert Fratto rfratto @grafana twitter.com/robertfratto Dev at @grafana working on the Loki team

grafana/agent 74

A lightweight subset of Prometheus and more, optimized for Grafana Cloud

rfratto/localenv 2

My personal local dev environment for work

rfratto/annoybot 1

annoy coworkers on slack in one easy script

rfratto/FreeImage 1

CMake-based FreeImage fork

rfratto/accidental-noise-library 0

Automatically exported from code.google.com/p/accidental-noise-library

rfratto/agent 0

A lightweight subset of Prometheus for Grafana Cloud.

rfratto/awaybot 0

A Slack bot to track away statuses

rfratto/bbparse 0

JavaScript library to parse BBCode-like expressions.

rfratto/beats 0

:tropical_fish: Beats - Lightweight shippers for Elasticsearch & Logstash

rfratto/bootkube 0

bootkube - Launch a self-hosted Kubernetes cluster

delete branch rfratto/agent

delete branch : prometheus-config-validations

delete time in 3 hours

push eventgrafana/agent

Robert Fratto

commit sha 2d590fd990a21a6ccfc3627b69b66551c0d03111

use remote_write and scrape_config validations from Prometheus (#92) * use remote_write and scrape_config validations from Prometheus Prometheus was doing some validations on the remote_write and scrape_config sections per config that we weren't doing yet. To make this a little easier, ApplyDefaults and Validate have been merged. The scraping service PutConfiguration endpoint will now apply the defaults to the config before passing it through to the KV store; invalid configs will be prevented from reaching the KV store and errors will be returned to the client. When migrating to this change, if any invalid configs exist in the KV store, the Agent will refuse to start them and log an error. The configs in the KV store should be updated to fix the invalid settings before the Agent will be able to run it. Fixes #80. * add tests for new validations * s/ok/exists

view details

push time in 3 hours

PR merged grafana/agent

use remote_write and scrape_config validations from Prometheus

Prometheus was doing some validations on the remote_write and scrape_config sections per config that we weren't doing yet. To make this a little easier, ApplyDefaults and Validate have been merged.

The scraping service PutConfiguration endpoint will now apply the defaults to the config before passing it through to the KV store; invalid configs will be prevented from reaching the KV store and errors will be returned to the client.

When migrating to this change, if any invalid configs exist in the KV store, the Agent will refuse to start them and log an error. The configs in the KV store should be updated to fix the invalid settings before the Agent will be able to run it.

Fixes #80.

+238 -94

0 comment

10 changed files

rfratto

pr closed time in 3 hours

issue closedgrafana/agent

Instance configs missing Prometheus validation checks

The Agent is missing some validation checks that Prometheus has:

Check the links to see what Prometheus does.

The instance configs don't do validation checks on unmarshaling like Prometheus does; rather it's broken up into ApplyDefaults and Validate. However, ApplyDefaults and Validate aren't being called from the instance config API; this should also be fixed.

closed time in 3 hours

rfratto

push eventrfratto/agent

Robert Fratto

commit sha 4464d0603a4fafd802c3293d2f814c6834da99c5

s/ok/exists

view details

push time in 3 hours

push eventgrafana/agent

Robert Fratto

commit sha 36817fbe55af71757e492c5bbd687b41ac7533a2

Fix race conditions and flaky CI (#90) * enable testing for race conditions * stop using mock etcd mock etcd was the source of a lot of flakes * fix reported race conditions * fix lint error

view details

push time in a day

PR merged grafana/agent

Reviewers
Fix race conditions and flaky CI

This PR fixes the various flakes CI has been having lately.

+167 -47

2 comments

5 changed files

rfratto

pr closed time in a day

issue openedgrafana/agent

Cut a 0.3.2 release

The various issues found in the scraping service warrant a new patch release.

Before 0.3.2, we need the following PRs merged: #92, #90 and for all code to run for at least half a day to make sure everything looks good.

/cc @gotjosh @hoenn

created time in a day

push eventrfratto/agent

Robert Fratto

commit sha 67c0a7b72cc3f26c9645c055f8e8636dfb3d934c

add example config for running agent with integrations

view details

push time in 5 days

PR opened grafana/agent

Add integrations system

This PR adds a system to run integrations (i.e., embedded exporters) and automatically collect metrics from them. It's currently in a rough state, but I want to open this up as a draft to track the work that needs to be done.

TODO:

  • [ ] Clean up the code
  • [ ] Add hostname as a label to allow multiple agents to run the same set of integrations out of the box with the same config
  • [ ] Make the labels used for integrations less obtuse (currently __integration_agent_0)
  • [ ] Tests
+333 -6

0 comment

8 changed files

pr created time in 5 days

create barnchrfratto/agent

branch : integrations-system

created branch time in 5 days

issue commentgrafana/agent

Implement an integrations system

Something to think about: are there going to be any unique identifiers if two agents are running and both run the agent integration? Or will their series collide when remote writing to Cortex?

rfratto

comment created time in 5 days

push eventgrafana/loki

Aditya C S

commit sha f984e71192d41723c16f01c83f04d191b0bb2a8c

Fix Promtail journal seeking known position (#2111)

view details

push time in 5 days

PR merged grafana/loki

Fix Promtail journal seeking known position size/XS

What this PR does / why we need it: Fix Promtail journal seeking known position

Which issue(s) this PR fixes: Fixes #2104

+9 -0

6 comments

1 changed file

adityacs

pr closed time in 5 days

issue closedgrafana/loki

promtail/journal: cannot seek back to saved position

Describe the bug When promtail is configured to scrape logs from journald it is supposed to remember the position of the logs in the position file. When promtail is restarted it can't read back saved position from journal with the following error:

level=error ts=2020-05-21T05:53:29.024009194Z caller=journaltarget.go:219 msg="received error reading saved journal position" err="failed to get realtime timestamp: cannot assign requested address"

Note: on ubuntu, the error is not cannot assign requested address but 99 which is the same reality (error message vs error code).

The side effect is that is then reread logs from journal up to journal.max_age which can cause some troubles like:

  • burst of level=error ts=2020-05-21T05:58:14.901208619Z caller=client.go:247 component=client host=172.30.0.101:6902 msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry with timestamp 2020-05-21 04:40:42.937458 +0000 UTC ignored, reason: 'entry out of order' for stream: {host=\"node1.novalocal\", job=\"systemd-journal\", log_type=\"access\"},"
  • and loki could then complains for too much requests with level=warn ts=2020-05-21T06:15:39.861839553Z caller=client.go:242 component=client host=172.30.0.101:6902 msg="error sending batch, will retry" status=429 error="server returned HTTP status 429 Too Many Requests (429): Ingestion rate limit exceeded (limit: 4194304 bytes/sec) while attempting to ingest '333' lines totaling '102198' bytes, reduce log volume or contact your Loki administrator to see if the limit can be increased"

To Reproduce Steps to reproduce the behavior:

  1. Started Loki 1.5.0
  2. Started Promtail 1.5.0
  3. check error logs

Expected behavior promtail should be able to detect where it stopped to fetch logs and start again from there. Instead it starts over again logs that have already been pushed

Environment:

  • Infrastructure: baremetal Centos7 or Ubuntu 18.04
  • Deployment tool: release binary from github or local compilation
  • configuration file:
server:
  http_listen_address: 172.30.0.111
  http_listen_port: 6922
  grpc_listen_port: 0

positions:
  filename: /var/log/positions.yml

clients:
  - url: "http://172.30.0.101:6902/loki/api/v1/push"

scrape_configs:
  - job_name: journal
    journal:
      json: false
#      max_age: 1s
      labels:
        job: systemd-journal

closed time in 5 days

fatpat

push eventrfratto/agent

Robert Fratto

commit sha b0811238640cc4d9cac5144bffdaf9136fa58e22

add tests for new validations

view details

push time in 5 days

PR opened grafana/agent

use remote_write and scrape_config validations from Prometheus

Prometheus was doing some validations on the remote_write and scrape_config sections per config that we weren't doing yet. To make this a little easier, ApplyDefaults and Validate have been merged.

The scraping service PutConfiguration endpoint will now apply the defaults to the config before passing it through to the KV store; invalid configs will be prevented from reaching the KV store and errors will be returned to the client.

When migrating to this change, if any invalid configs exist in the KV store, the Agent will refuse to start them and log an error. The configs in the KV store should be updated to fix the invalid settings before the Agent will be able to run it.

Fixes #80.

+132 -94

0 comment

10 changed files

pr created time in 5 days

create barnchrfratto/agent

branch : prometheus-config-validations

created branch time in 5 days

issue openedgrafana/agent

Run go mod vendor, tidy as part of CI validations

It's easy to make changes to the Agent and forget to run go mod tidy and go mod vendor before checking it in. It would be nice if the CI that runs on pull requests checked to see if the PR author needs to rerun the commands.

The CI check should fail if running go mod tidy and go mod vendor caused any git diffs.

created time in 5 days

pull request commentgrafana/loki

Fix Promtail journal seeking known position

What I thought was If there is no entry after the cursor, calling next would throw an error. However, calling previous would re read the entry as you mentioned.

Right, so calling Next in this case would be valid, but it would be set to the EOF. I think getting the entry would just return the last entry before EOF, but it's not clear (sorry, my C skills aren't what they used to be).

Even if that's wrong and getting the entry fails, I think that's ok; trying to restore the position is best-effort. Your PR here will help a lot with the majority of issues, but I think there's always going to be the possibility of some edge case that's going to prevent us from restoring.

That being said, I think the real question here is agreeing what the behavior should be when we do fail to read the saved entry, and it's either:

  1. We reread from the max age and the client ends up resending some logs they already sent (seeing 4xxs)
  2. We skip to the tail of the journal and stop reading from there. No 4xxs, but then journal entries that appeared in the promtail downtime will never be sent.

Currently I'm using the first approach and IMO I think that's the right thing to do, even if the 4xxs might raise some red flags when we fail to read the entry.

adityacs

comment created time in 5 days

pull request commentgrafana/loki

Fix Promtail journal seeking known position

Thanks for looking into this @adityacs! I'm looking at the systemd docs for this, and while it seems like calling Next or Previous should work, I think we want to use Next here instead; that way Promtail never re-reads already-sent entries if the cursor we seek to is invalid.

adityacs

comment created time in 5 days

pull request commentgrafana/agent

Fix race conditions and flaky CI

I've tested it for a good hour and everything passed every time, I'm pretty confident this fixes the flakiness.

rfratto

comment created time in 6 days

pull request commentgrafana/agent

Fix race conditions and flaky CI

:crossed_fingers: going to rerun the CI a bunch of times and hopefully nothing fails. If everything is good, I'll open this up for review

rfratto

comment created time in 6 days

push eventrfratto/agent

Robert Fratto

commit sha ea44f306785e20c69ecb94018615bcc41ea5de76

fix lint error

view details

push time in 6 days

push eventrfratto/agent

Robert Fratto

commit sha 282c6682e64afb34467c545ae29bd2ba16f84b5e

fix reported race conditions

view details

push time in 6 days

PR opened grafana/agent

Fix race conditions and flaky CI
+134 -33

0 comment

2 changed files

pr created time in 6 days

create barnchrfratto/agent

branch : race-conditions

created branch time in 6 days

delete branch rfratto/agent

delete branch : rename-prometheus-package

delete time in 6 days

push eventgrafana/agent

Robert Fratto

commit sha 1fa02c25c0c785cc8a811efc83d1d600dd453438

rename pkg/prometheus to pkg/prom (#89)

view details

push time in 6 days

PR merged grafana/agent

rename pkg/prometheus to pkg/prom

pkg/prometheus collided with github.com/prometheus/prometheus in annoying ways - let's just call it pkg/prom (plus it makes the import paths slightly shorter!)

+37 -37

0 comment

31 changed files

rfratto

pr closed time in 6 days

PR opened grafana/agent

Reviewers
rename pkg/prometheus to pkg/prom

pkg/prometheus collided with github.com/prometheus/prometheus in annoying ways - let's just call it pkg/prom (plus it makes the import paths slightly shorter!)

+37 -37

0 comment

31 changed files

pr created time in 6 days

create barnchrfratto/agent

branch : rename-prometheus-package

created branch time in 6 days

push eventgrafana/agent

Robert Fratto

commit sha c929f9fc4e5eaae179c0c6328f7efc8ec57054c3

couple discovery and host filtering (#87) * couple discovery and host filtering The implementation of #83 will require the ability to replace the Prometheus discovery manager with a custom component that outputs discovered targets. The first half of this is to couple the Prometheus discovery manager and host filtering as a single "actor" that can be swapped out with another implementation. * rename newRunGroup to runGroupWithContext * fix silly typo * don't have multiple different names used for referring to the discovery manager

view details

push time in 6 days

delete branch rfratto/agent

delete branch : couple-discovery-and-filter

delete time in 6 days

PR merged grafana/agent

couple discovery and host filtering

The implementation of #83 will require the ability to replace the Prometheus discovery manager with a custom component that outputs discovered targets. The first half of this is to couple the Prometheus discovery manager and host filtering as a single "actor" that can be swapped out with another implementation.

+99 -64

1 comment

1 changed file

rfratto

pr closed time in 6 days

push eventrfratto/agent

Robert Fratto

commit sha dc1cfddd20757d0f55e89fc6b0976b701e036b24

don't have multiple different names used for referring to the discovery manager

view details

push time in 6 days

push eventrfratto/agent

Robert Fratto

commit sha 32ddf298f6f81e10f759115bc38b9e9bd6448325

fix silly typo

view details

push time in 6 days

delete branch rfratto/agent

delete branch : use-yaml-hook

delete time in 6 days

push eventgrafana/agent

Robert Fratto

commit sha fee72e6fd9a13ca3af0fc4af88f7a11c8561a755

add yaml hook to never marshal secrets as "<secret>" (#88) * add yaml hook to not marshal secrets as "<secret>" Fixes #86 Fixes #81 (because it removes the broken) Supplants #82 * replace other uses of yaml marshaling a config with the package methods * add in protection that prevents instance configs from being marshaled directly instance.Configs should always be marshaled to YAML through the marshal functions within the instance package, never using the YAML package directly. * allow "normal" marshaling of instance.Config to keep secrets intact Marshaling without the custom methods will now be slower but will at least work and not unexpectedly remove secrets. Calling the custom methods is still required for scrubbing secrets. * add test to validate instance config marshaling

view details

push time in 6 days

PR merged grafana/agent

add yaml hook to never marshal secrets as "<secret>"

Supplants #82 with a better fix for keeping secrets on marshal that doesn't require reproducing all of Prometheus' types. I've also added in protection to make sure that we don't run across this again; directly marshaling a config will always fail unless you call the utility methods in the instance package.

This switches us to a go-yaml fork for now, but I'll work on seeing if upstream is interested in this.

Fixes #86 Fixes #81 (because it removes the broken code)

/cc @gotjosh

+163 -197

1 comment

14 changed files

rfratto

pr closed time in 6 days

issue closedgrafana/agent

remote_write config stanzas ignore password_file

Per docs, it looks like one should be able to use password_file in a remote_write configuration (as in prometheus; desirable cause it plays well with K8s secrets, etc).

Best I can tell though, a password_file value appears to be checked for / validated, but never makes it into the actual runtime configuration used? (I was led to this by 401 Unauthorized errors which I believe are attributable to this.)

closed time in 6 days

jgraettinger

issue closedgrafana/agent

Some secrets still get mangled in scraping service mode

See full configuration reference; there are many fields marked as secret that aren’t being converted to a string for storage. For example, the basic auth credentials in the scrape configs.

We need a better solution than replicating all the types that have a secret field.

closed time in 6 days

rfratto

push eventrfratto/agent

Robert Fratto

commit sha d7b29d6504791c06727b097e1d2f4503753f25a5

add test to validate instance config marshaling

view details

push time in 6 days

pull request commentgrafana/agent

add yaml hook to never marshal secrets as "<secret>"

@hoenn sorry for all the changes, this is in a reviewable state now

rfratto

comment created time in 6 days

push eventrfratto/agent

Robert Fratto

commit sha 339d5d1c6ebc8d60f0b79f2aa7a19ea8cfb52fff

allow "normal" marshaling of instance.Config to keep secrets intact Marshaling without the custom methods will now be slower but will at least work and not unexpectedly remove secrets. Calling the custom methods is still required for scrubbing secrets.

view details

push time in 6 days

create barnchrfratto/agent

branch : use-yaml-v3-hook

created branch time in 6 days

push eventrfratto/go-yaml

Robert Fratto

commit sha c4517cec828d3103caa7e0e57ed84df69a59b785

add in encoding hook

view details

push time in 6 days

push eventrfratto/agent

Robert Fratto

commit sha 691de0a2cbb44175a06cef8ccec91507b4e62b20

add in protection that prevents instance configs from being marshaled directly instance.Configs should always be marshaled to YAML through the marshal functions within the instance package, never using the YAML package directly.

view details

push time in 6 days

push eventrfratto/agent

Robert Fratto

commit sha ddd68d211fe9bb1370a6a6f5a66f7b430fd9d8df

replace other uses of yaml marshaling a config with the package methods

view details

push time in 6 days

PR opened grafana/agent

Reviewers
add yaml hook to not marshal secrets as "<secret>"

Supplants #82 with a better fix for keeping secrets on marshal that doesn't require reproducing all of Prometheus' types.

Fixes #86 Fixes #81 (because it removes the broken code)

/cc @gotjosh

+69 -177

0 comment

11 changed files

pr created time in 6 days

create barnchrfratto/agent

branch : use-yaml-hook

created branch time in 6 days

push eventrfratto/go-yaml

Robert Fratto

commit sha 984fc90c8a04c9ba2e8ef7d25150f087df9a861f

add in encoding hook

view details

push time in 6 days

push eventrfratto/agent

Robert Fratto

commit sha 325fd56c645a89b7ad4859db42361b199f05db3a

rename newRunGroup to runGroupWithContext

view details

push time in 6 days

Pull request review commentgrafana/agent

couple discovery and host filtering

 func (vc *MetricValueCollector) GetValues(label string, labelValues ...string) (  	return vals, nil }++type runGroup struct {+	cancel context.CancelFunc++	g *run.Group+}++// newRunGroup wraps around run.Group but accepts a context. The run group will+// be stopped if the context gets canceled.+func newRunGroup(ctx context.Context) *runGroup {

makes sense to me, changing it

rfratto

comment created time in 6 days

fork rfratto/yaml

YAML support for the Go language.

fork in 6 days

pull request commentgrafana/agent

couple discovery and host filtering

I'm going to test this locally for a few hours to make sure nothing is broken here.

rfratto

comment created time in 6 days

PR opened grafana/agent

couple discovery and host filtering

The implementation of #83 will require the ability to replace the Prometheus discovery manager with a custom component that outputs discovered targets. The first half of this is to couple the Prometheus discovery manager and host filtering as a single "actor" that can be swapped out with another implementation.

+98 -64

0 comment

1 changed file

pr created time in 6 days

create barnchrfratto/agent

branch : couple-discovery-and-filter

created branch time in 6 days

delete branch rfratto/agent

delete branch : add_password_file

delete time in 6 days

PR closed grafana/agent

Reviewers
add password_file when converting to Prometheus basic_auth

Fixes #81

+3 -2

1 comment

1 changed file

rfratto

pr closed time in 6 days

pull request commentgrafana/agent

add password_file when converting to Prometheus basic_auth

This fix isn't sufficient enough to fix all the problems v0.3.0 introduced; see #86. I'm going to close this in favor of a better solution that I'll publish later today.

rfratto

comment created time in 6 days

issue commentgrafana/agent

Some secrets still get mangled in scraping service mode

Two approaches here are to see if Prometheus is interested in scrubbing the secrets outside the marshal function or fork the yaml parser and add in a hook to ignore the marshal function for secrets (and see if upstream is interested in taking it).

The latter seems like less work, so I might try that first.

/cc @hoenn

rfratto

comment created time in 7 days

issue openedgrafana/agent

Some secrets still get mangled in scraping service mode

See full configuration reference; there are many fields marked as secret that aren’t being converted to a string for storage. For example, the basic auth credentials in the scrape configs.

We need a better solution than replicating all the types that have a secret field.

created time in 7 days

create barnchrfratto/agent

branch : instance-options

created branch time in 7 days

delete branch rfratto/agent

delete branch : refactor-instance

delete time in 7 days

push eventgrafana/agent

Robert Fratto

commit sha 59c1c447e9a3f4bcdfb9dc222b3601a7386a1746

simplify instance creation and lifecycle (#85) * simplify instance creation and lifecycle As part of #83, the instance package will have to be utilized in a new way for running integrations (i.e., running an instance that does not have a discovery component). To prepare for this change, the instance package was in need of a refactoring to simplify how the instance objects are used. Now, instead of an instance immediately starting when it is created, it must be started explicitly through the Run function. Run is a blocking operation and will only exit on error or when the context is canceled. Changing to this allowed for cleaning up a lot of code callers required to watch the lifecycle of the Instance. Note that the code for unregistering WAL metrics is no longer needed as the WAL is now created when Run is called. Running an instance with a discovery manager is still required as of this commit; future work is still needed to make the individual components of the Instance optional. * rename fanoutStorage -> storage * update instance.New godoc * wrap errors for creating host filter and getting hostname

view details

push time in 7 days

PR merged grafana/agent

simplify instance creation and lifecycle

As part of #83, the instance package will have to be utilized in a new way for running integrations (i.e., running an instance that does not have a discovery component).

To prepare for this change, the instance package was in need of a refactoring to simplify how the instance objects are used. Now, instead of an instance immediately starting when it is created, it must be started explicitly through the Run function. Run is a blocking operation and will only exit on error or when the context is canceled. Changing to this allowed for cleaning up a lot of code callers required to watch the lifecycle of the Instance.

Note that the code for unregistering WAL metrics is no longer needed as the WAL is now created when Run is called.

Running an instance with a discovery manager is still required as of this PR; future work is still needed to make the individual components of the Instance optional.

/cc @gotjosh

+217 -238

0 comment

5 changed files

rfratto

pr closed time in 7 days

push eventrfratto/agent

Robert Fratto

commit sha a2833ed4f7b1e06e30c89d41b77bbb55e436c148

wrap errors for creating host filter and getting hostname

view details

push time in 7 days

push eventrfratto/agent

Robert Fratto

commit sha 65229ba5a6df6bab971fe8dba24488e6d073f103

update instance.New godoc

view details

push time in 7 days

push eventrfratto/agent

Robert Fratto

commit sha ffad7112adb64db50ac17d1bd3e657ae1e7e9b2c

rename fanoutStorage -> storage

view details

push time in 7 days

PR opened grafana/agent

Reviewers
simplify instance creation and lifecycle

As part of #83, the instance package will have to be utilized in a new way for running integrations (i.e., running an instance that does not have a discovery component).

To prepare for this change, the instance package was in need of a refactoring to simplify how the instance objects are used. Now, instead of an instance immediately starting when it is created, it must be started explicitly through the Run function. Run is a blocking operation and will only exit on error or when the context is canceled. Changing to this allowed for cleaning up a lot of code callers required to watch the lifecycle of the Instance.

Note that the code for unregistering WAL metrics is no longer needed as the WAL is now created when Run is called.

Running an instance with a discovery manager is still required as of this PR; future work is still needed to make the individual components of the Instance optional.

/cc @gotjosh

+208 -232

0 comment

5 changed files

pr created time in 7 days

create barnchrfratto/agent

branch : refactor-instance

created branch time in 7 days

issue closedgrafana/agent

Research on what is needed to embed `node_exporter`

What we need to do

As part of agent we want to able to bundle multiple exporters into the agent itself, node_exporter is one of them.

We'd like to understand what is the work to embed it. There are multiple options here which are worth considering:

  • We can get the agent to scrape itself with whatever node_exporter exposes. This translates to just exposing the exporter metrics into /metrics and let the agent scrape the endpoint.

  • We can get the contents of the node_exporter metrics and ship them through the remote-write storage.

Implementation notes

  • It is important, that everything goes through the WAL.
  • Robert thinks this is just a matter of using a custom gatherer and then exposing it.
  • A good starting point for understanding what the node_exporter is doing is the main file.
  • If we go down the route of agent scraping itself, should we have it on a different /metrics endpoint? We believe Prometheus might have increase memory usage if we scrape endpoints with many metrics, plus we'd like to have separate concerns per exporter.

Expected output

  • We don't necessarily need to write code as an output, writing on next steps should suffice. Specifically, a new set of issues on how to tackle it.

closed time in 7 days

gotjosh

issue commentgrafana/agent

Research on what is needed to embed `node_exporter`

After discussion internally, the first approach is clearly preferable, as it is quicker to implement and less risky. We can always implement the second method down the road.

I'm going to close this issue; this has been effectively split into #83 and #84.

gotjosh

comment created time in 7 days

issue openedgrafana/agent

Implement a node_exporter integration

Depends on #83

A node_exporter integration should be added to the Agent, implementing the Integration interface.

created time in 7 days

issue openedgrafana/agent

Implement an integrations system

As part of the project to embed exporters within the Agent, we need a generalized system that can run an "integration." An integration will initially be defined as:

  1. A metrics endpoint scoped to the specific integration (e.g., /integrations/<integration>/metrics).
  2. An set of optional extra endpoints for extra visibility/control into integrations.
  3. The ability to Start and Stop an integration.

Integrations will have to implement to expose these functionalities, roughly something like the following:

type Integration interface {
  // Name returns the name of the integration. Must be unique. 
  Name() string 
  
  // RegisterRoutes should register HTTP handlers for the integration.
  RegisterRoutes(r *mux.Router) error 

  // MetricsEndspoints should return the endpoint(s) for the integration
  // that expose Prometheus metrics. 
  MetricsEndpoints() []string 

  // Run should start the integration and block until it fails or ctx is canceled. 
  Run(ctx context.Context) error
}

The Agent must run each integration and make sure that they stay alive - if an integration exits unexpectedly, it should be restarted.

The Agent should also create a special non-configurable Prometheus Instance that doesn't run any SD and collects its targets from the running integrations This special Prometheus instance will be dedicated to scraping metrics from integrations.

There should be one "sample" integration added as part of the initial implementation: an agent integration, where the Agent collects its own metrics. A second integration, an embedded node_exporter, should eventually be added, but is out of scope for this issue.

The implementation for this feature should be split across multiple PRs.

/cc @gotjosh @hoenn

created time in 7 days

issue commentgrafana/agent

remote_write config stanzas ignore password_file

Hi! This was another bug unfortunately introduced by v0.3.0; if you're not using the scraping service mode, you might want to roll back to v0.2.0 until we can do a new release that includes the fix for this (#82)

jgraettinger

comment created time in 7 days

PR opened grafana/agent

Reviewers
add password_file when converting to Prometheus basic_auth

Fixes #81

+3 -2

0 comment

1 changed file

pr created time in 7 days

create barnchrfratto/agent

branch : add_password_file

created branch time in 7 days

push eventgrafana/agent

Cyril Tovena

commit sha e5b487e3294b2a22f27b4cb938b4422f84a50208

Add grafanaDashboardFolder in the mixin (#44) * Add grafanaDashboardFolder to put all dashboards in a Agent folder by default. Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com> * run make example-dashboards example-kubernetes Latter command not totally necessary for this, but it was out of date anyway. Co-authored-by: Robert Fratto <robertfratto@gmail.com>

view details

Robert Fratto

commit sha 12b779ab678e71b727e2a16d82936cda7f601ff6

tanka: add configmap hash as annotation to daemonset (#46) If the configmap is changed, the daemonset should be rolled out with the new changes. This does not apply for the Kubernetes manifest used in the install script, so disabling the annotation hash is configurable. This commit also removes an unused config variable.

view details

Robert Fratto

commit sha 1cbb13cb2849ca7e27182ef3415213a1318883d9

add HTTP API for storing instance configs in KV store (#42)

view details

Evan

commit sha 5ed464385123a37113bce32a913c72e84118be46

Enable strict yaml parsing (#54)

view details

Robert Fratto

commit sha c875742f8182ba10e59e3e5251c989cf814d48c8

cmd/agent: Dockerfile should be a release build by default (#59) Fixes #50

view details

Robert Fratto

commit sha f9e55a67e484e7273fe1feec2a11a6a5f97731d0

Support clustering agents (#53) * split prometheus package into prometheus, prometheus/ha, and prometheus/instance * add lifecycler and ring * config change detection, sharding * fix test errors * fix lint errors * add scraping service node resharding This commit adds support for resharding a node in the scraping service agent cluster. This is implemented using gRPC servers that expose a Reshard RPC method. While the gRPC servers are tiny right now, it is anticipated that more gRPC methods will be added in the future. The scraping service node lifecycle is now complete with this commit. When the agent starts, it does the following: 1. Join the Cortex ring 2. Connect to all nodes in ring (except self) 3. Tell nodes to reshard When the agent is running, it reshards itself on an interval. This is done so config drift against the KV store is caught at runtime so the agents will eventually correct themselves in case of a bug. When the agent shuts down, it does the following: 1. Mark itself as leaving in the Cortex ring. 2. Connect to all nodes in ring (except self) 3. Tell nodes to reshard. 4. Leave the Cortex ring The resharding process is accomplished as follows: 1. Retrieve all currently running configs. 2. Retrieve all configs currently registered in the KV store. 3. For each config in the config store, check to see if that config needs to be removed, added, or changed within the local agent. 4. Remove any remaining configs that are currently registered but not present within the config store. * fix issues found in testing 1. Staleness markers now default to not being written on shutdown. They are only useful when deploying the Agent as a daemonset and not in the other two operational modes. 2. A small script has been added to quickly deploy all of the config files to the agent-1 container. 3. An /agent/api/v1/instances endpoint has been added to see which configs an agent is currently running. 4. Initial reshard now happens in the background. This is required because otherwise the resharding process would block the server from starting, which breaks multiple agents booting at once (as the gRPC server isn't listening yet so calling Reshard will fail). 5. The instance label has changed to instance_name. 6. WAL metrics will now be unregistered when the WAL shuts down. * pkg/prometheus/ha: split up sharding code into its own file * pkg/prometheus/ha: propagate ctx to backoff when doing cluster-wide reshard * write test for new /agent/api/v1/instances endpoint * pkg/prometheus/ha: write tests for sharding mechanism This commit changes sharding by wrapping the inner ConfigManager with the logic for caching hashes and sharding, making the concerns for ha.Server a bit tighter and easier to test. * add test and fix issue where recreating an instance with the same name crashes * pkg/prometheus/ha: add lifecycle for HA server * force replication factor to be 1 for scraping service * documentation for scraping service * address review feedback on scraping service docs

view details

Robert Fratto

commit sha 0e38c03aafc5b9e684730a9957e18ee76c68caf8

Add healthiness and readiness handlers (#61) * cmd/agent: add health and ready endpoints to API Closes #52. * address review feedback

view details

Robert Fratto

commit sha 3dd3fe84957b5b929b67a86b06120400f0ca7e95

pkg/prometheus/ha: store and hash secrets properly (#63) The Bearer Token and Basic Auth Password for remote_write configs were not being stored or hashed properly; those values are of a Secret type that implements a custom yaml.Marshaler and removes the underlying secret value. So when configs were being stored in the KV store, the secrets were being removed and replaced with the text "<secret>". This caused agents who loaded the config to run an instance to use the wrong password. Hashing a config had the same problem; since the hash only marshaled a config to YAML, secrets were not being calculated properly with the hash. This commit fixes both problems; secrets are stored separately now in a raw string type and hashing adds the values for secrets before summing the input. Fixes #62.

view details

Robert Fratto

commit sha c54d2c4c2dcc89679b344f072dcda029f20c2891

Introduce `agentctl` (#60) * cmd/agentctl: initial commit * pkg/client: add client for agent http/1 API * implement agentctl and hook up to CI for builds * fix test fail * address review feedback * agentctl: exit with failure when config-sync fails * Update docs/scraping-service.md Co-authored-by: gotjosh <josue.abreu@gmail.com> * Update pkg/agentctl/sync.go Co-authored-by: gotjosh <josue.abreu@gmail.com> * address review feedback * return error at the end of syncing if anything failed Co-authored-by: gotjosh <josue.abreu@gmail.com>

view details

Robert Fratto

commit sha edac817be3786771aa3fd9f7c55f4f41e0c0bcf5

ci: fix typo when pushing agentctl image (#66)

view details

Robert Fratto

commit sha 2bdb8429df4be8e7e3f1c0cef4aeac6f1e0705de

Duplicate RemoteWriteConfig without the Secret type (#67) * Duplicate RemoteWriteConfig without the Secret type This commit completely removes the Secret type to allow RemoteWriteConfigs to be arbitrarily marshalled back and forth between YAML. Sanitization now happens on demand rather than being forced by the marshaling. Closes #62 * fix lint error

view details

Evan

commit sha 0eff751fc91ee31f9257e7b60d71838e964f48b8

Added dry run flag to agentctl config-sync cmd (#68) * Added dry run flag to agentctl config-sync cmd * Simplify test, replace print with logger

view details

Robert Fratto

commit sha bc7a5f87c6ed8450dcf4a6f6354d0dfde23281d7

prepare for 0.3.0 release (#69)

view details

gotjosh

commit sha c364c0736b906751cc2691139bfbf1dce6d34ec9

Update scraping-service.md

view details

Craig Peterson

commit sha 1b361a22d270c8efc0b518aacbecf0dc865a5566

Fix typo in tanka config (#71)

view details

Robert Fratto

commit sha a4e3c97bed640a2adadee2fa579742752c3a9be8

update to latest release-2.17.1-grafana (#41) * update to latest 2.17.1-grafana * Update docs/maintaining.md Co-Authored-By: gotjosh <josue.abreu@gmail.com> * Update docs/maintaining.md Co-Authored-By: gotjosh <josue.abreu@gmail.com> * address PR feedback * some extra corrections to try to get GitHub to detect a new commit :) Co-authored-by: gotjosh <josue.abreu@gmail.com>

view details

Robert Fratto

commit sha f7d9b25d59c0a09dd5c08ecf58c1c193bd50a027

strictly parse yaml in agentctl config-sync (#74) * strictly parse yaml in agentctl config-sync * change invalid/valid file to invalid/valid config

view details

Robert Fratto

commit sha cb169172bc130948d0322bf000fd8bc13077c672

enable strict YAML parsing for uploading a new config to config API (#75) * enable strict YAML parsing for uploading a new config to config API The instance.UnmarshalConfig will now do strict decoding. This commit also changes agentctl to use this function instead of reimplementing the same functionality. * update CHANGELOG

view details

Robert Fratto

commit sha 5310fed2826ed294f07c963ab144a43484f3c2c1

Update Prometheus vendor to latest release-grafana-2.18.1 (#72) * update Prometheus vendor to latest release-grafana-2.18.1 The k8s replace directive had to be removed for Prometheus to compile. It is now using the same version Prometheus depends on, and does not seem to break any internal usage of the Cortex ring. * update CHANGELOG to mention vendor bump

view details

Robert Fratto

commit sha 3cbb06b56291f6a96631b584042386fc0dcef0a4

fix issue where agent dashboards weren't properly assigned a folder (#76) * fix issue where agent dashboards weren't properly being added into a folder * make grafanaDashboards private

view details

push time in 7 days

issue commentgrafana/agent

Instance configs missing Prometheus validation checks

ApplyDefaults and Validate also aren't being called by the config management API, which they should. Ideally that should also be fixed as part of this issue.

rfratto

comment created time in 7 days

issue openedgrafana/agent

Agent doesn't perform same validation checks on unmarshal that Prometheus does

The Agent is missing some validation checks that Prometheus has:

Check the links to see what Prometheus does.

created time in 7 days

issue commentgrafana/agent

v0.3's strict YAML parsing uncovered a typo in agent.yaml

Hi @jgraettinger, I just released v0.3.1 that includes the fix for this. Thanks again!

jgraettinger

comment created time in 7 days

created taggrafana/agent

tagv0.3.1

A remote_write-only Prometheus with some bonus features

created time in 7 days

delete branch rfratto/agent

delete branch : 0.3.1-release-prep

delete time in 7 days

push eventgrafana/agent

Robert Fratto

commit sha ac14fbd26b6c176406ea89a9f5a5132ac0aca875

prepare for 0.3.1 release (#79) * prepare for 0.3.1 release * don't repeat myself in the CHANGELOG * Refer to agentctl with backticks and don't uppercase it

view details

push time in 7 days

PR merged grafana/agent

Reviewers
prepare for 0.3.1 release
+40 -9

0 comment

6 changed files

rfratto

pr closed time in 7 days

push eventrfratto/agent

Robert Fratto

commit sha fb2d9463ca4399e2c3aafbd12af1ab47a69f923e

Refer to agentctl with backticks and don't uppercase it

view details

push time in 7 days

push eventrfratto/agent

Robert Fratto

commit sha 941ae13548c101f1450a9bc46ef545708d43f38c

don't repeat myself in the CHANGELOG

view details

push time in 7 days

PR opened grafana/agent

Reviewers
prepare for 0.3.1 release
+42 -6

0 comment

6 changed files

pr created time in 7 days

create barnchrfratto/agent

branch : 0.3.1-release-prep

created branch time in 7 days

push eventgrafana/agent

Robert Fratto

commit sha 66c5507a473480b8e4951dd1d8b45db2847b799e

re-render Kubernetes manifest to include fix from #71

view details

Robert Fratto

commit sha 7ea4c04ff25aee606988f314003cfd0d71ecc624

Merge pull request #78 from rfratto/rerender-k8s re-render Kubernetes manifest to include fix from #71

view details

push time in 7 days

PR merged grafana/agent

re-render Kubernetes manifest to include fix from #71

Reran make example-kubernetes to rerender the Kubernetes manifest and had to rerun jb install, which was required before the render would work.

Fixes #77.

+25 -16

1 comment

4 changed files

rfratto

pr closed time in 7 days

issue closedgrafana/agent

v0.3's strict YAML parsing uncovered a typo in agent.yaml

https://github.com/grafana/agent/blob/3cbb06b56291f6a96631b584042386fc0dcef0a4/production/kubernetes/agent.yaml#L171

Should be metric_relabel_configs instead, I believe?

closed time in 7 days

jgraettinger

pull request commentgrafana/agent

re-render Kubernetes manifest to include fix from #71

I'm going to want to do a 0.3.1 release to include this, the Prometheus 2.18.1 update, and our fixes for strict YAML parsing.

rfratto

comment created time in 7 days

PR opened grafana/agent

re-render Kubernetes manifest to include fix from #71

Reran make example-kubernetes to rerender the Kubernetes manifest and had to rerun jb install, which was required before the render would work.

Fixes #77.

+25 -16

0 comment

4 changed files

pr created time in 7 days

create barnchrfratto/agent

branch : rerender-k8s

created branch time in 7 days

issue commentgrafana/agent

v0.3's strict YAML parsing uncovered a typo in agent.yaml

Yep, you're totally right. #71 fixed part of this but we didn't update the Kubernetes YAML to include the change. I'll open a PR to fix this and do a release tomorrow to include the fix.

jgraettinger

comment created time in 7 days

startedslim-bean/leafbus

started time in 8 days

issue commentgrafana/agent

Research on what is needed to embed `node_exporter`

After some investigation, embedding node_exporter isn't the real challenge here; the main challenge is more around how we're going to collect data from an embedded exporter.

As @gotjosh mentioned in the issue, there's two approaches:

  1. Expose a custom metrics endpoint for an integration that uses a isolated Prometheus Registerer and has no chance of overlapping with any other Agent metrics. Then the Agent can run an instance that automatically scrapes from these endpoints.
  2. Collect metrics in a custom gatherer that bypasses the Prometheus scraper and goes directly into the WAL appender.

The first approach is nice because integrations aren't treated any differently when it comes to sample collection; they act as any other target. The second code might save more memory by bypassing the scraping code, but I'm nervous about it in general, as I recognize the scraping code is complex and does a lot more than "just call a /metrics endpoint every n seconds."

For now, I'm partial to the first approach. We can always change this decision down the road, but the first approach is going to be more simple to implement and less error prone as we don't have to try to reproduce every behavior of the Prometheus scrape manager.

WDYT @gotjosh @hoenn? If we're all on board with this, I'll create issues for building a system to handle integrations and close this issue as completed.

gotjosh

comment created time in 8 days

more