profile
viewpoint
If you are wondering where the data of this site comes from, please visit https://api.github.com/users/stevendanna/events. GitMemory does not store any data, but only uses NGINX to cache data for a period of time. The idea behind GitMemory is simply to give users a better reading experience.

cockroachdb/cockroach 22117

CockroachDB - the open source, cloud-native distributed SQL database.

seth/ej 175

Helper module for working with Erlang terms representing JSON

nathwill/chef-systemd 41

resource-driven chef cookbook for managing linux systems via systemd

stevendanna/cookbook-r 22

Chef Cookbook for R

chef-boneyard/delivery-cluster 20

DEPRECATED: Deployment cookbook for standing up Delivery clusters using chef-provisioning.

stevendanna/cookbook-xinetd 5

Chef cookbook for xinetd

paulmooring/knife-stalenodes 4

Knife plugin for listing stale nodes

stevendanna/basebox 2

Clean & Uniform Vagrant Basebox Definitions

stevendanna/chef 1

A systems integration framework, built to bring the benefits of configuration management to your entire infrastructure.

stevendanna/chef-sugar 1

Chef Sugar is a Gem & Chef Recipe that includes series of helpful sugar of the Chef core and other resources to make a cleaner, more lean recipd DSL, enforce DRY principles, and make writing Chef recipes an awesome experience!

PR opened cockroachdb/cockroach

release-21.1: roachtest: retry java install

Backport 1/1 commits from #70491.

/cc @cockroachdb/release


Occasionally, the apt mirrors in GCP return 503 Service Unavailable.

Here, we retry the install attempt 3 times with some backoff between attempts.

I've also added the --no-install-recommends flag, although it does very little in this case.

Release note: None

+18 -2

0 comment

1 changed file

pr created time in 6 hours

create barnchstevendanna/cockroach

branch : backport21.1-70491

created branch time in 6 hours

pull request commentcockroachdb/cockroach

jobs,changefeedccl: propogate session ID to changeFrontier

Still a draft as I'm writing some tests for this. I'm pretty tempted to change some of the jobs API further, but I want to put some more thought into them before doing that.

stevendanna

comment created time in 8 hours

PR opened cockroachdb/cockroach

jobs,changefeedccl: propogate session ID to changeFrontier

The changeFrontier updates the job multiple times during its run. It does this via a *job.Job object that it explicitly loads at startup using a JobID that is passed to it via its distflow spec.

However, when loading the job in this way, the sessionID that was associated with the initial adoption of the job is not set. This means that this change frontier can continue updating the job without error even long after the session from the original adoption is expired. This can potentially lead to a single changefeed job running twice.

This change addresses that by passing the original session ID to the changeFrontier. When the changeFrontier attempts to update the job record (i.e checkpoint progress), it will fail if its session is expired.

Note that for this to be correct it requires that the changeFrontier is only ever scheduled on the sql instance that adopts the job.

Release note (enterprise change): Fixes a bug that could have led to duplicate instances of a single changefeed job running for prolonged periods of time.

+183 -50

0 comment

10 changed files

pr created time in 8 hours

push eventstevendanna/cockroach

Steven Danna

commit sha 87f4d1a4b3a2682326be3dd4588adf210236a6c4

jobs,changefeedccl: propogate session ID to changeFrontier The changeFrontier updates the job multiple times during its run. It does this via a *job.Job object that it explicitly loads at startup using a JobID that is passed to it via its distflow spec. However, when loading the job in this way, the sessionID that was associated with the initial adoption of the job is not set. This means that this change frontier can continue updating the job without error even long after the session from the original adoption is expired. This can potentially lead to a single changefeed job running twice. This change addresses that by passing the original session ID to the changeFrontier. When the changeFrontier attempts to update the job record (i.e checkpoint progress), it will fail if its session is expired. Note that for this to be correct it _requires_ that the changeFrontier is only ever scheduled on the sql instance that adopts the job. Release note (enterprise change): Fixes a bug that could have led to duplicate instances of a single changefeed job running for prolonged periods of time.

view details

push time in 8 hours

create barnchstevendanna/cockroach

branch : propagate-session-id

created branch time in 8 hours

PullRequestReviewEvent

issue openedcockroachdb/cockroach

changefeedccl: changefeed jobs can be adopted by multiple nodes

Describe the problem

At a customer, we noted a changefeed job where the changeFrontier was running on more than 1 node. This would imply that the changefeed job itself was running twice.

While the job may be running for some period on two nodes in the case of a node being partitioned from the rest of the cluster. The claim_session_id in the jobs table which should prevent two different nodes from updating the same job.

However, the claim_session_id is only checked in Go and not in the SQL update statements itself and if the claim_session_id is not set on the job being updated, no such check is done:

https://github.com/cockroachdb/cockroach/blob/cfb433ab29dd9a72361a6c3298680cd9cd400a37/pkg/jobs/update.go#L169-L181

In this case two different nodes can continue to happily update the same job. While the sessionID is initially set on all newly created jobs, the changefeed code explicitly reloads the job in two places:

https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/changefeedccl/changefeed_processors.go#L1089 https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/changefeedccl/changefeed_stmt.go#L747

and the LoadJob method used there does not populate the sessionID:

https://github.com/cockroachdb/cockroach/blob/cfb433ab29dd9a72361a6c3298680cd9cd400a37/pkg/jobs/jobs.go#L763

created time in a day

pull request commentcockroachdb/cockroach

release-21.1: changefeedccl: allow changefeeds on regional by row tables

@miretskiy I think this backport is ready for review now.

stevendanna

comment created time in a day

PullRequestReviewEvent
PullRequestReviewEvent

pull request commentcockroachdb/cockroach

testutils: run roachvet on test files

Sorry about any backporting troubles this has caused. I've opened a backport here for discussion: https://github.com/cockroachdb/cockroach/pull/70561

stevendanna

comment created time in a day

PR opened cockroachdb/cockroach

release-21.2: testutils: run roachvet on test files

Backport 1/1 commits from #70194.

/cc @cockroachdb/release


The previously constructed extended regular expression would end up looking like:

(pkg/*_test.go)|(pkg/workload)|...|(pkg/testutil):.*Use of go keyword not allowed

As a result, roachvet was being skipped on all test files and all but the last package in this list.

Fixes #70193

Release note: None

+119 -134

0 comment

41 changed files

pr created time in a day

create barnchstevendanna/cockroach

branch : backport21.2-70194

created branch time in a day

push eventstevendanna/cockroach

Paul Bardea

commit sha 51ea2a548bf3edbe9b8591e804a750b02b603a83

backupccl: stop including historical databases in cluster backup Descs A previous commit attempted to fix a bug where cluster backup would not include tables in dropped databases between incremental backups. That fixed aimed to find dropped databases and add it to the set of descriptors. However, this causes issues when a database is recreated with the same name. Rather than adding the dropped DBs to the Descriptors field on the backup manifest, this commit updates how DescriptorChanges are populated for cluster backups with revision history. Now, the initial scan of descriptors as of the start time will look for all descriptors in the cluster rather than just those that were resolved as of the end time of the backup. Release note (bug fix): Fix a bug where cluster revision-history backups may have included dropped descriptors in the "current" snapshot of descriptors on the cluster. Release justification: bug fix. Fix a bug where cluster revision-history backups may have included dropped descriptors in the "current" snapshot of descriptors on the cluster.

view details

Tobias Grieger

commit sha be6c8efceedc46229bbcbae5f4db9f8b30dfcc8a

kvserver: remove extraneous circuit breaker check in Raft transport See https://github.com/cockroachdb/cockroach/issues/68419. We now use `DialNoBreaker` for the raft transport, taking into account the previous `Ready()` check. `DialNoBreaker` was previously bypassing the breaker as it ought to but was also *not reporting to the breaker* the result of the operation; this is not ideal and was caught by the tests. This commit changes `DialNoBreaker` to report the result (i.e. fail or success). Release justification: bug fix Release note (bug fix): Previously, after a temporary node outage, other nodes in the cluster could fail to connect to the restarted node due to their circuit breakers not resetting. This would manifest in the logs via messages "unable to dial nXX: breaker open", where `XX` is the ID of the restarted node. (Note that such errors are expected for nodes that are truly unreachable, and may still occur around the time of the restart, but for no longer than a few seconds).

view details

Yahor Yuzefovich

commit sha 034c189365804f9a61bde83323975af32e45164d

distsql: restore EvalCtx.Mon on the flow cleanup In `setupFlow`, if we're setting up a flow on the gateway, we're using `LocalState` to save on deserialization of some state. Notably, we pass the eval context that we used during the physical planning. That eval context can be mutated (in particular, we're updating its `Mon` field to the "flow" memory monitor), and previously this could cause issues when automatically retrying stats collection jobs (possibly there could be other issues). This commit introduces a callback to restore the local eval context to its original state which is done on the flow cleanup. Release note (bug fix): Previously, table stats collection issued via `ANALYZE` statement or via `CREATE STATISTICS` statement without specifying `AS OF SYSTEM TIME` option could run into `flow: memory budget exceeded`, and this has been fixed. Release justification: fix to a long standing bug.

view details

Yahor Yuzefovich

commit sha 7fa3094d919547b38dcfe189b1b1aeb4a09ae90a

distsql: fix cleaning up resources in an error case in setupFlow Release note: None Release justification: low-risk improvement to resources' cleanup in an edge case.

view details

Oliver Tan

commit sha 2a230b86e372446c5c93f581705c1744b0132869

sql: alter ALTER TYPE ... OWNER TO ... for multi-region enum Release justification: fix for old functionality Release note (sql change): Previously, one could not alter the owner of the crdb_internal_region type which is created by initiating a multi-region database. This is now possible.

view details

Oliver Tan

commit sha 3c072d55b1290f8a471244822286b86f249db940

sql: apply zone configs for copied INDEX for ALTER PK on RBR tables Release justification: bug fix for existing feature Release note (bug fix): Previously, when using ALTER PRIMARY KEY on a REGIONAL BY ROW table, the copied unique index from the old PRIMARY KEY would not have the correct zone configurations applied. This commit fixes that. Users who encountered this bug should re-create the index.

view details

Paul Bardea

commit sha 93e0a9f22cfb57f7bb0310d7ce345a61f87aa118

backupccl: ignore old database revisions in cluster backups Cluster backups could briefly include dropped databases in the backup manifest's `Descriptors` rather than its `DescriptorChanges`. Release note (bug fix): RESTORE will now correctly ignore dropped databases that may have been included in cluster backups with revision history.

view details

Rafi Shamim

commit sha 1c10b7c7a61d5e0b6f10b73db01b356f8dd1fc8c

sql/pgwire: fix race condition with default_int_size The default_int_size session variable needs to be read in a different goroutine from the session itself. It's read at parsing time, when reading from the pgwire stmtBuf, but is written inside the session. This was a race condition. Now, the value is changed atomically. Release justification: bug fix Release note: None

view details

Aayush Shah

commit sha 2e21e785cf491f06b97a88dfde51d165adbad9e0

kvserver: stop transferring leases to replicas that may need snapshots This commit disallows the `replicateQueue` from initiating lease transfers to replicas that may be in need of a raft snapshot. Note that the `StoreRebalancer` already has a stronger form of this check since it disallows lease transfers to replicas that are lagging behind the raft leader (which includes the set of replicas that need a snapshot). In cases where the raft leader is not the leaseholder, we disallow the replicateQueue from any sort of lease transfer until leaseholdership and leadership are collocated. We rely on calls to `maybeTransferRaftLeadershipToLeaseholderLocked()` (called on every raft tick) to make sure that such periods of leadership / leaseholdership misalignment are ephemeral and rare. Release justification: bug fix Release note (bug fix): Fixes a bug that can cause prolonged unavailability due to lease transfer to a replica that may be in need of a raft snapshot.

view details

Rebecca Taft

commit sha f2d4be6725b89d75643732ae4bd429db788b2820

opt: add a small cost for all operators with unbounded cardinality This commit updates the coster so that a small cost is added for any operator with unbounded cardinality. This ensures we prefer plans that push limits as far down the tree as possible, all else being equal. Release note (performance improvement): Updated the optimizer cost model so that all else being equal, the optimizer prefers plans in which LIMIT operators are pushed as far down the tree as possible. This can reduce the number of rows that need to be processed by higher operators in the plan tree, thus improving performance. Release justification: low risk, high benefit change to existing functionality

view details

Yahor Yuzefovich

commit sha 0620019f596f1b1963d2be6d13af29bae18ed9e4

builtins: harden some builtins against OidWrappers This commit replaces several usages of unsafe `*tree.DString` interface conversions with safe `tree.AsDString` function calls. Release note (bug fix): Previously, an internal error or a crash could occur when some `crdb_internal` builtin functions took string-like type arguments (e.g. `name`). This is now fixed. Release justification: low-risk bug fix.

view details

Steven Danna

commit sha 982928c1edcafc12ce43615c83adbfb7b2c58ad1

changefeedccl: signal changeAggregator shutdown from the kvfeed During acceptance testing, we observed that changefeeds did not correctly restart on primary key changes and did not correctly stop when schema_change_policy was set to 'stop' when the changeFrontier and changeAggregator were running on different nodes (most production changefeeds). The root cause of this was a bad assumption in the changeAggregator shutdown logic. Namely, we assumed that the changeAggregator (and kv feed) would see context cancellations as a result of the changeFrontier moving to draining. However, this is not guaranteed. When the changeFrontier moves to draining, all of its inputs will be drained. But, a DrainRequest message is only sent to the input lazily when the next message is received from that input. In this case of a schema change, the kv feed would stop sending messages to the changeAggregator and thus no further messages will be sent to the changeFrontier and the drain request is not triggered. With this change, we now shut down the changeAggregator when the kvfeed indicates that no more messages will be returned. Fixes #68791 Release note (enterprise change): Fixed a bug where CHANGEFEEDs would fail to correctly handle a primary key change. Release justification: Critical fix for previously released functionality.

view details

Steven Danna

commit sha 489ceecf46cb49f5837525740e76906b32384788

testutils: extend SucceedsSoon timeout under race Release justification: non-production code change Release note: None

view details

Rebecca Taft

commit sha 9e6af3fc9b11b8775a8fd6a182ca6729983b84f9

Merge pull request #69977 from rytaft/backport21.1-69688 release-21.1: opt: add a small cost for all operators with unbounded cardinality

view details

Rebecca Taft

commit sha f067e4571962e279c422504593e88f9248dbae63

opt: fix regression in stats estimation for large tables This commit fixes a regression in statistics estimation that could cause the optimizer to severely underestimate the number of rows returned by an expression. Fixes #69709 Release note (bug fix): Fixed a regression in statistics estimation in the optimizer for very large tables. The bug, which has been present since 20.2.14 and 21.1.7, could cause the optimizer to severely underestimate the number of rows returned by an expression. This has now been fixed. Release justification: low risk bug fix to existing functionality

view details

Bilal Akhtar

commit sha a6bf186446e16a1b755cdd20d327fb660d9e12fd

roachtest: Stop node in disk-full roachtest before restarting. Pebble doesn't have a distinction for fatal background errors; so no space left on device during a flush/compaction ends up being a suppressed background error. Seeing as this isn't going to be fixed rightaway, this change updates the disk-full roachtest to stop the node being stressed in case it hasn't crashed on its own, before attempting to restart it. The test still checks for the restarts failing, so it's still a useful roahctest with this change. Fixes #61686. Backport of #61954. Release note: None.

view details

Bilal Akhtar

commit sha 8c3c2f2fe0e351dd07bedbd86fb1bdcbfd059f3c

Merge pull request #70028 from itsbilal/disk-full-stop-node-21.1 [release-21.1] roachtest: Stop node in disk-full roachtest before restarting.

view details

Michael Erickson

commit sha 766806caf3c7a4d985f39cd6d9f87099d5265470

sql: add sql.mutations.max_row_size.log guardrail (large row logging) Addresses: #67400 Add sql.mutations.max_row_size.log, a new cluster setting which controls large row logging. Rows larger than this size will have their primary keys logged to the SQL_PERF or SQL_INTERNAL_PERF channels whenever the SQL layer puts them into the KV layer. This logging takes place in rowHelper, which is used by both row.Inserter and row.Updater. Most of the work is plumbing settings.Values and SessionData into rowHelper, and adding a new structured event type. Release note (ops change): A new cluster setting, sql.mutations.max_row_size.log, was added, which controls large row logging. Whenever a row larger than this size is written (or a single column family if multiple column families are in use) a LargeRow event is logged to the SQL_PERF channel (or a LargeRowInternal event is logged to SQL_INTERNAL_PERF if the row was added by an internal query). This could occur for INSERT, UPSERT, UPDATE, CREATE TABLE AS, CREATE INDEX, ALTER TABLE, ALTER INDEX, IMPORT, or RESTORE statements. SELECT, DELETE, TRUNCATE, and DROP are not affected by this setting. Release justification: Low risk, high benefit change to existing functionality. This adds logging whenever a large row is written to the database. Default is 0 in 21.1 (meaning disabled).

view details

Michael Erickson

commit sha 38a778efb072ebb962b1fc353112df40a98d4d6f

sql: add sql.mutations.max_row_size.err guardrail (large row errors) Addresses: #67400 Add sql.mutations.max_row_size.err, a new cluster setting similar to sql.mutations.max_row_size.log, which limits the size of rows written to the database. Statements trying to write a row larger than this will fail with an error. (Internal queries will not fail with an error, but will log a LargeRowInternal event to the SQL_INTERNAL_PERF channel.) We're reusing eventpb.CommonLargeRowDetails as the error type, out of convenience. Release note (ops change): A new cluster setting, sql.mutations.max_row_size.err, was added, which limits the size of rows written to the database (or individual column families, if multiple column families are in use). Statements trying to write a row larger than this will fail with a code 54000 (program_limit_exceeded) error. (Internal queries writing a row larger than this will not fail, but will log a LargeRowInternal event to the SQL_INTERNAL_PERF channel.) This limit is enforced for INSERT, UPSERT, and UPDATE statements. CREATE TABLE AS, CREATE INDEX, ALTER TABLE, ALTER INDEX, IMPORT, and RESTORE will not fail with an error, but will log LargeRowInternal events to the SQL_INTERNAL_PERF channel. SELECT, DELETE, TRUNCATE, and DROP are not affected by this limit. **Note that existing rows violating the limit *cannot* be updated, unless the update shrinks the size of the row below the limit, but *can* be selected, deleted, altered, backed-up, and restored.** For this reason we recommend using the accompanying setting sql.mutations.max_row_size.log in conjunction with SELECT pg_column_size() queries to detect and fix any existing large rows before lowering sql.mutations.max_row_size.err. Release justification: Low risk, high benefit change to existing functionality. This causes statements adding large rows to fail with an error. Default is 0 in 21.1 (meaning disabled).

view details

Michael Erickson

commit sha 5385fa0b5cfc8b5806d940db0cc4f867742f3e2d

sql: rename max_row_size guardrails to match transaction row limits Addresses: https://github.com/cockroachdb/cockroach/issues/67400 Rename sql.mutations.max_row_size.{log|err} to sql.guardrails.max_row_size_{log|err} for consistency with transaction_rows_{read|written}_{log|err} and upcoming metrics. Release justification: Low-risk update to new functionality. Release note (ops change): New variables sql.mutations.max_row_size.{log|err} were renamed to sql.guardrails.max_row_size_{log|err} for consistency with other variables and metrics.

view details

push time in a day

push eventstevendanna/cockroach

Steven Danna

commit sha b757086e955c2d338534d069bfb37e823a75eb58

changefeedccl: allow changefeeds on regional by row tables Previously we explicitly disallowed changefeeds on regional by row tables because: 1) the Avro encoder lacked enum support, 2) the schema feed did not correctly identify RBR schema changes as both a column addition and a primary key change, and 3) we were unsure whether users would be OK seeing the new crdb_region column in their changefeed output. Both (1) and (2) have since been fixed. For (3) we've decided that changefeeds should rather faithfully emit what KV tells us about. Future work to support projections and predicate filtering may allow _users_ to filter what gets output, but until then we want to err on the side of avoiding ad-hoc modifications to the changefeed output. Thus, we are OK emitting the crdb_region column as-is both in the row data and as part of the primary key. Release note (enterprise change): CHANGEFEEDs no longer fail when started on REGIONAL BY ROW tables. Note that in REGION BY ROW tables, the crdb_region column becomes part of the primary index. Thus, changing an existing table to REGIONAL BY ROW will trigger a changefeed backfill with new messages emitted using the new composite primary key.

view details

push time in a day

pull request commentcockroachdb/cockroach

Makefile: delete problematic bindata.go file

bors r=rickystewart,erikgrinaker

stevendanna

comment created time in a day

PR opened cockroachdb/cockroach

Makefile: delete problematic bindata.go file

This file gets left on disk when moving between branches. The build failure it leads to:

pkg/ui/distccl/bindata.go:1138:15: undefined: ui.Asset
pkg/ui/distccl/bindata.go:1138:33: undefined: ui.AssetDir
pkg/ui/distccl/bindata.go:1138:57: undefined: ui.AssetInfo

leaves many confused. This file was never checked in and the entire directory is gitignore'd, increasing confusion as git status doesn't show any problem.

This adds a Makefile target that deletes the offending file.

Release note: None

+20 -1

0 comment

2 changed files

pr created time in 2 days

create barnchstevendanna/cockroach

branch : nuke-from-orbit

created branch time in 2 days

pull request commentcockroachdb/cockroach

changefeedccl: check that rangefeeds are enabled earlier

bors r=miretskiy

stevendanna

comment created time in 2 days

pull request commentcockroachdb/cockroach

workload: log histogram write/encode failures, close output file

bors r=erikgrinaker

stevendanna

comment created time in 2 days

pull request commentcockroachdb/cockroach

workload: log histogram write/encode failures, close output file

TFTR!

I'm seeing that you're adding an explicit Sync() here - so maybe this does indeed fix the problem. Worth a try for sure.

Definitely possible. I think I had overlooked this since because we were able to Get() the file and then decode. But those reads could have all come from the cache even though they hadn't been flushed to disk yet.

stevendanna

comment created time in 2 days

pull request commentcockroachdb/cockroach

roachtest: retry java install

bors r=miretskiy

stevendanna

comment created time in 2 days

push eventstevendanna/cockroach

Steven Danna

commit sha 229f505afb94405a527d10a033bb32b84aa5e288

changefeedccl: Don't test non-error condition in error test This tested a successful statement in our TestChangefeedErrors test, which left us in a somewhat unexpected state when investigating #70301. Further, I think passing an empty string to ExpectErr is confusing, so I've disallowed it. Release note: None

view details

push time in 2 days

pull request commentcockroachdb/cockroach

roachtest: retry java install

bors r=miretskiy

stevendanna

comment created time in 2 days

PullRequestReviewEvent

Pull request review commentcockroachdb/cockroach

changefeedccl: Don't test non-error condition in error test

 func (sr *SQLRunner) ExecRowsAffected( // matching the given regex. func (sr *SQLRunner) ExpectErr(t testutils.TB, errRE string, query string, args ...interface{}) { 	t.Helper()+	if errRE == "" {+		t.Fatal("Empty error regular expression was provided to ExpectErr. This would assert that no error occurred. Use Exec instead, it is less confusing.")

I'll need to check the call sites of this. I'm going to let CI do the first round of work for me. :D

stevendanna

comment created time in 2 days

PullRequestReviewEvent

push eventstevendanna/cockroach

Steven Danna

commit sha c3387e0489e08c028be66f93375d7ed0bb64247a

changefeedccl: Don't test non-error condition in error test This tested a successful statement in our TestChangefeedErrors test, which left us in a somewhat unexpected state when investigating #70301. Further, I think passing an empty string to ExpectErr is confusing, so I've disallowed it. Release note: None

view details

push time in 2 days

PR opened cockroachdb/cockroach

changefeedccl: Don't test non-error condition in error test

This tested a successful statement in our TestChangefeedErrors test, which left us in a somewhat unexpected state when investigating #70301.

Further, I think passing an empty string to ExpectErr is confusing, so I've disallowed it.

Release note: None

+3 -5

0 comment

2 changed files

pr created time in 2 days