profile
viewpoint

asimshankar/go-tensorflow 21

Toying around with TensorFlow in Go

asimshankar/java-tensorflow 18

Toying around with the TensorFlow Java API

asimshankar/amarok-remote 5

Remotely control amarok from a web browser like Safari on the iPhone

asimshankar/bn256 3

Experimenting with C-bindings for bn256 pairings (https://godoc.org/golang.org/x/crypto/bn256)

asimshankar/triangles 3

Go+V23+Mobile

asimshankar/imagej-tensorflow 2

Toying around with TensorFlow image models in an ImageJ plugin

asimshankar/visionapi 1

Toying around with the Google and Microsoft vision APIs

asimshankar/aws-sdk-go-v2 0

AWS SDK for the Go programming language.

asimshankar/edward 0

A probabilistic programming language in TensorFlow. Deep generative models, variational inference.

asimshankar/prometheus-operator 0

Prometheus Operator creates/configures/manages Prometheus clusters atop Kubernetes

issue commentscylladb/scylla

Node unreachable, unresponsive to clients (nodetool status disagrees)

Yup, single row tombstones in that case that.

(I may have over simplified, the table schema is actually:

CREATE TABLE table(
  x text,
  y text,
  z text,
  data text,
  PRIMARY KEY (x, y, z)
)

And the delete statement is:

DELETE FROM table USING TIMESTAMP ? WHERE x = ? AND y = ?

(no z, so it is deleting multiple rows in a statement, but no ranges involved).

asimshankar

comment created time in 14 days

issue commentscylladb/scylla

Node unreachable, unresponsive to clients (nodetool status disagrees)

Apologies for my shallow understanding of terminology - but I think our deletes would qualify as "tombstones over whole partition".

Specifically, the DELETE statements were of the form:

DELETE FROM table USING TIMESTAMP ? WHERE x = ?

Where table's schema is of the form:

CREATE TABLE table (
    x text,
    y text,
    z text,
    PRIMARY KEY (x, y)
) 
asimshankar

comment created time in 14 days

fork asimshankar/aws-sdk-go-v2

AWS SDK for the Go programming language.

fork in a month

issue commentscylladb/scylla

Node unreachable, unresponsive to clients (nodetool status disagrees)

@denesb - thanks for digging in. From the PromQL query sum(scylla_cql_deletes)/sum(scylla_cql_inserts), it's about 10-12% deletes/inserts.

asimshankar

comment created time in a month

Pull request review commentneevaco/antch

Retry on 429s

 func (c *Crawler) scanRequestWork(workCh chan chan *http.Request, closeCh chan i 			spider.reqch <- requestAndChan{req: req, ch: resc} 			select { 			case re := <-resc:-				closeRequest(req) 				if re.err != nil { 					c.logf("crawler: send HTTP request got error: %v", re.err) 				} else {-					go func(res *http.Response) {+					go func(clonedReq *http.Request, res *http.Response) { 						defer closeResponse(res) 						defer func() { 							if r := recover(); r != nil { 								c.logf("crawler: Handler got panic error: %v", r) 							} 						}()++						var retryResponseCode bool+						if len(c.RetryHTTPResponseCodes) > 0 {+							for _, v := range c.RetryHTTPResponseCodes {+								if res.StatusCode == v {+									retryResponseCode = true+									break+								}+							}+						}+						url := clonedReq.URL.String()+						var recrawl bool+						c.urlNumRetriesMu.Lock()+						if c.urlNumRetries == nil {+							c.urlNumRetries = make(map[string]int)+						}+						if retryResponseCode {+							val, _ := c.urlNumRetries[url]+							if val < c.maxRetries() {+								c.urlNumRetries[url]+++								recrawl = true+							}+						} else {+							// Finally, we succeeded! Reset the retry count.+							c.urlNumRetries[url] = 0+						}+						c.urlNumRetriesMu.Unlock()++						if recrawl {+							timeSleep := c.secondsBetweenRetries()+							if timeSleepHeader, err := strconv.Atoi(res.Header.Get("Retry-After")); err == nil && timeSleepHeader > 0 {+								timeSleep = timeSleepHeader+							}+							select {+							case <-clonedReq.Context().Done():+								c.logf("crawler: aborted because context done")+							case <-time.After(time.Second * time.Duration(rand.Intn(1+2*timeSleep))):+								// Randomize the sleeps to spread out retries.+								c.Crawl(clonedReq)

Ah, sorry, my bad. Yeah, this is fine since it only queues it up.

vivekraghunathan

comment created time in a month

Pull request review commentneevaco/antch

Retry on 429s

 func (c *Crawler) requestTimeout() time.Duration { 	return 30 * time.Second } +func (c *Crawler) secondsBetweenRetries() int {+	if v := c.SecondsBetweenRetries; v > 0 {+		return v+	}+	return 10+}++func (c *Crawler) maxRetries() int {

Why do we need this accessor? It seems that if this was:

func (c *Crawler) maxRetries() int { return c.MaxRetries }

would work too, and at that point, might as well remove the method?

vivekraghunathan

comment created time in a month

Pull request review commentneevaco/antch

Retry on 429s

 func (c *Crawler) scanRequestWork(workCh chan chan *http.Request, closeCh chan i 			spider.reqch <- requestAndChan{req: req, ch: resc} 			select { 			case re := <-resc:-				closeRequest(req) 				if re.err != nil { 					c.logf("crawler: send HTTP request got error: %v", re.err) 				} else {-					go func(res *http.Response) {+					go func(clonedReq *http.Request, res *http.Response) { 						defer closeResponse(res) 						defer func() { 							if r := recover(); r != nil { 								c.logf("crawler: Handler got panic error: %v", r) 							} 						}()++						var retryResponseCode bool+						if len(c.RetryHTTPResponseCodes) > 0 {+							for _, v := range c.RetryHTTPResponseCodes {+								if res.StatusCode == v {+									retryResponseCode = true+									break+								}+							}+						}+						url := clonedReq.URL.String()+						var recrawl bool+						c.urlNumRetriesMu.Lock()+						if c.urlNumRetries == nil {+							c.urlNumRetries = make(map[string]int)+						}+						if retryResponseCode {+							val, _ := c.urlNumRetries[url]+							if val < c.maxRetries() {+								c.urlNumRetries[url]+++								recrawl = true+							}+						} else {+							// Finally, we succeeded! Reset the retry count.+							c.urlNumRetries[url] = 0+						}+						c.urlNumRetriesMu.Unlock()++						if recrawl {+							timeSleep := c.secondsBetweenRetries()+							if timeSleepHeader, err := strconv.Atoi(res.Header.Get("Retry-After")); err == nil && timeSleepHeader > 0 {+								timeSleep = timeSleepHeader+							}+							select {+							case <-clonedReq.Context().Done():+								c.logf("crawler: aborted because context done")+							case <-time.After(time.Second * time.Duration(rand.Intn(1+2*timeSleep))):+								// Randomize the sleeps to spread out retries.+								c.Crawl(clonedReq)

This is still recursive, right? Just making sure we're okay with that.

vivekraghunathan

comment created time in a month

Pull request review commentneevaco/antch

Retry on 429s

 func (c *Crawler) scanRequestWork(workCh chan chan *http.Request, closeCh chan i 			spider.reqch <- requestAndChan{req: req, ch: resc} 			select { 			case re := <-resc:-				closeRequest(req) 				if re.err != nil { 					c.logf("crawler: send HTTP request got error: %v", re.err) 				} else {-					go func(res *http.Response) {+					go func(clonedReq *http.Request, res *http.Response) { 						defer closeResponse(res) 						defer func() { 							if r := recover(); r != nil { 								c.logf("crawler: Handler got panic error: %v", r) 							} 						}()++						var retryResponseCode bool+						if len(c.RetryHTTPResponseCodes) > 0 {+							for _, v := range c.RetryHTTPResponseCodes {+								if res.StatusCode == v {+									retryResponseCode = true+									break+								}+							}+						}+						url := clonedReq.URL.String()+						var recrawl bool+						c.urlNumRetriesMu.Lock()+						if c.urlNumRetries == nil {+							c.urlNumRetries = make(map[string]int)+						}+						if retryResponseCode {+							val, _ := c.urlNumRetries[url]+							if val < c.maxRetries() {+								c.urlNumRetries[url]+++								recrawl = true+							}+						} else {+							// Finally, we succeeded! Reset the retry count.+							c.urlNumRetries[url] = 0

Should we delete it from the map instead of setting it to zero? Otherwise the map is ever growing, right? i.e., delete(c.urlNumRetries, url)?

vivekraghunathan

comment created time in a month

Pull request review commentneevaco/antch

Retry on 429s

 func (c *Crawler) scanRequestWork(workCh chan chan *http.Request, closeCh chan i 			spider.reqch <- requestAndChan{req: req, ch: resc} 			select { 			case re := <-resc:-				closeRequest(req) 				if re.err != nil { 					c.logf("crawler: send HTTP request got error: %v", re.err) 				} else {-					go func(res *http.Response) {+					go func(clonedReq *http.Request, res *http.Response) { 						defer closeResponse(res) 						defer func() { 							if r := recover(); r != nil { 								c.logf("crawler: Handler got panic error: %v", r) 							} 						}()++						var retryResponseCode bool

May make sense to factor this out into a helper method to make it easier to read. So this reads as:

if !c.shouldRetry(resp) {
  closeRequest(req)
  h, _ := c.Handler(res)
  h.ServeSpider(c.writeCh, res)
  return
}
// Retry the request
...
vivekraghunathan

comment created time in a month

Pull request review commentneevaco/antch

Retry on 429s

 type Crawler struct { 	// Default is 32. 	MaxConcurrentItems int +	// MaxNumRetries specifies the maximum number of retries we'll make for a particular URL.+	// Default is 0 (excluding the original attempt)+	MaxRetries int++	// RetryHTTPResponseCodes specifies the response codes for which we'll retry for a particular URL.+	RetryHTTPResponseCodes []int++	// Time in seconds between retries.+	// Default is 10.+	SecondsBetweenRetries int

Make this DefaultRetryPeriod time.Duration, i.e.,

(a) time.Duration instead of int, and (b) "Default..." because it is overridden if the request provides a Retry-After, right?

vivekraghunathan

comment created time in a month

Pull request review commentneevaco/antch

Retry on 429s

 func (c *Crawler) scanRequestWork(workCh chan chan *http.Request, closeCh chan i 			spider.reqch <- requestAndChan{req: req, ch: resc} 			select { 			case re := <-resc:-				closeRequest(req) 				if re.err != nil { 					c.logf("crawler: send HTTP request got error: %v", re.err) 				} else {-					go func(res *http.Response) {+					go func(clonedReq *http.Request, res *http.Response) { 						defer closeResponse(res) 						defer func() { 							if r := recover(); r != nil { 								c.logf("crawler: Handler got panic error: %v", r) 							} 						}()+						// TODO(vivek): Don't want to retry forever. How should I escape out?+						// TODO(vivek): Should I respect Retry-After perfectly?+						if res.StatusCode == http.StatusTooManyRequests {+							timeSleep := 10+							if timeSleepHeader, err := strconv.Atoi(res.Header.Get("Retry-After")); err == nil && timeSleepHeader > 0 && timeSleepHeader < timeSleep {+								timeSleep = timeSleepHeader+							}+							time.Sleep(time.Second * time.Duration(timeSleep))

You want to respect the context here, so I think this should be something like:

select {
case <- cloneReq.Context().Done():
  c.logf("crawler: aborted because context done")
  return
case <-time.After(timeSleep * time.Second):
  ...
}
vivekraghunathan

comment created time in a month

Pull request review commentneevaco/antch

Retry on 429s

 func (c *Crawler) scanRequestWork(workCh chan chan *http.Request, closeCh chan i 			spider.reqch <- requestAndChan{req: req, ch: resc} 			select { 			case re := <-resc:-				closeRequest(req) 				if re.err != nil { 					c.logf("crawler: send HTTP request got error: %v", re.err) 				} else {-					go func(res *http.Response) {+					go func(clonedReq *http.Request, res *http.Response) { 						defer closeResponse(res) 						defer func() { 							if r := recover(); r != nil { 								c.logf("crawler: Handler got panic error: %v", r) 							} 						}()+						// TODO(vivek): Don't want to retry forever. How should I escape out?+						// TODO(vivek): Should I respect Retry-After perfectly?+						if res.StatusCode == http.StatusTooManyRequests {+							timeSleep := 10+							if timeSleepHeader, err := strconv.Atoi(res.Header.Get("Retry-After")); err == nil && timeSleepHeader > 0 && timeSleepHeader < timeSleep {+								timeSleep = timeSleepHeader+							}+							time.Sleep(time.Second * time.Duration(timeSleep))+							c.Crawl(clonedReq)

Isn't this prone to infinite recursion?

vivekraghunathan

comment created time in a month

create barnchneevaco/webrisk

branch : neeva

created branch time in 2 months

Pull request review commentneevaco/antch

Allow option for antch to set HTTP2ForceUpgrade transport param

 func TestRobotstxtHandler(t *testing.T) { 	}{ 		{"/", "", true}, 		{"/shopping/", "", true},-		{"account", "", false},

Why this change?

vivekraghunathan

comment created time in 2 months

Pull request review commentneevaco/antch

Allow option for antch to set HTTP2ForceUpgrade transport param

+module github.com/neevaco/antch

Optional: Do you want to split out the go.mod changes into a separate PR, or at least attempt to upstream that separately?

vivekraghunathan

comment created time in 2 months

Pull request review commentneevaco/antch

Allow option for antch to set HTTP2ForceUpgrade transport param

 type Crawler struct { 	// UserAgent specifies the user-agent for the remote server. 	UserAgent string +	// ForceAttemptHTTP2 specifies the corresponding field in the underlying HTTP transport.+	ForceAttemptHTTP2 bool

Instead of making this an option, I'd just remove the option and force it to true in line 202. It seems that doing so is better behavior now (and consistent with http.DefaultTransport?)

vivekraghunathan

comment created time in 2 months

issue commentscylladb/scylla

Node unreachable, unresponsive to clients (nodetool status disagrees)

Apologies, figuring out how to prune out metrics from our application before uploading (we have a single monitoring stack for all the services we us and our application. Any tips on how to do so are much appreciated)

Some more information on our workload: This problem seems to be easily triggered by a job that scans the entire table (that job is following recommendations for such scans). We added BYPASS_CACHE to it and other readers, and as of right now, it appears that doing so mitigates the occurrence of this problem.

asimshankar

comment created time in 2 months

issue openedscylladb/scylla

Node unreachable, unresponsive to clients (nodetool status disagrees)

Installation details Scylla version (or git commit hash): 3.2.3 Cluster size: 10 nodes, ~1TB of data total (3x replication) OS (RHEL/CentOS/Ubuntu/AWS AMI): Running on Kubernetes on Amazon EKS. AWS AMI amazon-eks-node-1.14-v20191213 (ami-087a82f6b78a07557)

Hardware details (for performance issues) Platform (physical/VM/cloud instance type/docker): docker on i3.2xlarge (NVMe SSD)

Scylla flags: scylla --developer-mode 0 --smp 7

Our Scylla cluster has been running into issues where nodes become “unavailable” - nodetool status on all nodes think it is fine, but clients are unable to connect to the node and requests time out. This happens multiple times a day, so any suggestions on how to mitigate this will be much appreciated!

Specifically:

  1. The bad node uses ~100% CPU constantly (other nodes are using <10%), and top shows that it’s all on a single core. The specific core that’s pegged changes as this runs.
  2. cqlsh fails to connect with Connection error: ('Unable to connect to any servers', {'192.168.219.199': OperationTimedOut('errors=None, last_host=None',)})
  3. Prometheus metrics can’t be collected (curl localhost:9180/metrics never returns)
  4. scyllato``p,node_health_check both hang
  5. A bunch of connections are stuck in CLOSE_WAIT (lsof -p $(pidof scylla) | grep CLOSE_WAIT)
  6. nodetool status on the bad and the other nodes seem to think the node is fine. For example:
    UN  192.168.149.124  57.14 GB   256          ?       e69a4d0e-8ca5-4292-b1e3-9c99b70e00b4  1a
    UN  192.168.172.108  64.67 GB   256          ?       c2f7248a-fc95-4a69-952f-42f07a955a50  1a
    UN  192.168.213.30   55.19 GB   256          ?       5654b8f3-cf62-4015-925a-dc7a31bfaa84  1b
    UN  192.168.168.63   51.56 GB   256          ?       0c37b902-4c32-4202-a590-b8f4f2027a8e  1a
    UN  192.168.250.207  61.56 GB   256          ?       f8f0bd97-e5bd-4c78-862f-45ad11b1aa87  1b
    UN  192.168.233.132  59.76 GB   256          ?       3ce1c6e1-a83e-47d6-a4cd-9922922e33cb  1b
    UN  192.168.198.21   60.35 GB   256          ?       652ba539-9a29-444b-bf90-1453e1280c58  1b
    UN  192.168.232.22   62.89 GB   256          ?       37525c68-5b1d-4df7-b20f-9532144998cb  1b
    UN  192.168.219.199  65.42 GB   256          ?       770285bd-5174-44c5-90d2-3db5936ce19e  1b
    UN  192.168.157.215  61 GB      256          ?       a420b6af-7d7d-4b3b-a18b-53383885e5ea  1a

Logs from one of the unreachable nodes:

    INFO  2020-03-16 23:02:39,176 [shard 6] compaction - Compacting [/var/lib/scylla/data/system/peers-37f71aca7dc2383ba70672528af04d4f/mc-4430-big-Data.db:level=0, /var/lib/scylla/data/system/peers-37f71aca7dc2383ba70672528af04d4f/mc-4423-big-Data.db:level=0, ]
    INFO  2020-03-16 23:02:39,185 [shard 2] compaction - Compacted 2 sstables to [/var/lib/scylla/data/system/peers-37f71aca7dc2383ba70672528af04d4f/mc-3012-big-Data.db:level=0, ]. 16751 bytes to 11343 (~67% of original) in 10ms = 1.08MB/s. ~256 total partitions merged to 1.
    INFO  2020-03-16 23:02:39,187 [shard 6] compaction - Compacted 2 sstables to [/var/lib/scylla/data/system/peers-37f71aca7dc2383ba70672528af04d4f/mc-4437-big-Data.db:level=0, ]. 67757 bytes to 56844 (~83% of original) in 11ms = 4.93MB/s. ~256 total partitions merged to 3.
    INFO  2020-03-16 23:02:50,356 [shard 0] storage_service - Node 192.168.213.30 state jump to normal
    INFO  2020-03-16 23:02:50,366 [shard 6] compaction - Compacting [/var/lib/scylla/data/system/peers-37f71aca7dc2383ba70672528af04d4f/mc-4444-big-Data.db:level=0, /var/lib/scylla/data/system/peers-37f71aca7dc2383ba70672528af04d4f/mc-4437-big-Data.db:level=0, ]
    INFO  2020-03-16 23:02:50,375 [shard 6] compaction - Compacted 2 sstables to [/var/lib/scylla/data/system/peers-37f71aca7dc2383ba70672528af04d4f/mc-4451-big-Data.db:level=0, ]. 67761 bytes to 56812 (~83% of original) in 9ms = 6.02MB/s. ~256 total partitions merged to 3.
    ERROR 2020-03-17 00:30:17,686 [shard 6] storage_proxy - Exception when communicating with 192.168.219.199: seastar::semaphore_timed_out (Semaphore timedout)

Note that the daemon went unresponsive at around 00:00 (UTC), so an hour after the second last log message. And 192.168.219.199 is the IP address of this node.

Some screenshots from the Grafana dashboards attached below. The bad behavior begins on two nodes just around 5pm (PST) on the graphs, which corresponds to 00:00 in the logs, where timestamps are in UTC). Note that metrics from the bad nodes aren’t available because the node isn’t responding to requests for prometheus metrics, so all the metrics are from the remaining nodes.

Screen Shot 2020-03-16 at 6 03 11 PM Screen Shot 2020-03-16 at 6 03 21 PM Screen Shot 2020-03-16 at 6 03 49 PM Screen Shot 2020-03-16 at 6 04 14 PM

Thanks a bunch for your time!

created time in 2 months

more