profile
viewpoint
Josh Wills jwills N/A San Francisco, CA http://twitter.com/josh_wills Gainfully unemployed data person

grpc/grpc-web 4602

gRPC for Web Clients

broadinstitute/gatk 861

Official code repository for GATK versions 4 and up

HopkinsIDD/COVIDScenarioPipeline 73

Public shared code for doing scenario forecasting and creating reports for various governmental entities.

jwills/avro-json 29

Utilities for converting to and from JSON from Avro records via Hadoop streaming or Hive.

broadinstitute/gatk-dataflow 4

Development dataflow

jwills/crunch-demo 4

A demo application for getting started with Apache Crunch.

jwills/avroplay 3

Me messing around with some Avro stuff

jwills/datafu 3

Hadoop library for large-scale data processing

jwills/dbt-materialize 3

materialize.io plugin for dbt

jwills/attribution 2

MapReduce job for creating multitouch attribution models.

fork jwills/crunch

Mirror of Apache Crunch (Incubating)

fork in a month

GollumEvent
GollumEvent

pull request commentHopkinsIDD/COVIDScenarioPipeline

Trying to get some docs up on how to run things on AWS

Now wiki-fied: https://github.com/HopkinsIDD/COVIDScenarioPipeline/wiki/AWS-Instructions

jwills

comment created time in 2 months

GollumEvent
GollumEvent
GollumEvent
GollumEvent

push eventHopkinsIDD/COVIDScenarioPipeline

Josh Wills

commit sha ac5453087ec02b91de68b5f15a159093d6156a99

Writing more stuff about launching inference jobs

view details

push time in 2 months

Pull request review commentHopkinsIDD/COVIDScenarioPipeline

Trying to get some docs up on how to run things on AWS

++## Introduction++This document contains instructions for setting up and running+the two different kinds of SEIR modeling jobs supported by the+COVIDScenarioPipeline repository on AWS:++1. *Inference* jobs, using AWS Batch to coordinate hundreds/thousands+of jobs across a fleet of servers, and+1. *Planning* jobs, using a single relatively large EC2 instance (usually+an `r5.24xlarge`) to run one or more planning scenarios on a single+high-powered machine.++Most of the steps required to setup and run the two different types of jobs+on AWS are identical, and I will explicitly call out the places where+different steps are required. Throughout the document, we assume that+your client machine is a UNIX-like environment (e.g., OS X, Linux, or WSL).++## Local Client Setup++I need a few things to be true about the local machine that you will be using+to connect to AWS that I'll outline here:++1. You have created and downloaded a `.pem` file for connecting to an EC2+instance to your `~/.ssh` directory. When we provision machines, you'll need to+use the `.pem` file for connecting.+1. You have created a `~/.ssh/config` file that contains an entry that looks like this so we can use `staging` as an alias for your+   provisioned EC2 instance in the rest of the runbook:++    ```+    host staging+    HostName <IP address of provisioned server goes here>+    IdentityFile ~/.ssh/<your .pem file goes here>+    User ec2-user+    IdentitiesOnly yes+    StrictHostKeyChecking no +    ```+1. You can [connect to Github via SSH.](https://help.github.com/en/github/authenticating-to-github/connecting-to-github-with-ssh) This is+important because we will need to use your Github SSH key to interact with private repositories from the `staging` server on EC2.++## Provisioning The Staging Server++If you are running an *Inference* job, you should+use a small instance type for your staging server (e.g., an `m5.xlarge` will be more than enough.) If you are running a *Planning* job, you+should provision a beefy instance type (I am especially partial to the memory-and-CPU heavy `r5.24xlarge`, but given how fast the planning code+has become, an `r5.8xlarge` should be perfectly adequate.)++If you have access to the `jh-covid` account, you should use the *IDD Staging AMI* (`ami-03641dd0c8554e5d0`) to provision and launch new+staging servers; it is already setup with all of the dependencies described in this section. You can find the AMI [here](https://us-west-2.console.aws.amazon.com/ec2/v2/home?region=us-west-2#Images:sort=imageId), select it, and press the *Launch* button to walk you through the Launch Wizard to choose your instance type and `.pem` file to provision your staging server.+Once your instance is provisioned, be sure to put its IP address into the `HostName` section of the `~/.ssh/config` file on your local client so that you can connect to it from your+client by simply typing `ssh staging` in your terminal window.++If you do *not* have access to the `jh-covid` account, you should walk through the regular EC2 Launch Wizard flow and be sure to choose the+*Amazon Linux 2 AMI (HVM), SSD Volume Type* (`ami-0e34e7b9ca0ace12d`, the 64-bit x86 version) AMI. Once the machine is up and running and you can

Yeah; to make things as consistent as possible, I wrote up the instructions for Amazon Linux instead of Ubuntu. There are a number of small differences in how the machines need to be provisioned and I felt like the option that would maximize sanity was running stuff on staging the exact same way the batch jobs themselves are run.

jwills

comment created time in 2 months

Pull request review commentHopkinsIDD/COVIDScenarioPipeline

Trying to get some docs up on how to run things on AWS

++## Introduction++This document contains instructions for setting up and running+the two different kinds of SEIR modeling jobs supported by the+COVIDScenarioPipeline repository on AWS:++1. *Inference* jobs, using AWS Batch to coordinate hundreds/thousands+of jobs across a fleet of servers, and+1. *Planning* jobs, using a single relatively large EC2 instance (usually+an `r5.24xlarge`) to run one or more planning scenarios on a single+high-powered machine.++Most of the steps required to setup and run the two different types of jobs+on AWS are identical, and I will explicitly call out the places where+different steps are required. Throughout the document, we assume that+your client machine is a UNIX-like environment (e.g., OS X, Linux, or WSL).++## Local Client Setup++I need a few things to be true about the local machine that you will be using+to connect to AWS that I'll outline here:++1. You have created and downloaded a `.pem` file for connecting to an EC2+instance to your `~/.ssh` directory. When we provision machines, you'll need to+use the `.pem` file for connecting.+1. You have created a `~/.ssh/config` file that contains an entry that looks like this so we can use `staging` as an alias for your+   provisioned EC2 instance in the rest of the runbook:++    ```+    host staging+    HostName <IP address of provisioned server goes here>+    IdentityFile ~/.ssh/<your .pem file goes here>+    User ec2-user+    IdentitiesOnly yes+    StrictHostKeyChecking no +    ```+1. You can [connect to Github via SSH.](https://help.github.com/en/github/authenticating-to-github/connecting-to-github-with-ssh) This is+important because we will need to use your Github SSH key to interact with private repositories from the `staging` server on EC2.++## Provisioning The Staging Server++If you are running an *Inference* job, you should+use a small instance type for your staging server (e.g., an `m5.xlarge` will be more than enough.) If you are running a *Planning* job, you+should provision a beefy instance type (I am especially partial to the memory-and-CPU heavy `r5.24xlarge`, but given how fast the planning code+has become, an `r5.8xlarge` should be perfectly adequate.)++If you have access to the `jh-covid` account, you should use the *IDD Staging AMI* (`ami-03641dd0c8554e5d0`) to provision and launch new+staging servers; it is already setup with all of the dependencies described in this section. You can find the AMI [here](https://us-west-2.console.aws.amazon.com/ec2/v2/home?region=us-west-2#Images:sort=imageId), select it, and press the *Launch* button to walk you through the Launch Wizard to choose your instance type and `.pem` file to provision your staging server.+Once your instance is provisioned, be sure to put its IP address into the `HostName` section of the `~/.ssh/config` file on your local client so that you can connect to it from your+client by simply typing `ssh staging` in your terminal window.++If you do *not* have access to the `jh-covid` account, you should walk through the regular EC2 Launch Wizard flow and be sure to choose the+*Amazon Linux 2 AMI (HVM), SSD Volume Type* (`ami-0e34e7b9ca0ace12d`, the 64-bit x86 version) AMI. Once the machine is up and running and you can+SSH to it, you will need to run the following code to install the software you will need for the rest of the run:++```+sudo yum -y update+sudo yum -y install awscli +sudo yum -y install git +sudo yum -y install docker.io +sudo yum -y install pbzip2 ++curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.rpm.sh | sudo bash+sudo yum -y install git-lfs+git lfs install+```++## Connect to Github++Once your staging server is provisioned and you can connect to it, you should `scp` the private key file that you use for connecting to+Github to the `/home/ec2-user/.ssh` directory on the staging server (e.g., if the local file is named `~/.ssh/id_rsa`, then you should run+`scp ~/.ssh/id_rsa staging:/home/ec2-user/.ssh` to do the copy. For convenience, you should create a `/home/ec2-user/.ssh/config` file on+the staging server that has the following entry:++```+host github.com+ HostName github.com+ IdentityFile ~/.ssh/id_rsa+ User git+```++This way, the `git clone`, `git pull`, etc. operations that you run on the staging server will use your SSH key without constantly prompting+you to login. You should now be able to clone a COVID19 data repository into your home directory on the staging server to do work against.+For this example, I'm going to use the `COVID19_Minimal` repo as my example, so I would run `git clone git@github.com:HopkinsIDD/COVID19_Minimal.git`+to get it onto the staging server. By convention, I usually do runs (for both Planning and Inference jobs) with the `COVIDScenarioPipeline` repository+nested inside of the data repository, so I would then do `cd COVID19_Minimal; git clone git@github.com:HopkinsIDD/COVIDScenarioPipeline.git` to+clone the modeling code itself into a child directory of the data repository.++## Getting and Launching the Docker Container++The previous section is only for getting a minimal set of dependencies setup on your staging server. To do an actual run, you will+need to download the Docker container that contains the more extensive set of dependencies we need for running the code in the+`COVIDScenarioPipeline` repository. For *Inference* jobs, please run `sudo docker pull hopkinsidd/covidscenariopipeline:latest-dev`;

That would certainly simplify things. 🤔

jwills

comment created time in 2 months

Pull request review commentHopkinsIDD/COVIDScenarioPipeline

Trying to get some docs up on how to run things on AWS

++## Introduction++This document contains instructions for setting up and running+the two different kinds of SEIR modeling jobs supported by the+COVIDScenarioPipeline repository on AWS:++1. *Inference* jobs, using AWS Batch to coordinate hundreds/thousands+of jobs across a fleet of servers, and+1. *Planning* jobs, using a single relatively large EC2 instance (usually+an `r5.24xlarge`) to run one or more planning scenarios on a single+high-powered machine.++Most of the steps required to setup and run the two different types of jobs+on AWS are identical, and I will explicitly call out the places where+different steps are required. Throughout the document, we assume that+your client machine is a UNIX-like environment (e.g., OS X, Linux, or WSL).++## Local Client Setup++I need a few things to be true about the local machine that you will be using+to connect to AWS that I'll outline here:++1. You have created and downloaded a `.pem` file for connecting to an EC2+instance to your `~/.ssh` directory. When we provision machines, you'll need to+use the `.pem` file for connecting.+1. You have created a `~/.ssh/config` file that contains an entry that looks like this so we can use `staging` as an alias for your+   provisioned EC2 instance in the rest of the runbook:++    ```+    host staging+    HostName <IP address of provisioned server goes here>+    IdentityFile ~/.ssh/<your .pem file goes here>+    User ec2-user+    IdentitiesOnly yes+    StrictHostKeyChecking no +    ```+1. You can [connect to Github via SSH.](https://help.github.com/en/github/authenticating-to-github/connecting-to-github-with-ssh) This is+important because we will need to use your Github SSH key to interact with private repositories from the `staging` server on EC2.++## Provisioning The Staging Server++If you are running an *Inference* job, you should+use a small instance type for your staging server (e.g., an `m5.xlarge` will be more than enough.) If you are running a *Planning* job, you+should provision a beefy instance type (I am especially partial to the memory-and-CPU heavy `r5.24xlarge`, but given how fast the planning code+has become, an `r5.8xlarge` should be perfectly adequate.)++If you have access to the `jh-covid` account, you should use the *IDD Staging AMI* (`ami-03641dd0c8554e5d0`) to provision and launch new+staging servers; it is already setup with all of the dependencies described in this section. You can find the AMI [here](https://us-west-2.console.aws.amazon.com/ec2/v2/home?region=us-west-2#Images:sort=imageId), select it, and press the *Launch* button to walk you through the Launch Wizard to choose your instance type and `.pem` file to provision your staging server.+Once your instance is provisioned, be sure to put its IP address into the `HostName` section of the `~/.ssh/config` file on your local client so that you can connect to it from your+client by simply typing `ssh staging` in your terminal window.++If you do *not* have access to the `jh-covid` account, you should walk through the regular EC2 Launch Wizard flow and be sure to choose the+*Amazon Linux 2 AMI (HVM), SSD Volume Type* (`ami-0e34e7b9ca0ace12d`, the 64-bit x86 version) AMI. Once the machine is up and running and you can+SSH to it, you will need to run the following code to install the software you will need for the rest of the run:++```+sudo yum -y update+sudo yum -y install awscli +sudo yum -y install git +sudo yum -y install docker.io +sudo yum -y install pbzip2 +

Ooh, TIL!

jwills

comment created time in 2 months

pull request commentHopkinsIDD/COVIDScenarioPipeline

Trying to get some docs up on how to run things on AWS

thank you for writing this up! FYI: there's a CSP wiki, not sure where these docs should go.

I am more-or-less incapable of writing anything in a browser b/c Twitter will distract me, so I do everything in vim. But happy to migrate the text to the wiki once I finish the draft if that is the preferred venue!

jwills

comment created time in 2 months

PR opened HopkinsIDD/COVIDScenarioPipeline

Trying to get some docs up on how to run things on AWS

Creating this as a draft so I can get some feedback while I work through the rest of the instructions.

+106 -0

0 comment

1 changed file

pr created time in 2 months

create barnchHopkinsIDD/COVIDScenarioPipeline

branch : jwills_runbooking

created branch time in 2 months

PR closed HopkinsIDD/COVIDScenarioPipeline

WIP: batch job for post-processing output of large-scale inference jobs

I wrote the pieces here to move the output of the sims to a single location in S3:

  1. aggregate_inference_job.py: the controller/launcher job, similar to the other batch launcher jobs,
  2. aggregate_runner.sh the wrapper runner script for the batch worker, and
  3. output_mover.py which does the actual work on the batch worker to copy sim outputs from S3 to a local directory on the machine before they are all copied out to a target S3 location by aggregate_runner.sh.

My thought here was that we could then setup and run Sam's Spark jobs on the big box using the output that is copied locally by the output mover.

+199 -0

0 comment

3 changed files

jwills

pr closed time in 2 months

delete branch HopkinsIDD/COVIDScenarioPipeline

delete branch : jwills_pkg_updates

delete time in 2 months

delete branch HopkinsIDD/COVIDScenarioPipeline

delete branch : dataseed_batch

delete time in 2 months

delete branch HopkinsIDD/COVIDScenarioPipeline

delete branch : jwills_output_offsets

delete time in 2 months

delete branch HopkinsIDD/COVIDScenarioPipeline

delete branch : run_covid19_california_jwills_20200422040225

delete time in 2 months

delete branch HopkinsIDD/COVIDScenarioPipeline

delete branch : run_covid19_california_jwills_20200422034723

delete time in 2 months

delete branch HopkinsIDD/COVIDScenarioPipeline

delete branch : dvc_filter_USA

delete time in 2 months

delete branch HopkinsIDD/COVIDScenarioPipeline

delete branch : dataseed_batch2

delete time in 2 months

delete branch HopkinsIDD/COVIDScenarioPipeline

delete branch : jwills_dvc_filter_USA_by_scenario

delete time in 2 months

delete branch HopkinsIDD/COVIDScenarioPipeline

delete branch : jwills_dev_simulate

delete time in 2 months

delete branch HopkinsIDD/COVIDScenarioPipeline

delete branch : inference_dev_merge

delete time in 2 months

delete branch HopkinsIDD/COVIDScenarioPipeline

delete branch : jwills_inference_runnable

delete time in 2 months

delete branch HopkinsIDD/COVIDScenarioPipeline

delete branch : jwills_inference

delete time in 2 months

delete branch HopkinsIDD/COVIDScenarioPipeline

delete branch : jwills_inference_batch_org

delete time in 2 months

delete branch HopkinsIDD/COVIDScenarioPipeline

delete branch : inference_25

delete time in 2 months

delete branch HopkinsIDD/COVIDScenarioPipeline

delete branch : jwills_inference_improvements

delete time in 2 months

delete branch HopkinsIDD/COVIDScenarioPipeline

delete branch : inference_env_vars

delete time in 2 months

delete branch HopkinsIDD/COVIDScenarioPipeline

delete branch : jwills_better_retry

delete time in 2 months

PR closed HopkinsIDD/COVIDScenarioPipeline

Do a better job of tarring up the CSP code + add a manifest.json file for each run

Also updating the default memory setting to be 4G instead of 2G.

+24 -4

0 comment

1 changed file

jwills

pr closed time in 2 months

PR closed HopkinsIDD/COVIDScenarioPipeline

Reviewers
WIP for launching batch runs on AWS directly from COVIDScenarioPipeline

Going to be iterating on this for a bit, but posting it for feedback from folks now.

+269 -2

2 comments

7 changed files

jwills

pr closed time in 2 months

pull request commentHopkinsIDD/COVIDScenarioPipeline

error handling logic for inference runs

Verified that this works with some test jobs that intentionally called the error_handler, triggered some fake "successes" and saw that downstream jobs failed once the threshold of 10 failures was exceeded.

jwills

comment created time in 2 months

push eventHopkinsIDD/COVIDScenarioPipeline

Josh Wills

commit sha 95d597c61fb6ce7a1bf6888fac30bc6294082bb9

load-bearing typo

view details

push time in 2 months

push eventHopkinsIDD/COVIDScenarioPipeline

Josh Wills

commit sha 02aaeb8c6f9b7c5c698fb814c73d8540572f952c

fix typo

view details

Josh Wills

commit sha 18db77ab3c79aaf7c5b29b65fc1661463b639a0b

consistency fix

view details

push time in 2 months

push eventHopkinsIDD/COVIDScenarioPipeline

Sam Shah

commit sha fe1054a57aa9a4ec29508b51b79c1cb2135d99a8

Minor fixes - `config$affected_geoids` now a set instead of a list. This deals with an issue during a Maryland run where a geoid was specified twice, which broke the code. - Moved csv_to_csr.py to scripts dir

view details

kkintaro

commit sha 8e757addc2f8ff423840292174f49f712f723f51

Merge pull request #314 from HopkinsIDD/qol_improvements Minor fixes

view details

kkintaro

commit sha 378410cd337befd6c73797e642cde1323a04f0fc

Calculate incidence data for USAFacts (#308) * Calculate incidents and fix negative incident counts, missing dates, and NA values * Add a test.

view details

Justin Lessler

commit sha f0e0a6f63dd4aaed4e9033aaa5326fb57f0fb165

Fixed some problems with week and reading single states for filter_MC

view details

eclee25

commit sha 07f4436ffe3ae89ff2e39676cbcf0550860b56c7

USAFacts fulljoin case/death and add test

view details

Justin Lessler

commit sha 277635cbebc2fcfc3f796cdd536f64999797f5a5

Update NAESPACE after deleting filter_MC

view details

eclee25

commit sha 293a464f85d825f5e1d42a096a57ca54fe9dee89

dplyr::full_join

view details

eclee25

commit sha 55f24490b35a3f0fb0068a811c602db7103e40ae

import dopar

view details

jkamins7

commit sha d4ebe699196f94d030ad7dd47d73693e57b92ed2

Updated to fix week aggregation issue

view details

eclee25

commit sha d014e141d3fab4d922363d557490116d56c3eed4

commit man pages

view details

Justin Lessler

commit sha 71aa5dedbcc3fcc7409e0d3f30187556ef71d2fb

adds facilities for making more forecast like cumDeath forecasts

view details

Justin Lessler

commit sha 8a765178e071383618bbedc1403c83cd017cdfa2

Fixed NAMESPACE conflict

view details

javierps

commit sha a94a797da1bc4ebbf223c1473383b280a71421e5

Second try commit inference test

view details

javierps

commit sha 3538a24a92ba7c82c361531ce691d0bbe334e16b

clean inference test markdown

view details

javierps

commit sha 020cf90c39eb91b8fcb96734801caa878092e134

set redo_fit to true

view details

Justin Lessler

commit sha ad6c5e87cba9f549e4c8252ebbce6274ebdf986f

Added incident deaths

view details

Justin Lessler

commit sha 1f1d6261aa40e927539f36de8453391c1c7c265e

Merge branch 'inference' of https://github.com/HopkinsIDD/COVIDScenarioPipeline into inference

view details

Sam Shah

commit sha eae9daecab82e35db8053c10852162453d43dc70

POC of environmental vars

view details

Josh Wills

commit sha bd5c0fb4753c95fc83b6353f3996165dbbd020df

WIP for the batch job launcher using new env vars

view details

Josh Wills

commit sha 9d71e925a3a6ba23159fd7373c7a0574b9e2819d

Fixes to make the new launcher/envvar code work correctly in batch

view details

push time in 2 months

push eventHopkinsIDD/COVIDScenarioPipeline

Josh Wills

commit sha c7a15285f1d1afbaa1fdaea083312f7bf4b5d4da

Fixes to make the new launcher/envvar code work correctly in batch

view details

push time in 2 months

PR opened HopkinsIDD/COVIDScenarioPipeline

error handling logic for inference runs

We currently have the problem where a single failure in one of the child jobs will cause the parent job for a run to be marked as failed, which prevents the copy job from running during the last phase of the pipeline. A small number of child job failures isn't actually a problem for us, so I'm adding a handler function that will:

  1. Trigger a retry of the job on a failure when the retry count is less than 3,
  2. Exit the job on its third failure as "successful," but writing a failure indicator file to S3 marking the job as having had a failure.

To prevent the problem where every job we run fails but gets marked as succeeding, I'm adding a check to the start of each inference run that will count up the number of failure indicator files in the failure indicator directory, and if it is greater than 10, every job in the run will fail.

Note that a consequence of this change is that some inference runs will re-start from scratch again after an initial failure, which will mean that some slots will have been generated using fewer sims than other slots. I'm not sure of the right way to handle this, but I'm assuming that dropping that slot entirely is an acceptable fallback.

+23 -6

0 comment

1 changed file

pr created time in 3 months

create barnchHopkinsIDD/COVIDScenarioPipeline

branch : jwills_better_retry

created branch time in 3 months

push eventHopkinsIDD/COVIDScenarioPipeline

Josh Wills

commit sha 68eded3906ee7a922ef6e37ff547d0a269fdc539

WIP for the batch job launcher using new env vars

view details

push time in 3 months

create barnchHopkinsIDD/COVIDScenarioPipeline

branch : inference_env_vars

created branch time in 3 months

push eventHopkinsIDD/COVIDScenarioPipeline

Josh Wills

commit sha 7e5d95773637d06c6ab70da70a2f0d070a68e99c

Update inference_job_status.py to print the state of all queues, not just the ones with a special prefix

view details

push time in 3 months

push eventHopkinsIDD/COVIDScenarioPipeline

Josh Wills

commit sha 7ff546fc7d30ee89101afd327b0543baa1fc4512

Make inference_copy.sh more forgiving of missing files

view details

push time in 3 months

push eventHopkinsIDD/COVIDScenarioPipeline

Josh Wills

commit sha 061b03309081c4b7dc96acd0b66680b4fb019413

Various improvements to running batch jobs (#311) * Trying to make things in inference job running more better * Need to be smarter about sample_data

view details

push time in 3 months

PR merged HopkinsIDD/COVIDScenarioPipeline

Reviewers
Various improvements to running batch jobs
  1. Allow us to change the number of sims run per job from 10 to some other number (esp. useful when doing the CA/NY/MD runs, which are smaller/faster than the USA run)
  2. removing the need for a target .dvc file in favor of just specifying output folders directly (with the usual suspects as the default settings),
  3. removing the need for the -p option and just always parallelizing by scenario/death config,
  4. ensuring we always build + install the python module instead of just installing it,
  5. and skipping the copying of some potentially large subdirs of COVIDScenarioPipeline (e.g. sample_data/united-states-commutes and build)
+42 -62

0 comment

2 changed files

jwills

pr closed time in 3 months

Pull request review commentHopkinsIDD/COVIDScenarioPipeline

Various improvements to running batch jobs

 def launch(self, job_name, config_file, batch_job_queue):          # Prepare to tar up the current directory, excluding any dvc outputs, so it         # can be shipped to S3-        dvc_outputs = get_dvc_outputs()         tarfile_name = f"{job_name}.tar.gz"         tar = tarfile.open(tarfile_name, "w:gz", dereference=True)         for p in os.listdir('.'):             if p == 'COVIDScenarioPipeline':                 for q in os.listdir('COVIDScenarioPipeline'):-                    if not (q == 'packrat' or q.startswith('.')):+                    if not (q == 'packrat' or q == 'sample_data' or q == 'build' or q.startswith('.')):                         tar.add(os.path.join('COVIDScenarioPipeline', q))-            elif not (p.startswith(".") or p.endswith("tar.gz") or p in dvc_outputs or p == "batch"):+                    elif q == 'sample_data':+                        for r in os.listdir('COVIDScenarioPipeline/sample_data'):+                            if r != 'united-states-commutes':+                                tar.add(os.path.join('COVIDScenarioPipeline', 'sample_data', r))+            elif not (p.startswith(".") or p.endswith("tar.gz") or p in self.outputs):

or i'll do it in a follow-on PR

jwills

comment created time in 3 months

Pull request review commentHopkinsIDD/COVIDScenarioPipeline

Various improvements to running batch jobs

 def launch(self, job_name, config_file, batch_job_queue):          # Prepare to tar up the current directory, excluding any dvc outputs, so it         # can be shipped to S3-        dvc_outputs = get_dvc_outputs()         tarfile_name = f"{job_name}.tar.gz"         tar = tarfile.open(tarfile_name, "w:gz", dereference=True)         for p in os.listdir('.'):             if p == 'COVIDScenarioPipeline':                 for q in os.listdir('COVIDScenarioPipeline'):-                    if not (q == 'packrat' or q.startswith('.')):+                    if not (q == 'packrat' or q == 'sample_data' or q == 'build' or q.startswith('.')):                         tar.add(os.path.join('COVIDScenarioPipeline', q))-            elif not (p.startswith(".") or p.endswith("tar.gz") or p in dvc_outputs or p == "batch"):+                    elif q == 'sample_data':+                        for r in os.listdir('COVIDScenarioPipeline/sample_data'):+                            if r != 'united-states-commutes':+                                tar.add(os.path.join('COVIDScenarioPipeline', 'sample_data', r))+            elif not (p.startswith(".") or p.endswith("tar.gz") or p in self.outputs):

yep, let me wire it up

jwills

comment created time in 3 months

PR opened HopkinsIDD/COVIDScenarioPipeline

Various improvements to running batch jobs
  1. Allow us to change the number of sims run per job from 10 to some other number (esp. useful when doing the CA/NY/MD runs, which are smaller/faster than the USA run)
  2. removing the need for a target .dvc file in favor of just specifying output folders directly (with the usual suspects as the default settings),
  3. removing the need for the -p option and just always parallelizing by scenario/death config,
  4. ensuring we always build + install the python module instead of just installing it,
  5. and skipping the copying of some potentially large subdirs of COVIDScenarioPipeline (e.g. sample_data/united-states-commutes and build)
+42 -62

0 comment

2 changed files

pr created time in 3 months

push eventHopkinsIDD/COVIDScenarioPipeline

Josh Wills

commit sha 4017e3fce23c99a27d68bd52b58e6db9ca9495b0

Need to be smarter about sample_data

view details

push time in 3 months

push eventHopkinsIDD/COVIDScenarioPipeline

jkamins7

commit sha a779400aa7948781efa53e7591a4e97f5b073a16

Removed file move that didn't do anything

view details

jkamins7

commit sha 199c30a0e6aa31713d1e8664aefba468c6a5b007

Fixed build_US_setup to not symmetrize mobility

view details

Josh Wills

commit sha e4c76bee53b7498ed36bfc2ac87cf2a6fbae7fe1

Trying to make things in inference job running more better

view details

push time in 3 months

push eventHopkinsIDD/COVIDScenarioPipeline

Josh Wills

commit sha aefe6bd553742ded444334ac4403d409c1e7dc09

Do a build and install of the python SEIR module

view details

push time in 3 months

push eventHopkinsIDD/COVIDScenarioPipeline

Josh Wills

commit sha 7884b0e35d5fd81e25ac81d6d9b483f3d946395b

Add build to the list of dirs to ignore; rename -t and targets to -o and outputs

view details

push time in 3 months

create barnchHopkinsIDD/COVIDScenarioPipeline

branch : jwills_inference_improvements

created branch time in 3 months

create barnchHopkinsIDD/COVID19_InferenceTest

branch : run_minimal-20200514T153842

created branch time in 3 months

create barnchHopkinsIDD/COVID19_InferenceTest

branch : run_minimal-20200514T153640

created branch time in 3 months

create barnchHopkinsIDD/COVIDScenarioPipeline

branch : inference_25

created branch time in 3 months

push eventHopkinsIDD/COVID19_InferenceTest

Josh Wills

commit sha 8b2081454d2adb5ed3a4f3c4144b33f3e1bb6936

updated data and the importation.dvc file

view details

push time in 3 months

push eventapache/crunch

Ben Roling

commit sha c57b55bd3c8eaa3dae124af0eda771144abe90a5

CRUNCH-696 update FormatBundle.readFields() compatibility Make FormatBundle.readFields() compatible with FormatBundles serialized with an older version of Crunch. This ensures jobs don't fail during an upgrade to a cluster-provided Crunch dependency. Without this some jobs get submitted without the filesystem field in the serialized FormatBundle and then encounter EOFException when the job gets scheduled to run and uses the newer Crunch to deserialize the FormatBundle.

view details

Ben Roling

commit sha a5c5a7ed7fad7314d80be61fa7d251f0739332c5

Update crunch-core/src/main/java/org/apache/crunch/io/FormatBundle.java Co-authored-by: Andrew Olson <930946+noslowerdna@users.noreply.github.com>

view details

Josh Wills

commit sha a2a879661b5148a1cdd2b4af615130d0e8c61444

Merge pull request #33 from ben-roling/CRUNCH-696 CRUNCH-696 update FormatBundle.readFields() compatibility

view details

push time in 3 months

PR merged apache/crunch

CRUNCH-696 update FormatBundle.readFields() compatibility

Make FormatBundle.readFields() compatible with FormatBundles serialized with an older version of Crunch. This ensures jobs don't fail during an upgrade to a cluster-provided Crunch dependency. Without this some jobs get submitted without the filesystem field in the serialized FormatBundle and then encounter EOFException when the job gets scheduled to run and uses the newer Crunch to deserialize the FormatBundle.

+16 -1

0 comment

1 changed file

ben-roling

pr closed time in 3 months

Pull request review commentapache/crunch

CRUNCH-696 update FormatBundle.readFields() compatibility

 public void readFields(DataInput in) throws IOException {       String value = Text.readString(in);       extraConf.put(key, value);     }-    if (in.readBoolean()) {+    boolean hasFilesystem;+    try {+      hasFilesystem = in.readBoolean();+    } catch (EOFException e) {+      // This can be a normal occurrence when Crunch is treated as a  cluster-provided+      // dependency and the version is upgraded.  Some jobs will have been submitted with+      // code that does not contain the filesystem field.  If those jobs run later with+      // this code that does contain the field, EOFException will occur trying to read+      // the non-existent field.+      LOG.debug("EOFException caught attempting to read filesystem field.  This condition"+          + "may temporarily occur with jobs that are submitted before but run after a"

Patch LGTM w/the whitespace fix included; assuming no objections, I'll merge it later today.

ben-roling

comment created time in 3 months

push eventHopkinsIDD/COVIDScenarioPipeline

Josh Wills

commit sha 794afeafb9f69c4e50f9394ba8e1d56308414be6

fix typo

view details

push time in 3 months

push eventHopkinsIDD/COVIDScenarioPipeline

Josh Wills

commit sha c8de6733d26788954e689f887e91269cf7fc3e8a

bump default RAM

view details

push time in 3 months

push eventHopkinsIDD/COVIDScenarioPipeline

Josh Wills

commit sha 7d87741388f17d7b4f3589895fe027c971efd670

with that

view details

push time in 3 months

create barnchHopkinsIDD/COVIDScenarioPipeline

branch : jwills_inference_batch_org

created branch time in 3 months

PR opened HopkinsIDD/COVIDScenarioPipeline

Do a better job of tarring up the CSP code + add a manifest.json file for each run

Also updating the default memory setting to be 4G instead of 2G.

+24 -4

0 comment

1 changed file

pr created time in 3 months

push eventHopkinsIDD/COVIDScenarioPipeline

Josh Wills

commit sha 1999c24714165ede8c7f9759a26e5e313998bf6a

add the cmdline call to the manifest

view details

push time in 3 months

push eventHopkinsIDD/COVIDScenarioPipeline

Josh Wills

commit sha 9119eeed012ad1308bde9ac08e87cb632f47cb3d

now make it pretty

view details

push time in 3 months

push eventHopkinsIDD/COVIDScenarioPipeline

Josh Wills

commit sha 5adfb19949a303e04000d2d089d6c4c7454bf4a6

important addition

view details

push time in 3 months

push eventHopkinsIDD/COVIDScenarioPipeline

Josh Wills

commit sha db1b5a113905bac448b1d91c74c31604b35880f3

Make 4GB the default RAM

view details

push time in 3 months

create barnchHopkinsIDD/COVIDScenarioPipeline

branch : jwills_inference

created branch time in 3 months

more