profile
viewpoint
Nelson Liu nelson-liu @stanfordnlp Stanford, California https://nelsonliu.me language x computation. @stanfordnlp CS PhD student. Formerly: @uwnlp @Noahs-ARK undergrad / @allenai / @isi-nlp / GSoC @scikit-learn.

allenai/allennlp 8761

An open-source NLP research library, built on PyTorch.

ericmjl/nxviz 265

Visualization Package for NetworkX

nelson-liu/contextual-repr-analysis 196

A toolkit for evaluating the linguistic knowledge and transferability of contextual representations. Code for "Linguistic Knowledge and Transferability of Contextual Representations", to appear at NAACL 2019.

allenai/allennlp-as-a-library-example 109

A simple example for how to build your own model using AllenNLP as a dependency.

codalab/codalab-worksheets 84

A collaborative platform for reproducible research (web interface and CLI).

nelson-liu/cython-crash-course 29

A quick intro to Cython for Python users who don't know C

allenai/allennlp-language-modeling 16

An experimental plugin adding language model implementations and training methods to AllenNLP.

nelson-liu/ASLSpeak 9

:microphone: DubHacks 2015 project. Decode sign language using the Leap Motion, and speak it!

nelson-liu/BitStation-App 3

:money_with_wings: The MIT Kerberos-integrated social wallet. Winner of BitComp 2014 Improving MIT Award.

delete branch nelson-liu/codalab-worksheets

delete branch : aws_worker_manager_filter

delete time in 3 days

push eventcodalab/codalab-worksheets

Nelson Liu

commit sha 231438dd76e33aafd5dcca15e3060b26c90a69a3

Enable filtering jobs in the AWSBatchWorkerManager (#2565) * Enable filtering jobs in the AWSBatchWorkerManager * Remove redundant condition * Remove negation in cli helpstring

view details

push time in 3 days

PR merged codalab/codalab-worksheets

Reviewers
Enable filtering jobs in the AWSBatchWorkerManager

Currently, the AWSBatchWorkerManager treats all jobs on the provided job queue as cl-worker jobs. However, this isn't a valid assumption in many cases (namely, when the queue is shared, like in the Stanford NLP AWS account).

This PR adds an argument that enables filtering the workermanager jobs with a regex. The default behavior should not be changed.

+13 -3

0 comment

1 changed file

nelson-liu

pr closed time in 3 days

push eventnelson-liu/codalab-worksheets

Nelson Liu

commit sha b345b471ad0fe2436675ba68c533524bffa133d1

Add --worker-tag-exclusive to WorkerManager (#2564)

view details

Nelson Liu

commit sha 10b432a2fabf1703c3b5aff8fa3587f07a56bd7f

Merge branch 'master' into aws_worker_manager_filter

view details

push time in 3 days

push eventnelson-liu/codalab-worksheets

Nelson Liu

commit sha e0a291d0b1c216e0ce0d727421ccb649a898e3c7

Remove redundant condition

view details

Nelson Liu

commit sha aa3235f6e380736204577aefa0c8ffb401ca88d8

Remove negation in cli helpstring

view details

push time in 3 days

PR opened codalab/codalab-worksheets

Reviewers
Enable filtering jobs in the AWSBatchWorkerManager

Currently, the AWSBatchWorkerManager treats all jobs on the provided job queue as cl-worker jobs. However, this isn't a valid assumption in many cases (namely, when the queue is shared, like in the Stanford NLP AWS account).

This PR adds an argument that enables filtering the workermanager jobs with a regex. The default behavior should not be changed.

+15 -3

0 comment

1 changed file

pr created time in 3 days

create barnchnelson-liu/codalab-worksheets

branch : aws_worker_manager_filter

created branch time in 3 days

delete branch nelson-liu/codalab-worksheets

delete branch : worker_manager_tag_exclusive

delete time in 4 days

push eventcodalab/codalab-worksheets

Nelson Liu

commit sha b345b471ad0fe2436675ba68c533524bffa133d1

Add --worker-tag-exclusive to WorkerManager (#2564)

view details

push time in 4 days

PR merged codalab/codalab-worksheets

Reviewers
Add --worker-tag-exclusive to WorkerManager

Is it better to do search for request_queue=<X> or request_queue=tag=<X>?

+12 -0

0 comment

4 changed files

nelson-liu

pr closed time in 4 days

PR opened codalab/codalab-worksheets

Reviewers
Add --worker-tag-exclusive to WorkerManager

Is it better to do search for request_queue=<X> or request_queue=tag=<X>?

+12 -0

0 comment

4 changed files

pr created time in 4 days

create barnchnelson-liu/codalab-worksheets

branch : worker_manager_tag_exclusive

created branch time in 4 days

push eventnelson-liu/codalab-worksheets

Nelson Liu

commit sha 771c2d57b5ed0e4d80411784b8a6a80ffd0205db

Don't filter workermanager to current user's staged bundles if user is codalab (#2560)

view details

Jing Ge

commit sha 0a14c1c32abbbc5653c41eb4820b1ce862eb78c1

Minimize intermediate state transition for bundle manager (#2438) Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> Co-authored-by: Nelson Liu <nelson-liu@users.noreply.github.com> Co-authored-by: Nelson Liu <nfliu@nelsonliu.me>

view details

push time in 4 days

push eventcodalab/codalab-worksheets

Jing Ge

commit sha 0a14c1c32abbbc5653c41eb4820b1ce862eb78c1

Minimize intermediate state transition for bundle manager (#2438) Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> Co-authored-by: Nelson Liu <nelson-liu@users.noreply.github.com> Co-authored-by: Nelson Liu <nfliu@nelsonliu.me>

view details

push time in 5 days

delete branch codalab/codalab-worksheets

delete branch : termination-column

delete time in 5 days

PR merged codalab/codalab-worksheets

Minimize intermediate state transition for bundle manager

Fixed #2485

Copied from the above ref: There are essentially two issues here:

  1. The bundle manager keeps assigning bundles to a worker that is going to be terminated soon.
    1. a bundle that is assigned to a terminating worker will stay in the starting state for 5 minutes until it gets picked up by the bundle manager.
    2. after the starting bundle gets picked up by the bundle manager, the worker goes offline, but there is still a record of this offline worker in the database.
  2. The bundle manager keeps assigning bundles to a worker that is offline. (worker is disconnected, but its record still exists in the database)

Based on the above summary, there are differences and overlap between 1 and 2:

  1. overlap: 1.ii and 2.
    1. goal: minimize the number of bundles that will be assigned to an offline worker.
    2. approach: detect the offline worker before dispatching a bundle.
  2. difference: 1.a
    1. goal: minimize the number of bundles that will be assigned to a terminating worker.
    2. approach: detect the terminating worker before dispatching a bundle.

This PR aims to fix 1.a: minimize the number of bundles that will be assigned to a worker which will be terminating soon.

In the PR, I added a new column in the worker table is_terminating (please bear with my poor English, feel free to suggest changes in the naming). Whenever the termination signal is sent, the worker will update the database with is_terminating=True in the earliest time. Then when dispatching bundles from bundle manager, always check for this field before starting a new bundle on a worker. This way, we minimize the number of bundles that will be sent to a worker that is going to be terminated soon.

  1. Pros: 1. avoid bundles getting stuck in starting state for 5 minutes (messed up with bundles' execution priority) 1. the intermediate state transition between starting and staged
  2. Cons: 1. one more database call --> performance overhead
+40 -0

0 comment

7 changed files

candicegjing

pr closed time in 5 days

issue closedcodalab/codalab-worksheets

Ensure dispatching jobs in the order that user specified

Based on discussions in https://github.com/codalab/codalab-worksheets/issues/2423#issuecomment-646269225

The bundle manager keeps assigning bundles to a worker that is going to be terminated soon. 1. a bundle that is assigned to a terminating worker will stay in the starting state for 5 minutes until it gets picked up by the bundle manager. 1. after the starting bundle gets picked up by the bundle manager, the worker goes offline, but there is still a record of this offline worker in the database.

  1. as the above offline worker still resides in the database, the bundle manager will keep assigning jobs to this offline worker. Jobs that assigned to this offline worker will be restaged after 5 minutes and wait for new available workers again.

The above workflow will mess up the original job order that user specified. The goal of this ticket is to eliminate assigning bundles to terminated workers as much as we can to ensure better user experience.

closed time in 5 days

candicegjing

push eventcodalab/codalab-worksheets

Nelson Liu

commit sha 852d66b61354cdffe01248172b3d8c37d5ed5d5d

Add default value for is_terminating to REST checkin

view details

push time in 5 days

push eventcodalab/codalab-worksheets

Tony Lee

commit sha adaf6cf76118e8e4c05de82847d8455cc50da8de

Merge master into staging (#2121) * Update pre-commit script to cleanup venv on failure and add new troubleshooting steps (#1992) * Update stress test script. (#2076) * Fix build issue in Travis (#2091) Closed via #2090 * worksheet title should not be editable when user doesn't have permission' (#2085) Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * Improve error handling for CLI (#2049) Closed via #1958 * Support tagging CodaLab's public instances (#2093) Closed via #2031 * Use "//" comment for editor ctrl + / behavior (#2063) * change blockComment style Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * Bump psutil from 5.4.6 to 5.6.6 (#2087) * Bump psutil from 5.4.6 to 5.6.6 Bumps [psutil](https://github.com/giampaolo/psutil) from 5.4.6 to 5.6.6. - [Release notes](https://github.com/giampaolo/psutil/releases) - [Changelog](https://github.com/giampaolo/psutil/blob/master/HISTORY.rst) - [Commits](https://github.com/giampaolo/psutil/compare/release-5.4.6...release-5.6.6) Signed-off-by: dependabot[bot] <support@github.com> * don't pass version arg to .travis.yml * Update .travis.yml * use old travis with --version * fix format Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * get rid of invalid argument file when logging (#2103) * Improve CLI documentation and add -m option for auto generating CLI docs (#2057) Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> * Check the existence of valid_bundle before further actions (#2107) Closed via #2106 * move cpu and gpu resource check to the beginning of _transition_from_… (#2104) Closed via #2096 * Adjust timezone information when calculating current container running time (#2113) Closed via #2112 * Binding the actual cores that can be used to the current process (#2110) Closed via #1045 * Make Travis badge link to build status (#2116) Co-authored-by: Percy Liang <percyliang@gmail.com> * Add queue in new run field (#2040) * add queue in new run field Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * Make cl workers show user-owned workers for non-root users (#2118) * Show cl workers for non-root users, limiting to ones they own * Black format * Update CLI reference * Fix CLI reference string for cl workers. * Cosmetic changes to /workers/info * Simplify cl workers CLI help string * Lint * 2108: Handle incompatible Cuda version specified in the user's Docker image (#2119) Co-authored-by: armantajback <armantajback@users.noreply.github.com> Co-authored-by: Jing Ge <jingge2@illinois.edu> Co-authored-by: yipenghe <yipenghe@stanford.edu> Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> Co-authored-by: Nelson Liu <nelson-liu@users.noreply.github.com> Co-authored-by: Percy Liang <percyliang@gmail.com>

view details

Ashwin Ramaswami

commit sha 6a9a49c75d694312e685b71495af1e850ba6f621

Merge branch 'master' of github.com:codalab/codalab-worksheets into staging

view details

Ashwin Ramaswami

commit sha d7ca58cfc503f16151f8d9c3a63e181b08c2918f

Merge branch 'master' of github.com:codalab/codalab-worksheets into staging

view details

Ashwin Ramaswami

commit sha d2ebc232d3192f52da5bb29055d43b53f347e3e0

Merge branch 'master' of github.com:codalab/codalab-worksheets into staging

view details

Tony Lee

commit sha 6eefba4f0d5296f17a11eaddbf1cff701833309f

Merge branch 'master' of https://github.com/codalab/codalab-worksheets into staging

view details

Tony Lee

commit sha 88aabe6e50e69f4d6cf3b59aa7de6401a3aa453d

debug

view details

Tony Lee

commit sha 95809d8bb631242535586b356e1235915a3f001c

debug

view details

Tony Lee

commit sha c54110df23deb6722d4ece13b39c94f21e59b7d5

Disable readthedocs test for now

view details

armantajback

commit sha 79e8351fc6ced4be5becacc57b9254cd31ca7e02

Merge Master Into Staging (#2469) * Fix free disk bytes calculation (#2370) Closed via #2362 * Fix WorkerManager launching (#2372) Co-authored-by: Jing Ge <jingge2@illinois.edu> * Make worker id match output directory and job name (#2369) * Make worker id match output directory and job name * Blacken Co-authored-by: Jing Ge <jingge2@illinois.edu> * Pass down max work dir size and delete workdir on exit to slurm workers (#2354) * Pass down max work dir size and delete workdir on exit to slurm workers * Fix typo. Co-authored-by: Jing Ge <jingge2@illinois.edu> * == -> === (#2357) Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * Refactor dialogs into worksheetDialog component (#2263) * merge openX to one open dialog variable * refactor error message, delete worksheet message, bundle action fail dialogs into worksheet dialog component * add check when using enter keyword Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * remove context menu (#2376) * Update print information (#2382) Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> * Add a CLI option to limit the number of jobs allowed to run on a worker (#2289) Closed via #2287 * Set cli verbose default to 0 (#2378) Closed via #2350 * Mounting dependencies on paths specified by keys (#2295) * Fix time displaying issue (#2381) * Fix datetime displaying issue * Add a comment Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> Co-authored-by: yipenghe <yipenghe@stanford.edu> * overdue (#2393) * fix migration (#2403) * Make strings more consistent in terms of terminology / organization (#2409) * tweak dialog prompts * add cut/copied * organize keyboard shortcuts * fix formatting * remove dummy file * ignore a d when a dialog is opened Co-authored-by: yipenghe <yipenghe@stanford.edu> * Replace sacct with scontrol, since sacct is only available on the slurm head node (#2368) * Handle sort key is 0 condition on frontend and use correct sort key for new runs (#2412) * handle sort key is 0 condition * handle sort key is 0 in getAfterSortKey function * use this.props.after_sort_key instead * Clear force delete bit after deletion (#2413) * clear force delete bit * clear force delete bit fails * bump to v0.5.15 (#2416) * 2417: Fix mkdocs Travis failure (#2420) * debug * reenable test * Fix SlurmWorkerManager overlaunching (#2391) Co-authored-by: Jing Ge <jingge2@illinois.edu> * Fix accessing worker information from WorkerInfoAccessor (#2419) * Don't fail WorkerManager if a network exception is encountered (#2422) * Rename actionbar => terminal (#2429) * rename actionbar->terminal * remove unused line * Update frontend/src/components/worksheets/Worksheet/Worksheet.js Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * action bar => terminal in comments Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * Remove select all for table (#2428) * clear force delete bit * remove select-all checkbox * remove selectAll handler * Remove detach from the frontend (#2427) * clear force delete bit * remove detach from frontend * remove constant * Use GitHub Actions for CI (#2185) for #2094 Based off of Jane's work in https://github.com/candicegjing/codalab-worksheets Time: 40 minutes -> 15-20 minutes Several improvements: Build images split into three parallel steps - rest-server, worker, frontend Tests split into around 10 parallel steps - each step runs about 4 tests, and there is a step which runs the UI tests. If a CLI test gives an error, it archives all Docker containers' logs so that they can be downloaded and inspected. If the UI test gives an error, it archives the screenshot images so that they can be downloaded and inspected. Additionally, the publish to pypi process has changed. Now, publishes to pypi happen on every GitHub release as opposed to every tag push. Effectively, this means that the release workflow has changed a bit: Old workflow: Wait 40 mins for Travis build on master to complete Push a tag to master Wait 40 mins for Travis build on master to complete, which also deploys to pypi New workflow: Wait for GitHub Actions build on master to complete Create a new GitHub release Wait for action to complete, which only deploys to pypi * CI: github.head_ref -> github.ref * CI: properly populate VERSION variable with branch name * Add exit-after-num-runs=1 to slurm worker manager (#2373) Closed via #2289 * Fix showing file contents for record item (#2455) fix #2446 to test: create a schema with a field using files: % schema a % add hello /stdout display bundles in record mode % display record a * Fix long test startup times by using python:3.6.10-slim-buster for default test runs (#2449) Fixes #2388 by using python:3.6.10-slim-buster for default test runs. Following @nelson-liu 's suggestion in #2388 (comment). This image is still around ~40 MB (as opposed to the 5 GB default-cpu image). Time changes from ~20 mins -> 10 mins * Fix github actions on forks by not calling --push (#2457) This conditional expression was in the old `.travis.yml`, but it isn't there in the new github actions workflow. This PR adds that expression so that `--push` is not called from forks. Fixes #2456 * Prettier CLI Reference Doc (#2458) Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> * Actually fix builds from forks by building docker images when needed (#2461) * Test worker restart in GHA (#2466) Resolves #2465 Adding back this test that we used to have in Travis Appended .log for to err log files Co-authored-by: Jing Ge <jingge2@illinois.edu> Co-authored-by: Nelson Liu <nelson-liu@users.noreply.github.com> Co-authored-by: yipenghe <yipenghe@stanford.edu> Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> Co-authored-by: Tony Lee <tonyh.lee@yahoo.com> Co-authored-by: Percy Liang <percyliang@gmail.com>

view details

armantajback

commit sha 9f4eaeaa3bec9a73fc555cec5cc90ee32287f8de

Update Staging Version (#2473)

view details

armantajback

commit sha aad44ea588e7f9e3d54aea8945b1f0dabad47541

Merge branch 'master' into staging

view details

Ashwin Ramaswami

commit sha 7c7624dc7254abc63bbdd9752e91c75d8140e541

Merge branch 'master' of github.com:codalab/codalab-worksheets into staging

view details

Ashwin Ramaswami

commit sha aeac16afabefad725e9f365e4888c3b11e268af1

Merge branch 'master' into staging

view details

Jing Ge

commit sha 13bbbc800324b82d062be367e5e42c1a5792264c

Prettier CLI Reference: remove end colon for each single command (#2462)

view details

Jing Ge

commit sha 9cf494b7a87f3a1a0c6449c83348f41482156fa0

Improve monitor script when sending worker offline alert (#2503) Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local>

view details

Nelson Liu

commit sha f05d3ce07b3728da739f1ea13f0a1bfc6b894e8b

Add --worker-pass-down-termination flag to WorkerManager (#2508)

view details

Ashwin Ramaswami

commit sha e6e56d8832c10167d40a1dfe61ba8c3bf1e9b7a7

Create SECURITY.md (#2501) * Create SECURITY.md Fixes #2226 Based off https://github.com/Azure/azure-sdk-for-python/blob/master/SECURITY.md * Update SECURITY.md

view details

Tony Lee

commit sha 4d5983eae9d55ccfce178af9fce724b09c44a43b

expect failure for infinite memory tests (#2507) Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com>

view details

Nelson Liu

commit sha ac11edc772c1a7aef0dc12e18a70ddfe149f90b1

Re-create missing docker networks before launching containers (#2468)

view details

Jing Ge

commit sha 37d948d6be5068ebb5d60b54ee544fd90b7330d0

Fix state transition in finalizing worker run state (#2451) Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> Co-authored-by: Tony Lee <tonyh.lee@yahoo.com>

view details

push time in 5 days

delete branch nelson-liu/codalab-worksheets

delete branch : codalab_user_worker_manager

delete time in 5 days

push eventcodalab/codalab-worksheets

Nelson Liu

commit sha 771c2d57b5ed0e4d80411784b8a6a80ffd0205db

Don't filter workermanager to current user's staged bundles if user is codalab (#2560)

view details

push time in 5 days

push eventnelson-liu/codalab-worksheets

Nelson Liu

commit sha 82eae2cf9981567bbaec56b11df73942ca7ea0be

Add some details to the comment.

view details

push time in 5 days

Pull request review commentcodalab/codalab-worksheets

Don't filter workermanager to current user's staged bundles if user is codalab

 def run_loop(self):      def run_one_iteration(self):         # Get staged bundles for the current user.-        keywords = ['state=' + State.STAGED] + [".mine"] + self.args.search+        keywords = ['state=' + State.STAGED] + self.args.search+        # If the current user is "codalab", don't filter by .mine because the workers owned+        # by "codalab" can be shared by all users. But, for all other users, we only+        # want to see their staged bundles.+        if os.environ.get('CODALAB_USERNAME') != "codalab":

yes, for both the aws and slurm worker managers.

nelson-liu

comment created time in 5 days

create barnchnelson-liu/codalab-worksheets

branch : codalab_user_worker_manager

created branch time in 5 days

issue openedcodalab/codalab-worksheets

Workers should not run bundles that are deleted

Suppose a worker runs a bundle. If it disconnects for some reason, the bundle is marked as worker_offline. This bundle can then be deleted.

However, suppose that, after deleting the bundle, the worker comes back online. Right now, the worker would continue running the bundle in docker, and then encounter an error when trying to upload results. Instead, it'd be nice if workers, when they check back in, could detect whether the bundles they're running still exist. If they don't exist, then the worker should stop executing those bundles and wait for new things to run.

created time in 5 days

push eventnelson-liu/codalab-worksheets

Tony Lee

commit sha adaf6cf76118e8e4c05de82847d8455cc50da8de

Merge master into staging (#2121) * Update pre-commit script to cleanup venv on failure and add new troubleshooting steps (#1992) * Update stress test script. (#2076) * Fix build issue in Travis (#2091) Closed via #2090 * worksheet title should not be editable when user doesn't have permission' (#2085) Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * Improve error handling for CLI (#2049) Closed via #1958 * Support tagging CodaLab's public instances (#2093) Closed via #2031 * Use "//" comment for editor ctrl + / behavior (#2063) * change blockComment style Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * Bump psutil from 5.4.6 to 5.6.6 (#2087) * Bump psutil from 5.4.6 to 5.6.6 Bumps [psutil](https://github.com/giampaolo/psutil) from 5.4.6 to 5.6.6. - [Release notes](https://github.com/giampaolo/psutil/releases) - [Changelog](https://github.com/giampaolo/psutil/blob/master/HISTORY.rst) - [Commits](https://github.com/giampaolo/psutil/compare/release-5.4.6...release-5.6.6) Signed-off-by: dependabot[bot] <support@github.com> * don't pass version arg to .travis.yml * Update .travis.yml * use old travis with --version * fix format Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * get rid of invalid argument file when logging (#2103) * Improve CLI documentation and add -m option for auto generating CLI docs (#2057) Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> * Check the existence of valid_bundle before further actions (#2107) Closed via #2106 * move cpu and gpu resource check to the beginning of _transition_from_… (#2104) Closed via #2096 * Adjust timezone information when calculating current container running time (#2113) Closed via #2112 * Binding the actual cores that can be used to the current process (#2110) Closed via #1045 * Make Travis badge link to build status (#2116) Co-authored-by: Percy Liang <percyliang@gmail.com> * Add queue in new run field (#2040) * add queue in new run field Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * Make cl workers show user-owned workers for non-root users (#2118) * Show cl workers for non-root users, limiting to ones they own * Black format * Update CLI reference * Fix CLI reference string for cl workers. * Cosmetic changes to /workers/info * Simplify cl workers CLI help string * Lint * 2108: Handle incompatible Cuda version specified in the user's Docker image (#2119) Co-authored-by: armantajback <armantajback@users.noreply.github.com> Co-authored-by: Jing Ge <jingge2@illinois.edu> Co-authored-by: yipenghe <yipenghe@stanford.edu> Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> Co-authored-by: Nelson Liu <nelson-liu@users.noreply.github.com> Co-authored-by: Percy Liang <percyliang@gmail.com>

view details

Ashwin Ramaswami

commit sha 6a9a49c75d694312e685b71495af1e850ba6f621

Merge branch 'master' of github.com:codalab/codalab-worksheets into staging

view details

Ashwin Ramaswami

commit sha d7ca58cfc503f16151f8d9c3a63e181b08c2918f

Merge branch 'master' of github.com:codalab/codalab-worksheets into staging

view details

Ashwin Ramaswami

commit sha d2ebc232d3192f52da5bb29055d43b53f347e3e0

Merge branch 'master' of github.com:codalab/codalab-worksheets into staging

view details

Tony Lee

commit sha 6eefba4f0d5296f17a11eaddbf1cff701833309f

Merge branch 'master' of https://github.com/codalab/codalab-worksheets into staging

view details

Tony Lee

commit sha 88aabe6e50e69f4d6cf3b59aa7de6401a3aa453d

debug

view details

Tony Lee

commit sha 95809d8bb631242535586b356e1235915a3f001c

debug

view details

Tony Lee

commit sha c54110df23deb6722d4ece13b39c94f21e59b7d5

Disable readthedocs test for now

view details

armantajback

commit sha 79e8351fc6ced4be5becacc57b9254cd31ca7e02

Merge Master Into Staging (#2469) * Fix free disk bytes calculation (#2370) Closed via #2362 * Fix WorkerManager launching (#2372) Co-authored-by: Jing Ge <jingge2@illinois.edu> * Make worker id match output directory and job name (#2369) * Make worker id match output directory and job name * Blacken Co-authored-by: Jing Ge <jingge2@illinois.edu> * Pass down max work dir size and delete workdir on exit to slurm workers (#2354) * Pass down max work dir size and delete workdir on exit to slurm workers * Fix typo. Co-authored-by: Jing Ge <jingge2@illinois.edu> * == -> === (#2357) Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * Refactor dialogs into worksheetDialog component (#2263) * merge openX to one open dialog variable * refactor error message, delete worksheet message, bundle action fail dialogs into worksheet dialog component * add check when using enter keyword Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * remove context menu (#2376) * Update print information (#2382) Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> * Add a CLI option to limit the number of jobs allowed to run on a worker (#2289) Closed via #2287 * Set cli verbose default to 0 (#2378) Closed via #2350 * Mounting dependencies on paths specified by keys (#2295) * Fix time displaying issue (#2381) * Fix datetime displaying issue * Add a comment Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> Co-authored-by: yipenghe <yipenghe@stanford.edu> * overdue (#2393) * fix migration (#2403) * Make strings more consistent in terms of terminology / organization (#2409) * tweak dialog prompts * add cut/copied * organize keyboard shortcuts * fix formatting * remove dummy file * ignore a d when a dialog is opened Co-authored-by: yipenghe <yipenghe@stanford.edu> * Replace sacct with scontrol, since sacct is only available on the slurm head node (#2368) * Handle sort key is 0 condition on frontend and use correct sort key for new runs (#2412) * handle sort key is 0 condition * handle sort key is 0 in getAfterSortKey function * use this.props.after_sort_key instead * Clear force delete bit after deletion (#2413) * clear force delete bit * clear force delete bit fails * bump to v0.5.15 (#2416) * 2417: Fix mkdocs Travis failure (#2420) * debug * reenable test * Fix SlurmWorkerManager overlaunching (#2391) Co-authored-by: Jing Ge <jingge2@illinois.edu> * Fix accessing worker information from WorkerInfoAccessor (#2419) * Don't fail WorkerManager if a network exception is encountered (#2422) * Rename actionbar => terminal (#2429) * rename actionbar->terminal * remove unused line * Update frontend/src/components/worksheets/Worksheet/Worksheet.js Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * action bar => terminal in comments Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * Remove select all for table (#2428) * clear force delete bit * remove select-all checkbox * remove selectAll handler * Remove detach from the frontend (#2427) * clear force delete bit * remove detach from frontend * remove constant * Use GitHub Actions for CI (#2185) for #2094 Based off of Jane's work in https://github.com/candicegjing/codalab-worksheets Time: 40 minutes -> 15-20 minutes Several improvements: Build images split into three parallel steps - rest-server, worker, frontend Tests split into around 10 parallel steps - each step runs about 4 tests, and there is a step which runs the UI tests. If a CLI test gives an error, it archives all Docker containers' logs so that they can be downloaded and inspected. If the UI test gives an error, it archives the screenshot images so that they can be downloaded and inspected. Additionally, the publish to pypi process has changed. Now, publishes to pypi happen on every GitHub release as opposed to every tag push. Effectively, this means that the release workflow has changed a bit: Old workflow: Wait 40 mins for Travis build on master to complete Push a tag to master Wait 40 mins for Travis build on master to complete, which also deploys to pypi New workflow: Wait for GitHub Actions build on master to complete Create a new GitHub release Wait for action to complete, which only deploys to pypi * CI: github.head_ref -> github.ref * CI: properly populate VERSION variable with branch name * Add exit-after-num-runs=1 to slurm worker manager (#2373) Closed via #2289 * Fix showing file contents for record item (#2455) fix #2446 to test: create a schema with a field using files: % schema a % add hello /stdout display bundles in record mode % display record a * Fix long test startup times by using python:3.6.10-slim-buster for default test runs (#2449) Fixes #2388 by using python:3.6.10-slim-buster for default test runs. Following @nelson-liu 's suggestion in #2388 (comment). This image is still around ~40 MB (as opposed to the 5 GB default-cpu image). Time changes from ~20 mins -> 10 mins * Fix github actions on forks by not calling --push (#2457) This conditional expression was in the old `.travis.yml`, but it isn't there in the new github actions workflow. This PR adds that expression so that `--push` is not called from forks. Fixes #2456 * Prettier CLI Reference Doc (#2458) Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> * Actually fix builds from forks by building docker images when needed (#2461) * Test worker restart in GHA (#2466) Resolves #2465 Adding back this test that we used to have in Travis Appended .log for to err log files Co-authored-by: Jing Ge <jingge2@illinois.edu> Co-authored-by: Nelson Liu <nelson-liu@users.noreply.github.com> Co-authored-by: yipenghe <yipenghe@stanford.edu> Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> Co-authored-by: Tony Lee <tonyh.lee@yahoo.com> Co-authored-by: Percy Liang <percyliang@gmail.com>

view details

armantajback

commit sha 9f4eaeaa3bec9a73fc555cec5cc90ee32287f8de

Update Staging Version (#2473)

view details

armantajback

commit sha aad44ea588e7f9e3d54aea8945b1f0dabad47541

Merge branch 'master' into staging

view details

Ashwin Ramaswami

commit sha 7c7624dc7254abc63bbdd9752e91c75d8140e541

Merge branch 'master' of github.com:codalab/codalab-worksheets into staging

view details

Ashwin Ramaswami

commit sha aeac16afabefad725e9f365e4888c3b11e268af1

Merge branch 'master' into staging

view details

Tony Lee

commit sha ce7dd073cde1a08709de7121eeae3496bb6460af

fix heartbeat (#2516)

view details

Tony Lee

commit sha a0e44dc4c71bd32d18baf944029fe284be6c9510

fix heartbeat (#2517)

view details

Ashwin Ramaswami

commit sha 1bf72e6adcf5d01614b7a5c4cd12fbe2d051ed2d

reenable gen-readthedocs

view details

Ashwin Ramaswami

commit sha ba9ed2f032f4627b14915454daa3ef6d161c4408

Fix some frontend build warnings (#2500) * Fix some frontend build warnings * Make sure eslint errors are caught by CI Co-authored-by: yipenghe <yipenghe@stanford.edu>

view details

Jing Ge

commit sha 710992c7123ad1b5105b6d9cdb2d219e07a833d1

Set default value of exit_after_num_runs to maxint in rest API call (#2511)

view details

Ashwin Ramaswami

commit sha 96e7d9f6667c502a1f741bed9fff18d123c18230

Remove node-gyp dependency (#2497) * Remove node-gyp dependency * Add react-bootstrap Co-authored-by: yipenghe <yipenghe@stanford.edu>

view details

Ashwin Ramaswami

commit sha a7b4caf624cef45ecff536de40763704ef58736c

Remove unused dependencies, jquery-ui-bundle (#2498) * Remove node-gyp dependency * Add react-bootstrap * Remove unused dependencies, jquery-ui-bundle * Remove jquery ui and ace js * Remove flow Co-authored-by: yipenghe <yipenghe@stanford.edu>

view details

push time in 5 days

push eventnelson-liu/codalab-worksheets

Tony Lee

commit sha adaf6cf76118e8e4c05de82847d8455cc50da8de

Merge master into staging (#2121) * Update pre-commit script to cleanup venv on failure and add new troubleshooting steps (#1992) * Update stress test script. (#2076) * Fix build issue in Travis (#2091) Closed via #2090 * worksheet title should not be editable when user doesn't have permission' (#2085) Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * Improve error handling for CLI (#2049) Closed via #1958 * Support tagging CodaLab's public instances (#2093) Closed via #2031 * Use "//" comment for editor ctrl + / behavior (#2063) * change blockComment style Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * Bump psutil from 5.4.6 to 5.6.6 (#2087) * Bump psutil from 5.4.6 to 5.6.6 Bumps [psutil](https://github.com/giampaolo/psutil) from 5.4.6 to 5.6.6. - [Release notes](https://github.com/giampaolo/psutil/releases) - [Changelog](https://github.com/giampaolo/psutil/blob/master/HISTORY.rst) - [Commits](https://github.com/giampaolo/psutil/compare/release-5.4.6...release-5.6.6) Signed-off-by: dependabot[bot] <support@github.com> * don't pass version arg to .travis.yml * Update .travis.yml * use old travis with --version * fix format Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * get rid of invalid argument file when logging (#2103) * Improve CLI documentation and add -m option for auto generating CLI docs (#2057) Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> * Check the existence of valid_bundle before further actions (#2107) Closed via #2106 * move cpu and gpu resource check to the beginning of _transition_from_… (#2104) Closed via #2096 * Adjust timezone information when calculating current container running time (#2113) Closed via #2112 * Binding the actual cores that can be used to the current process (#2110) Closed via #1045 * Make Travis badge link to build status (#2116) Co-authored-by: Percy Liang <percyliang@gmail.com> * Add queue in new run field (#2040) * add queue in new run field Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * Make cl workers show user-owned workers for non-root users (#2118) * Show cl workers for non-root users, limiting to ones they own * Black format * Update CLI reference * Fix CLI reference string for cl workers. * Cosmetic changes to /workers/info * Simplify cl workers CLI help string * Lint * 2108: Handle incompatible Cuda version specified in the user's Docker image (#2119) Co-authored-by: armantajback <armantajback@users.noreply.github.com> Co-authored-by: Jing Ge <jingge2@illinois.edu> Co-authored-by: yipenghe <yipenghe@stanford.edu> Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> Co-authored-by: Nelson Liu <nelson-liu@users.noreply.github.com> Co-authored-by: Percy Liang <percyliang@gmail.com>

view details

Ashwin Ramaswami

commit sha 6a9a49c75d694312e685b71495af1e850ba6f621

Merge branch 'master' of github.com:codalab/codalab-worksheets into staging

view details

Ashwin Ramaswami

commit sha d7ca58cfc503f16151f8d9c3a63e181b08c2918f

Merge branch 'master' of github.com:codalab/codalab-worksheets into staging

view details

Ashwin Ramaswami

commit sha d2ebc232d3192f52da5bb29055d43b53f347e3e0

Merge branch 'master' of github.com:codalab/codalab-worksheets into staging

view details

Tony Lee

commit sha 6eefba4f0d5296f17a11eaddbf1cff701833309f

Merge branch 'master' of https://github.com/codalab/codalab-worksheets into staging

view details

Tony Lee

commit sha 88aabe6e50e69f4d6cf3b59aa7de6401a3aa453d

debug

view details

Tony Lee

commit sha 95809d8bb631242535586b356e1235915a3f001c

debug

view details

Tony Lee

commit sha c54110df23deb6722d4ece13b39c94f21e59b7d5

Disable readthedocs test for now

view details

armantajback

commit sha 79e8351fc6ced4be5becacc57b9254cd31ca7e02

Merge Master Into Staging (#2469) * Fix free disk bytes calculation (#2370) Closed via #2362 * Fix WorkerManager launching (#2372) Co-authored-by: Jing Ge <jingge2@illinois.edu> * Make worker id match output directory and job name (#2369) * Make worker id match output directory and job name * Blacken Co-authored-by: Jing Ge <jingge2@illinois.edu> * Pass down max work dir size and delete workdir on exit to slurm workers (#2354) * Pass down max work dir size and delete workdir on exit to slurm workers * Fix typo. Co-authored-by: Jing Ge <jingge2@illinois.edu> * == -> === (#2357) Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * Refactor dialogs into worksheetDialog component (#2263) * merge openX to one open dialog variable * refactor error message, delete worksheet message, bundle action fail dialogs into worksheet dialog component * add check when using enter keyword Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * remove context menu (#2376) * Update print information (#2382) Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> * Add a CLI option to limit the number of jobs allowed to run on a worker (#2289) Closed via #2287 * Set cli verbose default to 0 (#2378) Closed via #2350 * Mounting dependencies on paths specified by keys (#2295) * Fix time displaying issue (#2381) * Fix datetime displaying issue * Add a comment Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> Co-authored-by: yipenghe <yipenghe@stanford.edu> * overdue (#2393) * fix migration (#2403) * Make strings more consistent in terms of terminology / organization (#2409) * tweak dialog prompts * add cut/copied * organize keyboard shortcuts * fix formatting * remove dummy file * ignore a d when a dialog is opened Co-authored-by: yipenghe <yipenghe@stanford.edu> * Replace sacct with scontrol, since sacct is only available on the slurm head node (#2368) * Handle sort key is 0 condition on frontend and use correct sort key for new runs (#2412) * handle sort key is 0 condition * handle sort key is 0 in getAfterSortKey function * use this.props.after_sort_key instead * Clear force delete bit after deletion (#2413) * clear force delete bit * clear force delete bit fails * bump to v0.5.15 (#2416) * 2417: Fix mkdocs Travis failure (#2420) * debug * reenable test * Fix SlurmWorkerManager overlaunching (#2391) Co-authored-by: Jing Ge <jingge2@illinois.edu> * Fix accessing worker information from WorkerInfoAccessor (#2419) * Don't fail WorkerManager if a network exception is encountered (#2422) * Rename actionbar => terminal (#2429) * rename actionbar->terminal * remove unused line * Update frontend/src/components/worksheets/Worksheet/Worksheet.js Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * action bar => terminal in comments Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * Remove select all for table (#2428) * clear force delete bit * remove select-all checkbox * remove selectAll handler * Remove detach from the frontend (#2427) * clear force delete bit * remove detach from frontend * remove constant * Use GitHub Actions for CI (#2185) for #2094 Based off of Jane's work in https://github.com/candicegjing/codalab-worksheets Time: 40 minutes -> 15-20 minutes Several improvements: Build images split into three parallel steps - rest-server, worker, frontend Tests split into around 10 parallel steps - each step runs about 4 tests, and there is a step which runs the UI tests. If a CLI test gives an error, it archives all Docker containers' logs so that they can be downloaded and inspected. If the UI test gives an error, it archives the screenshot images so that they can be downloaded and inspected. Additionally, the publish to pypi process has changed. Now, publishes to pypi happen on every GitHub release as opposed to every tag push. Effectively, this means that the release workflow has changed a bit: Old workflow: Wait 40 mins for Travis build on master to complete Push a tag to master Wait 40 mins for Travis build on master to complete, which also deploys to pypi New workflow: Wait for GitHub Actions build on master to complete Create a new GitHub release Wait for action to complete, which only deploys to pypi * CI: github.head_ref -> github.ref * CI: properly populate VERSION variable with branch name * Add exit-after-num-runs=1 to slurm worker manager (#2373) Closed via #2289 * Fix showing file contents for record item (#2455) fix #2446 to test: create a schema with a field using files: % schema a % add hello /stdout display bundles in record mode % display record a * Fix long test startup times by using python:3.6.10-slim-buster for default test runs (#2449) Fixes #2388 by using python:3.6.10-slim-buster for default test runs. Following @nelson-liu 's suggestion in #2388 (comment). This image is still around ~40 MB (as opposed to the 5 GB default-cpu image). Time changes from ~20 mins -> 10 mins * Fix github actions on forks by not calling --push (#2457) This conditional expression was in the old `.travis.yml`, but it isn't there in the new github actions workflow. This PR adds that expression so that `--push` is not called from forks. Fixes #2456 * Prettier CLI Reference Doc (#2458) Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> * Actually fix builds from forks by building docker images when needed (#2461) * Test worker restart in GHA (#2466) Resolves #2465 Adding back this test that we used to have in Travis Appended .log for to err log files Co-authored-by: Jing Ge <jingge2@illinois.edu> Co-authored-by: Nelson Liu <nelson-liu@users.noreply.github.com> Co-authored-by: yipenghe <yipenghe@stanford.edu> Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> Co-authored-by: Tony Lee <tonyh.lee@yahoo.com> Co-authored-by: Percy Liang <percyliang@gmail.com>

view details

armantajback

commit sha 9f4eaeaa3bec9a73fc555cec5cc90ee32287f8de

Update Staging Version (#2473)

view details

armantajback

commit sha aad44ea588e7f9e3d54aea8945b1f0dabad47541

Merge branch 'master' into staging

view details

Ashwin Ramaswami

commit sha 7c7624dc7254abc63bbdd9752e91c75d8140e541

Merge branch 'master' of github.com:codalab/codalab-worksheets into staging

view details

Ashwin Ramaswami

commit sha aeac16afabefad725e9f365e4888c3b11e268af1

Merge branch 'master' into staging

view details

Tony Lee

commit sha ce7dd073cde1a08709de7121eeae3496bb6460af

fix heartbeat (#2516)

view details

Tony Lee

commit sha a0e44dc4c71bd32d18baf944029fe284be6c9510

fix heartbeat (#2517)

view details

Ashwin Ramaswami

commit sha 1bf72e6adcf5d01614b7a5c4cd12fbe2d051ed2d

reenable gen-readthedocs

view details

Ashwin Ramaswami

commit sha ba9ed2f032f4627b14915454daa3ef6d161c4408

Fix some frontend build warnings (#2500) * Fix some frontend build warnings * Make sure eslint errors are caught by CI Co-authored-by: yipenghe <yipenghe@stanford.edu>

view details

Jing Ge

commit sha 710992c7123ad1b5105b6d9cdb2d219e07a833d1

Set default value of exit_after_num_runs to maxint in rest API call (#2511)

view details

Ashwin Ramaswami

commit sha 96e7d9f6667c502a1f741bed9fff18d123c18230

Remove node-gyp dependency (#2497) * Remove node-gyp dependency * Add react-bootstrap Co-authored-by: yipenghe <yipenghe@stanford.edu>

view details

Ashwin Ramaswami

commit sha a7b4caf624cef45ecff536de40763704ef58736c

Remove unused dependencies, jquery-ui-bundle (#2498) * Remove node-gyp dependency * Add react-bootstrap * Remove unused dependencies, jquery-ui-bundle * Remove jquery ui and ace js * Remove flow Co-authored-by: yipenghe <yipenghe@stanford.edu>

view details

push time in 5 days

push eventnelson-liu/codalab-worksheets

Tony Lee

commit sha adaf6cf76118e8e4c05de82847d8455cc50da8de

Merge master into staging (#2121) * Update pre-commit script to cleanup venv on failure and add new troubleshooting steps (#1992) * Update stress test script. (#2076) * Fix build issue in Travis (#2091) Closed via #2090 * worksheet title should not be editable when user doesn't have permission' (#2085) Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * Improve error handling for CLI (#2049) Closed via #1958 * Support tagging CodaLab's public instances (#2093) Closed via #2031 * Use "//" comment for editor ctrl + / behavior (#2063) * change blockComment style Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * Bump psutil from 5.4.6 to 5.6.6 (#2087) * Bump psutil from 5.4.6 to 5.6.6 Bumps [psutil](https://github.com/giampaolo/psutil) from 5.4.6 to 5.6.6. - [Release notes](https://github.com/giampaolo/psutil/releases) - [Changelog](https://github.com/giampaolo/psutil/blob/master/HISTORY.rst) - [Commits](https://github.com/giampaolo/psutil/compare/release-5.4.6...release-5.6.6) Signed-off-by: dependabot[bot] <support@github.com> * don't pass version arg to .travis.yml * Update .travis.yml * use old travis with --version * fix format Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * get rid of invalid argument file when logging (#2103) * Improve CLI documentation and add -m option for auto generating CLI docs (#2057) Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> * Check the existence of valid_bundle before further actions (#2107) Closed via #2106 * move cpu and gpu resource check to the beginning of _transition_from_… (#2104) Closed via #2096 * Adjust timezone information when calculating current container running time (#2113) Closed via #2112 * Binding the actual cores that can be used to the current process (#2110) Closed via #1045 * Make Travis badge link to build status (#2116) Co-authored-by: Percy Liang <percyliang@gmail.com> * Add queue in new run field (#2040) * add queue in new run field Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * Make cl workers show user-owned workers for non-root users (#2118) * Show cl workers for non-root users, limiting to ones they own * Black format * Update CLI reference * Fix CLI reference string for cl workers. * Cosmetic changes to /workers/info * Simplify cl workers CLI help string * Lint * 2108: Handle incompatible Cuda version specified in the user's Docker image (#2119) Co-authored-by: armantajback <armantajback@users.noreply.github.com> Co-authored-by: Jing Ge <jingge2@illinois.edu> Co-authored-by: yipenghe <yipenghe@stanford.edu> Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> Co-authored-by: Nelson Liu <nelson-liu@users.noreply.github.com> Co-authored-by: Percy Liang <percyliang@gmail.com>

view details

Ashwin Ramaswami

commit sha 6a9a49c75d694312e685b71495af1e850ba6f621

Merge branch 'master' of github.com:codalab/codalab-worksheets into staging

view details

Ashwin Ramaswami

commit sha d7ca58cfc503f16151f8d9c3a63e181b08c2918f

Merge branch 'master' of github.com:codalab/codalab-worksheets into staging

view details

Ashwin Ramaswami

commit sha d2ebc232d3192f52da5bb29055d43b53f347e3e0

Merge branch 'master' of github.com:codalab/codalab-worksheets into staging

view details

Tony Lee

commit sha 6eefba4f0d5296f17a11eaddbf1cff701833309f

Merge branch 'master' of https://github.com/codalab/codalab-worksheets into staging

view details

Tony Lee

commit sha 88aabe6e50e69f4d6cf3b59aa7de6401a3aa453d

debug

view details

Tony Lee

commit sha 95809d8bb631242535586b356e1235915a3f001c

debug

view details

Tony Lee

commit sha c54110df23deb6722d4ece13b39c94f21e59b7d5

Disable readthedocs test for now

view details

armantajback

commit sha 79e8351fc6ced4be5becacc57b9254cd31ca7e02

Merge Master Into Staging (#2469) * Fix free disk bytes calculation (#2370) Closed via #2362 * Fix WorkerManager launching (#2372) Co-authored-by: Jing Ge <jingge2@illinois.edu> * Make worker id match output directory and job name (#2369) * Make worker id match output directory and job name * Blacken Co-authored-by: Jing Ge <jingge2@illinois.edu> * Pass down max work dir size and delete workdir on exit to slurm workers (#2354) * Pass down max work dir size and delete workdir on exit to slurm workers * Fix typo. Co-authored-by: Jing Ge <jingge2@illinois.edu> * == -> === (#2357) Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * Refactor dialogs into worksheetDialog component (#2263) * merge openX to one open dialog variable * refactor error message, delete worksheet message, bundle action fail dialogs into worksheet dialog component * add check when using enter keyword Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * remove context menu (#2376) * Update print information (#2382) Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> * Add a CLI option to limit the number of jobs allowed to run on a worker (#2289) Closed via #2287 * Set cli verbose default to 0 (#2378) Closed via #2350 * Mounting dependencies on paths specified by keys (#2295) * Fix time displaying issue (#2381) * Fix datetime displaying issue * Add a comment Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> Co-authored-by: yipenghe <yipenghe@stanford.edu> * overdue (#2393) * fix migration (#2403) * Make strings more consistent in terms of terminology / organization (#2409) * tweak dialog prompts * add cut/copied * organize keyboard shortcuts * fix formatting * remove dummy file * ignore a d when a dialog is opened Co-authored-by: yipenghe <yipenghe@stanford.edu> * Replace sacct with scontrol, since sacct is only available on the slurm head node (#2368) * Handle sort key is 0 condition on frontend and use correct sort key for new runs (#2412) * handle sort key is 0 condition * handle sort key is 0 in getAfterSortKey function * use this.props.after_sort_key instead * Clear force delete bit after deletion (#2413) * clear force delete bit * clear force delete bit fails * bump to v0.5.15 (#2416) * 2417: Fix mkdocs Travis failure (#2420) * debug * reenable test * Fix SlurmWorkerManager overlaunching (#2391) Co-authored-by: Jing Ge <jingge2@illinois.edu> * Fix accessing worker information from WorkerInfoAccessor (#2419) * Don't fail WorkerManager if a network exception is encountered (#2422) * Rename actionbar => terminal (#2429) * rename actionbar->terminal * remove unused line * Update frontend/src/components/worksheets/Worksheet/Worksheet.js Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * action bar => terminal in comments Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * Remove select all for table (#2428) * clear force delete bit * remove select-all checkbox * remove selectAll handler * Remove detach from the frontend (#2427) * clear force delete bit * remove detach from frontend * remove constant * Use GitHub Actions for CI (#2185) for #2094 Based off of Jane's work in https://github.com/candicegjing/codalab-worksheets Time: 40 minutes -> 15-20 minutes Several improvements: Build images split into three parallel steps - rest-server, worker, frontend Tests split into around 10 parallel steps - each step runs about 4 tests, and there is a step which runs the UI tests. If a CLI test gives an error, it archives all Docker containers' logs so that they can be downloaded and inspected. If the UI test gives an error, it archives the screenshot images so that they can be downloaded and inspected. Additionally, the publish to pypi process has changed. Now, publishes to pypi happen on every GitHub release as opposed to every tag push. Effectively, this means that the release workflow has changed a bit: Old workflow: Wait 40 mins for Travis build on master to complete Push a tag to master Wait 40 mins for Travis build on master to complete, which also deploys to pypi New workflow: Wait for GitHub Actions build on master to complete Create a new GitHub release Wait for action to complete, which only deploys to pypi * CI: github.head_ref -> github.ref * CI: properly populate VERSION variable with branch name * Add exit-after-num-runs=1 to slurm worker manager (#2373) Closed via #2289 * Fix showing file contents for record item (#2455) fix #2446 to test: create a schema with a field using files: % schema a % add hello /stdout display bundles in record mode % display record a * Fix long test startup times by using python:3.6.10-slim-buster for default test runs (#2449) Fixes #2388 by using python:3.6.10-slim-buster for default test runs. Following @nelson-liu 's suggestion in #2388 (comment). This image is still around ~40 MB (as opposed to the 5 GB default-cpu image). Time changes from ~20 mins -> 10 mins * Fix github actions on forks by not calling --push (#2457) This conditional expression was in the old `.travis.yml`, but it isn't there in the new github actions workflow. This PR adds that expression so that `--push` is not called from forks. Fixes #2456 * Prettier CLI Reference Doc (#2458) Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> * Actually fix builds from forks by building docker images when needed (#2461) * Test worker restart in GHA (#2466) Resolves #2465 Adding back this test that we used to have in Travis Appended .log for to err log files Co-authored-by: Jing Ge <jingge2@illinois.edu> Co-authored-by: Nelson Liu <nelson-liu@users.noreply.github.com> Co-authored-by: yipenghe <yipenghe@stanford.edu> Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> Co-authored-by: Tony Lee <tonyh.lee@yahoo.com> Co-authored-by: Percy Liang <percyliang@gmail.com>

view details

armantajback

commit sha 9f4eaeaa3bec9a73fc555cec5cc90ee32287f8de

Update Staging Version (#2473)

view details

armantajback

commit sha aad44ea588e7f9e3d54aea8945b1f0dabad47541

Merge branch 'master' into staging

view details

Ashwin Ramaswami

commit sha 7c7624dc7254abc63bbdd9752e91c75d8140e541

Merge branch 'master' of github.com:codalab/codalab-worksheets into staging

view details

Ashwin Ramaswami

commit sha aeac16afabefad725e9f365e4888c3b11e268af1

Merge branch 'master' into staging

view details

Ashwin Ramaswami

commit sha e6e56d8832c10167d40a1dfe61ba8c3bf1e9b7a7

Create SECURITY.md (#2501) * Create SECURITY.md Fixes #2226 Based off https://github.com/Azure/azure-sdk-for-python/blob/master/SECURITY.md * Update SECURITY.md

view details

Tony Lee

commit sha 4d5983eae9d55ccfce178af9fce724b09c44a43b

expect failure for infinite memory tests (#2507) Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com>

view details

Nelson Liu

commit sha ac11edc772c1a7aef0dc12e18a70ddfe149f90b1

Re-create missing docker networks before launching containers (#2468)

view details

Jing Ge

commit sha 37d948d6be5068ebb5d60b54ee544fd90b7330d0

Fix state transition in finalizing worker run state (#2451) Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> Co-authored-by: Tony Lee <tonyh.lee@yahoo.com>

view details

Nelson Liu

commit sha 06f8521fbbfb0c7241201e43168f18308efe98a5

Don't limit workermanager search by provided tags (#2506) Co-authored-by: Jing Ge <jingge2@illinois.edu>

view details

Nelson Liu

commit sha c5eae4c25b4ae596e525f2af57989a8b815b59d6

Handle simultaneous image deletion in DockerImageManager (#2444)

view details

Jing Ge

commit sha 5076777c5e1fea50fc42e75b9364da66cc8f9d1a

Remove encoding and decoding (#2512) * Remove encoding and decoding * Black format Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local>

view details

push time in 5 days

push eventnelson-liu/codalab-worksheets

Ashwin Ramaswami

commit sha c9d503dda4c5fa252fc8f9b6641bef1ff04d3f86

Delete .pre-commit-config.yaml (#2539) Co-authored-by: Tony Lee <tonyh.lee@yahoo.com>

view details

dependabot[bot]

commit sha 73fa19357e99b87fdb442b0d057bdf04b98b6c65

Bump sass from 1.23.3 to 1.26.9 in /frontend (#2536) Bumps [sass](https://github.com/sass/dart-sass) from 1.23.3 to 1.26.9. - [Release notes](https://github.com/sass/dart-sass/releases) - [Changelog](https://github.com/sass/dart-sass/blob/master/CHANGELOG.md) - [Commits](https://github.com/sass/dart-sass/compare/1.23.3...1.26.9) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com>

view details

dependabot[bot]

commit sha 4e3755d05e02f823e5b6d2243c86fb448f939a29

Bump docker from 3.7.0 to 4.2.2 (#2551) Bumps [docker](https://github.com/docker/docker-py) from 3.7.0 to 4.2.2. - [Release notes](https://github.com/docker/docker-py/releases) - [Commits](https://github.com/docker/docker-py/compare/3.7.0...4.2.2) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

view details

dependabot[bot]

commit sha bc16d7242cc52c25ecc86856466b46507b6b533b

Bump react-stickynode from 2.1.1 to 3.0.3 in /frontend (#2553) Bumps [react-stickynode](https://github.com/yahoo/react-stickynode) from 2.1.1 to 3.0.3. - [Release notes](https://github.com/yahoo/react-stickynode/releases) - [Commits](https://github.com/yahoo/react-stickynode/commits/v3.0.3) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

view details

Nelson Liu

commit sha 096df808a9f99bead441790f209f1c710cee4a93

Don't try to start bundles on offline / timed-out workers (#2424)

view details

Ashwin Ramaswami

commit sha fb6b8bd9ab65503250d5e753317ebb932cdb11ef

Re-add ace dependency (#2556)

view details

push time in 5 days

delete branch nelson-liu/codalab-worksheets

delete branch : fix_bundle_manager_stalling

delete time in 6 days

push eventcodalab/codalab-worksheets

Nelson Liu

commit sha 096df808a9f99bead441790f209f1c710cee4a93

Don't try to start bundles on offline / timed-out workers (#2424)

view details

push time in 6 days

PR merged codalab/codalab-worksheets

Don't try to start bundles on offline / timed-out workers

Fixes #2423

We need to both _cleanup_dead_workers and re-fetch the workers, and then try to run bundles on only the workers that are online.

Empirically, this doesn't seem to affect runtime that much at all, even with 40+ workers connected.

I also thought about fixing this issue by making workers checkout when they exit, but there's no guarantee of a clean termination (e.g., if a worker gets SIGKILLED).

+34 -0

6 comments

1 changed file

nelson-liu

pr closed time in 6 days

issue closedcodalab/codalab-worksheets

BundleManager sometimes stalls for a very long time (~1 hour)

Describe the bug

Sometimes, the BundleManager appears unresponsive for a very long time. New runs are not staged or assigned to run on workers and dead workers are not cleaned up.

This is running on the latest master branch.

After investigating, I figured out that the BundleManager isn't stuck--it simply takes an inordinate amount of time to try to schedule each bundle. In BundleManager._schedule_run_bundles_on_workers, we iterate through the staged bundles and try to start them on each of the workers that have the resources to take the bundle (https://github.com/codalab/codalab-worksheets/blob/master/codalab/server/bundle_manager.py#L386).

However, sometimes a worker goes offline in this process (this happens with the SlurmWorkerManager, when a worker gets pre-empted [1]). As a result, we have to iterate through every bundle and try to run it on a worker that's offline. This can take a large amount of time, upwards of an hour. This is bad because no bundles are staged for an hour+, which also means that new workers won't pick up jobs and might exit (if they have --exit-on-idle set to true).

[1]: I think there's an especially bad interaction with the SlurmWorkerManager and the restage-on-terminate feature—this issue always comes up when a job is pre-empted.

closed time in 6 days

nelson-liu

issue openedcodalab/codalab-worksheets

Invariances in bundle manager can be confusing

The invariants in the bundle manager code are extremely hard to reason about, especially with regards to what's stale. A long term issue would be to think closely about how to make it better.

Relevant PRs: https://github.com/codalab/codalab-worksheets/pull/2336 , https://github.com/codalab/codalab-worksheets/pull/2424 , https://github.com/codalab/codalab-worksheets/pull/2549

created time in 6 days

push eventnelson-liu/codalab-worksheets

Ashwin Ramaswami

commit sha c9d503dda4c5fa252fc8f9b6641bef1ff04d3f86

Delete .pre-commit-config.yaml (#2539) Co-authored-by: Tony Lee <tonyh.lee@yahoo.com>

view details

dependabot[bot]

commit sha 73fa19357e99b87fdb442b0d057bdf04b98b6c65

Bump sass from 1.23.3 to 1.26.9 in /frontend (#2536) Bumps [sass](https://github.com/sass/dart-sass) from 1.23.3 to 1.26.9. - [Release notes](https://github.com/sass/dart-sass/releases) - [Changelog](https://github.com/sass/dart-sass/blob/master/CHANGELOG.md) - [Commits](https://github.com/sass/dart-sass/compare/1.23.3...1.26.9) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com>

view details

dependabot[bot]

commit sha 4e3755d05e02f823e5b6d2243c86fb448f939a29

Bump docker from 3.7.0 to 4.2.2 (#2551) Bumps [docker](https://github.com/docker/docker-py) from 3.7.0 to 4.2.2. - [Release notes](https://github.com/docker/docker-py/releases) - [Commits](https://github.com/docker/docker-py/compare/3.7.0...4.2.2) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

view details

dependabot[bot]

commit sha bc16d7242cc52c25ecc86856466b46507b6b533b

Bump react-stickynode from 2.1.1 to 3.0.3 in /frontend (#2553) Bumps [react-stickynode](https://github.com/yahoo/react-stickynode) from 2.1.1 to 3.0.3. - [Release notes](https://github.com/yahoo/react-stickynode/releases) - [Commits](https://github.com/yahoo/react-stickynode/commits/v3.0.3) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

view details

Nelson Liu

commit sha d0927ccb06f1f2e821052f15ce8c483393465120

Merge branch 'master' into fix_bundle_manager_stalling

view details

push time in 6 days

Pull request review commentcodalab/codalab-worksheets

Fix indentation when updating workers' exit_after_num_runs in BundleManager

 def _schedule_run_bundles_on_workers(self, workers, staged_bundles_to_run, user_                     worker['exit_after_num_runs'] -= 1                     break -        # To avoid the potential race condition between bundle manager's dispatch frequency and-        # worker's checkin frequency, update the column "exit_after_num_runs" in worker table-        # before bundle manager's next scheduling loop-        for worker in workers_list:-            # Update workers that have "exit_after_num_runs" manually set from CLI.-            if (-                worker['exit_after_num_runs']-                < workers._workers[worker['worker_id']]['exit_after_num_runs']-            ):-                self._worker_model.update_workers(-                    worker["user_id"],-                    worker['worker_id'],-                    {'exit_after_num_runs': worker['exit_after_num_runs']},-                )+            # To avoid the potential race condition between bundle manager's dispatch frequency and

Hmm, how was this code working before, then? as-is, the value of workers_list depends on what the workers for the last bundle are (i.e., it's bundle-order dependent). So, you could have different results depending on the bundle order, which doesn't seem ideal...thoughts?

nelson-liu

comment created time in 6 days

PR opened codalab/codalab-worksheets

Fix indentation when updating workers' exit_after_num_runs in BundleManager

Right now, the BundleManager only updates exit_after_num_runs for workers that are eligible to run the last bundle , since workers_list is shadowed in the loop and recreated. I'm pretty sure this is not the intended behavior / a bug.

I'm pretty sure we want to be running this after every bundle dispatch instead, ref: https://github.com/codalab/codalab-worksheets/pull/2424#discussion_r447828225

+14 -15

0 comment

1 changed file

pr created time in 6 days

create barnchnelson-liu/codalab-worksheets

branch : exit_num_runs_indentation

created branch time in 6 days

push eventnelson-liu/codalab-worksheets

Tony Lee

commit sha adaf6cf76118e8e4c05de82847d8455cc50da8de

Merge master into staging (#2121) * Update pre-commit script to cleanup venv on failure and add new troubleshooting steps (#1992) * Update stress test script. (#2076) * Fix build issue in Travis (#2091) Closed via #2090 * worksheet title should not be editable when user doesn't have permission' (#2085) Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * Improve error handling for CLI (#2049) Closed via #1958 * Support tagging CodaLab's public instances (#2093) Closed via #2031 * Use "//" comment for editor ctrl + / behavior (#2063) * change blockComment style Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * Bump psutil from 5.4.6 to 5.6.6 (#2087) * Bump psutil from 5.4.6 to 5.6.6 Bumps [psutil](https://github.com/giampaolo/psutil) from 5.4.6 to 5.6.6. - [Release notes](https://github.com/giampaolo/psutil/releases) - [Changelog](https://github.com/giampaolo/psutil/blob/master/HISTORY.rst) - [Commits](https://github.com/giampaolo/psutil/compare/release-5.4.6...release-5.6.6) Signed-off-by: dependabot[bot] <support@github.com> * don't pass version arg to .travis.yml * Update .travis.yml * use old travis with --version * fix format Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * get rid of invalid argument file when logging (#2103) * Improve CLI documentation and add -m option for auto generating CLI docs (#2057) Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> * Check the existence of valid_bundle before further actions (#2107) Closed via #2106 * move cpu and gpu resource check to the beginning of _transition_from_… (#2104) Closed via #2096 * Adjust timezone information when calculating current container running time (#2113) Closed via #2112 * Binding the actual cores that can be used to the current process (#2110) Closed via #1045 * Make Travis badge link to build status (#2116) Co-authored-by: Percy Liang <percyliang@gmail.com> * Add queue in new run field (#2040) * add queue in new run field Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * Make cl workers show user-owned workers for non-root users (#2118) * Show cl workers for non-root users, limiting to ones they own * Black format * Update CLI reference * Fix CLI reference string for cl workers. * Cosmetic changes to /workers/info * Simplify cl workers CLI help string * Lint * 2108: Handle incompatible Cuda version specified in the user's Docker image (#2119) Co-authored-by: armantajback <armantajback@users.noreply.github.com> Co-authored-by: Jing Ge <jingge2@illinois.edu> Co-authored-by: yipenghe <yipenghe@stanford.edu> Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> Co-authored-by: Nelson Liu <nelson-liu@users.noreply.github.com> Co-authored-by: Percy Liang <percyliang@gmail.com>

view details

Ashwin Ramaswami

commit sha 6a9a49c75d694312e685b71495af1e850ba6f621

Merge branch 'master' of github.com:codalab/codalab-worksheets into staging

view details

Ashwin Ramaswami

commit sha d7ca58cfc503f16151f8d9c3a63e181b08c2918f

Merge branch 'master' of github.com:codalab/codalab-worksheets into staging

view details

Ashwin Ramaswami

commit sha d2ebc232d3192f52da5bb29055d43b53f347e3e0

Merge branch 'master' of github.com:codalab/codalab-worksheets into staging

view details

Tony Lee

commit sha 6eefba4f0d5296f17a11eaddbf1cff701833309f

Merge branch 'master' of https://github.com/codalab/codalab-worksheets into staging

view details

Tony Lee

commit sha 88aabe6e50e69f4d6cf3b59aa7de6401a3aa453d

debug

view details

Tony Lee

commit sha 95809d8bb631242535586b356e1235915a3f001c

debug

view details

Tony Lee

commit sha c54110df23deb6722d4ece13b39c94f21e59b7d5

Disable readthedocs test for now

view details

armantajback

commit sha 79e8351fc6ced4be5becacc57b9254cd31ca7e02

Merge Master Into Staging (#2469) * Fix free disk bytes calculation (#2370) Closed via #2362 * Fix WorkerManager launching (#2372) Co-authored-by: Jing Ge <jingge2@illinois.edu> * Make worker id match output directory and job name (#2369) * Make worker id match output directory and job name * Blacken Co-authored-by: Jing Ge <jingge2@illinois.edu> * Pass down max work dir size and delete workdir on exit to slurm workers (#2354) * Pass down max work dir size and delete workdir on exit to slurm workers * Fix typo. Co-authored-by: Jing Ge <jingge2@illinois.edu> * == -> === (#2357) Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * Refactor dialogs into worksheetDialog component (#2263) * merge openX to one open dialog variable * refactor error message, delete worksheet message, bundle action fail dialogs into worksheet dialog component * add check when using enter keyword Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * remove context menu (#2376) * Update print information (#2382) Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> * Add a CLI option to limit the number of jobs allowed to run on a worker (#2289) Closed via #2287 * Set cli verbose default to 0 (#2378) Closed via #2350 * Mounting dependencies on paths specified by keys (#2295) * Fix time displaying issue (#2381) * Fix datetime displaying issue * Add a comment Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> Co-authored-by: yipenghe <yipenghe@stanford.edu> * overdue (#2393) * fix migration (#2403) * Make strings more consistent in terms of terminology / organization (#2409) * tweak dialog prompts * add cut/copied * organize keyboard shortcuts * fix formatting * remove dummy file * ignore a d when a dialog is opened Co-authored-by: yipenghe <yipenghe@stanford.edu> * Replace sacct with scontrol, since sacct is only available on the slurm head node (#2368) * Handle sort key is 0 condition on frontend and use correct sort key for new runs (#2412) * handle sort key is 0 condition * handle sort key is 0 in getAfterSortKey function * use this.props.after_sort_key instead * Clear force delete bit after deletion (#2413) * clear force delete bit * clear force delete bit fails * bump to v0.5.15 (#2416) * 2417: Fix mkdocs Travis failure (#2420) * debug * reenable test * Fix SlurmWorkerManager overlaunching (#2391) Co-authored-by: Jing Ge <jingge2@illinois.edu> * Fix accessing worker information from WorkerInfoAccessor (#2419) * Don't fail WorkerManager if a network exception is encountered (#2422) * Rename actionbar => terminal (#2429) * rename actionbar->terminal * remove unused line * Update frontend/src/components/worksheets/Worksheet/Worksheet.js Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * action bar => terminal in comments Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * Remove select all for table (#2428) * clear force delete bit * remove select-all checkbox * remove selectAll handler * Remove detach from the frontend (#2427) * clear force delete bit * remove detach from frontend * remove constant * Use GitHub Actions for CI (#2185) for #2094 Based off of Jane's work in https://github.com/candicegjing/codalab-worksheets Time: 40 minutes -> 15-20 minutes Several improvements: Build images split into three parallel steps - rest-server, worker, frontend Tests split into around 10 parallel steps - each step runs about 4 tests, and there is a step which runs the UI tests. If a CLI test gives an error, it archives all Docker containers' logs so that they can be downloaded and inspected. If the UI test gives an error, it archives the screenshot images so that they can be downloaded and inspected. Additionally, the publish to pypi process has changed. Now, publishes to pypi happen on every GitHub release as opposed to every tag push. Effectively, this means that the release workflow has changed a bit: Old workflow: Wait 40 mins for Travis build on master to complete Push a tag to master Wait 40 mins for Travis build on master to complete, which also deploys to pypi New workflow: Wait for GitHub Actions build on master to complete Create a new GitHub release Wait for action to complete, which only deploys to pypi * CI: github.head_ref -> github.ref * CI: properly populate VERSION variable with branch name * Add exit-after-num-runs=1 to slurm worker manager (#2373) Closed via #2289 * Fix showing file contents for record item (#2455) fix #2446 to test: create a schema with a field using files: % schema a % add hello /stdout display bundles in record mode % display record a * Fix long test startup times by using python:3.6.10-slim-buster for default test runs (#2449) Fixes #2388 by using python:3.6.10-slim-buster for default test runs. Following @nelson-liu 's suggestion in #2388 (comment). This image is still around ~40 MB (as opposed to the 5 GB default-cpu image). Time changes from ~20 mins -> 10 mins * Fix github actions on forks by not calling --push (#2457) This conditional expression was in the old `.travis.yml`, but it isn't there in the new github actions workflow. This PR adds that expression so that `--push` is not called from forks. Fixes #2456 * Prettier CLI Reference Doc (#2458) Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> * Actually fix builds from forks by building docker images when needed (#2461) * Test worker restart in GHA (#2466) Resolves #2465 Adding back this test that we used to have in Travis Appended .log for to err log files Co-authored-by: Jing Ge <jingge2@illinois.edu> Co-authored-by: Nelson Liu <nelson-liu@users.noreply.github.com> Co-authored-by: yipenghe <yipenghe@stanford.edu> Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> Co-authored-by: Tony Lee <tonyh.lee@yahoo.com> Co-authored-by: Percy Liang <percyliang@gmail.com>

view details

armantajback

commit sha 9f4eaeaa3bec9a73fc555cec5cc90ee32287f8de

Update Staging Version (#2473)

view details

armantajback

commit sha aad44ea588e7f9e3d54aea8945b1f0dabad47541

Merge branch 'master' into staging

view details

Ashwin Ramaswami

commit sha 7c7624dc7254abc63bbdd9752e91c75d8140e541

Merge branch 'master' of github.com:codalab/codalab-worksheets into staging

view details

Ashwin Ramaswami

commit sha aeac16afabefad725e9f365e4888c3b11e268af1

Merge branch 'master' into staging

view details

Tony Lee

commit sha a0e44dc4c71bd32d18baf944029fe284be6c9510

fix heartbeat (#2517)

view details

Ashwin Ramaswami

commit sha 1bf72e6adcf5d01614b7a5c4cd12fbe2d051ed2d

reenable gen-readthedocs

view details

Ashwin Ramaswami

commit sha 42d800efa42dfdcdf3770ee54a4f7f6ecba057b0

Re-deploy docker images with "v" prefix

view details

Ashwin Ramaswami

commit sha 849d58edc2d07be40459855885ba776547437f1c

Re-enable other steps of travis-deploy

view details

Ashwin Ramaswami

commit sha a5481a6329f41afe5fbb5bb0d4cbbba1b9319bdd

Update test_cli.py

view details

Nelson Liu

commit sha a13cd5a5c5bb66bb6aedfb6a2222e8c0b4289840

Make WorkerManagers sleep on JsonApiException (#2523)

view details

Ashwin Ramaswami

commit sha dbbbfe8696d322626166a68be3c48e152db87ec9

CI: Remove some parallelism (#2478) * CI: Remove some parallelism Remove some unnecessary parallelism in github actions, so that we require less parallel workers per job. * Update test.yml * shorten line * combine netcurl

view details

push time in 6 days

pull request commentcodalab/codalab-worksheets

Don't try to start bundles on offline / timed-out workers

Ok, @percyliang / @teetone , here's a summary of what's going on here (and a recap of #2423 ):

#2423 tracks an issue where the BundleManager spends an inordinate amount of time in the _schedule_run_bundles_on_workers https://github.com/codalab/codalab-worksheets/blob/7426f4ca70a65a3f9e42926ea51cda7197b0d9ee/codalab/server/bundle_manager.py#L315 . After doing some digging, I figured out that the issue occurs because the worker information we have is stale—as a result, the BundleManager thinks that there are workers that can run jobs, when really they're offline. So, since the BundleManager thinks that the workers are available, it'll try to iteratively schedule each bundle on it (assuming the bundle meets resource requirements).

However, this scheduling is doomed to fail, since the worker is offline. When the _try_start_bundle function (https://github.com/codalab/codalab-worksheets/blob/7426f4ca70a65a3f9e42926ea51cda7197b0d9ee/codalab/server/bundle_manager.py#L398) fails, it sleeps for 0.3 seconds—as a result, each bundle takes at least 0.3 seconds to process. When we have lots of bundles (say, 10,000), this means that the BundleManager is stuck in this loop for 50 minutes. While it's in this loop, no new staged bundles can be run, no new workers will receive staged bundles, etc.

The fix I've applied here is to re-fetch the online workers before trying to schedule each bundle. The WorkerInfoAccessor will automatically refresh every 60 seconds, so by querying it before each bundle we schedule, a worker that goes offline will only be stale in the cache for a maximum of 60 seconds, a negligible amount. I store a running set of workers that have gone offline, in order to respect prioritization in the case where a worker is online, goes offline during bundle scheduling, and comes back online—we don't want the worker to be assigned bundles, since these bundles might have jumped the queue.

Let me know if that made sense / needs more clarification, happy to elaborate.

nelson-liu

comment created time in 6 days

push eventnelson-liu/codalab-worksheets

Tony Lee

commit sha adaf6cf76118e8e4c05de82847d8455cc50da8de

Merge master into staging (#2121) * Update pre-commit script to cleanup venv on failure and add new troubleshooting steps (#1992) * Update stress test script. (#2076) * Fix build issue in Travis (#2091) Closed via #2090 * worksheet title should not be editable when user doesn't have permission' (#2085) Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * Improve error handling for CLI (#2049) Closed via #1958 * Support tagging CodaLab's public instances (#2093) Closed via #2031 * Use "//" comment for editor ctrl + / behavior (#2063) * change blockComment style Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * Bump psutil from 5.4.6 to 5.6.6 (#2087) * Bump psutil from 5.4.6 to 5.6.6 Bumps [psutil](https://github.com/giampaolo/psutil) from 5.4.6 to 5.6.6. - [Release notes](https://github.com/giampaolo/psutil/releases) - [Changelog](https://github.com/giampaolo/psutil/blob/master/HISTORY.rst) - [Commits](https://github.com/giampaolo/psutil/compare/release-5.4.6...release-5.6.6) Signed-off-by: dependabot[bot] <support@github.com> * don't pass version arg to .travis.yml * Update .travis.yml * use old travis with --version * fix format Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * get rid of invalid argument file when logging (#2103) * Improve CLI documentation and add -m option for auto generating CLI docs (#2057) Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> * Check the existence of valid_bundle before further actions (#2107) Closed via #2106 * move cpu and gpu resource check to the beginning of _transition_from_… (#2104) Closed via #2096 * Adjust timezone information when calculating current container running time (#2113) Closed via #2112 * Binding the actual cores that can be used to the current process (#2110) Closed via #1045 * Make Travis badge link to build status (#2116) Co-authored-by: Percy Liang <percyliang@gmail.com> * Add queue in new run field (#2040) * add queue in new run field Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * Make cl workers show user-owned workers for non-root users (#2118) * Show cl workers for non-root users, limiting to ones they own * Black format * Update CLI reference * Fix CLI reference string for cl workers. * Cosmetic changes to /workers/info * Simplify cl workers CLI help string * Lint * 2108: Handle incompatible Cuda version specified in the user's Docker image (#2119) Co-authored-by: armantajback <armantajback@users.noreply.github.com> Co-authored-by: Jing Ge <jingge2@illinois.edu> Co-authored-by: yipenghe <yipenghe@stanford.edu> Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> Co-authored-by: Nelson Liu <nelson-liu@users.noreply.github.com> Co-authored-by: Percy Liang <percyliang@gmail.com>

view details

Ashwin Ramaswami

commit sha 6a9a49c75d694312e685b71495af1e850ba6f621

Merge branch 'master' of github.com:codalab/codalab-worksheets into staging

view details

Ashwin Ramaswami

commit sha d7ca58cfc503f16151f8d9c3a63e181b08c2918f

Merge branch 'master' of github.com:codalab/codalab-worksheets into staging

view details

Ashwin Ramaswami

commit sha d2ebc232d3192f52da5bb29055d43b53f347e3e0

Merge branch 'master' of github.com:codalab/codalab-worksheets into staging

view details

Tony Lee

commit sha 6eefba4f0d5296f17a11eaddbf1cff701833309f

Merge branch 'master' of https://github.com/codalab/codalab-worksheets into staging

view details

Tony Lee

commit sha 88aabe6e50e69f4d6cf3b59aa7de6401a3aa453d

debug

view details

Tony Lee

commit sha 95809d8bb631242535586b356e1235915a3f001c

debug

view details

Tony Lee

commit sha c54110df23deb6722d4ece13b39c94f21e59b7d5

Disable readthedocs test for now

view details

armantajback

commit sha 79e8351fc6ced4be5becacc57b9254cd31ca7e02

Merge Master Into Staging (#2469) * Fix free disk bytes calculation (#2370) Closed via #2362 * Fix WorkerManager launching (#2372) Co-authored-by: Jing Ge <jingge2@illinois.edu> * Make worker id match output directory and job name (#2369) * Make worker id match output directory and job name * Blacken Co-authored-by: Jing Ge <jingge2@illinois.edu> * Pass down max work dir size and delete workdir on exit to slurm workers (#2354) * Pass down max work dir size and delete workdir on exit to slurm workers * Fix typo. Co-authored-by: Jing Ge <jingge2@illinois.edu> * == -> === (#2357) Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * Refactor dialogs into worksheetDialog component (#2263) * merge openX to one open dialog variable * refactor error message, delete worksheet message, bundle action fail dialogs into worksheet dialog component * add check when using enter keyword Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * remove context menu (#2376) * Update print information (#2382) Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> * Add a CLI option to limit the number of jobs allowed to run on a worker (#2289) Closed via #2287 * Set cli verbose default to 0 (#2378) Closed via #2350 * Mounting dependencies on paths specified by keys (#2295) * Fix time displaying issue (#2381) * Fix datetime displaying issue * Add a comment Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> Co-authored-by: yipenghe <yipenghe@stanford.edu> * overdue (#2393) * fix migration (#2403) * Make strings more consistent in terms of terminology / organization (#2409) * tweak dialog prompts * add cut/copied * organize keyboard shortcuts * fix formatting * remove dummy file * ignore a d when a dialog is opened Co-authored-by: yipenghe <yipenghe@stanford.edu> * Replace sacct with scontrol, since sacct is only available on the slurm head node (#2368) * Handle sort key is 0 condition on frontend and use correct sort key for new runs (#2412) * handle sort key is 0 condition * handle sort key is 0 in getAfterSortKey function * use this.props.after_sort_key instead * Clear force delete bit after deletion (#2413) * clear force delete bit * clear force delete bit fails * bump to v0.5.15 (#2416) * 2417: Fix mkdocs Travis failure (#2420) * debug * reenable test * Fix SlurmWorkerManager overlaunching (#2391) Co-authored-by: Jing Ge <jingge2@illinois.edu> * Fix accessing worker information from WorkerInfoAccessor (#2419) * Don't fail WorkerManager if a network exception is encountered (#2422) * Rename actionbar => terminal (#2429) * rename actionbar->terminal * remove unused line * Update frontend/src/components/worksheets/Worksheet/Worksheet.js Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * action bar => terminal in comments Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * Remove select all for table (#2428) * clear force delete bit * remove select-all checkbox * remove selectAll handler * Remove detach from the frontend (#2427) * clear force delete bit * remove detach from frontend * remove constant * Use GitHub Actions for CI (#2185) for #2094 Based off of Jane's work in https://github.com/candicegjing/codalab-worksheets Time: 40 minutes -> 15-20 minutes Several improvements: Build images split into three parallel steps - rest-server, worker, frontend Tests split into around 10 parallel steps - each step runs about 4 tests, and there is a step which runs the UI tests. If a CLI test gives an error, it archives all Docker containers' logs so that they can be downloaded and inspected. If the UI test gives an error, it archives the screenshot images so that they can be downloaded and inspected. Additionally, the publish to pypi process has changed. Now, publishes to pypi happen on every GitHub release as opposed to every tag push. Effectively, this means that the release workflow has changed a bit: Old workflow: Wait 40 mins for Travis build on master to complete Push a tag to master Wait 40 mins for Travis build on master to complete, which also deploys to pypi New workflow: Wait for GitHub Actions build on master to complete Create a new GitHub release Wait for action to complete, which only deploys to pypi * CI: github.head_ref -> github.ref * CI: properly populate VERSION variable with branch name * Add exit-after-num-runs=1 to slurm worker manager (#2373) Closed via #2289 * Fix showing file contents for record item (#2455) fix #2446 to test: create a schema with a field using files: % schema a % add hello /stdout display bundles in record mode % display record a * Fix long test startup times by using python:3.6.10-slim-buster for default test runs (#2449) Fixes #2388 by using python:3.6.10-slim-buster for default test runs. Following @nelson-liu 's suggestion in #2388 (comment). This image is still around ~40 MB (as opposed to the 5 GB default-cpu image). Time changes from ~20 mins -> 10 mins * Fix github actions on forks by not calling --push (#2457) This conditional expression was in the old `.travis.yml`, but it isn't there in the new github actions workflow. This PR adds that expression so that `--push` is not called from forks. Fixes #2456 * Prettier CLI Reference Doc (#2458) Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> * Actually fix builds from forks by building docker images when needed (#2461) * Test worker restart in GHA (#2466) Resolves #2465 Adding back this test that we used to have in Travis Appended .log for to err log files Co-authored-by: Jing Ge <jingge2@illinois.edu> Co-authored-by: Nelson Liu <nelson-liu@users.noreply.github.com> Co-authored-by: yipenghe <yipenghe@stanford.edu> Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> Co-authored-by: Tony Lee <tonyh.lee@yahoo.com> Co-authored-by: Percy Liang <percyliang@gmail.com>

view details

armantajback

commit sha 9f4eaeaa3bec9a73fc555cec5cc90ee32287f8de

Update Staging Version (#2473)

view details

armantajback

commit sha aad44ea588e7f9e3d54aea8945b1f0dabad47541

Merge branch 'master' into staging

view details

Ashwin Ramaswami

commit sha 7c7624dc7254abc63bbdd9752e91c75d8140e541

Merge branch 'master' of github.com:codalab/codalab-worksheets into staging

view details

Ashwin Ramaswami

commit sha aeac16afabefad725e9f365e4888c3b11e268af1

Merge branch 'master' into staging

view details

Ashwin Ramaswami

commit sha e6e56d8832c10167d40a1dfe61ba8c3bf1e9b7a7

Create SECURITY.md (#2501) * Create SECURITY.md Fixes #2226 Based off https://github.com/Azure/azure-sdk-for-python/blob/master/SECURITY.md * Update SECURITY.md

view details

Tony Lee

commit sha 4d5983eae9d55ccfce178af9fce724b09c44a43b

expect failure for infinite memory tests (#2507) Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com>

view details

Nelson Liu

commit sha ac11edc772c1a7aef0dc12e18a70ddfe149f90b1

Re-create missing docker networks before launching containers (#2468)

view details

Jing Ge

commit sha 37d948d6be5068ebb5d60b54ee544fd90b7330d0

Fix state transition in finalizing worker run state (#2451) Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> Co-authored-by: Tony Lee <tonyh.lee@yahoo.com>

view details

Nelson Liu

commit sha 06f8521fbbfb0c7241201e43168f18308efe98a5

Don't limit workermanager search by provided tags (#2506) Co-authored-by: Jing Ge <jingge2@illinois.edu>

view details

Nelson Liu

commit sha c5eae4c25b4ae596e525f2af57989a8b815b59d6

Handle simultaneous image deletion in DockerImageManager (#2444)

view details

Jing Ge

commit sha 5076777c5e1fea50fc42e75b9364da66cc8f9d1a

Remove encoding and decoding (#2512) * Remove encoding and decoding * Black format Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local>

view details

push time in 6 days

push eventnelson-liu/codalab-worksheets

Nelson Liu

commit sha 1c9a7bf6f429ef3b066ac862a296a2669fd1813c

Revert changes to exit_after_num_runs logic.

view details

push time in 6 days

delete branch nelson-liu/codalab-worksheets

delete branch : worker_gpu_isolation

delete time in 6 days

push eventcodalab/codalab-worksheets

Nelson Liu

commit sha 0746961549421a8049ef0c2282261d792c99b430

Run nvidia-smi directly on worker host to respect GPU isolation (#2415)

view details

push time in 6 days

PR merged codalab/codalab-worksheets

Reviewers
Run nvidia-smi directly on worker host to respect GPU isolation

Currently, workers parse their GPUSET with docker, which doesn't respect GPU isolation. I'm not sure what the original reason for doing this was / if there's anything I'm missing—any ideas?

Most of this new functionality is taken directly from https://github.com/codalab/codalab-worksheets/blob/master/codalab/worker/docker_utils.py#L106-L113 . I didn't think the overlap was large enough to warrant an extra function, especially since it's subprocess vs. docker run.

+16 -2

0 comment

1 changed file

nelson-liu

pr closed time in 6 days

push eventnelson-liu/codalab-worksheets

Tony Lee

commit sha adaf6cf76118e8e4c05de82847d8455cc50da8de

Merge master into staging (#2121) * Update pre-commit script to cleanup venv on failure and add new troubleshooting steps (#1992) * Update stress test script. (#2076) * Fix build issue in Travis (#2091) Closed via #2090 * worksheet title should not be editable when user doesn't have permission' (#2085) Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * Improve error handling for CLI (#2049) Closed via #1958 * Support tagging CodaLab's public instances (#2093) Closed via #2031 * Use "//" comment for editor ctrl + / behavior (#2063) * change blockComment style Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * Bump psutil from 5.4.6 to 5.6.6 (#2087) * Bump psutil from 5.4.6 to 5.6.6 Bumps [psutil](https://github.com/giampaolo/psutil) from 5.4.6 to 5.6.6. - [Release notes](https://github.com/giampaolo/psutil/releases) - [Changelog](https://github.com/giampaolo/psutil/blob/master/HISTORY.rst) - [Commits](https://github.com/giampaolo/psutil/compare/release-5.4.6...release-5.6.6) Signed-off-by: dependabot[bot] <support@github.com> * don't pass version arg to .travis.yml * Update .travis.yml * use old travis with --version * fix format Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * get rid of invalid argument file when logging (#2103) * Improve CLI documentation and add -m option for auto generating CLI docs (#2057) Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> * Check the existence of valid_bundle before further actions (#2107) Closed via #2106 * move cpu and gpu resource check to the beginning of _transition_from_… (#2104) Closed via #2096 * Adjust timezone information when calculating current container running time (#2113) Closed via #2112 * Binding the actual cores that can be used to the current process (#2110) Closed via #1045 * Make Travis badge link to build status (#2116) Co-authored-by: Percy Liang <percyliang@gmail.com> * Add queue in new run field (#2040) * add queue in new run field Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * Make cl workers show user-owned workers for non-root users (#2118) * Show cl workers for non-root users, limiting to ones they own * Black format * Update CLI reference * Fix CLI reference string for cl workers. * Cosmetic changes to /workers/info * Simplify cl workers CLI help string * Lint * 2108: Handle incompatible Cuda version specified in the user's Docker image (#2119) Co-authored-by: armantajback <armantajback@users.noreply.github.com> Co-authored-by: Jing Ge <jingge2@illinois.edu> Co-authored-by: yipenghe <yipenghe@stanford.edu> Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> Co-authored-by: Nelson Liu <nelson-liu@users.noreply.github.com> Co-authored-by: Percy Liang <percyliang@gmail.com>

view details

Ashwin Ramaswami

commit sha 6a9a49c75d694312e685b71495af1e850ba6f621

Merge branch 'master' of github.com:codalab/codalab-worksheets into staging

view details

Ashwin Ramaswami

commit sha d7ca58cfc503f16151f8d9c3a63e181b08c2918f

Merge branch 'master' of github.com:codalab/codalab-worksheets into staging

view details

Ashwin Ramaswami

commit sha d2ebc232d3192f52da5bb29055d43b53f347e3e0

Merge branch 'master' of github.com:codalab/codalab-worksheets into staging

view details

Tony Lee

commit sha 6eefba4f0d5296f17a11eaddbf1cff701833309f

Merge branch 'master' of https://github.com/codalab/codalab-worksheets into staging

view details

Tony Lee

commit sha 88aabe6e50e69f4d6cf3b59aa7de6401a3aa453d

debug

view details

Tony Lee

commit sha 95809d8bb631242535586b356e1235915a3f001c

debug

view details

Tony Lee

commit sha c54110df23deb6722d4ece13b39c94f21e59b7d5

Disable readthedocs test for now

view details

armantajback

commit sha 79e8351fc6ced4be5becacc57b9254cd31ca7e02

Merge Master Into Staging (#2469) * Fix free disk bytes calculation (#2370) Closed via #2362 * Fix WorkerManager launching (#2372) Co-authored-by: Jing Ge <jingge2@illinois.edu> * Make worker id match output directory and job name (#2369) * Make worker id match output directory and job name * Blacken Co-authored-by: Jing Ge <jingge2@illinois.edu> * Pass down max work dir size and delete workdir on exit to slurm workers (#2354) * Pass down max work dir size and delete workdir on exit to slurm workers * Fix typo. Co-authored-by: Jing Ge <jingge2@illinois.edu> * == -> === (#2357) Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * Refactor dialogs into worksheetDialog component (#2263) * merge openX to one open dialog variable * refactor error message, delete worksheet message, bundle action fail dialogs into worksheet dialog component * add check when using enter keyword Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * remove context menu (#2376) * Update print information (#2382) Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> * Add a CLI option to limit the number of jobs allowed to run on a worker (#2289) Closed via #2287 * Set cli verbose default to 0 (#2378) Closed via #2350 * Mounting dependencies on paths specified by keys (#2295) * Fix time displaying issue (#2381) * Fix datetime displaying issue * Add a comment Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> Co-authored-by: yipenghe <yipenghe@stanford.edu> * overdue (#2393) * fix migration (#2403) * Make strings more consistent in terms of terminology / organization (#2409) * tweak dialog prompts * add cut/copied * organize keyboard shortcuts * fix formatting * remove dummy file * ignore a d when a dialog is opened Co-authored-by: yipenghe <yipenghe@stanford.edu> * Replace sacct with scontrol, since sacct is only available on the slurm head node (#2368) * Handle sort key is 0 condition on frontend and use correct sort key for new runs (#2412) * handle sort key is 0 condition * handle sort key is 0 in getAfterSortKey function * use this.props.after_sort_key instead * Clear force delete bit after deletion (#2413) * clear force delete bit * clear force delete bit fails * bump to v0.5.15 (#2416) * 2417: Fix mkdocs Travis failure (#2420) * debug * reenable test * Fix SlurmWorkerManager overlaunching (#2391) Co-authored-by: Jing Ge <jingge2@illinois.edu> * Fix accessing worker information from WorkerInfoAccessor (#2419) * Don't fail WorkerManager if a network exception is encountered (#2422) * Rename actionbar => terminal (#2429) * rename actionbar->terminal * remove unused line * Update frontend/src/components/worksheets/Worksheet/Worksheet.js Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * action bar => terminal in comments Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * Remove select all for table (#2428) * clear force delete bit * remove select-all checkbox * remove selectAll handler * Remove detach from the frontend (#2427) * clear force delete bit * remove detach from frontend * remove constant * Use GitHub Actions for CI (#2185) for #2094 Based off of Jane's work in https://github.com/candicegjing/codalab-worksheets Time: 40 minutes -> 15-20 minutes Several improvements: Build images split into three parallel steps - rest-server, worker, frontend Tests split into around 10 parallel steps - each step runs about 4 tests, and there is a step which runs the UI tests. If a CLI test gives an error, it archives all Docker containers' logs so that they can be downloaded and inspected. If the UI test gives an error, it archives the screenshot images so that they can be downloaded and inspected. Additionally, the publish to pypi process has changed. Now, publishes to pypi happen on every GitHub release as opposed to every tag push. Effectively, this means that the release workflow has changed a bit: Old workflow: Wait 40 mins for Travis build on master to complete Push a tag to master Wait 40 mins for Travis build on master to complete, which also deploys to pypi New workflow: Wait for GitHub Actions build on master to complete Create a new GitHub release Wait for action to complete, which only deploys to pypi * CI: github.head_ref -> github.ref * CI: properly populate VERSION variable with branch name * Add exit-after-num-runs=1 to slurm worker manager (#2373) Closed via #2289 * Fix showing file contents for record item (#2455) fix #2446 to test: create a schema with a field using files: % schema a % add hello /stdout display bundles in record mode % display record a * Fix long test startup times by using python:3.6.10-slim-buster for default test runs (#2449) Fixes #2388 by using python:3.6.10-slim-buster for default test runs. Following @nelson-liu 's suggestion in #2388 (comment). This image is still around ~40 MB (as opposed to the 5 GB default-cpu image). Time changes from ~20 mins -> 10 mins * Fix github actions on forks by not calling --push (#2457) This conditional expression was in the old `.travis.yml`, but it isn't there in the new github actions workflow. This PR adds that expression so that `--push` is not called from forks. Fixes #2456 * Prettier CLI Reference Doc (#2458) Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> * Actually fix builds from forks by building docker images when needed (#2461) * Test worker restart in GHA (#2466) Resolves #2465 Adding back this test that we used to have in Travis Appended .log for to err log files Co-authored-by: Jing Ge <jingge2@illinois.edu> Co-authored-by: Nelson Liu <nelson-liu@users.noreply.github.com> Co-authored-by: yipenghe <yipenghe@stanford.edu> Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> Co-authored-by: Tony Lee <tonyh.lee@yahoo.com> Co-authored-by: Percy Liang <percyliang@gmail.com>

view details

armantajback

commit sha 9f4eaeaa3bec9a73fc555cec5cc90ee32287f8de

Update Staging Version (#2473)

view details

armantajback

commit sha aad44ea588e7f9e3d54aea8945b1f0dabad47541

Merge branch 'master' into staging

view details

Ashwin Ramaswami

commit sha 7c7624dc7254abc63bbdd9752e91c75d8140e541

Merge branch 'master' of github.com:codalab/codalab-worksheets into staging

view details

Ashwin Ramaswami

commit sha aeac16afabefad725e9f365e4888c3b11e268af1

Merge branch 'master' into staging

view details

Ashwin Ramaswami

commit sha e6e56d8832c10167d40a1dfe61ba8c3bf1e9b7a7

Create SECURITY.md (#2501) * Create SECURITY.md Fixes #2226 Based off https://github.com/Azure/azure-sdk-for-python/blob/master/SECURITY.md * Update SECURITY.md

view details

Tony Lee

commit sha 4d5983eae9d55ccfce178af9fce724b09c44a43b

expect failure for infinite memory tests (#2507) Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com>

view details

Nelson Liu

commit sha ac11edc772c1a7aef0dc12e18a70ddfe149f90b1

Re-create missing docker networks before launching containers (#2468)

view details

Jing Ge

commit sha 37d948d6be5068ebb5d60b54ee544fd90b7330d0

Fix state transition in finalizing worker run state (#2451) Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> Co-authored-by: Tony Lee <tonyh.lee@yahoo.com>

view details

Nelson Liu

commit sha 06f8521fbbfb0c7241201e43168f18308efe98a5

Don't limit workermanager search by provided tags (#2506) Co-authored-by: Jing Ge <jingge2@illinois.edu>

view details

Nelson Liu

commit sha c5eae4c25b4ae596e525f2af57989a8b815b59d6

Handle simultaneous image deletion in DockerImageManager (#2444)

view details

Jing Ge

commit sha 5076777c5e1fea50fc42e75b9364da66cc8f9d1a

Remove encoding and decoding (#2512) * Remove encoding and decoding * Black format Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local>

view details

push time in 6 days

issue openedcodalab/codalab-worksheets

Workers on AWS can't reconnect to main bundle service

I'm running some workers on AWS, and they were working before the new deployment. Now, they're all logging:

2020-06-30 21:24:42,349 Disconnected from server! Failed check in: Unable to check in with bundle service: Internal Server Error - Internal Server Error
2020-06-30 21:24:42,349 Checkin failed twice in a row, sleeping 5 seconds
2020-06-30 21:24:47,780 Disconnected from server! Failed check in: Unable to check in with bundle service: Internal Server Error - Internal Server Error
2020-06-30 21:24:47,781 Checkin failed twice in a row, sleeping 5 seconds

etc...

It seems like workers that I have on slurm are going fine, so I'm not sure what the issue could be / what this internal server error could be. Maybe someone could check the bundle service logs?

created time in 7 days

fork nelson-liu/builder

Continuous builder and binary build scripts for pytorch

fork in 7 days

Pull request review commentcodalab/codalab-worksheets

Don't try to start bundles on offline / timed-out workers

 def _schedule_run_bundles_on_workers(self, workers, staged_bundles_to_run, user_         # To avoid the potential race condition between bundle manager's dispatch frequency and         # worker's checkin frequency, update the column "exit_after_num_runs" in worker table         # before bundle manager's next scheduling loop-        for worker in workers_list:+        for worker in workers.workers():

I think workers_list is what I want here because workers_list contains the minimum number of workers that would need to update their exit_after_num_runs value in the database (it's the list of eligible workers that could potentially run the current bundle)

oh, got it. I think you meant to indent this for loop by 4, then? Right now, it runs after all of the bundles are finished scheduling.

nelson-liu

comment created time in 8 days

push eventnelson-liu/codalab-worksheets

Nelson Liu

commit sha ef2017f52a36c4fe4f3b8f3b72392b229a5a8e28

Store offline worker IDs

view details

push time in 8 days

PR opened codalab/codalab-worksheets

Make WorkerManagers sleep on JsonApiException

If the JsonApiClient encounters an exception, it won't pass through that exception. Instead, it raises a JsonApiException . As a result, we ignore and sleep these errors when we see them in the WorkerManager.

(all of my workermanagers crashed with the deployment to 0.5.16 due to this)

+8 -1

0 comment

1 changed file

pr created time in 8 days

create barnchnelson-liu/codalab-worksheets

branch : workermanager_robust_jsonapi

created branch time in 8 days

push eventnelson-liu/codalab-worksheets

Nelson Liu

commit sha 7aa5da874dc4739707a4079750c5b2288cfb3cb9

Lint

view details

push time in 8 days

Pull request review commentcodalab/codalab-worksheets

Don't try to start bundles on offline / timed-out workers

 def _schedule_run_bundles_on_workers(self, workers, staged_bundles_to_run, user_         # To avoid the potential race condition between bundle manager's dispatch frequency and         # worker's checkin frequency, update the column "exit_after_num_runs" in worker table         # before bundle manager's next scheduling loop-        for worker in workers_list:+        for worker in workers.workers():

@candicegjing this is an unrelated fix, but I noticed that workers_list is essentially just the workers corresponding to the last bundle, not all workers (as I think you intended it). Maybe you can check and verify this makes sense?

nelson-liu

comment created time in 8 days

push eventnelson-liu/codalab-worksheets

Nelson Liu

commit sha fea532a5f005f08348420e75e6834b202ebefa79

Accumluate offline workers to respect bundle priority

view details

push time in 8 days

push eventnelson-liu/codalab-worksheets

Tony Lee

commit sha ce7dd073cde1a08709de7121eeae3496bb6460af

fix heartbeat (#2516)

view details

Ashwin Ramaswami

commit sha ba9ed2f032f4627b14915454daa3ef6d161c4408

Fix some frontend build warnings (#2500) * Fix some frontend build warnings * Make sure eslint errors are caught by CI Co-authored-by: yipenghe <yipenghe@stanford.edu>

view details

Jing Ge

commit sha 710992c7123ad1b5105b6d9cdb2d219e07a833d1

Set default value of exit_after_num_runs to maxint in rest API call (#2511)

view details

Ashwin Ramaswami

commit sha 96e7d9f6667c502a1f741bed9fff18d123c18230

Remove node-gyp dependency (#2497) * Remove node-gyp dependency * Add react-bootstrap Co-authored-by: yipenghe <yipenghe@stanford.edu>

view details

Ashwin Ramaswami

commit sha a7b4caf624cef45ecff536de40763704ef58736c

Remove unused dependencies, jquery-ui-bundle (#2498) * Remove node-gyp dependency * Add react-bootstrap * Remove unused dependencies, jquery-ui-bundle * Remove jquery ui and ace js * Remove flow Co-authored-by: yipenghe <yipenghe@stanford.edu>

view details

Nelson Liu

commit sha 526027b492c9f7fafad4f8df4cd133ce3460ab94

Remove SlurmWorkerManager jobs when they timeout (#2521)

view details

push time in 8 days

pull request commentcodalab/codalab-worksheets

Don't try to start bundles on offline / timed-out workers

Sure, after 25 seconds, the worker will be refreshed, and bundle manager won't stall anymore. But during these 25 seconds, there are bundles that will be dispatched to the dead workers, which mess up the order of execution.

Ah, I see your point. This is why in this PR I make sure to explicitly only remove workers that are dead. i.e., if a bundle is dispatched to a dead worker, and fails there are two options:

(1) there is another worker available to run it. In this case, the bundlemanager will try dispatching the bundle onto this worker. (2) There is not another worker available to run it, so it's fine to skip it anyway.

Although we're refreshing the workers, the number of resources is monotonically decreasing in general. This is a rare case where, if a worker goes offline, but then comes back online with the same ID, that the filter wouldn't necessarily work—I can edit the PR to fix that and make the worker list cumulative.

Not sure if any of the above made sense, thoughts @candicegjing ?

nelson-liu

comment created time in 8 days

pull request commentcodalab/codalab-worksheets

Don't try to start bundles on offline / timed-out workers

Suppose the workers get refreshed 30 seconds ago, when we call the function cleanup_dead_workers(), we still use the outdated worker information. If any worker gets disconnected from the server during this 30 seconds interval, we won't be able to remove them from the workers object. Hence, we cannot avoid dispatching bundles to the dead worker.

Let me know if I'm misunderstanding: yeah, we can't avoid dispatching bundles to dead workers wholesale. But, if it was refreshed 30 seconds earlier, it'd be re-refreshed 25 seconds later. At this point, we'd fetch from the database, clean up the dead workers, and then the bundles would be speedily dispatched to workers (or skipped, since there are no workers). So, the bundlemanager wouldn't stall any longer.

nelson-liu

comment created time in 8 days

delete branch nelson-liu/codalab-worksheets

delete branch : timeout_state

delete time in 8 days

push eventnelson-liu/codalab-worksheets

Tony Lee

commit sha ce7dd073cde1a08709de7121eeae3496bb6460af

fix heartbeat (#2516)

view details

Ashwin Ramaswami

commit sha ba9ed2f032f4627b14915454daa3ef6d161c4408

Fix some frontend build warnings (#2500) * Fix some frontend build warnings * Make sure eslint errors are caught by CI Co-authored-by: yipenghe <yipenghe@stanford.edu>

view details

Nelson Liu

commit sha 0071dd08d0999aee080da593a19fb16f31be0926

Merge branch 'master' into timeout_state

view details

push time in 9 days

PR opened codalab/codalab-worksheets

Remove SlurmWorkerManager jobs when they timeout

When a batch job reaches its allotted time (e.g., 10 days is our default), the slurm job terminates and is given the state TIMEOUT. This PR edits the code such that these TIMEOUT jobs are removed from the SlurmWorkerManager's internal state.

+1 -1

0 comment

1 changed file

pr created time in 9 days

create barnchnelson-liu/codalab-worksheets

branch : timeout_state

created branch time in 9 days

push eventnelson-liu/codalab-worksheets

Nelson Liu

commit sha 3dcbdde7be701ddbf679e35f15502f19be20e3bc

Enable customizing the cl-worker executable used in the WorkerManager (#2513)

view details

Nelson Liu

commit sha 1c43a29d02cc5d2e5c676ee0310488eb99ca6d9d

Run nvidia-smi directly on worker host to respect GPU isolation

view details

Nelson Liu

commit sha 39077f126c96553fa890d7b403fb7821767d17ee

Don't try to start bundles on offline / timed-out workers

view details

Nelson Liu

commit sha c5dce99f0867010db726d9cc9d80777b60af7fcc

Add CodalabManagerState classes, with support for json and sqlite3

view details

push time in 11 days

push eventnelson-liu/codalab-worksheets

Nelson Liu

commit sha 3dcbdde7be701ddbf679e35f15502f19be20e3bc

Enable customizing the cl-worker executable used in the WorkerManager (#2513)

view details

Nelson Liu

commit sha 2fffde40dc672f7d1725d3837aaefe5d973f5969

Run nvidia-smi directly on worker host to respect GPU isolation

view details

Nelson Liu

commit sha e2c61ad0a73681a8176c52a6363eb40016b5a317

Don't try to start bundles on offline / timed-out workers

view details

Nelson Liu

commit sha e1fc0817989fe3c7755e19846d20bb93bb97e4d5

Add CodalabManagerState classes, with support for json and sqlite3

view details

push time in 11 days

push eventnelson-liu/codalab-worksheets

Nelson Liu

commit sha 3dcbdde7be701ddbf679e35f15502f19be20e3bc

Enable customizing the cl-worker executable used in the WorkerManager (#2513)

view details

push time in 11 days

delete branch nelson-liu/codalab-worksheets

delete branch : customize_cl_worker_executable

delete time in 11 days

push eventcodalab/codalab-worksheets

Nelson Liu

commit sha 3dcbdde7be701ddbf679e35f15502f19be20e3bc

Enable customizing the cl-worker executable used in the WorkerManager (#2513)

view details

push time in 11 days

PR merged codalab/codalab-worksheets

Reviewers
Enable customizing the cl-worker executable used in the WorkerManager

Sometimes, it's not ideal to use the cl-worker that is found first in the $PATH , but you'd rather provide an explicit path to an executable.

+5 -2

0 comment

3 changed files

nelson-liu

pr closed time in 11 days

PR opened codalab/codalab-worksheets

Reviewers
Enable customizing the cl-worker executable used in the WorkerManager

Sometimes, it's not ideal to use the cl-worker that is found first in the $PATH , but you'd rather provide an explicit path to an executable.

+5 -2

0 comment

3 changed files

pr created time in 11 days

create barnchnelson-liu/codalab-worksheets

branch : customize_cl_worker_executable

created branch time in 11 days

push eventnelson-liu/codalab-worksheets

Ashwin Ramaswami

commit sha e6e56d8832c10167d40a1dfe61ba8c3bf1e9b7a7

Create SECURITY.md (#2501) * Create SECURITY.md Fixes #2226 Based off https://github.com/Azure/azure-sdk-for-python/blob/master/SECURITY.md * Update SECURITY.md

view details

Tony Lee

commit sha 4d5983eae9d55ccfce178af9fce724b09c44a43b

expect failure for infinite memory tests (#2507) Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com>

view details

Nelson Liu

commit sha ac11edc772c1a7aef0dc12e18a70ddfe149f90b1

Re-create missing docker networks before launching containers (#2468)

view details

Jing Ge

commit sha 37d948d6be5068ebb5d60b54ee544fd90b7330d0

Fix state transition in finalizing worker run state (#2451) Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> Co-authored-by: Tony Lee <tonyh.lee@yahoo.com>

view details

Nelson Liu

commit sha 06f8521fbbfb0c7241201e43168f18308efe98a5

Don't limit workermanager search by provided tags (#2506) Co-authored-by: Jing Ge <jingge2@illinois.edu>

view details

Nelson Liu

commit sha c5eae4c25b4ae596e525f2af57989a8b815b59d6

Handle simultaneous image deletion in DockerImageManager (#2444)

view details

Jing Ge

commit sha 5076777c5e1fea50fc42e75b9364da66cc8f9d1a

Remove encoding and decoding (#2512) * Remove encoding and decoding * Black format Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local>

view details

Nelson Liu

commit sha 0168fb8a3a791827971856868cbc4c7add66635d

Run nvidia-smi directly on worker host to respect GPU isolation

view details

Nelson Liu

commit sha 967a3594b9fe7a151ef585d34f1cbebdbec32707

Don't try to start bundles on offline / timed-out workers

view details

Nelson Liu

commit sha 44d2cd59c4ad3326d92ae9472334c3b7c7ad2f5e

Add CodalabManagerState classes, with support for json and sqlite3

view details

push time in 11 days

push eventnelson-liu/codalab-worksheets

Ashwin Ramaswami

commit sha e6e56d8832c10167d40a1dfe61ba8c3bf1e9b7a7

Create SECURITY.md (#2501) * Create SECURITY.md Fixes #2226 Based off https://github.com/Azure/azure-sdk-for-python/blob/master/SECURITY.md * Update SECURITY.md

view details

Tony Lee

commit sha 4d5983eae9d55ccfce178af9fce724b09c44a43b

expect failure for infinite memory tests (#2507) Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com>

view details

Nelson Liu

commit sha ac11edc772c1a7aef0dc12e18a70ddfe149f90b1

Re-create missing docker networks before launching containers (#2468)

view details

Jing Ge

commit sha 37d948d6be5068ebb5d60b54ee544fd90b7330d0

Fix state transition in finalizing worker run state (#2451) Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> Co-authored-by: Tony Lee <tonyh.lee@yahoo.com>

view details

Nelson Liu

commit sha 06f8521fbbfb0c7241201e43168f18308efe98a5

Don't limit workermanager search by provided tags (#2506) Co-authored-by: Jing Ge <jingge2@illinois.edu>

view details

Nelson Liu

commit sha c5eae4c25b4ae596e525f2af57989a8b815b59d6

Handle simultaneous image deletion in DockerImageManager (#2444)

view details

Jing Ge

commit sha 5076777c5e1fea50fc42e75b9364da66cc8f9d1a

Remove encoding and decoding (#2512) * Remove encoding and decoding * Black format Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local>

view details

Nelson Liu

commit sha c565c6f1c96631e1bee4b0c79bd1a4fe3220b990

Run nvidia-smi directly on worker host to respect GPU isolation

view details

Nelson Liu

commit sha 9717af83f8a73a4826e63478d17fcfd7f7e8e553

Don't try to start bundles on offline / timed-out workers

view details

Nelson Liu

commit sha 8607dc03b46f794a9a36964c999f7a1091580b62

Add CodalabManagerState classes, with support for json and sqlite3

view details

push time in 11 days

push eventnelson-liu/codalab-worksheets

Ashwin Ramaswami

commit sha e6e56d8832c10167d40a1dfe61ba8c3bf1e9b7a7

Create SECURITY.md (#2501) * Create SECURITY.md Fixes #2226 Based off https://github.com/Azure/azure-sdk-for-python/blob/master/SECURITY.md * Update SECURITY.md

view details

Tony Lee

commit sha 4d5983eae9d55ccfce178af9fce724b09c44a43b

expect failure for infinite memory tests (#2507) Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com>

view details

Nelson Liu

commit sha ac11edc772c1a7aef0dc12e18a70ddfe149f90b1

Re-create missing docker networks before launching containers (#2468)

view details

Jing Ge

commit sha 37d948d6be5068ebb5d60b54ee544fd90b7330d0

Fix state transition in finalizing worker run state (#2451) Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> Co-authored-by: Tony Lee <tonyh.lee@yahoo.com>

view details

Nelson Liu

commit sha 06f8521fbbfb0c7241201e43168f18308efe98a5

Don't limit workermanager search by provided tags (#2506) Co-authored-by: Jing Ge <jingge2@illinois.edu>

view details

Nelson Liu

commit sha c5eae4c25b4ae596e525f2af57989a8b815b59d6

Handle simultaneous image deletion in DockerImageManager (#2444)

view details

Jing Ge

commit sha 5076777c5e1fea50fc42e75b9364da66cc8f9d1a

Remove encoding and decoding (#2512) * Remove encoding and decoding * Black format Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local>

view details

push time in 11 days

delete branch nelson-liu/codalab-worksheets

delete branch : fix_docker_cache_race_condition

delete time in 12 days

push eventcodalab/codalab-worksheets

Nelson Liu

commit sha c5eae4c25b4ae596e525f2af57989a8b815b59d6

Handle simultaneous image deletion in DockerImageManager (#2444)

view details

push time in 12 days

PR merged codalab/codalab-worksheets

Reviewers
Handle simultaneous image deletion in DockerImageManager

Fixes #2442

Sometimes, Docker downloads fail with an error message that it was downloaded successfully, but cannot be found locally due to an unhandled client error:

81348:2020-06-16 18:21:30,684 http://localhost:None "DELETE /v1.35/images/codalab-image-cache/last-used:1592356889.1950865?force=False&noprune=False HTTP/1.1" 404 78
81349:2020-06-16 18:21:30,685 Failed to download Docker image: Image codalab/default-cpu:latest was downloaded successfully, but it cannot be found locally due to unhandled error 404 Client Error: Not Found ("No such image: codalab-image-cache/last-used:1592356889.1950865")

If you look at the preceding line, docker is trying to DELETE an image, but it fails because it can't find it (404). Investigating the logs of other workers, I found this line in the logs of a different worker:

564853:2020-06-16 18:21:30,684 http://localhost:None "DELETE /v1.35/images/codalab-image-cache/last-used:1592356889.1950865?force=False&noprune=False HTTP/1.1" 200 66

This indicates that this particular different worker was able to delete the image, before the first worker got to it (the timestamp is the exact same, probably because the workers started at the same time on the same node, and finished the download at the same time). Thus, the first worker wasn't able to delete the image, and threw a 404 on deletion.

This PR simply circumvents the issue by catching a 404 on deletion.

+12 -1

1 comment

1 changed file

nelson-liu

pr closed time in 12 days

issue closedcodalab/codalab-worksheets

Runs fail because docker image (in codalab cache) cannot be found

Describe the bug

Sometimes, runs fail with error messages like:

Failed to download Docker image: Image nfliu/architecture-generalization-reading-comprehension-albert:latest was downloaded successfully, but it cannot be found locally due to unhandled error 404 Client Error: Not Found ("No such image: codalab-image-cache/last-used:1592516003.596136")

Looking at some worker logs, I see the following happening:

50607-2020-06-06 15:38:32,817 http://localhost:None "DELETE /v1.35/images/codalab-image-cache/last-used:1591433958.0564198?force=False&noprune=False HTTP/1.1" 404 78
50608:2020-06-06 15:38:32,818 Failed to download Docker image: Image nfliu/architecture-generalization-reading-comprehension-fusionnet:latest was downloaded successfully, but it cannot be found locally due to unhandled error 404 Client Error: Not Found ("No such image: codalab-image-cache/last-used:1591433958.0564198")

I haven't dug into the code, but it seems like images are getting pruned right before they have to get used, so when the worker tries to start a container on them, it can't.

closed time in 12 days

nelson-liu

pull request commentcodalab/codalab-worksheets

Backward compatibility

defaulting the fields with the value None for now, although it might not be the best default value to use. So please feel free to suggest.

Would it work to use the same default as https://github.com/codalab/codalab-worksheets/blob/ac11edc772c1a7aef0dc12e18a70ddfe149f90b1/codalab/worker/worker.py#L536-L553 ?

candicegjing

comment created time in 12 days

push eventnelson-liu/codalab-worksheets

Ashwin Ramaswami

commit sha e6e56d8832c10167d40a1dfe61ba8c3bf1e9b7a7

Create SECURITY.md (#2501) * Create SECURITY.md Fixes #2226 Based off https://github.com/Azure/azure-sdk-for-python/blob/master/SECURITY.md * Update SECURITY.md

view details

Tony Lee

commit sha 4d5983eae9d55ccfce178af9fce724b09c44a43b

expect failure for infinite memory tests (#2507) Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com>

view details

Nelson Liu

commit sha ac11edc772c1a7aef0dc12e18a70ddfe149f90b1

Re-create missing docker networks before launching containers (#2468)

view details

Jing Ge

commit sha 37d948d6be5068ebb5d60b54ee544fd90b7330d0

Fix state transition in finalizing worker run state (#2451) Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local> Co-authored-by: Tony Lee <tonyh.lee@yahoo.com>

view details

Nelson Liu

commit sha 06f8521fbbfb0c7241201e43168f18308efe98a5

Don't limit workermanager search by provided tags (#2506) Co-authored-by: Jing Ge <jingge2@illinois.edu>

view details

Nelson Liu

commit sha 1f287009c4c5d3c3e55d8541238fd73fba32feba

Merge branch 'master' into fix_docker_cache_race_condition

view details

push time in 12 days

Pull request review commentcodalab/codalab-worksheets

Handle simultaneous image deletion in DockerImageManager

 def image_availability_state(image_spec, success_message, failure_message):                     tag_label, timestamp = tag.split(":")                     # remove any other timestamp but not the current one                     if tag_label == self.CACHE_TAG and timestamp != new_timestamp:-                        self._docker.images.remove(tag)+                        try:+                            self._docker.images.remove(tag)+                        except docker.errors.ImageNotFound as err:+                            # It's possible that we get a 404 not found error here when removing the image,

Here's where i got it from:

(cl_dev) nfliu at mazama in ~/git/codalab-worksheets on master
$ pip list | grep docker
docker              3.7.0
docker-pycreds      0.4.0
(cl_dev) nfliu at mazama in ~/git/codalab-worksheets on master
$ ipython
Python 3.6.10 |Anaconda, Inc.| (default, Jan  7 2020, 15:01:53)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.12.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import docker

In [2]: docker = docker.from_env()

In [3]: docker.images.remove("animagethatdoesntexist")
---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
~/miniconda3/envs/cl_dev/lib/python3.6/site-packages/docker/api/client.py in _raise_for_status(self, response)
    255         try:
--> 256             response.raise_for_status()
    257         except requests.exceptions.HTTPError as e:

~/miniconda3/envs/cl_dev/lib/python3.6/site-packages/requests/models.py in raise_for_status(self)
    940         if http_error_msg:
--> 941             raise HTTPError(http_error_msg, response=self)
    942

HTTPError: 404 Client Error: Not Found for url: http+docker://localhost/v1.35/images/animagethatdoesntexist?force=False&noprune=False

During handling of the above exception, another exception occurred:

ImageNotFound                             Traceback (most recent call last)
<ipython-input-3-9ab3846e26db> in <module>
----> 1 docker.images.remove("animagethatdoesntexist")

~/miniconda3/envs/cl_dev/lib/python3.6/site-packages/docker/models/images.py in remove(self, *args, **kwargs)
    457
    458     def remove(self, *args, **kwargs):
--> 459         self.client.api.remove_image(*args, **kwargs)
    460     remove.__doc__ = APIClient.remove_image.__doc__
    461

~/miniconda3/envs/cl_dev/lib/python3.6/site-packages/docker/utils/decorators.py in wrapped(self, resource_id, *args, **kwargs)
     17                     'Resource ID was not provided'
     18                 )
---> 19             return f(self, resource_id, *args, **kwargs)
     20         return wrapped
     21     return decorator

~/miniconda3/envs/cl_dev/lib/python3.6/site-packages/docker/api/image.py in remove_image(self, image, force, noprune)
    479         params = {'force': force, 'noprune': noprune}
    480         res = self._delete(self._url("/images/{0}", image), params=params)
--> 481         return self._result(res, True)
    482
    483     def search(self, term):

~/miniconda3/envs/cl_dev/lib/python3.6/site-packages/docker/api/client.py in _result(self, response, json, binary)
    260     def _result(self, response, json=False, binary=False):
    261         assert not (json and binary)
--> 262         self._raise_for_status(response)
    263
    264         if json:

~/miniconda3/envs/cl_dev/lib/python3.6/site-packages/docker/api/client.py in _raise_for_status(self, response)
    256             response.raise_for_status()
    257         except requests.exceptions.HTTPError as e:
--> 258             raise create_api_error_from_http_exception(e)
    259
    260     def _result(self, response, json=False, binary=False):

~/miniconda3/envs/cl_dev/lib/python3.6/site-packages/docker/errors.py in create_api_error_from_http_exception(e)
     29         else:
     30             cls = NotFound
---> 31     raise cls(e, response=response, explanation=explanation)
     32
     33

ImageNotFound: 404 Client Error: Not Found ("No such image: animagethatdoesntexist:latest")

It's possible that other exceptions could lead to this...but we can add those exceptions later if need be?

nelson-liu

comment created time in 12 days

pull request commentcodalab/codalab-worksheets

Set default value of exit_after_num_runs to maxint in rest API call

+1 to making all 3 consistent

candicegjing

comment created time in 12 days

delete branch nelson-liu/codalab-worksheets

delete branch : worker_robust_docker_network

delete time in 13 days

PR merged codalab/codalab-worksheets

Reviewers
Re-create missing docker networks before launching containers

Fixes #2439

Docker networks are getting weirdly removed from hosts—these are usually networks that haven't been used in awhile, so maybe something on the host is auto-pruning them? This makes the worker more robust to these changes in the docker network(s) by re-creating them (if they don't exist) before launching containers.

+23 -8

3 comments

1 changed file

nelson-liu

pr closed time in 13 days

issue closedcodalab/codalab-worksheets

Worker is not robust to docker network removal

Describe the bug

It seems like sometimes, the worker docker networks get removed. On an already-existing worker that is trying to run another job, I got the following error:

2020-06-18 16:48:18,019 Cannot start Docker container: Unable to start Docker container: 404 Client Error: Not Found ("network cl_worker_nfliu-codalab-slurm-worker-main-standard-4dd7cd2c_network_int not found")
2020-06-18 16:48:18,034 Traceback (most recent call last):
  File "/juice/scr/nfliu/miniconda3/envs/cl_nfliu_main/lib/python3.6/site-packages/docker/api/client.py", line 256, in _raise_for_status
    response.raise_for_status()
  File "/juice/scr/nfliu/miniconda3/envs/cl_nfliu_main/lib/python3.6/site-packages/requests/models.py", line 941, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http+docker://localhost/v1.35/containers/c5e44547ac0562ec3c62614b6d9600d44321a74c7e408a164aa31781b8b75395/start

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/juice/scr/nfliu/miniconda3/envs/cl_nfliu_main/lib/python3.6/site-packages/codalab/worker/docker_utils.py", line 51, in wrapper
    return f(*args, **kwargs)
  File "/juice/scr/nfliu/miniconda3/envs/cl_nfliu_main/lib/python3.6/site-packages/codalab/worker/docker_utils.py", line 186, in start_bundle_container
    stdin_open=tty,
  File "/juice/scr/nfliu/miniconda3/envs/cl_nfliu_main/lib/python3.6/site-packages/docker/models/containers.py", line 791, in run
    container.start()
  File "/juice/scr/nfliu/miniconda3/envs/cl_nfliu_main/lib/python3.6/site-packages/docker/models/containers.py", line 392, in start
    return self.client.api.start(self.id, **kwargs)
  File "/juice/scr/nfliu/miniconda3/envs/cl_nfliu_main/lib/python3.6/site-packages/docker/utils/decorators.py", line 19, in wrapped
    return f(self, resource_id, *args, **kwargs)
  File "/juice/scr/nfliu/miniconda3/envs/cl_nfliu_main/lib/python3.6/site-packages/docker/api/container.py", line 1091, in start
    self._raise_for_status(res)
  File "/juice/scr/nfliu/miniconda3/envs/cl_nfliu_main/lib/python3.6/site-packages/docker/api/client.py", line 258, in _raise_for_status
    raise create_api_error_from_http_exception(e)
  File "/juice/scr/nfliu/miniconda3/envs/cl_nfliu_main/lib/python3.6/site-packages/docker/errors.py", line 31, in create_api_error_from_http_exception
    raise cls(e, response=response, explanation=explanation)
docker.errors.NotFound: 404 Client Error: Not Found ("network cl_worker_nfliu-codalab-slurm-worker-main-standard-4dd7cd2c_network_int not found")

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/juice/scr/nfliu/miniconda3/envs/cl_nfliu_main/lib/python3.6/site-packages/codalab/worker/worker_run_state.py", line 339, in _transition_from_PREPARING
    runtime=self.docker_runtime,
  File "/juice/scr/nfliu/miniconda3/envs/cl_nfliu_main/lib/python3.6/site-packages/codalab/worker/docker_utils.py", line 55, in wrapper
    check_for_user_error(e)
  File "/juice/scr/nfliu/miniconda3/envs/cl_nfliu_main/lib/python3.6/site-packages/codalab/worker/docker_utils.py", line 48, in check_for_user_error
    raise DockerException(error_message)
codalab.worker.docker_utils.DockerException: Unable to start Docker container: 404 Client Error: Not Found ("network cl_worker_nfliu-codalab-slurm-worker-main-standard-4dd7cd2c_network_int not found")

Traceback (most recent call last):
  File "/juice/scr/nfliu/miniconda3/envs/cl_nfliu_main/lib/python3.6/site-packages/docker/api/client.py", line 256, in _raise_for_status
    response.raise_for_status()
  File "/juice/scr/nfliu/miniconda3/envs/cl_nfliu_main/lib/python3.6/site-packages/requests/models.py", line 941, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http+docker://localhost/v1.35/containers/c5e44547ac0562ec3c62614b6d9600d44321a74c7e408a164aa31781b8b75395/start

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/juice/scr/nfliu/miniconda3/envs/cl_nfliu_main/lib/python3.6/site-packages/codalab/worker/docker_utils.py", line 51, in wrapper
    return f(*args, **kwargs)
  File "/juice/scr/nfliu/miniconda3/envs/cl_nfliu_main/lib/python3.6/site-packages/codalab/worker/docker_utils.py", line 186, in start_bundle_container
    stdin_open=tty,
  File "/juice/scr/nfliu/miniconda3/envs/cl_nfliu_main/lib/python3.6/site-packages/docker/models/containers.py", line 791, in run
    container.start()
  File "/juice/scr/nfliu/miniconda3/envs/cl_nfliu_main/lib/python3.6/site-packages/docker/models/containers.py", line 392, in start
    return self.client.api.start(self.id, **kwargs)
  File "/juice/scr/nfliu/miniconda3/envs/cl_nfliu_main/lib/python3.6/site-packages/docker/utils/decorators.py", line 19, in wrapped
    return f(self, resource_id, *args, **kwargs)
  File "/juice/scr/nfliu/miniconda3/envs/cl_nfliu_main/lib/python3.6/site-packages/docker/api/container.py", line 1091, in start
    self._raise_for_status(res)
  File "/juice/scr/nfliu/miniconda3/envs/cl_nfliu_main/lib/python3.6/site-packages/docker/api/client.py", line 258, in _raise_for_status
    raise create_api_error_from_http_exception(e)
  File "/juice/scr/nfliu/miniconda3/envs/cl_nfliu_main/lib/python3.6/site-packages/docker/errors.py", line 31, in create_api_error_from_http_exception
    raise cls(e, response=response, explanation=explanation)
docker.errors.NotFound: 404 Client Error: Not Found ("network cl_worker_nfliu-codalab-slurm-worker-main-standard-4dd7cd2c_network_int not found")

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/juice/scr/nfliu/miniconda3/envs/cl_nfliu_main/lib/python3.6/site-packages/codalab/worker/worker.py", line 199, in start
    self.process_runs()
  File "/juice/scr/nfliu/miniconda3/envs/cl_nfliu_main/lib/python3.6/site-packages/codalab/worker/worker.py", line 365, in process_runs
    self.runs[uuid] = self.run_state_manager.transition(run_state)
  File "/juice/scr/nfliu/miniconda3/envs/cl_nfliu_main/lib/python3.6/site-packages/codalab/worker/fsm.py", line 35, in transition
    return self._transition_functions[state.stage](state)
  File "/juice/scr/nfliu/miniconda3/envs/cl_nfliu_main/lib/python3.6/site-packages/codalab/worker/worker_run_state.py", line 339, in _transition_from_PREPARING
    runtime=self.docker_runtime,
  File "/juice/scr/nfliu/miniconda3/envs/cl_nfliu_main/lib/python3.6/site-packages/codalab/worker/docker_utils.py", line 55, in wrapper
    check_for_user_error(e)
  File "/juice/scr/nfliu/miniconda3/envs/cl_nfliu_main/lib/python3.6/site-packages/codalab/worker/docker_utils.py", line 48, in check_for_user_error
    raise DockerException(error_message)
codalab.worker.docker_utils.DockerException: Unable to start Docker container: 404 Client Error: Not Found ("network cl_worker_nfliu-codalab-slurm-worker-main-standard-4dd7cd2c_network_int not found")
2020-06-18 16:48:18,038 Sleeping for 1 hour due to exception...please help me!

Logging on to the machine, it appears that the network is indeed not there. Looking at the list, every triple of docker network prefix should have one with suffixes ext, general, and int. However, some of these triples are missing the int network.

$ docker network ls
NETWORK ID          NAME                                                                          DRIVER              SCOPE
d3b08a9ce680        bridge                                                                        bridge              local
58379732b6b9        cl_worker_nfliu-codalab-slurm-worker-main-standard-4dd7cd2c_network_ext       bridge              local
a7fe7b588c3b        cl_worker_nfliu-codalab-slurm-worker-main-standard-4dd7cd2c_network_general   bridge              local
6e4cdbb88be5        cl_worker_nfliu-codalab-slurm-worker-main-standard-be8ee085_network_ext       bridge              local
7931e9798632        cl_worker_nfliu-codalab-slurm-worker-main-standard-be8ee085_network_general   bridge              local
48f7179e258a        cl_worker_nfliu-codalab-slurm-worker-main-standard-be8ee085_network_int       bridge              local
e3be37f75563        cl_worker_nfliu-codalab-slurm-worker-stanford-low-21f7d0d3_network_ext        bridge              local
1cfda3e85361        cl_worker_nfliu-codalab-slurm-worker-stanford-low-21f7d0d3_network_general    bridge              local
e833baff02f1        cl_worker_nfliu-codalab-slurm-worker-stanford-low-21f7d0d3_network_int        bridge              local
d0cc4f087587        cl_worker_nfliu-codalab-slurm-worker-stanford-low-283b6698_network_ext        bridge              local
2b93ef5bbd35        cl_worker_nfliu-codalab-slurm-worker-stanford-low-283b6698_network_general    bridge              local
f3f52a9ca04d        cl_worker_nfliu-codalab-slurm-worker-stanford-low-06378f73_network_ext        bridge              local
2513f5792155        cl_worker_nfliu-codalab-slurm-worker-stanford-low-06378f73_network_general    bridge              local
1d48c87c0537        cl_worker_nfliu-codalab-slurm-worker-stanford-low-62773d58_network_ext        bridge              local
7ef521854022        cl_worker_nfliu-codalab-slurm-worker-stanford-low-62773d58_network_general    bridge              local
19b30eff830b        cl_worker_nfliu-codalab-slurm-worker-stanford-low-b7c9bc22_network_ext        bridge              local
9c4614a07ee2        cl_worker_nfliu-codalab-slurm-worker-stanford-low-b7c9bc22_network_general    bridge              local
d4c17fc1632a        cl_worker_nfliu-codalab-slurm-worker-stanford-low-b7c9bc22_network_int        bridge              local
e6c87b4b8737        cl_worker_nfliu-codalab-slurm-worker-stanford-low-f3a582a4_network_ext        bridge              local
ae03195fb837        cl_worker_nfliu-codalab-slurm-worker-stanford-low-f3a582a4_network_general    bridge              local
283a6cf62094        cl_worker_nfliu-codalab-slurm-worker-stanford-low-f3a582a4_network_int        bridge              local
574ec1166bbe        host                                                                          host                local
b24a9a7bd93e        none                                                                          null                local

So, one issue is to figure out where the int network is going. Another issue is to make the worker robust to these changes in docker networks. Currently, the worker creates docker networks (or reuses existing ones) when it initially launches, but maybe it's worth doing it before each container the worker starts?

closed time in 13 days

nelson-liu

push eventnelson-liu/codalab-worksheets

Nelson Liu

commit sha f2195995bfe9599023e0c55427f9d4485b3e602f

Lint

view details

push time in 13 days

pull request commentcodalab/codalab-worksheets

Re-create missing docker networks before launching containers

@epicfaace mind taking another look?

nelson-liu

comment created time in 13 days

push eventnelson-liu/codalab-worksheets

Ashwin Ramaswami

commit sha e6e56d8832c10167d40a1dfe61ba8c3bf1e9b7a7

Create SECURITY.md (#2501) * Create SECURITY.md Fixes #2226 Based off https://github.com/Azure/azure-sdk-for-python/blob/master/SECURITY.md * Update SECURITY.md

view details

Tony Lee

commit sha 4d5983eae9d55ccfce178af9fce724b09c44a43b

expect failure for infinite memory tests (#2507) Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com>

view details

Nelson Liu

commit sha ee61d2a36ca96747e33e9d5c9499d7dc4a69852d

Merge branch 'master' into worker_robust_docker_network

view details

push time in 13 days

push eventnelson-liu/codalab-worksheets

Nelson Liu

commit sha 6a066ae521589012f68b37702b5d63d1007c879b

Add log text messages for when a docker network is actually created

view details

push time in 13 days

pull request commentcodalab/codalab-worksheets

Re-create missing docker networks before launching containers

yeah, I want to add a verbosity argument—I'll get around to it later today.

nelson-liu

comment created time in 13 days

pull request commentcodalab/codalab-worksheets

Re-create missing docker networks before launching containers

i want to take another look at this, it adds a lot of unnecessary logging to the worker when the --verbose flag is set. it's a bunch of lines that look like:

2020-06-25 00:48:31,114 Creating docker network cl_worker_nfliu-codalab-slurm-worker-stanford-low-4db9b593_network_general
2020-06-25 00:48:31,120 http://localhost:None "POST /v1.35/networks/create HTTP/1.1" 409 122
2020-06-25 00:48:31,121 Network cl_worker_nfliu-codalab-slurm-worker-stanford-low-4db9b593_network_general already exists, reusing
2020-06-25 00:48:31,126 http://localhost:None "GET /v1.35/networks?filters=%7B%22name%22%3A+%5B%22cl_worker_nfliu-codalab-slurm-worker-stanford-low-4db9b593_network_general%22%5D%7D HTTP/1.1" 200 510
2020-06-25 00:48:31,127 Creating docker network cl_worker_nfliu-codalab-slurm-worker-stanford-low-4db9b593_network_ext
2020-06-25 00:48:31,130 http://localhost:None "POST /v1.35/networks/create HTTP/1.1" 409 118
2020-06-25 00:48:31,131 Network cl_worker_nfliu-codalab-slurm-worker-stanford-low-4db9b593_network_ext already exists, reusing
2020-06-25 00:48:31,135 http://localhost:None "GET /v1.35/networks?filters=%7B%22name%22%3A+%5B%22cl_worker_nfliu-codalab-slurm-worker-stanford-low-4db9b593_network_ext%22%5D%7D HTTP/1.1" 200 507
2020-06-25 00:48:31,136 Creating docker network cl_worker_nfliu-codalab-slurm-worker-stanford-low-4db9b593_network_int
2020-06-25 00:48:31,139 http://localhost:None "POST /v1.35/networks/create HTTP/1.1" 409 118
2020-06-25 00:48:31,139 Network cl_worker_nfliu-codalab-slurm-worker-stanford-low-4db9b593_network_int already exists, reusing
2020-06-25 00:48:31,143 http://localhost:None "GET /v1.35/networks?filters=%7B%22name%22%3A+%5B%22cl_worker_nfliu-codalab-slurm-worker-stanford-low-4db9b593_network_int%22%5D%7D HTTP/1.1" 200 499

Since it logs every time init_docker_networks is called.

nelson-liu

comment created time in 13 days

push eventnelson-liu/codalab-worksheets

yipenghe

commit sha 105c50f9aa9d9c0afe99ce1c66f28002d444fff0

Clear force delete bit after deletion (#2413) * clear force delete bit * clear force delete bit fails

view details

Tony Lee

commit sha 689e416c30f7bd30f308b54bdd8f5c3bac6eab12

bump to v0.5.15 (#2416)

view details

Tony Lee

commit sha 5097dcebad2f67ee9c5041e48ec5ac28a427fc2c

2417: Fix mkdocs Travis failure (#2420) * debug * reenable test

view details

Nelson Liu

commit sha 14dd5b64819da0b8cb54e2c48a561a0136321f6d

Fix SlurmWorkerManager overlaunching (#2391) Co-authored-by: Jing Ge <jingge2@illinois.edu>

view details

Nelson Liu

commit sha 7426f4ca70a65a3f9e42926ea51cda7197b0d9ee

Fix accessing worker information from WorkerInfoAccessor (#2419)

view details

Nelson Liu

commit sha a916f2604129bacf8f73332962509a50549bcc98

Don't fail WorkerManager if a network exception is encountered (#2422)

view details

yipenghe

commit sha 4751da662087c3ce812bf604c148f09d84e8bb62

Rename actionbar => terminal (#2429) * rename actionbar->terminal * remove unused line * Update frontend/src/components/worksheets/Worksheet/Worksheet.js Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * action bar => terminal in comments Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com>

view details

yipenghe

commit sha abd8cf12d6b3cdb0a2484c3ac83191ef28af1f3f

Remove select all for table (#2428) * clear force delete bit * remove select-all checkbox * remove selectAll handler

view details

yipenghe

commit sha 2e8a72adb62cd53db544607fa372f0f9e3c106f4

Remove detach from the frontend (#2427) * clear force delete bit * remove detach from frontend * remove constant

view details

Ashwin Ramaswami

commit sha 0038a0f471d3ebc06f1599a3cc0ec7113885dc83

Use GitHub Actions for CI (#2185) for #2094 Based off of Jane's work in https://github.com/candicegjing/codalab-worksheets Time: 40 minutes -> 15-20 minutes Several improvements: Build images split into three parallel steps - rest-server, worker, frontend Tests split into around 10 parallel steps - each step runs about 4 tests, and there is a step which runs the UI tests. If a CLI test gives an error, it archives all Docker containers' logs so that they can be downloaded and inspected. If the UI test gives an error, it archives the screenshot images so that they can be downloaded and inspected. Additionally, the publish to pypi process has changed. Now, publishes to pypi happen on every GitHub release as opposed to every tag push. Effectively, this means that the release workflow has changed a bit: Old workflow: Wait 40 mins for Travis build on master to complete Push a tag to master Wait 40 mins for Travis build on master to complete, which also deploys to pypi New workflow: Wait for GitHub Actions build on master to complete Create a new GitHub release Wait for action to complete, which only deploys to pypi

view details

Ashwin Ramaswami

commit sha d3118bb685091fef005ea69fdddd906ce9c2065b

CI: github.head_ref -> github.ref

view details

Ashwin Ramaswami

commit sha c03c9c34e423ebde811161329229bf888a36d39f

CI: properly populate VERSION variable with branch name

view details

Jing Ge

commit sha 43c77d12d76cf29ca209e2088a276a3d44b0514c

Add exit-after-num-runs=1 to slurm worker manager (#2373) Closed via #2289

view details

yipenghe

commit sha 9ce75b09a05b8698db161afcdd8112c6b6d9aa9b

Fix showing file contents for record item (#2455) fix #2446 to test: create a schema with a field using files: % schema a % add hello /stdout display bundles in record mode % display record a

view details

Ashwin Ramaswami

commit sha b21918ea5afae02de90d33c8a1416103826b81c9

Fix long test startup times by using python:3.6.10-slim-buster for default test runs (#2449) Fixes #2388 by using python:3.6.10-slim-buster for default test runs. Following @nelson-liu 's suggestion in #2388 (comment). This image is still around ~40 MB (as opposed to the 5 GB default-cpu image). Time changes from ~20 mins -> 10 mins

view details

Ashwin Ramaswami

commit sha 433c007744bd6a1b9a157b384bfebb44f4f27855

Fix github actions on forks by not calling --push (#2457) This conditional expression was in the old `.travis.yml`, but it isn't there in the new github actions workflow. This PR adds that expression so that `--push` is not called from forks. Fixes #2456

view details

Jing Ge

commit sha f8dcf57ec59bc5562ee2324e53846edabc5cfbe0

Prettier CLI Reference Doc (#2458) Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local>

view details

Ashwin Ramaswami

commit sha 50706ab2e76cceddd289cccfb2c7e11a208da8cc

Actually fix builds from forks by building docker images when needed (#2461)

view details

Tony Lee

commit sha e31a55153a855d276efcd7a1e4df91e6ba3246f0

Test worker restart in GHA (#2466) Resolves #2465 Adding back this test that we used to have in Travis Appended .log for to err log files

view details

Nelson Liu

commit sha 697b8f58b79671b2e0c65f9ee5210ec851105a27

Fix worker restart CI on non-branch PRs (#2476) * Fix worker restart tests on non-branch PRs. * Empty commit since frontend tests are flaky

view details

push time in 13 days

push eventnelson-liu/codalab-worksheets

Tony Lee

commit sha 689e416c30f7bd30f308b54bdd8f5c3bac6eab12

bump to v0.5.15 (#2416)

view details

Tony Lee

commit sha 5097dcebad2f67ee9c5041e48ec5ac28a427fc2c

2417: Fix mkdocs Travis failure (#2420) * debug * reenable test

view details

Nelson Liu

commit sha 14dd5b64819da0b8cb54e2c48a561a0136321f6d

Fix SlurmWorkerManager overlaunching (#2391) Co-authored-by: Jing Ge <jingge2@illinois.edu>

view details

Nelson Liu

commit sha 7426f4ca70a65a3f9e42926ea51cda7197b0d9ee

Fix accessing worker information from WorkerInfoAccessor (#2419)

view details

Nelson Liu

commit sha a916f2604129bacf8f73332962509a50549bcc98

Don't fail WorkerManager if a network exception is encountered (#2422)

view details

yipenghe

commit sha 4751da662087c3ce812bf604c148f09d84e8bb62

Rename actionbar => terminal (#2429) * rename actionbar->terminal * remove unused line * Update frontend/src/components/worksheets/Worksheet/Worksheet.js Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com> * action bar => terminal in comments Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com>

view details

yipenghe

commit sha abd8cf12d6b3cdb0a2484c3ac83191ef28af1f3f

Remove select all for table (#2428) * clear force delete bit * remove select-all checkbox * remove selectAll handler

view details

yipenghe

commit sha 2e8a72adb62cd53db544607fa372f0f9e3c106f4

Remove detach from the frontend (#2427) * clear force delete bit * remove detach from frontend * remove constant

view details

Ashwin Ramaswami

commit sha 0038a0f471d3ebc06f1599a3cc0ec7113885dc83

Use GitHub Actions for CI (#2185) for #2094 Based off of Jane's work in https://github.com/candicegjing/codalab-worksheets Time: 40 minutes -> 15-20 minutes Several improvements: Build images split into three parallel steps - rest-server, worker, frontend Tests split into around 10 parallel steps - each step runs about 4 tests, and there is a step which runs the UI tests. If a CLI test gives an error, it archives all Docker containers' logs so that they can be downloaded and inspected. If the UI test gives an error, it archives the screenshot images so that they can be downloaded and inspected. Additionally, the publish to pypi process has changed. Now, publishes to pypi happen on every GitHub release as opposed to every tag push. Effectively, this means that the release workflow has changed a bit: Old workflow: Wait 40 mins for Travis build on master to complete Push a tag to master Wait 40 mins for Travis build on master to complete, which also deploys to pypi New workflow: Wait for GitHub Actions build on master to complete Create a new GitHub release Wait for action to complete, which only deploys to pypi

view details

Ashwin Ramaswami

commit sha d3118bb685091fef005ea69fdddd906ce9c2065b

CI: github.head_ref -> github.ref

view details

Ashwin Ramaswami

commit sha c03c9c34e423ebde811161329229bf888a36d39f

CI: properly populate VERSION variable with branch name

view details

Jing Ge

commit sha 43c77d12d76cf29ca209e2088a276a3d44b0514c

Add exit-after-num-runs=1 to slurm worker manager (#2373) Closed via #2289

view details

yipenghe

commit sha 9ce75b09a05b8698db161afcdd8112c6b6d9aa9b

Fix showing file contents for record item (#2455) fix #2446 to test: create a schema with a field using files: % schema a % add hello /stdout display bundles in record mode % display record a

view details

Ashwin Ramaswami

commit sha b21918ea5afae02de90d33c8a1416103826b81c9

Fix long test startup times by using python:3.6.10-slim-buster for default test runs (#2449) Fixes #2388 by using python:3.6.10-slim-buster for default test runs. Following @nelson-liu 's suggestion in #2388 (comment). This image is still around ~40 MB (as opposed to the 5 GB default-cpu image). Time changes from ~20 mins -> 10 mins

view details

Ashwin Ramaswami

commit sha 433c007744bd6a1b9a157b384bfebb44f4f27855

Fix github actions on forks by not calling --push (#2457) This conditional expression was in the old `.travis.yml`, but it isn't there in the new github actions workflow. This PR adds that expression so that `--push` is not called from forks. Fixes #2456

view details

Jing Ge

commit sha f8dcf57ec59bc5562ee2324e53846edabc5cfbe0

Prettier CLI Reference Doc (#2458) Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local>

view details

Ashwin Ramaswami

commit sha 50706ab2e76cceddd289cccfb2c7e11a208da8cc

Actually fix builds from forks by building docker images when needed (#2461)

view details

Tony Lee

commit sha e31a55153a855d276efcd7a1e4df91e6ba3246f0

Test worker restart in GHA (#2466) Resolves #2465 Adding back this test that we used to have in Travis Appended .log for to err log files

view details

Nelson Liu

commit sha 697b8f58b79671b2e0c65f9ee5210ec851105a27

Fix worker restart CI on non-branch PRs (#2476) * Fix worker restart tests on non-branch PRs. * Empty commit since frontend tests are flaky

view details

Ashwin Ramaswami

commit sha b7d11d4d324422b66ba1a21badde766a3930d24c

Save / upload logs on both success and failure (#2450)

view details

push time in 13 days

push eventnelson-liu/codalab-worksheets

Nelson Liu

commit sha 9534fc0ef1c02e1f937351a55896284c1f054be0

Run nvidia-smi directly on worker host to respect GPU isolation

view details

push time in 13 days

push eventnelson-liu/codalab-worksheets

Ashwin Ramaswami

commit sha b7d11d4d324422b66ba1a21badde766a3930d24c

Save / upload logs on both success and failure (#2450)

view details

Nelson Liu

commit sha 3adcf6bf6689e99eb49206eab872f217f9ae4b43

Fix typo in docs/Server-Setup.md (#2418) Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com>

view details

Nelson Liu

commit sha 524c79d8cefb4b6875e8b0fd686ba5550352afad

Update instructions for self-hosting GPU workers (#2481) * Update instructions for self-hosting GPU workers * Make docker images consistent * Specify that it tests Docker, through CodaLab

view details

Jing Ge

commit sha 69f2aaab58f0480ef41c218c8c603f459410a244

Fix invalid datetime and missing created field in metadata object (#2475) Closed via #2475

view details

Nelson Liu

commit sha 9714930b6e5713721e4dacc766fd1d23a90cee20

Fix scontrol state parsing regex in SlurmWorkerManager (#2484)

view details

Nelson Liu

commit sha 53692675df540639ea08755518020bae595198de

Enable workers to exit on exceptions, instead of sleeping (#2467)

view details

armantajback

commit sha 9a11c8334df03988d8960f564cad4a4d9b0bc2cb

Update version to 0.5.16 (#2472) Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com>

view details

Tony Lee

commit sha bb33a2827feb2a1c28373b571a213ecd70a88979

Disable EditWorksheetTest Selenium for now (#2493)

view details

Jing Ge

commit sha 13bbbc800324b82d062be367e5e42c1a5792264c

Prettier CLI Reference: remove end colon for each single command (#2462)

view details

Jing Ge

commit sha 9cf494b7a87f3a1a0c6449c83348f41482156fa0

Improve monitor script when sending worker offline alert (#2503) Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local>

view details

Nelson Liu

commit sha f05d3ce07b3728da739f1ea13f0a1bfc6b894e8b

Add --worker-pass-down-termination flag to WorkerManager (#2508)

view details

Nelson Liu

commit sha 0677251266b9fa5a632bdf1800234283d73f0b33

Merge branch 'master' into worker_gpu_isolation

view details

push time in 13 days

push eventnelson-liu/codalab-worksheets

Nelson Liu

commit sha 20289ab7834242e320afffcfb3af0c42d19a8e26

Don't try to start bundles on offline / timed-out workers

view details

push time in 13 days

push eventnelson-liu/codalab-worksheets

Ashwin Ramaswami

commit sha b7d11d4d324422b66ba1a21badde766a3930d24c

Save / upload logs on both success and failure (#2450)

view details

Nelson Liu

commit sha 3adcf6bf6689e99eb49206eab872f217f9ae4b43

Fix typo in docs/Server-Setup.md (#2418) Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com>

view details

Nelson Liu

commit sha 524c79d8cefb4b6875e8b0fd686ba5550352afad

Update instructions for self-hosting GPU workers (#2481) * Update instructions for self-hosting GPU workers * Make docker images consistent * Specify that it tests Docker, through CodaLab

view details

Jing Ge

commit sha 69f2aaab58f0480ef41c218c8c603f459410a244

Fix invalid datetime and missing created field in metadata object (#2475) Closed via #2475

view details

Nelson Liu

commit sha 9714930b6e5713721e4dacc766fd1d23a90cee20

Fix scontrol state parsing regex in SlurmWorkerManager (#2484)

view details

Nelson Liu

commit sha 53692675df540639ea08755518020bae595198de

Enable workers to exit on exceptions, instead of sleeping (#2467)

view details

armantajback

commit sha 9a11c8334df03988d8960f564cad4a4d9b0bc2cb

Update version to 0.5.16 (#2472) Co-authored-by: Ashwin Ramaswami <aramaswamis@gmail.com>

view details

Tony Lee

commit sha bb33a2827feb2a1c28373b571a213ecd70a88979

Disable EditWorksheetTest Selenium for now (#2493)

view details

Jing Ge

commit sha 13bbbc800324b82d062be367e5e42c1a5792264c

Prettier CLI Reference: remove end colon for each single command (#2462)

view details

Jing Ge

commit sha 9cf494b7a87f3a1a0c6449c83348f41482156fa0

Improve monitor script when sending worker offline alert (#2503) Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local>

view details

Nelson Liu

commit sha f05d3ce07b3728da739f1ea13f0a1bfc6b894e8b

Add --worker-pass-down-termination flag to WorkerManager (#2508)

view details

Nelson Liu

commit sha af786a4970cb9d97bd8a3df68f206719afe801d6

Merge branch 'master' into fix_bundle_manager_stalling

view details

push time in 13 days

push eventnelson-liu/codalab-worksheets

Nelson Liu

commit sha e8627bf978941836a8be4b7bfda7d661fddb9f77

Re-create missing docker networks before launching containers

view details

push time in 13 days

push eventnelson-liu/codalab-worksheets

Jing Ge

commit sha 13bbbc800324b82d062be367e5e42c1a5792264c

Prettier CLI Reference: remove end colon for each single command (#2462)

view details

Jing Ge

commit sha 9cf494b7a87f3a1a0c6449c83348f41482156fa0

Improve monitor script when sending worker offline alert (#2503) Co-authored-by: Jing Ge <stanford@Stanfords-MacBook-Pro.local>

view details

Nelson Liu

commit sha f05d3ce07b3728da739f1ea13f0a1bfc6b894e8b

Add --worker-pass-down-termination flag to WorkerManager (#2508)

view details

Nelson Liu

commit sha f12b070e569b204c8072ecaa506cca06bef59c76

Merge branch 'master' into worker_robust_docker_network

view details

push time in 13 days

more