profile
viewpoint

bazelbuild/rules_rust 233

Rust rules for Bazel

google/starlark-rust 176

Starlark (https://github.com/bazelbuild/starlark) in Rust

apple/apple_rules_lint 63

A framework for adding lint checks to Bazel projects

illicitonion/cargo-ensure-installed 4

Like cargo install but if you already have a suitable version, simply leaves it as-is.

illicitonion/cargo-serve-doc 4

A cargo plugin for serving a unified tree of crate-local and std documentation (as well as the Rust book).

illicitonion/cargo-ensure-prefix 1

Cargo subcommand to check that all target files have a fixed prefix.

issue commentpantsbuild/pex

pex binary hangs on startup at atomic_directory

Thanks @mbakhoff. That's a facepalm bug. Thanks for identifying, I'll get out a fix for this tomorrow.

mbakhoff

comment created time in 4 hours

PR opened CodeYourFuture/syllabus

updated the file with the new requirements

What does this change?

Module: Week(s):

Description

<!-- Add a description of what your PR changes here --> I have updated requirements from google classroom to this file. <!-- For ease of review, consider adding a "rendered" version (using GitHub's markdown renderer) of the file(s) that you changed by adding a link in this format:

Rendered version -->

Who needs to know about this?

<!--- @frosas -->

+5 -6

0 comment

1 changed file

pr created time in 4 hours

issue openedpantsbuild/pex

pex binary hangs on startup at atomic_directory

My pex binary is hanging on startup. Using python3.7, pex 2.1.21, ubuntu 18.04. Here's the stack:

Traceback (most recent call first):
  <built-in method lockf of module object at remote 0x7fdf05a08f50>
  File "/var/zt/zt_consumers/current/.bootstrap/pex/common.py", line 385, in atomic_directory
  <built-in method next of module object at remote 0x7fdf06709d10>
  File "/usr/lib/python3.7/contextlib.py", line 112, in __enter__
    return next(self.gen)
  File "/var/zt/zt_consumers/current/.bootstrap/pex/util.py", line 182, in cache_distribution
  File "/var/zt/zt_consumers/current/.bootstrap/pex/environment.py", line 173, in _write_zipped_internal_cache
  File "/var/zt/zt_consumers/current/.bootstrap/pex/environment.py", line 197, in _load_internal_cache
  File "/var/zt/zt_consumers/current/.bootstrap/pex/environment.py", line 227, in _update_candidate_distributions
  File "/var/zt/zt_consumers/current/.bootstrap/pex/environment.py", line 416, in _activate
  File "/var/zt/zt_consumers/current/.bootstrap/pex/environment.py", line 260, in activate
  File "/var/zt/zt_consumers/current/.bootstrap/pex/pex.py", line 103, in _activate
  File "/var/zt/zt_consumers/current/.bootstrap/pex/pex.py", line 444, in execute
  File "/var/zt/zt_consumers/current/.bootstrap/pex/pex_bootstrapper.py", line 360, in bootstrap_pex
  File "/var/zt/zt_consumers/current/__main__.py", line 68, in <module>
  <built-in method exec of module object at remote 0x7fdf06709d10>
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)

It's a bit tricky to reproduce because it's timing sensitive. The basic steps should be: build a pex binary containing some wheels, launch multiple copies of the binary in parallel. All processes must use the same pex_root.

The hang happens in pex/common.py atomic_directory. What I think happens is that the first process takes a file lock for the atomic dir, finalizes it and releases the lock. If the timing is right, then another process reaches https://github.com/pantsbuild/pex/blob/v2.1.21/pex/common.py#L390 so it also grabs the lock but never releases it.

The issue was introduced in https://github.com/pantsbuild/pex/pull/1062

created time in 5 hours

pull request commentbazelbuild/rules_rust

rust_doc: add web server for docs

From a high level design perspective, do we want to be generating a server target for every rustdoc target, or just a single binary target that serves as a tool that can handle any rustdoc output?

If you want to be able to ibazel run it and have the server automatically restart when the Rust code is updated, there needs to be a build path from the target that you run to your code, so I think that there needs to be a server target for each rustdoc target. Now, these could be trivial shims that just delegate to a single core server, but given that they’re just Python files it seems easier to just let each one be its own self-contained server. I don’t feel strongly about this.

On the build side, I’m having trouble getting the docs sub-repo to work with rules_python without crashing the build. So I’m feeling inclined to abandon this. On TensorBoard, we just have our own version of this server as a local py_binary, so I’m not personally blocked by this. If anyone happens to know an easy fix for the build issue, I’d be happy to incorporate it, or if you want to poke at it yourself, please feel free to yoink my server code.

@dfreese, does this sound reasonable to you?

wchargin

comment created time in 6 hours

pull request commentbazelbuild/rules_rust

rust_doc: strip directory prefix from archive names

Okay: I’ve rebased master. This patch seems to work fine. I’m having trouble with the build setup for #475, but I agree that this patch is nice to have by itself, so feel free to merge if you like.

wchargin

comment created time in 6 hours

pull request commentbazelbuild/rules_rust

fail build on any warnings when running clippy

I am happy with it as it is, but if David presents a use case where it won't work in #514, it may need rethinking.

dae

comment created time in 6 hours

PR opened CodeYourFuture/syllabus

/docs/git/homework.md fixed

What does this change?

Module: Week(s):

Description

<!-- Add a description of what your PR changes here -->

<!-- For ease of review, consider adding a "rendered" version (using GitHub's markdown renderer) of the file(s) that you changed by adding a link in this format:

Rendered version -->

Who needs to know about this?

<!--- Tag anyone who might want to be notified about this PR -->

+5 -6

0 comment

1 changed file

pr created time in 6 hours

issue commentbazelbuild/rules_rust

compilation_mode changes during build for proc-macro deps

Sure, RustEmbed is specifically what I'm working with, but there's no reason it couldn't be generally true in other crates: https://github.com/pyros2097/rust-embed/blob/master/impl/src/lib.rs

I've also come to the conclusion that your PR wouldn't have changed this behavior, it was probably always this way. Having thought about it more, this just seems like a bad pattern from the proc-macro author and the best thing to do is probably to patch the crate. For example, an attribute on target cpu would be invalid, so why should an attribute on compilation mode be expected to be valid. However, cargo does behave this way, so we might see it in more crates.

djmarcin

comment created time in 6 hours

issue commentbazelbuild/starlark

Python bindings for starlark-go

there doesn't seem to be a way to pass non primitive types between Go and Python.

Even if there isn't a way, you shouldn't expose raw pointers as ints in your API as clients can fake them and cause you to write to (say) arbitrary memory. The int should be encapsulated so that clients can't see it.

As for thread safety, I only did some cursory research on whether it would be an issue since I believe the GIL would protect multiple Starlark threads from running at the same

The GIL protects you from certain kinds of data race in Python itself (in CPython---not every implementation of Python even has a GIL) but it won't stop two Python threads executing Go code from concurrently accessing the same data (e.g. the global maps), nor does it prevent higher-level nonatomicity problems in code such as x = get(); modify(x); set(x).

ColdHeat

comment created time in 7 hours

issue commentbazelbuild/starlark

Python bindings for starlark-go

Ah gotcha now!

I settled on this approach because there doesn't seem to be a way to pass non primitive types between Go and Python. Probably for good reason.

As for thread safety, I only did some cursory research on whether it would be an issue since I believe the GIL would protect multiple Starlark threads from running at the same but I'm not 100% sure. Thank you for the advice though, I'll dig into it in the next few versions of pystarlark.

ColdHeat

comment created time in 8 hours

issue closedbazelbuild/rules_rust

Cannot reference non-rust files in source tree at compile time

I've run into the following problems trying to convert an existing codebase to bazel. I think they have the same root cause, which is that it's not possible (or at least I can't figure out how) to reference non-rust files at compile time in a static way.

Example 1, RustEmbed

#[derive(rust_embed::RustEmbed)]
#[folder = "../web/public/"]
struct Bundle;

The effect of this attribute is to embed the contents of the ../web/public/ folder from the source tree in the binary and make it available via functions on the Bundle struct. My first thought at how to solve this was to add a filegroup on ../web/public and add it to srcs; however, rust_library targets only allow *.rs files as sources, so this doesn't work. Next, I tried adding the filegroup to the cargo_build_script rule and copying the files to OUT_DIR. However the #[folder = ] attribute must take a literal, so I cannot do something like #[folder = concat!(env!("OUT_DIR"), "/web/public")] even though it's a compile-time constant. Luckily, rust-embed offers a way to interpolate env vars in this string, but this is not standard. Additionally, the need to manually copy the incoming filegroup to OUT_DIR in build.rs in order to make this work is awkward. Is there a nicer way?

Example 2, sqlx::migrate

sqlx::migrate!("db/migrations").run(&pool).await?;

Much like the RustEmbed example, this example reads the contents of the db/migrations directory at compile time and stores them as embedded objects in the binary, which run the migrations when the binary executes. Unlike RustEmbed, sqlx offers no way to interpolate an env var into this string, and must take a string literal. Nothing I have tried allows me to successfully use this macro in library code. Is there a solution to this I'm missing?

I am able to use this macro in build.rs since the path to data dependencies is static so I can write a string literal that works, but that reveals a further problem with the sqlx macro. It checks that the detected *.sql files are regular files. For some reason, when bazel symlinks the files into the sandbox directory for build.rs, even though sqlx uses the symlink-following std::fs::metadata method, these are not detected as is_file. Is there something about how the sandbox works that would prevent std::fs::metadata from working correctly?

closed time in 8 hours

djmarcin

pull request commentbazelbuild/rules_rust

Add separate proc_macro_deps attr

I suspect this PR may be related to #502 where all proc-macro deps become unconditionally compiled in opt which breaks crates that detect the presence of debug_assertions internally to do different things (e.g. RustEmbed).

illicitonion

comment created time in 8 hours

issue openedbazelbuild/rules_rust

compilation_mode changes during build for proc-macro deps

When compiling a target in fastbuild, all proc-macro dependencies compile in opt rather than fastbuild. This breaks crates like rust-embed which expect to be compiled with the same settings as the crates using them. I've searched through rules_rust and I don't see where this could be changing. In fact, changing the very first line of _rust_library_impl to print(ctx.var["COMPILATION_MODE"]) will print opt for proc-macro crates.

I suspect this has something to do with this code, though I don't really understand it yet. Maybe @illicitonion has more context. https://github.com/bazelbuild/rules_rust/blob/feb8642761ba923de735a102bf43d090254b92ff/rust/private/rust.bzl#L401-L411

created time in 8 hours

pull request commentbazelbuild/rules_rust

Updated stardoc and regenerated docs

@damienmg Small PR, would you mind taking a quick look next time you're online? 🙏

UebelAndre

comment created time in 9 hours

create barnchpantsbuild/pex

branch : 6zO9p1WAJznesv9s

created branch time in 10 hours

PR opened pantsbuild/example-python

Add toolchain buildsense plugin (not enabled yet)

testing stuff, ignore this PR for now.

+23 -0

0 comment

3 changed files

pr created time in 11 hours

Pull request review commentbazelbuild/rules_rust

Fix linking against versioned shared library

 def collect_deps(label, deps, proc_macro_deps, aliases, toolchain):             transitive_dylibs.append(depset([                 lib                 for lib in libs.to_list()-                if lib.basename.endswith(toolchain.dylib_ext)+                # Dynamic libraries may have a version number nowhere, or before (linux) or after (macos) the extension.

whoops, thanks :)

djmarcin

comment created time in 11 hours

PR opened pantsbuild/pex

Fix `safe_open` for single element relative paths.
+37 -1

0 comment

2 changed files

pr created time in 12 hours

PR opened bazelbuild/rules_rust

Updated stardoc and regenerated docs

There are some features in newer release of Bazel that are not compatible with the current version of stardoc (eg. incompatible_use_toolchain_transition). Updating stardoc will prevent issues in other PRs

+15 -9

0 comment

4 changed files

pr created time in 13 hours

push eventCodeYourFuture/DocsV2

Chris Owen

commit sha 2edd4634ba0261b6af608bb2b197fd6059181b3f

GitBook: [master] 3 pages modified

view details

push time in 13 hours

push eventCodeYourFuture/DocsV2

Chris Owen

commit sha cd7d89322d8e3b63425200dd4839ae92045c5a80

GitBook: [master] one page modified

view details

push time in 13 hours

pull request commentbazelbuild/rules_rust

Added `proto_compile_deps` and `grpc_compile_deps` to `@io_bazel_rules_rust//proto:toolchain`

I think the minimum supported version would have to go up to 3.4 for this based on per (https://github.com/bazelbuild/bazel/commit/099cf2f9c57936617e912cff73399bbf65f14e64)

UebelAndre

comment created time in 14 hours

pull request commentbazelbuild/rules_rust

Added `proto_compile_deps` and `grpc_compile_deps` to `@io_bazel_rules_rust//proto:toolchain`

@UebelAndre I don't understand, what missing change? CI seems to be green @Head.

Oh, and I was referring to the fact that CI in this PR had failed and I didn't expect it to. The recent builds are passing except for the Minimum Supported Version test. This makes sense due to the addition of incompatible_use_toolchain_transition (as per https://github.com/bazelbuild/bazel/issues/11584). This is what I was referring to when I was saying I don't think CI will pass since this functionality seems relatively new.

UebelAndre

comment created time in 14 hours

Pull request review commenttwitter/scoot

added new scheduler

 func (fw *FakeWorker) Run(cmd *runner.Command) (runner.RunStatus, error) {  	fw.state = runner.RUNNING 	go func(fw *FakeWorker) {-		time.Sleep(time.Duration(duration) * time.Second)+		t := time.NewTicker(time.Duration(duration) * time.Second)

could also just skip assigning to t <-time.NewTicker(time.Second).C

JeanetteBruno

comment created time in a day

Pull request review commenttwitter/scoot

added new scheduler

+package server++import (+	"fmt"+	"math"+	"regexp"+	"sort"+	"sync"+	"time"++	log "github.com/sirupsen/logrus"++	"github.com/twitter/scoot/common/stats"+)++const (+	under int = iota+	over+)++// defaults for the LoadBasedScheduler algorithm: only one class and all jobs map to that class+var (+	DefaultLoadBasedSchedulerClassPercents = map[string]int32{"c0": 100}+	DefaultRequestorToClassMap             = map[string]string{".*": "c0"}+	DefaultMinRebalanceTime                = time.Duration(4 * time.Minute)+	MaxTaskDuration                        = time.Duration(4 * time.Hour)+)++type LoadBasedAlgConfig struct {+	classLoadPercents       map[string]int+	classLoadPercentsMu     sync.RWMutex+	requestorReToClassMap   map[string]string+	requestorReToClassMapMU sync.RWMutex++	rebalanceThreshold     int+	rebalanceThresholdMu   sync.RWMutex+	rebalanceMinDuration   time.Duration+	rebalanceMinDurationMu sync.RWMutex++	classByDescLoadPct []string++	stat stats.StatsReceiver+}++// LoadBasedAlg the scheduling algorithm allocates job tasks to workers using a class map as+// follows:+// - each job maps to a 'class' (based on the job's requestor value)+// - classes are assigned a % of the number of scoot workers.+// When the algorithm is assigning tasks to workers it will try to start tasks such that the+// number of running tasks maintain the defined class %'s.  We refer to number of workers as+// per a class's defined % as the number of 'entitled' workers for the class.+// (The class is entitled to use class % * total number of workers to run job tasks from jobs+// assigned to the class.)+// When there are not enough tasks to use all of the class's entitled workers, the algorithm+// will allow other classes to run their tasks on the unused workers.  We refer to+// this as loaning worker to other classes.+// Note: workers are not assigned to specific classes.  The class % concept is simply a counting+// mechanism.+//+// Each scheduling iteration tries to bring the task allocation back to the original class+// entitlement as defined in the class %s. It could be the case that long running tasks slowly create+// an imbalance in the worker to class numbers (long running tasks accumulate loaned workers).+// As such, the algorithm periodically rebalances the running workers back toward the original target+// %s by stopping tasks that have been started on loaned workers.  It will stop the most recently started+// tasks till the running task to class allocation meets the original entitlement targets.+//+type LoadBasedAlg struct {+	config     *LoadBasedAlgConfig+	jobClasses map[string]*jobClass++	totalUnusedEntitlement          int+	exceededRebalanceThresholdStart time.Time++	// local copy of load pcts and requestor map to use during assignment computation+	// to insulate the computation from external changes to the configuration+	classLoadPercents     map[string]int+	requestorReToClassMap map[string]string++	classByDescLoadPct             []string+	tasksByJobClassAndStartTimeSec map[taskClassAndStartKey]taskStateByJobIDTaskID+}++// NewLoadBasedAlg allocate a new LoadBaseSchedAlg object.  If the load %'s don't add up to 100+// the %'s will be adjusted and an error will be returned with the alg object+func NewLoadBasedAlg(config *LoadBasedAlgConfig, tasksByJobClassAndStartTimeSec map[taskClassAndStartKey]taskStateByJobIDTaskID) *LoadBasedAlg {+	lbs := &LoadBasedAlg{+		config:                          config,+		jobClasses:                      map[string]*jobClass{},+		exceededRebalanceThresholdStart: time.Time{},+		tasksByJobClassAndStartTimeSec:  tasksByJobClassAndStartTimeSec,+	}+	return lbs+}++// jobWaitingTaskIds map waiting task ids to the job state objects+type jobWaitingTaskIds struct {+	jobState       *jobState+	waitingTaskIDs []string+}++// jobClass the class definition that will be assigned to a set of jobs (using the job's requestor value)+type jobClass struct {+	className string++	// jobsByNumRunningTasks is a map that bins jobs by their number of running tasks.  Given that the algorithm has+	// determined it will start n tasks from class A, the tasks selected for starting from class A will give prefence+	// to jobs with the least number of running tasks.+	jobsByNumRunningTasks map[int][]jobWaitingTaskIds+	// the largest key value in the jobsByNumRunningTasks map+	maxTaskRunningMapIndex int++	origNumWaitingTasks int+	origNumRunningTasks int++	// the target % of workers for this class+	origTargetLoadPct int+	// the original number of workers allocated for this class by target % (total workers * origTargetLoadPct)+	origNumTargetedWorkers int+	// number of tasks that can be started (when negative -> number of tasks to stop)+	numTasksToStart int+	// number of tasks still waiting to be started+	numWaitingTasks int++	// temporary field to hold intermediate entitled num workers+	tempEntitlement int+	// temporary field to hold the normalized load %+	tempNormalizedPct int+}++func (jc *jobClass) String() string {+	return fmt.Sprintf("%s:TargetLoadPct:%d, origTasksWaiting:%d, origTasksRunning:%d, origTargetWorkers:%d, TasksToStart:%d, remainingWaitingTasks:%d, tempEntitlement:%d, tempNormalizedPct:%d",+		jc.className,+		jc.origTargetLoadPct,+		jc.origNumWaitingTasks,+		jc.origNumRunningTasks,+		jc.origNumTargetedWorkers,+		jc.numTasksToStart,+		jc.numWaitingTasks,+		jc.tempEntitlement,+		jc.tempNormalizedPct)+}++// NewJobClass a job class with its target % worker load+func NewJobClass(name string, targetLoadPct int) *jobClass {+	return &jobClass{className: name, origTargetLoadPct: targetLoadPct, jobsByNumRunningTasks: map[int][]jobWaitingTaskIds{}}+}++// GetTasksToBeAssigned - the entry point to the load based scheduling algorithm+// It computes the list of tasks that should be started.+// Allocate available workers to classes based on target load % allocations for each class, and the current number of running and+// waiting tasks for each class.+// - When a class has tasks waiting to start, the algorithm will determine the number of workers the class it 'entitled' to:+// the number of workers as per the class's target load %+// - When classes are under-utilizing their 'entitlement', (due to lack of waiting tasks), the unallocated workers will be+// ‘loaned’/used to run tasks from other classes (class allocations may exceed the targeted allocations %’s)+// -The algorithm will try to allocate 100% of the workers (no unallocated reserves)+//+// When starting tasks within a class:+// Jobs within a class are binned by the number of running tasks (ranking jobs by the number of active tasks).+// When starting tasks for a class, the tasks are first pulled from jobs with the least number of active tasks.+func (lbs *LoadBasedAlg) GetTasksToBeAssigned(jobsNotUsed []*jobState, stat stats.StatsReceiver, cs *clusterState,+	jobsByRequestor map[string][]*jobState) ([]*taskState, []*taskState) {+	log.Debugf("in LoadBasedAlg.GetTasksToBeAssigned: numWorkers:%d, numIdleWorkers:%d", len(cs.nodes), cs.numFree())++	// make local copies of the load pct structures+	lbs.classLoadPercents = lbs.LocalCopyClassLoadPercents()+	lbs.requestorReToClassMap = lbs.getRequestorToClassMap()+	lbs.classByDescLoadPct = lbs.getClassByDescLoadPct()++	numWorkers := len(cs.nodes)+	lbs.initOrigNumTargetedWorkers(numWorkers)++	lbs.initJobClassesMap(jobsByRequestor)++	rebalanced := false+	var stopTasks []*taskState+	if lbs.getRebalanceMinimumDuration() > 0 && lbs.getRebalanceThreshold() > 0 {+		// currentPctSpread is the delta between the highest and lowest+		currentPctSpread := lbs.getCurrentPercentsSpread(numWorkers)+		if currentPctSpread > lbs.getRebalanceThreshold() {+			nilTime := time.Time{}+			if lbs.exceededRebalanceThresholdStart == nilTime {+				lbs.exceededRebalanceThresholdStart = time.Now()+			} else if time.Now().Sub(lbs.exceededRebalanceThresholdStart) > lbs.config.rebalanceMinDuration {+				stopTasks = lbs.rebalanceClassTasks(jobsByRequestor, numWorkers, cs.numFree())+				lbs.exceededRebalanceThresholdStart = time.Time{}+				rebalanced = true+			}+		}+	}++	if !rebalanced {+		// compute the number of tasks to be started for each class+		lbs.computeNumTasksToStart(cs.numFree())+	}++	// add the tasks to be started to the return list+	tasksToStart := lbs.buildTaskStartList()++	// record the assignment stats+	for _, jc := range lbs.jobClasses {+		stat.Gauge(fmt.Sprintf("%s_%s", stats.SchedJobClassTasksStarting, jc.className)).Update(int64(jc.numTasksToStart))+		stat.Gauge(fmt.Sprintf("%s_%s", stats.SchedJobClassTasksWaiting, jc.className)).Update(int64(jc.origNumWaitingTasks - jc.numTasksToStart))+		stat.Gauge(fmt.Sprintf("%s_%s", stats.SchedJobClassTasksRunning, jc.className)).Update(int64(jc.origNumRunningTasks))+		stat.Gauge(fmt.Sprintf("%s_%s", stats.SchedJobClassDefinedPct, jc.className)).Update(int64(jc.origTargetLoadPct))+		finalPct := int(math.Round(float64(jc.origNumRunningTasks+jc.numTasksToStart) / float64(int64(numWorkers)*100.0)))+		stat.Gauge(fmt.Sprintf("%s_%s", stats.SchedJobClassActualPct, jc.className)).Update(int64(finalPct))+	}++	log.Debugf("Returning %d start tasks, %d stop tasks", len(tasksToStart), len(stopTasks))+	return tasksToStart, stopTasks+}++// initOrigNumTargetedWorkers computes the number of workers targeted for each class as per the class's+// original target load pct+func (lbs *LoadBasedAlg) initOrigNumTargetedWorkers(numWorkers int) {+	lbs.jobClasses = map[string]*jobClass{}+	totalWorkers := 0+	firstClass := true+	for _, className := range lbs.classByDescLoadPct {+		jc := &jobClass{className: className, origTargetLoadPct: lbs.classLoadPercents[className], jobsByNumRunningTasks: map[int][]jobWaitingTaskIds{}}+		lbs.jobClasses[className] = jc+		if firstClass {+			firstClass = false+			continue+		}+		targetNumWorkers := int(math.Floor(float64(numWorkers) * float64(jc.origTargetLoadPct) / 100.0))+		jc.origNumTargetedWorkers = targetNumWorkers+		totalWorkers += targetNumWorkers+	}+	lbs.jobClasses[lbs.classByDescLoadPct[0]].origNumTargetedWorkers = numWorkers - totalWorkers+}++// initJobClassesMap builds the map of requestor (class name) to jobClass objects+// if we see a job whose class % not defined, assign the job to the class with the+// least number of workers+func (lbs *LoadBasedAlg) initJobClassesMap(jobsByRequestor map[string][]*jobState) {+	classNameWithLeastWorkers := lbs.classByDescLoadPct[len(lbs.classByDescLoadPct)-1]+	// fill the jobClasses map with the state of the running jobs+	for requestor, jobs := range jobsByRequestor {+		var jc *jobClass+		var ok bool+		className := GetRequestorClass(requestor, lbs.requestorReToClassMap)+		jc, ok = lbs.jobClasses[className]+		if !ok {+			// the class name was not recognized, use the class with the least number of workers (lowest %)+			lbs.config.stat.Counter(stats.SchedLBSUnknownJobCounter).Inc(1)+			jc = lbs.jobClasses[classNameWithLeastWorkers]+			log.Errorf("%s is not a recognized job class assigning to class (%s)", className, classNameWithLeastWorkers)+		}+		if jc.origTargetLoadPct == 0 {+			log.Errorf("%s worker allocation (load %% is 0), ignoring %d jobs", requestor, len(jobs))+			lbs.config.stat.Counter(stats.SchedLBSIgnoredJobCounter).Inc(1)+			continue+		}++		// organize the class's jobs by the number of tasks currently running (map of jobs indexed by the number of+		// tasks currently running for the job).  This will be used in the round robin task selection to start a+		// class's worker allocation at the jobs with least number of running tasks+		// this loop also computes the class's running tasks and waiting task totals+		for _, job := range jobs {+			_, ok := jc.jobsByNumRunningTasks[job.TasksRunning]+			if !ok {+				jc.jobsByNumRunningTasks[job.TasksRunning] = []jobWaitingTaskIds{}+			}+			waitingTaskIds := []string{}+			for taskID := range job.NotStarted {+				waitingTaskIds = append(waitingTaskIds, taskID)+			}+			jc.jobsByNumRunningTasks[job.TasksRunning] = append(jc.jobsByNumRunningTasks[job.TasksRunning], jobWaitingTaskIds{jobState: job, waitingTaskIDs: waitingTaskIds})+			if job.TasksRunning > jc.maxTaskRunningMapIndex {+				jc.maxTaskRunningMapIndex = job.TasksRunning+			}+			jc.origNumRunningTasks += job.TasksRunning+			jc.origNumWaitingTasks += len(job.NotStarted)+		}++		jc.numWaitingTasks = jc.origNumWaitingTasks+	}+}++// GetRequestorClass find the requestorToClass entry for requestor+// keys in requestorToClassEntry are regular expressions+// if no match is found, return "" for the class name+func GetRequestorClass(requestor string, requestorToClassMap map[string]string) string {+	for reqRe, className := range requestorToClassMap {+		if m, _ := regexp.Match(reqRe, []byte(requestor)); m {+			return className+		}+	}+	return ""+}++// computeNumTasksToStart - computes the the number of tasks to start for each class.+// Perform the entitlement calculation first and if there are still unallocated wokers+// and tasks waiting to start, compute the loan calculation.+func (lbs *LoadBasedAlg) computeNumTasksToStart(numIdleWorkers int) {++	var haveUnallocatedTasks bool++	numIdleWorkers, haveUnallocatedTasks = lbs.entitlementTasksToStart(numIdleWorkers)++	if numIdleWorkers > 0 && haveUnallocatedTasks {+		lbs.workerLoanAllocation(numIdleWorkers)+	}+}++// entitlementTasksToStart compute the number of tasks we can start for each class based on each classes original targeted+// number of workers (origNumTargetedWorkers)+// Note: this is an iterative computation that converges on the number of tasks to start within number of class's iterations.+//+// 1. compute the entitlement of a class as the class's orig target load minus (number of tasks running + number of tasks that+// can be started)  (exception: if a class does not have waiting tasks, its entitlement is 0)+// 2. compute entitlement % as entitlement/total of all classes entitlements+// 3. compute num tasks to start for each class as min(entitlement % * idle(unallocated) workers, number of the class's waiting tasks)+//+// After completing the 3 steps above, the sum of the number tasks to start may still be < number of idle workers.  This will happen+// when a class's waiting task count < than its entitlement (the class is not using all of its entitlement).  When this happens,+// the un-allocated idle workers can be distributed across the other classes that have waiting tasks and have not met their full+// entitlement.  We compute this by repeating steps 1-3 till all idle workers have been allocated, all waiting tasks have been+// allocated or all classes entitlements have been met.  Each iteration either uses up all idle workers, all of a class's waiting tasks+// or fully allocates at least one class's task entitlement.   This means that the we will not iterate more than the number of classes.+func (lbs *LoadBasedAlg) entitlementTasksToStart(numIdleWorkers int) (int, bool) {+	i := 0+	haveWaitingTasks := true+	for ; i < len(lbs.jobClasses); i++ {+		// compute the class's current entitlement: number of tasks we would like to start for each class as per the class's+		// target load % and number of waiting tasks.  We'll use this to compute normalized entitlement %s below.+		totalEntitlements := 0+		// get the current entitlements+		for _, jc := range lbs.jobClasses {+			if (jc.origNumRunningTasks+jc.numTasksToStart) <= jc.origNumTargetedWorkers && jc.numWaitingTasks > 0 {+				jc.tempEntitlement = jc.origNumTargetedWorkers - (jc.origNumRunningTasks + jc.numTasksToStart)+			} else {+				jc.tempEntitlement = 0+			}+			totalEntitlements += jc.tempEntitlement+		}++		if totalEntitlements == 0 {+			// the class's task allocations have used up each class's entitlement, break+			// so we can move on to calculating loaned workers+			break+		}++		// compute normalized entitlement pcts for classes with entitlement > 0+		lbs.computeEntitlementPercents()++		// compute worker allocations as per the normalized entitlement %s+		numTasksAllocated := 0+		workersToAllocate := min(numIdleWorkers, totalEntitlements)+		numTasksAllocated, haveWaitingTasks = lbs.getTaskAllocations(workersToAllocate)++		numIdleWorkers -= numTasksAllocated++		if !haveWaitingTasks {+			break+		}+		if numIdleWorkers <= 0 {+			break+		}+	}+	return numIdleWorkers, haveWaitingTasks+}++// loanWorkers: We have workers that can be 'loaned' to classes that still have waiting tasks.+// Note: this is an iterative computation that will converge on the number of workers to loan to classes+// For each iteration+// 1. normalize the original target load % to those classes with waiting tasks+// 2. compute each class's allowed loan amount as the number of unallocated workers * the normalized % but not to+// exceed the class's number of waiting tasks+//+// When a class's allowed loan amount is larger than the class's waiting tasks, there will be unallocated workers+// after all the class 'loan' amounts have been calculated.  We distribute these unallocated workers by+// repeating the loan calculation till there are no unallocated workers left.+// Each iteration either uses up all idle workers, or all of a class's waiting tasks.  This means that the we will not+// iterate more than the number of classes.+func (lbs *LoadBasedAlg) workerLoanAllocation(numIdleWorkers int) {+	i := 0+	for ; i < len(lbs.jobClasses); i++ {+		lbs.computeLoanPercents()++		// compute loan %'s and allocate idle workers+		numTasksAllocated, haveWaitingTasks := lbs.getTaskAllocations(numIdleWorkers)++		numIdleWorkers -= numTasksAllocated++		if !haveWaitingTasks {+			break+		}+		if numIdleWorkers <= 0 {+			break+		}+	}+}++// getTaskAllocations given the normalized allocation %s for each class, working from highest % (largest allocation) to smallest,+// allocate that class's % of the idle workers (update the class's numTasksToStart and numWaitingTasks), but not to exceed the+// classs' number of waiting tasks. Return the total number of tasks allocated to workers and a boolean indicating if there are+// still tasks waiting to be allocated+func (lbs *LoadBasedAlg) getTaskAllocations(numIdleWorkers int) (int, bool) {+	totalTasksAllocated := 0+	haveWaitingTasks := false++	for _, className := range lbs.classByDescLoadPct {+		jc := lbs.jobClasses[className]+		numTasksToStart := min(jc.numWaitingTasks, ceil(float32(numIdleWorkers)*(float32(jc.tempNormalizedPct)/100.0)))++		if (totalTasksAllocated + numTasksToStart) > numIdleWorkers {+			numTasksToStart = numIdleWorkers - totalTasksAllocated+		}+		jc.numTasksToStart += numTasksToStart+		jc.numWaitingTasks -= numTasksToStart+		if jc.numWaitingTasks > 0 {+			haveWaitingTasks = true+		}+		totalTasksAllocated += numTasksToStart+	}+	return totalTasksAllocated, haveWaitingTasks+}++// computeEntitlementPercents computes each class's current entitled % of total entitlements (from the current)+// entitlement values+func (lbs *LoadBasedAlg) computeEntitlementPercents() {+	// get the entitlements total+	entitlementTotal := 0+	for _, jc := range lbs.jobClasses {+		entitlementTotal += jc.tempEntitlement+	}++	// compute the % for all but the class with the largest %.  Add up all computed %s and assign+	// 100 - sum of % to the class with largest % (this eliminates rounding errors, forcing the+	// % to add up to 100%)+	totalPercents := 0+	firstClass := true+	for _, className := range lbs.classByDescLoadPct {+		if firstClass {+			firstClass = false+			continue+		}+		jc := lbs.jobClasses[className]+		jc.tempNormalizedPct = int(math.Floor(float64(jc.tempEntitlement) * 100.0 / float64(entitlementTotal)))+		totalPercents += jc.tempNormalizedPct+	}+	lbs.jobClasses[lbs.classByDescLoadPct[0]].tempNormalizedPct = 100 - totalPercents+}++// computeLoanPercents as orig load %'s normalized to exclude classes that don't have waiting tasks+func (lbs *LoadBasedAlg) computeLoanPercents() {+	// get the sum of all the original load pcts for classes that have waiting tasks+	pctsTotal := 0+	for _, jc := range lbs.jobClasses {+		if jc.numWaitingTasks > 0 {+			pctsTotal += jc.origTargetLoadPct+		}+	}++	if pctsTotal == 0 {+		return+	}++	// compute the % for all but the class with the largest %.  Add up all computed %s and assign+	// 100 - sum of % to the class with the largest % from the range (this eliminates rounding errors, forcing the+	// sum or % to go to 100%)+	totalPercents := 0+	firstClass := true+	firstClassName := ""+	for _, className := range lbs.classByDescLoadPct {+		jc := lbs.jobClasses[className]+		if jc.numWaitingTasks > 0 {+			if firstClass {+				firstClass = false+				firstClassName = className+				continue+			}+			jc.tempNormalizedPct = int(math.Floor(float64(jc.origTargetLoadPct) * 100.0 / float64(pctsTotal)))+			totalPercents += jc.tempNormalizedPct+		} else {+			jc.tempNormalizedPct = 0+		}+	}+	lbs.jobClasses[firstClassName].tempNormalizedPct = 100 - totalPercents+}++// buildTaskStartList builds the list of tasks to be started for each jobClass.+func (lbs *LoadBasedAlg) buildTaskStartList() []*taskState {+	tasks := []*taskState{}+	for _, jc := range lbs.jobClasses {+		if jc.numTasksToStart <= 0 {+			continue+		}+		classTasks := lbs.getTasksToStartForJobClass(jc)+		tasks = append(tasks, classTasks...)+	}+	return tasks+}++// getTasksToStartForJobClass get the tasks to start list for a given jobClass.  The jobClass's numTasksToStart+// field will contain the number of tasks to start for this job class.  The jobClass's jobsByNumRunningTasks is+// a map from an integer value (number of tasks running) to the list of jobs with that number of tasks running+// For a given jobClass, we start adding tasks from the jobs with the least number of tasks running.+// (Note: when a task is started for a job, the job is moved to the ‘next’ bin and placed at the end of that bin’s job list.)+func (lbs *LoadBasedAlg) getTasksToStartForJobClass(jc *jobClass) []*taskState {+	tasks := []*taskState{}++	startingTaskCnt := 0+	// work our way through the class's jobs, starting with jobs with the least number of running tasks,+	// till we've added the class's numTasksToStart number of tasks to the task list+	for numRunningTasks := 0; numRunningTasks <= jc.maxTaskRunningMapIndex; numRunningTasks++ {+		var jobs []jobWaitingTaskIds+		var ok bool+		if jobs, ok = jc.jobsByNumRunningTasks[numRunningTasks]; !ok {+			// there are no jobs with numRunningTasks running tasks, move on to jobs with more running tasks+			continue+		}+		// jobs contains list of jobs and their waiting taskIds. (Each job in this list has the same number of running tasks.)+		// Allocate one task from each job till we've allocated numTasksToStart for the jobClass, or have allocated 1 task from+		// each job in this list.  As we allocate a task for a job, move the job to the end of jc.jobsByNumRunningTasks[numRunningTasks+1].+		for _, job := range jobs {+			if job.waitingTaskIDs != nil && len(job.waitingTaskIDs) > 0 {+				// get the next task to start from the job+				tasks = append(tasks, lbs.getJobsNextTask(job))++				// move the job to jobsByRunning tasks with numRunningTasks + 1 entry.  Note: we don't have to pull it from+				// its current numRunningTasks bucket since this is a 1 time pass through the jobsByNumRunningTasks map.  The map+				// will be rebuilt with the next scheduling iteration+				if len(job.waitingTaskIDs) > 1 {+					job.waitingTaskIDs = job.waitingTaskIDs[1:]+					jc.jobsByNumRunningTasks[numRunningTasks+1] = append(jc.jobsByNumRunningTasks[numRunningTasks+1], job)+					if numRunningTasks == jc.maxTaskRunningMapIndex {+						jc.maxTaskRunningMapIndex+++					}+				}++				startingTaskCnt+++				if startingTaskCnt == jc.numTasksToStart {+					return tasks+				}+			}++		}+	}++	return tasks // note: we should never hit this line+}++// getJobsNextTask get the next task to start from the job+func (lbs *LoadBasedAlg) getJobsNextTask(job jobWaitingTaskIds) *taskState {+	task := job.jobState.NotStarted[job.waitingTaskIDs[0]]+	job.waitingTaskIDs = job.waitingTaskIDs[1:]++	return task+}++// buildTaskStopList builds the list of tasks to be stopped.+func (lbs *LoadBasedAlg) buildTaskStopList() []*taskState {+	tasks := []*taskState{}+	for _, jc := range lbs.jobClasses {+		if jc.numTasksToStart >= 0 {+			continue+		}+		classTasks := lbs.getTasksToStopForJobClass(jc)+		tasks = append(tasks, classTasks...)+	}+	return tasks+}++// getTasksToStopForJobClass for each job class return the abs(numTasksToStart) most recently started+// tasks.  (numTasksToStart will be a negative number)+func (lbs *LoadBasedAlg) getTasksToStopForJobClass(jobClass *jobClass) []*taskState {+	earliest := time.Now().Add(-1 * MaxTaskDuration)+	startTimeSec := time.Now().Truncate(time.Second)+	numTasksToStop := jobClass.numTasksToStart * -1+	stopTasks := []*taskState{}+	for len(stopTasks) < numTasksToStop {+		key := taskClassAndStartKey{class: jobClass.className, start: startTimeSec}+		tasks := lbs.tasksByJobClassAndStartTimeSec[key]+		for _, task := range tasks {+			stopTasks = append(stopTasks, task)+			if len(stopTasks) == numTasksToStop {+				break+			}+		}+		startTimeSec = startTimeSec.Add(-1 * time.Second).Truncate(time.Second)+		if startTimeSec.Before(earliest) {+			break+		}+	}+	lbs.config.stat.Gauge(fmt.Sprintf("%s%s", stats.SchedStoppingTasks, jobClass.className)).Update(int64(numTasksToStop))+	return stopTasks+}++func (lbs *LoadBasedAlg) getNumTasksToStart(requestor string) int {+	return lbs.jobClasses[requestor].numTasksToStart+}++// rebalanceClassTasks compute the tasks that should be deleted to allow the scheduling algorithm+// to start tasks in better alignment with the original targeted task load percents.+// The function returns the list of tasks to stop and updates the jobClass objects with the number+// of tasks to start+func (lbs *LoadBasedAlg) rebalanceClassTasks(jobsByRequestor map[string][]*jobState, totalWorkers int, numIdleWorkers int) []*taskState {++	totalTasks := 0+	// compute number tasks as per the each class's entitlement and waiting tasks+	// will be negative when a class is over its entitlement+	for _, jc := range lbs.jobClasses {+		if jc.origNumRunningTasks > jc.origNumTargetedWorkers {+			// the number of tasks that could be started to bring the class up to its entitlement+			jc.numTasksToStart = jc.origNumTargetedWorkers - jc.origNumRunningTasks+		} else if jc.origNumRunningTasks+jc.origNumWaitingTasks < jc.origNumTargetedWorkers {+			// the waiting tasks won't put the class over its entitlement+			jc.numTasksToStart = jc.origNumWaitingTasks+		} else {+			// the class is running more than its entitled workers, numTasksToStart is number of tasks+			// to stop to bring back to its entitlement (it will be a negative number)+			jc.numTasksToStart = jc.origNumTargetedWorkers - jc.origNumRunningTasks+		}+		totalTasks += jc.origNumRunningTasks + jc.numTasksToStart+	}++	if totalTasks < totalWorkers {+		// some classes are not using their full allocation, we can loan workers+		lbs.computeLoanPercents()++		lbs.getTaskAllocations(totalWorkers - totalTasks)+	}++	stopTasks := lbs.buildTaskStopList()++	return stopTasks+}++// getCurrentPercentsSpread is used to measure how well the current worker assignment matches the target loads.+// It computes each class's difference between the target load pct and actual load pct.  The 'spread' value+// is the difference between the min and max differences across the classes.  Eg: if 30% was the load target+// for class A and class A is using 50% of the workers, A's pct difference is -20%.  If class B has a target+// of 15% and is only using 5% of workers, B's pct difference is 10%.  If A and B are the only classes, the+// pctSpread is 25% (5% - -20%).+func (lbs *LoadBasedAlg) getCurrentPercentsSpread(totalWorkers int) int {+	if len(lbs.jobClasses) < 2 {+		return 0+	}+	minPct := 0+	maxPct := 0+	for _, className := range lbs.classByDescLoadPct {+		jc := lbs.jobClasses[className]+		currPct := int(math.Floor(float64(jc.origNumRunningTasks) * 100.0 / float64(totalWorkers)))+		pctDiff := jc.origTargetLoadPct - currPct+		if pctDiff < 0 || jc.numWaitingTasks > 0 { // only consider classes using loaned workers, or with waiting tasks+			minPct = min(minPct, pctDiff)+			maxPct = max(maxPct, pctDiff)+		}+	}++	return maxPct - minPct+}++// getClassByDescLoadPct get a copy of the config's class by descending load pcts+func (lbs *LoadBasedAlg) getClassByDescLoadPct() []string {+	lbs.config.classLoadPercentsMu.RLock()+	defer lbs.config.classLoadPercentsMu.RUnlock()+	copy := []string{}+	for _, v := range lbs.config.classByDescLoadPct {+		copy = append(copy, v)+	}+	return copy+}++// getClassLoadPercents return a copy of the ClassLoadPercents converting to int32+func (lbs *LoadBasedAlg) getClassLoadPercents() map[string]int32 {+	lbs.config.classLoadPercentsMu.RLock()+	defer lbs.config.classLoadPercentsMu.RUnlock()+	copy := map[string]int32{}+	for k, v := range lbs.config.classLoadPercents {+		copy[k] = int32(v)+	}+	return copy+}++// LocalCopyClassLoadPercents return a copy of the ClassLoadPercents leaving as int+func (lbs *LoadBasedAlg) LocalCopyClassLoadPercents() map[string]int {+	lbs.config.classLoadPercentsMu.RLock()+	defer lbs.config.classLoadPercentsMu.RUnlock()+	copy := map[string]int{}+	for k, v := range lbs.config.classLoadPercents {+		copy[k] = v+	}+	return copy+}++// setClassLoadPercents set the scheduler's class load pcts with a copy of the input class load pcts+func (lbs *LoadBasedAlg) setClassLoadPercents(classLoadPercents map[string]int32) {+	lbs.config.classLoadPercentsMu.Lock()+	defer lbs.config.classLoadPercentsMu.Unlock()++	// build a list that orders the classes by descending pct.+	keys := []string{}+	for key := range classLoadPercents {+		keys = append(keys, key)+	}+	sort.Slice(keys, func(i, j int) bool {+		return classLoadPercents[keys[i]] > classLoadPercents[keys[j]]+	})+	lbs.config.classByDescLoadPct = keys++	// set the load pcts - normalizing them if the don't add up to 100+	lbs.config.classLoadPercents = map[string]int{}+	pctTotal := 0+	for _, val := range classLoadPercents {+		pctTotal += int(val)+	}+	if pctTotal != 100 {+		log.Errorf("LoadBalanced scheduling %%'s don't add up to 100, normalizing them")

let's break normalization out into a helper so we can more easily test it

JeanetteBruno

comment created time in 14 hours

Pull request review commenttwitter/scoot

added new scheduler

 func (c *CloudScootClient) GetSchedulerStatus() (*scoot.SchedulerStatus, error) 	return schedulerStatus, err } +// GetClassLoadPercents get the target load pcts for the classes+func (c *CloudScootClient) GetClassLoadPercents() (map[string]int32, error) {+	if err := c.checkForClient(); err != nil {+		return nil, err+	}+	classLoadPercents, err := c.client.GetClassLoadPercents()+	// if an error occurred reset the connection, could be a broken pipe or other

Going further - why don't we just reset the connection regardless? Is there any value in persisting the client? Could be another cleanup item to add to MGCI-1310

JeanetteBruno

comment created time in a day

Pull request review commenttwitter/scoot

added new scheduler

+package cli++import (+	"encoding/json"+	"fmt"+	"io/ioutil"+	"os"++	log "github.com/sirupsen/logrus"+	"github.com/spf13/cobra"+)++// lbsSchedAlgParams load based scheduling params.  The setter API is associated with+// this structure.+type lbsSchedAlgParams struct {+	ClassLoadPercents    map[string]int32+	RequestorMap         map[string]string+	RebalanceMinDuration int+	RebalanceThreshold   int++	clpFilePath    string+	reqMapFilePath string+}++// getLBSSchedAlgParams structure for getting the load based scheduling params.+type getLBSSchedAlgParams struct {+	printAsJSON bool+	params      lbsSchedAlgParams+}++func (g *getLBSSchedAlgParams) registerFlags() *cobra.Command {+	r := &cobra.Command{+		Use:   "get_scheduling_alg_params",+		Short: "GetSchedAlgParams",+	}+	r.Flags().BoolVar(&g.printAsJSON, "json", false, "Print out scheduling algorithm parameters as JSON")+	return r+}++func (g *getLBSSchedAlgParams) run(cl *simpleCLIClient, cmd *cobra.Command, args []string) error {+	log.Info("Getting Scheduling Algorithm Parameters", args)++	var err error+	g.params.ClassLoadPercents, err = cl.scootClient.GetClassLoadPercents()+	if err != nil {+		return getReturnError(err)+	}++	g.params.RequestorMap, err = cl.scootClient.GetRequestorToClassMap()+	if err != nil {+		return getReturnError(err)+	}++	var tInt int32+	tInt, err = cl.scootClient.GetRebalanceMinDuration()+	if err != nil {+		return getReturnError(err)+	}+	g.params.RebalanceMinDuration = int(tInt)++	tInt, err = cl.scootClient.GetRebalanceThreshold()+	if err != nil {+		return getReturnError(err)+	}+	g.params.RebalanceThreshold = int(tInt)++	if g.printAsJSON {+		asJSON, err := json.Marshal(g.params)+		if err != nil {+			log.Errorf("Error converting status to JSON: %v", err.Error())+			return fmt.Errorf("Error converting status to JSON: %v", err.Error())+		}+		log.Infof("%s\n", string(asJSON))+		fmt.Printf("%s\n", string(asJSON)) // must also go to stdout in case caller looking in stdout for the results+	} else {+		log.Info("Class Load Percents:")+		fmt.Println("Class Load Percents:")+		for class, pct := range g.params.ClassLoadPercents {+			log.Infof("%s:%d", class, pct)+			fmt.Println(class, ":", pct)+		}+		log.Info("Requestor (reg exp) to class map:")+		fmt.Println("Requestor (reg exp) to class map:")+		for requestorRe, class := range g.params.RequestorMap {+			log.Infof("%s:%s", requestorRe, class)+			fmt.Println(requestorRe, ":", class)+		}+		log.Infof("Rebalance Duration:%d (minutes)", g.params.RebalanceMinDuration)+		fmt.Println("Rebalance Duration:", g.params.RebalanceMinDuration, " (minutes)")+		log.Infof("Rebalance Threshold:%d", g.params.RebalanceThreshold)+		fmt.Println("Rebalance Threshold:", g.params.RebalanceThreshold)++	}++	return nil+}++func (s *lbsSchedAlgParams) registerFlags() *cobra.Command {

I think I'd prefer that approach - isolating the setter

JeanetteBruno

comment created time in a day

Pull request review commenttwitter/scoot

added new scheduler

+package server++import (+	"fmt"+	"math"+	"regexp"+	"sort"+	"sync"+	"time"++	log "github.com/sirupsen/logrus"++	"github.com/twitter/scoot/common/stats"+)++const (+	under int = iota+	over+)++// defaults for the LoadBasedScheduler algorithm: only one class and all jobs map to that class+var (+	DefaultLoadBasedSchedulerClassPercents = map[string]int32{"c0": 100}+	DefaultRequestorToClassMap             = map[string]string{".*": "c0"}+	DefaultMinRebalanceTime                = time.Duration(4 * time.Minute)+	MaxTaskDuration                        = time.Duration(4 * time.Hour)+)++type LoadBasedAlgConfig struct {+	classLoadPercents       map[string]int+	classLoadPercentsMu     sync.RWMutex+	requestorReToClassMap   map[string]string+	requestorReToClassMapMU sync.RWMutex++	rebalanceThreshold     int+	rebalanceThresholdMu   sync.RWMutex+	rebalanceMinDuration   time.Duration+	rebalanceMinDurationMu sync.RWMutex++	classByDescLoadPct []string++	stat stats.StatsReceiver+}++// LoadBasedAlg the scheduling algorithm allocates job tasks to workers using a class map as+// follows:+// - each job maps to a 'class' (based on the job's requestor value)+// - classes are assigned a % of the number of scoot workers.+// When the algorithm is assigning tasks to workers it will try to start tasks such that the+// number of running tasks maintain the defined class %'s.  We refer to number of workers as+// per a class's defined % as the number of 'entitled' workers for the class.+// (The class is entitled to use class % * total number of workers to run job tasks from jobs+// assigned to the class.)+// When there are not enough tasks to use all of the class's entitled workers, the algorithm+// will allow other classes to run their tasks on the unused workers.  We refer to+// this as loaning worker to other classes.+// Note: workers are not assigned to specific classes.  The class % concept is simply a counting+// mechanism.+//+// Each scheduling iteration tries to bring the task allocation back to the original class+// entitlement as defined in the class %s. It could be the case that long running tasks slowly create+// an imbalance in the worker to class numbers (long running tasks accumulate loaned workers).+// As such, the algorithm periodically rebalances the running workers back toward the original target+// %s by stopping tasks that have been started on loaned workers.  It will stop the most recently started+// tasks till the running task to class allocation meets the original entitlement targets.+//+type LoadBasedAlg struct {+	config     *LoadBasedAlgConfig+	jobClasses map[string]*jobClass++	totalUnusedEntitlement          int+	exceededRebalanceThresholdStart time.Time++	// local copy of load pcts and requestor map to use during assignment computation+	// to insulate the computation from external changes to the configuration+	classLoadPercents     map[string]int+	requestorReToClassMap map[string]string++	classByDescLoadPct             []string+	tasksByJobClassAndStartTimeSec map[taskClassAndStartKey]taskStateByJobIDTaskID+}++// NewLoadBasedAlg allocate a new LoadBaseSchedAlg object.  If the load %'s don't add up to 100+// the %'s will be adjusted and an error will be returned with the alg object+func NewLoadBasedAlg(config *LoadBasedAlgConfig, tasksByJobClassAndStartTimeSec map[taskClassAndStartKey]taskStateByJobIDTaskID) *LoadBasedAlg {+	lbs := &LoadBasedAlg{+		config:                          config,+		jobClasses:                      map[string]*jobClass{},+		exceededRebalanceThresholdStart: time.Time{},+		tasksByJobClassAndStartTimeSec:  tasksByJobClassAndStartTimeSec,+	}+	return lbs+}++// jobWaitingTaskIds map waiting task ids to the job state objects+type jobWaitingTaskIds struct {+	jobState       *jobState+	waitingTaskIDs []string+}++// jobClass the class definition that will be assigned to a set of jobs (using the job's requestor value)+type jobClass struct {+	className string++	// jobsByNumRunningTasks is a map that bins jobs by their number of running tasks.  Given that the algorithm has+	// determined it will start n tasks from class A, the tasks selected for starting from class A will give prefence+	// to jobs with the least number of running tasks.+	jobsByNumRunningTasks map[int][]jobWaitingTaskIds+	// the largest key value in the jobsByNumRunningTasks map+	maxTaskRunningMapIndex int++	origNumWaitingTasks int+	origNumRunningTasks int++	// the target % of workers for this class+	origTargetLoadPct int+	// the original number of workers allocated for this class by target % (total workers * origTargetLoadPct)+	origNumTargetedWorkers int+	// number of tasks that can be started (when negative -> number of tasks to stop)+	numTasksToStart int+	// number of tasks still waiting to be started+	numWaitingTasks int++	// temporary field to hold intermediate entitled num workers+	tempEntitlement int+	// temporary field to hold the normalized load %+	tempNormalizedPct int+}++func (jc *jobClass) String() string {+	return fmt.Sprintf("%s:TargetLoadPct:%d, origTasksWaiting:%d, origTasksRunning:%d, origTargetWorkers:%d, TasksToStart:%d, remainingWaitingTasks:%d, tempEntitlement:%d, tempNormalizedPct:%d",+		jc.className,+		jc.origTargetLoadPct,+		jc.origNumWaitingTasks,+		jc.origNumRunningTasks,+		jc.origNumTargetedWorkers,+		jc.numTasksToStart,+		jc.numWaitingTasks,+		jc.tempEntitlement,+		jc.tempNormalizedPct)+}++// NewJobClass a job class with its target % worker load+func NewJobClass(name string, targetLoadPct int) *jobClass {+	return &jobClass{className: name, origTargetLoadPct: targetLoadPct, jobsByNumRunningTasks: map[int][]jobWaitingTaskIds{}}+}++// GetTasksToBeAssigned - the entry point to the load based scheduling algorithm+// It computes the list of tasks that should be started.+// Allocate available workers to classes based on target load % allocations for each class, and the current number of running and+// waiting tasks for each class.+// - When a class has tasks waiting to start, the algorithm will determine the number of workers the class it 'entitled' to:+// the number of workers as per the class's target load %+// - When classes are under-utilizing their 'entitlement', (due to lack of waiting tasks), the unallocated workers will be+// ‘loaned’/used to run tasks from other classes (class allocations may exceed the targeted allocations %’s)+// -The algorithm will try to allocate 100% of the workers (no unallocated reserves)+//+// When starting tasks within a class:+// Jobs within a class are binned by the number of running tasks (ranking jobs by the number of active tasks).+// When starting tasks for a class, the tasks are first pulled from jobs with the least number of active tasks.+func (lbs *LoadBasedAlg) GetTasksToBeAssigned(jobsNotUsed []*jobState, stat stats.StatsReceiver, cs *clusterState,+	jobsByRequestor map[string][]*jobState) ([]*taskState, []*taskState) {+	log.Debugf("in LoadBasedAlg.GetTasksToBeAssigned: numWorkers:%d, numIdleWorkers:%d", len(cs.nodes), cs.numFree())++	// make local copies of the load pct structures+	lbs.classLoadPercents = lbs.LocalCopyClassLoadPercents()+	lbs.requestorReToClassMap = lbs.getRequestorToClassMap()+	lbs.classByDescLoadPct = lbs.getClassByDescLoadPct()++	numWorkers := len(cs.nodes)+	lbs.initOrigNumTargetedWorkers(numWorkers)++	lbs.initJobClassesMap(jobsByRequestor)++	rebalanced := false+	var stopTasks []*taskState+	if lbs.getRebalanceMinimumDuration() > 0 && lbs.getRebalanceThreshold() > 0 {+		// currentPctSpread is the delta between the highest and lowest+		currentPctSpread := lbs.getCurrentPercentsSpread(numWorkers)+		if currentPctSpread > lbs.getRebalanceThreshold() {+			nilTime := time.Time{}+			if lbs.exceededRebalanceThresholdStart == nilTime {+				lbs.exceededRebalanceThresholdStart = time.Now()+			} else if time.Now().Sub(lbs.exceededRebalanceThresholdStart) > lbs.config.rebalanceMinDuration {+				stopTasks = lbs.rebalanceClassTasks(jobsByRequestor, numWorkers, cs.numFree())+				lbs.exceededRebalanceThresholdStart = time.Time{}+				rebalanced = true+			}+		}+	}++	if !rebalanced {+		// compute the number of tasks to be started for each class+		lbs.computeNumTasksToStart(cs.numFree())+	}++	// add the tasks to be started to the return list+	tasksToStart := lbs.buildTaskStartList()++	// record the assignment stats+	for _, jc := range lbs.jobClasses {+		stat.Gauge(fmt.Sprintf("%s_%s", stats.SchedJobClassTasksStarting, jc.className)).Update(int64(jc.numTasksToStart))+		stat.Gauge(fmt.Sprintf("%s_%s", stats.SchedJobClassTasksWaiting, jc.className)).Update(int64(jc.origNumWaitingTasks - jc.numTasksToStart))+		stat.Gauge(fmt.Sprintf("%s_%s", stats.SchedJobClassTasksRunning, jc.className)).Update(int64(jc.origNumRunningTasks))+		stat.Gauge(fmt.Sprintf("%s_%s", stats.SchedJobClassDefinedPct, jc.className)).Update(int64(jc.origTargetLoadPct))+		finalPct := int(math.Round(float64(jc.origNumRunningTasks+jc.numTasksToStart) / float64(int64(numWorkers)*100.0)))+		stat.Gauge(fmt.Sprintf("%s_%s", stats.SchedJobClassActualPct, jc.className)).Update(int64(finalPct))+	}++	log.Debugf("Returning %d start tasks, %d stop tasks", len(tasksToStart), len(stopTasks))+	return tasksToStart, stopTasks+}++// initOrigNumTargetedWorkers computes the number of workers targeted for each class as per the class's+// original target load pct+func (lbs *LoadBasedAlg) initOrigNumTargetedWorkers(numWorkers int) {+	lbs.jobClasses = map[string]*jobClass{}+	totalWorkers := 0+	firstClass := true+	for _, className := range lbs.classByDescLoadPct {+		jc := &jobClass{className: className, origTargetLoadPct: lbs.classLoadPercents[className], jobsByNumRunningTasks: map[int][]jobWaitingTaskIds{}}+		lbs.jobClasses[className] = jc+		if firstClass {+			firstClass = false+			continue+		}+		targetNumWorkers := int(math.Floor(float64(numWorkers) * float64(jc.origTargetLoadPct) / 100.0))+		jc.origNumTargetedWorkers = targetNumWorkers+		totalWorkers += targetNumWorkers+	}+	lbs.jobClasses[lbs.classByDescLoadPct[0]].origNumTargetedWorkers = numWorkers - totalWorkers+}++// initJobClassesMap builds the map of requestor (class name) to jobClass objects+// if we see a job whose class % not defined, assign the job to the class with the+// least number of workers+func (lbs *LoadBasedAlg) initJobClassesMap(jobsByRequestor map[string][]*jobState) {+	classNameWithLeastWorkers := lbs.classByDescLoadPct[len(lbs.classByDescLoadPct)-1]+	// fill the jobClasses map with the state of the running jobs+	for requestor, jobs := range jobsByRequestor {+		var jc *jobClass+		var ok bool+		className := GetRequestorClass(requestor, lbs.requestorReToClassMap)+		jc, ok = lbs.jobClasses[className]+		if !ok {+			// the class name was not recognized, use the class with the least number of workers (lowest %)+			lbs.config.stat.Counter(stats.SchedLBSUnknownJobCounter).Inc(1)+			jc = lbs.jobClasses[classNameWithLeastWorkers]+			log.Errorf("%s is not a recognized job class assigning to class (%s)", className, classNameWithLeastWorkers)+		}+		if jc.origTargetLoadPct == 0 {+			log.Errorf("%s worker allocation (load %% is 0), ignoring %d jobs", requestor, len(jobs))+			lbs.config.stat.Counter(stats.SchedLBSIgnoredJobCounter).Inc(1)+			continue+		}++		// organize the class's jobs by the number of tasks currently running (map of jobs indexed by the number of+		// tasks currently running for the job).  This will be used in the round robin task selection to start a+		// class's worker allocation at the jobs with least number of running tasks+		// this loop also computes the class's running tasks and waiting task totals+		for _, job := range jobs {+			_, ok := jc.jobsByNumRunningTasks[job.TasksRunning]+			if !ok {+				jc.jobsByNumRunningTasks[job.TasksRunning] = []jobWaitingTaskIds{}+			}+			waitingTaskIds := []string{}+			for taskID := range job.NotStarted {+				waitingTaskIds = append(waitingTaskIds, taskID)+			}+			jc.jobsByNumRunningTasks[job.TasksRunning] = append(jc.jobsByNumRunningTasks[job.TasksRunning], jobWaitingTaskIds{jobState: job, waitingTaskIDs: waitingTaskIds})+			if job.TasksRunning > jc.maxTaskRunningMapIndex {+				jc.maxTaskRunningMapIndex = job.TasksRunning+			}+			jc.origNumRunningTasks += job.TasksRunning+			jc.origNumWaitingTasks += len(job.NotStarted)+		}++		jc.numWaitingTasks = jc.origNumWaitingTasks+	}+}++// GetRequestorClass find the requestorToClass entry for requestor+// keys in requestorToClassEntry are regular expressions+// if no match is found, return "" for the class name+func GetRequestorClass(requestor string, requestorToClassMap map[string]string) string {+	for reqRe, className := range requestorToClassMap {+		if m, _ := regexp.Match(reqRe, []byte(requestor)); m {+			return className+		}+	}+	return ""+}++// computeNumTasksToStart - computes the the number of tasks to start for each class.+// Perform the entitlement calculation first and if there are still unallocated wokers+// and tasks waiting to start, compute the loan calculation.+func (lbs *LoadBasedAlg) computeNumTasksToStart(numIdleWorkers int) {++	var haveUnallocatedTasks bool++	numIdleWorkers, haveUnallocatedTasks = lbs.entitlementTasksToStart(numIdleWorkers)++	if numIdleWorkers > 0 && haveUnallocatedTasks {+		lbs.workerLoanAllocation(numIdleWorkers)+	}+}++// entitlementTasksToStart compute the number of tasks we can start for each class based on each classes original targeted+// number of workers (origNumTargetedWorkers)+// Note: this is an iterative computation that converges on the number of tasks to start within number of class's iterations.+//+// 1. compute the entitlement of a class as the class's orig target load minus (number of tasks running + number of tasks that+// can be started)  (exception: if a class does not have waiting tasks, its entitlement is 0)+// 2. compute entitlement % as entitlement/total of all classes entitlements+// 3. compute num tasks to start for each class as min(entitlement % * idle(unallocated) workers, number of the class's waiting tasks)+//+// After completing the 3 steps above, the sum of the number tasks to start may still be < number of idle workers.  This will happen+// when a class's waiting task count < than its entitlement (the class is not using all of its entitlement).  When this happens,+// the un-allocated idle workers can be distributed across the other classes that have waiting tasks and have not met their full+// entitlement.  We compute this by repeating steps 1-3 till all idle workers have been allocated, all waiting tasks have been+// allocated or all classes entitlements have been met.  Each iteration either uses up all idle workers, all of a class's waiting tasks+// or fully allocates at least one class's task entitlement.   This means that the we will not iterate more than the number of classes.+func (lbs *LoadBasedAlg) entitlementTasksToStart(numIdleWorkers int) (int, bool) {+	i := 0+	haveWaitingTasks := true+	for ; i < len(lbs.jobClasses); i++ {+		// compute the class's current entitlement: number of tasks we would like to start for each class as per the class's+		// target load % and number of waiting tasks.  We'll use this to compute normalized entitlement %s below.+		totalEntitlements := 0+		// get the current entitlements+		for _, jc := range lbs.jobClasses {+			if (jc.origNumRunningTasks+jc.numTasksToStart) <= jc.origNumTargetedWorkers && jc.numWaitingTasks > 0 {+				jc.tempEntitlement = jc.origNumTargetedWorkers - (jc.origNumRunningTasks + jc.numTasksToStart)+			} else {+				jc.tempEntitlement = 0+			}+			totalEntitlements += jc.tempEntitlement+		}++		if totalEntitlements == 0 {+			// the class's task allocations have used up each class's entitlement, break+			// so we can move on to calculating loaned workers+			break+		}++		// compute normalized entitlement pcts for classes with entitlement > 0+		lbs.computeEntitlementPercents()++		// compute worker allocations as per the normalized entitlement %s+		numTasksAllocated := 0+		workersToAllocate := min(numIdleWorkers, totalEntitlements)+		numTasksAllocated, haveWaitingTasks = lbs.getTaskAllocations(workersToAllocate)++		numIdleWorkers -= numTasksAllocated++		if !haveWaitingTasks {+			break+		}+		if numIdleWorkers <= 0 {+			break+		}+	}+	return numIdleWorkers, haveWaitingTasks+}++// loanWorkers: We have workers that can be 'loaned' to classes that still have waiting tasks.+// Note: this is an iterative computation that will converge on the number of workers to loan to classes+// For each iteration+// 1. normalize the original target load % to those classes with waiting tasks+// 2. compute each class's allowed loan amount as the number of unallocated workers * the normalized % but not to+// exceed the class's number of waiting tasks+//+// When a class's allowed loan amount is larger than the class's waiting tasks, there will be unallocated workers+// after all the class 'loan' amounts have been calculated.  We distribute these unallocated workers by+// repeating the loan calculation till there are no unallocated workers left.+// Each iteration either uses up all idle workers, or all of a class's waiting tasks.  This means that the we will not+// iterate more than the number of classes.+func (lbs *LoadBasedAlg) workerLoanAllocation(numIdleWorkers int) {+	i := 0+	for ; i < len(lbs.jobClasses); i++ {+		lbs.computeLoanPercents()++		// compute loan %'s and allocate idle workers+		numTasksAllocated, haveWaitingTasks := lbs.getTaskAllocations(numIdleWorkers)++		numIdleWorkers -= numTasksAllocated++		if !haveWaitingTasks {+			break+		}+		if numIdleWorkers <= 0 {+			break+		}+	}+}++// getTaskAllocations given the normalized allocation %s for each class, working from highest % (largest allocation) to smallest,+// allocate that class's % of the idle workers (update the class's numTasksToStart and numWaitingTasks), but not to exceed the+// classs' number of waiting tasks. Return the total number of tasks allocated to workers and a boolean indicating if there are+// still tasks waiting to be allocated+func (lbs *LoadBasedAlg) getTaskAllocations(numIdleWorkers int) (int, bool) {+	totalTasksAllocated := 0+	haveWaitingTasks := false++	for _, className := range lbs.classByDescLoadPct {+		jc := lbs.jobClasses[className]+		numTasksToStart := min(jc.numWaitingTasks, ceil(float32(numIdleWorkers)*(float32(jc.tempNormalizedPct)/100.0)))++		if (totalTasksAllocated + numTasksToStart) > numIdleWorkers {+			numTasksToStart = numIdleWorkers - totalTasksAllocated+		}+		jc.numTasksToStart += numTasksToStart+		jc.numWaitingTasks -= numTasksToStart+		if jc.numWaitingTasks > 0 {+			haveWaitingTasks = true+		}+		totalTasksAllocated += numTasksToStart+	}+	return totalTasksAllocated, haveWaitingTasks+}++// computeEntitlementPercents computes each class's current entitled % of total entitlements (from the current)+// entitlement values+func (lbs *LoadBasedAlg) computeEntitlementPercents() {+	// get the entitlements total+	entitlementTotal := 0+	for _, jc := range lbs.jobClasses {+		entitlementTotal += jc.tempEntitlement+	}++	// compute the % for all but the class with the largest %.  Add up all computed %s and assign+	// 100 - sum of % to the class with largest % (this eliminates rounding errors, forcing the+	// % to add up to 100%)+	totalPercents := 0+	firstClass := true+	for _, className := range lbs.classByDescLoadPct {+		if firstClass {+			firstClass = false+			continue+		}+		jc := lbs.jobClasses[className]+		jc.tempNormalizedPct = int(math.Floor(float64(jc.tempEntitlement) * 100.0 / float64(entitlementTotal)))+		totalPercents += jc.tempNormalizedPct+	}+	lbs.jobClasses[lbs.classByDescLoadPct[0]].tempNormalizedPct = 100 - totalPercents+}++// computeLoanPercents as orig load %'s normalized to exclude classes that don't have waiting tasks+func (lbs *LoadBasedAlg) computeLoanPercents() {+	// get the sum of all the original load pcts for classes that have waiting tasks+	pctsTotal := 0+	for _, jc := range lbs.jobClasses {+		if jc.numWaitingTasks > 0 {+			pctsTotal += jc.origTargetLoadPct+		}+	}++	if pctsTotal == 0 {+		return+	}++	// compute the % for all but the class with the largest %.  Add up all computed %s and assign+	// 100 - sum of % to the class with the largest % from the range (this eliminates rounding errors, forcing the+	// sum or % to go to 100%)+	totalPercents := 0+	firstClass := true+	firstClassName := ""+	for _, className := range lbs.classByDescLoadPct {+		jc := lbs.jobClasses[className]+		if jc.numWaitingTasks > 0 {+			if firstClass {+				firstClass = false+				firstClassName = className+				continue+			}+			jc.tempNormalizedPct = int(math.Floor(float64(jc.origTargetLoadPct) * 100.0 / float64(pctsTotal)))+			totalPercents += jc.tempNormalizedPct+		} else {+			jc.tempNormalizedPct = 0+		}+	}+	lbs.jobClasses[firstClassName].tempNormalizedPct = 100 - totalPercents+}++// buildTaskStartList builds the list of tasks to be started for each jobClass.+func (lbs *LoadBasedAlg) buildTaskStartList() []*taskState {+	tasks := []*taskState{}+	for _, jc := range lbs.jobClasses {+		if jc.numTasksToStart <= 0 {+			continue+		}+		classTasks := lbs.getTasksToStartForJobClass(jc)+		tasks = append(tasks, classTasks...)+	}+	return tasks+}++// getTasksToStartForJobClass get the tasks to start list for a given jobClass.  The jobClass's numTasksToStart+// field will contain the number of tasks to start for this job class.  The jobClass's jobsByNumRunningTasks is+// a map from an integer value (number of tasks running) to the list of jobs with that number of tasks running+// For a given jobClass, we start adding tasks from the jobs with the least number of tasks running.+// (Note: when a task is started for a job, the job is moved to the ‘next’ bin and placed at the end of that bin’s job list.)+func (lbs *LoadBasedAlg) getTasksToStartForJobClass(jc *jobClass) []*taskState {+	tasks := []*taskState{}++	startingTaskCnt := 0+	// work our way through the class's jobs, starting with jobs with the least number of running tasks,+	// till we've added the class's numTasksToStart number of tasks to the task list+	for numRunningTasks := 0; numRunningTasks <= jc.maxTaskRunningMapIndex; numRunningTasks++ {+		var jobs []jobWaitingTaskIds+		var ok bool+		if jobs, ok = jc.jobsByNumRunningTasks[numRunningTasks]; !ok {+			// there are no jobs with numRunningTasks running tasks, move on to jobs with more running tasks+			continue+		}+		// jobs contains list of jobs and their waiting taskIds. (Each job in this list has the same number of running tasks.)+		// Allocate one task from each job till we've allocated numTasksToStart for the jobClass, or have allocated 1 task from+		// each job in this list.  As we allocate a task for a job, move the job to the end of jc.jobsByNumRunningTasks[numRunningTasks+1].+		for _, job := range jobs {+			if job.waitingTaskIDs != nil && len(job.waitingTaskIDs) > 0 {+				// get the next task to start from the job+				tasks = append(tasks, lbs.getJobsNextTask(job))++				// move the job to jobsByRunning tasks with numRunningTasks + 1 entry.  Note: we don't have to pull it from+				// its current numRunningTasks bucket since this is a 1 time pass through the jobsByNumRunningTasks map.  The map+				// will be rebuilt with the next scheduling iteration+				if len(job.waitingTaskIDs) > 1 {+					job.waitingTaskIDs = job.waitingTaskIDs[1:]+					jc.jobsByNumRunningTasks[numRunningTasks+1] = append(jc.jobsByNumRunningTasks[numRunningTasks+1], job)+					if numRunningTasks == jc.maxTaskRunningMapIndex {+						jc.maxTaskRunningMapIndex+++					}+				}++				startingTaskCnt+++				if startingTaskCnt == jc.numTasksToStart {+					return tasks+				}+			}++		}+	}++	return tasks // note: we should never hit this line

why not? why include it?

JeanetteBruno

comment created time in a day

Pull request review commenttwitter/scoot

added new scheduler

+package server++import (+	"fmt"+	"math"+	"regexp"+	"sort"+	"sync"+	"time"++	log "github.com/sirupsen/logrus"++	"github.com/twitter/scoot/common/stats"+)++const (+	under int = iota+	over+)++// defaults for the LoadBasedScheduler algorithm: only one class and all jobs map to that class+var (+	DefaultLoadBasedSchedulerClassPercents = map[string]int32{"c0": 100}+	DefaultRequestorToClassMap             = map[string]string{".*": "c0"}+	DefaultMinRebalanceTime                = time.Duration(4 * time.Minute)+	MaxTaskDuration                        = time.Duration(4 * time.Hour)+)++type LoadBasedAlgConfig struct {+	classLoadPercents       map[string]int+	classLoadPercentsMu     sync.RWMutex+	requestorReToClassMap   map[string]string+	requestorReToClassMapMU sync.RWMutex++	rebalanceThreshold     int+	rebalanceThresholdMu   sync.RWMutex+	rebalanceMinDuration   time.Duration+	rebalanceMinDurationMu sync.RWMutex++	classByDescLoadPct []string++	stat stats.StatsReceiver+}++// LoadBasedAlg the scheduling algorithm allocates job tasks to workers using a class map as+// follows:+// - each job maps to a 'class' (based on the job's requestor value)+// - classes are assigned a % of the number of scoot workers.+// When the algorithm is assigning tasks to workers it will try to start tasks such that the+// number of running tasks maintain the defined class %'s.  We refer to number of workers as+// per a class's defined % as the number of 'entitled' workers for the class.+// (The class is entitled to use class % * total number of workers to run job tasks from jobs+// assigned to the class.)+// When there are not enough tasks to use all of the class's entitled workers, the algorithm+// will allow other classes to run their tasks on the unused workers.  We refer to+// this as loaning worker to other classes.+// Note: workers are not assigned to specific classes.  The class % concept is simply a counting+// mechanism.+//+// Each scheduling iteration tries to bring the task allocation back to the original class+// entitlement as defined in the class %s. It could be the case that long running tasks slowly create+// an imbalance in the worker to class numbers (long running tasks accumulate loaned workers).+// As such, the algorithm periodically rebalances the running workers back toward the original target+// %s by stopping tasks that have been started on loaned workers.  It will stop the most recently started+// tasks till the running task to class allocation meets the original entitlement targets.+//+type LoadBasedAlg struct {+	config     *LoadBasedAlgConfig+	jobClasses map[string]*jobClass++	totalUnusedEntitlement          int+	exceededRebalanceThresholdStart time.Time++	// local copy of load pcts and requestor map to use during assignment computation+	// to insulate the computation from external changes to the configuration+	classLoadPercents     map[string]int+	requestorReToClassMap map[string]string++	classByDescLoadPct             []string+	tasksByJobClassAndStartTimeSec map[taskClassAndStartKey]taskStateByJobIDTaskID+}++// NewLoadBasedAlg allocate a new LoadBaseSchedAlg object.  If the load %'s don't add up to 100+// the %'s will be adjusted and an error will be returned with the alg object+func NewLoadBasedAlg(config *LoadBasedAlgConfig, tasksByJobClassAndStartTimeSec map[taskClassAndStartKey]taskStateByJobIDTaskID) *LoadBasedAlg {+	lbs := &LoadBasedAlg{+		config:                          config,+		jobClasses:                      map[string]*jobClass{},+		exceededRebalanceThresholdStart: time.Time{},+		tasksByJobClassAndStartTimeSec:  tasksByJobClassAndStartTimeSec,+	}+	return lbs+}++// jobWaitingTaskIds map waiting task ids to the job state objects+type jobWaitingTaskIds struct {+	jobState       *jobState+	waitingTaskIDs []string+}++// jobClass the class definition that will be assigned to a set of jobs (using the job's requestor value)+type jobClass struct {+	className string++	// jobsByNumRunningTasks is a map that bins jobs by their number of running tasks.  Given that the algorithm has+	// determined it will start n tasks from class A, the tasks selected for starting from class A will give prefence+	// to jobs with the least number of running tasks.+	jobsByNumRunningTasks map[int][]jobWaitingTaskIds+	// the largest key value in the jobsByNumRunningTasks map+	maxTaskRunningMapIndex int++	origNumWaitingTasks int+	origNumRunningTasks int++	// the target % of workers for this class+	origTargetLoadPct int+	// the original number of workers allocated for this class by target % (total workers * origTargetLoadPct)+	origNumTargetedWorkers int+	// number of tasks that can be started (when negative -> number of tasks to stop)+	numTasksToStart int+	// number of tasks still waiting to be started+	numWaitingTasks int++	// temporary field to hold intermediate entitled num workers+	tempEntitlement int+	// temporary field to hold the normalized load %+	tempNormalizedPct int+}++func (jc *jobClass) String() string {+	return fmt.Sprintf("%s:TargetLoadPct:%d, origTasksWaiting:%d, origTasksRunning:%d, origTargetWorkers:%d, TasksToStart:%d, remainingWaitingTasks:%d, tempEntitlement:%d, tempNormalizedPct:%d",+		jc.className,+		jc.origTargetLoadPct,+		jc.origNumWaitingTasks,+		jc.origNumRunningTasks,+		jc.origNumTargetedWorkers,+		jc.numTasksToStart,+		jc.numWaitingTasks,+		jc.tempEntitlement,+		jc.tempNormalizedPct)+}++// NewJobClass a job class with its target % worker load+func NewJobClass(name string, targetLoadPct int) *jobClass {+	return &jobClass{className: name, origTargetLoadPct: targetLoadPct, jobsByNumRunningTasks: map[int][]jobWaitingTaskIds{}}+}++// GetTasksToBeAssigned - the entry point to the load based scheduling algorithm+// It computes the list of tasks that should be started.+// Allocate available workers to classes based on target load % allocations for each class, and the current number of running and+// waiting tasks for each class.+// - When a class has tasks waiting to start, the algorithm will determine the number of workers the class it 'entitled' to:+// the number of workers as per the class's target load %+// - When classes are under-utilizing their 'entitlement', (due to lack of waiting tasks), the unallocated workers will be+// ‘loaned’/used to run tasks from other classes (class allocations may exceed the targeted allocations %’s)+// -The algorithm will try to allocate 100% of the workers (no unallocated reserves)+//+// When starting tasks within a class:+// Jobs within a class are binned by the number of running tasks (ranking jobs by the number of active tasks).+// When starting tasks for a class, the tasks are first pulled from jobs with the least number of active tasks.+func (lbs *LoadBasedAlg) GetTasksToBeAssigned(jobsNotUsed []*jobState, stat stats.StatsReceiver, cs *clusterState,+	jobsByRequestor map[string][]*jobState) ([]*taskState, []*taskState) {+	log.Debugf("in LoadBasedAlg.GetTasksToBeAssigned: numWorkers:%d, numIdleWorkers:%d", len(cs.nodes), cs.numFree())++	// make local copies of the load pct structures+	lbs.classLoadPercents = lbs.LocalCopyClassLoadPercents()+	lbs.requestorReToClassMap = lbs.getRequestorToClassMap()+	lbs.classByDescLoadPct = lbs.getClassByDescLoadPct()++	numWorkers := len(cs.nodes)+	lbs.initOrigNumTargetedWorkers(numWorkers)++	lbs.initJobClassesMap(jobsByRequestor)++	rebalanced := false+	var stopTasks []*taskState+	if lbs.getRebalanceMinimumDuration() > 0 && lbs.getRebalanceThreshold() > 0 {+		// currentPctSpread is the delta between the highest and lowest+		currentPctSpread := lbs.getCurrentPercentsSpread(numWorkers)+		if currentPctSpread > lbs.getRebalanceThreshold() {+			nilTime := time.Time{}+			if lbs.exceededRebalanceThresholdStart == nilTime {+				lbs.exceededRebalanceThresholdStart = time.Now()+			} else if time.Now().Sub(lbs.exceededRebalanceThresholdStart) > lbs.config.rebalanceMinDuration {+				stopTasks = lbs.rebalanceClassTasks(jobsByRequestor, numWorkers, cs.numFree())+				lbs.exceededRebalanceThresholdStart = time.Time{}+				rebalanced = true+			}+		}+	}++	if !rebalanced {+		// compute the number of tasks to be started for each class+		lbs.computeNumTasksToStart(cs.numFree())+	}++	// add the tasks to be started to the return list+	tasksToStart := lbs.buildTaskStartList()++	// record the assignment stats+	for _, jc := range lbs.jobClasses {+		stat.Gauge(fmt.Sprintf("%s_%s", stats.SchedJobClassTasksStarting, jc.className)).Update(int64(jc.numTasksToStart))+		stat.Gauge(fmt.Sprintf("%s_%s", stats.SchedJobClassTasksWaiting, jc.className)).Update(int64(jc.origNumWaitingTasks - jc.numTasksToStart))+		stat.Gauge(fmt.Sprintf("%s_%s", stats.SchedJobClassTasksRunning, jc.className)).Update(int64(jc.origNumRunningTasks))+		stat.Gauge(fmt.Sprintf("%s_%s", stats.SchedJobClassDefinedPct, jc.className)).Update(int64(jc.origTargetLoadPct))+		finalPct := int(math.Round(float64(jc.origNumRunningTasks+jc.numTasksToStart) / float64(int64(numWorkers)*100.0)))+		stat.Gauge(fmt.Sprintf("%s_%s", stats.SchedJobClassActualPct, jc.className)).Update(int64(finalPct))+	}++	log.Debugf("Returning %d start tasks, %d stop tasks", len(tasksToStart), len(stopTasks))+	return tasksToStart, stopTasks+}++// initOrigNumTargetedWorkers computes the number of workers targeted for each class as per the class's+// original target load pct+func (lbs *LoadBasedAlg) initOrigNumTargetedWorkers(numWorkers int) {+	lbs.jobClasses = map[string]*jobClass{}+	totalWorkers := 0+	firstClass := true+	for _, className := range lbs.classByDescLoadPct {+		jc := &jobClass{className: className, origTargetLoadPct: lbs.classLoadPercents[className], jobsByNumRunningTasks: map[int][]jobWaitingTaskIds{}}+		lbs.jobClasses[className] = jc+		if firstClass {+			firstClass = false+			continue+		}+		targetNumWorkers := int(math.Floor(float64(numWorkers) * float64(jc.origTargetLoadPct) / 100.0))+		jc.origNumTargetedWorkers = targetNumWorkers+		totalWorkers += targetNumWorkers+	}+	lbs.jobClasses[lbs.classByDescLoadPct[0]].origNumTargetedWorkers = numWorkers - totalWorkers+}++// initJobClassesMap builds the map of requestor (class name) to jobClass objects+// if we see a job whose class % not defined, assign the job to the class with the+// least number of workers+func (lbs *LoadBasedAlg) initJobClassesMap(jobsByRequestor map[string][]*jobState) {+	classNameWithLeastWorkers := lbs.classByDescLoadPct[len(lbs.classByDescLoadPct)-1]+	// fill the jobClasses map with the state of the running jobs+	for requestor, jobs := range jobsByRequestor {+		var jc *jobClass+		var ok bool+		className := GetRequestorClass(requestor, lbs.requestorReToClassMap)+		jc, ok = lbs.jobClasses[className]+		if !ok {+			// the class name was not recognized, use the class with the least number of workers (lowest %)+			lbs.config.stat.Counter(stats.SchedLBSUnknownJobCounter).Inc(1)+			jc = lbs.jobClasses[classNameWithLeastWorkers]+			log.Errorf("%s is not a recognized job class assigning to class (%s)", className, classNameWithLeastWorkers)+		}+		if jc.origTargetLoadPct == 0 {+			log.Errorf("%s worker allocation (load %% is 0), ignoring %d jobs", requestor, len(jobs))+			lbs.config.stat.Counter(stats.SchedLBSIgnoredJobCounter).Inc(1)+			continue+		}++		// organize the class's jobs by the number of tasks currently running (map of jobs indexed by the number of+		// tasks currently running for the job).  This will be used in the round robin task selection to start a+		// class's worker allocation at the jobs with least number of running tasks+		// this loop also computes the class's running tasks and waiting task totals+		for _, job := range jobs {+			_, ok := jc.jobsByNumRunningTasks[job.TasksRunning]+			if !ok {+				jc.jobsByNumRunningTasks[job.TasksRunning] = []jobWaitingTaskIds{}+			}+			waitingTaskIds := []string{}+			for taskID := range job.NotStarted {+				waitingTaskIds = append(waitingTaskIds, taskID)+			}+			jc.jobsByNumRunningTasks[job.TasksRunning] = append(jc.jobsByNumRunningTasks[job.TasksRunning], jobWaitingTaskIds{jobState: job, waitingTaskIDs: waitingTaskIds})+			if job.TasksRunning > jc.maxTaskRunningMapIndex {+				jc.maxTaskRunningMapIndex = job.TasksRunning+			}+			jc.origNumRunningTasks += job.TasksRunning+			jc.origNumWaitingTasks += len(job.NotStarted)+		}++		jc.numWaitingTasks = jc.origNumWaitingTasks+	}+}++// GetRequestorClass find the requestorToClass entry for requestor+// keys in requestorToClassEntry are regular expressions+// if no match is found, return "" for the class name+func GetRequestorClass(requestor string, requestorToClassMap map[string]string) string {+	for reqRe, className := range requestorToClassMap {+		if m, _ := regexp.Match(reqRe, []byte(requestor)); m {+			return className+		}+	}+	return ""+}++// computeNumTasksToStart - computes the the number of tasks to start for each class.+// Perform the entitlement calculation first and if there are still unallocated wokers+// and tasks waiting to start, compute the loan calculation.+func (lbs *LoadBasedAlg) computeNumTasksToStart(numIdleWorkers int) {++	var haveUnallocatedTasks bool++	numIdleWorkers, haveUnallocatedTasks = lbs.entitlementTasksToStart(numIdleWorkers)++	if numIdleWorkers > 0 && haveUnallocatedTasks {+		lbs.workerLoanAllocation(numIdleWorkers)+	}+}++// entitlementTasksToStart compute the number of tasks we can start for each class based on each classes original targeted+// number of workers (origNumTargetedWorkers)+// Note: this is an iterative computation that converges on the number of tasks to start within number of class's iterations.+//+// 1. compute the entitlement of a class as the class's orig target load minus (number of tasks running + number of tasks that+// can be started)  (exception: if a class does not have waiting tasks, its entitlement is 0)+// 2. compute entitlement % as entitlement/total of all classes entitlements+// 3. compute num tasks to start for each class as min(entitlement % * idle(unallocated) workers, number of the class's waiting tasks)+//+// After completing the 3 steps above, the sum of the number tasks to start may still be < number of idle workers.  This will happen+// when a class's waiting task count < than its entitlement (the class is not using all of its entitlement).  When this happens,+// the un-allocated idle workers can be distributed across the other classes that have waiting tasks and have not met their full+// entitlement.  We compute this by repeating steps 1-3 till all idle workers have been allocated, all waiting tasks have been+// allocated or all classes entitlements have been met.  Each iteration either uses up all idle workers, all of a class's waiting tasks+// or fully allocates at least one class's task entitlement.   This means that the we will not iterate more than the number of classes.+func (lbs *LoadBasedAlg) entitlementTasksToStart(numIdleWorkers int) (int, bool) {+	i := 0+	haveWaitingTasks := true+	for ; i < len(lbs.jobClasses); i++ {+		// compute the class's current entitlement: number of tasks we would like to start for each class as per the class's+		// target load % and number of waiting tasks.  We'll use this to compute normalized entitlement %s below.+		totalEntitlements := 0+		// get the current entitlements+		for _, jc := range lbs.jobClasses {+			if (jc.origNumRunningTasks+jc.numTasksToStart) <= jc.origNumTargetedWorkers && jc.numWaitingTasks > 0 {+				jc.tempEntitlement = jc.origNumTargetedWorkers - (jc.origNumRunningTasks + jc.numTasksToStart)+			} else {+				jc.tempEntitlement = 0+			}+			totalEntitlements += jc.tempEntitlement+		}++		if totalEntitlements == 0 {+			// the class's task allocations have used up each class's entitlement, break+			// so we can move on to calculating loaned workers+			break+		}++		// compute normalized entitlement pcts for classes with entitlement > 0+		lbs.computeEntitlementPercents()++		// compute worker allocations as per the normalized entitlement %s+		numTasksAllocated := 0+		workersToAllocate := min(numIdleWorkers, totalEntitlements)+		numTasksAllocated, haveWaitingTasks = lbs.getTaskAllocations(workersToAllocate)++		numIdleWorkers -= numTasksAllocated++		if !haveWaitingTasks {+			break+		}+		if numIdleWorkers <= 0 {+			break+		}+	}+	return numIdleWorkers, haveWaitingTasks+}++// loanWorkers: We have workers that can be 'loaned' to classes that still have waiting tasks.+// Note: this is an iterative computation that will converge on the number of workers to loan to classes+// For each iteration+// 1. normalize the original target load % to those classes with waiting tasks+// 2. compute each class's allowed loan amount as the number of unallocated workers * the normalized % but not to+// exceed the class's number of waiting tasks+//+// When a class's allowed loan amount is larger than the class's waiting tasks, there will be unallocated workers+// after all the class 'loan' amounts have been calculated.  We distribute these unallocated workers by+// repeating the loan calculation till there are no unallocated workers left.+// Each iteration either uses up all idle workers, or all of a class's waiting tasks.  This means that the we will not+// iterate more than the number of classes.+func (lbs *LoadBasedAlg) workerLoanAllocation(numIdleWorkers int) {+	i := 0+	for ; i < len(lbs.jobClasses); i++ {+		lbs.computeLoanPercents()++		// compute loan %'s and allocate idle workers+		numTasksAllocated, haveWaitingTasks := lbs.getTaskAllocations(numIdleWorkers)++		numIdleWorkers -= numTasksAllocated++		if !haveWaitingTasks {+			break+		}+		if numIdleWorkers <= 0 {+			break+		}+	}+}++// getTaskAllocations given the normalized allocation %s for each class, working from highest % (largest allocation) to smallest,+// allocate that class's % of the idle workers (update the class's numTasksToStart and numWaitingTasks), but not to exceed the+// classs' number of waiting tasks. Return the total number of tasks allocated to workers and a boolean indicating if there are+// still tasks waiting to be allocated+func (lbs *LoadBasedAlg) getTaskAllocations(numIdleWorkers int) (int, bool) {+	totalTasksAllocated := 0+	haveWaitingTasks := false++	for _, className := range lbs.classByDescLoadPct {+		jc := lbs.jobClasses[className]+		numTasksToStart := min(jc.numWaitingTasks, ceil(float32(numIdleWorkers)*(float32(jc.tempNormalizedPct)/100.0)))++		if (totalTasksAllocated + numTasksToStart) > numIdleWorkers {+			numTasksToStart = numIdleWorkers - totalTasksAllocated+		}+		jc.numTasksToStart += numTasksToStart+		jc.numWaitingTasks -= numTasksToStart+		if jc.numWaitingTasks > 0 {+			haveWaitingTasks = true+		}+		totalTasksAllocated += numTasksToStart+	}+	return totalTasksAllocated, haveWaitingTasks+}++// computeEntitlementPercents computes each class's current entitled % of total entitlements (from the current)+// entitlement values+func (lbs *LoadBasedAlg) computeEntitlementPercents() {+	// get the entitlements total+	entitlementTotal := 0+	for _, jc := range lbs.jobClasses {+		entitlementTotal += jc.tempEntitlement+	}++	// compute the % for all but the class with the largest %.  Add up all computed %s and assign+	// 100 - sum of % to the class with largest % (this eliminates rounding errors, forcing the+	// % to add up to 100%)+	totalPercents := 0+	firstClass := true+	for _, className := range lbs.classByDescLoadPct {+		if firstClass {+			firstClass = false+			continue+		}+		jc := lbs.jobClasses[className]+		jc.tempNormalizedPct = int(math.Floor(float64(jc.tempEntitlement) * 100.0 / float64(entitlementTotal)))+		totalPercents += jc.tempNormalizedPct+	}+	lbs.jobClasses[lbs.classByDescLoadPct[0]].tempNormalizedPct = 100 - totalPercents+}++// computeLoanPercents as orig load %'s normalized to exclude classes that don't have waiting tasks+func (lbs *LoadBasedAlg) computeLoanPercents() {+	// get the sum of all the original load pcts for classes that have waiting tasks+	pctsTotal := 0+	for _, jc := range lbs.jobClasses {+		if jc.numWaitingTasks > 0 {+			pctsTotal += jc.origTargetLoadPct+		}+	}++	if pctsTotal == 0 {+		return+	}++	// compute the % for all but the class with the largest %.  Add up all computed %s and assign+	// 100 - sum of % to the class with the largest % from the range (this eliminates rounding errors, forcing the+	// sum or % to go to 100%)+	totalPercents := 0+	firstClass := true+	firstClassName := ""+	for _, className := range lbs.classByDescLoadPct {+		jc := lbs.jobClasses[className]+		if jc.numWaitingTasks > 0 {+			if firstClass {+				firstClass = false+				firstClassName = className+				continue+			}+			jc.tempNormalizedPct = int(math.Floor(float64(jc.origTargetLoadPct) * 100.0 / float64(pctsTotal)))+			totalPercents += jc.tempNormalizedPct+		} else {+			jc.tempNormalizedPct = 0+		}+	}+	lbs.jobClasses[firstClassName].tempNormalizedPct = 100 - totalPercents+}++// buildTaskStartList builds the list of tasks to be started for each jobClass.+func (lbs *LoadBasedAlg) buildTaskStartList() []*taskState {+	tasks := []*taskState{}+	for _, jc := range lbs.jobClasses {+		if jc.numTasksToStart <= 0 {+			continue+		}+		classTasks := lbs.getTasksToStartForJobClass(jc)+		tasks = append(tasks, classTasks...)+	}+	return tasks+}++// getTasksToStartForJobClass get the tasks to start list for a given jobClass.  The jobClass's numTasksToStart+// field will contain the number of tasks to start for this job class.  The jobClass's jobsByNumRunningTasks is+// a map from an integer value (number of tasks running) to the list of jobs with that number of tasks running+// For a given jobClass, we start adding tasks from the jobs with the least number of tasks running.+// (Note: when a task is started for a job, the job is moved to the ‘next’ bin and placed at the end of that bin’s job list.)+func (lbs *LoadBasedAlg) getTasksToStartForJobClass(jc *jobClass) []*taskState {+	tasks := []*taskState{}++	startingTaskCnt := 0+	// work our way through the class's jobs, starting with jobs with the least number of running tasks,

why least running tasks vs lowest pct of running tasks relative to waiting?

JeanetteBruno

comment created time in a day

Pull request review commenttwitter/scoot

added new scheduler

+package server++import (+	"fmt"+	"math"+	"regexp"+	"sort"+	"sync"+	"time"++	log "github.com/sirupsen/logrus"++	"github.com/twitter/scoot/common/stats"+)++const (+	under int = iota+	over+)++// defaults for the LoadBasedScheduler algorithm: only one class and all jobs map to that class+var (+	DefaultLoadBasedSchedulerClassPercents = map[string]int32{"c0": 100}+	DefaultRequestorToClassMap             = map[string]string{".*": "c0"}+	DefaultMinRebalanceTime                = time.Duration(4 * time.Minute)+	MaxTaskDuration                        = time.Duration(4 * time.Hour)+)++type LoadBasedAlgConfig struct {+	classLoadPercents       map[string]int+	classLoadPercentsMu     sync.RWMutex+	requestorReToClassMap   map[string]string+	requestorReToClassMapMU sync.RWMutex++	rebalanceThreshold     int+	rebalanceThresholdMu   sync.RWMutex+	rebalanceMinDuration   time.Duration+	rebalanceMinDurationMu sync.RWMutex++	classByDescLoadPct []string++	stat stats.StatsReceiver+}++// LoadBasedAlg the scheduling algorithm allocates job tasks to workers using a class map as+// follows:+// - each job maps to a 'class' (based on the job's requestor value)+// - classes are assigned a % of the number of scoot workers.+// When the algorithm is assigning tasks to workers it will try to start tasks such that the+// number of running tasks maintain the defined class %'s.  We refer to number of workers as+// per a class's defined % as the number of 'entitled' workers for the class.+// (The class is entitled to use class % * total number of workers to run job tasks from jobs+// assigned to the class.)+// When there are not enough tasks to use all of the class's entitled workers, the algorithm+// will allow other classes to run their tasks on the unused workers.  We refer to+// this as loaning worker to other classes.+// Note: workers are not assigned to specific classes.  The class % concept is simply a counting+// mechanism.+//+// Each scheduling iteration tries to bring the task allocation back to the original class+// entitlement as defined in the class %s. It could be the case that long running tasks slowly create+// an imbalance in the worker to class numbers (long running tasks accumulate loaned workers).+// As such, the algorithm periodically rebalances the running workers back toward the original target+// %s by stopping tasks that have been started on loaned workers.  It will stop the most recently started+// tasks till the running task to class allocation meets the original entitlement targets.+//+type LoadBasedAlg struct {+	config     *LoadBasedAlgConfig+	jobClasses map[string]*jobClass++	totalUnusedEntitlement          int+	exceededRebalanceThresholdStart time.Time++	// local copy of load pcts and requestor map to use during assignment computation+	// to insulate the computation from external changes to the configuration+	classLoadPercents     map[string]int+	requestorReToClassMap map[string]string++	classByDescLoadPct             []string+	tasksByJobClassAndStartTimeSec map[taskClassAndStartKey]taskStateByJobIDTaskID+}++// NewLoadBasedAlg allocate a new LoadBaseSchedAlg object.  If the load %'s don't add up to 100+// the %'s will be adjusted and an error will be returned with the alg object+func NewLoadBasedAlg(config *LoadBasedAlgConfig, tasksByJobClassAndStartTimeSec map[taskClassAndStartKey]taskStateByJobIDTaskID) *LoadBasedAlg {+	lbs := &LoadBasedAlg{+		config:                          config,+		jobClasses:                      map[string]*jobClass{},+		exceededRebalanceThresholdStart: time.Time{},+		tasksByJobClassAndStartTimeSec:  tasksByJobClassAndStartTimeSec,+	}+	return lbs+}++// jobWaitingTaskIds map waiting task ids to the job state objects+type jobWaitingTaskIds struct {+	jobState       *jobState+	waitingTaskIDs []string+}++// jobClass the class definition that will be assigned to a set of jobs (using the job's requestor value)+type jobClass struct {+	className string++	// jobsByNumRunningTasks is a map that bins jobs by their number of running tasks.  Given that the algorithm has+	// determined it will start n tasks from class A, the tasks selected for starting from class A will give prefence+	// to jobs with the least number of running tasks.+	jobsByNumRunningTasks map[int][]jobWaitingTaskIds+	// the largest key value in the jobsByNumRunningTasks map+	maxTaskRunningMapIndex int++	origNumWaitingTasks int+	origNumRunningTasks int++	// the target % of workers for this class+	origTargetLoadPct int+	// the original number of workers allocated for this class by target % (total workers * origTargetLoadPct)+	origNumTargetedWorkers int+	// number of tasks that can be started (when negative -> number of tasks to stop)+	numTasksToStart int+	// number of tasks still waiting to be started+	numWaitingTasks int++	// temporary field to hold intermediate entitled num workers+	tempEntitlement int+	// temporary field to hold the normalized load %+	tempNormalizedPct int+}++func (jc *jobClass) String() string {+	return fmt.Sprintf("%s:TargetLoadPct:%d, origTasksWaiting:%d, origTasksRunning:%d, origTargetWorkers:%d, TasksToStart:%d, remainingWaitingTasks:%d, tempEntitlement:%d, tempNormalizedPct:%d",+		jc.className,+		jc.origTargetLoadPct,+		jc.origNumWaitingTasks,+		jc.origNumRunningTasks,+		jc.origNumTargetedWorkers,+		jc.numTasksToStart,+		jc.numWaitingTasks,+		jc.tempEntitlement,+		jc.tempNormalizedPct)+}++// NewJobClass a job class with its target % worker load+func NewJobClass(name string, targetLoadPct int) *jobClass {+	return &jobClass{className: name, origTargetLoadPct: targetLoadPct, jobsByNumRunningTasks: map[int][]jobWaitingTaskIds{}}+}++// GetTasksToBeAssigned - the entry point to the load based scheduling algorithm+// It computes the list of tasks that should be started.+// Allocate available workers to classes based on target load % allocations for each class, and the current number of running and+// waiting tasks for each class.+// - When a class has tasks waiting to start, the algorithm will determine the number of workers the class it 'entitled' to:+// the number of workers as per the class's target load %+// - When classes are under-utilizing their 'entitlement', (due to lack of waiting tasks), the unallocated workers will be+// ‘loaned’/used to run tasks from other classes (class allocations may exceed the targeted allocations %’s)+// -The algorithm will try to allocate 100% of the workers (no unallocated reserves)+//+// When starting tasks within a class:+// Jobs within a class are binned by the number of running tasks (ranking jobs by the number of active tasks).+// When starting tasks for a class, the tasks are first pulled from jobs with the least number of active tasks.+func (lbs *LoadBasedAlg) GetTasksToBeAssigned(jobsNotUsed []*jobState, stat stats.StatsReceiver, cs *clusterState,+	jobsByRequestor map[string][]*jobState) ([]*taskState, []*taskState) {+	log.Debugf("in LoadBasedAlg.GetTasksToBeAssigned: numWorkers:%d, numIdleWorkers:%d", len(cs.nodes), cs.numFree())++	// make local copies of the load pct structures+	lbs.classLoadPercents = lbs.LocalCopyClassLoadPercents()+	lbs.requestorReToClassMap = lbs.getRequestorToClassMap()+	lbs.classByDescLoadPct = lbs.getClassByDescLoadPct()++	numWorkers := len(cs.nodes)+	lbs.initOrigNumTargetedWorkers(numWorkers)++	lbs.initJobClassesMap(jobsByRequestor)++	rebalanced := false+	var stopTasks []*taskState+	if lbs.getRebalanceMinimumDuration() > 0 && lbs.getRebalanceThreshold() > 0 {+		// currentPctSpread is the delta between the highest and lowest+		currentPctSpread := lbs.getCurrentPercentsSpread(numWorkers)+		if currentPctSpread > lbs.getRebalanceThreshold() {+			nilTime := time.Time{}+			if lbs.exceededRebalanceThresholdStart == nilTime {+				lbs.exceededRebalanceThresholdStart = time.Now()+			} else if time.Now().Sub(lbs.exceededRebalanceThresholdStart) > lbs.config.rebalanceMinDuration {+				stopTasks = lbs.rebalanceClassTasks(jobsByRequestor, numWorkers, cs.numFree())+				lbs.exceededRebalanceThresholdStart = time.Time{}+				rebalanced = true+			}+		}+	}++	if !rebalanced {+		// compute the number of tasks to be started for each class+		lbs.computeNumTasksToStart(cs.numFree())+	}++	// add the tasks to be started to the return list+	tasksToStart := lbs.buildTaskStartList()++	// record the assignment stats+	for _, jc := range lbs.jobClasses {+		stat.Gauge(fmt.Sprintf("%s_%s", stats.SchedJobClassTasksStarting, jc.className)).Update(int64(jc.numTasksToStart))+		stat.Gauge(fmt.Sprintf("%s_%s", stats.SchedJobClassTasksWaiting, jc.className)).Update(int64(jc.origNumWaitingTasks - jc.numTasksToStart))+		stat.Gauge(fmt.Sprintf("%s_%s", stats.SchedJobClassTasksRunning, jc.className)).Update(int64(jc.origNumRunningTasks))+		stat.Gauge(fmt.Sprintf("%s_%s", stats.SchedJobClassDefinedPct, jc.className)).Update(int64(jc.origTargetLoadPct))+		finalPct := int(math.Round(float64(jc.origNumRunningTasks+jc.numTasksToStart) / float64(int64(numWorkers)*100.0)))+		stat.Gauge(fmt.Sprintf("%s_%s", stats.SchedJobClassActualPct, jc.className)).Update(int64(finalPct))+	}++	log.Debugf("Returning %d start tasks, %d stop tasks", len(tasksToStart), len(stopTasks))+	return tasksToStart, stopTasks+}++// initOrigNumTargetedWorkers computes the number of workers targeted for each class as per the class's+// original target load pct+func (lbs *LoadBasedAlg) initOrigNumTargetedWorkers(numWorkers int) {+	lbs.jobClasses = map[string]*jobClass{}+	totalWorkers := 0+	firstClass := true+	for _, className := range lbs.classByDescLoadPct {+		jc := &jobClass{className: className, origTargetLoadPct: lbs.classLoadPercents[className], jobsByNumRunningTasks: map[int][]jobWaitingTaskIds{}}+		lbs.jobClasses[className] = jc+		if firstClass {+			firstClass = false+			continue+		}+		targetNumWorkers := int(math.Floor(float64(numWorkers) * float64(jc.origTargetLoadPct) / 100.0))+		jc.origNumTargetedWorkers = targetNumWorkers+		totalWorkers += targetNumWorkers+	}+	lbs.jobClasses[lbs.classByDescLoadPct[0]].origNumTargetedWorkers = numWorkers - totalWorkers+}++// initJobClassesMap builds the map of requestor (class name) to jobClass objects+// if we see a job whose class % not defined, assign the job to the class with the+// least number of workers+func (lbs *LoadBasedAlg) initJobClassesMap(jobsByRequestor map[string][]*jobState) {+	classNameWithLeastWorkers := lbs.classByDescLoadPct[len(lbs.classByDescLoadPct)-1]+	// fill the jobClasses map with the state of the running jobs+	for requestor, jobs := range jobsByRequestor {+		var jc *jobClass+		var ok bool+		className := GetRequestorClass(requestor, lbs.requestorReToClassMap)+		jc, ok = lbs.jobClasses[className]+		if !ok {+			// the class name was not recognized, use the class with the least number of workers (lowest %)+			lbs.config.stat.Counter(stats.SchedLBSUnknownJobCounter).Inc(1)+			jc = lbs.jobClasses[classNameWithLeastWorkers]+			log.Errorf("%s is not a recognized job class assigning to class (%s)", className, classNameWithLeastWorkers)+		}+		if jc.origTargetLoadPct == 0 {+			log.Errorf("%s worker allocation (load %% is 0), ignoring %d jobs", requestor, len(jobs))+			lbs.config.stat.Counter(stats.SchedLBSIgnoredJobCounter).Inc(1)+			continue+		}++		// organize the class's jobs by the number of tasks currently running (map of jobs indexed by the number of+		// tasks currently running for the job).  This will be used in the round robin task selection to start a+		// class's worker allocation at the jobs with least number of running tasks+		// this loop also computes the class's running tasks and waiting task totals+		for _, job := range jobs {+			_, ok := jc.jobsByNumRunningTasks[job.TasksRunning]+			if !ok {+				jc.jobsByNumRunningTasks[job.TasksRunning] = []jobWaitingTaskIds{}+			}+			waitingTaskIds := []string{}+			for taskID := range job.NotStarted {+				waitingTaskIds = append(waitingTaskIds, taskID)+			}+			jc.jobsByNumRunningTasks[job.TasksRunning] = append(jc.jobsByNumRunningTasks[job.TasksRunning], jobWaitingTaskIds{jobState: job, waitingTaskIDs: waitingTaskIds})+			if job.TasksRunning > jc.maxTaskRunningMapIndex {+				jc.maxTaskRunningMapIndex = job.TasksRunning+			}+			jc.origNumRunningTasks += job.TasksRunning+			jc.origNumWaitingTasks += len(job.NotStarted)+		}++		jc.numWaitingTasks = jc.origNumWaitingTasks+	}+}++// GetRequestorClass find the requestorToClass entry for requestor+// keys in requestorToClassEntry are regular expressions+// if no match is found, return "" for the class name+func GetRequestorClass(requestor string, requestorToClassMap map[string]string) string {+	for reqRe, className := range requestorToClassMap {+		if m, _ := regexp.Match(reqRe, []byte(requestor)); m {+			return className+		}+	}+	return ""+}++// computeNumTasksToStart - computes the the number of tasks to start for each class.+// Perform the entitlement calculation first and if there are still unallocated wokers+// and tasks waiting to start, compute the loan calculation.+func (lbs *LoadBasedAlg) computeNumTasksToStart(numIdleWorkers int) {++	var haveUnallocatedTasks bool++	numIdleWorkers, haveUnallocatedTasks = lbs.entitlementTasksToStart(numIdleWorkers)++	if numIdleWorkers > 0 && haveUnallocatedTasks {+		lbs.workerLoanAllocation(numIdleWorkers)+	}+}++// entitlementTasksToStart compute the number of tasks we can start for each class based on each classes original targeted+// number of workers (origNumTargetedWorkers)+// Note: this is an iterative computation that converges on the number of tasks to start within number of class's iterations.+//+// 1. compute the entitlement of a class as the class's orig target load minus (number of tasks running + number of tasks that+// can be started)  (exception: if a class does not have waiting tasks, its entitlement is 0)+// 2. compute entitlement % as entitlement/total of all classes entitlements+// 3. compute num tasks to start for each class as min(entitlement % * idle(unallocated) workers, number of the class's waiting tasks)+//+// After completing the 3 steps above, the sum of the number tasks to start may still be < number of idle workers.  This will happen+// when a class's waiting task count < than its entitlement (the class is not using all of its entitlement).  When this happens,+// the un-allocated idle workers can be distributed across the other classes that have waiting tasks and have not met their full+// entitlement.  We compute this by repeating steps 1-3 till all idle workers have been allocated, all waiting tasks have been+// allocated or all classes entitlements have been met.  Each iteration either uses up all idle workers, all of a class's waiting tasks+// or fully allocates at least one class's task entitlement.   This means that the we will not iterate more than the number of classes.+func (lbs *LoadBasedAlg) entitlementTasksToStart(numIdleWorkers int) (int, bool) {+	i := 0+	haveWaitingTasks := true+	for ; i < len(lbs.jobClasses); i++ {+		// compute the class's current entitlement: number of tasks we would like to start for each class as per the class's+		// target load % and number of waiting tasks.  We'll use this to compute normalized entitlement %s below.+		totalEntitlements := 0+		// get the current entitlements+		for _, jc := range lbs.jobClasses {+			if (jc.origNumRunningTasks+jc.numTasksToStart) <= jc.origNumTargetedWorkers && jc.numWaitingTasks > 0 {+				jc.tempEntitlement = jc.origNumTargetedWorkers - (jc.origNumRunningTasks + jc.numTasksToStart)+			} else {+				jc.tempEntitlement = 0+			}+			totalEntitlements += jc.tempEntitlement+		}++		if totalEntitlements == 0 {+			// the class's task allocations have used up each class's entitlement, break+			// so we can move on to calculating loaned workers+			break+		}++		// compute normalized entitlement pcts for classes with entitlement > 0+		lbs.computeEntitlementPercents()++		// compute worker allocations as per the normalized entitlement %s+		numTasksAllocated := 0+		workersToAllocate := min(numIdleWorkers, totalEntitlements)+		numTasksAllocated, haveWaitingTasks = lbs.getTaskAllocations(workersToAllocate)++		numIdleWorkers -= numTasksAllocated++		if !haveWaitingTasks {+			break+		}+		if numIdleWorkers <= 0 {+			break+		}+	}+	return numIdleWorkers, haveWaitingTasks+}++// loanWorkers: We have workers that can be 'loaned' to classes that still have waiting tasks.+// Note: this is an iterative computation that will converge on the number of workers to loan to classes+// For each iteration+// 1. normalize the original target load % to those classes with waiting tasks+// 2. compute each class's allowed loan amount as the number of unallocated workers * the normalized % but not to+// exceed the class's number of waiting tasks+//+// When a class's allowed loan amount is larger than the class's waiting tasks, there will be unallocated workers+// after all the class 'loan' amounts have been calculated.  We distribute these unallocated workers by+// repeating the loan calculation till there are no unallocated workers left.+// Each iteration either uses up all idle workers, or all of a class's waiting tasks.  This means that the we will not+// iterate more than the number of classes.+func (lbs *LoadBasedAlg) workerLoanAllocation(numIdleWorkers int) {+	i := 0+	for ; i < len(lbs.jobClasses); i++ {+		lbs.computeLoanPercents()++		// compute loan %'s and allocate idle workers+		numTasksAllocated, haveWaitingTasks := lbs.getTaskAllocations(numIdleWorkers)++		numIdleWorkers -= numTasksAllocated++		if !haveWaitingTasks {+			break+		}+		if numIdleWorkers <= 0 {+			break+		}+	}+}++// getTaskAllocations given the normalized allocation %s for each class, working from highest % (largest allocation) to smallest,+// allocate that class's % of the idle workers (update the class's numTasksToStart and numWaitingTasks), but not to exceed the+// classs' number of waiting tasks. Return the total number of tasks allocated to workers and a boolean indicating if there are+// still tasks waiting to be allocated+func (lbs *LoadBasedAlg) getTaskAllocations(numIdleWorkers int) (int, bool) {+	totalTasksAllocated := 0+	haveWaitingTasks := false++	for _, className := range lbs.classByDescLoadPct {+		jc := lbs.jobClasses[className]+		numTasksToStart := min(jc.numWaitingTasks, ceil(float32(numIdleWorkers)*(float32(jc.tempNormalizedPct)/100.0)))++		if (totalTasksAllocated + numTasksToStart) > numIdleWorkers {+			numTasksToStart = numIdleWorkers - totalTasksAllocated+		}+		jc.numTasksToStart += numTasksToStart+		jc.numWaitingTasks -= numTasksToStart+		if jc.numWaitingTasks > 0 {+			haveWaitingTasks = true+		}+		totalTasksAllocated += numTasksToStart+	}+	return totalTasksAllocated, haveWaitingTasks+}++// computeEntitlementPercents computes each class's current entitled % of total entitlements (from the current)+// entitlement values+func (lbs *LoadBasedAlg) computeEntitlementPercents() {+	// get the entitlements total+	entitlementTotal := 0+	for _, jc := range lbs.jobClasses {+		entitlementTotal += jc.tempEntitlement+	}++	// compute the % for all but the class with the largest %.  Add up all computed %s and assign+	// 100 - sum of % to the class with largest % (this eliminates rounding errors, forcing the+	// % to add up to 100%)+	totalPercents := 0+	firstClass := true+	for _, className := range lbs.classByDescLoadPct {+		if firstClass {+			firstClass = false+			continue+		}+		jc := lbs.jobClasses[className]+		jc.tempNormalizedPct = int(math.Floor(float64(jc.tempEntitlement) * 100.0 / float64(entitlementTotal)))+		totalPercents += jc.tempNormalizedPct+	}+	lbs.jobClasses[lbs.classByDescLoadPct[0]].tempNormalizedPct = 100 - totalPercents+}++// computeLoanPercents as orig load %'s normalized to exclude classes that don't have waiting tasks+func (lbs *LoadBasedAlg) computeLoanPercents() {+	// get the sum of all the original load pcts for classes that have waiting tasks+	pctsTotal := 0+	for _, jc := range lbs.jobClasses {+		if jc.numWaitingTasks > 0 {+			pctsTotal += jc.origTargetLoadPct+		}+	}++	if pctsTotal == 0 {+		return+	}++	// compute the % for all but the class with the largest %.  Add up all computed %s and assign+	// 100 - sum of % to the class with the largest % from the range (this eliminates rounding errors, forcing the+	// sum or % to go to 100%)+	totalPercents := 0+	firstClass := true+	firstClassName := ""+	for _, className := range lbs.classByDescLoadPct {+		jc := lbs.jobClasses[className]+		if jc.numWaitingTasks > 0 {+			if firstClass {+				firstClass = false+				firstClassName = className+				continue+			}+			jc.tempNormalizedPct = int(math.Floor(float64(jc.origTargetLoadPct) * 100.0 / float64(pctsTotal)))+			totalPercents += jc.tempNormalizedPct+		} else {+			jc.tempNormalizedPct = 0+		}+	}+	lbs.jobClasses[firstClassName].tempNormalizedPct = 100 - totalPercents+}++// buildTaskStartList builds the list of tasks to be started for each jobClass.+func (lbs *LoadBasedAlg) buildTaskStartList() []*taskState {+	tasks := []*taskState{}+	for _, jc := range lbs.jobClasses {+		if jc.numTasksToStart <= 0 {+			continue+		}+		classTasks := lbs.getTasksToStartForJobClass(jc)+		tasks = append(tasks, classTasks...)+	}+	return tasks+}++// getTasksToStartForJobClass get the tasks to start list for a given jobClass.  The jobClass's numTasksToStart+// field will contain the number of tasks to start for this job class.  The jobClass's jobsByNumRunningTasks is+// a map from an integer value (number of tasks running) to the list of jobs with that number of tasks running+// For a given jobClass, we start adding tasks from the jobs with the least number of tasks running.+// (Note: when a task is started for a job, the job is moved to the ‘next’ bin and placed at the end of that bin’s job list.)+func (lbs *LoadBasedAlg) getTasksToStartForJobClass(jc *jobClass) []*taskState {+	tasks := []*taskState{}++	startingTaskCnt := 0+	// work our way through the class's jobs, starting with jobs with the least number of running tasks,+	// till we've added the class's numTasksToStart number of tasks to the task list+	for numRunningTasks := 0; numRunningTasks <= jc.maxTaskRunningMapIndex; numRunningTasks++ {+		var jobs []jobWaitingTaskIds+		var ok bool+		if jobs, ok = jc.jobsByNumRunningTasks[numRunningTasks]; !ok {+			// there are no jobs with numRunningTasks running tasks, move on to jobs with more running tasks+			continue+		}+		// jobs contains list of jobs and their waiting taskIds. (Each job in this list has the same number of running tasks.)+		// Allocate one task from each job till we've allocated numTasksToStart for the jobClass, or have allocated 1 task from+		// each job in this list.  As we allocate a task for a job, move the job to the end of jc.jobsByNumRunningTasks[numRunningTasks+1].+		for _, job := range jobs {+			if job.waitingTaskIDs != nil && len(job.waitingTaskIDs) > 0 {+				// get the next task to start from the job+				tasks = append(tasks, lbs.getJobsNextTask(job))++				// move the job to jobsByRunning tasks with numRunningTasks + 1 entry.  Note: we don't have to pull it from

*jobsByRunningTasks

JeanetteBruno

comment created time in a day

more