Ticket #42 (new enhancement)

Opened 1 year ago

Last modified 3 weeks ago

Add load-balancing support

Reported by: stefan Assigned to: warner
Priority: major Milestone: 0.8.0
Component: other Version:
Keywords: Cc: Pike, bhearsum@mozilla.com, maruel@chromium.org

Description

When running lots of schedulers / builders on relatively few machines it becomes increasingly important to manage resources, i.e. strict the number of builders that can run simultaneously. A first step into that direction may be the use of semaphores to indicate how many builders may run in parallel on a given machine.

However, when managing multiple projects with multiple buildbot setups the above isn't enough either. Scheduling builds needs to take into account the 'outside' world, such as the current load of a machine. If there is only a single machine to which a builder is assigned, it may mean that a build has to be delayed. However, if there are multiple equivalent machines available, a single builder could be assigned to a class of equivalent machines, instead of a single machine, and the scheduler can then assign a build to the machine with the lowest load.

As a different use case taking advantage of the same feature, consider a project with a big number of builders, and a big community of contributors who are willing to share hardware for testing - remotely.

On connection, the build master would assign short-lived builders to them, aggregating results as they arrive.

Here again, the main new feature is a loosened relationship between builder and build slave.

Change History

06/06/07 01:54:38 changed by Pike

  • cc set to Pike.

I found one cow on the road to distribute load on slaves to be the early assignment of build requests to slaves. I'd have to be way less intrusive in my hacks if the pairing of buildrequest and slave happened at the time the slave actually starts instead of at the time the buildrequest is submitted. (I hope I got buildrequest and buildset right here.)

Background:

I hacked some semi load distribution into my buildbot install moving all the logic into a scheduler and a build factory. I'm running a whole lot of builds on a single source check-out, potentially, at least. Thus, the build factory makes sure that I only check out once per builder, where I only use one slave per builder. The builds have distinct properties, which I pull out of the reason string I create in the scheduler.

The actual foo is in the scheduler, which throttles build requests such that I only have one per builder at a time. The builders in return don't really make up a builder in the classical buildbot sense, but are more queues of builds with different properties. Empty queues, even. I can adjust the load on a slave by putting multiple builders on a single slave, or just one.

I admit, it's kinda backwards.

07/02/07 18:51:33 changed by warner

I'd have to be way less intrusive in my hacks if the pairing of buildrequest and slave happened at > the time the slave actually starts instead of at the time the buildrequest is submitted.

Hrm?

The BuildRequest? sits in a queue (inside the Builder) until it gets to the top of the list. At that point (inside maybeStartBuild), we see if there's an available slave. If so, we put the request (along with any other mergeable BuildRequests?) into a new Build and hand it to the slave. The slave selection does not happen until the BuildRequest? is pulled off the Builder.buildable queue.

Does this line up with your experiences? I might be missing something here, or we might have some terminology mismatches.

07/05/07 15:08:14 changed by stefan

I have just now realized that quite a lot of work related to load balancing already went into the repository. That's great !

Trying to understand the architecture a bit, I'm looking into where best to add calls to os.loadavg() in an attempt to rank available slaves. It seems the 'slaves' variable held by a Builder object (and which gets set via the 'slavenames' config variable) contains SlaveBuilder? instances. Is SlaveBuilder? the right place to add a 'loadavg()' method to ? Or should that go into the BuildSlave class ? (But SlaveBuilder? doesn't appear to hold a reference to that...)

Thanks for any advice. Keep up the good work !

07/10/07 04:36:53 changed by Pike

Ticket #48 might have patches that help further in terms of hooks.

Re Brian's question, I triggered multiple build requests at once, which called into Builder.maybeStartBuild in a tight loop. I recall that all those build requests ended up with the same slave, thus my randomizer patch. I blamed that the SlaveBuilder? status would be set to non-IDLE in a callback back then, though I think it should actually be set to PINGING without one. No idea why all slaves were claiming to be idle.

07/28/07 13:10:29 changed by warner

  • milestone changed from 0.7.6 to 0.7.7.

12/21/07 20:34:43 changed by warner

  • milestone changed from 0.7.7 to 0.7.8.

bumping out to the next release

02/24/08 15:08:12 changed by stefan

I just noticed this has been postponed again. That's a pity ! What is missing here ? Last time I looked at the code it seemed it should almost work, i.e. all that is missing is a call to loadavg() at an appropriate place to signal the given slave's availability. What am I missing ?

03/19/08 10:10:27 changed by bhearsum

  • cc changed from Pike to Pike, bhearsum@mozilla.com.

11/11/08 06:44:52 changed by maruel

  • cc changed from Pike, bhearsum@mozilla.com to Pike, bhearsum@mozilla.com, maruel@chromium.org.

I've implemented something similar in http://src.chromium.org/viewvc/chrome/trunk/tools/buildbot/config/master.tryserver/builders_pools.py?view=markup but it's definitely not the best way to implement this.

The use case is: run a try server on 3 platforms. When a user wants to test his patch, one builder of each 3 supported platform is selected to build and test the patch.

When a new BuildSet? is queue, this class selects one builder in each pool of the array, scheduling one Build independently of the other pools.

In my case, each pool is a different OS to try a change on. The current caveat is that once the pool is 100% used, it goes random so it doesn't take in account pending scheduler builds or ETA to do a more informed selection. A JIT Builder selection would be nice there.

Beside that, it works fine for our needs.

11/15/08 16:28:57 changed by maruel

/me feels ashamed. I should have known of 'slavenames' key in c[ 'builders' ] before. :(