Ticket #35 (new defect)

Opened 2 years ago

Last modified 4 weeks ago

reconfig doesn't pick up new dependent schedulers

Reported by: Pike Assigned to: warner
Priority: major Milestone: 0.8.0
Component: buildprocess Version: 0.7.5
Keywords: Cc: bhearsum@mozilla.com

Description

I use Dependent schedulers to do multiple builds based on one source checkout, and adding new Dependent builds doesn't get their deps picked up.

If I check in, all the original builds kick off, but new ones don't. Triggering them via admin works, and after I nuked all status in the master, it worked.

Change History

05/12/07 11:22:37 changed by warner

Yeah, this one bit me recently too. I think it's the same sort of equal-vs-Equal problem that we had with the Locks last year.

When the config file is re-read, it compares the objects therein for equality against the running objects, and only updates them if they've changed. That way we minimize interruptions when, say, the reconfig changes a status plugin and leaves the schedulers alone.

The problem here is that we wind up with two instances of the same upstream Scheduler. So if we started with:

 A = Scheduler("foo") # call this instance A1
 C = DependentScheduler("downstream", [A]) # call this instance C1

and then we move to a new config like:

 A = Scheduler("foo") # call this A2
 B = Scheduler("bar") # call this B2
 C = DependentScheduler("downstream", [A,B]) # call this C2, note it uses [A2,B2]

then what happens is that the reconfig process compares A2 against the active A1, decides they're the same, and leaves A1 alone. It sees B2, notices that it is new, and activates it. It sees C2, compares C1's upstream list [A1] against C2's upstream list [A2,B2], and decides that they're different (note that [A1] == [A2], and [A1,B2] == [A2,B2], but [A2] != [A2,B2]), so it shuts down C1 and activates C2.

But now C2 is referencing [A2,B2], whereas A2 hasn't been activated because it looked like A1 was sufficient. So A1 is still receiving Changes and triggering builds, but A2 isn't. C2 will never fire because it's watching A2, which is effectively dead.

To fix this cleanly, the following approaches come to mind:

  • modify loadConfig_Schedulers to look inside Dependent schedulers and compare the instances they use against the list of Schedulers that we've decided to activate or leave activated. If there's a difference, we need to cycle out some of the upstream schedulers (replace A1 with A2)
    • alternatively, if there's a difference, we somehow modify the downstream scheduler to reference the old A1 instead of the A2 that it was instantiated with
  • once loadConfig_Schedulers is done setting everything up, Dependent schedulers need to check the service .running flag inside their upstreams. If any of the upstream schedulers are not running, that indicates this sort of problem. It's not easy to fix it once we've hit that point though.. maybe just an error message in the logs that tells the admin to restart the buildmaster.
    • maybe if we get through loading schedulers and see a 's.running==False' somewhere, dump all the schedulers and then trigger a config-file reload. That would lose any pending builds that were hanging out in the schedulers but it would be most likely to fix the situation. We could also just call loadConfig_Schedulers with some new flag that says "please pretend that everything is new", which would do less work than reloading the whole config file (including Builders and status).
  • point to upstream schedulers by name rather than by instance (perhaps implicitly: use the instance reference to snarf the name, then search through all active schedulers to find one with a matching name). This is effectively how we did it with Locks, except that in that case we declared the Lock instances that you see in the config file to be the "names", and they reference "real" Lock instances that are created later.

09/29/07 01:49:03 changed by warner

  • version set to 0.7.5.
  • milestone set to 0.7.7.

12/28/07 00:58:43 changed by warner

  • milestone changed from 0.7.7 to 0.7.8.

no progress made on this yet, bumping to 0.7.8

03/19/08 10:14:39 changed by bhearsum

  • cc set to bhearsum@mozilla.com.

05/02/08 16:22:47 changed by warner

  • milestone changed from 0.7.8 to 0.7.9.

this is bothering me a lot, but I don't think I can fix it within the next week. Pushing it out to 0.7.9 .

06/27/08 10:07:05 changed by ste

I doubt this is a surprise, but I just bumped into this bug with Trigger/Triggerable too, where the same problem can evidently occur. Inside Trigger I can see that self.build.builder.botmaster.parent.allSchedulers() returns a Triggerable with a name that no longer exists in my config. A restart cured this.

11/07/08 09:28:10 changed by bhearsum

Another symptom of this may be Dependent schedulers failing to fire if they are their upstream scheduler(s) are changed with a reconfig while the upstream is running. Not sure if what's described above would fix this or not. (Should I file a new ticket?)