Ticket #211 (closed defect: fixed)

Opened 8 months ago

Last modified 2 months ago

slave notify_on_missing is sending too many messages

Reported by: warner Assigned to: warner
Priority: major Milestone: 0.7.9
Component: other Version: 0.7.6
Keywords: slave review Cc:

Description

The notify_on_missing= functionality, which is supposed to send email if a buildslave has been offline for too long, doesn't seem to be picking up on reconnects. The slave disconnects at time 0, reconnects at time 5, but the master sends out mail anyways at time 10.

Attachments

patches (0.9 kB) - added by zooko on 04/15/08 13:23:51.
patch (1.8 kB) - added by zooko on 04/22/08 19:13:31.

Change History

04/07/08 12:29:45 changed by warner

  • owner set to warner.
  • priority changed from minor to major.
  • status changed from new to assigned.

this is getting annoying

04/15/08 13:23:51 changed by zooko

  • attachment patches added.

04/15/08 13:23:58 changed by zooko

Tue Apr 15 14:20:22 MDT 2008 zooko@zooko.com

  • don't create a new missing-timer task if there is already one there Doing that would mean that the one that is already there will not be cleaned up if the buildslave reattaches before the timer goes off, so this patch probably fixed #211.

New patches:

[don't create a new missing-timer task if there is already one there zooko@zooko.com**20080415202022

Doing that would mean that the one that is already there will not be cleaned up if the buildslave reattaches before the timer goes off, so this patch probably fixed #211.

] hunk ./buildbot/buildslave.py 185 - if self.notify_on_missing and self.parent: + if self.notify_on_missing and self.parent and not self.missing_timer:

Context:

[TAG buildbot-0.7.7 warner@lothar.com**20080330025903] Patch bundle hash: 5b3689d6ecd24a330c01c6955820d16f2a0a1570

04/15/08 13:26:58 changed by zooko

Tue Apr 15 14:20:22 MDT 2008  zooko@zooko.com
  * don't create a new missing-timer task if there is already one there
  Doing that would mean that the one that is already there will not be cleaned up if the buildslave reattaches before the timer goes off, so this patch probably fixed #211.

New patches:

[don't create a new missing-timer task if there is already one there
zooko@zooko.com**20080415202022
 Doing that would mean that the one that is already there will not be cleaned up if the buildslave reattaches before the timer goes off, so this patch probably fixed #211.
] hunk ./buildbot/buildslave.py 185
-        if self.notify_on_missing and self.parent:
+        if self.notify_on_missing and self.parent and not self.missing_timer:

Context:

[TAG buildbot-0.7.7
warner@lothar.com**20080330025903] 
Patch bundle hash:
5b3689d6ecd24a330c01c6955820d16f2a0a1570

04/15/08 13:36:06 changed by zooko

Zandr: thanks for the useful bug report, which told me right where to look.

04/15/08 13:37:50 changed by warner

So, I agree that this is a good thing to do, and I'll apply it, but I don't see how it could be the cause of this problem, because that would imply that we have multiple detach() calls taking place without intervening attach() calls.

We'll give it a try. If it fixes the problem, then that will at least tell me that I understand the problem even worse than I thought..

04/22/08 19:13:06 changed by zooko

Here is an excerpt from the tahoe buildmaster twistd.log:

2008-04-22 15:35:54-0700 [Broker,12,208.80.68.108] duplicate slave slave-etch-tahoe replacing old one
2008-04-22 15:35:54-0700 [Broker,12,208.80.68.108] old slave was connected from IPv4Address(TCP, '208.80.68.108', 2056)
2008-04-22 15:35:54-0700 [Broker,12,208.80.68.108] new slave is from IPv4Address(TCP, '208.80.68.108', 1767)
2008-04-22 15:35:54-0700 [Broker,12,208.80.68.108] disconnecting old slave slave-etch-tahoe now
2008-04-22 15:35:54-0700 [Broker,12,208.80.68.108] waiting for slave to finish disconnecting
2008-04-22 15:35:54-0700 [Broker,9,208.80.68.108] <Builder 'deb-etch' at 1085436556>.detached slave-etch-tahoe
2008-04-22 15:35:54-0700 [Broker,9,208.80.68.108] Buildslave slave-etch-tahoe detached from deb-etch
2008-04-22 15:35:54-0700 [Broker,9,208.80.68.108] <BuildSlave 'slave-etch-tahoe', current builders: deb-etch,etch> removing <SlaveBuilder build
er=deb-etch slave=slave-etch-tahoe>
2008-04-22 15:35:54-0700 [Broker,9,208.80.68.108] <Builder 'etch' at 1086368908>.detached slave-etch-tahoe
2008-04-22 15:35:54-0700 [Broker,9,208.80.68.108] Buildslave slave-etch-tahoe detached from etch
2008-04-22 15:35:54-0700 [Broker,9,208.80.68.108] <BuildSlave 'slave-etch-tahoe', current builders: deb-etch,etch> removing <SlaveBuilder build
er=etch slave=slave-etch-tahoe>
2008-04-22 15:35:54-0700 [Broker,9,208.80.68.108] BuildSlave.detached(slave-etch-tahoe)
2008-04-22 15:35:54-0700 [Broker,12,208.80.68.108] Got slaveinfo from 'slave-etch-tahoe'
2008-04-22 15:35:54-0700 [Broker,12,208.80.68.108] bot attached
2008-04-22 15:35:54-0700 [Broker,12,208.80.68.108] <BuildSlave 'slave-etch-tahoe', current builders: deb-etch,etch> adding <SlaveBuilder builde
r=deb-etch slave=slave-etch-tahoe>
2008-04-22 15:35:54-0700 [Broker,12,208.80.68.108] Buildslave slave-etch-tahoe attached to deb-etch
2008-04-22 15:35:54-0700 [Broker,12,208.80.68.108] <BuildSlave 'slave-etch-tahoe', current builders: deb-etch,etch> adding <SlaveBuilder builde
r=etch slave=slave-etch-tahoe>
2008-04-22 15:35:54-0700 [Broker,12,208.80.68.108] Buildslave slave-etch-tahoe attached to etch
2008-04-22 15:35:54-0700 [-] maybeStartBuild <Builder 'deb-etch' at 1085436556>: [] [<SlaveBuilder builder=deb-etch slave=slave-etch-tahoe>]
2008-04-22 15:35:54-0700 [-] maybeStartBuild <Builder 'etch' at 1086368908>: [] [<SlaveBuilder builder=etch slave=slave-etch-tahoe>]

This resulted in a mail about the detached buildslave even though it was reattached. Here is a patch which (in combination with the patch that I earlier submitted in this ticket) will hopefully fix this.

Copying pristine 180/204 : buildstep.py
Tue Apr 22 20:03:02 MDT 2008  zooko@zooko.com
  * don't retire a missing-notification-timer until you finish accepting a slave
  Previously, we retired the timer when we began accepting the slave.  However, if we then detected duplicate connection and rejected one of the connections, then the resulting detach would start a new missing-notification-timer.  With this patch, the missing-notification-timer that is started by detaching a connection is soon retired by the other connection leading to acceptance of the buildslave.  So this probably fixes the other way that issue #211 was happening (after the previous patch that was intended to fix #211).
diff -rN -u old-trunk/buildbot/buildslave.py new-trunk/buildbot/buildslave.py
--- old-trunk/buildbot/buildslave.py	2008-04-22 20:12:44.000000000 -0600
+++ new-trunk/buildbot/buildslave.py	2008-04-22 20:12:45.000000000 -0600
@@ -92,10 +92,6 @@
         @return: a Deferred that fires with a suitable pb.IPerspective to
                  give to the slave (i.e. 'self')"""
 
-        if self.missing_timer:
-            self.missing_timer.cancel()
-            self.missing_timer = None
-
         if self.slave:
             # uh-oh, we've got a duplicate slave. The most likely
             # explanation is that the slave is behind a slow link, thinks we
@@ -164,6 +160,10 @@
             self.slave = bot
             log.msg("bot attached")
             self.messageReceivedFromSlave()
+            if self.missing_timer:
+                self.missing_timer.cancel()
+                self.missing_timer = None
+
             return self.updateSlave()
         d.addCallback(_accept_slave)
 

04/22/08 19:13:31 changed by zooko

  • attachment patch added.

09/04/08 18:21:15 changed by warner

  • keywords changed from slave to slave review.

09/05/08 08:50:20 changed by zooko

I submitted some tested and surely correct patches for this in e-mail to Brian Warner a few weeks ago. I intend at some point soon to push those patches directly into buildbot trunk.

09/05/08 16:16:46 changed by zooko

okay [694] and [690]

09/16/08 11:21:23 changed by warner

  • status changed from assigned to closed.
  • resolution set to fixed.

released in 0.7.9