Ticket #255 (new defect)

Opened 7 months ago

Last modified 2 weeks ago

Immediate SIGHUP on slave commands, but not with with 'usepty=0' or 'twistd --spew'

Reported by: strank Assigned to:
Priority: major Milestone: undecided
Component: buildprocess Version: 0.7.9
Keywords: Cc: strank, afri, dustin, dwlocks

Description

A week ago or so, one buildslave started getting immediate SIGHUP on a task. I think this is related to issues surrounding using PTY and closing stdin also mentioned by warner in #198, so I am posting this here as another data point.

An excerpt of the twistd.log:

2008/04/25 15:57 +0200 [Broker,client] ShellCommand._startCommand
2008/04/25 15:57 +0200 [Broker,client]  rm -rf /home/wrstl/Buildbot/slave-atuin/bbbatuin-bot/build
2008/04/25 15:57 +0200 [Broker,client]   in dir /home/wrstl/Buildbot/slave-atuin/bbbatuin-bot (timeout 1200 secs)
<snip>
2008/04/25 15:57 +0200 [Broker,client]   closing stdin
2008/04/25 15:57 +0200 [Broker,client]   using PTY: True
2008/04/25 15:57 +0200 [Broker,client] ShellCommandPP.connectionMade
2008/04/25 15:57 +0200 [Broker,client]  assigning self.command.process: <PTYProcess pid=12493 status=-1>
2008/04/25 15:57 +0200 [Broker,client]  closing stdin
2008/04/25 15:57 +0200 [-] ShellCommandPP.processEnded [Failure instance: Traceback (failure with no frames):
<class 'twisted.internet.error.ProcessTerminated'>: A process has ended with a probable error condition: process ended by signal 1. ]
2008/04/25 15:57 +0200 [-] command finished with signal 1, exit code None
2008/04/25 15:57 +0200 [-] _checkAbandoned [Failure instance: Traceback: <class 'buildbot.slave.commands.AbandonChain'>: -1
        /usr/lib/python2.5/site-packages/buildbot/slave/commands.py:158:processEnded
        /usr/lib/python2.5/site-packages/buildbot/slave/commands.py:482:finished
        /usr/lib/python2.5/site-packages/twisted/internet/defer.py:239:callback
        /usr/lib/python2.5/site-packages/twisted/internet/defer.py:304:_startRunCallbacks
        --- <exception caught here> ---
        /usr/lib/python2.5/site-packages/twisted/internet/defer.py:317:_runCallbacks
        /usr/lib/python2.5/site-packages/buildbot/slave/commands.py:707:_abandonOnFailure
        ]
2008/04/25 15:57 +0200 [-]  abandoning chain -1

So an rm -rf xyz is interrupted immediately upon starting with a SIGHUP. A day earlier the same task completed normally, and the same task continues to work on other slaves. This particular rm is part of a Darcs command, but sporadically it would succeed and the next task would be interrupted with a HUP.

The change that triggered the 'bug' (not sure who's bug it is) was an update of Gentoo Linux, updating (among some others) the gentoo packages:

sys-apps/coreutils-6.9-r1      -> sys-apps/coreutils-6.10-r1
sys-libs/readline-5.2_p7       -> sys-libs/readline-5.2_p12-r1
sys-apps/baselayout-1.12.10-r5 -> sys-apps/baselayout-1.12.11.1
dev-lang/python-2.5.1-r5       -> dev-lang/python-2.5.2-r2

At the time, I had dev-util/buildbot-0.7.5 installed.

  • The first thing I tried was upgrading to dev-util/buildbot-0.7.7 (master and all slaves, updating all configs). This did not help.
  • Then I found the notes about closing stdin in the code. Commenting out all two calls to self.transport.closeStdin() in buildbot.slave.commands.ShellCommandPP did not help (although I believe that the sporadic successes of one task started then, but I am sorry I did not record that so I am not sure).
  • Then I tried to diagnose by starting buildbot with twistd --spew... the problem was gone. (with and without the closeStdin())
  • To avoid immense logfiles, I also tried setting usepty=0 in buildbot.tac, which also resolves the issue.

Hope this helps :-)

Change History

05/02/08 04:42:31 changed by strank

  • version changed from 0.7.6 to 0.7.7.

05/02/08 04:42:42 changed by strank

  • summary changed from Immediate SIGHUP on slave commands, but not with with 'usepty=0'^ or 'twistd --spew' to Immediate SIGHUP on slave commands, but not with with 'usepty=0' or 'twistd --spew'.

09/26/08 14:23:01 changed by afri

  • cc changed from strank to strank, afri.

09/27/08 17:55:10 changed by dustin

  • cc changed from strank, afri to strank, afri, dustin, dwlocks.

I see this too, also on Gentoo.

10/22/08 03:57:02 changed by fabrice

  • version changed from 0.7.7 to 0.7.8.

I use buildbot 0.7.8 on Debian Linux. My slave buildbot.tac has usepty = 1 and I also see that my 'rm -rf <dir>' command (the first one used by all my builds) receives signal 1 that is 'HUP' signal. Exactly the same error message in my twistd.log file, sporadically.

I need to use a PTY because I run legacy TCL/Expect test scripts on that build and if I disable PTY (usepty = 0 in buildbot.tac) these test scripts will not work anymore and write this strange message:

error writing "stdout": bad file number

while executing

So am I stuck with usepty = 1 and inexplicable HUP signals.

That's funny because I thought that by replacing the old daily build application with Buildbot, it would have saved me from this, but I am stuck again.

11/04/08 00:07:20 changed by fabrice

  • version changed from 0.7.8 to 0.7.9.

I use buildbot 0.7.9 on Debian Linux (Lenny) with usepty = 1. I have modified my local slave/commands.py file and added "nohup" before "rm" and restarted my buildbot slave:

% grep -n nohup /usr/share/pyshared/buildbot/slave/commands.py
1392:        command = ["nohup", "rm", "-rf", d]

I have run buildbot with this change for two weeks and it seems to have solved the problem of immediate SIGHUP on slave commands (in my case "rm" was the slave command involved). However, I have noticed something else, It seems that "rm" runs two times before the actual cvs checkout:

starting cvs operation
nohup rm -rf /home/bbot1/builds/products/207MW_daily/build
 in dir /home/bbot1/builds/products/207MW_daily (timeout 1800 secs)
 watching logfiles {}
 argv: ['nohup', 'rm', '-rf', u'/home/bbot1/builds/products/207MW_daily/build']
 environment:
...
 closing stdin
 using PTY: True
nohup: ignoring input and appending output to `nohup.out'
elapsedTime=0.111772
nohup rm -rf /home/bbot1/builds/products/207MW_daily/build
 in dir /home/bbot1/builds/products/207MW_daily (timeout 1800 secs)
 watching logfiles {}
 argv: ['nohup', 'rm', '-rf', u'/home/bbot1/builds/products/207MW_daily/build']
 environment:
...
 closing stdin
 using PTY: True
nohup: ignoring input and appending output to `nohup.out'
elapsedTime=0.006152

/usr/bin/cvs -d :ext:buildbot@dev-cvs:/usr/local/cvs/linux -z3 checkout -d build -r HEAD ...
...