Ticket #255 (closed defect: fixed)
Immediate SIGHUP on slave commands, but not with with 'usepty=0' or 'twistd --spew'
| Reported by: | strank | Owned by: | |
|---|---|---|---|
| Priority: | major | Milestone: | 0.7.10 |
| Version: | 0.7.9 | Keywords: | |
| Cc: | strank, afri, dustin, dwlocks |
Description
A week ago or so, one buildslave started getting immediate SIGHUP on a task. I think this is related to issues surrounding using PTY and closing stdin also mentioned by warner in #198, so I am posting this here as another data point.
An excerpt of the twistd.log:
2008/04/25 15:57 +0200 [Broker,client] ShellCommand._startCommand
2008/04/25 15:57 +0200 [Broker,client] rm -rf /home/wrstl/Buildbot/slave-atuin/bbbatuin-bot/build
2008/04/25 15:57 +0200 [Broker,client] in dir /home/wrstl/Buildbot/slave-atuin/bbbatuin-bot (timeout 1200 secs)
<snip>
2008/04/25 15:57 +0200 [Broker,client] closing stdin
2008/04/25 15:57 +0200 [Broker,client] using PTY: True
2008/04/25 15:57 +0200 [Broker,client] ShellCommandPP.connectionMade
2008/04/25 15:57 +0200 [Broker,client] assigning self.command.process: <PTYProcess pid=12493 status=-1>
2008/04/25 15:57 +0200 [Broker,client] closing stdin
2008/04/25 15:57 +0200 [-] ShellCommandPP.processEnded [Failure instance: Traceback (failure with no frames):
<class 'twisted.internet.error.ProcessTerminated'>: A process has ended with a probable error condition: process ended by signal 1. ]
2008/04/25 15:57 +0200 [-] command finished with signal 1, exit code None
2008/04/25 15:57 +0200 [-] _checkAbandoned [Failure instance: Traceback: <class 'buildbot.slave.commands.AbandonChain'>: -1
/usr/lib/python2.5/site-packages/buildbot/slave/commands.py:158:processEnded
/usr/lib/python2.5/site-packages/buildbot/slave/commands.py:482:finished
/usr/lib/python2.5/site-packages/twisted/internet/defer.py:239:callback
/usr/lib/python2.5/site-packages/twisted/internet/defer.py:304:_startRunCallbacks
--- <exception caught here> ---
/usr/lib/python2.5/site-packages/twisted/internet/defer.py:317:_runCallbacks
/usr/lib/python2.5/site-packages/buildbot/slave/commands.py:707:_abandonOnFailure
]
2008/04/25 15:57 +0200 [-] abandoning chain -1
So an rm -rf xyz is interrupted immediately upon starting with a SIGHUP. A day earlier the same task completed normally, and the same task continues to work on other slaves. This particular rm is part of a Darcs command, but sporadically it would succeed and the next task would be interrupted with a HUP.
The change that triggered the 'bug' (not sure who's bug it is) was an update of Gentoo Linux, updating (among some others) the gentoo packages:
sys-apps/coreutils-6.9-r1 -> sys-apps/coreutils-6.10-r1 sys-libs/readline-5.2_p7 -> sys-libs/readline-5.2_p12-r1 sys-apps/baselayout-1.12.10-r5 -> sys-apps/baselayout-1.12.11.1 dev-lang/python-2.5.1-r5 -> dev-lang/python-2.5.2-r2
At the time, I had dev-util/buildbot-0.7.5 installed.
- The first thing I tried was upgrading to dev-util/buildbot-0.7.7 (master and all slaves, updating all configs). This did not help.
- Then I found the notes about closing stdin in the code. Commenting out all two calls to self.transport.closeStdin() in buildbot.slave.commands.ShellCommandPP did not help (although I believe that the sporadic successes of one task started then, but I am sorry I did not record that so I am not sure).
- Then I tried to diagnose by starting buildbot with twistd --spew... the problem was gone. (with and without the closeStdin())
- To avoid immense logfiles, I also tried setting usepty=0 in buildbot.tac, which also resolves the issue.
Hope this helps :-)
![[Buildbot Logo]](/trac/chrome/site/header-text-transparent.png)