Many folks have been frustrated by the fact that network connectivity problems (especially flaky firewalls) are likely to cause a build to get interrupted. This is particularly annoying when each build takes several hours to complete.
One issue is to decrease the frequency of false disconnects, specifically the one addressed by
SF#1500669.
But the long-term solution is to move back to the less connection-oriented approach used in the early buildbot days (and abandoned because it was too hard to make it work properly). In those days, each build was given a number, each piece of RemoteCommand? output was queued for delivery to the buildmaster if and when a connection was established, and the build number was used as a target rather than a RemoteReference? that would go away once the connection was lost.
The biggest issue is for each end to be able to determine which "incarnation" of the other it is currently talking to. If the buildmaster sees the buildslave go away, was it because of a network problem (in which case both programs remain running), or was it because the buildslave process stopped for some reason? In the latter case, the step needs to be restarted, and some steps may not handle this well (I seem to remember CVS getting confused if you interrupted it at some critical moment).
This will probably be made slightly easier by having a build status database (#24), since we could persist more things from one run of the buildmaster to the next (like which builds were still running and which had finished).