Cluster Failures



No, “cluster failures” is not a euphemism for a problem encountered in the military service and widely known to be caused by officers with clusters on their shoulders. Nor does this page relate to failures of computing clusters.

This is about failures of independent computer systems and other devices that occur very close together in time. In some cases, the failures may also be of similar type such as a series hard drive or motherboard failures.

And recently I really have had a cluster of problems, some with computers and some not. Actually a cluster of clusters.

I have been repairing things for about sixty years. I started with TV’s, ours and our friends’ and neighbors’. Then I worked at a couple now defunct audio stores. I went on to fixing unit record equipment and computers for IBM. And when personal computers started hitting the streets, I started fixing those as well, both hardware and software.

Cluster failures are a real thing. When working at the audio stores, there would be days or weeks when the most common things people brought in for repair were receivers. And they were frequently similar failures such as the power output stage, or the IF stage. Other days it would be turntables with broken belts or component tuners that would not tune.

When I worked for IBM, there was one week when I fixed three punched card readers with nearly identical feed failures. And later, when working at the IBM PC National Support Center, there were days when most of the calls I fielded were memory (RAM) problems.

Sure, randomness abounds in the universe, including the failures of electronic and electromechanical devices. But sometimes that randomness expresses itself in clusters of failures.

The Church Cluster

It all began on Wednesday, September 9, when the server at my church began failing; really crashing hard. The computer we use as a firewall was also failing with intermittent crashes. In addition, one of the office computers froze up and the telephone system began failing.

It took a couple days for this to this play out and repair the computers. I don’t do the phone system, someone else does that.

The server, a donated Dell, was clearly having hard drive problems. There were specific errors on the console pointing to one of the hard drives. Murphy rules and there were no errors recorded in the logs, because the failing hard drive was where the logs were kept. So because I was not on site and the errors were displayed on the console, the person who rebooted for me could not read the errors as the display had timed out into power saving mode and pressing keys on the keyboard did not wake it.

In addition, the computer we use for a firewall was locked up so I rebooted it.

I took the server home to rebuild and by the next day had mostly completed that task, but not without discovering serious hardware issues. One of the hard drives had failed catastrophically. That was easy enough to fix. But in attempting to install a new operating system on the replacement hard drive, it became obvious that there were other problems as well. The motherboard was failing as well and it was impossible to boot or even to get through the BIOS POST. So I installed the spare Dell motherboard we kept on hand for just this event, and was able to proceed.

I was able to restore the data for our web site and email servers from the good and well-tested backups I designed. I was also able to restore the data for  the DHCP and name service (BIND) servers.

However, soon after I returned home with the server that I needed to rebuild, a different computer, the firewall began locking up more frequently. The office staff were still able to get out to the Internet so long as that firewall was working, but our web site and email was down because that is all housed on our server. Without the firewall, all external access was gone.

The next day, Thursday, I returned to the church and installed the rebuilt server, which worked fine.

However, after installing the server, the firewall started failing so frequently that I could not leave the premises before it would do so again. I made a quick trip back home to obtain a spare computer that had been given to me and had been used as a firewall itself. I installed it at the church, made a couple simple configuration changes, and the replacement firewall was up and running.

The Home Cluster

While all of that was going on at the church, my home network was also embroiled in a cluster of problems.

First, my own server started failing. One of the four 1GB memory DIMMs had failed. One of my workstations had a motherboard failure, and another system developed a defective power supply. A fan then failed on my server, and a video adapter failed on a different workstation. And, oh yes, a hard drive failed on my own workstation. And then a hard drive failed on my server.

And don’t get me started on my refrigerator and car.

What’s it all about, Alfie?

All of this took place within the space of a week, both at church and home. So it was a very trying week.

But what does it mean?

Well, as much as we like to assign meaning to things, there really is none. Things fail. Most of the time they work for years without a problem. Sometimes the failures are spread out evenly over time, or suddenly many things seem to fail at once.

So, sometimes when you get something fixed one day and it fails with another problem the next, that is just the randomness of the universe in which we live.

And now, weeks after the events described, all is well with the computers at church, at home, with my computers, car and fridge — until the next time.