After working for 3 months for my new employer as a system administrator, a disaster finally breaks out on one of the mission critical server last weekend before I had time to implement any preventive measures. This breakdown could have cost the company to shutdown its business if we were not fortunate enough.
Outdated IT Structure
In this company, we have a very outdated IT structure. There are about ten file servers and each of them stores a different set of data for different departments. All of the servers are at least 7 years old. The one with the highest capacity could only hold up to about 150GB of data. It was probably when a server’s hard disks are full, another server is installed to hold more data.
Since each server holds a different set of data, there were no redundancy of the data. The server that broke down last week holds the most critical data in our company. We produce a new product everyday based on those data. Without them, we wouldn’t be able to make it and we could lose all our clients at once.
Improper and Insufficient Backup Plan
There are also two tape backup devices, one external and one internal. Each tape backup device is responsible for backing up data on a group of servers. There is a software that we use to perform the backups every night. The software has its own database to store information about what has been backed up. However, the device that backs up the data on the broken server happened to be on the same machine. The result? We couldn’t retrieve the data from the backup tapes because the backup software’s database was inaccessible. Even if we were able to retrieve the data from the tapes, we wouldn’t have enough time to do that before our deadline.
Whose Fault?
The cause of the whole situation was determined to be a bad memory stick. The server had 2 sticks of 256MB PC133 ECC memory and one of them had gone bad. The server requires two sticks at a time to operate and we were unable to find any spare RAM of the same type in our office. After many hours of thinking and searching for a resolution, we ended up pulling one stick of RAM from two other servers and install them on the critical machine temporarily until we get replacement memory.
My (And Your) Learning
It was a simple problem but evolved into a deadly situation due to several factors.
- Insufficient stocking of spare parts, especially for outdated hardware
- No real-time redundancy of critical data
- Backup device installed on the same machine as the data being backed up
The first point can be easily countered if you keep spare parts on hand. However, if the hardware is too old, it can be very expensive to purchase spare parts or they may not even be available. Therefore, it is justifiable to upgrade your hardware to newer technologies after certain period of time, especially when your old ones are still working at this moment.
When planning for data backups and redundancy, it is important to factor in the amount of time it takes to retrieve data from the backup media. The importance of real-time data redundancy is relative to how critical they are to your daily business operations. If it is highly critical, then tape backups are not sufficient to keep your data online all the time.
If your company have a similar IT structure like mine, then you should start looking into the three areas listed above immediately before anything goes wrong, which will happen very likely.
[...] my previous post, “IT Breakdown – Real Life Example”, I have told you how I got into trouble with some outdated technologies and unthoughtful IT [...]