Thursday, June 21, 2007

My hard drive has code?

Have you ever needed to update the code that makes your hard drive work? I had never even stopped to think that hard drives actually have software that makes them work, until I needed to update the code on two of our hard drives.

It all started in March 2006, when our central file server just seemed to stop doing what it's supposed to, without any error log or anything, forcing us to fail over to our standby server. This "crash" would repeat itself several times in March and April '06, and about a dozen times throughout the end of 2006 and into 2007.

We eventually blamed a host of things, including the Linux quota daemon, NFS and bad RAM, until we realized that our disk array would, for no known reason, become overwhelmed with disk writes until it simply stopped. We didn't know why, so we eventually wrote a bunch of scripts that would try to identify an upcoming crash and simply halt network traffic to give the server a "break". This was an effective workaround which led to better webmaster sleep at night, but the problem was still there, and we feared that the Europa release would cause the server to crash of exhaustion with all the added demand.

Long story short: two of the drives in our 14-disk tray were a different model, and there was a code update for them that fixed a problem where the disks "Entered read/write protection mode when a self-test timeout occurs". The update sounded interesting, so I went ahead and applied it to the two disks. Updating disks in a live (and busy) array is scary stuff, but the IBM tools make it easy and painless.

It's still too early to tell if this fixes the crashes we've been having, but I'm confident it does. Talk about weird bugs.


Post a Comment

<< Home