Server Stability Update

As you will all have noticed, the server has been increasingly unstable lately and finally came to a point where it wouldn’t even stay alive for a day. Clearly not viable and very frustrating.
It was time to try something new, so I decided to spend some down time experimenting with kernels. However, I was unable to even finish compiling a kernel before it crashed again. Because I was now using a local console, I started spotting an odd “Disabling Interrupt #185” line being emitted shortly before every crash. Sitting just beside the server also allowed me to hear the faint click-click of a harddisk being reset, so the obvious conclusion was that this might be hardware related after all.

So, I pulled out the Promise controller used for the RAID array, which has been offline for a while anyway. I was instantly able to finish the kernel compilation, and the “would not last an hour”-kernel has now been operating for a whooping 3 hours.

This makes me hopeful that the cause for the instability problems finally has been located, and that brighter days lie ahead.

I’ll keep you updated, but right now I’m off to find a replacement for the defunct controller card..

Yet Another Kernel Crash

Linux crashed again – and I hit the reset button at 5:30 am on my way out of the door for two days.

Unfortunately, Gentoo seems to be loading modules in a weird way (that is, not in the order in which they are listed in the modules.autoload files, which is how it used to be), because the network interface cards got switched again. So despite the server actually coming back up at 5:35, I wasn’t able to replug the network cables until I got back. Sorry all for the inconvenience.
I am actually considering dumping the entire setup – it’s no fun running an ill-behaved server. I’ve upgraded to the latest kernel and VServer 2.1.0 final – lets see if that is able to stay alive for more than a few days.

Another kernel oops

I believe it is time to get rid of ReiserFS on the server. It seems that all of the recent crashes have been in the reiserfs_clear_inode method, and it’s the only hint I’m able to extract from the kernel dumps as to what could be wrong. ReiserFS used to be very fast and rock stable (even on 64-bit), so I’m surprised it has devolved again. However, now that Hans Reiser looks set to spend 20 years behind bars (for the alleged murder of his wife), one might as well smell the coffee and move to a more reliable alternative.

Nov 9 03:10:47 harmony Unable to handle kernel NULL pointer dereference at 0000000000000001 RIP:
Nov 9 03:10:47 harmony {reiserfs_clear_inode+63}
Nov 9 03:10:47 harmony PGD 14d10067 PUD 6d6d7067 PMD 0
Nov 9 03:10:47 harmony Oops: 0002 [1]
Nov 9 03:10:47 harmony Pid: 152[#0], comm: kswapd0 Not tainted #2
Nov 9 03:10:47 harmony RIP: 0010:[] {reiserfs_clear_inode+63}

Nov 9 10:06:36 harmony Memory: 3025952k/3145408k available (2311k kernel code, 118568k reserved, 854k data, 168k init)
So, the server was offline for roughly 7 hours this time. Clearly not the stability one was used to from Linux :-(