mertner.com

Stability seems to be back!

— Morten @ 16:19

After eliminating the faulty hardware controller, I did see one more kernel crash. As a consequence, I changed the kernel CPU target to generic i686 (as opposed to AMD64), and the server has now been running smoothly for 8 days.

So, if anyone else is having stability problems with Linux when using an AMD optimized kernel, I can only recommend recompiling it in generic x64 mode. I think it’s sad that a change like this can cause so many problems – it just goes to show that software development practices still have a long way to go before software “just works”. The kernel was compiled using gcc 4.1.1 and this is the most likely culprit to blame.

Server Stability Update

— Morten @ 18:37

As you will all have noticed, the server has been increasingly unstable lately and finally came to a point where it wouldn’t even stay alive for a day. Clearly not viable and very frustrating.
It was time to try something new, so I decided to spend some down time experimenting with kernels. However, I was unable to even finish compiling a kernel before it crashed again. Because I was now using a local console, I started spotting an odd “Disabling Interrupt #185” line being emitted shortly before every crash. Sitting just beside the server also allowed me to hear the faint click-click of a harddisk being reset, so the obvious conclusion was that this might be hardware related after all.

So, I pulled out the Promise controller used for the RAID array, which has been offline for a while anyway. I was instantly able to finish the kernel compilation, and the “would not last an hour”-kernel has now been operating for a whooping 3 hours.

This makes me hopeful that the cause for the instability problems finally has been located, and that brighter days lie ahead.

I’ll keep you updated, but right now I’m off to find a replacement for the defunct controller card..

Yet Another Kernel Crash

— Morten @ 22:18

Linux crashed again – and I hit the reset button at 5:30 am on my way out of the door for two days.

Unfortunately, Gentoo seems to be loading modules in a weird way (that is, not in the order in which they are listed in the modules.autoload files, which is how it used to be), because the network interface cards got switched again. So despite the server actually coming back up at 5:35, I wasn’t able to replug the network cables until I got back. Sorry all for the inconvenience.
I am actually considering dumping the entire setup – it’s no fun running an ill-behaved server. I’ve upgraded to the latest kernel and VServer 2.1.0 final – lets see if that is able to stay alive for more than a few days.

Another kernel oops

— Morten @ 12:37

I believe it is time to get rid of ReiserFS on the server. It seems that all of the recent crashes have been in the reiserfs_clear_inode method, and it’s the only hint I’m able to extract from the kernel dumps as to what could be wrong. ReiserFS used to be very fast and rock stable (even on 64-bit), so I’m surprised it has devolved again. However, now that Hans Reiser looks set to spend 20 years behind bars (for the alleged murder of his wife), one might as well smell the coffee and move to a more reliable alternative.

Nov 9 03:10:47 harmony Unable to handle kernel NULL pointer dereference at 0000000000000001 RIP:
Nov 9 03:10:47 harmony {reiserfs_clear_inode+63}
Nov 9 03:10:47 harmony PGD 14d10067 PUD 6d6d7067 PMD 0
Nov 9 03:10:47 harmony Oops: 0002 [1]
Nov 9 03:10:47 harmony Pid: 152[#0], comm: kswapd0 Not tainted 2.6.16.19-vs2.1.1-rc22-gentoo #2
Nov 9 03:10:47 harmony RIP: 0010:[] {reiserfs_clear_inode+63}

Nov 9 10:06:36 harmony Memory: 3025952k/3145408k available (2311k kernel code, 118568k reserved, 854k data, 168k init)
So, the server was offline for roughly 7 hours this time. Clearly not the stability one was used to from Linux :-(

Disk Failure

— Morten @ 02:40

The server just crashed for some reason – I’d really wish they’d put out some stable releases! Linux didn’t use to be crash-prone, and it’s horrible when you can’t just rely on it being there when you need it.

The RAID got degraded and is currently offline – until I can find a spare or figure out how to tell EVMS to retry using the existing disk, which seems to be in perfect working order after all.

Update: The server will be taken offline on Sunday (October 22nd) at 14:00 in order to add a new spare to the RAID array. Since I don’t have a whole bunch of 500 GB disks lying around, I may need to reboot the server a few times toggling disks around, but will try to keep it to a minimum.

Software Updates

— Morten @ 11:56

I’ve just finished updating Tomcat, JIRA, Confluence and FishEye/Crucible – let me know if you spot anything that needs fixing as a result of this.

This may have caused a few strange “bad gateway” errors for visitors to those services – I apologize for the inconvenience.

PS: If anyone knows how to make Tomcats APR work, or how to configure syslog logging for JIRA/Atlassian, do send me an email ;)

Kernel Upgrade

— Morten @ 08:56

I’ve upgraded the kernel to 2.6.18 and had to reboot the server for this, causing a bit of downtime this morning.

I also took time out to perform a deep check of all file systems, which took a while and caused a minor delay in getting back online. Apart from a single unlinked sector, everything seemed to be fine though.

Downtime

— Morten @ 21:19

Amazing what can go wrong just by rebooting the server. Apart from the troubles with glibc suddenly LDAP wasn’t working and MySQL refused to start. At first I suspected the new kernel was responsible, but after many hours of troubleshooting it turns out to be a whole series of other things.

First, openssl 0.9.8 broke everything linked to 0.9.7 due to API changes, which resulted in much recompilation and waiting.

Next, MySQL needed to be started manually and the “mysql-fix-privileges” command executed to upgrade the permissions tables. One wonders why it couldn’t do this automatically. More by chance and sheer persistence I eventually came across a blog post explaining how to fix the problem, and 2 minutes later it was up and running again.

The system does appear to be stable and in working condition will all packages updated, however, this has certainly been a -5 score for Gentoo as a production/server platform.

Update Complete

— Morten @ 17:13

I just added an additional 1 GB of memory to the server, fixed the glibc problem from the last post, and upgraded the kernel (to a hardended 2.6.17 with vserver support). Let me know if you spot any problems with the new setup.

Server Issue

— Morten @ 01:46

Updating glibc of the host environment on the server seems to have broken some library dependency, causing basically everything to stop working, with the exception of processes already executing.
The weird thing about having a Linux-VServer installation is that only the root operating system is affected – all the virtual machines have their own files and aren’t affected. However, I cannot open new SSH sessions to the machine (as it runs in the root environment), and cron jobs are also likely failing (but all the interesting ones are in the guests anyway).

I need to get up early tomorrow and so cannot be bothered to boot from a CD to fix this right away, but do expect the server to be unavailable for an hour some time tomorrow (presumably late afternoon) as I perform the required recovery steps.