Fiber Link Failure

My ISP currently has a problem with a fiber link that affects 25% of their customers – including me. This means that the link to 217.116.235.96 is currently unavailable. Domains resolving to 217.157.3.210 (the DSL link to the server) can still be used (e.g. by visiting mirror.mertner.com instead of www.mertner.com).

The problem should be solved within an hour or two.

Status – Final Update

The server migration seems to be close to complete at this point.. FishEye is back online, and SpamAssassin is once again fully operational. I’ve also had time to track down and fix most of the known issues.. for instance, Tomcat would die for no reason with an OutOfMemory error (turns out it needed a larger PermSize, probably because it’s using a 64-bit JDK). I’ve also fixed a network card driver issue that caused the system to slow down quite a bit, so I am hoping that the system is stabilizing. FTP users have been given new passwords, so if you can’t login just send me an IM or email.

There are still a few issues to be fixed, so should it happen that a service is down for a minute, please bear with me. If anything is down for more than a few minutes, feel free to start complaining again ;-)

Status – Getting Closer

JIRA and Confluence are back online, and I’ve restored the missing plugins and corrected a few minor issues. FishEye should be back tomorrow too.

Subversion is also back, but the Gentle repository has a new URL (it is now http://www.mertner.com/svn/gentle instead of http://www.mertner.com/svn/repos). Note that the “projects” folder has been stripped on import. This means that you will have to relocate your working copy to the new URL. If you previously checked out /svn/repos/projects/gentle then you should relocate this to /svn/gentle/gentle (I tried to strip out the extra gentle too, but then Subversion would not be able to import it for some reason). The NProf repository has not moved and is still available at /svn/nprof.

Apparently the PHP developers have decided that using recode along with mysql is a bad thing, which has broken international characters on some domains. I’ve tried convincing PHP of compiling in support for both anyway, since it used to work just fine, but alas, it just wont no matter what.

Status – Day 3

Confluence and JIRA are now back online. The Confluence upgrade from 1.4 to 2.1 caused quite a few problems, as it was unable to upgrade the existing database. Also, the restore process halted at 55% during the “applying special processing” step, but as far as I can see it has restored all content and attachments correctly. If you spot anything amiss with either Confluence or JIRA, please let me know.

Status – Day 2

Mail should now be operational again across all domains. I ran into a ton of problems with the Exim configuration file on the new system (which uses LDAP and virtual mail accounts rather than system users), and it has taken most of today to iron out all of the problems this caused.

There are still a few quirks: SpamAssassin is somehow incapable of setting its home dir, so it cannot find the bayes database (this is not critical, but does mean that more spam than usual will be slipping through, at least until it gets fixed). Some users have received new login and/or password. If you cannot login, please catch me on IM to get your updated account information.

Apart from webmail still not being online, email should now be in perfect working order. Let me know if you discover anything that might indicate a problem.

Now, on to the next problem…

Server Upgrade Status

Of course, nothing is ever as easy as expected. I won’t go into too much detail at this hour, but as you can easily observe the server upgrade is still far from complete. I’ll post status updates at regular intervals tomorrow (that is, later today, but after catching some sleep) as the various services come online (web and email support is almost, but not quite, working – sigh).

At this point it seems unlikely that I’ll be able to get all of the Java-based services (Confluence, JIRA and FishEye), Subversion, FTP and webmail restored to working order tomorrow, so please be patient – it will get there as soon as humanly possible.

Web-based services unavailable

Apache was unable to start after a MySQL upgrade from 4.0 to 5.0 (due to invalid library references). Unfortunately, it took quite some time for Gentoo to recompile all the affected libraries required for everything to come back online.

I’m looking forward to the new vserver-setup, which I’m confident will make downtime due to issues such as this a thing of the past.

Server Downtime

The server was down today from around 8:30 to 01:00 CET due to a boot manager without a default choice. Actual downtime was roughly 10 minutes, so it’s a bit of a bummer not to have detected this any sooner. However, I was moving to a new flat and didn’t get connected again until this evening.. sorry for any inconvenience this may have caused.

Note that the server will be going down again August 25th from 12:00 to 16:00 CET, as that is when the DSL line is installed and switched over to the new place.

Router Upgrade

The router has been fairly unstable over the last 5 days. It’s firmware has just been updated and everything reconfigured from scratch, but only time will tell if this has solved the problem. According to my ISP the old firmware did have a number of bugs in it – I just think it’s strange that these haven’t cropped up before, but there you go.

On another note, the site will be going down for maintenance right about now, but should be back up again within 15 minutes or so. I need to physically move the server, and thus have to disconnect it.

Gentoo Trouble

The server has been unavailable to most users from guesstimately yesterday evening until 13:30 (CET) today. Services were suddenly being denied access to various system locations (such as /tmp and /dev/null, which are rather vital).

While the exact cause is still being investigated, the most likely culprit is Gentoos package management software (portage). Whether it’s a general bug or "just" an error in one of the package ebuilds I do not know, but I’ll certainly try to find out.

I guess this means it’s time to go for a proper ACL system and Tripwire (or some other integrity checking tool).