Confluence and JIRA are now back online. The Confluence upgrade from 1.4 to 2.1 caused quite a few problems, as it was unable to upgrade the existing database. Also, the restore process halted at 55% during the “applying special processing” step, but as far as I can see it has restored all content and attachments correctly. If you spot anything amiss with either Confluence or JIRA, please let me know.
Mail should now be operational again across all domains. I ran into a ton of problems with the Exim configuration file on the new system (which uses LDAP and virtual mail accounts rather than system users), and it has taken most of today to iron out all of the problems this caused.
There are still a few quirks: SpamAssassin is somehow incapable of setting its home dir, so it cannot find the bayes database (this is not critical, but does mean that more spam than usual will be slipping through, at least until it gets fixed). Some users have received new login and/or password. If you cannot login, please catch me on IM to get your updated account information.
Apart from webmail still not being online, email should now be in perfect working order. Let me know if you discover anything that might indicate a problem.
Now, on to the next problem…
Of course, nothing is ever as easy as expected. I won’t go into too much detail at this hour, but as you can easily observe the server upgrade is still far from complete. I’ll post status updates at regular intervals tomorrow (that is, later today, but after catching some sleep) as the various services come online (web and email support is almost, but not quite, working – sigh).
At this point it seems unlikely that I’ll be able to get all of the Java-based services (Confluence, JIRA and FishEye), Subversion, FTP and webmail restored to working order tomorrow, so please be patient – it will get there as soon as humanly possible.
Apache was unable to start after a MySQL upgrade from 4.0 to 5.0 (due to invalid library references). Unfortunately, it took quite some time for Gentoo to recompile all the affected libraries required for everything to come back online.
I’m looking forward to the new vserver-setup, which I’m confident will make downtime due to issues such as this a thing of the past.
The server is being physically relocated in a bit, as announced earlier. It should be back up again shortly…
The server was down today from around 8:30 to 01:00 CET due to a boot manager without a default choice. Actual downtime was roughly 10 minutes, so it’s a bit of a bummer not to have detected this any sooner. However, I was moving to a new flat and didn’t get connected again until this evening.. sorry for any inconvenience this may have caused.
Note that the server will be going down again August 25th from 12:00 to 16:00 CET, as that is when the DSL line is installed and switched over to the new place.
The router has been fairly unstable over the last 5 days. It’s firmware has just been updated and everything reconfigured from scratch, but only time will tell if this has solved the problem. According to my ISP the old firmware did have a number of bugs in it – I just think it’s strange that these haven’t cropped up before, but there you go.
On another note, the site will be going down for maintenance right about now, but should be back up again within 15 minutes or so. I need to physically move the server, and thus have to disconnect it.
The server has been unavailable to most users from guesstimately yesterday evening until 13:30 (CET) today. Services were suddenly being denied access to various system locations (such as /tmp and /dev/null, which are rather vital).
While the exact cause is still being investigated, the most likely culprit is Gentoos package management software (portage). Whether it’s a general bug or "just" an error in one of the package ebuilds I do not know, but I’ll certainly try to find out.
I guess this means it’s time to go for a proper ACL system and Tripwire (or some other integrity checking tool).
Due to the server now having two separate internet connections I had to set up policy based routing, in order to make the server respond on the same interface as the request was received.
I took the opportunity to clean up my firewall rules, and the combination of these changes took a while to get right. This may have caused problems for some users from 4:00-6:00 (CET) this morning.
The server will be taken offline saturday (April 2nd) afternoon (~16:00 CET) for a few hardware upgrades. I expect it will be unavailable for anywhere between 20 minutes and a full hour, depending on how smoothly everything goes.