Servers are built, going through stress tests at the moment. They will be racked on Wednesday, but that's too late to mess with the prod cluster before the weekend.
We'll switch start by switching the database to one of the new hosts; just finished setting up and tuning PostgreSQL today and got 3k TPS in pgbench out of it with basically zero CPU load, and the WoL code doesn't seem to be able to stress it at all either. It sure helps it's on SSD's instead of just 2.5" spinning disks, iostat showed a peak of 25k write ops in a second. We can set Exporter.LIVERANK back to true again after the migration.
After that, the backend is probably the next to be moved. Well, actually, not moved, but split. We'll use a custom sticky round robin method to split the load between 2 machines, basically doubling the processing capacity. Silly logs are large, but with 80G memory available in total, it should be okay for now.
The final step is migrating the data pool. We'll do that when we have spare time, otherwise, it can wait for another week.
So between Friday and Sunday, between 12:00 and 15:00 CEST every day, there's a chance on an interruption of the service. We don't expect to go down for that long, that's kinda the worse case estimate. The actual plan is:
Before Day 1
- Setup all the separate services, make sure everything runs. Add everything to the central monitoring services.
- Setup cross site monitoring between the Germany and the Amsterdam location; is long overdue.
- Setup host monitoring & notifications. Schedule the downtimes so we don't get Nagios spam.
- Disable site. Revoke access to the database and dump it (~30m)
- Import @ new db server (~30m)
- Restart the site, run test suite, enable at load balancer (~30m)
- Migrate all other postgresql dbs too, like forums, jira, etc etc.
- Jup. Depending on how fast we can dump / reload the huge database, it can take up to 1.5h to get everything up again.
- Setup backends.
- Set load balancer to serve from anything but frontend02.
- On frontend02, enable backend sticky routing module, set to backend01 only.
- Point frontend02 to new backend, run test suite.
- Point frontend02 to both backends, retest, assert sticky routing is sticky
- Set load balancer to serve from frontend02 only.
- Live test. Watch mailbox
- No warnings? Good. Do the setup on all other frontends.
- Reset load balancer back to serving from every host.
- There shouldn't be any downtime at all, unless we seriously mess up on one of the steps.
- Migrate the NFS host to twin2's new disks. Disk controller has better linux support. NCQ Q=31. Old one was using the Adaptec option ROM, linux didn't dare to queue things there.
- Pre rsync most of the data (hours)
- Stop WoL web service (downtime starts here)
- Final sync for changed files and verification (15-30m to verify 300k reports)
- Reconfigure all NFS mounts (10m)
- Revoke access to backend01 data pool so noone accidentally uses it (10m)
- Restart WoL web service
- Use original disks as backup. Change NFS export to read only.
- raid1 @ twin2, raid1 @ backend01 as warm-standby, periodically synced, and an offline copy on an off-site location. You can't be careful enough with data. Disks tend to die on the worst possible moments.
Future: reinstall the older servers with a newer OS, Ubuntu server 8.04 LTS is stuck on Python 2.5 and has a boatload of outdated packages. Shouldn't cause any downtime, we'll just move the main IP to another node and serve everything via the online servers.
And add all the previously unpossible due to load features. +50% more processing power and x100 TPS to disks makes a lot more possible. Just need to keep the DB size under 40G (is 5G now), SSD's are still tiny compared to SAS disks.