Though December of 2022 we saw a number of issues relating to intermittent outages on the MastodonApp production site. This appeared to cause the site to lock up or be entirely offline for 2-3 mins every hour.
This blog post will talk through the root cause of this issue and what we've since done to prevent it from happening again.
To understand the root cause, it's important to understand a bit of the basic architecture that was in place at this time. The Mastodon software is made up of a number of components that all have to work together to serve you the content you expect to see, at the bedrock of this architecture is our Postgres database which stores nearly all of the configuration and data for the site. We have an external media store (AWS S3) which stores all of the pictures and videos uploaded to the site, and within Mastodon itself we have SideKiq to handle content processing (Ingest / Exporting to other sites) as well as the web server itself, in order to manage content processing there is also a Redis server which SideKiq interacts with and the front end web server also interacts with. All of this were in Virtual Machines, at this point on our current production multi-node Proxmox cluster with a large backup server behind the scenes where the Proxmox cluster would snapshot the virtual machine disks and write it to the remote backup server.
In normal operations, the web tier will often call on the database and redis, and SideKiq and Redis also in some form or another interact with the database tier. In all cases the upstream software expects a fairly good response time to queries. All of our VM's were set for daily backups, with our database server VM being set to hourly snapshots to ensure a lower impact if we ever had to rollback and restore for any reason.
Our issue started when the database continued to grow at a significant rate and writing more and more to the virtual machine storage, and we started seeing very intermittent 30 second outages across our infrastructure, generally short enough our active monitoring would not detect it and always short enough that the times it did, by the time I got the pager alarm to tell me things were broken, it had magically fixed itself. This meant that for quite some time, we assumed this was a fault outside of our infrastructure like our provider moving us to their DDoS network if we were being attacked or our CDN's proxy having a hickup or failover event.
It was not until quite some time later where members of the community started to report issues where the site would go down, often at a set time on the hour and up to 60 seconds each time it did, something I was then able to re-produce myself. It took us some time to narrow in on the issue, originally looking at higher levels in the stack assuming it might have been an issue with our own internal load balancer.
After longer than really it should have taken, we noticed the graphs for the database server were showing exceptionally high IO Wait times every hour on the hour, which coincided with a specific part of the backup process that required a short disk lock. We took immediate action to reduce the frequency of the backups temporarily to reduce the impact to our community and to attempt to restore some stability to the site. We started to work on longer term designs, which ultimately have been re-done and re-looked at a few times since!
As this blog post is many many months late, I'm going to include the details of where we are as of July 2023 and some of the other issues that surfaced behind the scenes as a direct result of the original root cause analysis and the steps we took to remedy them at the time.
Our first (and probably the best long term for our design anyway) was to deploy a replica Postgres server onto one of the other physical hosts, we currently depend on resiliency at a host / VM level and do not have active failover from host to host, mostly because it adds storage complexities and we have different bare metal spec's for different VM's in some cases (Database being one!). The replica server took some time to (What we thought) configure both the VM and it's replication. Once it was configured and we had some confidence replication was operational, we left it. As with a lot of our infrastructure, our server and application monitoring was (and still is) very incomplete so we assumed things were working.
In Febuary 2023 we identified an issue where the replication had not been working at all, and the Postgres database on the replication server had not been changed since we originally set the system up. At this point we took the replication server offline and wrote a bash script which would take a full export of the database, write it to a file compressed before uploading this into our BackBlaze remote S3 like object storage. This system has been running well and is still running to this day, we have some issues due to the growing size of the database this script is taking longer and longer, sometimes taking multiple hours from when the backup is started until it's fully uploaded to the server.
We will be looking to move back to a fully operational replication model in the near future, and will be continuing to improve our monitoring of our infrastructure, we are in a lot better of a place than we were in November of 2022 but we still are not where we want to be and there is much we can do to improve our observability of the production system.