This blog post is going to be something a little bit different and we're going to be trying something new. As many of you will be aware towards the start of February 2025 we significantly enhanced our observability and monitoring into for our federated social media sites as part of our ongoing commitment to service improvements and ensuring the maximum possible availability of the sites, as part of this we've been tracking the health of a number of the public facing components and have set an initial Service Level Objective (SLO) of 99% over a rolling 30 day window.
All of our monitoring statistics are based on checks from multiple geographic areas worldwide to ensure we can better detect any potential regional issues or any latency issues that might impact specific geographic regions. We currently check the following locations for our detections:
- Europe - United Kingdom
- North America - US East
- North America - US West
- Asia / Pacific - Japan
The Universeodon Relay has an additional check from Europe - Germany as we are aware of a large number of instances that federate from networks based in Germany.
Summary
Overall across both MastodonApp.UK and Universeodon.com we maintained above 99% overall availability for the sites through our monitoring during the month of March 2025. Universeodon did have significantly more incidents compared to MastodonApp.UK and there are certainly a series of improvements we will need to make to our monitoring in these spaces.
MastodonAppUK
Overview Statistics:
MastodonAppUK Media - 100% Availability, 0 Errors, 110.4ms Median Duration
MastodonAppUK Main Site - 99.933% Availability, 38 Errors, 221.5ms Median Duration
MastodonAppUK Streaming - 99.947% Availability, 28 Errors, 217.1ms Median Duration
General Updates:
We performed a patch uplift for both MastodonApp.UK and Universeodon.com to bring them up to date with the latest (at the time) patch releases of Mastodon 4.2, we do have an outstanding action to bring Mastodon up to the 4.3 release on both sites which will introduce various improvements across the site and will allow us to (hopefully!) more quickly upgrade to 4.4 when Mastodon releases their next major version.
Incident Reviews:
21st March 2025 - Media Uploads Failed
On March 21st 2025 Cloudflare experienced a major issue with their R2 service which impacted our ability to upload new media on all our sites, the displaying of existing media worked without issue and the issue for uploads were intermittent. There's some further details on the Cloudflare outage and the root cause that you can find here - https://www.bleepingcomputer.com/news/security/cloudflare-r2-service-outage-caused-by-password-rotation-error/
Universeodon
Overview Statistics:
Universeodon Media - 100% Availability, 0 Errors, 110.5ms Median Duration
Universeodon Main Site - 99.547% Availability, 161 Errors, 221.3ms Median Duration
Universeodon Streaming - 99.494% Availability, 175 Errors, 216.7ms Median Duration
Universeodon Relay - 99.813Availability, 16 Errors, 502.5ms Median Duration
General Updates:
We performed a patch uplift for both MastodonApp.UK and Universeodon.com to bring them up to date with the latest (at the time) patch releases of Mastodon 4.2, we do have an outstanding action to bring Mastodon up to the 4.3 release on both sites which will introduce various improvements across the site and will allow us to (hopefully!) more quickly upgrade to 4.4 when Mastodon releases their next major version.
Incident Reviews:
2nd March 2025 - Server Migration Activities
On March 2nd 2025 there were delays to the Universeodon content processing due to server migration activities which we needed to undertake to re-balance our underlying infrastructure. This work was semi-planned and would have not had a direct impact to the site however did impact the speed at which we federated content between other sites which was substantially delayed up to around 60 mins during the maintenance and for a short while after. Due to this being planned / routine maintenance there is no additional action we intend to take at this time to directly resolve the issue in the future however due to the re-balancing of compute resources we now have additional content processing capacity at our disposal which will allow us to better handle such incidents and routine maintenance in the future.
14th March 2025 - Media Uploads Failed
On March 14th 2025 members on Universeodon.com were unable to upload any media to the site and were presented with 500 server errors when attempting to share media on their posts. This issue was reported by users of the site and is not currently something we have a good path to test for on a proactive basis, while our server logs would have flagged this issue we don't currently have alerts configured to search for new or out of the ordinary events taking place in our server logs though this is something we are looking into in the future. The root cause of this issue was due to a DNS failure on both of our web nodes for Universeodon.com which were set to use Cloudflare's public DNS (1.1.1.1 and 1.0.0.1) which failed to resolve Cloudflare's own R2 endpoints for an extended amount of time from these servers, our current best guess is these server IP's were being somehow rate limited as they shared an external IP address with a large number of other VMs and containers running on the same underlying hosts and the same R2 addresses were accessible from other devices using the same DNS servers. Our mitigation was to switch to using Googles DNS servers which resolved the issue. We do have an outstanding action to review the DNS configuration for our entire fleet and to ensure we have sufficient DNS servers running ourselves to cache DNS requests more effectively.
20th March 2025 - Content Processing Crash
On March 20th 2025 Universeodon.com experienced major issues with content processing resulting in content processing going nearly entirely offline and only a small number of services remaining operational ultimately resulting in a 5 hour delay of content processing at the time we were made aware of the issue. Our content processing is again an area we know there is a lack of monitoring right now and our main method for being alerted to issues is when users report delays in our content processing and notice no new posts are in their timelines for an extended amount of time. We were able to restart the services to restore queue processing and content processing resumed and slowly caught up. There is an ongoing issue impacting multiple Mastodon servers where the SideKiq content processing will consume all available memory and then lock up, there are work-arounds such as regularly rebooting the servers though this is not something we currently have in place. We have outstanding actions to both improve our monitoring in this area as well as to overhaul our infrastructure setup for the content processing to make it more dynamic and have suitable auto-scaling in place which should ensure this issue does not present itself again in the future.
21st March 2025 - Media Uploads Failed
On March 21st 2025 Cloudflare experienced a major issue with their R2 service which impacted our ability to upload new media on all our sites, the displaying of existing media worked without issue and the issue for uploads were intermittent. There's some further details on the Cloudflare outage and the root cause that you can find here - https://www.bleepingcomputer.com/news/security/cloudflare-r2-service-outage-caused-by-password-rotation-error/
Your Feedback
This is a new format of Blog posts for us, if you found it useful or have any feedback on what you would like to see us include in future updates like this please drop me a message on Mastodon - @Wild1145@MastodonApp.UK