Today we had a Purple Mash incident occurring from 11:23 to 12:27. The site was up and down during this time period, total downtime was approximately half an hour.
The problem was caused by our replica MongoDB database running out of CPU credits which caused slowdowns which in turn caused the main database to run out of connections. As this was a completely new problem it took a bit more time then we would have liked to diagnose and fix it.
We have now added more capacity on this replica server and put additional monitoring in place to prevent this from happening again. We have also identified a faster way to recover the platform when this type of problem occurs.