DRAFT: Update on recent site issues
Draft post to be made somewhere on Meta. There are definitely technical gaps and almost certainly errors here, to be corrected by other people. Don't rely on what you read here.
We've had some issues with site availability and odd bugs like not being able to vote or being signed out unexpectedly. This is an update on what we know so far.
Summary
- We've been under a DDoS wave (whether malicious is unknown) for two weeks. We've been in a loop of monitor, mitigate, see what else emerges since it started.
- If you encounter problems, try logging out, deleting the
codidact_acct
cookie, and logging back in. This works for most people. - Some of our mitigations have exposed cases where we, previously very infrequently, failed silently in various ways. We've improved error reporting in those cases, so sometimes you'll now see messages containing server response codes where previously things "just didn't work". We have changes almost ready to go to (we hope) fix most of these.
Details
Around August 22 and then intensifying starting August 27, we began seeing a new pattern of service degradation. While we're used to blips where the site is slow for a few minutes (bots and scrapers), we were now seeing sustained loads affecting us for hours at a time. Even though Cloudflare blocked the vast majority of unwanted traffic, enough was getting through to cause us problems.
We began to more aggressively cache assets (like images), so that repeated requests would be served by Cloudflare instead of by us. We also noticed and blocked some specific patterns of traffic from bad actors, and we added rate limits that are mostly hit by bots but can be hit by humans. If you've seen occasional "checking if you're a human" messages from Cloudflare recently, that's why. These rate limits helped some but did not solve the problem.
While we were reviewing our logs, we looked for pages that were getting hit a lot (including legitimate traffic), and reviewed the code for those pages. We found and fixed some performance issues, including the causes of some intermittent server (500) errors. We deployed the first batch of those code changes on September 1.
Then we started getting reports that some people couldn't use some features, like voting, but other people, or the same people on different devices, didn't see these problems. We found some issues with CSRF tokens and mitigated. We advised affected people to log out, remove the codidact_acct
(session) cookie, and log back in again, which fixed it for most people but not everybody.
While investigating that, we found that for people using 2FA, the session cookie is being flushed when the browser session ends. If you start your browser and restore your sessions and tabs you won't see the problem, but if you start fresh, you'll be logged out. We're not yet sure what's causing this (possibly something in a dependent library).
We deployed some improvements to both error-handling and error-reporting on September 9, and we have more fixes awaiting deployment.
Meanwhile, we're seeing some severe but short-lived load spikes a few times a day now -- not the sustained heavy loads that started a couple weeks ago, but we're not out of the woods yet either. We continue to investigate problems and improve code robustness in parallel. We're a small open-source project; if anyone reading this has relevant skills and wants to help, we're happy to have you.
0 comment threads