When Game Days go wrong

It was the week before pandemic lockdowns began.

Like many companies, we were thinking about what a fully remote workforce might mean for us. In an attempt to get ahead of things, we scheduled a ‘Game Day’ to test our work-from-home incident response capabilities. You know, just in case we’d need to abandon the office (spoiler: we would).

The timing was perfect. We had several new team members and others who were almost ready to join the on-call rotation but needed a confidence boost. Our goal was simple: familiarise people with our systems and processes, deliberately steering away from stress testing the infrastructure itself.

Focusing on how we’d respond remotely was the point of this, so we planned to break things in obvious ways and watch how the incident response unfolded.

As the designated villain for the day, I had a straightforward plan:

Detach a disk from our Postgres VM (running in a three-node cluster using Stolon and Etcd)
Fill up an Elasticsearch disk with junk (running in Kubernetes)
Create tons of consoles to exhaust cluster capacity

Pretty standard stuff. In fact, it sounded like it might even be fun.

All according to plan

For a while it was, with the Game Day starting smoothly.

When Postgres died and recovered, the team opened an incident and handled it well. As they were wrapping up the Postgres recovery, I started filling up the Elasticsearch disk, hoping to nudge the fairly large (~30TB) cluster into a degraded state.

Alerts began firing, and the team efficiently split into two groups to handle both issues. Perfect! This is exactly the type of response we were after, proving we could handle ramping pressure even while remote.

It was time, then, to make things more interesting. So I started creating a load of consoles in the Kubernetes cluster.

The console creation was meant to gradually fill the Kubernetes cluster with batch jobs, exhausting our staging clusters capacity. This should have been safe – we had pod priorities ensuring staging workloads were assigned much lower priorities than production ones. No matter how many staging consoles were created in the cluster there should be no impact on production workloads, as Kubernetes will evict staging over production workloads first.

So when someone from support raised the alarm that the production app was down, I was more than a little confused.

Going sideways

It was at this point that everything started going sideways.

Alerts were firing everywhere: Postgres alerts, HAProxy reporting no backends, blackbox alerts confirming our API was unresponsive, and – most worryingly – Etcd cluster out of quorum. That last one was particularly problematic because Stolon relies on Etcd for leadership election. No Etcd quorum means no database connections, which means no API requests, no batch jobs… no nothing.

The root cause? A perfect storm of “tomorrow’s problems” catching up with us.

At the time, we were running everything (production and staging) in a single large Kubernetes cluster, separating environments and cluster infrastructure using namespaces. We used pod priorities to ensure workloads would arrange themselves correctly – cluster-level infrastructure above production, production above staging, and so on.

But we’d made a critical mistake, one that would come back to haunt us in the worst possible way. It all traced back to when we initially rolled out pod priorities…

The mistake

When we initially rolled out pod priorities, we thought we’d be physically separating the cluster within a month. To avoid doing substantial operational work on something that would be rebuilt “soon,” we’d skipped applying pod priorities to some sensitive deployments, including Etcd.

Here’s where things get interesting: in Kubernetes, pod priorities work through a numeric class system. We had set this up pretty sensibly, with system workloads at priority 300, production at 200, and staging at 100. The idea is simple – when the cluster needs to make space, it evicts pods with lower priorities first.

But there’s a subtle gotcha that bit us hard. Before you introduce any priority classes to your cluster, every pod essentially has equal priority. They’re all implicitly set to 0, but it doesn’t matter because they’re all the same. The moment you add even a single priority class, though, that changes completely – now any pod without an explicit priority gets assigned 0, making it lower priority than everything else.

There exists a ‘globalDefault’ property that can be set for a priority class which means any pod without a class will get assigned that priority. We didn’t have this set in our cluster, though I’d advise people set this if they’re using pod priorities.

By not setting a priority on our Etcd pods while having priorities everywhere else, we’d accidentally marked our most critical infrastructure component as the first thing to be evicted under pressure. When I came along and began filling the cluster with staging consoles (a job I admit to doing with enthusiasm), we started evicting Etcd pods. By the time alerts fired, we’d evicted enough pods to lose quorum, having eaten well into our pod disruption budgets.

This prevented the Etcd pods from successfully rejoining the cluster when restarted. We were, in technical terms, pretty screwed.

Righting the ship

The immediate focus was restoring database access. We deactivated the Stolon Postgres cluster manager and booted Postgres manually, relying on muscle memory for the right commands. We bypassed both PgBouncer and the Stolon fencing proxy, pointing applications directly at Postgres. This brought things back up, albeit in an unmanaged state.

After taking a moment to contemplate our life choices, we formulated a plan using Etcd’s disaster recovery procedure. It took about six hours to bring everything back and perform a Stolon failover. A good chunk of that time was spent running pg_basebackup to restore another node – our database was about 6TB at this point.

It was a long day and not at all what we’d had planned. As drills go though, this was probably the most intense way we could ever have tested our production readiness in a remote context, even if it cost us more than we’d anticipated.

Infrastructure debt doesn’t age well

There’s a pattern in infrastructure work that’s worth recognising: we often defer work because we think a better solution is coming “soon.” But “soon” has a way of stretching into weeks or months, and the work we carefully documented and planned to revisit gets buried under newer, more urgent tasks.

Over time, this creates hidden trapdoors in our infrastructure. When we leave configuration half-applied or work partially complete, we’re not just creating technical debt – we’re fragmenting our team’s mental model of the system. Every “we’ll fix it later” adds another dimension of complexity, often temporal, that someone needs to keep in mind. Eventually, someone (in this case, me) will forget about that complexity and fall right through the trapdoor.

The real test isn’t the outage

Of everything I took from this incident, though, it was how our organisation responded that I found most interesting.

In preparing for an important change (remote work during COVID), while trying to do the right thing for the company (training new incident responders), we accidentally caused a significant outage. It would have been easy to say “well, we won’t be doing Game Days again!” or worse, fire the person (me) who pressed the buttons and wash your hands of their incomptence (ouch).

Instead, my boss, our CTO, caught up with me afterward to say “thank god we found this while the whole team were on a call ready to respond!”. It’s true, this would have been far worse had it happened out of hours, but you bet he’d got a lot of heat for this outage. It would have been easy to push blame to me, but he didn’t, and in that moment he created the essential conditions of a ‘blameless’ incident culture. One where an outage like this can balance its cost with the lessons learned from it.

Sometimes the most valuable lessons come from the most uncomfortable situations. The trick is making sure your team feels safe enough to learn from them.

If you liked this post and want to see more, follow me at @lawrjones.