Tips-and-tricks to better handle incidents, learned over years of dealing
with production issues. Included are opinions on strategy, process, tools
and how to handle the all-important human element.
Read this if you're new to incident response and want a starter-pack of
advice, or to contrast your own perspective with another.
As a team's infrastructure estate grows, it becomes increasingly beneficial
to create a global registry of all people, services, and components. Once
you do, you can integrate with tools like terraform, Chef, and Kubernetes to
help provision your infrastructure according to a single authoritative
This post explains how GoCardless built their registry, and some of the uses
we’ve put it to.
Most Prometheus metrics recording durations are subject to a
time-of-measurement bias, causing misleading graphs that can derail
investigations. See how an open-source Tracer can help solve this problem.
This post covers the implementation of pgreplay-go, a tool to realistically
simulate captured Postgres traffic. I'll explain why existing tools didn't
fit and explain some challenges in the implementation, focusing on what I
learned personally from the process.
Diving into the Postgres query planner to understand its decisions- and
occasionally- its mistakes. Explore a query plan that went wrong,
discovering the statistics that informed the bad decision.