From reliability engineering to wrestling with LLMs, my fourth year at incident.io pushed me harder than I'd expected. We launched On-call, weathered some tough times as a team, and I ended the year diving fully into AI.
Continue reading
A story about how incident response training went wrong, with valuable
lessons about pod priorities, isolation, and the importance of a healthy
incident response culture.
Continue reading
Of the mental models and rules I use in my life, by far the most useful is
to learn only one thing at any given time.
Continue reading
My reflections on 2023, now my second full year at incident.io. Doubled the
team this year (34 to 77), launched Status Pages and Catalog, and spent the
last six months building a really exciting new product.
Continue reading
Whenever a system has access to a consistent store, you can extend that
consistency through compare-and-swap to the system's users. This post shows
how you can add CAS to an HTTP API using example code and real-world
examples.
Continue reading
If you build a state machine on top of a relational database you can
abstract concurrency problems away from your business logic and allow
developers to write safe-by-default code without dealing with concurrency
concerns.
This post explains how to build a library that offers those protections, and
how they work under-the-hood.
Continue reading
From the moment you learn programming people tell you "don't repeat
yourself!"
So what I'm about to suggest might sound odd. But I'm here to say that if
you want to ship high-quality software at pace, you should be investing in
abstractions that are designed to enable copy-and-paste.
Continue reading
For the last three months we've been building out the incident.io catalog.
This project always made me nervious – it would've been easy to build
something pretty but useless – but it's ended up game-changingly good.
This post talks about some of the decisions that got us there.
Continue reading
From the evergreen AWS status page to hardcoded 100% uptime, no one fully
trusts a status page anymore.
But why is this? Companies often start with good intentions, aiming for full
transparency. So why do so many change along the way: what pressures people
into an evergreen status page with poorly-reflective uptime numbers?
Continue reading
ULIDs are an alternative to UUIDs that solve several problems, but it's not
all plain sailing.
This post shares experience using ULIDs in production, exploring some of the
drawbacks in an aim to help others pick an ID format.
Continue reading