Slack’s having a wobble right now, which meant we had an incident on our end too (we have a large Slack surface). Not that fun, but one silver-lining is being shocked at the quality and depth of the automated investigations our AI SRE produced.

I joined incident as a founding engineer and have literally built a lot of the system this incident is based on. Even with everything I know about it, the automated investigation quickly identified behaviour and changes I’d never have thought about, within minutes of my pager going off.

Things that shocked me (positively):

  1. Worker retry behaviour: Our Slack workers are retrying as intended, but it’s making some parts of our recovery slower than necessary. That needs looking at!

  2. Deployment separation: We separated deployments for Slack webhook handling just the other week to isolate them. The investigation linked me to the exact Slack message where the infra team shared this news.

  3. Race condition caught: Canvas writes (which create this report in Slack, quite meta!) are racing incident channel creation. I’ve already added better backoff here!

It’s increasingly the case that jumping to the investigation is the first thing I do when paged as it surfaces so much useful content, even things like changes across the company that I might have missed.

I end up chatting to our bot (@incident.io) to clarify things or have it point me at useful dashboards, following along its updates in the channel. When it finds something it thinks is important, it’s normally a thing I would otherwise have missed, which makes responding way less stressful.

We’re putting a huge effort into accuracy and expanding our data sources (deepening support for metrics, logs and traces) over the next quarter, but that work is already paying off in spades. It’s extremely apparent to me that every engineering team is going to have a system like this within a year, and we’re racing to ensure ours is the most capable those teams could buy.