I’ve started using our AI SRE product to generally ask production questions rather than exclusively for incidents and I’m really loving it.
On Friday I wanted to look into what’s been driving increases in our database CPU utilisation. We run a highly-parallelised monolith with hundreds of tasks going on at any given moment, which makes it hard to identify which of those tasks is responsible for database load, so I asked our bot for help.
I opened an incident and like I do to begin most technical investigations, I wrote-out my thoughts and directions I’d like to explore in Slack. AI SRE notices, kicks-off and investigation, and nudges me 5 minutes later to say:
From here I can boot up Claude directly from the incident and get on with fixing things, even sending anything I find in my IDE back into the incident to share with other responders.
It’s genuinely so good having all your production data in one place and be able to ask questions of it that are purely human, instead of crafting Prom/LogQL or whatever other query language you’d previously have to wrestle with. And having the bot reason alongside you as if it was a human colleague makes the investigation feel really engaging.