<p>I’ve started using our AI SRE product to generally ask production questions rather than exclusively for incidents and I’m really loving it.</p>

<p>On Friday I wanted to look into what’s been driving increases in our database CPU utilisation. We run a highly-parallelised monolith with hundreds of tasks going on at any given moment, which makes it hard to identify which of those tasks is responsible for database load, so I asked our bot for help.</p>

<p>I opened an incident and like I do to begin most technical investigations, I wrote-out my thoughts and directions I’d like to explore in Slack. AI SRE notices, kicks-off and investigation, and nudges me 5 minutes later to say:</p>

<ul>
  <li>It’s found the jobs that are causing the utilisation</li>
  <li>There is a select * pattern that we should eliminate</li>
  <li>We’re overly clustering our cron jobs</li>
</ul>

<p>From here I can boot up Claude directly from the incident and get on with fixing things, even sending anything I find in my IDE back into the incident to share with other responders.</p>

<p>It’s genuinely so good having all your production data in one place and be able to ask questions of it that are purely human, instead of crafting Prom/LogQL or whatever other query language you’d previously have to wrestle with. And having the bot reason alongside you as if it was a human colleague makes the investigation feel really engaging.</p>