Like many others, we were excited about OpenAI releasing GPT-4.1 last night.

That’s because:

What’s fun is that we can quickly test changing to the new models with our investigations agent, so we began running tests to see which prompts could benefit from switching.

We found that:

I wanted to share one of our tests where we compare different models for PromptInvestigationCodeChangeCauseAnalysis, a prompt that receives a code change (GitHub/Lab PR) and tries determining if that change is a primary cause of an incident.

Lots of interesting factors, a summary is:

In short, 4.1 is a totally different beast to 4o when it comes to software tasks, and at a much lower price-point than Sonnet 3.7 we’ll be considering it carefully across our agents.

We are also yet to find a metric where 4.1 is worse than 4o, so at minimum this release means >20% cost savings for us.