Like many others, we were excited about OpenAI releasing GPT-4.1 last night.
That’s because:
- 4.1 is materially cheaper than 4o (20% uncached, 60% cached) and Sonnet 3.7 (>30%)
- 1M token context window means 4.1 can be used where Gemini was previously the only option
- OpenAI emphasised 4.1 has been improved on real software benchmarks, which is where we most care about model performance
What’s fun is that we can quickly test changing to the new models with our investigations agent, so we began running tests to see which prompts could benefit from switching.
We found that:
- 4.1 can replace Sonnet 3.7 in straightforward prompts such as shortlisting steps with barely noticeable performance impact (but significant cost saving)
- 4.1 is much better than 4o at software tasks but Sonnet 3.7 continues to be the most sophisticated (for our dataset, at least)
I wanted to share one of our tests where we compare different models for PromptInvestigationCodeChangeCauseAnalysis, a prompt that receives a code change (GitHub/Lab PR) and tries determining if that change is a primary cause of an incident.
Lots of interesting factors, a summary is:
- 4.1 is much fussier than Sonnet 3.7 at claiming a code change caused an incident, leading to a drop (38%) in recall
- When 4.1 does suggest a PR caused an incident, it’s right 33% more than Sonnet 3.7
- 4.1 blows 4o out the water, with 4o finding just 3/31 of the code changes in our dataset, showing how much of an upgrade 4.1 is on this task
In short, 4.1 is a totally different beast to 4o when it comes to software tasks, and at a much lower price-point than Sonnet 3.7 we’ll be considering it carefully across our agents.
We are also yet to find a metric where 4.1 is worse than 4o, so at minimum this release means >20% cost savings for us.