<p>Like many others, we were excited about OpenAI releasing GPT-4.1 last night.</p>

<p>That’s because:</p>
<ul>
  <li>4.1 is materially cheaper than 4o (20% uncached, 60% cached) and Sonnet 3.7 (&gt;30%)</li>
  <li>1M token context window means 4.1 can be used where Gemini was previously the only option</li>
  <li>OpenAI emphasised 4.1 has been improved on real software benchmarks, which is where we most care about model performance</li>
</ul>

<p>What’s fun is that we can quickly test changing to the new models with our investigations agent, so we began running tests to see which prompts could benefit from switching.</p>

<p>We found that:</p>
<ul>
  <li>4.1 can replace Sonnet 3.7 in straightforward prompts such as shortlisting steps with barely noticeable performance impact (but significant cost saving)</li>
  <li>4.1 is much better than 4o at software tasks but Sonnet 3.7 continues to be the most sophisticated (for our dataset, at least)</li>
</ul>

<p>I wanted to share one of our tests where we compare different models for PromptInvestigationCodeChangeCauseAnalysis, a prompt that receives a code change (GitHub/Lab PR) and tries determining if that change is a primary cause of an incident.</p>

<p>Lots of interesting factors, a summary is:</p>
<ul>
  <li>4.1 is much fussier than Sonnet 3.7 at claiming a code change caused an incident, leading to a drop (38%) in recall</li>
  <li>When 4.1 does suggest a PR caused an incident, it’s right 33% more than Sonnet 3.7</li>
  <li>4.1 blows 4o out the water, with 4o finding just 3/31 of the code changes in our dataset, showing how much of an upgrade 4.1 is on this task</li>
</ul>

<p>In short, 4.1 is a totally different beast to 4o when it comes to software tasks, and at a much lower price-point than Sonnet 3.7 we’ll be considering it carefully across our agents.</p>

<p>We are also yet to find a metric where 4.1 is worse than 4o, so at minimum this release means &gt;20% cost savings for us.</p>