<p>I keep seeing companies talk about balancing AI operations across multiple providers and I don’t understand how they’re doing this without compromising the quality of their product.</p>

<p>My suspicion is most companies don’t have the ability to know if and how switching providers impacts their product, and actually there is a big impact when balancing workload between multiple models.</p>

<p>Last week we tried upgrading from 4o-2024-08-06 -&gt; 2024-11-20. Same model, just updated knowledge and weights, so you’d assume it would perform the same, right?</p>

<p>Not the case! On one of our core prompts we see:</p>
<ul>
  <li>4o-2024-08-06 pass 100% of the time</li>
  <li>4o-2024-11-20 drop to 79%!</li>
  <li>In comparison to Anthropic Sonnet 3.5 passing 95%</li>
</ul>

<p>Kinda crazy that Anthropic is comparatively better than an upgrade of the same model, right? But either way, if you’re shipping AI product experiences that can’t afford to go wrong, you can’t be switching between models on-the-fly and expecting things to work the same.</p>