<p>Recently I was struggling with an LLM prompt that was running slowly: it was the prompt that picks which Grafana dashboards to analyze at the start of an incident, e.g. “this looks ilke a CPU issue, let’s scan the Kubernetes dashboard”</p>

<p>At &gt;10s it was slowing down our initial investigation, which slows down our response. Thankfully we found a number of ways to improve performance without changing the prompt behaviour, eventually getting the same prompt to run in just ~2s.</p>

<p>The experience of hacking the prompt confirmed some valuable lessons about LLM performance:</p>

<p>⏱️ Output tokens are overwhelmingly more important than input tokens
📋 JSON is very token-inefficient (both for inputs but especially painful in outputs)
🗜 Custom compact formats can reduce latency by &gt;70%
📈 Direct relationship between output tokens and response time</p>

<p>I’ve written a log of everything I tried and how much it impacted performance to share tricks others can use in their application, but also to help people build intuition on the drivers of LLM latency.</p>

<p>Link to article in the comments 👇</p>