Recently I was struggling with an LLM prompt that was running slowly: it was the prompt that picks which Grafana dashboards to analyze at the start of an incident, e.g. “this looks ilke a CPU issue, let’s scan the Kubernetes dashboard”
At >10s it was slowing down our initial investigation, which slows down our response. Thankfully we found a number of ways to improve performance without changing the prompt behaviour, eventually getting the same prompt to run in just ~2s.
The experience of hacking the prompt confirmed some valuable lessons about LLM performance:
⏱️ Output tokens are overwhelmingly more important than input tokens 📋 JSON is very token-inefficient (both for inputs but especially painful in outputs) 🗜 Custom compact formats can reduce latency by >70% 📈 Direct relationship between output tokens and response time
I’ve written a log of everything I tried and how much it impacted performance to share tricks others can use in their application, but also to help people build intuition on the drivers of LLM latency.
Link to article in the comments 👇