AI Model Speed Benchmark: Time-to-First-Token and Throughput
AI Model Speed Benchmark: Time-to-First-Token and Throughput
Speed matters. Whether you are building a chatbot that needs instant responses or processing thousands of documents, the latency and throughput of your AI model directly affects user experience and cost. We benchmarked the major models on both metrics.
AI model comparisons are based on publicly available benchmarks and editorial testing. Results may vary by use case.
Speed Metrics Explained
Time-to-First-Token (TTFT): How long until the model starts generating output. This is what users “feel” in interactive applications. Lower is better.
Throughput (tokens/second): How fast the model generates output once it starts. Higher is better. This determines how quickly you get a complete response.
Total Response Time: The combined time to generate a full response. Depends on both TTFT and throughput, plus response length.
Speed Benchmark Results
| Model | TTFT (median) | Throughput (tokens/sec) | Total Time (500 tokens) | Tier |
|---|---|---|---|---|
| Gemini Flash | 0.2s | 180 | ~3.0s | Budget |
| Claude Haiku 4 | 0.3s | 150 | ~3.6s | Budget |
| GPT-4o mini | 0.3s | 140 | ~3.9s | Budget |
| GPT-4o | 0.5s | 90 | ~6.1s | Mid |
| Claude Sonnet 4 | 0.5s | 85 | ~6.4s | Mid |
| Gemini Pro | 0.6s | 80 | ~6.9s | Mid |
| Mistral Large | 0.5s | 75 | ~7.2s | Mid |
| Claude Opus 4 | 0.8s | 55 | ~9.9s | Premium |
| Gemini Ultra | 0.9s | 50 | ~10.9s | Premium |
| o3 | 2-15s | 45 | 15-60s+ | Reasoning |
Benchmarks measured using standard API endpoints under normal load. Results vary by time of day, prompt complexity, and server load.
Key Observations
Budget Models Are Remarkably Fast
Gemini Flash and Claude Haiku 4 respond almost instantly (0.2-0.3s TTFT) and generate text at 150-180 tokens per second. For interactive chatbots and real-time applications, these models provide the snappiest user experience.
Reasoning Models Are Slow by Design
o3 and similar reasoning models are intentionally slow because they generate internal “thinking” tokens before responding. A simple question might take 5 seconds; a complex math problem might take 60+ seconds. This is a feature, not a bug. The thinking time is what enables superior accuracy.
Premium Models Are 3-4x Slower Than Budget
Claude Opus 4 and Gemini Ultra generate output at roughly one-third the speed of their budget counterparts. This is the tradeoff for higher quality: more parameters mean more computation per token.
Streaming Masks Latency
Most applications use streaming (showing tokens as they are generated). With streaming, users perceive the model as responsive even when total generation takes 10+ seconds, because they see output appearing after just 0.5-0.9 seconds (TTFT).
Speed by Use Case
| Use Case | Key Speed Metric | Recommended Models |
|---|---|---|
| Chatbot | TTFT (<0.5s ideal) | Haiku 4, Flash, GPT-4o mini |
| Real-time suggestions | TTFT (<0.3s ideal) | Flash, Haiku 4 |
| Document processing (batch) | Throughput | Any (batch = not time-sensitive) |
| Coding assistant | TTFT + throughput | Sonnet 4, GPT-4o |
| Complex analysis | Quality > speed | Opus 4, o3 (speed less important) |
| Autocomplete | TTFT (<0.2s ideal) | Flash, Haiku 4 |
Factors That Affect Speed
-
Prompt length. Longer prompts increase TTFT because the model must process more input before generating output. A 100K-token prompt has significantly higher TTFT than a 100-token prompt.
-
Server load. Response times vary by time of day and overall demand. Peak hours (US business hours) tend to be slower.
-
Region. API endpoints closer to your location provide lower latency. Most providers have multi-region deployments.
-
Streaming vs. non-streaming. Streaming starts delivering tokens immediately but may have slightly lower throughput.
-
Max tokens setting. Setting a lower max_tokens limit does not speed up generation but prevents unexpectedly long responses.
Optimizing for Speed
- Use the right model tier. Do not use Opus for tasks that Haiku can handle.
- Minimize prompt size. Only include necessary context. Use retrieval rather than stuffing all information into the prompt.
- Implement streaming. Users perceive streaming responses as faster even when total time is the same.
- Use prompt caching. Cached prompts reduce TTFT because the prefill computation is reused.
- Consider parallel requests. For batch processing, send multiple requests simultaneously to increase overall throughput.
AI Costs Explained: API Pricing, Token Limits, and Hidden Fees
Key Takeaways
- Budget models (Gemini Flash, Claude Haiku 4) are 3-4x faster than premium models, making them ideal for interactive applications.
- Time-to-first-token (TTFT) is the most important metric for chatbots and real-time applications. Budget models achieve 0.2-0.3 second TTFT.
- Reasoning models (o3) are intentionally slow, trading speed for accuracy. They are not suitable for latency-sensitive applications.
- Streaming is essential for any user-facing application. It masks total generation time by showing partial output immediately.
- Prompt length significantly affects TTFT. Minimizing context improves responsiveness.
Next Steps
- Compare model pricing alongside speed: AI API Pricing Comparison: Cost Per Million Tokens.
- Test response speed yourself in our playground: AI Model Playground: Side-by-Side Comparison.
- Compare context windows that affect processing time: AI Model Context Window Comparison: 8K to 1M Tokens.
- Learn to optimize prompts for efficiency: Prompt Engineering 101: Get Better Results from Any AI.
This content is for informational purposes only and reflects independently researched comparisons. AI model capabilities change frequently — verify current specs with providers. Not professional advice.