AI Model Speed Benchmark: Time-to-First-Token and Throughput

Speed matters. Whether you are building a chatbot that needs instant responses or processing thousands of documents, the latency and throughput of your AI model directly affects user experience and cost. We benchmarked the major models on both metrics.

AI model comparisons are based on publicly available benchmarks and editorial testing. Results may vary by use case.

Speed Metrics Explained

Time-to-First-Token (TTFT): How long until the model starts generating output. This is what users “feel” in interactive applications. Lower is better.

Throughput (tokens/second): How fast the model generates output once it starts. Higher is better. This determines how quickly you get a complete response.

Total Response Time: The combined time to generate a full response. Depends on both TTFT and throughput, plus response length.

Speed Benchmark Results

Model	TTFT (median)	Throughput (tokens/sec)	Total Time (500 tokens)	Tier
Gemini Flash	0.2s	180	~3.0s	Budget
Claude Haiku 4	0.3s	150	~3.6s	Budget
GPT-4o mini	0.3s	140	~3.9s	Budget
GPT-4o	0.5s	90	~6.1s	Mid
Claude Sonnet 4	0.5s	85	~6.4s	Mid
Gemini Pro	0.6s	80	~6.9s	Mid
Mistral Large	0.5s	75	~7.2s	Mid
Claude Opus 4	0.8s	55	~9.9s	Premium
Gemini Ultra	0.9s	50	~10.9s	Premium
o3	2-15s	45	15-60s+	Reasoning

Benchmarks measured using standard API endpoints under normal load. Results vary by time of day, prompt complexity, and server load.

Key Observations

Budget Models Are Remarkably Fast

Gemini Flash and Claude Haiku 4 respond almost instantly (0.2-0.3s TTFT) and generate text at 150-180 tokens per second. For interactive chatbots and real-time applications, these models provide the snappiest user experience.

Reasoning Models Are Slow by Design

o3 and similar reasoning models are intentionally slow because they generate internal “thinking” tokens before responding. A simple question might take 5 seconds; a complex math problem might take 60+ seconds. This is a feature, not a bug. The thinking time is what enables superior accuracy.

Premium Models Are 3-4x Slower Than Budget

Claude Opus 4 and Gemini Ultra generate output at roughly one-third the speed of their budget counterparts. This is the tradeoff for higher quality: more parameters mean more computation per token.

Streaming Masks Latency

Most applications use streaming (showing tokens as they are generated). With streaming, users perceive the model as responsive even when total generation takes 10+ seconds, because they see output appearing after just 0.5-0.9 seconds (TTFT).

Speed by Use Case

Use Case	Key Speed Metric	Recommended Models
Chatbot	TTFT (<0.5s ideal)	Haiku 4, Flash, GPT-4o mini
Real-time suggestions	TTFT (<0.3s ideal)	Flash, Haiku 4
Document processing (batch)	Throughput	Any (batch = not time-sensitive)
Coding assistant	TTFT + throughput	Sonnet 4, GPT-4o
Complex analysis	Quality > speed	Opus 4, o3 (speed less important)
Autocomplete	TTFT (<0.2s ideal)	Flash, Haiku 4

Factors That Affect Speed

Prompt length. Longer prompts increase TTFT because the model must process more input before generating output. A 100K-token prompt has significantly higher TTFT than a 100-token prompt.
Server load. Response times vary by time of day and overall demand. Peak hours (US business hours) tend to be slower.
Region. API endpoints closer to your location provide lower latency. Most providers have multi-region deployments.
Streaming vs. non-streaming. Streaming starts delivering tokens immediately but may have slightly lower throughput.
Max tokens setting. Setting a lower max_tokens limit does not speed up generation but prevents unexpectedly long responses.

Optimizing for Speed

Use the right model tier. Do not use Opus for tasks that Haiku can handle.
Minimize prompt size. Only include necessary context. Use retrieval rather than stuffing all information into the prompt.
Implement streaming. Users perceive streaming responses as faster even when total time is the same.
Use prompt caching. Cached prompts reduce TTFT because the prefill computation is reused.
Consider parallel requests. For batch processing, send multiple requests simultaneously to increase overall throughput.

AI Costs Explained: API Pricing, Token Limits, and Hidden Fees

Key Takeaways

Budget models (Gemini Flash, Claude Haiku 4) are 3-4x faster than premium models, making them ideal for interactive applications.
Time-to-first-token (TTFT) is the most important metric for chatbots and real-time applications. Budget models achieve 0.2-0.3 second TTFT.
Reasoning models (o3) are intentionally slow, trading speed for accuracy. They are not suitable for latency-sensitive applications.
Streaming is essential for any user-facing application. It masks total generation time by showing partial output immediately.
Prompt length significantly affects TTFT. Minimizing context improves responsiveness.

Next Steps

Compare model pricing alongside speed: AI API Pricing Comparison: Cost Per Million Tokens.
Test response speed yourself in our playground: AI Model Playground: Side-by-Side Comparison.
Compare context windows that affect processing time: AI Model Context Window Comparison: 8K to 1M Tokens.
Learn to optimize prompts for efficiency: Prompt Engineering 101: Get Better Results from Any AI.

This content is for informational purposes only and reflects independently researched comparisons. AI model capabilities change frequently — verify current specs with providers. Not professional advice.