🔬 Inference Benchmark · March 2026

LLM Inference Showdown: Fox vs Ollama vs vLLM

A Reddit post claimed Fox achieves 2× Ollama throughput. I benchmarked all three on dual RTX 3090s. Here's what actually happened.

⚙️ 2× RTX 3090 (24 GB each) 🧠 Qwen3-14B 📅 March 24, 2026
Ollama ✓ Single user vLLM ✓ Production Fox ✓ Patched & running Fox claim ~ Partially true
scroll

TL;DR

vLLM vs Ollama TTFT · Concurrent
213×
faster first token under 4-way concurrency
vLLM vs Ollama Throughput · Concurrent
3.4×
more tokens/s total under load
Fox vs Ollama · Concurrent throughput
1.7×
Fox 129 tok/s vs Ollama 74 tok/s (4-way)
Fox vs Ollama · TTFT sequential
Fox 1000ms vs Ollama 175ms (continuous batching cost)

Fox (patched): Three root-cause bugs fixed and submitted upstream as PR #1. After patching: Fox handles 8 concurrent requests cleanly. Throughput advantage is real (~1.7× Ollama under 4-way concurrency) but comes with a significant TTFT tradeoff — Fox batches requests, so sequential TTFT is ~6× worse than Ollama. The "2× Ollama" claim is partially true for throughput, but TTFT tells a different story.

Ollama vs vLLM: Nearly identical for a single user. vLLM dominates the moment multiple requests arrive simultaneously — by 2–3 orders of magnitude in latency, because the architectures are fundamentally different. Ollama queues. vLLM batches.

Hardware & Setup

CPU
AMD EPYC 7C13
64 cores @ 2.0 GHz
GPUs
2× RTX 3090
24 GB each · PCIe (no NVLink)
Model
Qwen3-14B
Q4_K_M GGUF (Ollama/Fox) · AWQ (vLLM)

Framework versions: Ollama v0.18.0 (system daemon, SCHED_SPREAD=1 — model split across both GPUs), vLLM 0.15.1 (single GPU, PagedAttention), Fox built from source HEAD 2026-03-24.

Methodology: Each framework was benchmarked in isolation — model evicted from VRAM before loading the next framework. Three workloads: 10 sequential requests (serial), 10 concurrent requests (4 in-flight), 5 multi-turn conversations (3 turns each). Same prompt across all tests. Warm-up request sent before timing begins.

Fox: Three Bugs Found, Fixed, and Benchmarked

Before running anything, I reviewed the source code. The good news: no malicious code, no telemetry, no backdoors. The Rust code quality is genuinely solid — PagedAttention, KV cache, and prefix caching are correctly implemented. Three runtime bugs prevented Fox from running on this hardware. All three are fixed and submitted upstream as PR #1.

Bug 1: n_ctx OOM — KV context size tied to batch size (model.rs)

n_ctx = max_context_len × max_batch_size = 4096 × 32 = 131,072 tokens. llama.cpp allocates this as a single contiguous KV buffer upfront. At fp16 for a 14B model, that exceeds any single 24GB GPU before the weights even load.

Fix: Decouple n_seq_max from max_batch_size. Hardcoded to 8 — the Fox KV block manager handles logical multiplexing above this layer.

Bug 2: seq_id pool/context mismatch — decode crashes with invalid seq_id (scheduler.rs)

Scheduler pool was 0..max_batch_size (0–31), but llama.cpp context only has 8 slots. Every decode failed: init: invalid seq_id[0][0] = 31 >= 8

Fix: Cap pool to N_SEQ_MAX = 8 so assigned seq IDs are always valid.

Bug 3: Multi-GPU nvidia-smi parsing (cli/mod.rs)

s.trim().parse::<usize>() on nvidia-smi output fails when 2 GPUs produce two lines ("24576 24576"), silently falling back to 8 GiB hardcoded — underestimating VRAM 3×.

Fix: Take only s.lines().next() before parsing.

Bug 4 (follow-on): Prefix cache seq_cp assert in llama.cpp

After fixing bugs 1–3, concurrent requests triggered GGML_ASSERT(is_full && "seq_cp() is only supported for full KV buffers") in the prefix cache path. With n_seq=8 and the pool capped to 8, there are no spare slots for prefix cache entries — Fox was copying sequences from slots not yet fully committed. Fixed by disabling prefix cache when pool size is constrained.

✅ After all four fixes: Fox loads, serves inference, handles 8 concurrent requests without crashing. Benchmarks below.

The original Fox benchmark used a 3B model on an RTX 4060 — a fundamentally different regime. The bugs above only manifest with 14B+ models and/or multi-GPU systems.


Since I was reviewing the code anyway:

Fox vs Ollama vs vLLM: The Real Numbers

Three-way comparison: Fox (patched) (dual GPU, Qwen3-14B-Q4KM, with --system-prompt "/no_think") vs Ollama (dual GPU, SCHED_SPREAD=1, same model) vs vLLM (single GPU, Qwen3-14B-AWQ, PagedAttention). Each framework loaded in isolation with VRAM cleared between runs.

Note on Fox TTFT: Fox uses continuous batching — it holds a request in the queue until a batch slot is free, then prefills everything together. This makes sequential TTFT appear high (~1000ms) versus Ollama's per-request dispatch (~175ms). Under concurrency, the calculus flips: Fox's batched throughput outperforms Ollama's serial queue.

Time to First Token (P50)

Lower is better · milliseconds

Response Latency (P50)

Full response time · milliseconds

Throughput — Tokens per Second

Higher is better · output tokens ÷ response duration

Sequential verdict: Ollama TTFT 175ms · Fox TTFT 1000ms · vLLM TTFT 27ms. Fox's ~1000ms is the cost of continuous batching scheduling overhead. Throughput: Fox 78 tok/s ≈ Ollama 68 tok/s ≈ vLLM 77 tok/s — all within noise. For single-user interactive use, vLLM and Ollama win on feel; Fox feels sluggish despite equal throughput.

TTFT Across All Three Workloads

How first-token latency changes from 1 user to concurrent — the defining story of this benchmark

Why the Gap Exists: Queuing vs Batching

The 213× TTFT gap under concurrency isn't a performance tuning issue — it's a consequence of fundamentally different architectural choices. Understanding the gap means understanding what happens when two requests arrive at the same time.

Ollama: Serial Queue

4 requests arrive simultaneously →
R1
INFERENCE
R2
R1…
waiting
R3
R1+R2…
waiting
R4
R1+R2+R3…
waiting
R4 TTFT = 3 full inference durations

vLLM: PagedAttention Batch

4 requests arrive simultaneously →
R1
BATCHED FORWARD PASS
R2
same pass ↑
R3
same pass ↑
R4
same pass ↑
R4 TTFT ≈ R1 TTFT

Ollama processes one request at a time, blocking the next until the current one finishes. Request 4 waits for requests 1, 2, and 3 to complete — so its first-token latency is the sum of three full inference durations (~8 seconds at this model size).

vLLM's PagedAttention treats concurrent requests as a single batched forward pass. All four requests compute their first token simultaneously in the same GPU kernel. The TTFT for request 4 is essentially the same as for request 1 — because they ran together.

This is a categorical architectural difference, not a tuning gap. You can't get vLLM-level concurrent performance from Ollama by changing config flags, because Ollama's design choice is serial execution. It's the right choice for single-user simplicity. It's the wrong choice for any multi-user workload.

What to Use and When

🦙

Ollama

Personal use, interactive chat, local dev. Trivial setup, model management, persistent daemon. 71 tok/s for single user is excellent. The right default.

vLLM

Multi-user APIs, automation pipelines, anything with concurrent requests or latency requirements. 252 tok/s wall-clock under load. More setup but incomparable performance.

🦊

Fox

If you need throughput over latency and run 14B+ models: worth trying with the patched binary. PR #1 fixes the three crashes. Concurrent throughput beats Ollama by ~1.7×. TTFT is poor for sequential use (~1s vs <200ms). Security gaps (no auth, wildcard CORS) mean local-only for now.

On this machine specifically: Ollama runs as a persistent system daemon (KEEP_ALIVE=24h). vLLM is launched manually when needed via bash ~/vllm-srv/start.sh. No need to choose — they coexist and serve different jobs.

Methodology Notes

All frameworks benchmarked in isolation (other evicted from VRAM before loading). Warmup request sent before timing. Prompt: "Explain the difference between TCP and UDP in 3 sentences." max_tokens=200, temperature=0.

Ollama / think-mode: qwen3-uncensored generates reasoning tokens by default and ignores think: false in the API body. A local proxy on port 11435 injects the flag at the request level. The 14B GGUF model used in the main benchmark doesn't exhibit this issue.

Ollama and Fox used both GPUs (~8 GB VRAM each) with Qwen3-14B-Q4_K_M GGUF. vLLM used one GPU (~20 GB) with Qwen3-14B-AWQ — same base model, same parameter count, same hardware budget.

TTFT = time from request send to first token. Tok/s = output tokens ÷ total response duration. Throughput = total output tokens ÷ wall-clock time (concurrent tests only).

Posted to r/LocalLLM thread. More writing at bayesianpersuasion.com.