A Reddit post claimed Fox achieves 2× Ollama throughput. I benchmarked all three on dual RTX 3090s. Here's what actually happened.
Fox (patched): Three root-cause bugs fixed and submitted upstream as PR #1. After patching: Fox handles 8 concurrent requests cleanly. Throughput advantage is real (~1.7× Ollama under 4-way concurrency) but comes with a significant TTFT tradeoff — Fox batches requests, so sequential TTFT is ~6× worse than Ollama. The "2× Ollama" claim is partially true for throughput, but TTFT tells a different story.
Ollama vs vLLM: Nearly identical for a single user. vLLM dominates the moment multiple requests arrive simultaneously — by 2–3 orders of magnitude in latency, because the architectures are fundamentally different. Ollama queues. vLLM batches.
Framework versions: Ollama v0.18.0 (system daemon, SCHED_SPREAD=1 — model split across both GPUs),
vLLM 0.15.1 (single GPU, PagedAttention), Fox built from source HEAD 2026-03-24.
Methodology: Each framework was benchmarked in isolation — model evicted from VRAM before loading the next framework. Three workloads: 10 sequential requests (serial), 10 concurrent requests (4 in-flight), 5 multi-turn conversations (3 turns each). Same prompt across all tests. Warm-up request sent before timing begins.
Before running anything, I reviewed the source code. The good news: no malicious code, no telemetry, no backdoors. The Rust code quality is genuinely solid — PagedAttention, KV cache, and prefix caching are correctly implemented. Three runtime bugs prevented Fox from running on this hardware. All three are fixed and submitted upstream as PR #1.
Bug 1: n_ctx OOM — KV context size tied to batch size (model.rs)
n_ctx = max_context_len × max_batch_size = 4096 × 32 = 131,072 tokens. llama.cpp allocates this as a single contiguous KV buffer upfront. At fp16 for a 14B model, that exceeds any single 24GB GPU before the weights even load.
Fix: Decouple n_seq_max from max_batch_size. Hardcoded to 8 — the Fox KV block manager handles logical multiplexing above this layer.
Bug 2: seq_id pool/context mismatch — decode crashes with invalid seq_id (scheduler.rs)
Scheduler pool was 0..max_batch_size (0–31), but llama.cpp context only has 8 slots. Every decode failed:
init: invalid seq_id[0][0] = 31 >= 8
Fix: Cap pool to N_SEQ_MAX = 8 so assigned seq IDs are always valid.
Bug 3: Multi-GPU nvidia-smi parsing (cli/mod.rs)
s.trim().parse::<usize>() on nvidia-smi output fails when 2 GPUs produce two lines ("24576
24576"), silently falling back to 8 GiB hardcoded — underestimating VRAM 3×.
Fix: Take only s.lines().next() before parsing.
Bug 4 (follow-on): Prefix cache seq_cp assert in llama.cpp
After fixing bugs 1–3, concurrent requests triggered GGML_ASSERT(is_full && "seq_cp() is only supported for full KV buffers") in the prefix cache path. With n_seq=8 and the pool capped to 8, there are no spare slots for prefix cache entries — Fox was copying sequences from slots not yet fully committed. Fixed by disabling prefix cache when pool size is constrained.
✅ After all four fixes: Fox loads, serves inference, handles 8 concurrent requests without crashing. Benchmarks below.
The original Fox benchmark used a 3B model on an RTX 4060 — a fundamentally different regime. The bugs above only manifest with 14B+ models and/or multi-GPU systems.
Since I was reviewing the code anyway:
0.0.0.0:8080 with zero auth. Anyone on your network can submit inference, pull models, or delete model files.CorsLayer::permissive()). Any webpage in any browser tab can call your Fox API. If you visit a malicious site while Fox is running, it can query your model.pull_handler.rs constructs a "digest" from the filename — it never actually hashes the file. The verification is a no-op..gguf files, but there's no canonicalize() check.install.sh downloads binary without checksum. /metrics endpoint unauthenticated.
Three-way comparison: Fox (patched) (dual GPU, Qwen3-14B-Q4KM, with --system-prompt "/no_think") vs
Ollama (dual GPU, SCHED_SPREAD=1, same model) vs
vLLM (single GPU, Qwen3-14B-AWQ, PagedAttention).
Each framework loaded in isolation with VRAM cleared between runs.
Note on Fox TTFT: Fox uses continuous batching — it holds a request in the queue until a batch slot is free, then prefills everything together. This makes sequential TTFT appear high (~1000ms) versus Ollama's per-request dispatch (~175ms). Under concurrency, the calculus flips: Fox's batched throughput outperforms Ollama's serial queue.
Sequential verdict: Ollama TTFT 175ms · Fox TTFT 1000ms · vLLM TTFT 27ms. Fox's ~1000ms is the cost of continuous batching scheduling overhead. Throughput: Fox 78 tok/s ≈ Ollama 68 tok/s ≈ vLLM 77 tok/s — all within noise. For single-user interactive use, vLLM and Ollama win on feel; Fox feels sluggish despite equal throughput.
The 213× TTFT gap under concurrency isn't a performance tuning issue — it's a consequence of fundamentally different architectural choices. Understanding the gap means understanding what happens when two requests arrive at the same time.
Ollama processes one request at a time, blocking the next until the current one finishes. Request 4 waits for requests 1, 2, and 3 to complete — so its first-token latency is the sum of three full inference durations (~8 seconds at this model size).
vLLM's PagedAttention treats concurrent requests as a single batched forward pass. All four requests compute their first token simultaneously in the same GPU kernel. The TTFT for request 4 is essentially the same as for request 1 — because they ran together.
This is a categorical architectural difference, not a tuning gap. You can't get vLLM-level concurrent performance from Ollama by changing config flags, because Ollama's design choice is serial execution. It's the right choice for single-user simplicity. It's the wrong choice for any multi-user workload.
Personal use, interactive chat, local dev. Trivial setup, model management, persistent daemon. 71 tok/s for single user is excellent. The right default.
Multi-user APIs, automation pipelines, anything with concurrent requests or latency requirements. 252 tok/s wall-clock under load. More setup but incomparable performance.
If you need throughput over latency and run 14B+ models: worth trying with the patched binary. PR #1 fixes the three crashes. Concurrent throughput beats Ollama by ~1.7×. TTFT is poor for sequential use (~1s vs <200ms). Security gaps (no auth, wildcard CORS) mean local-only for now.
On this machine specifically: Ollama runs as a persistent system daemon (KEEP_ALIVE=24h).
vLLM is launched manually when needed via bash ~/vllm-srv/start.sh.
No need to choose — they coexist and serve different jobs.
All frameworks benchmarked in isolation (other evicted from VRAM before loading). Warmup request sent before timing. Prompt: "Explain the difference between TCP and UDP in 3 sentences." max_tokens=200, temperature=0.
Ollama / think-mode: qwen3-uncensored generates reasoning tokens by default and ignores think: false in the API body. A local proxy on port 11435 injects the flag at the request level. The 14B GGUF model used in the main benchmark doesn't exhibit this issue.
Ollama and Fox used both GPUs (~8 GB VRAM each) with Qwen3-14B-Q4_K_M GGUF. vLLM used one GPU (~20 GB) with Qwen3-14B-AWQ — same base model, same parameter count, same hardware budget.
TTFT = time from request send to first token. Tok/s = output tokens ÷ total response duration. Throughput = total output tokens ÷ wall-clock time (concurrent tests only).
Posted to r/LocalLLM thread. More writing at bayesianpersuasion.com.