LLM Inference Showdown: Fox vs Ollama vs vLLM (Updated with Patched Fox)

Fox · Ferrumox

Fox: Three Bugs Found, Fixed, and Benchmarked

Before running anything, I reviewed the source code. The good news: no malicious code, no telemetry, no backdoors. The Rust code quality is genuinely solid — PagedAttention, KV cache, and prefix caching are correctly implemented. Three runtime bugs prevented Fox from running on this hardware. All three are fixed and submitted upstream as PR #1.

Bug 1: n_ctx OOM — KV context size tied to batch size (model.rs)

n_ctx = max_context_len × max_batch_size = 4096 × 32 = 131,072 tokens. llama.cpp allocates this as a single contiguous KV buffer upfront. At fp16 for a 14B model, that exceeds any single 24GB GPU before the weights even load.

Fix: Decouple n_seq_max from max_batch_size. Hardcoded to 8 — the Fox KV block manager handles logical multiplexing above this layer.

Bug 2: seq_id pool/context mismatch — decode crashes with invalid seq_id (scheduler.rs)

Scheduler pool was 0..max_batch_size (0–31), but llama.cpp context only has 8 slots. Every decode failed: init: invalid seq_id[0][0] = 31 >= 8

Fix: Cap pool to N_SEQ_MAX = 8 so assigned seq IDs are always valid.

Bug 3: Multi-GPU nvidia-smi parsing (cli/mod.rs)

s.trim().parse::<usize>() on nvidia-smi output fails when 2 GPUs produce two lines ("24576 24576"), silently falling back to 8 GiB hardcoded — underestimating VRAM 3×.

Fix: Take only s.lines().next() before parsing.

Bug 4 (follow-on): Prefix cache seq_cp assert in llama.cpp

After fixing bugs 1–3, concurrent requests triggered GGML_ASSERT(is_full && "seq_cp() is only supported for full KV buffers") in the prefix cache path. With n_seq=8 and the pool capped to 8, there are no spare slots for prefix cache entries — Fox was copying sequences from slots not yet fully committed. Fixed by disabling prefix cache when pool size is constrained.

✅ After all four fixes: Fox loads, serves inference, handles 8 concurrent requests without crashing. Benchmarks below.

The original Fox benchmark used a 3B model on an RTX 4060 — a fundamentally different regime. The bugs above only manifest with 14B+ models and/or multi-GPU systems.

Security Findings

Since I was reviewing the code anyway:

CRITICAL
No authentication. API server binds 0.0.0.0:8080 with zero auth. Anyone on your network can submit inference, pull models, or delete model files.
HIGH
Wildcard CORS (CorsLayer::permissive()). Any webpage in any browser tab can call your Fox API. If you visit a malicious site while Fox is running, it can query your model.
MEDIUM
Fake SHA256 verification. pull_handler.rs constructs a "digest" from the filename — it never actually hashes the file. The verification is a no-op.
MEDIUM
Path traversal partially mitigated. DELETE endpoint only lists .gguf files, but there's no canonicalize() check.
LOW
install.sh downloads binary without checksum. /metrics endpoint unauthenticated.
CLEAR
No malware, no telemetry, no backdoors. The PagedAttention and KV cache code is well-written. This is a young project with operational gaps, not a malicious one.

Benchmark Results

Fox vs Ollama vs vLLM: The Real Numbers

Three-way comparison: Fox (patched) (dual GPU, Qwen3-14B-Q4KM, with --system-prompt "/no_think") vs Ollama (dual GPU, SCHED_SPREAD=1, same model) vs vLLM (single GPU, Qwen3-14B-AWQ, PagedAttention). Each framework loaded in isolation with VRAM cleared between runs.

Note on Fox TTFT: Fox uses continuous batching — it holds a request in the queue until a batch slot is free, then prefills everything together. This makes sequential TTFT appear high (~1000ms) versus Ollama's per-request dispatch (~175ms). Under concurrency, the calculus flips: Fox's batched throughput outperforms Ollama's serial queue.

Time to First Token (P50)

Lower is better · milliseconds

Response Latency (P50)

Full response time · milliseconds

Throughput — Tokens per Second

Higher is better · output tokens ÷ response duration

Sequential verdict: Ollama TTFT 175ms · Fox TTFT 1000ms · vLLM TTFT 27ms. Fox's ~1000ms is the cost of continuous batching scheduling overhead. Throughput: Fox 78 tok/s ≈ Ollama 68 tok/s ≈ vLLM 77 tok/s — all within noise. For single-user interactive use, vLLM and Ollama win on feel; Fox feels sluggish despite equal throughput.

Time to First Token (P50) · 4 in-flight

Lower is better · milliseconds · log scale

Wall-clock Throughput · 4 in-flight

Total output tokens ÷ wall time · tok/s

TTFT Distribution: Sequential vs Concurrent

How latency changes under load for each framework

Concurrent (4×) verdict: vLLM TTFT 37ms · Fox TTFT 2377ms · Ollama TTFT 6186ms (wait latency). Throughput: Fox 129 tok/s > Ollama 74 tok/s > vLLM effective (smaller model). Fox's continuous batching pays off here — 1.7× Ollama throughput while Ollama serializes. vLLM PagedAttention holds TTFT flat under load — the fundamental architectural advantage for production workloads.

Time to First Token (P50) · 8 in-flight

Lower is better · milliseconds

Wall-clock Throughput · 8 in-flight

Total output tokens ÷ wall time · tok/s

Concurrent (8×) verdict: Fox 134 tok/s wall vs Ollama 74 tok/s — 1.8× advantage. Fox TTFT P50 4247ms vs Ollama 197ms. Fox is winning the throughput war; Ollama wins the latency war. For batch jobs and high-throughput pipelines, Fox is competitive. For interactive use, Ollama still wins on feel.

Multi-turn TTFT (P50)

5 conversations × 3 turns · milliseconds

Multi-turn Latency (P50)

Full turn response time · milliseconds

Multi-turn Throughput

Output tokens ÷ response duration · tok/s

Multi-turn verdict: vLLM TTFT 52ms · Fox TTFT 3870ms · Ollama TTFT 184ms (sequential dispatch). Fox multi-turn throughput: 126 tok/s — highest of the three (continuous batching handles growing context better). Ollama multi-turn is surprisingly fast on TTFT because each conversation is dispatched alone; Fox queues all turns for batching, adding latency. vLLM wins on TTFT by 70×+ vs Fox.

TTFT Across All Three Workloads

How first-token latency changes from 1 user to concurrent — the defining story of this benchmark

Architecture

Why the Gap Exists: Queuing vs Batching

The 213× TTFT gap under concurrency isn't a performance tuning issue — it's a consequence of fundamentally different architectural choices. Understanding the gap means understanding what happens when two requests arrive at the same time.

Ollama: Serial Queue

4 requests arrive simultaneously →

INFERENCE

R1…

waiting

R1+R2…

waiting

R1+R2+R3…

waiting

R4 TTFT = 3 full inference durations

vLLM: PagedAttention Batch

4 requests arrive simultaneously →

BATCHED FORWARD PASS

same pass ↑

R4 TTFT ≈ R1 TTFT

Ollama processes one request at a time, blocking the next until the current one finishes. Request 4 waits for requests 1, 2, and 3 to complete — so its first-token latency is the sum of three full inference durations (~8 seconds at this model size).

vLLM's PagedAttention treats concurrent requests as a single batched forward pass. All four requests compute their first token simultaneously in the same GPU kernel. The TTFT for request 4 is essentially the same as for request 1 — because they ran together.

This is a categorical architectural difference, not a tuning gap. You can't get vLLM-level concurrent performance from Ollama by changing config flags, because Ollama's design choice is serial execution. It's the right choice for single-user simplicity. It's the wrong choice for any multi-user workload.

Guidance

What to Use and When

🦙

Ollama

Personal use, interactive chat, local dev. Trivial setup, model management, persistent daemon. 71 tok/s for single user is excellent. The right default.

⚡

vLLM

Multi-user APIs, automation pipelines, anything with concurrent requests or latency requirements. 252 tok/s wall-clock under load. More setup but incomparable performance.

🦊

Fox

If you need throughput over latency and run 14B+ models: worth trying with the patched binary. PR #1 fixes the three crashes. Concurrent throughput beats Ollama by ~1.7×. TTFT is poor for sequential use (~1s vs <200ms). Security gaps (no auth, wildcard CORS) mean local-only for now.

On this machine specifically: Ollama runs as a persistent system daemon (KEEP_ALIVE=24h). vLLM is launched manually when needed via bash ~/vllm-srv/start.sh. No need to choose — they coexist and serve different jobs.

Methodology Notes

All frameworks benchmarked in isolation (other evicted from VRAM before loading). Warmup request sent before timing. Prompt: "Explain the difference between TCP and UDP in 3 sentences." max_tokens=200, temperature=0.

Ollama / think-mode: qwen3-uncensored generates reasoning tokens by default and ignores think: false in the API body. A local proxy on port 11435 injects the flag at the request level. The 14B GGUF model used in the main benchmark doesn't exhibit this issue.

Ollama and Fox used both GPUs (~8 GB VRAM each) with Qwen3-14B-Q4_K_M GGUF. vLLM used one GPU (~20 GB) with Qwen3-14B-AWQ — same base model, same parameter count, same hardware budget.

TTFT = time from request send to first token. Tok/s = output tokens ÷ total response duration. Throughput = total output tokens ÷ wall-clock time (concurrent tests only).

Posted to r/LocalLLM thread. More writing at bayesianpersuasion.com.

LLM Inference Showdown: Fox vs Ollama vs vLLM

TL;DR

Hardware & Setup

Fox: Three Bugs Found, Fixed, and Benchmarked

Fox vs Ollama vs vLLM: The Real Numbers

Time to First Token (P50)

Response Latency (P50)

Throughput — Tokens per Second

Time to First Token (P50) · 4 in-flight

Wall-clock Throughput · 4 in-flight

TTFT Distribution: Sequential vs Concurrent

Time to First Token (P50) · 8 in-flight

Wall-clock Throughput · 8 in-flight

Multi-turn TTFT (P50)

Multi-turn Latency (P50)

Multi-turn Throughput

TTFT Across All Three Workloads

Why the Gap Exists: Queuing vs Batching

Ollama: Serial Queue

vLLM: PagedAttention Batch

What to Use and When

Ollama

vLLM

Fox

Methodology Notes