AI Model Zoo — open-weight models for local inference
The most important open-weight models in June 2026. Which model fits which , what hardware requirement is behind it, which licence is in play. Updated with every new release.
Pull models via Ollama
# All-purpose hero, runs on Apple Silicon M4 Ultra (192 GB) or NVIDIA with 48+ GB VRAM
ollama pull llama3.3:70b
# Excellent multilingual (esp. German), fits an RTX 5090 (32 GB)
ollama pull qwen2.5:32b
# Mixture-of-Experts, very strong tokens-per-dollar
ollama pull mixtral:8x7b
# Reasoning specialist (distill variant for realistic hardware)
ollama pull deepseek-r1:32b
# Small and strong, feasible on a single RTX 4090 (24 GB)
ollama pull phi4:14b
ollama pull gemma3:27b
# OpenAI's open-weight model (released Aug 2025), 20B Q4 needs ~12 GB VRAM
ollama pull gpt-oss:20b
Every listed model is available with one command. The size variant (e.g. ':70b' vs. ':8b') adapts the model to the available hardware. Source: ollama.com/library.
Seven models compared
Every model has its own strengths — size, licence, context length, . Compact overview here, clickable to the official vendor model page.
Llama 3.3
Meta
General purpose
Meta flagship
Meta's all-purpose hero. Very balanced at text generation, knowledge tasks, multilingual work. Llama 3.3 Community License allows commercial SMB use.
Mistral 7B as a compact model, Mixtral 8x7B/8x22B as a Mixture-of-Experts with a very strong performance/VRAM ratio. All Apache-2.0 — fully commercially usable.
Alibaba model with excellent German quality, even at smaller sizes. Variants from 0.5B to 72B. Apache-2.0 — unproblematic even for GDPR-sensitive SMBs.
DeepSeek R1 sets the bar for logical reasoning and mathematics. Full 671B version only on enterprise hardware, distill variants 7B/14B/32B/70B usable on SMB hardware. MIT license.
Phi 4 with only 14B parameters shows: well-curated training data beats sheer size. Very good performance on a single RTX 4090. MIT license, commercially unrestricted.
In August 2025 OpenAI released gpt-oss in two sizes (20B and 120B) under Apache-2.0. In model quality clearly better than older OSS models, with full commercial release.
An open-weight model is an whose trained weights (the parameter file) are publicly available — typically as a download on HuggingFace, on the vendor website or through . That distinguishes them from ChatGPT, Claude or Gemini Pro, whose weights are not released and which are accessible only via APIs.
Important: 'open weight' is not the same as 'open source'. The licence variation is wide — from real MIT/Apache-2.0 (Mistral, Qwen, Phi, GPT-oss) through custom licences with branding clauses (Llama, ) to restricted 'research only' licences. For commercial SMB use every licence is to be reviewed in advance — footnote per model on the respective card.
Which model for what?
Recommendations from our own client practice. Not 'the best model' — there is no best. But 'for this this model is the most pragmatic'.
General purpose in SMBs (text, translation, Q&A)
<b>Llama 3.3 70B</b> or <b>Qwen 2.5 32B</b>. Llama 70B needs 48 GB VRAM (Mac Studio M4 Ultra, RTX 5090+RTX 4090), Qwen 32B runs on an RTX 4090. Both very good in German.
Coding & code completion
<b>Qwen 2.5 Coder 32B</b> is the pragmatic standard for coding. Plus <b>DeepSeek R1 Distill 32B</b> for more complex refactoring tasks. Both usable in IDE plugins (Continue, Cody).
Reasoning, logic, mathematics
<b>DeepSeek R1</b> is clearly ahead of everyone else here. The distill variants (32B and 70B) bring a large part of the reasoning strength onto SMB hardware. Full 671B version only on enterprise hardware.
Small and fast (edge, mobile, embedded)
<b>Phi 4 14B</b>, <b> 3 9B</b> or <b>Llama 3.2 3B</b>. All run on a single consumer or even powerful CPUs. For embedded features in apps.
Multilingual with a GDPR bonus
<b>Qwen 2.5 32B</b> for multilingual work (also less common languages) plus an Apache-2.0 licence. <b>Mistral Large 2</b> as a European alternative (Mistral is based in Paris, EU hosting).
When nobody is allowed to use the OpenAI cloud
<b>GPT-oss 120B</b> since August 2025 as a commercially free OpenAI open-weight replacement. Apache-2.0. For compliance-strict industries that want ChatGPT quality but cannot use the OpenAI cloud.
What do all open-weight models have in common?
Six properties that make these seven models a class of their own — and distinguish them from cloud frontier models.
Local inference
Model runs on your own hardware (Mac, workstation, server). No request leaves the network. Mandatory in industries with confidentiality or GDPR obligations — lawyers, doctors, tax advisors, engineers with IP protection.
Quantisation available
Per model quantisations (Q4_K_M, Q5_K_M, Q8_0) are available that reduce RAM/VRAM footprint by a factor of 2–4 — with just barely acceptable quality loss. Q4_K_M is standard for self-hosting.
OpenAI-API compatible
Via , vLLM or llama.cpp all these models are served behind an OpenAI-compatible . Existing OpenAI client libraries (Python SDK, nodes) work without code change.
Multi-modal optional
Several of the models (Llama 3.2 Vision, Qwen 2.5 VL, 3 vision) accept images. Use: analyse scanned invoices, interpret part photos for complaints, process whiteboard sketches.
Fine-tuning possible
With your own data (glossary, style guides, domain knowledge) the models can be further trained — usually via LoRA/QLoRA for moderate hardware needs. Domain-specific strengths without full pre-training.
No vendor lock-in
Model switching without code change: swap the tag, the OpenAI endpoint stays the same. Anyone using Llama 3.3 today can switch to Qwen 3 or GPT-oss-200B tomorrow — within the open-weight ecosystem.
OpenAI-compatible API call against a local model
# Pick a model (variable)
MODEL="llama3.3:70b"
# Alternative: qwen2.5:32b · mixtral:8x7b · deepseek-r1:32b · gpt-oss:20b
# OpenAI-compatible chat-completion call
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "'$MODEL'",
"messages": [
{"role": "system", "content": "You are a pragmatic AI consultant for SMBs."},
{"role": "user", "content": "Which model do you recommend for an 8-person law firm with an RTX 4090?"}
],
"temperature": 0.2
}'
# The same request also works with the OpenAI Python SDK,
# simply set OPENAI_API_BASE to http://localhost:11434/v1.
A single curl call against the local Ollama API. Model name as a variable — swappable across all seven models. Source: own practice.
Cloud alternatives honestly compared
If open-weight is not enough — what else?
Three cloud frontier models that are still often ahead in absolute quality. Trade-off: data flow and per-token costs against top quality.
Frontier cloud (USA)
ChatGPT (GPT-5)
OpenAI
+ Best general-purpose frontier vendor
+ Tools, function calling, vision very mature
− US cloud, no self-hosting
− Data flow even with an enterprise DPA
Frontier cloud (USA/EU)
Claude (Anthropic)
Anthropic
+ Very strong on long texts
+ EU endpoints available
− Also no self-hosting option
− Per-token cost rises quickly at volume
Frontier cloud (USA)
Gemini Pro
Google
+ Very long context window (1M+ tokens)
+ Strong multi-modal
− Google cloud, US data location
− Data policy less transparent than OpenAI/Anthropic
Rule of thumb: with GDPR obligations or IP protection — open-weight local. For maximum quality on uncritical content — cloud frontier. For both — multi-routing in (sensitive content local, general research cloud).
Pricing
Hardware investment vs. per-token costs.
License
Mix of Apache-2.0 (Mistral, Qwen, Phi, GPT-oss), MIT (DeepSeek), custom community licences (Llama, Gemma). For SMB own-use in nearly all constellations unproblematic — verify per model on the respective card.
Hardware costs
Mid-range (for models up to 32B): RTX 4090 24GB + 32GB RAM workstation, around €3,000. Premium (for 70B+): Apple Mac Studio M4 Ultra 192 GB from €8,000 or an NVIDIA server with RTX 5090. Power: 150–300 W during inference, <50 W idle.
vs. cloud API
Cloud costs scale linearly with volume. At 1,000 tokens/day per 10 staff: GPT-5 around €60/month, Claude Sonnet similar. Self-hosted: hardware amortisation in 12–18 months. From 50+ staff clearly cheaper.
Important: quantisation quality is not linear. Q4_K_M is standard and 'good enough' for 95 % of . Q5/Q6 for higher demands. Full FP16 only for research. Anyone trying to make do with less VRAM should first check quantisation before picking a smaller model.
Related topics
Models need an inference server and a frontend
Models alone are useless. loads them, provides the user surface, the server solution sits behind: