Model zoo · snapshot 2026-06

AI Model Zoo — open-weight models for local inference

The most important open-weight models in June 2026. Which model fits which , what hardware requirement is behind it, which licence is in play. Updated with every new release.

Pull models via Ollama

# All-purpose hero, runs on Apple Silicon M4 Ultra (192 GB) or NVIDIA with 48+ GB VRAM
ollama pull llama3.3:70b

# Excellent multilingual (esp. German), fits an RTX 5090 (32 GB)
ollama pull qwen2.5:32b

# Mixture-of-Experts, very strong tokens-per-dollar
ollama pull mixtral:8x7b

# Reasoning specialist (distill variant for realistic hardware)
ollama pull deepseek-r1:32b

# Small and strong, feasible on a single RTX 4090 (24 GB)
ollama pull phi4:14b
ollama pull gemma3:27b

# OpenAI's open-weight model (released Aug 2025), 20B Q4 needs ~12 GB VRAM
ollama pull gpt-oss:20b

Every listed model is available with one command. The size variant (e.g. ':70b' vs. ':8b') adapts the model to the available hardware. Source: ollama.com/library.

Seven models compared

Every model has its own strengths — size, licence, context length, . Compact overview here, clickable to the official vendor model page.

Llama 3.3

Mistral / Mixtral

Mistral AI

Mixture-of-Experts

French powerhouse

Mistral 7B as a compact model, Mixtral 8x7B/8x22B as a Mixture-of-Experts with a very strong performance/VRAM ratio. All Apache-2.0 — fully commercially usable.

Größen: 7B · 8x7B · 8x22B
Context: 64k
VRAM Q4: ~28 GB (8x7B Q4)
Lizenz: Apache-2.0

ollama pull mixtral:8x7b

→ Homepage Q126193488Wikidata

Qwen 2.5

Alibaba

Multilingual

Multilingual specialist

Alibaba model with excellent German quality, even at smaller sizes. Variants from 0.5B to 72B. Apache-2.0 — unproblematic even for GDPR-sensitive SMBs.

Größen: 0.5B · 7B · 14B · 32B · 72B
Context: 128k
VRAM Q4: ~20 GB (32B Q4)
Lizenz: Apache-2.0

ollama pull qwen2.5:32b

→ Homepage Q130234299Wikidata

Gemma 3

Google

Small & efficient

Google small model

3 in sizes from 1B to 27B — very efficient for its weight class. License (similar to Llama: commercial use possible, some obligations).

Größen: 1B · 4B · 12B · 27B
Context: 128k
VRAM Q4: ~16 GB (27B Q4)
Lizenz: Gemma Lic.

ollama pull gemma3:27b

→ Homepage Q124629757Wikidata

DeepSeek R1

DeepSeek

Reasoning & logic

Reasoning specialist

DeepSeek R1 sets the bar for logical reasoning and mathematics. Full 671B version only on enterprise hardware, distill variants 7B/14B/32B/70B usable on SMB hardware. MIT license.

Größen: 671B + Distill 7B/14B/32B/70B
Context: 128k
VRAM Q4: ~20 GB (Distill-32B Q4)
Lizenz: MIT

ollama pull deepseek-r1:32b

→ Homepage Q131914874Wikidata

Phi 4

Microsoft

Small & efficient

Microsoft small champion

Phi 4 with only 14B parameters shows: well-curated training data beats sheer size. Very good performance on a single RTX 4090. MIT license, commercially unrestricted.

Größen: 14B · 3.8B (mini)
Context: 16k
VRAM Q4: ~9 GB (14B Q4)
Lizenz: MIT

ollama pull phi4:14b

→ Homepage Q132127208Wikidata

GPT-oss

OpenAI

General purpose

OpenAI's open-weight model

In August 2025 OpenAI released gpt-oss in two sizes (20B and 120B) under Apache-2.0. In model quality clearly better than older OSS models, with full commercial release.

Größen: 20B · 120B
Context: 128k
VRAM Q4: ~12 GB (20B Q4)
Lizenz: Apache-2.0

ollama pull gpt-oss:20b

→ Homepage Q137587500Wikidata

What are open-weight models?

An open-weight model is an whose trained weights (the parameter file) are publicly available — typically as a download on HuggingFace, on the vendor website or through . That distinguishes them from ChatGPT, Claude or Gemini Pro, whose weights are not released and which are accessible only via APIs.

Important: 'open weight' is not the same as 'open source'. The licence variation is wide — from real MIT/Apache-2.0 (Mistral, Qwen, Phi, GPT-oss) through custom licences with branding clauses (Llama, ) to restricted 'research only' licences. For commercial SMB use every licence is to be reviewed in advance — footnote per model on the respective card.

Which model for what?

Recommendations from our own client practice. Not 'the best model' — there is no best. But 'for this this model is the most pragmatic'.

General purpose in SMBs (text, translation, Q&A)

Llama 3.3 70B or Qwen 2.5 32B. Llama 70B needs 48 GB VRAM (Mac Studio M4 Ultra, RTX 5090+RTX 4090), Qwen 32B runs on an RTX 4090. Both very good in German.

Coding & code completion

Qwen 2.5 Coder 32B is the pragmatic standard for coding. Plus DeepSeek R1 Distill 32B for more complex refactoring tasks. Both usable in IDE plugins (Continue, Cody).

Reasoning, logic, mathematics

DeepSeek R1 is clearly ahead of everyone else here. The distill variants (32B and 70B) bring a large part of the reasoning strength onto SMB hardware. Full 671B version only on enterprise hardware.

Small and fast (edge, mobile, embedded)

Phi 4 14B, 3 9B or Llama 3.2 3B. All run on a single consumer or even powerful CPUs. For embedded features in apps.

Multilingual with a GDPR bonus

Qwen 2.5 32B for multilingual work (also less common languages) plus an Apache-2.0 licence. Mistral Large 2 as a European alternative (Mistral is based in Paris, EU hosting).

When nobody is allowed to use the OpenAI cloud

GPT-oss 120B since August 2025 as a commercially free OpenAI open-weight replacement. Apache-2.0. For compliance-strict industries that want ChatGPT quality but cannot use the OpenAI cloud.

What do all open-weight models have in common?

Six properties that make these seven models a class of their own — and distinguish them from cloud frontier models.

Local inference

Model runs on your own hardware (Mac, workstation, server). No request leaves the network. Mandatory in industries with confidentiality or GDPR obligations — lawyers, doctors, tax advisors, engineers with IP protection.

Quantisation available

Per model quantisations (Q4_K_M, Q5_K_M, Q8_0) are available that reduce RAM/VRAM footprint by a factor of 2–4 — with just barely acceptable quality loss. Q4_K_M is standard for self-hosting.

OpenAI-API compatible

Via , vLLM or llama.cpp all these models are served behind an OpenAI-compatible . Existing OpenAI client libraries (Python SDK, nodes) work without code change.

Multi-modal optional

Several of the models (Llama 3.2 Vision, Qwen 2.5 VL, 3 vision) accept images. Use: analyse scanned invoices, interpret part photos for complaints, process whiteboard sketches.

Fine-tuning possible

With your own data (glossary, style guides, domain knowledge) the models can be further trained — usually via LoRA/QLoRA for moderate hardware needs. Domain-specific strengths without full pre-training.

No vendor lock-in

Model switching without code change: swap the tag, the OpenAI endpoint stays the same. Anyone using Llama 3.3 today can switch to Qwen 3 or GPT-oss-200B tomorrow — within the open-weight ecosystem.

OpenAI-compatible API call against a local model

# Pick a model (variable)
MODEL="llama3.3:70b"
# Alternative: qwen2.5:32b · mixtral:8x7b · deepseek-r1:32b · gpt-oss:20b

# OpenAI-compatible chat-completion call
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'$MODEL'",
    "messages": [
      {"role": "system", "content": "You are a pragmatic AI consultant for SMBs."},
      {"role": "user", "content": "Which model do you recommend for an 8-person law firm with an RTX 4090?"}
    ],
    "temperature": 0.2
  }'

# The same request also works with the OpenAI Python SDK,
# simply set OPENAI_API_BASE to http://localhost:11434/v1.

A single curl call against the local Ollama API. Model name as a variable — swappable across all seven models. Source: own practice.

Cloud alternatives honestly compared

If open-weight is not enough — what else?

Three cloud frontier models that are still often ahead in absolute quality. Trade-off: data flow and per-token costs against top quality.

Frontier cloud (USA)

ChatGPT (GPT-5)

OpenAI

+ Best general-purpose frontier vendor
+ Tools, function calling, vision very mature
− US cloud, no self-hosting
− Data flow even with an enterprise DPA

Frontier cloud (USA/EU)

Claude (Anthropic)

Anthropic

+ Very strong on long texts
+ EU endpoints available
− Also no self-hosting option
− Per-token cost rises quickly at volume

Frontier cloud (USA)

Gemini Pro

Google

+ Very long context window (1M+ tokens)
+ Strong multi-modal
− Google cloud, US data location
− Data policy less transparent than OpenAI/Anthropic

Rule of thumb: with GDPR obligations or IP protection — open-weight local. For maximum quality on uncritical content — cloud frontier. For both — multi-routing in (sensitive content local, general research cloud).

Pricing

Hardware investment vs. per-token costs.

License

Mix of Apache-2.0 (Mistral, Qwen, Phi, GPT-oss), MIT (DeepSeek), custom community licences (Llama, Gemma). For SMB own-use in nearly all constellations unproblematic — verify per model on the respective card.

Hardware costs

Mid-range (for models up to 32B): RTX 4090 24GB + 32GB RAM workstation, around €3,000. Premium (for 70B+): Apple Mac Studio M4 Ultra 192 GB from €8,000 or an NVIDIA server with RTX 5090. Power: 150–300 W during inference, <50 W idle.

vs. cloud API

Cloud costs scale linearly with volume. At 1,000 tokens/day per 10 staff: GPT-5 around €60/month, Claude Sonnet similar. Self-hosted: hardware amortisation in 12–18 months. From 50+ staff clearly cheaper.

Important: quantisation quality is not linear. Q4_K_M is standard and 'good enough' for 95 % of . Q5/Q6 for higher demands. Full FP16 only for research. Anyone trying to make do with less VRAM should first check quantisation before picking a smaller model.

Models need an inference server and a frontend

Models alone are useless. loads them, provides the user surface, the server solution sits behind:

→ Ollama (inference server)→ Open WebUI (frontend)→ Your own AI server (solution context)→ Cloud vs. local: which model when?

Ready for the next step?

Free intro call, no strings attached. In 30 minutes you'll know whether and how AI can help your business.

Book a call Check eligibility