Solution in detail

AI model practice — which models we use for what

An honest list of the models we use daily at KaaTai-Beratung. With a concrete distribution, matrix and the routing pattern that pragmatically combines cloud and local inference.

No single model for everything

There is no 'best model'. There are models that are more pragmatic for certain tasks than others. We use four cloud vendors daily (Claude, Perplexity, Gemini, OpenAI) plus local open-weight models via for client- or patient-related content. The distribution below is our reality — as of 2026-06-02. Shifts as soon as new models serve the respective better.

Our cloud distribution

These four vendors cover the vast majority of our cloud work. Percentages are an honest approximation — as of 2026-06-02, can rearrange with every release.

Claude80%

Perplexity10%

Gemini5%

OpenAI5%

Claude

Anthropic

80%

Standard workhorse

Claude (esp. Opus 4.7 with 1M context) is our main model for text, code, long documents, strategy outlines, briefs. Very good German quality, consistent answers, very strong . EU endpoint at Anthropic available.

Modelle: Opus 4.7 (1M ctx) · Sonnet 4.6 · Haiku
Lizenz: SaaS, EU-Endpunkt verfügbar

Perplexity

Perplexity AI

10%

Web research with sources

Perplexity is RAG-first — web search combined with answer and source citations. Mandatory for research tasks: current subsidy logic, competitor analysis, industry updates. Comet browser agent for autonomous research.

Modelle: Pro Search · Deep Research · Comet
Lizenz: SaaS, RAG-First-Plattform

Gemini

Google DeepMind

Very long contexts + multimodal

Gemini (Pro/Ultra) brings 1M+ token context and very strong multimodal (images, video, audio). We use it for tasks where very large document collections have to be processed at once or image understanding is in the foreground.

Modelle: Gemini 3 Pro · Gemini 3 Ultra · Flash
Lizenz: SaaS, Google Cloud + Vertex AI

OpenAI

Special tasks + open-weight

OpenAI (GPT-5/o3) for tasks where function-calling depth is strong (e.g. code interpreter, DALL-E images). Plus: GPT-oss as an open-weight variant for local inference. Share is small because other vendors usually have the edge for our .

Modelle: GPT-5 · GPT-5.5 · o3-pro · GPT-oss
Lizenz: SaaS + Open-Weight (gpt-oss)

Use-case matrix — which model for what

Eight typical from our consulting practice. Per the concrete model recommendation — cloud and local separated, because sensitive content must not go into a cloud.

Text, translation, briefs

Cloud: Claude Opus (long texts, house conventions) or Sonnet (faster). Local: Llama 3.3 70B or Qwen 2.5 32B. Routing: client texts local, general marketing text cloud.

Code, refactoring, code review

Cloud: Claude Opus (very strong on refactoring) or GPT-5 (better at ). Local: Qwen 2.5 Coder 32B + DeepSeek R1 Distill for complex logic. Routing: open-source code cloud OK, customer code local.

Web research with sources

Cloud: Perplexity Pro Search or Deep Research. Local: with SearXNG ( search) + Llama 3.3. Routing: general research cloud, competitor research local.

RAG (own documents)

Cloud: Claude Projects or Gemini (1M context). Local: knowledge bases + Llama 3.3 or Qwen. Routing: client documents ALWAYS local. Industry studies cloud OK.

Reasoning, logic, mathematics

Cloud: Claude Opus 4.7 or GPT-5 o3-pro. Local: DeepSeek R1 Distill 32B or 70B. Routing: for complex calculations with sensitive numbers local, general analysis questions cloud.

Multimodal (images, whiteboards, PDFs)

Cloud: Gemini 3 Pro/Ultra (very strong) or Claude (also good). Local: Qwen 2.5 VL or Llama 3.2 Vision. Routing: client whiteboards/scans local, stock-image analysis cloud.

Long documents (>100 pages)

Cloud: Claude Opus (1M context), Gemini Ultra (2M context). Local: Llama 3.3 70B with 128k context. Routing: for very large PDFs cloud is often practical, for sensitive data local with chunking.

Agent workflows (autonomous)

Cloud: Claude (very strong on ), GPT-5 (good). Perplexity Comet for browser agents. Local: still limited — autonomous agents usually need frontier quality. Routing: for agent tasks usually cloud.

The routing pattern

We do not combine cloud and local inference at random — but along a clear rule that every staff member can apply without thinking.

Rule 1 — client/patient data = always local

Anything that identifies clients, patients or persons (names, addresses, concrete file numbers, session notes, patient data) ALWAYS runs through the local instance with Llama 3.3 or Qwen 2.5. No exceptions, no 'just upload it briefly'.

Rule 2 — own code base = local, open-source code = cloud OK

Customer project code or own IP-relevant code bases → local Qwen 2.5 Coder. Open-source libraries, demo code, general code questions → Claude/GPT-5. No own code ends up in cloud prompts.

Rule 3 — general research + text = cloud OK

Marketing text, general industry research, updates, training material: Claude (standard) or Perplexity (with sources). Data flow is unproblematic here, quality beats sovereignty.

Rule 4 — multi-endpoint frontend for staff

connects all four cloud vendors plus the local instance. Staff pick the right model per task from a dropdown — routing not in their head, but in the UI.

Open WebUI multi-endpoint configuration

# Open WebUI Connections (Settings → Admin → Connections)

# Local Ollama instance (default)
OLLAMA_BASE_URL=http://ollama:11434

# Anthropic (Claude — standard 80%)
ANTHROPIC_API_BASE=https://api.anthropic.com
ANTHROPIC_API_KEY=sk-ant-...
ANTHROPIC_MODELS=claude-opus-4-7,claude-sonnet-4-6,claude-haiku-4-5

# Perplexity (research — 10%)
PERPLEXITY_API_BASE=https://api.perplexity.ai
PERPLEXITY_API_KEY=pplx-...
PERPLEXITY_MODELS=sonar-pro,sonar-reasoning-pro

# Google Gemini (multimodal + long contexts — 5%)
GOOGLE_API_BASE=https://generativelanguage.googleapis.com
GOOGLE_API_KEY=AIza...
GOOGLE_MODELS=gemini-3-pro,gemini-3-ultra,gemini-flash

# OpenAI (special tasks + GPT-oss — 5%)
OPENAI_API_BASE=https://api.openai.com/v1
OPENAI_API_KEY=sk-...
OPENAI_MODELS=gpt-5,gpt-5.5,o3-pro

# Workspaces with a pre-selected model per use case:
#   - Workspace 'Clients'   → default: llama3.3:70b (local)
#   - Workspace 'Marketing' → default: claude-opus-4-7 (cloud)
#   - Workspace 'Research'  → default: sonar-pro (Perplexity)
#   - Workspace 'Vision'    → default: gemini-3-pro (cloud)

Open WebUI connects all four cloud vendors plus the local Ollama instance. Staff see every model in one dropdown and pick per use case. Source: docs.openwebui.com, BSD-3 + branding.

Honest notes

This distribution is ours — as of 2026-06-02. It is not a promise of salvation but the result of our own practice. Other consultancies route differently: many OpenAI-first (because of the ChatGPT brand), some Gemini-first (because of Google Workspace). We landed at Claude because Anthropic has consistently delivered the qualitatively best models for our over the past 18 months — that may change tomorrow.

Also honest: local models are not quite at frontier cloud quality in absolute top-tier yet. Llama 3.3 70B is very good, but Claude Opus 4.7 is measurably better on complex tasks. The trade-off is data sovereignty — and for sensitive content that is non-negotiable.

Models need infrastructure and a routing frontend

The model zoo shows the model universe, is the local inference server, is the multi-endpoint frontend, the GDPR server blueprint is the platform underneath:

→ AI model zoo (open-weight overview)→ Open WebUI (routing frontend)→ GDPR server blueprint

Ready for the next step?

Free intro call, no strings attached. In 30 minutes you'll know whether and how AI can help your business.

Book a call Check eligibility