Book a call
Solution — Autonomous

Your own AI server & RAG knowledge base

We build your stack, train your team and hand it over — three setup tiers from a to hardware. With an honest view of where self-hosting pays off and where a hybrid approach with frontier cloud is the better choice.

An in-house AI server does not pay off for every use case. For occasional requests the cloud API is usually cheaper and qualitatively better. Self-hosting becomes interesting when three conditions coincide: regular use with meaningful volume, data-sensitive processes without permission for third-country transfer, and the will to keep long-term sovereignty over models, data and cost.

Important upfront: local open-source models still lag the current frontier cloud model in quality. For German standard correspondence, classification and RAG queries that is enough; for complex reasoning and multilingual strategic copy it often is not. The clean path is therefore frequently a hybrid setup: self-hosted for data-sensitive routines, frontier cloud (with DPA and EU region) for the demanding strategic layer.

Three setup tiers — from a lean VPS for first steps to on-premise hardware for strictly regulated sectors — and you can start with the tier that fits today without throwing the setup away later.

Three setup tiers

Which tier fits depends on user volume, latency requirements and compliance obligations. Moving between tiers leaves the software stack identical — only the hardware scales with it.

Tier 1

Starter — VPS, CPU-only

Tool mix

  • Mid-sized VPS at a German provider (e.g. netcup, Hetzner), 16–32 GB RAM, NVMe SSD, no GPU
  • Ollama with smaller models (Gemma 4 in 4B–9B, Phi-4, Llama-3 8B) — CPU inference, slower but functional
  • Open WebUI as multi-user frontend, local embeddings via Ollama, ChromaDB as vector store
  • n8n for workflows, Authentik or Keycloak for login
  • Optionally a simple RAG index on your own Markdown documents

Best fit

Small teams (2–10 users), few parallel requests, use cases with German standard correspondence and classic RAG queries. First steps into self-hosting before investing in a GPU.

Effort & cost

Setup 3–6 days. Running costs around €30–80 / month (VPS rental + backups). Models and software are open source.

Tradeoff

CPU inference is significantly slower than GPU — answering a short question often takes 10–30 seconds instead of 1–3. Too slow for real-time use cases like live chat, very usable for asynchronous workflows (receipt classification, batch processing).

Tier 2

Professional — server with GPU

Tool mix

  • Dedicated server in a German data center with a consumer or workstation GPU (e.g. RTX 4090, RTX 6000 Ada) — or rented GPU cloud server (Hetzner, OVH) with an explicit GDPR assurance
  • Ollama or vLLM for inference, mid-sized models (Llama-3 70B quantized, Gemma 2 27B, Qwen 32B, Mistral Large) with response times under 5 seconds
  • Full RAG pipeline: Ollama embeddings, ChromaDB or pgvector, optionally a knowledge graph (KuzuDB) for relations, hybrid retrieval with RRF fusion
  • Open WebUI with RBAC and audit log, n8n workflows, Authentik / Keycloak
  • Monitoring stack with Grafana and Prometheus, Telegram or Slack alerts

Best fit

Active use in a team of 10–50 people, real-time use cases (chat, fast RAG), specific knowledge domains with a larger corpus.

Effort & cost

Setup 6–12 days. Running costs around €150–500 / month (hardware rental + electricity + backups), depending on GPU class and provider.

Tradeoff

Even a 70B model with good quantization lags the current frontier cloud model in quality. For standard tasks (business correspondence, classification, RAG answers) it is enough; for demanding strategic copy or complex reasoning, cloud frontier remains the better choice — often sensible as a hybrid setup.

Tier 3

On-premise — hardware in your own server room

Tool mix

  • Hardware to spec: GPU workstation (e.g. with RTX 6000 Ada), tower or rack server with ECC RAM, redundant NVMe storage
  • Tier 2 in full, plus physical control: no mandatory cloud connection, network isolation possible (air-gapped setup)
  • Extended monitoring tools (Netdata, Prometheus, Loki) and disaster recovery concept (offsite backup, hardware redundancy)
  • Optional fine-tuning pipeline with LoRA or QLoRA on your own hardware — possible, but only economically sensible with a clear use case

Best fit

Sectors with hard data protection or security requirements (regulated markets, research, defense, public sector), organizations with their own IT team and existing server room infrastructure, requirement for air-gapped operation.

Effort & cost

Hardware investment typically €8,000–25,000 one-off (depending on GPU class and redundancy). Setup 10–20 days. Running costs mainly electricity, maintenance, backups — usually €50–150 / month.

Tradeoff

Full data control and no recurring hosting fees, but complete operations responsibility in-house: hardware failures, power outages, backup restores, cooling. Not operable without an in-house IT team or external service provider.

What runs on the server

The software stack is identical across all three tiers — open source, Docker-based, swappable. Nine building blocks, each with a clear function:

Inference engine

Ollama for simple setups, vLLM for higher load and concurrency. Both open source, both self-hosted. Which fits better depends on model size and request volume.

Language models

Gemma 4 (Google), Llama 3/4 (Meta), Qwen (Alibaba), Mistral. Check licenses — most allow commercial use, some with restrictions above a certain user count.

Embedding model

Local embedding for vector representation, e.g. qwen3-embedding (1,024 dim.) or BGE-M3. Important: German material needs a German-trained or multilingual model, otherwise recall suffers.

Vector store

ChromaDB for simple setups (Docker-native), pgvector for PostgreSQL integration, Qdrant for higher load and filter complexity. Choice depends on data volume and query pattern.

Knowledge graph (optional)

KuzuDB as an embedded graph DB for dependency and relation queries. Useful when not just text is searched, but also „what connects to what“.

RAG pipeline

Chunking, embedding, retrieval with hybrid strategy (vector + BM25, RRF fusion), reranking, prompt composition. Most of the work is not in the model, but in pipeline quality.

Frontend

Open WebUI for multi-user chat with RBAC and audit log, custom web apps for special applications, n8n for workflow integration.

Authentication

Keycloak or Authentik for single sign-on, Active Directory integration and 2FA. Important: AI access follows the same permission model as other enterprise tools.

Monitoring & backup

Grafana for dashboards, Prometheus for metrics, Uptime Kuma for external availability. Backup strategy: daily automatic, offsite copy, regular restore tests.

Where self-hosting concretely pays off

Six sector profiles in which an in-house AI server is typically the right choice — either for data protection, volume or compliance reasons:

Law firms & tax advisors

Statutes, rulings, client correspondence in the RAG system. Data protection strict, no US cloud. Augments legal research and speeds up standard queries — but does not replace professional judgement.

Healthcare & practices

Patient records and treatment guidelines must not leave the practice. Self-hosted enables AI-supported research and documentation help without data protection conflict.

Trade & technical operations

Hundreds of data sheets, manufacturer instructions, VDE standards and your own specifications in one system, queryable by the technician on site.

Internal knowledge base

Manuals, SOPs, training material, email archives — everything searchable in natural language. New hires find answers in seconds, knowledge transfer at staff turnover becomes easier.

Insurance and financial services

Tariffs, conditions, claim processes as a RAG knowledge base — internal queries are answered consistently. With MaRisk and compliance requirements in mind.

Public sector and authorities

High demands on data sovereignty, often air-gapped. On-premise setups with local models and a clean audit trail meet typical compliance requirements.

What we teach

So you can operate the server yourselves, six areas of competence we build up in workshop and pilot accompaniment:

Model selection

Which open-source model fits which use case (size, quality, RAM footprint, license). When quantization is worth it, when a larger model with fewer tokens per second is.

RAG tuning

Chunking strategy, embedding choice, hybrid retrieval, reranking — and how to measure Recall@5 against a golden query set instead of relying on gut feel.

Workflow literacy

How to read, adjust and debug an n8n workflow. Where triggers sit, where errors arise, where retry and dead-letter queues take over.

Security & access

RBAC per role and department, audit log for compliance, VPN access for admin interfaces, 2FA on all access.

Monitoring & backups

Which metrics are watched, how alerts are sensibly configured, how backup restores are tested regularly — not only when the server burns.

Update discipline

Roll out model updates in a controlled way, security patches promptly, a test environment for larger changes. An AI stack is a software stack — same maintenance obligations.

What gets automated

Six routine steps that workflows take over in operation — so the server runs stable without someone looking at it every day:

Data sync into the RAG index

New or changed documents are re-embedded automatically via n8n workflows and pushed into the vector store.

Model and container updates

Automated update routines pull new models or container images, check health endpoints and only go live after a successful smoke test.

Health checks & alerts

Service availability, response times, GPU utilization, disk usage are continuously monitored — Telegram or Slack alert on outliers.

Backup routines

Databases, vector stores, configurations and custom models are backed up automatically every day, offsite copy encrypted, monthly restore test.

Audit logging

Who sent which prompt when, which data was retrieved, which tool was called — fully logged and exportable for compliance audits.

Capacity tracking

Tokens per day, requests per hour, cost indicators (electricity, GPU utilization) as a dashboard, so scaling decisions rest on data.

What stays MANUAL on purpose

Self-hosting means responsibility. These six disciplines belong in human hands — workflow automation does not replace them:

Strategic model decisions

Which models you deploy, which license you accept, which tradeoffs you make (quality vs. speed vs. data control) — that is a business decision, not a workflow.

Choice of data in the RAG index

Which documents may go into the index, which may not, which need explicit approval — a human decision with data protection and confidentiality context.

Quality review of answers

Spot checks, recognising hallucinations, deriving model or pipeline tuning — observation that cannot be automated because it needs content context.

Disaster recovery drills

Quarterly actually run a backup restore, test failover, document what did not work. Routine that does not happen without human discipline.

Security audits

Review access rights regularly, remove departed staff, penetration test for larger setups — not a workflow job.

Scaling decisions

When does the hardware become too small, when is moving to a bigger GPU or a hybrid setup with cloud frontier worth it — monitoring delivers the data, you decide.

How the handover into self-operation works

From requirements workshop to full self-operation typically 8–14 weeks, depending on tier and data volume in the RAG index:

1

Requirements workshop

Which use cases should be covered, what volume to expect, which data may go into the RAG index, which sector compliance applies?

2

Pick a setup tier

Starter, Professional or On-Premise — recommendation with reasoning based on use cases, budget and existing IT know-how on the team.

3

Configure hardware and hosting

Choose VPS provider or order hardware, set up network and access structures, prepare VPN and SSO integration.

4

Build the stack

Inference engine, models, embedding, vector store, RAG pipeline, frontend, n8n workflows, auth and monitoring — as a reproducible Docker stack.

5

Curate and index RAG data

Gather first data sources, test chunking strategy, define a golden query set against which every pipeline change is later measured.

6

Training & handover

1–2 day workshop with IT and subject matter owners: understand architecture, read and adjust workflows, maintain RAG data, use monitoring.

7

Accompanied pilot month

Weekly sparring sessions, measure RAG quality, curate prompts, document first edge cases. You operate, we step in only when needed.

8

Self-operation with maintenance discipline

From there the setup belongs to you. Optional: quarterly refreshers on model swaps, new components or compliance updates.

Effort and investment depend on the chosen tier and the level of accompaniment — concrete numbers come after the requirements workshop, in the context of our pricing overview.

Ready for the next step?

Free intro call, no strings attached. In 30 minutes you'll know whether and how AI can help your business.

Book a callBAFA funding