Tool in production

Ollama

The local server. Llama, Mistral, Qwen, — every common open-weight model, ready to run in 15 minutes, with an OpenAI-compatible . The platform for SMBs that want frontier quality without handing data to a US cloud.

Project profile

Ollama

Run large language models locally

As of: June 1, 2026

GitHub stars

173k

Forks

16k

Open issues

3.3k

License

MIT

Latest version

v0.24.0

Language

First release

June 26, 2023

Last commit

June 1, 2026

→ GitHub repository → Official website → Documentation

Third-party source · Wikidata (CC0)

Wikidata profile

Ollama

Q124636097

License

MIT License

→ Wikidata entry

What is Ollama?

is a server written in Go that manages, loads and serves open-weight through an OpenAI-compatible . Models are downloaded with one command (` pull llama3.3:70b`), stay in memory and answer requests locally — without a request ever leaving the network.

The software is licensed under MIT. The models you load have their OWN licenses — important point: Llama models sit under Meta Community License (not classic OSI), Mistral and Qwen under Apache-2.0. The self-hosting stack is open source, the model license decides what is allowed commercially.

Why a machine-builder uses Ollama

At a custom machine-builder, every specification, every CAD description, every competitor patent search is a question of competitive position. ChatGPT, Claude or Copilot Cloud are technically impressive — but uploading a packaging-machine specification there means handing IP to a US cloud vendor.

A local stack with as the server + as the interface + for automation: 80 staff use frontier quality on their own server. Every query stays in the building, every answer is traceable, every token does not count against an bill. Shadow IT solved, quality not reduced.

Client case study

Maschinenbau Wagner

Custom machine builder for packaging technology, 80 staff, location in Lower Saxony, Germany. ChatGPT was banned internally — for IP reasons. An audit revealed: 30 staff used it anyway (shadow IT). The answer: a local server with , running productively for 11 months. Shadow IT rate: 0%.

IP protection for CAD and patents

Specifications, CAD descriptions, own patent filings and competitor research are competitive assets. They must not be seen by a US cloud vendor — not even 'DPA-compliant', not even 'with EU endpoint'.

Datasheet generation in 4 languages

Machine datasheets are required in DE/EN/CN/FR. Manual translation takes weeks, classic DeepL use uploads contents to DeepL. The local translates within the same infrastructure with the same brand-name consistency.

Code completion for PLC programming

PLC programming in structured text (IEC 61131-3) is completed in the in-house Codium IDE via with Qwen 2.5 Coder 32B. No GitHub Copilot, no US call — code logic stays in the building.

Translations without SaaS

Marketing texts, fair brochures, complaint replies in several languages — all generated locally. With glossary consistency for brand names, technical terms and product codes.

RAG over competitor patents

An in-house database with 12,000+ patents from DEPATIS (the German public patent office) is semantically searchable. Question: 'Which patents deal with film welding using hot-air flow?' returns relevant excerpts with file numbers.

API hookup for CAD software

's OpenAI-compatible is called by an internal CAD plugin — a question about the part in CAD context, the answer comes from the local , no cloud component.

What the workforce actually does with it

Eight productive from 11 months of practice at Wagner. Each replaces either shadow IT (clandestine ChatGPT) or something that simply was not possible without local .

Specification analysis

A customer sends an 80-page specification as a PDF. The shop-floor lead uploads it to and asks: 'Extract all technical requirements as a table, sorted into mandatory and optional.' (Llama 3.3 70B) returns 47 rows in 90 seconds — reviewed, taken into the quoting process.

Datasheet generation in 4 languages

Master content lives as Markdown. : datasheet → (Llama 3.2) translates into DE/EN/CN/FR, glossary-consistent for brand names. Before: 2 weeks per datasheet set. Now: 2 hours including a review pass.

PLC code completion

An in-house Codium plugin speaks via the with Qwen 2.5 Coder 32B. PLC structured text gets context-aware completion — variable definitions, function calls, comments. Senior developers validate, junior developers become more productive.

Patent research assistant

A local with has indexed 12,000 competitor patents from DEPATIS. A question in the surface, Llama 3.3 70B answers with context excerpts and patent file numbers. Research hours per project: from 8 down to 1.5.

Complaint reply drafts

An incoming complaint is fed by sales support into : 'Draft a reply, friendly but legally watertight, referencing contract clause X.' Llama 3.3 writes the draft, sales support lead edits and sends. Time per reply: 8 instead of 25 minutes.

Translations for fair brochures

Marketing produces fair brochures in DE, has them translated via into EN/CN/FR. Glossary-consistent (brand names like 'WrapPro 3000' stay untouched), stylistically appropriate (fair tone instead of translation English).

Email drafting in sales support

Standard : 'Confirm delivery time for order #12345 in a polite tone.' returns 4 variants, the sales-support clerk picks + extends. Email backlog in sales support reduced from 40 to 8 per day.

Voice control for CAD (POC)

Pilot with two designers: a spoken instruction to a CAD plugin, Llama 3.3 understands ('Move the drive unit 30 mm in X direction, collision-free'), the plugin executes. Not yet productive, but a fascinating for the next 12 months.

Core capabilities of Ollama

What delivers as an server — and which capabilities really carry an SMB setup.

100+ models from the registry

Llama 3.3, Mistral, Qwen 2.5, 3, Phi 4, DeepSeek — every model loaded with ` pull `. Versions are tags (`:70b`, `:7b`, `:q4_K_M`). Updates via .com/library, new models available within days of release.

OpenAI-compatible API

Drop-in replacement for the OpenAI : same endpoints (/v1/chat/completions, /v1/), same request structure. nodes, LangChain, the Python OpenAI SDK, — all work against without a code change.

Quantisation for RAM/VRAM efficiency

Models are made available quantised (Q4, Q5, Q8). Llama 3.3 70B in Q4_K_M fits in 48 GB VRAM (RTX 5090 + RTX 4090 SLI or Apple Mac Studio M4 Ultra); the full FP16 model would need 140 GB.

GPU acceleration

NVIDIA CUDA (all modern RTX), AMD ROCm (RDNA3+), Apple Metal (M-Mac), Intel oneAPI — everything is auto-detected and used. Fallback to CPU for small models without a .

Model library (ollama.com/library)

Central repository with every common open-weight model. Includes model cards with size, license and . Custom models (e.g. fine-tunes) can be built and hosted as Modelfiles — analogous to Dockerfiles.

Multi-modal (vision models)

Vision models like Llama 3.2 Vision, LLaVa, Qwen 2.5 VL accept images via Base64 or URL and reply in text. Use at Wagner: scanned delivery notes, part photos for complaint handling, whiteboard sketches for specification workflows.

Honest alternatives

If Ollama is not a fit — what else?

Three alternatives for local inference. Each with its own profile — has the widest pragmatism corridor.

CLI / library

llama.cpp

Georgi Gerganov, MIT

+ Very fine control (quantisation, batch size)
+ Very resource-efficient, no container needed
− Steep learning curve, no API out of the box
− Model management entirely your own

Desktop GUI

LM Studio

Element Labs, proprietary

+ Very good UX for single users
+ Chat interface integrated directly
− Not headless / not serveable
− Proprietary, not open source

Production inference

vLLM

UC Berkeley, Apache-2.0

+ High performance, PagedAttention
+ OpenAI API, multi-user capable
− Setup more complex than Ollama
− No integrated model management

Rule of thumb: anyone with a server with a or an Apple Silicon Mac who wants to be productive quickly is up and running on in 15 minutes. llama.cpp is the right choice when you need deep control over inference parameters and quantisation. LM Studio fits single-seat professionals. vLLM pays off with several hundred parallel requests.

Pricing

MIT server. Model license separate. Hardware dominates.

License

Ollama itself: MIT — true OSI open-source license for the server software. The models you load have their OWN licenses — Llama models Meta Community License (NOT OSI), Mistral/Qwen/Gemma Apache-2.0. For commercial use, check the model license.

Running costs

Hardware-dominated. Mid-range: RTX 4090 + 32 GB RAM workstation from €3,000. Premium: Apple Mac Studio M4 Ultra with 192 GB unified memory from €8,000 (runs everything including 70B models). Power: roughly 150–300 W during inference, far less at idle.

Effort

Install Ollama: 10 minutes (Brew, Docker or Linux installer). First model pull: 5–60 minutes depending on size. Productive SMB setup with Open WebUI, RAG, workflow hookup and staff training: 5–10 consulting days.

Important: unlike with Caddy or (server software alone), with the license question shifts onto the model weight. Llama 3.x is NOT classic open source (Meta Community License has commercial restrictions for very large operators), Mistral and Qwen on the other hand are Apache-2.0. For SMBs with under 700 million monthly active users, Llama use is commercially free too.

Pull models + API call

# Load models
docker exec ollama ollama pull llama3.3:70b
docker exec ollama ollama pull qwen2.5-coder:32b
docker exec ollama ollama pull llama3.2-vision:11b

# API call (OpenAI compatible)
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.3:70b",
    "messages": [
      {"role": "system", "content": "You are a mechanical-engineering specification expert."},
      {"role": "user", "content": "Extract requirements from this PDF: ..."}
    ],
    "temperature": 0.2
  }'

Three commands give you a productive setup. The API call is drop-in compatible with OpenAI clients — n8n nodes, LangChain, the Python OpenAI SDK work without modification. Source: docs.ollama.com.

Ollama as a Docker container with GPU pass-through

services:
  ollama:
    image: ollama/ollama:0.24.0
    container_name: ollama
    restart: unless-stopped
    ports:
      - 11434:11434
    volumes:
      - ollama_models:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    environment:
      - OLLAMA_HOST=0.0.0.0
      - OLLAMA_KEEP_ALIVE=24h
      - OLLAMA_NUM_PARALLEL=4
    networks:
      - ai-backend

volumes:
  ollama_models:

networks:
  ai-backend:
    external: true

One container that pulls models, keeps them warm and serves them through an OpenAI-compatible API. NVIDIA runtime for GPU acceleration. Models land in the named volume. Source: docs.ollama.com, MIT license.

Ollama is the engine — what drives it?

is the inference server. The user-facing surface comes from , workflows hook in via , the whole platform sits in the solution 'Your own server':

→ Open WebUI (front-end for Ollama)→ Your own AI server & RAG (solution context)→ Cluster: Workflow & AI

Ready for the next step?

Free intro call, no strings attached. In 30 minutes you'll know whether and how AI can help your business.

Book a call Check eligibility