Ollama
The local server. Llama, Mistral, Qwen, — every common open-weight model, ready to run in 15 minutes, with an OpenAI-compatible . The platform for SMBs that want frontier quality without handing data to a US cloud.
Project profile
Ollama
Run large language models locally
As of: June 1, 2026
GitHub stars
173k
Forks
16k
Open issues
3.3k
License
MIT
Latest version
v0.24.0
Language
Go
Third-party source · Wikidata (CC0)
Wikidata profile
Ollama
License
MIT License
What is Ollama?
is a server written in Go that manages, loads and serves open-weight through an OpenAI-compatible . Models are downloaded with one command (` pull llama3.3:70b`), stay in memory and answer requests locally — without a request ever leaving the network.
The software is licensed under MIT. The models you load have their OWN licenses — important point: Llama models sit under Meta Community License (not classic OSI), Mistral and Qwen under Apache-2.0. The self-hosting stack is open source, the model license decides what is allowed commercially.
Why a machine-builder uses Ollama
At a custom machine-builder, every specification, every CAD description, every competitor patent search is a question of competitive position. ChatGPT, Claude or Copilot Cloud are technically impressive — but uploading a packaging-machine specification there means handing IP to a US cloud vendor.
A local stack with as the server + as the interface + for automation: 80 staff use frontier quality on their own server. Every query stays in the building, every answer is traceable, every token does not count against an bill. Shadow IT solved, quality not reduced.
Client case study
Maschinenbau Wagner
Custom machine builder for packaging technology, 80 staff, location in Lower Saxony, Germany. ChatGPT was banned internally — for IP reasons. An audit revealed: 30 staff used it anyway (shadow IT). The answer: a local server with , running productively for 11 months. Shadow IT rate: 0%.
IP protection for CAD and patents
Datasheet generation in 4 languages
Code completion for PLC programming
Translations without SaaS
RAG over competitor patents
API hookup for CAD software
What the workforce actually does with it
Eight productive from 11 months of practice at Wagner. Each replaces either shadow IT (clandestine ChatGPT) or something that simply was not possible without local .
Specification analysis
Datasheet generation in 4 languages
PLC code completion
Patent research assistant
Complaint reply drafts
Translations for fair brochures
Email drafting in sales support
Voice control for CAD (POC)
Core capabilities of Ollama
What delivers as an server — and which capabilities really carry an SMB setup.
100+ models from the registry
OpenAI-compatible API
Quantisation for RAM/VRAM efficiency
GPU acceleration
Model library (ollama.com/library)
Multi-modal (vision models)
Honest alternatives
If Ollama is not a fit — what else?
Three alternatives for local inference. Each with its own profile — has the widest pragmatism corridor.
CLI / library
llama.cpp
Georgi Gerganov, MIT
- + Very fine control (quantisation, batch size)
- + Very resource-efficient, no container needed
- − Steep learning curve, no API out of the box
- − Model management entirely your own
Desktop GUI
LM Studio
Element Labs, proprietary
- + Very good UX for single users
- + Chat interface integrated directly
- − Not headless / not serveable
- − Proprietary, not open source
Production inference
vLLM
UC Berkeley, Apache-2.0
- + High performance, PagedAttention
- + OpenAI API, multi-user capable
- − Setup more complex than Ollama
- − No integrated model management
Rule of thumb: anyone with a server with a or an Apple Silicon Mac who wants to be productive quickly is up and running on in 15 minutes. llama.cpp is the right choice when you need deep control over inference parameters and quantisation. LM Studio fits single-seat professionals. vLLM pays off with several hundred parallel requests.
Pricing
MIT server. Model license separate. Hardware dominates.
License
Ollama itself: MIT — true OSI open-source license for the server software. The models you load have their OWN licenses — Llama models Meta Community License (NOT OSI), Mistral/Qwen/Gemma Apache-2.0. For commercial use, check the model license.
Running costs
Hardware-dominated. Mid-range: RTX 4090 + 32 GB RAM workstation from €3,000. Premium: Apple Mac Studio M4 Ultra with 192 GB unified memory from €8,000 (runs everything including 70B models). Power: roughly 150–300 W during inference, far less at idle.
Effort
Install Ollama: 10 minutes (Brew, Docker or Linux installer). First model pull: 5–60 minutes depending on size. Productive SMB setup with Open WebUI, RAG, workflow hookup and staff training: 5–10 consulting days.
Important: unlike with Caddy or (server software alone), with the license question shifts onto the model weight. Llama 3.x is NOT classic open source (Meta Community License has commercial restrictions for very large operators), Mistral and Qwen on the other hand are Apache-2.0. For SMBs with under 700 million monthly active users, Llama use is commercially free too.
Pull models + API call
# Load models
docker exec ollama ollama pull llama3.3:70b
docker exec ollama ollama pull qwen2.5-coder:32b
docker exec ollama ollama pull llama3.2-vision:11b
# API call (OpenAI compatible)
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.3:70b",
"messages": [
{"role": "system", "content": "You are a mechanical-engineering specification expert."},
{"role": "user", "content": "Extract requirements from this PDF: ..."}
],
"temperature": 0.2
}'Ollama as a Docker container with GPU pass-through
services:
ollama:
image: ollama/ollama:0.24.0
container_name: ollama
restart: unless-stopped
ports:
- 11434:11434
volumes:
- ollama_models:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
environment:
- OLLAMA_HOST=0.0.0.0
- OLLAMA_KEEP_ALIVE=24h
- OLLAMA_NUM_PARALLEL=4
networks:
- ai-backend
volumes:
ollama_models:
networks:
ai-backend:
external: trueRelated topics
Ollama is the engine — what drives it?
is the inference server. The user-facing surface comes from , workflows hook in via , the whole platform sits in the solution 'Your own server':
Ready for the next step?
Free intro call, no strings attached. In 30 minutes you'll know whether and how AI can help your business.