Stack in production

Observability stack — Prometheus + Grafana + Loki

Seven containers that together form a complete stack: collect metrics, aggregate logs, build dashboards, dispatch alerts. A concrete alternative to Datadog and New Relic for hosting providers and IT teams that want to keep their own eyes on their infrastructure.

Compose for the central observability host

services:
  prometheus:
    image: prom/prometheus:v3.5.0
    container_name: prometheus
    restart: unless-stopped
    ports: ["9090:9090"]
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus_data:/prometheus
    command:
      - --config.file=/etc/prometheus/prometheus.yml
      - --storage.tsdb.retention.time=30d
    networks: [observability]

  grafana:
    image: grafana/grafana:11.5.0
    container_name: grafana
    restart: unless-stopped
    ports: ["3000:3000"]
    volumes:
      - grafana_data:/var/lib/grafana
    environment:
      - GF_SERVER_ROOT_URL=https://obs.hoster.com
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASS}
    networks: [observability]

  loki:
    image: grafana/loki:3.4.2
    container_name: loki
    restart: unless-stopped
    ports: ["3100:3100"]
    volumes:
      - ./loki-config.yaml:/etc/loki/local-config.yaml
      - loki_data:/loki
    command: -config.file=/etc/loki/local-config.yaml
    networks: [observability]

  alertmanager:
    image: prom/alertmanager:v0.28.1
    container_name: alertmanager
    restart: unless-stopped
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
    networks: [observability]

volumes:
  prometheus_data:
  grafana_data:
  loki_data:

networks:
  observability:

Six containers on one central host. Each monitored VPS additionally runs node_exporter, cAdvisor and Promtail. Netdata optional as a second all-in-one realtime monitor. Source: own practice, open-source licences.

Seven components in one stack

Each component does one thing well. Together they form a stack that runs fully on your own server with no SaaS dependency. All seven are real open source.

Prometheus

64k ★

Time-series database

Scrapes metrics every 15 seconds, stores them in its own TSDB, answers queries via PromQL. The heart of the stack.

Apache-2.0Go

→ GitHub → Homepage Q52534999Wikidata

Grafana

74k ★

Dashboards + alerting

Front end for and . Dashboards for hosts, containers, customers. Alertmanager integration for Slack/email.

AGPL-3.0TypeScript

→ GitHub → Homepage Q43399271Wikidata

Loki

28k ★

Log aggregation

Like for logs: no full-text index, just labels. Very resource-efficient. Queries via LogQL.

AGPL-3.0Go

→ GitHub → Homepage Q116149533Wikidata

Promtail

28k ★

Log shipper on every host

Collects logs from journald, containers, file paths and sends them to . Installed on every monitored host.

AGPL-3.0Go

→ GitHub → Homepage

cAdvisor

19k ★

Container metrics

Reads CPU, RAM, network metrics per container. Exports them in format. One instance runs on every container host.

Apache-2.0Go

→ GitHub → Homepage Q110868226Wikidata

node_exporter

13k ★

Host metrics

Measures CPU load, RAM, disk IO, network counters, system load on the host itself. On every monitored host as a service or container.

Apache-2.0Go

→ GitHub → Homepage

Netdata

79k ★

Realtime monitoring (optional)

All-in-one realtime monitor with its own UI. Complementary to + — for one-second reaction on acute issues.

GPL-3.0C

→ GitHub → Homepage Q24961917Wikidata

What does the stack do together?

The seven components work in a clear data flow: node_exporter and collect metrics on each monitored host and serve them on port 9100/8080. scrapes those endpoints every 15 seconds and stores them in a time-series DB. Promtail collects logs from containers and journald and sends them to for aggregation. shows both — metrics and logs — in dashboards with alerting.

The result: one UI for every host, every container, every service. Anyone who wants to know whether a customer's webshop is still up, whether DB load is high, whether a specific error appears in the log — looks at , not at 18 separate SSH sessions. Datadog delivers the same, but costs around €5,000/year at 18 hosts — the stack is one-off setup plus one .

Why a hosting provider self-hosts observability

A small hosting provider with 15–25 client on its own hardware has two problems without central monitoring: first, every piece of information comes from the customer — 'my shop is down', 'the server reacts slowly', 'my mail does not come through'. Reactive. Second, if you want to clarify on your own, it takes 18 SSH sessions, htop, journalctl. Does not scale.

An stack flips both: you see in that the disk on VPS-12 is 91 % full — before the customer files a ticket. Alerts go via Slack. Logs are searchable across every host at once. Datadog delivers the same — but for a 6-person hoster €5,000/year of licence costs make a big difference.

Client case study

Schmidt-Werlich Hosting

Small managed-services provider in Lower Saxony, 6 people, 18 client on their own Hetzner servers. 8 months ago migrated from 'everyone SSHs individually' to a central stack. Today: one dashboard for everything, Slack alerts for every disk/memory/service issue, 30 days of metrics history for post-mortems.

Unified overview of 18 VPS

One dashboard that shows the state of all 18 client at a glance — CPU, RAM, disk, service status. Instead of 18× SSH and htop.

Proactive alerts instead of tickets

Disk > 85 %, memory > 90 %, service unreachable — Slack alert in the hoster channel. Before the customer notices, action is taken.

30 days of metrics history

For a post-mortem: 'Why was server X unreachable last Wednesday between 14:00 and 16:00?' — answer from with graphs. Instead of guesswork.

Container metrics per customer

Which customer consumes how much RAM/CPU on which ? Important for fair billing and capacity planning. delivers the data per container.

Cross-host log analysis

Find error patterns across all 18 : 'Who has 5xx spikes in the last 24 hours?' — one LogQL query instead of 18 SSH sessions with grep.

Optional: public status page

Generate a public status page for customers from the data. Builds trust, reduces tickets during known maintenance windows.

Eight productive patterns in operation

Concrete setups Schmidt-Werlich uses daily for 8 months. Each pattern replaces either a reactive activity or a gap that would have stayed invisible without central monitoring.

Host overview in Grafana

Main dashboard 'Hosts Overview': one row per , columns for CPU%, RAM%, disk%, load, uptime. Red cells above thresholds. Daily glance: 30 seconds. Before: 30 minutes of round-robin SSH.

Container overview

Dashboard 'Containers': shows for every host all running containers with RAM/CPU consumption. Filter by customer, image, status. delivers the data, renders the table.

Slack alerts via Alertmanager

rule: 'disk_usage{instance=*} > 85 over 5m' → Alertmanager → Slack in #hosting-alerts. Plus email fallback. Recovered notifications once the issue is resolved.

Log aggregation via Promtail

On every Promtail runs as a service, reads container logs and journald. Ships centrally to . Label convention: {host, container, customer, severity} — enables precise filtering.

LogQL for error patterns

Example: {level="error"} |= "connection refused" | rate(1m) shows connection errors across all hosts in the last hour. Anomalies visible in seconds, no ssh + grep.

Custom dashboards per customer

One dashboard per customer: only their , their containers, their logs. In consulting calls Schmidt-Werlich shows the customer concretely how their infrastructure evolves.

Prometheus TSDB backup

Nightly snapshot backup of the TSDB (continuous replication would double stack effort). 30 days retention, older data moved to a separate NAS for cases.

Public status page

status.schmidt-werlich.com shows a public traffic light per customer. Generated from data via a Statping container. Reduces tickets during known maintenance windows by about 40 %.

What the stack delivers as a whole

Six stack-level capabilities — properties that only emerge from the interplay of the components.

Single pane of glass

One UI for every datum. Whether metrics, logs or alerts: is the central touch point. Staff only need to learn one system, not seven.

Time-series database with retention

stores metrics efficiently (1 GB per million data points). Default retention 15 days, easily extended to 30 or 90 days. Post-mortems become routine.

Log aggregation without full-text overkill

does not index the full text (like Elasticsearch), only labels. Result: 10× less RAM/disk than the ELK stack for comparable log search functionality.

Multi-host collection

federation or simple multi-target scraping: one central pulls metrics from any number of hosts. Scales from 5 to 500 hosts without architectural change.

Alerting with routing

Alertmanager: define recipients per alert rule (Slack, email, ). Silencing for maintenance windows. Inhibition: if the cluster is down, no 200 individual service alerts.

Containers + hosts together

node_exporter for host level (RAM, disk, load), for container level (per container CPU, RAM, network). Together they give the full picture — host pressure and container pressure.

Example Prometheus alert rule

# /etc/prometheus/alerts/disk.yml
groups:
  - name: disk
    interval: 30s
    rules:
      - alert: DiskUsageHigh
        expr: |
          (
            (node_filesystem_size_bytes - node_filesystem_avail_bytes)
            / node_filesystem_size_bytes
          ) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Disk on {{ $labels.instance }} > 85% full"
          description: |
            Mountpoint {{ $labels.mountpoint }} on host
            {{ $labels.instance }} is {{ $value | printf \"%.1f\" }}%
            full. Maintenance required.
        # → Alertmanager → Slack #hosting-alerts

A rule that fires an alert when disk usage on any host is above 85 % and stays so for at least 5 minutes. Source: own practice, MIT snippet.

Honest alternatives

If the self-hosted stack is not a fit — what else?

Three alternatives with different strengths. The stack is the pragmatic default for SMBs and small hosting providers.

SaaS market leader

Datadog

Datadog Inc., USA

+ Excellent UX, very wide integration library
+ Out of the box, productive in 30 minutes
− Very expensive (from $15/host/month, cumulative)
− US cloud, metric data leaves the building

SaaS with free tier

New Relic

New Relic Inc., USA

+ 100 GB ingest/month free
+ APM (Application Performance Monitoring) integrated
− Past the free tier prices comparable to Datadog
− US cloud, metrics outside GDPR

Classic self-hosted

Zabbix

Zabbix LLC, GPL-2.0

+ Very mature, productive since 2001
+ Usable even without Docker
− Monolithic (one large app)
− UI feels dated

Rule of thumb: anyone with 5–100 hosts, IT-affine staff and consulting support is pragmatic on the stack. Datadog/New Relic scale only via the wallet — at 50+ hosts they get noticeably more expensive than an extra mini PC. Zabbix remains an option for pure infrastructure monitoring without container depth.

Pricing

Open source. One extra host. No SaaS.

License

All 7 components open source: Prometheus + cAdvisor + node_exporter under Apache-2.0, Grafana + Loki + Promtail under AGPL-3.0, Netdata under GPL-3.0. For own use without redistribution no obligations.

Running costs

One additional observability host: VPS with 4–8 GB RAM, 100 GB storage (Hetzner CPX31 from €15/month). Plus minimal overhead on every monitored host (node_exporter + cAdvisor + Promtail = around 50 MB RAM/host). At 18 hosts: 0.9 GB RAM extra in total.

Effort

Initial setup with all 7 components + first 5 hosts: 2–3 days. Roll-out to additional hosts: 30 minutes per host. Dashboard build for a hosting setup (multi-tenant, customer dashboards, alerts): 2 consulting days.

Datadog for 18 hosts: around €5,000/year. New Relic free tier covers about 10 hosts. stack: one-off setup (5–8 consulting days) + €15/month . Break-even against Datadog for hosting providers typically after 2–4 months.

Observability complements the rest of the toolset

does external checks (HTTPS, DNS), does container inspection, the stack delivers the infrastructure view:

→ Uptime Kuma (external checks)→ Portainer (container inspection)→ Docker (platform)

Ready for the next step?

Free intro call, no strings attached. In 30 minutes you'll know whether and how AI can help your business.

Book a call Check eligibility