Observability — Healthcheck, Metrics & Monitoring

Gubernator exposes all observability endpoints on port 4002, which is public (no authentication required) to allow easy scraping by Prometheus and other monitoring tools.

Port 4002 is intentionally public

This port is designed for internal infrastructure monitoring. In production, firewall it from the public internet and only expose it to your Prometheus / monitoring network.

Endpoints Summary

Endpoint	Method	Description
`/health`	GET	JSON health check — use for load balancers and readiness probes
`/metrics`	GET	Prometheus-format metrics (Gubernator + Go runtime)
`/swagger/index.html`	GET	Interactive Swagger UI for the REST API

Health Check

HTTP Endpoint

curl http://localhost:4002/health

Response when healthy:

{"status": "healthy"}

The endpoint always returns 200 OK as long as the Gubernator process is running. It can be used as:

Docker HEALTHCHECKDocker ComposeKubernetes ProbeCLI / Script

HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
  CMD ["/app/gbnt", "health"]

This is already built into the official Dockerfile and is completely self-contained (requires no external shell, curl, or wget).

services:
  gubernator:
    image: gubernator:latest
    healthcheck:
      test: ["CMD", "/app/gbnt", "health"]
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 10s

livenessProbe:
  httpGet:
    path: /health
    port: 4002
  initialDelaySeconds: 10
  periodSeconds: 30
  timeoutSeconds: 5
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /health
    port: 4002
  initialDelaySeconds: 5
  periodSeconds: 10

# Wait until Gubernator is healthy before running commands
until curl -sf http://localhost:4002/health; do
  echo "Waiting for Gubernator..."
  sleep 2
done
echo "Gubernator is healthy!"

Prometheus Metrics

Available Metrics

Gubernator Custom Metrics

Metric	Type	Description
`gbnt_total_nodes`	Gauge	Number of nodes currently registered in the cluster
`gbnt_total_tasks`	Gauge	Number of tasks (containers) scheduled in the cluster

These update every 15 seconds via the internal startWatchtowers() loop.

Go Runtime Metrics (automatic)

Metric	Description
`go_goroutines`	Number of running goroutines
`go_memstats_alloc_bytes`	Heap memory in use
`go_memstats_sys_bytes`	Total memory obtained from OS
`go_gc_duration_seconds`	GC pause durations (p50, p75, p99)
`go_info`	Go version information
`process_cpu_seconds_total`	CPU time consumed
`process_open_fds`	Number of open file descriptors

Scraping Manually

# Raw Prometheus text format
curl http://localhost:4002/metrics

# Filter only Gubernator metrics
curl -s http://localhost:4002/metrics | grep "^gbnt_"

Example output:

# HELP gbnt_total_nodes Current number of nodes registered in the cluster.
# TYPE gbnt_total_nodes gauge
gbnt_total_nodes 2
# HELP gbnt_total_tasks Current number of tasks scheduled in the cluster.
# TYPE gbnt_total_tasks gauge
gbnt_total_tasks 5

Prometheus Configuration

Minimal scrape config

Add this to your existing prometheus.yml:

scrape_configs:
  - job_name: 'gubernator'
    scrape_interval: 15s
    metrics_path: '/metrics'
    static_configs:
      - targets: ['<gubernator-host>:4002']
        labels:
          service: 'gubernator'

Replace <gubernator-host> with: - localhost — if Prometheus runs on the same machine - gubernator — if running in the same Docker Compose network - host.docker.internal — if Prometheus is in Docker and Gubernator is on the host (Mac/Windows)

Monitoring Stack (Prometheus + Grafana)

The repository includes a ready-to-use monitoring/ directory with:

monitoring/
├── docker-compose.yml                          # Gubernator + Prometheus + Grafana
├── prometheus/
│   └── prometheus.yml                          # Pre-configured scrape job
└── grafana/
    ├── provisioning/
    │   ├── datasources/prometheus.yml          # Auto-connects Prometheus
    │   └── dashboards/dashboards.yml           # Auto-loads dashboards
    └── dashboards/
        └── gubernator.json                     # Pre-built Gubernator dashboard

Start the Full Stack

# Build Gubernator image first (if not using Docker Hub)
docker build -t gubernator:latest .

# Launch everything
cd monitoring/
docker compose up -d

Verify

# Check all containers are healthy
docker compose ps

# Check Gubernator is being scraped
curl 'http://localhost:9090/api/v1/query?query=up{job="gubernator"}' | python3 -m json.tool

Access

Service	URL	Credentials
Gubernator Web UI	http://localhost:4001	admin / admin
Prometheus	http://localhost:9090	—
Grafana	http://localhost:3000	admin / admin
Gubernator Metrics	http://localhost:4002/metrics	—
Health Check	http://localhost:4002/health	—

Grafana Dashboard

The Gubernator — Cluster Overview dashboard is provisioned automatically. It shows:

┌────────────────────────────────────────────────────────────────┐
│  Active Nodes  │  Total Tasks  │  Status  │  Goroutines        │
│      ██ 1      │     ██ 3      │  ✅ UP   │      ██ 12         │
├────────────────────────────────────────────────────────────────┤
│  Nodes & Tasks Over Time        │  Memory Usage                │
│  ─────────────────────────────  │  ─────────────────────────── │
│  nodes ─────── 1                │  heap ────────────────        │
│  tasks ╱──────╱ 3               │  sys  ─────────────────       │
├────────────────────────────────────────────────────────────────┤
│  Goroutines Over Time  │  GC Rate  │  GC Pause Duration        │
└────────────────────────────────────────────────────────────────┘

Panels included:

Panel	Metric	Type
Active Nodes	`gbnt_total_nodes`	Stat
Total Tasks	`gbnt_total_tasks`	Stat
Gubernator Status	`up{job="gubernator"}`	Stat (UP/DOWN)
Goroutines	`go_goroutines`	Stat
Nodes & Tasks Over Time	`gbnt_total_nodes`, `gbnt_total_tasks`	Time series
Memory Usage	`go_memstats_alloc_bytes`, `go_memstats_sys_bytes`	Time series
Goroutines	`go_goroutines`	Time series
GC Rate	`rate(go_gc_duration_seconds_count[5m])`	Time series
GC Pause Duration	`go_gc_duration_seconds` p50/p99	Time series

Tear Down

cd monitoring/
docker compose down       # stop containers, keep volumes
docker compose down -v    # stop containers AND delete volumes (reset all data)