Healthcheck fails but manual curl works—what now?

Check probe user and PATH, and whether start_period covers cold start; if still wrong, compare loopback and proxy headers with the Exit 137 and allowedOrigins triage article.

2026 OpenClaw Docker Compose production baseline: mem_limit, healthchecks, restart policy, and json-file log rotation (VPS 24/7)

Q: Does json-file rotation hurt compliance retention?

It shortens hot local retention; push archives to centralized storage when policy requires longer windows; ops policy is in the help center.

01

Why a compose file that “works on my laptop” starts self-sabotaging around week three

Dev stacks often drop resource caps, shorten health windows, and stream logs to the console. On unattended hosts those choices stack into restart storms: the container is marked unhealthy before the Gateway is ready, restart: always respawns every few seconds, and log files exhaust inode or disk quotas before any backoff helps.

Unlike a plain API container, OpenClaw’s first compile and model load raises RSS spikes; if mem_limit tracks only steady-state averages, the OOM killer removes the process quietly at night. The baseline here bakes in observable signals: every knob should map to a field in docker inspect or a host metric on the incident ticket.

01
start_period too short: probes fail before port 18789 listens; the panel shows endless restarts.
02
mem_limit vs peaks: first-run WASM or dependency spikes exceed the cgroup cap with exit 137 and thin application logs.
03
Unbounded json-file: channel callbacks and model debug logs fill root during busy PR validation.
04
Restart without backoff: misconfig plus false unhealthy states starve CPU and IO together.
05
Dev bind mounts in prod: hot-reload trees and loose permissions expose secrets outside backup scripts.

02

Dev compose vs prod compose: smallest diff table for resources and logs

Assume two overrides from one repo: docker-compose.yml holds shared service definitions while docker-compose.prod.yml only appends production deltas so you avoid copy-paste drift.

Dimension	Dev default	Prod baseline
Memory	No cap or host-wide only	Explicit `mem_limit` with headroom for cold spikes; cross-check the Exit 137 article
Healthcheck	Short interval, no start_period	`start_period` covers cold boot; `retries` and `timeout` match your SLA language
Restart	`unless-stopped` or none	`on-failure` or bounded `always` plus host-level alerts
Logging driver	json-file defaults or local	`max-size` + `max-file`; do not rely on manual `truncate`
Volumes and secrets	Bind-mount source trees	Read-only config mounts; secrets from env_file or Docker secrets, not image layers

Every production-only line should trace to a Grafana panel or ticket field; otherwise it is wishful thinking.

03

Six-step baseline: from mem_limit calibration to probes aligned with Gateway readiness

These steps assume images follow the pinning playbook; if you still ship :latest, finish one full cold-start sample on staging before moving the override to prod.

01
Sample peaks: on staging run docker stats --no-stream and in-container ps, logging RSS for ten minutes around first Gateway Ready.
02
Write mem_limit: multiply the peak by your agreed safety factor; document swap policy on the host to avoid silent OOM.
03
Define readiness: HTTP or CMD probes should hit the same loopback the Gateway binds, not the public edge path.
04
Set start_period: cover first dependency install and WASM compile; size to cold-start p95, not the mean.
05
Tighten json-file: declare a logging block per service with max-size and max-file; alert on log directory growth rate.
06
Drill restarts: inject one failed probe and confirm intervals, backoff, and paging match the on-call runbook.

yaml

services:
  openclaw:
    mem_limit: "2g"
    logging:
      driver: json-file
      options:
        max-size: "20m"
        max-file: "5"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://127.0.0.1:18789/health"]
      interval: 30s
      timeout: 5s
      retries: 5
      start_period: 180s
    restart: on-failure:5

ℹ

Note: health paths must match your image; if only TCP is available, use CMD-SHELL with nc but document the higher false-positive surface versus HTTP.

04

Restart jitter and json-file: when you must stop automation

When docker events shows more restarts per five minutes than your runbook allows, suspect misconfiguration and false unhealthy before “just add RAM.” Blind restarts amplify log write amplification.

If root is already read-only, drain traffic or stop the reverse proxy first, then follow the space order in the Exit 137 article; avoid forcing compose up -d over volumes before disk health is confirmed.

⚠

Warning: lowering max-file rotates hot logs away faster; if compliance needs longer retention, ship cold logs to object storage or a logging host instead of raising single-file caps without bounds.

A
Freeze releases: disable deploy pipeline entry when restart storms exceed threshold.
B
Capture state: export Health and OOM snippets from inspect plus the last two hundred log lines.
C
Rollback compose: restore the previous override and keep a repro bundle for postmortem.

05

Reference parameters and capacity review starters

The numbers below are charter and on-call starters; replace them with your real cold-start histogram and disk growth curves. Do not advertise them as customer-facing SLA without measurement.

When the same VPS also hosts a reverse proxy or a small database, mem_limit and log IO contend with neighbors; reviewers should require container RSS, host free memory, and disk write latency on one dashboard.

start_period floor: cover at least cold-start p95; tighten only after two stable weeks and keep a rollback window.
json-file growth rate: when a single service exceeds your disk budget fraction per day, split channel log levels or ship to centralized logging.
Restart frequency threshold: encode “more than N restarts per fifteen minutes” into alerts and human-escalation clauses so automation does not hide config bugs.

Host shape	Logging starter	Relation to image pins
2 vCPU / 4 GB	Smaller max-size, shorter retention, more aggressive shipping	Digest pins required to avoid surprise cold-start growth
4 vCPU / 8 GB	Can relax max-file slightly; never disable rotation	Staging and prod use different tags validated against the same digest
Mixed workloads	Separate data disk or log volume away from database partitions	Upgrade windows align with the pinning article

Parking OpenClaw on a throwaway small VPS under-provisions memory, disk, and secret rotation at once; bare-metal DIY trades that for power and line SLA risk.

Teams that need contracted compute, selectable regions, and auditable bandwidth while keeping Gateway and channels observable in 24/7 mode often do better on rented cloud Mac Minis; VpsMesh Mac Mini cloud rental is usually the stronger fit because compose baselines, reverse proxies, and backup stories can be accepted in one capacity narrative.

FAQ

Frequently asked questions

Check probe user and PATH, and whether start_period covers cold start; if still wrong, compare loopback with the allowedOrigins and reverse-proxy article.

Start with host dmesg and container exit code, then re-sample peaks before changing limits; follow the Exit 137 triage article. Compare plans on the pricing page.

It shortens hot local retention; ship archives to centralized storage when policy requires longer windows; ops policy is in the help center.