mem_limit · healthcheck and start_period · restart backoff · json-file rotation and disk checks
Teams running OpenClaw on a VPS 24/7 often get hit by three classes of failure: cold-start Gateway killed by a too-short healthcheck, mem_limit set below WASM peaks triggering exit 137, and default json-file logs filling the host disk. This article states the smallest diff set between dev and prod compose, a six-step reproducible baseline, and when restart jitter requires a human halt, cross-linked with the Exit 137 and allowedOrigins triage article and the image pin and rollback guide.
Dev stacks often drop resource caps, shorten health windows, and stream logs to the console. On unattended hosts those choices stack into restart storms: the container is marked unhealthy before the Gateway is ready, restart: always respawns every few seconds, and log files exhaust inode or disk quotas before any backoff helps.
Unlike a plain API container, OpenClaw’s first compile and model load raises RSS spikes; if mem_limit tracks only steady-state averages, the OOM killer removes the process quietly at night. The baseline here bakes in observable signals: every knob should map to a field in docker inspect or a host metric on the incident ticket.
start_period too short: probes fail before port 18789 listens; the panel shows endless restarts.
mem_limit vs peaks: first-run WASM or dependency spikes exceed the cgroup cap with exit 137 and thin application logs.
Unbounded json-file: channel callbacks and model debug logs fill root during busy PR validation.
Restart without backoff: misconfig plus false unhealthy states starve CPU and IO together.
Dev bind mounts in prod: hot-reload trees and loose permissions expose secrets outside backup scripts.
Assume two overrides from one repo: docker-compose.yml holds shared service definitions while docker-compose.prod.yml only appends production deltas so you avoid copy-paste drift.
| Dimension | Dev default | Prod baseline |
|---|---|---|
| Memory | No cap or host-wide only | Explicit mem_limit with headroom for cold spikes; cross-check the Exit 137 article |
| Healthcheck | Short interval, no start_period | start_period covers cold boot; retries and timeout match your SLA language |
| Restart | unless-stopped or none | on-failure or bounded always plus host-level alerts |
| Logging driver | json-file defaults or local | max-size + max-file; do not rely on manual truncate |
| Volumes and secrets | Bind-mount source trees | Read-only config mounts; secrets from env_file or Docker secrets, not image layers |
Every production-only line should trace to a Grafana panel or ticket field; otherwise it is wishful thinking.
These steps assume images follow the pinning playbook; if you still ship :latest, finish one full cold-start sample on staging before moving the override to prod.
Sample peaks: on staging run docker stats --no-stream and in-container ps, logging RSS for ten minutes around first Gateway Ready.
Write mem_limit: multiply the peak by your agreed safety factor; document swap policy on the host to avoid silent OOM.
Define readiness: HTTP or CMD probes should hit the same loopback the Gateway binds, not the public edge path.
Set start_period: cover first dependency install and WASM compile; size to cold-start p95, not the mean.
Tighten json-file: declare a logging block per service with max-size and max-file; alert on log directory growth rate.
Drill restarts: inject one failed probe and confirm intervals, backoff, and paging match the on-call runbook.
services:
openclaw:
mem_limit: "2g"
logging:
driver: json-file
options:
max-size: "20m"
max-file: "5"
healthcheck:
test: ["CMD", "curl", "-f", "http://127.0.0.1:18789/health"]
interval: 30s
timeout: 5s
retries: 5
start_period: 180s
restart: on-failure:5
Note: health paths must match your image; if only TCP is available, use CMD-SHELL with nc but document the higher false-positive surface versus HTTP.
When docker events shows more restarts per five minutes than your runbook allows, suspect misconfiguration and false unhealthy before “just add RAM.” Blind restarts amplify log write amplification.
If root is already read-only, drain traffic or stop the reverse proxy first, then follow the space order in the Exit 137 article; avoid forcing compose up -d over volumes before disk health is confirmed.
Warning: lowering max-file rotates hot logs away faster; if compliance needs longer retention, ship cold logs to object storage or a logging host instead of raising single-file caps without bounds.
Freeze releases: disable deploy pipeline entry when restart storms exceed threshold.
Capture state: export Health and OOM snippets from inspect plus the last two hundred log lines.
Rollback compose: restore the previous override and keep a repro bundle for postmortem.
The numbers below are charter and on-call starters; replace them with your real cold-start histogram and disk growth curves. Do not advertise them as customer-facing SLA without measurement.
When the same VPS also hosts a reverse proxy or a small database, mem_limit and log IO contend with neighbors; reviewers should require container RSS, host free memory, and disk write latency on one dashboard.
| Host shape | Logging starter | Relation to image pins |
|---|---|---|
| 2 vCPU / 4 GB | Smaller max-size, shorter retention, more aggressive shipping | Digest pins required to avoid surprise cold-start growth |
| 4 vCPU / 8 GB | Can relax max-file slightly; never disable rotation | Staging and prod use different tags validated against the same digest |
| Mixed workloads | Separate data disk or log volume away from database partitions | Upgrade windows align with the pinning article |
Parking OpenClaw on a throwaway small VPS under-provisions memory, disk, and secret rotation at once; bare-metal DIY trades that for power and line SLA risk.
Teams that need contracted compute, selectable regions, and auditable bandwidth while keeping Gateway and channels observable in 24/7 mode often do better on rented cloud Mac Minis; VpsMesh Mac Mini cloud rental is usually the stronger fit because compose baselines, reverse proxies, and backup stories can be accepted in one capacity narrative.
Check probe user and PATH, and whether start_period covers cold start; if still wrong, compare loopback with the allowedOrigins and reverse-proxy article.
Start with host dmesg and container exit code, then re-sample peaks before changing limits; follow the Exit 137 triage article. Compare plans on the pricing page.
It shortens hot local retention; ship archives to centralized storage when policy requires longer windows; ops policy is in the help center.