Shared memory · memory peaks · capability allowlists · symptom tree and six-step runbook
OpenClaw on Docker often passes the first smoke test: channels answer, openclaw doctor looks fine. That only proves the control plane and model path are roughly healthy. As soon as you enable page-driving skills, screenshots, or headless Chromium, the workload shifts to a browser rendering stack with bursty memory and heavy use of shared memory. On a VPS, the failure mode is rarely a polite log line: you see intermittent blank pages, random tab crashes, or sudden OOM exits that are misread as slow models or anti-bot pages. This playbook gives you a symptom-to-parameter matrix for shm_size and mem_limit, then a six-step runbook that keeps changes bisectable. Pair it with the Exit 137 VPS primer and the Compose production baseline so networking, WASM warm-up, and browser peaks are not debugged as one tangled incident.
Many teams validate OpenClaw on a VPS by proving that messages flow and doctor is green. That is necessary but not sufficient for browser-class skills. Headless Chromium creates large anonymous mappings and synthetic buffers; when those collide with Docker default 64MB /dev/shm or an aggressive cgroup memory cap, the symptom is often blank UI, tab crashes, or screenshot timeouts rather than an immediate Exit 137. The incident is then misrouted to model latency, site anti-bot rules, or channel retries. Operations engineers waste hours tuning model timeouts when the real constraint is shared memory and peak RSS from the renderer stack.
Single-page success mistaken for load testing: loading a marketing homepage is not the same stress as multi-step login, long scrolling captures, or concurrent tabs; production traffic will spike memory and shm pressure without warning.
Ignoring coupling between /dev/shm and host memory: Chromium prefers large shared-memory segments; RSS in docker stats can look modest while dmesg already shows cgroup throttling or oom-kill events.
Copy-pasting wide capabilities: adding SYS_ADMIN to bypass sandbox friction expands blast radius from browser bugs to host compromise; reviewers need a written threat trade-off.
Mixing this with reverse-proxy and allowedOrigins incidents: non-loopback control UI errors and WebSocket drops belong to the Compose networking runbook; do not triangulate unrelated failure trees in one change window.
Stacking heavy browser jobs on the same instance as chatty channels: overnight batch automations can break a profile that looked stable during daytime pings unless you plan peaks and isolation profiles.
Encode the five taxes as explicit forbidden patterns and mandatory soak tests. Print them on the first page of your change request so nobody silently widens privileges to make a demo pass. The next section indexes symptoms to parameters so on-call engineers can stop bleeding without rereading every upstream doc.
The table below is indexed by what you observe first, not by parameter names, because incidents arrive as user-visible pain. After each change, capture docker stats peaks and a short Gateway log slice; change only one knob per experiment so rollbacks stay honest.
| Symptom you see | Check first | Typical root cause and move |
|---|---|---|
| Intermittent white screen, Aw Snap, tab crashes | shm_size, /dev/shm utilization | Default 64MB is often too small; try 512m then 1g and cap concurrent pages. |
| Process disappears, Exit 137 | mem_limit, host swap, oom_kill counters | Browser peak plus Node resident set exceeded cgroup; raise limits in steps or split instances; see Exit 137 primer. |
| Immediate permission or device errors | cap_add, devices, seccomp profile | Diff against official compose snippets; add the minimum surface, not a bag of privileged caps. |
| CPU pegged but navigation stalls | Software rendering flags, infinite navigation retries in skills | Bound retries and timeouts; verify the skill is not hot-reloading in a loop. |
| Only certain sites fail | TLS fingerprinting, HTTP/2, regional egress | If signals point to network rather than cgroup, pivot to egress tests instead of stacking shm tweaks. |
Stability for browser-class skills is mostly three auditable facts: peak memory, shared memory, and a capability allowlist; everything else is secondary tuning.
Community writeups and official Docker guidance in 2026 still recommend explicit shm_size for stacks that embed browser automation—commonly in the 512MB to 1GB band—paired with a clear memory ceiling. You do not need to memorize vendor magic numbers, but you do need the phrase defaults are not enough in your team vocabulary, plus a separate capacity line item for overnight batch windows when skills scrape dashboards or capture evidence packs.
The sequence below matches the Compose production baseline: observe, change one variable, soak test, archive. Paste outputs into the ticket instead of narrating changes in chat.
Pin the image reference: note digest or immutable tag before touching browser parameters; avoid drifting production on :latest while debugging peaks.
Capture baseline: run the same skill three times; record docker stats peaks, df -h /dev/shm inside the container, and Gateway log windows.
Change shm only: raise shm_size to 512m or 1g, keep everything else fixed, rerun the same skill three times.
Then adjust mem_limit: if Exit 137 or oom_kill persists, raise mem_limit in roughly 25 percent steps and verify whether swap is disabled on the host.
Minimize capabilities: if official snippets require specific cap_add or device nodes, document the exact error you fix; avoid SYS_ADMIN unless the threat model is explicit.
Archive rollback points: commit the passing compose fragment and digest; keep rollback as a copy-paste compose down && compose up -d block.
services:
openclaw:
image: ghcr.io/openclaw/openclaw:<pin-a-digest-not-latest>
shm_size: "1g"
mem_limit: "4g"
# keep control plane on 127.0.0.1; terminate TLS at the reverse proxy
# align json-file rotation and healthcheck start_period with the baseline article
Tip: if you need a second heavier browser stack on the same host, read multi-instance isolation for ports and volumes before cloning this runbook.
This section lists only facts you can point to in config or monitoring, not vibes like the browser feels unstable. Treat the numbers as starting bands and validate with your own skills.
shm_size as co-equal with mem_limit; start at 512MB, validate long captures, then consider a 1GB tier—common in 2026 community documentation.healthcheck start_period is too short, Compose restarts during warm-up and looks like random flakiness. Align fields with the baseline article.Warning: do not rotate reverse-proxy certificates, model keys, and browser resource caps in the same change; triple moves make rollback non-bisectable. TLS paths live in the reverse-proxy guide.
Once daytime traffic is stable, ask the organizational question: may this same instance run heavy browser batches overnight? Answering after an outage is expensive.
| Pattern | When it fits | Main risk |
|---|---|---|
| Single mixed instance | Personal pilots and light skills without long captures | Peak stacking is invisible; one OOM takes down channels and tools together. |
| Dedicated browser profile | Two compose stacks on one machine with split volumes | Requires strict isolation checklists; see multi-instance article. |
| Dedicated 24/7 node | Team production needing predictable SLA | Higher cost, but you get sign-off capacity and auditable change history. |
Ad-hoc VPS tuning is flexible early, yet production OpenClaw needs three artifacts that informal stacks often lack: reserved capacity, pinned images, and ticketed changes. When skills must coexist with iOS builds, desktop handoffs, and always-on agents, moving browser peaks to a predictable 24/7 footprint beats endless parameter whack-a-mole. For teams that need dedicated, region-stable Mac capacity with operational clarity, VpsMesh Mac Mini cloud rental is usually the better fit: easier headroom for browser bursts and disk, aligned with the Mac Mesh collaboration narrative. See pricing and help center.
Chromium-style renderers lean on shared memory for large buffers. When /dev/shm is tiny you can see intermittent white screens or tab crashes while CPU still looks fine. Raise shm_size to 512m or 1g first, then cross-check memory lines in the Exit 137 primer.
Not always. Start from official compose snippets with least privilege. If you must add privileged caps, document the threat trade-off. Channel hardening references live in the production hardening checklist.
Shared Gateways amplify resource contention when browser peaks spike. Add shm and memory rows to the team resource table and review jointly with the multi API key compartments runbook.