linger · XDG_RUNTIME_DIR · daemon install and verify · layered triage · API region vs egress checks
Platform engineers, SREs, and self-hosted agent operators hit the same failure mode in 2026: the service looks fine during an SSH session, then user-level systemd stops after logout, XDG_RUNTIME_DIR is missing under non-interactive paths, gateway logs are read as one blob with channel and model issues, and console-selected API regions disagree with the VPS egress path. This article lists five pre-production taxes, a three-way comparison of bare-metal systemd versus systemd-in-container versus Docker-only, a six-step reproducible runbook with commands, a checklist plus three citeable technical facts, and a decision matrix. Pair it with the install and doctor checklist and the Docker Compose production baseline. Order flows live on the order page.
Running OpenClaw on Linux moves long-lived processes, socket directories, logs, and restart semantics from personal habit into auditable units. The five items below usually arrive together and all point to one gate: put linger and XDG_RUNTIME_DIR on the acceptance sheet before debating Docker.
Session binding: Without linger, ending an interactive SSH session can stop the user systemd manager, so units go quiet overnight while tickets only say “it worked yesterday.”
Missing runtime dir: Cron, minimal shells, or wrong service types can leave XDG_RUNTIME_DIR empty, so sockets and state paths fail with errors split between the app and systemd.
Skipped layering: Gateway not listening, channel credentials, model routing, and upstream HTTP 429 are merged into one story called “OpenClaw is broken” without per-layer samples.
Region vs egress drift: The console or env vars point to region A while the VPS path presents region B hints in headers, which looks like flaky auth instead of a stable 403.
Mixed boundaries: Docker stacks plus user units on one host disagree on restart order and health semantics, so rollbacks are unclear about which layer to stop first.
If you are comparing host user units with container PID 1, treat the next table as a review slide, not a slogan.
Decide who owns restart semantics, log rotation, linger semantics, and the boundary between sockets and host ports. There is no universal winner, only an ops boundary that matches your skills.
| Model | Typical fit | Main benefit | Main cost |
|---|---|---|---|
| Bare-metal systemd (user) | Single VPS, tight work with host firewall and loopback | Aligns with distro tooling, units line up with journal | Must handle linger and login session edges |
| systemd-in-container | Multi-process supervision inside an image | Feels like a classic Linux service host | Image and privilege edges are sharper, debug spans host and container |
| Docker-only | Compose or an orchestrator already owns health and restart | Versioned artifacts and rollback paths are obvious | Host user linger semantics are not automatic |
Reproducible acceptance is not “it runs on my laptop,” it is “the unit survives SSH logout, journal reasons are legible, and region hints are captured with the same commands twice.”
Order the work as keep the user manager alive unattended, confirm the runtime directory, install units, triage in layers, then capture egress snapshots. Each step should ship saved command output. Gateway baselines belong in the install and doctor checklist.
Pick the service user: Fix the account and primary group, avoid mixing root with a deploy user. Deliverable: id plus a short loginctl user-status snippet.
Enable linger: Turn on linger for the deploy user so user@ can run without an active login. Deliverable: show-user prints linger=yes.
Validate XDG_RUNTIME_DIR: Print the variable from the same profile path your unit uses, expect a /run/user/<uid> shaped value.
Install and enable: Place the unit in the user scope, run daemon-reload and enable --now, confirm Active state and main pid with status.
Sample by layer: Check gateway listen and config parse first, then channel tokens and webhook reachability, then upstream model quotas and region headers. Keep the last two hundred journal lines per layer.
Egress consistency: Resolve the same hostname and capture TLS-visible metadata before and after changes. Do not promote a single RTT sample to a performance claim.
loginctl show-user "${USER}" -p Linger
sudo loginctl enable-linger "${USER}"
systemctl --user show-environment | grep XDG_RUNTIME_DIR || true
echo "${XDG_RUNTIME_DIR}"
systemctl --user daemon-reload
systemctl --user status openclaw-gateway.service --no-pager
journalctl --user -u openclaw-gateway.service -n 200 --no-pager
Note: Replace openclaw-gateway.service with your real unit name. If your image uses another gateway binary, still trust the unit file ExecStart line.
Map every item to an owner and a review cadence. Region checks collect repeatable TLS and response metadata only, not invented throughput rankings.
Linger gate: Change records must attach show-user Linger=yes output as text or a screenshot.
Unit boundary: State which ports user units bind versus which ports containers publish, then sync firewall docs.
Log retention: Document journal persistence or remote forwarding so debug logs cannot fill the disk and mimic a crash.
Layered runbook: For gateway, channel, and model layers, keep at least three “pass to advance” checks as commands or URLs.
Region snapshots: Store resolver output and header samples before and after each release window for rollback contrast.
loginctl enable-linger affects whether the per-user systemd manager stays alive; it is not automatically equivalent to choosing Docker./run/user/<uid>. When it is missing off-login, sockets fall back to non-writable or unstable paths.Warning: One successful curl is not a durable region proof after a CDN change. A fixed hostname with repeatable commands beats a single lucky sample.
If linger, unit names, port matrix, and region snapshots are not versioned, Linux residency is only half done. The other half is sharing the same responsibility language as gateway triage. Use the matrix as a review slide.
| Team posture | Default pick | Acceptance signal | Common trap |
|---|---|---|---|
| Solo maintainer, fast iteration | Docker Compose baseline | Health checks and restart policy are reviewable in compose | Ignoring mem_limit and log rotation causes false hangs |
| Multi-tenant host | Container boundary plus isolated project names | Each stack has its own data directory | Mixing with user units creates restart races |
| Host-tight coupling | User systemd plus linger | Journal stays continuous after SSH ends | Skipping XDG_RUNTIME_DIR on non-interactive paths |
Interactive bash, linger-free nohup, or hand-rolled watchdog loops usually pay back during change review and audits. Upstream region policy shifts are also harder to explain without egress snapshots. By contrast, dedicated cloud Mac capacity with selectable regions and predictable network tiers makes stable egress and golden images easier to own alongside iOS builds or desktop handoff.
Common trap: Assuming Docker removes all systemd semantics. If a user unit still fronts the gateway outside Compose, linger and the runtime directory remain hard gates.
Personal scripts and unversioned environment exports rarely survive handoff, compliance, or rollback with an external SLA. When OpenClaw must ship with upstream region policy, TLS fingerprints, and a fixed egress narrative, bash-only paths usually lack auditable change tickets. For teams that need iOS handoff, CI regression, and automation agents in one acceptance story, and want ordering and region tiers instead of self-managed egress games, VpsMesh Mac Mini cloud rental is usually the better fit: dedicated nodes simplify ACLs and hostnames, collaboration stays close to high-churn loops, and ops language can align with the team private network build-node runbook. See pricing for region mixes, and treat connection boundaries per the help center.
After interactive sessions end, systemd --user can stop and take user-level OpenClaw units with it. Verify linger before production and align connection plus residency guidance in the help center so overnight exits are not misread as upstream outages.
Pin hostname and tool versions, store resolver output, and capture TLS-visible metadata for the same endpoint. Compare console region settings with environment variables. For Compose-level health semantics, read the Docker Compose production baseline article.
When restart, log rotation, limits, and health checks are fully declared in Compose or an orchestrator, and you do not rely on host user sockets plus linger semantics, Docker-only is often simpler. Finish section three layering before mixing with user units.