Disk waterline · DerivedData / CocoaPods / Gradle · Three artifact layers · Six-step runbook · Hard thresholds
Ops engineers, mobile platform owners, and tech leads who must sign disk SLOs for shared Mac build pools often get the same Friday-night alert: runners are online but jobs fail with "No space left," DerivedData fills the system volume, CocoaPods and Gradle global caches have no owner, and artifact directories stay local after rsync succeeds. This article names who faces which problem when Mac Mesh multi-tenant rotation lacks observable waterlines and tiered reclaim contracts; then states the outcome: use L1 DerivedData / L2 dependency cache / L3 CI artifacts with a six-step runbook so cleanup becomes auditable routine instead of firefighting. You get five hidden taxes, a cleanup-strategy table, waterline probe fields, six implementation steps, three hard thresholds, and FAQ. Cross-read seat locks and mutex, golden image checklist, rsync and object storage, three-pool SLO matrix, and Git worktree isolation.
In 2026 Mac Mesh tickets, disk issues are rarely "we needed 100GB more." More often there is no shared contract across tenant rotation, cache locality, and artifact lifecycles, so APFS looks fine while Xcode fails writing temp files.
Unbounded DerivedData sharing: multiple repos share ~/Library/Developer/Xcode/DerivedData; indexes and module caches interleave by branch, and one clean deletes a neighbor's ModuleCache, showing up as random link failures—not disk full.
CocoaPods / Gradle global cache with no TTL: ~/Library/Caches/CocoaPods and ~/.gradle/caches only grow; old tarballs stay after Pod upgrades, and worktree multi-branch parallelism amplifies contention.
Artifacts "uploaded but still local": object storage succeeded but $CI_ARTIFACTS_DIR has no retention policy, and the rsync completion hook is not bound—IPA/dSYM slowly eat the disk.
APFS snapshots vs "available" space: local snapshots make df look healthy while real writable space breaks at compile peaks; missing per-volume, per-layer waterline_used_pct metrics.
Cleanup vs seat-lock races: sweeping directories before lease release, or conflicting with seat lock TTL, causes "disk cleared but build red" secondary incidents.
Deliverables: three-layer directory dictionary, warn/hard dual waterlines, LRU on lease end, golden-image drift weekly checks kept separate. Without any of these, do not promise "any monorepo can run in parallel" on a shared pool. The next section compares three cleanup philosophies so you avoid "everyone SSHs in Friday night and rm -rf."
Disk governance is not "clean harder." Balance build hit rate, auditable cleanup, and tenant isolation. Pin this table in change review: each layer (L1/L2/L3) gets one default strategy only.
| Strategy | L1 DerivedData | L2 Pods/Gradle | L3 Artifacts | Best for | Main risk |
|---|---|---|---|---|---|
| Manual cron | Weekend rm of global dir | Occasional pod cache prune | find by age | Tiny teams, low parallelism | Neighbor deletes, no audit trail |
| Waterline daemon | LRU per workspace hash | Evict on capacity | 48h after rsync success | Shared pool default | Needs metrics and lock contract |
| Image reset | Snapshot rollback clears | Refreshed with image | Volume replace | Drift out of control, compliance snapshots | Cold-start compile slowdown |
Bottom line: shared pools should default to "waterline daemon"; image reset only as quarterly fallback with the golden image drift checklist, not daily LRU.
When Dedicated pools and Shared rotation coexist, L1 cache keys must carry a pool-type tag or shared-pool sweeps will evict dedicated-node locality.
L1: /var/mesh/cache/deriveddata/{workspace_hash}, bound via Xcode DERIVED_DATA_DIR. L2: /var/mesh/cache/cocoapods, /var/mesh/cache/gradle—do not write back to user-home global caches. L3: /var/mesh/artifacts/{job_id}—after upload, keep only checksum sidecar files. Monitoring can report layer_*_bytes per tier instead of a vague "/ partition 85%."
These six steps assume runners are on Mac Mesh labels and seats are acquired before the job and released after. Do not skip order: waterlines without metrics are blind deletes.
Freeze the three-layer dictionary and paths: write L1/L2/L3 roots and warn (82%) / hard (92%) thresholds into repo mesh-disk-policy.yaml, and register default mount points in the image checklist.
Deploy disk-waterline probe: every 60s collect volume use and per-layer bytes; export to Prometheus/OpenTelemetry; on hard threshold runners enter drain and fail-fast new jobs.
Isolate DerivedData: CI injects DERIVED_DATA_DIR to the workspace-hash bucket; lease end triggers LRU on that bucket—never sweep global DerivedData.
L2 dependency cache evict: wrap pod cache clean as capacity-driven, not time-driven; point GRADLE_USER_HOME at mesh dirs and cap max-cache-size.
Artifacts and rsync hooks: object-storage multipart-complete callback deletes local L3; failed retries keep 7 days—fields aligned with the artifact runbook.
Weekly check and drill: compare golden-image checksums, simulate job reject at 90% waterline, log cleanup audit; when coordinating Burst overflow, clear L3 before accepting interruptible jobs.
hostname pool_type volume_mount waterline_used_pct waterline_warn_threshold waterline_hard_threshold layer_l1_deriveddata_bytes layer_l2_cocoapods_bytes layer_l2_gradle_bytes layer_l3_artifacts_bytes seat_lease_id last_cleanup_ts_unix cleanup_evicted_bytes_1h disk_waterline_hard_stop
Note: Probe output should be the first row on your Grafana board, not only OS alerts. Plot cleanup_evicted_bytes_1h with successful builds to tell real cleanup from "fewer builds so disk looks better."
Disk alerts often overlap queue SLO symptoms. Use the table to see whether the issue is capacity, cache keys, or artifact pile-up before choosing sweep scope.
| Symptom | layer_* dominant | Likely root cause | First action |
|---|---|---|---|
| Only Xcode step fails | L1 high | DerivedData cross-talk or index corruption | Clear bucket by workspace hash |
| Mixed Android/iOS pool slow | L2 high | Pods/Gradle never evicted | Tighten L2 capacity cap |
| Upload OK, disk still full | L3 high | rsync hook not bound | Add object-storage callback |
| df OK, writes fail | snapshots | APFS local snapshots | Reduce snapshot retention + probe |
Warning: Do not run volume-level rm -rf while holding a seat lock. Cleanup scripts must see seat_lease_id empty or lease expired, or they delete an in-flight ModuleCache.
If L1 refills within 24 hours after bucket clear, review missing worktree isolation causing multiple full DerivedData trees on one node—before buying more disk.
These values are field compromises from multiple 16GB/24GB shared pools. Attach them to change tickets as external SLO annexes; Dedicated pools may lower warn by 5 points for stabler index hot cache.
waterline_warn_threshold=82 triggers L3→L2→L1 evict order; waterline_hard_threshold=92 rejects new jobs and sets disk_waterline_hard_stop=1.On 512GB system volumes with ~60% reserved for mesh, cap L2 combined at 80GB (40GB soft cap each for CocoaPods and Gradle) and L3 per-job directories at 12GB (including dSYM). Treating "weekend cron only" or "everyone SSH-deletes cache" as the long-term plan usually lacks audit fields and seat contracts—neighbor deletes, compile cold starts, and half-written artifacts spike in release week. For teams that need iOS/Android CI and disk SLOs on contract-grade cloud Mac Mini capacity, VpsMesh Mac Mini cloud rental is usually the better fit. See the pricing page, help center, and order page.
Default to workspace-hash buckets bound to seat leases; lease end triggers LRU on that bucket. For parallel branches see the worktree isolation article; do not let global ~/Library/Developer/Xcode/DerivedData grow without bounds.
Runners should fail-fast and report disk_waterline_hard_stop to avoid half-written artifacts; schedulers route jobs to nodes with headroom or trigger Burst. Seat semantics are in the seat-lock article.
Yes. Disk cleanup only reclaims runtime garbage; it does not replace the snapshot drift checklist. Onboarding steps are on the help center; plan comparison is on the pricing page.