2026 Mac Mesh Shared Build Pool Disk Waterline
DerivedData Cleanup and Three-Layer Cache Runbook

Disk waterline · DerivedData / CocoaPods / Gradle · Three artifact layers · Six-step runbook · Hard thresholds

2026 Mac Mesh shared build pool disk waterline and DerivedData governance

Ops engineers, mobile platform owners, and tech leads who must sign disk SLOs for shared Mac build pools often get the same Friday-night alert: runners are online but jobs fail with "No space left," DerivedData fills the system volume, CocoaPods and Gradle global caches have no owner, and artifact directories stay local after rsync succeeds. This article names who faces which problem when Mac Mesh multi-tenant rotation lacks observable waterlines and tiered reclaim contracts; then states the outcome: use L1 DerivedData / L2 dependency cache / L3 CI artifacts with a six-step runbook so cleanup becomes auditable routine instead of firefighting. You get five hidden taxes, a cleanup-strategy table, waterline probe fields, six implementation steps, three hard thresholds, and FAQ. Cross-read seat locks and mutex, golden image checklist, rsync and object storage, three-pool SLO matrix, and Git worktree isolation.

01

Five hidden taxes before a shared build pool disk fills up

In 2026 Mac Mesh tickets, disk issues are rarely "we needed 100GB more." More often there is no shared contract across tenant rotation, cache locality, and artifact lifecycles, so APFS looks fine while Xcode fails writing temp files.

  1. 01

    Unbounded DerivedData sharing: multiple repos share ~/Library/Developer/Xcode/DerivedData; indexes and module caches interleave by branch, and one clean deletes a neighbor's ModuleCache, showing up as random link failures—not disk full.

  2. 02

    CocoaPods / Gradle global cache with no TTL: ~/Library/Caches/CocoaPods and ~/.gradle/caches only grow; old tarballs stay after Pod upgrades, and worktree multi-branch parallelism amplifies contention.

  3. 03

    Artifacts "uploaded but still local": object storage succeeded but $CI_ARTIFACTS_DIR has no retention policy, and the rsync completion hook is not bound—IPA/dSYM slowly eat the disk.

  4. 04

    APFS snapshots vs "available" space: local snapshots make df look healthy while real writable space breaks at compile peaks; missing per-volume, per-layer waterline_used_pct metrics.

  5. 05

    Cleanup vs seat-lock races: sweeping directories before lease release, or conflicting with seat lock TTL, causes "disk cleared but build red" secondary incidents.

Deliverables: three-layer directory dictionary, warn/hard dual waterlines, LRU on lease end, golden-image drift weekly checks kept separate. Without any of these, do not promise "any monorepo can run in parallel" on a shared pool. The next section compares three cleanup philosophies so you avoid "everyone SSHs in Friday night and rm -rf."

02

Table: manual sweeps vs waterline daemon vs golden-image reset

Disk governance is not "clean harder." Balance build hit rate, auditable cleanup, and tenant isolation. Pin this table in change review: each layer (L1/L2/L3) gets one default strategy only.

StrategyL1 DerivedDataL2 Pods/GradleL3 ArtifactsBest forMain risk
Manual cronWeekend rm of global dirOccasional pod cache prunefind by ageTiny teams, low parallelismNeighbor deletes, no audit trail
Waterline daemonLRU per workspace hashEvict on capacity48h after rsync successShared pool defaultNeeds metrics and lock contract
Image resetSnapshot rollback clearsRefreshed with imageVolume replaceDrift out of control, compliance snapshotsCold-start compile slowdown

Bottom line: shared pools should default to "waterline daemon"; image reset only as quarterly fallback with the golden image drift checklist, not daily LRU.

When Dedicated pools and Shared rotation coexist, L1 cache keys must carry a pool-type tag or shared-pool sweeps will evict dedicated-node locality.

Three-layer directory layout (attach to runbook)

L1: /var/mesh/cache/deriveddata/{workspace_hash}, bound via Xcode DERIVED_DATA_DIR. L2: /var/mesh/cache/cocoapods, /var/mesh/cache/gradle—do not write back to user-home global caches. L3: /var/mesh/artifacts/{job_id}—after upload, keep only checksum sidecar files. Monitoring can report layer_*_bytes per tier instead of a vague "/ partition 85%."

03

Six-step runbook: from waterline script to three-layer auto reclaim

These six steps assume runners are on Mac Mesh labels and seats are acquired before the job and released after. Do not skip order: waterlines without metrics are blind deletes.

  1. 01

    Freeze the three-layer dictionary and paths: write L1/L2/L3 roots and warn (82%) / hard (92%) thresholds into repo mesh-disk-policy.yaml, and register default mount points in the image checklist.

  2. 02

    Deploy disk-waterline probe: every 60s collect volume use and per-layer bytes; export to Prometheus/OpenTelemetry; on hard threshold runners enter drain and fail-fast new jobs.

  3. 03

    Isolate DerivedData: CI injects DERIVED_DATA_DIR to the workspace-hash bucket; lease end triggers LRU on that bucket—never sweep global DerivedData.

  4. 04

    L2 dependency cache evict: wrap pod cache clean as capacity-driven, not time-driven; point GRADLE_USER_HOME at mesh dirs and cap max-cache-size.

  5. 05

    Artifacts and rsync hooks: object-storage multipart-complete callback deletes local L3; failed retries keep 7 days—fields aligned with the artifact runbook.

  6. 06

    Weekly check and drill: compare golden-image checksums, simulate job reject at 90% waterline, log cleanup audit; when coordinating Burst overflow, clear L3 before accepting interruptible jobs.

Minimum disk-waterline probe fields
hostname
pool_type
volume_mount
waterline_used_pct
waterline_warn_threshold
waterline_hard_threshold
layer_l1_deriveddata_bytes
layer_l2_cocoapods_bytes
layer_l2_gradle_bytes
layer_l3_artifacts_bytes
seat_lease_id
last_cleanup_ts_unix
cleanup_evicted_bytes_1h
disk_waterline_hard_stop

Note: Probe output should be the first row on your Grafana board, not only OS alerts. Plot cleanup_evicted_bytes_1h with successful builds to tell real cleanup from "fewer builds so disk looks better."

04

Symptom matrix: triage by layer or by pool first

Disk alerts often overlap queue SLO symptoms. Use the table to see whether the issue is capacity, cache keys, or artifact pile-up before choosing sweep scope.

Symptomlayer_* dominantLikely root causeFirst action
Only Xcode step failsL1 highDerivedData cross-talk or index corruptionClear bucket by workspace hash
Mixed Android/iOS pool slowL2 highPods/Gradle never evictedTighten L2 capacity cap
Upload OK, disk still fullL3 highrsync hook not boundAdd object-storage callback
df OK, writes failsnapshotsAPFS local snapshotsReduce snapshot retention + probe

Warning: Do not run volume-level rm -rf while holding a seat lock. Cleanup scripts must see seat_lease_id empty or lease expired, or they delete an in-flight ModuleCache.

If L1 refills within 24 hours after bucket clear, review missing worktree isolation causing multiple full DerivedData trees on one node—before buying more disk.

05

Three hard thresholds and quotable ops parameters

These values are field compromises from multiple 16GB/24GB shared pools. Attach them to change tickets as external SLO annexes; Dedicated pools may lower warn by 5 points for stabler index hot cache.

  • Dual waterline: waterline_warn_threshold=82 triggers L3→L2→L1 evict order; waterline_hard_threshold=92 rejects new jobs and sets disk_waterline_hard_stop=1.
  • L1 max residency: shared pool per workspace bucket 14 days or 32GB, whichever hits first; Dedicated may use 28 days with a dedicated tag.
  • L3 local retention: delete within 48 hours after rsync/upload success; failed queue keeps 7 days, then alert and verify objects exist in object storage.

On 512GB system volumes with ~60% reserved for mesh, cap L2 combined at 80GB (40GB soft cap each for CocoaPods and Gradle) and L3 per-job directories at 12GB (including dSYM). Treating "weekend cron only" or "everyone SSH-deletes cache" as the long-term plan usually lacks audit fields and seat contracts—neighbor deletes, compile cold starts, and half-written artifacts spike in release week. For teams that need iOS/Android CI and disk SLOs on contract-grade cloud Mac Mini capacity, VpsMesh Mac Mini cloud rental is usually the better fit. See the pricing page, help center, and order page.

FAQ

Top three reader questions

Default to workspace-hash buckets bound to seat leases; lease end triggers LRU on that bucket. For parallel branches see the worktree isolation article; do not let global ~/Library/Developer/Xcode/DerivedData grow without bounds.

Runners should fail-fast and report disk_waterline_hard_stop to avoid half-written artifacts; schedulers route jobs to nodes with headroom or trigger Burst. Seat semantics are in the seat-lock article.

Yes. Disk cleanup only reclaims runtime garbage; it does not replace the snapshot drift checklist. Onboarding steps are on the help center; plan comparison is on the pricing page.