2026: Run DeepSeek V4 Flash Locally with antirez ds4 — Real Hardware Bill on 96 / 128 / 256 / 512 GB Macs and a Cloud-Rent Decision Matrix

A new Flash-only engine · The unified-memory bill · Three-tier rental matrix · ds4-server launch checklist

Apple Silicon render — ds4 engine and DeepSeek V4 Flash local inference

Redis author antirez wrote ds4 (DwarfStar 4) in roughly a week of C and made DeepSeek V4 Flash actually runnable on a single Mac. The catch is the hardware bill. The floor is a 96 GB unified-memory Mac; serious work starts at 256 GB; the comfort zone is 512 GB. List prices run from about USD 4,000 to over USD 15,000. This article gives independent developers, AI researchers, and small teams three things. First, the honest hardware bill for ds4 and Flash, plus a correction on the common myth that PRO can run on a 512 GB Mac. Second, a tiered decision matrix for 96 / 128 / 256 / 512 GB cloud-Mac nodes with a three-year TCO sketch. Third, a minimum viable launch checklist for ds4 on a VpsMesh cloud-Mac node with Cursor and opencode integration steps.

01

What ds4 actually is: why antirez did not write another general GGUF runner

ds4, short for DwarfStar 4, is by Salvatore Sanfilippo (antirez), the author of Redis. It is not a wrapper around llama.cpp. It is not a general GGUF loader. It is not yet another web UI. It is a native inference engine purpose-built for one model: DeepSeek V4 Flash. The primary backends are Metal on macOS and CUDA on Linux, including the DGX Spark. AMD ROCm lives in a separate branch. This deliberate narrowness is the reason ds4 picked up tens of thousands of GitHub stars in days and posts numbers that general runners cannot match.

The narrow scope buys real things. ds4 controls the MoE routing pipeline of DeepSeek V4 end to end, so it can apply aggressive 2-bit quantization to the routing experts while keeping the rest of the graph at higher precision. It treats the 1M-token context window as a first-class concern by spilling the KV cache to disk on demand, instead of recomputing prefill on every session. It ships a tool-calling loop and a coding agent that are part of the engine, not a separate framework glued on top. The list below summarizes the design choices and why they matter.

  1. 01

    One model, taken to the limit. The README is explicit: ds4 is not a GGUF runner, not a wrapper, not a framework. Every graph path is built for the DeepSeek V4 Flash MoE structure, so routing experts can be quantized hard while the rest of the network keeps precision. Generic runners almost never do this for portability reasons.

  2. 02

    Metal first, CUDA in parallel, CPU only for diagnostics. On macOS you build with make. On Linux you build with make cuda-spark or make cuda-generic. The README warns that current macOS virtual-memory behavior can kernel-panic when you actually use the CPU path, so do not try to run inference without Metal on a Mac.

  3. 03

    On-disk KV cache built in. When you start ds4-server, pass --kv-disk-dir and --kv-disk-space-mb. The KV state is persisted to that directory and can be reloaded across sessions. Combined with the SSD inside a Mac, this turns the 1M-token context from a constant tax into a recoverable cost.

  4. 04

    OpenAI-compatible server, agent built in. ds4-server exposes /v1/chat/completions, so you can point Cursor, opencode, Claude Code, or any OpenAI-protocol client at it. Tool calling is native, which means a real coding-agent loop without an extra framework.

  5. 05

    Auditable by being small. The project is self-contained and does not pull in a third-party runtime. The codebase is small enough that a small team can audit the graph and the quantization choices. For anyone running large models in production, that matters.

Once you accept that ds4 is Flash-only by design, the next section follows naturally. The frequent claim that PRO will fit on a 512 GB Mac Studio needs correcting, and it is worth doing carefully.

02

The honest hardware bill: 96 / 128 / 256 / 512 GB compared, and why PRO on 512 GB is a myth

Start with the model specs. DeepSeek V4 Flash is a 284B-parameter MoE with 13B active per token. BF16 weights are around 570 GB. Q4 quantization brings the file size into the 150 GB range. The antirez q2 variant lands near 86.7 GB. That is why 96 GB is the floor that lets the model load, and why the community treats 128 GB as the realistic lab minimum. DeepSeek V4 PRO is a different story: 1.65T parameters with 49B active, roughly 3.2 TB at BF16 and about 800 GB even at Q4. It does not fit in 512 GB of unified memory, and ds4 mainline does not target PRO either. Any claim that PRO will run on a 512 GB Mac needs that correction.

Unified MemoryTypical Mac / List PriceWhat ds4 can doReference speedPractical role
96 GBMacBook Pro M3/M4/M5 Max top-spec, from about USD 4,000Flash q2 floorq2 short prompts onlyCan load; swap arrives quickly with mid-length context
128 GBMacBook Pro M3 Max max-spec or Mac Studio M2 Max, about USD 5,000–6,500Flash q2 lab minimumq2 prefill about 58.5 t/s and generation about 26.7 t/s on short prompts; about 250 t/s prefill on an 11.7k-token promptCommunity-accepted lab floor; can keep q2 resident
256 GBMac Studio M2 Ultra or mid-spec M3 Ultra, about USD 7,500–10,000Flash q4 viableq4 short prompts run smoothly; mid-length context does not force swapThe serious-use target for Flash
512 GBMac Studio M3 Ultra top-spec, about USD 14,000+Flash q4 plus long context comfort zoneq4 short: prefill about 79 t/s, generation about 35.5 t/s; q4 with about 12k-token prompt: prefill about 449 t/s, generation about 26.6 t/sLong context plus a coding agent resident; still cannot hold PRO

A few details deserve a separate note. Fitting the weights is not the same as fluent generation. The KV cache, the context window, and other system processes can eat tens of gigabytes. At 96 GB you will swap once the context passes about 100k tokens. The gap between q2 and q4 is not linear either. On a 512 GB Mac Studio, q2 short-prompt prefill is actually a touch faster than q4, but q4 wins on long context and on tool-calling quality. The DGX Spark GB10 with 128 GB on CUDA delivers about 344 t/s prefill on a 7k-token q2 prompt yet only about 13.7 t/s generation, which shows that the Mac unified-memory architecture still has a sweet spot for single-box long-context work.

ds4 drops the floor for running DeepSeek V4 Flash locally to 96 GB, but the comfort line still sits at 256–512 GB. The real cost is whether that machine stays busy across your project cycle.

03

Why it has to be a Mac: unified memory, bandwidth, and the on-disk KV cache

ds4 puts Metal first for engineering reasons, not aesthetics. Apple Silicon unified memory (UMA) shares one pool between CPU and GPU. There is no PCIe round-trip moving tensors between VRAM and system RAM. For an MoE model like Flash, where each token activates only a fraction of the experts, UMA lets the engine fetch the needed expert weights from one large pool without being constrained by a discrete-GPU memory ceiling. At consumer prices, no other platform gives you 96 GB at the low end and 512 GB at the top end as your effective inference memory.

The second factor is memory bandwidth. M3 Max sits at roughly 400 GB/s of unified-memory bandwidth, and M3 Ultra roughly doubles it to about 800 GB/s. That is the physical reason ds4 hits about 449 t/s long-prompt prefill on the M3 Ultra Mac Studio. Bandwidth governs how quickly the engine can pull weights, which is the dominant bottleneck for MoE inference. On a Mac, that bandwidth is contiguous, not sharded across discrete GPUs.

The third factor is often missed. Modern Mac internal NVMe SSDs pair well with the ds4 on-disk KV cache. ds4-server writes KV state into the path you pass to --kv-disk-dir and caps the footprint with --kv-disk-space-mb. When you reopen the same session, you skip seconds or minutes of prefill. Apple internal SSDs run at 5–7 GB/s sequential, which makes spill-and-reload cheaper than the alternative of paying RAM for every concurrent session.

i

Tip: Point --kv-disk-dir at the internal SSD. External USB-C drives are often a third of the random read/write speed, and KV reload then becomes the new bottleneck. Keep external storage for cold session snapshots.

Put the three together and the conclusion is direct. In 2026 consumer hardware, nothing fits DeepSeek V4 Flash and ds4 better than a high-memory Mac. The remaining question is whether you can afford a 256 GB or 512 GB Mac, and whether you will actually keep it busy long enough for the math to work out.

04

When buying loses to renting: a tiered decision matrix and a three-year TCO sketch

Once you overlay the hardware bill onto a real project cycle, one conclusion is hard to escape. Most developers do not actually keep a 512 GB Mac Studio busy. Early exploration may need only 128 GB Flash q2. Productization may move to 256 GB at q4. A long-context coding agent may finally need 512 GB. That ladder is exactly what cloud-Mac nodes do well. A bought machine locks you into one tier.

Typical roleMain tierSwitch frequencyBuy top-spec Mac Studio, 3-year TCORent cloud Mac node, 3-year TCO
Independent developer or researcher (under 20 model-hours per week)Mainly 128 GB Flash q2, occasional 256 GBRare upgrades256 GB Mac Studio, about USD 7,500; about USD 6,500+ over three years with depreciationWeekly 128 GB plus quarterly 256 GB on demand; about USD 2,300–USD 3,800 over three years
Small AI startup (30–60 hours per week, multi-project)Mainly 256 GB Flash q4, occasional 512 GB long contextWeekly switches512 GB Mac Studio, about USD 14,000; about USD 12,000+ over three yearsMonthly 256 GB resident plus 512 GB on burst; about USD 5,700–USD 9,000 over three years
Coding-agent heavy user (60+ hours per week, steady)Mainly 512 GB Flash q4 long contextNo switchingTop-spec Mac Studio amortizes wellMonthly long-term lease of 512 GB; price gap narrows, but you keep elasticity and skip ops
Cross-region team (need to be close to users)128–256 GB per regionParallel by regionMultiple machines, duplicated spend, hard to manageOpen by region on demand; cross-region switch is an order, not a logistics task

The headline of this table is simple. A bought top-spec Mac Studio only wins when you keep the 512 GB tier full continuously, and most independent developers and small teams never reach that intensity. The realistic path is to use cloud nodes to find your actual tier, then decide whether to commit to a physical machine. By the time the exploration is finished, the cloud node has usually become the answer.

!

Note: The hidden costs of buying go far beyond list price: electricity, cooling, backup storage, repairs after the warranty, and most importantly the next two or three Apple Silicon generations landing inside a three-year window. Today is top-spec; in three years it is mid-spec. A cloud node absorbs that depreciation curve for you.

05

A minimum viable launch checklist for ds4 on a VpsMesh cloud-Mac node, plus Cursor integration

The following six steps boil all of the above into a repeatable Runbook. They assume a VpsMesh cloud-Mac node, 128 GB minimum, 256 GB recommended, 512 GB for long-context comfort. Each step ships with a clear pass/fail check so your team can reuse it.

  1. 01

    Build ds4 with the Metal backend. git clone https://github.com/antirez/ds4 && cd ds4 && make. You get ./ds4 (CLI) and ./ds4-server (HTTP). Pass: both binaries exist and ./ds4 --help prints help. Do not run make cpu on macOS; the CPU path can kernel-panic.

  2. 02

    Smoke-test the Metal backend. Run ./ds4 -p "Hello" --metal to confirm device acquisition and the basic graph path. On a node with 128 GB or more, you can move straight on to loading Flash q2 weights. Pass: no "Metal device not available" error, no OOM.

  3. 03

    Pull DeepSeek V4 Flash q2 or q4 weights and verify. Use the GGUF source listed in the ds4 project. Sizes are about 86.7 GB for q2 and about 150 GB for q4. Always check SHA256. Keep weights and KV on separate volumes: weights on a large data disk with at least 500 GB free, KV on the Mac internal SSD. Pass: checksum matches; df -h shows at least 100 GB headroom on the data disk.

  4. 04

    Start ds4-server with on-disk KV. Example: ./ds4-server --ctx 200000 --kv-disk-dir /Volumes/ssd-kv/ds4-kv --kv-disk-space-mb 16384 --bind 127.0.0.1:8080. Begin with a 200k window, not 1M, to avoid early memory pressure. Pass: startup log shows Metal ready and KV directory writable; curl http://127.0.0.1:8080/v1/models returns JSON.

  5. 05

    Hook up Cursor, opencode, or Claude Code. Point the client base URL at ds4-server through an SSH tunnel that forwards the remote 8080 to local 127.0.0.1:8080. Never expose 8080 on 0.0.0.0. Set the Authorization header to whatever placeholder token your launch flags require. Pick the model name documented by the current ds4 release. Pass: a short streaming request to /v1/chat/completions returns 200 OK.

  6. 06

    Set up observability and a rollback rule. Watch memory and disk with vm_stat, memory_pressure, and iostat. Define triggers: if swap stays high, if prefill drops below 50% of baseline, or if the KV directory crosses 80% of --kv-disk-space-mb, fall back to a cloud API (OpenAI, Anthropic, or the official DeepSeek endpoint). Pass: the rollback path produces a comparable result for the same input.

bash
ssh -L 8080:127.0.0.1:8080 vpsmesh-mac-node \
  './ds4-server \
     --ctx 200000 \
     --kv-disk-dir /Volumes/ssd-kv/ds4-kv \
     --kv-disk-space-mb 16384 \
     --bind 127.0.0.1:8080'

curl -sS http://127.0.0.1:8080/v1/chat/completions \
  -H "Authorization: Bearer $DS4_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"model":"deepseek-v4-flash-q4","messages":[{"role":"user","content":"hello"}],"stream":false}' \
  | jq .

Three hard data points worth pasting into a team README.

  • Throughput baselines. On a 512 GB M3 Ultra Mac Studio, q4 long-prompt prefill is about 449 t/s and generation about 26.6 t/s. On a 128 GB M3 Max MacBook Pro, q2 long-prompt prefill is about 250 t/s and generation about 21.5 t/s. Use these as health anchors per node.
  • Memory budget. q2 weights are about 86.7 GB, a 200k-token KV cache adds about 8–14 GB, and the system needs roughly 8 GB. That is about 110 GB to begin with. A 96 GB node is therefore short-context only; 128 GB is the real lab floor, and 256 GB is where KV and concurrent sessions get headroom.
  • Disk KV sizing. Start --kv-disk-space-mb at 16 GB. Reserve about 1–3 GB per long-context session. Use the internal SSD; an external drive will make KV reload the new bottleneck.

If you are weighing a 256 or 512 GB Mac Studio against renting a cloud Mac for ds4, weigh two things that rarely make it into spec sheets. First, the hidden bill for a physical machine: power, noise, cooling, repairs after the warranty, and the next two or three Apple Silicon generations arriving inside a three-year horizon. Second, the operational tax of self-hosting: daemonizing ds4-server across reboots, watching the KV disk waterline, and keeping the Cursor or opencode link self-healing. None of that is the work you wanted to do. For independent developers, researchers, and small teams who would rather spend their time running models and writing code than babysitting a server, the VpsMesh high-memory cloud-Mac nodes — switchable across 96 / 128 / 256 / 512 GB on demand — are usually the more realistic and more economical choice. Start with a week of 128 GB to validate Flash q2 fit. Move to a month of 256 GB to make Cursor and a coding agent feel good. Only then decide whether to commit to a 512 GB resident. That ladder is far less risky than dropping the price of a small car on a single top-spec Mac Studio up front.

FAQ

Frequently Asked Questions

No. The ds4 mainline targets DeepSeek V4 Flash only. Flash is a 284B-parameter MoE with 13B active per token. PRO is 1.65T total and 49B active, which is about 3.2 TB at BF16 and about 800 GB even at Q4. It does not fit any 512 GB Mac and is out of scope for ds4 and single-box Mac setups. For Flash specifically, see the VpsMesh pricing page and pick a 128 GB node or above.

It is the floor that lets q2 load, not a comfort zone. Long contexts and any concurrency push swap quickly, especially past 100k tokens. 128 GB is the realistic lab minimum, 256 GB is the first serious target where q4 with mid-size contexts no longer swaps, and 512 GB is the comfort zone for long contexts plus a coding agent kept resident. If you only want to verify feasibility, renting a 128 GB cloud node for two weeks is cheaper than buying a 96 GB laptop.

A simple rule: only when you fill the 512 GB tier for at least 30 hours per week for at least two years. Anything lighter typically loses to pay-as-you-go rental, once you factor in power, depreciation, and the next two or three Apple Silicon generations landing inside that window. The VpsMesh help center covers capacity planning, and you can also go straight to the order page to open a trial node sized to your real workload.

Yes. ds4-server exposes /v1/chat/completions, which is OpenAI-compatible. Point the client base URL at the server, set the placeholder token, and choose a context window aligned with your launch flags. Always bind ds4-server to 127.0.0.1 in production and reach it through an SSH tunnel or a private network; never expose 0.0.0.0. The SSH tunnel template and rollback triggers are in section 05 of this article.