On June 27, 2026, DeepSeek open-sourced DSpark — a speculative decoding stack for DeepSeek-V4-Flash and DeepSeek-V4-Pro. Daniel Han (@danielhanchen, Unsloth) summarized the release on X: throughput gains of 51% to 400%, with DSpark also training cleanly on Gemma and Qwen targets — not only V4.
This is not a new reasoning benchmark headline or a secretly retrained 2T model. The Hugging Face model card is explicit: same checkpoint, extra draft module. If you already follow DeepSeek V4 API migration or V4-Pro agent economics, DSpark is the inference-speed layer on top of those weights — the piece that makes self-hosted V4 feel closer to what hyperscalers get from custom serving stacks.
TL;DR — what people ask first
| Question | Direct answer |
|---|---|
| New model or speed hack? | Speed hack — V4-Pro/V4-Flash plus a speculative decoding draft (HF note). |
| How much faster? | DeepSeek / community summaries cite ~51%–400% throughput uplift depending on task, batch, and hardware — read DSpark_paper.pdf for conditions. |
| Where is the code? | github.com/deepseek-ai/DeepSpec — train, eval, configs for DSpark, DFlash, Eagle3. |
| vs DFlash? | Both draft-based; paper includes Qwen/Gemma comparison tables (~page 11 per early readers). DFlash is the incumbent industry pattern; DSpark is DeepSeek's V4-focused recipe. |
| Run today? | vllm serve deepseek-ai/DeepSeek-V4-Pro-DSpark or SGLang equivalent — see HF integration snippets. |
| API model string change? | No — still deepseek-v4-pro / deepseek-v4-flash on the official API; DSpark is primarily open-weight + self-host today. |
What speculative decoding actually buys you
Autoregressive LLMs generate one token at a time. Each step runs the full target model forward pass — expensive for 1.6T-parameter MoE targets like V4-Pro (49B activated per token).
Speculative decoding adds a small draft model that proposes a block of candidate tokens. The target model verifies them in parallel. Accepted tokens cost roughly one target forward pass for multiple output tokens; rejected tails fall back to standard decoding.
For agent workloads — long tool loops, streaming chat, batch evals — throughput is often the bottleneck before raw IQ. That is why a 51–400% swing matters more than another 0.3 point on a static MMLU cell when you are serving thousands of concurrent Claude Code–class agent traces.
DSpark is DeepSeek's recipe for draft training + verification tuned to their V4 family and published as reproducible code — not a black-box serving flag.
What shipped on June 27, 2026
| Asset | URL | What it is |
|---|---|---|
| DeepSpec repo | github.com/deepseek-ai/DeepSpec | Train/eval draft models; MIT license |
| DSpark paper | DSpark_paper.pdf | Method + benchmarks (arXiv 2606.19348 on HF) |
| V4-Pro + DSpark weights | huggingface.co/deepseek-ai/DeepSeek-V4-Pro-DSpark | Target checkpoint + draft module bundle |
| V4 collection | huggingface.co/collections/deepseek-ai/deepseek-v4 | Full V4-Flash / V4-Pro variants |
DeepSpec README lists three supported draft algorithms:
- DSpark — new in this release (paper PDF)
- DFlash — prior block-diffusion-style draft approach (also in repo)
- Eagle3 — third-party lineage via SpecForge adaptations
Acknowledgements credit SpecForge, DFlash, Qwen3, and Gemma codebases — which matches the social thread claim that DSpark generalizes beyond V4 targets.
DSpark vs DFlash — what builders should know
DFlash has become the shorthand many teams use for fast draft + verify serving on open models. DSpark is DeepSeek's 2026 entry in the same design space, bundled with V4-Pro-DSpark weights and first-class configs under config/dspark/ in DeepSpec.
Practical differences matter at integration time:
| Dimension | DFlash (prior art) | DSpark (this release) |
|---|---|---|
| Primary target | General open models (Qwen, etc.) | DeepSeek-V4-Flash / V4-Pro + Qwen/Gemma configs in DeepSpec |
| Weights on HF | Community / separate releases | DeepSeek-V4-Pro-DSpark official bundle |
| Training code | External DFlash repo + DeepSpec adapter | Native in DeepSpec alongside Eagle3 |
| Benchmarks | DFlash paper | DSpark_paper.pdf incl. cross-method tables |
When Lily Zhang asked on X whether DSpark compares to DFlash, Daniel Han pointed readers to the Qwen/Gemma table in the paper (~page 11) — the honest answer for production is still run eval.py on your prompt distribution, not cherry-pick one leaderboard cell.
How to serve DeepSeek-V4-Pro-DSpark
The Hugging Face card documents four integration paths. Most production teams start with vLLM or SGLang.
vLLM (OpenAI-compatible server)
pip install vllm
vllm serve "deepseek-ai/DeepSeek-V4-Pro-DSpark"
Then call the completions endpoint:
curl -X POST "http://localhost:8000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "deepseek-ai/DeepSeek-V4-Pro-DSpark",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'
SGLang
pip install sglang
python3 -m sglang.launch_server \
--model-path "deepseek-ai/DeepSeek-V4-Pro-DSpark" \
--host 0.0.0.0 \
--port 30000
Chat format warning
V4 does not ship a Jinja chat template on the DSpark card — it uses the encoding_dsv4 Python helpers for OpenAI-style messages and reasoning modes. Before wiring agents, read the encoding folder in the DeepSeek-V4-Pro repo. Agent hosts that assume a generic apply_chat_template() string will mis-tokenize Think High / Think Max modes documented in the V4 tech report.
Training your own DSpark draft (DeepSpec workflow)
DeepSpec is not inference-only. The README describes a three-stage pipeline:
Data prep → Training → Evaluation
- Data preparation — download prompts, regenerate target answers, build target cache (storage-heavy; README warns ~38 TB for default
Qwen/Qwen3-4Bcache settings). - Training —
bash scripts/train/train.shwith configs likeconfig/dspark/dspark_qwen3_4b.pyordspark_gemma4_12b.py. - Evaluation —
bash scripts/eval/eval.shover gsm8k, humaneval, livecodebench, mt-bench, arena-hard-v2, and others ineval_datasets/.
Example train entrypoint from the repo:
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python train.py \
--config config/dspark/dspark_qwen3_4b.py \
--opts "data.target_cache_path=${HOME}/.cache/deepspec/qwen3_4b_target_cache"
Default scripts assume an 8-GPU node. Scale CUDA_VISIBLE_DEVICES and train.local_batch_size down on smaller boxes — OOM is common on 1.6T-class targets without serious hardware.
This matters for teams running local open-source stacks: you can specialize drafts on your own agent trace distribution instead of trusting generic Eagle weights.
Who should care — and who can skip it
Care if you:
- Self-host V4-Pro or V4-Flash and pay per GPU-hour
- Run batch evals, synthetic data generation, or high-QPS chat behind vLLM/SGLang
- Build inference engines (Unsloth, SGLang, vLLM forks) and need acceptance-rate benchmarks
- Already optimize token budgets in agent context — faster tokens change economics of long traces
Skip for now if you:
- Only use DeepSeek's hosted API with no self-host plan — wait for provider-side announcements
- Run small models on laptop CPUs — V4-Pro-DSpark is not a consumer download
- Expect DSpark to fix bad agent harnesses — draft speed does not replace tool schemas, evals, or MCP wiring
Limitations and honest caveats
- Throughput range is wide (51–400%) — acceptance rates collapse on out-of-distribution prompts; marketing tops and your agent traces are not the same workload.
- Not a quality upgrade — when drafts reject, you pay verification overhead; worst case can be slower than baseline decoding.
- Hardware barrier — V4-Pro is 1.6T total / 49B activated MoE with 1M context architecture; DSpark adds draft weights on top.
- Storage for training — DeepSpec data prep is datacenter-scale; read
scripts/data/README.mdbefore budgeting. - Engine support lag — newest draft modules often land on HF before every inference engine exposes one-click flags; pin versions and read release notes.
Related reading on explainx.ai
- DeepSeek V4 preview: API and migration —
deepseek-v4-pro/deepseek-v4-flashmodel IDs and legacy retirement - DeepSeek V4-Pro: benchmarks and agent coding — SWE Verified, CSA/HCA, API pricing context
- DeepSeek V4-Pro permanent API discount — hosted API economics vs self-host
- Closed-source vs local open-source alternatives — when self-hosting wins
- What are LLM tokens? — why throughput changes agent cost
- Context engineering guide — long traces amplify inference spend
Official sources: DeepSpec GitHub · DSpark paper PDF · DeepSeek-V4-Pro-DSpark on Hugging Face · DeepSeek V4 collection · DeepSeek API docs
Throughput figures, Hugging Face integration snippets, and DeepSpec layout reflect the June 27, 2026 release. Re-verify acceptance rates on your hardware and prompt mix before production cutover.