flash-moe-inference▌
aradotso/trending-skills · updated Apr 8, 2026
MDX-style export adds YAML metadata + attribution linking explainx.ai and this canonical listing URL.
Skill by ara.so — Daily 2026 Skills collection.
Flash-MoE Inference Engine
Skill by ara.so — Daily 2026 Skills collection.
Flash-MoE is a pure C/Objective-C/Metal inference engine that runs Qwen3.5-397B-A17B (397B parameter Mixture-of-Experts) on a MacBook Pro with 48GB RAM at 4.4+ tokens/second. It streams 209GB of expert weights from NVMe SSD on demand — no Python, no ML frameworks, just C, Objective-C, and hand-tuned Metal shaders.
Requirements
- Hardware: Apple Silicon Mac (M3 Max or similar), 48GB+ unified memory, 1TB+ SSD with ~210GB free
- OS: macOS 26+ (Darwin 25+)
- Tools: Xcode Command Line Tools, Python 3.x (for weight extraction only)
- Model: Qwen3.5-397B-A17B safetensors weights (download separately from HuggingFace)
Installation & Build
# Clone the repo
git clone https://github.com/danveloper/flash-moe
cd flash-moe/metal_infer
# Build everything
make
# Verify build artifacts
ls infer chat main
The Makefile compiles infer.m, chat.m, main.m with Metal shader compilation for shaders.metal.
Weight Preparation
Step 1: Extract non-expert weights
# From the metal_infer/ directory
# Point to your downloaded Qwen3.5-397B safetensors directory
python3 extract_weights.py /path/to/Qwen3.5-397B-A17B-Instruct/
# Produces:
# model_weights.bin (~5.5GB, mmap'd at runtime)
# model_weights.json (tensor manifest)
# vocab.bin (vocabulary)
# tokenizer.bin (BPE tokenizer data)
Step 2: Pack expert weights (4-bit, production)
# From repo root
python3 repack_experts.py /path/to/Qwen3.5-397B-A17B-Instruct/ metal_infer/packed_experts/
# Produces packed_experts/ directory (~209GB)
# Each expert is a separate file: layer_XX_expert_YYYY.bin
Step 3: Optional 2-bit requantization (faster but breaks JSON/tool calling)
# Convert 4-bit experts to 2-bit (saves ~89GB, 120GB total)
python3 metal_infer/repack_experts_2bit.py \
metal_infer/packed_experts/ \
metal_infer/packed_experts_2bit/
Key Commands
Basic inference
cd metal_infer
# 4-bit inference (production quality, tool calling works)
./infer --prompt "Explain quantum computing" --tokens 100
# 2-bit inference (faster, breaks JSON/tool calling)
./infer --prompt "Explain quantum computing" --tokens 100 --2bit
# Per-layer timing breakdown
./infer --prompt "Hello" --tokens 20 --timing
Interactive chat with tool calling
./chat
# Opens TUI with full tool calling support
# Uses 4-bit experts by default
MoE-only benchmark (measures expert throughput)
./main
# Runs pure expert forward-pass benchmark
# Reports tokens/sec without attention overhead
Project Structure
flash-moe/
├── paper/
│ └── flash_moe.pdf # Full technical paper
├── metal_infer/
│ ├── infer.m # Complete inference engine (~7000 lines)
│ ├── shaders.metal # Metal compute kernels (~1200 lines)
│ ├── chat.m # Interactive chat TUI
│ ├── tokenizer.h # Single-header C BPE tokenizer (449 lines)
│ ├── main.m # MoE-only benchmark
│ ├── Makefile
│ ├── extract_weights.py # Safetensors → model_weights.bin
│ ├── repack_experts_2bit.py # 4-bit → 2-bit requantization
│ ├── train_predictor.py # Expert routing prediction analysis
│ ├── model_weights.bin # Non-expert weights (mmap'd)
│ ├── model_weights.json # Tensor manifest
│ ├── vocab.bin
│ ├── tokenizer.bin
│ ├── packed_experts/ # 4-bit expert files (209GB)
│ └── packed_experts_2bit/ # 2-bit expert files (120GB, optional)
├── repack_experts.py # 4-bit expert packing from safetensors
├── progress.py # Results visualization
└── results.tsv # Experiment log
Architecture Overview
The model has 60 transformer layers:
- 45 GatedDeltaNet (linear attention) layers
- 15 standard full attention layers
- Each layer: 512 experts, K=4 activated per token + 1 shared expert
- Hidden dimension: 4096
Per-layer pipeline (4.28ms average at 4-bit)
CMD3(prev) → CMD1: attention projections + delta-net [1.22ms GPU]
→ CPU: flush results [0.01ms CPU]
→ CMD2: o_proj + norm + routing + shared [0.55ms GPU]
→ CPU: softmax + topK routing [0.003ms]
→ I/O: parallel pread K=4 experts [2.41ms SSD]
→ CMD3: expert forward + combine + norm [0.04ms encode, DEFERRED]
Metal Shader Kernels
The shaders.metal file contains hand-written kernels. Key kernels:
// 4-bit dequantized matrix-vector multiply (FMA-optimized)
// Key insight: fma(nibble, scale*x, bias*x) instead of (nibble*scale + bias)*x
// Pre-compute scale*x and bias*x to fuse dequant+multiply in one FMA instruction
kernel void matvec_4bit_fma(
device const uint8_t* weights [[buffer(0)]],
device const float* scales [[buffer(1)]],
device const float* biases [[buffer(2)]],
device const float* x [[buffer(3)]],
device float* out [[buffer(4)]],
uint tid [[thread_position_in_threadgroup]],
uint gid [[threadgroup_position_in_grid]])
{
// ... tiled SIMD-reduced FMA kernel
// 12% faster than naive (nibble * scale + bias) * x
}
// Fused SwiGLU activation
kernel void swiglu(device float* gate [[buffer(0)]],
device const float* up [[buffer(1)]],
uint gid [[thread_position_in_grid]])
{
float g = gate[gid];
gate[gid] = (g / (1.0f + exp(-g))) * up[gid];
}
// RMS normalization (two-pass)
kernel void rms_norm_pass1(...) // sum of squares reduction
kernel void rms_norm_pass2(...) // apply normalization
// GPU RoPE (fused with Q deinterleave and K normalization)
kernel void rope_qk(...)
// MoE combine + residual + sigmoid gate (fused)
kernel void moe_combine_residual(...)
SSD Expert Streaming Pattern
The core innovation — loading only K=4 active experts per layer from SSD:
// Parallel expert loading using GCD dispatch groups
// From infer.m (conceptual pattern)
dispatch_group_t group = dispatch_group_create();
dispatch_queue_t ioQueue = dispatch_get_global_queue(QOS_CLASS_USER_INITIATED, 0);
for (int k = 0; k < K_EXPERTS; k++) {
int expert_id = top_k_indices[k];
dispatch_group_async(group, ioQueue, ^{
// Each expert: ~6.75MB at 4-bit
char path[256];
snprintf(path, sizeof(path),
"packed_experts/layer_%02d_expert_%04d.bin",
layer, expert_id);
int fd = open(path, O_RDONLY);
// pread() — non-blocking, OS page cache handles LRU
pread(fd, expert_buffer[k], expert_size, 0);
close(fd);
});
}
dispatch_group_wait(group, DISPATCH_TIME_FOREVER);
// GPU compute follows — serial pipeline is hardware-optimal on Apple Silicon
Why pread() not mmap(): mmap incurs per-page fault overhead on cold data (~5x slower). Direct pread() with OS page cache achieves ~71% hit rate naturally.
GatedDeltaNet Linear Attention (BLAS)
The recurrence update uses Accelerate BLAS — 64% faster than scalar:
// GatedDeltaNet state update per head (conceptual pattern)
// state: 128×128 float matrix, 64 heads
// From infer.m
#import <Accelerate/Accelerate.h>
for (int h = 0; h < 64; h++) {
float* S = state + h * 128 * 128; // 128×128 state matrix
float* q = Q + h * 128;
float* k = K + h * 128;
float* v = V + h * 128;
// β·(k⊗v) outer product update
// cblas_sger: S += beta * (k ⊗ v)
cblas_sger(CblasRowMajor, 128, 128,
beta[h], k, 1, v, 1, S, 128);
// Decay: S = alpha * S
cblas_sscal(128 * 128, alpha[h], S, 1);
// Output: o = S @ q
cblas_sgemv(CblasRowMajor, CblasNoTrans,
128, 128, 1.0f, S, 128, q, 1, 0.0f,
output + h * 128, 1);
}
Performance Configuration
4-bit (production default)
- Quality: Excellent — full tool calling, correct JSON
- Speed: 4.36 tok/s
- Disk: 209GB
2-bit (speed testing only)
- Quality: Good — but breaks JSON/tool calling (
\name\instead of"name") - Speed: 5.74 tok/s (7.05 peak single token with warm cache)
- Disk: 120GB
- Uses
F_NOCACHEflag to avoid page cache thrashing
What NOT to Try (Learned from 58 Experiments)
| Approach | Why it fails |
|---|---|
mmap() expert fhow to use flash-moe-inference How to use flash-moe-inference on CursorAI-first code editor with Composer 1 PrerequisitesBefore installing skills in Cursor, ensure your development environment meets these requirements:
2 Execute installation commandExecute the skills CLI command in your project's root directory to begin installation: $npx skills add https://github.com/aradotso/trending-skills --skill flash-moe-inference The skills CLI fetches 3 Select Cursor when promptedThe CLI will show a list of available agents. Use arrow keys to navigate and space to select Cursor: ◆ Which agents do you want to install to? │ │ ── Universal (.agents/skills) ── always included ──── │ • Amp │ • Antigravity │ • Cline │ • Codex │ ●Cursor(selected) │ • Cursor │ • Windsurf 4 Verify installationConfirm successful installation by checking the skill directory location: .cursor/skills/flash-moe-inference Reload or restart Cursor to activate flash-moe-inference. Access the skill through slash commands (e.g., ⚠ Security & Verification NoticeWe perform automated surface-level scans (Gen AI Scanner, Socket, Snyk) during installation. These checks detect common vulnerabilities but do not guarantee complete security. Always review skill source code and verify the publisher's reputation before production use. Skills execute code in your development environment. Always verify the publisher's identity, review recent commits, and test in isolated environments before production deployment. List & Monetize Your SkillSubmit your Claude Code skill and start earning Use Cases▌User Story & Requirements GenerationCreate detailed user stories, acceptance criteria, and feature specs Example Generate user stories for 'password reset feature' with acceptance criteria, edge cases, and test scenarios ✓ Reduce spec writing time by 50%, ensure comprehensive coverage Competitive AnalysisResearch competitors, compare features, identify gaps Example Analyze 5 competitor products, create feature comparison matrix, suggest differentiation opportunities ✓ Complete competitive research in 2 hours instead of 2 days Roadmap PrioritizationEvaluate features using frameworks (RICE, ICE, Kano) and create prioritized backlogs Example Score 20 feature ideas using RICE framework, generate prioritized roadmap with rationale ✓ Make data-driven prioritization decisions faster Stakeholder CommunicationDraft PRDs, status updates, and stakeholder presentations Example Create executive summary of Q3 roadmap, monthly progress report, feature launch announcement ✓ Save 3-5 hours/week on communication overhead Implementation Guide▌Prerequisites
Time Estimate30-60 minutes to see productivity improvements Installation Steps
Common Pitfalls
Best Practices▌✓ Do
✗ Don't
💡 Pro Tips
When to Use This▌✓ Use WhenUse for user story writing, competitive research, roadmap prioritization, stakeholder communication, and PRD drafting. Best for reducing repetitive documentation and research work. ✗ Avoid WhenAvoid for strategic product vision (requires deep customer empathy), pricing decisions (needs market and financial expertise), or when face-to-face customer discovery is more valuable than speed. Learning Path▌
DiscussionProduct Hunt–style comments (not star reviews)
general reviews Ratings4.6★★★★★39 reviews
showing 1-10 of 39 1 / 4 |