rwkv-architecture▌
davila7/claude-code-templates · updated Apr 8, 2026
MDX-style export adds YAML metadata + attribution linking explainx.ai and this canonical listing URL.
RWKV (RwaKuv) combines Transformer parallelization (training) with RNN efficiency (inference).
RWKV - Receptance Weighted Key Value
Quick start
RWKV (RwaKuv) combines Transformer parallelization (training) with RNN efficiency (inference).
Installation:
# Install PyTorch
pip install torch --upgrade --extra-index-url https://download.pytorch.org/whl/cu121
# Install dependencies
pip install pytorch-lightning==1.9.5 deepspeed wandb ninja --upgrade
# Install RWKV
pip install rwkv
Basic usage (GPT mode + RNN mode):
import os
from rwkv.model import RWKV
os.environ["RWKV_JIT_ON"] = '1'
os.environ["RWKV_CUDA_ON"] = '1' # Use CUDA kernel for speed
# Load model
model = RWKV(
model='/path/to/RWKV-4-Pile-1B5-20220903-8040',
strategy='cuda fp16'
)
# GPT mode (parallel processing)
out, state = model.forward([187, 510, 1563, 310, 247], None)
print(out.detach().cpu().numpy()) # Logits
# RNN mode (sequential processing, same result)
out, state = model.forward([187, 510], None) # First 2 tokens
out, state = model.forward([1563], state) # Next token
out, state = model.forward([310, 247], state) # Last tokens
print(out.detach().cpu().numpy()) # Same logits as above!
Common workflows
Workflow 1: Text generation (streaming)
Efficient token-by-token generation:
from rwkv.model import RWKV
from rwkv.utils import PIPELINE
model = RWKV(model='RWKV-4-Pile-14B-20230313-ctx8192-test1050', strategy='cuda fp16')
pipeline = PIPELINE(model, "20B_tokenizer.json")
# Initial prompt
prompt = "The future of AI is"
state = None
# Generate token by token
for token in prompt:
out, state = pipeline.model.forward(pipeline.encode(token), state)
# Continue generation
for _ in range(100):
out, state = pipeline.model.forward(None, state)
token = pipeline.sample_logits(out)
print(pipeline.decode(token), end='', flush=True)
Key advantage: Constant memory per token (no growing KV cache)
Workflow 2: Long context processing (infinite context)
Process million-token sequences:
model = RWKV(model='RWKV-4-Pile-14B', strategy='cuda fp16')
# Process very long document
state = None
long_document = load_document() # e.g., 1M tokens
# Stream through entire document
for chunk in chunks(long_document, chunk_size=1024):
out, state = model.forward(chunk, state)
# State now contains information from entire 1M token document
# Memory usage: O(1) (constant, not O(n)!)
Workflow 3: Fine-tuning RWKV
Standard fine-tuning workflow:
# Training script
import pytorch_lightning as pl
from rwkv.model import RWKV
from rwkv.trainer import RWKVTrainer
# Configure model
config = {
'n_layer': 24,
'n_embd': 1024,
'vocab_size': 50277,
'ctx_len': 1024
}
# Setup trainer
trainer = pl.Trainer(
accelerator='gpu',
devices=8,
precision='bf16',
strategy='deepspeed_stage_2',
max_epochs=1
)
# Train
model = RWKV(config)
trainer.fit(model, train_dataloader)
Workflow 4: RWKV vs Transformer comparison
Memory comparison (1M token sequence):
# Transformer (GPT)
# Memory: O(n²) for attention
# KV cache: 1M × hidden_dim × n_layers × 2 (keys + values)
# Example: 1M × 4096 × 24 × 2 = ~400GB (impractical!)
# RWKV
# Memory: O(1) per token
# State: hidden_dim × n_layers = 4096 × 24 = ~400KB
# 1,000,000× more efficient!
Speed comparison (inference):
# Transformer: O(n) per token (quadratic overall)
# First token: 1 computation
# Second token: 2 computations
# ...
# 1000th token: 1000 computations
# RWKV: O(1) per token (linear overall)
# Every token: 1 computation
# 1000th token: 1 computation (same as first!)
When to use vs alternatives
Use RWKV when:
- Need very long context (100K+ tokens)
- Want constant memory usage
- Building streaming applications
- Need RNN efficiency with Transformer performance
- Memory-constrained deployment
Key advantages:
- Linear time: O(n) vs O(n²) for Transformers
- No KV cache: Constant memory per token
- Infinite context: No fixed window limit
- Parallelizable training: Like GPT
- Sequential inference: Like RNN
Use alternatives instead:
- Transformers: Need absolute best performance, have compute
- Mamba: Want state-space models
- RetNet: Need retention mechanism
- Hyena: Want convolution-based approach
Common issues
Issue: Out of memory during training
Use gradient checkpointing and DeepSpeed:
trainer = pl.Trainer(
strategy='deepspeed_stage_3', # Full ZeRO-3
precision='bf16'
)
Issue: Slow inference
Enable CUDA kernel:
os.environ["RWKV_CUDA_ON"] = '1'
Issue: Model not loading
Check model path and strategy:
model = RWKV(
model='/absolute/path/to/model.pth',
strategy='cuda fp16' # Or 'cpu fp32' for CPU
)
Issue: State management in RNN mode
Always pass state between forward calls:
# WRONG: State lost
out1, _ = model.forward(tokens1, None)
out2, _ = model.forward(tokens2, None) # No context from tokens1!
# CORRECT: State preserved
out1, state = model.forward(tokens1, None)
out2, state = model.forward(tokens2, state) # Has context from tokens1
Advanced topics
Time-mixing and channel-mixing: See references/architecture-details.md for WKV operation, time-decay mechanism, and receptance gates.
State management: See references/state-management.md for att_x_prev, att_kv, ffn_x_prev states, and numerical stability considerations.
RWKV-7 improvements: See AI-first code editor with Composer Before installing skills in Cursor, ensure your development environment meets these requirements: Execute the skills CLI command in your project's root directory to begin installation: The skills CLI fetches The CLI will show a list of available agents. Use arrow keys to navigate and space to select Cursor: Confirm successful installation by checking the skill directory location: Reload or restart Cursor to activate rwkv-architecture. Access the skill through slash commands (e.g., We perform automated surface-level scans (Gen AI Scanner, Socket, Snyk) during installation. These checks detect common vulnerabilities but do not guarantee complete security. Always review skill source code and verify the publisher's reputation before production use. Skills execute code in your development environment. Always verify the publisher's identity, review recent commits, and test in isolated environments before production deployment. Submit your Claude Code skill and start earning Create detailed user stories, acceptance criteria, and feature specs Example Generate user stories for 'password reset feature' with acceptance criteria, edge cases, and test scenarios Reduce spec writing time by 50%, ensure comprehensive coverage Research competitors, compare features, identify gaps Example Analyze 5 competitor products, create feature comparison matrix, suggest differentiation opportunities Complete competitive research in 2 hours instead of 2 days Evaluate features using frameworks (RICE, ICE, Kano) and create prioritized backlogs Example Score 20 feature ideas using RICE framework, generate prioritized roadmap with rationale Make data-driven prioritization decisions faster Draft PRDs, status updates, and stakeholder presentations Example Create executive summary of Q3 roadmap, monthly progress report, feature launch announcement Save 3-5 hours/week on communication overhead 30-60 minutes to see productivity improvements Use for user story writing, competitive research, roadmap prioritization, stakeholder communication, and PRD drafting. Best for reducing repetitive documentation and research work. Avoid for strategic product vision (requires deep customer empathy), pricing decisions (needs market and financial expertise), or when face-to-face customer discovery is more valuable than speed. I recommend rwkv-architecture for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area. I recommend rwkv-architecture for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area. rwkv-architecture has been reliable in day-to-day use. Documentation quality is above average for community skills. Keeps context tight: rwkv-architecture is the kind of skill you can hand to a new teammate without a long onboarding doc. rwkv-architecture fits our agent workflows well — practical, well scoped, and easy to wire into existing repos. rwkv-architecture fits our agent workflows well — practical, well scoped, and easy to wire into existing repos. Registry listing for rwkv-architecture matched our evaluation — installs cleanly and behaves as described in the markdown. rwkv-architecture is among the better-maintained entries we tried; worth keeping pinned for repeat workflows. Solid pick for teams standardizing on skills: rwkv-architecture is focused, and the summary matches what you get after install. Solid pick for teams standardizing on skills: rwkv-architecture is focused, and the summary matches what you get after install. showing 1-10 of 73How to use rwkv-architecture on Cursor
Prerequisites
node --version)Execute installation command
rwkv-architecture from GitHub repository davila7/claude-code-templates and configures it for Cursor.Select Cursor when prompted
Verify installation
/rwkv-architecture) or your agent's skill management interface.Security & Verification Notice
Additional Resources
List & Monetize Your Skill
Use Cases▌
User Story & Requirements Generation
Competitive Analysis
Roadmap Prioritization
Stakeholder Communication
Implementation Guide▌
Prerequisites
Time Estimate
Installation Steps
Common Pitfalls
Best Practices▌
✓ Do
✗ Don't
💡 Pro Tips
When to Use This▌
✓ Use When
✗ Avoid When
Learning Path▌
Discussion
Product Hunt–style comments (not star reviews)Ratings
4.7★★★★★73 reviews