perplexityai/pplx-garden is Perplexity AI's public commitment to open-source inference research. It is not a demo or a tutorial repo — it is the actual infrastructure Perplexity uses at scale, packaged for external teams to build on.
Two projects live here today: fabric-lib, an RDMA-based communication library and P2P MoE dispatch kernel, and pplx-unigram, a high-performance unigram tokenizer encoder. Both are written in Rust with Python bindings.
Primary repo: perplexityai/pplx-garden
Quick reference
| Item | Details |
|---|---|
| Organization | Perplexity AI |
| Purpose | Open-source inference technology |
| Primary language | Rust (Python bindings via python-ext) |
| License | MIT |
| Repository signals | ~509 stars, ~54 forks (at capture time) |
| Projects | fabric-lib, pplx-unigram |
| Research backing | MLSys'26 paper on fabric-lib |
| Hardware targets | NVIDIA ConnectX-7, AWS EFA, NVLink |
Why this matters
Scaling LLMs beyond a single node surfaces two hard problems: communication overhead between GPUs and tokenization throughput at the input boundary. Both bottlenecks compound as model size grows into the trillion-parameter range.
pplx-garden attacks both directly:
- fabric-lib makes inter-GPU communication fast enough that Mixture-of-Experts routing does not become the bottleneck
- pplx-unigram squeezes every cycle out of the CPU-side tokenization path before inference even starts
These are not prototype implementations. Perplexity published a MLSys'26 peer-reviewed paper on fabric-lib and has documented production deployment across AWS EFA and NVIDIA ConnectX-7 clusters.
fabric-lib: RDMA inference communication
What it does
fabric-lib is a two-part library:
- RDMA TransferEngine — a general-purpose RDMA communication library for moving GPU memory between nodes
- P2P MoE dispatch/combine kernel — optimized All-to-All communication for Mixture-of-Experts model routing
The combination enables trillion-parameter MoE models to route tokens across expert nodes with minimal latency overhead.
Architecture decisions
NVLink + RDMA hybrid: intra-node transfers use NVLink (high bandwidth, zero-copy between local GPUs), inter-node transfers use RDMA directly from GPU memory (GPUDirect RDMA). This avoids CPU involvement on the critical path.
SM-free RDMA transfer: the kernel does not consume Streaming Multiprocessor compute during data movement. On GPUs already saturated with matrix math, this is critical — network ops compete for SMs in naive implementations.
CUDA Graph support: captures the communication pattern as a CUDA Graph, enabling repeated execution without re-submission overhead. Relevant for decode loops where the same routing pattern fires thousands of times.
Split send/recv stages: dispatch and combine are split into separate send and receive phases, enabling micro-batching and overlap with compute.
Reliable unordered transport: fabric-lib uses reliable unordered protocol rather than reliable ordered (like RC in InfiniBand). This allows reordering packets in flight, which helps with NIC aggregation across multiple NICs per GPU.
System requirements
Linux Kernel 5.12+ (DMA-BUF support)
CUDA 12.8+
libfabric
libibverbs
GDRCopy
SYS_PTRACE + SYS_ADMIN (for pidfd_getfd)
RDMA network with GPUDirect RDMA
Each GPU needs at least one dedicated RDMA NIC. The library supports aggregating multiple NICs per GPU.
Performance benchmarks
Decode (128 tokens) — Dispatch and Combine latency:
| Config | pplx-EFA (D) | pplx-CX7 (D) | DeepEP-CX7 (D) | pplx-EFA (C) | pplx-CX7 (C) | DeepEP-CX7 (C) |
|---|---|---|---|---|---|---|
| EP64 | 266.7 μs | 187.5 μs | 177.9 μs | 391.2 μs | 309.1 μs | 325.0 μs |
| EP32 | 229.1 μs | 153.9 μs | 159.1 μs | 335.0 μs | 266.3 μs | 285.0 μs |
| EP16 | 214.8 μs | 110.2 μs | 123.9 μs | 241.5 μs | 185.5 μs | 203.0 μs |
| EP8 | 49.7 μs | 50.5 μs | 42.6 μs | 64.2 μs | 65.3 μs | 72.0 μs |
D = Dispatch, C = Combine. pplx-CX7 beats DeepEP-CX7 at EP16 and EP32 for both dispatch and combine, and beats it on combine at EP64. pplx-EFA holds competitive latency on AWS without dedicated InfiniBand hardware.
Getting started with fabric-lib
# Build the dev container
docker build -t pplx-garden-dev - < docker/dev.Dockerfile
./scripts/run-docker.sh
# Build and test the network benchmark
cargo build --release --bin fabric-debug
# Server
./target/release/fabric-debug 0,1,2,3,4,5,6,7 2
# Client (replace fe80xxxx with server's printed address)
./target/release/fabric-debug 0,1,2,3,4,5,6,7 2 fe80xxxx
Install as Python wheel:
export TORCH_CMAKE_PREFIX_PATH=$(python3 -c "import torch; print(torch.utils.cmake_prefix_path)")
python3 -m build --wheel
python3 -m pip install dist/*.whl
Run the All-to-All benchmark across multiple nodes:
python3 -m benchmarks.bench_all_to_all \
--world-size $((NUM_NODES * 8)) \
--nets-per-gpu 2 \
--init-method=tcp://$MASTER_IP:29500 \
--node-rank=$NODE_RANK \
--nvlink=8
pplx-unigram: fast tokenizer encoder
What it does
pplx-unigram encodes text into token IDs using the SentencePiece unigram pipeline:
- Precompiled charsmap normalization — Unicode normalization via precompiled charsmap
- Metaspace pre-tokenization — SentencePiece-style whitespace handling
- Viterbi segmentation — optimal tokenization via Viterbi decoding
- Special-token splitting — handles
<s>,</s>,[MASK]etc.
It loads standard HuggingFace tokenizer.json files directly — no conversion step required.
Why it is fast
The Viterbi search runs over a double-array trie packed one node per cache line. This layout minimizes cache misses during the trie traversal, which is the hot loop in tokenization. On long inputs, cache-friendly trie access compounds into meaningful throughput gains.
Quick start
# Get a unigram tokenizer (XLM-R as example)
# Download tokenizer.json from HuggingFace
cargo run --release --example encode -p pplx-unigram -- \
path/to/tokenizer.json "The quick brown fox jumps over the lazy dog."
Repo structure
fabric-lib/ RDMA TransferEngine library
p2p-all-to-all/ P2P MoE All-to-All implementation
pplx-unigram/ Unigram tokenizer encoder
python-ext/ Python extension module from Rust code
python/pplx_garden/ Python package
rust/ Rust utility libraries
benchmarks/ Performance benchmarks
docker/ Dev container definitions
docs/ Documentation for each project
scripts/ Helper scripts
tests/ Integration tests
The repo is a Cargo workspace with Python bindings generated via the python-ext crate, making it callable from PyTorch-based training and inference code.
Research backing
| Publication | Link |
|---|---|
| MLSys'26 paper | fabric-lib: RDMA Point-to-Point Communication for LLM Systems |
| Blog: RDMA P2P comm | RDMA Point-to-Point Communication for LLM Systems |
| Blog: AWS EFA | Enabling Trillion-Parameter Models on AWS EFA |
| Blog: RL weight transfer | Weight Transfer for RL Post-Training in under 2 seconds |
| Blog: Disaggregated prefill | Disaggregated Prefill and Decode |
| Tokenizer blog | Improving Unigram Tokenizer CPU Performance |
Who should look at this
LLM serving infrastructure teams running MoE models at multi-node scale. If your current bottleneck is all-to-all communication latency during expert routing, fabric-lib is directly relevant.
Teams on AWS with EFA: the pplx-EFA benchmark numbers show competitive latency without InfiniBand. This is meaningful for teams that need multi-node MoE without dedicated networking hardware.
Tokenization throughput teams: if you are preprocessing billions of documents and unigram tokenization is measurable in your pipeline, pplx-unigram's cache-line-aligned trie is worth benchmarking against your current path.
RL post-training teams: the weight transfer blog documents under-2-second model weight sync between training and inference replicas using fabric-lib. If you are doing RLHF or online RL at scale, this is one of the harder infrastructure problems — seeing a production solution is valuable.
Constraints and considerations
| Area | What to verify |
|---|---|
| Hardware requirement | RDMA NIC with GPUDirect RDMA support per GPU — not available everywhere |
| Kernel version | Linux 5.12+ for DMA-BUF; older kernels need fallback paths |
| CUDA 12.8+ | Locks out older GPU deployments on legacy CUDA versions |
| Capabilities | SYS_PTRACE and SYS_ADMIN needed — requires root, sudo, or Docker with explicit cap-add |
| Rust-first codebase | Python users interact via compiled wheel; debugging inside the library requires Rust familiarity |
| MoE-specific optimization | If you run dense models, fabric-lib's All-to-All kernel is not your bottleneck |
None of these are design flaws — they reflect real hardware requirements for RDMA at this performance level.
Comparison: pplx-garden vs other inference communication libraries
| Dimension | pplx-garden (fabric-lib) | DeepEP | NCCL |
|---|---|---|---|
| MoE dispatch | First-class, SM-free | First-class | Via AllToAll |
| CUDA Graph | Supported | Supported | Supported |
| AWS EFA | Supported | Not documented | Via AWS plugin |
| NIC aggregation | Multiple NICs per GPU | Not documented | Not native |
| Rust implementation | Yes | No (C++/CUDA) | No (C/CUDA) |
| MLSys peer review | Yes (MLSys'26) | No | N/A |
| License | MIT | Apache 2.0 | Proprietary |
Bottom line
pplx-garden is one of the most technically credible open-source releases in the LLM inference infrastructure space in 2026. fabric-lib solves a real problem — MoE all-to-all communication at low latency — with production benchmarks, a peer-reviewed paper, and NIC aggregation support that few alternatives offer.
The hardware bar is real: you need RDMA NICs, GPUDirect support, and Linux 5.12+. This is not a drop-in replacement for NCCL in commodity setups. But for teams operating at the scale where these constraints are already satisfied, pplx-garden is worth a serious look.
For the tokenization side, pplx-unigram is narrower in scope but practically useful: a well-engineered Rust implementation that slots into any HuggingFace-based pipeline without conversion overhead.
Related on ExplainX
- Perplexity's Search as Code: Rethinking Search for the Agentic Era
- What is MCP? Model Context Protocol explained
- What are agent skills? Complete guide
- Open source AI tools directory
Repository metrics and benchmark data are based on the public repo snapshot as of May 2026 and can change. Verify on the upstream project before making infrastructure decisions.