← Blog
explainx / blog

pplx-garden: Perplexity's open-source inference technology stack explained

A deep dive into perplexityai/pplx-garden: the RDMA fabric library, P2P MoE dispatch kernels, unigram tokenizer, and what it means for teams building large-scale LLM infrastructure.

8 min readYash Thakker
Open sourceInferenceRDMALLM infrastructurePerplexity AIMoE

MDX restores the committed source plus an HTML comment attribution; plain text bundles the rendered markdown body with the explainx.ai attribution footer.

pplx-garden: Perplexity's open-source inference technology stack explained

perplexityai/pplx-garden is Perplexity AI's public commitment to open-source inference research. It is not a demo or a tutorial repo — it is the actual infrastructure Perplexity uses at scale, packaged for external teams to build on.

Two projects live here today: fabric-lib, an RDMA-based communication library and P2P MoE dispatch kernel, and pplx-unigram, a high-performance unigram tokenizer encoder. Both are written in Rust with Python bindings.

Primary repo: perplexityai/pplx-garden


Quick reference

ItemDetails
OrganizationPerplexity AI
PurposeOpen-source inference technology
Primary languageRust (Python bindings via python-ext)
LicenseMIT
Repository signals~509 stars, ~54 forks (at capture time)
Projectsfabric-lib, pplx-unigram
Research backingMLSys'26 paper on fabric-lib
Hardware targetsNVIDIA ConnectX-7, AWS EFA, NVLink

Why this matters

Scaling LLMs beyond a single node surfaces two hard problems: communication overhead between GPUs and tokenization throughput at the input boundary. Both bottlenecks compound as model size grows into the trillion-parameter range.

pplx-garden attacks both directly:

  • fabric-lib makes inter-GPU communication fast enough that Mixture-of-Experts routing does not become the bottleneck
  • pplx-unigram squeezes every cycle out of the CPU-side tokenization path before inference even starts

These are not prototype implementations. Perplexity published a MLSys'26 peer-reviewed paper on fabric-lib and has documented production deployment across AWS EFA and NVIDIA ConnectX-7 clusters.


fabric-lib: RDMA inference communication

What it does

fabric-lib is a two-part library:

  1. RDMA TransferEngine — a general-purpose RDMA communication library for moving GPU memory between nodes
  2. P2P MoE dispatch/combine kernel — optimized All-to-All communication for Mixture-of-Experts model routing

The combination enables trillion-parameter MoE models to route tokens across expert nodes with minimal latency overhead.

Architecture decisions

NVLink + RDMA hybrid: intra-node transfers use NVLink (high bandwidth, zero-copy between local GPUs), inter-node transfers use RDMA directly from GPU memory (GPUDirect RDMA). This avoids CPU involvement on the critical path.

SM-free RDMA transfer: the kernel does not consume Streaming Multiprocessor compute during data movement. On GPUs already saturated with matrix math, this is critical — network ops compete for SMs in naive implementations.

CUDA Graph support: captures the communication pattern as a CUDA Graph, enabling repeated execution without re-submission overhead. Relevant for decode loops where the same routing pattern fires thousands of times.

Split send/recv stages: dispatch and combine are split into separate send and receive phases, enabling micro-batching and overlap with compute.

Reliable unordered transport: fabric-lib uses reliable unordered protocol rather than reliable ordered (like RC in InfiniBand). This allows reordering packets in flight, which helps with NIC aggregation across multiple NICs per GPU.

System requirements

Linux Kernel 5.12+      (DMA-BUF support)
CUDA 12.8+
libfabric
libibverbs
GDRCopy
SYS_PTRACE + SYS_ADMIN  (for pidfd_getfd)
RDMA network with GPUDirect RDMA

Each GPU needs at least one dedicated RDMA NIC. The library supports aggregating multiple NICs per GPU.

Performance benchmarks

Decode (128 tokens) — Dispatch and Combine latency:

Configpplx-EFA (D)pplx-CX7 (D)DeepEP-CX7 (D)pplx-EFA (C)pplx-CX7 (C)DeepEP-CX7 (C)
EP64266.7 μs187.5 μs177.9 μs391.2 μs309.1 μs325.0 μs
EP32229.1 μs153.9 μs159.1 μs335.0 μs266.3 μs285.0 μs
EP16214.8 μs110.2 μs123.9 μs241.5 μs185.5 μs203.0 μs
EP849.7 μs50.5 μs42.6 μs64.2 μs65.3 μs72.0 μs

D = Dispatch, C = Combine. pplx-CX7 beats DeepEP-CX7 at EP16 and EP32 for both dispatch and combine, and beats it on combine at EP64. pplx-EFA holds competitive latency on AWS without dedicated InfiniBand hardware.

Getting started with fabric-lib

# Build the dev container
docker build -t pplx-garden-dev - < docker/dev.Dockerfile
./scripts/run-docker.sh

# Build and test the network benchmark
cargo build --release --bin fabric-debug

# Server
./target/release/fabric-debug 0,1,2,3,4,5,6,7 2

# Client (replace fe80xxxx with server's printed address)
./target/release/fabric-debug 0,1,2,3,4,5,6,7 2 fe80xxxx

Install as Python wheel:

export TORCH_CMAKE_PREFIX_PATH=$(python3 -c "import torch; print(torch.utils.cmake_prefix_path)")
python3 -m build --wheel
python3 -m pip install dist/*.whl

Run the All-to-All benchmark across multiple nodes:

python3 -m benchmarks.bench_all_to_all \
    --world-size $((NUM_NODES * 8)) \
    --nets-per-gpu 2 \
    --init-method=tcp://$MASTER_IP:29500 \
    --node-rank=$NODE_RANK \
    --nvlink=8

pplx-unigram: fast tokenizer encoder

What it does

pplx-unigram encodes text into token IDs using the SentencePiece unigram pipeline:

  1. Precompiled charsmap normalization — Unicode normalization via precompiled charsmap
  2. Metaspace pre-tokenization — SentencePiece-style whitespace handling
  3. Viterbi segmentation — optimal tokenization via Viterbi decoding
  4. Special-token splitting — handles <s>, </s>, [MASK] etc.

It loads standard HuggingFace tokenizer.json files directly — no conversion step required.

Why it is fast

The Viterbi search runs over a double-array trie packed one node per cache line. This layout minimizes cache misses during the trie traversal, which is the hot loop in tokenization. On long inputs, cache-friendly trie access compounds into meaningful throughput gains.

Quick start

# Get a unigram tokenizer (XLM-R as example)
# Download tokenizer.json from HuggingFace

cargo run --release --example encode -p pplx-unigram -- \
    path/to/tokenizer.json "The quick brown fox jumps over the lazy dog."

Repo structure

fabric-lib/        RDMA TransferEngine library
p2p-all-to-all/    P2P MoE All-to-All implementation
pplx-unigram/      Unigram tokenizer encoder
python-ext/        Python extension module from Rust code
python/pplx_garden/ Python package
rust/              Rust utility libraries
benchmarks/        Performance benchmarks
docker/            Dev container definitions
docs/              Documentation for each project
scripts/           Helper scripts
tests/             Integration tests

The repo is a Cargo workspace with Python bindings generated via the python-ext crate, making it callable from PyTorch-based training and inference code.


Research backing

PublicationLink
MLSys'26 paperfabric-lib: RDMA Point-to-Point Communication for LLM Systems
Blog: RDMA P2P commRDMA Point-to-Point Communication for LLM Systems
Blog: AWS EFAEnabling Trillion-Parameter Models on AWS EFA
Blog: RL weight transferWeight Transfer for RL Post-Training in under 2 seconds
Blog: Disaggregated prefillDisaggregated Prefill and Decode
Tokenizer blogImproving Unigram Tokenizer CPU Performance

Who should look at this

LLM serving infrastructure teams running MoE models at multi-node scale. If your current bottleneck is all-to-all communication latency during expert routing, fabric-lib is directly relevant.

Teams on AWS with EFA: the pplx-EFA benchmark numbers show competitive latency without InfiniBand. This is meaningful for teams that need multi-node MoE without dedicated networking hardware.

Tokenization throughput teams: if you are preprocessing billions of documents and unigram tokenization is measurable in your pipeline, pplx-unigram's cache-line-aligned trie is worth benchmarking against your current path.

RL post-training teams: the weight transfer blog documents under-2-second model weight sync between training and inference replicas using fabric-lib. If you are doing RLHF or online RL at scale, this is one of the harder infrastructure problems — seeing a production solution is valuable.


Constraints and considerations

AreaWhat to verify
Hardware requirementRDMA NIC with GPUDirect RDMA support per GPU — not available everywhere
Kernel versionLinux 5.12+ for DMA-BUF; older kernels need fallback paths
CUDA 12.8+Locks out older GPU deployments on legacy CUDA versions
CapabilitiesSYS_PTRACE and SYS_ADMIN needed — requires root, sudo, or Docker with explicit cap-add
Rust-first codebasePython users interact via compiled wheel; debugging inside the library requires Rust familiarity
MoE-specific optimizationIf you run dense models, fabric-lib's All-to-All kernel is not your bottleneck

None of these are design flaws — they reflect real hardware requirements for RDMA at this performance level.


Comparison: pplx-garden vs other inference communication libraries

Dimensionpplx-garden (fabric-lib)DeepEPNCCL
MoE dispatchFirst-class, SM-freeFirst-classVia AllToAll
CUDA GraphSupportedSupportedSupported
AWS EFASupportedNot documentedVia AWS plugin
NIC aggregationMultiple NICs per GPUNot documentedNot native
Rust implementationYesNo (C++/CUDA)No (C/CUDA)
MLSys peer reviewYes (MLSys'26)NoN/A
LicenseMITApache 2.0Proprietary

Bottom line

pplx-garden is one of the most technically credible open-source releases in the LLM inference infrastructure space in 2026. fabric-lib solves a real problem — MoE all-to-all communication at low latency — with production benchmarks, a peer-reviewed paper, and NIC aggregation support that few alternatives offer.

The hardware bar is real: you need RDMA NICs, GPUDirect support, and Linux 5.12+. This is not a drop-in replacement for NCCL in commodity setups. But for teams operating at the scale where these constraints are already satisfied, pplx-garden is worth a serious look.

For the tokenization side, pplx-unigram is narrower in scope but practically useful: a well-engineered Rust implementation that slots into any HuggingFace-based pipeline without conversion overhead.


Related on ExplainX


Repository metrics and benchmark data are based on the public repo snapshot as of May 2026 and can change. Verify on the upstream project before making infrastructure decisions.

Related posts