pplx-garden is Perplexity AI's open-source repository for inference technology, containing fabric-lib (an RDMA TransferEngine and P2P MoE dispatch kernel) and pplx-unigram (a fast unigram tokenizer encoder written in Rust).

What hardware does fabric-lib support?

fabric-lib supports NVIDIA ConnectX-7 and AWS EFA RDMA NICs. It uses NVLink for intra-node data transfer and RDMA for inter-node transfer, and supports CUDA Graph.

What is the pplx-unigram tokenizer?

pplx-unigram is a Rust implementation of a unigram tokenizer encoder that loads HuggingFace tokenizer.json files and encodes text via Viterbi decoding over a double-array trie packed one node per cache line.

Is pplx-garden production-ready?

The core fabric-lib has been validated in Perplexity's own production LLM serving stack and has a peer-reviewed MLSys'26 paper. That said, it requires specialized RDMA hardware and Linux kernel 5.12+, so teams should evaluate against their infrastructure.

How does fabric-lib compare to DeepEP?

Benchmarks in the repo show fabric-lib matches or outperforms DeepEP-CX7 on decode (128 tokens) dispatch/combine across EP8–EP64 configurations, with pplx-CX7 beating DeepEP-CX7 at EP16 and EP32 for dispatch.

pplx-garden: Perplexity's open-source inference | explainx.ai Blog

explainx.ainewsletter3.5k

workshops ↗

pplx-garden: Perplexity's open-source inference | explainx.ai Blog | explainx.ai

perplexityai/pplx-garden is Perplexity AI's public commitment to open-source inference research. It is not a demo or a tutorial repo — it is the actual infrastructure Perplexity uses at scale, packaged for external teams to build on.

Two projects live here today: fabric-lib, an RDMA-based communication library and P2P MoE dispatch kernel, and pplx-unigram, a high-performance unigram tokenizer encoder. Both are written in Rust with Python bindings.

Primary repo: perplexityai/pplx-garden

Quick reference

Item	Details
Organization	Perplexity AI
Purpose	Open-source inference technology
Primary language	Rust (Python bindings via `python-ext`)
License	MIT
Repository signals	~509 stars, ~54 forks (at capture time)
Projects	fabric-lib, pplx-unigram
Research backing	MLSys'26 paper on fabric-lib
Hardware targets	NVIDIA ConnectX-7, AWS EFA, NVLink

Why this matters

Scaling LLMs beyond a single node surfaces two hard problems: between GPUs and at the input boundary. Both bottlenecks compound as model size grows into the trillion-parameter range.

Config	pplx-EFA (D)	pplx-CX7 (D)	DeepEP-CX7 (D)	pplx-EFA (C)	pplx-CX7 (C)	DeepEP-CX7 (C)
EP64	266.7 μs	187.5 μs	177.9 μs	391.2 μs	309.1 μs	325.0 μs
EP32	229.1 μs	153.9 μs	159.1 μs	335.0 μs	266.3 μs	285.0 μs
EP16	214.8 μs	110.2 μs	123.9 μs	241.5 μs	185.5 μs	203.0 μs
EP8	49.7 μs	50.5 μs	42.6 μs	64.2 μs	65.3 μs	72.0 μs

bash

# Build the dev container
docker build -t pplx-garden-dev - < docker/dev.Dockerfile
./scripts/run-docker.sh

# Build and test the network benchmark
cargo build --release --bin fabric-debug

# Server
./target/release/fabric-debug 0,1,2,3,4,5,6,7 2

# Client (replace fe80xxxx with server's printed address)
./target/release/fabric-debug 0,1,2,3,4,5,6,7 2 fe80xxxx

bash

# Get a unigram tokenizer (XLM-R as example)
# Download tokenizer.json from HuggingFace

cargo run --release --example encode -p pplx-unigram -- \
    path/to/tokenizer.json "The quick brown fox jumps over the lazy dog."

snippet

fabric-lib/        RDMA TransferEngine library
p2p-all-to-all/    P2P MoE All-to-All implementation
pplx-unigram/      Unigram tokenizer encoder
python-ext/        Python extension module from Rust code
python/pplx_garden/ Python package
rust/              Rust utility libraries
benchmarks/        Performance benchmarks
docker/            Dev container definitions
docs/              Documentation for each project
scripts/           Helper scripts
tests/             Integration tests

Publication	Link
MLSys'26 paper	fabric-lib: RDMA Point-to-Point Communication for LLM Systems
Blog: RDMA P2P comm	RDMA Point-to-Point Communication for LLM Systems
Blog: AWS EFA	Enabling Trillion-Parameter Models on AWS EFA
Blog: RL weight transfer	Weight Transfer for RL Post-Training in under 2 seconds
Blog: Disaggregated prefill	Disaggregated Prefill and Decode
Tokenizer blog	Improving Unigram Tokenizer CPU Performance

Area	What to verify
Hardware requirement	RDMA NIC with GPUDirect RDMA support per GPU — not available everywhere
Kernel version	Linux 5.12+ for DMA-BUF; older kernels need fallback paths
CUDA 12.8+	Locks out older GPU deployments on legacy CUDA versions
Capabilities	`SYS_PTRACE` and `SYS_ADMIN` needed — requires root, sudo, or Docker with explicit cap-add
Rust-first codebase	Python users interact via compiled wheel; debugging inside the library requires Rust familiarity
MoE-specific optimization	If you run dense models, fabric-lib's All-to-All kernel is not your bottleneck

Dimension	pplx-garden (fabric-lib)	DeepEP	NCCL
MoE dispatch	First-class, SM-free	First-class	Via AllToAll
CUDA Graph	Supported	Supported	Supported
AWS EFA	Supported	Not documented	Via AWS plugin
NIC aggregation	Multiple NICs per GPU	Not documented	Not native
Rust implementation	Yes	No (C++/CUDA)	No (C/CUDA)
MLSys peer review	Yes (MLSys'26)	No	N/A
License	MIT	Apache 2.0	Proprietary

pplx-garden: Perplexity's open-source inference technology stack explained

Quick reference

Why this matters

Related posts

How to Run Kimi K3 Locally on Desktop — Open Weights Prep Guide (July 2026)

Tencent Hy3 GGUF — 1-Bit and 4-Bit Quants for Single-GPU llama.cpp

Colibrì: Run GLM-5.2 on 25 GB RAM by Streaming MoE Experts From Disk

fabric-lib: RDMA inference communication

What it does

Architecture decisions

System requirements

Performance benchmarks

Getting started with fabric-lib

pplx-unigram: fast tokenizer encoder

What it does

Why it is fast

Quick start

Repo structure

Research backing

Who should look at this

Constraints and considerations

Comparison: pplx-garden vs other inference communication libraries

Bottom line

Quick reference

Why this matters

Related posts

How to Run Kimi K3 Locally on Desktop — Open Weights Prep Guide (July 2026)

Tencent Hy3 GGUF — 1-Bit and 4-Bit Quants for Single-GPU llama.cpp

Colibrì: Run GLM-5.2 on 25 GB RAM by Streaming MoE Experts From Disk

fabric-lib: RDMA inference communication

What it does

Architecture decisions

System requirements

Performance benchmarks

Getting started with fabric-lib

pplx-unigram: fast tokenizer encoder

What it does

Why it is fast

Quick start

Repo structure

Research backing

Who should look at this

Constraints and considerations

Comparison: pplx-garden vs other inference communication libraries

Bottom line

Related on explainx.ai