Part 2 of 3: Individuals · Business · Fortune 500
TL;DR — business decision table
| Question | Answer for 5–500 employee cos. |
|---|---|
| Why now? | Frontier APIs gated; token bills scale with headcount; clients ask where data goes |
| Minimum infra | 1× GPU server (24–80GB VRAM) + LiteLLM proxy |
| People | 0.25 FTE senior eng + existing IT |
| CapEx vs OpEx | $15k box or $3k/mo Lambda/CoreWeave GPU |
| Model pick | GLM-5.2 + Qwen3 (two-family rule) |
| Timeline | 60–90 days to production default |
| Still need Claude? | Yes, ~5–15% burst on eval failure |
Your company is not Anthropic’s trusted partner. Your engineers are not on GPT-5.6 Sol preview. If AI is embedded in delivery—agencies, SaaS, consultancies, fintech back-office—you are one policy change away from margin collapse.
Open source for business means owning the default inference path for internal work, not romantic self-sufficiency.
What “business scale” means (and what it does not)
| Business (this guide) | Individual | Fortune 500 | |
|---|---|---|---|
| Headcount | 5–500 | 1 | 5,000+ |
| GPU count | 1–8 | 0–2 | 100+ |
| Governance | Founder + eng lead | Personal | Board, procurement, legal |
| Goal | Cut API bill 50–80% | Privacy + learning | Sovereignty + regulatory |
What it takes: five business investments
1. Infrastructure (one inference plane)
Option A — On-prem / office closet
- 1× workstation: RTX 4090 24GB or used RTX 3090 — $1.5–3k
- Runs Qwen3 32B or GLM-5.2 quantized for 5–20 concurrent devs (queue-based)
Option B — Cloud GPU (no hardware ops)
- Lambda / CoreWeave / AWS g5.2xlarge — ~$1.50–3/hr
- vLLM Docker; persistent volume for weights
Option C — Managed open API
- Together / Fireworks host GLM-5.2, Llama, Qwen — you get open weights economics without GPUs
- Still vendor risk, but no Annex A problem
See Mac vs GPU for why Mac is a dev laptop, not your inference server.
2. Software stack (standardize early)
Developers → LiteLLM gateway (OpenAI-compatible)
├─ primary: glm-5.2-vllm (internal)
├─ coding: qwen3-coder-vllm
└─ fallback: claude-opus / gpt-5.5 API (gated)
- LiteLLM — one API key, budgets per team, logging
- vLLM — production throughput
- Eval suite — 100 tickets from last sprint; pass/fail scoring
Codex + Ollama OSS patterns apply if you standardize on OpenCode/Codex CLI.
3. People (roles, not headcount)
| Role | Time | Owns |
|---|---|---|
| AI platform owner (staff eng) | 25–50% | Models, upgrades, uptime |
| Security | Review once | Data classification, burst policy |
| Finance | Monthly | API vs infra TCO |
| Everyone else | 2hr onboarding | When to use local vs cloud |
You do not need to hire an ML researcher.
4. Policy (one page, enforced)
Write it down:
- Green data — internal code, drafts → local/open only
- Yellow — anonymized prod logs → local + approval
- Red — customer PII, PHI → no cloud without legal sign-off
- Burst rule — if open model fails eval twice, allowed Opus/GPT with ticket link
June 2026 export controls mean US HQ + foreign engineers on Claude Fable was already broken—self-host fixes deemed-export anxiety for internal tools (international access context).
5. Money (honest TCO)
20 engineers, heavy agent use (illustrative):
| Frontier API only | Hybrid open default | |
|---|---|---|
| Monthly tokens | $8k–25k | $1k–4k API burst |
| Infra | $0 | $500–3k (cloud GPU or amortized box) |
| Year 1 total | $96k–300k | $30k–80k |
Break-even on CapEx GPU box often <12 months at $10k+/mo API spend.
Model selection for business workloads
| Workload | Model | Why |
|---|---|---|
| Product engineering | GLM-5.2, Kimi K2.7 | Best open coding/agentic reports mid-2026 |
| Support / ops docs | Qwen3 32B | Cheap, multilingual |
| Finance / analysis | DeepSeek R1, Qwen3 235B | Reasoning chains |
| Customer-facing chatbot | Fine-tuned 8B–14B | Latency + cost; not raw GLM-5.2 |
Benchmark context vs Fable/GPT-5.6: enterprise comparison.
Kilo Code planning test: GLM-5.2 9.0 vs Fable 9.1 — viable for spec → build pipelines (planning post).
Security checklist (business minimum)
Before production:
- TLS on LiteLLM gateway; no plain HTTP inside office WiFi
- API keys per team; rotate quarterly
- No default admin keys in Slack
- Prompt logging retention policy (30–90 days max unless legal hold)
- SBOM for vLLM Docker images
- Backup weight volume — re-download is slow
For SOC 2 path, map controls to CC6 logical access and CC7 monitoring—auditors care that you control keys, not Anthropic.
Case sketch: 40-person SaaS engineering org
Before: Claude Team + ad hoc API keys; ~$12k/mo; Fable suspension broke two agent pipelines.
After (90 days):
- 2× RTX 4090 server + vLLM (GLM-5.2 primary, Qwen3 Coder secondary)
- LiteLLM with $500/mo Opus burst cap
- Result: ~$4.5k/mo infra + burst; eval within 5% of pre-ban quality on internal suite
Lesson: Pilot squad found planning tasks identical to Kilo benchmark; multi-file refactors still needed burst 20% of time.
Build vs buy for business
| Self-host GPU | Managed open API (Together/Fireworks) | |
|---|---|---|
| CapEx | High | Low |
| Ops burden | You | Vendor |
| Data control | Maximum | Good (contract-dependent) |
| Latency | Best on LAN | Internet |
| Best for | 15+ daily active devs | <15 devs, fast start |
Many businesses start managed, move self-host when API bill exceeds $6k/mo for 6 consecutive months.
Hiring and talent (business)
Open source repoints talent rather than replacing it:
- Less prompt-hacking around rate limits; more eval and RAG ownership
- Job specs shift toward LiteLLM + vLLM maintainers (higher leverage, smaller pool)
- Retention during Fable outage required a credible internal API, not “use Opus” alone
Common business mistakes
- CEO buys GPUs, no platform owner — idle hardware by month six
- Unlimited personal Claude while mandating open internally — no savings, data leaks
- Single-model religion — one license or geopolitical shock with no fallback
- Skipping eval — silent reversion to cloud; finance sees flat OPEX
- Customer data in week-one pilot — start internal-only until legal signs policy
60-day business rollout
| Week | Milestone |
|---|---|
| 1–2 | Token audit; pick primary open model; buy/provision GPU |
| 3–4 | LiteLLM + vLLM staging; eval 200 real tasks |
| 5–6 | Pilot one team (5–10 devs); daily quality Slack channel |
| 7–8 | Company-wide default; disable personal Claude on green data |
| 9–12 | Fine-tune optional; add second model family; quarterly eval |
Anti-pattern: Mandating open models without eval — engineers will secretly use ChatGPT and you lose audit trail.
Vendor shortlist (business tier)
| Vendor type | Examples | Use when |
|---|---|---|
| GPU cloud | Lambda, CoreWeave, AWS G5 | No datacenter, need burst |
| Open API | Together, Fireworks, DeepInfra | Fast start, no GPU ops |
| Gateway | LiteLLM (OSS + enterprise) | Team keys, budgets, logging |
| Vector DB | Qdrant, Weaviate self-host | RAG on internal docs |
| Burst closed | OpenRouter, direct Anthropic/OpenAI | Eval failure escape hatch |
Negotiate annual burst caps on closed APIs before you migrate—finance will ask.
Business sustainability means your LiteLLM billboards show 80%+ open traffic in the dashboard—not a one-time blog post about “we care about sovereignty.” Review routing rules every sprint; model releases in 2026 arrive faster than quarterly procurement cycles.
When business should NOT go open-first
- Customer-facing product needs frontier quality and you cannot afford eval gap
- No one owns uptime — single GPU SPOF without on-call
- Regulated burst-only workloads (some health/finance) where validated vendor required
- Team <5 with <$500/mo API spend — optimize subscriptions first
Open source + agency/client work
Agencies face client data segregation:
- Per-client LiteLLM virtual keys routing to dedicated Qdrant collections
- Never train on Client A data for Client B
- Contract language: “We run open-weight models in [region] VPC; no third-party frontier training.”
Differentiator vs competitors still on permissioned Anthropic/OpenAI tiers.
Bottom line
Business open source is one GPU plane, LiteLLM, written policy, and 60–90 days of disciplined eval—not a research program.
You buy predictable cost, data control, and survival when the next model is trusted-partners only.
Series: Individuals · Fortune 500 · Full benchmark map
Budget ranges reflect US/EU mid-market SaaS and agencies, June 28, 2026.