speak-tts

emzod/speak · updated Apr 8, 2026

MDX-style export adds YAML metadata + attribution linking explainx.ai and this canonical listing URL.

$npx skills add https://github.com/emzod/speak --skill speak-tts
0 commentsdiscussion
summary

Real-time text-to-speech with voice cloning on Apple Silicon, entirely on-device.

  • Supports multiple input sources (text files, markdown, stdin, web articles, PDFs) and output modes (streaming, file save, playback, or both)
  • Voice cloning from 10–30 second WAV samples at 24000 Hz mono; includes emotion tags like [laugh] , [sigh] , and [gasp] for audible effects
  • Batch processing with auto-chunking for long documents, concatenation utilities, and resume capability for interrupted generat
skill.md

speak - Talk to your Claude!

Give your agent the ability to speak to you real-time. Local text-to-speech, voice cloning, and audio generation on Apple Silicon. Give your agent the ability to speak to you real-time. Local TTS with voice cloning on Apple Silicon.

Prerequisites

Requirement Check Install
Apple Silicon Mac uname -m → arm64 Intel not supported
macOS 12.0+ sw_vers -
sox which sox brew install sox
ffmpeg which ffmpeg brew install ffmpeg
poppler (PDF) which pdftotext brew install poppler

Input Sources

Source Example
Text file speak article.txt
Markdown speak doc.md
Direct string speak "Hello"
Clipboard pbpaste | speak
Stdin cat file.txt | speak

Web Articles

lynx -dump -nolist "https://example.com/article" | speak --output article.wav

Converting Formats

Format Convert Command
PDF pdftotext doc.pdf doc.txt
DOCX textutil -convert txt doc.docx
HTML pandoc -f html -t plain doc.html > doc.txt

Output Modes

Goal Command
Save for later speak text.txt --output file.wav
Listen now (streaming) speak text.txt --stream
Listen now (complete) speak text.txt --play
Both speak text.txt --stream --output file.wav

Default Behavior

speak article.txt          # → ~/Audio/speak/article.wav (no playback)
speak "Hello"              # → ~/Audio/speak/speak_<timestamp>.wav

Directory Auto-Creation

Directory Auto-Created?
~/Audio/speak/ ✓ Yes
~/.chatter/voices/ ✗ No
Custom directories ✗ No

Always create custom directories first:

mkdir -p ~/.chatter/voices/
mkdir -p ~/Audio/custom/

Voice Cloning

Voice cloning generates speech that matches your vocal characteristics (pitch, tone, cadence) from a short recording.

Quality Expectations

  • Output captures general voice characteristics but is not a perfect replica
  • Quality depends heavily on sample quality
  • 15-25 seconds is optimal (10s minimum, 30s maximum)

Recording Your Voice

Using QuickTime:

  1. Open QuickTime Player → File → New Audio Recording
  2. Record 20 seconds of clear speech
  3. File → Export As → Audio Only (.m4a)
  4. Convert to WAV (see below)

Using sox (command line):

# -d = use default microphone
# Recording starts immediately and stops after 25 seconds
sox -d -r 24000 -c 1 ~/.chatter/voices/my_voice.wav trim 0 25

Converting to Required Format

Voice samples MUST be: WAV, 24000 Hz, mono, 10-30 seconds.

# From MP3
ffmpeg -i voice.mp3 -ar 24000 -ac 1 voice.wav

# From M4A (QuickTime)
ffmpeg -i voice.m4a -ar 24000 -ac 1 voice.wav

# Trim to 25 seconds
ffmpeg -i long.wav -t 25 -ar 24000 -ac 1 trimmed.wav

# Check sample properties
ffprobe -i voice.wav 2>&1 | grep -E "Duration|Stream"
# Should show: Duration ~15-25s, 24000 Hz, mono

Using Your Voice

# Create directory
mkdir -p ~/.chatter/voices/

# Move sample
mv voice.wav ~/.chatter/voices/my_voice.wav

# Test
speak "Testing my voice" --voice ~/.chatter/voices/my_voice.wav --stream

# Use for content
speak notes.txt --voice ~/.chatter/voices/my_voice.wav --output presentation.wav

Path requirements:

  • ✓ Works: ~/.chatter/voices/my_voice.wav (tilde expanded by shell)
  • ✓ Works: /Users/name/.chatter/voices/my_voice.wav
  • ✗ Fails: my_voice.wav (relative path)
  • ✗ Fails: ./voices/my_voice.wav (relative path)

Voice Sample Tips

Good Sample Bad Sample
Quiet room Background noise
Natural pace Rushed or monotone
Clear diction Mumbling
Varied content Repetitive phrases

Default Voice

When --voice is omitted, a built-in default voice is used:

speak "Hello world" --stream  # Uses default voice

Emotion Tags

Tags produce audible effects (actual sounds), not spoken words:

speak "[sigh] Monday again." --stream
# Output: (sigh sound) "Monday again."
Tag Effect
[laugh] Laughter
[chuckle] Light chuckle
[sigh] Sighing
[gasp] Gasping
[groan] Groaning
[clear throat] Throat clearing
[cough] Coughing
[crying] Crying
[singing] Sung speech

NOT supported: [pause], [whisper] (ignored)

For pauses: Use punctuation: "Wait... let me think."

Batch Processing

mkdir -p ~/Audio/book/
speak ch01.txt ch02.txt ch03.txt --output-dir ~/Audio/book/
# Creates: ch01.wav, ch02.wav, ch03.wav

# With auto-chunking (for long files)
speak chapters/*.txt --output-dir ~/Audio/book/ --auto-chunk

# Skip completed files
speak chapters/*.txt --output-dir ~/Audio/book/ --skip-existing

Auto-Chunk Behavior

When using --auto-chunk with batch processing:

  1. Each input file is chunked independently
  2. Chunks are generated and automatically concatenated per file
  3. Final output: one .wav per input file (e.g., ch01.wav)
  4. Intermediate chunks deleted (unless --keep-chunks)

You don't need to manually concatenate chunks — only concatenate final chapter files.

Concatenating Audio

# Explicit order (recommended)
speak concat ch01.wav ch02.wav ch03.wav --output book.wav

# Glob pattern (REQUIRES zero-padded filenames)
speak concat audiobook/*.wav --output book.wav

Zero-Padding Rules

Critical for correct concatenation order:

Files Correct Wrong
1-9 01, 02, ..., 09 1, 2, ..., 9
10-99 01, 02, ..., 99 1, 10, 2, ...
100+ 001, 002, ..., 999 1, 100, 2, ...

Why: Shell glob expansion sorts alphabetically. 1, 10, 2 vs 01, 02, 10.

PDF to Audiobook (Complete Workflow)

Step 1: Find Chapter Boundaries

# Preview table of contents
pdftotext -f 1 -l 5 textbook.pdf toc.txt
cat toc.txt  # Note chapter page numbers

# Or search for "Chapter" markers
pdftotext textbook.pdf - | grep -n "Chapter"

Step 2: Extract Chapters (Zero-Padded!)

# For 100-page book with ~10 chapters
pdftotext -f 1 -l 12 -layout textbook.pdf ch01.txt
pdftotext -f 13 -l 25 -layout textbook.pdf ch02.txt
pdftotext -f 26 -l 38 -layout textbook.pdf ch03.txt
# ... continue for all chapters

Step 3: Estimate Time

speak --estimate ch*.txt
# Shows: total audio duration, generation time, storage needed

# Quick estimates:
# 1 page ≈ 2 min audio ≈ 1 min generation
# 100 pages ≈ 200 min audio ≈ 100 min generation ≈ 500 MB

Step 4: Generate Audio

mkdir -p audiobook/
speak ch01.txt ch02.txt ch03.txt --output-dir audiobook/ --auto-chunk
# Creates: audiobook/ch01.wav, audiobook/ch02.wav, audiobook/ch03.wav

Step 5: Concatenate

speak concat audiobook/ch01.wav audiobook/ch02.wav audiobook/ch03.wav --output complete_audiobook.wav
# Or with glob (only if zero-padded):
speak concat audiobook/ch*.wav --output complete_audiobook.wav

PDF Troubleshooting

Issue Solution
Empty/garbled text Scanned PDF — use OCR: brew install tesseract
Wrong encoding Try: pdftotext -enc UTF-8 doc.pdf
Check word count pdftotext doc.pdf - | wc -w (should be >100)

Multi-Voice Content

mkdir -p podcast/scripts podcast/wav

echo "Welcome to the show." > podcast/scripts/01_host.txt
echo "Thanks for having me." > podcast/scripts/02_guest.txt

speak podcast/scripts/01_host.txt --voice ~/.chatter/voices/host.wav --output podcast/wav/01.wav
speak podcast/scripts/02_guest.txt --voice ~/.chatter/voices/guest.wav --output podcast/wav/02.wav

speak concat podcast/wav/01.wav podcast/wav/02.wav --output podcast.wav

Options Reference

Option Description Default
--stream Stream as it generates false
--play Play after complete false
--output <path> Output file ~/Audio/speak/
--output-dir <dir> Batch output directory -
--voice <path> Voice sample (full path) default
--timeout <sec> Timeout per file 300
--auto-chunk Split long documents false
--chunk-size <n> Chars per chunk 6000
--resume <file> Resume from manifest -
--keep-chunks Keep intermediate files false
--skip-existing Skip if output exists false
--estimate Show duration estimate false
--dry-run Preview only false
--quiet Suppress output false

Commands

Command Description
speak setup Set up environment
speak health Check system status
speak models List TTS models
speak concat Concatenate audio
speak daemon kill Stop TTS server
speak config Show configuration

Performance

Metric Value
Cold start ~4-8s
Warm start ~3-8s
Speed 0.3-0.5x RTF (faster than real-time)
Storage ~2.5 MB/min, ~150 MB/hour

Resume Capability

For interrupted long generations:

# Single file with auto-chunk — use --resume
speak long.txt --auto-chunk --output book.wav
# If interrupted, manifest saved at ~/Audio/speak/manifest.json
speak --resume ~/Audio/speak/manifest.json

# Batch processing — use --skip-existing
speak ch*.txt --output-dir audiobook/ --auto-chunk
# If interrupted, re-run same command:
speak ch*.txt --output-dir audiobook/ --auto-chunk --skip-existing

Common Errors

Error Cause Solution
"Voice file not found" Relative path Use full path: ~/.chatter/voices/x.wav
"Invalid WAV format" Wrong specs Convert: ffmpeg -i in.wav -ar 24000 -ac 1 out.wav
"Voice sample too short" <10 seconds Record 15-25 seconds
"Output directory doesn't exist" Not created mkdir -p dirname/
"sox not found" Not installed brew install sox
Scrambled concat order Non-zero-padded Use 01, 02, not 1, 2
Timeout >5 min generation Use --auto-chunk or --timeout 600
"Server not running" Stale daemon speak daemon kill && speak health

Setup

how to use speak-tts

How to use speak-tts on Cursor

AI-first code editor with Composer

1

Prerequisites

Before installing skills in Cursor, ensure your development environment meets these requirements:

  • Cursor installed and configured on your development machine
  • Node.js version 16.0+ with npm package manager (verify with node --version)
  • Active project directory or workspace where you want to add speak-tts
2

Execute installation command

Execute the skills CLI command in your project's root directory to begin installation:

$npx skills add https://github.com/emzod/speak --skill speak-tts

The skills CLI fetches speak-tts from GitHub repository emzod/speak and configures it for Cursor.

3

Select Cursor when prompted

The CLI will show a list of available agents. Use arrow keys to navigate and space to select Cursor:

◆ Which agents do you want to install to?
│ ── Universal (.agents/skills) ── always included ────
│ • Amp
│ • Antigravity
│ • Cline
│ • Codex
│ ●Cursor(selected)
│ • Cursor
│ • Windsurf
4

Verify installation

Confirm successful installation by checking the skill directory location:

.cursor/skills/speak-tts

Reload or restart Cursor to activate speak-tts. Access the skill through slash commands (e.g., /speak-tts) or your agent's skill management interface.

Security & Verification Notice

We perform automated surface-level scans (Gen AI Scanner, Socket, Snyk) during installation. These checks detect common vulnerabilities but do not guarantee complete security. Always review skill source code and verify the publisher's reputation before production use.

Skills execute code in your development environment. Always verify the publisher's identity, review recent commits, and test in isolated environments before production deployment.

List & Monetize Your Skill

Submit your Claude Code skill and start earning

GET_STARTED →

Use Cases

User Story & Requirements Generation

Create detailed user stories, acceptance criteria, and feature specs

Example

Generate user stories for 'password reset feature' with acceptance criteria, edge cases, and test scenarios

Reduce spec writing time by 50%, ensure comprehensive coverage

Competitive Analysis

Research competitors, compare features, identify gaps

Example

Analyze 5 competitor products, create feature comparison matrix, suggest differentiation opportunities

Complete competitive research in 2 hours instead of 2 days

Roadmap Prioritization

Evaluate features using frameworks (RICE, ICE, Kano) and create prioritized backlogs

Example

Score 20 feature ideas using RICE framework, generate prioritized roadmap with rationale

Make data-driven prioritization decisions faster

Stakeholder Communication

Draft PRDs, status updates, and stakeholder presentations

Example

Create executive summary of Q3 roadmap, monthly progress report, feature launch announcement

Save 3-5 hours/week on communication overhead

Implementation Guide

Prerequisites

  • Claude Desktop or compatible AI client
  • Access to product documentation and roadmap tools (Jira, Notion, etc.)
  • Understanding of product management frameworks (RICE, Jobs-to-be-Done, etc.)
  • Stakeholder contact information and communication channels

Time Estimate

30-60 minutes to see productivity improvements

Installation Steps

  1. 1.Install product management skill
  2. 2.Start with user story generation for known feature
  3. 3.Progress to competitive analysis: research 2-3 competitors
  4. 4.Use for roadmap prioritization: apply RICE/ICE scoring
  5. 5.Draft stakeholder communications and refine based on feedback
  6. 6.Build template library for recurring PM tasks
  7. 7.Share effective prompts with product team

Common Pitfalls

  • Not validating competitive research—verify facts before sharing
  • Accepting user stories without involving engineering team
  • Over-relying on frameworks without qualitative judgment
  • Not customizing outputs to company culture and communication style
  • Skipping stakeholder validation of generated requirements

Best Practices

✓ Do

  • +Validate research and competitive analysis with real data
  • +Collaborate with engineering when generating technical requirements
  • +Customize frameworks and templates to your company context
  • +Use skill for first drafts, refine with stakeholder input
  • +Document successful prompt patterns for PM tasks
  • +Combine AI efficiency with human judgment and intuition

✗ Don't

  • Don't publish competitive analysis without fact-checking
  • Don't finalize user stories without engineering review
  • Don't make prioritization decisions solely on AI scoring
  • Don't skip customer validation of generated requirements
  • Don't ignore company-specific context and culture

💡 Pro Tips

  • Provide context: company goals, constraints, customer feedback
  • Ask for alternatives: 'Show 3 ways to prioritize this roadmap'
  • Request stakeholder-specific formatting: 'Executive summary vs. engineering spec'
  • Use skill for 70% generation + 30% customization to company needs

When to Use This

✓ Use When

Use for user story writing, competitive research, roadmap prioritization, stakeholder communication, and PRD drafting. Best for reducing repetitive documentation and research work.

✗ Avoid When

Avoid for strategic product vision (requires deep customer empathy), pricing decisions (needs market and financial expertise), or when face-to-face customer discovery is more valuable than speed.

Learning Path

  1. 1Basic: user stories, feature specs, status updates
  2. 2Intermediate: competitive analysis, prioritization frameworks, PRDs
  3. 3Advanced: product strategy, go-to-market planning, OKR setting
  4. 4Expert: product vision, market positioning, business model innovation

Discussion

Product Hunt–style comments (not star reviews)
  • No comments yet — start the thread.
general reviews

Ratings

4.630 reviews
  • Hana Bhatia· Dec 28, 2024

    Registry listing for speak-tts matched our evaluation — installs cleanly and behaves as described in the markdown.

  • Ganesh Mohane· Dec 24, 2024

    speak-tts reduced setup friction for our internal harness; good balance of opinion and flexibility.

  • Ira Patel· Dec 20, 2024

    We added speak-tts from the explainx registry; install was straightforward and the SKILL.md answered most questions upfront.

  • Hana Mehta· Dec 4, 2024

    Useful defaults in speak-tts — fewer surprises than typical one-off scripts, and it plays nicely with `npx skills` flows.

  • Ishan Dixit· Nov 23, 2024

    speak-tts has been reliable in day-to-day use. Documentation quality is above average for community skills.

  • Isabella Iyer· Nov 11, 2024

    Keeps context tight: speak-tts is the kind of skill you can hand to a new teammate without a long onboarding doc.

  • Ira Perez· Oct 14, 2024

    Solid pick for teams standardizing on skills: speak-tts is focused, and the summary matches what you get after install.

  • Henry White· Oct 2, 2024

    speak-tts is among the better-maintained entries we tried; worth keeping pinned for repeat workflows.

  • Mia White· Sep 25, 2024

    Registry listing for speak-tts matched our evaluation — installs cleanly and behaves as described in the markdown.

  • Jin Jackson· Sep 21, 2024

    Useful defaults in speak-tts — fewer surprises than typical one-off scripts, and it plays nicely with `npx skills` flows.

showing 1-10 of 30

1 / 3