Content Core▌

by lfnovo
Extract text and audio from URLs, docs, videos, and images with AI voice generator and text to speech for unified conten
Extracts content from diverse media sources including URLs, documents, videos, audio files, and images using intelligent auto-detection and multiple extraction engines for unified content processing and analysis.
best for
- / Content researchers analyzing diverse media sources
- / Data analysts processing mixed document formats
- / AI developers building content processing pipelines
- / Anyone needing to extract text from various file types
capabilities
- / Extract text from PDFs, Word docs, and other documents
- / Transcribe videos and audio files to text
- / Extract content from web URLs
- / Perform OCR on images to extract text
- / Process ZIP archives and other compressed files
- / Generate AI summaries of extracted content
what it does
Extracts and processes content from URLs, documents, videos, audio files, and images into clean, structured text. Uses AI to automatically detect media types and apply the right extraction method.
about
Content Core is a community-built MCP server published by lfnovo that provides AI assistants with tools and capabilities via the Model Context Protocol. Extract text and audio from URLs, docs, videos, and images with AI voice generator and text to speech for unified conten It is categorized under ai ml, productivity. This server exposes 1 tool that AI clients can invoke during conversations and coding sessions.
how to install
You can install Content Core in your AI client of choice. Use the install panel on this page to get one-click setup for Cursor, Claude Desktop, VS Code, and other MCP-compatible clients. This server runs locally on your machine via the stdio transport.
license
MIT
Content Core is released under the MIT license. This is a permissive open-source license, meaning you can freely use, modify, and distribute the software.
readme
Content Core
Content Core is a powerful, AI-powered content extraction and processing platform that transforms any source into clean, structured content. Extract text from websites, transcribe videos, process documents, and generate AI summaries—all through a unified interface with multiple integration options.
🚀 What You Can Do
Extract content from anywhere:
- 📄 Documents - PDF, Word, PowerPoint, Excel, Markdown, HTML, EPUB
- 🎥 Media - Videos (MP4, AVI, MOV) with automatic transcription
- 🎵 Audio - MP3, WAV, M4A with speech-to-text conversion
- 🌐 Web - Any URL with intelligent content extraction
- 🖼️ Images - JPG, PNG, TIFF with OCR text recognition
- 📦 Archives - ZIP, TAR, GZ with content analysis
Process with AI:
- ✨ Clean & format extracted content automatically
- 📝 Generate summaries with customizable styles (bullet points, executive summary, etc.)
- 🎯 Context-aware processing - explain to a child, technical summary, action items
- 🔄 Smart engine selection - automatically chooses the best extraction method
🛠️ Multiple Ways to Use
🖥️ Command Line (Zero Install)
# Extract content from any source
uvx --from "content-core" ccore https://example.com
uvx --from "content-core" ccore document.pdf
# Generate AI summaries
uvx --from "content-core" csum video.mp4 --context "bullet points"
🤖 Claude Desktop Integration
One-click setup with Model Context Protocol (MCP) - extract content directly in Claude conversations.
🔍 Raycast Extension
Smart auto-detection commands:
- Extract Content - Full interface with format options
- Summarize Content - 9 summary styles available
- Quick Extract - Instant clipboard extraction
🖱️ macOS Right-Click Integration
Right-click any file in Finder → Services → Extract or Summarize content instantly.
🐍 Python Library
import content_core as cc
# Extract from any source
result = await cc.extract("https://example.com/article")
summary = await cc.summarize_content(result, context="explain to a child")
⚡ Key Features
- 🎯 Intelligent Auto-Detection: Automatically selects the best extraction method based on content type and available services
- 🔧 Smart Engine Selection:
- URLs: Firecrawl → Jina → Crawl4AI (optional) → BeautifulSoup fallback chain
- Documents: Docling → Enhanced PyMuPDF → Simple extraction fallback
- Media: OpenAI Whisper transcription
- Images: OCR with multiple engine support
- 📊 Enhanced PDF Processing: Advanced PyMuPDF engine with quality flags, table detection, and optional OCR for mathematical formulas
- 🌍 Multiple Integrations: CLI, Python library, MCP server, Raycast extension, macOS Services
- ⚡ Zero-Install Options: Use
uvxfor instant access without installation - 🧠 AI-Powered Processing: LLM integration for content cleaning and summarization
- 🔄 Asynchronous: Built with
asynciofor efficient processing - 🐍 Pure Python Implementation: No system dependencies required - simplified installation across all platforms
Getting Started
Installation
Install Content Core using pip - no system dependencies required!
# Basic installation (PyMuPDF + BeautifulSoup/Jina extraction)
pip install content-core
# With enhanced document processing (adds Docling)
pip install content-core[docling]
# With local browser-based URL extraction (adds Crawl4AI)
# Note: Requires Playwright browsers (~300MB). Run:
pip install content-core[crawl4ai]
python -m playwright install --with-deps
# Full installation (with all optional features)
pip install content-core[docling,crawl4ai]
Note: The core installation uses pure Python implementations and doesn't require system libraries like libmagic, ensuring consistent, hassle-free installation across Windows, macOS, and Linux. Optional features like Crawl4AI (browser automation) may require additional system dependencies.
Alternatively, if you’re developing locally:
# Clone the repository
git clone https://github.com/lfnovo/content-core
cd content-core
# Install with uv
uv sync
Command-Line Interface
Content Core provides three CLI commands for extracting, cleaning, and summarizing content: ccore, cclean, and csum. These commands support input from text, URLs, files, or piped data (e.g., via cat file | command).
Zero-install usage with uvx:
# Extract content
uvx --from "content-core" ccore https://example.com
# Clean content
uvx --from "content-core" cclean "messy content"
# Summarize content
uvx --from "content-core" csum "long text" --context "bullet points"
ccore - Extract Content
Extracts content from text, URLs, or files, with optional formatting. Usage:
ccore [-f|--format xml|json|text] [-d|--debug] [content]
Options:
-f,--format: Output format (xml, json, or text). Default: text.-d,--debug: Enable debug logging.content: Input content (text, URL, or file path). If omitted, reads from stdin.
Examples:
# Extract from a URL as text
ccore https://example.com
# Extract from a file as JSON
ccore -f json document.pdf
# Extract from piped text as XML
echo "Sample text" | ccore --format xml
cclean - Clean Content
Cleans content by removing unnecessary formatting, spaces, or artifacts. Accepts text, JSON, XML input, URLs, or file paths. Usage:
cclean [-d|--debug] [content]
Options:
-d,--debug: Enable debug logging.content: Input content to clean (text, URL, file path, JSON, or XML). If omitted, reads from stdin.
Examples:
# Clean a text string
cclean " messy text "
# Clean piped JSON
echo '{"content": " messy text "}' | cclean
# Clean content from a URL
cclean https://example.com
# Clean a file’s content
cclean document.txt
csum - Summarize Content
Summarizes content with an optional context to guide the summary style. Accepts text, JSON, XML input, URLs, or file paths.
Usage:
csum [--context "context text"] [-d|--debug] [content]
Options:
--context: Context for summarization (e.g., "explain to a child"). Default: none.-d,--debug: Enable debug logging.content: Input content to summarize (text, URL, file path, JSON, or XML). If omitted, reads from stdin.
Examples:
# Summarize text
csum "AI is transforming industries."
# Summarize with context
csum --context "in bullet points" "AI is transforming industries."
# Summarize piped content
cat article.txt | csum --context "one sentence"
# Summarize content from URL
csum https://example.com
# Summarize a file's content
csum document.txt
Quick Start
You can quickly integrate content-core into your Python projects to extract, clean, and summarize content from various sources.
import content_core as cc
# Extract content from a URL, file, or text
result = await cc.extract("https://example.com/article")
# Clean messy content
cleaned_text = await cc.clean("...messy text with [brackets] and extra spaces...")
# Summarize content with optional context
summary = await cc.summarize_content("long article text", context="explain to a child")
# Extract audio with custom speech-to-text model
from content_core.common import ProcessSourceInput
result = await cc.extract(ProcessSourceInput(
file_path="interview.mp3",
audio_provider="openai",
audio_model="whisper-1"
))
Documentation
For more information on how to use the Content Core library, including details on AI model configuration and customization, refer to our Usage Documentation.
MCP Server Integration
Content Core includes a Model Context Protocol (MCP) server that enables seamless integration with Claude Desktop and other MCP-compatible applications. The MCP server exposes Content Core's powerful extraction capabilities through a standardized protocol.
<a href="https://glama.ai/mcp/servers/@lfnovo/content-core"> <img width="380" height="200" src="https://glama.ai/mcp/servers/@lfnovo/content-core/badge" /> </a>Quick Setup with Claude Desktop
# Install Content Core (MCP server included)
pip install content-core
# Or use directly with uvx (no installation required)
uvx --from "content-core" content-core-mcp
Add to your claude_desktop_config.json:
{
"mcpServers": {
"content-core": {
"command": "uvx",
"args": [
"--from",
"content-core",
"content-core-mcp"
]
}
}
}
For detailed setup instructions, configuration options, and usage examples, see our MCP Documentation.
Enhanced PDF Processing
Content Core features an optimized PyMuPDF extraction engine with significant improvements for scientific documents and complex PDFs.
Key Improvements
- 🔬 Mathematical Formula Extraction: E