monitoring-observability▌
ahmedasmar/devops-claude-skills · updated Apr 8, 2026
MDX-style export adds YAML metadata + attribution linking explainx.ai and this canonical listing URL.
This skill provides comprehensive guidance for monitoring and observability workflows including metrics design, log aggregation, distributed tracing, alerting strategies, SLO/SLA management, and tool selection.
Monitoring & Observability
Overview
This skill provides comprehensive guidance for monitoring and observability workflows including metrics design, log aggregation, distributed tracing, alerting strategies, SLO/SLA management, and tool selection.
When to use this skill:
- Setting up monitoring for new services
- Designing alerts and dashboards
- Troubleshooting performance issues
- Implementing SLO tracking and error budgets
- Choosing between monitoring tools
- Integrating OpenTelemetry instrumentation
- Analyzing metrics, logs, and traces
- Optimizing Datadog costs and finding waste
- Migrating from Datadog to open-source stack
Core Workflow: Observability Implementation
Use this decision tree to determine your starting point:
Are you setting up monitoring from scratch?
├─ YES → Start with "1. Design Metrics Strategy"
└─ NO → Do you have an existing issue?
├─ YES → Go to "9. Troubleshooting & Analysis"
└─ NO → Are you improving existing monitoring?
├─ Alerts → Go to "3. Alert Design"
├─ Dashboards → Go to "4. Dashboard & Visualization"
├─ SLOs → Go to "5. SLO & Error Budgets"
├─ Tool selection → Read references/tool_comparison.md
└─ Using Datadog? High costs? → Go to "7. Datadog Cost Optimization & Migration"
1. Design Metrics Strategy
Start with The Four Golden Signals
Every service should monitor:
- Latency: Response time (p50, p95, p99)
- Traffic: Requests per second
- Errors: Failure rate
- Saturation: Resource utilization
For request-driven services, use the RED Method:
- Rate: Requests/sec
- Errors: Error rate
- Duration: Response time
For infrastructure resources, use the USE Method:
- Utilization: % time busy
- Saturation**: Queue depth
- Errors**: Error count
Quick Start - Web Application Example:
# Rate (requests/sec)
sum(rate(http_requests_total[5m]))
# Errors (error rate %)
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100
# Duration (p95 latency)
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
Deep Dive: Metric Design
For comprehensive metric design guidance including:
- Metric types (counter, gauge, histogram, summary)
- Cardinality best practices
- Naming conventions
- Dashboard design principles
→ Read: references/metrics_design.md
Automated Metric Analysis
Detect anomalies and trends in your metrics:
# Analyze Prometheus metrics for anomalies
python3 scripts/analyze_metrics.py prometheus \
--endpoint http://localhost:9090 \
--query 'rate(http_requests_total[5m])' \
--hours 24
# Analyze CloudWatch metrics
python3 scripts/analyze_metrics.py cloudwatch \
--namespace AWS/EC2 \
--metric CPUUtilization \
--dimensions InstanceId=i-1234567890abcdef0 \
--hours 48
→ Script: scripts/analyze_metrics.py
2. Log Aggregation & Analysis
Structured Logging Checklist
Every log entry should include:
- ✅ Timestamp (ISO 8601 format)
- ✅ Log level (DEBUG, INFO, WARN, ERROR, FATAL)
- ✅ Message (human-readable)
- ✅ Service name
- ✅ Request ID (for tracing)
Example structured log (JSON):
{
"timestamp": "2024-10-28T14:32:15Z",
"level": "error",
"message": "Payment processing failed",
"service": "payment-service",
"request_id": "550e8400-e29b-41d4-a716-446655440000",
"user_id": "user123",
"order_id": "ORD-456",
"error_type": "GatewayTimeout",
"duration_ms": 5000
}
Log Aggregation Patterns
ELK Stack (Elasticsearch, Logstash, Kibana):
- Best for: Deep log analysis, complex queries
- Cost: High (infrastructure + operations)
- Complexity: High
Grafana Loki:
- Best for: Cost-effective logging, Kubernetes
- Cost: Low
- Complexity: Medium
CloudWatch Logs:
- Best for: AWS-centric applications
- Cost: Medium
- Complexity: Low
Log Analysis
Analyze logs for errors, patterns, and anomalies:
# Analyze log file for patterns
python3 scripts/log_analyzer.py application.log
# Show error lines with context
python3 scripts/log_analyzer.py application.log --show-errors
# Extract stack traces
python3 scripts/log_analyzer.py application.log --show-traces
→ Script: scripts/log_analyzer.py
Deep Dive: Logging
For comprehensive logging guidance including:
- Structured logging implementation examples (Python, Node.js, Go, Java)
- Log aggregation patterns (ELK, Loki, CloudWatch, Fluentd)
- Query patterns and best practices
- PII redaction and security
- Sampling and rate limiting
→ Read: references/logging_guide.md
3. Alert Design
Alert Design Principles
- Every alert must be actionable - If you can't do something, don't alert
- Alert on symptoms, not causes - Alert on user experience, not components
- Tie alerts to SLOs - Connect to business impact
- Reduce noise - Only page for critical issues
Alert Severity Levels
| Severity | Response Time | Example |
|---|---|---|
| Critical | Page immediately | Service down, SLO violation |
| Warning | Ticket, review in hours | Elevated error rate, resource warning |
| Info | Log for awareness | Deployment completed, scaling event |
Multi-Window Burn Rate Alerting
Alert when error budget is consumed too quickly:
# Fast burn (1h window) - Critical
- alert: ErrorBudgetFastBurn
expr: |
(error_rate / 0.001) > 14.4 # 99.9% SLO
for: 2m
labels:
severity: critical
# Slow burn (6h window) - Warning
- alert: ErrorBudgetSlowBurn
expr: |
(error_rate / 0.001) > 6 # 99.9% SLO
for: 30m
labels:
severity: warning
Alert Quality Checker
Audit your alert rules against best practices:
# Check single file
python3 scripts/alert_quality_checker.py alerts.yml
# Check all rules in directory
python3 scripts/alert_quality_checker.py /path/to/prometheus/rules/
Checks for:
- Alert naming conventions
- Required labels (severity, team)
- Required annotations (summary, description, runbook_url)
- PromQL expression quality
- 'for' clause to prevent flapping
→ Script: scripts/alert_quality_checker.py
Alert Templates
Production-ready alert rule templates:
→ Templates:
- assets/templates/prometheus-alerts/webapp-alerts.yml - Web application alerts
- assets/templates/prometheus-alerts/kubernetes-alerts.yml - Kubernetes alerts
Deep Dive: Alerting
For comprehensive alerting guidance including:
- Alert design patterns (multi-window, rate of change, threshold with hysteresis)
- Alert annotation best practices
- Alert routing (severity-based, team-based, time-based)
- Inhibition rules
- Runbook structure
- On-call best practices
→ Read: references/alerting_best_practices.md
Runbook Template
Create comprehensive runbooks for your alerts:
→ Template: assets/templates/runbooks/incident-runbook-template.md
4. Dashboard & Visualization
Dashboard Design Principles
- Top-down layout: Most important metrics first
- Color coding: Red (critical), yellow (warning), green (healthy)
- Consistent time windows: All panels use same time range
- Limit panels: 8-12 panels per dashboard maximum
- Include context: Show related metrics together
Recommended Dashboard Structure
┌─────────────────────────────────────┐
│ Overall Health (Single Stats) │
│ [Requests/s] [Error%] [P95 Latency]│
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Request Rate & Errors (Graphs) │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Latency Distribution (Graphs) │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Resource Usage (Graphs) │
└─────────────────────────────────────┘
Generate Grafana Dashboards
Automatically generate dashboards from templates:
# Web application dashboard
python3 scripts/dashboard_generator.py webapp \
--title "My API Dashboard" \
--service my_api \
--output dashboard.json
# Kubernetes dashboard
python3 scripts/dashboard_generator.py kubernetes \
--title "K8s Production" \
--namespace production \
--output k8s-dashboard.json
# Database dashboard
python3 scripts/dashboard_generator.py database \
--title "PostgreSQL" \
--db-type postgres \
--instance db.example.com:5432 \
--output db-dashboard.json
Supports:
- Web applications (requests, errors, latency, resources)
- Kubernetes (pods, nodes, resources, network)
- Databases (PostgreSQL, MySQL)
→ Script: scripts/dashboard_generator.py
5. SLO & Error Budgets
SLO Fundamentals
SLI (Service Level Indicator): Measurement of service quality
- Example: Request latency, error rate, availability
SLO (Service Level Objective): Target value for an SLI
- Example: "99.9% of requests return in < 500ms"
Error Budget: Allowed failure amount = (100% - SLO)
- Example: 99.9% SLO = 0.1% error budget = 43.2 minutes/month
Common SLO Targets
| Availability | Downtime/Month | Use Case |
|---|---|---|
| 99% | 7.2 hours | Internal tools |
| 99.9% | 43.2 minutes | Standard production |
| 99.95% | 21.6 minutes | Critical services |
how to use monitoring-observability How to use monitoring-observability on CursorAI-first code editor with Composer 1 PrerequisitesBefore installing skills in Cursor, ensure your development environment meets these requirements:
2 Execute installation commandExecute the skills CLI command in your project's root directory to begin installation: $npx skills add https://github.com/ahmedasmar/devops-claude-skills --skill monitoring-observability The skills CLI fetches 3 Select Cursor when promptedThe CLI will show a list of available agents. Use arrow keys to navigate and space to select Cursor: ◆ Which agents do you want to install to? │ │ ── Universal (.agents/skills) ── always included ──── │ • Amp │ • Antigravity │ • Cline │ • Codex │ ●Cursor(selected) │ • Cursor │ • Windsurf 4 Verify installationConfirm successful installation by checking the skill directory location: .cursor/skills/monitoring-observability Reload or restart Cursor to activate monitoring-observability. Access the skill through slash commands (e.g., ⚠ Security & Verification NoticeWe perform automated surface-level scans (Gen AI Scanner, Socket, Snyk) during installation. These checks detect common vulnerabilities but do not guarantee complete security. Always review skill source code and verify the publisher's reputation before production use. Skills execute code in your development environment. Always verify the publisher's identity, review recent commits, and test in isolated environments before production deployment. List & Monetize Your SkillSubmit your Claude Code skill and start earning Use Cases▌User Story & Requirements GenerationCreate detailed user stories, acceptance criteria, and feature specs Example Generate user stories for 'password reset feature' with acceptance criteria, edge cases, and test scenarios ✓ Reduce spec writing time by 50%, ensure comprehensive coverage Competitive AnalysisResearch competitors, compare features, identify gaps Example Analyze 5 competitor products, create feature comparison matrix, suggest differentiation opportunities ✓ Complete competitive research in 2 hours instead of 2 days Roadmap PrioritizationEvaluate features using frameworks (RICE, ICE, Kano) and create prioritized backlogs Example Score 20 feature ideas using RICE framework, generate prioritized roadmap with rationale ✓ Make data-driven prioritization decisions faster Stakeholder CommunicationDraft PRDs, status updates, and stakeholder presentations Example Create executive summary of Q3 roadmap, monthly progress report, feature launch announcement ✓ Save 3-5 hours/week on communication overhead Implementation Guide▌Prerequisites
Time Estimate30-60 minutes to see productivity improvements Installation Steps
Common Pitfalls
Best Practices▌✓ Do
✗ Don't
💡 Pro Tips
When to Use This▌✓ Use WhenUse for user story writing, competitive research, roadmap prioritization, stakeholder communication, and PRD drafting. Best for reducing repetitive documentation and research work. ✗ Avoid WhenAvoid for strategic product vision (requires deep customer empathy), pricing decisions (needs market and financial expertise), or when face-to-face customer discovery is more valuable than speed. Learning Path▌
DiscussionProduct Hunt–style comments (not star reviews)
general reviews Ratings4.6★★★★★50 reviews
showing 1-10 of 50 1 / 5 |