it-operations▌
davila7/claude-code-templates · updated Apr 8, 2026
MDX-style export adds YAML metadata + attribution linking explainx.ai and this canonical listing URL.
A comprehensive skill for managing IT infrastructure operations, ensuring service reliability, implementing monitoring and alerting strategies, managing incidents, and maintaining operational excellence through automation and best practices.
IT Operations Expert
A comprehensive skill for managing IT infrastructure operations, ensuring service reliability, implementing monitoring and alerting strategies, managing incidents, and maintaining operational excellence through automation and best practices.
Core Principles
1. Service Reliability First
- Proactive Monitoring: Implement comprehensive observability before incidents occur
- Incident Management: Structured response processes with clear escalation paths
- SLA/SLO Management: Define and maintain service level objectives aligned with business needs
- Continuous Improvement: Learn from incidents through blameless post-mortems
2. Automation Over Manual Processes
- Infrastructure as Code: Manage infrastructure configuration through version-controlled code
- Runbook Automation: Convert manual procedures into automated workflows
- Self-Healing Systems: Implement automated remediation for common issues
- Configuration Management: Maintain consistency across environments
3. ITIL Service Management
- Service Strategy: Align IT services with business objectives
- Service Design: Design resilient, scalable services
- Service Transition: Manage changes with minimal disruption
- Service Operation: Deliver and support services effectively
- Continual Service Improvement: Iteratively enhance service quality
4. Operational Excellence
- Documentation: Maintain current runbooks, procedures, and architecture diagrams
- Knowledge Management: Build searchable knowledge bases from incident resolutions
- Capacity Planning: Forecast and provision resources proactively
- Cost Optimization: Balance performance requirements with infrastructure costs
Core Workflow
Infrastructure Operations Workflow
1. MONITORING & OBSERVABILITY
├─ Define SLIs/SLOs/SLAs for critical services
├─ Implement metrics collection (infrastructure, application, business)
├─ Configure alerting with proper thresholds and escalation
├─ Build dashboards for different audiences (ops, devs, executives)
└─ Establish on-call rotation and escalation procedures
2. INCIDENT MANAGEMENT
├─ Receive alert or user report
├─ Assess severity and impact (P1/P2/P3/P4)
├─ Engage appropriate responders
├─ Investigate and diagnose root cause
├─ Implement fix or workaround
├─ Communicate status to stakeholders
├─ Document resolution in knowledge base
└─ Conduct post-incident review
3. CHANGE MANAGEMENT
├─ Submit change request with impact assessment
├─ Review and approve through CAB (Change Advisory Board)
├─ Schedule change window
├─ Execute change with rollback plan ready
├─ Validate success criteria
├─ Document actual vs planned results
└─ Close change ticket
4. CAPACITY PLANNING
├─ Collect resource utilization trends
├─ Analyze growth patterns
├─ Forecast future requirements
├─ Plan procurement or provisioning
├─ Execute capacity additions
└─ Monitor effectiveness
5. AUTOMATION & OPTIMIZATION
├─ Identify repetitive manual tasks
├─ Document current process
├─ Design automated solution
├─ Implement and test automation
├─ Deploy to production
├─ Measure time/cost savings
└─ Iterate and improve
Decision Frameworks
Alert Configuration Decision Matrix
| Scenario | Alert Type | Threshold | Response Time | Escalation |
|---|---|---|---|---|
| Service completely down | Page | Immediate | < 5 min | Immediate to on-call |
| Service degraded | Page | 2-3 failures | < 15 min | After 15 min to on-call |
| High resource usage | Warning | > 80% sustained | < 1 hour | After 2 hours to team lead |
| Approaching capacity | Info | > 70% trend | < 24 hours | Weekly capacity review |
| Configuration drift | Ticket | Any deviation | < 7 days | Monthly review |
Incident Severity Classification
Priority 1 (Critical)
- Complete service outage affecting all users
- Data loss or security breach
- Financial impact > $10K/hour
- Response: Immediate, 24/7, all hands on deck
Priority 2 (High)
- Partial service outage affecting many users
- Significant performance degradation
- Financial impact $1K-$10K/hour
- Response: < 30 minutes during business hours
Priority 3 (Medium)
- Service degradation affecting some users
- Non-critical functionality impaired
- Workaround available
- Response: < 4 hours during business hours
Priority 4 (Low)
- Minor issues with minimal impact
- Cosmetic problems
- Enhancement requests
- Response: Next business day
Change Management Risk Assessment
Risk Level = Impact × Likelihood × Complexity
Impact (1-5):
1 = Single user
2 = Team
3 = Department
4 = Company-wide
5 = Customer-facing
Likelihood of Issues (1-5):
1 = Routine, tested
2 = Familiar, documented
3 = Some uncertainty
4 = New territory
5 = Never done before
Complexity (1-5):
1 = Single component
2 = Few components
3 = Multiple systems
4 = Cross-platform
5 = Enterprise-wide
Risk Score Interpretation:
1-20: Standard change (pre-approved)
21-50: Normal change (CAB review)
51-75: High-risk change (extensive testing, senior approval)
76-125: Emergency change only (executive approval)
Monitoring Tool Selection
| Requirement | Prometheus + Grafana | Datadog | New Relic | ELK Stack | Splunk |
|---|---|---|---|---|---|
| Cost | Free (self-hosted) | $$$$ | $$$$ | Free-$$ | $$$$$ |
| Metrics | Excellent | Excellent | Excellent | Good | Good |
| Logs | Via Loki | Excellent | Excellent | Excellent | Excellent |
| Traces | Via Tempo | Excellent | Excellent | Limited | Good |
| Learning Curve | Steep | Moderate | Moderate | Steep | Steep |
| Cloud-Native | Excellent | Excellent | Excellent | Good | Good |
| On-Premises | Excellent | Good | Good | Excellent | Excellent |
| APM | Via exporters | Excellent | Excellent | Limited | Good |
Common Operational Challenges
Challenge 1: Alert Fatigue
Problem: Too many false positive alerts causing team burnout
Solution:
Alert Tuning Process:
1. Measure baseline alert volume and false positive rate
2. Categorize alerts by actionability:
- Actionable + Urgent = Keep as page
- Actionable + Not Urgent = Ticket
- Not Actionable = Remove or convert to dashboard metric
3. Implement alert aggregation (group similar alerts)
4. Add context to alerts (runbook links, relevant metrics)
5. Regular review meetings (weekly) to tune thresholds
6. Track metrics:
- MTTA (Mean Time to Acknowledge): < 5 min target
- False Positive Rate: < 20% target
- Alert Volume per Week: Trending down
Challenge 2: Incident Documentation During Crisis
Problem: Teams skip documentation during high-pressure incidents
Solution:
- Assign dedicated scribe role (not the incident commander)
- Use incident management tools (PagerDuty, Opsgenie) with automatic timeline
- Template-based incident reports with required fields
- Post-incident review scheduled automatically (within 48 hours)
- Gamify documentation (track and recognize thorough documentation)
Challenge 3: Knowledge Silos
Problem: Critical knowledge trapped in individual team members' heads
Solution:
Knowledge Transfer Strategy:
- Pair Programming/Shadowing: 20% of sprint capacity
- Runbook Requirements: Every system must have runbook
- Lunch & Learn Sessions: Weekly 30-min knowledge sharing
- Cross-Training Matrix: Track who knows what, identify gaps
- On-Call Rotation: Everyone rotates to spread knowledge
- Post-Incident Reviews: Mandatory team sharing
- Documentation Sprints: Quarterly focus on doc completion
Challenge 4: Balancing Stability vs Innovation
Problem: Operations team resists change to maintain stability
Solution:
- Implement change windows (planned maintenance periods)
- Use blue-green or canary deployments for lower risk
- Establish "innovation time" (Google 20% time model)
- Create sandbox environments for experimentation
- Measure and reward both stability AND improvement metrics
- Include "toil reduction" as OKR target
Key Metrics & KPIs
Service Reliability Metrics
Availability:
Formula: (Total Time - Downtime) / Total Time × 100
Target: 99.9% (43.8 min/month downtime)
Measurement: Per service, monthly
MTTR (Mean Time to Recovery):
Formula: Sum of recovery times / Number of incidents
Target: < 30 minutes for P1, < 4 hours for P2
Measurement: Per severity level, monthly
MTBF (Mean Time Between Failures):
Formula: Total operational time / Number of failures
Target: > 720 hours (30 days)
Measurement: Per service, quarterly
MTTA (Mean Time to Acknowledge):
Formula: Sum of acknowledgment times / Number of alerts
Target: < 5 minutes for pages
Measurement: Per on-call engineer, weekly
Change Success Rate:
Formula: Successful changes / Total changes × 100
Target: > 95%
Measurement: Monthly
Incident Recurrence Rate:
Formula: Repeat incidents / Total incidents × 100
Target: < 10%
Measurement: Quarterly (same root cause within 90 days)
Operational Efficiency Metrics
Toil Percentage:
Definition: Time spent on manual, repetitive tasks
Target: < 30% of team capacity
Measurement: Weekly time tracking
Automation Coverage:
Formula: Automated tasks / Total repetitive tasks × 100
Target: > 70%
Measurement: Quarterly audit
On-Call Load:
Formula: Alerts per on-call shift
Target: < 5 actionable alerts per shift
Measurement: Per engineer, weekly
Runbook Coverage:
Formula: Services with runbooks / Total services × 100
Target: 100%
Measurement: Monthly audit
Knowledge Base Utilization:
Formula: Incidents resolved via KB / Total incidents × 100
Target: > 40%
Measurement: Monthly
Integration Points
With Development Teams
- Participate in design reviews for operational requirements
- Provide deployment automation and CI/CD pipeline support
- Share monitoring and logging requirements
- Collaborate on incident response and post-mortems
- Joint ownership of SLOs and error budgets
With Security Teams
- Implement security monitoring and alerting
- Manage access controls and authentication systems
- Coordinate vulnerability patching and remediation
- Conduct security incident response
- Maintain compliance with security policies
With Business Stakeholders
- Report on service availability and performance
- Communicate planned maintenance windows
- Provide capacity planning forecasts
- Translate technical metrics to business impact
- Participate in business continuity planning
Best Practices
1. Blameless Post-Mortems
Post-Incident Review Template:
- Incident Summary (what happened, when, impact)
- Timeline of Events (detailed chronology)
- Root Cause Analysis (5 Whys or Fishbone)
- What Went Well (strengths during response)
- What Could Be Improved (opportunities)
- Action How to use it-operations on Cursor
AI-first code editor with Composer
Prerequisites
Before installing skills in Cursor, ensure your development environment meets these requirements:
- ›Cursor installed and configured on your development machine
- ›Node.js version 16.0+ with npm package manager (verify with
node --version) - ›Active project directory or workspace where you want to add it-operations
Execute installation command
Execute the skills CLI command in your project's root directory to begin installation:
The skills CLI fetches it-operations from GitHub repository davila7/claude-code-templates and configures it for Cursor.
Select Cursor when prompted
The CLI will show a list of available agents. Use arrow keys to navigate and space to select Cursor:
Verify installation
Confirm successful installation by checking the skill directory location:
Reload or restart Cursor to activate it-operations. Access the skill through slash commands (e.g., /it-operations) or your agent's skill management interface.
Security & Verification Notice
We perform automated surface-level scans (Gen AI Scanner, Socket, Snyk) during installation. These checks detect common vulnerabilities but do not guarantee complete security. Always review skill source code and verify the publisher's reputation before production use.
Skills execute code in your development environment. Always verify the publisher's identity, review recent commits, and test in isolated environments before production deployment.
List & Monetize Your Skill
Submit your Claude Code skill and start earning
Use Cases▌
User Story & Requirements Generation
Create detailed user stories, acceptance criteria, and feature specs
Example
Generate user stories for 'password reset feature' with acceptance criteria, edge cases, and test scenarios
Reduce spec writing time by 50%, ensure comprehensive coverage
Competitive Analysis
Research competitors, compare features, identify gaps
Example
Analyze 5 competitor products, create feature comparison matrix, suggest differentiation opportunities
Complete competitive research in 2 hours instead of 2 days
Roadmap Prioritization
Evaluate features using frameworks (RICE, ICE, Kano) and create prioritized backlogs
Example
Score 20 feature ideas using RICE framework, generate prioritized roadmap with rationale
Make data-driven prioritization decisions faster
Stakeholder Communication
Draft PRDs, status updates, and stakeholder presentations
Example
Create executive summary of Q3 roadmap, monthly progress report, feature launch announcement
Save 3-5 hours/week on communication overhead
Implementation Guide▌
Prerequisites
- ›Claude Desktop or compatible AI client
- ›Access to product documentation and roadmap tools (Jira, Notion, etc.)
- ›Understanding of product management frameworks (RICE, Jobs-to-be-Done, etc.)
- ›Stakeholder contact information and communication channels
Time Estimate
30-60 minutes to see productivity improvements
Installation Steps
- 1.Install product management skill
- 2.Start with user story generation for known feature
- 3.Progress to competitive analysis: research 2-3 competitors
- 4.Use for roadmap prioritization: apply RICE/ICE scoring
- 5.Draft stakeholder communications and refine based on feedback
- 6.Build template library for recurring PM tasks
- 7.Share effective prompts with product team
Common Pitfalls
- ⚠Not validating competitive research—verify facts before sharing
- ⚠Accepting user stories without involving engineering team
- ⚠Over-relying on frameworks without qualitative judgment
- ⚠Not customizing outputs to company culture and communication style
- ⚠Skipping stakeholder validation of generated requirements
Best Practices▌
✓ Do
- +Validate research and competitive analysis with real data
- +Collaborate with engineering when generating technical requirements
- +Customize frameworks and templates to your company context
- +Use skill for first drafts, refine with stakeholder input
- +Document successful prompt patterns for PM tasks
- +Combine AI efficiency with human judgment and intuition
✗ Don't
- −Don't publish competitive analysis without fact-checking
- −Don't finalize user stories without engineering review
- −Don't make prioritization decisions solely on AI scoring
- −Don't skip customer validation of generated requirements
- −Don't ignore company-specific context and culture
💡 Pro Tips
- ★Provide context: company goals, constraints, customer feedback
- ★Ask for alternatives: 'Show 3 ways to prioritize this roadmap'
- ★Request stakeholder-specific formatting: 'Executive summary vs. engineering spec'
- ★Use skill for 70% generation + 30% customization to company needs
When to Use This▌
✓ Use When
Use for user story writing, competitive research, roadmap prioritization, stakeholder communication, and PRD drafting. Best for reducing repetitive documentation and research work.
✗ Avoid When
Avoid for strategic product vision (requires deep customer empathy), pricing decisions (needs market and financial expertise), or when face-to-face customer discovery is more valuable than speed.
Learning Path▌
- 1Basic: user stories, feature specs, status updates
- 2Intermediate: competitive analysis, prioritization frameworks, PRDs
- 3Advanced: product strategy, go-to-market planning, OKR setting
- 4Expert: product vision, market positioning, business model innovation
Discussion
Product Hunt–style comments (not star reviews)- No comments yet — start the thread.
Ratings
4.7★★★★★26 reviews- ★★★★★Mateo Gill· Nov 3, 2024
I recommend it-operations for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.
- ★★★★★Mateo Rao· Oct 22, 2024
it-operations reduced setup friction for our internal harness; good balance of opinion and flexibility.
- ★★★★★Oshnikdeep· Sep 25, 2024
it-operations reduced setup friction for our internal harness; good balance of opinion and flexibility.
- ★★★★★Soo Nasser· Sep 25, 2024
Registry listing for it-operations matched our evaluation — installs cleanly and behaves as described in the markdown.
- ★★★★★Aditi Desai· Sep 9, 2024
Solid pick for teams standardizing on skills: it-operations is focused, and the summary matches what you get after install.
- ★★★★★Piyush G· Sep 1, 2024
it-operations fits our agent workflows well — practical, well scoped, and easy to wire into existing repos.
- ★★★★★Min Rahman· Aug 28, 2024
We added it-operations from the explainx registry; install was straightforward and the SKILL.md answered most questions upfront.
- ★★★★★Shikha Mishra· Aug 20, 2024
it-operations is among the better-maintained entries we tried; worth keeping pinned for repeat workflows.
- ★★★★★Ganesh Mohane· Aug 16, 2024
I recommend it-operations for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.
- ★★★★★Maya Martinez· Aug 16, 2024
Keeps context tight: it-operations is the kind of skill you can hand to a new teammate without a long onboarding doc.
showing 1-10 of 26