Service Reliability Loop
This loop enables continuous improvement of service reliability and uptime by leveraging Service Level Objective (SLO) reports to identify and address performance gaps.
Goal
Improve uptime and reliability
How to Run
The agent will iteratively analyze SLO reports, identify reliability issues, implement fixes, and verify improvements until the SLO target is achieved.
- 01
Initialize SLO Check
Run the 'slo report' command to get current reliability metrics and identify any violations of SLO targets.
- 02
Analyze Results
Inspect the SLO report output to determine root causes of reliability issues, focusing on error budgets and key performance indicators.
- 03
Implement Fixes
Make targeted changes to code or configuration to address identified reliability problems, prioritizing high-impact improvements.
- 04
Verify Improvements
Re-run the 'slo report' to assess whether changes have improved reliability toward meeting the SLO target.
- 05
Iterate Until Target Met
If SLO target is not met, repeat the process with refined analysis and actions, up to a maximum of 10 iterations.
Workflow Steps
- 01
Generate SLO Report
Execute the SLO reporting tool to gather current reliability metrics including availability, latency, and error rates.
- 02
Review and Analyze Metrics
Examine the SLO report to identify which metrics are below target thresholds and require attention.
- 03
Identify Root Causes
Determine underlying causes of reliability issues by correlating metrics with recent changes, logs, and system behavior.
- 04
Develop and Apply Fixes
Create targeted solutions such as code patches, configuration adjustments, or infrastructure optimizations to address issues.
- 05
Validate Changes
Test implemented fixes to ensure they improve reliability without introducing new problems.
- 06
Reassess SLO Compliance
Run the check command again to verify if the changes have successfully brought reliability metrics within acceptable SLO targets.
Kickoff Prompt
Start the "Service Reliability Loop" loop. Goal: Improve uptime and reliability Max iterations: 10 Between iterations run: slo report Exit when: SLO target met Begin by generating an SLO report using the command 'slo report'. Analyze the results to identify any reliability issues that are preventing us from meeting our targets. Prioritize the most critical problems and implement fixes aimed at improving system uptime and performance. After each change, re-run the SLO report to assess progress. Continue this cycle of analysis and improvement until the SLO target is met or the maximum number of iterations (10) is reached. Focus on sustainable, high-impact changes that enhance overall service reliability. Self-pace this loop. After each iteration, run `slo report` and evaluate the output, and only continue if the exit condition is not met (SLO target met). Stop when the exit condition passes or 10 iterations are reached. Give a short status update each pass.
Guardrails
hardcoded- ·Avoid changes that could cause service downtime or significant user impact
- ·Ensure all modifications align with defined SLOs and error budgets
- ·Maintain reversibility of changes to allow quick rollbacks if needed
- ·Prioritize improvements that provide measurable reliability gains
- ·Stay within approved deployment windows and change management processes
- ·Do not exceed system resource limits during testing or implementation
Flow Diagram
Related loops — DevOps
DevOps
Container Security Fixer
Automatically detects and remediates security vulnerabilities in container images through iterative scanning and patching workflows.
DevOps
Monitoring Coverage Builder
This loop iteratively identifies and adds missing monitoring coverage to your codebase by analyzing test coverage, identifying gaps, and implementing targeted monitoring solutions until the desired threshold is achieved.
DevOps
Alert Noise Reducer
Automatically analyzes and reduces false positive alerts in your monitoring system by identifying noisy patterns and adjusting alert configurations. This agent examines alert metrics, detects recurring false positives, and modifies alert rules to improve signal-to-noise ratio without compromising critical system visibility.