Debuggingprompt onlyAdvanced

Production Incident Resolver

A coding agent loop designed to diagnose and resolve production incidents through iterative investigation, targeted fixes, and continuous health monitoring until system stability is restored.

← all loops
debuggingincident-responseproductionhealth-checkiterativetroubleshootingmonitoringfix-validation

Goal

Resolve production incident

How to Run

Provide a production incident description to this loop. The agent will iteratively investigate, apply fixes, and validate system health until the exit condition is met or max iterations are reached.

  1. 01

    Initiate Incident Response

    Run the kickoff prompt with a detailed description of the production issue including any error messages, affected components, and initial observations.

  2. 02

    Monitor Progress

    The agent will automatically execute check commands after each action. Review results and approve/deny proposed changes to ensure safe resolution.

  3. 03

    Verify Resolution

    Once monitoring shows healthy status, confirm the fix addresses root cause and doesn't introduce new issues.

Workflow Steps

  1. 01

    Understand incident context and scope from user-provided description

    Validate understanding with clarifying questions

  2. 02

    Triage by identifying most likely root causes and failure points

    Prioritize potential issues based on impact and evidence

  3. 03

    Investigate logs, metrics, and recent changes to confirm diagnosis

    Analyze monitoring data and correlate with symptoms

  4. 04

    Implement targeted fix for identified root cause

    Apply change and run health check command

  5. 05

    Validate fix doesn't break other functionality

    Run comprehensive tests and monitor for regressions

  6. 06

    Document resolution and update incident records

    Confirm knowledge capture and prepare handoff notes

Kickoff Prompt

Start the "Production Incident Resolver" loop.

Goal: Resolve production incident
Max iterations: 10
Between iterations run: health check
Exit when: Monitoring healthy


I'm experiencing a production incident. Here's what I know so far: [DESCRIBE INCIDENT]. Please guide me through resolving this systematically while keeping our services stable.

Self-pace this loop. After each iteration, run `health check` and evaluate the output, and only continue if the exit condition is not met (Monitoring healthy). Stop when the exit condition passes or 10 iterations are reached. Give a short status update each pass.

Guardrails

hardcoded
  • ·No code changes without explicit user approval for production systems
  • ·Avoid accessing or modifying sensitive configuration/data
  • ·All proposed fixes must pass static analysis before implementation
  • ·Maintain full context of previous actions to prevent redundant work
  • ·Stop and escalate if max iterations reached without resolution

Flow Diagram

rendering…

Related loops — Debugging