Skip to content

yprikhodko/aws-devops-agent-demo-environment

Repository files navigation

AWS DevOps Agent Demo

This demo showcases AWS DevOps Agent's capabilities in identifying root causes of system issues and accelerating incident response through a realistic Unicorn Rentals microservices architecture.

Application Overview

Unicorn Rentals is a customer booking system with rental processing logic and analytics reporting:

graph TB
    subgraph app["πŸ¦„ Unicorn Rentals Application"]
        Customer["πŸ‘€ Customer Requests"]
        APIGW["🌐 API Gateway<br/><i>unicorn-rentals-api</i>"]
        Lambda["⚑ Lambda<br/><i>rental-processor</i><br/>"]
        DDB["πŸ—„οΈ DynamoDB<br/><i>unicorn-rentals</i><br/>"]
        
        Customer -->|"POST /process"| APIGW
        APIGW -->|"Invoke"| Lambda
        Lambda -->|"Read/Write"| DDB
    end

    subgraph monitoring["πŸ“Š Monitoring & Alerting"]
        A1["πŸ”΄ Error Alarm<br/>Errors > 3/min"]
        A2["🟑 Duration Alarm<br/>Avg > 10s"]
        A3["🟠 Throttle Alarm<br/>Throttles β‰₯ 1"]
        SNS["πŸ“¨ SNS Topic"]
        Forwarder["βš™οΈ Webhook Forwarder<br/><i>Lambda</i>"]
        
        A1 & A2 & A3 -->|"ALARM state"| SNS
        SNS --> Forwarder
    end

    subgraph devops["πŸ€– AWS DevOps Agent"]
        Agent["πŸ” Investigation<br/>Root Cause Analysis<br/>Mitigation Plans"]
        Slack["πŸ’¬ Slack<br/>Real-time findings"]
        
        Agent -->|"Posts updates"| Slack
    end

    Lambda -.->|"Metrics & Logs"| A1 & A2
    DDB -.->|"Throttle Events"| A3
    Forwarder -->|"HMAC Webhook"| Agent

    style app fill:#1a1a2e,stroke:#e94560,color:#fff
    style monitoring fill:#1a1a2e,stroke:#f5a623,color:#fff
    style devops fill:#1a1a2e,stroke:#00d2ff,color:#fff
Loading

The architecture includes intentional failure points that demonstrate DevOps Agent's diagnostic capabilities across multiple AWS services.

Demo Scenarios

🧠 Scenario 1: Memory Exhaustion

Rental Analytics Overload

  • Problem: Rental processor runs out of memory (127/128 MB usage)
  • Cause: Loading large analytics datasets during processing
  • Impact: 30% booking failures with 10+ second delays
  • Symptoms: Runtime.OutOfMemory errors, processing timeouts

πŸ—„οΈ Scenario 2: Database Throttling

Peak Demand Capacity Issues

  • Problem: DynamoDB capacity exceeded during high traffic
  • Cause: Provisioned throughput limits hit during batch operations
  • Impact: 5+ second booking delays, intermittent failures
  • Symptoms: ProvisionedThroughputExceeded errors

πŸ”— Scenario 3: Cascade Failures

Cross-Service Error Propagation

  • Problem: Multi-service error chain reaction
  • Cause: Database throttling β†’ Lambda timeouts β†’ API Gateway 5XX errors
  • Impact: Complete system outage affecting all customers
  • Symptoms: Service dependency failures across the stack

Getting Started

Prerequisites

  • AWS CLI configured with appropriate permissions
  • Python 3.7+ with requests library
  • CloudFormation deployment permissions

1. Deploy the Infrastructure

# Make deployment script executable
chmod +x deploy.sh

# Deploy the complete stack
./deploy.sh

The deployment creates:

  • API Gateway: demo-unicorn-rentals-api
  • Lambda Function: demo-unicorn-rental-processor (128MB memory, 30% error rate)
  • DynamoDB Table: demo-unicorn-rentals (low provisioned capacity)
  • CloudWatch Alarms: Error rate, duration, and throttling monitors

2. Load Environment Variables

# Source the generated environment file
source demo-environment.env

# Verify deployment
echo "API URL: $API_URL"
echo "Lambda: $LAMBDA_NAME" 
echo "Table: $TABLE_NAME"

3. Generate Realistic Load

Install Python dependencies:

pip install requests

Start continuous background load:

# Light continuous load (5 RPS baseline)
python continuous-load-generator.py --api-url $API_URL --rps 5 --duration 10

# Higher load for faster error generation
python continuous-load-generator.py --api-url $API_URL --rps 15 --duration 10

The load generator creates realistic traffic patterns with business hours peaks and occasional spikes.

DevOps Agent Analysis

Setup DevOps Agent Space

  1. Create DevOps Agent Space

    • Navigate to AWS Console β†’ DevOps Agent
    • Create a new Space for the demo
    • Include resources with tag: Application = unicorn_rentals
    • This will automatically discover and include:
      • API Gateway: demo-unicorn-rentals-api
      • Lambda Function: demo-unicorn-rental-processor
      • DynamoDB Table: demo-unicorn-rentals
      • CloudWatch Alarms and Logs
  2. Open WebApp and Start Investigation

    • Launch the DevOps Agent WebApp from your Space
    • Begin investigation using the prompts below
    • Agent will analyze logs, metrics, and service relationships

Investigation Prompts

Use these prompts with AWS DevOps Agent to analyze the system:

Memory Issues Analysis

The performance of my unicorn_rentals application has degraded significantly. Customer booking response times have increased and I'm seeing more rental processing errors. Can you analyze the system behavior over the last hour?

Latency Investigation

My unicorn rental processor is experiencing high latency spikes during booking confirmations. The duration metrics show some rental processing taking much longer than others. What's causing this inconsistent booking performance?

Root Cause Analysis

Investigation details: Why is my unicorn rentals API Gateway showing increased latency and 5XX errors? Customer bookings were working fine earlier today.

Slack Integration

Once you've added your Slack Workspace as a capability provider, follow these steps to complete the integration:

  1. Associate Slack Channel with your Agent Space

    • In your Agent Space, go to Capabilities β†’ Communications β†’ Slack
    • Select Add Slack and enter the Channel ID of your target channel
    • Choose Create to complete the association
    • For private channels, invite the DevOps Agent bot user to the channel before it can post
  2. How Slack Works During Investigations

    • When an investigation starts (manually from the WebApp, via webhook, or from a ticketing integration), DevOps Agent automatically posts updates to the configured Slack channel
    • The channel receives key findings, root cause analyses, and mitigation plans as the investigation progresses
    • Team members can follow along in real-time without needing console access
  3. Starting Investigations for This Demo

    • Investigations are started from the DevOps Agent WebApp (Incident Response tab), not directly from Slack
    • Use the prompts from the Investigation Prompts section above, or choose a pre-configured starting point like "Latest alarm" or "Error rate spike"
    • Once started, all findings stream into your Slack channel automatically
    • You can also trigger investigations via webhooks from PagerDuty, Grafana, or custom alerting systems

Note: Slack serves as a notification and collaboration channel. The investigation itself is driven from the WebApp, ticketing integrations, or webhooks. Avoid uninstalling the Slack app during the public preview as reinstallation may not work.

Automatic Investigation via Webhook (Optional)

You can configure CloudWatch Alarms to automatically trigger DevOps Agent investigations when errors occur. The stack includes an optional webhook integration pipeline:

CloudWatch Alarm β†’ SNS Topic β†’ Forwarder Lambda β†’ DevOps Agent Webhook

To enable this:

  1. Generate a webhook in DevOps Agent

    • In your Agent Space, go to Capabilities β†’ Webhook β†’ Configure
    • Click Generate webhook to create an HMAC key pair
    • Save the webhook URL and secret securely (you won't see the secret again)
  2. Deploy with webhook parameters

    export DEVOPS_AGENT_WEBHOOK_URL="https://event-ai.us-east-1.api.aws/webhook/generic/YOUR_ID"
    export DEVOPS_AGENT_WEBHOOK_SECRET="your-hmac-secret"
    ./deploy.sh
  3. Trigger the pipeline

    • Run the load generator to produce errors
    • CloudWatch Alarms fire when thresholds are breached
    • The forwarder Lambda sends HMAC-signed webhook requests to DevOps Agent
    • An investigation starts automatically, with findings posted to Slack

The forwarder only triggers investigations on ALARM state transitions (not OK recoveries), and each alarm produces a unique incident ID to avoid deduplication.

Billing & Cost Management MCP Server (Optional)

Connecting MCP servers to DevOps Agent gives it additional context and tools beyond what's available through native AWS integrations. In this case, the Billing & Cost Management MCP server provides access to pricing data, cost anomaly detection, and optimization recommendations β€” enabling the agent to make cost-aware recommendations during incident investigations (e.g., estimating the cost impact of increasing Lambda memory or switching DynamoDB to on-demand mode).

Deploy the MCP Server

The MCP server runs as a Lambda function behind API Gateway, wrapping the awslabs.billing-cost-management-mcp-server package using the MCP Streamable HTTP transport.

cd mcp-server-hosting
bash deploy.sh

The script will output the endpoint URL and API key. No CDK or Docker required β€” just AWS CLI, Python 3.10+, and pip.

Connect MCP to DevOps Agent

  1. In your Agent Space, go to Capabilities β†’ MCP Servers β†’ Add MCP Server
  2. Enter the endpoint URL from the deploy output (e.g., https://<api-id>.execute-api.us-east-1.amazonaws.com/prod/mcp)
  3. Select API Key as the authorization flow
  4. Configure:
    • API Key Name: bcm-mcp-key
    • API Key Header: x-api-key
    • API Key Value: the key from the deploy output
  5. Leave "Dynamic Client Registration" and "Private connection" unchecked
  6. Choose AWS owned key for encryption

Upload the Billing Skill

The skill instructs DevOps Agent when and how to use the billing MCP tools during investigations.

  1. In your Agent Space, go to Skills β†’ Add Skill β†’ Upload Skill
  2. Upload devops-agent-skill-billing-mcp.zip
  3. Select Generic agent type (applies to all investigation types)

Once connected, the agent will automatically use billing tools to check for cost anomalies correlated with incidents, estimate the cost of proposed mitigations, and include a cost impact summary in its findings.

Architecture Details

Application: unicorn_rentals

  • API Gateway: ${Environment}-unicorn-rentals-api - Customer-facing REST API
  • Lambda: ${Environment}-unicorn-rental-processor - Business logic with intentional constraints
  • DynamoDB: ${Environment}-unicorn-rentals - Rental data storage with throttling limits
  • CloudWatch: Comprehensive monitoring and alerting

Intentional Constraints:

  • Lambda: 128MB memory limit, 30-second timeout
  • DynamoDB: 2 RCU/WCU provisioned capacity
  • Error injection: 30% failure rate with realistic scenarios

Cleanup

Remove all resources when finished:

aws cloudformation delete-stack --stack-name unicorn-rentals
aws cloudformation delete-stack --stack-name bcm-mcp-server

Additional Resources

Files

  • cloudformation-template.yaml - Complete infrastructure definition
  • deploy.sh - Automated deployment with error handling
  • continuous-load-generator.py - Realistic traffic simulation
  • reset-alarms.sh - Reset CloudWatch alarms to OK state
  • mcp-server-hosting/ - Billing & Cost Management MCP server (Lambda + API Gateway)
  • devops-agent-skill/ - DevOps Agent skill for cost-aware investigations
  • devops-agent-skill-billing-mcp.zip - Ready-to-upload skill package

About

Demo environment with synthetic workload generation for various AWS DevOps agent demo scenarios

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors