Phase 5: Agent Execution - AWS Deployment Guide

Step 1: Verify Daytona Configuration

1.1

Check Daytona API Key

Verify your Daytona API key is correctly stored in Secrets Manager.

bash

# Get secret value (to verify)
aws secretsmanager get-secret-value \
  --secret-id helium/backend/production \
  --query SecretString \
  --output text | jq -r '.DAYTONA_API_KEY'

# Should output your Daytona API key
# If empty or incorrect, update it:
# aws secretsmanager update-secret ...

1.2

Test Daytona Connection

bash

# Test Daytona API from your backend
curl -X POST https://api.he2.ai/api/agents/test-sandbox \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_AUTH_TOKEN" \
  -d '{
    "command": "echo Hello from Daytona"
  }'

# Expected response:
# {
#   "status": "success",
#   "output": "Hello from Daytona",
#   "sandbox_id": "sandbox-xxxxx"
# }

1.3

Configure Daytona Settings

Ensure your Daytona configuration matches your production needs.

Recommended Settings:

Server URL: https://app.daytona.io/api
Target: us (US region)
Timeout: 300 seconds (5 minutes)
Max Concurrent: 10 sandboxes

Step 2: Set Up Agent Execution Monitoring

2.1

Create CloudWatch Dashboard

Go to CloudWatch → Dashboards
Click "Create dashboard"
Name: helium-agent-execution
Add widgets for:
- Agent execution count (per hour)
- Average execution time
- Success rate (%)
- Worker task count
- Redis queue depth
- Error rate

2.2

Create CloudWatch Alarms

bash

# Create SNS topic for alerts
aws sns create-topic \
  --name helium-production-alerts \
  --region us-east-1

# Subscribe your email
aws sns subscribe \
  --topic-arn arn:aws:sns:us-east-1:ACCOUNT_ID:helium-production-alerts \
  --protocol email \
  --notification-endpoint your-email@example.com \
  --region us-east-1

# Create alarm for high error rate
aws cloudwatch put-metric-alarm \
  --alarm-name helium-high-error-rate \
  --alarm-description "Alert when error rate exceeds 5%" \
  --metric-name Errors \
  --namespace AWS/ECS \
  --statistic Average \
  --period 300 \
  --evaluation-periods 2 \
  --threshold 5 \
  --comparison-operator GreaterThanThreshold \
  --alarm-actions arn:aws:sns:us-east-1:ACCOUNT_ID:helium-production-alerts \
  --region us-east-1

# Create alarm for worker task count
aws cloudwatch put-metric-alarm \
  --alarm-name helium-worker-tasks-low \
  --alarm-description "Alert when worker tasks drop below 1" \
  --metric-name RunningTaskCount \
  --namespace ECS/ContainerInsights \
  --dimensions Name=ServiceName,Value=helium-worker-service Name=ClusterName,Value=helium-production-cluster \
  --statistic Average \
  --period 300 \
  --evaluation-periods 1 \
  --threshold 1 \
  --comparison-operator LessThanThreshold \
  --alarm-actions arn:aws:sns:us-east-1:ACCOUNT_ID:helium-production-alerts \
  --region us-east-1

2.3

Enable Container Insights

✅ 2025 Best Practice:
Container Insights provides detailed metrics for ECS tasks including CPU, memory, network, and storage at the task level.

bash

# Enable Container Insights for cluster
aws ecs update-cluster-settings \
  --cluster helium-production-cluster \
  --settings name=containerInsights,value=enabled \
  --region us-east-1

# Verify it's enabled
aws ecs describe-clusters \
  --clusters helium-production-cluster \
  --include SETTINGS \
  --region us-east-1

Step 3: Test Agent Execution Performance

3.1

Run Test Agent Execution

Test a simple agent execution to verify everything works end-to-end.

bash

# Create a test agent run via API
curl -X POST https://api.he2.ai/api/agents/run \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_AUTH_TOKEN" \
  -d '{
    "agent_id": "your-agent-id",
    "prompt": "Hello! Can you tell me what 2+2 equals?",
    "project_id": "your-project-id"
  }'

# Monitor the execution in CloudWatch Logs
# Go to CloudWatch → Log groups → /ecs/helium-worker
# Filter by agent run ID

3.2

Monitor Redis Queue

bash

# Connect to Redis and check queue
redis-cli -h your-cluster.cache.amazonaws.com -p 6379 --tls -a YourPassword

# Check queue length
> LLEN dramatiq:default.DQ
(integer) 0

# Check active workers
> SMEMBERS dramatiq:default.workers
1) "worker-1"
2) "worker-2"

# Monitor in real-time
> MONITOR

3.3

Load Testing

Test with multiple concurrent agent executions to verify scaling works.

bash

# Simple load test script
for i in {1..10}; do
  curl -X POST https://api.he2.ai/api/agents/run \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer YOUR_AUTH_TOKEN" \
    -d "{
      \"agent_id\": \"your-agent-id\",
      \"prompt\": \"Test execution $i\",
      \"project_id\": \"your-project-id\"
    }" &
done

wait

# Monitor CloudWatch for:
# - All executions complete successfully
# - Worker tasks scale up if needed
# - No errors in logs

✅ Success Indicators:
• All 10 executions complete within 2-3 minutes
• No errors in CloudWatch Logs
• Worker tasks scale up if queue grows
• Redis queue returns to 0 after completion

Step 4: Optimize Agent Execution

4.1

Configure Worker Auto Scaling

Scale workers based on Redis queue depth to handle traffic spikes.

bash

# Register scalable target for workers
aws application-autoscaling register-scalable-target \
  --service-namespace ecs \
  --scalable-dimension ecs:service:DesiredCount \
  --resource-id service/helium-production-cluster/helium-worker-service \
  --min-capacity 2 \
  --max-capacity 10 \
  --region us-east-1

# Create custom metric for queue depth (requires CloudWatch custom metric)
# This is typically done in your application code

# Create scaling policy based on queue depth
aws application-autoscaling put-scaling-policy \
  --service-namespace ecs \
  --scalable-dimension ecs:service:DesiredCount \
  --resource-id service/helium-production-cluster/helium-worker-service \
  --policy-name helium-worker-queue-scaling \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
    "TargetValue": 10.0,
    "CustomizedMetricSpecification": {
      "MetricName": "RedisQueueDepth",
      "Namespace": "Helium/Workers",
      "Statistic": "Average"
    },
    "ScaleInCooldown": 300,
    "ScaleOutCooldown": 60
  }' \
  --region us-east-1

4.2

Set Execution Timeouts

Configure appropriate timeouts to prevent stuck executions.

Recommended Timeouts:

API Request: 30 seconds
Agent Execution: 5 minutes (300 seconds)
Sandbox Creation: 60 seconds
Task Timeout: 10 minutes (600 seconds)

Step 5: Configure Structured Logging

5.1

Set Up Log Insights Queries

Create saved queries for common troubleshooting scenarios.

cloudwatch-insights

# Query 1: Find failed agent executions
fields @timestamp, agent_id, error_message
| filter status = "failed"
| sort @timestamp desc
| limit 20

# Query 2: Average execution time by agent
fields agent_id, execution_time
| stats avg(execution_time) as avg_time by agent_id
| sort avg_time desc

# Query 3: Find slow executions (>2 minutes)
fields @timestamp, agent_id, execution_time
| filter execution_time > 120
| sort @timestamp desc

# Query 4: Error patterns
fields @timestamp, error_message
| filter @message like /error/i
| stats count() by error_message
| sort count desc

5.2

Configure Log Retention

bash

# Set log retention to 30 days (saves costs)
aws logs put-retention-policy \
  --log-group-name /ecs/helium-backend \
  --retention-in-days 30 \
  --region us-east-1

aws logs put-retention-policy \
  --log-group-name /ecs/helium-worker \
  --retention-in-days 30 \
  --region us-east-1

💰 Cost Optimization:
CloudWatch Logs charges $0.50/GB ingested and $0.03/GB stored. Setting retention to 30 days can save 50-70% on log storage costs.

Phase 5 Verification Checklist

Daytona API key verified in Secrets Manager
Daytona connection tested successfully
Daytona settings configured
CloudWatch dashboard created
CloudWatch alarms configured
SNS topic created for alerts
Email subscribed to alerts
Container Insights enabled
Test agent execution successful
Load test completed (10 concurrent)
Worker auto scaling configured
Log Insights queries saved
Log retention configured
No errors in CloudWatch Logs