Phase 5 of 6

Agent Execution

Verify Daytona sandbox integration and set up monitoring for agent execution

1 day
Beginner Level
3 main steps
🤖 What is Agent Execution?
Your Helium agents run in isolated Daytona sandboxes. This phase verifies the integration and sets up monitoring for agent performance.

Step 1: Verify Daytona Configuration

1.1
Check Daytona API Key

Verify your Daytona API key is correctly stored in Secrets Manager.

bash
# Get secret value (to verify)
aws secretsmanager get-secret-value \
  --secret-id helium/backend/production \
  --query SecretString \
  --output text | jq -r '.DAYTONA_API_KEY'

# Should output your Daytona API key
# If empty or incorrect, update it:
# aws secretsmanager update-secret ...
1.2
Test Daytona Connection
bash
# Test Daytona API from your backend
curl -X POST https://api.he2.ai/api/agents/test-sandbox \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_AUTH_TOKEN" \
  -d '{
    "command": "echo Hello from Daytona"
  }'

# Expected response:
# {
#   "status": "success",
#   "output": "Hello from Daytona",
#   "sandbox_id": "sandbox-xxxxx"
# }
1.3
Configure Daytona Settings

Ensure your Daytona configuration matches your production needs.

Recommended Settings:

  • Server URL: https://app.daytona.io/api
  • Target: us (US region)
  • Timeout: 300 seconds (5 minutes)
  • Max Concurrent: 10 sandboxes

Step 2: Set Up Agent Execution Monitoring

2.1
Create CloudWatch Dashboard
  1. Go to CloudWatch → Dashboards
  2. Click "Create dashboard"
  3. Name: helium-agent-execution
  4. Add widgets for:
    • Agent execution count (per hour)
    • Average execution time
    • Success rate (%)
    • Worker task count
    • Redis queue depth
    • Error rate
2.2
Create CloudWatch Alarms
bash
# Create SNS topic for alerts
aws sns create-topic \
  --name helium-production-alerts \
  --region us-east-1

# Subscribe your email
aws sns subscribe \
  --topic-arn arn:aws:sns:us-east-1:ACCOUNT_ID:helium-production-alerts \
  --protocol email \
  --notification-endpoint your-email@example.com \
  --region us-east-1

# Create alarm for high error rate
aws cloudwatch put-metric-alarm \
  --alarm-name helium-high-error-rate \
  --alarm-description "Alert when error rate exceeds 5%" \
  --metric-name Errors \
  --namespace AWS/ECS \
  --statistic Average \
  --period 300 \
  --evaluation-periods 2 \
  --threshold 5 \
  --comparison-operator GreaterThanThreshold \
  --alarm-actions arn:aws:sns:us-east-1:ACCOUNT_ID:helium-production-alerts \
  --region us-east-1

# Create alarm for worker task count
aws cloudwatch put-metric-alarm \
  --alarm-name helium-worker-tasks-low \
  --alarm-description "Alert when worker tasks drop below 1" \
  --metric-name RunningTaskCount \
  --namespace ECS/ContainerInsights \
  --dimensions Name=ServiceName,Value=helium-worker-service Name=ClusterName,Value=helium-production-cluster \
  --statistic Average \
  --period 300 \
  --evaluation-periods 1 \
  --threshold 1 \
  --comparison-operator LessThanThreshold \
  --alarm-actions arn:aws:sns:us-east-1:ACCOUNT_ID:helium-production-alerts \
  --region us-east-1
2.3
Enable Container Insights
✅ 2025 Best Practice:
Container Insights provides detailed metrics for ECS tasks including CPU, memory, network, and storage at the task level.
bash
# Enable Container Insights for cluster
aws ecs update-cluster-settings \
  --cluster helium-production-cluster \
  --settings name=containerInsights,value=enabled \
  --region us-east-1

# Verify it's enabled
aws ecs describe-clusters \
  --clusters helium-production-cluster \
  --include SETTINGS \
  --region us-east-1

Step 3: Test Agent Execution Performance

3.1
Run Test Agent Execution

Test a simple agent execution to verify everything works end-to-end.

bash
# Create a test agent run via API
curl -X POST https://api.he2.ai/api/agents/run \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_AUTH_TOKEN" \
  -d '{
    "agent_id": "your-agent-id",
    "prompt": "Hello! Can you tell me what 2+2 equals?",
    "project_id": "your-project-id"
  }'

# Monitor the execution in CloudWatch Logs
# Go to CloudWatch → Log groups → /ecs/helium-worker
# Filter by agent run ID
3.2
Monitor Redis Queue
bash
# Connect to Redis and check queue
redis-cli -h your-cluster.cache.amazonaws.com -p 6379 --tls -a YourPassword

# Check queue length
> LLEN dramatiq:default.DQ
(integer) 0

# Check active workers
> SMEMBERS dramatiq:default.workers
1) "worker-1"
2) "worker-2"

# Monitor in real-time
> MONITOR
3.3
Load Testing

Test with multiple concurrent agent executions to verify scaling works.

bash
# Simple load test script
for i in {1..10}; do
  curl -X POST https://api.he2.ai/api/agents/run \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer YOUR_AUTH_TOKEN" \
    -d "{
      \"agent_id\": \"your-agent-id\",
      \"prompt\": \"Test execution $i\",
      \"project_id\": \"your-project-id\"
    }" &
done

wait

# Monitor CloudWatch for:
# - All executions complete successfully
# - Worker tasks scale up if needed
# - No errors in logs
✅ Success Indicators:
• All 10 executions complete within 2-3 minutes
• No errors in CloudWatch Logs
• Worker tasks scale up if queue grows
• Redis queue returns to 0 after completion

Step 4: Optimize Agent Execution

4.1
Configure Worker Auto Scaling

Scale workers based on Redis queue depth to handle traffic spikes.

bash
# Register scalable target for workers
aws application-autoscaling register-scalable-target \
  --service-namespace ecs \
  --scalable-dimension ecs:service:DesiredCount \
  --resource-id service/helium-production-cluster/helium-worker-service \
  --min-capacity 2 \
  --max-capacity 10 \
  --region us-east-1

# Create custom metric for queue depth (requires CloudWatch custom metric)
# This is typically done in your application code

# Create scaling policy based on queue depth
aws application-autoscaling put-scaling-policy \
  --service-namespace ecs \
  --scalable-dimension ecs:service:DesiredCount \
  --resource-id service/helium-production-cluster/helium-worker-service \
  --policy-name helium-worker-queue-scaling \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
    "TargetValue": 10.0,
    "CustomizedMetricSpecification": {
      "MetricName": "RedisQueueDepth",
      "Namespace": "Helium/Workers",
      "Statistic": "Average"
    },
    "ScaleInCooldown": 300,
    "ScaleOutCooldown": 60
  }' \
  --region us-east-1
4.2
Set Execution Timeouts

Configure appropriate timeouts to prevent stuck executions.

Recommended Timeouts:

  • API Request: 30 seconds
  • Agent Execution: 5 minutes (300 seconds)
  • Sandbox Creation: 60 seconds
  • Task Timeout: 10 minutes (600 seconds)

Step 5: Configure Structured Logging

5.1
Set Up Log Insights Queries

Create saved queries for common troubleshooting scenarios.

cloudwatch-insights
# Query 1: Find failed agent executions
fields @timestamp, agent_id, error_message
| filter status = "failed"
| sort @timestamp desc
| limit 20

# Query 2: Average execution time by agent
fields agent_id, execution_time
| stats avg(execution_time) as avg_time by agent_id
| sort avg_time desc

# Query 3: Find slow executions (>2 minutes)
fields @timestamp, agent_id, execution_time
| filter execution_time > 120
| sort @timestamp desc

# Query 4: Error patterns
fields @timestamp, error_message
| filter @message like /error/i
| stats count() by error_message
| sort count desc
5.2
Configure Log Retention
bash
# Set log retention to 30 days (saves costs)
aws logs put-retention-policy \
  --log-group-name /ecs/helium-backend \
  --retention-in-days 30 \
  --region us-east-1

aws logs put-retention-policy \
  --log-group-name /ecs/helium-worker \
  --retention-in-days 30 \
  --region us-east-1
💰 Cost Optimization:
CloudWatch Logs charges $0.50/GB ingested and $0.03/GB stored. Setting retention to 30 days can save 50-70% on log storage costs.

Phase 5 Verification Checklist