Your Helium agents run in isolated Daytona sandboxes. This phase verifies the integration and sets up monitoring for agent performance.
Step 1: Verify Daytona Configuration
Verify your Daytona API key is correctly stored in Secrets Manager.
# Get secret value (to verify)
aws secretsmanager get-secret-value \
--secret-id helium/backend/production \
--query SecretString \
--output text | jq -r '.DAYTONA_API_KEY'
# Should output your Daytona API key
# If empty or incorrect, update it:
# aws secretsmanager update-secret ...
# Test Daytona API from your backend
curl -X POST https://api.he2.ai/api/agents/test-sandbox \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_AUTH_TOKEN" \
-d '{
"command": "echo Hello from Daytona"
}'
# Expected response:
# {
# "status": "success",
# "output": "Hello from Daytona",
# "sandbox_id": "sandbox-xxxxx"
# }
Ensure your Daytona configuration matches your production needs.
Recommended Settings:
- Server URL: https://app.daytona.io/api
- Target: us (US region)
- Timeout: 300 seconds (5 minutes)
- Max Concurrent: 10 sandboxes
Step 2: Set Up Agent Execution Monitoring
- Go to CloudWatch → Dashboards
- Click "Create dashboard"
- Name:
helium-agent-execution - Add widgets for:
- Agent execution count (per hour)
- Average execution time
- Success rate (%)
- Worker task count
- Redis queue depth
- Error rate
# Create SNS topic for alerts
aws sns create-topic \
--name helium-production-alerts \
--region us-east-1
# Subscribe your email
aws sns subscribe \
--topic-arn arn:aws:sns:us-east-1:ACCOUNT_ID:helium-production-alerts \
--protocol email \
--notification-endpoint your-email@example.com \
--region us-east-1
# Create alarm for high error rate
aws cloudwatch put-metric-alarm \
--alarm-name helium-high-error-rate \
--alarm-description "Alert when error rate exceeds 5%" \
--metric-name Errors \
--namespace AWS/ECS \
--statistic Average \
--period 300 \
--evaluation-periods 2 \
--threshold 5 \
--comparison-operator GreaterThanThreshold \
--alarm-actions arn:aws:sns:us-east-1:ACCOUNT_ID:helium-production-alerts \
--region us-east-1
# Create alarm for worker task count
aws cloudwatch put-metric-alarm \
--alarm-name helium-worker-tasks-low \
--alarm-description "Alert when worker tasks drop below 1" \
--metric-name RunningTaskCount \
--namespace ECS/ContainerInsights \
--dimensions Name=ServiceName,Value=helium-worker-service Name=ClusterName,Value=helium-production-cluster \
--statistic Average \
--period 300 \
--evaluation-periods 1 \
--threshold 1 \
--comparison-operator LessThanThreshold \
--alarm-actions arn:aws:sns:us-east-1:ACCOUNT_ID:helium-production-alerts \
--region us-east-1
Container Insights provides detailed metrics for ECS tasks including CPU, memory, network, and storage at the task level.
# Enable Container Insights for cluster
aws ecs update-cluster-settings \
--cluster helium-production-cluster \
--settings name=containerInsights,value=enabled \
--region us-east-1
# Verify it's enabled
aws ecs describe-clusters \
--clusters helium-production-cluster \
--include SETTINGS \
--region us-east-1
Step 3: Test Agent Execution Performance
Test a simple agent execution to verify everything works end-to-end.
# Create a test agent run via API
curl -X POST https://api.he2.ai/api/agents/run \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_AUTH_TOKEN" \
-d '{
"agent_id": "your-agent-id",
"prompt": "Hello! Can you tell me what 2+2 equals?",
"project_id": "your-project-id"
}'
# Monitor the execution in CloudWatch Logs
# Go to CloudWatch → Log groups → /ecs/helium-worker
# Filter by agent run ID
# Connect to Redis and check queue
redis-cli -h your-cluster.cache.amazonaws.com -p 6379 --tls -a YourPassword
# Check queue length
> LLEN dramatiq:default.DQ
(integer) 0
# Check active workers
> SMEMBERS dramatiq:default.workers
1) "worker-1"
2) "worker-2"
# Monitor in real-time
> MONITOR
Test with multiple concurrent agent executions to verify scaling works.
# Simple load test script
for i in {1..10}; do
curl -X POST https://api.he2.ai/api/agents/run \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_AUTH_TOKEN" \
-d "{
\"agent_id\": \"your-agent-id\",
\"prompt\": \"Test execution $i\",
\"project_id\": \"your-project-id\"
}" &
done
wait
# Monitor CloudWatch for:
# - All executions complete successfully
# - Worker tasks scale up if needed
# - No errors in logs
• All 10 executions complete within 2-3 minutes
• No errors in CloudWatch Logs
• Worker tasks scale up if queue grows
• Redis queue returns to 0 after completion
Step 4: Optimize Agent Execution
Scale workers based on Redis queue depth to handle traffic spikes.
# Register scalable target for workers
aws application-autoscaling register-scalable-target \
--service-namespace ecs \
--scalable-dimension ecs:service:DesiredCount \
--resource-id service/helium-production-cluster/helium-worker-service \
--min-capacity 2 \
--max-capacity 10 \
--region us-east-1
# Create custom metric for queue depth (requires CloudWatch custom metric)
# This is typically done in your application code
# Create scaling policy based on queue depth
aws application-autoscaling put-scaling-policy \
--service-namespace ecs \
--scalable-dimension ecs:service:DesiredCount \
--resource-id service/helium-production-cluster/helium-worker-service \
--policy-name helium-worker-queue-scaling \
--policy-type TargetTrackingScaling \
--target-tracking-scaling-policy-configuration '{
"TargetValue": 10.0,
"CustomizedMetricSpecification": {
"MetricName": "RedisQueueDepth",
"Namespace": "Helium/Workers",
"Statistic": "Average"
},
"ScaleInCooldown": 300,
"ScaleOutCooldown": 60
}' \
--region us-east-1
Configure appropriate timeouts to prevent stuck executions.
Recommended Timeouts:
- API Request: 30 seconds
- Agent Execution: 5 minutes (300 seconds)
- Sandbox Creation: 60 seconds
- Task Timeout: 10 minutes (600 seconds)
Step 5: Configure Structured Logging
Create saved queries for common troubleshooting scenarios.
# Query 1: Find failed agent executions
fields @timestamp, agent_id, error_message
| filter status = "failed"
| sort @timestamp desc
| limit 20
# Query 2: Average execution time by agent
fields agent_id, execution_time
| stats avg(execution_time) as avg_time by agent_id
| sort avg_time desc
# Query 3: Find slow executions (>2 minutes)
fields @timestamp, agent_id, execution_time
| filter execution_time > 120
| sort @timestamp desc
# Query 4: Error patterns
fields @timestamp, error_message
| filter @message like /error/i
| stats count() by error_message
| sort count desc
# Set log retention to 30 days (saves costs)
aws logs put-retention-policy \
--log-group-name /ecs/helium-backend \
--retention-in-days 30 \
--region us-east-1
aws logs put-retention-policy \
--log-group-name /ecs/helium-worker \
--retention-in-days 30 \
--region us-east-1
CloudWatch Logs charges $0.50/GB ingested and $0.03/GB stored. Setting retention to 30 days can save 50-70% on log storage costs.
Phase 5 Verification Checklist
- Daytona API key verified in Secrets Manager
- Daytona connection tested successfully
- Daytona settings configured
- CloudWatch dashboard created
- CloudWatch alarms configured
- SNS topic created for alerts
- Email subscribed to alerts
- Container Insights enabled
- Test agent execution successful
- Load test completed (10 concurrent)
- Worker auto scaling configured
- Log Insights queries saved
- Log retention configured
- No errors in CloudWatch Logs