Infra Orchestrator
Central orchestrator for infrastructure control actions, routing between Tier 1 (Lambda-based stop/start) and Tier 2 (Terrateam Cloud) operations.
Why This Exists
Infrastructure management needs different approaches for different scenarios:
- Quick stop/start: Seconds matter when you just want to pause resources
- Full teardown: Complete infrastructure destruction needs careful orchestration
- Safety controls: Production must be protected from accidental destruction
The orchestrator provides a unified interface that routes to the right tool:
| Tier | Speed | Use Case | Implementation |
|---|---|---|---|
| Tier 1 | Seconds | Stop/start ECS and RDS | Direct Lambda calls |
| Tier 2 | Minutes | Full infrastructure teardown/spinup | Terrateam Cloud API |
Architecture
┌─────────────────────────────────────┐
│ Event Sources │
│ ┌──────────┐ ┌────────────────┐ │
│ │ Slack │ │ EventBridge │ │
│ │ Bot │ │ Scheduler │ │
│ └────┬─────┘ └───────┬────────┘ │
└───────┼────────────────┼───────────┘
│ │
v v
┌─────────────────────────────────────┐
│ Orchestrator Lambda │
│ ┌─────────────────────────────┐ │
│ │ 1. Validate Event │ │
│ │ 2. Check Safety Controls │ │
│ │ 3. Route to Tier 1 or 2 │ │
│ │ 4. Send Slack Notification │ │
│ │ 5. Create Audit Log │ │
│ └─────────────────────────────┘ │
└─────────┬───────────────┬───────────┘
│ │
┌───────────────┘ └───────────────┐
v v
┌─────────────────────┐ ┌─────────────────────┐
│ Tier 1: Lambda │ │ Tier 2: Terrateam │
│ │ │ Cloud API │
│ - ECS: desired=0/N │ │ - Full teardown │
│ - RDS: stop/start │ │ - Full spinup │
└─────────────────────┘ └─────────────────────┘
Actions
| Action | Tier | Description |
|---|---|---|
stop | 1 | Stop ECS services and RDS instances |
start | 1 | Start ECS services and RDS instances |
teardown | 2 | Destroy infrastructure via Terrateam |
spinup | 2 | Create infrastructure via Terrateam |
status | - | Query current infrastructure state |
Event Payload
{
"action": "stop|start|teardown|spinup|status",
"environment": "dev|staging",
"level": "services|full",
"triggered_by": "slack|schedule|manual",
"confirmed": true
}
| Field | Type | Description | Required |
|---|---|---|---|
action | string | Action to perform | Yes |
environment | string | Target environment | Yes |
level | string | Teardown level (default: services) | No |
triggered_by | string | Trigger source (default: manual) | No |
confirmed | boolean | Confirmation for full teardown | Required for teardown with level=full |
Safety Controls
Environment Protection
- Production is always protected: No teardown or stop actions allowed on
prod - Valid environments: Only
devandstagingare valid targets
Confirmation Requirements
- Full teardown requires confirmation: Must set
confirmed: truein the event - Slack modal confirmation: Users must type the environment name to confirm
Tag-Based Protection
Resources tagged with TeardownPolicy: never are skipped:
resource "aws_ecs_service" "critical" {
# ...
tags = {
TeardownPolicy = "never"
}
}
Audit Logging
All actions are logged to CloudWatch with:
- Timestamp (UTC)
- Action, environment, level
- Trigger source
- Status (started, completed, failed)
- Details and error information
Response Format
Success
{
"statusCode": 200,
"body": {
"action": "stop",
"environment": "dev",
"level": "services",
"status": "completed",
"details": {
"ecs_services_stopped": 2,
"rds_instances_stopped": 1
}
}
}
Error
{
"statusCode": 400,
"body": {
"error": {
"code": "PROTECTED_ENVIRONMENT",
"message": "Action 'teardown' is blocked on protected environment 'prod'"
}
}
}
Error Codes
| Code | Description |
|---|---|
INVALID_ACTION | Action not in allowed list |
INVALID_ENVIRONMENT | Environment not in allowed list |
PROTECTED_ENVIRONMENT | Action blocked on production |
CONFIRMATION_REQUIRED | Full teardown requires confirmed: true |
LAMBDA_INVOCATION_FAILED | Tier 1 Lambda invocation failed |
TERRATEAM_ERROR | Terrateam API call failed |
Environment Variables
| Variable | Description | Required |
|---|---|---|
STOP_LAMBDA_ARN | ARN of the stop_resources Lambda | Yes |
START_LAMBDA_ARN | ARN of the start_resources Lambda | Yes |
TERRATEAM_SECRET_NAME | Secrets Manager secret for Terrateam API token | Yes |
SLACK_WEBHOOK_SECRET_NAME | Secrets Manager secret for Slack webhook URL | No |
GITHUB_REPO | GitHub repository for Terrateam runs | Yes |
ENVIRONMENT | Current environment | Yes |
LOG_LEVEL | Logging level | No (default: INFO) |
Deployment
module "infra_orchestrator" {
source = "../../modules/infra-orchestrator"
name = "docustack-${var.environment}"
environment = var.environment
stop_lambda_arn = module.nightly_scheduler.stop_lambda_arn
start_lambda_arn = module.nightly_scheduler.start_lambda_arn
terrateam_secret_name = aws_secretsmanager_secret.terrateam.name
slack_webhook_secret_name = aws_secretsmanager_secret.slack_webhook.name
github_repo = "your-org/docustack"
}
Development Workflow
Local Testing
cd docustack-mono/services/lambdas/infra-orchestrator
# Install dependencies
pip install boto3 requests
# Set environment variables
export STOP_LAMBDA_ARN="arn:aws:lambda:us-east-1:123456789:function:stop-resources"
export START_LAMBDA_ARN="arn:aws:lambda:us-east-1:123456789:function:start-resources"
export TERRATEAM_SECRET_NAME="docustack/terrateam-api-token"
export GITHUB_REPO="your-org/docustack"
export ENVIRONMENT="dev"
export LOG_LEVEL="DEBUG"
# Test status action
python -c "
from handler import lambda_handler
result = lambda_handler({
'action': 'status',
'environment': 'dev',
'triggered_by': 'manual'
}, None)
print(result)
"
Unit Tests
# Run tests with pytest
pytest tests/ -v
# Run with coverage
pytest tests/ --cov=. --cov-report=html
Testing in AWS
# Test status
aws lambda invoke \
--function-name docustack-dev-infra-orchestrator \
--payload '{"action":"status","environment":"dev"}' \
/tmp/response.json
cat /tmp/response.json
Files
| File | Description |
|---|---|
handler.py | Main Lambda handler and orchestration logic |
terrateam_client.py | Terrateam Cloud API client |
slack_notifier.py | Slack notification module |
Troubleshooting
Tier 1 actions failing
- Check CloudWatch logs for the orchestrator
- Verify
STOP_LAMBDA_ARNandSTART_LAMBDA_ARNare correct - Check IAM permissions for Lambda invoke
- Review nightly scheduler Lambda logs
Tier 2 actions failing
- Verify Terrateam API token in Secrets Manager
- Check
GITHUB_REPOis correct - Review Terrateam Cloud dashboard for run status
- Check GitHub Actions workflow logs
Production protection triggered
This is expected behavior. Production cannot be stopped or torn down via the orchestrator. Use manual Terraform operations with appropriate approvals.
Code Location
docustack-mono/services/lambdas/infra-orchestrator/
├── handler.py # Main orchestrator logic
├── terrateam_client.py # Terrateam API client
├── slack_notifier.py # Slack notifications
└── README.md