Skip to main content

Infra Orchestrator

Central orchestrator for infrastructure control actions, routing between Tier 1 (Lambda-based stop/start) and Tier 2 (Terrateam Cloud) operations.

Why This Exists

Infrastructure management needs different approaches for different scenarios:

  • Quick stop/start: Seconds matter when you just want to pause resources
  • Full teardown: Complete infrastructure destruction needs careful orchestration
  • Safety controls: Production must be protected from accidental destruction

The orchestrator provides a unified interface that routes to the right tool:

TierSpeedUse CaseImplementation
Tier 1SecondsStop/start ECS and RDSDirect Lambda calls
Tier 2MinutesFull infrastructure teardown/spinupTerrateam Cloud API

Architecture

                    ┌─────────────────────────────────────┐
│ Event Sources │
│ ┌──────────┐ ┌────────────────┐ │
│ │ Slack │ │ EventBridge │ │
│ │ Bot │ │ Scheduler │ │
│ └────┬─────┘ └───────┬────────┘ │
└───────┼────────────────┼───────────┘
│ │
v v
┌─────────────────────────────────────┐
│ Orchestrator Lambda │
│ ┌─────────────────────────────┐ │
│ │ 1. Validate Event │ │
│ │ 2. Check Safety Controls │ │
│ │ 3. Route to Tier 1 or 2 │ │
│ │ 4. Send Slack Notification │ │
│ │ 5. Create Audit Log │ │
│ └─────────────────────────────┘ │
└─────────┬───────────────┬───────────┘
│ │
┌───────────────┘ └───────────────┐
v v
┌─────────────────────┐ ┌─────────────────────┐
│ Tier 1: Lambda │ │ Tier 2: Terrateam │
│ │ │ Cloud API │
│ - ECS: desired=0/N │ │ - Full teardown │
│ - RDS: stop/start │ │ - Full spinup │
└─────────────────────┘ └─────────────────────┘

Actions

ActionTierDescription
stop1Stop ECS services and RDS instances
start1Start ECS services and RDS instances
teardown2Destroy infrastructure via Terrateam
spinup2Create infrastructure via Terrateam
status-Query current infrastructure state

Event Payload

{
"action": "stop|start|teardown|spinup|status",
"environment": "dev|staging",
"level": "services|full",
"triggered_by": "slack|schedule|manual",
"confirmed": true
}
FieldTypeDescriptionRequired
actionstringAction to performYes
environmentstringTarget environmentYes
levelstringTeardown level (default: services)No
triggered_bystringTrigger source (default: manual)No
confirmedbooleanConfirmation for full teardownRequired for teardown with level=full

Safety Controls

Environment Protection

  • Production is always protected: No teardown or stop actions allowed on prod
  • Valid environments: Only dev and staging are valid targets

Confirmation Requirements

  • Full teardown requires confirmation: Must set confirmed: true in the event
  • Slack modal confirmation: Users must type the environment name to confirm

Tag-Based Protection

Resources tagged with TeardownPolicy: never are skipped:

resource "aws_ecs_service" "critical" {
# ...
tags = {
TeardownPolicy = "never"
}
}

Audit Logging

All actions are logged to CloudWatch with:

  • Timestamp (UTC)
  • Action, environment, level
  • Trigger source
  • Status (started, completed, failed)
  • Details and error information

Response Format

Success

{
"statusCode": 200,
"body": {
"action": "stop",
"environment": "dev",
"level": "services",
"status": "completed",
"details": {
"ecs_services_stopped": 2,
"rds_instances_stopped": 1
}
}
}

Error

{
"statusCode": 400,
"body": {
"error": {
"code": "PROTECTED_ENVIRONMENT",
"message": "Action 'teardown' is blocked on protected environment 'prod'"
}
}
}

Error Codes

CodeDescription
INVALID_ACTIONAction not in allowed list
INVALID_ENVIRONMENTEnvironment not in allowed list
PROTECTED_ENVIRONMENTAction blocked on production
CONFIRMATION_REQUIREDFull teardown requires confirmed: true
LAMBDA_INVOCATION_FAILEDTier 1 Lambda invocation failed
TERRATEAM_ERRORTerrateam API call failed

Environment Variables

VariableDescriptionRequired
STOP_LAMBDA_ARNARN of the stop_resources LambdaYes
START_LAMBDA_ARNARN of the start_resources LambdaYes
TERRATEAM_SECRET_NAMESecrets Manager secret for Terrateam API tokenYes
SLACK_WEBHOOK_SECRET_NAMESecrets Manager secret for Slack webhook URLNo
GITHUB_REPOGitHub repository for Terrateam runsYes
ENVIRONMENTCurrent environmentYes
LOG_LEVELLogging levelNo (default: INFO)

Deployment

module "infra_orchestrator" {
source = "../../modules/infra-orchestrator"

name = "docustack-${var.environment}"
environment = var.environment

stop_lambda_arn = module.nightly_scheduler.stop_lambda_arn
start_lambda_arn = module.nightly_scheduler.start_lambda_arn

terrateam_secret_name = aws_secretsmanager_secret.terrateam.name
slack_webhook_secret_name = aws_secretsmanager_secret.slack_webhook.name
github_repo = "your-org/docustack"
}

Development Workflow

Local Testing

cd docustack-mono/services/lambdas/infra-orchestrator

# Install dependencies
pip install boto3 requests

# Set environment variables
export STOP_LAMBDA_ARN="arn:aws:lambda:us-east-1:123456789:function:stop-resources"
export START_LAMBDA_ARN="arn:aws:lambda:us-east-1:123456789:function:start-resources"
export TERRATEAM_SECRET_NAME="docustack/terrateam-api-token"
export GITHUB_REPO="your-org/docustack"
export ENVIRONMENT="dev"
export LOG_LEVEL="DEBUG"

# Test status action
python -c "
from handler import lambda_handler
result = lambda_handler({
'action': 'status',
'environment': 'dev',
'triggered_by': 'manual'
}, None)
print(result)
"

Unit Tests

# Run tests with pytest
pytest tests/ -v

# Run with coverage
pytest tests/ --cov=. --cov-report=html

Testing in AWS

# Test status
aws lambda invoke \
--function-name docustack-dev-infra-orchestrator \
--payload '{"action":"status","environment":"dev"}' \
/tmp/response.json

cat /tmp/response.json

Files

FileDescription
handler.pyMain Lambda handler and orchestration logic
terrateam_client.pyTerrateam Cloud API client
slack_notifier.pySlack notification module

Troubleshooting

Tier 1 actions failing

  1. Check CloudWatch logs for the orchestrator
  2. Verify STOP_LAMBDA_ARN and START_LAMBDA_ARN are correct
  3. Check IAM permissions for Lambda invoke
  4. Review nightly scheduler Lambda logs

Tier 2 actions failing

  1. Verify Terrateam API token in Secrets Manager
  2. Check GITHUB_REPO is correct
  3. Review Terrateam Cloud dashboard for run status
  4. Check GitHub Actions workflow logs

Production protection triggered

This is expected behavior. Production cannot be stopped or torn down via the orchestrator. Use manual Terraform operations with appropriate approvals.

Code Location

docustack-mono/services/lambdas/infra-orchestrator/
├── handler.py # Main orchestrator logic
├── terrateam_client.py # Terrateam API client
├── slack_notifier.py # Slack notifications
└── README.md