Skip to main content

Infrastructure Teardown System

Automated infrastructure teardown and spinup system with ChatOps control for cost optimization in non-production environments.

Overview

The Infrastructure Teardown System provides a hybrid approach to managing infrastructure costs:

TierMethodSpeedUse Case
Tier 1Lambda stop/start~30 secondsNightly savings, quick stops
Tier 2Terrateam Cloud~10 minutesWeekend teardown, maximum savings
ChatOpsSlack commandsInstantManual control
ScheduledEventBridgeAutomatedNightly teardown

Architecture

┌────────────────────────────────────────────────────────────────────────┐
│ Control Plane │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Slack Bot │ │ EventBridge │ │
│ │ (ECS Fargate) │ │ Scheduler │ │
│ │ │ │ │ │
│ │ /infra status │ │ 7 AM UTC: stop │ │
│ │ /infra stop │ │ 10 PM UTC: start│ │
│ │ /infra start │ │ │ │
│ │ /infra teardown│ │ │ │
│ │ /infra spinup │ │ │ │
│ └────────┬────────┘ └────────┬────────┘ │
│ │ │ │
│ └──────────┬───────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ Orchestrator │ │
│ │ Lambda │ │
│ │ │ │
│ │ - Validate input │ │
│ │ - Safety checks │ │
│ │ - Route to tier │ │
│ │ - Audit logging │ │
│ │ - Notifications │ │
│ └──────────┬──────────┘ │
│ │ │
└───────────────────────┼─────────────────────────────────────────────────┘

┌───────────────┴───────────────┐
│ │
▼ ▼
┌─────────────────────┐ ┌─────────────────────┐
│ Tier 1 │ │ Tier 2 │
│ Lambda Stop/Start │ │ Terrateam Cloud │
│ │ │ │
│ Fast operations: │ │ Full operations: │
│ - ECS desired=0/N │ │ - terraform apply │
│ - RDS stop/start │ │ - TEARDOWN_LEVEL │
│ │ │ variable │
│ ~30 seconds │ │ ~5-10 minutes │
└─────────────────────┘ └─────────────────────┘

Teardown Levels

The system supports three teardown levels controlled by the TEARDOWN_LEVEL Terraform variable:

Level: none (Default)

All resources are created and running. Normal operational state.

Level: services

Stops running services while preserving infrastructure:

Resource TypeActionPreserved
ECS ServicesDesired count = 0Yes
RDS InstancesStoppedYes (data intact)
NAT GatewaysRunningYes
Load BalancersRunningYes
VPCRunningYes

Use case: Nightly cost savings, quick restore needed

Level: full

Destroys non-critical infrastructure:

Resource TypeActionPreserved
ECS ServicesDestroyedNo
ECS ClusterDestroyedNo
RDS InstancesSnapshot + DestroySnapshot only
NAT GatewaysDestroyedNo
Load BalancersDestroyedNo
VPCPreservedYes
S3 BucketsPreservedYes
Secrets ManagerPreservedYes
IAM RolesPreservedYes

Use case: Weekend/holiday savings, longer restore time acceptable

Tier 1 vs Tier 2 Operations

Tier 1: Lambda Stop/Start

Fast operations using AWS SDK directly with tag-based discovery:

OperationTimeCost Impact
Stop ECS~10sImmediate
Start ECS~30sTasks launch
Stop RDS~5minImmediate
Start RDS~5minInstance starts
Stop EC2~30sImmediate
Start EC2~60sInstance starts

Discovery Mode: Lambda auto-discovers ECS services, RDS instances, and EC2 instances. Resources tagged with NightlyTeardown=skip are excluded.

Best for:

  • Nightly schedules
  • Quick manual stops
  • Preserving all infrastructure
  • Dynamic environments (no manual resource lists needed)

Tier 2: Terrateam Cloud

Full Terraform operations via Terrateam Cloud API:

OperationTimeCost Impact
Teardown (services)~5minModerate
Teardown (full)~10minMaximum
Spinup~10minFull restore

Best for:

  • Weekend teardowns
  • Maximum cost savings
  • Infrastructure changes

Slack ChatOps Commands

Command Reference

/infra                              # Show help
/infra status [env] # Show infrastructure status
/infra stop [env] # Tier 1: Stop ECS/RDS
/infra start [env] # Tier 1: Start ECS/RDS
/infra teardown [env] --level=X # Tier 2: Teardown infrastructure
/infra spinup [env] # Tier 2: Restore infrastructure

Examples

# Check dev environment status
/infra status dev

# Stop staging for the night
/infra stop staging

# Start dev in the morning
/infra start dev

# Full teardown for weekend (requires confirmation)
/infra teardown dev --level=full

# Restore after weekend
/infra spinup dev

Response Examples

Status Response:

Infrastructure Status: DEV
─────────────────────────
ECS Services:
🟢 api - RUNNING (desired: 2, running: 2)
🟢 worker - RUNNING (desired: 1, running: 1)

RDS Instances:
🟢 docustack-dev-db - available

🕐 2024-12-12 10:30:00 UTC

Action Response:

✅ Infrastructure Stop Completed

Environment: `dev`
Level: `services`
Status: `completed`

Details:
- ECS services stopped: 2
- RDS instances stopped: 1

🕐 2024-12-12 22:00:00 UTC
🏷️ Action ID: `a1b2c3d4`

IP Whitelist Commands

The Slack bot also provides IP whitelist management for controlling access to protected resources.

Commands

/infra whitelist add <ip> [--ttl=<duration>] [--description='<text>']
/infra whitelist remove <ip>
/infra whitelist list [env]
/infra whitelist refresh [env]

TTL Options

  • Minutes: 1m to 1440m (e.g., 5m, 30m for testing)
  • Days: 1d to 365d (e.g., 7d, 30d - default)

Examples

# Add IP for quick testing (5 minutes)
/infra whitelist add 1.2.3.4 --ttl=5m --description='Quick test'

# Add IP for a week
/infra whitelist add 1.2.3.4 --ttl=7d --description='Home office'

# List current whitelist
/infra whitelist list dev

# Remove IP
/infra whitelist remove 1.2.3.4

Automatic Features

  • Auto-sync: Security groups update immediately when IPs are added/removed
  • Auto-expiration: IPs automatically removed when TTL expires
  • Notifications: Slack channel notified when IPs expire
  • Managed Ports: 80 (HTTP), 443 (HTTPS), 7233 (gRPC for Temporal workers)

Safety Controls

Protected Environments

Production is always protected:

PROTECTED_ENVIRONMENTS = ["prod", "production"]
  • No teardown or stop actions allowed
  • Blocked at orchestrator level
  • Additional block at Slack bot level

Protected Resources

Resources tagged with TeardownPolicy: never are never modified:

resource "aws_ecs_service" "critical" {
tags = {
TeardownPolicy = "never"
}
}

Confirmation Requirements

ActionLevelConfirmation Required
stop-No
start-No
teardownservicesNo
teardownfullYes (type environment name)
spinup-No

RDS Protection

  • deletion_protection = true always enabled
  • Final snapshot created before teardown
  • Snapshot retained for recovery

Audit Logging

All actions logged to CloudWatch:

{
"timestamp": "2024-12-12T10:30:00Z",
"action": "teardown",
"environment": "dev",
"level": "full",
"triggered_by": "slack",
"status": "completed",
"user_id": "U123ABC"
}

Scheduled Operations

Nightly Stop/Start

The nightly scheduler automatically stops non-production resources to save costs.

Schedule (US Central Time)

ActionTime (CT)Time (UTC)Description
Stop2:00 AM7:00 AMStop ECS services, RDS instances
Start5:00 PM10:00 PMStart resources before work hours

Working Window

12AM  2AM   4AM   6AM   8AM   10AM  12PM  2PM   4PM   6PM   8PM   10PM  12AM
│ │ │ │ │ │ │ │ │ │ │ │ │
├─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ RESOURCES STOPPED │ │
│ │ (2 AM - 5 PM CT) │ │
│ │ 15 hours savings │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ ▲ STOP START ▲ │
│ │ 2 AM CT 5 PM CT│ │

Weekend Teardown

For maximum cost savings, full teardown can be triggered for weekends:

# Friday evening - full teardown
/infra teardown dev --level=full

# Monday morning - restore
/infra spinup dev

Cost Savings Estimates

Tier 1: Nightly Stop/Start

Assuming 12 hours stopped per day (7 PM - 7 AM):

ResourceHourly CostDaily SavingsMonthly Savings
ECS Fargate (2 tasks)$0.10$1.20$36
RDS db.t3.medium$0.05$0.60$18
Total~$54/month

Tier 2: Weekend Full Teardown

Assuming 60 hours stopped per weekend:

ResourceHourly CostWeekend SavingsMonthly Savings
ECS Fargate$0.10$6.00$24
RDS$0.05$3.00$12
NAT Gateway$0.045$2.70$11
ALB$0.025$1.50$6
Total~$53/month

Combined Savings

StrategyMonthly Savings
Nightly Tier 1 only~$54
Weekend Tier 2 only~$53
Both combined~$100+

Troubleshooting

Bot Not Responding

  1. Check ECS service is running:
    aws ecs describe-services --cluster docustack-dev --services slack-bot
  2. Verify Socket Mode is enabled in Slack App settings
  3. Check CloudWatch logs:
    aws logs tail /ecs/slack-bot --follow

Teardown Fails

  1. Check Terrateam API token is valid
  2. Verify GitHub repository access
  3. Check Terraform state is not locked

Resources Not Stopping

  1. Verify resource tags don't have TeardownPolicy: never
  2. Check IAM permissions for Lambda
  3. Review CloudWatch logs for errors

Logs

ComponentLog Location
Orchestrator/aws/lambda/infra-orchestrator
Slack Bot/ecs/slack-bot
Nightly Scheduler/aws/lambda/stop-resources, /aws/lambda/start-resources

Recovery Procedures

Restore from Services Teardown

/infra start dev

Restore from Full Teardown

/infra spinup dev
# Wait ~10 minutes for Terraform apply

Manual RDS Recovery

If RDS needs manual recovery from snapshot:

aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier docustack-dev-db \
--db-snapshot-identifier docustack-dev-db-final-snapshot

Daily Usage Workflow

# Start work session
/infra start dev

# End work session
/infra stop dev

# Weekend teardown (maximum savings)
/infra teardown dev --level=full

# Monday restore
/infra spinup dev