Infrastructure Teardown System
Automated infrastructure teardown and spinup system with ChatOps control for cost optimization in non-production environments.
Overview
The Infrastructure Teardown System provides a hybrid approach to managing infrastructure costs:
| Tier | Method | Speed | Use Case |
|---|---|---|---|
| Tier 1 | Lambda stop/start | ~30 seconds | Nightly savings, quick stops |
| Tier 2 | Terrateam Cloud | ~10 minutes | Weekend teardown, maximum savings |
| ChatOps | Slack commands | Instant | Manual control |
| Scheduled | EventBridge | Automated | Nightly teardown |
Architecture
┌────────────────────────────────────────────────────────────────────────┐
│ Control Plane │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Slack Bot │ │ EventBridge │ │
│ │ (ECS Fargate) │ │ Scheduler │ │
│ │ │ │ │ │
│ │ /infra status │ │ 7 AM UTC: stop │ │
│ │ /infra stop │ │ 10 PM UTC: start│ │
│ │ /infra start │ │ │ │
│ │ /infra teardown│ │ │ │
│ │ /infra spinup │ │ │ │
│ └────────┬────────┘ └────────┬────────┘ │
│ │ │ │
│ └──────────┬───────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ Orchestrator │ │
│ │ Lambda │ │
│ │ │ │
│ │ - Validate input │ │
│ │ - Safety checks │ │
│ │ - Route to tier │ │
│ │ - Audit logging │ │
│ │ - Notifications │ │
│ └──────────┬──────────┘ │
│ │ │
└───────────────────────┼─────────────────────────────────────────────────┘
│
┌───────────────┴───────────────┐
│ │
▼ ▼
┌─────────────────────┐ ┌─────────────────────┐
│ Tier 1 │ │ Tier 2 │
│ Lambda Stop/Start │ │ Terrateam Cloud │
│ │ │ │
│ Fast operations: │ │ Full operations: │
│ - ECS desired=0/N │ │ - terraform apply │
│ - RDS stop/start │ │ - TEARDOWN_LEVEL │
│ │ │ variable │
│ ~30 seconds │ │ ~5-10 minutes │
└─────────────────────┘ └─────────────────────┘
Teardown Levels
The system supports three teardown levels controlled by the TEARDOWN_LEVEL Terraform variable:
Level: none (Default)
All resources are created and running. Normal operational state.
Level: services
Stops running services while preserving infrastructure:
| Resource Type | Action | Preserved |
|---|---|---|
| ECS Services | Desired count = 0 | Yes |
| RDS Instances | Stopped | Yes (data intact) |
| NAT Gateways | Running | Yes |
| Load Balancers | Running | Yes |
| VPC | Running | Yes |
Use case: Nightly cost savings, quick restore needed
Level: full
Destroys non-critical infrastructure:
| Resource Type | Action | Preserved |
|---|---|---|
| ECS Services | Destroyed | No |
| ECS Cluster | Destroyed | No |
| RDS Instances | Snapshot + Destroy | Snapshot only |
| NAT Gateways | Destroyed | No |
| Load Balancers | Destroyed | No |
| VPC | Preserved | Yes |
| S3 Buckets | Preserved | Yes |
| Secrets Manager | Preserved | Yes |
| IAM Roles | Preserved | Yes |
Use case: Weekend/holiday savings, longer restore time acceptable
Tier 1 vs Tier 2 Operations
Tier 1: Lambda Stop/Start
Fast operations using AWS SDK directly with tag-based discovery:
| Operation | Time | Cost Impact |
|---|---|---|
| Stop ECS | ~10s | Immediate |
| Start ECS | ~30s | Tasks launch |
| Stop RDS | ~5min | Immediate |
| Start RDS | ~5min | Instance starts |
| Stop EC2 | ~30s | Immediate |
| Start EC2 | ~60s | Instance starts |
Discovery Mode: Lambda auto-discovers ECS services, RDS instances, and EC2 instances. Resources tagged with NightlyTeardown=skip are excluded.
Best for:
- Nightly schedules
- Quick manual stops
- Preserving all infrastructure
- Dynamic environments (no manual resource lists needed)
Tier 2: Terrateam Cloud
Full Terraform operations via Terrateam Cloud API:
| Operation | Time | Cost Impact |
|---|---|---|
| Teardown (services) | ~5min | Moderate |
| Teardown (full) | ~10min | Maximum |
| Spinup | ~10min | Full restore |
Best for:
- Weekend teardowns
- Maximum cost savings
- Infrastructure changes
Slack ChatOps Commands
Command Reference
/infra # Show help
/infra status [env] # Show infrastructure status
/infra stop [env] # Tier 1: Stop ECS/RDS
/infra start [env] # Tier 1: Start ECS/RDS
/infra teardown [env] --level=X # Tier 2: Teardown infrastructure
/infra spinup [env] # Tier 2: Restore infrastructure
Examples
# Check dev environment status
/infra status dev
# Stop staging for the night
/infra stop staging
# Start dev in the morning
/infra start dev
# Full teardown for weekend (requires confirmation)
/infra teardown dev --level=full
# Restore after weekend
/infra spinup dev
Response Examples
Status Response:
Infrastructure Status: DEV
─────────────────────────
ECS Services:
🟢 api - RUNNING (desired: 2, running: 2)
🟢 worker - RUNNING (desired: 1, running: 1)
RDS Instances:
🟢 docustack-dev-db - available
🕐 2024-12-12 10:30:00 UTC
Action Response:
✅ Infrastructure Stop Completed
Environment: `dev`
Level: `services`
Status: `completed`
Details:
- ECS services stopped: 2
- RDS instances stopped: 1
🕐 2024-12-12 22:00:00 UTC
🏷️ Action ID: `a1b2c3d4`
IP Whitelist Commands
The Slack bot also provides IP whitelist management for controlling access to protected resources.
Commands
/infra whitelist add <ip> [--ttl=<duration>] [--description='<text>']
/infra whitelist remove <ip>
/infra whitelist list [env]
/infra whitelist refresh [env]
TTL Options
- Minutes:
1mto1440m(e.g.,5m,30mfor testing) - Days:
1dto365d(e.g.,7d,30d- default)
Examples
# Add IP for quick testing (5 minutes)
/infra whitelist add 1.2.3.4 --ttl=5m --description='Quick test'
# Add IP for a week
/infra whitelist add 1.2.3.4 --ttl=7d --description='Home office'
# List current whitelist
/infra whitelist list dev
# Remove IP
/infra whitelist remove 1.2.3.4
Automatic Features
- Auto-sync: Security groups update immediately when IPs are added/removed
- Auto-expiration: IPs automatically removed when TTL expires
- Notifications: Slack channel notified when IPs expire
- Managed Ports: 80 (HTTP), 443 (HTTPS), 7233 (gRPC for Temporal workers)
Safety Controls
Protected Environments
Production is always protected:
PROTECTED_ENVIRONMENTS = ["prod", "production"]
- No teardown or stop actions allowed
- Blocked at orchestrator level
- Additional block at Slack bot level
Protected Resources
Resources tagged with TeardownPolicy: never are never modified:
resource "aws_ecs_service" "critical" {
tags = {
TeardownPolicy = "never"
}
}
Confirmation Requirements
| Action | Level | Confirmation Required |
|---|---|---|
| stop | - | No |
| start | - | No |
| teardown | services | No |
| teardown | full | Yes (type environment name) |
| spinup | - | No |
RDS Protection
deletion_protection = truealways enabled- Final snapshot created before teardown
- Snapshot retained for recovery
Audit Logging
All actions logged to CloudWatch:
{
"timestamp": "2024-12-12T10:30:00Z",
"action": "teardown",
"environment": "dev",
"level": "full",
"triggered_by": "slack",
"status": "completed",
"user_id": "U123ABC"
}
Scheduled Operations
Nightly Stop/Start
The nightly scheduler automatically stops non-production resources to save costs.
Schedule (US Central Time)
| Action | Time (CT) | Time (UTC) | Description |
|---|---|---|---|
| Stop | 2:00 AM | 7:00 AM | Stop ECS services, RDS instances |
| Start | 5:00 PM | 10:00 PM | Start resources before work hours |
Working Window
12AM 2AM 4AM 6AM 8AM 10AM 12PM 2PM 4PM 6PM 8PM 10PM 12AM
│ │ │ │ │ │ │ │ │ │ │ │ │
├─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ RESOURCES STOPPED │ │
│ │ (2 AM - 5 PM CT) │ │
│ │ 15 hours savings │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ ▲ STOP START ▲ │
│ │ 2 AM CT 5 PM CT│ │
Weekend Teardown
For maximum cost savings, full teardown can be triggered for weekends:
# Friday evening - full teardown
/infra teardown dev --level=full
# Monday morning - restore
/infra spinup dev
Cost Savings Estimates
Tier 1: Nightly Stop/Start
Assuming 12 hours stopped per day (7 PM - 7 AM):
| Resource | Hourly Cost | Daily Savings | Monthly Savings |
|---|---|---|---|
| ECS Fargate (2 tasks) | $0.10 | $1.20 | $36 |
| RDS db.t3.medium | $0.05 | $0.60 | $18 |
| Total | ~$54/month |
Tier 2: Weekend Full Teardown
Assuming 60 hours stopped per weekend:
| Resource | Hourly Cost | Weekend Savings | Monthly Savings |
|---|---|---|---|
| ECS Fargate | $0.10 | $6.00 | $24 |
| RDS | $0.05 | $3.00 | $12 |
| NAT Gateway | $0.045 | $2.70 | $11 |
| ALB | $0.025 | $1.50 | $6 |
| Total | ~$53/month |
Combined Savings
| Strategy | Monthly Savings |
|---|---|
| Nightly Tier 1 only | ~$54 |
| Weekend Tier 2 only | ~$53 |
| Both combined | ~$100+ |
Troubleshooting
Bot Not Responding
- Check ECS service is running:
aws ecs describe-services --cluster docustack-dev --services slack-bot - Verify Socket Mode is enabled in Slack App settings
- Check CloudWatch logs:
aws logs tail /ecs/slack-bot --follow
Teardown Fails
- Check Terrateam API token is valid
- Verify GitHub repository access
- Check Terraform state is not locked
Resources Not Stopping
- Verify resource tags don't have
TeardownPolicy: never - Check IAM permissions for Lambda
- Review CloudWatch logs for errors
Logs
| Component | Log Location |
|---|---|
| Orchestrator | /aws/lambda/infra-orchestrator |
| Slack Bot | /ecs/slack-bot |
| Nightly Scheduler | /aws/lambda/stop-resources, /aws/lambda/start-resources |
Recovery Procedures
Restore from Services Teardown
/infra start dev
Restore from Full Teardown
/infra spinup dev
# Wait ~10 minutes for Terraform apply
Manual RDS Recovery
If RDS needs manual recovery from snapshot:
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier docustack-dev-db \
--db-snapshot-identifier docustack-dev-db-final-snapshot
Daily Usage Workflow
# Start work session
/infra start dev
# End work session
/infra stop dev
# Weekend teardown (maximum savings)
/infra teardown dev --level=full
# Monday restore
/infra spinup dev