Infrastructure Teardown System

Automated infrastructure teardown and spinup system with ChatOps control for cost optimization in non-production environments.

Overview

The Infrastructure Teardown System provides a hybrid approach to managing infrastructure costs:

Tier	Method	Speed	Use Case
Tier 1	Lambda stop/start	~30 seconds	Nightly savings, quick stops
Tier 2	Terrateam Cloud	~10 minutes	Weekend teardown, maximum savings
ChatOps	Slack commands	Instant	Manual control
Scheduled	EventBridge	Automated	Nightly teardown

Architecture

┌────────────────────────────────────────────────────────────────────────┐
│                         Control Plane                                   │
│                                                                         │
│   ┌─────────────────┐        ┌─────────────────┐                       │
│   │  Slack Bot      │        │  EventBridge    │                       │
│   │  (ECS Fargate)  │        │  Scheduler      │                       │
│   │                 │        │                 │                       │
│   │  /infra status  │        │  7 AM UTC: stop │                       │
│   │  /infra stop    │        │ 10 PM UTC: start│                       │
│   │  /infra start   │        │                 │                       │
│   │  /infra teardown│        │                 │                       │
│   │  /infra spinup  │        │                 │                       │
│   └────────┬────────┘        └────────┬────────┘                       │
│            │                          │                                 │
│            └──────────┬───────────────┘                                 │
│                       │                                                 │
│                       ▼                                                 │
│            ┌─────────────────────┐                                      │
│            │   Orchestrator      │                                      │
│            │   Lambda            │                                      │
│            │                     │                                      │
│            │  - Validate input   │                                      │
│            │  - Safety checks    │                                      │
│            │  - Route to tier    │                                      │
│            │  - Audit logging    │                                      │
│            │  - Notifications    │                                      │
│            └──────────┬──────────┘                                      │
│                       │                                                 │
└───────────────────────┼─────────────────────────────────────────────────┘
                        │
        ┌───────────────┴───────────────┐
        │                               │
        ▼                               ▼
┌─────────────────────┐       ┌─────────────────────┐
│     Tier 1          │       │     Tier 2          │
│  Lambda Stop/Start  │       │  Terrateam Cloud    │
│                     │       │                     │
│  Fast operations:   │       │  Full operations:   │
│  - ECS desired=0/N  │       │  - terraform apply  │
│  - RDS stop/start   │       │  - TEARDOWN_LEVEL   │
│                     │       │    variable         │
│  ~30 seconds        │       │  ~5-10 minutes      │
└─────────────────────┘       └─────────────────────┘

Teardown Levels

The system supports three teardown levels controlled by the TEARDOWN_LEVEL Terraform variable:

Level: `none` (Default)

All resources are created and running. Normal operational state.

Level: `services`

Stops running services while preserving infrastructure:

Resource Type	Action	Preserved
ECS Services	Desired count = 0	Yes
RDS Instances	Stopped	Yes (data intact)
NAT Gateways	Running	Yes
Load Balancers	Running	Yes
VPC	Running	Yes

Use case: Nightly cost savings, quick restore needed

Level: `full`

Destroys non-critical infrastructure:

Resource Type	Action	Preserved
ECS Services	Destroyed	No
ECS Cluster	Destroyed	No
RDS Instances	Snapshot + Destroy	Snapshot only
NAT Gateways	Destroyed	No
Load Balancers	Destroyed	No
VPC	Preserved	Yes
S3 Buckets	Preserved	Yes
Secrets Manager	Preserved	Yes
IAM Roles	Preserved	Yes

Use case: Weekend/holiday savings, longer restore time acceptable

Tier 1 vs Tier 2 Operations

Tier 1: Lambda Stop/Start

Fast operations using AWS SDK directly with tag-based discovery:

Operation	Time	Cost Impact
Stop ECS	~10s	Immediate
Start ECS	~30s	Tasks launch
Stop RDS	~5min	Immediate
Start RDS	~5min	Instance starts
Stop EC2	~30s	Immediate
Start EC2	~60s	Instance starts

Discovery Mode: Lambda auto-discovers ECS services, RDS instances, and EC2 instances. Resources tagged with NightlyTeardown=skip are excluded.

Best for:

Nightly schedules
Quick manual stops
Preserving all infrastructure
Dynamic environments (no manual resource lists needed)

Tier 2: Terrateam Cloud

Full Terraform operations via Terrateam Cloud API:

Operation	Time	Cost Impact
Teardown (services)	~5min	Moderate
Teardown (full)	~10min	Maximum
Spinup	~10min	Full restore

Best for:

Weekend teardowns
Maximum cost savings
Infrastructure changes

Slack ChatOps Commands

Command Reference

/infra                              # Show help
/infra status [env]                 # Show infrastructure status
/infra stop [env]                   # Tier 1: Stop ECS/RDS
/infra start [env]                  # Tier 1: Start ECS/RDS
/infra teardown [env] --level=X     # Tier 2: Teardown infrastructure
/infra spinup [env]                 # Tier 2: Restore infrastructure

Examples

# Check dev environment status
/infra status dev

# Stop staging for the night
/infra stop staging

# Start dev in the morning
/infra start dev

# Full teardown for weekend (requires confirmation)
/infra teardown dev --level=full

# Restore after weekend
/infra spinup dev

Response Examples

Status Response:

Infrastructure Status: DEV
─────────────────────────
ECS Services:
  🟢 api - RUNNING (desired: 2, running: 2)
  🟢 worker - RUNNING (desired: 1, running: 1)

RDS Instances:
  🟢 docustack-dev-db - available

🕐 2024-12-12 10:30:00 UTC

Action Response:

✅ Infrastructure Stop Completed

Environment: `dev`
Level: `services`
Status: `completed`

Details:
- ECS services stopped: 2
- RDS instances stopped: 1

🕐 2024-12-12 22:00:00 UTC
🏷️ Action ID: `a1b2c3d4`

IP Whitelist Commands

The Slack bot also provides IP whitelist management for controlling access to protected resources.

Commands

/infra whitelist add <ip> [--ttl=<duration>] [--description='<text>']
/infra whitelist remove <ip>
/infra whitelist list [env]
/infra whitelist refresh [env]

TTL Options

Minutes: 1m to 1440m (e.g., 5m, 30m for testing)
Days: 1d to 365d (e.g., 7d, 30d - default)

Examples

# Add IP for quick testing (5 minutes)
/infra whitelist add 1.2.3.4 --ttl=5m --description='Quick test'

# Add IP for a week
/infra whitelist add 1.2.3.4 --ttl=7d --description='Home office'

# List current whitelist
/infra whitelist list dev

# Remove IP
/infra whitelist remove 1.2.3.4

Automatic Features

Auto-sync: Security groups update immediately when IPs are added/removed
Auto-expiration: IPs automatically removed when TTL expires
Notifications: Slack channel notified when IPs expire
Managed Ports: 80 (HTTP), 443 (HTTPS), 7233 (gRPC for Temporal workers)

Safety Controls

Protected Environments

Production is always protected:

PROTECTED_ENVIRONMENTS = ["prod", "production"]

No teardown or stop actions allowed
Blocked at orchestrator level
Additional block at Slack bot level

Protected Resources

Resources tagged with TeardownPolicy: never are never modified:

resource "aws_ecs_service" "critical" {
  tags = {
    TeardownPolicy = "never"
  }
}

Confirmation Requirements

Action	Level	Confirmation Required
stop	-	No
start	-	No
teardown	services	No
teardown	full	Yes (type environment name)
spinup	-	No

RDS Protection

deletion_protection = true always enabled
Final snapshot created before teardown
Snapshot retained for recovery

Audit Logging

All actions logged to CloudWatch:

{
  "timestamp": "2024-12-12T10:30:00Z",
  "action": "teardown",
  "environment": "dev",
  "level": "full",
  "triggered_by": "slack",
  "status": "completed",
  "user_id": "U123ABC"
}

Scheduled Operations

Nightly Stop/Start

The nightly scheduler automatically stops non-production resources to save costs.

Schedule (US Central Time)

Action	Time (CT)	Time (UTC)	Description
Stop	2:00 AM	7:00 AM	Stop ECS services, RDS instances
Start	5:00 PM	10:00 PM	Start resources before work hours

Working Window

12AM  2AM   4AM   6AM   8AM   10AM  12PM  2PM   4PM   6PM   8PM   10PM  12AM
 │     │     │     │     │     │     │     │     │     │     │     │     │
 ├─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┤
 │                                                                       │
 │  ┌─────────────────────────────────────────────────────────────────┐ │
 │  │                    RESOURCES STOPPED                             │ │
 │  │                    (2 AM - 5 PM CT)                              │ │
 │  │                    15 hours savings                              │ │
 │  └─────────────────────────────────────────────────────────────────┘ │
 │                                                                       │
 │         ▲ STOP                                          START ▲      │
 │         │ 2 AM CT                                       5 PM CT│      │

Weekend Teardown

For maximum cost savings, full teardown can be triggered for weekends:

# Friday evening - full teardown
/infra teardown dev --level=full

# Monday morning - restore
/infra spinup dev

Cost Savings Estimates

Tier 1: Nightly Stop/Start

Assuming 12 hours stopped per day (7 PM - 7 AM):

Resource	Hourly Cost	Daily Savings	Monthly Savings
ECS Fargate (2 tasks)	$0.10	$1.20	$36
RDS db.t3.medium	$0.05	$0.60	$18
Total			~$54/month

Tier 2: Weekend Full Teardown

Assuming 60 hours stopped per weekend:

Resource	Hourly Cost	Weekend Savings	Monthly Savings
ECS Fargate	$0.10	$6.00	$24
RDS	$0.05	$3.00	$12
NAT Gateway	$0.045	$2.70	$11
ALB	$0.025	$1.50	$6
Total			~$53/month

Combined Savings

Strategy	Monthly Savings
Nightly Tier 1 only	~$54
Weekend Tier 2 only	~$53
Both combined	~$100+

Troubleshooting

Bot Not Responding

Check ECS service is running:

aws ecs describe-services --cluster docustack-dev --services slack-bot

Verify Socket Mode is enabled in Slack App settings
Check CloudWatch logs:
```
aws logs tail /ecs/slack-bot --follow
```

Teardown Fails

Check Terrateam API token is valid
Verify GitHub repository access
Check Terraform state is not locked

Resources Not Stopping

Verify resource tags don't have TeardownPolicy: never
Check IAM permissions for Lambda
Review CloudWatch logs for errors

Logs

Component	Log Location
Orchestrator	`/aws/lambda/infra-orchestrator`
Slack Bot	`/ecs/slack-bot`
Nightly Scheduler	`/aws/lambda/stop-resources`, `/aws/lambda/start-resources`

Recovery Procedures

Restore from Services Teardown

/infra start dev

Restore from Full Teardown

/infra spinup dev
# Wait ~10 minutes for Terraform apply

Manual RDS Recovery

If RDS needs manual recovery from snapshot:

aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier docustack-dev-db \
  --db-snapshot-identifier docustack-dev-db-final-snapshot

Daily Usage Workflow

# Start work session
/infra start dev

# End work session
/infra stop dev

# Weekend teardown (maximum savings)
/infra teardown dev --level=full

# Monday restore
/infra spinup dev

Overview​

Architecture​

Teardown Levels​

Level: none (Default)​

Level: services​

Level: full​

Tier 1 vs Tier 2 Operations​

Tier 1: Lambda Stop/Start​

Tier 2: Terrateam Cloud​

Slack ChatOps Commands​

Command Reference​

Examples​

Response Examples​

IP Whitelist Commands​

Commands​

TTL Options​

Examples​

Automatic Features​

Safety Controls​

Protected Environments​

Protected Resources​

Confirmation Requirements​

RDS Protection​

Audit Logging​

Scheduled Operations​

Nightly Stop/Start​

Schedule (US Central Time)​

Working Window​

Weekend Teardown​

Cost Savings Estimates​

Tier 1: Nightly Stop/Start​

Tier 2: Weekend Full Teardown​

Combined Savings​

Troubleshooting​

Bot Not Responding​

Teardown Fails​

Resources Not Stopping​

Logs​

Recovery Procedures​

Restore from Services Teardown​

Restore from Full Teardown​

Manual RDS Recovery​

Daily Usage Workflow​

Overview

Architecture

Teardown Levels

Level: `none` (Default)

Level: `services`

Level: `full`

Tier 1 vs Tier 2 Operations

Tier 1: Lambda Stop/Start

Tier 2: Terrateam Cloud

Slack ChatOps Commands

Command Reference

Examples

Response Examples

IP Whitelist Commands

Commands

TTL Options

Examples

Automatic Features

Safety Controls

Protected Environments

Protected Resources

Confirmation Requirements

RDS Protection

Audit Logging

Scheduled Operations

Nightly Stop/Start

Schedule (US Central Time)

Working Window

Weekend Teardown

Cost Savings Estimates

Tier 1: Nightly Stop/Start

Tier 2: Weekend Full Teardown

Combined Savings

Troubleshooting

Bot Not Responding

Teardown Fails

Resources Not Stopping

Logs

Recovery Procedures

Restore from Services Teardown

Restore from Full Teardown

Manual RDS Recovery

Daily Usage Workflow