Defying airflow to Amazon ECS

Bryson Meiling Medium logo

Strategies

## Executive Summary

Total Commits Analyzed: 54

Problem-Solving Commits: 47 (87%)

Documentation/Cleanup Commits: 7 (13%)

### Problem Distribution by Category

| Category | Count | Percentage | Top Issues |

|----------|-------|------------|------------|

| Airflow-Specific Pitfalls | 23 | 43% | Configuration, permissions, secrets, service communication |

| AWS/ECS Pitfalls | 19 | 35% | Secret injection, health checks, resource allocation, spot instances |

| Networking Pitfalls | 12 | 22% | IPv6, service discovery, ALB configuration, security groups |

| CDK Pitfalls | 8 | 15% | Context management, synthesizer, stack dependencies |

| Docker/Container Pitfalls | 15 | 28% | Build performance, user permissions, multi-stage builds |

Note: Commits can appear in multiple categories

---

## 1. Airflow-Specific Pitfalls (23 commits)

### Key Learnings

- Airflow has complex inter-service dependencies that must be carefully orchestrated

- Secret management and encoding are critical for multi-service deployments

- DAG deployment strategy significantly impacts architecture complexity

- Airflow 3.x has different requirements than 2.x

### Issues Encountered

#### 1.1 Configuration & Environment Variables (8 commits)

Problem: Airflow requires precise configuration across multiple services with shared secrets.

| Commit | Issue | Resolution |

|--------|-------|------------|

| #39 | Removed static airflow.cfg, relied on env vars | Use environment variables exclusively instead of config files |

| #40 | ECS injecting “[object Object]” for secrets | Fixed secret reference syntax in CDK |

| #41 | Secrets still passed incorrectly | Simplified to direct secret ARN references |

| #28 | Changed service startup based on local dev | Validated config locally first before ECS deployment |

| #4 | Major refactoring of environment variables | Centralized env var management in CDK constructs |

| #54 | Removed unneeded default env vars | Clean up reduces complexity and confusion |

| #31 | Database connection failing | Proper connection string format with all required params |

| #30 | Special characters in DB password broke startup | URL-encode passwords with special characters |

Resolution Strategy:

- ✅ Use environment variables exclusively, avoid static config files

- ✅ Validate configuration locally with Docker Compose before ECS

- ✅ URL-encode all connection strings and secrets

- ✅ Use CDK constructs to centralize environment variable management

- ✅ Reference secrets by ARN directly, not through complex wrapper objects

#### 1.2 Database & Storage (6 commits)

Problem: Database connectivity and encoding issues caused repeated failures.

| Commit | Issue | Resolution |

|--------|-------|------------|

| #30 | Encoded chars in password caused ECS failure | URL-encode passwords, avoid special chars in generation |

| #31 | DB connection failing | Security groups, connection strings, timeouts |

| #36 | Health check failed due to DB connection | Increase grace period, validate connection in startup |

| #23 | Bad IPv6 URL when connecting to Postgres | Use proper endpoint format without IPv6 |

| #22 | Switched back to PostgreSQL from alternative | Stick with well-supported databases |

| #29 | Moved DB migration to API server | Ensures DB ready before other services start |

Resolution Strategy:

- ✅ Generate passwords without special characters OR properly URL-encode them

- ✅ Run database migrations in the webserver/API startup, not scheduler

- ✅ Use RDS endpoint hostname format, avoid IPv6 references

- ✅ Increase health check grace periods for DB-dependent services

- ✅ Validate DB connectivity before starting service processes

#### 1.3 Service Communication & Architecture (5 commits)

Problem: Airflow services need to communicate with each other reliably.

| Commit | Issue | Resolution |

|--------|-------|------------|

| #46 | Worker can’t reach base URL | Added service discovery and internal ALB |

| #47 | Added common JWT secret | Shared JWT for inter-service authentication |

| #27 | Can’t get auth to work | Proper authentication backend configuration |

| #32 | Successfully deployed API and scheduler | Validated basic service communication |

| #24 | Split into storage/service stacks | Separate stateful from stateless resources |

Resolution Strategy:

- ✅ Use AWS Cloud Map for service discovery between components

- ✅ Configure internal ALB DNS for workers to reach webserver

- ✅ Share authentication secrets (JWT, Fernet) across all services

- ✅ Separate stateful (DB, cache) from stateless (compute) in stack design

- ✅ Deploy webserver/API first, then scheduler, then workers

#### 1.4 DAG Deployment & Storage (7 commits)

Problem: Finding the right way to deploy DAGs to all Airflow services.

| Commit | Issue | Resolution |

|--------|-------|------------|

| #15 | Tried EFS after ‘aws’ CLI not found in image | EFS for shared storage |

| #17 | EFS mount point issues | Fixed mount configuration |

| #18 | EFS DNS modification issues | Changed mount attachment method |

| #19 | Switched from EFS to git-sync | Git-sync sidecar pattern |

| #20 | Added git-sync with SSL certs | Proper SSL for private repos |

| #21 | Git-sync needs root permissions | Run as root to write to filesystem |

| #43 | Read-only DAG folder caused startup error | Make DAG folder writable |

| #49 | Simplified git-sync configuration | Direct git sync approach |

| #53 | Fixed git-sync repo configuration | Correct repo URL and credentials |

Resolution Strategy:

- ✅ Use git-sync sidecar pattern instead of EFS (simpler, more reliable)

- ✅ Run git-sync as root with proper permissions

- ✅ Ensure shared volume is writable by Airflow processes

- ✅ Mirror git-sync image to ECR with SSL certificates for private repos

- ✅ Configure git-sync to sync to correct path (/opt/airflow/dags)

#### 1.5 Component Health & Scaling (6 commits)

Problem: Getting all Airflow components (webserver, scheduler, worker, triggerer) healthy.

| Commit | Issue | Resolution |

|--------|-------|------------|

| #33 | Triggerer failing health checks | Temporarily disabled to isolate issues |

| #34 | Deployed without health checks | Removed to isolate startup from health check issues |

| #35 | All but worker running | Webserver, scheduler, triggerer operational |

| #44 | Re-enabled triggerer | Fixed config and re-added successfully |

| #45 | Worker killed due to high memory | Doubled memory to 2GB |

| #51 | Added worker auto-scaling | Scale 1-10 based on CPU/memory |

Resolution Strategy:

- ✅ Disable health checks initially to isolate startup issues

- ✅ Deploy services incrementally: webserver → scheduler → triggerer → worker

- ✅ Monitor memory usage and increase allocations (workers need 2GB+)

- ✅ Implement auto-scaling for workers (1-10 instances)

- ✅ Re-enable health checks after confirming services start successfully

#### 1.6 Permissions & User Management (3 commits)

Problem: Container user permissions for filesystem access.

| Commit | Issue | Resolution |

|--------|-------|------------|

| #37 | Trying to fix airflow user with root permissions | User/group permission conflicts |

| #21 | Git-sync needs root permissions to write | Run git-sync as root |

| #38 | Simplified user management in Docker | Use base image defaults |

Resolution Strategy:

- ✅ Run git-sync sidecar as root (uid 0) for file system writes

- ✅ Run Airflow processes as airflow user (default in base image)

- ✅ Use shared volumes with appropriate permissions

- ✅ Don’t override user settings unless necessary

---

## 2. AWS/ECS Pitfalls (19 commits)

### Key Learnings

- ECS has specific requirements for secret injection and environment variables

- Health checks need careful tuning for startup times

- Resource allocation (CPU/memory) requires iteration based on actual usage

- Spot instances can significantly reduce costs

### Issues Encountered

#### 2.1 Secret Management (5 commits)

Problem: ECS secret injection has specific syntax requirements.

| Commit | Issue | Resolution |

|--------|-------|------------|

| #40 | ECS injecting “[object Object]” literally | Fixed CDK secret reference syntax |

| #41 | Secrets still passed incorrectly | Used direct secret ARN references |

| #47 | Added common JWT secret | Centralized secret in Secrets Manager |

| #52 | Fixed Fernet key generation | Proper secret generation in storage stack |

| #30 | Password encoding issues | URL-encode or avoid special characters |

Resolution Strategy:

- ✅ Use ecs.Secret.fromSecretsManager(secret) or direct ARN references

- ✅ Avoid complex object wrapping in secret definitions

- ✅ Store all secrets in Secrets Manager, not environment variables

- ✅ Generate secrets in storage stack, reference in service stack

- ✅ Use Secrets Manager rotation for production environments

#### 2.2 Health Checks & Service Startup (8 commits)

Problem: ECS health checks failed due to timing and configuration issues.

| Commit | Issue | Resolution |

|--------|-------|------------|

| #42 | ALB considering task failed, need more time | Increased grace period and check intervals |

| #36 | Health check failed due to DB connection | Added DB connection validation |

| #34 | Deployed without health checks temporarily | Isolated startup from health check issues |

| #33 | Triggerer failing health checks | Disabled until config fixed |

| #35 | Most services up without health checks | Validated core functionality first |

| #15 | Health check failed, probably DB connection | Database initialization timing |

| #12 | Added ECS circuit breaker | Automatic rollback on deployment failures |

| #54 | Removed unneeded health checks | Cleaned up redundant checks |

Resolution Strategy:

- ✅ Set health check grace period to 300-600 seconds for DB-dependent services

- ✅ Use circuit breaker for automatic rollback on failures

- ✅ Disable health checks temporarily to isolate startup issues

- ✅ Validate database connectivity before enabling health checks

- ✅ Use ALB target group health checks for webserver

- ✅ Configure startup scripts to wait for dependencies

#### 2.3 Resource Allocation (5 commits)

Problem: Services need adequate CPU and memory resources.

| Commit | Issue | Resolution |

|--------|-------|------------|

| #45 | Worker killed due to high memory usage | Doubled memory to 2GB |

| #42 | Bumped to 1 vCPU | Increased from 0.5 to 1 vCPU |

| #51 | Added auto-scaling for workers | Scale 1-10 based on metrics |

| #45 | Added Spot instances | Cost savings on worker fleet |

| #11 | Added temp directory configuration | Proper temp storage allocation |

Resolution Strategy:

- ✅ Start with minimum: webserver/scheduler 1 vCPU/2GB, workers 1 vCPU/2GB

- ✅ Monitor CloudWatch metrics for CPU/memory utilization

- ✅ Use Spot instances for workers (50-70% cost savings)

- ✅ Implement auto-scaling based on CPU (70%) and memory (80%) targets

- ✅ Configure temp directories with adequate storage

#### 2.4 Container Architecture (4 commits)

Problem: ARM64 vs AMD64 architecture differences.

| Commit | Issue | Resolution |

|--------|-------|------------|

| #13 | Trying ARM64 for entrypoint.sh problem | Switched to ARM64 architecture |

| #14 | Specific ARM64 tagging | Explicit platform tagging |

| #12 | Architecture differences in ECS | Set explicit architecture in task definitions |

| #8 | Multi-stage builds with UV | Optimized for target architecture |

Resolution Strategy:

- ✅ Choose ARM64 for better price/performance (Graviton processors)

- ✅ Tag Docker images with explicit platform (—platform linux/arm64)

- ✅ Set runtimePlatform in ECS task definitions

- ✅ Test locally with matching architecture

#### 2.5 Availability & Deployment (3 commits)

Problem: Availability zone and deployment configuration issues.

| Commit | Issue | Resolution |

|--------|-------|------------|

| #6 | Getting error about availability zones | Added AZ mappings to context |

| #7 | Some services not starting up | Fixed ECR image references |

| #25 | Fixed subnet AZ and naming | Proper subnet selection |

Resolution Strategy:

- ✅ Cache AZ lookups in cdk.context.json to avoid API calls

- ✅ Use multiple AZs for high availability

- ✅ Ensure subnets span multiple AZs properly

- ✅ Use CDK context for consistent AZ selection

---

## 3. Networking Pitfalls (12 commits)

### Key Learnings

- Service-to-service communication requires proper security groups and discovery

- ALB configuration is critical for external and internal traffic

- IPv6 can cause unexpected connection issues

- VPC design impacts cost and complexity

### Issues Encountered

#### 3.1 Service Discovery & Internal Communication (4 commits)

Problem: Services couldn’t reliably communicate with each other.

| Commit | Issue | Resolution |

|--------|-------|------------|

| #46 | Worker can’t reach base URL | Added AWS Cloud Map service discovery |

| #46 | Worker to API connectivity | Configured internal ALB DNS |

| #46 | Security group rules missing | Added worker-to-webserver connectivity |

| #31 | DB connection failing | Security group for RDS access |

Resolution Strategy:

- ✅ Use AWS Cloud Map for service discovery (servicename.namespace)

- ✅ Configure internal ALB for worker-to-webserver communication

- ✅ Set up security groups to allow inter-service traffic

- ✅ Use private subnets for services, public for ALB only

- ✅ Document all security group rules for troubleshooting

#### 3.2 Load Balancer Configuration (5 commits)

Problem: ALB health checks and routing configuration.

| Commit | Issue | Resolution |

|--------|-------|------------|

| #42 | ALB considering task failed | Increased timeouts and grace periods |

| #2 | Initial ALB setup with webserver | Public ALB for external access |

| #46 | Added internal communication path | Internal ALB/service discovery |

| #35 | Health check configuration | Proper health check paths |

| #54 | Removed unneeded health checks | Simplified configuration |

Resolution Strategy:

- ✅ Use public ALB only for webserver/UI

- ✅ Set health check path to /health or /api/v1/health

- ✅ Configure deregistration delay to 30 seconds

- ✅ Set healthy/unhealthy threshold appropriately (2/3)

- ✅ Use longer intervals (30s) for services with slow startup

#### 3.3 Database Connectivity (3 commits)

Problem: RDS connection string format and network access.

| Commit | Issue | Resolution |

|--------|-------|------------|

| #23 | Bad IPv6 URL when connecting to Postgres | Use hostname format without brackets |

| #31 | DB connection failing | Security groups and connection params |

| #36 | DB connection in health checks | Validate connection before health checks |

Resolution Strategy:

- ✅ Use RDS endpoint hostname directly: dbinstance.xxxxxx.region.rds.amazonaws.com

- ✅ Avoid IPv6 format with brackets: [::1]

- ✅ Configure security group ingress on port 5432 from ECS tasks

- ✅ Use private subnets for RDS (never public)

- ✅ Test connection string format locally first

#### 3.4 VPC & Subnet Configuration (3 commits)

Problem: Subnet and availability zone configuration.

| Commit | Issue | Resolution |

|--------|-------|------------|

| #6 | AZ errors | Fixed AZ configuration and context |

| #25 | Subnet AZ issues | Proper subnet selection in correct AZs |

| #24 | VPC in storage stack | Centralized VPC in stateful stack |

Resolution Strategy:

- ✅ Create VPC in storage/stateful stack

- ✅ Use 2-3 AZs for high availability

- ✅ Separate public and private subnets

- ✅ Use NAT gateway for private subnet internet access

- ✅ Cache AZ lookups in cdk.context.json

---

## 4. CDK Pitfalls (8 commits)

### Key Learnings

- Stack organization impacts maintainability and deployment

- Context management is crucial for consistent deployments

- CDK warnings should be addressed early

- Stack dependencies must be explicitly defined

### Issues Encountered

#### 4.1 Stack Organization (3 commits)

Problem: Monolithic stack vs separated concerns.

| Commit | Issue | Resolution |

|--------|-------|------------|

| #24 | Split into storage and service stacks | Separated stateful from stateless |

| #4 | Major stack refactoring | Better organization and structure |

| #2 | Initial monolithic stack | Started with everything in one stack |

Resolution Strategy:

- ✅ Separate stateful resources (VPC, RDS, Cache) into storage-stack

- ✅ Put stateless compute (ECS services) in service-stack

- ✅ Use stack outputs and cross-stack references

- ✅ Define explicit dependencies between stacks

- ✅ Benefits: independent lifecycle, easier testing, faster iteration on services

#### 4.2 Context & Configuration Management (3 commits)

Problem: CDK context and synthesizer warnings.

| Commit | Issue | Resolution |

|--------|-------|------------|

| #5 | Fixing CDK warnings | Added proper synthesizer configuration |

| #6 | AZ lookup issues | Added cdk.context.json with AZ mappings |

| #6 | Added .env file | Environment-specific configuration |

Resolution Strategy:

- ✅ Use cdk.context.json to cache AZ lookups (avoid API calls)

- ✅ Configure DefaultStackSynthesizer properly

- ✅ Use .env files for environment-specific values

- ✅ Address CDK warnings during development, not later

- ✅ Version control cdk.context.json for team consistency

#### 4.3 Resource References & Dependencies (2 commits)

Problem: Properly referencing resources across stacks.

| Commit | Issue | Resolution |

|--------|-------|------------|

| #24 | Cross-stack references for storage resources | Export/import pattern |

| #40 | Secret reference syntax errors | Proper CDK secret constructs |

Resolution Strategy:

- ✅ Use stack.export() and Fn.importValue() for cross-stack refs

- ✅ Pass resources as constructor parameters when possible

- ✅ Use proper CDK constructs (Secret.fromSecretArn) instead of raw ARNs

- ✅ Define dependencies explicitly with addDependency()

- ✅ Document cross-stack dependencies in README

---

## 5. Docker/Container Pitfalls (15 commits)

### Key Learnings

- Build performance matters significantly for iteration speed

- Multi-stage builds reduce image size and build time

- User permissions are tricky with shared volumes

- Base image selection impacts complexity

### Issues Encountered

#### 5.1 Build Performance (3 commits)

Problem: Docker builds were extremely slow (4+ minutes).

| Commit | Issue | Resolution |

|--------|-------|------------|

| #8 | Got docker building faster with UV | 4 min → 10 sec improvement |

| #8 | Implemented multi-stage builds | Separate builder and runtime stages |

| #39 | Simplified to leverage base image UV | Use apache/airflow’s built-in tools |

Resolution Strategy:

- ✅ Use UV package manager for 25-50x faster pip installs

- ✅ Multi-stage builds: builder stage + lean runtime stage

- ✅ Copy pre-built virtual environment from builder

- ✅ Leverage capabilities of base image (apache/airflow has UV)

- ✅ Cache layers effectively by ordering Dockerfile commands

Impact: Build time: 4 minutes → 10 seconds (24x improvement)

#### 5.2 Image Complexity (4 commits)

Problem: Over-complicated Docker image with unnecessary scripts.

| Commit | Issue | Resolution |

|--------|-------|------------|

| #38 | Massive cleanup: removed 1057 lines | Removed test/debug scripts |

| #39 | Removed airflow.cfg file | Use environment variables only |

| #39 | Simplified Dockerfile by 16 lines | Leveraged base image features |

| #6 | Temporarily removed Docker files | Clean slate approach |

Resolution Strategy:

- ✅ Start with minimal Dockerfile extending apache/airflow base

- ✅ Avoid custom entrypoint scripts unless absolutely necessary

- ✅ Use environment variables instead of config files

- ✅ Remove debugging/testing scripts from production image

- ✅ Keep Dockerfile under 50 lines total

#### 5.3 Local Development (3 commits)

Problem: Need to test Docker images locally before ECS.

| Commit | Issue | Resolution |

|--------|-------|------------|

| #9 | Created Docker Compose for local testing | Full local environment |

| #10 | Got local dev working | Validated configuration locally |

| #26 | Used local deployment to fix ECS issues | Local testing found issues |

Resolution Strategy:

- ✅ Create docker-compose.yml matching ECS configuration

- ✅ Test all configuration locally before deploying to ECS

- ✅ Use same environment variables in both environments

- ✅ Validate database connectivity, secret injection, volumes locally

- ✅ Significantly reduces cloud debugging time and cost

#### 5.4 Base Image & Dependencies (3 commits)

Problem: Managing Python dependencies and base image.

| Commit | Issue | Resolution |

|--------|-------|------------|

| #7 | Added build_and_push.sh for ECR | Automated image publishing |

| #8 | Used UV for dependencies | Faster installs |

| #15 | ‘aws’ CLI not found in image | Base image didn’t include AWS CLI |

Resolution Strategy:

- ✅ Use official apache/airflow:3.1.0 as base image

- ✅ Only install additional dependencies if truly needed

- ✅ Use UV or pip with —no-cache-dir to minimize layer size

- ✅ Create automated build/push scripts for CI/CD

- ✅ Tag images with git commit SHA for traceability

#### 5.5 Sidecar Containers (2 commits)

Problem: Running git-sync as sidecar with proper image.

| Commit | Issue | Resolution |

|--------|-------|------------|

| #20 | Added git-sync image to ECR | Mirrored k8s.gcr.io/git-sync |

| #50 | Created GitLab CI for git-sync mirroring | Automated mirror updates |

Resolution Strategy:

- ✅ Mirror external images (k8s git-sync) to ECR for reliability

- ✅ Use sidecar pattern for orthogonal concerns (git sync, monitoring)

- ✅ Configure shared volumes between main and sidecar containers

- ✅ Automate image mirroring with CI/CD pipelines

- ✅ Document sidecar image sources and update procedures

---

## 6. Infrastructure as Code Best Practices Learned

### From This Journey

1. Iterative Development is Key

- 54 commits show the value of small, incremental changes

- Each commit isolated a specific problem

- Easier to rollback and debug

2. Local Testing Saves Time & Money

- Docker Compose for local validation (commits #9, #10, #26)

- Reduced ECS debugging iterations

- Faster feedback loop

3. Separate Stateful from Stateless

- Storage stack (VPC, RDS, Cache) - commit #24

- Service stack (ECS, ALB) can be destroyed/recreated quickly

- Independent lifecycle management

4. Secrets Management

- Never hardcode secrets

- Use Secrets Manager for all sensitive data

- Proper URL encoding for special characters

5. Start Simple, Add Complexity

- Initial “AI slop” was too complex

- Simplified over time (commits #38, #39)

- Remove before adding

6. Documentation as You Go

- Git commit messages tell the story

- Document pitfalls immediately

- Future you will thank present you

---

## 7. Recommended Development Workflow

Based on lessons learned from 54 commits:

### Phase 1: Foundation (Do This Right First)

1. ✅ Set up CDK project with proper structure

2. ✅ Create storage stack (VPC, RDS, Cache)

3. ✅ Configure secrets in Secrets Manager

4. ✅ Set up cdk.context.json with AZ mappings

5. ✅ Create .env for environment config

### Phase 2: Docker Development

1. ✅ Create minimal Dockerfile extending apache/airflow

2. ✅ Set up docker-compose.yml for local testing

3. ✅ Test all Airflow services locally

4. ✅ Validate secrets, DB connection, volumes

5. ✅ Optimize build with UV/multi-stage

### Phase 3: AWS Deployment (Incremental)

1. ✅ Deploy storage stack first

2. ✅ Push Docker image to ECR

3. ✅ Deploy webserver only (with ALB)

4. ✅ Deploy scheduler

5. ✅ Deploy triggerer

6. ✅ Deploy worker last (with auto-scaling)

### Phase 4: Optimization

1. ✅ Enable health checks after services stable

2. ✅ Add circuit breaker

3. ✅ Configure auto-scaling

4. ✅ Add Spot instances for workers

5. ✅ Set up monitoring and alarms

### Phase 5: Production Hardening

1. ✅ Enable secret rotation

2. ✅ Configure backup policies

3. ✅ Set up CI/CD pipelines

4. ✅ Document runbooks

5. ✅ Load testing and tuning

---

## 8. Cost Optimization Insights

### What Worked

- Spot Instances for Workers (commit #45): 50-70% cost savings

- ARM64 Architecture (commits #13, #14): Better price/performance

- Separate Stacks (commit #24): Can destroy/recreate services without affecting storage

- Auto-scaling (commit #51): Only pay for capacity you use

### What Didn’t Work

- EFS (commits #15-18): More complex and costly than git-sync

- Always-on development: Could use dev/prod environment separation

---

## 9. Key Metrics

### Development Journey

- Total Commits: 54

- Major Refactors: 3 (commits #4, #24, #38)

- Pitfall Fixes: 47

- Days of Development: ~10 days (Oct 28 - Nov 6, 2025)

### Technical Debt Resolved

- Deleted Lines: ~6,000+ (cleanup commits #38, #39, #52)

- Documentation Added: 2,500+ lines

- Build Time Improvement: 24x faster (4 min → 10 sec)

### Final Architecture

- Stacks: 2 (storage, service)

- ECS Services: 4 (webserver, scheduler, worker, triggerer)

- Supporting Services: 3 (RDS, Valkey, git-sync)

- Auto-scaling: 1-10 workers based on load

- Spot Instances: Enabled for workers

---

## 10. Top 10 Lessons Learned

1. Test Locally First - Docker Compose saved countless hours debugging in AWS

2. URL-Encode Passwords - Special characters in passwords will break everything

3. Use Git-Sync, Not EFS - Simpler, cheaper, more reliable for DAG deployment

4. Separate Stateful/Stateless - Independent lifecycle = faster iteration

5. UV Package Manager - 24x faster Docker builds

6. Start Simple - Remove complexity before adding features

7. Security Groups Matter - Document all connectivity requirements

8. Health Check Grace Periods - DB-dependent services need 300-600 seconds

9. Spot Instances for Workers - 50-70% cost savings, minimal impact

10. Commit Often - Small commits make debugging and rollback easier

---

## 11. If Starting Over, Do This

### Skip These Mistakes

- ❌ Don’t use static airflow.cfg files

- ❌ Don’t start with all services at once

- ❌ Don’t use EFS for DAGs (use git-sync)

- ❌ Don’t ignore CDK warnings

- ❌ Don’t skip local testing

- ❌ Don’t generate passwords with special characters (or URL-encode them)

- ❌ Don’t use complex entrypoint scripts

### Do These Things

- ✅ Start with storage stack + webserver only

- ✅ Use UV for fast Docker builds from day 1

- ✅ Set up docker-compose.yml immediately

- ✅ Use environment variables exclusively

- ✅ Configure Spot instances from the start

- ✅ Document as you go

- ✅ Use git-sync for DAG deployment

### Time Saved

Following this guide could reduce development time from ~80 hours to ~20 hours, avoiding the 47 pitfall commits.

---

## Conclusion

This journey through 54 commits demonstrates the iterative nature of cloud infrastructure development. The key insight is that failure is part of the learning process, and each commit represents a lesson learned.

The most valuable outcome isn’t just a working Airflow deployment, but the understanding of:

- How ECS services communicate

- Why certain patterns work (git-sync) and others don’t (EFS)

- The importance of local testing

- How to structure IaC for maintainability

These lessons are transferable to any ECS/Fargate deployment, not just Airflow.

Final Architecture Success Rate: 100% (after 54 commits and ~47 fixes)

Would Do It Again: Yes, with this document as a guide

Share this article: