Strategies
Executive Summary
Total Commits Analyzed: 54
Problem-Solving Commits: 47 (87%)
Documentation/Cleanup Commits: 7 (13%)
Problem Distribution by Category
| Category | Count | Percentage | Top Issues |
|---|---|---|---|
| Airflow-Specific Pitfalls | 23 | 43% | Configuration, permissions, secrets, service communication |
| AWS/ECS Pitfalls | 19 | 35% | Secret injection, health checks, resource allocation, spot instances |
| Networking Pitfalls | 12 | 22% | IPv6, service discovery, ALB configuration, security groups |
| CDK Pitfalls | 8 | 15% | Context management, synthesizer, stack dependencies |
| Docker/Container Pitfalls | 15 | 28% | Build performance, user permissions, multi-stage builds |
| Note: Commits can appear in multiple categories |
1. Airflow-Specific Pitfalls (23 commits)
Key Learnings
-
Airflow has complex inter-service dependencies that must be carefully orchestrated
-
Secret management and encoding are critical for multi-service deployments
-
DAG deployment strategy significantly impacts architecture complexity
-
Airflow 3.x has different requirements than 2.x
Issues Encountered
1.1 Configuration & Environment Variables (8 commits)
Problem: Airflow requires precise configuration across multiple services with shared secrets.
| Commit | Issue | Resolution |
|---|---|---|
| #39 | Removed static airflow.cfg, relied on env vars | Use environment variables exclusively instead of config files |
| #40 | ECS injecting “[object Object]” for secrets | Fixed secret reference syntax in CDK |
| #41 | Secrets still passed incorrectly | Simplified to direct secret ARN references |
| #28 | Changed service startup based on local dev | Validated config locally first before ECS deployment |
| #4 | Major refactoring of environment variables | Centralized env var management in CDK constructs |
| #54 | Removed unneeded default env vars | Clean up reduces complexity and confusion |
| #31 | Database connection failing | Proper connection string format with all required params |
| #30 | Special characters in DB password broke startup | URL-encode passwords with special characters |
| Resolution Strategy: |
-
â Use environment variables exclusively, avoid static config files
-
â Validate configuration locally with Docker Compose before ECS
-
â URL-encode all connection strings and secrets
-
â Use CDK constructs to centralize environment variable management
-
â Reference secrets by ARN directly, not through complex wrapper objects
1.2 Database & Storage (6 commits)
Problem: Database connectivity and encoding issues caused repeated failures.
| Commit | Issue | Resolution |
|---|
| #30 | Encoded chars in password caused ECS failure | URL-encode passwords, avoid special chars in generation | | #31 | DB connection failing | Security groups, connection strings, timeouts | | #36 | Health check failed due to DB connection | Increase grace period, validate connection in startup | | #23 | Bad IPv6 URL when connecting to Postgres | Use proper endpoint format without IPv6 | | #22 | Switched back to PostgreSQL from alternative | Stick with well-supported databases | | #29 | Moved DB migration to API server | Ensures DB ready before other services start | Resolution Strategy:
-
â Generate passwords without special characters OR properly URL-encode them
-
â Run database migrations in the webserver/API startup, not scheduler
-
â Use RDS endpoint hostname format, avoid IPv6 references
-
â Increase health check grace periods for DB-dependent services
-
â Validate DB connectivity before starting service processes
1.3 Service Communication & Architecture (5 commits)
Problem: Airflow services need to communicate with each other reliably.
| Commit | Issue | Resolution |
|---|---|---|
| #46 | Worker can’t reach base URL | Added service discovery and internal ALB |
| #47 | Added common JWT secret | Shared JWT for inter-service authentication |
| #27 | Can’t get auth to work | Proper authentication backend configuration |
| #32 | Successfully deployed API and scheduler | Validated basic service communication |
| #24 | Split into storage/service stacks | Separate stateful from stateless resources |
| Resolution Strategy: |
-
â Use AWS Cloud Map for service discovery between components
-
â Configure internal ALB DNS for workers to reach webserver
-
â Share authentication secrets (JWT, Fernet) across all services
-
â Separate stateful (DB, cache) from stateless (compute) in stack design
-
â Deploy webserver/API first, then scheduler, then workers
1.4 DAG Deployment & Storage (7 commits)
Problem: Finding the right way to deploy DAGs to all Airflow services.
| Commit | Issue | Resolution |
|---|---|---|
| #15 | Tried EFS after ‘aws’ CLI not found in image | EFS for shared storage |
| #17 | EFS mount point issues | Fixed mount configuration |
| #18 | EFS DNS modification issues | Changed mount attachment method |
| #19 | Switched from EFS to git-sync | Git-sync sidecar pattern |
| #20 | Added git-sync with SSL certs | Proper SSL for private repos |
| #21 | Git-sync needs root permissions | Run as root to write to filesystem |
| #43 | Read-only DAG folder caused startup error | Make DAG folder writable |
| #49 | Simplified git-sync configuration | Direct git sync approach |
| #53 | Fixed git-sync repo configuration | Correct repo URL and credentials |
| Resolution Strategy: |
-
â Use git-sync sidecar pattern instead of EFS (simpler, more reliable)
-
â Run git-sync as root with proper permissions
-
â Ensure shared volume is writable by Airflow processes
-
â Mirror git-sync image to ECR with SSL certificates for private repos
-
â Configure git-sync to sync to correct path (/opt/airflow/dags)
1.5 Component Health & Scaling (6 commits)
Problem: Getting all Airflow components (webserver, scheduler, worker, triggerer) healthy.
| Commit | Issue | Resolution |
|---|---|---|
| #33 | Triggerer failing health checks | Temporarily disabled to isolate issues |
| #34 | Deployed without health checks | Removed to isolate startup from health check issues |
| #35 | All but worker running | Webserver, scheduler, triggerer operational |
| #44 | Re-enabled triggerer | Fixed config and re-added successfully |
| #45 | Worker killed due to high memory | Doubled memory to 2GB |
| #51 | Added worker auto-scaling | Scale 1-10 based on CPU/memory |
| Resolution Strategy: |
-
â Disable health checks initially to isolate startup issues
-
â Deploy services incrementally: webserver â scheduler â triggerer â worker
-
â Monitor memory usage and increase allocations (workers need 2GB+)
-
â Implement auto-scaling for workers (1-10 instances)
-
â Re-enable health checks after confirming services start successfully
1.6 Permissions & User Management (3 commits)
Problem: Container user permissions for filesystem access.
| Commit | Issue | Resolution |
|---|---|---|
| #37 | Trying to fix airflow user with root permissions | User/group permission conflicts |
| #21 | Git-sync needs root permissions to write | Run git-sync as root |
| #38 | Simplified user management in Docker | Use base image defaults |
| Resolution Strategy: |
-
â Run git-sync sidecar as root (uid 0) for file system writes
-
â Run Airflow processes as airflow user (default in base image)
-
â Use shared volumes with appropriate permissions
-
â Don’t override user settings unless necessary
2. AWS/ECS Pitfalls (19 commits)
Key Learnings
-
ECS has specific requirements for secret injection and environment variables
-
Health checks need careful tuning for startup times
-
Resource allocation (CPU/memory) requires iteration based on actual usage
-
Spot instances can significantly reduce costs
Issues Encountered
2.1 Secret Management (5 commits)
Problem: ECS secret injection has specific syntax requirements.
| Commit | Issue | Resolution |
|---|---|---|
| #40 | ECS injecting “[object Object]” literally | Fixed CDK secret reference syntax |
| #41 | Secrets still passed incorrectly | Used direct secret ARN references |
| #47 | Added common JWT secret | Centralized secret in Secrets Manager |
| #52 | Fixed Fernet key generation | Proper secret generation in storage stack |
| #30 | Password encoding issues | URL-encode or avoid special characters |
| Resolution Strategy: |
-
â Use
ecs.Secret.fromSecretsManager(secret)or direct ARN references -
â Avoid complex object wrapping in secret definitions
-
â Store all secrets in Secrets Manager, not environment variables
-
â Generate secrets in storage stack, reference in service stack
-
â Use Secrets Manager rotation for production environments
2.2 Health Checks & Service Startup (8 commits)
Problem: ECS health checks failed due to timing and configuration issues.
| Commit | Issue | Resolution |
|---|---|---|
| #42 | ALB considering task failed, need more time | Increased grace period and check intervals |
| #36 | Health check failed due to DB connection | Added DB connection validation |
| #34 | Deployed without health checks temporarily | Isolated startup from health check issues |
| #33 | Triggerer failing health checks | Disabled until config fixed |
| #35 | Most services up without health checks | Validated core functionality first |
| #15 | Health check failed, probably DB connection | Database initialization timing |
| #12 | Added ECS circuit breaker | Automatic rollback on deployment failures |
| #54 | Removed unneeded health checks | Cleaned up redundant checks |
| Resolution Strategy: |
-
â Set health check grace period to 300-600 seconds for DB-dependent services
-
â Use circuit breaker for automatic rollback on failures
-
â Disable health checks temporarily to isolate startup issues
-
â Validate database connectivity before enabling health checks
-
â Use ALB target group health checks for webserver
-
â Configure startup scripts to wait for dependencies
2.3 Resource Allocation (5 commits)
Problem: Services need adequate CPU and memory resources.
| Commit | Issue | Resolution |
|---|---|---|
| #45 | Worker killed due to high memory usage | Doubled memory to 2GB |
| #42 | Bumped to 1 vCPU | Increased from 0.5 to 1 vCPU |
| #51 | Added auto-scaling for workers | Scale 1-10 based on metrics |
| #45 | Added Spot instances | Cost savings on worker fleet |
| #11 | Added temp directory configuration | Proper temp storage allocation |
| Resolution Strategy: |
-
â Start with minimum: webserver/scheduler 1 vCPU/2GB, workers 1 vCPU/2GB
-
â Monitor CloudWatch metrics for CPU/memory utilization
-
â Use Spot instances for workers (50-70% cost savings)
-
â Implement auto-scaling based on CPU (70%) and memory (80%) targets
-
â Configure temp directories with adequate storage
2.4 Container Architecture (4 commits)
Problem: ARM64 vs AMD64 architecture differences.
| Commit | Issue | Resolution |
|---|---|---|
| #13 | Trying ARM64 for entrypoint.sh problem | Switched to ARM64 architecture |
| #14 | Specific ARM64 tagging | Explicit platform tagging |
| #12 | Architecture differences in ECS | Set explicit architecture in task definitions |
| #8 | Multi-stage builds with UV | Optimized for target architecture |
| Resolution Strategy: |
-
â Choose ARM64 for better price/performance (Graviton processors)
-
â Tag Docker images with explicit platform (—platform linux/arm64)
-
â Set runtimePlatform in ECS task definitions
-
â Test locally with matching architecture
2.5 Availability & Deployment (3 commits)
Problem: Availability zone and deployment configuration issues.
| Commit | Issue | Resolution |
|---|---|---|
| #6 | Getting error about availability zones | Added AZ mappings to context |
| #7 | Some services not starting up | Fixed ECR image references |
| #25 | Fixed subnet AZ and naming | Proper subnet selection |
| Resolution Strategy: |
-
â Cache AZ lookups in cdk.context.json to avoid API calls
-
â Use multiple AZs for high availability
-
â Ensure subnets span multiple AZs properly
-
â Use CDK context for consistent AZ selection
3. Networking Pitfalls (12 commits)
Key Learnings
-
Service-to-service communication requires proper security groups and discovery
-
ALB configuration is critical for external and internal traffic
-
IPv6 can cause unexpected connection issues
-
VPC design impacts cost and complexity
Issues Encountered
3.1 Service Discovery & Internal Communication (4 commits)
Problem: Services couldn’t reliably communicate with each other.
| Commit | Issue | Resolution |
|---|---|---|
| #46 | Worker can’t reach base URL | Added AWS Cloud Map service discovery |
| #46 | Worker to API connectivity | Configured internal ALB DNS |
| #46 | Security group rules missing | Added worker-to-webserver connectivity |
| #31 | DB connection failing | Security group for RDS access |
| Resolution Strategy: |
-
â Use AWS Cloud Map for service discovery (servicename.namespace)
-
â Configure internal ALB for worker-to-webserver communication
-
â Set up security groups to allow inter-service traffic
-
â Use private subnets for services, public for ALB only
-
â Document all security group rules for troubleshooting
3.2 Load Balancer Configuration (5 commits)
Problem: ALB health checks and routing configuration.
| Commit | Issue | Resolution |
|---|---|---|
| #42 | ALB considering task failed | Increased timeouts and grace periods |
| #2 | Initial ALB setup with webserver | Public ALB for external access |
| #46 | Added internal communication path | Internal ALB/service discovery |
| #35 | Health check configuration | Proper health check paths |
| #54 | Removed unneeded health checks | Simplified configuration |
| Resolution Strategy: |
-
â Use public ALB only for webserver/UI
-
â Set health check path to /health or /api/v1/health
-
â Configure deregistration delay to 30 seconds
-
â Set healthy/unhealthy threshold appropriately (2/3)
-
â Use longer intervals (30s) for services with slow startup
3.3 Database Connectivity (3 commits)
Problem: RDS connection string format and network access.
| Commit | Issue | Resolution |
|---|---|---|
| #23 | Bad IPv6 URL when connecting to Postgres | Use hostname format without brackets |
| #31 | DB connection failing | Security groups and connection params |
| #36 | DB connection in health checks | Validate connection before health checks |
| Resolution Strategy: |
-
â Use RDS endpoint hostname directly:
dbinstance.xxxxxx.region.rds.amazonaws.com -
â Avoid IPv6 format with brackets:
[::1] -
â Configure security group ingress on port 5432 from ECS tasks
-
â Use private subnets for RDS (never public)
-
â Test connection string format locally first
3.4 VPC & Subnet Configuration (3 commits)
Problem: Subnet and availability zone configuration.
| Commit | Issue | Resolution |
|---|---|---|
| #6 | AZ errors | Fixed AZ configuration and context |
| #25 | Subnet AZ issues | Proper subnet selection in correct AZs |
| #24 | VPC in storage stack | Centralized VPC in stateful stack |
| Resolution Strategy: |
-
â Create VPC in storage/stateful stack
-
â Use 2-3 AZs for high availability
-
â Separate public and private subnets
-
â Use NAT gateway for private subnet internet access
-
â Cache AZ lookups in cdk.context.json
4. CDK Pitfalls (8 commits)
Key Learnings
-
Stack organization impacts maintainability and deployment
-
Context management is crucial for consistent deployments
-
CDK warnings should be addressed early
-
Stack dependencies must be explicitly defined
Issues Encountered
4.1 Stack Organization (3 commits)
Problem: Monolithic stack vs separated concerns.
| Commit | Issue | Resolution |
|---|---|---|
| #24 | Split into storage and service stacks | Separated stateful from stateless |
| #4 | Major stack refactoring | Better organization and structure |
| #2 | Initial monolithic stack | Started with everything in one stack |
| Resolution Strategy: |
-
â Separate stateful resources (VPC, RDS, Cache) into storage-stack
-
â Put stateless compute (ECS services) in service-stack
-
â Use stack outputs and cross-stack references
-
â Define explicit dependencies between stacks
-
â Benefits: independent lifecycle, easier testing, faster iteration on services
4.2 Context & Configuration Management (3 commits)
Problem: CDK context and synthesizer warnings.
| Commit | Issue | Resolution |
|---|---|---|
| #5 | Fixing CDK warnings | Added proper synthesizer configuration |
| #6 | AZ lookup issues | Added cdk.context.json with AZ mappings |
| #6 | Added .env file | Environment-specific configuration |
| Resolution Strategy: |
-
â Use cdk.context.json to cache AZ lookups (avoid API calls)
-
â Configure DefaultStackSynthesizer properly
-
â Use .env files for environment-specific values
-
â Address CDK warnings during development, not later
-
â Version control cdk.context.json for team consistency
4.3 Resource References & Dependencies (2 commits)
Problem: Properly referencing resources across stacks.
| Commit | Issue | Resolution |
|---|---|---|
| #24 | Cross-stack references for storage resources | Export/import pattern |
| #40 | Secret reference syntax errors | Proper CDK secret constructs |
| Resolution Strategy: |
-
â Use
stack.export()andFn.importValue()for cross-stack refs -
â Pass resources as constructor parameters when possible
-
â Use proper CDK constructs (Secret.fromSecretArn) instead of raw ARNs
-
â Define dependencies explicitly with
addDependency() -
â Document cross-stack dependencies in README
5. Docker/Container Pitfalls (15 commits)
Key Learnings
-
Build performance matters significantly for iteration speed
-
Multi-stage builds reduce image size and build time
-
User permissions are tricky with shared volumes
-
Base image selection impacts complexity
Issues Encountered
5.1 Build Performance (3 commits)
Problem: Docker builds were extremely slow (4+ minutes).
| Commit | Issue | Resolution |
|---|---|---|
| #8 | Got docker building faster with UV | 4 min â 10 sec improvement |
| #8 | Implemented multi-stage builds | Separate builder and runtime stages |
| #39 | Simplified to leverage base image UV | Use apache/airflow’s built-in tools |
| Resolution Strategy: |
-
â Use UV package manager for 25-50x faster pip installs
-
â Multi-stage builds: builder stage + lean runtime stage
-
â Copy pre-built virtual environment from builder
-
â Leverage capabilities of base image (apache/airflow has UV)
-
â Cache layers effectively by ordering Dockerfile commands
Impact: Build time: 4 minutes â 10 seconds (24x improvement)
5.2 Image Complexity (4 commits)
Problem: Over-complicated Docker image with unnecessary scripts.
| Commit | Issue | Resolution |
|---|---|---|
| #38 | Massive cleanup: removed 1057 lines | Removed test/debug scripts |
| #39 | Removed airflow.cfg file | Use environment variables only |
| #39 | Simplified Dockerfile by 16 lines | Leveraged base image features |
| #6 | Temporarily removed Docker files | Clean slate approach |
| Resolution Strategy: |
-
â Start with minimal Dockerfile extending apache/airflow base
-
â Avoid custom entrypoint scripts unless absolutely necessary
-
â Use environment variables instead of config files
-
â Remove debugging/testing scripts from production image
-
â Keep Dockerfile under 50 lines total
5.3 Local Development (3 commits)
Problem: Need to test Docker images locally before ECS.
| Commit | Issue | Resolution |
|---|---|---|
| #9 | Created Docker Compose for local testing | Full local environment |
| #10 | Got local dev working | Validated configuration locally |
| #26 | Used local deployment to fix ECS issues | Local testing found issues |
| Resolution Strategy: |
-
â Create docker-compose.yml matching ECS configuration
-
â Test all configuration locally before deploying to ECS
-
â Use same environment variables in both environments
-
â Validate database connectivity, secret injection, volumes locally
-
â Significantly reduces cloud debugging time and cost
5.4 Base Image & Dependencies (3 commits)
Problem: Managing Python dependencies and base image.
| Commit | Issue | Resolution |
|---|---|---|
| #7 | Added build_and_push.sh for ECR | Automated image publishing |
| #8 | Used UV for dependencies | Faster installs |
| #15 | ’aws’ CLI not found in image | Base image didn’t include AWS CLI |
| Resolution Strategy: |
-
â Use official apache/airflow:3.1.0 as base image
-
â Only install additional dependencies if truly needed
-
â Use UV or pip with —no-cache-dir to minimize layer size
-
â Create automated build/push scripts for CI/CD
-
â Tag images with git commit SHA for traceability
5.5 Sidecar Containers (2 commits)
Problem: Running git-sync as sidecar with proper image.
| Commit | Issue | Resolution |
|---|---|---|
| #20 | Added git-sync image to ECR | Mirrored k8s.gcr.io/git-sync |
| #50 | Created GitLab CI for git-sync mirroring | Automated mirror updates |
| Resolution Strategy: |
-
â Mirror external images (k8s git-sync) to ECR for reliability
-
â Use sidecar pattern for orthogonal concerns (git sync, monitoring)
-
â Configure shared volumes between main and sidecar containers
-
â Automate image mirroring with CI/CD pipelines
-
â Document sidecar image sources and update procedures
6. Infrastructure as Code Best Practices Learned
From This Journey
- Iterative Development is Key
- 54 commits show the value of small, incremental changes
- Each commit isolated a specific problem
- Easier to rollback and debug
- Local Testing Saves Time & Money
-
Docker Compose for local validation (commits #9, #10, #26)
-
Reduced ECS debugging iterations
-
Faster feedback loop
- Separate Stateful from Stateless
- Storage stack (VPC, RDS, Cache) - commit #24
- Service stack (ECS, ALB) can be destroyed/recreated quickly
- Independent lifecycle management
- Secrets Management
- Never hardcode secrets
- Use Secrets Manager for all sensitive data
- Proper URL encoding for special characters
- Start Simple, Add Complexity
- Initial “AI slop” was too complex
- Simplified over time (commits #38, #39)
- Remove before adding
- Documentation as You Go
- Git commit messages tell the story
- Document pitfalls immediately
- Future you will thank present you
7. Recommended Development Workflow
Based on lessons learned from 54 commits:
Phase 1: Foundation (Do This Right First)
- â Set up CDK project with proper structure
2. â Create storage stack (VPC, RDS, Cache)
3. â Configure secrets in Secrets Manager
4. â Set up cdk.context.json with AZ mappings
5. â Create .env for environment config
Phase 2: Docker Development
- â Create minimal Dockerfile extending apache/airflow
2. â Set up docker-compose.yml for local testing
3. â Test all Airflow services locally
4. â Validate secrets, DB connection, volumes
5. â Optimize build with UV/multi-stage
Phase 3: AWS Deployment (Incremental)
- â Deploy storage stack first
2. â Push Docker image to ECR
3. â Deploy webserver only (with ALB)
4. â Deploy scheduler
5. â Deploy triggerer
6. â Deploy worker last (with auto-scaling)
Phase 4: Optimization
- â Enable health checks after services stable
2. â Add circuit breaker
3. â Configure auto-scaling
4. â Add Spot instances for workers
5. â Set up monitoring and alarms
Phase 5: Production Hardening
- â Enable secret rotation
2. â Configure backup policies
3. â Set up CI/CD pipelines
4. â Document runbooks
5. â Load testing and tuning
8. Cost Optimization Insights
What Worked
-
Spot Instances for Workers (commit #45): 50-70% cost savings
-
ARM64 Architecture (commits #13, #14): Better price/performance
-
Separate Stacks (commit #24): Can destroy/recreate services without affecting storage
-
Auto-scaling (commit #51): Only pay for capacity you use
What Didn’t Work
-
EFS (commits #15-18): More complex and costly than git-sync
-
Always-on development: Could use dev/prod environment separation
9. Key Metrics
Development Journey
-
Total Commits: 54
-
Major Refactors: 3 (commits #4, #24, #38)
-
Pitfall Fixes: 47
-
Days of Development: ~10 days (Oct 28 - Nov 6, 2025)
Technical Debt Resolved
-
Deleted Lines: ~6,000+ (cleanup commits #38, #39, #52)
-
Documentation Added: 2,500+ lines
-
Build Time Improvement: 24x faster (4 min â 10 sec)
Final Architecture
-
Stacks: 2 (storage, service)
-
ECS Services: 4 (webserver, scheduler, worker, triggerer)
-
Supporting Services: 3 (RDS, Valkey, git-sync)
-
Auto-scaling: 1-10 workers based on load
-
Spot Instances: Enabled for workers
10. Top 10 Lessons Learned
- Test Locally First - Docker Compose saved countless hours debugging in AWS
2. URL-Encode Passwords - Special characters in passwords will break everything
3. Use Git-Sync, Not EFS - Simpler, cheaper, more reliable for DAG deployment
4. Separate Stateful/Stateless - Independent lifecycle = faster iteration
5. UV Package Manager - 24x faster Docker builds
6. Start Simple - Remove complexity before adding features
7. Security Groups Matter - Document all connectivity requirements
8. Health Check Grace Periods - DB-dependent services need 300-600 seconds
9. Spot Instances for Workers - 50-70% cost savings, minimal impact
10. Commit Often - Small commits make debugging and rollback easier
11. If Starting Over, Do This
Skip These Mistakes
-
â Don’t use static airflow.cfg files
-
â Don’t start with all services at once
-
â Don’t use EFS for DAGs (use git-sync)
-
â Don’t ignore CDK warnings
-
â Don’t skip local testing
-
â Don’t generate passwords with special characters (or URL-encode them)
-
â Don’t use complex entrypoint scripts
Do These Things
-
â Start with storage stack + webserver only
-
â Use UV for fast Docker builds from day 1
-
â Set up docker-compose.yml immediately
-
â Use environment variables exclusively
-
â Configure Spot instances from the start
-
â Document as you go
-
â Use git-sync for DAG deployment
Time Saved
Following this guide could reduce development time from ~80 hours to ~20 hours, avoiding the 47 pitfall commits.
Conclusion
This journey through 54 commits demonstrates the iterative nature of cloud infrastructure development. The key insight is that failure is part of the learning process, and each commit represents a lesson learned.
The most valuable outcome isn’t just a working Airflow deployment, but the understanding of:
-
How ECS services communicate
-
Why certain patterns work (git-sync) and others don’t (EFS)
-
The importance of local testing
-
How to structure IaC for maintainability
These lessons are transferable to any ECS/Fargate deployment, not just Airflow.
Final Architecture Success Rate: 100% (after 54 commits and ~47 fixes)
Would Do It Again: Yes, with this document as a guide