Defying airflow to Amazon ECS

Bryson Medium logo

Strategies

Executive Summary

Total Commits Analyzed: 54

Problem-Solving Commits: 47 (87%)

Documentation/Cleanup Commits: 7 (13%)

Problem Distribution by Category

CategoryCountPercentageTop Issues
Airflow-Specific Pitfalls2343%Configuration, permissions, secrets, service communication
AWS/ECS Pitfalls1935%Secret injection, health checks, resource allocation, spot instances
Networking Pitfalls1222%IPv6, service discovery, ALB configuration, security groups
CDK Pitfalls815%Context management, synthesizer, stack dependencies
Docker/Container Pitfalls1528%Build performance, user permissions, multi-stage builds
Note: Commits can appear in multiple categories

1. Airflow-Specific Pitfalls (23 commits)

Key Learnings

  • Airflow has complex inter-service dependencies that must be carefully orchestrated

  • Secret management and encoding are critical for multi-service deployments

  • DAG deployment strategy significantly impacts architecture complexity

  • Airflow 3.x has different requirements than 2.x

Issues Encountered

1.1 Configuration & Environment Variables (8 commits)

Problem: Airflow requires precise configuration across multiple services with shared secrets.

CommitIssueResolution
#39Removed static airflow.cfg, relied on env varsUse environment variables exclusively instead of config files
#40ECS injecting “[object Object]” for secretsFixed secret reference syntax in CDK
#41Secrets still passed incorrectlySimplified to direct secret ARN references
#28Changed service startup based on local devValidated config locally first before ECS deployment
#4Major refactoring of environment variablesCentralized env var management in CDK constructs
#54Removed unneeded default env varsClean up reduces complexity and confusion
#31Database connection failingProper connection string format with all required params
#30Special characters in DB password broke startupURL-encode passwords with special characters
Resolution Strategy:
  • ✅ Use environment variables exclusively, avoid static config files

  • ✅ Validate configuration locally with Docker Compose before ECS

  • ✅ URL-encode all connection strings and secrets

  • ✅ Use CDK constructs to centralize environment variable management

  • ✅ Reference secrets by ARN directly, not through complex wrapper objects

1.2 Database & Storage (6 commits)

Problem: Database connectivity and encoding issues caused repeated failures.

CommitIssueResolution

| #30 | Encoded chars in password caused ECS failure | URL-encode passwords, avoid special chars in generation | | #31 | DB connection failing | Security groups, connection strings, timeouts | | #36 | Health check failed due to DB connection | Increase grace period, validate connection in startup | | #23 | Bad IPv6 URL when connecting to Postgres | Use proper endpoint format without IPv6 | | #22 | Switched back to PostgreSQL from alternative | Stick with well-supported databases | | #29 | Moved DB migration to API server | Ensures DB ready before other services start | Resolution Strategy:

  • ✅ Generate passwords without special characters OR properly URL-encode them

  • ✅ Run database migrations in the webserver/API startup, not scheduler

  • ✅ Use RDS endpoint hostname format, avoid IPv6 references

  • ✅ Increase health check grace periods for DB-dependent services

  • ✅ Validate DB connectivity before starting service processes

1.3 Service Communication & Architecture (5 commits)

Problem: Airflow services need to communicate with each other reliably.

CommitIssueResolution
#46Worker can’t reach base URLAdded service discovery and internal ALB
#47Added common JWT secretShared JWT for inter-service authentication
#27Can’t get auth to workProper authentication backend configuration
#32Successfully deployed API and schedulerValidated basic service communication
#24Split into storage/service stacksSeparate stateful from stateless resources
Resolution Strategy:
  • ✅ Use AWS Cloud Map for service discovery between components

  • ✅ Configure internal ALB DNS for workers to reach webserver

  • ✅ Share authentication secrets (JWT, Fernet) across all services

  • ✅ Separate stateful (DB, cache) from stateless (compute) in stack design

  • ✅ Deploy webserver/API first, then scheduler, then workers

1.4 DAG Deployment & Storage (7 commits)

Problem: Finding the right way to deploy DAGs to all Airflow services.

CommitIssueResolution
#15Tried EFS after ‘aws’ CLI not found in imageEFS for shared storage
#17EFS mount point issuesFixed mount configuration
#18EFS DNS modification issuesChanged mount attachment method
#19Switched from EFS to git-syncGit-sync sidecar pattern
#20Added git-sync with SSL certsProper SSL for private repos
#21Git-sync needs root permissionsRun as root to write to filesystem
#43Read-only DAG folder caused startup errorMake DAG folder writable
#49Simplified git-sync configurationDirect git sync approach
#53Fixed git-sync repo configurationCorrect repo URL and credentials
Resolution Strategy:
  • ✅ Use git-sync sidecar pattern instead of EFS (simpler, more reliable)

  • ✅ Run git-sync as root with proper permissions

  • ✅ Ensure shared volume is writable by Airflow processes

  • ✅ Mirror git-sync image to ECR with SSL certificates for private repos

  • ✅ Configure git-sync to sync to correct path (/opt/airflow/dags)

1.5 Component Health & Scaling (6 commits)

Problem: Getting all Airflow components (webserver, scheduler, worker, triggerer) healthy.

CommitIssueResolution
#33Triggerer failing health checksTemporarily disabled to isolate issues
#34Deployed without health checksRemoved to isolate startup from health check issues
#35All but worker runningWebserver, scheduler, triggerer operational
#44Re-enabled triggererFixed config and re-added successfully
#45Worker killed due to high memoryDoubled memory to 2GB
#51Added worker auto-scalingScale 1-10 based on CPU/memory
Resolution Strategy:
  • ✅ Disable health checks initially to isolate startup issues

  • ✅ Deploy services incrementally: webserver → scheduler → triggerer → worker

  • ✅ Monitor memory usage and increase allocations (workers need 2GB+)

  • ✅ Implement auto-scaling for workers (1-10 instances)

  • ✅ Re-enable health checks after confirming services start successfully

1.6 Permissions & User Management (3 commits)

Problem: Container user permissions for filesystem access.

CommitIssueResolution
#37Trying to fix airflow user with root permissionsUser/group permission conflicts
#21Git-sync needs root permissions to writeRun git-sync as root
#38Simplified user management in DockerUse base image defaults
Resolution Strategy:
  • ✅ Run git-sync sidecar as root (uid 0) for file system writes

  • ✅ Run Airflow processes as airflow user (default in base image)

  • ✅ Use shared volumes with appropriate permissions

  • ✅ Don’t override user settings unless necessary


2. AWS/ECS Pitfalls (19 commits)

Key Learnings

  • ECS has specific requirements for secret injection and environment variables

  • Health checks need careful tuning for startup times

  • Resource allocation (CPU/memory) requires iteration based on actual usage

  • Spot instances can significantly reduce costs

Issues Encountered

2.1 Secret Management (5 commits)

Problem: ECS secret injection has specific syntax requirements.

CommitIssueResolution
#40ECS injecting “[object Object]” literallyFixed CDK secret reference syntax
#41Secrets still passed incorrectlyUsed direct secret ARN references
#47Added common JWT secretCentralized secret in Secrets Manager
#52Fixed Fernet key generationProper secret generation in storage stack
#30Password encoding issuesURL-encode or avoid special characters
Resolution Strategy:
  • ✅ Use ecs.Secret.fromSecretsManager(secret) or direct ARN references

  • ✅ Avoid complex object wrapping in secret definitions

  • ✅ Store all secrets in Secrets Manager, not environment variables

  • ✅ Generate secrets in storage stack, reference in service stack

  • ✅ Use Secrets Manager rotation for production environments

2.2 Health Checks & Service Startup (8 commits)

Problem: ECS health checks failed due to timing and configuration issues.

CommitIssueResolution
#42ALB considering task failed, need more timeIncreased grace period and check intervals
#36Health check failed due to DB connectionAdded DB connection validation
#34Deployed without health checks temporarilyIsolated startup from health check issues
#33Triggerer failing health checksDisabled until config fixed
#35Most services up without health checksValidated core functionality first
#15Health check failed, probably DB connectionDatabase initialization timing
#12Added ECS circuit breakerAutomatic rollback on deployment failures
#54Removed unneeded health checksCleaned up redundant checks
Resolution Strategy:
  • ✅ Set health check grace period to 300-600 seconds for DB-dependent services

  • ✅ Use circuit breaker for automatic rollback on failures

  • ✅ Disable health checks temporarily to isolate startup issues

  • ✅ Validate database connectivity before enabling health checks

  • ✅ Use ALB target group health checks for webserver

  • ✅ Configure startup scripts to wait for dependencies

2.3 Resource Allocation (5 commits)

Problem: Services need adequate CPU and memory resources.

CommitIssueResolution
#45Worker killed due to high memory usageDoubled memory to 2GB
#42Bumped to 1 vCPUIncreased from 0.5 to 1 vCPU
#51Added auto-scaling for workersScale 1-10 based on metrics
#45Added Spot instancesCost savings on worker fleet
#11Added temp directory configurationProper temp storage allocation
Resolution Strategy:
  • ✅ Start with minimum: webserver/scheduler 1 vCPU/2GB, workers 1 vCPU/2GB

  • ✅ Monitor CloudWatch metrics for CPU/memory utilization

  • ✅ Use Spot instances for workers (50-70% cost savings)

  • ✅ Implement auto-scaling based on CPU (70%) and memory (80%) targets

  • ✅ Configure temp directories with adequate storage

2.4 Container Architecture (4 commits)

Problem: ARM64 vs AMD64 architecture differences.

CommitIssueResolution
#13Trying ARM64 for entrypoint.sh problemSwitched to ARM64 architecture
#14Specific ARM64 taggingExplicit platform tagging
#12Architecture differences in ECSSet explicit architecture in task definitions
#8Multi-stage builds with UVOptimized for target architecture
Resolution Strategy:
  • ✅ Choose ARM64 for better price/performance (Graviton processors)

  • ✅ Tag Docker images with explicit platform (—platform linux/arm64)

  • ✅ Set runtimePlatform in ECS task definitions

  • ✅ Test locally with matching architecture

2.5 Availability & Deployment (3 commits)

Problem: Availability zone and deployment configuration issues.

CommitIssueResolution
#6Getting error about availability zonesAdded AZ mappings to context
#7Some services not starting upFixed ECR image references
#25Fixed subnet AZ and namingProper subnet selection
Resolution Strategy:
  • ✅ Cache AZ lookups in cdk.context.json to avoid API calls

  • ✅ Use multiple AZs for high availability

  • ✅ Ensure subnets span multiple AZs properly

  • ✅ Use CDK context for consistent AZ selection


3. Networking Pitfalls (12 commits)

Key Learnings

  • Service-to-service communication requires proper security groups and discovery

  • ALB configuration is critical for external and internal traffic

  • IPv6 can cause unexpected connection issues

  • VPC design impacts cost and complexity

Issues Encountered

3.1 Service Discovery & Internal Communication (4 commits)

Problem: Services couldn’t reliably communicate with each other.

CommitIssueResolution
#46Worker can’t reach base URLAdded AWS Cloud Map service discovery
#46Worker to API connectivityConfigured internal ALB DNS
#46Security group rules missingAdded worker-to-webserver connectivity
#31DB connection failingSecurity group for RDS access
Resolution Strategy:
  • ✅ Use AWS Cloud Map for service discovery (servicename.namespace)

  • ✅ Configure internal ALB for worker-to-webserver communication

  • ✅ Set up security groups to allow inter-service traffic

  • ✅ Use private subnets for services, public for ALB only

  • ✅ Document all security group rules for troubleshooting

3.2 Load Balancer Configuration (5 commits)

Problem: ALB health checks and routing configuration.

CommitIssueResolution
#42ALB considering task failedIncreased timeouts and grace periods
#2Initial ALB setup with webserverPublic ALB for external access
#46Added internal communication pathInternal ALB/service discovery
#35Health check configurationProper health check paths
#54Removed unneeded health checksSimplified configuration
Resolution Strategy:
  • ✅ Use public ALB only for webserver/UI

  • ✅ Set health check path to /health or /api/v1/health

  • ✅ Configure deregistration delay to 30 seconds

  • ✅ Set healthy/unhealthy threshold appropriately (2/3)

  • ✅ Use longer intervals (30s) for services with slow startup

3.3 Database Connectivity (3 commits)

Problem: RDS connection string format and network access.

CommitIssueResolution
#23Bad IPv6 URL when connecting to PostgresUse hostname format without brackets
#31DB connection failingSecurity groups and connection params
#36DB connection in health checksValidate connection before health checks
Resolution Strategy:
  • ✅ Use RDS endpoint hostname directly: dbinstance.xxxxxx.region.rds.amazonaws.com

  • ✅ Avoid IPv6 format with brackets: [::1]

  • ✅ Configure security group ingress on port 5432 from ECS tasks

  • ✅ Use private subnets for RDS (never public)

  • ✅ Test connection string format locally first

3.4 VPC & Subnet Configuration (3 commits)

Problem: Subnet and availability zone configuration.

CommitIssueResolution
#6AZ errorsFixed AZ configuration and context
#25Subnet AZ issuesProper subnet selection in correct AZs
#24VPC in storage stackCentralized VPC in stateful stack
Resolution Strategy:
  • ✅ Create VPC in storage/stateful stack

  • ✅ Use 2-3 AZs for high availability

  • ✅ Separate public and private subnets

  • ✅ Use NAT gateway for private subnet internet access

  • ✅ Cache AZ lookups in cdk.context.json


4. CDK Pitfalls (8 commits)

Key Learnings

  • Stack organization impacts maintainability and deployment

  • Context management is crucial for consistent deployments

  • CDK warnings should be addressed early

  • Stack dependencies must be explicitly defined

Issues Encountered

4.1 Stack Organization (3 commits)

Problem: Monolithic stack vs separated concerns.

CommitIssueResolution
#24Split into storage and service stacksSeparated stateful from stateless
#4Major stack refactoringBetter organization and structure
#2Initial monolithic stackStarted with everything in one stack
Resolution Strategy:
  • ✅ Separate stateful resources (VPC, RDS, Cache) into storage-stack

  • ✅ Put stateless compute (ECS services) in service-stack

  • ✅ Use stack outputs and cross-stack references

  • ✅ Define explicit dependencies between stacks

  • ✅ Benefits: independent lifecycle, easier testing, faster iteration on services

4.2 Context & Configuration Management (3 commits)

Problem: CDK context and synthesizer warnings.

CommitIssueResolution
#5Fixing CDK warningsAdded proper synthesizer configuration
#6AZ lookup issuesAdded cdk.context.json with AZ mappings
#6Added .env fileEnvironment-specific configuration
Resolution Strategy:
  • ✅ Use cdk.context.json to cache AZ lookups (avoid API calls)

  • ✅ Configure DefaultStackSynthesizer properly

  • ✅ Use .env files for environment-specific values

  • ✅ Address CDK warnings during development, not later

  • ✅ Version control cdk.context.json for team consistency

4.3 Resource References & Dependencies (2 commits)

Problem: Properly referencing resources across stacks.

CommitIssueResolution
#24Cross-stack references for storage resourcesExport/import pattern
#40Secret reference syntax errorsProper CDK secret constructs
Resolution Strategy:
  • ✅ Use stack.export() and Fn.importValue() for cross-stack refs

  • ✅ Pass resources as constructor parameters when possible

  • ✅ Use proper CDK constructs (Secret.fromSecretArn) instead of raw ARNs

  • ✅ Define dependencies explicitly with addDependency()

  • ✅ Document cross-stack dependencies in README


5. Docker/Container Pitfalls (15 commits)

Key Learnings

  • Build performance matters significantly for iteration speed

  • Multi-stage builds reduce image size and build time

  • User permissions are tricky with shared volumes

  • Base image selection impacts complexity

Issues Encountered

5.1 Build Performance (3 commits)

Problem: Docker builds were extremely slow (4+ minutes).

CommitIssueResolution
#8Got docker building faster with UV4 min → 10 sec improvement
#8Implemented multi-stage buildsSeparate builder and runtime stages
#39Simplified to leverage base image UVUse apache/airflow’s built-in tools
Resolution Strategy:
  • ✅ Use UV package manager for 25-50x faster pip installs

  • ✅ Multi-stage builds: builder stage + lean runtime stage

  • ✅ Copy pre-built virtual environment from builder

  • ✅ Leverage capabilities of base image (apache/airflow has UV)

  • ✅ Cache layers effectively by ordering Dockerfile commands

Impact: Build time: 4 minutes → 10 seconds (24x improvement)

5.2 Image Complexity (4 commits)

Problem: Over-complicated Docker image with unnecessary scripts.

CommitIssueResolution
#38Massive cleanup: removed 1057 linesRemoved test/debug scripts
#39Removed airflow.cfg fileUse environment variables only
#39Simplified Dockerfile by 16 linesLeveraged base image features
#6Temporarily removed Docker filesClean slate approach
Resolution Strategy:
  • ✅ Start with minimal Dockerfile extending apache/airflow base

  • ✅ Avoid custom entrypoint scripts unless absolutely necessary

  • ✅ Use environment variables instead of config files

  • ✅ Remove debugging/testing scripts from production image

  • ✅ Keep Dockerfile under 50 lines total

5.3 Local Development (3 commits)

Problem: Need to test Docker images locally before ECS.

CommitIssueResolution
#9Created Docker Compose for local testingFull local environment
#10Got local dev workingValidated configuration locally
#26Used local deployment to fix ECS issuesLocal testing found issues
Resolution Strategy:
  • ✅ Create docker-compose.yml matching ECS configuration

  • ✅ Test all configuration locally before deploying to ECS

  • ✅ Use same environment variables in both environments

  • ✅ Validate database connectivity, secret injection, volumes locally

  • ✅ Significantly reduces cloud debugging time and cost

5.4 Base Image & Dependencies (3 commits)

Problem: Managing Python dependencies and base image.

CommitIssueResolution
#7Added build_and_push.sh for ECRAutomated image publishing
#8Used UV for dependenciesFaster installs
#15’aws’ CLI not found in imageBase image didn’t include AWS CLI
Resolution Strategy:
  • ✅ Use official apache/airflow:3.1.0 as base image

  • ✅ Only install additional dependencies if truly needed

  • ✅ Use UV or pip with —no-cache-dir to minimize layer size

  • ✅ Create automated build/push scripts for CI/CD

  • ✅ Tag images with git commit SHA for traceability

5.5 Sidecar Containers (2 commits)

Problem: Running git-sync as sidecar with proper image.

CommitIssueResolution
#20Added git-sync image to ECRMirrored k8s.gcr.io/git-sync
#50Created GitLab CI for git-sync mirroringAutomated mirror updates
Resolution Strategy:
  • ✅ Mirror external images (k8s git-sync) to ECR for reliability

  • ✅ Use sidecar pattern for orthogonal concerns (git sync, monitoring)

  • ✅ Configure shared volumes between main and sidecar containers

  • ✅ Automate image mirroring with CI/CD pipelines

  • ✅ Document sidecar image sources and update procedures


6. Infrastructure as Code Best Practices Learned

From This Journey

  1. Iterative Development is Key
  • 54 commits show the value of small, incremental changes
  • Each commit isolated a specific problem
  • Easier to rollback and debug
  1. Local Testing Saves Time & Money
  • Docker Compose for local validation (commits #9, #10, #26)

  • Reduced ECS debugging iterations

  • Faster feedback loop

  1. Separate Stateful from Stateless
  • Storage stack (VPC, RDS, Cache) - commit #24
  • Service stack (ECS, ALB) can be destroyed/recreated quickly
  • Independent lifecycle management
  1. Secrets Management
  • Never hardcode secrets
  • Use Secrets Manager for all sensitive data
  • Proper URL encoding for special characters
  1. Start Simple, Add Complexity
  • Initial “AI slop” was too complex
  • Simplified over time (commits #38, #39)
  • Remove before adding
  1. Documentation as You Go
  • Git commit messages tell the story
  • Document pitfalls immediately
  • Future you will thank present you

Based on lessons learned from 54 commits:

Phase 1: Foundation (Do This Right First)

  1. ✅ Set up CDK project with proper structure

2. ✅ Create storage stack (VPC, RDS, Cache)

3. ✅ Configure secrets in Secrets Manager

4. ✅ Set up cdk.context.json with AZ mappings

5. ✅ Create .env for environment config

Phase 2: Docker Development

  1. ✅ Create minimal Dockerfile extending apache/airflow

2. ✅ Set up docker-compose.yml for local testing

3. ✅ Test all Airflow services locally

4. ✅ Validate secrets, DB connection, volumes

5. ✅ Optimize build with UV/multi-stage

Phase 3: AWS Deployment (Incremental)

  1. ✅ Deploy storage stack first

2. ✅ Push Docker image to ECR

3. ✅ Deploy webserver only (with ALB)

4. ✅ Deploy scheduler

5. ✅ Deploy triggerer

6. ✅ Deploy worker last (with auto-scaling)

Phase 4: Optimization

  1. ✅ Enable health checks after services stable

2. ✅ Add circuit breaker

3. ✅ Configure auto-scaling

4. ✅ Add Spot instances for workers

5. ✅ Set up monitoring and alarms

Phase 5: Production Hardening

  1. ✅ Enable secret rotation

2. ✅ Configure backup policies

3. ✅ Set up CI/CD pipelines

4. ✅ Document runbooks

5. ✅ Load testing and tuning


8. Cost Optimization Insights

What Worked

  • Spot Instances for Workers (commit #45): 50-70% cost savings

  • ARM64 Architecture (commits #13, #14): Better price/performance

  • Separate Stacks (commit #24): Can destroy/recreate services without affecting storage

  • Auto-scaling (commit #51): Only pay for capacity you use

What Didn’t Work

  • EFS (commits #15-18): More complex and costly than git-sync

  • Always-on development: Could use dev/prod environment separation


9. Key Metrics

Development Journey

  • Total Commits: 54

  • Major Refactors: 3 (commits #4, #24, #38)

  • Pitfall Fixes: 47

  • Days of Development: ~10 days (Oct 28 - Nov 6, 2025)

Technical Debt Resolved

  • Deleted Lines: ~6,000+ (cleanup commits #38, #39, #52)

  • Documentation Added: 2,500+ lines

  • Build Time Improvement: 24x faster (4 min → 10 sec)

Final Architecture

  • Stacks: 2 (storage, service)

  • ECS Services: 4 (webserver, scheduler, worker, triggerer)

  • Supporting Services: 3 (RDS, Valkey, git-sync)

  • Auto-scaling: 1-10 workers based on load

  • Spot Instances: Enabled for workers


10. Top 10 Lessons Learned

  1. Test Locally First - Docker Compose saved countless hours debugging in AWS

2. URL-Encode Passwords - Special characters in passwords will break everything

3. Use Git-Sync, Not EFS - Simpler, cheaper, more reliable for DAG deployment

4. Separate Stateful/Stateless - Independent lifecycle = faster iteration

5. UV Package Manager - 24x faster Docker builds

6. Start Simple - Remove complexity before adding features

7. Security Groups Matter - Document all connectivity requirements

8. Health Check Grace Periods - DB-dependent services need 300-600 seconds

9. Spot Instances for Workers - 50-70% cost savings, minimal impact

10. Commit Often - Small commits make debugging and rollback easier


11. If Starting Over, Do This

Skip These Mistakes

  • ❌ Don’t use static airflow.cfg files

  • ❌ Don’t start with all services at once

  • ❌ Don’t use EFS for DAGs (use git-sync)

  • ❌ Don’t ignore CDK warnings

  • ❌ Don’t skip local testing

  • ❌ Don’t generate passwords with special characters (or URL-encode them)

  • ❌ Don’t use complex entrypoint scripts

Do These Things

  • ✅ Start with storage stack + webserver only

  • ✅ Use UV for fast Docker builds from day 1

  • ✅ Set up docker-compose.yml immediately

  • ✅ Use environment variables exclusively

  • ✅ Configure Spot instances from the start

  • ✅ Document as you go

  • ✅ Use git-sync for DAG deployment

Time Saved

Following this guide could reduce development time from ~80 hours to ~20 hours, avoiding the 47 pitfall commits.


Conclusion

This journey through 54 commits demonstrates the iterative nature of cloud infrastructure development. The key insight is that failure is part of the learning process, and each commit represents a lesson learned.

The most valuable outcome isn’t just a working Airflow deployment, but the understanding of:

  • How ECS services communicate

  • Why certain patterns work (git-sync) and others don’t (EFS)

  • The importance of local testing

  • How to structure IaC for maintainability

These lessons are transferable to any ECS/Fargate deployment, not just Airflow.

Final Architecture Success Rate: 100% (after 54 commits and ~47 fixes)

Would Do It Again: Yes, with this document as a guide

Share this article: