Running Airflow on Kubernetes offers incredible flexibility — every task gets its own isolated pod with precise resource allocation, and you can scale to zero workers when idle. This guide covers a production-ready Kubernetes setup using the official Helm chart.
Architecture Overview
There are two main executor options for Kubernetes deployments:
| KubernetesExecutor | CeleryExecutor on K8s | |
|---|---|---|
| Worker scaling | Per-task pods (scale to zero) | Long-running worker pods |
| Task isolation | Full pod isolation | Shared worker process |
| Cold start | ~10–30s pod startup latency | Near-instant (worker already running) |
| Resource efficiency | Excellent for bursty workloads | Better for high-throughput, low-latency |
| Complexity | Lower (no Celery/Redis) | Higher (needs Redis, Flower) |
For most teams, KubernetesExecutor is the right choice unless you have strict sub-second task latency requirements.
Prerequisites
# Required tools
kubectl >= 1.24
helm >= 3.10
# A working Kubernetes cluster (GKE, EKS, AKS, or local kind/minikube)Installing with the Official Helm Chart
# Add the Apache Airflow Helm chart repo
helm repo add apache-airflow https://airflow.apache.org
helm repo update
# Create a dedicated namespace
kubectl create namespace airflow
# Install Airflow with KubernetesExecutor
helm upgrade --install airflow apache-airflow/airflow \
--namespace airflow \
--version 1.15.0 \
--set executor=KubernetesExecutor \
--set dags.gitSync.enabled=true \
--waitBut you'll want a values.yaml for any real deployment. Let's build one.
Production values.yaml
# ── Executor ────────────────────────────────────────────────────────────────
executor: "KubernetesExecutor"
# ── Image ───────────────────────────────────────────────────────────────────
images:
airflow:
repository: apache/airflow
tag: "2.10.4-python3.11"
pullPolicy: IfNotPresent
# ── DAG Deployment via GitSync ──────────────────────────────────────────────
dags:
gitSync:
enabled: true
repo: https://github.com/your-org/airflow-dags.git
branch: main
rev: HEAD
depth: 1
maxFailures: 3
subPath: "dags"
# For private repos, use an SSH key secret:
# sshKeySecret: airflow-gitsync-ssh-key
period: 60s # Sync every 60 seconds
wait: 10 # Seconds to wait after first sync
# ── Web Server ──────────────────────────────────────────────────────────────
webserver:
replicas: 2 # HA webserver
resources:
limits:
cpu: "1"
memory: 2Gi
requests:
cpu: 500m
memory: 1Gi
service:
type: ClusterIP # Use an Ingress controller in production
livenessProbe:
initialDelaySeconds: 30
periodSeconds: 15
readinessProbe:
initialDelaySeconds: 20
periodSeconds: 10
# ── Scheduler ───────────────────────────────────────────────────────────────
scheduler:
replicas: 2 # HA scheduler (requires PostgreSQL)
resources:
limits:
cpu: "2"
memory: 4Gi
requests:
cpu: "1"
memory: 2Gi
# ── Triggerer (for Deferrable Operators) ────────────────────────────────────
triggerer:
enabled: true
replicas: 1
resources:
limits:
cpu: "1"
memory: 2Gi
requests:
cpu: 500m
memory: 1Gi
# ── PostgreSQL (production: use an external managed DB) ─────────────────────
postgresql:
enabled: false # Disable bundled PostgreSQL
# Use an external database secret
data:
metadataSecretName: airflow-metadata-secret
resultBackendSecretName: airflow-result-backend-secret
# ── Kubernetes Executor Pod Template ────────────────────────────────────────
workers:
# Default pod template for all KubernetesExecutor workers
podTemplateFile: pod_template.yaml
# ── Persistent Logs ─────────────────────────────────────────────────────────
logs:
persistence:
enabled: true
size: 20Gi
storageClassName: "standard" # Your storage class
# ── RBAC & Security ─────────────────────────────────────────────────────────
rbac:
create: true
createSCCRoleBinding: false
serviceAccount:
create: true
name: "airflow"
annotations:
# GKE Workload Identity — annotate with your GCP SA
iam.gke.io/gcp-service-account: airflow@your-project.iam.gserviceaccount.com
# ── Airflow Configuration ────────────────────────────────────────────────────
config:
core:
load_examples: "False"
parallelism: "64"
max_active_runs_per_dag: "16"
max_active_tasks_per_dag: "32"
scheduler:
min_file_process_interval: "30"
dag_dir_list_interval: "30"
max_dagruns_to_create_per_loop: "10"
kubernetes_executor:
enable_tcp_keepalive: "True"
verify_ssl: "True"
delete_worker_pods: "True" # Clean up pods after completion
delete_worker_pods_on_failure: "False" # Keep failed pods for debugging
worker_pods_creation_batch_size: "8"
logging:
remote_logging: "True"
remote_log_conn_id: "aws_s3_logs"
remote_base_log_folder: "s3://your-bucket/airflow-logs"
webserver:
expose_config: "False"
enable_proxy_fix: "True"Pod Template for KubernetesExecutor
The pod template controls how each task pod is created:
apiVersion: v1
kind: Pod
metadata:
name: placeholder
labels:
tier: airflow
component: worker
spec:
serviceAccountName: airflow
# Ensure pods land on nodes with the right labels
nodeSelector:
workload-type: airflow-workers
tolerations:
- key: "airflow-worker"
operator: "Equal"
value: "true"
effect: "NoSchedule"
containers:
- name: base
image: apache/airflow:2.10.4-python3.11
imagePullPolicy: IfNotPresent
# Default resources — DAGs can override these
resources:
limits:
cpu: "2"
memory: 4Gi
requests:
cpu: "500m"
memory: 2Gi
env:
- name: AIRFLOW__CORE__EXECUTOR
value: KubernetesExecutor
volumeMounts:
- name: dags
mountPath: /opt/airflow/dags
- name: logs
mountPath: /opt/airflow/logs
volumes:
- name: dags
emptyDir: {}
- name: logs
emptyDir: {}
restartPolicy: Never # Important: pods must not restart on failure
terminationGracePeriodSeconds: 600Per-Task Pod Customization
One of Kubernetes Executor's superpowers is giving individual tasks different resources or images:
from datetime import datetime
from airflow.decorators import dag, task
from kubernetes.client import models as k8s
@dag(dag_id="pod_customization", schedule="@daily", start_date=datetime(2024, 1, 1))
def pod_customization():
# ML training task — needs GPU and lots of RAM
@task(
executor_config={
"pod_override": k8s.V1Pod(
spec=k8s.V1PodSpec(
containers=[
k8s.V1Container(
name="base",
resources=k8s.V1ResourceRequirements(
limits={
"cpu": "8",
"memory": "32Gi",
"nvidia.com/gpu": "1",
},
requests={
"cpu": "4",
"memory": "16Gi",
"nvidia.com/gpu": "1",
},
),
)
],
node_selector={"cloud.google.com/gke-accelerator": "nvidia-l4"},
tolerations=[
k8s.V1Toleration(key="nvidia.com/gpu", operator="Exists")
],
)
)
}
)
def train_model() -> str:
import subprocess
subprocess.run(["python", "train.py", "--epochs", "10"], check=True)
return "s3://bucket/models/latest"
# Lightweight task — minimal resources
@task(
executor_config={
"pod_override": k8s.V1Pod(
spec=k8s.V1PodSpec(
containers=[
k8s.V1Container(
name="base",
resources=k8s.V1ResourceRequirements(
limits={"cpu": "500m", "memory": "512Mi"},
requests={"cpu": "100m", "memory": "256Mi"},
),
)
]
)
)
}
)
def notify(model_path: str) -> None:
print(f"Model saved to {model_path}")
model = train_model()
notify(model)
pod_customization()Secrets Management
Never hardcode secrets. Use Kubernetes secrets or a dedicated secrets backend:
# Create the metadata DB secret
kubectl create secret generic airflow-metadata-secret \
--namespace airflow \
--from-literal=connection="postgresql+psycopg2://airflow:password@postgres/airflow"
# Create secrets for DAG connections
kubectl create secret generic airflow-connections \
--namespace airflow \
--from-literal=AIRFLOW_CONN_SNOWFLAKE="snowflake://user:pass@account/db?warehouse=WH" \
--from-literal=AIRFLOW_CONN_AWS_DEFAULT="aws:///?region_name=us-east-1"Reference them in values.yaml:
extraEnvFrom:
- secretRef:
name: airflow-connectionsFor production, use a proper secrets backend (HashiCorp Vault or AWS Secrets Manager):
# values.yaml
config:
secrets:
backend: "airflow.providers.amazon.aws.secrets.secrets_manager.SecretsManagerBackend"
backend_kwargs: '{"connections_prefix": "airflow/connections", "variables_prefix": "airflow/variables"}'Deploying Changes
# Apply updated values.yaml
helm upgrade airflow apache-airflow/airflow \
--namespace airflow \
--version 1.15.0 \
--values k8s/values.yaml \
--wait
# Check rollout status
kubectl rollout status deployment/airflow-webserver -n airflow
kubectl rollout status deployment/airflow-scheduler -n airflow
# View running pods
kubectl get pods -n airflow
# Tail scheduler logs
kubectl logs -n airflow -l component=scheduler --followTroubleshooting
# See why a task pod failed
kubectl get pods -n airflow -l dag_id=my_dag,task_id=failing_task
# Get logs from a failed worker pod
kubectl logs -n airflow <pod-name> --previous
# Describe pod for scheduling errors (OOMKilled, etc.)
kubectl describe pod -n airflow <pod-name>
# Check resource quotas aren't blocking pod creation
kubectl describe resourcequota -n airflowCommon issues:
OOMKilled— increase memory limits in pod templateImagePullBackOff— check image name/tag and registry accessPending— likely node selector or resource request issue; checkkubectl describe node- Tasks stuck in
queued— check scheduler logs; often a pod template parsing error
The KubernetesExecutor is a game-changer for production Airflow. Each task gets clean isolation, precise resources, and auto-cleanup. The operational overhead is front-loaded into setup, but the day-to-day reliability is excellent.
Next steps: