Back to Blog

Running Apache Airflow on Kubernetes with the KubernetesExecutor

A complete production guide to deploying Apache Airflow on Kubernetes. Covers the official Helm chart, KubernetesExecutor vs CeleryExecutor, resource quotas, pod templates, and GitSync DAG deployment.

Prashant Singh

Prashant Singh

Senior Data Engineer

6 min read
Share:

Running Airflow on Kubernetes offers incredible flexibility — every task gets its own isolated pod with precise resource allocation, and you can scale to zero workers when idle. This guide covers a production-ready Kubernetes setup using the official Helm chart.

Architecture Overview

There are two main executor options for Kubernetes deployments:

KubernetesExecutorCeleryExecutor on K8s
Worker scalingPer-task pods (scale to zero)Long-running worker pods
Task isolationFull pod isolationShared worker process
Cold start~10–30s pod startup latencyNear-instant (worker already running)
Resource efficiencyExcellent for bursty workloadsBetter for high-throughput, low-latency
ComplexityLower (no Celery/Redis)Higher (needs Redis, Flower)

For most teams, KubernetesExecutor is the right choice unless you have strict sub-second task latency requirements.

Prerequisites

# Required tools
kubectl >= 1.24
helm >= 3.10
# A working Kubernetes cluster (GKE, EKS, AKS, or local kind/minikube)

Installing with the Official Helm Chart

# Add the Apache Airflow Helm chart repo
helm repo add apache-airflow https://airflow.apache.org
helm repo update
 
# Create a dedicated namespace
kubectl create namespace airflow
 
# Install Airflow with KubernetesExecutor
helm upgrade --install airflow apache-airflow/airflow \
  --namespace airflow \
  --version 1.15.0 \
  --set executor=KubernetesExecutor \
  --set dags.gitSync.enabled=true \
  --wait

But you'll want a values.yaml for any real deployment. Let's build one.

Production values.yaml

k8s/values.yaml
# ── Executor ────────────────────────────────────────────────────────────────
executor: "KubernetesExecutor"
 
# ── Image ───────────────────────────────────────────────────────────────────
images:
  airflow:
    repository: apache/airflow
    tag: "2.10.4-python3.11"
    pullPolicy: IfNotPresent
 
# ── DAG Deployment via GitSync ──────────────────────────────────────────────
dags:
  gitSync:
    enabled: true
    repo: https://github.com/your-org/airflow-dags.git
    branch: main
    rev: HEAD
    depth: 1
    maxFailures: 3
    subPath: "dags"
    # For private repos, use an SSH key secret:
    # sshKeySecret: airflow-gitsync-ssh-key
    period: 60s        # Sync every 60 seconds
    wait: 10           # Seconds to wait after first sync
 
# ── Web Server ──────────────────────────────────────────────────────────────
webserver:
  replicas: 2          # HA webserver
  resources:
    limits:
      cpu: "1"
      memory: 2Gi
    requests:
      cpu: 500m
      memory: 1Gi
  service:
    type: ClusterIP    # Use an Ingress controller in production
  livenessProbe:
    initialDelaySeconds: 30
    periodSeconds: 15
  readinessProbe:
    initialDelaySeconds: 20
    periodSeconds: 10
 
# ── Scheduler ───────────────────────────────────────────────────────────────
scheduler:
  replicas: 2          # HA scheduler (requires PostgreSQL)
  resources:
    limits:
      cpu: "2"
      memory: 4Gi
    requests:
      cpu: "1"
      memory: 2Gi
 
# ── Triggerer (for Deferrable Operators) ────────────────────────────────────
triggerer:
  enabled: true
  replicas: 1
  resources:
    limits:
      cpu: "1"
      memory: 2Gi
    requests:
      cpu: 500m
      memory: 1Gi
 
# ── PostgreSQL (production: use an external managed DB) ─────────────────────
postgresql:
  enabled: false       # Disable bundled PostgreSQL
 
# Use an external database secret
data:
  metadataSecretName: airflow-metadata-secret
  resultBackendSecretName: airflow-result-backend-secret
 
# ── Kubernetes Executor Pod Template ────────────────────────────────────────
workers:
  # Default pod template for all KubernetesExecutor workers
  podTemplateFile: pod_template.yaml
 
# ── Persistent Logs ─────────────────────────────────────────────────────────
logs:
  persistence:
    enabled: true
    size: 20Gi
    storageClassName: "standard"  # Your storage class
 
# ── RBAC & Security ─────────────────────────────────────────────────────────
rbac:
  create: true
  createSCCRoleBinding: false
 
serviceAccount:
  create: true
  name: "airflow"
  annotations:
    # GKE Workload Identity — annotate with your GCP SA
    iam.gke.io/gcp-service-account: airflow@your-project.iam.gserviceaccount.com
 
# ── Airflow Configuration ────────────────────────────────────────────────────
config:
  core:
    load_examples: "False"
    parallelism: "64"
    max_active_runs_per_dag: "16"
    max_active_tasks_per_dag: "32"
  scheduler:
    min_file_process_interval: "30"
    dag_dir_list_interval: "30"
    max_dagruns_to_create_per_loop: "10"
  kubernetes_executor:
    enable_tcp_keepalive: "True"
    verify_ssl: "True"
    delete_worker_pods: "True"    # Clean up pods after completion
    delete_worker_pods_on_failure: "False"  # Keep failed pods for debugging
    worker_pods_creation_batch_size: "8"
  logging:
    remote_logging: "True"
    remote_log_conn_id: "aws_s3_logs"
    remote_base_log_folder: "s3://your-bucket/airflow-logs"
  webserver:
    expose_config: "False"
    enable_proxy_fix: "True"

Pod Template for KubernetesExecutor

The pod template controls how each task pod is created:

k8s/pod_template.yaml
apiVersion: v1
kind: Pod
metadata:
  name: placeholder
  labels:
    tier: airflow
    component: worker
spec:
  serviceAccountName: airflow
 
  # Ensure pods land on nodes with the right labels
  nodeSelector:
    workload-type: airflow-workers
 
  tolerations:
    - key: "airflow-worker"
      operator: "Equal"
      value: "true"
      effect: "NoSchedule"
 
  containers:
    - name: base
      image: apache/airflow:2.10.4-python3.11
      imagePullPolicy: IfNotPresent
 
      # Default resources — DAGs can override these
      resources:
        limits:
          cpu: "2"
          memory: 4Gi
        requests:
          cpu: "500m"
          memory: 2Gi
 
      env:
        - name: AIRFLOW__CORE__EXECUTOR
          value: KubernetesExecutor
 
      volumeMounts:
        - name: dags
          mountPath: /opt/airflow/dags
        - name: logs
          mountPath: /opt/airflow/logs
 
  volumes:
    - name: dags
      emptyDir: {}
    - name: logs
      emptyDir: {}
 
  restartPolicy: Never   # Important: pods must not restart on failure
  terminationGracePeriodSeconds: 600

Per-Task Pod Customization

One of Kubernetes Executor's superpowers is giving individual tasks different resources or images:

dags/custom_pod_tasks.py
from datetime import datetime
from airflow.decorators import dag, task
from kubernetes.client import models as k8s
 
 
@dag(dag_id="pod_customization", schedule="@daily", start_date=datetime(2024, 1, 1))
def pod_customization():
 
    # ML training task — needs GPU and lots of RAM
    @task(
        executor_config={
            "pod_override": k8s.V1Pod(
                spec=k8s.V1PodSpec(
                    containers=[
                        k8s.V1Container(
                            name="base",
                            resources=k8s.V1ResourceRequirements(
                                limits={
                                    "cpu": "8",
                                    "memory": "32Gi",
                                    "nvidia.com/gpu": "1",
                                },
                                requests={
                                    "cpu": "4",
                                    "memory": "16Gi",
                                    "nvidia.com/gpu": "1",
                                },
                            ),
                        )
                    ],
                    node_selector={"cloud.google.com/gke-accelerator": "nvidia-l4"},
                    tolerations=[
                        k8s.V1Toleration(key="nvidia.com/gpu", operator="Exists")
                    ],
                )
            )
        }
    )
    def train_model() -> str:
        import subprocess
        subprocess.run(["python", "train.py", "--epochs", "10"], check=True)
        return "s3://bucket/models/latest"
 
    # Lightweight task — minimal resources
    @task(
        executor_config={
            "pod_override": k8s.V1Pod(
                spec=k8s.V1PodSpec(
                    containers=[
                        k8s.V1Container(
                            name="base",
                            resources=k8s.V1ResourceRequirements(
                                limits={"cpu": "500m", "memory": "512Mi"},
                                requests={"cpu": "100m", "memory": "256Mi"},
                            ),
                        )
                    ]
                )
            )
        }
    )
    def notify(model_path: str) -> None:
        print(f"Model saved to {model_path}")
 
    model = train_model()
    notify(model)
 
 
pod_customization()

Secrets Management

Never hardcode secrets. Use Kubernetes secrets or a dedicated secrets backend:

# Create the metadata DB secret
kubectl create secret generic airflow-metadata-secret \
  --namespace airflow \
  --from-literal=connection="postgresql+psycopg2://airflow:password@postgres/airflow"
 
# Create secrets for DAG connections
kubectl create secret generic airflow-connections \
  --namespace airflow \
  --from-literal=AIRFLOW_CONN_SNOWFLAKE="snowflake://user:pass@account/db?warehouse=WH" \
  --from-literal=AIRFLOW_CONN_AWS_DEFAULT="aws:///?region_name=us-east-1"

Reference them in values.yaml:

extraEnvFrom:
  - secretRef:
      name: airflow-connections

For production, use a proper secrets backend (HashiCorp Vault or AWS Secrets Manager):

# values.yaml
config:
  secrets:
    backend: "airflow.providers.amazon.aws.secrets.secrets_manager.SecretsManagerBackend"
    backend_kwargs: '{"connections_prefix": "airflow/connections", "variables_prefix": "airflow/variables"}'

Deploying Changes

# Apply updated values.yaml
helm upgrade airflow apache-airflow/airflow \
  --namespace airflow \
  --version 1.15.0 \
  --values k8s/values.yaml \
  --wait
 
# Check rollout status
kubectl rollout status deployment/airflow-webserver -n airflow
kubectl rollout status deployment/airflow-scheduler -n airflow
 
# View running pods
kubectl get pods -n airflow
 
# Tail scheduler logs
kubectl logs -n airflow -l component=scheduler --follow

Troubleshooting

# See why a task pod failed
kubectl get pods -n airflow -l dag_id=my_dag,task_id=failing_task
 
# Get logs from a failed worker pod
kubectl logs -n airflow <pod-name> --previous
 
# Describe pod for scheduling errors (OOMKilled, etc.)
kubectl describe pod -n airflow <pod-name>
 
# Check resource quotas aren't blocking pod creation
kubectl describe resourcequota -n airflow

Common issues:

  • OOMKilled — increase memory limits in pod template
  • ImagePullBackOff — check image name/tag and registry access
  • Pending — likely node selector or resource request issue; check kubectl describe node
  • Tasks stuck in queued — check scheduler logs; often a pod template parsing error

The KubernetesExecutor is a game-changer for production Airflow. Each task gets clean isolation, precise resources, and auto-cleanup. The operational overhead is front-loaded into setup, but the day-to-day reliability is excellent.

Next steps:

Was this helpful?

Found this helpful? Share it with your team.

Share:

Comments

Prashant Singh

Prashant Singh

Senior Data Engineer

Senior Data Engineer with expertise in Apache Airflow, data orchestration, and building scalable data pipelines. Passionate about sharing knowledge and best practices with the data engineering community.