Resources

Your quick-reference library for Apache Airflow and the data engineering ecosystem. Glossary, best practices, and tool comparisons — all in one place.

Reference

Airflow Glossary

15 essential terms every Airflow user should know.

DAG

Core

Directed Acyclic Graph. The core unit of work in Airflow — a Python file that defines a workflow of tasks with their dependencies. DAGs must be acyclic (no loops).

TaskOperatorDAG Run

Operator

Core

A template for a single unit of work. Operators determine what actually executes when a Task runs. Common operators include PythonOperator, BashOperator, and provider-specific operators like BigQueryInsertJobOperator.

TaskHookProvider

Task

Core

An instance of an Operator within a DAG. Tasks are the actual nodes in the DAG graph. Each task has a unique task_id within its DAG.

DAGOperatorTask Instance

Task Instance

Core

A specific run of a Task for a given DAG Run. A Task Instance has a state (queued, running, success, failed, etc.) and represents the execution of one task at a specific point in time.

DAG RunTaskState

DAG Run

Core

An instantiation of a DAG for a specific logical date. A DAG Run contains all Task Instances for that execution. Can be triggered manually, by schedule, or via the REST API.

DAGTask InstanceLogical Date

XCom

Communication

Cross-Communication. A mechanism for Tasks to share small amounts of data between each other. XComs are stored in the Airflow metadata database and should be used for small values (not large dataframes).

Task InstanceMetadata Database

Executor

Infrastructure

The component that determines how tasks are executed. Options include SequentialExecutor (single process, dev only), LocalExecutor (multi-process), CeleryExecutor (distributed), and KubernetesExecutor (k8s pods).

WorkerSchedulerKubernetesExecutor

Hook

Connectivity

An interface to an external service or database. Hooks abstract the connection logic and are used by Operators internally. For example, PostgresHook manages PostgreSQL connections.

OperatorConnectionProvider

Connection

Connectivity

A stored set of credentials and endpoint information used to connect to external systems. Connections are referenced by conn_id and can be stored in the metadata DB, environment variables, or a secrets backend.

HookVariableSecrets Backend

Sensor

Core

A special type of Operator that waits for a condition to be true before completing. Sensors poll for a condition at a defined interval (poke_interval) until a timeout is reached.

OperatorDeferrable Operator

Deferrable Operator

Performance

An operator that can suspend itself and free up a worker slot while waiting for an external event. Uses Triggerer process instead of blocking a worker. Available since Airflow 2.2.

SensorTriggererWorker

TaskFlow API

Core

A decorator-based API introduced in Airflow 2.0 that simplifies writing DAGs using @dag and @task decorators. Makes XCom passing implicit and reduces boilerplate significantly.

DAGTaskXComDynamic Task Mapping

Dynamic Task Mapping

Core

A feature allowing the number of task instances to be determined at runtime based on the output of an upstream task. Uses .expand() or .partial().expand(). Available since Airflow 2.3.

TaskFlow APITaskXCom

Pool

Concurrency

A resource limit applied to a group of tasks. Pools control how many tasks can run concurrently for a given resource (e.g., limit DB connections). Tasks acquire slots from a Pool before running.

TaskConcurrencySlot

Provider

Ecosystem

A package that extends Airflow with additional Operators, Hooks, Sensors, and Connections for a specific service. For example, apache-airflow-providers-google adds 100+ Google Cloud operators.

OperatorHookConnection
Guidelines

Best Practices

Battle-tested patterns from production Airflow deployments at scale.

DAG Design

Keep DAGs idempotent

Every DAG run should produce the same result when re-run for the same logical date. Use UPSERT instead of INSERT, and write to date-partitioned paths.

Avoid top-level code in DAG files

Never run database queries, file I/O, or heavy computation at DAG parse time. This runs on every scheduler heartbeat and will cripple parse performance.

Use atomic operations

Write data to a staging location first, then atomically move it to the final destination. This ensures consumers never see partial data.

Limit task granularity

Don't create thousands of tiny tasks. Batch work at the task level and use parallelism within tasks. Each task has scheduler and metadata DB overhead.

Retries & Error Handling

Set retries on every task

Always configure retries=3 and retry_delay at the DAG or task level. Transient failures (network, API rate limits) will retry automatically.

Use exponential backoff

Set retry_exponential_backoff=True to avoid hammering a struggling downstream system. Combine with retry_delay=timedelta(minutes=5).

Set email_on_failure selectively

Don't send emails for every task failure. Use callbacks (on_failure_callback) to route alerts to Slack/PagerDuty for critical paths only.

Performance

Index your metadata database

Ensure task_instance, dag_run, and xcom tables are indexed properly. Use PostgreSQL in production — SQLite is for development only.

Limit DAG file count and complexity

More DAG files = slower parse time. Aim for <500 DAGs and profile with airflow dags report if your scheduler lags.

Clean up old DAG runs and XComs

Run the built-in cleanup DAG or set max_active_runs and configure database retention. Uncleaned metadata tables cause serious scheduler slowdowns.

Comparisons

Airflow vs. The Alternatives

Honest comparisons to help you pick the right orchestrator for your use case.

🌊

Airflow vs. Prefect

Great for Python-first teams, simpler setup

Prefect strengths

  • Better local development experience
  • Dynamic workflows without limitations
  • Modern Python-native API
  • Excellent UI for small teams

Prefect limitations

  • Smaller community and ecosystem
  • Fewer integrations (providers) than Airflow
  • Cloud-managed option can be expensive
  • Less mature for enterprise use cases

Choose Prefect when…

You want minimal infrastructure overhead and a modern Pythonic API for small-to-medium teams.

Choose Airflow when…

When you need the vast provider ecosystem, battle-tested enterprise features, or are already on Airflow.

💎

Airflow vs. Dagster

Best for data-asset oriented teams

Dagster strengths

  • Asset-centric paradigm is a real paradigm shift
  • Excellent type system and data lineage
  • Great testing story out of the box
  • Strong software engineering principles baked in

Dagster limitations

  • Steeper learning curve
  • Smaller community than Airflow
  • Can feel over-engineered for simple pipelines
  • Fewer cloud provider integrations

Choose Dagster when…

Your team thinks in terms of data assets and wants first-class lineage, versioning, and observability.

Choose Airflow when…

When you have existing Airflow DAGs, a large ops team, or need the breadth of Airflow's provider packages.

🧙

Airflow vs. Mage

Ideal for fast iteration and notebook-style development

Mage strengths

  • Interactive block-based editor
  • Great for prototyping and exploration
  • Built-in data validation
  • Lower barrier to entry

Mage limitations

  • Less battle-tested at scale
  • Smaller provider ecosystem
  • Less suitable for complex dependency graphs
  • Immature enterprise features

Choose Mage when…

You want a notebook-like experience and fast iteration cycles for ELT pipelines.

Choose Airflow when…

For complex dependency graphs, large-scale production workloads, or teams needing enterprise governance.

Quick Reference

Cheat Sheets

Copy-paste ready reference cards for common Airflow tasks.

Cron Scheduling Quick Reference
# Every minute
* * * * *

# Every hour at :30
30 * * * *

# Daily at 02:00 UTC
0 2 * * *

# Weekly, Monday at 06:00
0 6 * * 1

# Monthly, 1st at midnight
0 0 1 * *

# Airflow presets
@hourly   →  0 * * * *
@daily    →  0 0 * * *
@weekly   →  0 0 * * 0
@monthly  →  0 0 1 * *
Default Task Arguments Template
from datetime import datetime, timedelta

default_args = {
    "owner": "data-team",
    "depends_on_past": False,
    "start_date": datetime(2024, 1, 1),
    "email_on_failure": False,
    "email_on_retry": False,
    "retries": 3,
    "retry_delay": timedelta(minutes=5),
    "retry_exponential_backoff": True,
    "max_retry_delay": timedelta(hours=1),
    "execution_timeout": timedelta(hours=2),
    "on_failure_callback": alert_slack,
}
TaskFlow API Pattern
from airflow.decorators import dag, task
from datetime import datetime

@dag(
    schedule="@daily",
    start_date=datetime(2024, 1, 1),
    catchup=False,
    tags=["example"],
)
def my_pipeline():

    @task()
    def extract() -> list[dict]:
        return [{"id": 1, "val": "foo"}]

    @task()
    def transform(records: list[dict]) -> list[dict]:
        return [{"id": r["id"], "val": r["val"].upper()}
                for r in records]

    @task()
    def load(records: list[dict]) -> None:
        print(f"Loading {len(records)} records")

    load(transform(extract()))

my_pipeline()
Dynamic Task Mapping
from airflow.decorators import dag, task
from datetime import datetime

@dag(schedule="@daily", start_date=datetime(2024, 1, 1))
def dynamic_mapping_demo():

    @task()
    def get_tables() -> list[str]:
        return ["orders", "customers", "products"]

    @task()
    def process_table(table: str) -> str:
        print(f"Processing {table}")
        return f"{table}_processed"

    # Creates one task instance per table at runtime
    tables = get_tables()
    process_table.expand(table=tables)

dynamic_mapping_demo()