Resources
Your quick-reference library for Apache Airflow and the data engineering ecosystem. Glossary, best practices, and tool comparisons — all in one place.
Airflow Glossary
15 essential terms every Airflow user should know.
DAG
CoreDirected Acyclic Graph. The core unit of work in Airflow — a Python file that defines a workflow of tasks with their dependencies. DAGs must be acyclic (no loops).
Operator
CoreA template for a single unit of work. Operators determine what actually executes when a Task runs. Common operators include PythonOperator, BashOperator, and provider-specific operators like BigQueryInsertJobOperator.
Task
CoreAn instance of an Operator within a DAG. Tasks are the actual nodes in the DAG graph. Each task has a unique task_id within its DAG.
Task Instance
CoreA specific run of a Task for a given DAG Run. A Task Instance has a state (queued, running, success, failed, etc.) and represents the execution of one task at a specific point in time.
DAG Run
CoreAn instantiation of a DAG for a specific logical date. A DAG Run contains all Task Instances for that execution. Can be triggered manually, by schedule, or via the REST API.
XCom
CommunicationCross-Communication. A mechanism for Tasks to share small amounts of data between each other. XComs are stored in the Airflow metadata database and should be used for small values (not large dataframes).
Executor
InfrastructureThe component that determines how tasks are executed. Options include SequentialExecutor (single process, dev only), LocalExecutor (multi-process), CeleryExecutor (distributed), and KubernetesExecutor (k8s pods).
Hook
ConnectivityAn interface to an external service or database. Hooks abstract the connection logic and are used by Operators internally. For example, PostgresHook manages PostgreSQL connections.
Connection
ConnectivityA stored set of credentials and endpoint information used to connect to external systems. Connections are referenced by conn_id and can be stored in the metadata DB, environment variables, or a secrets backend.
Sensor
CoreA special type of Operator that waits for a condition to be true before completing. Sensors poll for a condition at a defined interval (poke_interval) until a timeout is reached.
Deferrable Operator
PerformanceAn operator that can suspend itself and free up a worker slot while waiting for an external event. Uses Triggerer process instead of blocking a worker. Available since Airflow 2.2.
TaskFlow API
CoreA decorator-based API introduced in Airflow 2.0 that simplifies writing DAGs using @dag and @task decorators. Makes XCom passing implicit and reduces boilerplate significantly.
Dynamic Task Mapping
CoreA feature allowing the number of task instances to be determined at runtime based on the output of an upstream task. Uses .expand() or .partial().expand(). Available since Airflow 2.3.
Pool
ConcurrencyA resource limit applied to a group of tasks. Pools control how many tasks can run concurrently for a given resource (e.g., limit DB connections). Tasks acquire slots from a Pool before running.
Provider
EcosystemA package that extends Airflow with additional Operators, Hooks, Sensors, and Connections for a specific service. For example, apache-airflow-providers-google adds 100+ Google Cloud operators.
Best Practices
Battle-tested patterns from production Airflow deployments at scale.
DAG Design
Keep DAGs idempotent
Every DAG run should produce the same result when re-run for the same logical date. Use UPSERT instead of INSERT, and write to date-partitioned paths.
Avoid top-level code in DAG files
Never run database queries, file I/O, or heavy computation at DAG parse time. This runs on every scheduler heartbeat and will cripple parse performance.
Use atomic operations
Write data to a staging location first, then atomically move it to the final destination. This ensures consumers never see partial data.
Limit task granularity
Don't create thousands of tiny tasks. Batch work at the task level and use parallelism within tasks. Each task has scheduler and metadata DB overhead.
Retries & Error Handling
Set retries on every task
Always configure retries=3 and retry_delay at the DAG or task level. Transient failures (network, API rate limits) will retry automatically.
Use exponential backoff
Set retry_exponential_backoff=True to avoid hammering a struggling downstream system. Combine with retry_delay=timedelta(minutes=5).
Set email_on_failure selectively
Don't send emails for every task failure. Use callbacks (on_failure_callback) to route alerts to Slack/PagerDuty for critical paths only.
Performance
Index your metadata database
Ensure task_instance, dag_run, and xcom tables are indexed properly. Use PostgreSQL in production — SQLite is for development only.
Limit DAG file count and complexity
More DAG files = slower parse time. Aim for <500 DAGs and profile with airflow dags report if your scheduler lags.
Clean up old DAG runs and XComs
Run the built-in cleanup DAG or set max_active_runs and configure database retention. Uncleaned metadata tables cause serious scheduler slowdowns.
Airflow vs. The Alternatives
Honest comparisons to help you pick the right orchestrator for your use case.
Airflow vs. Prefect
Great for Python-first teams, simpler setup
Prefect strengths
- Better local development experience
- Dynamic workflows without limitations
- Modern Python-native API
- Excellent UI for small teams
Prefect limitations
- Smaller community and ecosystem
- Fewer integrations (providers) than Airflow
- Cloud-managed option can be expensive
- Less mature for enterprise use cases
Choose Prefect when…
You want minimal infrastructure overhead and a modern Pythonic API for small-to-medium teams.
Choose Airflow when…
When you need the vast provider ecosystem, battle-tested enterprise features, or are already on Airflow.
Airflow vs. Dagster
Best for data-asset oriented teams
Dagster strengths
- Asset-centric paradigm is a real paradigm shift
- Excellent type system and data lineage
- Great testing story out of the box
- Strong software engineering principles baked in
Dagster limitations
- Steeper learning curve
- Smaller community than Airflow
- Can feel over-engineered for simple pipelines
- Fewer cloud provider integrations
Choose Dagster when…
Your team thinks in terms of data assets and wants first-class lineage, versioning, and observability.
Choose Airflow when…
When you have existing Airflow DAGs, a large ops team, or need the breadth of Airflow's provider packages.
Airflow vs. Mage
Ideal for fast iteration and notebook-style development
Mage strengths
- Interactive block-based editor
- Great for prototyping and exploration
- Built-in data validation
- Lower barrier to entry
Mage limitations
- Less battle-tested at scale
- Smaller provider ecosystem
- Less suitable for complex dependency graphs
- Immature enterprise features
Choose Mage when…
You want a notebook-like experience and fast iteration cycles for ELT pipelines.
Choose Airflow when…
For complex dependency graphs, large-scale production workloads, or teams needing enterprise governance.
Cheat Sheets
Copy-paste ready reference cards for common Airflow tasks.
# Every minute
* * * * *
# Every hour at :30
30 * * * *
# Daily at 02:00 UTC
0 2 * * *
# Weekly, Monday at 06:00
0 6 * * 1
# Monthly, 1st at midnight
0 0 1 * *
# Airflow presets
@hourly → 0 * * * *
@daily → 0 0 * * *
@weekly → 0 0 * * 0
@monthly → 0 0 1 * *from datetime import datetime, timedelta
default_args = {
"owner": "data-team",
"depends_on_past": False,
"start_date": datetime(2024, 1, 1),
"email_on_failure": False,
"email_on_retry": False,
"retries": 3,
"retry_delay": timedelta(minutes=5),
"retry_exponential_backoff": True,
"max_retry_delay": timedelta(hours=1),
"execution_timeout": timedelta(hours=2),
"on_failure_callback": alert_slack,
}from airflow.decorators import dag, task
from datetime import datetime
@dag(
schedule="@daily",
start_date=datetime(2024, 1, 1),
catchup=False,
tags=["example"],
)
def my_pipeline():
@task()
def extract() -> list[dict]:
return [{"id": 1, "val": "foo"}]
@task()
def transform(records: list[dict]) -> list[dict]:
return [{"id": r["id"], "val": r["val"].upper()}
for r in records]
@task()
def load(records: list[dict]) -> None:
print(f"Loading {len(records)} records")
load(transform(extract()))
my_pipeline()from airflow.decorators import dag, task
from datetime import datetime
@dag(schedule="@daily", start_date=datetime(2024, 1, 1))
def dynamic_mapping_demo():
@task()
def get_tables() -> list[str]:
return ["orders", "customers", "products"]
@task()
def process_table(table: str) -> str:
print(f"Processing {table}")
return f"{table}_processed"
# Creates one task instance per table at runtime
tables = get_tables()
process_table.expand(table=tables)
dynamic_mapping_demo()