AIRFLOW

Scheduling, intervals & catchup in Airflow

Learn how to schedule DAGs, understand data intervals, and control catchup and backfill behavior in Airflow.

What we're doing

You know how to write and run a DAG. Now learn how to make it run automatically — on a schedule, at the right time, with full control over what happens when it misses a run. Get to know how scheduling works, what data intervals are, and how catchup and backfill behave. Then you'll write a scheduled DAG and control it from the UI.

Step 1: How scheduling works

Every DAG has a schedule which tells Airflow when to trigger a run. You define it with the schedule parameter in the DAG block.

Airflow supports two ways to define a schedule:

Preset shortcuts

schedule="@hourly"    # every hour
schedule="@daily"     # every day at midnight
schedule="@weekly"    # every Monday at midnight
schedule="@monthly"   # first day of every month
schedule=None         # never runs automatically, manual trigger only

Cron expressions

A cron expression gives you full control over the schedule. It has five fields:

  • day of week (0-6, Sunday=0)
  • month (1-12)
  • day of month (1-31)
  • hour (0-23)
  • minute (0-59)

Some examples:

schedule="0 9 * * *"      # every day at 9am
schedule="0 9 * * 1"      # every Monday at 9am
schedule="0 */6 * * *"    # every 6 hours
schedule="30 8 1 * *"     # first day of every month at 8:30am

* means "every". 0 9 * * * reads as: at minute 0, hour 9, every day, every month, every day of the week.

Step 2: Data intervals

A data interval is the window of time that the DAG run is processing.

For a daily DAG scheduled at midnight:

  • The run that fires at 2024-01-02 00:00 processes data from 2024-01-01 00:00 to 2024-01-02 00:00
  • data_interval_start = 2024-01-01 00:00
  • data_interval_end = 2024-01-02 00:00

This is important. Airflow runs at the end of the interval, not the beginning. A daily DAG with start_date=2024-01-01 doesn't fire on January 1st — it fires on January 2nd, processing January 1st's data.

You can access the interval in your tasks:

def extract(**context):
    start = context["data_interval_start"]
    end = context["data_interval_end"]
    print(f"Processing data from {start} to {end}")

Step 3: Catchup and backfill

Catchup

When you create a DAG with a start_date in the past and catchup=True, Airflow will automatically run all the missed intervals between the start date and now.

with DAG(
    dag_id="my_dag",
    start_date=datetime(2024, 1, 1),
    schedule="@daily",
    catchup=True    # Airflow will backfill all missed runs
) as dag:

Most of the time you want catchup=False, especially when you're developing or when past data doesn't matter.

catchup=False    # only run going forward, ignore missed runs

Backfill

Backfill is the manual version of catchup. You trigger it from the command line for a specific date range:

airflow dags backfill \
    --start-date 2024-01-01 \
    --end-date 2024-01-31 \
    my_dag

catchup=True - always turn it on, because with a start_date from a year ago Airflow will queue hundreds of runs immediately.

Step 4: Create the DAG file

Click VS Code in the environment panel. Right click on the dags folder and create a new file called scheduled_dag.py.

Step 5: Write the scheduled DAG

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

def process_data(**context):
    start = context["data_interval_start"]
    end = context["data_interval_end"]
    print(f"Processing data from {start} to {end}")

with DAG(
    dag_id="scheduled_dag",
    start_date=datetime(2024, 1, 1),
    schedule="0 9 * * *",
    catchup=False
) as dag:

    process = PythonOperator(
        task_id="process",
        python_callable=process_data
    )
  • schedule="0 9 * * *" — runs every day at 9am
  • catchup=False — only runs going forward
  • **context — Airflow passes a context dictionary to every task function. It contains information about the current run including the data interval
  • context["data_interval_start"] and context["data_interval_end"] — the start and end of the data window this run is processing

Save with Ctrl+S.

Step 6: Find it in the UI and inspect the schedule

Open the Airflow UI from the environment panel. Go to the DAGs page and find scheduled_dag. You'll see the schedule showing 0 9 * * * and the next scheduled run time.

Click into the DAG and open the Details tab. Here you can see:

  • The schedule interval
  • The next run time
  • The last run time
  • The data interval of the last run

After hibernation

If the VM hibernates, reconnect and run in the VS Code terminal:

cd ~/airflow
docker compose up -d

What's next

Now go and try this out in a live environment — boot a fresh cluster and play with the manifests above.

Start Airflow
Spec 2 CPU / 8 GiB ·Disk 20 GiB ·Lifetime 7 days
Up next in Apache Airflow Mastery Chapter 3 of 5

DAG Patterns & Best Practices

Continue