LINUX TUTORIAL Understanding load average (and when it's actually bad)
What the three load numbers mean, why load is not CPU percent, and how we tell when a high number is a real problem. We generate CPU load and disk load and watch them move differently.
What we're doing
We read the load average, then make it move on purpose: first with CPU work, then with disk work, watching how each one changes the numbers. This VM has 2 cores and ~3.9 GB RAM.
Watch the video first, then run these as we read. The reading commands are read-only, so they need no sudo; stress-ng runs as our normal user too.
The one idea: load counts tasks, it is not CPU percent
The load average is how many tasks are running or waiting, averaged over the last 1, 5, and 15 minutes (the three numbers, in that order). "Waiting" includes:
- runnable tasks: ready to run, waiting for a free core.
- D state tasks (uninterruptible sleep): blocked waiting for disk input/output (I/O). These use no CPU but still count.
We read it against the core count: on N cores, load N means the cores are full. Above N means tasks are queuing.
Step 1: read the load
uptime # time, uptime, users, and the 3 load numbers
cat /proc/loadavg # the kernel's raw source of those numbers
nproc # core count, to read load against
14:02:10 up 1:12, 1 user, load average: 0.08, 0.12, 0.09
0.08 0.12 0.09 1/142 3501
2
uptime's last three values are the 1-, 5-, and 15-minute load./proc/loadavgis where they come from. The1/142is running/total tasks; the last value is the newest PID (process ID).nproc= 2 cores, so load 2 = both cores full.
Step 2: make CPU load, watch it climb
stress-ng --cpu 2 --timeout 120s & # 2 workers, each fills one core; stops after 120s; & = background
# wait about a minute (load is an average), then:
uptime
top -bn1 | head -3 # -b batch, -n1 one update; head -3 = load + CPU summary
14:05:40 up 1:16, 1 user, load average: 1.78, 0.86, 0.39
...
%Cpu(s): 99.3 us, 0.5 sy, 0.0 ni, 0.2 id, 0.0 wa, ...
The 1-minute number (first) climbs toward 2.00 = both cores busy. %Cpu(s) shows us ~99%, so this load is real CPU work. This is CPU-bound.
Step 3: make disk load, watch load rise without CPU
stress-ng --hdd 2 --hdd-bytes 256m --timeout 120s & # 2 workers writing/deleting files; capped at 256 MB each
# wait about a minute, then:
uptime
top -bn1 | head -3
ps -eo pid,stat,comm | grep " D" # tasks in D state (blocked on disk I/O)
14:09:20 up 1:20, 1 user, load average: 2.90, 1.40, 0.70
%Cpu(s): 2.1 us, 6.0 sy, 0.0 ni, 12.6 id, 79.0 wa, ...
PID STAT COMMAND
4102 D stress-ng-hdd
Load is high, but us is tiny and wa (CPU time waiting for input/output) is ~79%. The CPU is mostly idle; the disk is the limit. The D-state tasks are counted in the load but use no CPU. This is I/O-bound, and it proves load is not CPU percent.
Reading "is this load bad?"
- Against cores: load /
nproc. Under 1 per core = spare; ~1 = full; over 1 = tasks waiting. - Which kind: high
us= CPU-bound (need less work or more cores); highwa= I/O-bound (disk is the limit, more CPU won't help). - Trend: 1-min above 15-min = rising; below = falling.
Cheat sheet
uptime # the 3 load numbers (1, 5, 15 min)
cat /proc/loadavg # raw source + running/total tasks
nproc # cores, to read load against
top -bn1 | head -3 # load + %Cpu(s): us = CPU work, wa = waiting on disk
# generate load to study it
stress-ng --cpu 2 --timeout 120s & # CPU-bound
stress-ng --hdd 2 --hdd-bytes 256m --timeout 120s & # I/O-bound
ps -eo pid,stat,comm | grep " D" # tasks blocked on disk (D state)
The one thing to remember: load average counts tasks running or waiting (including disk-blocked ones), averaged over 1/5/15 minutes. It is not CPU percent. Read it against the core count, and use the %Cpu(s) line to tell CPU-bound (us) from disk-bound (wa).
Next tutorial: stopping a process properly, the difference between asking it to quit and forcing it, with signals like SIGTERM and SIGKILL.
What's next
Start LINUX