SPARK

RDDs, DataFrames, and how Spark actually executes your code

Every Spark operation you write turns into a plan, gets optimized, and runs across the cluster. Learn the difference between RDDs and DataFrames, why one is preferred, and how Spark actually executes what you write.

What we're doing

You'll learn about Spark's two main data abstractions — RDDs and DataFrames, then watch in the UI the way Spark actually executes code through jobs, stages, and tasks.

Step 1: RDDs — Spark's original abstraction

RDD stands for Resilient Distributed Dataset. When Spark was created, RDDs were the primary way to work with data.

Key properties of RDDs:

Resilient — if a worker crashes, Spark can rebuild lost data from the lineage of operations that created it
Distributed — the data lives across many machines, not one
Immutable — once created, an RDD doesn't change; any transformation creates a new one

Step 2: DataFrames — the modern default

DataFrames are the higher-level abstraction that replaced RDDs for most use cases. They look and feel like SQL tables - rows and columns with a defined schema.

The critical differences from RDDs:

Structured — each column has a name and a type
Optimized — Spark can understand the shape of the data and generate faster execution plans
SQL-friendly — you can query them with SQL directly, or use the DataFrame API

The API is more expressive and the performance is much better.

Step 3: Why DataFrames beat RDDs almost every time

Reason 1 — the Catalyst optimizer.

Because Spark knows the structure of a DataFrame (columns, types, operations), it can optimize your code before running it. It analyzes what you wrote, rewrites it into a faster plan, and executes that instead. You get performance improvements for free just by writing DataFrame code.

RDDs give Spark almost no visibility. It runs your Python functions blindly, one after another, with no ability to optimize.

Reason 2 — columnar operations.

DataFrames store data column-wise under the hood using a format called Tungsten. That means when you filter by one column or aggregate one column, Spark only reads that column from memory. Way faster than RDDs which read entire rows even if you only need one field.

When to still use RDDs:

You need fine-grained control over how data is partitioned
You're doing something Spark's structured APIs can't express — like custom low-level algorithms
You're maintaining legacy Spark code

Step 4: Transformations vs actions — Spark's lazy execution

When you write DataFrame operations, they don't actually run until you ask for a result. This is called lazy evaluation.

There are two kinds of operations in Spark:

Transformations — describe what you want to do. filter, select, groupBy, join. These don't execute anything. They just build up a plan
Actions — trigger the plan to actually run. show, count, collect, write. These force Spark to execute everything you asked for

If you filter, then filter again, then select two columns, then count — Spark can combine those into a single pass over the data.

Step 5: Jobs, stages, and tasks

There are three levels of granularity:

Job — one action triggers one job. Call df.show(), that's one job. Call it again, that's a new job
Stage — a job is split into stages based on where data needs to move across the cluster. Operations that keep data on the same worker are one stage. When data has to be reshuffled between workers (like a groupBy or join), that starts a new stage
Task — within a stage, work is split into tasks, one per partition of the data. If your data has 4 partitions, each stage has 4 tasks that run in parallel

Understanding this matters because when a job is slow, you look at which stage is the bottleneck. When a stage is slow, you look at how many tasks it spawned and whether they're balanced.

Step 6: Watch it happen in the UI

Let's see this live. In the terminal:

pyspark --master spark://default:7077

Wait for the prompt. Now build up a chain of transformations:

df = spark.range(1, 1000000)
filtered = df.filter(df.id > 500)
grouped = filtered.groupBy((df.id % 10).alias("bucket")).count()

Now trigger it with an action:

grouped.show()

In the master UI, you'll at least see the application marked as running with cores being used.

Try more actions:

grouped.count()
grouped.collect()

Spark doesn't cache results between actions unless you tell it to.

Step 7: See lazy evaluation and caching

To prove nothing runs until an action, try this:

import time

df = spark.range(1, 10000000)
start = time.time()
transformed = df.filter(df.id > 500).select((df.id * 2).alias("doubled"))
print(f"Transformation took {time.time() - start:.4f} seconds")

Now trigger the action:

start = time.time()
transformed.count()
print(f"Action took {time.time() - start:.4f} seconds")

This takes noticeably longer because the actual work happens now.

To make repeated actions on the same DataFrame faster, cache it:

transformed.cache()
transformed.count()  # first time — computes and caches
transformed.count()  # second time — uses cached data, much faster

Exit with exit().

What's next

Now go and try this out in a live environment — boot a fresh cluster and play with the manifests above.

Start Spark

Spec 2 CPU / 8 GiB ·Disk 25 GiB

Required 1 VM · 2 CPU · 8 GB · 25 GiB disk

Available 1 VM · 1 CPU · 2 GB · 10 GiB disk