SPARK RDDs, DataFrames, and how Spark actually executes your code
Every Spark operation you write turns into a plan, gets optimized, and runs across the cluster. Learn the difference between RDDs and DataFrames, why one is preferred, and how Spark actually executes what you write.
What we're doing
You'll learn about Spark's two main data abstractions — RDDs and DataFrames, then watch in the UI the way Spark actually executes code through jobs, stages, and tasks.
Step 1: RDDs — Spark's original abstraction
RDD stands for Resilient Distributed Dataset. When Spark was created, RDDs were the primary way to work with data.
Key properties of RDDs:
- Resilient — if a worker crashes, Spark can rebuild lost data from the lineage of operations that created it
- Distributed — the data lives across many machines, not one
- Immutable — once created, an RDD doesn't change; any transformation creates a new one
Step 2: DataFrames — the modern default
DataFrames are the higher-level abstraction that replaced RDDs for most use cases. They look and feel like SQL tables - rows and columns with a defined schema.
The critical differences from RDDs:
- Structured — each column has a name and a type
- Optimized — Spark can understand the shape of the data and generate faster execution plans
- SQL-friendly — you can query them with SQL directly, or use the DataFrame API
The API is more expressive and the performance is much better.
Step 3: Why DataFrames beat RDDs almost every time
Reason 1 — the Catalyst optimizer.
Because Spark knows the structure of a DataFrame (columns, types, operations), it can optimize your code before running it. It analyzes what you wrote, rewrites it into a faster plan, and executes that instead. You get performance improvements for free just by writing DataFrame code.
RDDs give Spark almost no visibility. It runs your Python functions blindly, one after another, with no ability to optimize.
Reason 2 — columnar operations.
DataFrames store data column-wise under the hood using a format called Tungsten. That means when you filter by one column or aggregate one column, Spark only reads that column from memory. Way faster than RDDs which read entire rows even if you only need one field.
When to still use RDDs:
- You need fine-grained control over how data is partitioned
- You're doing something Spark's structured APIs can't express — like custom low-level algorithms
- You're maintaining legacy Spark code
Step 4: Transformations vs actions — Spark's lazy execution
When you write DataFrame operations, they don't actually run until you ask for a result. This is called lazy evaluation.
There are two kinds of operations in Spark:
- Transformations — describe what you want to do.
filter,select,groupBy,join. These don't execute anything. They just build up a plan - Actions — trigger the plan to actually run.
show,count,collect,write. These force Spark to execute everything you asked for
If you filter, then filter again, then select two columns, then count — Spark can combine those into a single pass over the data.
Step 5: Jobs, stages, and tasks
There are three levels of granularity:
- Job — one action triggers one job. Call
df.show(), that's one job. Call it again, that's a new job - Stage — a job is split into stages based on where data needs to move across the cluster. Operations that keep data on the same worker are one stage. When data has to be reshuffled between workers (like a groupBy or join), that starts a new stage
- Task — within a stage, work is split into tasks, one per partition of the data. If your data has 4 partitions, each stage has 4 tasks that run in parallel
Understanding this matters because when a job is slow, you look at which stage is the bottleneck. When a stage is slow, you look at how many tasks it spawned and whether they're balanced.
Step 6: Watch it happen in the UI
Let's see this live. In the terminal:
pyspark --master spark://default:7077
Wait for the prompt. Now build up a chain of transformations:
df = spark.range(1, 1000000)
filtered = df.filter(df.id > 500)
grouped = filtered.groupBy((df.id % 10).alias("bucket")).count()
Now trigger it with an action:
grouped.show()
In the master UI, you'll at least see the application marked as running with cores being used.
Try more actions:
grouped.count()
grouped.collect()
Spark doesn't cache results between actions unless you tell it to.
Step 7: See lazy evaluation and caching
To prove nothing runs until an action, try this:
import time
df = spark.range(1, 10000000)
start = time.time()
transformed = df.filter(df.id > 500).select((df.id * 2).alias("doubled"))
print(f"Transformation took {time.time() - start:.4f} seconds")
Now trigger the action:
start = time.time()
transformed.count()
print(f"Action took {time.time() - start:.4f} seconds")
This takes noticeably longer because the actual work happens now.
To make repeated actions on the same DataFrame faster, cache it:
transformed.cache()
transformed.count() # first time — computes and caches
transformed.count() # second time — uses cached data, much faster
Exit with exit().
What's next
Start Spark