SPARK

What Spark is and why distributed compute matters

Apache Spark is the engine that powers most modern big-data pipelines. Learn what it is, why distributed processing exists in the first place, and get hands-on with a real Spark cluster on the VM.

What we're doing

You'll install Spark from scratch, start a small cluster, and run your first Spark job.

Step 1: What Spark is

Apache Spark is a distributed processing engine. You give it data and code. It splits the data across multiple machines, runs your code in parallel on each piece, then combines the results.

The huge advantage: you write code that looks almost identical to single-machine pandas or SQL. Spark handles the messy parts — splitting the data, sending tasks to machines, retrying failures, combining results.

A few facts that make Spark the dominant tool in this space:

Fast — processes data in memory across the cluster, much faster than disk-based alternatives
Multi-language — write Spark in Python (PySpark), Scala, SQL
Versatile — works with batch data, streaming data, machine learning, and graph processing
Mature — used by Netflix, Uber, Airbnb, Apple, every major bank

Step 2: The driver and executor model

To understand how Spark splits work, you need two concepts, the driver and the executors.

The driver is the brain. It's the process that runs your code. When you write df.filter(...), the driver reads that line. But it doesn't do the filtering itself — it creates a plan, a list of small tasks that, together, do the filtering on all your data.

The executors are the muscle. They're processes running on the worker machines in the cluster. Their job is to receive small tasks from the driver, run them on their slice of data, and report results back.

Step 3: Start the cluster

A Spark cluster needs one master and at least one worker. Start the master first:

start-master.sh

Now start two workers, both connecting to the master at localhost:7077:

start-worker.sh spark://localhost:7077
start-worker.sh spark://localhost:7077

You can also confirm everything is running:

jps

You should see processes named Master and Worker (twice).

Step 4: Look at the cluster UI

Open the Spark Master UI link in the environment panel. You'll see:

URL: spark://...:7077 — the address other processes use to submit jobs
Alive Workers: 2 — both workers are connected
Cores in use: 4 Total, 0 Used — total compute capacity, currently idle
Memory in use — total memory across the cluster
Workers table — your two workers as ALIVE

Below that you'll see Running Applications and Completed Applications — empty for now because no jobs have been submitted.

Step 5: Submit your first Spark job

Spark ships with example programs, including a Pi estimator:

spark-submit --master spark://localhost:7077 /opt/spark/examples/src/main/python/pi.py 10

spark-submit — Spark's tool for submitting jobs
--master spark://localhost:7077 — point at our master
/opt/spark/examples/src/main/python/pi.py — the example Python script
10 — argument to the script (number of samples)

You'll see:

Running Applications: 1 — your job is listed
Cores in use: 4 Used — the cluster is working

The job moves to Completed Applications and cores go back to 0 Used.

Step 6: Open a Spark session interactively

The previous step submitted a finished script. The other way to use Spark is interactively is to open a session and run commands as you go:

pyspark --master spark://localhost:7077

You'll get a Python prompt with a Spark session already set up as spark. Try this:

data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
df = spark.createDataFrame(data, ["name", "age"])
df.show()
df.filter(df.age > 28).show()

You just created a Spark DataFrame, displayed it, and ran a filter.

Exit with exit().

After hibernation

If the VM hibernates, reconnect and restart the cluster:

start-master.sh
start-worker.sh spark://localhost:7077
start-worker.sh spark://localhost:7077

What's next

Now go and try this out in a live environment — boot a fresh cluster and play with the manifests above.

Start Spark

Spec 2 CPU / 4 GiB ·Disk 25 GiB

Required 1 VM · 2 CPU · 4 GB · 25 GiB disk

Available 1 VM · 1 CPU · 2 GB · 10 GiB disk