Objective
Spark uses lazy evaluation, which means transformations are recomputed from scratch every time an action is triggered. If you plan to reuse a DataFrame multiple times, this can be wasteful. The cache() method tells Spark to store the DataFrame in memory after the first computation, so subsequent actions can read from memory instead of re-reading and re-parsing the source data. Caching is one of the most practical Spark optimizations, especially when iterating over the same dataset in machine learning pipelines or exploratory analysis.
Task
A dataset with 50,000 order records is available at /home/interview/orders.csv. A starter script has been created for you at /home/interview/cache_performance.py with a SparkSession and the CSV already loaded into a DataFrame called df. Cache the DataFrame, run count() twice while measuring each runtime, and print the results in the exact format:
first run = <time>s
second run = <time>s
faster after caching = true
Example
first run = 1.24s
second run = 0.05s
faster after caching = true
from pyspark.sql import SparkSession
import time
spark = SparkSession.builder.appName("PrepareshSpark").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
df = spark.read.csv("/home/interview/orders.csv", header=True, inferSchema=True)
df.cache()
start = time.time()
df.count()
first_run = time.time() - start
start = time.time()
df.count()
second_run = time.time() - start
is_faster = second_run < first_run
print(f"first run = {first_run:.2f}s")
print(f"second run = {second_run:.2f}s")
print(f"faster after caching = {str(is_faster).lower()}")
spark.stop()
Explanation
Step 1: Caching the DataFrame
df.cache()
cache() marks the DataFrame to be stored in memory after its first evaluation. It's important to understand that cache() is lazy: calling it doesn't actually load anything into memory. The data is only cached when the first action (like count()) triggers computation.
Under the hood, cache() is shorthand for persist(StorageLevel.MEMORY_AND_DISK). If the data doesn't fit in memory, Spark spills the overflow to disk rather than failing.
Step 2: First Run (Cold)
start = time.time()
df.count()
first_run = time.time() - start
The first count() does all the heavy lifting: reads the CSV from disk, parses every row (because of inferSchema), and stores the result in memory. This is the "cold" run because nothing is cached yet.
Step 3: Second Run (Warm)
start = time.time()
df.count()
second_run = time.time() - start
The second count() reads directly from memory, skipping the disk read and CSV parsing entirely. This is why it's significantly faster.
Step 4: Cache vs Persist
cache() stores data in memory with disk spillover (the safest default). If you need more control, persist() lets you choose a storage level:
MEMORY_ONLY: Fastest, but recomputes partitions that don't fit in memory
MEMORY_AND_DISK: Default for cache(), spills to disk if needed
DISK_ONLY: Saves memory but slower reads
MEMORY_AND_DISK_SER: Serialized format, uses less memory but more CPU
When you're done with a cached DataFrame, call df.unpersist() to free the memory.