Objective
When Spark reads a file, it splits the data into partitions (chunks that can be processed in parallel across the cluster). The number of partitions Spark creates depends on the file size and the value of spark.sql.files.maxPartitionBytes (default 128MB). Getting a feel for how Spark partitions your data is one of the first steps to understanding parallel execution.
Task
A file with 5,000 order records is available at /home/interview/orders.csv. A starter script has been created for you at /home/interview/read_partitions.py with a SparkSession and the CSV already loaded into a DataFrame called df. Complete the script to find the number of partitions and print it in the exact format: number of partitions = <N>
Example
number of partitions = 1
Note: The actual value depends on file size and Spark's default configuration.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("PrepareshSpark").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
df = spark.read.csv("/home/interview/orders.csv", header=True, inferSchema=True)
num_partitions = df.rdd.getNumPartitions()
print(f"number of partitions = {num_partitions}")
spark.stop()
Explanation
Step 1: Getting the Underlying RDD
df.rdd.getNumPartitions()
Every Spark DataFrame is backed by an RDD (Resilient Distributed Dataset) under the hood. The RDD is the lower level representation that actually holds the partitioned data. Calling .rdd on a DataFrame gives you access to it, and getNumPartitions() tells you how many chunks Spark split the data into.
Step 2: Why Does Spark Partition Data?
When Spark reads a file, it doesn't load the entire thing into one place. It splits the data into partitions so that multiple tasks can process different chunks in parallel. The number of partitions is determined by:
- File size vs
spark.sql.files.maxPartitionBytes (default 128MB): Spark aims to keep each partition under this limit. A 256MB file would get roughly 2 partitions, a 512MB file would get roughly 4, and so on.
- For small files (well under 128MB, like our ~250KB CSV), Spark typically creates just 1 partition since there's no benefit in splitting such a small amount of data.
Step 3: What This Means in Practice
The partition count directly maps to parallelism. If your data has 4 partitions, Spark can run 4 tasks simultaneously (assuming enough cores). For our small 5,000 row CSV, 1 partition is perfectly fine. But for larger datasets, too few partitions means underutilized cores, and too many means excessive scheduling overhead.