Objective
When Spark reads a small file, it often creates just 1 partition, which means only 1 task processes the entire dataset. In production, you'll frequently want more partitions so Spark can process data in parallel across multiple cores. The repartition() method lets you control this by redistributing data across a specified number of partitions, where each partition maps to exactly one parallel task.
Task
An orders dataset with 5,000 records is available at /home/interview/orders.csv. A starter script has been created for you at /home/interview/repartition_tasks.py with a SparkSession and the CSV already loaded into a DataFrame called df. Repartition the DataFrame to 8 partitions and print the resulting task count in the exact format: task count = 8
Example
task count = 8
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("PrepareshSpark").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
df = spark.read.csv("/home/interview/orders.csv", header=True, inferSchema=True)
df_repartitioned = df.repartition(8)
print(f"task count = {df_repartitioned.rdd.getNumPartitions()}")
spark.stop()
Explanation
Step 1: Reading the CSV
df = spark.read.csv("/home/interview/orders.csv", header=True, inferSchema=True)
For a small CSV file, Spark typically creates 1 partition since the file is under the spark.sql.files.maxPartitionBytes threshold (default 128MB).
Step 2: Repartitioning
df_repartitioned = df.repartition(8)
repartition(8) performs a full shuffle of the data across 8 new partitions. This is a wide transformation - Spark redistributes all rows using a round-robin or hash-based strategy so that data is evenly spread across the target partition count. Unlike coalesce(), repartition() can increase the number of partitions and always triggers a shuffle.
Step 3: Getting the Task Count
df_repartitioned.rdd.getNumPartitions()
Each partition maps to exactly one task in Spark's execution model. With 8 partitions, the next stage that operates on this DataFrame will launch 8 parallel tasks, one per partition. getNumPartitions() returns this count without triggering an action.