Objective
In this problem, we are given two DataFrames: AnimalData and RegionData. The goal is to analyze the characteristics of various animal species across different regional climates.
Task
Write a PySpark function that joins the two DataFrames and groups the data by Species and Climate. Provide the average age, the average weight, and the total count of the animals for each species in each climate.
Note the AvgWeight column in the expected output. When calculating the average weight, you must truncate the decimal to turn it into a whole number (e.g., an average of 152.7 becomes 152).
Save your resulting DataFrame as result_df. Ensure the output matches the exact schema order requested and order the final DataFrame alphabetically by Species (ascending), and then by Climate (ascending).
File Path
- Animal Dataset:
/home/interview/animal_data.csv
- Region Dataset:
/home/interview/region_data.csv
- Starter script:
/home/interview/zoology.py
Schema
animal_data.csv
| Column Name |
Type |
| ID |
String |
| Species |
String |
| Age |
Integer |
| Weight |
Float |
| Region |
String |
region_data.csv
| Column Name |
Type |
| Region |
String |
| Climate |
String |
Expected Output Schema
| Column Name |
Type |
| Species |
String |
| Climate |
String |
| AvgAge |
Float |
| AvgWeight |
Integer |
| TotalAnimals |
Integer |
Example
Given this sample input:
AnimalData
| ID | Species | Age | Weight | Region |
|----+---------+-----+--------+---------------|
| A1 | Lion | 10 | 200.5 | Africa |
| A2 | Tiger | 5 | 150.3 | Asia |
| A3 | Bear | 7 | 180.2 | North America |
| A4 | Lion | 12 | 205.7 | Africa |
| A5 | Tiger | 6 | 155.1 | Asia |
RegionData
| Region | Climate |
|---------------+-----------|
| Africa | Hot |
| Asia | Temperate |
| North America | Cold |
The expected output would be:
| Species | Climate | AvgAge | AvgWeight | TotalAnimals |
|---------+-----------+--------+-----------+--------------|
| Bear | Cold | 7.0 | 180 | 1 |
| Lion | Hot | 11.0 | 203 | 2 |
| Tiger | Temperate | 5.5 | 152 | 2 |
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.appName("PrepareshSpark").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
animal_data = spark.read.csv("/home/interview/animal_data.csv", header=True, inferSchema=True)
region_data = spark.read.csv("/home/interview/region_data.csv", header=True, inferSchema=True)
# Step 1: Join the DataFrames on the shared 'Region' column
joined_df = animal_data.join(region_data, on="Region", how="inner")
# Step 2: Group by Species and Climate, then perform the aggregations
result_df = joined_df.groupBy("Species", "Climate").agg(
F.avg("Age").cast("float").alias("AvgAge"),
# Casting to integer automatically truncates the decimal values (e.g. 152.7 -> 152)
F.avg("Weight").cast("integer").alias("AvgWeight"),
F.count("ID").cast("integer").alias("TotalAnimals")
)
# Step 3: Format the final schema and order deterministically
result_df = result_df.select(
"Species", "Climate", "AvgAge", "AvgWeight", "TotalAnimals"
).orderBy("Species", "Climate")
# --- Do not edit below this line ---
result_df.coalesce(1).write.csv("/home/interview/output", header=True, mode="overwrite")
spark.stop()
Explanation
Step 1: Merging the Datasets
joined_df = animal_data.join(region_data, on="Region", how="inner")
To summarize data by both the animal's species and the region's climate, we need to bring the Climate column into the AnimalData DataFrame. We do this by executing an inner join on the shared Region column.
Step 2: Grouping by Multiple Columns
joined_df.groupBy("Species", "Climate")
By passing both "Species" and "Climate" to .groupBy(), PySpark creates specific sub-buckets for every unique combination of those two traits, allowing us to aggregate the stats accurately.
Step 3: Calculating and Truncating Aggregations
.agg(
F.avg("Age").cast("float").alias("AvgAge"),
F.avg("Weight").cast("integer").alias("AvgWeight"),
F.count("ID").cast("integer").alias("TotalAnimals")
)
Inside the .agg() method, we calculate our statistics. Notice the behavior of the AvgWeight column. In the example, the Tigers have an average weight of 152.7, but the expected output requires 152. Rather than using a rounding function (which would round 152.7 up to 153), we simply .cast("integer"). Casting a float to an integer in PySpark immediately truncates the decimal off, giving us the exact desired output!
Step 4: Output Formatting
result_df = result_df.select(
"Species", "Climate", "AvgAge", "AvgWeight", "TotalAnimals"
).orderBy("Species", "Climate")
Finally, we chain a .select() block to ensure the final output columns are arranged in the precise order specified by the prompt's Expected Output Schema, and we sort the results deterministically by "Species" and "Climate".
betically by "Species".