VMware: Zoological Climate Analysis — Data Engineering Interview Q&A (2026)

Zoological Climate Analysis

Beginner Mode

Start your terminal to use beginner mode.

Objective

In this problem, we are given two DataFrames: AnimalData and RegionData. The goal is to analyze the characteristics of various animal species across different regional climates.

Task

Write a PySpark function that joins the two DataFrames and groups the data by Species and Climate. Provide the average age, the average weight, and the total count of the animals for each species in each climate.

Note the AvgWeight column in the expected output. When calculating the average weight, you must truncate the decimal to turn it into a whole number (e.g., an average of 152.7 becomes 152).

Save your resulting DataFrame as result_df. Ensure the output matches the exact schema order requested and order the final DataFrame alphabetically by Species (ascending), and then by Climate (ascending).

File Path

Animal Dataset: /home/interview/animal_data.csv
Region Dataset: /home/interview/region_data.csv
Starter script: /home/interview/zoology.py

Schema

animal_data.csv

Column Name	Type
ID	String
Species	String
Age	Integer
Weight	Float
Region	String

region_data.csv

Column Name	Type
Region	String
Climate	String

Expected Output Schema

Column Name	Type
Species	String
Climate	String
AvgAge	Float
AvgWeight	Integer
TotalAnimals	Integer

Example

Given this sample input:

AnimalData

| ID | Species | Age | Weight | Region |
|----+---------+-----+--------+---------------|
| A1 | Lion | 10 | 200.5 | Africa |
| A2 | Tiger | 5 | 150.3 | Asia |
| A3 | Bear | 7 | 180.2 | North America |
| A4 | Lion | 12 | 205.7 | Africa |
| A5 | Tiger | 6 | 155.1 | Asia |

RegionData

| Region | Climate |
|---------------+-----------|
| Africa | Hot |
| Asia | Temperate |
| North America | Cold |

The expected output would be:

| Species | Climate | AvgAge | AvgWeight | TotalAnimals |
|---------+-----------+--------+-----------+--------------|
| Bear | Cold | 7.0 | 180 | 1 |
| Lion | Hot | 11.0 | 203 | 2 |
| Tiger | Temperate | 5.5 | 152 | 2 |

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.appName("PrepareshSpark").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

animal_data = spark.read.csv("/home/interview/animal_data.csv", header=True, inferSchema=True)
region_data = spark.read.csv("/home/interview/region_data.csv", header=True, inferSchema=True)

# Step 1: Join the DataFrames on the shared 'Region' column
joined_df = animal_data.join(region_data, on="Region", how="inner")

# Step 2: Group by Species and Climate, then perform the aggregations
result_df = joined_df.groupBy("Species", "Climate").agg(
    F.avg("Age").cast("float").alias("AvgAge"),
    
    # Casting to integer automatically truncates the decimal values (e.g. 152.7 -> 152)
    F.avg("Weight").cast("integer").alias("AvgWeight"),
    
    F.count("ID").cast("integer").alias("TotalAnimals")
)

# Step 3: Format the final schema and order deterministically
result_df = result_df.select(
    "Species", "Climate", "AvgAge", "AvgWeight", "TotalAnimals"
).orderBy("Species", "Climate")

# --- Do not edit below this line ---
result_df.coalesce(1).write.csv("/home/interview/output", header=True, mode="overwrite")
spark.stop()

Explanation

Step 1: Merging the Datasets

joined_df = animal_data.join(region_data, on="Region", how="inner")

To summarize data by both the animal's species and the region's climate, we need to bring the Climate column into the AnimalData DataFrame. We do this by executing an inner join on the shared Region column.

Step 2: Grouping by Multiple Columns

joined_df.groupBy("Species", "Climate")

By passing both "Species" and "Climate" to .groupBy(), PySpark creates specific sub-buckets for every unique combination of those two traits, allowing us to aggregate the stats accurately.

Step 3: Calculating and Truncating Aggregations

.agg(
    F.avg("Age").cast("float").alias("AvgAge"),
    F.avg("Weight").cast("integer").alias("AvgWeight"),
    F.count("ID").cast("integer").alias("TotalAnimals")
)

Inside the .agg() method, we calculate our statistics. Notice the behavior of the AvgWeight column. In the example, the Tigers have an average weight of 152.7, but the expected output requires 152. Rather than using a rounding function (which would round 152.7 up to 153), we simply .cast("integer"). Casting a float to an integer in PySpark immediately truncates the decimal off, giving us the exact desired output!

Step 4: Output Formatting

result_df = result_df.select(
    "Species", "Climate", "AvgAge", "AvgWeight", "TotalAnimals"
).orderBy("Species", "Climate")

Finally, we chain a .select() block to ensure the final output columns are arranged in the precise order specified by the prompt's Expected Output Schema, and we sort the results deterministically by "Species" and "Climate".
betically by "Species".

Terminal requires a larger screen

Open this page on a desktop or tablet (≥ 768px) to launch the terminal and practice hands-on.

Linux Terminal Environment

Write and execute your solution in the terminal below.

Track

	Question	Difficulty	Company	Access

Need more practice in this area? Explore more questions →