Start your terminal to use beginner mode.
Objective
As a herpetologist studying reptiles and amphibians, you have two DataFrames at your disposal: observations (containing sighting logs) and species (containing the reference catalog of animals).
Task
Write a PySpark function that joins the observations and species DataFrames on the species_id column.
After joining, return the top 3 rows ordered by the count of individuals observed in descending order. Save your resulting DataFrame as result_df. Ensure the output matches the exact schema order requested.
File Path
- Observations Dataset:
/home/interview/observations.csv - Species Dataset:
/home/interview/species.csv - Starter script:
/home/interview/herpetology.py
Schema
observations.csv
| Column Name | Data Type | Description |
|---|---|---|
| obs_id | Integer | The unique identifier of the observation |
| species_id | Integer | The unique identifier of the species observed |
| location_id | Integer | The unique identifier of the location where the observation was made |
| count | Integer | The number of individuals observed |
species.csv
| Column Name | Data Type | Description |
|---|---|---|
| species_id | Integer | The unique identifier of the species |
| species_name | String | The common name of the species |
Expected Output Schema
| Column Name | Data Type | Description |
|---|---|---|
| obs_id | Integer | The unique identifier of the observation |
| species_id | Integer | The unique identifier of the species |
| species_name | String | The common name of the species |
| location_id | Integer | The unique identifier of the location where the observation was made |
| count | Integer | The number of individuals observed |
Example
Given this sample input:
observations
| obs_id | species_id | location_id | count |
|---|---|---|---|
| 1 | 100 | 1 | 55 |
| 2 | 101 | 2 | 35 |
| 3 | 100 | 1 | 45 |
species
| species_id | species_name |
|---|---|
| 100 | Python |
| 101 | Gecko |
| 102 | Frog |
The expected output would be:
| obs_id | species_id | species_name | location_id | count |
|---|---|---|---|---|
| 1 | 100 | Python | 1 | 55 |
| 3 | 100 | Python | 1 | 45 |
| 2 | 101 | Gecko | 2 | 35 |
Terminal requires a larger screen
Open this page on a desktop or tablet (≥ 768px) to launch the terminal and practice hands-on.
Linux Terminal Environment
Write and execute your solution in the terminal below.
Track
| Question | Difficulty | Company | Access |
|---|
Stripe