Amusement Park Anomaly Detection
Beginner Mode

Start your terminal to use beginner mode.

Objective

You are a Data Analyst at an amusement park operator. You've been given two DataFrames: rides (containing metadata about the park's attractions) and visitors (containing logs of visitor ride histories and ratings).

Task

Write a PySpark function to identify the ride with the most anomalous average visitor rating. An anomalous ride is defined as a ride whose average rating is the furthest (either significantly higher or lower) from the global average rating across all rides.

Save your resulting DataFrame as result_df. Ensure the output strictly matches the requested Output Schema, casting average_rating to a Float and is_anomalous to a Boolean. Order the final output by ride_id in ascending order.

File Path

  • Rides Dataset: /home/interview/rides.csv
  • Visitors Dataset: /home/interview/visitors.csv
  • Starter script: /home/interview/park_outlier.py

Schema

rides.csv

Column Name Data Type
ride_id string
ride_name string
type string
capacity integer

visitors.csv

Column Name Data Type
visitor_id string
ride_id string
timestamp timestamp
rating integer

Expected Output Schema

Column Name Data Type
ride_id string
ride_name string
average_rating float
is_anomalous boolean

Terminal requires a larger screen

Open this page on a desktop or tablet (≥ 768px) to launch the terminal and practice hands-on.

Linux Terminal Environment

Write and execute your solution in the terminal below.

Sign In

Track

Question Difficulty Company Access
Need more practice in this area? Explore more questions →