Question 1

Average Order Value

Accepted Answer

This SQL question is called Average Order Value. We have one table that is called Orders. This table records customers' purchases and has four columns: orders ID, date of transaction, reference to the customer by ID, and total amount spent. Our job here is to calculate for each customer the average amount they spent across all their orders. In the end, the average value should be rounded to two decimal places. The output should include customer ID and average order value, sorted by ID in ascending order. We start with FROM clause. Then we proceed with GROUP BY clause that separates individual orders into groups for each customer. In order to choose specific columns from a particular table, we use SELECT clause. We wrap the total amount column with Average inside of Round function as first argument, and two number as two decimal places. Finally, we order by customer ID.

Question 2

Join Employees and Departments

Accepted Answer

We are given two tables, departments and employees. These two tables are connected through department ID column. Some employees might not be assigned to any department, so their department ID is null. But we still want them in our results. Our main goal here is to return all employees that earn more than 50,000 with their department name sorted by hire date from most recent to oldest. In order to combine these two tables, we will use the concept of left join. A left join keeps all rows from the left table even if there is no matching row in the right table. The missing values just become null. We implement left join in between, and using on, we indicate which common columns were used to connect these tables. We will use where clause to filter the rows. We only want those employees whose salary is greater than 50,000. Finally, we sort everything out by hire date in descending order so that the most recent date pops up first and oldest date last.

Question 3

Filter Orders by Date Range

Accepted Answer

We need to write a query to filter orders by date range. We are given one table that is called Orders. It has four columns: customer's name, ID of an order that has been made, date of transaction, and total amount spent. We are required to return only the orders that were placed between January 1st, 2023, and June 30th, 2023. Both dates are inclusive, which means that orders on exactly those dates should be included, too. In SQL, we always start with FROM clause because this tells which table we want to work with. Using SELECT clause, we can choose the columns for the output. WHERE clause, which is the filter step. It goes through every single row and checks if the condition is true. For order date column, we use BETWEEN and operators. This combination checks if a value falls within a range, including both the start and end values. Finally, we sort everything by order date in ascending order so that the rows with the earliest date come first.

Question 4

Find Customers Without Orders

Accepted Answer

We need to find customers without orders. We have two tables, customers and orders. These tables are connected through customer ID column. Our job here is to find customers who have never placed a single order, but at the same time they exist in the customers table. We are more interested in LEFT JOIN since we want all rows from the left table to stay. A LEFT JOIN keeps all rows from the left table, even if there is no matching row in the right table. The missing values just become null. Then we add the LEFT JOIN operator in between, and ON keyword that indicates which common columns were used to combine the tables. In our case, it's customer ID. We proceed with WHERE clause that filters the results. We build the condition to check if order ID is null. We pick this specific column because if a customer has at least one order, their order ID will be a real number after the LEFT JOIN. Finally, we sort everything by customer name in ascending order or alphabetically.

Question 5

Use COALESCE for Null Handling

Accepted Answer

## How to Replace NULL with 0 in SQL and Retrieve All Orders from the `orders` Table Writing clean and efficient SQL queries is an essential skill for database management and data analysis. If you're asked to retrieve all orders from an `orders` table, ensuring that any `NULL` values in the `discount` column are replaced with `0`, you need to follow specific steps to structure your query correctly. Below is a comprehensive guide on achieving this task. ### Steps to Write the SQL Query 1. **Identify the Columns**: The `orders` table contains the columns `order_id`, `customer_name`, `discount`, and `total_amount`. 2. **Handle NULL Values**: Ensure that the `discount` column does not contain any `NULL` values by using the `COALESCE` function, which allows you to replace `NULL` values with `0`. 3. **Select All Required Columns**: Ensure that the query retrieves all the columns in the specified order - `order_id`, `customer_name`, `discount`, and `total_amount`. 4. **Order the Results**: Use the `ORDER BY` clause to sort the results by `order_id` in ascending order. ### Sample SQL Query ```sql SELECT order_id, customer_name, COALESCE(discount, 0) AS discount, total_amount FROM orders OR

Question 6

Merge Multiple Address Fields

Accepted Answer

We have a customers table with columns like city, customer ID, first name, last name, postal code, state, and street address. Our job is to combine them into one single column called full address. Some columns could have null values, meaning some customers might be missing a city, state, or postal code. In this case, they should be excluded from the full address. One more possible case is that all components are missing. In this case, we should return an empty string. For this problem, we will need to use COALESCE function. COALESCE function takes a list of values and returns the first one that is not null. For this reason, we need to know the case statement. The idea is basically like an if else statement in any programming language. We will use the concatenation sign in order to connect them one by one. The last thing we have to do is sorting everything out in ascending order, which can be done using order by.

Question 7

String Concatenation in SELECT

Accepted Answer

Our job is to retrieve the full names of all employees by combining the first and last names. We have to make sure that the spaces in between are added, and the result is sorted in ascending order. In SQL, a string is basically just text, anything like a name, a number, or email that are stored in columns with types like varchar or text. Concatenation is simply joining two or more strings together into one output. If you want the space between your strings, you have to treat that space as its own separate string inside quotes. In order to return this specific full name column, we use select clause and implement concat function inside of it. We name our new column as full name. The last thing we are required to do is to sort everything in ascending order based on full name. For this reason, we use order by clause and the name of the column.

Question 8

Find Nth Highest Revenue

Accepted Answer

We will write a query to find N's highest revenue. We have sales table with columns like ID, product, and revenue. Our job is simply to find the third highest distinct revenue value. We need to keep in mind that product with same revenue count as one value, not two. Our query will consist of three steps. First, we get rid of duplicates, then what's left, we rank from highest to lowest, and then pick the one on third place. A window function performs a calculation across multiple rows, but unlike group by, it does not collapse everything into one result. The word just activates the window function. We sort the revenue column out in descending order so that on the first place, we get the highest value. We filter the column where rank equals to three, and we find the value on the third position.

Question 9

Self-Join to Identify Missing Supervisors

Accepted Answer

We have only one table that is called Employees and three columns that indicate employee ID, the name of that employee, and supervisor ID. One employee can be a supervisor for his or her colleagues. Our job is simply to find employees whose supervisor does not exist in the table. Employees with a null value in supervisor ID column are excluded. The final list should be sorted out in ascending order by employee ID. In order to reference the same table twice, we give it two different names, or aliases. e1 will represent each employee, and e2 will search for supervisor. We will use left join, which will keep all rows from the left table, regardless of whether a match was found in the right table or not. Using ON condition, we simply ask the query to take the supervisor ID from e1 table and look for it in employee ID column of e2 table. The first statement is employee ID from e2 table is null, which will simply catch all employees whose supervisor was not found in the table after the left join. We also check if supervisor ID is not a null value.

Question 10

Year-over-Year Revenue Growth

Accepted Answer

We need to calculate the yearly revenue and the percentage growth year over year for a given set of financial transactions. We have financials table that contains two columns, transaction date, that is date type, and amount, that is numeric. We group all transactions by year and sum them up to get the total revenue per year. A CTE is a temporary result set in SQL that you can reference within a single query. It only exists while that query is running. In our select, we use extract in order to get only year from our transaction date column. We also need to get the total revenue of our amount column using sum. A left join keeps all rows from the left table, even if there is no matching row in the right table. The missing values just become null. We take our CTE yearly revenue and give it an alias current. Using left join, we take the exact same CTE again, but this time we call it prev for previous. We need to subtract current and previous year, divide by previous year, and multiply by 100 to get the percentage. Inside of select, we use round, which is a built-in SQL function that rounds a decimal number to a specified number.

Question 11

Above Average Price Products

Accepted Answer

We have a products table with columns like category, ID, name, price, rating, and stock quantity. Our job is to return the products that are priced above the average price. We only work with those products that are currently in stock. It means that not only we don't return them, but we also don't use them while calculating the average amount. While sorting, we should put all the products with null ratings at the bottom. Using select with a star, we can choose all columns. Using from clause, we retrieve data from products table. We need to write a subquery. Using where clause, we create a small condition. Using the average function, it will calculate the average price. The main condition checks whether selected product's price is higher than the average. We have to sort everything out in descending order using order by. At the end of our order by clause, we will use nulls last.

Question 12

Calculate Cumulative Sales

Accepted Answer

We are given a sales data table with the name of the products, date of transaction, and the amount of daily sales. Each row simply represents one day of sales per product. Our mission is to add a new column called cumulative sales, which is a running total that keeps adding up the daily sales for each product as we move forward through the dates. We do the calculation within each product separately. For each row, we need to add up all the previous rows for the same product, which is not something that regular sum with group by can handle, since group by would just put everything into one total. A window function performs a calculation across multiple rows, but unlike group by, it never collapses them. We start with sum within daily sales column, and then we use over keyword that simply activates the window function. We will use partition by function within product name column that will simply divide the data into separate groups.

Question 13

Find Overlapping Date Ranges

Accepted Answer

We have one table that is called assignments. Each row represents one employee assigned to one project with start date and end date. We need to find those employees who are assigned to multiple projects at the same time, or in other words, who face an overlap. The final result must contain all the columns from the input and sort it out by project ID primarily, and then by employee ID in ascending order. Two date ranges overlap when they share at least one day in common. We need to compare each assignment to another one for the same employee, which basically means that we reference the same table twice. This means that we'll use self join here. We take assignments table and give it alias a1. Then we join it with a2 copy that will search for overlapping assignments. The first condition is that assignment a1 must start before or on the same day that assignment a2 ends, because if a1 starts after a2 already ended, no overlap is possible. The role of distinct function here is removing those duplicates and making sure that each assignment appears only once in the final result.

Question 14

Set Operation: INTERSECT

Accepted Answer

This SQL question focuses on set operation called intersect. We are given two tables that share customer ID as a primary key. Our main goal is to return list of active customers based on two criterias. On a monthly basis, a new customer should spend more than 1,000, be loyal, and have at least three years of membership and a premium tier status. Final output must include only the customer ID and name columns sorted in ascending order by ID. A CTE is a temporary result set in SQL that you can reference within a single query. Using with clause, we create a CTE called monthly spenders among new customers that spend more than 1,000. We can name the CTE as premium tier. Intersect takes both tables and returns only the values that appear in both of them. We select customer ID and name columns from monthly spenders, which was the first CTE. Then we select information from the second CTE, and between these two we add intersect operator.

Question 15

Subquery for Best Order per Customer

Accepted Answer

We are given two tables. The first table contains the names of the customers. The second table stores order IDs and total amount of each order. Both tables share customer ID primary key which connects them together. In the output, we should get customer's name, ID of the best order, and its total amount. All the results should be sorted by a customer's name in alphabetical order. Since each customer can have multiple orders, and among these orders might be the ones with the same highest total amount, in this case, we should return the one with the smallest order ID. We are more interested in inner join because it returns only the rows where there is a match in both tables. Inside of the where clause, we will have correlated subquery. It is also a query inside another query, but it runs once for every single row in the outer query. Since we need to return the highest valued order for each customer, we sort by total amount in descending order. We also sort order ID in ascending order because when two orders pop up with the same amount, the one with smaller order ID will come first. And in the very end, we apply limit one.

Question 16

Ranking with Dense_Rank

Accepted Answer

The main focus is on ranking values with dense rank window function. We are working with sales table that tracks individual sales made by different representatives. The goal here is to first add up all sales per person to get their total value, and then rank everyone based on that total from highest to lowest. When we use dense ranking, two people with equal amount of sales will share the same rank. A CTE is a temporary result set in SQL that you can reference within a single query. Dense rank is one of the window functions that assigns a rank number to each row based on a specified order. Over is a keyword that simply activates a window function and lets SQL know that we try to implement it. Finally, order by clause sorts by sales rank first, so rank one comes before rank two, and then salesperson name is sorted alphabetically.

Question 17

Median Salary by Job Title

Accepted Answer

We'll write a query to calculate median salary for each job title. When all salaries are sorted in ascending order, the median value will be the one that is in the middle of them. If there is an even number of salaries, we take the average of the two middle values. A CTE is a temporary result set in SQL that you can reference within a single query. A window function performs a calculation across multiple rows while keeping every single row in the result. Inside of the window function, we are using partition by job title. Partition by divides the data into separate groups, and count will run independently inside each of these groups. We use row number that simply assigns a sequential number to each row. These formulas calculate the median position so that we know which row number represents the middle one. If the number of rows is an odd number, both of the formulas will give the same result. Using round function and double column numeric, we will convert the results to numeric type and then clean everything up to two decimal places.

Question 18

String Splitting and Aggregation

Accepted Answer

We have a product_tags table with two columns, id and tags. The tags column stores multiple tags in one single string separated by commas. Our job is to split those tags apart and count how many times each tag appears across all products. One important thing to keep in mind is that tags are case sensitive. A CTE is a temporary result set in SQL that you can reference within a single query. Our temporary table will be called split_data, and then we select id and tag from our product_tags table. First, we are using string_to_array. You basically pass a string and the separator to the function. After turning tags column into array, we will use unnest function, which will basically put each piece of our array into one separate row. We will use the count with a star function that will basically go through each row and return the amount. Using group by clause, we will group all rows with the same tag together.

Question 19

Salary Comparison with CTE Aggregation

Accepted Answer

We have two tables, departments and employees. Our job is to find employees who earn more than the average salary of their own department. We need the average salary per department before we can compare anything. For this reason, we will use CTE here. A CTE is a temporary result set in SQL that you can reference within a single query. It doesn't get saved anywhere. It only exists while that query is running. We want to find the average salary per department. Inside of it, we select department and average salary grouped by department. For our task, we are more interested in inner join because it returns only the rows where there is a match in both tables. We connect them on department ID from employees table and ID from departments table because we remember that they were the same common primary keys. The average salary should be rounded to two decimal places in the output. Using where clause, we write a condition that salary of an employee should be higher than the average salary that we calculated in our CTE before. The last thing we have to do is sorting everything out primarily by department's name in ascending order and then by salary in descending order.

Question 20

String Pattern Extraction in Descriptions

Accepted Answer

This question mainly focuses on string extraction. We have a products table with four columns: description, name, price, and product ID. The description column contains free text. Some products have email address inside of it and some don't. Our mission is to find those email addresses and extract them from the column. We only return three columns, product ID, name, and email that we will find from description. We will need to use substring. Substring is a string manipulation function. It simply returns a copy of a specific portion of the string. The pattern of an email address is some word, then @ symbol, word again, dot, and then word in the end. Regular expression is kind of a language that is used to describe a sequence of characters. Instead of equal, we use tilde, which is regular expression operator, which means just find this kind of pattern in description column. We sort everything in ascending order by product ID.

Question 21

Nested Subquery for Latest Record

Accepted Answer

We only have one table here, events. There are five columns: event_date and name, id, status, and user_id. Each row in the event_name column represents something a user did, and then status column shows if it was successful. For each user, we want to know what their most recent event was. Final result should contain all the columns except for ID column, and everything should be sorted in ascending order by user ID. For each user and event row, we check and find the maximum date for that specific user. If the current row's date matches the maximum date, we keep it. We take the events table and give it an alias e1. Now we have to use this table again, but we can't name it with the same alias anymore. That's why inside of the subquery, we give another nickname, another alias to our table, which will be e2. We implement max function to find the maximum date, and then inside of the where clause, we compare the current date that we are checking right now and the maximum date that we already found. Finally, we sort everything out in ascending order by user ID.

Question 22

Window Function for Moving Average

Accepted Answer

We are given a sales table with two columns, amount and sale date. Each row indicates the amount of sales per day within one week. Our main goal is to calculate a seven-day moving average for each date. Moving average means that we calculate that number for each day using the current amount plus the amount of previous days. The challenge here is that for each row, we need to look at other rows, which is not something that where or group by clause can do. A CTE is a temporary result set in SQL that you can reference within a single query. Window function calculates across multiple rows, but it keeps every single row in the result. From sales table, we select sale date and amount columns, and also moving average as a new column. This statement, range between interval six days preceding and current row, simply helps us to define the frame. Using round function for this column, we write double colon that will help us to convert raw data to numeric that we need, and two here will round everything to two decimal places.

Question 23

Re-enrollment Rate Calculator

Accepted Answer

We are given one table, Enrollments. This table has three columns that contain information like course ID, student ID, and term ID. Our job here is to calculate what percentage of students are enrolled in two or more consecutive terms. Consecutive means that in term ID column, the sequence goes like 101, 102, 103 without any gaps. A CTE is a temporary result set in SQL that you can reference within a single query. We will call our CTE as consecutive terms, and from enrollments table, we select student and term ID columns. We implement a window function with row number. Partition by clause divides a result set into smaller groups and allows a function to perform calculations on each subset separately. Row number assigns a sequential number to each row starting from one. In order to prevent all the duplicates, count with distinct will only return unique student IDs and completely ignore any null values. Cast in SQL is used to convert a value of one data type into another. We are required to round the results to two decimal places, and the only way to do so is by using a round function.

Question 24

String Pattern Matching Using LIKE

Accepted Answer

We'll match string patterns using like. We have two tables, departments and employees. Our job is to filter employees based on three conditions. First one is that their name must start with letter A. Second is that the email must contain the substring @tech, and third is the position level must contain the word senior. A join in SQL connects two tables together based on a common column. We are more interested in inner join because it returns only the rows where there is a match in both tables. Like is an operator in SQL used to search for a specific pattern inside of a string. Like works together with wild cards, and the most significant one is percent sign. We write name column from employees table, like A percent sign, which means A is the first letter, and then any symbol can follow this letter. Second condition is that email from employees table must contain @tech keyword, and since this value is in the middle of the string, we put percent signs from both sides. And finally we sort everything in ascending order by name.

Question 25

Customer Order Aggregation

Accepted Answer

We are given two tables. First is customers table that contains customer's name, email, and ID. Also, second table that contains ID of the order, its amount, and ID of the customer. Our main goal is to find customers who have placed more than two orders and calculate the total number of orders and the amount they spent on them. The final result must be sorted out in descending order by total spend. A join in SQL connects two tables together based on a common column. For our task, we are more interested in inner join because it returns only the rows where there is a match in both tables. Second step is grouping by ID, name, and email from customers table because this will collapse all rows belonging to the same customer into one single group. Since we already grouped the columns, we will have to use having clause that works with groups, not rows. To get this number, we have to use ID column that we grouped before and count function that will simply find the number of orders for each group.

Question 26

SQL JOIN with Pandas Data Processing and CSV Export

Accepted Answer

We need to do the data processing and CSV export using Pandas and SQLite. We are given a SQLite database that is called Sales. It contains three tables: customers, orders, and items. SQLite is a lightweight database that stores everything in a single file. We need to connect to the database, then run an SQL query to join all three tables that we had, load the results into Pandas, calculate revenue metrics per customer, and export everything to CSV file. JOIN connects two tables based on a common column. We are more interested in inner join because it returns only the rows where there is a match in both tables. We will implement the read_sql_query function. We build the total amount column by multiplying quantity with unit price. When we group by customer ID, we will put all rows belonging to the same customer together. In order to calculate the revenue percentage, we will need to divide each customer's total by the overall revenue.

Question 27

Insert New Records into SQLite Database from CSV

Accepted Answer

We'll insert new records into database from CSV file using SQLite and Pandas. We have one CSV file that is called New Customers. It contains customer records that need to be imported in a SQLite database. The problem is that some of these customers might already exist in the database file. SQLite is a lightweight database that stores everything in a single file. Our job here is to read the CSV file, check which records already exist, and make sure we only include the new ones. We will use the connect function where we define the path to the file. Then within this object, we implement cursor. We read the given CSV file into a DataFrame that will be called df. Using cursor execute will run the SQL query. In the query itself, we select ID column from customers table. Fetchall function will retrieve all the results as a list of tuples. The tilde symbol is used here to flip true to false and false to true. Cursor execute will run the insert statement. Instead of putting the values directly in the SQL string, we will use the question mark and pass the actual values as a separate tuple. We do it for security reasons to protect the data from SQL injection. We will use the commit function w

Question 28

Aggregate SQL Query Results with Pandas and Export to Excel

Accepted Answer

We'll need to aggregate SQL query results and export them to Excel using Pandas. Pandas is a library that was specifically designed for data analysis and manipulation. We are given one SQLite database that is called Orders. It contains two tables, the first one with customers' information and second one with all the transactions. SQLite is a lightweight database that stores everything in a single file. Our job here is to join both of the given tables, then calculate the total order value per customer, and save the results to an Excel file. We import two libraries, SQLite 3 and Pandas. Then we open the database using the connect function. We are more interested in inner join because it returns only the rows where there is a match in both tables. Then we execute what we have written above using the read_sql_query function and load the results directly into a Pandas data frame. We use group by so that all rows belonging to the same customer are put together. Finally, we save the resulting data frame to Excel file.

Question 29

Merge Employee and Department Records

Accepted Answer

We need to merge employee and department records. We have two tables, departments and employees. We only consider departments that have more than 10 employees. Second requirement is to find employees whose salary is above their department average. Third requirement is to add high earners column to the output, which will basically count the number of employees whose salary is more than 75,000. A CTE is a temporary result set in SQL that you can reference within a single query. A join in SQL connects two tables together based on a common column. We are more interested in inner join because it returns only the rows where there is a match in both tables. The difference is that where clause comes before group by, while having runs after. In order to calculate the high earners column, we will use count with case when. Results should be sorted out primarily by department name and then by salary in descending order.

Question 30

Sequence Products by Price

Accepted Answer

We are given products table with the name of the product, its ID, and the price. Our main goal is to create and calculate neighbor product column. For each product, we take the price of previous product and the price of next product, and then we multiply them together. A CTE is a temporary result set in SQL that you can reference within a single query. It doesn't get saved anywhere. It only exists while that query is running. Window function performs a calculation across multiple rows and keeps every single row in the result. The reason why we can't use group by here is that group by clause collapses all rows into one single row. We use lag, which reaches back to the previous row and grabs its value. The only difference is that we'll use lead instead of lag. Lead will reach forward to the next row and grab its value. COALESCE function takes a list of values and returns the first one that is not null. Since null values can't be used in the calculations, COALESCE function will replace them with zeros.

Question 31

Combine Data from Multiple Sources into Unified Report

Accepted Answer

We need to combine data from multiple sources into a unified report using Pandas. We are given one customers CSV file and one orders SQLite database. SQLite is a lightweight database that stores everything in a single file. We get the data from API, CSV file, and the database. We send an HTTP GET request to a URL, and the API responds with data in JSON format. Our job is to fetch data from all three sources, combine them into one unified DataFrame, and save the final report as a CSV file. We import three libraries: Requests, Pandas, and SQLite 3 to connect the database. We open the database file using connect function. Read SQL query will execute our SQL request and load the result directly into the DataFrame. Merge in Pandas connects two DataFrames based on a common column or index. It works the same way as JOIN in SQL. On indicates which common column was used to connect the DataFrames. The method is left join, which means that we keep all orders even if customer ID doesn't match. We can calculate the total amount when multiplying quantity by price. The final unified report is saved as CSV file.

Question 32

Export SQLite Database to Parquet Format with Metadata

Accepted Answer

We'll need to export SQLite database to Parquet format with metadata using Pandas. We have one SQLite database that is called ecommerce. It contains three tables, customers, products, and orders. The orders table has foreign keys that connects it to the first two tables. SQLite is a lightweight database that stores everything in a single file. Our job here is to export every of three tables to their own Parquet file with Snappy compression and create a manifest JSON file describing the export. Parquet stores data column by column instead of doing that row by row. A manifest is a metadata file that describes the contents of an export. We import five things: SQLite 3 library to connect to the database; Pandas, that will be aliased as pd; JSON to create the manifest; OS library for creating directories, building file paths, and getting file sizes; and the DateTime module for getting the current timestamp for the manifest. We read the entire table into the data frame and save the data frame as a Parquet file. Then we set compression to Snappy and index to false.

Question 33

Top Categories by Average Price

Accepted Answer

We are given two tables, inventory and products. These two tables share common column and are connected through product_id column from inventory table and id column from products table. Our main goal is to find the top three product categories with the highest average price. We only consider active products, meaning that they are in stock. For low stock items are qualified those products that are less than 10 available. The final result should be ordered primarily by average price in descending order, and then by product count column as a tiebreaker. A CTE is a temporary result set in SQL that you can reference within a single query. We use left join to make sure every product appears, even if it somehow has no inventory entry. For low stock items column, we need to count only products with stock less than 10. We group by category so that all rows that share the same category would collapse into one group. Rank here assigns a number to each row based on a specified order. We put the condition inside the where clause that price rank should be less or equal to three.

Question 34

CSV and Partitions

Accepted Answer

Spark is a big data processing framework. It is designed to process massive amounts of data across multiple computers at the same time. And instead of tables, it uses data frames. Our job here is simply to read a CSV file, then to find out how many partitions are created, and print that number. When Spark reads this file, it doesn't process it as one giant block. Instead, it splits it into smaller chunks called partitions. Each partition gets sent to a different executor. The default maximum partition size is 128 megabytes. RDD stands for Resilient Distributed Dataset. The RDD splits it into multiple partitions, and each partition holds a subset of the full data. RDD will convert the data frame into a format that can help us to access the underlying partition information. And getNumPartitions is a built-in method that returns the number of partitions as a number.

Question 35

Repartition

Accepted Answer

Spark is a big data framework that is designed to process massive amounts of data across multiple computers at the same time. Instead of tables like in SQL, Spark uses data frames. We have only one file that is called orders.csv with 5,000 records. Our main goal here is to repartition the data frame to eight partitions and print the task count in the format that it equals to eight. When Spark reads this file, it doesn't process it as one giant block. Instead, it splits it into smaller chunks called partitions. Each partition gets sent to a different executor. An executor is a process that runs on its chunk of data independently. Repartition is a built-in Spark method that lets us manually control how many partitions our data frame has. We pass the number that we want, and Spark will redistribute all the data across exactly that many partitions. repartition.rdd will convert our data frame to an RDD format, which is a data structure of Spark that holds the partitioned data. getNumPartitions is a built-in function that counts and returns the number of partitions.

Question 36

Broadcast Join

Accepted Answer

Spark is a big data framework that processes massive amounts of data across multiple computers at the same time. Instead of tables like in SQL, Spark uses data frames. We are given two files, orders.csv with 5,000 records, and customers.csv with 50 records. We need to join these two files together using a broadcast join, then count orders and print the number of distinct cities. We will use a regular inner join because we only want orders that have a matching customer. The only thing is how we perform that join. Instead of shuffling both data frames across the network, Spark takes the small data frame and sends a full copy of it to every worker. So each worker now has its own partition of the large data frame and small data frame. It means that it can perform the join right here without any movement. We don't want to shuffle this large data frame across the network. That's why we take the smallest one, because it is easier and cheaper to copy. We will also import the broadcast function from the library. header that is set to true uses the first row as column names, and inferSchema automatically detects data types for each column. Then we count orders per city with the help of group

Question 37

Correcting Social Media Posts

Accepted Answer

Spark is a big data framework that is designed to process massive amounts of data across multiple computers at the same time. Instead of tables, Spark uses data frames. We have one file, posts.csv. Its data frame contains seven columns: text of the post, ID, date, amount of likes, comments, and shares, and platform where it was published. We need to go through every post and replace the word Python with PySpark in the text column. withColumn is a data frame method that modifies or replaces a specific column. It takes two arguments. The first one is the column, it is the text column, and second one is the value that we want to add. For the second argument, we use regexReplace function. This function takes three arguments. The first one is the column to search. Second is the word to find. It is Python. And third is word to replace it with, which is PySpark.

Question 38

Daily Category Sales Aggregation

Accepted Answer

Master daily sales aggregation in PySpark. Learn how to join transaction tables with product catalogs and use multi-column GroupBy operations to calculate total quantities sold per category per day.

Question 39

Most Common Order Status

Accepted Answer

Spark is a big data framework that processes massive amounts of data across multiple machines. Instead of tables, Spark uses data frames. We are given only one file, orders.csv, with 5,000 records. Each order has a status, like completed, canceled, ongoing, and so on. Our job here is to find which status appears most frequently. There are two types of transformations, narrow and wide. In narrow transformation, each row on the left executor goes directly to the row on the right executor. It means that no data moves between executors, and each partition stays on the same machine and gets processed independently. But when it comes to wide transformations, rows from one executor can end up on a completely different executor. Here, data is able to move between machines, and this is called a shuffle. The only problem is that wide transformation requires more network traffic and more time. When we use group by, we do the wide transformation because it makes the rows with the same status move onto the same executor. Count finds the number of rows in each group, and ordering by count, ascending set to false, which means that everything is sorted in descending order.

Question 40

Calculating Overtime Pay

Accepted Answer

We will be calculating overtime pay. Spark is a big data framework that processes large amounts of data across multiple machines. We are given two files: employees.csv and payroll.csv. First data frame contains the names of employees, their IDs, ages, and job positions. The second one stores hourly rate, amount of hours worked, and reference to an employee by ID. Our job is to calculate total pay for each employee, which consists of two rules. If an employee worked less or equal to 40 hours, then the total pay is the product of hours worked and hourly rate. But if an employee worked more than 40 hours, then all the extra hours need to be multiplied not by the regular rate, but by the one that is 1.5 times bigger. We join the two data frames on employee ID using inner join. To the result data frame, we add a new column that is called Pay using the withColumn method. When is Spark's version of if/else statement. Else statement in Spark is replaced with otherwise.

Question 41

Cache and Performance

Accepted Answer

Spark is a big data framework that processes massive amounts of data across multiple machines. Instead of tables, it uses data frames. We have one file, orders.csv, with 50,000 records. Our job here is to cache the data frame, run count twice, measure how long each run takes, and print the results. Spark uses lazy evaluation, which means that it doesn't execute anything when we write a transformation. A transformation is simply an operation that modifies or processes our data. Lazy evaluation doesn't run anything until we specifically ask for a result. If we don't do the caching, then every single iteration reads from the input and produces an output independently. But when we do the caching, then the input is read only once and stored in the distributed memory. The first count reads the CSV file and caches the result, while second one reads from that memory and makes everything much faster.

Question 42

Filter Popular Videos

Accepted Answer

We will need to filter popular videos. Spark is a framework that processes large amounts of data across multiple machines. Instead of tables, it uses data frames. We are given one CSV file that is called Videos. The data frame contains six columns: title of the video, its ID, genre, release year, duration, and number of views. We need to filter everything and keep only videos that have more than one million views and were released in 2019 or later. The result should be saved as result_df. We read the CSV file to a given path and store it in a variable called df for data frame. When header is set to true, it uses first row as column names. At the same time, inferSchema automatically detects data types of those columns. For the condition, we use filter method within data frame and store it in result_df. The condition consists of two statements. First one checks if number of views is greater than one million, and second one ensures that the release year is after 2019. In between, we use AND operator that requires both of the statements to be true.

Question 43

Anonymize User PII

Accepted Answer

PII stands for Personally Identifiable Information, data like emails and phone numbers that can identify a real person. We are given only one file that is called users.csv, and this data frame contains three columns: email, phone number of the user, along with referenced_by ID. We are required to do two things. First, we extract the domain from the email address, which means that we keep everything after the @ symbol. Then we need to hide the first digits of the phone number, and only the last four should be visible. Regex stands for regular expression. It is a pattern that is used to search, extract, or replace specific text inside of a string. Regexp_extract is a Spark function that takes a specific part of a string using a regex pattern. We will use the regexp_replace function that simply replaces part of the string. WithColumn will take the results and create a new column.

Question 44

Call Center Daily Stats

Accepted Answer

Join call records with customer data and compute daily aggregates using countDistinct and sum.

Question 45

Venture Capital Sector Analysis

Accepted Answer

Spark is a framework that processes large amounts of data across multiple machines. At the same time, instead of tables, it uses data frames. We are given two files, companies and investments. Our job here is to find the total investment amount for each industry sector and sort from highest to lowest. We will first combine both of the data frames, then we will sum investments per industries, and in the end, we will sort everything in descending order. Inner join returns only rows where a match exists in both tables or data frames. We will use inner join because for every investment row, we want the company's name and industry to be right next to it. We group by industry column so that all rows that share the same field are put together. Sum function simply adds up all amount values and finds the total investment per industry. The third step is that we order everything by total investment column in descending order.

Question 46

Window Functions without Partitions

Accepted Answer

Master global sorting and sequential numbering in PySpark. Learn how to join DataFrames and use the row_number() window function across an entire unpartitioned dataset.

Question 47

Calculating PE Portfolio Values

Accepted Answer

Master financial data aggregation in PySpark. Learn how to join relational tables, multiply columns to calculate holding values, and group by multiple dimensions to compute daily private equity portfolio totals.

Question 48

Top Products by Revenue

Accepted Answer

Spark is a framework that processes large amounts of data and covers multiple machines at the same time. We are given two files, products and sales.csv. We are required to aggregate total revenue per product, then rank products within each category using dense rank function, and keep only the top three. The output should include the name of the product, category to which it belongs, revenue from the sale, and rank number. A window function performs a calculation across a set of related rows and keeps every single row in the result. This is the key difference from group by, which is an aggregate function that collapses multiple rows into one outcome per group. A window function instead just adds a new column with the calculation. We'll use group by product ID that will put all sales rows for the same product together. Sum will simply add up all revenue values for each group within revenue column. Partition by category will create invisible walls between each category. We sort everything out in descending order by revenue so that the highest amount will get rank one. Dense rank is a type of window function that assigns rank numbers. The thing about dense rank is that it never creates

Question 49

Product Summary

Accepted Answer

Spark is a framework that processes large amounts of data across multiple machines. Instead of tables, it uses data frames. We are given three files, and their data frames are called products, sales, and inventory. We need to calculate total quantity sold and total revenue per product from sales. Then we find the total stock number per product across all warehouses. And we need to join everything to products data frame and make sure every product appears, even if it has no sales or inventory records. Any null values should be replaced with zero. We group by product ID so that all sales rows that belong to the same product are put together. Inner join returns only rows where a match exist in both tables or data frames. Left outer join returns everything from the left table, plus matching rows from the right. In our specific case, we'll use left join. It will keep all rows from the left table, which is products, regardless whether there is a match in sales. Coalesce function takes list of values and returns the first one that is not null. Lit function creates a literal value of zero.

Question 50

Parsing Comma-Separated Values

Accepted Answer

Spark is a framework that processes large amounts of data across multiple machines. Instead of tables, it uses data frames. We have only one file that is called background_checks.csv. Four of these columns store multiple values in one string separated by comma. We are required to count how many items are there in each of those strings and create four new columns to store these results. Everything should be sorted out in ascending order by ID. We read the CSV file into data frame using given path. When header is set to true, it uses first row of the data frame as column names, and inferSchema automatically detects data types. We use split function that takes a string and cuts it into an array using a separator. We wrap everything inside of size function that will count how many elements are in the array. We sort everything out in ascending order by check ID using order by.

Question 51

Mountain Climber Logs

Accepted Answer

Spark is a framework that processes large amounts of data across multiple machines at the same time, and instead of tables, it uses data frames. We are given two files, mountain_info and mountain_climbers.csv. Our job here is to find the most recent climber for each mountain. We need to keep in mind that mountains with no climbing records should be excluded from the final result. A window function performs a calculation across a set of related rows and keeps every single row in the result. This is the key difference from group by, which is an aggregate function that collapses multiple rows into one outcome per group. We need to read both of the files into data frames. We set header to true, which means that the first row will include the column names, and inferSchema will automatically detect data types of those columns. Inner join returns only rows where a match exists in both tables or data frames. We want name of the mountain from info_df to be equal to the name of the mountain from climbers_df. Partition by mountain name creates sort of invisible walls between each mountain. We also sort everything in descending order by climb date so that most recent date comes first. Row numb

Question 52

Global & Domain SEO Leaders

Accepted Answer

Master advanced PySpark Window functions by calculating both partition-level and global maximums. Learn how to use empty window partitions and conditional F.when() logic to identify top-performing SEO pages.

Question 53

Tracking Customer Purchase History

Accepted Answer

Spark is a framework that processes large amounts of data across multiple machines at the same time. Instead of tables, it uses data frames. We have only one file that is called transactions.csv. Its data frame contains four columns, product ID, reference to the customer by ID, quantity, and date of transaction. For each transaction, we need to find what product that particular customer bought most recently before the current one. Then we create a new column that combines the date and that previous product in one string. A window function performs a calculation across a set of related rows and keeps every single row in the result. This is the key difference from group by, which is an aggregate function that collapses multiple rows into one outcome per group. Partition by customer ID will create invisible walls between customers. Lag is a type of window function that will look backwards and reach back to the previous row to grab its product ID value. Concat function joins multiple strings together.

Question 54

Filter Popular Videos on a Streaming Platform

Accepted Answer

We will filter popular videos on a streaming platform. Snowflake is a cloud-based data warehouse platform. Here, your query will look just like an SQL. We are given videos table, which contains video ID, title, genre, release year, duration, and view count columns. Our job is to retrieve those videos that have more than one million views and that were released in year 2019 or later. FROM clause, this is one of the things that differ regular SQL from Snowflake. It uses a dynamic reference with the ref function. Instead of naming the table, we are asking the system to find correct version of the table for us. Inside of the double curly braces, we put the ref function with the name of our table. We select all columns using select with a star. Our main goal was to filter the videos, and to do so, we should build sort of a condition inside of the WHERE clause. The first statement is that a number inside of view count column should be higher than one million. The second is that release year should be higher or equal to 2019. Since both of the statements are mandatory, we will use a logical operator.

Question 55

Replace Keywords in Social Media Post Text

Accepted Answer

Snowflake is a cloud-based data warehouse platform. The syntax, the functions, the logic, everything is almost like in SQL. Our table, social media, contains information about different posts, its ID, text inside of it, date, amount of likes, comments, shares, and the name of the platform. Our job is simple. We go through every post and replace the word Python with PySpark. The only thing that is going to change is that inside of the text column, the word Python is replaced by PySpark. In Snowflake, it uses a dynamic reference with the ref function. Inside of the double curly braces, we put the ref function with the name of our table. Since our goal was to modify the text column, we need to use replace function here. Replace function takes three arguments: the string itself, the old value, and the new value. Replace basically scans through the string, finds the needed word Python, and swaps it with a new value.

Question 56

Filter Movies with Missing Box Office Data

Accepted Answer

Practice filtering for NULL values in Snowflake SQL with this movie analytics interview question. You query a movies table to find rows where box office collection data is missing using IS NULL. Covers NULL handling, IS NULL, WHERE clause, and data quality checks in Snowflake. An easy-level question common in analytics interviews at companies like DoorDash.

Question 57

Daily Category Sales

Accepted Answer

SELECT
s.sale_date,
p.category,
SUM(s.quantity_sold) AS total_quantity
FROM {{ ref("sales") }} s
INNER JOIN {{ ref("products") }} p
ON s.product_id = p.product_id
GROUP BY s.sale_date, p.category
ORDER BY s.sale_date ASC, p.category ASC

Question 58

Filter and Uppercase Artifacts

Accepted Answer

SELECT ID, Item, Period, UPPER(Material) AS Material, Quantity FROM {{ ref("artifacts") }} WHERE Quantity > 100

Question 59

Merge Customer Records from Two Sources

Accepted Answer

We will merge customer records from two sources. Snowflake is a cloud-based data warehouse that uses SQL. We are given two tables, customers from West and customers from East. Both of the tables contain the same columns and store information about the customers, such as full name, age, ID, and subscription plan type. Our job is simply to combine these two tables into one single result, even if they contain duplicate rows. FROM clause in Snowflake is a little bit different from SQL. It uses a dynamic reference with the ref function. Inside of the double curly braces, we put the ref function with the name of our table. In SQL and Snowflake, UNION operator is a set operation used to combine the result sets of two or more SELECT statements into a single unified output. If we use UNION separately, it would remove all duplicate rows. So using UNION ALL simply means that we want to keep all rows, even if they are repeated.

Question 60

Filter Funded Startups

Accepted Answer

We need to write a query to filter funded startups. This is a Snowflake question. Snowflake is a cloud-based data warehouse that simply uses SQL. We are given two tables, investors and startups. Our job is to find investors whose average funding across all their startups is strictly greater than their personal funding limit. Those cases when limit equals to average funding needs to be excluded. The output should include three columns: investor name, its ID, and average funding number. We are more interested in inner join because it returns only the rows where there is a match in both tables. Within funding column from startups table, we implement average and round functions. Since we need to find the average number for each of the investing companies separately, we need to use a GROUP BY clause. We will put a condition inside of HAVING clause that works with groups and filters them. We can't use WHERE clause for the condition because there is a strict rule that WHERE clause can only come before GROUP BY clause, not after.

Question 61

Combine Customer Orders and Products

Accepted Answer

We need to combine customer orders and products. Snowflake is mainly using the same concepts as in SQL. The only difference is that Snowflake is cloud-based. We are given three tables: customers, orders, and products. Our main goal here is to join all three tables together and return six unique columns. In Snowflake, we have to put the name of the table inside of the ref function, and everything has to be wrapped up in double curly braces. The reason behind is that Snowflake uses data build tool, which requires the ref function. Join connects two tables based on a common column. We are more interested in inner join because it returns only the rows where there is a match in both tables. We give aliases O to orders and C to customers table. Using on, we indicate which common columns were used to connect the tables. We use double pipe operator, or in other words, a concatenation sign, to get the full name.

Question 62

Anonymize User PII Data for a Social Media Platform

Accepted Answer

We need to write a query to anonymize user PII data for a social media platform. This is a Snowflake question, which is a cloud-based data warehouse that uses SQL and all of its concepts. PII stands for Personally Identifiable Information. For each user, we need to extract only the domain part from the email address and anonymize the phone number by hiding the first six digits with asterisks and keeping only the last four. In Snowflake, we put the table inside of the ref function, and this whole thing is wrapped up in double curly braces. Split part function takes three arguments and splits a string into parts based on a delimiter and returns a specific part. Our delimiter will be @ symbol. We will use the right function. Right takes two arguments and extracts a specified number of characters from the right side of a string. Since phone is not a string, we need to use double colon varchar in order to convert number to a string. Using the concatenation sign that combines two strings, we connect them together.

Question 63

Product Sales and Inventory Data

Accepted Answer

This is a Snowflake question, which is a cloud-based data warehouse that uses SQL and all its main concepts. We are given three tables, products, sales, and inventory. All three tables are connected through product ID, since this is a common column. First, we need to calculate total quantity and total revenue per product from sales. Then we need to calculate total stock per product from inventory. And in the end, we need to join everything with products table, making sure every single product appears in the result, even if it has no sales or inventory records. If we try to join everything directly without aggregating first, we would get duplicate rows and completely wrong totals. A CTE, or in other words, a common table expression, is a temporary result set that we define at the top of our query. A left join keeps all rows from the left table, even if there is no matching row in the right table. The missing values just become null. Coalesce function takes a list of values and returns the first one that is not null.

Question 64

Products and Duplicates

Accepted Answer

Practice a medium difficulty Snowflake SQL interview question tagged JPMorgan that tests deduplication, CTEs, and INNER JOIN. In a manufacturing scenario, you must remove duplicate rows from two tables using SELECT DISTINCT or ROW_NUMBER before joining them on a shared key. This question evaluates your ability to handle data quality issues before performing joins, a skill commonly tested in financial services and data engineering interviews.

Data Analysis Interview Questions

SQL (33)

How to Replace NULL with 0 in SQL and Retrieve All Orders from the `orders` Table

Steps to Write the SQL Query

Sample SQL Query

Breaking Down the Query

Best Practices

Spark (20)

Snowflake (22)

Python (19)

More questions (2)

Practice on real environments