Data Analysis Interview Questions
Are you getting ready for a data analysis interview? Data analysis is about examining, cleaning, transforming and modeling data to uncover insights. These real Data Analysis interview questions cover the SQL, Python and statistics problems you'll actually be asked.
- 96+Questions
- 58Companies
- 82With video
- 38/43/15 Easy / Med / Hard
Real questions asked at Google, DoorDash, Amazon, Meta, Databricks, VMware and 52+ more.
Click any question to read the full answer, then open it to solve hands-on in a real Data Analysis environment that boots in your browser. Built by working engineers, no AI.
SQL (33)
Easy Average Order Value
This SQL question is called Average Order Value. We have one table that is called Orders. This table records customers' purchases and has four columns: orders ID, date of transaction, reference to the customer by ID, and total amount spent. Our job here is to calculate for each customer the average amount they spent across all their orders. In the end, the average value should be rounded to two decimal places. The output should include customer ID and average order value, sorted by ID in ascending order. We start with FROM clause. Then we proceed with GROUP BY clause that separates individual orders into groups for each customer. In order to choose specific columns from a particular table, we use SELECT clause. We wrap the total amount column with Average inside of Round function as first argument, and two number as two decimal places. Finally, we order by customer ID.
Open in a real environment →Easy Join Employees and Departments
We are given two tables, departments and employees. These two tables are connected through department ID column. Some employees might not be assigned to any department, so their department ID is null. But we still want them in our results. Our main goal here is to return all employees that earn more than 50,000 with their department name sorted by hire date from most recent to oldest. In order to combine these two tables, we will use the concept of left join. A left join keeps all rows from the left table even if there is no matching row in the right table. The missing values just become null. We implement left join in between, and using on, we indicate which common columns were used to connect these tables. We will use where clause to filter the rows. We only want those employees whose salary is greater than 50,000. Finally, we sort everything out by hire date in descending order so that the most recent date pops up first and oldest date last.
Open in a real environment →Easy Filter Orders by Date Range
We need to write a query to filter orders by date range. We are given one table that is called Orders. It has four columns: customer's name, ID of an order that has been made, date of transaction, and total amount spent. We are required to return only the orders that were placed between January 1st, 2023, and June 30th, 2023. Both dates are inclusive, which means that orders on exactly those dates should be included, too. In SQL, we always start with FROM clause because this tells which table we want to work with. Using SELECT clause, we can choose the columns for the output. WHERE clause, which is the filter step. It goes through every single row and checks if the condition is true. For order date column, we use BETWEEN and operators. This combination checks if a value falls within a range, including both the start and end values. Finally, we sort everything by order date in ascending order so that the rows with the earliest date come first.
Open in a real environment →Easy Find Customers Without Orders
We need to find customers without orders. We have two tables, customers and orders. These tables are connected through customer ID column. Our job here is to find customers who have never placed a single order, but at the same time they exist in the customers table. We are more interested in LEFT JOIN since we want all rows from the left table to stay. A LEFT JOIN keeps all rows from the left table, even if there is no matching row in the right table. The missing values just become null. Then we add the LEFT JOIN operator in between, and ON keyword that indicates which common columns were used to combine the tables. In our case, it's customer ID. We proceed with WHERE clause that filters the results. We build the condition to check if order ID is null. We pick this specific column because if a customer has at least one order, their order ID will be a real number after the LEFT JOIN. Finally, we sort everything by customer name in ascending order or alphabetically.
Open in a real environment →Easy Use COALESCE for Null Handling
How to Replace NULL with 0 in SQL and Retrieve All Orders from the
ordersTableWriting clean and efficient SQL queries is an essential skill for database management and data analysis. If you're asked to retrieve all orders from an
orderstable, ensuring that anyNULLvalues in thediscountcolumn are replaced with0, you need to follow specific steps to structure your query correctly. Below is a comprehensive guide on achieving this task.Steps to Write the SQL Query
- Identify the Columns: The
orderstable contains the columnsorder_id,customer_name,discount, andtotal_amount. - Handle NULL Values: Ensure that the
discountcolumn does not contain anyNULLvalues by using theCOALESCEfunction, which allows you to replaceNULLvalues with0. - Select All Required Columns: Ensure that the query retrieves all the columns in the specified order -
order_id,customer_name,discount, andtotal_amount. - Order the Results: Use the
ORDER BYclause to sort the results byorder_idin ascending order.
Sample SQL Query
SELECT order_id, customer_name, COALESCE(discount, 0) AS discount, total_amount FROM orders ORDER BY order_id ASC;Breaking Down the Query
SELECT statement: This part retrieves the columns you need.
SELECT order_id, customer_name,COALESCE Function: Use
COALESCE(discount, 0)to replaceNULLvalues in thediscountcolumn with0.COALESCE(discount, 0) AS discount,FROM clause: Specifies the table from which to fetch the data.
FROM ordersORDER BY clause: Ensures the result set is ordered by
order_idin ascending sequence, making it easier to read and analyze.ORDER BY order_id ASC;
Best Practices
- Readability: Write clear and readable queries. Using aliases (like
COALESCE(discount, 0) AS discount) makes it easier to interpret results. - Performance: Ensure your database has indexes on columns commonly used in
ORDER BYclauses, likeorder_id, to optimize query performance.
Leveraging these structured steps will help you efficiently write the required SQL query to retrieve and process the orders data from the
Open in a real environment →orderstable, replacingNULLdiscounts with0, and ordering byorder_id. This method ensures clean data handling and a structured output, which is a critical aspect of database queries and analysis.- Identify the Columns: The
Easy Merge Multiple Address Fields
We have a customers table with columns like city, customer ID, first name, last name, postal code, state, and street address. Our job is to combine them into one single column called full address. Some columns could have null values, meaning some customers might be missing a city, state, or postal code. In this case, they should be excluded from the full address. One more possible case is that all components are missing. In this case, we should return an empty string. For this problem, we will need to use COALESCE function. COALESCE function takes a list of values and returns the first one that is not null. For this reason, we need to know the case statement. The idea is basically like an if else statement in any programming language. We will use the concatenation sign in order to connect them one by one. The last thing we have to do is sorting everything out in ascending order, which can be done using order by.
Open in a real environment →Easy String Concatenation in SELECT
Our job is to retrieve the full names of all employees by combining the first and last names. We have to make sure that the spaces in between are added, and the result is sorted in ascending order. In SQL, a string is basically just text, anything like a name, a number, or email that are stored in columns with types like varchar or text. Concatenation is simply joining two or more strings together into one output. If you want the space between your strings, you have to treat that space as its own separate string inside quotes. In order to return this specific full name column, we use select clause and implement concat function inside of it. We name our new column as full name. The last thing we are required to do is to sort everything in ascending order based on full name. For this reason, we use order by clause and the name of the column.
Open in a real environment →Easy Find Nth Highest Revenue
We will write a query to find N's highest revenue. We have sales table with columns like ID, product, and revenue. Our job is simply to find the third highest distinct revenue value. We need to keep in mind that product with same revenue count as one value, not two. Our query will consist of three steps. First, we get rid of duplicates, then what's left, we rank from highest to lowest, and then pick the one on third place. A window function performs a calculation across multiple rows, but unlike group by, it does not collapse everything into one result. The word just activates the window function. We sort the revenue column out in descending order so that on the first place, we get the highest value. We filter the column where rank equals to three, and we find the value on the third position.
Open in a real environment →Easy Self-Join to Identify Missing Supervisors
We have only one table that is called Employees and three columns that indicate employee ID, the name of that employee, and supervisor ID. One employee can be a supervisor for his or her colleagues. Our job is simply to find employees whose supervisor does not exist in the table. Employees with a null value in supervisor ID column are excluded. The final list should be sorted out in ascending order by employee ID. In order to reference the same table twice, we give it two different names, or aliases. e1 will represent each employee, and e2 will search for supervisor. We will use left join, which will keep all rows from the left table, regardless of whether a match was found in the right table or not. Using ON condition, we simply ask the query to take the supervisor ID from e1 table and look for it in employee ID column of e2 table. The first statement is employee ID from e2 table is null, which will simply catch all employees whose supervisor was not found in the table after the left join. We also check if supervisor ID is not a null value.
Open in a real environment →Easy Year-over-Year Revenue Growth
We need to calculate the yearly revenue and the percentage growth year over year for a given set of financial transactions. We have financials table that contains two columns, transaction date, that is date type, and amount, that is numeric. We group all transactions by year and sum them up to get the total revenue per year. A CTE is a temporary result set in SQL that you can reference within a single query. It only exists while that query is running. In our select, we use extract in order to get only year from our transaction date column. We also need to get the total revenue of our amount column using sum. A left join keeps all rows from the left table, even if there is no matching row in the right table. The missing values just become null. We take our CTE yearly revenue and give it an alias current. Using left join, we take the exact same CTE again, but this time we call it prev for previous. We need to subtract current and previous year, divide by previous year, and multiply by 100 to get the percentage. Inside of select, we use round, which is a built-in SQL function that rounds a decimal number to a specified number.
Open in a real environment →Medium Above Average Price Products
We have a products table with columns like category, ID, name, price, rating, and stock quantity. Our job is to return the products that are priced above the average price. We only work with those products that are currently in stock. It means that not only we don't return them, but we also don't use them while calculating the average amount. While sorting, we should put all the products with null ratings at the bottom. Using select with a star, we can choose all columns. Using from clause, we retrieve data from products table. We need to write a subquery. Using where clause, we create a small condition. Using the average function, it will calculate the average price. The main condition checks whether selected product's price is higher than the average. We have to sort everything out in descending order using order by. At the end of our order by clause, we will use nulls last.
Open in a real environment →Medium Calculate Cumulative Sales
We are given a sales data table with the name of the products, date of transaction, and the amount of daily sales. Each row simply represents one day of sales per product. Our mission is to add a new column called cumulative sales, which is a running total that keeps adding up the daily sales for each product as we move forward through the dates. We do the calculation within each product separately. For each row, we need to add up all the previous rows for the same product, which is not something that regular sum with group by can handle, since group by would just put everything into one total. A window function performs a calculation across multiple rows, but unlike group by, it never collapses them. We start with sum within daily sales column, and then we use over keyword that simply activates the window function. We will use partition by function within product name column that will simply divide the data into separate groups.
Open in a real environment →Medium Find Overlapping Date Ranges
We have one table that is called assignments. Each row represents one employee assigned to one project with start date and end date. We need to find those employees who are assigned to multiple projects at the same time, or in other words, who face an overlap. The final result must contain all the columns from the input and sort it out by project ID primarily, and then by employee ID in ascending order. Two date ranges overlap when they share at least one day in common. We need to compare each assignment to another one for the same employee, which basically means that we reference the same table twice. This means that we'll use self join here. We take assignments table and give it alias a1. Then we join it with a2 copy that will search for overlapping assignments. The first condition is that assignment a1 must start before or on the same day that assignment a2 ends, because if a1 starts after a2 already ended, no overlap is possible. The role of distinct function here is removing those duplicates and making sure that each assignment appears only once in the final result.
Open in a real environment →Medium Set Operation: INTERSECT
This SQL question focuses on set operation called intersect. We are given two tables that share customer ID as a primary key. Our main goal is to return list of active customers based on two criterias. On a monthly basis, a new customer should spend more than 1,000, be loyal, and have at least three years of membership and a premium tier status. Final output must include only the customer ID and name columns sorted in ascending order by ID. A CTE is a temporary result set in SQL that you can reference within a single query. Using with clause, we create a CTE called monthly spenders among new customers that spend more than 1,000. We can name the CTE as premium tier. Intersect takes both tables and returns only the values that appear in both of them. We select customer ID and name columns from monthly spenders, which was the first CTE. Then we select information from the second CTE, and between these two we add intersect operator.
Open in a real environment →Medium Subquery for Best Order per Customer
We are given two tables. The first table contains the names of the customers. The second table stores order IDs and total amount of each order. Both tables share customer ID primary key which connects them together. In the output, we should get customer's name, ID of the best order, and its total amount. All the results should be sorted by a customer's name in alphabetical order. Since each customer can have multiple orders, and among these orders might be the ones with the same highest total amount, in this case, we should return the one with the smallest order ID. We are more interested in inner join because it returns only the rows where there is a match in both tables. Inside of the where clause, we will have correlated subquery. It is also a query inside another query, but it runs once for every single row in the outer query. Since we need to return the highest valued order for each customer, we sort by total amount in descending order. We also sort order ID in ascending order because when two orders pop up with the same amount, the one with smaller order ID will come first. And in the very end, we apply limit one.
Open in a real environment →Medium Ranking with Dense_Rank
The main focus is on ranking values with dense rank window function. We are working with sales table that tracks individual sales made by different representatives. The goal here is to first add up all sales per person to get their total value, and then rank everyone based on that total from highest to lowest. When we use dense ranking, two people with equal amount of sales will share the same rank. A CTE is a temporary result set in SQL that you can reference within a single query. Dense rank is one of the window functions that assigns a rank number to each row based on a specified order. Over is a keyword that simply activates a window function and lets SQL know that we try to implement it. Finally, order by clause sorts by sales rank first, so rank one comes before rank two, and then salesperson name is sorted alphabetically.
Open in a real environment →Medium Median Salary by Job Title
We'll write a query to calculate median salary for each job title. When all salaries are sorted in ascending order, the median value will be the one that is in the middle of them. If there is an even number of salaries, we take the average of the two middle values. A CTE is a temporary result set in SQL that you can reference within a single query. A window function performs a calculation across multiple rows while keeping every single row in the result. Inside of the window function, we are using partition by job title. Partition by divides the data into separate groups, and count will run independently inside each of these groups. We use row number that simply assigns a sequential number to each row. These formulas calculate the median position so that we know which row number represents the middle one. If the number of rows is an odd number, both of the formulas will give the same result. Using round function and double column numeric, we will convert the results to numeric type and then clean everything up to two decimal places.
Open in a real environment →Medium String Splitting and Aggregation
We have a product_tags table with two columns, id and tags. The tags column stores multiple tags in one single string separated by commas. Our job is to split those tags apart and count how many times each tag appears across all products. One important thing to keep in mind is that tags are case sensitive. A CTE is a temporary result set in SQL that you can reference within a single query. Our temporary table will be called split_data, and then we select id and tag from our product_tags table. First, we are using string_to_array. You basically pass a string and the separator to the function. After turning tags column into array, we will use unnest function, which will basically put each piece of our array into one separate row. We will use the count with a star function that will basically go through each row and return the amount. Using group by clause, we will group all rows with the same tag together.
Open in a real environment →Medium Salary Comparison with CTE Aggregation
We have two tables, departments and employees. Our job is to find employees who earn more than the average salary of their own department. We need the average salary per department before we can compare anything. For this reason, we will use CTE here. A CTE is a temporary result set in SQL that you can reference within a single query. It doesn't get saved anywhere. It only exists while that query is running. We want to find the average salary per department. Inside of it, we select department and average salary grouped by department. For our task, we are more interested in inner join because it returns only the rows where there is a match in both tables. We connect them on department ID from employees table and ID from departments table because we remember that they were the same common primary keys. The average salary should be rounded to two decimal places in the output. Using where clause, we write a condition that salary of an employee should be higher than the average salary that we calculated in our CTE before. The last thing we have to do is sorting everything out primarily by department's name in ascending order and then by salary in descending order.
Open in a real environment →Medium String Pattern Extraction in Descriptions
This question mainly focuses on string extraction. We have a products table with four columns: description, name, price, and product ID. The description column contains free text. Some products have email address inside of it and some don't. Our mission is to find those email addresses and extract them from the column. We only return three columns, product ID, name, and email that we will find from description. We will need to use substring. Substring is a string manipulation function. It simply returns a copy of a specific portion of the string. The pattern of an email address is some word, then @ symbol, word again, dot, and then word in the end. Regular expression is kind of a language that is used to describe a sequence of characters. Instead of equal, we use tilde, which is regular expression operator, which means just find this kind of pattern in description column. We sort everything in ascending order by product ID.
Open in a real environment →Medium Nested Subquery for Latest Record
We only have one table here, events. There are five columns: event_date and name, id, status, and user_id. Each row in the event_name column represents something a user did, and then status column shows if it was successful. For each user, we want to know what their most recent event was. Final result should contain all the columns except for ID column, and everything should be sorted in ascending order by user ID. For each user and event row, we check and find the maximum date for that specific user. If the current row's date matches the maximum date, we keep it. We take the events table and give it an alias e1. Now we have to use this table again, but we can't name it with the same alias anymore. That's why inside of the subquery, we give another nickname, another alias to our table, which will be e2. We implement max function to find the maximum date, and then inside of the where clause, we compare the current date that we are checking right now and the maximum date that we already found. Finally, we sort everything out in ascending order by user ID.
Open in a real environment →Medium Window Function for Moving Average
We are given a sales table with two columns, amount and sale date. Each row indicates the amount of sales per day within one week. Our main goal is to calculate a seven-day moving average for each date. Moving average means that we calculate that number for each day using the current amount plus the amount of previous days. The challenge here is that for each row, we need to look at other rows, which is not something that where or group by clause can do. A CTE is a temporary result set in SQL that you can reference within a single query. Window function calculates across multiple rows, but it keeps every single row in the result. From sales table, we select sale date and amount columns, and also moving average as a new column. This statement, range between interval six days preceding and current row, simply helps us to define the frame. Using round function for this column, we write double colon that will help us to convert raw data to numeric that we need, and two here will round everything to two decimal places.
Open in a real environment →Medium Re-enrollment Rate Calculator
We are given one table, Enrollments. This table has three columns that contain information like course ID, student ID, and term ID. Our job here is to calculate what percentage of students are enrolled in two or more consecutive terms. Consecutive means that in term ID column, the sequence goes like 101, 102, 103 without any gaps. A CTE is a temporary result set in SQL that you can reference within a single query. We will call our CTE as consecutive terms, and from enrollments table, we select student and term ID columns. We implement a window function with row number. Partition by clause divides a result set into smaller groups and allows a function to perform calculations on each subset separately. Row number assigns a sequential number to each row starting from one. In order to prevent all the duplicates, count with distinct will only return unique student IDs and completely ignore any null values. Cast in SQL is used to convert a value of one data type into another. We are required to round the results to two decimal places, and the only way to do so is by using a round function.
Open in a real environment →Medium String Pattern Matching Using LIKE
We'll match string patterns using like. We have two tables, departments and employees. Our job is to filter employees based on three conditions. First one is that their name must start with letter A. Second is that the email must contain the substring @tech, and third is the position level must contain the word senior. A join in SQL connects two tables together based on a common column. We are more interested in inner join because it returns only the rows where there is a match in both tables. Like is an operator in SQL used to search for a specific pattern inside of a string. Like works together with wild cards, and the most significant one is percent sign. We write name column from employees table, like A percent sign, which means A is the first letter, and then any symbol can follow this letter. Second condition is that email from employees table must contain @tech keyword, and since this value is in the middle of the string, we put percent signs from both sides. And finally we sort everything in ascending order by name.
Open in a real environment →Medium Customer Order Aggregation
We are given two tables. First is customers table that contains customer's name, email, and ID. Also, second table that contains ID of the order, its amount, and ID of the customer. Our main goal is to find customers who have placed more than two orders and calculate the total number of orders and the amount they spent on them. The final result must be sorted out in descending order by total spend. A join in SQL connects two tables together based on a common column. For our task, we are more interested in inner join because it returns only the rows where there is a match in both tables. Second step is grouping by ID, name, and email from customers table because this will collapse all rows belonging to the same customer into one single group. Since we already grouped the columns, we will have to use having clause that works with groups, not rows. To get this number, we have to use ID column that we grouped before and count function that will simply find the number of orders for each group.
Open in a real environment →Medium SQL JOIN with Pandas Data Processing and CSV Export
We need to do the data processing and CSV export using Pandas and SQLite. We are given a SQLite database that is called Sales. It contains three tables: customers, orders, and items. SQLite is a lightweight database that stores everything in a single file. We need to connect to the database, then run an SQL query to join all three tables that we had, load the results into Pandas, calculate revenue metrics per customer, and export everything to CSV file. JOIN connects two tables based on a common column. We are more interested in inner join because it returns only the rows where there is a match in both tables. We will implement the read_sql_query function. We build the total amount column by multiplying quantity with unit price. When we group by customer ID, we will put all rows belonging to the same customer together. In order to calculate the revenue percentage, we will need to divide each customer's total by the overall revenue.
Open in a real environment →Medium Insert New Records into SQLite Database from CSV
We'll insert new records into database from CSV file using SQLite and Pandas. We have one CSV file that is called New Customers. It contains customer records that need to be imported in a SQLite database. The problem is that some of these customers might already exist in the database file. SQLite is a lightweight database that stores everything in a single file. Our job here is to read the CSV file, check which records already exist, and make sure we only include the new ones. We will use the connect function where we define the path to the file. Then within this object, we implement cursor. We read the given CSV file into a DataFrame that will be called df. Using cursor execute will run the SQL query. In the query itself, we select ID column from customers table. Fetchall function will retrieve all the results as a list of tuples. The tilde symbol is used here to flip true to false and false to true. Cursor execute will run the insert statement. Instead of putting the values directly in the SQL string, we will use the question mark and pass the actual values as a separate tuple. We do it for security reasons to protect the data from SQL injection. We will use the commit function within our database to finalize and save all changes. Finally, we close the database.
Open in a real environment →Medium Aggregate SQL Query Results with Pandas and Export to Excel
We'll need to aggregate SQL query results and export them to Excel using Pandas. Pandas is a library that was specifically designed for data analysis and manipulation. We are given one SQLite database that is called Orders. It contains two tables, the first one with customers' information and second one with all the transactions. SQLite is a lightweight database that stores everything in a single file. Our job here is to join both of the given tables, then calculate the total order value per customer, and save the results to an Excel file. We import two libraries, SQLite 3 and Pandas. Then we open the database using the connect function. We are more interested in inner join because it returns only the rows where there is a match in both tables. Then we execute what we have written above using the read_sql_query function and load the results directly into a Pandas data frame. We use group by so that all rows belonging to the same customer are put together. Finally, we save the resulting data frame to Excel file.
Open in a real environment →Hard Merge Employee and Department Records
We need to merge employee and department records. We have two tables, departments and employees. We only consider departments that have more than 10 employees. Second requirement is to find employees whose salary is above their department average. Third requirement is to add high earners column to the output, which will basically count the number of employees whose salary is more than 75,000. A CTE is a temporary result set in SQL that you can reference within a single query. A join in SQL connects two tables together based on a common column. We are more interested in inner join because it returns only the rows where there is a match in both tables. The difference is that where clause comes before group by, while having runs after. In order to calculate the high earners column, we will use count with case when. Results should be sorted out primarily by department name and then by salary in descending order.
Open in a real environment →Hard Sequence Products by Price
We are given products table with the name of the product, its ID, and the price. Our main goal is to create and calculate neighbor product column. For each product, we take the price of previous product and the price of next product, and then we multiply them together. A CTE is a temporary result set in SQL that you can reference within a single query. It doesn't get saved anywhere. It only exists while that query is running. Window function performs a calculation across multiple rows and keeps every single row in the result. The reason why we can't use group by here is that group by clause collapses all rows into one single row. We use lag, which reaches back to the previous row and grabs its value. The only difference is that we'll use lead instead of lag. Lead will reach forward to the next row and grab its value. COALESCE function takes a list of values and returns the first one that is not null. Since null values can't be used in the calculations, COALESCE function will replace them with zeros.
Open in a real environment →Hard Combine Data from Multiple Sources into Unified Report
We need to combine data from multiple sources into a unified report using Pandas. We are given one customers CSV file and one orders SQLite database. SQLite is a lightweight database that stores everything in a single file. We get the data from API, CSV file, and the database. We send an HTTP GET request to a URL, and the API responds with data in JSON format. Our job is to fetch data from all three sources, combine them into one unified DataFrame, and save the final report as a CSV file. We import three libraries: Requests, Pandas, and SQLite 3 to connect the database. We open the database file using connect function. Read SQL query will execute our SQL request and load the result directly into the DataFrame. Merge in Pandas connects two DataFrames based on a common column or index. It works the same way as JOIN in SQL. On indicates which common column was used to connect the DataFrames. The method is left join, which means that we keep all orders even if customer ID doesn't match. We can calculate the total amount when multiplying quantity by price. The final unified report is saved as CSV file.
Open in a real environment →Hard Export SQLite Database to Parquet Format with Metadata
We'll need to export SQLite database to Parquet format with metadata using Pandas. We have one SQLite database that is called ecommerce. It contains three tables, customers, products, and orders. The orders table has foreign keys that connects it to the first two tables. SQLite is a lightweight database that stores everything in a single file. Our job here is to export every of three tables to their own Parquet file with Snappy compression and create a manifest JSON file describing the export. Parquet stores data column by column instead of doing that row by row. A manifest is a metadata file that describes the contents of an export. We import five things: SQLite 3 library to connect to the database; Pandas, that will be aliased as pd; JSON to create the manifest; OS library for creating directories, building file paths, and getting file sizes; and the DateTime module for getting the current timestamp for the manifest. We read the entire table into the data frame and save the data frame as a Parquet file. Then we set compression to Snappy and index to false.
Open in a real environment →Hard Top Categories by Average Price
We are given two tables, inventory and products. These two tables share common column and are connected through product_id column from inventory table and id column from products table. Our main goal is to find the top three product categories with the highest average price. We only consider active products, meaning that they are in stock. For low stock items are qualified those products that are less than 10 available. The final result should be ordered primarily by average price in descending order, and then by product count column as a tiebreaker. A CTE is a temporary result set in SQL that you can reference within a single query. We use left join to make sure every product appears, even if it somehow has no inventory entry. For low stock items column, we need to count only products with stock less than 10. We group by category so that all rows that share the same category would collapse into one group. Rank here assigns a number to each row based on a specified order. We put the condition inside the where clause that price rank should be less or equal to three.
Open in a real environment →
Spark (20)
Easy CSV and Partitions
Spark is a big data processing framework. It is designed to process massive amounts of data across multiple computers at the same time. And instead of tables, it uses data frames. Our job here is simply to read a CSV file, then to find out how many partitions are created, and print that number. When Spark reads this file, it doesn't process it as one giant block. Instead, it splits it into smaller chunks called partitions. Each partition gets sent to a different executor. The default maximum partition size is 128 megabytes. RDD stands for Resilient Distributed Dataset. The RDD splits it into multiple partitions, and each partition holds a subset of the full data. RDD will convert the data frame into a format that can help us to access the underlying partition information. And getNumPartitions is a built-in method that returns the number of partitions as a number.
Open in a real environment →Easy Repartition
Spark is a big data framework that is designed to process massive amounts of data across multiple computers at the same time. Instead of tables like in SQL, Spark uses data frames. We have only one file that is called orders.csv with 5,000 records. Our main goal here is to repartition the data frame to eight partitions and print the task count in the format that it equals to eight. When Spark reads this file, it doesn't process it as one giant block. Instead, it splits it into smaller chunks called partitions. Each partition gets sent to a different executor. An executor is a process that runs on its chunk of data independently. Repartition is a built-in Spark method that lets us manually control how many partitions our data frame has. We pass the number that we want, and Spark will redistribute all the data across exactly that many partitions. repartition.rdd will convert our data frame to an RDD format, which is a data structure of Spark that holds the partitioned data. getNumPartitions is a built-in function that counts and returns the number of partitions.
Open in a real environment →Easy Broadcast Join
Spark is a big data framework that processes massive amounts of data across multiple computers at the same time. Instead of tables like in SQL, Spark uses data frames. We are given two files, orders.csv with 5,000 records, and customers.csv with 50 records. We need to join these two files together using a broadcast join, then count orders and print the number of distinct cities. We will use a regular inner join because we only want orders that have a matching customer. The only thing is how we perform that join. Instead of shuffling both data frames across the network, Spark takes the small data frame and sends a full copy of it to every worker. So each worker now has its own partition of the large data frame and small data frame. It means that it can perform the join right here without any movement. We don't want to shuffle this large data frame across the network. That's why we take the smallest one, because it is easier and cheaper to copy. We will also import the broadcast function from the library. header that is set to true uses the first row as column names, and inferSchema automatically detects data types for each column. Then we count orders per city with the help of group by.
Open in a real environment →Easy Correcting Social Media Posts
Spark is a big data framework that is designed to process massive amounts of data across multiple computers at the same time. Instead of tables, Spark uses data frames. We have one file, posts.csv. Its data frame contains seven columns: text of the post, ID, date, amount of likes, comments, and shares, and platform where it was published. We need to go through every post and replace the word Python with PySpark in the text column. withColumn is a data frame method that modifies or replaces a specific column. It takes two arguments. The first one is the column, it is the text column, and second one is the value that we want to add. For the second argument, we use regexReplace function. This function takes three arguments. The first one is the column to search. Second is the word to find. It is Python. And third is word to replace it with, which is PySpark.
Open in a real environment →Easy Daily Category Sales Aggregation
Master daily sales aggregation in PySpark. Learn how to join transaction tables with product catalogs and use multi-column GroupBy operations to calculate total quantities sold per category per day.
Open in a real environment →Easy Most Common Order Status
Spark is a big data framework that processes massive amounts of data across multiple machines. Instead of tables, Spark uses data frames. We are given only one file, orders.csv, with 5,000 records. Each order has a status, like completed, canceled, ongoing, and so on. Our job here is to find which status appears most frequently. There are two types of transformations, narrow and wide. In narrow transformation, each row on the left executor goes directly to the row on the right executor. It means that no data moves between executors, and each partition stays on the same machine and gets processed independently. But when it comes to wide transformations, rows from one executor can end up on a completely different executor. Here, data is able to move between machines, and this is called a shuffle. The only problem is that wide transformation requires more network traffic and more time. When we use group by, we do the wide transformation because it makes the rows with the same status move onto the same executor. Count finds the number of rows in each group, and ordering by count, ascending set to false, which means that everything is sorted in descending order.
Open in a real environment →Easy Calculating Overtime Pay
We will be calculating overtime pay. Spark is a big data framework that processes large amounts of data across multiple machines. We are given two files: employees.csv and payroll.csv. First data frame contains the names of employees, their IDs, ages, and job positions. The second one stores hourly rate, amount of hours worked, and reference to an employee by ID. Our job is to calculate total pay for each employee, which consists of two rules. If an employee worked less or equal to 40 hours, then the total pay is the product of hours worked and hourly rate. But if an employee worked more than 40 hours, then all the extra hours need to be multiplied not by the regular rate, but by the one that is 1.5 times bigger. We join the two data frames on employee ID using inner join. To the result data frame, we add a new column that is called Pay using the withColumn method. When is Spark's version of if/else statement. Else statement in Spark is replaced with otherwise.
Open in a real environment →Medium Cache and Performance
Spark is a big data framework that processes massive amounts of data across multiple machines. Instead of tables, it uses data frames. We have one file, orders.csv, with 50,000 records. Our job here is to cache the data frame, run count twice, measure how long each run takes, and print the results. Spark uses lazy evaluation, which means that it doesn't execute anything when we write a transformation. A transformation is simply an operation that modifies or processes our data. Lazy evaluation doesn't run anything until we specifically ask for a result. If we don't do the caching, then every single iteration reads from the input and produces an output independently. But when we do the caching, then the input is read only once and stored in the distributed memory. The first count reads the CSV file and caches the result, while second one reads from that memory and makes everything much faster.
Open in a real environment →Medium Filter Popular Videos
We will need to filter popular videos. Spark is a framework that processes large amounts of data across multiple machines. Instead of tables, it uses data frames. We are given one CSV file that is called Videos. The data frame contains six columns: title of the video, its ID, genre, release year, duration, and number of views. We need to filter everything and keep only videos that have more than one million views and were released in 2019 or later. The result should be saved as result_df. We read the CSV file to a given path and store it in a variable called df for data frame. When header is set to true, it uses first row as column names. At the same time, inferSchema automatically detects data types of those columns. For the condition, we use filter method within data frame and store it in result_df. The condition consists of two statements. First one checks if number of views is greater than one million, and second one ensures that the release year is after 2019. In between, we use AND operator that requires both of the statements to be true.
Open in a real environment →Medium Anonymize User PII
PII stands for Personally Identifiable Information, data like emails and phone numbers that can identify a real person. We are given only one file that is called users.csv, and this data frame contains three columns: email, phone number of the user, along with referenced_by ID. We are required to do two things. First, we extract the domain from the email address, which means that we keep everything after the @ symbol. Then we need to hide the first digits of the phone number, and only the last four should be visible. Regex stands for regular expression. It is a pattern that is used to search, extract, or replace specific text inside of a string. Regexp_extract is a Spark function that takes a specific part of a string using a regex pattern. We will use the regexp_replace function that simply replaces part of the string. WithColumn will take the results and create a new column.
Open in a real environment →Medium Call Center Daily Stats
Join call records with customer data and compute daily aggregates using countDistinct and sum.
Open in a real environment →Medium Venture Capital Sector Analysis
Spark is a framework that processes large amounts of data across multiple machines. At the same time, instead of tables, it uses data frames. We are given two files, companies and investments. Our job here is to find the total investment amount for each industry sector and sort from highest to lowest. We will first combine both of the data frames, then we will sum investments per industries, and in the end, we will sort everything in descending order. Inner join returns only rows where a match exists in both tables or data frames. We will use inner join because for every investment row, we want the company's name and industry to be right next to it. We group by industry column so that all rows that share the same field are put together. Sum function simply adds up all amount values and finds the total investment per industry. The third step is that we order everything by total investment column in descending order.
Open in a real environment →Medium Window Functions without Partitions
Master global sorting and sequential numbering in PySpark. Learn how to join DataFrames and use the row_number() window function across an entire unpartitioned dataset.
Open in a real environment →Medium Calculating PE Portfolio Values
Master financial data aggregation in PySpark. Learn how to join relational tables, multiply columns to calculate holding values, and group by multiple dimensions to compute daily private equity portfolio totals.
Open in a real environment →Medium Top Products by Revenue
Spark is a framework that processes large amounts of data and covers multiple machines at the same time. We are given two files, products and sales.csv. We are required to aggregate total revenue per product, then rank products within each category using dense rank function, and keep only the top three. The output should include the name of the product, category to which it belongs, revenue from the sale, and rank number. A window function performs a calculation across a set of related rows and keeps every single row in the result. This is the key difference from group by, which is an aggregate function that collapses multiple rows into one outcome per group. A window function instead just adds a new column with the calculation. We'll use group by product ID that will put all sales rows for the same product together. Sum will simply add up all revenue values for each group within revenue column. Partition by category will create invisible walls between each category. We sort everything out in descending order by revenue so that the highest amount will get rank one. Dense rank is a type of window function that assigns rank numbers. The thing about dense rank is that it never creates any gaps. If two products have the same amount of revenue, then both of them get rank one. We need to build sort of a condition using filter that will keep only products with rank three or less.
Open in a real environment →Medium Product Summary
Spark is a framework that processes large amounts of data across multiple machines. Instead of tables, it uses data frames. We are given three files, and their data frames are called products, sales, and inventory. We need to calculate total quantity sold and total revenue per product from sales. Then we find the total stock number per product across all warehouses. And we need to join everything to products data frame and make sure every product appears, even if it has no sales or inventory records. Any null values should be replaced with zero. We group by product ID so that all sales rows that belong to the same product are put together. Inner join returns only rows where a match exist in both tables or data frames. Left outer join returns everything from the left table, plus matching rows from the right. In our specific case, we'll use left join. It will keep all rows from the left table, which is products, regardless whether there is a match in sales. Coalesce function takes list of values and returns the first one that is not null. Lit function creates a literal value of zero.
Open in a real environment →Medium Parsing Comma-Separated Values
Spark is a framework that processes large amounts of data across multiple machines. Instead of tables, it uses data frames. We have only one file that is called background_checks.csv. Four of these columns store multiple values in one string separated by comma. We are required to count how many items are there in each of those strings and create four new columns to store these results. Everything should be sorted out in ascending order by ID. We read the CSV file into data frame using given path. When header is set to true, it uses first row of the data frame as column names, and inferSchema automatically detects data types. We use split function that takes a string and cuts it into an array using a separator. We wrap everything inside of size function that will count how many elements are in the array. We sort everything out in ascending order by check ID using order by.
Open in a real environment →Hard Mountain Climber Logs
Spark is a framework that processes large amounts of data across multiple machines at the same time, and instead of tables, it uses data frames. We are given two files, mountain_info and mountain_climbers.csv. Our job here is to find the most recent climber for each mountain. We need to keep in mind that mountains with no climbing records should be excluded from the final result. A window function performs a calculation across a set of related rows and keeps every single row in the result. This is the key difference from group by, which is an aggregate function that collapses multiple rows into one outcome per group. We need to read both of the files into data frames. We set header to true, which means that the first row will include the column names, and inferSchema will automatically detect data types of those columns. Inner join returns only rows where a match exists in both tables or data frames. We want name of the mountain from info_df to be equal to the name of the mountain from climbers_df. Partition by mountain name creates sort of invisible walls between each mountain. We also sort everything in descending order by climb date so that most recent date comes first. Row number is type of window function that assigns a sequential number to each row. Filter method will help us to build a condition to keep only rows where RN is equal to one.
Open in a real environment →Hard Global & Domain SEO Leaders
Master advanced PySpark Window functions by calculating both partition-level and global maximums. Learn how to use empty window partitions and conditional F.when() logic to identify top-performing SEO pages.
Open in a real environment →Hard Tracking Customer Purchase History
Spark is a framework that processes large amounts of data across multiple machines at the same time. Instead of tables, it uses data frames. We have only one file that is called transactions.csv. Its data frame contains four columns, product ID, reference to the customer by ID, quantity, and date of transaction. For each transaction, we need to find what product that particular customer bought most recently before the current one. Then we create a new column that combines the date and that previous product in one string. A window function performs a calculation across a set of related rows and keeps every single row in the result. This is the key difference from group by, which is an aggregate function that collapses multiple rows into one outcome per group. Partition by customer ID will create invisible walls between customers. Lag is a type of window function that will look backwards and reach back to the previous row to grab its product ID value. Concat function joins multiple strings together.
Open in a real environment →
Snowflake (22)
Easy Filter Popular Videos on a Streaming Platform
We will filter popular videos on a streaming platform. Snowflake is a cloud-based data warehouse platform. Here, your query will look just like an SQL. We are given videos table, which contains video ID, title, genre, release year, duration, and view count columns. Our job is to retrieve those videos that have more than one million views and that were released in year 2019 or later. FROM clause, this is one of the things that differ regular SQL from Snowflake. It uses a dynamic reference with the ref function. Instead of naming the table, we are asking the system to find correct version of the table for us. Inside of the double curly braces, we put the ref function with the name of our table. We select all columns using select with a star. Our main goal was to filter the videos, and to do so, we should build sort of a condition inside of the WHERE clause. The first statement is that a number inside of view count column should be higher than one million. The second is that release year should be higher or equal to 2019. Since both of the statements are mandatory, we will use a logical operator.
Open in a real environment →Easy Replace Keywords in Social Media Post Text
Snowflake is a cloud-based data warehouse platform. The syntax, the functions, the logic, everything is almost like in SQL. Our table, social media, contains information about different posts, its ID, text inside of it, date, amount of likes, comments, shares, and the name of the platform. Our job is simple. We go through every post and replace the word Python with PySpark. The only thing that is going to change is that inside of the text column, the word Python is replaced by PySpark. In Snowflake, it uses a dynamic reference with the ref function. Inside of the double curly braces, we put the ref function with the name of our table. Since our goal was to modify the text column, we need to use replace function here. Replace function takes three arguments: the string itself, the old value, and the new value. Replace basically scans through the string, finds the needed word Python, and swaps it with a new value.
Open in a real environment →Easy Filter Movies with Missing Box Office Data
Practice filtering for NULL values in Snowflake SQL with this movie analytics interview question. You query a movies table to find rows where box office collection data is missing using IS NULL. Covers NULL handling, IS NULL, WHERE clause, and data quality checks in Snowflake. An easy-level question common in analytics interviews at companies like DoorDash.
Open in a real environment →Easy Daily Category Sales
SELECT
Open in a real environment →
s.sale_date,
p.category,
SUM(s.quantity_sold) AS total_quantity
FROM {{ ref("sales") }} s
INNER JOIN {{ ref("products") }} p
ON s.product_id = p.product_id
GROUP BY s.sale_date, p.category
ORDER BY s.sale_date ASC, p.category ASCEasy Filter and Uppercase Artifacts
SELECT
Open in a real environment →
ID,
Item,
Period,
UPPER(Material) AS Material,
Quantity
FROM {{ ref("artifacts") }}
WHERE Quantity > 100Easy Merge Customer Records from Two Sources
We will merge customer records from two sources. Snowflake is a cloud-based data warehouse that uses SQL. We are given two tables, customers from West and customers from East. Both of the tables contain the same columns and store information about the customers, such as full name, age, ID, and subscription plan type. Our job is simply to combine these two tables into one single result, even if they contain duplicate rows. FROM clause in Snowflake is a little bit different from SQL. It uses a dynamic reference with the ref function. Inside of the double curly braces, we put the ref function with the name of our table. In SQL and Snowflake, UNION operator is a set operation used to combine the result sets of two or more SELECT statements into a single unified output. If we use UNION separately, it would remove all duplicate rows. So using UNION ALL simply means that we want to keep all rows, even if they are repeated.
Open in a real environment →Easy Filter Funded Startups
We need to write a query to filter funded startups. This is a Snowflake question. Snowflake is a cloud-based data warehouse that simply uses SQL. We are given two tables, investors and startups. Our job is to find investors whose average funding across all their startups is strictly greater than their personal funding limit. Those cases when limit equals to average funding needs to be excluded. The output should include three columns: investor name, its ID, and average funding number. We are more interested in inner join because it returns only the rows where there is a match in both tables. Within funding column from startups table, we implement average and round functions. Since we need to find the average number for each of the investing companies separately, we need to use a GROUP BY clause. We will put a condition inside of HAVING clause that works with groups and filters them. We can't use WHERE clause for the condition because there is a strict rule that WHERE clause can only come before GROUP BY clause, not after.
Open in a real environment →Medium Combine Customer Orders and Products
We need to combine customer orders and products. Snowflake is mainly using the same concepts as in SQL. The only difference is that Snowflake is cloud-based. We are given three tables: customers, orders, and products. Our main goal here is to join all three tables together and return six unique columns. In Snowflake, we have to put the name of the table inside of the ref function, and everything has to be wrapped up in double curly braces. The reason behind is that Snowflake uses data build tool, which requires the ref function. Join connects two tables based on a common column. We are more interested in inner join because it returns only the rows where there is a match in both tables. We give aliases O to orders and C to customers table. Using on, we indicate which common columns were used to connect the tables. We use double pipe operator, or in other words, a concatenation sign, to get the full name.
Open in a real environment →Medium Anonymize User PII Data for a Social Media Platform
We need to write a query to anonymize user PII data for a social media platform. This is a Snowflake question, which is a cloud-based data warehouse that uses SQL and all of its concepts. PII stands for Personally Identifiable Information. For each user, we need to extract only the domain part from the email address and anonymize the phone number by hiding the first six digits with asterisks and keeping only the last four. In Snowflake, we put the table inside of the ref function, and this whole thing is wrapped up in double curly braces. Split part function takes three arguments and splits a string into parts based on a delimiter and returns a specific part. Our delimiter will be @ symbol. We will use the right function. Right takes two arguments and extracts a specified number of characters from the right side of a string. Since phone is not a string, we need to use double colon varchar in order to convert number to a string. Using the concatenation sign that combines two strings, we connect them together.
Open in a real environment →Medium Product Sales and Inventory Data
This is a Snowflake question, which is a cloud-based data warehouse that uses SQL and all its main concepts. We are given three tables, products, sales, and inventory. All three tables are connected through product ID, since this is a common column. First, we need to calculate total quantity and total revenue per product from sales. Then we need to calculate total stock per product from inventory. And in the end, we need to join everything with products table, making sure every single product appears in the result, even if it has no sales or inventory records. If we try to join everything directly without aggregating first, we would get duplicate rows and completely wrong totals. A CTE, or in other words, a common table expression, is a temporary result set that we define at the top of our query. A left join keeps all rows from the left table, even if there is no matching row in the right table. The missing values just become null. Coalesce function takes a list of values and returns the first one that is not null.
Open in a real environment →Medium Products and Duplicates
Practice a medium difficulty Snowflake SQL interview question tagged JPMorgan that tests deduplication, CTEs, and INNER JOIN. In a manufacturing scenario, you must remove duplicate rows from two tables using SELECT DISTINCT or ROW_NUMBER before joining them on a shared key. This question evaluates your ability to handle data quality issues before performing joins, a skill commonly tested in financial services and data engineering interviews.
Open in a real environment →Medium Mortgage Rate Calculator
WITH aggregated AS (
Open in a real environment →
SELECT
md.mortgage_type,
ROUND(SUM(md.interest_rate) / COUNT(um.user_id), 2) AS rate_of_mortgage
FROM {{ ref("mortgage_details") }} md
INNER JOIN {{ ref("user_mortgages") }} um
ON md.mortgage_id = um.mortgage_id
GROUP BY md.mortgage_type
)
SELECT * FROM aggregatedMedium Weekend Order Detection
This is a Snowflake question, which is a cloud-based data warehouse that uses SQL and all its main concepts. We are given two tables, orders and products. Our job here is to parse the order date column from string into a real date format, and then drop any invalid dates. After that, we need to detect whether the order was placed on a weekend or not. If it's from Monday to Friday, then the output is false. If it's Saturday or Sunday, then the output is true. We will use a CTE, in other words, a common table expression, which is a temporary result set that we define at the top of our query. In Snowflake, we use data build, which is a framework that manages and organizes tables. We put it inside of ref function, and this whole thing is wrapped in double curly braces. We are more interested in inner join because it returns only the rows where there is a match in both tables. Within parse date column, we implement try to date function. This function converts a string into a real date type. If the string is invalid, it simply returns a null value. Day of week is a Snowflake function that takes a date and returns a number that represents the day of the week. Sunday is assigned with zero, Saturday is assigned with six. We check if the day of the week result is either zero or six, meaning Sunday or Saturday.
Open in a real environment →Medium Flooring Company Data
This is a Snowflake question, which is a cloud-based data warehouse that uses SQL and all its concepts. We are given three tables: customers, orders, and products. Full name in the customers table stores first and last name together in one column, separated by space. We need to split them into two columns: first name and last name. Here we also have to split product info column into two separate ones. In Snowflake, we use data build tool, which is a framework that manages and organizes tables. We use the ref function wrapped in double curly braces, and then ref function finds the correct table in the Snowflake environment automatically. Join connects two tables based on a common column. We are more interested in inner join because it returns only the rows where there is a match in both tables. To split full name and product info columns, we will use split part function. This function takes three arguments, splits the string, and returns a specific part. Second argument is delimiter. It's basically the character where the cut happens.
Open in a real environment →Medium Assign Row Numbers to Authors per Paper
We will assign row numbers to authors for each paper. This is a Snowflake question, which is a cloud-based data warehouse that uses all main concepts from SQL. We have two tables, research_papers and authors. Our job here is to join these two tables together and assign a sequential row number to each author per paper. The numbering must reset for every new paper so that the first author of each paper always gets row number one. In Snowflake, we put the table inside of ref function, and this whole thing is wrapped up in double curly braces. We are more interested in inner join because it returns only the rows where there is a match in both tables. Between these two tables, we implement inner join operator, and using on keyword, we indicate which columns are used to combine the tables. Row number assigns a sequential number to each row, starting from one. Over keyword simply activates the window function. Partition by divides all the rows into separate groups based on a column. In our case, it's paper_id. Row number will reset back to one at the start of each new group. And then we sort everything out in ascending order by author_id.
Open in a real environment →Medium Amusement Park Rating Anomalies
We are given two tables, rides and visitors. Our job here is to find rides whose average rating is unusually high or unusually low compared to all other rides. We measure this using standard deviation. Standard deviation is a statistical measure that checks how spread out a set of values is from their average. Our query will consist of three parts. First, we will calculate the average rating per ride. Second, we will calculate global mean and standard deviation across all ride averages. And third, we will use those stats to flag anomalies. A CTE, or common table expression, is a temporary result set that we define at the top of our query. For our task, we are more interested in inner join because it returns only the rows where there is a match in both tables. A window function, unlike group by that collapses all rows into one, performs a calculation across multiple rows and keeps every single row in the result. Stddev is a built-in SQL and Snowflake function that calculates standard deviation. We wrap it inside of ABS, which stands for absolute value. Then we compare if that distance is greater than global standard deviation. If the distance is greater than one standard deviation, the ride is considered too far from the norm and gets true.
Open in a real environment →Medium Usage and Accuracy per Model Type
This is a Snowflake question, which is a cloud-based data warehouse that uses SQL and all its main concepts. We have two tables, models and usage logs. Both of the tables are connected through a model ID, which is a common column. The requirements are calculating the total users by summing all the usage log entries, and at the same time, to find the average accuracy for each model. We will store the results in sort of a temporary table that is called CTE. CTE is short for Common Table Expression. It's a temporary result set that we define at the top of our query inside of WITH clause. From usage logs table, we first select model ID, and then we build a new column called total users and implement sum function within users column. Grouping by model ID would collapse all rows sharing the same ID into one single group. For this last column, we will implement window function with average. Partition by model type will divide all rows into separate groups based on their model type. We implement the round function and wrap the window function inside of it.
Open in a real environment →Medium Find the Last Climber per Mountain
We need to find the last climber per mountain. This is a Snowflake question, which is a cloud-based data warehouse that uses SQL. We have two tables, mountains and climbers. Both of the tables are connected through mountain name. Our job here is to find the most recent climber for each mountain. We only want mountains that have at least one climbing record. We need to rank all climbers per mountain by date from most recent to oldest, and then pick the climber with rank one. In Snowflake, we use database tool, which organizes and manages the tables. We put the table inside of ref function wrapped in double curly braces. We are more interested in inner join because it returns only the rows where there is a match in both tables. Partition by mountain name would divide rows into separate groups, and each group would indicate one mountain. Then row number would reset to one at the start of each new mountain group. We sort everything out in descending order by climb date so that the climber with the most recent date gets row number one.
Open in a real environment →Hard Rank Top Products by Revenue per Category
We need to rank top products by revenue for each category. This is a Snowflake question, which is a cloud-based data warehouse that uses SQL and all its main concepts. We are given two tables: products and sales. First, we need to calculate how much revenue each product generated. Then we rank them within their category from highest to lowest, and in the end, we keep only the top three per category. A CTE, or in other words, common table expression, is a temporary result set that we define at the top of our query. In Snowflake, we use database tool, which organizes and manages tables. We put the name of the table inside of ref function, and this whole thing is wrapped in double curly braces. We need to combine them using JOIN. We are more interested in inner join because it returns only the rows where there is a match in both tables. Dense rank is a function that assigns rank numbers to rows. The reason why we use dense rank instead of regular rank is that dense rank never creates any gaps. Partition by category creates sort of invisible walls between each category. We put a condition inside the WHERE clause that rank has to be equal or lower than three.
Open in a real environment →Hard Highest SEO Score Pages per Domain
Practice a hard Snowflake SQL interview question tagged Cisco that tests window functions, aggregation, and conditional logic. Working with an SEO analytics dataset, you must find the highest scoring page per domain and identify the overall best page across all domains using CASE expressions and window functions. This question evaluates your ability to combine per-group and global rankings in a single query.
Open in a real environment →Hard Math Expressions
Practice a Snowflake SQL interview question tagged IBM that tests regular expressions and pattern matching with RLIKE or REGEXP. In a data validation scenario, you filter user-submitted text to keep only valid arithmetic expressions containing digits and operators. This hard difficulty question covers regex patterns in Snowflake, RLIKE for row filtering, and input validation logic commonly asked in data quality and engineering interviews.
Open in a real environment →Hard Track Product Purchases
This is a Snowflake question, which is a cloud-based data warehouse that uses SQL and all its main concepts. We are given only one table that is called Transactions. Each row represents one purchase made by a customer. First, for each transaction, we need to find what product the customer bought most recently before the current one. Second, we need to create a new column, combine the date and that previous product into one string, and use the word none when there is no previous product. A CTE, or Common Table Expression, is a temporary result set that we define at the top of our query. Lag specifically within product ID row looks backwards, reaches to the previous row, and grabs its value. Partition by creates sort of a group for each customer. We wrap previous product in COALESCE function. COALESCE takes a list of values and returns the first one that is not null.
Open in a real environment →
Python (19)
Easy CSV Row Filter and Count
We have one customers.csv file. Each row represents one customer and has a status field that can be active or something else. We are required to read the file, then count how many customers have the active status, and write that number to a text file. First of all, we import the CSV module that is designed for reading and writing CSV files. We define a new counter variable that we will call active_count, and we set it to zero. Using Python's built-in open function, we will enter the customers CSV file. It takes two arguments, the file path and r that stands for read mode. Then we create a DictReader from the open file that will read the first row as column header, and then every row as Python dictionary. Using for loop, we go through every row in the CSV file and check whether the value in the status column is equal to active or not.
Open in a real environment →Easy Analyze Sales Dataset Dimensions and Calculate Total Revenue
We will analyze sales dataset dimensions and calculate total revenue. Pandas is a library that was specifically designed for data analysis and manipulation. We are given one CSV file that is called Sales Data. Our job is, first of all, to analyze its size, classify it as small, medium, or large based on total amount of sales, and then calculate total revenue. Everything should be saved as JSON report. We need to read it into a data frame. Data frame is simply a table in memory with rows and columns. For rows and columns, we will implement the len function that simply returns the number of items in any list or collection. In order to find the total number of sales, we need to multiply rows by columns. Small one is considered the one with less than 10,000 sales, then medium in range between 10,000 and 99,999, and large, everything that exceeds 100,000 sales. In order to find the revenue, we need to multiply the quantity by price. We will use the sum function to add up all revenue amounts. Everything is wrapped inside of round function.
Open in a real environment →Easy Sort Avro Employee Records by Salary
We'll need to sort Avro employee records by salary. Avro stores data in a compressed binary format, which makes it much more efficient for large data sets. The schema travels with the data within the same file and never gets separated. An Avro file is made up of a header that is followed by multiple blocks. We are given one file that is called employees.avro. Our goal here is to read the given file, sort the records by salary in descending order, and save the result as JSON file. We import two things, Reader function from FastAvro library that will help us to read the file, and then JSON. To read the file, we will use open function that takes two arguments. First is file path, and second is mode. In our case, it is rb, that stands for read binary mode. We create the Avro reader from the open file using the Reader function. We need to sort that list by salary using sorted function. As a second argument, we implement a small lambda function that for each X employee returns X salary. In third argument, we set reverse to True, which will sort everything out in descending order.
Open in a real environment →Easy Count User Events from JSON Activity Logs
We will count user events from JSON activity logs. We are given one JSON file that is called activity_logs. JSON stands for JavaScript Object Notation, and it is a text-based format that stores structured data. Our job here is to count how many events each user performed and save the results as a new JSON file. First thing that has to be done is importing the JSON module, which is a built-in Python library for reading and writing JSON files. We will use the open function. It takes two arguments. First is the path to the file, and second is the mode we need. Load will read the entire JSON file and convert it into a Python object. A dictionary in Python is a data structure that stores information as key value pairs. We need to use for loop to go through every log entry. Dump function converts a Python object back to a JSON file. To run the script, we type python3 and path to the file.
Open in a real environment →Easy Split Delimited Column into Separate Columns with Pandas
We will split the delimited column into separate columns using Pandas. Pandas is a library that was specifically designed for data analysis and manipulation. We have one CSV file that is called Data, which contains two columns: ID and Full Info. The Full Info column stores multiple pieces of information in one single string, separated by commas. We need to split each string and turn all these pieces into independent columns. We import Pandas library and alias it as pd. We need to read it into DataFrame, which is basically a table in memory with columns and rows. Split in Pandas takes one combined column and expands it into multiple separate columns. It does the splitting whenever it sees the delimiter. Str in Pandas is a special accessor that lets us apply the string operation to every value. We implement the split function. First argument is comma, our delimiter. After setting expand to true, each split piece will get into separate column. Drop function removes a column or row from the DataFrame, depending on how we set the axis. Finally, we save the result to a new CSV file called Split Data.
Open in a real environment →Easy Compare SQLite Database and CSV File Records
We will compare database and CSV file records using SQLite. We have two data sources. The first one is a CSV file. The second source is a SQLite database. SQLite is a lightweight database that stores everything in a single file. We can run it directly inside of our Python program. We are required to find the customer IDs that are out of sync and save the comparison report as a JSON file. We import three things, SQLite 3 library, that will help us to connect and make queries with our database. Then we import CSV to read those files row by row, and JSON to save the comparison report. A set in Python is basically a data structure that stores unique values. Using connect function, we will open the database file. We create a cursor, which is a tool that executes SQL statements. Fetchall function will retrieve all the results as a list of tuples. We will use the concept of set subtraction, and in order to find the extra IDs in the SQL file, we will have to subtract the CSV IDs from the database. Dump function will convert the dictionary to a JSON format.
Open in a real environment →Easy Analyze DataFrame Memory Usage
We'll analyze DataFrame memory usage using Pandas. We are given one CSV file that is called Sales Data. Our job here is to calculate how much memory each column from the initial file uses and save that report to a new CSV file that will be called Memory Report. The first thing that we do is importing Pandas and alias it as pd. Then we need to read the initial CSV file into a Pandas DataFrame that will be called df. We will implement the memory_usage method that will calculate how many bytes of RAM each column is using. We will set deep to true, and in the calculations, the actual content will be considered. Our current result is in bytes, and we will need to convert it to megabytes. We will need to divide it, first of all by 1,024, and by that, we get the result in kilobytes. Then we divide by 1,024 again to convert it to megabytes. The last thing that we have to do is saving our report to CSV file.
Open in a real environment →Easy Calculate Descriptive Statistics for Numeric Columns in Pandas
We need to calculate descriptive statistics for numeric columns in Pandas. We are given one CSV file that is called Sales Data. It contains e-commerce transaction information with both numeric and non-numeric columns. Descriptive statistics are numbers that summarize a data set. Mean is an average value of all numbers, median is the middle value when sorted from low to high, STD is mathematical standard deviation, min for smallest value, max for largest value, and then the percentages. Our main goal is to calculate these eight descriptive statistics for all numeric columns and save the result as a CSV report with all values rounded to two decimal places. In order to work with the CSV file, it should be read into a data frame. Select_dtype is a Pandas method that filters columns by their data type. Include number means that we only want to keep columns that contains integers and floats. We save the report as a CSV file using to_csv function, and we set index to false to prevent Pandas from adding an extra row number column.
Open in a real environment →Easy Extract Schema Information from Parquet File Using PyArrow
We'll extract schema information from Parquet file using PyArrow. PyArrow is a Python library that works with columnar data formats, especially with Parquet files. It gives more access to files' metadata, things like schema, row groups, file statistics, and so on. Parquet stores data column by column instead of doing that row by row. It will look at the data stored in each group, analyze the minimum and maximum values for each column, and skip the entire row groups that cannot possibly contain matching data. Here we are required to extract the metadata about the file, column names, data types, amount of rows, size of the file, and type of compression. The result should be saved as JSON report. We will import pyarrow.parquet, that will be aliased as pq. Then we use schema_arrow function to get the full picture of what columns exist and what data type each column has. Num_rows function gives us the total amount of rows in the file. Compression function will return the name of codec, like Snappy or Gzip.
Open in a real environment →Easy Select Specific Columns from Parquet File
We will select specific columns from Parquet file using Pandas. Pandas is a library that was specifically designed for data analysis and manipulation. Parquet stores data column by column instead of doing that row by row. It is good when we want to read only some specific columns and skip everything else. It will look at the data stored in each group, analyze the minimum and maximum values for each column, and skip the entire row groups that cannot possibly contain matching data. We are required to return only five specific ones: ID, first and last name, email, and total amount of purchases. After selecting all five columns, everything needs to be saved as a new Parquet file called selected data. The first thing that we do is that we import the Pandas library and alias it as pd. Then we need to read the Parquet file into a data frame. In Pandas, when we want to select multiple columns, we pass a list of column names inside of double square brackets. Since Pandas automatically assigns row numbers to every data frame, we set index to false, which means that we do not include the data frame row numbers in the output.
Open in a real environment →Easy Merge Customer and Purchase Data Using Pandas
We will merge customer and purchase data using Pandas. Pandas is a library that was specifically designed for data analysis and manipulation. We are given two CSV files that are called customers and purchases. We are required to combine both of the initial files using left join and save the result to a new combined data CSV file. The first thing to do is that we import Pandas library and alias it as pd. In order to work with the files, we first have to read them into data frames. Merging Pandas connects two data frames based on a common column or index. It works the same way as JOIN in SQL. Left join returns all rows from the left data frame and the matched rows from the right data frame. If there is no match, then all those fields will contain null values. We use the merge method to connect customers to purchases. On indicates a column that was common for both of the data frames. We set index to false, which prevents Pandas from adding an extra row number column to the output.
Open in a real environment →Easy Convert Excel Files with Multiple Sheets to Individual CSV Files
Open this question to see the full solution and solve it on a real environment.
Open in a real environment →Medium Time-Series Rolling Window Analysis for Multi-Stock Price Data
Here we will be working with Pandas. Pandas is a library that was specifically designed for data analysis and manipulation. We have one CSV file that is called Stock Data. It contains daily price information within multiple companies. Each row has a date, ticker, and closing price. Our job here is to calculate rolling statistics, mean, sum, and standard deviation for the period of 7, 14, and 30 days separately for each ticker. In the output, we should have three original columns, plus nine new columns that we'll build. Rolling method in Pandas provides window calculation and allows us to perform aggregate operations. In order to start doing any operations on the file, we need to read it into DataFrame. Using for loop, we go through three window sizes, 7, 14, and 30 days. Then we group by ticker so that all the calculations run separately for each of them. We implement the rolling method and set that window equals to window. Mean will calculate the average of all values in each window. We need to save that DataFrame to a new CSV file using to_csv function.
Open in a real environment →Medium Flatten Nested JSON to CSV with Dot-Notation Columns
Convert nested JSON objects to flat CSV format using dot-notation for nested fields and comma-separated strings for arrays with pandas in Python.
Open in a real environment →Medium Decompose Time-Series Data into Trend, Seasonal, and Residual Components
We will need to decompose time series data into trend, seasonal, and residual components. We will be working mainly with Pandas. Pandas is a library that was specifically designed for data analysis and manipulation. We are given only one CSV file that is called Temperature Data. It contains daily statistics over multiple years. First one is trend, which is a long-term direction of the data that indicates whether temperature generally increases over the years or not. Seasonal is responsible for repeating pattern that happens on a regular cycle. Residual is everything that is left after removing the trend and seasonal components, which is mostly some unpredictable day-to-day variations. Each component should be saved into its own CSV file. We were required to use Statsmodels, which is a Python library for statistical analysis. We import this specific function, which is called seasonal decompose, that will perform our decomposition. Then we specify the model by setting it to additive. We define one full cycle as 365 days because this is an annual pattern for temperature.
Open in a real environment →Medium Flatten Nested Struct Columns in Parquet and Export to CSV
We'll need to flatten nested struct columns in Parquet and export them to CSV using Pandas. Pandas is a library that was specifically designed for data analysis and manipulation. We are given one Parquet file that contains customer data, but some columns are nested, which means that instead of a simple value like a string or number, they contain a whole object with multiple fields. Parquet stores data column by column instead of doing that row by row. Our job here is to extract the fields from the nested columns and turn them into independent ones. In order to start working with Parquet file, we need to read it into data frame that is called DF. JSONNormalize function in Pandas will take each dictionary from address column and expand it into separate columns. Next step is renaming the columns by adding address_ as a prefix. Using drop function, we will remove the initial nested columns. Now we will combine the results using concat function. We will save the result to a CSV file. For this reason, we will use to_csv function.
Open in a real environment →Hard Aggregate Time-Series Data into Fixed Time Windows
We'll need to aggregate time series data into fixed time windows using Pandas. Pandas is a library that was specifically designed for data analysis and manipulation. We have one CSV file that is called sensor data. It contains temperature and humidity readings from a sensor collected at regular intervals. Our job is to group all readings into fixed 15-minute windows and calculate mean, minimum, and maximum values for both temperature and humidity within each group. Time window aggregation solves this by dividing time into equal fixed buckets. All records that fall within the same interval get grouped together, and we calculate statistics for that group. We read the CSV file into DataFrame called df. We convert it to a proper Pandas date-time type. We set the timestamp column as the DataFrame index. We start doing the aggregation with resample method. It divides time series data into fixed time windows. Aggregate function performs multiple operations at once on each group. We save the final aggregated DataFrame using to_csv.
Open in a real environment →Hard Interpolate Missing Values in Irregular Time-Series Sensor Data
We will interpolate missing values in irregular time series sensor data using Pandas. We are given one CSV file that is called sensor_data. It contains temperature and humidity readings from IoT collected over time. Some of the readings are missing because of network issues or sensor failures, so those cells are empty. Our job is to fill in those missing values using interpolation and save the clean data. Interpolation is a way of estimating missing values based on the records around them. In our case, we will use the time-based interpolation that also accounts the irregular time intervals. Before the interpolation itself, we need to convert the timestamp column from a plain string into a proper date-time type. Then we set the timestamp column as the index of our DataFrame. It is very important specifically for time-based interpolation because it requires the date-time to be indexed. We reference to the temperature column and implement the interpolate method. We set method to time to use the time-based interpolation. We were required to round the interpolated values to one decimal place. Finally, we save the result as a CSV file using to_csv function.
Open in a real environment →Hard Remove Seasonal Effects from Time-Series Sales Data
We'll remove seasonal effects from time series sales data using Pandas and Statsmodels. We are given one CSV file, sales data, that contains transactions over multiple years. The sales indicate the current seasonal pattern. We are required to remove those seasonal effects so that we can see if the company is actually growing or it is just a result of seasonal factors. Deseasonalization simply removes that repeating pattern so that we can see the real trend. We import two things, Pandas library, and seasonal decompose function from Statsmodels. Seasonal decompose is the function that splits our time and date into trend, seasonal, and residual components. We specify the decomposition model by setting it to additive, which simply means that the components will add up together. Since our data is monthly and seasons repeat every year, the period will be set to 12 months. In order to calculate the deseasonalized value, we need to sum the trend values with residual values.
Open in a real environment →
More questions (2)
Easy Managing High I/O Processes
Users are complaining about slow file access and we have high disc utilization. We need to reduce IO activity of top offenders using IO priorities and we need to settle the IO priority to idle. While doing so, we need to keep critical jobs, databases, message queues, applications at high priority. First we need to identify processes that have high IO activity. We'll use IO top command, and we at hyphen N 10 at the end of this command. This means that this command will run 10 times. To check current IO priority of job one, we need to use ionice command. First we need to look into process ID of this job. We have no priority set for this process, and we need to set this to idle. And idle will be priority number three. The command for this will be ionice three and the process ID.
Open in a real environment →Easy Docker Multi-Architecture Image
When we build our container from this file, it builds it in architecture of our host system. Our task is to change the setup to build it in multiple architectures. We'll use Docker build with an instance named Multi Arc and we will use buildx create. We'll list current builders by typing Docker buildx ls. We have only default that builds it in our current platform's underlying os. So add new builder. Docker buildx create and then name multi arc. Then we use driver network host and then use to have it as default builder. Now attempt to build for our multi architecture setup. We'll run Docker buildx build. Verify this by typing Docker images.
Open in a real environment →
Practice on real environments
Browse and filter the full Data Analysis catalog, or see questions asked at specific companies.