Data Engineering Interview Questions
Are you preparing for a data engineering interview? Data engineering is about building and maintaining the pipelines and infrastructure that move data reliably. These real Data Engineering interview questions cover SQL, Spark, Kafka, warehousing and modeling.
- 189+Questions
- 71Companies
- 98With video
- 61/112/16 Easy / Med / Hard
Real questions asked at Google, Samsung, Meta, DoorDash, Amazon, Apple and 65+ more.
Click any question to read the full answer, then open it to solve hands-on in a real Data Engineering environment that boots in your browser. Built by working engineers, no AI.
SQL (33)
Easy Average Order Value
This SQL question is called Average Order Value. We have one table that is called Orders. This table records customers' purchases and has four columns: orders ID, date of transaction, reference to the customer by ID, and total amount spent. Our job here is to calculate for each customer the average amount they spent across all their orders. In the end, the average value should be rounded to two decimal places. The output should include customer ID and average order value, sorted by ID in ascending order. We start with FROM clause. Then we proceed with GROUP BY clause that separates individual orders into groups for each customer. In order to choose specific columns from a particular table, we use SELECT clause. We wrap the total amount column with Average inside of Round function as first argument, and two number as two decimal places. Finally, we order by customer ID.
Open in a real environment →Easy Join Employees and Departments
We are given two tables, departments and employees. These two tables are connected through department ID column. Some employees might not be assigned to any department, so their department ID is null. But we still want them in our results. Our main goal here is to return all employees that earn more than 50,000 with their department name sorted by hire date from most recent to oldest. In order to combine these two tables, we will use the concept of left join. A left join keeps all rows from the left table even if there is no matching row in the right table. The missing values just become null. We implement left join in between, and using on, we indicate which common columns were used to connect these tables. We will use where clause to filter the rows. We only want those employees whose salary is greater than 50,000. Finally, we sort everything out by hire date in descending order so that the most recent date pops up first and oldest date last.
Open in a real environment →Easy Filter Orders by Date Range
We need to write a query to filter orders by date range. We are given one table that is called Orders. It has four columns: customer's name, ID of an order that has been made, date of transaction, and total amount spent. We are required to return only the orders that were placed between January 1st, 2023, and June 30th, 2023. Both dates are inclusive, which means that orders on exactly those dates should be included, too. In SQL, we always start with FROM clause because this tells which table we want to work with. Using SELECT clause, we can choose the columns for the output. WHERE clause, which is the filter step. It goes through every single row and checks if the condition is true. For order date column, we use BETWEEN and operators. This combination checks if a value falls within a range, including both the start and end values. Finally, we sort everything by order date in ascending order so that the rows with the earliest date come first.
Open in a real environment →Easy Find Customers Without Orders
We need to find customers without orders. We have two tables, customers and orders. These tables are connected through customer ID column. Our job here is to find customers who have never placed a single order, but at the same time they exist in the customers table. We are more interested in LEFT JOIN since we want all rows from the left table to stay. A LEFT JOIN keeps all rows from the left table, even if there is no matching row in the right table. The missing values just become null. Then we add the LEFT JOIN operator in between, and ON keyword that indicates which common columns were used to combine the tables. In our case, it's customer ID. We proceed with WHERE clause that filters the results. We build the condition to check if order ID is null. We pick this specific column because if a customer has at least one order, their order ID will be a real number after the LEFT JOIN. Finally, we sort everything by customer name in ascending order or alphabetically.
Open in a real environment →Easy Use COALESCE for Null Handling
How to Replace NULL with 0 in SQL and Retrieve All Orders from the
ordersTableWriting clean and efficient SQL queries is an essential skill for database management and data analysis. If you're asked to retrieve all orders from an
orderstable, ensuring that anyNULLvalues in thediscountcolumn are replaced with0, you need to follow specific steps to structure your query correctly. Below is a comprehensive guide on achieving this task.Steps to Write the SQL Query
- Identify the Columns: The
orderstable contains the columnsorder_id,customer_name,discount, andtotal_amount. - Handle NULL Values: Ensure that the
discountcolumn does not contain anyNULLvalues by using theCOALESCEfunction, which allows you to replaceNULLvalues with0. - Select All Required Columns: Ensure that the query retrieves all the columns in the specified order -
order_id,customer_name,discount, andtotal_amount. - Order the Results: Use the
ORDER BYclause to sort the results byorder_idin ascending order.
Sample SQL Query
SELECT order_id, customer_name, COALESCE(discount, 0) AS discount, total_amount FROM orders ORDER BY order_id ASC;Breaking Down the Query
SELECT statement: This part retrieves the columns you need.
SELECT order_id, customer_name,COALESCE Function: Use
COALESCE(discount, 0)to replaceNULLvalues in thediscountcolumn with0.COALESCE(discount, 0) AS discount,FROM clause: Specifies the table from which to fetch the data.
FROM ordersORDER BY clause: Ensures the result set is ordered by
order_idin ascending sequence, making it easier to read and analyze.ORDER BY order_id ASC;
Best Practices
- Readability: Write clear and readable queries. Using aliases (like
COALESCE(discount, 0) AS discount) makes it easier to interpret results. - Performance: Ensure your database has indexes on columns commonly used in
ORDER BYclauses, likeorder_id, to optimize query performance.
Leveraging these structured steps will help you efficiently write the required SQL query to retrieve and process the orders data from the
Open in a real environment →orderstable, replacingNULLdiscounts with0, and ordering byorder_id. This method ensures clean data handling and a structured output, which is a critical aspect of database queries and analysis.- Identify the Columns: The
Easy Merge Multiple Address Fields
We have a customers table with columns like city, customer ID, first name, last name, postal code, state, and street address. Our job is to combine them into one single column called full address. Some columns could have null values, meaning some customers might be missing a city, state, or postal code. In this case, they should be excluded from the full address. One more possible case is that all components are missing. In this case, we should return an empty string. For this problem, we will need to use COALESCE function. COALESCE function takes a list of values and returns the first one that is not null. For this reason, we need to know the case statement. The idea is basically like an if else statement in any programming language. We will use the concatenation sign in order to connect them one by one. The last thing we have to do is sorting everything out in ascending order, which can be done using order by.
Open in a real environment →Easy String Concatenation in SELECT
Our job is to retrieve the full names of all employees by combining the first and last names. We have to make sure that the spaces in between are added, and the result is sorted in ascending order. In SQL, a string is basically just text, anything like a name, a number, or email that are stored in columns with types like varchar or text. Concatenation is simply joining two or more strings together into one output. If you want the space between your strings, you have to treat that space as its own separate string inside quotes. In order to return this specific full name column, we use select clause and implement concat function inside of it. We name our new column as full name. The last thing we are required to do is to sort everything in ascending order based on full name. For this reason, we use order by clause and the name of the column.
Open in a real environment →Easy Find Nth Highest Revenue
We will write a query to find N's highest revenue. We have sales table with columns like ID, product, and revenue. Our job is simply to find the third highest distinct revenue value. We need to keep in mind that product with same revenue count as one value, not two. Our query will consist of three steps. First, we get rid of duplicates, then what's left, we rank from highest to lowest, and then pick the one on third place. A window function performs a calculation across multiple rows, but unlike group by, it does not collapse everything into one result. The word just activates the window function. We sort the revenue column out in descending order so that on the first place, we get the highest value. We filter the column where rank equals to three, and we find the value on the third position.
Open in a real environment →Easy Self-Join to Identify Missing Supervisors
We have only one table that is called Employees and three columns that indicate employee ID, the name of that employee, and supervisor ID. One employee can be a supervisor for his or her colleagues. Our job is simply to find employees whose supervisor does not exist in the table. Employees with a null value in supervisor ID column are excluded. The final list should be sorted out in ascending order by employee ID. In order to reference the same table twice, we give it two different names, or aliases. e1 will represent each employee, and e2 will search for supervisor. We will use left join, which will keep all rows from the left table, regardless of whether a match was found in the right table or not. Using ON condition, we simply ask the query to take the supervisor ID from e1 table and look for it in employee ID column of e2 table. The first statement is employee ID from e2 table is null, which will simply catch all employees whose supervisor was not found in the table after the left join. We also check if supervisor ID is not a null value.
Open in a real environment →Easy Year-over-Year Revenue Growth
We need to calculate the yearly revenue and the percentage growth year over year for a given set of financial transactions. We have financials table that contains two columns, transaction date, that is date type, and amount, that is numeric. We group all transactions by year and sum them up to get the total revenue per year. A CTE is a temporary result set in SQL that you can reference within a single query. It only exists while that query is running. In our select, we use extract in order to get only year from our transaction date column. We also need to get the total revenue of our amount column using sum. A left join keeps all rows from the left table, even if there is no matching row in the right table. The missing values just become null. We take our CTE yearly revenue and give it an alias current. Using left join, we take the exact same CTE again, but this time we call it prev for previous. We need to subtract current and previous year, divide by previous year, and multiply by 100 to get the percentage. Inside of select, we use round, which is a built-in SQL function that rounds a decimal number to a specified number.
Open in a real environment →Medium Above Average Price Products
We have a products table with columns like category, ID, name, price, rating, and stock quantity. Our job is to return the products that are priced above the average price. We only work with those products that are currently in stock. It means that not only we don't return them, but we also don't use them while calculating the average amount. While sorting, we should put all the products with null ratings at the bottom. Using select with a star, we can choose all columns. Using from clause, we retrieve data from products table. We need to write a subquery. Using where clause, we create a small condition. Using the average function, it will calculate the average price. The main condition checks whether selected product's price is higher than the average. We have to sort everything out in descending order using order by. At the end of our order by clause, we will use nulls last.
Open in a real environment →Medium Calculate Cumulative Sales
We are given a sales data table with the name of the products, date of transaction, and the amount of daily sales. Each row simply represents one day of sales per product. Our mission is to add a new column called cumulative sales, which is a running total that keeps adding up the daily sales for each product as we move forward through the dates. We do the calculation within each product separately. For each row, we need to add up all the previous rows for the same product, which is not something that regular sum with group by can handle, since group by would just put everything into one total. A window function performs a calculation across multiple rows, but unlike group by, it never collapses them. We start with sum within daily sales column, and then we use over keyword that simply activates the window function. We will use partition by function within product name column that will simply divide the data into separate groups.
Open in a real environment →Medium Find Overlapping Date Ranges
We have one table that is called assignments. Each row represents one employee assigned to one project with start date and end date. We need to find those employees who are assigned to multiple projects at the same time, or in other words, who face an overlap. The final result must contain all the columns from the input and sort it out by project ID primarily, and then by employee ID in ascending order. Two date ranges overlap when they share at least one day in common. We need to compare each assignment to another one for the same employee, which basically means that we reference the same table twice. This means that we'll use self join here. We take assignments table and give it alias a1. Then we join it with a2 copy that will search for overlapping assignments. The first condition is that assignment a1 must start before or on the same day that assignment a2 ends, because if a1 starts after a2 already ended, no overlap is possible. The role of distinct function here is removing those duplicates and making sure that each assignment appears only once in the final result.
Open in a real environment →Medium Set Operation: INTERSECT
This SQL question focuses on set operation called intersect. We are given two tables that share customer ID as a primary key. Our main goal is to return list of active customers based on two criterias. On a monthly basis, a new customer should spend more than 1,000, be loyal, and have at least three years of membership and a premium tier status. Final output must include only the customer ID and name columns sorted in ascending order by ID. A CTE is a temporary result set in SQL that you can reference within a single query. Using with clause, we create a CTE called monthly spenders among new customers that spend more than 1,000. We can name the CTE as premium tier. Intersect takes both tables and returns only the values that appear in both of them. We select customer ID and name columns from monthly spenders, which was the first CTE. Then we select information from the second CTE, and between these two we add intersect operator.
Open in a real environment →Medium Subquery for Best Order per Customer
We are given two tables. The first table contains the names of the customers. The second table stores order IDs and total amount of each order. Both tables share customer ID primary key which connects them together. In the output, we should get customer's name, ID of the best order, and its total amount. All the results should be sorted by a customer's name in alphabetical order. Since each customer can have multiple orders, and among these orders might be the ones with the same highest total amount, in this case, we should return the one with the smallest order ID. We are more interested in inner join because it returns only the rows where there is a match in both tables. Inside of the where clause, we will have correlated subquery. It is also a query inside another query, but it runs once for every single row in the outer query. Since we need to return the highest valued order for each customer, we sort by total amount in descending order. We also sort order ID in ascending order because when two orders pop up with the same amount, the one with smaller order ID will come first. And in the very end, we apply limit one.
Open in a real environment →Medium Ranking with Dense_Rank
The main focus is on ranking values with dense rank window function. We are working with sales table that tracks individual sales made by different representatives. The goal here is to first add up all sales per person to get their total value, and then rank everyone based on that total from highest to lowest. When we use dense ranking, two people with equal amount of sales will share the same rank. A CTE is a temporary result set in SQL that you can reference within a single query. Dense rank is one of the window functions that assigns a rank number to each row based on a specified order. Over is a keyword that simply activates a window function and lets SQL know that we try to implement it. Finally, order by clause sorts by sales rank first, so rank one comes before rank two, and then salesperson name is sorted alphabetically.
Open in a real environment →Medium Median Salary by Job Title
We'll write a query to calculate median salary for each job title. When all salaries are sorted in ascending order, the median value will be the one that is in the middle of them. If there is an even number of salaries, we take the average of the two middle values. A CTE is a temporary result set in SQL that you can reference within a single query. A window function performs a calculation across multiple rows while keeping every single row in the result. Inside of the window function, we are using partition by job title. Partition by divides the data into separate groups, and count will run independently inside each of these groups. We use row number that simply assigns a sequential number to each row. These formulas calculate the median position so that we know which row number represents the middle one. If the number of rows is an odd number, both of the formulas will give the same result. Using round function and double column numeric, we will convert the results to numeric type and then clean everything up to two decimal places.
Open in a real environment →Medium String Splitting and Aggregation
We have a product_tags table with two columns, id and tags. The tags column stores multiple tags in one single string separated by commas. Our job is to split those tags apart and count how many times each tag appears across all products. One important thing to keep in mind is that tags are case sensitive. A CTE is a temporary result set in SQL that you can reference within a single query. Our temporary table will be called split_data, and then we select id and tag from our product_tags table. First, we are using string_to_array. You basically pass a string and the separator to the function. After turning tags column into array, we will use unnest function, which will basically put each piece of our array into one separate row. We will use the count with a star function that will basically go through each row and return the amount. Using group by clause, we will group all rows with the same tag together.
Open in a real environment →Medium Salary Comparison with CTE Aggregation
We have two tables, departments and employees. Our job is to find employees who earn more than the average salary of their own department. We need the average salary per department before we can compare anything. For this reason, we will use CTE here. A CTE is a temporary result set in SQL that you can reference within a single query. It doesn't get saved anywhere. It only exists while that query is running. We want to find the average salary per department. Inside of it, we select department and average salary grouped by department. For our task, we are more interested in inner join because it returns only the rows where there is a match in both tables. We connect them on department ID from employees table and ID from departments table because we remember that they were the same common primary keys. The average salary should be rounded to two decimal places in the output. Using where clause, we write a condition that salary of an employee should be higher than the average salary that we calculated in our CTE before. The last thing we have to do is sorting everything out primarily by department's name in ascending order and then by salary in descending order.
Open in a real environment →Medium String Pattern Extraction in Descriptions
This question mainly focuses on string extraction. We have a products table with four columns: description, name, price, and product ID. The description column contains free text. Some products have email address inside of it and some don't. Our mission is to find those email addresses and extract them from the column. We only return three columns, product ID, name, and email that we will find from description. We will need to use substring. Substring is a string manipulation function. It simply returns a copy of a specific portion of the string. The pattern of an email address is some word, then @ symbol, word again, dot, and then word in the end. Regular expression is kind of a language that is used to describe a sequence of characters. Instead of equal, we use tilde, which is regular expression operator, which means just find this kind of pattern in description column. We sort everything in ascending order by product ID.
Open in a real environment →Medium Nested Subquery for Latest Record
We only have one table here, events. There are five columns: event_date and name, id, status, and user_id. Each row in the event_name column represents something a user did, and then status column shows if it was successful. For each user, we want to know what their most recent event was. Final result should contain all the columns except for ID column, and everything should be sorted in ascending order by user ID. For each user and event row, we check and find the maximum date for that specific user. If the current row's date matches the maximum date, we keep it. We take the events table and give it an alias e1. Now we have to use this table again, but we can't name it with the same alias anymore. That's why inside of the subquery, we give another nickname, another alias to our table, which will be e2. We implement max function to find the maximum date, and then inside of the where clause, we compare the current date that we are checking right now and the maximum date that we already found. Finally, we sort everything out in ascending order by user ID.
Open in a real environment →Medium Window Function for Moving Average
We are given a sales table with two columns, amount and sale date. Each row indicates the amount of sales per day within one week. Our main goal is to calculate a seven-day moving average for each date. Moving average means that we calculate that number for each day using the current amount plus the amount of previous days. The challenge here is that for each row, we need to look at other rows, which is not something that where or group by clause can do. A CTE is a temporary result set in SQL that you can reference within a single query. Window function calculates across multiple rows, but it keeps every single row in the result. From sales table, we select sale date and amount columns, and also moving average as a new column. This statement, range between interval six days preceding and current row, simply helps us to define the frame. Using round function for this column, we write double colon that will help us to convert raw data to numeric that we need, and two here will round everything to two decimal places.
Open in a real environment →Medium Re-enrollment Rate Calculator
We are given one table, Enrollments. This table has three columns that contain information like course ID, student ID, and term ID. Our job here is to calculate what percentage of students are enrolled in two or more consecutive terms. Consecutive means that in term ID column, the sequence goes like 101, 102, 103 without any gaps. A CTE is a temporary result set in SQL that you can reference within a single query. We will call our CTE as consecutive terms, and from enrollments table, we select student and term ID columns. We implement a window function with row number. Partition by clause divides a result set into smaller groups and allows a function to perform calculations on each subset separately. Row number assigns a sequential number to each row starting from one. In order to prevent all the duplicates, count with distinct will only return unique student IDs and completely ignore any null values. Cast in SQL is used to convert a value of one data type into another. We are required to round the results to two decimal places, and the only way to do so is by using a round function.
Open in a real environment →Medium String Pattern Matching Using LIKE
We'll match string patterns using like. We have two tables, departments and employees. Our job is to filter employees based on three conditions. First one is that their name must start with letter A. Second is that the email must contain the substring @tech, and third is the position level must contain the word senior. A join in SQL connects two tables together based on a common column. We are more interested in inner join because it returns only the rows where there is a match in both tables. Like is an operator in SQL used to search for a specific pattern inside of a string. Like works together with wild cards, and the most significant one is percent sign. We write name column from employees table, like A percent sign, which means A is the first letter, and then any symbol can follow this letter. Second condition is that email from employees table must contain @tech keyword, and since this value is in the middle of the string, we put percent signs from both sides. And finally we sort everything in ascending order by name.
Open in a real environment →Medium Customer Order Aggregation
We are given two tables. First is customers table that contains customer's name, email, and ID. Also, second table that contains ID of the order, its amount, and ID of the customer. Our main goal is to find customers who have placed more than two orders and calculate the total number of orders and the amount they spent on them. The final result must be sorted out in descending order by total spend. A join in SQL connects two tables together based on a common column. For our task, we are more interested in inner join because it returns only the rows where there is a match in both tables. Second step is grouping by ID, name, and email from customers table because this will collapse all rows belonging to the same customer into one single group. Since we already grouped the columns, we will have to use having clause that works with groups, not rows. To get this number, we have to use ID column that we grouped before and count function that will simply find the number of orders for each group.
Open in a real environment →Medium SQL JOIN with Pandas Data Processing and CSV Export
We need to do the data processing and CSV export using Pandas and SQLite. We are given a SQLite database that is called Sales. It contains three tables: customers, orders, and items. SQLite is a lightweight database that stores everything in a single file. We need to connect to the database, then run an SQL query to join all three tables that we had, load the results into Pandas, calculate revenue metrics per customer, and export everything to CSV file. JOIN connects two tables based on a common column. We are more interested in inner join because it returns only the rows where there is a match in both tables. We will implement the read_sql_query function. We build the total amount column by multiplying quantity with unit price. When we group by customer ID, we will put all rows belonging to the same customer together. In order to calculate the revenue percentage, we will need to divide each customer's total by the overall revenue.
Open in a real environment →Medium Insert New Records into SQLite Database from CSV
We'll insert new records into database from CSV file using SQLite and Pandas. We have one CSV file that is called New Customers. It contains customer records that need to be imported in a SQLite database. The problem is that some of these customers might already exist in the database file. SQLite is a lightweight database that stores everything in a single file. Our job here is to read the CSV file, check which records already exist, and make sure we only include the new ones. We will use the connect function where we define the path to the file. Then within this object, we implement cursor. We read the given CSV file into a DataFrame that will be called df. Using cursor execute will run the SQL query. In the query itself, we select ID column from customers table. Fetchall function will retrieve all the results as a list of tuples. The tilde symbol is used here to flip true to false and false to true. Cursor execute will run the insert statement. Instead of putting the values directly in the SQL string, we will use the question mark and pass the actual values as a separate tuple. We do it for security reasons to protect the data from SQL injection. We will use the commit function within our database to finalize and save all changes. Finally, we close the database.
Open in a real environment →Medium Aggregate SQL Query Results with Pandas and Export to Excel
We'll need to aggregate SQL query results and export them to Excel using Pandas. Pandas is a library that was specifically designed for data analysis and manipulation. We are given one SQLite database that is called Orders. It contains two tables, the first one with customers' information and second one with all the transactions. SQLite is a lightweight database that stores everything in a single file. Our job here is to join both of the given tables, then calculate the total order value per customer, and save the results to an Excel file. We import two libraries, SQLite 3 and Pandas. Then we open the database using the connect function. We are more interested in inner join because it returns only the rows where there is a match in both tables. Then we execute what we have written above using the read_sql_query function and load the results directly into a Pandas data frame. We use group by so that all rows belonging to the same customer are put together. Finally, we save the resulting data frame to Excel file.
Open in a real environment →Hard Merge Employee and Department Records
We need to merge employee and department records. We have two tables, departments and employees. We only consider departments that have more than 10 employees. Second requirement is to find employees whose salary is above their department average. Third requirement is to add high earners column to the output, which will basically count the number of employees whose salary is more than 75,000. A CTE is a temporary result set in SQL that you can reference within a single query. A join in SQL connects two tables together based on a common column. We are more interested in inner join because it returns only the rows where there is a match in both tables. The difference is that where clause comes before group by, while having runs after. In order to calculate the high earners column, we will use count with case when. Results should be sorted out primarily by department name and then by salary in descending order.
Open in a real environment →Hard Sequence Products by Price
We are given products table with the name of the product, its ID, and the price. Our main goal is to create and calculate neighbor product column. For each product, we take the price of previous product and the price of next product, and then we multiply them together. A CTE is a temporary result set in SQL that you can reference within a single query. It doesn't get saved anywhere. It only exists while that query is running. Window function performs a calculation across multiple rows and keeps every single row in the result. The reason why we can't use group by here is that group by clause collapses all rows into one single row. We use lag, which reaches back to the previous row and grabs its value. The only difference is that we'll use lead instead of lag. Lead will reach forward to the next row and grab its value. COALESCE function takes a list of values and returns the first one that is not null. Since null values can't be used in the calculations, COALESCE function will replace them with zeros.
Open in a real environment →Hard Top Categories by Average Price
We are given two tables, inventory and products. These two tables share common column and are connected through product_id column from inventory table and id column from products table. Our main goal is to find the top three product categories with the highest average price. We only consider active products, meaning that they are in stock. For low stock items are qualified those products that are less than 10 available. The final result should be ordered primarily by average price in descending order, and then by product count column as a tiebreaker. A CTE is a temporary result set in SQL that you can reference within a single query. We use left join to make sure every product appears, even if it somehow has no inventory entry. For low stock items column, we need to count only products with stock less than 10. We group by category so that all rows that share the same category would collapse into one group. Rank here assigns a number to each row based on a specified order. We put the condition inside the where clause that price rank should be less or equal to three.
Open in a real environment →Hard Export SQLite Database to Parquet Format with Metadata
We'll need to export SQLite database to Parquet format with metadata using Pandas. We have one SQLite database that is called ecommerce. It contains three tables, customers, products, and orders. The orders table has foreign keys that connects it to the first two tables. SQLite is a lightweight database that stores everything in a single file. Our job here is to export every of three tables to their own Parquet file with Snappy compression and create a manifest JSON file describing the export. Parquet stores data column by column instead of doing that row by row. A manifest is a metadata file that describes the contents of an export. We import five things: SQLite 3 library to connect to the database; Pandas, that will be aliased as pd; JSON to create the manifest; OS library for creating directories, building file paths, and getting file sizes; and the DateTime module for getting the current timestamp for the manifest. We read the entire table into the data frame and save the data frame as a Parquet file. Then we set compression to Snappy and index to false.
Open in a real environment →Hard Combine Data from Multiple Sources into Unified Report
We need to combine data from multiple sources into a unified report using Pandas. We are given one customers CSV file and one orders SQLite database. SQLite is a lightweight database that stores everything in a single file. We get the data from API, CSV file, and the database. We send an HTTP GET request to a URL, and the API responds with data in JSON format. Our job is to fetch data from all three sources, combine them into one unified DataFrame, and save the final report as a CSV file. We import three libraries: Requests, Pandas, and SQLite 3 to connect the database. We open the database file using connect function. Read SQL query will execute our SQL request and load the result directly into the DataFrame. Merge in Pandas connects two DataFrames based on a common column or index. It works the same way as JOIN in SQL. On indicates which common column was used to connect the DataFrames. The method is left join, which means that we keep all orders even if customer ID doesn't match. We can calculate the total amount when multiplying quantity by price. The final unified report is saved as CSV file.
Open in a real environment →
Git (15)
Easy Create Branch from Detached HEAD State
We have a Git repository that has detached head. Head is basically our latest commit and detached head means that the head that we are currently located at is not pointing to any Git branch, any points directly to commit or tag. It happens when we check out directly tag or we check out directly at comit hash, or we can check out some remote branch. How we can restore from detached head is we either need to return to some existing branch or we have to create a new branch from that detached head. Run git status. We can see that GIT message that head is detached. We'll create new branch by typing Git Checkout the name of the branch, or you can run more modern command Git Switch. Run checkout minus B. And then name of the branch.
Open in a real environment →Easy Rebase Feature Branch
We have a Git repository. Under this folder we have a feature payment branch and it's behind our main branch by 32 commits. We need to rebase a feature branch onto the latest main. We need to bring all the latest changes from main to this feature branch. Switch to feature payment branch. We need to rebase from main. Type git rebase main. We have conflicts. Some of the commits cannot be rebased. We have two changes for the same line. Remove the obsolete one and keep what we need. And now we can git add and git rebase continue. We've successfully rebased and updated feature payment.
Open in a real environment →Easy Apply Specific Stash from Multiple Stashes
Git stash lets us to save our work in progress, go to some other task, and then return back, restore our stashed work, and continue from that moment on. We can save multiple work in progress jobs, so we can have multiple stashes. We have to navigate to this repository, identify the third stash in the list and restore that. So we need to apply it without removing it from the stack. Run git stash list. This will show us list of the stashed work. Index starts from zero. In order to restore this, git stash apply stash and then two. This will still retain stash number two in the list. In order to remove stash from the list, we need to run command git stash drop and then the name of the stash. Another command to restore stash is git stash pop. Git stash pop applies the latest stash, meaning stash number zero, and then drops it from the list.
Open in a real environment →Easy Remove Last Commit and Discard Changes
We've committed some local changes that contain incorrect changes and we need to completely erase that from history. Our task is to navigate to this repository, remove last commit entirely. The most important thing here is to discard all associated files changes. First run get log to see all the current commits. We will go a few steps back, and we also need to discard all associated file changes. This means that this is not soft reset, which will reset that, but keep the local changes. So this has to be a hard reset. We have bad commit, wrong changes. It's the last commit and we will reset it. One commit back, run Git, reset hard at one.
Open in a real environment →Easy Checkout Single File from Another Branch
We have two branches, main and feature settings, and we have a file config json that is located in the feature branch. We need to copy this file into the main branch. We don't need to cherry pick the commit. Cherry pick and copying one single file is different. Cherry picking adds specific commit from another branch to the branch that we are in currently. When we need to copy the file, we need to check out, but we need to check out specific file. GI Checkout, and we type the name of the branch and then Hyen and name of the file that we would like to check out. Now if we run Git status, we'll see this file as modified in our directory.
Open in a real environment →Easy Cherry-Pick Specific Commit
We have a git repository and we have a feature branch called feature. This feature branch contains a fix that fixes the bug on main, but we don't want to move all the commits from feature branch to the main. We just want to pick one commit and move only that. That's called cherry picking. We need to navigate to this repository, identify the commit that fixes the main branch and move it to the main from feature. See the git log to identify the commit that fixes the bug on main, so this commit has message fix critical bug. Next, we need to type git cherry pick and hash of this commit. This commit now was added to our main branch.
Open in a real environment →Medium Restore File to Previous Version
We have a Git repository where we have a file config gs. This file has been modified in the last two commits, but those two commits introduced a bug. We need to restore config gs to the version that it had two commits ago while not affecting any other file. Run Git log to see last five commits. Next, we'll preview config gs that was two commits ago, for this we'll type git show head meaning our current last commit and then the sign, and then two. Next, we'll type git checkout head tilde sign two and then two hyphens config gs, which will restore config gs that was two commits ago. Last thing, we need to commit our changes with restore config gs message.
Open in a real environment →Medium Create an Annotated Tag
We have Git Repository located on this directory which is completed new version of our application. We need to create an annotated tag, and once it's created, we need to push that to remote repository with this name. First, move to this directory, view current commit. This is our latest commit at head of our branch. A lightweight flag just references some commit. It doesn't have its own SHA hash meaning that cannot reference the tag. Annotated tag, however, has its own SHA, so you can reference that. Verify that tag was created by typing Git Tag. And now push this.
Open in a real environment →Medium Add Git Submodule
Integrate external repositories as submodules to manage dependencies without code duplication. Add submodules with git submodule add, configure .gitmodules file, initialize submodule directories, and commit configuration. Essential for managing shared libraries, vendor dependencies, monorepo structures, and maintaining decoupled version control across interdependent projects.
Open in a real environment →Medium Update Submodule to Latest Commit
We have a repository interview repo that contains a submodule vendor details and submodule lets us nesting repositories. When submodule is used, it uses the specific commit. We need to update a submodule to the latest commit on its default branch. Run git submodule to see our submodule status. Next, run git fetch and then log to see commits in our submodule. We need to pull latest changes. Go to the parent and check status again. Those changes are not committed yet. We run git add and then git commit. Submodule is updated to the latest version.
Open in a real environment →Medium Stash Work, Fix Bug, Restore and Update
Imagine a common scenario when you've been working on some git repository on the feature ui and suddenly you need to do something to fix authentication issue on the main branch. For this, you have to create new hotfix branch, commit changes into that branch and then merge that with main branch. While you've been working on feature ui, you cannot simply change the branch because you have uncommitted changes. Before moving to main branch, we need to stash those changes, meaning to put them aside. Next, move to our main branch and to fix our authentication issue, we first need to create hotfix branch. Move to the main branch and merge our hotfix with main. And finally we delete the hotfix branch since we don't need it anymore. We need to move back to our work that we paused meaning to feature ui, and then rebase it from main. Finally, we need to move back the work that we stashed aside. For that, we need to type git stash pop.
Open in a real environment →Medium Remove File from Entire Git History
We've committed file secrets env, which contains some sensitive credentials and we have to remove this from entire git history. First we'll need to move to the repository directory. Check logs that contains secret env file. For this we'll type git log all to see all logs, one line to see them in one line. In order to filter it by the file name we type hyphen and then name of the file. To see exactly what was changed in those commits we can add hyphen p flag. We need to delete this from our current git history and for this we'll run command called git filter branch. We'll run force flag to change this in entire git history. And we'll use the filter flag that lets us run certain command. And the command is basic Linux syntax rm to remove file. Prune empty flag is used in certain cases that removing this will make the commit useless. Finally, we need to delete the history file, meaning we need to run rm rf to delete everything in the git refs original. In case if we'd like to change also this on the remote origin, we need to run git push force all.
Open in a real environment →Medium Merge Repositories Preserving Both Histories
We have two separate git repositories, repo A and repo B, five and four commits respectively. They've been developed independently, so have different histories, and we need to create one monorepo, combine both of them and have full commit history. Important thing is we need to use subtree. Subtree is git subtree command used when we have some shared libraries or other shared resources when we do not want to merge everything into one monorepo, but rather have some repository and reference other repositories in a directory. We'll use git log one line. Create a directory and then initialize git inside this directory. We'll do git init. We'll do our first commit. We'll do empty commit, so we'll need to add a empty flag. Now we will need to integrate our directories as a subtree in our repository. Use git subtree add, and then we use prefix project A. Finally, verify our monorepo. Check it with Git logs.
Open in a real environment →Medium Fix Repository with Unrelated Histories
The repository interview is in broken state. The local and remote branches diverged with no common ancestor, meaning they don't share the same history. When we use Git Push to push things to the remote main, it fails with non fast forward error. And the same happens when we try to pull from Main. Our task is to fix this repository, merge and linearize the unrelated histories using Rebase and create new single commit sequence. When we use Git merge, branch that was merged into the main branch retains the same commit hashes. When we use Rebase, those commit hashes get rewritten. Next we'll pull main and we'll Rebase not merge. The flag that we'll use in this case is allow unrelated histories. We need to resolve this issue. Those three are the main types of the conflicts: modify modify, modify delete, add add. Once this is done, we have to type Git rebase continue.
Open in a real environment →Medium Recover Lost Commits from Detached HEAD
We had Git repository located under this directory and we've been in detached head state. When we switch the main branch, those three commits are now unreachable and we'd like to restore those three commits. When we do git log, we don't see those commits, so we need to find a solution to restore those commits and create a branch called recovered work where those commits will be listed. To see all the logs that we've done to the head of this branch, meaning the ones that were lost from the detached head or while we did git reset and so on, we can type command called git reflog. Git reflog shows us logs exactly for the head. In this git reflog we can see that we have much more than we have in git log. Since our task is to restore this with the branch recovered work, we have to create branch recovered work from some Git commit. We'll use this git commit's hash.
Open in a real environment →
Spark (20)
Easy CSV and Partitions
Spark is a big data processing framework. It is designed to process massive amounts of data across multiple computers at the same time. And instead of tables, it uses data frames. Our job here is simply to read a CSV file, then to find out how many partitions are created, and print that number. When Spark reads this file, it doesn't process it as one giant block. Instead, it splits it into smaller chunks called partitions. Each partition gets sent to a different executor. The default maximum partition size is 128 megabytes. RDD stands for Resilient Distributed Dataset. The RDD splits it into multiple partitions, and each partition holds a subset of the full data. RDD will convert the data frame into a format that can help us to access the underlying partition information. And getNumPartitions is a built-in method that returns the number of partitions as a number.
Open in a real environment →Easy Repartition
Spark is a big data framework that is designed to process massive amounts of data across multiple computers at the same time. Instead of tables like in SQL, Spark uses data frames. We have only one file that is called orders.csv with 5,000 records. Our main goal here is to repartition the data frame to eight partitions and print the task count in the format that it equals to eight. When Spark reads this file, it doesn't process it as one giant block. Instead, it splits it into smaller chunks called partitions. Each partition gets sent to a different executor. An executor is a process that runs on its chunk of data independently. Repartition is a built-in Spark method that lets us manually control how many partitions our data frame has. We pass the number that we want, and Spark will redistribute all the data across exactly that many partitions. repartition.rdd will convert our data frame to an RDD format, which is a data structure of Spark that holds the partitioned data. getNumPartitions is a built-in function that counts and returns the number of partitions.
Open in a real environment →Easy Broadcast Join
Spark is a big data framework that processes massive amounts of data across multiple computers at the same time. Instead of tables like in SQL, Spark uses data frames. We are given two files, orders.csv with 5,000 records, and customers.csv with 50 records. We need to join these two files together using a broadcast join, then count orders and print the number of distinct cities. We will use a regular inner join because we only want orders that have a matching customer. The only thing is how we perform that join. Instead of shuffling both data frames across the network, Spark takes the small data frame and sends a full copy of it to every worker. So each worker now has its own partition of the large data frame and small data frame. It means that it can perform the join right here without any movement. We don't want to shuffle this large data frame across the network. That's why we take the smallest one, because it is easier and cheaper to copy. We will also import the broadcast function from the library. header that is set to true uses the first row as column names, and inferSchema automatically detects data types for each column. Then we count orders per city with the help of group by.
Open in a real environment →Easy Correcting Social Media Posts
Spark is a big data framework that is designed to process massive amounts of data across multiple computers at the same time. Instead of tables, Spark uses data frames. We have one file, posts.csv. Its data frame contains seven columns: text of the post, ID, date, amount of likes, comments, and shares, and platform where it was published. We need to go through every post and replace the word Python with PySpark in the text column. withColumn is a data frame method that modifies or replaces a specific column. It takes two arguments. The first one is the column, it is the text column, and second one is the value that we want to add. For the second argument, we use regexReplace function. This function takes three arguments. The first one is the column to search. Second is the word to find. It is Python. And third is word to replace it with, which is PySpark.
Open in a real environment →Easy Daily Category Sales Aggregation
Master daily sales aggregation in PySpark. Learn how to join transaction tables with product catalogs and use multi-column GroupBy operations to calculate total quantities sold per category per day.
Open in a real environment →Easy Most Common Order Status
Spark is a big data framework that processes massive amounts of data across multiple machines. Instead of tables, Spark uses data frames. We are given only one file, orders.csv, with 5,000 records. Each order has a status, like completed, canceled, ongoing, and so on. Our job here is to find which status appears most frequently. There are two types of transformations, narrow and wide. In narrow transformation, each row on the left executor goes directly to the row on the right executor. It means that no data moves between executors, and each partition stays on the same machine and gets processed independently. But when it comes to wide transformations, rows from one executor can end up on a completely different executor. Here, data is able to move between machines, and this is called a shuffle. The only problem is that wide transformation requires more network traffic and more time. When we use group by, we do the wide transformation because it makes the rows with the same status move onto the same executor. Count finds the number of rows in each group, and ordering by count, ascending set to false, which means that everything is sorted in descending order.
Open in a real environment →Easy Calculating Overtime Pay
We will be calculating overtime pay. Spark is a big data framework that processes large amounts of data across multiple machines. We are given two files: employees.csv and payroll.csv. First data frame contains the names of employees, their IDs, ages, and job positions. The second one stores hourly rate, amount of hours worked, and reference to an employee by ID. Our job is to calculate total pay for each employee, which consists of two rules. If an employee worked less or equal to 40 hours, then the total pay is the product of hours worked and hourly rate. But if an employee worked more than 40 hours, then all the extra hours need to be multiplied not by the regular rate, but by the one that is 1.5 times bigger. We join the two data frames on employee ID using inner join. To the result data frame, we add a new column that is called Pay using the withColumn method. When is Spark's version of if/else statement. Else statement in Spark is replaced with otherwise.
Open in a real environment →Medium Cache and Performance
Spark is a big data framework that processes massive amounts of data across multiple machines. Instead of tables, it uses data frames. We have one file, orders.csv, with 50,000 records. Our job here is to cache the data frame, run count twice, measure how long each run takes, and print the results. Spark uses lazy evaluation, which means that it doesn't execute anything when we write a transformation. A transformation is simply an operation that modifies or processes our data. Lazy evaluation doesn't run anything until we specifically ask for a result. If we don't do the caching, then every single iteration reads from the input and produces an output independently. But when we do the caching, then the input is read only once and stored in the distributed memory. The first count reads the CSV file and caches the result, while second one reads from that memory and makes everything much faster.
Open in a real environment →Medium Filter Popular Videos
We will need to filter popular videos. Spark is a framework that processes large amounts of data across multiple machines. Instead of tables, it uses data frames. We are given one CSV file that is called Videos. The data frame contains six columns: title of the video, its ID, genre, release year, duration, and number of views. We need to filter everything and keep only videos that have more than one million views and were released in 2019 or later. The result should be saved as result_df. We read the CSV file to a given path and store it in a variable called df for data frame. When header is set to true, it uses first row as column names. At the same time, inferSchema automatically detects data types of those columns. For the condition, we use filter method within data frame and store it in result_df. The condition consists of two statements. First one checks if number of views is greater than one million, and second one ensures that the release year is after 2019. In between, we use AND operator that requires both of the statements to be true.
Open in a real environment →Medium Anonymize User PII
PII stands for Personally Identifiable Information, data like emails and phone numbers that can identify a real person. We are given only one file that is called users.csv, and this data frame contains three columns: email, phone number of the user, along with referenced_by ID. We are required to do two things. First, we extract the domain from the email address, which means that we keep everything after the @ symbol. Then we need to hide the first digits of the phone number, and only the last four should be visible. Regex stands for regular expression. It is a pattern that is used to search, extract, or replace specific text inside of a string. Regexp_extract is a Spark function that takes a specific part of a string using a regex pattern. We will use the regexp_replace function that simply replaces part of the string. WithColumn will take the results and create a new column.
Open in a real environment →Medium Call Center Daily Stats
Join call records with customer data and compute daily aggregates using countDistinct and sum.
Open in a real environment →Medium Venture Capital Sector Analysis
Spark is a framework that processes large amounts of data across multiple machines. At the same time, instead of tables, it uses data frames. We are given two files, companies and investments. Our job here is to find the total investment amount for each industry sector and sort from highest to lowest. We will first combine both of the data frames, then we will sum investments per industries, and in the end, we will sort everything in descending order. Inner join returns only rows where a match exists in both tables or data frames. We will use inner join because for every investment row, we want the company's name and industry to be right next to it. We group by industry column so that all rows that share the same field are put together. Sum function simply adds up all amount values and finds the total investment per industry. The third step is that we order everything by total investment column in descending order.
Open in a real environment →Medium Window Functions without Partitions
Master global sorting and sequential numbering in PySpark. Learn how to join DataFrames and use the row_number() window function across an entire unpartitioned dataset.
Open in a real environment →Medium Calculating PE Portfolio Values
Master financial data aggregation in PySpark. Learn how to join relational tables, multiply columns to calculate holding values, and group by multiple dimensions to compute daily private equity portfolio totals.
Open in a real environment →Medium Top Products by Revenue
Spark is a framework that processes large amounts of data and covers multiple machines at the same time. We are given two files, products and sales.csv. We are required to aggregate total revenue per product, then rank products within each category using dense rank function, and keep only the top three. The output should include the name of the product, category to which it belongs, revenue from the sale, and rank number. A window function performs a calculation across a set of related rows and keeps every single row in the result. This is the key difference from group by, which is an aggregate function that collapses multiple rows into one outcome per group. A window function instead just adds a new column with the calculation. We'll use group by product ID that will put all sales rows for the same product together. Sum will simply add up all revenue values for each group within revenue column. Partition by category will create invisible walls between each category. We sort everything out in descending order by revenue so that the highest amount will get rank one. Dense rank is a type of window function that assigns rank numbers. The thing about dense rank is that it never creates any gaps. If two products have the same amount of revenue, then both of them get rank one. We need to build sort of a condition using filter that will keep only products with rank three or less.
Open in a real environment →Medium Product Summary
Spark is a framework that processes large amounts of data across multiple machines. Instead of tables, it uses data frames. We are given three files, and their data frames are called products, sales, and inventory. We need to calculate total quantity sold and total revenue per product from sales. Then we find the total stock number per product across all warehouses. And we need to join everything to products data frame and make sure every product appears, even if it has no sales or inventory records. Any null values should be replaced with zero. We group by product ID so that all sales rows that belong to the same product are put together. Inner join returns only rows where a match exist in both tables or data frames. Left outer join returns everything from the left table, plus matching rows from the right. In our specific case, we'll use left join. It will keep all rows from the left table, which is products, regardless whether there is a match in sales. Coalesce function takes list of values and returns the first one that is not null. Lit function creates a literal value of zero.
Open in a real environment →Medium Parsing Comma-Separated Values
Spark is a framework that processes large amounts of data across multiple machines. Instead of tables, it uses data frames. We have only one file that is called background_checks.csv. Four of these columns store multiple values in one string separated by comma. We are required to count how many items are there in each of those strings and create four new columns to store these results. Everything should be sorted out in ascending order by ID. We read the CSV file into data frame using given path. When header is set to true, it uses first row of the data frame as column names, and inferSchema automatically detects data types. We use split function that takes a string and cuts it into an array using a separator. We wrap everything inside of size function that will count how many elements are in the array. We sort everything out in ascending order by check ID using order by.
Open in a real environment →Hard Mountain Climber Logs
Spark is a framework that processes large amounts of data across multiple machines at the same time, and instead of tables, it uses data frames. We are given two files, mountain_info and mountain_climbers.csv. Our job here is to find the most recent climber for each mountain. We need to keep in mind that mountains with no climbing records should be excluded from the final result. A window function performs a calculation across a set of related rows and keeps every single row in the result. This is the key difference from group by, which is an aggregate function that collapses multiple rows into one outcome per group. We need to read both of the files into data frames. We set header to true, which means that the first row will include the column names, and inferSchema will automatically detect data types of those columns. Inner join returns only rows where a match exists in both tables or data frames. We want name of the mountain from info_df to be equal to the name of the mountain from climbers_df. Partition by mountain name creates sort of invisible walls between each mountain. We also sort everything in descending order by climb date so that most recent date comes first. Row number is type of window function that assigns a sequential number to each row. Filter method will help us to build a condition to keep only rows where RN is equal to one.
Open in a real environment →Hard Global & Domain SEO Leaders
Master advanced PySpark Window functions by calculating both partition-level and global maximums. Learn how to use empty window partitions and conditional F.when() logic to identify top-performing SEO pages.
Open in a real environment →Hard Tracking Customer Purchase History
Spark is a framework that processes large amounts of data across multiple machines at the same time. Instead of tables, it uses data frames. We have only one file that is called transactions.csv. Its data frame contains four columns, product ID, reference to the customer by ID, quantity, and date of transaction. For each transaction, we need to find what product that particular customer bought most recently before the current one. Then we create a new column that combines the date and that previous product in one string. A window function performs a calculation across a set of related rows and keeps every single row in the result. This is the key difference from group by, which is an aggregate function that collapses multiple rows into one outcome per group. Partition by customer ID will create invisible walls between customers. Lag is a type of window function that will look backwards and reach back to the previous row to grab its product ID value. Concat function joins multiple strings together.
Open in a real environment →
Snowflake (22)
Easy Filter Popular Videos on a Streaming Platform
We will filter popular videos on a streaming platform. Snowflake is a cloud-based data warehouse platform. Here, your query will look just like an SQL. We are given videos table, which contains video ID, title, genre, release year, duration, and view count columns. Our job is to retrieve those videos that have more than one million views and that were released in year 2019 or later. FROM clause, this is one of the things that differ regular SQL from Snowflake. It uses a dynamic reference with the ref function. Instead of naming the table, we are asking the system to find correct version of the table for us. Inside of the double curly braces, we put the ref function with the name of our table. We select all columns using select with a star. Our main goal was to filter the videos, and to do so, we should build sort of a condition inside of the WHERE clause. The first statement is that a number inside of view count column should be higher than one million. The second is that release year should be higher or equal to 2019. Since both of the statements are mandatory, we will use a logical operator.
Open in a real environment →Easy Replace Keywords in Social Media Post Text
Snowflake is a cloud-based data warehouse platform. The syntax, the functions, the logic, everything is almost like in SQL. Our table, social media, contains information about different posts, its ID, text inside of it, date, amount of likes, comments, shares, and the name of the platform. Our job is simple. We go through every post and replace the word Python with PySpark. The only thing that is going to change is that inside of the text column, the word Python is replaced by PySpark. In Snowflake, it uses a dynamic reference with the ref function. Inside of the double curly braces, we put the ref function with the name of our table. Since our goal was to modify the text column, we need to use replace function here. Replace function takes three arguments: the string itself, the old value, and the new value. Replace basically scans through the string, finds the needed word Python, and swaps it with a new value.
Open in a real environment →Easy Filter Movies with Missing Box Office Data
Practice filtering for NULL values in Snowflake SQL with this movie analytics interview question. You query a movies table to find rows where box office collection data is missing using IS NULL. Covers NULL handling, IS NULL, WHERE clause, and data quality checks in Snowflake. An easy-level question common in analytics interviews at companies like DoorDash.
Open in a real environment →Easy Daily Category Sales
SELECT
Open in a real environment →
s.sale_date,
p.category,
SUM(s.quantity_sold) AS total_quantity
FROM {{ ref("sales") }} s
INNER JOIN {{ ref("products") }} p
ON s.product_id = p.product_id
GROUP BY s.sale_date, p.category
ORDER BY s.sale_date ASC, p.category ASCEasy Filter and Uppercase Artifacts
SELECT
Open in a real environment →
ID,
Item,
Period,
UPPER(Material) AS Material,
Quantity
FROM {{ ref("artifacts") }}
WHERE Quantity > 100Easy Merge Customer Records from Two Sources
We will merge customer records from two sources. Snowflake is a cloud-based data warehouse that uses SQL. We are given two tables, customers from West and customers from East. Both of the tables contain the same columns and store information about the customers, such as full name, age, ID, and subscription plan type. Our job is simply to combine these two tables into one single result, even if they contain duplicate rows. FROM clause in Snowflake is a little bit different from SQL. It uses a dynamic reference with the ref function. Inside of the double curly braces, we put the ref function with the name of our table. In SQL and Snowflake, UNION operator is a set operation used to combine the result sets of two or more SELECT statements into a single unified output. If we use UNION separately, it would remove all duplicate rows. So using UNION ALL simply means that we want to keep all rows, even if they are repeated.
Open in a real environment →Easy Filter Funded Startups
We need to write a query to filter funded startups. This is a Snowflake question. Snowflake is a cloud-based data warehouse that simply uses SQL. We are given two tables, investors and startups. Our job is to find investors whose average funding across all their startups is strictly greater than their personal funding limit. Those cases when limit equals to average funding needs to be excluded. The output should include three columns: investor name, its ID, and average funding number. We are more interested in inner join because it returns only the rows where there is a match in both tables. Within funding column from startups table, we implement average and round functions. Since we need to find the average number for each of the investing companies separately, we need to use a GROUP BY clause. We will put a condition inside of HAVING clause that works with groups and filters them. We can't use WHERE clause for the condition because there is a strict rule that WHERE clause can only come before GROUP BY clause, not after.
Open in a real environment →Medium Combine Customer Orders and Products
We need to combine customer orders and products. Snowflake is mainly using the same concepts as in SQL. The only difference is that Snowflake is cloud-based. We are given three tables: customers, orders, and products. Our main goal here is to join all three tables together and return six unique columns. In Snowflake, we have to put the name of the table inside of the ref function, and everything has to be wrapped up in double curly braces. The reason behind is that Snowflake uses data build tool, which requires the ref function. Join connects two tables based on a common column. We are more interested in inner join because it returns only the rows where there is a match in both tables. We give aliases O to orders and C to customers table. Using on, we indicate which common columns were used to connect the tables. We use double pipe operator, or in other words, a concatenation sign, to get the full name.
Open in a real environment →Medium Anonymize User PII Data for a Social Media Platform
We need to write a query to anonymize user PII data for a social media platform. This is a Snowflake question, which is a cloud-based data warehouse that uses SQL and all of its concepts. PII stands for Personally Identifiable Information. For each user, we need to extract only the domain part from the email address and anonymize the phone number by hiding the first six digits with asterisks and keeping only the last four. In Snowflake, we put the table inside of the ref function, and this whole thing is wrapped up in double curly braces. Split part function takes three arguments and splits a string into parts based on a delimiter and returns a specific part. Our delimiter will be @ symbol. We will use the right function. Right takes two arguments and extracts a specified number of characters from the right side of a string. Since phone is not a string, we need to use double colon varchar in order to convert number to a string. Using the concatenation sign that combines two strings, we connect them together.
Open in a real environment →Medium Product Sales and Inventory Data
This is a Snowflake question, which is a cloud-based data warehouse that uses SQL and all its main concepts. We are given three tables, products, sales, and inventory. All three tables are connected through product ID, since this is a common column. First, we need to calculate total quantity and total revenue per product from sales. Then we need to calculate total stock per product from inventory. And in the end, we need to join everything with products table, making sure every single product appears in the result, even if it has no sales or inventory records. If we try to join everything directly without aggregating first, we would get duplicate rows and completely wrong totals. A CTE, or in other words, a common table expression, is a temporary result set that we define at the top of our query. A left join keeps all rows from the left table, even if there is no matching row in the right table. The missing values just become null. Coalesce function takes a list of values and returns the first one that is not null.
Open in a real environment →Medium Products and Duplicates
Practice a medium difficulty Snowflake SQL interview question tagged JPMorgan that tests deduplication, CTEs, and INNER JOIN. In a manufacturing scenario, you must remove duplicate rows from two tables using SELECT DISTINCT or ROW_NUMBER before joining them on a shared key. This question evaluates your ability to handle data quality issues before performing joins, a skill commonly tested in financial services and data engineering interviews.
Open in a real environment →Medium Mortgage Rate Calculator
WITH aggregated AS (
Open in a real environment →
SELECT
md.mortgage_type,
ROUND(SUM(md.interest_rate) / COUNT(um.user_id), 2) AS rate_of_mortgage
FROM {{ ref("mortgage_details") }} md
INNER JOIN {{ ref("user_mortgages") }} um
ON md.mortgage_id = um.mortgage_id
GROUP BY md.mortgage_type
)
SELECT * FROM aggregatedMedium Weekend Order Detection
This is a Snowflake question, which is a cloud-based data warehouse that uses SQL and all its main concepts. We are given two tables, orders and products. Our job here is to parse the order date column from string into a real date format, and then drop any invalid dates. After that, we need to detect whether the order was placed on a weekend or not. If it's from Monday to Friday, then the output is false. If it's Saturday or Sunday, then the output is true. We will use a CTE, in other words, a common table expression, which is a temporary result set that we define at the top of our query. In Snowflake, we use data build, which is a framework that manages and organizes tables. We put it inside of ref function, and this whole thing is wrapped in double curly braces. We are more interested in inner join because it returns only the rows where there is a match in both tables. Within parse date column, we implement try to date function. This function converts a string into a real date type. If the string is invalid, it simply returns a null value. Day of week is a Snowflake function that takes a date and returns a number that represents the day of the week. Sunday is assigned with zero, Saturday is assigned with six. We check if the day of the week result is either zero or six, meaning Sunday or Saturday.
Open in a real environment →Medium Flooring Company Data
This is a Snowflake question, which is a cloud-based data warehouse that uses SQL and all its concepts. We are given three tables: customers, orders, and products. Full name in the customers table stores first and last name together in one column, separated by space. We need to split them into two columns: first name and last name. Here we also have to split product info column into two separate ones. In Snowflake, we use data build tool, which is a framework that manages and organizes tables. We use the ref function wrapped in double curly braces, and then ref function finds the correct table in the Snowflake environment automatically. Join connects two tables based on a common column. We are more interested in inner join because it returns only the rows where there is a match in both tables. To split full name and product info columns, we will use split part function. This function takes three arguments, splits the string, and returns a specific part. Second argument is delimiter. It's basically the character where the cut happens.
Open in a real environment →Medium Assign Row Numbers to Authors per Paper
We will assign row numbers to authors for each paper. This is a Snowflake question, which is a cloud-based data warehouse that uses all main concepts from SQL. We have two tables, research_papers and authors. Our job here is to join these two tables together and assign a sequential row number to each author per paper. The numbering must reset for every new paper so that the first author of each paper always gets row number one. In Snowflake, we put the table inside of ref function, and this whole thing is wrapped up in double curly braces. We are more interested in inner join because it returns only the rows where there is a match in both tables. Between these two tables, we implement inner join operator, and using on keyword, we indicate which columns are used to combine the tables. Row number assigns a sequential number to each row, starting from one. Over keyword simply activates the window function. Partition by divides all the rows into separate groups based on a column. In our case, it's paper_id. Row number will reset back to one at the start of each new group. And then we sort everything out in ascending order by author_id.
Open in a real environment →Medium Amusement Park Rating Anomalies
We are given two tables, rides and visitors. Our job here is to find rides whose average rating is unusually high or unusually low compared to all other rides. We measure this using standard deviation. Standard deviation is a statistical measure that checks how spread out a set of values is from their average. Our query will consist of three parts. First, we will calculate the average rating per ride. Second, we will calculate global mean and standard deviation across all ride averages. And third, we will use those stats to flag anomalies. A CTE, or common table expression, is a temporary result set that we define at the top of our query. For our task, we are more interested in inner join because it returns only the rows where there is a match in both tables. A window function, unlike group by that collapses all rows into one, performs a calculation across multiple rows and keeps every single row in the result. Stddev is a built-in SQL and Snowflake function that calculates standard deviation. We wrap it inside of ABS, which stands for absolute value. Then we compare if that distance is greater than global standard deviation. If the distance is greater than one standard deviation, the ride is considered too far from the norm and gets true.
Open in a real environment →Medium Usage and Accuracy per Model Type
This is a Snowflake question, which is a cloud-based data warehouse that uses SQL and all its main concepts. We have two tables, models and usage logs. Both of the tables are connected through a model ID, which is a common column. The requirements are calculating the total users by summing all the usage log entries, and at the same time, to find the average accuracy for each model. We will store the results in sort of a temporary table that is called CTE. CTE is short for Common Table Expression. It's a temporary result set that we define at the top of our query inside of WITH clause. From usage logs table, we first select model ID, and then we build a new column called total users and implement sum function within users column. Grouping by model ID would collapse all rows sharing the same ID into one single group. For this last column, we will implement window function with average. Partition by model type will divide all rows into separate groups based on their model type. We implement the round function and wrap the window function inside of it.
Open in a real environment →Medium Find the Last Climber per Mountain
We need to find the last climber per mountain. This is a Snowflake question, which is a cloud-based data warehouse that uses SQL. We have two tables, mountains and climbers. Both of the tables are connected through mountain name. Our job here is to find the most recent climber for each mountain. We only want mountains that have at least one climbing record. We need to rank all climbers per mountain by date from most recent to oldest, and then pick the climber with rank one. In Snowflake, we use database tool, which organizes and manages the tables. We put the table inside of ref function wrapped in double curly braces. We are more interested in inner join because it returns only the rows where there is a match in both tables. Partition by mountain name would divide rows into separate groups, and each group would indicate one mountain. Then row number would reset to one at the start of each new mountain group. We sort everything out in descending order by climb date so that the climber with the most recent date gets row number one.
Open in a real environment →Hard Rank Top Products by Revenue per Category
We need to rank top products by revenue for each category. This is a Snowflake question, which is a cloud-based data warehouse that uses SQL and all its main concepts. We are given two tables: products and sales. First, we need to calculate how much revenue each product generated. Then we rank them within their category from highest to lowest, and in the end, we keep only the top three per category. A CTE, or in other words, common table expression, is a temporary result set that we define at the top of our query. In Snowflake, we use database tool, which organizes and manages tables. We put the name of the table inside of ref function, and this whole thing is wrapped in double curly braces. We need to combine them using JOIN. We are more interested in inner join because it returns only the rows where there is a match in both tables. Dense rank is a function that assigns rank numbers to rows. The reason why we use dense rank instead of regular rank is that dense rank never creates any gaps. Partition by category creates sort of invisible walls between each category. We put a condition inside the WHERE clause that rank has to be equal or lower than three.
Open in a real environment →Hard Highest SEO Score Pages per Domain
Practice a hard Snowflake SQL interview question tagged Cisco that tests window functions, aggregation, and conditional logic. Working with an SEO analytics dataset, you must find the highest scoring page per domain and identify the overall best page across all domains using CASE expressions and window functions. This question evaluates your ability to combine per-group and global rankings in a single query.
Open in a real environment →Hard Math Expressions
Practice a Snowflake SQL interview question tagged IBM that tests regular expressions and pattern matching with RLIKE or REGEXP. In a data validation scenario, you filter user-submitted text to keep only valid arithmetic expressions containing digits and operators. This hard difficulty question covers regex patterns in Snowflake, RLIKE for row filtering, and input validation logic commonly asked in data quality and engineering interviews.
Open in a real environment →Hard Track Product Purchases
This is a Snowflake question, which is a cloud-based data warehouse that uses SQL and all its main concepts. We are given only one table that is called Transactions. Each row represents one purchase made by a customer. First, for each transaction, we need to find what product the customer bought most recently before the current one. Second, we need to create a new column, combine the date and that previous product into one string, and use the word none when there is no previous product. A CTE, or Common Table Expression, is a temporary result set that we define at the top of our query. Lag specifically within product ID row looks backwards, reaches to the previous row, and grabs its value. Partition by creates sort of a group for each customer. We wrap previous product in COALESCE function. COALESCE takes a list of values and returns the first one that is not null.
Open in a real environment →
Python (19)
Easy CSV Row Filter and Count
We have one customers.csv file. Each row represents one customer and has a status field that can be active or something else. We are required to read the file, then count how many customers have the active status, and write that number to a text file. First of all, we import the CSV module that is designed for reading and writing CSV files. We define a new counter variable that we will call active_count, and we set it to zero. Using Python's built-in open function, we will enter the customers CSV file. It takes two arguments, the file path and r that stands for read mode. Then we create a DictReader from the open file that will read the first row as column header, and then every row as Python dictionary. Using for loop, we go through every row in the CSV file and check whether the value in the status column is equal to active or not.
Open in a real environment →Easy Analyze Sales Dataset Dimensions and Calculate Total Revenue
We will analyze sales dataset dimensions and calculate total revenue. Pandas is a library that was specifically designed for data analysis and manipulation. We are given one CSV file that is called Sales Data. Our job is, first of all, to analyze its size, classify it as small, medium, or large based on total amount of sales, and then calculate total revenue. Everything should be saved as JSON report. We need to read it into a data frame. Data frame is simply a table in memory with rows and columns. For rows and columns, we will implement the len function that simply returns the number of items in any list or collection. In order to find the total number of sales, we need to multiply rows by columns. Small one is considered the one with less than 10,000 sales, then medium in range between 10,000 and 99,999, and large, everything that exceeds 100,000 sales. In order to find the revenue, we need to multiply the quantity by price. We will use the sum function to add up all revenue amounts. Everything is wrapped inside of round function.
Open in a real environment →Easy Sort Avro Employee Records by Salary
We'll need to sort Avro employee records by salary. Avro stores data in a compressed binary format, which makes it much more efficient for large data sets. The schema travels with the data within the same file and never gets separated. An Avro file is made up of a header that is followed by multiple blocks. We are given one file that is called employees.avro. Our goal here is to read the given file, sort the records by salary in descending order, and save the result as JSON file. We import two things, Reader function from FastAvro library that will help us to read the file, and then JSON. To read the file, we will use open function that takes two arguments. First is file path, and second is mode. In our case, it is rb, that stands for read binary mode. We create the Avro reader from the open file using the Reader function. We need to sort that list by salary using sorted function. As a second argument, we implement a small lambda function that for each X employee returns X salary. In third argument, we set reverse to True, which will sort everything out in descending order.
Open in a real environment →Easy Count User Events from JSON Activity Logs
We will count user events from JSON activity logs. We are given one JSON file that is called activity_logs. JSON stands for JavaScript Object Notation, and it is a text-based format that stores structured data. Our job here is to count how many events each user performed and save the results as a new JSON file. First thing that has to be done is importing the JSON module, which is a built-in Python library for reading and writing JSON files. We will use the open function. It takes two arguments. First is the path to the file, and second is the mode we need. Load will read the entire JSON file and convert it into a Python object. A dictionary in Python is a data structure that stores information as key value pairs. We need to use for loop to go through every log entry. Dump function converts a Python object back to a JSON file. To run the script, we type python3 and path to the file.
Open in a real environment →Easy Split Delimited Column into Separate Columns with Pandas
We will split the delimited column into separate columns using Pandas. Pandas is a library that was specifically designed for data analysis and manipulation. We have one CSV file that is called Data, which contains two columns: ID and Full Info. The Full Info column stores multiple pieces of information in one single string, separated by commas. We need to split each string and turn all these pieces into independent columns. We import Pandas library and alias it as pd. We need to read it into DataFrame, which is basically a table in memory with columns and rows. Split in Pandas takes one combined column and expands it into multiple separate columns. It does the splitting whenever it sees the delimiter. Str in Pandas is a special accessor that lets us apply the string operation to every value. We implement the split function. First argument is comma, our delimiter. After setting expand to true, each split piece will get into separate column. Drop function removes a column or row from the DataFrame, depending on how we set the axis. Finally, we save the result to a new CSV file called Split Data.
Open in a real environment →Easy Compare SQLite Database and CSV File Records
We will compare database and CSV file records using SQLite. We have two data sources. The first one is a CSV file. The second source is a SQLite database. SQLite is a lightweight database that stores everything in a single file. We can run it directly inside of our Python program. We are required to find the customer IDs that are out of sync and save the comparison report as a JSON file. We import three things, SQLite 3 library, that will help us to connect and make queries with our database. Then we import CSV to read those files row by row, and JSON to save the comparison report. A set in Python is basically a data structure that stores unique values. Using connect function, we will open the database file. We create a cursor, which is a tool that executes SQL statements. Fetchall function will retrieve all the results as a list of tuples. We will use the concept of set subtraction, and in order to find the extra IDs in the SQL file, we will have to subtract the CSV IDs from the database. Dump function will convert the dictionary to a JSON format.
Open in a real environment →Easy Analyze DataFrame Memory Usage
We'll analyze DataFrame memory usage using Pandas. We are given one CSV file that is called Sales Data. Our job here is to calculate how much memory each column from the initial file uses and save that report to a new CSV file that will be called Memory Report. The first thing that we do is importing Pandas and alias it as pd. Then we need to read the initial CSV file into a Pandas DataFrame that will be called df. We will implement the memory_usage method that will calculate how many bytes of RAM each column is using. We will set deep to true, and in the calculations, the actual content will be considered. Our current result is in bytes, and we will need to convert it to megabytes. We will need to divide it, first of all by 1,024, and by that, we get the result in kilobytes. Then we divide by 1,024 again to convert it to megabytes. The last thing that we have to do is saving our report to CSV file.
Open in a real environment →Easy Calculate Descriptive Statistics for Numeric Columns in Pandas
We need to calculate descriptive statistics for numeric columns in Pandas. We are given one CSV file that is called Sales Data. It contains e-commerce transaction information with both numeric and non-numeric columns. Descriptive statistics are numbers that summarize a data set. Mean is an average value of all numbers, median is the middle value when sorted from low to high, STD is mathematical standard deviation, min for smallest value, max for largest value, and then the percentages. Our main goal is to calculate these eight descriptive statistics for all numeric columns and save the result as a CSV report with all values rounded to two decimal places. In order to work with the CSV file, it should be read into a data frame. Select_dtype is a Pandas method that filters columns by their data type. Include number means that we only want to keep columns that contains integers and floats. We save the report as a CSV file using to_csv function, and we set index to false to prevent Pandas from adding an extra row number column.
Open in a real environment →Easy Parse JSON Log Files and Extract Fields to CSV
We need to parse JSON log files and extract fields to CSV. A log file records events that happen in application over some time. Our log file stores each entry in JSON, which is a text-based format that stores structured data. Our job here is to read each line, then extract five specific fields that are timestamp, level, user and request ID, and message. We save the results in a CSV file. We need to import two built-in Python libraries. First of all, JSON that is used for reading JSON formatted text, and then CSV for writing CSV files. We need to open and read the log file. For this reason, we will use the open function that takes two arguments. First is the path to the file, and second is the mode that we want to use. Using for loop, we go through each line, and in return, we get one line as a string. When we use load as function, it will convert the JSON string into a Python object. Here we will implement the get method. Get looks up for a key in the dictionary and returns its value. But if the key doesn't exist, it returns a default value that we define as a second argument. For request ID, it is basically nested inside of another object, so we will implement the get method twice here. DictWriter will convert the dictionary to CSV file. Write header method puts the first row as column names.
Open in a real environment →Easy Extract Schema Information from Parquet File Using PyArrow
We'll extract schema information from Parquet file using PyArrow. PyArrow is a Python library that works with columnar data formats, especially with Parquet files. It gives more access to files' metadata, things like schema, row groups, file statistics, and so on. Parquet stores data column by column instead of doing that row by row. It will look at the data stored in each group, analyze the minimum and maximum values for each column, and skip the entire row groups that cannot possibly contain matching data. Here we are required to extract the metadata about the file, column names, data types, amount of rows, size of the file, and type of compression. The result should be saved as JSON report. We will import pyarrow.parquet, that will be aliased as pq. Then we use schema_arrow function to get the full picture of what columns exist and what data type each column has. Num_rows function gives us the total amount of rows in the file. Compression function will return the name of codec, like Snappy or Gzip.
Open in a real environment →Easy Select Specific Columns from Parquet File
We will select specific columns from Parquet file using Pandas. Pandas is a library that was specifically designed for data analysis and manipulation. Parquet stores data column by column instead of doing that row by row. It is good when we want to read only some specific columns and skip everything else. It will look at the data stored in each group, analyze the minimum and maximum values for each column, and skip the entire row groups that cannot possibly contain matching data. We are required to return only five specific ones: ID, first and last name, email, and total amount of purchases. After selecting all five columns, everything needs to be saved as a new Parquet file called selected data. The first thing that we do is that we import the Pandas library and alias it as pd. Then we need to read the Parquet file into a data frame. In Pandas, when we want to select multiple columns, we pass a list of column names inside of double square brackets. Since Pandas automatically assigns row numbers to every data frame, we set index to false, which means that we do not include the data frame row numbers in the output.
Open in a real environment →Easy Merge Customer and Purchase Data Using Pandas
We will merge customer and purchase data using Pandas. Pandas is a library that was specifically designed for data analysis and manipulation. We are given two CSV files that are called customers and purchases. We are required to combine both of the initial files using left join and save the result to a new combined data CSV file. The first thing to do is that we import Pandas library and alias it as pd. In order to work with the files, we first have to read them into data frames. Merging Pandas connects two data frames based on a common column or index. It works the same way as JOIN in SQL. Left join returns all rows from the left data frame and the matched rows from the right data frame. If there is no match, then all those fields will contain null values. We use the merge method to connect customers to purchases. On indicates a column that was common for both of the data frames. We set index to false, which prevents Pandas from adding an extra row number column to the output.
Open in a real environment →Easy Convert Excel Files with Multiple Sheets to Individual CSV Files
We will convert Excel files with multiple sheets to individual CSV files using Pandas. We have a directory that contains multiple Excel files with many sheets inside of it. We are required to convert every single sheet from Excel file into its own separate CSV file. The naming convention for each output is filename_sheetname.csv. We import Pandas library that we alias as PD, then OS library that interacts with the operating system. We use it to list all files in a directory and build a complete file path. And lastly, the Pathlib library that works with file paths. We define two path variables: excel_dir, the folder where our Excel files are stored, and output_dir, where we save the converted CSV files. We open the Excel file using the read_excel function, where as a first argument, we define the full path, and we set sheet name to None so that Pandas will read all sheets at once. We save the current sheet as a CSV file using to_csv function, and then we set index to false so that Pandas won't add an extra row number column to the output.
Open in a real environment →Medium Time-Series Rolling Window Analysis for Multi-Stock Price Data
Here we will be working with Pandas. Pandas is a library that was specifically designed for data analysis and manipulation. We have one CSV file that is called Stock Data. It contains daily price information within multiple companies. Each row has a date, ticker, and closing price. Our job here is to calculate rolling statistics, mean, sum, and standard deviation for the period of 7, 14, and 30 days separately for each ticker. In the output, we should have three original columns, plus nine new columns that we'll build. Rolling method in Pandas provides window calculation and allows us to perform aggregate operations. In order to start doing any operations on the file, we need to read it into DataFrame. Using for loop, we go through three window sizes, 7, 14, and 30 days. Then we group by ticker so that all the calculations run separately for each of them. We implement the rolling method and set that window equals to window. Mean will calculate the average of all values in each window. We need to save that DataFrame to a new CSV file using to_csv function.
Open in a real environment →Medium Decompose Time-Series Data into Trend, Seasonal, and Residual Components
We will need to decompose time series data into trend, seasonal, and residual components. We will be working mainly with Pandas. Pandas is a library that was specifically designed for data analysis and manipulation. We are given only one CSV file that is called Temperature Data. It contains daily statistics over multiple years. First one is trend, which is a long-term direction of the data that indicates whether temperature generally increases over the years or not. Seasonal is responsible for repeating pattern that happens on a regular cycle. Residual is everything that is left after removing the trend and seasonal components, which is mostly some unpredictable day-to-day variations. Each component should be saved into its own CSV file. We were required to use Statsmodels, which is a Python library for statistical analysis. We import this specific function, which is called seasonal decompose, that will perform our decomposition. Then we specify the model by setting it to additive. We define one full cycle as 365 days because this is an annual pattern for temperature.
Open in a real environment →Medium Flatten Nested Struct Columns in Parquet and Export to CSV
We'll need to flatten nested struct columns in Parquet and export them to CSV using Pandas. Pandas is a library that was specifically designed for data analysis and manipulation. We are given one Parquet file that contains customer data, but some columns are nested, which means that instead of a simple value like a string or number, they contain a whole object with multiple fields. Parquet stores data column by column instead of doing that row by row. Our job here is to extract the fields from the nested columns and turn them into independent ones. In order to start working with Parquet file, we need to read it into data frame that is called DF. JSONNormalize function in Pandas will take each dictionary from address column and expand it into separate columns. Next step is renaming the columns by adding address_ as a prefix. Using drop function, we will remove the initial nested columns. Now we will combine the results using concat function. We will save the result to a CSV file. For this reason, we will use to_csv function.
Open in a real environment →Hard Aggregate Time-Series Data into Fixed Time Windows
We'll need to aggregate time series data into fixed time windows using Pandas. Pandas is a library that was specifically designed for data analysis and manipulation. We have one CSV file that is called sensor data. It contains temperature and humidity readings from a sensor collected at regular intervals. Our job is to group all readings into fixed 15-minute windows and calculate mean, minimum, and maximum values for both temperature and humidity within each group. Time window aggregation solves this by dividing time into equal fixed buckets. All records that fall within the same interval get grouped together, and we calculate statistics for that group. We read the CSV file into DataFrame called df. We convert it to a proper Pandas date-time type. We set the timestamp column as the DataFrame index. We start doing the aggregation with resample method. It divides time series data into fixed time windows. Aggregate function performs multiple operations at once on each group. We save the final aggregated DataFrame using to_csv.
Open in a real environment →Hard Interpolate Missing Values in Irregular Time-Series Sensor Data
We will interpolate missing values in irregular time series sensor data using Pandas. We are given one CSV file that is called sensor_data. It contains temperature and humidity readings from IoT collected over time. Some of the readings are missing because of network issues or sensor failures, so those cells are empty. Our job is to fill in those missing values using interpolation and save the clean data. Interpolation is a way of estimating missing values based on the records around them. In our case, we will use the time-based interpolation that also accounts the irregular time intervals. Before the interpolation itself, we need to convert the timestamp column from a plain string into a proper date-time type. Then we set the timestamp column as the index of our DataFrame. It is very important specifically for time-based interpolation because it requires the date-time to be indexed. We reference to the temperature column and implement the interpolate method. We set method to time to use the time-based interpolation. We were required to round the interpolated values to one decimal place. Finally, we save the result as a CSV file using to_csv function.
Open in a real environment →Hard Remove Seasonal Effects from Time-Series Sales Data
We'll remove seasonal effects from time series sales data using Pandas and Statsmodels. We are given one CSV file, sales data, that contains transactions over multiple years. The sales indicate the current seasonal pattern. We are required to remove those seasonal effects so that we can see if the company is actually growing or it is just a result of seasonal factors. Deseasonalization simply removes that repeating pattern so that we can see the real trend. We import two things, Pandas library, and seasonal decompose function from Statsmodels. Seasonal decompose is the function that splits our time and date into trend, seasonal, and residual components. We specify the decomposition model by setting it to additive, which simply means that the components will add up together. Since our data is monthly and seasons repeat every year, the period will be set to 12 months. In order to calculate the deseasonalized value, we need to sum the trend values with residual values.
Open in a real environment →
More questions (11)
Easy Recursive Keyword Finder
We have multiple applications that write their logs under var slash log. An issue with this approach is that since multiple logs file are constantly being written to this folder, checking them one by one would be slow and inefficient. We want to use this file to consolidate all the lines that contain errors and write them here. We can use grep and we'll use recursive flag to get error lines in var slash log folder, and it will include only files that are in dot log extension. What we left to do is send this into the consolidated error logs file.
Open in a real environment →Easy Docker Multi-Architecture Image
When we build our container from this file, it builds it in architecture of our host system. Our task is to change the setup to build it in multiple architectures. We'll use Docker build with an instance named Multi Arc and we will use buildx create. We'll list current builders by typing Docker buildx ls. We have only default that builds it in our current platform's underlying os. So add new builder. Docker buildx create and then name multi arc. Then we use driver network host and then use to have it as default builder. Now attempt to build for our multi architecture setup. We'll run Docker buildx build. Verify this by typing Docker images.
Open in a real environment →Medium Debug SSH Lockout
We have developer account dev that has been locked out of the server and security logs indicate that there were too many failed SSH identification attempts. We need to check logs and count exactly how many attempts this user had today. And once we have this number, we need to update the configuration to increase this allowed login attempts above this number. We'll need to check log for login attempts. We'll need to use sudo to view this as admin. We can use grep command to filter out the lines that we need. We can also count this with word count hyphen L, which will print us number of lines. We need to change SSHD daemon configuration file, which is located under etc ssh and then sshd config. Find something that says Max. Finally, restart sshd daemon using systemctl restart.
Open in a real environment →Medium Min Stack
class MinStack:
def init(self):
self.stack = []
self.min_stack = []
Open in a real environment →def push(self, val: int) -> None: self.stack.append(val) if self.min_stack: self.min_stack.append(min(val, self.min_stack[-1])) else: self.min_stack.append(val) def pop(self) -> None: self.stack.pop() self.min_stack.pop() def top(self) -> int: return self.stack[-1] def getMin(self) -> int: return self.min_stack[-1]Medium LRU Cache
class Node:
def init(self, key=0, val=0):
self.key = key
self.val = val
self.prev = None
self.next = Noneclass LRUCache:
def init(self, capacity: int):
self.cap = capacity
self.cache = {}
Open in a real environment →self.left = Node() self.right = Node() self.left.next = self.right self.right.prev = self.left def remove(self, node): prev_node = node.prev next_node = node.next prev_node.next = next_node next_node.prev = prev_node def insert(self, node): prev_mru = self.right.prev prev_mru.next = node self.right.prev = node node.prev = prev_mru node.next = self.right def get(self, key: int) -> int: if key in self.cache: self.remove(self.cache[key]) self.insert(self.cache[key]) return self.cache[key].val return -1 def put(self, key: int, value: int) -> None: if key in self.cache: self.remove(self.cache[key]) self.cache[key] = Node(key, value) self.insert(self.cache[key]) if len(self.cache) > self.cap: lru = self.left.next self.remove(lru) del self.cache[lru.key]Medium Implement Trie (Prefix Tree)
class TrieNode:
def init(self):
self.children = {}
self.is_end_of_word = Falseclass Trie:
def init(self):
self.root = TrieNode()
Open in a real environment →def insert(self, word: str) -> None: curr = self.root for char in word: if char not in curr.children: curr.children[char] = TrieNode() curr = curr.children[char] curr.is_end_of_word = True def search(self, word: str) -> bool: curr = self.root for char in word: if char not in curr.children: return False curr = curr.children[char] return curr.is_end_of_word def startsWith(self, prefix: str) -> bool: curr = self.root for char in prefix: if char not in curr.children: return False curr = curr.children[char] return TrueMedium Design Add and Search Words Data Structure
class TrieNode:
def init(self):
self.children = {}
self.is_end = Falseclass WordDictionary:
def init(self):
self.root = TrieNode()
Open in a real environment →def addWord(self, word: str) -> None: curr = self.root for char in word: if char not in curr.children: curr.children[char] = TrieNode() curr = curr.children[char] curr.is_end = True def search(self, word: str) -> bool: def dfs(j, root): curr = root for i in range(j, len(word)): char = word[i] if char == '.': for child in curr.children.values(): if dfs(i + 1, child): return True return False else: if char not in curr.children: return False curr = curr.children[char] return curr.is_end return dfs(0, self.root)Medium Design Twitter
class Twitter:
def init(self):
self.count = 0
self.tweetMap = {}
self.followMap = {}
Open in a real environment →def postTweet(self, userId: int, tweetId: int) -> None: if userId not in self.tweetMap: self.tweetMap[userId] = [] self.tweetMap[userId].append([self.count, tweetId]) self.count -= 1 def getNewsFeed(self, userId: int) -> list[int]: res = [] minHeap = [] if userId not in self.followMap: self.followMap[userId] = set() self.followMap[userId].add(userId) for followeeId in self.followMap[userId]: if followeeId in self.tweetMap and len(self.tweetMap[followeeId]) > 0: index = len(self.tweetMap[followeeId]) - 1 count, tweetId = self.tweetMap[followeeId][index] heapq.heappush(minHeap, [count, tweetId, followeeId, index - 1]) while minHeap and len(res) < 10: count, tweetId, followeeId, index = heapq.heappop(minHeap) res.append(tweetId) if index >= 0: next_count, next_tweetId = self.tweetMap[followeeId][index] heapq.heappush(minHeap, [next_count, next_tweetId, followeeId, index - 1]) self.followMap[userId].remove(userId) return res def follow(self, followerId: int, followeeId: int) -> None: if followerId not in self.followMap: self.followMap[followerId] = set() self.followMap[followerId].add(followeeId) def unfollow(self, followerId: int, followeeId: int) -> None: if followerId in self.followMap and followeeId in self.followMap[followerId]: self.followMap[followerId].remove(followeeId)Medium Subarray Sum Equals K
def subarray_sum(nums: list[int], k: int) -> int:
count = 0
prefix_sum = 0
prefix_map = {0: 1}
Open in a real environment →for num in nums: prefix_sum += num diff = prefix_sum - k if diff in prefix_map: count += prefix_map[diff] prefix_map[prefix_sum] = prefix_map.get(prefix_sum, 0) + 1 return countMedium Continuous Subarray Sum
def check_subarray_sum(nums: list[int], k: int) -> bool:
remainder_map = {0: -1}
prefix_sum = 0
Open in a real environment →for i in range(len(nums)): prefix_sum += nums[i] remainder = prefix_sum % k if remainder in remainder_map: if i - remainder_map[remainder] >= 2: return True else: remainder_map[remainder] = i return FalseHard Sliding Window Median
def median_sliding_window(nums: list[int], k: int) -> list[float]:
small = []
large = []
lazy = {}
Open in a real environment →for i in range(k): heapq.heappush(small, -nums[i]) for i in range(k // 2): heapq.heappush(large, -heapq.heappop(small)) def get_median(): if k % 2 == 1: return float(-small[0]) return (-small[0] + large[0]) / 2.0 res = [get_median()] for i in range(k, len(nums)): out_num = nums[i - k] in_num = nums[i] lazy[out_num] = lazy.get(out_num, 0) + 1 balance = 0 if out_num <= -small[0]: balance -= 1 else: balance += 1 if small and in_num <= -small[0]: balance += 1 heapq.heappush(small, -in_num) else: balance -= 1 heapq.heappush(large, in_num) if balance < 0: heapq.heappush(small, -heapq.heappop(large)) elif balance > 0: heapq.heappush(large, -heapq.heappop(small)) while small and lazy.get(-small[0], 0) > 0: lazy[-small[0]] -= 1 heapq.heappop(small) while large and lazy.get(large[0], 0) > 0: lazy[large[0]] -= 1 heapq.heappop(large) res.append(get_median()) return res
Bonus topics to round out your prep
DSA (69)
Easy Contains Duplicate
def contains_duplicate(nums: list[int]) -> bool:
Open in a real environment →
seen = set()
for n in nums:
if n in seen:
return True
seen.add(n)
return FalseEasy Valid Anagram
def is_anagram(s: str, t: str) -> bool:
if len(s) != len(t):
return False
Open in a real environment →countS, countT = {}, {} for i in range(len(s)): countS[s[i]] = countS.get(s[i], 0) + 1 countT[t[i]] = countT.get(t[i], 0) + 1 return countS == countTEasy Two Sum
def two_sum(nums: list[int], target: int) -> list[int]:
prev_map = {} # value -> index
Open in a real environment →for i, n in enumerate(nums): diff = target - n if diff in prev_map: return [prev_map[diff], i] prev_map[n] = i return []Easy Valid Palindrome
def is_palindrome(s: str) -> bool:
l, r = 0, len(s) - 1
Open in a real environment →while l < r: if not s[l].isalnum(): l += 1 continue if not s[r].isalnum(): r -= 1 continue if s[l].lower() != s[r].lower(): return False l += 1 r -= 1 return TrueEasy Valid Parentheses
def is_valid(s: str) -> bool:
stack = []
close_to_open = {")": "(", "]": "[", "}": "{"}
Open in a real environment →for c in s: if c in close_to_open: if stack and stack[-1] == close_to_open[c]: stack.pop() else: return False else: stack.append(c) return True if not stack else FalseEasy Binary Search
def search(nums: list[int], target: int) -> int:
l, r = 0, len(nums) - 1
Open in a real environment →while l <= r: m = (l + r) // 2 if nums[m] > target: r = m - 1 elif nums[m] < target: l = m + 1 else: return m return -1Easy Merge Two Sorted Lists
Definition for singly-linked list.
class ListNode:
def init(self, val=0, next=None):
self.val = val
self.next = next
def merge_two_lists(list1: Optional[ListNode], list2: Optional[ListNode]) -> Optional[ListNode]:
dummy = ListNode()
tail = dummy
Open in a real environment →while list1 and list2: if list1.val < list2.val: tail.next = list1 list1 = list1.next else: tail.next = list2 list2 = list2.next tail = tail.next if list1: tail.next = list1 elif list2: tail.next = list2 return dummy.nextEasy Invert Binary Tree
Definition for a binary tree node.
class TreeNode:
def init(self, val=0, left=None, right=None):
self.val = val
self.left = left
self.right = right
def invert_tree(root: Optional[TreeNode]) -> Optional[TreeNode]:
if not root:
return None
Open in a real environment →# Swap root.left, root.right = root.right, root.left # Recurse invert_tree(root.left) invert_tree(root.right) return rootEasy Maximum Depth of Binary Tree
Definition for a binary tree node.
class TreeNode:
def init(self, val=0, left=None, right=None):
self.val = val
self.left = left
self.right = right
def max_depth(root: Optional[TreeNode]) -> int:
if not root:
return 0
Open in a real environment →return 1 + max(max_depth(root.left), max_depth(root.right))Easy Diameter of Binary Tree
Definition for a binary tree node.
class TreeNode:
def init(self, val=0, left=None, right=None):
self.val = val
self.left = left
self.right = right
def diameter_of_binary_tree(root: Optional[TreeNode]) -> int:
diameter = 0
Open in a real environment →def height(node): nonlocal diameter if not node: return 0 left_h = height(node.left) right_h = height(node.right) diameter = max(diameter, left_h + right_h) return 1 + max(left_h, right_h) height(root) return diameterEasy Balanced Binary Tree
Definition for a binary tree node.
class TreeNode:
def init(self, val=0, left=None, right=None):
self.val = val
self.left = left
self.right = right
def is_balanced(root: Optional[TreeNode]) -> bool:
def dfs(node):
if not node:
return 0
Open in a real environment →left = dfs(node.left) if left == -1: return -1 right = dfs(node.right) if right == -1: return -1 if abs(left - right) > 1: return -1 return 1 + max(left, right) return dfs(root) != -1Easy Same Tree
Definition for a binary tree node.
class TreeNode:
def init(self, val=0, left=None, right=None):
self.val = val
self.left = left
self.right = right
def is_same_tree(p: Optional[TreeNode], q: Optional[TreeNode]) -> bool:
if not p and not q:
return True
Open in a real environment →if not p or not q or p.val != q.val: return False return is_same_tree(p.left, q.left) and is_same_tree(p.right, q.right)Easy Subtree of Another Tree
Definition for a binary tree node.
class TreeNode:
def init(self, val=0, left=None, right=None):
self.val = val
self.left = left
self.right = right
def is_subtree(root: Optional[TreeNode], subRoot: Optional[TreeNode]) -> bool:
if not subRoot:
return True
if not root:
return Falseif is_same_tree(root, subRoot): return True return is_subtree(root.left, subRoot) or is_subtree(root.right, subRoot)def is_same_tree(p: Optional[TreeNode], q: Optional[TreeNode]) -> bool:
Open in a real environment →
if not p and not q:
return True
if not p or not q or p.val != q.val:
return False
return is_same_tree(p.left, q.left) and is_same_tree(p.right, q.right)Easy Linked List Cycle
Definition for singly-linked list.
class ListNode:
def init(self, x):
self.val = x
self.next = None
def has_cycle(head: Optional[ListNode]) -> bool:
slow, fast = head, head
Open in a real environment →while fast and fast.next: slow = slow.next fast = fast.next.next if slow == fast: return True return FalseEasy Reverse Linked List
Definition for singly-linked list.
class ListNode:
def init(self, val=0, next=None):
self.val = val
self.next = next
def reverse_list(head: Optional[ListNode]) -> Optional[ListNode]:
prev = None
curr = head
Open in a real environment →while curr: nxt = curr.next curr.next = prev prev = curr curr = nxt return prevEasy Moving Average from Data Stream
from collections import deque
class MovingAverage:
def init(self, size):
self.size = size
self.queue = deque()
self.window_sum = 0
Open in a real environment →def next(self, val): self.queue.append(val) self.window_sum += val if len(self.queue) > self.size: self.window_sum -= self.queue.popleft() return self.window_sum / len(self.queue)Medium Group Anagrams
def group_anagrams(strs: list[str]) -> list[list[str]]:
Open in a real environment →
anagram_map = {}
for s in strs:
count = [0] * 26
for c in s:
count[ord(c) - ord('a')] += 1
signature = tuple(count)
if signature not in anagram_map:
anagram_map[signature] = []
anagram_map[signature].append(s)
return list(anagram_map.values())Medium Top K Frequent Elements
def top_k_frequent(nums: list[int], k: int) -> list[int]:
count = {}
freq = [[] for i in range(len(nums) + 1)]
Open in a real environment →for n in nums: count[n] = 1 + count.get(n, 0) for n, c in count.items(): freq[c].append(n) res = [] for i in range(len(freq) - 1, 0, -1): for n in freq[i]: res.append(n) if len(res) == k: return res return resMedium Product of Array Except Self
def product_except_self(nums: list[int]) -> list[int]:
res = [1] * len(nums)
Open in a real environment →# Calculate prefix products prefix = 1 for i in range(len(nums)): res[i] = prefix prefix *= nums[i] # Calculate suffix products and combine postfix = 1 for i in range(len(nums) - 1, -1, -1): res[i] *= postfix postfix *= nums[i] return resMedium Longest Consecutive Sequence
def longest_consecutive(nums: list[int]) -> int:
num_set = set(nums)
longest = 0
Open in a real environment →for n in num_set: # Only check for the start of a sequence if (n - 1) not in num_set: length = 1 while (n + length) in num_set: length += 1 longest = max(length, longest) return longestMedium Two Sum II - Input Array Is Sorted
def two_sum(numbers: list[int], target: int) -> list[int]:
l, r = 0, len(numbers) - 1
Open in a real environment →while l < r: cur_sum = numbers[l] + numbers[r] if cur_sum > target: r -= 1 elif cur_sum < target: l += 1 else: return [l + 1, r + 1] return []Medium Three Sum
def three_sum(nums: list[int]) -> list[list[int]]:
res = []
nums.sort()
Open in a real environment →for i, a in enumerate(nums): if i > 0 and a == nums[i - 1]: continue l, r = i + 1, len(nums) - 1 while l < r: three_sum_val = a + nums[l] + nums[r] if three_sum_val > 0: r -= 1 elif three_sum_val < 0: l += 1 else: res.append([a, nums[l], nums[r]]) l += 1 while l < r and nums[l] == nums[l - 1]: l += 1 return resMedium Container With Most Water
def max_area(height: list[int]) -> int:
l, r = 0, len(height) - 1
res = 0
Open in a real environment →while l < r: area = (r - l) * min(height[l], height[r]) res = max(res, area) if height[l] < height[r]: l += 1 else: r -= 1 return resMedium Longest Substring Without Repeating Characters
def length_of_longest_substring(s: str) -> int:
char_set = set()
l = 0
res = 0
Open in a real environment →for r in range(len(s)): while s[r] in char_set: char_set.remove(s[l]) l += 1 char_set.add(s[r]) res = max(res, r - l + 1) return resMedium Longest Repeating Character Replacement
def character_replacement(s: str, k: int) -> int:
count = {}
res = 0
l = 0
max_f = 0
Open in a real environment →for r in range(len(s)): count[s[r]] = 1 + count.get(s[r], 0) max_f = max(max_f, count[s[r]]) if (r - l + 1) - max_f > k: count[s[l]] -= 1 l += 1 res = max(res, r - l + 1) return resMedium Permutation in String
def check_inclusion(s1: str, s2: str) -> bool:
if len(s1) > len(s2):
return False
Open in a real environment →s1_count = [0] * 26 window_count = [0] * 26 for i in range(len(s1)): s1_count[ord(s1[i]) - ord('a')] += 1 window_count[ord(s2[i]) - ord('a')] += 1 if s1_count == window_count: return True l = 0 for r in range(len(s1), len(s2)): window_count[ord(s2[r]) - ord('a')] += 1 window_count[ord(s2[l]) - ord('a')] -= 1 l += 1 if s1_count == window_count: return True return FalseMedium Daily Temperatures
def daily_temperatures(temperatures: list[int]) -> list[int]:
res = [0] * len(temperatures)
stack = [] # Stores indices
Open in a real environment →for i, t in enumerate(temperatures): while stack and t > temperatures[stack[-1]]: prev_index = stack.pop() res[prev_index] = i - prev_index stack.append(i) return resMedium Car Fleet
def car_fleet(target: int, position: list[int], speed: list[int]) -> int:
pair = [[p, s] for p, s in zip(position, speed)]
stack = []
Open in a real environment →# Sort by position (reverse) to process closest to target first for p, s in sorted(pair)[::-1]: stack.append((target - p) / s) if len(stack) >= 2 and stack[-1] <= stack[-2]: stack.pop() return len(stack)Medium Search a 2D Matrix
def search_matrix(matrix: list[list[int]], target: int) -> bool:
rows, cols = len(matrix), len(matrix[0])
l, r = 0, rows * cols - 1
Open in a real environment →while l <= r: m = (l + r) // 2 row, col = m // cols, m % cols val = matrix[row][col] if val > target: r = m - 1 elif val < target: l = m + 1 else: return True return FalseMedium Koko Eating Bananas
def min_eating_speed(piles: list[int], h: int) -> int:
l, r = 1, max(piles)
res = r
Open in a real environment →while l <= r: k = (l + r) // 2 hours = 0 for p in piles: hours += (p + k - 1) // k if hours <= h: res = k r = k - 1 else: l = k + 1 return resMedium Find Minimum in Rotated Sorted Array
def find_min(nums: list[int]) -> int:
l, r = 0, len(nums) - 1
Open in a real environment →while l < r: m = (l + r) // 2 if nums[m] > nums[r]: l = m + 1 else: r = m return nums[l]Medium Search in Rotated Sorted Array
def search(nums: list[int], target: int) -> int:
l, r = 0, len(nums) - 1
Open in a real environment →while l <= r: m = (l + r) // 2 if nums[m] == target: return m if nums[l] <= nums[m]: if nums[l] <= target < nums[m]: r = m - 1 else: l = m + 1 else: if nums[m] < target <= nums[r]: l = m + 1 else: r = m - 1 return -1Medium Remove Nth Node From End of List
Definition for singly-linked list.
class ListNode:
def init(self, val=0, next=None):
self.val = val
self.next = next
def remove_nth_from_end(head: Optional[ListNode], n: int) -> Optional[ListNode]:
dummy = ListNode(0, head)
left = dummy
right = head
Open in a real environment →while n > 0 and right: right = right.next n -= 1 while right: left = left.next right = right.next left.next = left.next.next return dummy.nextMedium Add Two Numbers
Definition for singly-linked list.
class ListNode:
def init(self, val=0, next=None):
self.val = val
self.next = next
def add_two_numbers(l1: Optional[ListNode], l2: Optional[ListNode]) -> Optional[ListNode]:
dummy = ListNode(0)
curr = dummy
carry = 0
Open in a real environment →while l1 or l2 or carry: val1 = l1.val if l1 else 0 val2 = l2.val if l2 else 0 total = val1 + val2 + carry carry = total // 10 curr.next = ListNode(total % 10) curr = curr.next if l1: l1 = l1.next if l2: l2 = l2.next return dummy.nextMedium Lowest Common Ancestor of a BST
Definition for a binary tree node.
class TreeNode:
def init(self, val=0, left=None, right=None):
self.val = val
self.left = left
self.right = right
def lowest_common_ancestor(root: 'TreeNode', p: 'TreeNode', q: 'TreeNode') -> 'TreeNode':
curr = root
Open in a real environment →while curr: if p.val > curr.val and q.val > curr.val: curr = curr.right elif p.val < curr.val and q.val < curr.val: curr = curr.left else: return currMedium Binary Tree Level Order Traversal
Definition for a binary tree node.
class TreeNode:
def init(self, val=0, left=None, right=None):
self.val = val
self.left = left
self.right = right
from collections import deque
def level_order(root: Optional[TreeNode]) -> list[list[int]]:
if not root:
return []
Open in a real environment →res = [] queue = deque([root]) while queue: level_size = len(queue) current_level = [] for _ in range(level_size): node = queue.popleft() current_level.append(node.val) if node.left: queue.append(node.left) if node.right: queue.append(node.right) res.append(current_level) return resMedium Validate Binary Search Tree
Definition for a binary tree node.
class TreeNode:
def init(self, val=0, left=None, right=None):
self.val = val
self.left = left
self.right = right
def is_valid_bst(root: Optional[TreeNode]) -> bool:
def validate(node, low=-float('inf'), high=float('inf')):
if not node:
return True
Open in a real environment →if node.val <= low or node.val >= high: return False return (validate(node.left, low, node.val) and validate(node.right, node.val, high)) return validate(root)Medium Kth Smallest Element in a BST
Definition for a binary tree node.
class TreeNode:
def init(self, val=0, left=None, right=None):
self.val = val
self.left = left
self.right = right
def kth_smallest(root: Optional[TreeNode], k: int) -> int:
stack = []
curr = root
Open in a real environment →while curr or stack: while curr: stack.append(curr) curr = curr.left curr = stack.pop() k -= 1 if k == 0: return curr.val curr = curr.right return -1Medium K Closest Points to Origin
def k_closest(points: list[list[int]], k: int) -> list[list[int]]:
max_heap = []
Open in a real environment →for x, y in points: dist = x**2 + y**2 heapq.heappush(max_heap, (-dist, [x, y])) if len(max_heap) > k: heapq.heappop(max_heap) return [point for dist, point in max_heap]Medium Kth Largest Element in an Array
def find_kth_largest(nums: list[int], k: int) -> int:
min_heap = []
Open in a real environment →for num in nums: heapq.heappush(min_heap, num) if len(min_heap) > k: heapq.heappop(min_heap) return min_heap[0]Medium Task Scheduler
def least_interval(tasks: list[str], n: int) -> int:
counts = Counter(tasks)
Open in a real environment →max_freq = max(counts.values()) max_freq_count = 0 for count in counts.values(): if count == max_freq: max_freq_count += 1 required_time = (max_freq - 1) * (n + 1) + max_freq_count return max(len(tasks), required_time)Medium Combination Sum
def combination_sum(candidates: list[int], target: int) -> list[list[int]]:
res = []
Open in a real environment →def dfs(i, current_comb, total): if total == target: res.append(current_comb.copy()) return if i >= len(candidates) or total > target: return current_comb.append(candidates[i]) dfs(i, current_comb, total + candidates[i]) current_comb.pop() dfs(i + 1, current_comb, total) dfs(0, [], 0) return resMedium Permutations
def permute(nums: list[int]) -> list[list[int]]:
res = []
Open in a real environment →def dfs(current_perm): if len(current_perm) == len(nums): res.append(current_perm.copy()) return for num in nums: if num in current_perm: continue current_perm.append(num) dfs(current_perm) current_perm.pop() dfs([]) return resMedium Number of Islands
def numIslands(grid: list[list[str]]) -> int:
if not grid:
return 0
Open in a real environment →ROWS, COLS = len(grid), len(grid[0]) islands = 0 def bfs(r, c): q = deque() q.append((r, c)) grid[r][c] = "0" while q: row, col = q.popleft() directions = [[1, 0], [-1, 0], [0, 1], [0, -1]] for dr, dc in directions: nr, nc = row + dr, col + dc if (0 <= nr < ROWS and 0 <= nc < COLS and grid[nr][nc] == "1"): q.append((nr, nc)) grid[nr][nc] = "0" for r in range(ROWS): for c in range(COLS): if grid[r][c] == "1": islands += 1 bfs(r, c) return islandsMedium Course Schedule II
def find_order(num_courses: int, prerequisites: list[list[int]]) -> list[int]:
adj = {i: [] for i in range(num_courses)}
indegree = {i: 0 for i in range(num_courses)}
Open in a real environment →for crs, pre in prerequisites: adj[pre].append(crs) indegree[crs] += 1 q = deque() for i in range(num_courses): if indegree[i] == 0: q.append(i) res = [] while q: curr = q.popleft() res.append(curr) for neighbor in adj[curr]: indegree[neighbor] -= 1 if indegree[neighbor] == 0: q.append(neighbor) if len(res) == num_courses: return res return []Medium Graph Valid Tree
def valid_tree(n: int, edges: list[list[int]]) -> bool:
if len(edges) != n - 1:
return False
Open in a real environment →parent = [i for i in range(n)] def find(node): p = parent[node] while p != parent[p]: parent[p] = parent[parent[p]] p = parent[p] return p def union(n1, n2): p1, p2 = find(n1), find(n2) if p1 == p2: return False parent[p1] = p2 return True for u, v in edges: if not union(u, v): return False return TrueMedium Network Delay Time
def network_delay_time(times: list[list[int]], n: int, k: int) -> int:
adj = defaultdict(list)
for u, v, w in times:
adj[u].append((v, w))
Open in a real environment →min_heap = [(0, k)] visited = set() total_time = 0 while min_heap: time, node = heapq.heappop(min_heap) if node in visited: continue visited.add(node) total_time = time for neighbor, weight in adj[node]: if neighbor not in visited: heapq.heappush(min_heap, (time + weight, neighbor)) return total_time if len(visited) == n else -1Medium Jump Game
def can_jump(nums: list[int]) -> bool:
max_reachable = 0
n = len(nums)
Open in a real environment →for i in range(n): if i > max_reachable: return False if i + nums[i] > max_reachable: max_reachable = i + nums[i] if max_reachable >= n - 1: return True return TrueMedium Jump Game II
def jump(nums: list[int]) -> int:
jumps = 0
current_end = 0
farthest = 0
Open in a real environment →for i in range(len(nums) - 1): farthest = max(farthest, i + nums[i]) if i == current_end: jumps += 1 current_end = farthest if current_end >= len(nums) - 1: break return jumpsMedium Gas Station
def can_complete_circuit(gas: list[int], cost: list[int]) -> int:
if sum(gas) < sum(cost):
return -1
Open in a real environment →current_tank = 0 start_index = 0 for i in range(len(gas)): current_tank += gas[i] - cost[i] if current_tank < 0: start_index = i + 1 current_tank = 0 return start_indexMedium Partition Labels
def partition_labels(s: str) -> list[int]:
last_occurrence = {}
Open in a real environment →for i, char in enumerate(s): last_occurrence[char] = i res = [] size = 0 end = 0 for i, char in enumerate(s): size += 1 end = max(end, last_occurrence[char]) if i == end: res.append(size) size = 0 return resMedium Number of Connected Components in an Undirected Graph
def count_components(n: int, edges: list[list[int]]) -> int:
parent = [i for i in range(n)]
rank = [1] * n
Open in a real environment →def find(node): p = parent[node] while p != parent[p]: parent[p] = parent[parent[p]] p = parent[p] return p def union(n1, n2): p1, p2 = find(n1), find(n2) if p1 == p2: return 0 if rank[p1] > rank[p2]: parent[p2] = p1 rank[p1] += rank[p2] else: parent[p1] = p2 rank[p2] += rank[p1] return 1 components = n for u, v in edges: components -= union(u, v) return componentsMedium Course Schedule
def can_finish(num_courses: int, prerequisites: list[list[int]]) -> bool:
adj = {i: [] for i in range(num_courses)}
indegree = {i: 0 for i in range(num_courses)}
Open in a real environment →for crs, pre in prerequisites: adj[pre].append(crs) indegree[crs] += 1 q = deque() for i in range(num_courses): if indegree[i] == 0: q.append(i) completed_courses = 0 while q: curr = q.popleft() completed_courses += 1 for neighbor in adj[curr]: indegree[neighbor] -= 1 if indegree[neighbor] == 0: q.append(neighbor) return completed_courses == num_coursesMedium Walls and Gates
def walls_and_gates(rooms: list[list[int]]) -> list[list[int]]:
if not rooms or not rooms[0]:
return rooms
Open in a real environment →rows, cols = len(rooms), len(rooms[0]) q = deque() INF = 2147483647 for r in range(rows): for c in range(cols): if rooms[r][c] == 0: q.append((r, c)) directions = [[1, 0], [-1, 0], [0, 1], [0, -1]] while q: r, c = q.popleft() for dr, dc in directions: nr, nc = r + dr, c + dc if 0 <= nr < rows and 0 <= nc < cols and rooms[nr][nc] == INF: rooms[nr][nc] = rooms[r][c] + 1 q.append((nr, nc)) return roomsMedium Surrounded Regions
def solve(board: list[list[str]]) -> list[list[str]]:
if not board or not board[0]:
return board
Open in a real environment →rows, cols = len(board), len(board[0]) def dfs(r, c): if r < 0 or c < 0 or r >= rows or c >= cols or board[r][c] != "O": return board[r][c] = "T" dfs(r + 1, c) dfs(r - 1, c) dfs(r, c + 1) dfs(r, c - 1) for r in range(rows): dfs(r, 0) dfs(r, cols - 1) for c in range(cols): dfs(0, c) dfs(rows - 1, c) for r in range(rows): for c in range(cols): if board[r][c] == "O": board[r][c] = "X" elif board[r][c] == "T": board[r][c] = "O" return boardMedium Pacific Atlantic Water Flow
def pacific_atlantic(heights: list[list[int]]) -> list[list[int]]:
if not heights or not heights[0]:
return []
Open in a real environment →rows, cols = len(heights), len(heights[0]) pacific = set() atlantic = set() def dfs(r, c, visited, prev_height): if (r < 0 or c < 0 or r >= rows or c >= cols or (r, c) in visited or heights[r][c] < prev_height): return visited.add((r, c)) dfs(r + 1, c, visited, heights[r][c]) dfs(r - 1, c, visited, heights[r][c]) dfs(r, c + 1, visited, heights[r][c]) dfs(r, c - 1, visited, heights[r][c]) for c in range(cols): dfs(0, c, pacific, heights[0][c]) dfs(rows - 1, c, atlantic, heights[rows - 1][c]) for r in range(rows): dfs(r, 0, pacific, heights[r][0]) dfs(r, cols - 1, atlantic, heights[r][cols - 1]) res = [] for r in range(rows): for c in range(cols): if (r, c) in pacific and (r, c) in atlantic: res.append([r, c]) return resMedium Max Area of Island
def max_area_of_island(grid: list[list[int]]) -> int:
ROWS, COLS = len(grid), len(grid[0])
max_area = 0
Open in a real environment →def dfs(r, c): if r < 0 or r >= ROWS or c < 0 or c >= COLS or grid[r][c] == 0: return 0 grid[r][c] = 0 return (1 + dfs(r + 1, c) + dfs(r - 1, c) + dfs(r, c + 1) + dfs(r, c - 1)) for r in range(ROWS): for c in range(COLS): if grid[r][c] == 1: max_area = max(max_area, dfs(r, c)) return max_areaMedium Clone Graph
class GraphNode:
def init(self, val = 0, neighbors = None):
self.val = val
self.neighbors = neighbors if neighbors is not None else []
def clone_graph(node: 'GraphNode') -> 'GraphNode':
old_to_new = {}
Open in a real environment →def dfs(node): if not node: return None if node in old_to_new: return old_to_new[node] copy = GraphNode(node.val) old_to_new[node] = copy for nei in node.neighbors: copy.neighbors.append(dfs(nei)) return copy return dfs(node)Medium Subsets
def subsets(nums: list[int]) -> list[list[int]]:
res = []
Open in a real environment →def dfs(i, current_subset): res.append(current_subset.copy()) for j in range(i, len(nums)): current_subset.append(nums[j]) dfs(j + 1, current_subset) current_subset.pop() dfs(0, []) return resMedium Binary Tree Right Side View
Definition for a binary tree node.
class TreeNode:
def init(self, val=0, left=None, right=None):
self.val = val
self.left = left
self.right = right
def right_side_view(root: Optional[TreeNode]) -> list[int]:
if not root:
return []
Open in a real environment →res = [] queue = deque([root]) while queue: level_size = len(queue) rightmost_val = None for _ in range(level_size): node = queue.popleft() rightmost_val = node.val if node.left: queue.append(node.left) if node.right: queue.append(node.right) res.append(rightmost_val) return resMedium Copy List with Random Pointer
"""
Definition for a Node.
class Node:
def init(self, x: int, next: 'Node' = None, random: 'Node' = None):
self.val = int(x)
self.next = next
self.random = random
"""def copy_random_list(head: 'Optional[Node]') -> 'Optional[Node]':
if not head:
return None
Open in a real environment →# 1. Interweave curr = head while curr: new_node = Node(curr.val, curr.next) curr.next = new_node curr = new_node.next # 2. Assign random pointers curr = head while curr: if curr.random: curr.next.random = curr.random.next curr = curr.next.next # 3. Separate curr = head new_head = head.next while curr: copy = curr.next curr.next = copy.next if copy.next: copy.next = copy.next.next curr = curr.next return new_headMedium Reorder List
Definition for singly-linked list.
class ListNode:
def init(self, val=0, next=None):
self.val = val
self.next = next
def reorder_list(head: Optional[ListNode]) -> Optional[ListNode]:
if not head: return None
Open in a real environment →# 1. Find middle slow, fast = head, head.next while fast and fast.next: slow = slow.next fast = fast.next.next # 2. Reverse second half second = slow.next prev = slow.next = None while second: tmp = second.next second.next = prev prev = second second = tmp # 3. Merge first, second = head, prev while second: tmp1, tmp2 = first.next, second.next first.next = second second.next = tmp1 first, second = tmp1, tmp2 return headMedium Evaluate Reverse Polish Notation
def eval_rpn(tokens: list[str]) -> int:
stack = []
Open in a real environment →for c in tokens: if c == "+": stack.append(stack.pop() + stack.pop()) elif c == "-": b, a = stack.pop(), stack.pop() stack.append(a - b) elif c == "*": stack.append(stack.pop() * stack.pop()) elif c == "/": b, a = stack.pop(), stack.pop() stack.append(int(a / b)) else: stack.append(int(c)) return stack[0]Medium Accounts Merge
class UnionFind:
def init(self):
self.parent = {}def find(self, x): if x not in self.parent: self.parent[x] = x if self.parent[x] != x: self.parent[x] = self.find(self.parent[x]) return self.parent[x] def union(self, x, y): rootX = self.find(x) rootY = self.find(y) if rootX != rootY: self.parent[rootY] = rootXdef accounts_merge(accounts: list[list[str]]) -> list[list[str]]:
uf = UnionFind()
email_to_name = {}
Open in a real environment →for acc in accounts: name = acc[0] first_email = acc[1] for i in range(1, len(acc)): email = acc[i] email_to_name[email] = name uf.union(first_email, email) merged_emails = {} for email in email_to_name: root = uf.find(email) if root not in merged_emails: merged_emails[root] = [] merged_emails[root].append(email) res = [] for root, emails in merged_emails.items(): res.append([email_to_name[root]] + sorted(emails)) return resMedium Top K Frequent Elements in Stream
import heapq
from collections import defaultdictclass TopKFrequent:
def init(self, k):
self.k = k
self.freq = defaultdict(int)
Open in a real environment →def add(self, num): self.freq[num] += 1 def topK(self): items = [(-count, num) for num, count in self.freq.items()] top = heapq.nsmallest(self.k, items) return [num for _, num in top]Medium Log Aggregator
from collections import defaultdict
import bisectclass LogAggregator:
def init(self):
self.logs = defaultdict(list)
Open in a real environment →def record(self, timestamp, key): self.logs[key].append(timestamp) def count(self, key, start, end): if key not in self.logs: return 0 timestamps = self.logs[key] left = bisect.bisect_left(timestamps, start) right = bisect.bisect_right(timestamps, end) return right - leftMedium Event Stream Deduplicator
class Deduplicator:
def init(self, ttl):
self.ttl = ttl
self.seen = {}
Open in a real environment →def process(self, timestamp, eventId): expired = [k for k, t in self.seen.items() if timestamp - t > self.ttl] for k in expired: del self.seen[k] if eventId in self.seen: self.seen[eventId] = timestamp return False self.seen[eventId] = timestamp return TrueMedium Skew-Aware Key Partitioner
class SkewHandler:
def init(self, numBuckets, hotThreshold, splitFactor):
self.n = numBuckets
self.threshold = hotThreshold
self.split = splitFactor
self.counts = {}
self.robin = {}
self.load = [0] * numBuckets
Open in a real environment →def _hash(self, key): h = 0 for c in key: h = h * 31 + ord(c) return h % self.n def assign(self, key): self.counts[key] = self.counts.get(key, 0) + 1 base = self._hash(key) if self.counts[key] > self.threshold: idx = self.robin.get(key, 0) bucket = (base + idx) % self.n self.robin[key] = (idx + 1) % self.split else: bucket = base self.load[bucket] += 1 return bucket def getLoad(self): return list(self.load)Medium Hash Join Simulator
from collections import defaultdict
class HashJoin:
def init(self):
self.table = defaultdict(list)
Open in a real environment →def build(self, rows, keyIndex): self.table.clear() for row in rows: self.table[row[keyIndex]].append(row) def probe(self, rows, keyIndex): result = [] for row in rows: for build_row in self.table.get(row[keyIndex], []): result.append(build_row + row) result.sort() return result
Practice on real environments
Browse and filter the full Data Engineering catalog, or see questions asked at specific companies.