Interview Questions

Data Engineering Interview Questions

Are you preparing for a data engineering interview? Data engineering is about building and maintaining the pipelines and infrastructure that move data reliably. These real Data Engineering interview questions cover SQL, Spark, Kafka, warehousing and modeling.

  • 189+Questions
  • 71Companies
  • 98With video
  • 61/112/16 Easy / Med / Hard

Real questions asked at Google, Samsung, Meta, DoorDash, Amazon, Apple and 65+ more.

Click any question to read the full answer, then open it to solve hands-on in a real Data Engineering environment that boots in your browser. Built by working engineers, no AI.

SQL (33)

  • Easy Average Order Value Video

    This SQL question is called Average Order Value. We have one table that is called Orders. This table records customers' purchases and has four columns: orders ID, date of transaction, reference to the customer by ID, and total amount spent. Our job here is to calculate for each customer the average amount they spent across all their orders. In the end, the average value should be rounded to two decimal places. The output should include customer ID and average order value, sorted by ID in ascending order. We start with FROM clause. Then we proceed with GROUP BY clause that separates individual orders into groups for each customer. In order to choose specific columns from a particular table, we use SELECT clause. We wrap the total amount column with Average inside of Round function as first argument, and two number as two decimal places. Finally, we order by customer ID.

    Open in a real environment →
  • Easy Join Employees and Departments Video

    We are given two tables, departments and employees. These two tables are connected through department ID column. Some employees might not be assigned to any department, so their department ID is null. But we still want them in our results. Our main goal here is to return all employees that earn more than 50,000 with their department name sorted by hire date from most recent to oldest. In order to combine these two tables, we will use the concept of left join. A left join keeps all rows from the left table even if there is no matching row in the right table. The missing values just become null. We implement left join in between, and using on, we indicate which common columns were used to connect these tables. We will use where clause to filter the rows. We only want those employees whose salary is greater than 50,000. Finally, we sort everything out by hire date in descending order so that the most recent date pops up first and oldest date last.

    Open in a real environment →
  • Easy Filter Orders by Date Range Video

    We need to write a query to filter orders by date range. We are given one table that is called Orders. It has four columns: customer's name, ID of an order that has been made, date of transaction, and total amount spent. We are required to return only the orders that were placed between January 1st, 2023, and June 30th, 2023. Both dates are inclusive, which means that orders on exactly those dates should be included, too. In SQL, we always start with FROM clause because this tells which table we want to work with. Using SELECT clause, we can choose the columns for the output. WHERE clause, which is the filter step. It goes through every single row and checks if the condition is true. For order date column, we use BETWEEN and operators. This combination checks if a value falls within a range, including both the start and end values. Finally, we sort everything by order date in ascending order so that the rows with the earliest date come first.

    Open in a real environment →
  • Easy Find Customers Without Orders Video

    We need to find customers without orders. We have two tables, customers and orders. These tables are connected through customer ID column. Our job here is to find customers who have never placed a single order, but at the same time they exist in the customers table. We are more interested in LEFT JOIN since we want all rows from the left table to stay. A LEFT JOIN keeps all rows from the left table, even if there is no matching row in the right table. The missing values just become null. Then we add the LEFT JOIN operator in between, and ON keyword that indicates which common columns were used to combine the tables. In our case, it's customer ID. We proceed with WHERE clause that filters the results. We build the condition to check if order ID is null. We pick this specific column because if a customer has at least one order, their order ID will be a real number after the LEFT JOIN. Finally, we sort everything by customer name in ascending order or alphabetically.

    Open in a real environment →
  • Easy Use COALESCE for Null Handling

    How to Replace NULL with 0 in SQL and Retrieve All Orders from the orders Table

    Writing clean and efficient SQL queries is an essential skill for database management and data analysis. If you're asked to retrieve all orders from an orders table, ensuring that any NULL values in the discount column are replaced with 0, you need to follow specific steps to structure your query correctly. Below is a comprehensive guide on achieving this task.

    Steps to Write the SQL Query

    1. Identify the Columns: The orders table contains the columns order_id, customer_name, discount, and total_amount.
    2. Handle NULL Values: Ensure that the discount column does not contain any NULL values by using the COALESCE function, which allows you to replace NULL values with 0.
    3. Select All Required Columns: Ensure that the query retrieves all the columns in the specified order - order_id, customer_name, discount, and total_amount.
    4. Order the Results: Use the ORDER BY clause to sort the results by order_id in ascending order.

    Sample SQL Query

    SELECT 
        order_id,
        customer_name,
        COALESCE(discount, 0) AS discount,
        total_amount
    FROM 
        orders
    ORDER BY 
        order_id ASC;
    

    Breaking Down the Query

    • SELECT statement: This part retrieves the columns you need.

      SELECT 
          order_id,
          customer_name,
      
    • COALESCE Function: Use COALESCE(discount, 0) to replace NULL values in the discount column with 0.

          COALESCE(discount, 0) AS discount,
      
    • FROM clause: Specifies the table from which to fetch the data.

      FROM 
          orders
      
    • ORDER BY clause: Ensures the result set is ordered by order_id in ascending sequence, making it easier to read and analyze.

      ORDER BY 
          order_id ASC;
      

    Best Practices

    • Readability: Write clear and readable queries. Using aliases (like COALESCE(discount, 0) AS discount) makes it easier to interpret results.
    • Performance: Ensure your database has indexes on columns commonly used in ORDER BY clauses, like order_id, to optimize query performance.

    Leveraging these structured steps will help you efficiently write the required SQL query to retrieve and process the orders data from the orders table, replacing NULL discounts with 0, and ordering by order_id. This method ensures clean data handling and a structured output, which is a critical aspect of database queries and analysis.

    Open in a real environment →
  • Easy Merge Multiple Address Fields Video

    We have a customers table with columns like city, customer ID, first name, last name, postal code, state, and street address. Our job is to combine them into one single column called full address. Some columns could have null values, meaning some customers might be missing a city, state, or postal code. In this case, they should be excluded from the full address. One more possible case is that all components are missing. In this case, we should return an empty string. For this problem, we will need to use COALESCE function. COALESCE function takes a list of values and returns the first one that is not null. For this reason, we need to know the case statement. The idea is basically like an if else statement in any programming language. We will use the concatenation sign in order to connect them one by one. The last thing we have to do is sorting everything out in ascending order, which can be done using order by.

    Open in a real environment →
  • Easy String Concatenation in SELECT Video

    Our job is to retrieve the full names of all employees by combining the first and last names. We have to make sure that the spaces in between are added, and the result is sorted in ascending order. In SQL, a string is basically just text, anything like a name, a number, or email that are stored in columns with types like varchar or text. Concatenation is simply joining two or more strings together into one output. If you want the space between your strings, you have to treat that space as its own separate string inside quotes. In order to return this specific full name column, we use select clause and implement concat function inside of it. We name our new column as full name. The last thing we are required to do is to sort everything in ascending order based on full name. For this reason, we use order by clause and the name of the column.

    Open in a real environment →
  • Easy Find Nth Highest Revenue Video

    We will write a query to find N's highest revenue. We have sales table with columns like ID, product, and revenue. Our job is simply to find the third highest distinct revenue value. We need to keep in mind that product with same revenue count as one value, not two. Our query will consist of three steps. First, we get rid of duplicates, then what's left, we rank from highest to lowest, and then pick the one on third place. A window function performs a calculation across multiple rows, but unlike group by, it does not collapse everything into one result. The word just activates the window function. We sort the revenue column out in descending order so that on the first place, we get the highest value. We filter the column where rank equals to three, and we find the value on the third position.

    Open in a real environment →
  • Easy Self-Join to Identify Missing Supervisors Video

    We have only one table that is called Employees and three columns that indicate employee ID, the name of that employee, and supervisor ID. One employee can be a supervisor for his or her colleagues. Our job is simply to find employees whose supervisor does not exist in the table. Employees with a null value in supervisor ID column are excluded. The final list should be sorted out in ascending order by employee ID. In order to reference the same table twice, we give it two different names, or aliases. e1 will represent each employee, and e2 will search for supervisor. We will use left join, which will keep all rows from the left table, regardless of whether a match was found in the right table or not. Using ON condition, we simply ask the query to take the supervisor ID from e1 table and look for it in employee ID column of e2 table. The first statement is employee ID from e2 table is null, which will simply catch all employees whose supervisor was not found in the table after the left join. We also check if supervisor ID is not a null value.

    Open in a real environment →
  • Easy Year-over-Year Revenue Growth Video

    We need to calculate the yearly revenue and the percentage growth year over year for a given set of financial transactions. We have financials table that contains two columns, transaction date, that is date type, and amount, that is numeric. We group all transactions by year and sum them up to get the total revenue per year. A CTE is a temporary result set in SQL that you can reference within a single query. It only exists while that query is running. In our select, we use extract in order to get only year from our transaction date column. We also need to get the total revenue of our amount column using sum. A left join keeps all rows from the left table, even if there is no matching row in the right table. The missing values just become null. We take our CTE yearly revenue and give it an alias current. Using left join, we take the exact same CTE again, but this time we call it prev for previous. We need to subtract current and previous year, divide by previous year, and multiply by 100 to get the percentage. Inside of select, we use round, which is a built-in SQL function that rounds a decimal number to a specified number.

    Open in a real environment →
  • Medium Above Average Price Products Video

    We have a products table with columns like category, ID, name, price, rating, and stock quantity. Our job is to return the products that are priced above the average price. We only work with those products that are currently in stock. It means that not only we don't return them, but we also don't use them while calculating the average amount. While sorting, we should put all the products with null ratings at the bottom. Using select with a star, we can choose all columns. Using from clause, we retrieve data from products table. We need to write a subquery. Using where clause, we create a small condition. Using the average function, it will calculate the average price. The main condition checks whether selected product's price is higher than the average. We have to sort everything out in descending order using order by. At the end of our order by clause, we will use nulls last.

    Open in a real environment →
  • Medium Calculate Cumulative Sales Video

    We are given a sales data table with the name of the products, date of transaction, and the amount of daily sales. Each row simply represents one day of sales per product. Our mission is to add a new column called cumulative sales, which is a running total that keeps adding up the daily sales for each product as we move forward through the dates. We do the calculation within each product separately. For each row, we need to add up all the previous rows for the same product, which is not something that regular sum with group by can handle, since group by would just put everything into one total. A window function performs a calculation across multiple rows, but unlike group by, it never collapses them. We start with sum within daily sales column, and then we use over keyword that simply activates the window function. We will use partition by function within product name column that will simply divide the data into separate groups.

    Open in a real environment →
  • Medium Find Overlapping Date Ranges Video

    We have one table that is called assignments. Each row represents one employee assigned to one project with start date and end date. We need to find those employees who are assigned to multiple projects at the same time, or in other words, who face an overlap. The final result must contain all the columns from the input and sort it out by project ID primarily, and then by employee ID in ascending order. Two date ranges overlap when they share at least one day in common. We need to compare each assignment to another one for the same employee, which basically means that we reference the same table twice. This means that we'll use self join here. We take assignments table and give it alias a1. Then we join it with a2 copy that will search for overlapping assignments. The first condition is that assignment a1 must start before or on the same day that assignment a2 ends, because if a1 starts after a2 already ended, no overlap is possible. The role of distinct function here is removing those duplicates and making sure that each assignment appears only once in the final result.

    Open in a real environment →
  • Medium Set Operation: INTERSECT Video

    This SQL question focuses on set operation called intersect. We are given two tables that share customer ID as a primary key. Our main goal is to return list of active customers based on two criterias. On a monthly basis, a new customer should spend more than 1,000, be loyal, and have at least three years of membership and a premium tier status. Final output must include only the customer ID and name columns sorted in ascending order by ID. A CTE is a temporary result set in SQL that you can reference within a single query. Using with clause, we create a CTE called monthly spenders among new customers that spend more than 1,000. We can name the CTE as premium tier. Intersect takes both tables and returns only the values that appear in both of them. We select customer ID and name columns from monthly spenders, which was the first CTE. Then we select information from the second CTE, and between these two we add intersect operator.

    Open in a real environment →
  • Medium Subquery for Best Order per Customer Video

    We are given two tables. The first table contains the names of the customers. The second table stores order IDs and total amount of each order. Both tables share customer ID primary key which connects them together. In the output, we should get customer's name, ID of the best order, and its total amount. All the results should be sorted by a customer's name in alphabetical order. Since each customer can have multiple orders, and among these orders might be the ones with the same highest total amount, in this case, we should return the one with the smallest order ID. We are more interested in inner join because it returns only the rows where there is a match in both tables. Inside of the where clause, we will have correlated subquery. It is also a query inside another query, but it runs once for every single row in the outer query. Since we need to return the highest valued order for each customer, we sort by total amount in descending order. We also sort order ID in ascending order because when two orders pop up with the same amount, the one with smaller order ID will come first. And in the very end, we apply limit one.

    Open in a real environment →
  • Medium Ranking with Dense_Rank Video

    The main focus is on ranking values with dense rank window function. We are working with sales table that tracks individual sales made by different representatives. The goal here is to first add up all sales per person to get their total value, and then rank everyone based on that total from highest to lowest. When we use dense ranking, two people with equal amount of sales will share the same rank. A CTE is a temporary result set in SQL that you can reference within a single query. Dense rank is one of the window functions that assigns a rank number to each row based on a specified order. Over is a keyword that simply activates a window function and lets SQL know that we try to implement it. Finally, order by clause sorts by sales rank first, so rank one comes before rank two, and then salesperson name is sorted alphabetically.

    Open in a real environment →
  • Medium Median Salary by Job Title Video

    We'll write a query to calculate median salary for each job title. When all salaries are sorted in ascending order, the median value will be the one that is in the middle of them. If there is an even number of salaries, we take the average of the two middle values. A CTE is a temporary result set in SQL that you can reference within a single query. A window function performs a calculation across multiple rows while keeping every single row in the result. Inside of the window function, we are using partition by job title. Partition by divides the data into separate groups, and count will run independently inside each of these groups. We use row number that simply assigns a sequential number to each row. These formulas calculate the median position so that we know which row number represents the middle one. If the number of rows is an odd number, both of the formulas will give the same result. Using round function and double column numeric, we will convert the results to numeric type and then clean everything up to two decimal places.

    Open in a real environment →
  • Medium String Splitting and Aggregation Video

    We have a product_tags table with two columns, id and tags. The tags column stores multiple tags in one single string separated by commas. Our job is to split those tags apart and count how many times each tag appears across all products. One important thing to keep in mind is that tags are case sensitive. A CTE is a temporary result set in SQL that you can reference within a single query. Our temporary table will be called split_data, and then we select id and tag from our product_tags table. First, we are using string_to_array. You basically pass a string and the separator to the function. After turning tags column into array, we will use unnest function, which will basically put each piece of our array into one separate row. We will use the count with a star function that will basically go through each row and return the amount. Using group by clause, we will group all rows with the same tag together.

    Open in a real environment →
  • Medium Salary Comparison with CTE Aggregation Video

    We have two tables, departments and employees. Our job is to find employees who earn more than the average salary of their own department. We need the average salary per department before we can compare anything. For this reason, we will use CTE here. A CTE is a temporary result set in SQL that you can reference within a single query. It doesn't get saved anywhere. It only exists while that query is running. We want to find the average salary per department. Inside of it, we select department and average salary grouped by department. For our task, we are more interested in inner join because it returns only the rows where there is a match in both tables. We connect them on department ID from employees table and ID from departments table because we remember that they were the same common primary keys. The average salary should be rounded to two decimal places in the output. Using where clause, we write a condition that salary of an employee should be higher than the average salary that we calculated in our CTE before. The last thing we have to do is sorting everything out primarily by department's name in ascending order and then by salary in descending order.

    Open in a real environment →
  • Medium String Pattern Extraction in Descriptions Video

    This question mainly focuses on string extraction. We have a products table with four columns: description, name, price, and product ID. The description column contains free text. Some products have email address inside of it and some don't. Our mission is to find those email addresses and extract them from the column. We only return three columns, product ID, name, and email that we will find from description. We will need to use substring. Substring is a string manipulation function. It simply returns a copy of a specific portion of the string. The pattern of an email address is some word, then @ symbol, word again, dot, and then word in the end. Regular expression is kind of a language that is used to describe a sequence of characters. Instead of equal, we use tilde, which is regular expression operator, which means just find this kind of pattern in description column. We sort everything in ascending order by product ID.

    Open in a real environment →
  • Medium Nested Subquery for Latest Record Video

    We only have one table here, events. There are five columns: event_date and name, id, status, and user_id. Each row in the event_name column represents something a user did, and then status column shows if it was successful. For each user, we want to know what their most recent event was. Final result should contain all the columns except for ID column, and everything should be sorted in ascending order by user ID. For each user and event row, we check and find the maximum date for that specific user. If the current row's date matches the maximum date, we keep it. We take the events table and give it an alias e1. Now we have to use this table again, but we can't name it with the same alias anymore. That's why inside of the subquery, we give another nickname, another alias to our table, which will be e2. We implement max function to find the maximum date, and then inside of the where clause, we compare the current date that we are checking right now and the maximum date that we already found. Finally, we sort everything out in ascending order by user ID.

    Open in a real environment →
  • Medium Window Function for Moving Average Video

    We are given a sales table with two columns, amount and sale date. Each row indicates the amount of sales per day within one week. Our main goal is to calculate a seven-day moving average for each date. Moving average means that we calculate that number for each day using the current amount plus the amount of previous days. The challenge here is that for each row, we need to look at other rows, which is not something that where or group by clause can do. A CTE is a temporary result set in SQL that you can reference within a single query. Window function calculates across multiple rows, but it keeps every single row in the result. From sales table, we select sale date and amount columns, and also moving average as a new column. This statement, range between interval six days preceding and current row, simply helps us to define the frame. Using round function for this column, we write double colon that will help us to convert raw data to numeric that we need, and two here will round everything to two decimal places.

    Open in a real environment →
  • Medium Re-enrollment Rate Calculator Video

    We are given one table, Enrollments. This table has three columns that contain information like course ID, student ID, and term ID. Our job here is to calculate what percentage of students are enrolled in two or more consecutive terms. Consecutive means that in term ID column, the sequence goes like 101, 102, 103 without any gaps. A CTE is a temporary result set in SQL that you can reference within a single query. We will call our CTE as consecutive terms, and from enrollments table, we select student and term ID columns. We implement a window function with row number. Partition by clause divides a result set into smaller groups and allows a function to perform calculations on each subset separately. Row number assigns a sequential number to each row starting from one. In order to prevent all the duplicates, count with distinct will only return unique student IDs and completely ignore any null values. Cast in SQL is used to convert a value of one data type into another. We are required to round the results to two decimal places, and the only way to do so is by using a round function.

    Open in a real environment →
  • Medium String Pattern Matching Using LIKE Video

    We'll match string patterns using like. We have two tables, departments and employees. Our job is to filter employees based on three conditions. First one is that their name must start with letter A. Second is that the email must contain the substring @tech, and third is the position level must contain the word senior. A join in SQL connects two tables together based on a common column. We are more interested in inner join because it returns only the rows where there is a match in both tables. Like is an operator in SQL used to search for a specific pattern inside of a string. Like works together with wild cards, and the most significant one is percent sign. We write name column from employees table, like A percent sign, which means A is the first letter, and then any symbol can follow this letter. Second condition is that email from employees table must contain @tech keyword, and since this value is in the middle of the string, we put percent signs from both sides. And finally we sort everything in ascending order by name.

    Open in a real environment →
  • Medium Customer Order Aggregation Video

    We are given two tables. First is customers table that contains customer's name, email, and ID. Also, second table that contains ID of the order, its amount, and ID of the customer. Our main goal is to find customers who have placed more than two orders and calculate the total number of orders and the amount they spent on them. The final result must be sorted out in descending order by total spend. A join in SQL connects two tables together based on a common column. For our task, we are more interested in inner join because it returns only the rows where there is a match in both tables. Second step is grouping by ID, name, and email from customers table because this will collapse all rows belonging to the same customer into one single group. Since we already grouped the columns, we will have to use having clause that works with groups, not rows. To get this number, we have to use ID column that we grouped before and count function that will simply find the number of orders for each group.

    Open in a real environment →
  • Medium SQL JOIN with Pandas Data Processing and CSV Export Video

    We need to do the data processing and CSV export using Pandas and SQLite. We are given a SQLite database that is called Sales. It contains three tables: customers, orders, and items. SQLite is a lightweight database that stores everything in a single file. We need to connect to the database, then run an SQL query to join all three tables that we had, load the results into Pandas, calculate revenue metrics per customer, and export everything to CSV file. JOIN connects two tables based on a common column. We are more interested in inner join because it returns only the rows where there is a match in both tables. We will implement the read_sql_query function. We build the total amount column by multiplying quantity with unit price. When we group by customer ID, we will put all rows belonging to the same customer together. In order to calculate the revenue percentage, we will need to divide each customer's total by the overall revenue.

    Open in a real environment →
  • Medium Insert New Records into SQLite Database from CSV Video

    We'll insert new records into database from CSV file using SQLite and Pandas. We have one CSV file that is called New Customers. It contains customer records that need to be imported in a SQLite database. The problem is that some of these customers might already exist in the database file. SQLite is a lightweight database that stores everything in a single file. Our job here is to read the CSV file, check which records already exist, and make sure we only include the new ones. We will use the connect function where we define the path to the file. Then within this object, we implement cursor. We read the given CSV file into a DataFrame that will be called df. Using cursor execute will run the SQL query. In the query itself, we select ID column from customers table. Fetchall function will retrieve all the results as a list of tuples. The tilde symbol is used here to flip true to false and false to true. Cursor execute will run the insert statement. Instead of putting the values directly in the SQL string, we will use the question mark and pass the actual values as a separate tuple. We do it for security reasons to protect the data from SQL injection. We will use the commit function within our database to finalize and save all changes. Finally, we close the database.

    Open in a real environment →
  • Medium Aggregate SQL Query Results with Pandas and Export to Excel Video

    We'll need to aggregate SQL query results and export them to Excel using Pandas. Pandas is a library that was specifically designed for data analysis and manipulation. We are given one SQLite database that is called Orders. It contains two tables, the first one with customers' information and second one with all the transactions. SQLite is a lightweight database that stores everything in a single file. Our job here is to join both of the given tables, then calculate the total order value per customer, and save the results to an Excel file. We import two libraries, SQLite 3 and Pandas. Then we open the database using the connect function. We are more interested in inner join because it returns only the rows where there is a match in both tables. Then we execute what we have written above using the read_sql_query function and load the results directly into a Pandas data frame. We use group by so that all rows belonging to the same customer are put together. Finally, we save the resulting data frame to Excel file.

    Open in a real environment →
  • Hard Merge Employee and Department Records Video

    We need to merge employee and department records. We have two tables, departments and employees. We only consider departments that have more than 10 employees. Second requirement is to find employees whose salary is above their department average. Third requirement is to add high earners column to the output, which will basically count the number of employees whose salary is more than 75,000. A CTE is a temporary result set in SQL that you can reference within a single query. A join in SQL connects two tables together based on a common column. We are more interested in inner join because it returns only the rows where there is a match in both tables. The difference is that where clause comes before group by, while having runs after. In order to calculate the high earners column, we will use count with case when. Results should be sorted out primarily by department name and then by salary in descending order.

    Open in a real environment →
  • Hard Sequence Products by Price Video

    We are given products table with the name of the product, its ID, and the price. Our main goal is to create and calculate neighbor product column. For each product, we take the price of previous product and the price of next product, and then we multiply them together. A CTE is a temporary result set in SQL that you can reference within a single query. It doesn't get saved anywhere. It only exists while that query is running. Window function performs a calculation across multiple rows and keeps every single row in the result. The reason why we can't use group by here is that group by clause collapses all rows into one single row. We use lag, which reaches back to the previous row and grabs its value. The only difference is that we'll use lead instead of lag. Lead will reach forward to the next row and grab its value. COALESCE function takes a list of values and returns the first one that is not null. Since null values can't be used in the calculations, COALESCE function will replace them with zeros.

    Open in a real environment →
  • Hard Top Categories by Average Price Video

    We are given two tables, inventory and products. These two tables share common column and are connected through product_id column from inventory table and id column from products table. Our main goal is to find the top three product categories with the highest average price. We only consider active products, meaning that they are in stock. For low stock items are qualified those products that are less than 10 available. The final result should be ordered primarily by average price in descending order, and then by product count column as a tiebreaker. A CTE is a temporary result set in SQL that you can reference within a single query. We use left join to make sure every product appears, even if it somehow has no inventory entry. For low stock items column, we need to count only products with stock less than 10. We group by category so that all rows that share the same category would collapse into one group. Rank here assigns a number to each row based on a specified order. We put the condition inside the where clause that price rank should be less or equal to three.

    Open in a real environment →
  • Hard Export SQLite Database to Parquet Format with Metadata Video

    We'll need to export SQLite database to Parquet format with metadata using Pandas. We have one SQLite database that is called ecommerce. It contains three tables, customers, products, and orders. The orders table has foreign keys that connects it to the first two tables. SQLite is a lightweight database that stores everything in a single file. Our job here is to export every of three tables to their own Parquet file with Snappy compression and create a manifest JSON file describing the export. Parquet stores data column by column instead of doing that row by row. A manifest is a metadata file that describes the contents of an export. We import five things: SQLite 3 library to connect to the database; Pandas, that will be aliased as pd; JSON to create the manifest; OS library for creating directories, building file paths, and getting file sizes; and the DateTime module for getting the current timestamp for the manifest. We read the entire table into the data frame and save the data frame as a Parquet file. Then we set compression to Snappy and index to false.

    Open in a real environment →
  • Hard Combine Data from Multiple Sources into Unified Report Video

    We need to combine data from multiple sources into a unified report using Pandas. We are given one customers CSV file and one orders SQLite database. SQLite is a lightweight database that stores everything in a single file. We get the data from API, CSV file, and the database. We send an HTTP GET request to a URL, and the API responds with data in JSON format. Our job is to fetch data from all three sources, combine them into one unified DataFrame, and save the final report as a CSV file. We import three libraries: Requests, Pandas, and SQLite 3 to connect the database. We open the database file using connect function. Read SQL query will execute our SQL request and load the result directly into the DataFrame. Merge in Pandas connects two DataFrames based on a common column or index. It works the same way as JOIN in SQL. On indicates which common column was used to connect the DataFrames. The method is left join, which means that we keep all orders even if customer ID doesn't match. We can calculate the total amount when multiplying quantity by price. The final unified report is saved as CSV file.

    Open in a real environment →

Git (15)

  • Easy Create Branch from Detached HEAD State Video

    We have a Git repository that has detached head. Head is basically our latest commit and detached head means that the head that we are currently located at is not pointing to any Git branch, any points directly to commit or tag. It happens when we check out directly tag or we check out directly at comit hash, or we can check out some remote branch. How we can restore from detached head is we either need to return to some existing branch or we have to create a new branch from that detached head. Run git status. We can see that GIT message that head is detached. We'll create new branch by typing Git Checkout the name of the branch, or you can run more modern command Git Switch. Run checkout minus B. And then name of the branch.

    Open in a real environment →
  • Easy Rebase Feature Branch Video

    We have a Git repository. Under this folder we have a feature payment branch and it's behind our main branch by 32 commits. We need to rebase a feature branch onto the latest main. We need to bring all the latest changes from main to this feature branch. Switch to feature payment branch. We need to rebase from main. Type git rebase main. We have conflicts. Some of the commits cannot be rebased. We have two changes for the same line. Remove the obsolete one and keep what we need. And now we can git add and git rebase continue. We've successfully rebased and updated feature payment.

    Open in a real environment →
  • Easy Apply Specific Stash from Multiple Stashes Video

    Git stash lets us to save our work in progress, go to some other task, and then return back, restore our stashed work, and continue from that moment on. We can save multiple work in progress jobs, so we can have multiple stashes. We have to navigate to this repository, identify the third stash in the list and restore that. So we need to apply it without removing it from the stack. Run git stash list. This will show us list of the stashed work. Index starts from zero. In order to restore this, git stash apply stash and then two. This will still retain stash number two in the list. In order to remove stash from the list, we need to run command git stash drop and then the name of the stash. Another command to restore stash is git stash pop. Git stash pop applies the latest stash, meaning stash number zero, and then drops it from the list.

    Open in a real environment →
  • Easy Remove Last Commit and Discard Changes Video

    We've committed some local changes that contain incorrect changes and we need to completely erase that from history. Our task is to navigate to this repository, remove last commit entirely. The most important thing here is to discard all associated files changes. First run get log to see all the current commits. We will go a few steps back, and we also need to discard all associated file changes. This means that this is not soft reset, which will reset that, but keep the local changes. So this has to be a hard reset. We have bad commit, wrong changes. It's the last commit and we will reset it. One commit back, run Git, reset hard at one.

    Open in a real environment →
  • Easy Checkout Single File from Another Branch Video

    We have two branches, main and feature settings, and we have a file config json that is located in the feature branch. We need to copy this file into the main branch. We don't need to cherry pick the commit. Cherry pick and copying one single file is different. Cherry picking adds specific commit from another branch to the branch that we are in currently. When we need to copy the file, we need to check out, but we need to check out specific file. GI Checkout, and we type the name of the branch and then Hyen and name of the file that we would like to check out. Now if we run Git status, we'll see this file as modified in our directory.

    Open in a real environment →
  • Easy Cherry-Pick Specific Commit Video

    We have a git repository and we have a feature branch called feature. This feature branch contains a fix that fixes the bug on main, but we don't want to move all the commits from feature branch to the main. We just want to pick one commit and move only that. That's called cherry picking. We need to navigate to this repository, identify the commit that fixes the main branch and move it to the main from feature. See the git log to identify the commit that fixes the bug on main, so this commit has message fix critical bug. Next, we need to type git cherry pick and hash of this commit. This commit now was added to our main branch.

    Open in a real environment →
  • Medium Restore File to Previous Version Video

    We have a Git repository where we have a file config gs. This file has been modified in the last two commits, but those two commits introduced a bug. We need to restore config gs to the version that it had two commits ago while not affecting any other file. Run Git log to see last five commits. Next, we'll preview config gs that was two commits ago, for this we'll type git show head meaning our current last commit and then the sign, and then two. Next, we'll type git checkout head tilde sign two and then two hyphens config gs, which will restore config gs that was two commits ago. Last thing, we need to commit our changes with restore config gs message.

    Open in a real environment →
  • Medium Create an Annotated Tag Video

    We have Git Repository located on this directory which is completed new version of our application. We need to create an annotated tag, and once it's created, we need to push that to remote repository with this name. First, move to this directory, view current commit. This is our latest commit at head of our branch. A lightweight flag just references some commit. It doesn't have its own SHA hash meaning that cannot reference the tag. Annotated tag, however, has its own SHA, so you can reference that. Verify that tag was created by typing Git Tag. And now push this.

    Open in a real environment →
  • Medium Add Git Submodule

    Integrate external repositories as submodules to manage dependencies without code duplication. Add submodules with git submodule add, configure .gitmodules file, initialize submodule directories, and commit configuration. Essential for managing shared libraries, vendor dependencies, monorepo structures, and maintaining decoupled version control across interdependent projects.

    Open in a real environment →
  • Medium Update Submodule to Latest Commit Video

    We have a repository interview repo that contains a submodule vendor details and submodule lets us nesting repositories. When submodule is used, it uses the specific commit. We need to update a submodule to the latest commit on its default branch. Run git submodule to see our submodule status. Next, run git fetch and then log to see commits in our submodule. We need to pull latest changes. Go to the parent and check status again. Those changes are not committed yet. We run git add and then git commit. Submodule is updated to the latest version.

    Open in a real environment →
  • Medium Stash Work, Fix Bug, Restore and Update Video

    Imagine a common scenario when you've been working on some git repository on the feature ui and suddenly you need to do something to fix authentication issue on the main branch. For this, you have to create new hotfix branch, commit changes into that branch and then merge that with main branch. While you've been working on feature ui, you cannot simply change the branch because you have uncommitted changes. Before moving to main branch, we need to stash those changes, meaning to put them aside. Next, move to our main branch and to fix our authentication issue, we first need to create hotfix branch. Move to the main branch and merge our hotfix with main. And finally we delete the hotfix branch since we don't need it anymore. We need to move back to our work that we paused meaning to feature ui, and then rebase it from main. Finally, we need to move back the work that we stashed aside. For that, we need to type git stash pop.

    Open in a real environment →
  • Medium Remove File from Entire Git History Video

    We've committed file secrets env, which contains some sensitive credentials and we have to remove this from entire git history. First we'll need to move to the repository directory. Check logs that contains secret env file. For this we'll type git log all to see all logs, one line to see them in one line. In order to filter it by the file name we type hyphen and then name of the file. To see exactly what was changed in those commits we can add hyphen p flag. We need to delete this from our current git history and for this we'll run command called git filter branch. We'll run force flag to change this in entire git history. And we'll use the filter flag that lets us run certain command. And the command is basic Linux syntax rm to remove file. Prune empty flag is used in certain cases that removing this will make the commit useless. Finally, we need to delete the history file, meaning we need to run rm rf to delete everything in the git refs original. In case if we'd like to change also this on the remote origin, we need to run git push force all.

    Open in a real environment →
  • Medium Merge Repositories Preserving Both Histories Video

    We have two separate git repositories, repo A and repo B, five and four commits respectively. They've been developed independently, so have different histories, and we need to create one monorepo, combine both of them and have full commit history. Important thing is we need to use subtree. Subtree is git subtree command used when we have some shared libraries or other shared resources when we do not want to merge everything into one monorepo, but rather have some repository and reference other repositories in a directory. We'll use git log one line. Create a directory and then initialize git inside this directory. We'll do git init. We'll do our first commit. We'll do empty commit, so we'll need to add a empty flag. Now we will need to integrate our directories as a subtree in our repository. Use git subtree add, and then we use prefix project A. Finally, verify our monorepo. Check it with Git logs.

    Open in a real environment →
  • Medium Fix Repository with Unrelated Histories Video

    The repository interview is in broken state. The local and remote branches diverged with no common ancestor, meaning they don't share the same history. When we use Git Push to push things to the remote main, it fails with non fast forward error. And the same happens when we try to pull from Main. Our task is to fix this repository, merge and linearize the unrelated histories using Rebase and create new single commit sequence. When we use Git merge, branch that was merged into the main branch retains the same commit hashes. When we use Rebase, those commit hashes get rewritten. Next we'll pull main and we'll Rebase not merge. The flag that we'll use in this case is allow unrelated histories. We need to resolve this issue. Those three are the main types of the conflicts: modify modify, modify delete, add add. Once this is done, we have to type Git rebase continue.

    Open in a real environment →
  • Medium Recover Lost Commits from Detached HEAD Video

    We had Git repository located under this directory and we've been in detached head state. When we switch the main branch, those three commits are now unreachable and we'd like to restore those three commits. When we do git log, we don't see those commits, so we need to find a solution to restore those commits and create a branch called recovered work where those commits will be listed. To see all the logs that we've done to the head of this branch, meaning the ones that were lost from the detached head or while we did git reset and so on, we can type command called git reflog. Git reflog shows us logs exactly for the head. In this git reflog we can see that we have much more than we have in git log. Since our task is to restore this with the branch recovered work, we have to create branch recovered work from some Git commit. We'll use this git commit's hash.

    Open in a real environment →

Spark (20)

  • Easy CSV and Partitions Video

    Spark is a big data processing framework. It is designed to process massive amounts of data across multiple computers at the same time. And instead of tables, it uses data frames. Our job here is simply to read a CSV file, then to find out how many partitions are created, and print that number. When Spark reads this file, it doesn't process it as one giant block. Instead, it splits it into smaller chunks called partitions. Each partition gets sent to a different executor. The default maximum partition size is 128 megabytes. RDD stands for Resilient Distributed Dataset. The RDD splits it into multiple partitions, and each partition holds a subset of the full data. RDD will convert the data frame into a format that can help us to access the underlying partition information. And getNumPartitions is a built-in method that returns the number of partitions as a number.

    Open in a real environment →
  • Easy Repartition Video

    Spark is a big data framework that is designed to process massive amounts of data across multiple computers at the same time. Instead of tables like in SQL, Spark uses data frames. We have only one file that is called orders.csv with 5,000 records. Our main goal here is to repartition the data frame to eight partitions and print the task count in the format that it equals to eight. When Spark reads this file, it doesn't process it as one giant block. Instead, it splits it into smaller chunks called partitions. Each partition gets sent to a different executor. An executor is a process that runs on its chunk of data independently. Repartition is a built-in Spark method that lets us manually control how many partitions our data frame has. We pass the number that we want, and Spark will redistribute all the data across exactly that many partitions. repartition.rdd will convert our data frame to an RDD format, which is a data structure of Spark that holds the partitioned data. getNumPartitions is a built-in function that counts and returns the number of partitions.

    Open in a real environment →
  • Easy Broadcast Join Video

    Spark is a big data framework that processes massive amounts of data across multiple computers at the same time. Instead of tables like in SQL, Spark uses data frames. We are given two files, orders.csv with 5,000 records, and customers.csv with 50 records. We need to join these two files together using a broadcast join, then count orders and print the number of distinct cities. We will use a regular inner join because we only want orders that have a matching customer. The only thing is how we perform that join. Instead of shuffling both data frames across the network, Spark takes the small data frame and sends a full copy of it to every worker. So each worker now has its own partition of the large data frame and small data frame. It means that it can perform the join right here without any movement. We don't want to shuffle this large data frame across the network. That's why we take the smallest one, because it is easier and cheaper to copy. We will also import the broadcast function from the library. header that is set to true uses the first row as column names, and inferSchema automatically detects data types for each column. Then we count orders per city with the help of group by.

    Open in a real environment →
  • Easy Correcting Social Media Posts Video

    Spark is a big data framework that is designed to process massive amounts of data across multiple computers at the same time. Instead of tables, Spark uses data frames. We have one file, posts.csv. Its data frame contains seven columns: text of the post, ID, date, amount of likes, comments, and shares, and platform where it was published. We need to go through every post and replace the word Python with PySpark in the text column. withColumn is a data frame method that modifies or replaces a specific column. It takes two arguments. The first one is the column, it is the text column, and second one is the value that we want to add. For the second argument, we use regexReplace function. This function takes three arguments. The first one is the column to search. Second is the word to find. It is Python. And third is word to replace it with, which is PySpark.

    Open in a real environment →
  • Easy Daily Category Sales Aggregation

    Master daily sales aggregation in PySpark. Learn how to join transaction tables with product catalogs and use multi-column GroupBy operations to calculate total quantities sold per category per day.

    Open in a real environment →
  • Easy Most Common Order Status Video

    Spark is a big data framework that processes massive amounts of data across multiple machines. Instead of tables, Spark uses data frames. We are given only one file, orders.csv, with 5,000 records. Each order has a status, like completed, canceled, ongoing, and so on. Our job here is to find which status appears most frequently. There are two types of transformations, narrow and wide. In narrow transformation, each row on the left executor goes directly to the row on the right executor. It means that no data moves between executors, and each partition stays on the same machine and gets processed independently. But when it comes to wide transformations, rows from one executor can end up on a completely different executor. Here, data is able to move between machines, and this is called a shuffle. The only problem is that wide transformation requires more network traffic and more time. When we use group by, we do the wide transformation because it makes the rows with the same status move onto the same executor. Count finds the number of rows in each group, and ordering by count, ascending set to false, which means that everything is sorted in descending order.

    Open in a real environment →
  • Easy Calculating Overtime Pay Video

    We will be calculating overtime pay. Spark is a big data framework that processes large amounts of data across multiple machines. We are given two files: employees.csv and payroll.csv. First data frame contains the names of employees, their IDs, ages, and job positions. The second one stores hourly rate, amount of hours worked, and reference to an employee by ID. Our job is to calculate total pay for each employee, which consists of two rules. If an employee worked less or equal to 40 hours, then the total pay is the product of hours worked and hourly rate. But if an employee worked more than 40 hours, then all the extra hours need to be multiplied not by the regular rate, but by the one that is 1.5 times bigger. We join the two data frames on employee ID using inner join. To the result data frame, we add a new column that is called Pay using the withColumn method. When is Spark's version of if/else statement. Else statement in Spark is replaced with otherwise.

    Open in a real environment →
  • Medium Cache and Performance Video

    Spark is a big data framework that processes massive amounts of data across multiple machines. Instead of tables, it uses data frames. We have one file, orders.csv, with 50,000 records. Our job here is to cache the data frame, run count twice, measure how long each run takes, and print the results. Spark uses lazy evaluation, which means that it doesn't execute anything when we write a transformation. A transformation is simply an operation that modifies or processes our data. Lazy evaluation doesn't run anything until we specifically ask for a result. If we don't do the caching, then every single iteration reads from the input and produces an output independently. But when we do the caching, then the input is read only once and stored in the distributed memory. The first count reads the CSV file and caches the result, while second one reads from that memory and makes everything much faster.

    Open in a real environment →
  • Medium Filter Popular Videos Video

    We will need to filter popular videos. Spark is a framework that processes large amounts of data across multiple machines. Instead of tables, it uses data frames. We are given one CSV file that is called Videos. The data frame contains six columns: title of the video, its ID, genre, release year, duration, and number of views. We need to filter everything and keep only videos that have more than one million views and were released in 2019 or later. The result should be saved as result_df. We read the CSV file to a given path and store it in a variable called df for data frame. When header is set to true, it uses first row as column names. At the same time, inferSchema automatically detects data types of those columns. For the condition, we use filter method within data frame and store it in result_df. The condition consists of two statements. First one checks if number of views is greater than one million, and second one ensures that the release year is after 2019. In between, we use AND operator that requires both of the statements to be true.

    Open in a real environment →
  • Medium Anonymize User PII Video

    PII stands for Personally Identifiable Information, data like emails and phone numbers that can identify a real person. We are given only one file that is called users.csv, and this data frame contains three columns: email, phone number of the user, along with referenced_by ID. We are required to do two things. First, we extract the domain from the email address, which means that we keep everything after the @ symbol. Then we need to hide the first digits of the phone number, and only the last four should be visible. Regex stands for regular expression. It is a pattern that is used to search, extract, or replace specific text inside of a string. Regexp_extract is a Spark function that takes a specific part of a string using a regex pattern. We will use the regexp_replace function that simply replaces part of the string. WithColumn will take the results and create a new column.

    Open in a real environment →
  • Medium Call Center Daily Stats

    Join call records with customer data and compute daily aggregates using countDistinct and sum.

    Open in a real environment →
  • Medium Venture Capital Sector Analysis Video

    Spark is a framework that processes large amounts of data across multiple machines. At the same time, instead of tables, it uses data frames. We are given two files, companies and investments. Our job here is to find the total investment amount for each industry sector and sort from highest to lowest. We will first combine both of the data frames, then we will sum investments per industries, and in the end, we will sort everything in descending order. Inner join returns only rows where a match exists in both tables or data frames. We will use inner join because for every investment row, we want the company's name and industry to be right next to it. We group by industry column so that all rows that share the same field are put together. Sum function simply adds up all amount values and finds the total investment per industry. The third step is that we order everything by total investment column in descending order.

    Open in a real environment →
  • Medium Window Functions without Partitions

    Master global sorting and sequential numbering in PySpark. Learn how to join DataFrames and use the row_number() window function across an entire unpartitioned dataset.

    Open in a real environment →
  • Medium Calculating PE Portfolio Values

    Master financial data aggregation in PySpark. Learn how to join relational tables, multiply columns to calculate holding values, and group by multiple dimensions to compute daily private equity portfolio totals.

    Open in a real environment →
  • Medium Top Products by Revenue Video

    Spark is a framework that processes large amounts of data and covers multiple machines at the same time. We are given two files, products and sales.csv. We are required to aggregate total revenue per product, then rank products within each category using dense rank function, and keep only the top three. The output should include the name of the product, category to which it belongs, revenue from the sale, and rank number. A window function performs a calculation across a set of related rows and keeps every single row in the result. This is the key difference from group by, which is an aggregate function that collapses multiple rows into one outcome per group. A window function instead just adds a new column with the calculation. We'll use group by product ID that will put all sales rows for the same product together. Sum will simply add up all revenue values for each group within revenue column. Partition by category will create invisible walls between each category. We sort everything out in descending order by revenue so that the highest amount will get rank one. Dense rank is a type of window function that assigns rank numbers. The thing about dense rank is that it never creates any gaps. If two products have the same amount of revenue, then both of them get rank one. We need to build sort of a condition using filter that will keep only products with rank three or less.

    Open in a real environment →
  • Medium Product Summary Video

    Spark is a framework that processes large amounts of data across multiple machines. Instead of tables, it uses data frames. We are given three files, and their data frames are called products, sales, and inventory. We need to calculate total quantity sold and total revenue per product from sales. Then we find the total stock number per product across all warehouses. And we need to join everything to products data frame and make sure every product appears, even if it has no sales or inventory records. Any null values should be replaced with zero. We group by product ID so that all sales rows that belong to the same product are put together. Inner join returns only rows where a match exist in both tables or data frames. Left outer join returns everything from the left table, plus matching rows from the right. In our specific case, we'll use left join. It will keep all rows from the left table, which is products, regardless whether there is a match in sales. Coalesce function takes list of values and returns the first one that is not null. Lit function creates a literal value of zero.

    Open in a real environment →
  • Medium Parsing Comma-Separated Values Video

    Spark is a framework that processes large amounts of data across multiple machines. Instead of tables, it uses data frames. We have only one file that is called background_checks.csv. Four of these columns store multiple values in one string separated by comma. We are required to count how many items are there in each of those strings and create four new columns to store these results. Everything should be sorted out in ascending order by ID. We read the CSV file into data frame using given path. When header is set to true, it uses first row of the data frame as column names, and inferSchema automatically detects data types. We use split function that takes a string and cuts it into an array using a separator. We wrap everything inside of size function that will count how many elements are in the array. We sort everything out in ascending order by check ID using order by.

    Open in a real environment →
  • Hard Mountain Climber Logs Video

    Spark is a framework that processes large amounts of data across multiple machines at the same time, and instead of tables, it uses data frames. We are given two files, mountain_info and mountain_climbers.csv. Our job here is to find the most recent climber for each mountain. We need to keep in mind that mountains with no climbing records should be excluded from the final result. A window function performs a calculation across a set of related rows and keeps every single row in the result. This is the key difference from group by, which is an aggregate function that collapses multiple rows into one outcome per group. We need to read both of the files into data frames. We set header to true, which means that the first row will include the column names, and inferSchema will automatically detect data types of those columns. Inner join returns only rows where a match exists in both tables or data frames. We want name of the mountain from info_df to be equal to the name of the mountain from climbers_df. Partition by mountain name creates sort of invisible walls between each mountain. We also sort everything in descending order by climb date so that most recent date comes first. Row number is type of window function that assigns a sequential number to each row. Filter method will help us to build a condition to keep only rows where RN is equal to one.

    Open in a real environment →
  • Hard Global & Domain SEO Leaders

    Master advanced PySpark Window functions by calculating both partition-level and global maximums. Learn how to use empty window partitions and conditional F.when() logic to identify top-performing SEO pages.

    Open in a real environment →
  • Hard Tracking Customer Purchase History Video

    Spark is a framework that processes large amounts of data across multiple machines at the same time. Instead of tables, it uses data frames. We have only one file that is called transactions.csv. Its data frame contains four columns, product ID, reference to the customer by ID, quantity, and date of transaction. For each transaction, we need to find what product that particular customer bought most recently before the current one. Then we create a new column that combines the date and that previous product in one string. A window function performs a calculation across a set of related rows and keeps every single row in the result. This is the key difference from group by, which is an aggregate function that collapses multiple rows into one outcome per group. Partition by customer ID will create invisible walls between customers. Lag is a type of window function that will look backwards and reach back to the previous row to grab its product ID value. Concat function joins multiple strings together.

    Open in a real environment →

Snowflake (22)

  • Easy Filter Popular Videos on a Streaming Platform Video

    We will filter popular videos on a streaming platform. Snowflake is a cloud-based data warehouse platform. Here, your query will look just like an SQL. We are given videos table, which contains video ID, title, genre, release year, duration, and view count columns. Our job is to retrieve those videos that have more than one million views and that were released in year 2019 or later. FROM clause, this is one of the things that differ regular SQL from Snowflake. It uses a dynamic reference with the ref function. Instead of naming the table, we are asking the system to find correct version of the table for us. Inside of the double curly braces, we put the ref function with the name of our table. We select all columns using select with a star. Our main goal was to filter the videos, and to do so, we should build sort of a condition inside of the WHERE clause. The first statement is that a number inside of view count column should be higher than one million. The second is that release year should be higher or equal to 2019. Since both of the statements are mandatory, we will use a logical operator.

    Open in a real environment →
  • Easy Replace Keywords in Social Media Post Text Video

    Snowflake is a cloud-based data warehouse platform. The syntax, the functions, the logic, everything is almost like in SQL. Our table, social media, contains information about different posts, its ID, text inside of it, date, amount of likes, comments, shares, and the name of the platform. Our job is simple. We go through every post and replace the word Python with PySpark. The only thing that is going to change is that inside of the text column, the word Python is replaced by PySpark. In Snowflake, it uses a dynamic reference with the ref function. Inside of the double curly braces, we put the ref function with the name of our table. Since our goal was to modify the text column, we need to use replace function here. Replace function takes three arguments: the string itself, the old value, and the new value. Replace basically scans through the string, finds the needed word Python, and swaps it with a new value.

    Open in a real environment →
  • Easy Filter Movies with Missing Box Office Data

    Practice filtering for NULL values in Snowflake SQL with this movie analytics interview question. You query a movies table to find rows where box office collection data is missing using IS NULL. Covers NULL handling, IS NULL, WHERE clause, and data quality checks in Snowflake. An easy-level question common in analytics interviews at companies like DoorDash.

    Open in a real environment →
  • Easy Daily Category Sales

    SELECT
    s.sale_date,
    p.category,
    SUM(s.quantity_sold) AS total_quantity
    FROM {{ ref("sales") }} s
    INNER JOIN {{ ref("products") }} p
    ON s.product_id = p.product_id
    GROUP BY s.sale_date, p.category
    ORDER BY s.sale_date ASC, p.category ASC

    Open in a real environment →
  • Easy Filter and Uppercase Artifacts

    SELECT
    ID,
    Item,
    Period,
    UPPER(Material) AS Material,
    Quantity
    FROM {{ ref("artifacts") }}
    WHERE Quantity > 100

    Open in a real environment →
  • Easy Merge Customer Records from Two Sources Video

    We will merge customer records from two sources. Snowflake is a cloud-based data warehouse that uses SQL. We are given two tables, customers from West and customers from East. Both of the tables contain the same columns and store information about the customers, such as full name, age, ID, and subscription plan type. Our job is simply to combine these two tables into one single result, even if they contain duplicate rows. FROM clause in Snowflake is a little bit different from SQL. It uses a dynamic reference with the ref function. Inside of the double curly braces, we put the ref function with the name of our table. In SQL and Snowflake, UNION operator is a set operation used to combine the result sets of two or more SELECT statements into a single unified output. If we use UNION separately, it would remove all duplicate rows. So using UNION ALL simply means that we want to keep all rows, even if they are repeated.

    Open in a real environment →
  • Easy Filter Funded Startups Video

    We need to write a query to filter funded startups. This is a Snowflake question. Snowflake is a cloud-based data warehouse that simply uses SQL. We are given two tables, investors and startups. Our job is to find investors whose average funding across all their startups is strictly greater than their personal funding limit. Those cases when limit equals to average funding needs to be excluded. The output should include three columns: investor name, its ID, and average funding number. We are more interested in inner join because it returns only the rows where there is a match in both tables. Within funding column from startups table, we implement average and round functions. Since we need to find the average number for each of the investing companies separately, we need to use a GROUP BY clause. We will put a condition inside of HAVING clause that works with groups and filters them. We can't use WHERE clause for the condition because there is a strict rule that WHERE clause can only come before GROUP BY clause, not after.

    Open in a real environment →
  • Medium Combine Customer Orders and Products Video

    We need to combine customer orders and products. Snowflake is mainly using the same concepts as in SQL. The only difference is that Snowflake is cloud-based. We are given three tables: customers, orders, and products. Our main goal here is to join all three tables together and return six unique columns. In Snowflake, we have to put the name of the table inside of the ref function, and everything has to be wrapped up in double curly braces. The reason behind is that Snowflake uses data build tool, which requires the ref function. Join connects two tables based on a common column. We are more interested in inner join because it returns only the rows where there is a match in both tables. We give aliases O to orders and C to customers table. Using on, we indicate which common columns were used to connect the tables. We use double pipe operator, or in other words, a concatenation sign, to get the full name.

    Open in a real environment →
  • Medium Anonymize User PII Data for a Social Media Platform Video

    We need to write a query to anonymize user PII data for a social media platform. This is a Snowflake question, which is a cloud-based data warehouse that uses SQL and all of its concepts. PII stands for Personally Identifiable Information. For each user, we need to extract only the domain part from the email address and anonymize the phone number by hiding the first six digits with asterisks and keeping only the last four. In Snowflake, we put the table inside of the ref function, and this whole thing is wrapped up in double curly braces. Split part function takes three arguments and splits a string into parts based on a delimiter and returns a specific part. Our delimiter will be @ symbol. We will use the right function. Right takes two arguments and extracts a specified number of characters from the right side of a string. Since phone is not a string, we need to use double colon varchar in order to convert number to a string. Using the concatenation sign that combines two strings, we connect them together.

    Open in a real environment →
  • Medium Product Sales and Inventory Data Video

    This is a Snowflake question, which is a cloud-based data warehouse that uses SQL and all its main concepts. We are given three tables, products, sales, and inventory. All three tables are connected through product ID, since this is a common column. First, we need to calculate total quantity and total revenue per product from sales. Then we need to calculate total stock per product from inventory. And in the end, we need to join everything with products table, making sure every single product appears in the result, even if it has no sales or inventory records. If we try to join everything directly without aggregating first, we would get duplicate rows and completely wrong totals. A CTE, or in other words, a common table expression, is a temporary result set that we define at the top of our query. A left join keeps all rows from the left table, even if there is no matching row in the right table. The missing values just become null. Coalesce function takes a list of values and returns the first one that is not null.

    Open in a real environment →
  • Medium Products and Duplicates

    Practice a medium difficulty Snowflake SQL interview question tagged JPMorgan that tests deduplication, CTEs, and INNER JOIN. In a manufacturing scenario, you must remove duplicate rows from two tables using SELECT DISTINCT or ROW_NUMBER before joining them on a shared key. This question evaluates your ability to handle data quality issues before performing joins, a skill commonly tested in financial services and data engineering interviews.

    Open in a real environment →
  • Medium Mortgage Rate Calculator

    WITH aggregated AS (
    SELECT
    md.mortgage_type,
    ROUND(SUM(md.interest_rate) / COUNT(um.user_id), 2) AS rate_of_mortgage
    FROM {{ ref("mortgage_details") }} md
    INNER JOIN {{ ref("user_mortgages") }} um
    ON md.mortgage_id = um.mortgage_id
    GROUP BY md.mortgage_type
    )
    SELECT * FROM aggregated

    Open in a real environment →
  • Medium Weekend Order Detection Video

    This is a Snowflake question, which is a cloud-based data warehouse that uses SQL and all its main concepts. We are given two tables, orders and products. Our job here is to parse the order date column from string into a real date format, and then drop any invalid dates. After that, we need to detect whether the order was placed on a weekend or not. If it's from Monday to Friday, then the output is false. If it's Saturday or Sunday, then the output is true. We will use a CTE, in other words, a common table expression, which is a temporary result set that we define at the top of our query. In Snowflake, we use data build, which is a framework that manages and organizes tables. We put it inside of ref function, and this whole thing is wrapped in double curly braces. We are more interested in inner join because it returns only the rows where there is a match in both tables. Within parse date column, we implement try to date function. This function converts a string into a real date type. If the string is invalid, it simply returns a null value. Day of week is a Snowflake function that takes a date and returns a number that represents the day of the week. Sunday is assigned with zero, Saturday is assigned with six. We check if the day of the week result is either zero or six, meaning Sunday or Saturday.

    Open in a real environment →
  • Medium Flooring Company Data Video

    This is a Snowflake question, which is a cloud-based data warehouse that uses SQL and all its concepts. We are given three tables: customers, orders, and products. Full name in the customers table stores first and last name together in one column, separated by space. We need to split them into two columns: first name and last name. Here we also have to split product info column into two separate ones. In Snowflake, we use data build tool, which is a framework that manages and organizes tables. We use the ref function wrapped in double curly braces, and then ref function finds the correct table in the Snowflake environment automatically. Join connects two tables based on a common column. We are more interested in inner join because it returns only the rows where there is a match in both tables. To split full name and product info columns, we will use split part function. This function takes three arguments, splits the string, and returns a specific part. Second argument is delimiter. It's basically the character where the cut happens.

    Open in a real environment →
  • Medium Assign Row Numbers to Authors per Paper Video

    We will assign row numbers to authors for each paper. This is a Snowflake question, which is a cloud-based data warehouse that uses all main concepts from SQL. We have two tables, research_papers and authors. Our job here is to join these two tables together and assign a sequential row number to each author per paper. The numbering must reset for every new paper so that the first author of each paper always gets row number one. In Snowflake, we put the table inside of ref function, and this whole thing is wrapped up in double curly braces. We are more interested in inner join because it returns only the rows where there is a match in both tables. Between these two tables, we implement inner join operator, and using on keyword, we indicate which columns are used to combine the tables. Row number assigns a sequential number to each row, starting from one. Over keyword simply activates the window function. Partition by divides all the rows into separate groups based on a column. In our case, it's paper_id. Row number will reset back to one at the start of each new group. And then we sort everything out in ascending order by author_id.

    Open in a real environment →
  • Medium Amusement Park Rating Anomalies Video

    We are given two tables, rides and visitors. Our job here is to find rides whose average rating is unusually high or unusually low compared to all other rides. We measure this using standard deviation. Standard deviation is a statistical measure that checks how spread out a set of values is from their average. Our query will consist of three parts. First, we will calculate the average rating per ride. Second, we will calculate global mean and standard deviation across all ride averages. And third, we will use those stats to flag anomalies. A CTE, or common table expression, is a temporary result set that we define at the top of our query. For our task, we are more interested in inner join because it returns only the rows where there is a match in both tables. A window function, unlike group by that collapses all rows into one, performs a calculation across multiple rows and keeps every single row in the result. Stddev is a built-in SQL and Snowflake function that calculates standard deviation. We wrap it inside of ABS, which stands for absolute value. Then we compare if that distance is greater than global standard deviation. If the distance is greater than one standard deviation, the ride is considered too far from the norm and gets true.

    Open in a real environment →
  • Medium Usage and Accuracy per Model Type Video

    This is a Snowflake question, which is a cloud-based data warehouse that uses SQL and all its main concepts. We have two tables, models and usage logs. Both of the tables are connected through a model ID, which is a common column. The requirements are calculating the total users by summing all the usage log entries, and at the same time, to find the average accuracy for each model. We will store the results in sort of a temporary table that is called CTE. CTE is short for Common Table Expression. It's a temporary result set that we define at the top of our query inside of WITH clause. From usage logs table, we first select model ID, and then we build a new column called total users and implement sum function within users column. Grouping by model ID would collapse all rows sharing the same ID into one single group. For this last column, we will implement window function with average. Partition by model type will divide all rows into separate groups based on their model type. We implement the round function and wrap the window function inside of it.

    Open in a real environment →
  • Medium Find the Last Climber per Mountain Video

    We need to find the last climber per mountain. This is a Snowflake question, which is a cloud-based data warehouse that uses SQL. We have two tables, mountains and climbers. Both of the tables are connected through mountain name. Our job here is to find the most recent climber for each mountain. We only want mountains that have at least one climbing record. We need to rank all climbers per mountain by date from most recent to oldest, and then pick the climber with rank one. In Snowflake, we use database tool, which organizes and manages the tables. We put the table inside of ref function wrapped in double curly braces. We are more interested in inner join because it returns only the rows where there is a match in both tables. Partition by mountain name would divide rows into separate groups, and each group would indicate one mountain. Then row number would reset to one at the start of each new mountain group. We sort everything out in descending order by climb date so that the climber with the most recent date gets row number one.

    Open in a real environment →
  • Hard Rank Top Products by Revenue per Category Video

    We need to rank top products by revenue for each category. This is a Snowflake question, which is a cloud-based data warehouse that uses SQL and all its main concepts. We are given two tables: products and sales. First, we need to calculate how much revenue each product generated. Then we rank them within their category from highest to lowest, and in the end, we keep only the top three per category. A CTE, or in other words, common table expression, is a temporary result set that we define at the top of our query. In Snowflake, we use database tool, which organizes and manages tables. We put the name of the table inside of ref function, and this whole thing is wrapped in double curly braces. We need to combine them using JOIN. We are more interested in inner join because it returns only the rows where there is a match in both tables. Dense rank is a function that assigns rank numbers to rows. The reason why we use dense rank instead of regular rank is that dense rank never creates any gaps. Partition by category creates sort of invisible walls between each category. We put a condition inside the WHERE clause that rank has to be equal or lower than three.

    Open in a real environment →
  • Hard Highest SEO Score Pages per Domain

    Practice a hard Snowflake SQL interview question tagged Cisco that tests window functions, aggregation, and conditional logic. Working with an SEO analytics dataset, you must find the highest scoring page per domain and identify the overall best page across all domains using CASE expressions and window functions. This question evaluates your ability to combine per-group and global rankings in a single query.

    Open in a real environment →
  • Hard Math Expressions

    Practice a Snowflake SQL interview question tagged IBM that tests regular expressions and pattern matching with RLIKE or REGEXP. In a data validation scenario, you filter user-submitted text to keep only valid arithmetic expressions containing digits and operators. This hard difficulty question covers regex patterns in Snowflake, RLIKE for row filtering, and input validation logic commonly asked in data quality and engineering interviews.

    Open in a real environment →
  • Hard Track Product Purchases Video

    This is a Snowflake question, which is a cloud-based data warehouse that uses SQL and all its main concepts. We are given only one table that is called Transactions. Each row represents one purchase made by a customer. First, for each transaction, we need to find what product the customer bought most recently before the current one. Second, we need to create a new column, combine the date and that previous product into one string, and use the word none when there is no previous product. A CTE, or Common Table Expression, is a temporary result set that we define at the top of our query. Lag specifically within product ID row looks backwards, reaches to the previous row, and grabs its value. Partition by creates sort of a group for each customer. We wrap previous product in COALESCE function. COALESCE takes a list of values and returns the first one that is not null.

    Open in a real environment →

Python (19)

  • Easy CSV Row Filter and Count Video

    We have one customers.csv file. Each row represents one customer and has a status field that can be active or something else. We are required to read the file, then count how many customers have the active status, and write that number to a text file. First of all, we import the CSV module that is designed for reading and writing CSV files. We define a new counter variable that we will call active_count, and we set it to zero. Using Python's built-in open function, we will enter the customers CSV file. It takes two arguments, the file path and r that stands for read mode. Then we create a DictReader from the open file that will read the first row as column header, and then every row as Python dictionary. Using for loop, we go through every row in the CSV file and check whether the value in the status column is equal to active or not.

    Open in a real environment →
  • Easy Analyze Sales Dataset Dimensions and Calculate Total Revenue Video

    We will analyze sales dataset dimensions and calculate total revenue. Pandas is a library that was specifically designed for data analysis and manipulation. We are given one CSV file that is called Sales Data. Our job is, first of all, to analyze its size, classify it as small, medium, or large based on total amount of sales, and then calculate total revenue. Everything should be saved as JSON report. We need to read it into a data frame. Data frame is simply a table in memory with rows and columns. For rows and columns, we will implement the len function that simply returns the number of items in any list or collection. In order to find the total number of sales, we need to multiply rows by columns. Small one is considered the one with less than 10,000 sales, then medium in range between 10,000 and 99,999, and large, everything that exceeds 100,000 sales. In order to find the revenue, we need to multiply the quantity by price. We will use the sum function to add up all revenue amounts. Everything is wrapped inside of round function.

    Open in a real environment →
  • Easy Sort Avro Employee Records by Salary Video

    We'll need to sort Avro employee records by salary. Avro stores data in a compressed binary format, which makes it much more efficient for large data sets. The schema travels with the data within the same file and never gets separated. An Avro file is made up of a header that is followed by multiple blocks. We are given one file that is called employees.avro. Our goal here is to read the given file, sort the records by salary in descending order, and save the result as JSON file. We import two things, Reader function from FastAvro library that will help us to read the file, and then JSON. To read the file, we will use open function that takes two arguments. First is file path, and second is mode. In our case, it is rb, that stands for read binary mode. We create the Avro reader from the open file using the Reader function. We need to sort that list by salary using sorted function. As a second argument, we implement a small lambda function that for each X employee returns X salary. In third argument, we set reverse to True, which will sort everything out in descending order.

    Open in a real environment →
  • Easy Count User Events from JSON Activity Logs Video

    We will count user events from JSON activity logs. We are given one JSON file that is called activity_logs. JSON stands for JavaScript Object Notation, and it is a text-based format that stores structured data. Our job here is to count how many events each user performed and save the results as a new JSON file. First thing that has to be done is importing the JSON module, which is a built-in Python library for reading and writing JSON files. We will use the open function. It takes two arguments. First is the path to the file, and second is the mode we need. Load will read the entire JSON file and convert it into a Python object. A dictionary in Python is a data structure that stores information as key value pairs. We need to use for loop to go through every log entry. Dump function converts a Python object back to a JSON file. To run the script, we type python3 and path to the file.

    Open in a real environment →
  • Easy Split Delimited Column into Separate Columns with Pandas Video

    We will split the delimited column into separate columns using Pandas. Pandas is a library that was specifically designed for data analysis and manipulation. We have one CSV file that is called Data, which contains two columns: ID and Full Info. The Full Info column stores multiple pieces of information in one single string, separated by commas. We need to split each string and turn all these pieces into independent columns. We import Pandas library and alias it as pd. We need to read it into DataFrame, which is basically a table in memory with columns and rows. Split in Pandas takes one combined column and expands it into multiple separate columns. It does the splitting whenever it sees the delimiter. Str in Pandas is a special accessor that lets us apply the string operation to every value. We implement the split function. First argument is comma, our delimiter. After setting expand to true, each split piece will get into separate column. Drop function removes a column or row from the DataFrame, depending on how we set the axis. Finally, we save the result to a new CSV file called Split Data.

    Open in a real environment →
  • Easy Compare SQLite Database and CSV File Records Video

    We will compare database and CSV file records using SQLite. We have two data sources. The first one is a CSV file. The second source is a SQLite database. SQLite is a lightweight database that stores everything in a single file. We can run it directly inside of our Python program. We are required to find the customer IDs that are out of sync and save the comparison report as a JSON file. We import three things, SQLite 3 library, that will help us to connect and make queries with our database. Then we import CSV to read those files row by row, and JSON to save the comparison report. A set in Python is basically a data structure that stores unique values. Using connect function, we will open the database file. We create a cursor, which is a tool that executes SQL statements. Fetchall function will retrieve all the results as a list of tuples. We will use the concept of set subtraction, and in order to find the extra IDs in the SQL file, we will have to subtract the CSV IDs from the database. Dump function will convert the dictionary to a JSON format.

    Open in a real environment →
  • Easy Analyze DataFrame Memory Usage Video

    We'll analyze DataFrame memory usage using Pandas. We are given one CSV file that is called Sales Data. Our job here is to calculate how much memory each column from the initial file uses and save that report to a new CSV file that will be called Memory Report. The first thing that we do is importing Pandas and alias it as pd. Then we need to read the initial CSV file into a Pandas DataFrame that will be called df. We will implement the memory_usage method that will calculate how many bytes of RAM each column is using. We will set deep to true, and in the calculations, the actual content will be considered. Our current result is in bytes, and we will need to convert it to megabytes. We will need to divide it, first of all by 1,024, and by that, we get the result in kilobytes. Then we divide by 1,024 again to convert it to megabytes. The last thing that we have to do is saving our report to CSV file.

    Open in a real environment →
  • Easy Calculate Descriptive Statistics for Numeric Columns in Pandas Video

    We need to calculate descriptive statistics for numeric columns in Pandas. We are given one CSV file that is called Sales Data. It contains e-commerce transaction information with both numeric and non-numeric columns. Descriptive statistics are numbers that summarize a data set. Mean is an average value of all numbers, median is the middle value when sorted from low to high, STD is mathematical standard deviation, min for smallest value, max for largest value, and then the percentages. Our main goal is to calculate these eight descriptive statistics for all numeric columns and save the result as a CSV report with all values rounded to two decimal places. In order to work with the CSV file, it should be read into a data frame. Select_dtype is a Pandas method that filters columns by their data type. Include number means that we only want to keep columns that contains integers and floats. We save the report as a CSV file using to_csv function, and we set index to false to prevent Pandas from adding an extra row number column.

    Open in a real environment →
  • Easy Parse JSON Log Files and Extract Fields to CSV Video

    We need to parse JSON log files and extract fields to CSV. A log file records events that happen in application over some time. Our log file stores each entry in JSON, which is a text-based format that stores structured data. Our job here is to read each line, then extract five specific fields that are timestamp, level, user and request ID, and message. We save the results in a CSV file. We need to import two built-in Python libraries. First of all, JSON that is used for reading JSON formatted text, and then CSV for writing CSV files. We need to open and read the log file. For this reason, we will use the open function that takes two arguments. First is the path to the file, and second is the mode that we want to use. Using for loop, we go through each line, and in return, we get one line as a string. When we use load as function, it will convert the JSON string into a Python object. Here we will implement the get method. Get looks up for a key in the dictionary and returns its value. But if the key doesn't exist, it returns a default value that we define as a second argument. For request ID, it is basically nested inside of another object, so we will implement the get method twice here. DictWriter will convert the dictionary to CSV file. Write header method puts the first row as column names.

    Open in a real environment →
  • Easy Extract Schema Information from Parquet File Using PyArrow Video

    We'll extract schema information from Parquet file using PyArrow. PyArrow is a Python library that works with columnar data formats, especially with Parquet files. It gives more access to files' metadata, things like schema, row groups, file statistics, and so on. Parquet stores data column by column instead of doing that row by row. It will look at the data stored in each group, analyze the minimum and maximum values for each column, and skip the entire row groups that cannot possibly contain matching data. Here we are required to extract the metadata about the file, column names, data types, amount of rows, size of the file, and type of compression. The result should be saved as JSON report. We will import pyarrow.parquet, that will be aliased as pq. Then we use schema_arrow function to get the full picture of what columns exist and what data type each column has. Num_rows function gives us the total amount of rows in the file. Compression function will return the name of codec, like Snappy or Gzip.

    Open in a real environment →
  • Easy Select Specific Columns from Parquet File Video

    We will select specific columns from Parquet file using Pandas. Pandas is a library that was specifically designed for data analysis and manipulation. Parquet stores data column by column instead of doing that row by row. It is good when we want to read only some specific columns and skip everything else. It will look at the data stored in each group, analyze the minimum and maximum values for each column, and skip the entire row groups that cannot possibly contain matching data. We are required to return only five specific ones: ID, first and last name, email, and total amount of purchases. After selecting all five columns, everything needs to be saved as a new Parquet file called selected data. The first thing that we do is that we import the Pandas library and alias it as pd. Then we need to read the Parquet file into a data frame. In Pandas, when we want to select multiple columns, we pass a list of column names inside of double square brackets. Since Pandas automatically assigns row numbers to every data frame, we set index to false, which means that we do not include the data frame row numbers in the output.

    Open in a real environment →
  • Easy Merge Customer and Purchase Data Using Pandas Video

    We will merge customer and purchase data using Pandas. Pandas is a library that was specifically designed for data analysis and manipulation. We are given two CSV files that are called customers and purchases. We are required to combine both of the initial files using left join and save the result to a new combined data CSV file. The first thing to do is that we import Pandas library and alias it as pd. In order to work with the files, we first have to read them into data frames. Merging Pandas connects two data frames based on a common column or index. It works the same way as JOIN in SQL. Left join returns all rows from the left data frame and the matched rows from the right data frame. If there is no match, then all those fields will contain null values. We use the merge method to connect customers to purchases. On indicates a column that was common for both of the data frames. We set index to false, which prevents Pandas from adding an extra row number column to the output.

    Open in a real environment →
  • Easy Convert Excel Files with Multiple Sheets to Individual CSV Files Video

    We will convert Excel files with multiple sheets to individual CSV files using Pandas. We have a directory that contains multiple Excel files with many sheets inside of it. We are required to convert every single sheet from Excel file into its own separate CSV file. The naming convention for each output is filename_sheetname.csv. We import Pandas library that we alias as PD, then OS library that interacts with the operating system. We use it to list all files in a directory and build a complete file path. And lastly, the Pathlib library that works with file paths. We define two path variables: excel_dir, the folder where our Excel files are stored, and output_dir, where we save the converted CSV files. We open the Excel file using the read_excel function, where as a first argument, we define the full path, and we set sheet name to None so that Pandas will read all sheets at once. We save the current sheet as a CSV file using to_csv function, and then we set index to false so that Pandas won't add an extra row number column to the output.

    Open in a real environment →
  • Medium Time-Series Rolling Window Analysis for Multi-Stock Price Data Video

    Here we will be working with Pandas. Pandas is a library that was specifically designed for data analysis and manipulation. We have one CSV file that is called Stock Data. It contains daily price information within multiple companies. Each row has a date, ticker, and closing price. Our job here is to calculate rolling statistics, mean, sum, and standard deviation for the period of 7, 14, and 30 days separately for each ticker. In the output, we should have three original columns, plus nine new columns that we'll build. Rolling method in Pandas provides window calculation and allows us to perform aggregate operations. In order to start doing any operations on the file, we need to read it into DataFrame. Using for loop, we go through three window sizes, 7, 14, and 30 days. Then we group by ticker so that all the calculations run separately for each of them. We implement the rolling method and set that window equals to window. Mean will calculate the average of all values in each window. We need to save that DataFrame to a new CSV file using to_csv function.

    Open in a real environment →
  • Medium Decompose Time-Series Data into Trend, Seasonal, and Residual Components Video

    We will need to decompose time series data into trend, seasonal, and residual components. We will be working mainly with Pandas. Pandas is a library that was specifically designed for data analysis and manipulation. We are given only one CSV file that is called Temperature Data. It contains daily statistics over multiple years. First one is trend, which is a long-term direction of the data that indicates whether temperature generally increases over the years or not. Seasonal is responsible for repeating pattern that happens on a regular cycle. Residual is everything that is left after removing the trend and seasonal components, which is mostly some unpredictable day-to-day variations. Each component should be saved into its own CSV file. We were required to use Statsmodels, which is a Python library for statistical analysis. We import this specific function, which is called seasonal decompose, that will perform our decomposition. Then we specify the model by setting it to additive. We define one full cycle as 365 days because this is an annual pattern for temperature.

    Open in a real environment →
  • Medium Flatten Nested Struct Columns in Parquet and Export to CSV Video

    We'll need to flatten nested struct columns in Parquet and export them to CSV using Pandas. Pandas is a library that was specifically designed for data analysis and manipulation. We are given one Parquet file that contains customer data, but some columns are nested, which means that instead of a simple value like a string or number, they contain a whole object with multiple fields. Parquet stores data column by column instead of doing that row by row. Our job here is to extract the fields from the nested columns and turn them into independent ones. In order to start working with Parquet file, we need to read it into data frame that is called DF. JSONNormalize function in Pandas will take each dictionary from address column and expand it into separate columns. Next step is renaming the columns by adding address_ as a prefix. Using drop function, we will remove the initial nested columns. Now we will combine the results using concat function. We will save the result to a CSV file. For this reason, we will use to_csv function.

    Open in a real environment →
  • Hard Aggregate Time-Series Data into Fixed Time Windows Video

    We'll need to aggregate time series data into fixed time windows using Pandas. Pandas is a library that was specifically designed for data analysis and manipulation. We have one CSV file that is called sensor data. It contains temperature and humidity readings from a sensor collected at regular intervals. Our job is to group all readings into fixed 15-minute windows and calculate mean, minimum, and maximum values for both temperature and humidity within each group. Time window aggregation solves this by dividing time into equal fixed buckets. All records that fall within the same interval get grouped together, and we calculate statistics for that group. We read the CSV file into DataFrame called df. We convert it to a proper Pandas date-time type. We set the timestamp column as the DataFrame index. We start doing the aggregation with resample method. It divides time series data into fixed time windows. Aggregate function performs multiple operations at once on each group. We save the final aggregated DataFrame using to_csv.

    Open in a real environment →
  • Hard Interpolate Missing Values in Irregular Time-Series Sensor Data Video

    We will interpolate missing values in irregular time series sensor data using Pandas. We are given one CSV file that is called sensor_data. It contains temperature and humidity readings from IoT collected over time. Some of the readings are missing because of network issues or sensor failures, so those cells are empty. Our job is to fill in those missing values using interpolation and save the clean data. Interpolation is a way of estimating missing values based on the records around them. In our case, we will use the time-based interpolation that also accounts the irregular time intervals. Before the interpolation itself, we need to convert the timestamp column from a plain string into a proper date-time type. Then we set the timestamp column as the index of our DataFrame. It is very important specifically for time-based interpolation because it requires the date-time to be indexed. We reference to the temperature column and implement the interpolate method. We set method to time to use the time-based interpolation. We were required to round the interpolated values to one decimal place. Finally, we save the result as a CSV file using to_csv function.

    Open in a real environment →
  • Hard Remove Seasonal Effects from Time-Series Sales Data Video

    We'll remove seasonal effects from time series sales data using Pandas and Statsmodels. We are given one CSV file, sales data, that contains transactions over multiple years. The sales indicate the current seasonal pattern. We are required to remove those seasonal effects so that we can see if the company is actually growing or it is just a result of seasonal factors. Deseasonalization simply removes that repeating pattern so that we can see the real trend. We import two things, Pandas library, and seasonal decompose function from Statsmodels. Seasonal decompose is the function that splits our time and date into trend, seasonal, and residual components. We specify the decomposition model by setting it to additive, which simply means that the components will add up together. Since our data is monthly and seasons repeat every year, the period will be set to 12 months. In order to calculate the deseasonalized value, we need to sum the trend values with residual values.

    Open in a real environment →

More questions (11)

  • Easy Recursive Keyword Finder Video

    We have multiple applications that write their logs under var slash log. An issue with this approach is that since multiple logs file are constantly being written to this folder, checking them one by one would be slow and inefficient. We want to use this file to consolidate all the lines that contain errors and write them here. We can use grep and we'll use recursive flag to get error lines in var slash log folder, and it will include only files that are in dot log extension. What we left to do is send this into the consolidated error logs file.

    Open in a real environment →
  • Easy Docker Multi-Architecture Image Video

    When we build our container from this file, it builds it in architecture of our host system. Our task is to change the setup to build it in multiple architectures. We'll use Docker build with an instance named Multi Arc and we will use buildx create. We'll list current builders by typing Docker buildx ls. We have only default that builds it in our current platform's underlying os. So add new builder. Docker buildx create and then name multi arc. Then we use driver network host and then use to have it as default builder. Now attempt to build for our multi architecture setup. We'll run Docker buildx build. Verify this by typing Docker images.

    Open in a real environment →
  • Medium Debug SSH Lockout Video

    We have developer account dev that has been locked out of the server and security logs indicate that there were too many failed SSH identification attempts. We need to check logs and count exactly how many attempts this user had today. And once we have this number, we need to update the configuration to increase this allowed login attempts above this number. We'll need to check log for login attempts. We'll need to use sudo to view this as admin. We can use grep command to filter out the lines that we need. We can also count this with word count hyphen L, which will print us number of lines. We need to change SSHD daemon configuration file, which is located under etc ssh and then sshd config. Find something that says Max. Finally, restart sshd daemon using systemctl restart.

    Open in a real environment →
  • Medium Min Stack

    class MinStack:
    def init(self):
    self.stack = []
    self.min_stack = []

    def push(self, val: int) -> None:
        self.stack.append(val)
        if self.min_stack:
            self.min_stack.append(min(val, self.min_stack[-1]))
        else:
            self.min_stack.append(val)
    
    def pop(self) -> None:
        self.stack.pop()
        self.min_stack.pop()
    
    def top(self) -> int:
        return self.stack[-1]
    
    def getMin(self) -> int:
        return self.min_stack[-1]
    
    Open in a real environment →
  • Medium LRU Cache

    class Node:
    def init(self, key=0, val=0):
    self.key = key
    self.val = val
    self.prev = None
    self.next = None

    class LRUCache:
    def init(self, capacity: int):
    self.cap = capacity
    self.cache = {}

        self.left = Node()
        self.right = Node()
        self.left.next = self.right
        self.right.prev = self.left
    
    def remove(self, node):
        prev_node = node.prev
        next_node = node.next
        prev_node.next = next_node
        next_node.prev = prev_node
    
    def insert(self, node):
        prev_mru = self.right.prev
        prev_mru.next = node
        self.right.prev = node
        node.prev = prev_mru
        node.next = self.right
    
    def get(self, key: int) -> int:
        if key in self.cache:
            self.remove(self.cache[key])
            self.insert(self.cache[key])
            return self.cache[key].val
        return -1
    
    def put(self, key: int, value: int) -> None:
        if key in self.cache:
            self.remove(self.cache[key])
            
        self.cache[key] = Node(key, value)
        self.insert(self.cache[key])
        
        if len(self.cache) > self.cap:
            lru = self.left.next
            self.remove(lru)
            del self.cache[lru.key]
    
    Open in a real environment →
  • Medium Implement Trie (Prefix Tree)

    class TrieNode:
    def init(self):
    self.children = {}
    self.is_end_of_word = False

    class Trie:
    def init(self):
    self.root = TrieNode()

    def insert(self, word: str) -> None:
        curr = self.root
        for char in word:
            if char not in curr.children:
                curr.children[char] = TrieNode()
            curr = curr.children[char]
        curr.is_end_of_word = True
    
    def search(self, word: str) -> bool:
        curr = self.root
        for char in word:
            if char not in curr.children:
                return False
            curr = curr.children[char]
        return curr.is_end_of_word
    
    def startsWith(self, prefix: str) -> bool:
        curr = self.root
        for char in prefix:
            if char not in curr.children:
                return False
            curr = curr.children[char]
        return True
    
    Open in a real environment →
  • Medium Design Add and Search Words Data Structure

    class TrieNode:
    def init(self):
    self.children = {}
    self.is_end = False

    class WordDictionary:
    def init(self):
    self.root = TrieNode()

    def addWord(self, word: str) -> None:
        curr = self.root
        for char in word:
            if char not in curr.children:
                curr.children[char] = TrieNode()
            curr = curr.children[char]
        curr.is_end = True
    
    def search(self, word: str) -> bool:
        def dfs(j, root):
            curr = root
            for i in range(j, len(word)):
                char = word[i]
                if char == '.':
                    for child in curr.children.values():
                        if dfs(i + 1, child):
                            return True
                    return False
                else:
                    if char not in curr.children:
                        return False
                    curr = curr.children[char]
            return curr.is_end
            
        return dfs(0, self.root)
    
    Open in a real environment →
  • Medium Design Twitter

    class Twitter:
    def init(self):
    self.count = 0
    self.tweetMap = {}
    self.followMap = {}

    def postTweet(self, userId: int, tweetId: int) -> None:
        if userId not in self.tweetMap:
            self.tweetMap[userId] = []
        self.tweetMap[userId].append([self.count, tweetId])
        self.count -= 1
    
    def getNewsFeed(self, userId: int) -> list[int]:
        res = []
        minHeap = []
        
        if userId not in self.followMap:
            self.followMap[userId] = set()
            
        self.followMap[userId].add(userId)
        
        for followeeId in self.followMap[userId]:
            if followeeId in self.tweetMap and len(self.tweetMap[followeeId]) > 0:
                index = len(self.tweetMap[followeeId]) - 1
                count, tweetId = self.tweetMap[followeeId][index]
                heapq.heappush(minHeap, [count, tweetId, followeeId, index - 1])
                
        while minHeap and len(res) < 10:
            count, tweetId, followeeId, index = heapq.heappop(minHeap)
            res.append(tweetId)
            
            if index >= 0:
                next_count, next_tweetId = self.tweetMap[followeeId][index]
                heapq.heappush(minHeap, [next_count, next_tweetId, followeeId, index - 1])
                
        self.followMap[userId].remove(userId)
        return res
    
    def follow(self, followerId: int, followeeId: int) -> None:
        if followerId not in self.followMap:
            self.followMap[followerId] = set()
        self.followMap[followerId].add(followeeId)
    
    def unfollow(self, followerId: int, followeeId: int) -> None:
        if followerId in self.followMap and followeeId in self.followMap[followerId]:
            self.followMap[followerId].remove(followeeId)
    
    Open in a real environment →
  • Medium Subarray Sum Equals K

    def subarray_sum(nums: list[int], k: int) -> int:
    count = 0
    prefix_sum = 0
    prefix_map = {0: 1}

    for num in nums:
        prefix_sum += num
        diff = prefix_sum - k
        
        if diff in prefix_map:
            count += prefix_map[diff]
            
        prefix_map[prefix_sum] = prefix_map.get(prefix_sum, 0) + 1
        
    return count
    
    Open in a real environment →
  • Medium Continuous Subarray Sum

    def check_subarray_sum(nums: list[int], k: int) -> bool:
    remainder_map = {0: -1}
    prefix_sum = 0

    for i in range(len(nums)):
        prefix_sum += nums[i]
        remainder = prefix_sum % k
        
        if remainder in remainder_map:
            if i - remainder_map[remainder] >= 2:
                return True
        else:
            remainder_map[remainder] = i
            
    return False
    
    Open in a real environment →
  • Hard Sliding Window Median

    def median_sliding_window(nums: list[int], k: int) -> list[float]:
    small = []
    large = []
    lazy = {}

    for i in range(k):
        heapq.heappush(small, -nums[i])
        
    for i in range(k // 2):
        heapq.heappush(large, -heapq.heappop(small))
        
    def get_median():
        if k % 2 == 1:
            return float(-small[0])
        return (-small[0] + large[0]) / 2.0
        
    res = [get_median()]
    
    for i in range(k, len(nums)):
        out_num = nums[i - k]
        in_num = nums[i]
        
        lazy[out_num] = lazy.get(out_num, 0) + 1
        
        balance = 0
        
        if out_num <= -small[0]:
            balance -= 1
        else:
            balance += 1
            
        if small and in_num <= -small[0]:
            balance += 1
            heapq.heappush(small, -in_num)
        else:
            balance -= 1
            heapq.heappush(large, in_num)
            
        if balance < 0:
            heapq.heappush(small, -heapq.heappop(large))
        elif balance > 0:
            heapq.heappush(large, -heapq.heappop(small))
            
        while small and lazy.get(-small[0], 0) > 0:
            lazy[-small[0]] -= 1
            heapq.heappop(small)
            
        while large and lazy.get(large[0], 0) > 0:
            lazy[large[0]] -= 1
            heapq.heappop(large)
            
        res.append(get_median())
        
    return res
    
    Open in a real environment →
Optional

Bonus topics to round out your prep

DSA (69)

  • Easy Contains Duplicate

    def contains_duplicate(nums: list[int]) -> bool:
    seen = set()
    for n in nums:
    if n in seen:
    return True
    seen.add(n)
    return False

    Open in a real environment →
  • Easy Valid Anagram

    def is_anagram(s: str, t: str) -> bool:
    if len(s) != len(t):
    return False

    countS, countT = {}, {}
    
    for i in range(len(s)):
        countS[s[i]] = countS.get(s[i], 0) + 1
        countT[t[i]] = countT.get(t[i], 0) + 1
        
    return countS == countT
    
    Open in a real environment →
  • Easy Two Sum

    def two_sum(nums: list[int], target: int) -> list[int]:
    prev_map = {} # value -> index

    for i, n in enumerate(nums):
        diff = target - n
        if diff in prev_map:
            return [prev_map[diff], i]
        prev_map[n] = i
        
    return []
    
    Open in a real environment →
  • Easy Valid Palindrome

    def is_palindrome(s: str) -> bool:
    l, r = 0, len(s) - 1

    while l < r:
        if not s[l].isalnum():
            l += 1
            continue
        
        if not s[r].isalnum():
            r -= 1
            continue
        
        if s[l].lower() != s[r].lower():
            return False
        
        l += 1
        r -= 1
    
    return True
    
    Open in a real environment →
  • Easy Valid Parentheses

    def is_valid(s: str) -> bool:
    stack = []
    close_to_open = {")": "(", "]": "[", "}": "{"}

    for c in s:
        if c in close_to_open:
            if stack and stack[-1] == close_to_open[c]:
                stack.pop()
            else:
                return False
        else:
            stack.append(c)
            
    return True if not stack else False
    
    Open in a real environment →
  • Easy Binary Search

    def search(nums: list[int], target: int) -> int:
    l, r = 0, len(nums) - 1

    while l <= r:
        m = (l + r) // 2
        if nums[m] > target:
            r = m - 1
        elif nums[m] < target:
            l = m + 1
        else:
            return m
    return -1
    
    Open in a real environment →
  • Easy Merge Two Sorted Lists

    Definition for singly-linked list.

    class ListNode:

    def init(self, val=0, next=None):

    self.val = val

    self.next = next

    def merge_two_lists(list1: Optional[ListNode], list2: Optional[ListNode]) -> Optional[ListNode]:
    dummy = ListNode()
    tail = dummy

    while list1 and list2:
        if list1.val < list2.val:
            tail.next = list1
            list1 = list1.next
        else:
            tail.next = list2
            list2 = list2.next
        tail = tail.next
        
    if list1:
        tail.next = list1
    elif list2:
        tail.next = list2
        
    return dummy.next
    
    Open in a real environment →
  • Easy Invert Binary Tree

    Definition for a binary tree node.

    class TreeNode:

    def init(self, val=0, left=None, right=None):

    self.val = val

    self.left = left

    self.right = right

    def invert_tree(root: Optional[TreeNode]) -> Optional[TreeNode]:
    if not root:
    return None

    # Swap
    root.left, root.right = root.right, root.left
    
    # Recurse
    invert_tree(root.left)
    invert_tree(root.right)
    
    return root
    
    Open in a real environment →
  • Easy Maximum Depth of Binary Tree

    Definition for a binary tree node.

    class TreeNode:

    def init(self, val=0, left=None, right=None):

    self.val = val

    self.left = left

    self.right = right

    def max_depth(root: Optional[TreeNode]) -> int:
    if not root:
    return 0

    return 1 + max(max_depth(root.left), max_depth(root.right))
    
    Open in a real environment →
  • Easy Diameter of Binary Tree

    Definition for a binary tree node.

    class TreeNode:

    def init(self, val=0, left=None, right=None):

    self.val = val

    self.left = left

    self.right = right

    def diameter_of_binary_tree(root: Optional[TreeNode]) -> int:
    diameter = 0

    def height(node):
        nonlocal diameter
        if not node:
            return 0
        
        left_h = height(node.left)
        right_h = height(node.right)
        
        diameter = max(diameter, left_h + right_h)
        
        return 1 + max(left_h, right_h)
        
    height(root)
    return diameter
    
    Open in a real environment →
  • Easy Balanced Binary Tree

    Definition for a binary tree node.

    class TreeNode:

    def init(self, val=0, left=None, right=None):

    self.val = val

    self.left = left

    self.right = right

    def is_balanced(root: Optional[TreeNode]) -> bool:
    def dfs(node):
    if not node:
    return 0

        left = dfs(node.left)
        if left == -1: return -1
        
        right = dfs(node.right)
        if right == -1: return -1
        
        if abs(left - right) > 1:
            return -1
            
        return 1 + max(left, right)
        
    return dfs(root) != -1
    
    Open in a real environment →
  • Easy Same Tree

    Definition for a binary tree node.

    class TreeNode:

    def init(self, val=0, left=None, right=None):

    self.val = val

    self.left = left

    self.right = right

    def is_same_tree(p: Optional[TreeNode], q: Optional[TreeNode]) -> bool:
    if not p and not q:
    return True

    if not p or not q or p.val != q.val:
        return False
        
    return is_same_tree(p.left, q.left) and is_same_tree(p.right, q.right)
    
    Open in a real environment →
  • Easy Subtree of Another Tree

    Definition for a binary tree node.

    class TreeNode:

    def init(self, val=0, left=None, right=None):

    self.val = val

    self.left = left

    self.right = right

    def is_subtree(root: Optional[TreeNode], subRoot: Optional[TreeNode]) -> bool:
    if not subRoot:
    return True
    if not root:
    return False

    if is_same_tree(root, subRoot):
        return True
        
    return is_subtree(root.left, subRoot) or is_subtree(root.right, subRoot)
    

    def is_same_tree(p: Optional[TreeNode], q: Optional[TreeNode]) -> bool:
    if not p and not q:
    return True
    if not p or not q or p.val != q.val:
    return False
    return is_same_tree(p.left, q.left) and is_same_tree(p.right, q.right)

    Open in a real environment →
  • Easy Linked List Cycle

    Definition for singly-linked list.

    class ListNode:

    def init(self, x):

    self.val = x

    self.next = None

    def has_cycle(head: Optional[ListNode]) -> bool:
    slow, fast = head, head

    while fast and fast.next:
        slow = slow.next
        fast = fast.next.next
        if slow == fast:
            return True
            
    return False
    
    Open in a real environment →
  • Easy Reverse Linked List

    Definition for singly-linked list.

    class ListNode:

    def init(self, val=0, next=None):

    self.val = val

    self.next = next

    def reverse_list(head: Optional[ListNode]) -> Optional[ListNode]:
    prev = None
    curr = head

    while curr:
        nxt = curr.next
        curr.next = prev
        prev = curr
        curr = nxt
        
    return prev
    
    Open in a real environment →
  • Easy Moving Average from Data Stream

    from collections import deque

    class MovingAverage:
    def init(self, size):
    self.size = size
    self.queue = deque()
    self.window_sum = 0

    def next(self, val):
        self.queue.append(val)
        self.window_sum += val
        if len(self.queue) > self.size:
            self.window_sum -= self.queue.popleft()
        return self.window_sum / len(self.queue)
    
    Open in a real environment →
  • Medium Group Anagrams

    def group_anagrams(strs: list[str]) -> list[list[str]]:
    anagram_map = {}
    for s in strs:
    count = [0] * 26
    for c in s:
    count[ord(c) - ord('a')] += 1
    signature = tuple(count)
    if signature not in anagram_map:
    anagram_map[signature] = []
    anagram_map[signature].append(s)
    return list(anagram_map.values())

    Open in a real environment →
  • Medium Top K Frequent Elements

    def top_k_frequent(nums: list[int], k: int) -> list[int]:
    count = {}
    freq = [[] for i in range(len(nums) + 1)]

    for n in nums:
        count[n] = 1 + count.get(n, 0)
        
    for n, c in count.items():
        freq[c].append(n)
        
    res = []
    for i in range(len(freq) - 1, 0, -1):
        for n in freq[i]:
            res.append(n)
            if len(res) == k:
                return res
                
    return res
    
    Open in a real environment →
  • Medium Product of Array Except Self

    def product_except_self(nums: list[int]) -> list[int]:
    res = [1] * len(nums)

    # Calculate prefix products
    prefix = 1
    for i in range(len(nums)):
        res[i] = prefix
        prefix *= nums[i]
    
    # Calculate suffix products and combine
    postfix = 1
    for i in range(len(nums) - 1, -1, -1):
        res[i] *= postfix
        postfix *= nums[i]
    
    return res
    
    Open in a real environment →
  • Medium Longest Consecutive Sequence

    def longest_consecutive(nums: list[int]) -> int:
    num_set = set(nums)
    longest = 0

    for n in num_set:
        # Only check for the start of a sequence
        if (n - 1) not in num_set:
            length = 1
            
            while (n + length) in num_set:
                length += 1
                
            longest = max(length, longest)
            
    return longest
    
    Open in a real environment →
  • Medium Two Sum II - Input Array Is Sorted

    def two_sum(numbers: list[int], target: int) -> list[int]:
    l, r = 0, len(numbers) - 1

    while l < r:
        cur_sum = numbers[l] + numbers[r]
    
        if cur_sum > target:
            r -= 1
        elif cur_sum < target:
            l += 1
        else:
            return [l + 1, r + 1]
            
    return []
    
    Open in a real environment →
  • Medium Three Sum

    def three_sum(nums: list[int]) -> list[list[int]]:
    res = []
    nums.sort()

    for i, a in enumerate(nums):
        if i > 0 and a == nums[i - 1]:
            continue
    
        l, r = i + 1, len(nums) - 1
        
        while l < r:
            three_sum_val = a + nums[l] + nums[r]
            
            if three_sum_val > 0:
                r -= 1
            elif three_sum_val < 0:
                l += 1
            else:
                res.append([a, nums[l], nums[r]])
                l += 1
                
                while l < r and nums[l] == nums[l - 1]:
                    l += 1
                    
    return res
    
    Open in a real environment →
  • Medium Container With Most Water

    def max_area(height: list[int]) -> int:
    l, r = 0, len(height) - 1
    res = 0

    while l < r:
        area = (r - l) * min(height[l], height[r])
        res = max(res, area)
        
        if height[l] < height[r]:
            l += 1
        else:
            r -= 1
            
    return res
    
    Open in a real environment →
  • Medium Longest Substring Without Repeating Characters

    def length_of_longest_substring(s: str) -> int:
    char_set = set()
    l = 0
    res = 0

    for r in range(len(s)):
        while s[r] in char_set:
            char_set.remove(s[l])
            l += 1
            
        char_set.add(s[r])
        res = max(res, r - l + 1)
        
    return res
    
    Open in a real environment →
  • Medium Longest Repeating Character Replacement

    def character_replacement(s: str, k: int) -> int:
    count = {}
    res = 0
    l = 0
    max_f = 0

    for r in range(len(s)):
        count[s[r]] = 1 + count.get(s[r], 0)
        max_f = max(max_f, count[s[r]])
        
        if (r - l + 1) - max_f > k:
            count[s[l]] -= 1
            l += 1
            
        res = max(res, r - l + 1)
        
    return res
    
    Open in a real environment →
  • Medium Permutation in String

    def check_inclusion(s1: str, s2: str) -> bool:
    if len(s1) > len(s2):
    return False

    s1_count = [0] * 26
    window_count = [0] * 26
    
    for i in range(len(s1)):
        s1_count[ord(s1[i]) - ord('a')] += 1
        window_count[ord(s2[i]) - ord('a')] += 1
        
    if s1_count == window_count:
        return True
        
    l = 0
    for r in range(len(s1), len(s2)):
        window_count[ord(s2[r]) - ord('a')] += 1
        window_count[ord(s2[l]) - ord('a')] -= 1
        l += 1
        
        if s1_count == window_count:
            return True
            
    return False
    
    Open in a real environment →
  • Medium Daily Temperatures

    def daily_temperatures(temperatures: list[int]) -> list[int]:
    res = [0] * len(temperatures)
    stack = [] # Stores indices

    for i, t in enumerate(temperatures):
        while stack and t > temperatures[stack[-1]]:
            prev_index = stack.pop()
            res[prev_index] = i - prev_index
        stack.append(i)
        
    return res
    
    Open in a real environment →
  • Medium Car Fleet

    def car_fleet(target: int, position: list[int], speed: list[int]) -> int:
    pair = [[p, s] for p, s in zip(position, speed)]
    stack = []

    # Sort by position (reverse) to process closest to target first
    for p, s in sorted(pair)[::-1]:
        stack.append((target - p) / s)
        if len(stack) >= 2 and stack[-1] <= stack[-2]:
            stack.pop()
            
    return len(stack)
    
    Open in a real environment →
  • Medium Search a 2D Matrix

    def search_matrix(matrix: list[list[int]], target: int) -> bool:
    rows, cols = len(matrix), len(matrix[0])
    l, r = 0, rows * cols - 1

    while l <= r:
        m = (l + r) // 2
        row, col = m // cols, m % cols
        val = matrix[row][col]
        
        if val > target:
            r = m - 1
        elif val < target:
            l = m + 1
        else:
            return True
            
    return False
    
    Open in a real environment →
  • Medium Koko Eating Bananas

    def min_eating_speed(piles: list[int], h: int) -> int:
    l, r = 1, max(piles)
    res = r

    while l <= r:
        k = (l + r) // 2
        hours = 0
        for p in piles:
            hours += (p + k - 1) // k
            
        if hours <= h:
            res = k
            r = k - 1
        else:
            l = k + 1
            
    return res
    
    Open in a real environment →
  • Medium Find Minimum in Rotated Sorted Array

    def find_min(nums: list[int]) -> int:
    l, r = 0, len(nums) - 1

    while l < r:
        m = (l + r) // 2
        if nums[m] > nums[r]:
            l = m + 1
        else:
            r = m
            
    return nums[l]
    
    Open in a real environment →
  • Medium Search in Rotated Sorted Array

    def search(nums: list[int], target: int) -> int:
    l, r = 0, len(nums) - 1

    while l <= r:
        m = (l + r) // 2
        if nums[m] == target:
            return m
            
        if nums[l] <= nums[m]:
            if nums[l] <= target < nums[m]:
                r = m - 1
            else:
                l = m + 1
        else:
            if nums[m] < target <= nums[r]:
                l = m + 1
            else:
                r = m - 1
                
    return -1
    
    Open in a real environment →
  • Medium Remove Nth Node From End of List

    Definition for singly-linked list.

    class ListNode:

    def init(self, val=0, next=None):

    self.val = val

    self.next = next

    def remove_nth_from_end(head: Optional[ListNode], n: int) -> Optional[ListNode]:
    dummy = ListNode(0, head)
    left = dummy
    right = head

    while n > 0 and right:
        right = right.next
        n -= 1
        
    while right:
        left = left.next
        right = right.next
        
    left.next = left.next.next
    
    return dummy.next
    
    Open in a real environment →
  • Medium Add Two Numbers

    Definition for singly-linked list.

    class ListNode:

    def init(self, val=0, next=None):

    self.val = val

    self.next = next

    def add_two_numbers(l1: Optional[ListNode], l2: Optional[ListNode]) -> Optional[ListNode]:
    dummy = ListNode(0)
    curr = dummy
    carry = 0

    while l1 or l2 or carry:
        val1 = l1.val if l1 else 0
        val2 = l2.val if l2 else 0
        
        total = val1 + val2 + carry
        carry = total // 10
        curr.next = ListNode(total % 10)
        
        curr = curr.next
        if l1: l1 = l1.next
        if l2: l2 = l2.next
        
    return dummy.next
    
    Open in a real environment →
  • Medium Lowest Common Ancestor of a BST

    Definition for a binary tree node.

    class TreeNode:

    def init(self, val=0, left=None, right=None):

    self.val = val

    self.left = left

    self.right = right

    def lowest_common_ancestor(root: 'TreeNode', p: 'TreeNode', q: 'TreeNode') -> 'TreeNode':
    curr = root

    while curr:
        if p.val > curr.val and q.val > curr.val:
            curr = curr.right
        elif p.val < curr.val and q.val < curr.val:
            curr = curr.left
        else:
            return curr
    
    Open in a real environment →
  • Medium Binary Tree Level Order Traversal

    Definition for a binary tree node.

    class TreeNode:

    def init(self, val=0, left=None, right=None):

    self.val = val

    self.left = left

    self.right = right

    from collections import deque

    def level_order(root: Optional[TreeNode]) -> list[list[int]]:
    if not root:
    return []

    res = []
    queue = deque([root])
    
    while queue:
        level_size = len(queue)
        current_level = []
        
        for _ in range(level_size):
            node = queue.popleft()
            current_level.append(node.val)
            
            if node.left:
                queue.append(node.left)
            if node.right:
                queue.append(node.right)
                
        res.append(current_level)
        
    return res
    
    Open in a real environment →
  • Medium Validate Binary Search Tree

    Definition for a binary tree node.

    class TreeNode:

    def init(self, val=0, left=None, right=None):

    self.val = val

    self.left = left

    self.right = right

    def is_valid_bst(root: Optional[TreeNode]) -> bool:
    def validate(node, low=-float('inf'), high=float('inf')):
    if not node:
    return True

        if node.val <= low or node.val >= high:
            return False
            
        return (validate(node.left, low, node.val) and 
                validate(node.right, node.val, high))
                
    return validate(root)
    
    Open in a real environment →
  • Medium Kth Smallest Element in a BST

    Definition for a binary tree node.

    class TreeNode:

    def init(self, val=0, left=None, right=None):

    self.val = val

    self.left = left

    self.right = right

    def kth_smallest(root: Optional[TreeNode], k: int) -> int:
    stack = []
    curr = root

    while curr or stack:
        while curr:
            stack.append(curr)
            curr = curr.left
            
        curr = stack.pop()
        k -= 1
        
        if k == 0:
            return curr.val
            
        curr = curr.right
        
    return -1
    
    Open in a real environment →
  • Medium K Closest Points to Origin

    def k_closest(points: list[list[int]], k: int) -> list[list[int]]:
    max_heap = []

    for x, y in points:
        dist = x**2 + y**2
        heapq.heappush(max_heap, (-dist, [x, y]))
        
        if len(max_heap) > k:
            heapq.heappop(max_heap)
            
    return [point for dist, point in max_heap]
    
    Open in a real environment →
  • Medium Kth Largest Element in an Array

    def find_kth_largest(nums: list[int], k: int) -> int:
    min_heap = []

    for num in nums:
        heapq.heappush(min_heap, num)
        
        if len(min_heap) > k:
            heapq.heappop(min_heap)
            
    return min_heap[0]
    
    Open in a real environment →
  • Medium Task Scheduler

    def least_interval(tasks: list[str], n: int) -> int:
    counts = Counter(tasks)

    max_freq = max(counts.values())
    
    max_freq_count = 0
    for count in counts.values():
        if count == max_freq:
            max_freq_count += 1
            
    required_time = (max_freq - 1) * (n + 1) + max_freq_count
    
    return max(len(tasks), required_time)
    
    Open in a real environment →
  • Medium Combination Sum

    def combination_sum(candidates: list[int], target: int) -> list[list[int]]:
    res = []

    def dfs(i, current_comb, total):
        if total == target:
            res.append(current_comb.copy())
            return
        if i >= len(candidates) or total > target:
            return
            
        current_comb.append(candidates[i])
        dfs(i, current_comb, total + candidates[i])
        
        current_comb.pop()
        dfs(i + 1, current_comb, total)
        
    dfs(0, [], 0)
    return res
    
    Open in a real environment →
  • Medium Permutations

    def permute(nums: list[int]) -> list[list[int]]:
    res = []

    def dfs(current_perm):
        if len(current_perm) == len(nums):
            res.append(current_perm.copy())
            return
            
        for num in nums:
            if num in current_perm:
                continue
                
            current_perm.append(num)
            dfs(current_perm)
            current_perm.pop()
            
    dfs([])
    return res
    
    Open in a real environment →
  • Medium Number of Islands

    def numIslands(grid: list[list[str]]) -> int:
    if not grid:
    return 0

    ROWS, COLS = len(grid), len(grid[0])
    islands = 0
    
    def bfs(r, c):
        q = deque()
        q.append((r, c))
        grid[r][c] = "0"
        
        while q:
            row, col = q.popleft()
            directions = [[1, 0], [-1, 0], [0, 1], [0, -1]]
            
            for dr, dc in directions:
                nr, nc = row + dr, col + dc
                if (0 <= nr < ROWS and 0 <= nc < COLS and grid[nr][nc] == "1"):
                    q.append((nr, nc))
                    grid[nr][nc] = "0"
                    
    for r in range(ROWS):
        for c in range(COLS):
            if grid[r][c] == "1":
                islands += 1
                bfs(r, c)
                
    return islands
    
    Open in a real environment →
  • Medium Course Schedule II

    def find_order(num_courses: int, prerequisites: list[list[int]]) -> list[int]:
    adj = {i: [] for i in range(num_courses)}
    indegree = {i: 0 for i in range(num_courses)}

    for crs, pre in prerequisites:
        adj[pre].append(crs)
        indegree[crs] += 1
        
    q = deque()
    for i in range(num_courses):
        if indegree[i] == 0:
            q.append(i)
            
    res = []
    while q:
        curr = q.popleft()
        res.append(curr)
        
        for neighbor in adj[curr]:
            indegree[neighbor] -= 1
            if indegree[neighbor] == 0:
                q.append(neighbor)
                
    if len(res) == num_courses:
        return res
    return []
    
    Open in a real environment →
  • Medium Graph Valid Tree

    def valid_tree(n: int, edges: list[list[int]]) -> bool:
    if len(edges) != n - 1:
    return False

    parent = [i for i in range(n)]
    
    def find(node):
        p = parent[node]
        while p != parent[p]:
            parent[p] = parent[parent[p]]
            p = parent[p]
        return p
        
    def union(n1, n2):
        p1, p2 = find(n1), find(n2)
        
        if p1 == p2:
            return False
            
        parent[p1] = p2
        return True
        
    for u, v in edges:
        if not union(u, v):
            return False
            
    return True
    
    Open in a real environment →
  • Medium Network Delay Time

    def network_delay_time(times: list[list[int]], n: int, k: int) -> int:
    adj = defaultdict(list)
    for u, v, w in times:
    adj[u].append((v, w))

    min_heap = [(0, k)]
    visited = set()
    total_time = 0
    
    while min_heap:
        time, node = heapq.heappop(min_heap)
        
        if node in visited:
            continue
            
        visited.add(node)
        total_time = time
        
        for neighbor, weight in adj[node]:
            if neighbor not in visited:
                heapq.heappush(min_heap, (time + weight, neighbor))
                
    return total_time if len(visited) == n else -1
    
    Open in a real environment →
  • Medium Jump Game

    def can_jump(nums: list[int]) -> bool:
    max_reachable = 0
    n = len(nums)

    for i in range(n):
        if i > max_reachable:
            return False
            
        if i + nums[i] > max_reachable:
            max_reachable = i + nums[i]
            
        if max_reachable >= n - 1:
            return True
            
    return True
    
    Open in a real environment →
  • Medium Jump Game II

    def jump(nums: list[int]) -> int:
    jumps = 0
    current_end = 0
    farthest = 0

    for i in range(len(nums) - 1):
        farthest = max(farthest, i + nums[i])
        
        if i == current_end:
            jumps += 1
            current_end = farthest
            
            if current_end >= len(nums) - 1:
                break
                
    return jumps
    
    Open in a real environment →
  • Medium Gas Station

    def can_complete_circuit(gas: list[int], cost: list[int]) -> int:
    if sum(gas) < sum(cost):
    return -1

    current_tank = 0
    start_index = 0
    
    for i in range(len(gas)):
        current_tank += gas[i] - cost[i]
        
        if current_tank < 0:
            start_index = i + 1
            current_tank = 0
            
    return start_index
    
    Open in a real environment →
  • Medium Partition Labels

    def partition_labels(s: str) -> list[int]:
    last_occurrence = {}

    for i, char in enumerate(s):
        last_occurrence[char] = i
        
    res = []
    size = 0
    end = 0
    
    for i, char in enumerate(s):
        size += 1
        end = max(end, last_occurrence[char])
        
        if i == end:
            res.append(size)
            size = 0
            
    return res
    
    Open in a real environment →
  • Medium Number of Connected Components in an Undirected Graph

    def count_components(n: int, edges: list[list[int]]) -> int:
    parent = [i for i in range(n)]
    rank = [1] * n

    def find(node):
        p = parent[node]
        while p != parent[p]:
            parent[p] = parent[parent[p]]
            p = parent[p]
        return p
        
    def union(n1, n2):
        p1, p2 = find(n1), find(n2)
        
        if p1 == p2:
            return 0
            
        if rank[p1] > rank[p2]:
            parent[p2] = p1
            rank[p1] += rank[p2]
        else:
            parent[p1] = p2
            rank[p2] += rank[p1]
            
        return 1
        
    components = n
    for u, v in edges:
        components -= union(u, v)
        
    return components
    
    Open in a real environment →
  • Medium Course Schedule

    def can_finish(num_courses: int, prerequisites: list[list[int]]) -> bool:
    adj = {i: [] for i in range(num_courses)}
    indegree = {i: 0 for i in range(num_courses)}

    for crs, pre in prerequisites:
        adj[pre].append(crs)
        indegree[crs] += 1
        
    q = deque()
    for i in range(num_courses):
        if indegree[i] == 0:
            q.append(i)
            
    completed_courses = 0
    
    while q:
        curr = q.popleft()
        completed_courses += 1
        
        for neighbor in adj[curr]:
            indegree[neighbor] -= 1
            if indegree[neighbor] == 0:
                q.append(neighbor)
                
    return completed_courses == num_courses
    
    Open in a real environment →
  • Medium Walls and Gates

    def walls_and_gates(rooms: list[list[int]]) -> list[list[int]]:
    if not rooms or not rooms[0]:
    return rooms

    rows, cols = len(rooms), len(rooms[0])
    q = deque()
    INF = 2147483647
    
    for r in range(rows):
        for c in range(cols):
            if rooms[r][c] == 0:
                q.append((r, c))
                
    directions = [[1, 0], [-1, 0], [0, 1], [0, -1]]
    
    while q:
        r, c = q.popleft()
        
        for dr, dc in directions:
            nr, nc = r + dr, c + dc
            if 0 <= nr < rows and 0 <= nc < cols and rooms[nr][nc] == INF:
                rooms[nr][nc] = rooms[r][c] + 1
                q.append((nr, nc))
                
    return rooms
    
    Open in a real environment →
  • Medium Surrounded Regions

    def solve(board: list[list[str]]) -> list[list[str]]:
    if not board or not board[0]:
    return board

    rows, cols = len(board), len(board[0])
    
    def dfs(r, c):
        if r < 0 or c < 0 or r >= rows or c >= cols or board[r][c] != "O":
            return
            
        board[r][c] = "T"
        
        dfs(r + 1, c)
        dfs(r - 1, c)
        dfs(r, c + 1)
        dfs(r, c - 1)
        
    for r in range(rows):
        dfs(r, 0)
        dfs(r, cols - 1)
        
    for c in range(cols):
        dfs(0, c)
        dfs(rows - 1, c)
        
    for r in range(rows):
        for c in range(cols):
            if board[r][c] == "O":
                board[r][c] = "X"
            elif board[r][c] == "T":
                board[r][c] = "O"
                
    return board
    
    Open in a real environment →
  • Medium Pacific Atlantic Water Flow

    def pacific_atlantic(heights: list[list[int]]) -> list[list[int]]:
    if not heights or not heights[0]:
    return []

    rows, cols = len(heights), len(heights[0])
    pacific = set()
    atlantic = set()
    
    def dfs(r, c, visited, prev_height):
        if (r < 0 or c < 0 or r >= rows or c >= cols or 
            (r, c) in visited or heights[r][c] < prev_height):
            return
            
        visited.add((r, c))
        
        dfs(r + 1, c, visited, heights[r][c])
        dfs(r - 1, c, visited, heights[r][c])
        dfs(r, c + 1, visited, heights[r][c])
        dfs(r, c - 1, visited, heights[r][c])
            
    for c in range(cols):
        dfs(0, c, pacific, heights[0][c])
        dfs(rows - 1, c, atlantic, heights[rows - 1][c])
        
    for r in range(rows):
        dfs(r, 0, pacific, heights[r][0])
        dfs(r, cols - 1, atlantic, heights[r][cols - 1])
        
    res = []
    for r in range(rows):
        for c in range(cols):
            if (r, c) in pacific and (r, c) in atlantic:
                res.append([r, c])
                
    return res
    
    Open in a real environment →
  • Medium Max Area of Island

    def max_area_of_island(grid: list[list[int]]) -> int:
    ROWS, COLS = len(grid), len(grid[0])
    max_area = 0

    def dfs(r, c):
        if r < 0 or r >= ROWS or c < 0 or c >= COLS or grid[r][c] == 0:
            return 0
            
        grid[r][c] = 0
        
        return (1 + 
                dfs(r + 1, c) + 
                dfs(r - 1, c) + 
                dfs(r, c + 1) + 
                dfs(r, c - 1))
                
    for r in range(ROWS):
        for c in range(COLS):
            if grid[r][c] == 1:
                max_area = max(max_area, dfs(r, c))
                
    return max_area
    
    Open in a real environment →
  • Medium Clone Graph

    class GraphNode:

    def init(self, val = 0, neighbors = None):

    self.val = val

    self.neighbors = neighbors if neighbors is not None else []

    def clone_graph(node: 'GraphNode') -> 'GraphNode':
    old_to_new = {}

    def dfs(node):
        if not node:
            return None
        if node in old_to_new:
            return old_to_new[node]
            
        copy = GraphNode(node.val)
        old_to_new[node] = copy
        
        for nei in node.neighbors:
            copy.neighbors.append(dfs(nei))
            
        return copy
        
    return dfs(node)
    
    Open in a real environment →
  • Medium Subsets

    def subsets(nums: list[int]) -> list[list[int]]:
    res = []

    def dfs(i, current_subset):
        res.append(current_subset.copy())
        
        for j in range(i, len(nums)):
            current_subset.append(nums[j])
            dfs(j + 1, current_subset)
            current_subset.pop()
            
    dfs(0, [])
    return res
    
    Open in a real environment →
  • Medium Binary Tree Right Side View

    Definition for a binary tree node.

    class TreeNode:

    def init(self, val=0, left=None, right=None):

    self.val = val

    self.left = left

    self.right = right

    def right_side_view(root: Optional[TreeNode]) -> list[int]:
    if not root:
    return []

    res = []
    queue = deque([root])
    
    while queue:
        level_size = len(queue)
        rightmost_val = None
        
        for _ in range(level_size):
            node = queue.popleft()
            rightmost_val = node.val
            
            if node.left:
                queue.append(node.left)
            if node.right:
                queue.append(node.right)
                
        res.append(rightmost_val)
        
    return res
    
    Open in a real environment →
  • Medium Copy List with Random Pointer

    """

    Definition for a Node.

    class Node:
    def init(self, x: int, next: 'Node' = None, random: 'Node' = None):
    self.val = int(x)
    self.next = next
    self.random = random
    """

    def copy_random_list(head: 'Optional[Node]') -> 'Optional[Node]':
    if not head:
    return None

    # 1. Interweave
    curr = head
    while curr:
        new_node = Node(curr.val, curr.next)
        curr.next = new_node
        curr = new_node.next
        
    # 2. Assign random pointers
    curr = head
    while curr:
        if curr.random:
            curr.next.random = curr.random.next
        curr = curr.next.next
        
    # 3. Separate
    curr = head
    new_head = head.next
    while curr:
        copy = curr.next
        curr.next = copy.next
        if copy.next:
            copy.next = copy.next.next
        curr = curr.next
        
    return new_head
    
    Open in a real environment →
  • Medium Reorder List

    Definition for singly-linked list.

    class ListNode:

    def init(self, val=0, next=None):

    self.val = val

    self.next = next

    def reorder_list(head: Optional[ListNode]) -> Optional[ListNode]:
    if not head: return None

    # 1. Find middle
    slow, fast = head, head.next
    while fast and fast.next:
        slow = slow.next
        fast = fast.next.next
        
    # 2. Reverse second half
    second = slow.next
    prev = slow.next = None
    while second:
        tmp = second.next
        second.next = prev
        prev = second
        second = tmp
        
    # 3. Merge
    first, second = head, prev
    while second:
        tmp1, tmp2 = first.next, second.next
        first.next = second
        second.next = tmp1
        first, second = tmp1, tmp2
    return head
    
    Open in a real environment →
  • Medium Evaluate Reverse Polish Notation

    def eval_rpn(tokens: list[str]) -> int:
    stack = []

    for c in tokens:
        if c == "+":
            stack.append(stack.pop() + stack.pop())
        elif c == "-":
            b, a = stack.pop(), stack.pop()
            stack.append(a - b)
        elif c == "*":
            stack.append(stack.pop() * stack.pop())
        elif c == "/":
            b, a = stack.pop(), stack.pop()
            stack.append(int(a / b))
        else:
            stack.append(int(c))
            
    return stack[0]
    
    Open in a real environment →
  • Medium Accounts Merge

    class UnionFind:
    def init(self):
    self.parent = {}

    def find(self, x):
        if x not in self.parent:
            self.parent[x] = x
        if self.parent[x] != x:
            self.parent[x] = self.find(self.parent[x])
        return self.parent[x]
        
    def union(self, x, y):
        rootX = self.find(x)
        rootY = self.find(y)
        if rootX != rootY:
            self.parent[rootY] = rootX
    

    def accounts_merge(accounts: list[list[str]]) -> list[list[str]]:
    uf = UnionFind()
    email_to_name = {}

    for acc in accounts:
        name = acc[0]
        first_email = acc[1]
        for i in range(1, len(acc)):
            email = acc[i]
            email_to_name[email] = name
            uf.union(first_email, email)
            
    merged_emails = {}
    for email in email_to_name:
        root = uf.find(email)
        if root not in merged_emails:
            merged_emails[root] = []
        merged_emails[root].append(email)
        
    res = []
    for root, emails in merged_emails.items():
        res.append([email_to_name[root]] + sorted(emails))
        
    return res
    
    Open in a real environment →
  • Medium Top K Frequent Elements in Stream

    import heapq
    from collections import defaultdict

    class TopKFrequent:
    def init(self, k):
    self.k = k
    self.freq = defaultdict(int)

    def add(self, num):
        self.freq[num] += 1
    
    def topK(self):
        items = [(-count, num) for num, count in self.freq.items()]
        top = heapq.nsmallest(self.k, items)
        return [num for _, num in top]
    
    Open in a real environment →
  • Medium Log Aggregator

    from collections import defaultdict
    import bisect

    class LogAggregator:
    def init(self):
    self.logs = defaultdict(list)

    def record(self, timestamp, key):
        self.logs[key].append(timestamp)
    
    def count(self, key, start, end):
        if key not in self.logs:
            return 0
        timestamps = self.logs[key]
        left = bisect.bisect_left(timestamps, start)
        right = bisect.bisect_right(timestamps, end)
        return right - left
    
    Open in a real environment →
  • Medium Event Stream Deduplicator

    class Deduplicator:
    def init(self, ttl):
    self.ttl = ttl
    self.seen = {}

    def process(self, timestamp, eventId):
        expired = [k for k, t in self.seen.items() if timestamp - t > self.ttl]
        for k in expired:
            del self.seen[k]
    
        if eventId in self.seen:
            self.seen[eventId] = timestamp
            return False
        self.seen[eventId] = timestamp
        return True
    
    Open in a real environment →
  • Medium Skew-Aware Key Partitioner

    class SkewHandler:
    def init(self, numBuckets, hotThreshold, splitFactor):
    self.n = numBuckets
    self.threshold = hotThreshold
    self.split = splitFactor
    self.counts = {}
    self.robin = {}
    self.load = [0] * numBuckets

    def _hash(self, key):
        h = 0
        for c in key:
            h = h * 31 + ord(c)
        return h % self.n
    
    def assign(self, key):
        self.counts[key] = self.counts.get(key, 0) + 1
        base = self._hash(key)
    
        if self.counts[key] > self.threshold:
            idx = self.robin.get(key, 0)
            bucket = (base + idx) % self.n
            self.robin[key] = (idx + 1) % self.split
        else:
            bucket = base
    
        self.load[bucket] += 1
        return bucket
    
    def getLoad(self):
        return list(self.load)
    
    Open in a real environment →
  • Medium Hash Join Simulator

    from collections import defaultdict

    class HashJoin:
    def init(self):
    self.table = defaultdict(list)

    def build(self, rows, keyIndex):
        self.table.clear()
        for row in rows:
            self.table[row[keyIndex]].append(row)
    
    def probe(self, rows, keyIndex):
        result = []
        for row in rows:
            for build_row in self.table.get(row[keyIndex], []):
                result.append(build_row + row)
        result.sort()
        return result
    
    Open in a real environment →

Practice on real environments

Browse and filter the full Data Engineering catalog, or see questions asked at specific companies.