Question 1

Average Order Value

Accepted Answer

This SQL question is called Average Order Value. We have one table that is called Orders. This table records customers' purchases and has four columns: orders ID, date of transaction, reference to the customer by ID, and total amount spent. Our job here is to calculate for each customer the average amount they spent across all their orders. In the end, the average value should be rounded to two decimal places. The output should include customer ID and average order value, sorted by ID in ascending order. We start with FROM clause. Then we proceed with GROUP BY clause that separates individual orders into groups for each customer. In order to choose specific columns from a particular table, we use SELECT clause. We wrap the total amount column with Average inside of Round function as first argument, and two number as two decimal places. Finally, we order by customer ID.

Question 2

Join Employees and Departments

Accepted Answer

We are given two tables, departments and employees. These two tables are connected through department ID column. Some employees might not be assigned to any department, so their department ID is null. But we still want them in our results. Our main goal here is to return all employees that earn more than 50,000 with their department name sorted by hire date from most recent to oldest. In order to combine these two tables, we will use the concept of left join. A left join keeps all rows from the left table even if there is no matching row in the right table. The missing values just become null. We implement left join in between, and using on, we indicate which common columns were used to connect these tables. We will use where clause to filter the rows. We only want those employees whose salary is greater than 50,000. Finally, we sort everything out by hire date in descending order so that the most recent date pops up first and oldest date last.

Question 3

Filter Orders by Date Range

Accepted Answer

We need to write a query to filter orders by date range. We are given one table that is called Orders. It has four columns: customer's name, ID of an order that has been made, date of transaction, and total amount spent. We are required to return only the orders that were placed between January 1st, 2023, and June 30th, 2023. Both dates are inclusive, which means that orders on exactly those dates should be included, too. In SQL, we always start with FROM clause because this tells which table we want to work with. Using SELECT clause, we can choose the columns for the output. WHERE clause, which is the filter step. It goes through every single row and checks if the condition is true. For order date column, we use BETWEEN and operators. This combination checks if a value falls within a range, including both the start and end values. Finally, we sort everything by order date in ascending order so that the rows with the earliest date come first.

Question 4

Find Customers Without Orders

Accepted Answer

We need to find customers without orders. We have two tables, customers and orders. These tables are connected through customer ID column. Our job here is to find customers who have never placed a single order, but at the same time they exist in the customers table. We are more interested in LEFT JOIN since we want all rows from the left table to stay. A LEFT JOIN keeps all rows from the left table, even if there is no matching row in the right table. The missing values just become null. Then we add the LEFT JOIN operator in between, and ON keyword that indicates which common columns were used to combine the tables. In our case, it's customer ID. We proceed with WHERE clause that filters the results. We build the condition to check if order ID is null. We pick this specific column because if a customer has at least one order, their order ID will be a real number after the LEFT JOIN. Finally, we sort everything by customer name in ascending order or alphabetically.

Question 5

Use COALESCE for Null Handling

Accepted Answer

## How to Replace NULL with 0 in SQL and Retrieve All Orders from the `orders` Table Writing clean and efficient SQL queries is an essential skill for database management and data analysis. If you're asked to retrieve all orders from an `orders` table, ensuring that any `NULL` values in the `discount` column are replaced with `0`, you need to follow specific steps to structure your query correctly. Below is a comprehensive guide on achieving this task. ### Steps to Write the SQL Query 1. **Identify the Columns**: The `orders` table contains the columns `order_id`, `customer_name`, `discount`, and `total_amount`. 2. **Handle NULL Values**: Ensure that the `discount` column does not contain any `NULL` values by using the `COALESCE` function, which allows you to replace `NULL` values with `0`. 3. **Select All Required Columns**: Ensure that the query retrieves all the columns in the specified order - `order_id`, `customer_name`, `discount`, and `total_amount`. 4. **Order the Results**: Use the `ORDER BY` clause to sort the results by `order_id` in ascending order. ### Sample SQL Query ```sql SELECT order_id, customer_name, COALESCE(discount, 0) AS discount, total_amount FROM orders OR

Question 6

Merge Multiple Address Fields

Accepted Answer

We have a customers table with columns like city, customer ID, first name, last name, postal code, state, and street address. Our job is to combine them into one single column called full address. Some columns could have null values, meaning some customers might be missing a city, state, or postal code. In this case, they should be excluded from the full address. One more possible case is that all components are missing. In this case, we should return an empty string. For this problem, we will need to use COALESCE function. COALESCE function takes a list of values and returns the first one that is not null. For this reason, we need to know the case statement. The idea is basically like an if else statement in any programming language. We will use the concatenation sign in order to connect them one by one. The last thing we have to do is sorting everything out in ascending order, which can be done using order by.

Question 7

String Concatenation in SELECT

Accepted Answer

Our job is to retrieve the full names of all employees by combining the first and last names. We have to make sure that the spaces in between are added, and the result is sorted in ascending order. In SQL, a string is basically just text, anything like a name, a number, or email that are stored in columns with types like varchar or text. Concatenation is simply joining two or more strings together into one output. If you want the space between your strings, you have to treat that space as its own separate string inside quotes. In order to return this specific full name column, we use select clause and implement concat function inside of it. We name our new column as full name. The last thing we are required to do is to sort everything in ascending order based on full name. For this reason, we use order by clause and the name of the column.

Question 8

Find Nth Highest Revenue

Accepted Answer

We will write a query to find N's highest revenue. We have sales table with columns like ID, product, and revenue. Our job is simply to find the third highest distinct revenue value. We need to keep in mind that product with same revenue count as one value, not two. Our query will consist of three steps. First, we get rid of duplicates, then what's left, we rank from highest to lowest, and then pick the one on third place. A window function performs a calculation across multiple rows, but unlike group by, it does not collapse everything into one result. The word just activates the window function. We sort the revenue column out in descending order so that on the first place, we get the highest value. We filter the column where rank equals to three, and we find the value on the third position.

Question 9

Self-Join to Identify Missing Supervisors

Accepted Answer

We have only one table that is called Employees and three columns that indicate employee ID, the name of that employee, and supervisor ID. One employee can be a supervisor for his or her colleagues. Our job is simply to find employees whose supervisor does not exist in the table. Employees with a null value in supervisor ID column are excluded. The final list should be sorted out in ascending order by employee ID. In order to reference the same table twice, we give it two different names, or aliases. e1 will represent each employee, and e2 will search for supervisor. We will use left join, which will keep all rows from the left table, regardless of whether a match was found in the right table or not. Using ON condition, we simply ask the query to take the supervisor ID from e1 table and look for it in employee ID column of e2 table. The first statement is employee ID from e2 table is null, which will simply catch all employees whose supervisor was not found in the table after the left join. We also check if supervisor ID is not a null value.

Question 10

Year-over-Year Revenue Growth

Accepted Answer

We need to calculate the yearly revenue and the percentage growth year over year for a given set of financial transactions. We have financials table that contains two columns, transaction date, that is date type, and amount, that is numeric. We group all transactions by year and sum them up to get the total revenue per year. A CTE is a temporary result set in SQL that you can reference within a single query. It only exists while that query is running. In our select, we use extract in order to get only year from our transaction date column. We also need to get the total revenue of our amount column using sum. A left join keeps all rows from the left table, even if there is no matching row in the right table. The missing values just become null. We take our CTE yearly revenue and give it an alias current. Using left join, we take the exact same CTE again, but this time we call it prev for previous. We need to subtract current and previous year, divide by previous year, and multiply by 100 to get the percentage. Inside of select, we use round, which is a built-in SQL function that rounds a decimal number to a specified number.

Question 11

Above Average Price Products

Accepted Answer

We have a products table with columns like category, ID, name, price, rating, and stock quantity. Our job is to return the products that are priced above the average price. We only work with those products that are currently in stock. It means that not only we don't return them, but we also don't use them while calculating the average amount. While sorting, we should put all the products with null ratings at the bottom. Using select with a star, we can choose all columns. Using from clause, we retrieve data from products table. We need to write a subquery. Using where clause, we create a small condition. Using the average function, it will calculate the average price. The main condition checks whether selected product's price is higher than the average. We have to sort everything out in descending order using order by. At the end of our order by clause, we will use nulls last.

Question 12

Calculate Cumulative Sales

Accepted Answer

We are given a sales data table with the name of the products, date of transaction, and the amount of daily sales. Each row simply represents one day of sales per product. Our mission is to add a new column called cumulative sales, which is a running total that keeps adding up the daily sales for each product as we move forward through the dates. We do the calculation within each product separately. For each row, we need to add up all the previous rows for the same product, which is not something that regular sum with group by can handle, since group by would just put everything into one total. A window function performs a calculation across multiple rows, but unlike group by, it never collapses them. We start with sum within daily sales column, and then we use over keyword that simply activates the window function. We will use partition by function within product name column that will simply divide the data into separate groups.

Question 13

Find Overlapping Date Ranges

Accepted Answer

We have one table that is called assignments. Each row represents one employee assigned to one project with start date and end date. We need to find those employees who are assigned to multiple projects at the same time, or in other words, who face an overlap. The final result must contain all the columns from the input and sort it out by project ID primarily, and then by employee ID in ascending order. Two date ranges overlap when they share at least one day in common. We need to compare each assignment to another one for the same employee, which basically means that we reference the same table twice. This means that we'll use self join here. We take assignments table and give it alias a1. Then we join it with a2 copy that will search for overlapping assignments. The first condition is that assignment a1 must start before or on the same day that assignment a2 ends, because if a1 starts after a2 already ended, no overlap is possible. The role of distinct function here is removing those duplicates and making sure that each assignment appears only once in the final result.

Question 14

Set Operation: INTERSECT

Accepted Answer

This SQL question focuses on set operation called intersect. We are given two tables that share customer ID as a primary key. Our main goal is to return list of active customers based on two criterias. On a monthly basis, a new customer should spend more than 1,000, be loyal, and have at least three years of membership and a premium tier status. Final output must include only the customer ID and name columns sorted in ascending order by ID. A CTE is a temporary result set in SQL that you can reference within a single query. Using with clause, we create a CTE called monthly spenders among new customers that spend more than 1,000. We can name the CTE as premium tier. Intersect takes both tables and returns only the values that appear in both of them. We select customer ID and name columns from monthly spenders, which was the first CTE. Then we select information from the second CTE, and between these two we add intersect operator.

Question 15

Subquery for Best Order per Customer

Accepted Answer

We are given two tables. The first table contains the names of the customers. The second table stores order IDs and total amount of each order. Both tables share customer ID primary key which connects them together. In the output, we should get customer's name, ID of the best order, and its total amount. All the results should be sorted by a customer's name in alphabetical order. Since each customer can have multiple orders, and among these orders might be the ones with the same highest total amount, in this case, we should return the one with the smallest order ID. We are more interested in inner join because it returns only the rows where there is a match in both tables. Inside of the where clause, we will have correlated subquery. It is also a query inside another query, but it runs once for every single row in the outer query. Since we need to return the highest valued order for each customer, we sort by total amount in descending order. We also sort order ID in ascending order because when two orders pop up with the same amount, the one with smaller order ID will come first. And in the very end, we apply limit one.

Question 16

Ranking with Dense_Rank

Accepted Answer

The main focus is on ranking values with dense rank window function. We are working with sales table that tracks individual sales made by different representatives. The goal here is to first add up all sales per person to get their total value, and then rank everyone based on that total from highest to lowest. When we use dense ranking, two people with equal amount of sales will share the same rank. A CTE is a temporary result set in SQL that you can reference within a single query. Dense rank is one of the window functions that assigns a rank number to each row based on a specified order. Over is a keyword that simply activates a window function and lets SQL know that we try to implement it. Finally, order by clause sorts by sales rank first, so rank one comes before rank two, and then salesperson name is sorted alphabetically.

Question 17

Median Salary by Job Title

Accepted Answer

We'll write a query to calculate median salary for each job title. When all salaries are sorted in ascending order, the median value will be the one that is in the middle of them. If there is an even number of salaries, we take the average of the two middle values. A CTE is a temporary result set in SQL that you can reference within a single query. A window function performs a calculation across multiple rows while keeping every single row in the result. Inside of the window function, we are using partition by job title. Partition by divides the data into separate groups, and count will run independently inside each of these groups. We use row number that simply assigns a sequential number to each row. These formulas calculate the median position so that we know which row number represents the middle one. If the number of rows is an odd number, both of the formulas will give the same result. Using round function and double column numeric, we will convert the results to numeric type and then clean everything up to two decimal places.

Question 18

String Splitting and Aggregation

Accepted Answer

We have a product_tags table with two columns, id and tags. The tags column stores multiple tags in one single string separated by commas. Our job is to split those tags apart and count how many times each tag appears across all products. One important thing to keep in mind is that tags are case sensitive. A CTE is a temporary result set in SQL that you can reference within a single query. Our temporary table will be called split_data, and then we select id and tag from our product_tags table. First, we are using string_to_array. You basically pass a string and the separator to the function. After turning tags column into array, we will use unnest function, which will basically put each piece of our array into one separate row. We will use the count with a star function that will basically go through each row and return the amount. Using group by clause, we will group all rows with the same tag together.

Question 19

Salary Comparison with CTE Aggregation

Accepted Answer

We have two tables, departments and employees. Our job is to find employees who earn more than the average salary of their own department. We need the average salary per department before we can compare anything. For this reason, we will use CTE here. A CTE is a temporary result set in SQL that you can reference within a single query. It doesn't get saved anywhere. It only exists while that query is running. We want to find the average salary per department. Inside of it, we select department and average salary grouped by department. For our task, we are more interested in inner join because it returns only the rows where there is a match in both tables. We connect them on department ID from employees table and ID from departments table because we remember that they were the same common primary keys. The average salary should be rounded to two decimal places in the output. Using where clause, we write a condition that salary of an employee should be higher than the average salary that we calculated in our CTE before. The last thing we have to do is sorting everything out primarily by department's name in ascending order and then by salary in descending order.

Question 20

String Pattern Extraction in Descriptions

Accepted Answer

This question mainly focuses on string extraction. We have a products table with four columns: description, name, price, and product ID. The description column contains free text. Some products have email address inside of it and some don't. Our mission is to find those email addresses and extract them from the column. We only return three columns, product ID, name, and email that we will find from description. We will need to use substring. Substring is a string manipulation function. It simply returns a copy of a specific portion of the string. The pattern of an email address is some word, then @ symbol, word again, dot, and then word in the end. Regular expression is kind of a language that is used to describe a sequence of characters. Instead of equal, we use tilde, which is regular expression operator, which means just find this kind of pattern in description column. We sort everything in ascending order by product ID.

Question 21

Nested Subquery for Latest Record

Accepted Answer

We only have one table here, events. There are five columns: event_date and name, id, status, and user_id. Each row in the event_name column represents something a user did, and then status column shows if it was successful. For each user, we want to know what their most recent event was. Final result should contain all the columns except for ID column, and everything should be sorted in ascending order by user ID. For each user and event row, we check and find the maximum date for that specific user. If the current row's date matches the maximum date, we keep it. We take the events table and give it an alias e1. Now we have to use this table again, but we can't name it with the same alias anymore. That's why inside of the subquery, we give another nickname, another alias to our table, which will be e2. We implement max function to find the maximum date, and then inside of the where clause, we compare the current date that we are checking right now and the maximum date that we already found. Finally, we sort everything out in ascending order by user ID.

Question 22

Window Function for Moving Average

Accepted Answer

We are given a sales table with two columns, amount and sale date. Each row indicates the amount of sales per day within one week. Our main goal is to calculate a seven-day moving average for each date. Moving average means that we calculate that number for each day using the current amount plus the amount of previous days. The challenge here is that for each row, we need to look at other rows, which is not something that where or group by clause can do. A CTE is a temporary result set in SQL that you can reference within a single query. Window function calculates across multiple rows, but it keeps every single row in the result. From sales table, we select sale date and amount columns, and also moving average as a new column. This statement, range between interval six days preceding and current row, simply helps us to define the frame. Using round function for this column, we write double colon that will help us to convert raw data to numeric that we need, and two here will round everything to two decimal places.

Question 23

Re-enrollment Rate Calculator

Accepted Answer

We are given one table, Enrollments. This table has three columns that contain information like course ID, student ID, and term ID. Our job here is to calculate what percentage of students are enrolled in two or more consecutive terms. Consecutive means that in term ID column, the sequence goes like 101, 102, 103 without any gaps. A CTE is a temporary result set in SQL that you can reference within a single query. We will call our CTE as consecutive terms, and from enrollments table, we select student and term ID columns. We implement a window function with row number. Partition by clause divides a result set into smaller groups and allows a function to perform calculations on each subset separately. Row number assigns a sequential number to each row starting from one. In order to prevent all the duplicates, count with distinct will only return unique student IDs and completely ignore any null values. Cast in SQL is used to convert a value of one data type into another. We are required to round the results to two decimal places, and the only way to do so is by using a round function.

Question 24

String Pattern Matching Using LIKE

Accepted Answer

We'll match string patterns using like. We have two tables, departments and employees. Our job is to filter employees based on three conditions. First one is that their name must start with letter A. Second is that the email must contain the substring @tech, and third is the position level must contain the word senior. A join in SQL connects two tables together based on a common column. We are more interested in inner join because it returns only the rows where there is a match in both tables. Like is an operator in SQL used to search for a specific pattern inside of a string. Like works together with wild cards, and the most significant one is percent sign. We write name column from employees table, like A percent sign, which means A is the first letter, and then any symbol can follow this letter. Second condition is that email from employees table must contain @tech keyword, and since this value is in the middle of the string, we put percent signs from both sides. And finally we sort everything in ascending order by name.

Question 25

Customer Order Aggregation

Accepted Answer

We are given two tables. First is customers table that contains customer's name, email, and ID. Also, second table that contains ID of the order, its amount, and ID of the customer. Our main goal is to find customers who have placed more than two orders and calculate the total number of orders and the amount they spent on them. The final result must be sorted out in descending order by total spend. A join in SQL connects two tables together based on a common column. For our task, we are more interested in inner join because it returns only the rows where there is a match in both tables. Second step is grouping by ID, name, and email from customers table because this will collapse all rows belonging to the same customer into one single group. Since we already grouped the columns, we will have to use having clause that works with groups, not rows. To get this number, we have to use ID column that we grouped before and count function that will simply find the number of orders for each group.

Question 26

SQL JOIN with Pandas Data Processing and CSV Export

Accepted Answer

We need to do the data processing and CSV export using Pandas and SQLite. We are given a SQLite database that is called Sales. It contains three tables: customers, orders, and items. SQLite is a lightweight database that stores everything in a single file. We need to connect to the database, then run an SQL query to join all three tables that we had, load the results into Pandas, calculate revenue metrics per customer, and export everything to CSV file. JOIN connects two tables based on a common column. We are more interested in inner join because it returns only the rows where there is a match in both tables. We will implement the read_sql_query function. We build the total amount column by multiplying quantity with unit price. When we group by customer ID, we will put all rows belonging to the same customer together. In order to calculate the revenue percentage, we will need to divide each customer's total by the overall revenue.

Question 27

Insert New Records into SQLite Database from CSV

Accepted Answer

We'll insert new records into database from CSV file using SQLite and Pandas. We have one CSV file that is called New Customers. It contains customer records that need to be imported in a SQLite database. The problem is that some of these customers might already exist in the database file. SQLite is a lightweight database that stores everything in a single file. Our job here is to read the CSV file, check which records already exist, and make sure we only include the new ones. We will use the connect function where we define the path to the file. Then within this object, we implement cursor. We read the given CSV file into a DataFrame that will be called df. Using cursor execute will run the SQL query. In the query itself, we select ID column from customers table. Fetchall function will retrieve all the results as a list of tuples. The tilde symbol is used here to flip true to false and false to true. Cursor execute will run the insert statement. Instead of putting the values directly in the SQL string, we will use the question mark and pass the actual values as a separate tuple. We do it for security reasons to protect the data from SQL injection. We will use the commit function w

Question 28

Aggregate SQL Query Results with Pandas and Export to Excel

Accepted Answer

We'll need to aggregate SQL query results and export them to Excel using Pandas. Pandas is a library that was specifically designed for data analysis and manipulation. We are given one SQLite database that is called Orders. It contains two tables, the first one with customers' information and second one with all the transactions. SQLite is a lightweight database that stores everything in a single file. Our job here is to join both of the given tables, then calculate the total order value per customer, and save the results to an Excel file. We import two libraries, SQLite 3 and Pandas. Then we open the database using the connect function. We are more interested in inner join because it returns only the rows where there is a match in both tables. Then we execute what we have written above using the read_sql_query function and load the results directly into a Pandas data frame. We use group by so that all rows belonging to the same customer are put together. Finally, we save the resulting data frame to Excel file.

Question 29

Merge Employee and Department Records

Accepted Answer

We need to merge employee and department records. We have two tables, departments and employees. We only consider departments that have more than 10 employees. Second requirement is to find employees whose salary is above their department average. Third requirement is to add high earners column to the output, which will basically count the number of employees whose salary is more than 75,000. A CTE is a temporary result set in SQL that you can reference within a single query. A join in SQL connects two tables together based on a common column. We are more interested in inner join because it returns only the rows where there is a match in both tables. The difference is that where clause comes before group by, while having runs after. In order to calculate the high earners column, we will use count with case when. Results should be sorted out primarily by department name and then by salary in descending order.

Question 30

Sequence Products by Price

Accepted Answer

We are given products table with the name of the product, its ID, and the price. Our main goal is to create and calculate neighbor product column. For each product, we take the price of previous product and the price of next product, and then we multiply them together. A CTE is a temporary result set in SQL that you can reference within a single query. It doesn't get saved anywhere. It only exists while that query is running. Window function performs a calculation across multiple rows and keeps every single row in the result. The reason why we can't use group by here is that group by clause collapses all rows into one single row. We use lag, which reaches back to the previous row and grabs its value. The only difference is that we'll use lead instead of lag. Lead will reach forward to the next row and grab its value. COALESCE function takes a list of values and returns the first one that is not null. Since null values can't be used in the calculations, COALESCE function will replace them with zeros.

Question 31

Top Categories by Average Price

Accepted Answer

We are given two tables, inventory and products. These two tables share common column and are connected through product_id column from inventory table and id column from products table. Our main goal is to find the top three product categories with the highest average price. We only consider active products, meaning that they are in stock. For low stock items are qualified those products that are less than 10 available. The final result should be ordered primarily by average price in descending order, and then by product count column as a tiebreaker. A CTE is a temporary result set in SQL that you can reference within a single query. We use left join to make sure every product appears, even if it somehow has no inventory entry. For low stock items column, we need to count only products with stock less than 10. We group by category so that all rows that share the same category would collapse into one group. Rank here assigns a number to each row based on a specified order. We put the condition inside the where clause that price rank should be less or equal to three.

Question 32

Export SQLite Database to Parquet Format with Metadata

Accepted Answer

We'll need to export SQLite database to Parquet format with metadata using Pandas. We have one SQLite database that is called ecommerce. It contains three tables, customers, products, and orders. The orders table has foreign keys that connects it to the first two tables. SQLite is a lightweight database that stores everything in a single file. Our job here is to export every of three tables to their own Parquet file with Snappy compression and create a manifest JSON file describing the export. Parquet stores data column by column instead of doing that row by row. A manifest is a metadata file that describes the contents of an export. We import five things: SQLite 3 library to connect to the database; Pandas, that will be aliased as pd; JSON to create the manifest; OS library for creating directories, building file paths, and getting file sizes; and the DateTime module for getting the current timestamp for the manifest. We read the entire table into the data frame and save the data frame as a Parquet file. Then we set compression to Snappy and index to false.

Question 33

Combine Data from Multiple Sources into Unified Report

Accepted Answer

We need to combine data from multiple sources into a unified report using Pandas. We are given one customers CSV file and one orders SQLite database. SQLite is a lightweight database that stores everything in a single file. We get the data from API, CSV file, and the database. We send an HTTP GET request to a URL, and the API responds with data in JSON format. Our job is to fetch data from all three sources, combine them into one unified DataFrame, and save the final report as a CSV file. We import three libraries: Requests, Pandas, and SQLite 3 to connect the database. We open the database file using connect function. Read SQL query will execute our SQL request and load the result directly into the DataFrame. Merge in Pandas connects two DataFrames based on a common column or index. It works the same way as JOIN in SQL. On indicates which common column was used to connect the DataFrames. The method is left join, which means that we keep all orders even if customer ID doesn't match. We can calculate the total amount when multiplying quantity by price. The final unified report is saved as CSV file.

Question 34

Create Branch from Detached HEAD State

Accepted Answer

We have a Git repository that has detached head. Head is basically our latest commit and detached head means that the head that we are currently located at is not pointing to any Git branch, any points directly to commit or tag. It happens when we check out directly tag or we check out directly at comit hash, or we can check out some remote branch. How we can restore from detached head is we either need to return to some existing branch or we have to create a new branch from that detached head. Run git status. We can see that GIT message that head is detached. We'll create new branch by typing Git Checkout the name of the branch, or you can run more modern command Git Switch. Run checkout minus B. And then name of the branch.

Question 35

Rebase Feature Branch

Accepted Answer

We have a Git repository. Under this folder we have a feature payment branch and it's behind our main branch by 32 commits. We need to rebase a feature branch onto the latest main. We need to bring all the latest changes from main to this feature branch. Switch to feature payment branch. We need to rebase from main. Type git rebase main. We have conflicts. Some of the commits cannot be rebased. We have two changes for the same line. Remove the obsolete one and keep what we need. And now we can git add and git rebase continue. We've successfully rebased and updated feature payment.

Question 36

Apply Specific Stash from Multiple Stashes

Accepted Answer

Git stash lets us to save our work in progress, go to some other task, and then return back, restore our stashed work, and continue from that moment on. We can save multiple work in progress jobs, so we can have multiple stashes. We have to navigate to this repository, identify the third stash in the list and restore that. So we need to apply it without removing it from the stack. Run git stash list. This will show us list of the stashed work. Index starts from zero. In order to restore this, git stash apply stash and then two. This will still retain stash number two in the list. In order to remove stash from the list, we need to run command git stash drop and then the name of the stash. Another command to restore stash is git stash pop. Git stash pop applies the latest stash, meaning stash number zero, and then drops it from the list.

Question 37

Remove Last Commit and Discard Changes

Accepted Answer

We've committed some local changes that contain incorrect changes and we need to completely erase that from history. Our task is to navigate to this repository, remove last commit entirely. The most important thing here is to discard all associated files changes. First run get log to see all the current commits. We will go a few steps back, and we also need to discard all associated file changes. This means that this is not soft reset, which will reset that, but keep the local changes. So this has to be a hard reset. We have bad commit, wrong changes. It's the last commit and we will reset it. One commit back, run Git, reset hard at one.

Question 38

Checkout Single File from Another Branch

Accepted Answer

We have two branches, main and feature settings, and we have a file config json that is located in the feature branch. We need to copy this file into the main branch. We don't need to cherry pick the commit. Cherry pick and copying one single file is different. Cherry picking adds specific commit from another branch to the branch that we are in currently. When we need to copy the file, we need to check out, but we need to check out specific file. GI Checkout, and we type the name of the branch and then Hyen and name of the file that we would like to check out. Now if we run Git status, we'll see this file as modified in our directory.

Question 39

Cherry-Pick Specific Commit

Accepted Answer

We have a git repository and we have a feature branch called feature. This feature branch contains a fix that fixes the bug on main, but we don't want to move all the commits from feature branch to the main. We just want to pick one commit and move only that. That's called cherry picking. We need to navigate to this repository, identify the commit that fixes the main branch and move it to the main from feature. See the git log to identify the commit that fixes the bug on main, so this commit has message fix critical bug. Next, we need to type git cherry pick and hash of this commit. This commit now was added to our main branch.

Question 40

Restore File to Previous Version

Accepted Answer

We have a Git repository where we have a file config gs. This file has been modified in the last two commits, but those two commits introduced a bug. We need to restore config gs to the version that it had two commits ago while not affecting any other file. Run Git log to see last five commits. Next, we'll preview config gs that was two commits ago, for this we'll type git show head meaning our current last commit and then the sign, and then two. Next, we'll type git checkout head tilde sign two and then two hyphens config gs, which will restore config gs that was two commits ago. Last thing, we need to commit our changes with restore config gs message.

Question 41

Create an Annotated Tag

Accepted Answer

We have Git Repository located on this directory which is completed new version of our application. We need to create an annotated tag, and once it's created, we need to push that to remote repository with this name. First, move to this directory, view current commit. This is our latest commit at head of our branch. A lightweight flag just references some commit. It doesn't have its own SHA hash meaning that cannot reference the tag. Annotated tag, however, has its own SHA, so you can reference that. Verify that tag was created by typing Git Tag. And now push this.

Question 42

Add Git Submodule

Accepted Answer

Integrate external repositories as submodules to manage dependencies without code duplication. Add submodules with git submodule add, configure .gitmodules file, initialize submodule directories, and commit configuration. Essential for managing shared libraries, vendor dependencies, monorepo structures, and maintaining decoupled version control across interdependent projects.

Question 43

Update Submodule to Latest Commit

Accepted Answer

We have a repository interview repo that contains a submodule vendor details and submodule lets us nesting repositories. When submodule is used, it uses the specific commit. We need to update a submodule to the latest commit on its default branch. Run git submodule to see our submodule status. Next, run git fetch and then log to see commits in our submodule. We need to pull latest changes. Go to the parent and check status again. Those changes are not committed yet. We run git add and then git commit. Submodule is updated to the latest version.

Question 44

Stash Work, Fix Bug, Restore and Update

Accepted Answer

Imagine a common scenario when you've been working on some git repository on the feature ui and suddenly you need to do something to fix authentication issue on the main branch. For this, you have to create new hotfix branch, commit changes into that branch and then merge that with main branch. While you've been working on feature ui, you cannot simply change the branch because you have uncommitted changes. Before moving to main branch, we need to stash those changes, meaning to put them aside. Next, move to our main branch and to fix our authentication issue, we first need to create hotfix branch. Move to the main branch and merge our hotfix with main. And finally we delete the hotfix branch since we don't need it anymore. We need to move back to our work that we paused meaning to feature ui, and then rebase it from main. Finally, we need to move back the work that we stashed aside. For that, we need to type git stash pop.

Question 45

Remove File from Entire Git History

Accepted Answer

We've committed file secrets env, which contains some sensitive credentials and we have to remove this from entire git history. First we'll need to move to the repository directory. Check logs that contains secret env file. For this we'll type git log all to see all logs, one line to see them in one line. In order to filter it by the file name we type hyphen and then name of the file. To see exactly what was changed in those commits we can add hyphen p flag. We need to delete this from our current git history and for this we'll run command called git filter branch. We'll run force flag to change this in entire git history. And we'll use the filter flag that lets us run certain command. And the command is basic Linux syntax rm to remove file. Prune empty flag is used in certain cases that removing this will make the commit useless. Finally, we need to delete the history file, meaning we need to run rm rf to delete everything in the git refs original. In case if we'd like to change also this on the remote origin, we need to run git push force all.

Question 46

Merge Repositories Preserving Both Histories

Accepted Answer

We have two separate git repositories, repo A and repo B, five and four commits respectively. They've been developed independently, so have different histories, and we need to create one monorepo, combine both of them and have full commit history. Important thing is we need to use subtree. Subtree is git subtree command used when we have some shared libraries or other shared resources when we do not want to merge everything into one monorepo, but rather have some repository and reference other repositories in a directory. We'll use git log one line. Create a directory and then initialize git inside this directory. We'll do git init. We'll do our first commit. We'll do empty commit, so we'll need to add a empty flag. Now we will need to integrate our directories as a subtree in our repository. Use git subtree add, and then we use prefix project A. Finally, verify our monorepo. Check it with Git logs.

Question 47

Fix Repository with Unrelated Histories

Accepted Answer

The repository interview is in broken state. The local and remote branches diverged with no common ancestor, meaning they don't share the same history. When we use Git Push to push things to the remote main, it fails with non fast forward error. And the same happens when we try to pull from Main. Our task is to fix this repository, merge and linearize the unrelated histories using Rebase and create new single commit sequence. When we use Git merge, branch that was merged into the main branch retains the same commit hashes. When we use Rebase, those commit hashes get rewritten. Next we'll pull main and we'll Rebase not merge. The flag that we'll use in this case is allow unrelated histories. We need to resolve this issue. Those three are the main types of the conflicts: modify modify, modify delete, add add. Once this is done, we have to type Git rebase continue.

Question 48

Recover Lost Commits from Detached HEAD

Accepted Answer

We had Git repository located under this directory and we've been in detached head state. When we switch the main branch, those three commits are now unreachable and we'd like to restore those three commits. When we do git log, we don't see those commits, so we need to find a solution to restore those commits and create a branch called recovered work where those commits will be listed. To see all the logs that we've done to the head of this branch, meaning the ones that were lost from the detached head or while we did git reset and so on, we can type command called git reflog. Git reflog shows us logs exactly for the head. In this git reflog we can see that we have much more than we have in git log. Since our task is to restore this with the branch recovered work, we have to create branch recovered work from some Git commit. We'll use this git commit's hash.

Question 49

CSV and Partitions

Accepted Answer

Spark is a big data processing framework. It is designed to process massive amounts of data across multiple computers at the same time. And instead of tables, it uses data frames. Our job here is simply to read a CSV file, then to find out how many partitions are created, and print that number. When Spark reads this file, it doesn't process it as one giant block. Instead, it splits it into smaller chunks called partitions. Each partition gets sent to a different executor. The default maximum partition size is 128 megabytes. RDD stands for Resilient Distributed Dataset. The RDD splits it into multiple partitions, and each partition holds a subset of the full data. RDD will convert the data frame into a format that can help us to access the underlying partition information. And getNumPartitions is a built-in method that returns the number of partitions as a number.

Question 50

Repartition

Accepted Answer

Spark is a big data framework that is designed to process massive amounts of data across multiple computers at the same time. Instead of tables like in SQL, Spark uses data frames. We have only one file that is called orders.csv with 5,000 records. Our main goal here is to repartition the data frame to eight partitions and print the task count in the format that it equals to eight. When Spark reads this file, it doesn't process it as one giant block. Instead, it splits it into smaller chunks called partitions. Each partition gets sent to a different executor. An executor is a process that runs on its chunk of data independently. Repartition is a built-in Spark method that lets us manually control how many partitions our data frame has. We pass the number that we want, and Spark will redistribute all the data across exactly that many partitions. repartition.rdd will convert our data frame to an RDD format, which is a data structure of Spark that holds the partitioned data. getNumPartitions is a built-in function that counts and returns the number of partitions.

Question 51

Broadcast Join

Accepted Answer

Spark is a big data framework that processes massive amounts of data across multiple computers at the same time. Instead of tables like in SQL, Spark uses data frames. We are given two files, orders.csv with 5,000 records, and customers.csv with 50 records. We need to join these two files together using a broadcast join, then count orders and print the number of distinct cities. We will use a regular inner join because we only want orders that have a matching customer. The only thing is how we perform that join. Instead of shuffling both data frames across the network, Spark takes the small data frame and sends a full copy of it to every worker. So each worker now has its own partition of the large data frame and small data frame. It means that it can perform the join right here without any movement. We don't want to shuffle this large data frame across the network. That's why we take the smallest one, because it is easier and cheaper to copy. We will also import the broadcast function from the library. header that is set to true uses the first row as column names, and inferSchema automatically detects data types for each column. Then we count orders per city with the help of group

Question 52

Correcting Social Media Posts

Accepted Answer

Spark is a big data framework that is designed to process massive amounts of data across multiple computers at the same time. Instead of tables, Spark uses data frames. We have one file, posts.csv. Its data frame contains seven columns: text of the post, ID, date, amount of likes, comments, and shares, and platform where it was published. We need to go through every post and replace the word Python with PySpark in the text column. withColumn is a data frame method that modifies or replaces a specific column. It takes two arguments. The first one is the column, it is the text column, and second one is the value that we want to add. For the second argument, we use regexReplace function. This function takes three arguments. The first one is the column to search. Second is the word to find. It is Python. And third is word to replace it with, which is PySpark.

Question 53

Daily Category Sales Aggregation

Accepted Answer

Master daily sales aggregation in PySpark. Learn how to join transaction tables with product catalogs and use multi-column GroupBy operations to calculate total quantities sold per category per day.

Question 54

Most Common Order Status

Accepted Answer

Spark is a big data framework that processes massive amounts of data across multiple machines. Instead of tables, Spark uses data frames. We are given only one file, orders.csv, with 5,000 records. Each order has a status, like completed, canceled, ongoing, and so on. Our job here is to find which status appears most frequently. There are two types of transformations, narrow and wide. In narrow transformation, each row on the left executor goes directly to the row on the right executor. It means that no data moves between executors, and each partition stays on the same machine and gets processed independently. But when it comes to wide transformations, rows from one executor can end up on a completely different executor. Here, data is able to move between machines, and this is called a shuffle. The only problem is that wide transformation requires more network traffic and more time. When we use group by, we do the wide transformation because it makes the rows with the same status move onto the same executor. Count finds the number of rows in each group, and ordering by count, ascending set to false, which means that everything is sorted in descending order.

Question 55

Calculating Overtime Pay

Accepted Answer

We will be calculating overtime pay. Spark is a big data framework that processes large amounts of data across multiple machines. We are given two files: employees.csv and payroll.csv. First data frame contains the names of employees, their IDs, ages, and job positions. The second one stores hourly rate, amount of hours worked, and reference to an employee by ID. Our job is to calculate total pay for each employee, which consists of two rules. If an employee worked less or equal to 40 hours, then the total pay is the product of hours worked and hourly rate. But if an employee worked more than 40 hours, then all the extra hours need to be multiplied not by the regular rate, but by the one that is 1.5 times bigger. We join the two data frames on employee ID using inner join. To the result data frame, we add a new column that is called Pay using the withColumn method. When is Spark's version of if/else statement. Else statement in Spark is replaced with otherwise.

Question 56

Cache and Performance

Accepted Answer

Spark is a big data framework that processes massive amounts of data across multiple machines. Instead of tables, it uses data frames. We have one file, orders.csv, with 50,000 records. Our job here is to cache the data frame, run count twice, measure how long each run takes, and print the results. Spark uses lazy evaluation, which means that it doesn't execute anything when we write a transformation. A transformation is simply an operation that modifies or processes our data. Lazy evaluation doesn't run anything until we specifically ask for a result. If we don't do the caching, then every single iteration reads from the input and produces an output independently. But when we do the caching, then the input is read only once and stored in the distributed memory. The first count reads the CSV file and caches the result, while second one reads from that memory and makes everything much faster.

Question 57

Filter Popular Videos

Accepted Answer

We will need to filter popular videos. Spark is a framework that processes large amounts of data across multiple machines. Instead of tables, it uses data frames. We are given one CSV file that is called Videos. The data frame contains six columns: title of the video, its ID, genre, release year, duration, and number of views. We need to filter everything and keep only videos that have more than one million views and were released in 2019 or later. The result should be saved as result_df. We read the CSV file to a given path and store it in a variable called df for data frame. When header is set to true, it uses first row as column names. At the same time, inferSchema automatically detects data types of those columns. For the condition, we use filter method within data frame and store it in result_df. The condition consists of two statements. First one checks if number of views is greater than one million, and second one ensures that the release year is after 2019. In between, we use AND operator that requires both of the statements to be true.

Question 58

Anonymize User PII

Accepted Answer

PII stands for Personally Identifiable Information, data like emails and phone numbers that can identify a real person. We are given only one file that is called users.csv, and this data frame contains three columns: email, phone number of the user, along with referenced_by ID. We are required to do two things. First, we extract the domain from the email address, which means that we keep everything after the @ symbol. Then we need to hide the first digits of the phone number, and only the last four should be visible. Regex stands for regular expression. It is a pattern that is used to search, extract, or replace specific text inside of a string. Regexp_extract is a Spark function that takes a specific part of a string using a regex pattern. We will use the regexp_replace function that simply replaces part of the string. WithColumn will take the results and create a new column.

Question 59

Call Center Daily Stats

Accepted Answer

Join call records with customer data and compute daily aggregates using countDistinct and sum.

Question 60

Venture Capital Sector Analysis

Accepted Answer

Spark is a framework that processes large amounts of data across multiple machines. At the same time, instead of tables, it uses data frames. We are given two files, companies and investments. Our job here is to find the total investment amount for each industry sector and sort from highest to lowest. We will first combine both of the data frames, then we will sum investments per industries, and in the end, we will sort everything in descending order. Inner join returns only rows where a match exists in both tables or data frames. We will use inner join because for every investment row, we want the company's name and industry to be right next to it. We group by industry column so that all rows that share the same field are put together. Sum function simply adds up all amount values and finds the total investment per industry. The third step is that we order everything by total investment column in descending order.

Question 61

Window Functions without Partitions

Accepted Answer

Master global sorting and sequential numbering in PySpark. Learn how to join DataFrames and use the row_number() window function across an entire unpartitioned dataset.

Question 62

Calculating PE Portfolio Values

Accepted Answer

Master financial data aggregation in PySpark. Learn how to join relational tables, multiply columns to calculate holding values, and group by multiple dimensions to compute daily private equity portfolio totals.

Question 63

Data Engineering Interview Questions

SQL (33)

How to Replace NULL with 0 in SQL and Retrieve All Orders from the orders Table

Steps to Write the SQL Query

Sample SQL Query

Breaking Down the Query

Best Practices

Git (15)

Spark (20)

Snowflake (22)

Python (19)

More questions (11)

DSA (69)

Definition for singly-linked list.

class ListNode:

def init(self, val=0, next=None):

self.val = val

self.next = next

Definition for a binary tree node.

class TreeNode:

def init(self, val=0, left=None, right=None):

self.val = val

self.left = left

self.right = right

Definition for a binary tree node.

class TreeNode:

def init(self, val=0, left=None, right=None):

self.val = val

self.left = left

self.right = right

Definition for a binary tree node.

class TreeNode:

def init(self, val=0, left=None, right=None):

self.val = val

self.left = left

self.right = right

Definition for a binary tree node.

class TreeNode:

def init(self, val=0, left=None, right=None):

self.val = val

self.left = left

self.right = right

Definition for a binary tree node.

class TreeNode:

def init(self, val=0, left=None, right=None):

self.val = val

self.left = left

self.right = right

Definition for a binary tree node.

class TreeNode:

def init(self, val=0, left=None, right=None):

self.val = val

self.left = left

self.right = right

Definition for singly-linked list.

class ListNode:

def init(self, x):

self.val = x

self.next = None

Definition for singly-linked list.

class ListNode:

def init(self, val=0, next=None):

self.val = val

self.next = next

Definition for singly-linked list.

class ListNode:

def init(self, val=0, next=None):

self.val = val

self.next = next

Definition for singly-linked list.

class ListNode:

def init(self, val=0, next=None):

self.val = val

self.next = next

Definition for a binary tree node.

class TreeNode:

def init(self, val=0, left=None, right=None):

self.val = val

self.left = left

self.right = right

How to Replace NULL with 0 in SQL and Retrieve All Orders from the `orders` Table