Handling Duplicate Columns
Beginner Mode

Start your terminal to use beginner mode.

Objective

A bank maintains its transactions and customer data in separate databases. However, for certain exhaustive data analysis tasks and risk models, the bank needs to perform a cross join operation between the transactions and customers datasets.

Task

Write a PySpark function that performs a cross join operation on these two DataFrames.

Because both DataFrames contain a cust_id column, a direct cross join will result in duplicate column names. To resolve this and satisfy the Expected Output Schema (which only asks for one cust_id), drop the cust_id column from the customers DataFrame before performing the cross join.

Save your resulting DataFrame as result_df. Ensure the output matches the exact schema order requested, and order the final output by trans_id (ascending), then by first_name (ascending).

File Path

  • Transactions Dataset: /home/interview/transactions.csv
  • Customers Dataset: /home/interview/customers.csv
  • Starter script: /home/interview/cross_join.py

Schema

transactions.csv

Column Name Data Type
trans_id Integer
trans_amt Float
date String
cust_id Integer

customers.csv

Column Name Data Type
cust_id Integer
first_name String
last_name String
age Integer

Expected Output Schema

Column Name Data Type
trans_id Integer
trans_amt Float
date String
cust_id Integer
first_name String
last_name String
age Integer

Example

Given this sample input:

transactions

trans_id trans_amt date cust_id
101 50.0 2023-01-01 1
102 150.0 2023-01-02 2

customers

cust_id first_name last_name age
1 John Doe 28
2 Jane Smith 34

The expected output would be:

trans_id trans_amt date cust_id first_name last_name age
101 50.0 2023-01-01 1 John Doe 28
101 50.0 2023-01-01 1 Jane Smith 34
102 150.0 2023-01-02 2 John Doe 28
102 150.0 2023-01-02 2 Jane Smith 34

(Note: Every transaction is paired with every customer, creating a Cartesian product).

Terminal requires a larger screen

Open this page on a desktop or tablet (≥ 768px) to launch the terminal and practice hands-on.

Linux Terminal Environment

Write and execute your solution in the terminal below.

Sign In

Track

Question Difficulty Company Access
Need more practice in this area? Explore more questions →