Start your terminal to use beginner mode.
Objective
A bank maintains its transactions and customer data in separate databases. However, for certain exhaustive data analysis tasks and risk models, the bank needs to perform a cross join operation between the transactions and customers datasets.
Task
Write a PySpark function that performs a cross join operation on these two DataFrames.
Because both DataFrames contain a cust_id column, a direct cross join will result in duplicate column names. To resolve this and satisfy the Expected Output Schema (which only asks for one cust_id), drop the cust_id column from the customers DataFrame before performing the cross join.
Save your resulting DataFrame as result_df. Ensure the output matches the exact schema order requested, and order the final output by trans_id (ascending), then by first_name (ascending).
File Path
- Transactions Dataset:
/home/interview/transactions.csv - Customers Dataset:
/home/interview/customers.csv - Starter script:
/home/interview/cross_join.py
Schema
transactions.csv
| Column Name | Data Type |
|---|---|
| trans_id | Integer |
| trans_amt | Float |
| date | String |
| cust_id | Integer |
customers.csv
| Column Name | Data Type |
|---|---|
| cust_id | Integer |
| first_name | String |
| last_name | String |
| age | Integer |
Expected Output Schema
| Column Name | Data Type |
|---|---|
| trans_id | Integer |
| trans_amt | Float |
| date | String |
| cust_id | Integer |
| first_name | String |
| last_name | String |
| age | Integer |
Example
Given this sample input:
transactions
| trans_id | trans_amt | date | cust_id |
|---|---|---|---|
| 101 | 50.0 | 2023-01-01 | 1 |
| 102 | 150.0 | 2023-01-02 | 2 |
customers
| cust_id | first_name | last_name | age |
|---|---|---|---|
| 1 | John | Doe | 28 |
| 2 | Jane | Smith | 34 |
The expected output would be:
| trans_id | trans_amt | date | cust_id | first_name | last_name | age |
|---|---|---|---|---|---|---|
| 101 | 50.0 | 2023-01-01 | 1 | John | Doe | 28 |
| 101 | 50.0 | 2023-01-01 | 1 | Jane | Smith | 34 |
| 102 | 150.0 | 2023-01-02 | 2 | John | Doe | 28 |
| 102 | 150.0 | 2023-01-02 | 2 | Jane | Smith | 34 |
(Note: Every transaction is paired with every customer, creating a Cartesian product).
Terminal requires a larger screen
Open this page on a desktop or tablet (≥ 768px) to launch the terminal and practice hands-on.
Linux Terminal Environment
Write and execute your solution in the terminal below.
Track
| Question | Difficulty | Company | Access |
|---|
Apple