Objective
You are working with a social media platform's user database that contains personally identifiable information (PII) which needs to be partially anonymized before sharing with the analytics team.
Task
Extract the domain name from each user's email address (everything after the @ symbol) and anonymize their phone number by replacing the first 6 digits with asterisks (******), keeping only the last 4 digits visible. Note that the phone number column is stored as an integer in the source data, so you will need to handle that appropriately. Your final output should contain three columns: user_id, email_domain, and anon_phone. Save your result as result_df.
File Path
- Dataset:
/home/interview/users.csv
- Starter script:
/home/interview/anonymize_users.py
Schema
| Column |
Type |
| user_id |
integer |
| email |
string |
| phone |
integer |
Example
Input: users
Output:
| user_id |
email_domain |
anon_phone |
| 1 |
example.com |
******4567 |
| 2 |
company.org |
******3210 |
The email_domain column contains only the part after the @ symbol.
The anon_phone column replaces the first 6 digits of the phone number with asterisks, leaving the last 4 digits visible.
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.appName("PrepareshSpark").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
df = spark.read.csv("/home/interview/users.csv", header=True, inferSchema=True)
email_domain = F.regexp_extract(F.col("email"), r"@(.+)", 1)
df = df.withColumn("email_domain", email_domain)
anon_phone = F.regexp_replace(F.col("phone").cast("string"), r"^\d{6}", "******")
df = df.withColumn("anon_phone", anon_phone)
result_df = df.select("user_id", "email_domain", "anon_phone")
# --- Do not edit below this line ---
result_df.coalesce(1).write.csv("/home/interview/output", header=True, mode="overwrite")
spark.stop()
Explanation
Step 1: Extracting the Email Domain
email_domain = F.regexp_extract(F.col("email"), r"@(.+)", 1)
df = df.withColumn("email_domain", email_domain)
regexp_extract takes a column, a regex pattern, and a group index. The pattern @(.+) matches the @ symbol followed by one or more characters. The parentheses create a capture group, and group 1 is everything after the @. .withColumn("email_domain", ...) adds the new column to the DataFrame.
Step 2: Anonymizing the Phone Number
anon_phone = F.regexp_replace(F.col("phone").cast("string"), r"^\d{6}", "******")
df = df.withColumn("anon_phone", anon_phone)
This is the tricky part. The phone column is stored as an integer in the CSV, so inferSchema=True reads it as a numeric type. Before applying any string regex operations, you need to cast it to a string with .cast("string"). Then regexp_replace swaps the first 6 digits (^\d{6}) with six asterisks. If you skip the cast, Spark will throw an error because regex functions expect string input.
Step 3: Selecting the Final Columns
result_df = df.select("user_id", "email_domain", "anon_phone")
select() picks only the three columns the output requires, dropping the original email and phone columns. This is the standard way to control the output schema in PySpark.