Palantir: Cache and Performance — Data Engineering Interview Q&A (2026)

All Questions Essential 0/189

52. Cache and Performance

Palantir ☯️ Medium Spark DataFrame Caching

Beginner Mode

Start your terminal to use beginner mode.

Objective

Spark uses lazy evaluation, which means transformations are recomputed from scratch every time an action is triggered. If you plan to reuse a DataFrame multiple times, this can be wasteful. The cache() method tells Spark to store the DataFrame in memory after the first computation, so subsequent actions can read from memory instead of re-reading and re-parsing the source data. Caching is one of the most practical Spark optimizations, especially when iterating over the same dataset in machine learning pipelines or exploratory analysis.

Task

A dataset with 50,000 order records is available at /home/interview/orders.csv. A starter script has been created for you at /home/interview/cache_performance.py with a SparkSession and the CSV already loaded into a DataFrame called df. Cache the DataFrame, run count() twice while measuring each runtime, and print the results in the exact format:

first run = <time>s
second run = <time>s
faster after caching = true

Example

first run = 1.24s
second run = 0.05s
faster after caching = true

from pyspark.sql import SparkSession
import time

spark = SparkSession.builder.appName("PrepareshSpark").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

df = spark.read.csv("/home/interview/orders.csv", header=True, inferSchema=True)

df.cache()

start = time.time()
df.count()
first_run = time.time() - start

start = time.time()
df.count()
second_run = time.time() - start

is_faster = second_run < first_run

print(f"first run = {first_run:.2f}s")
print(f"second run = {second_run:.2f}s")
print(f"faster after caching = {str(is_faster).lower()}")

spark.stop()

Explanation

Step 1: Caching the DataFrame

df.cache()

cache() marks the DataFrame to be stored in memory after its first evaluation. It's important to understand that cache() is lazy: calling it doesn't actually load anything into memory. The data is only cached when the first action (like count()) triggers computation.

Under the hood, cache() is shorthand for persist(StorageLevel.MEMORY_AND_DISK). If the data doesn't fit in memory, Spark spills the overflow to disk rather than failing.

Step 2: First Run (Cold)

start = time.time()
df.count()
first_run = time.time() - start

The first count() does all the heavy lifting: reads the CSV from disk, parses every row (because of inferSchema), and stores the result in memory. This is the "cold" run because nothing is cached yet.

Step 3: Second Run (Warm)

start = time.time()
df.count()
second_run = time.time() - start

The second count() reads directly from memory, skipping the disk read and CSV parsing entirely. This is why it's significantly faster.

Step 4: Cache vs Persist

cache() stores data in memory with disk spillover (the safest default). If you need more control, persist() lets you choose a storage level:

MEMORY_ONLY: Fastest, but recomputes partitions that don't fit in memory
MEMORY_AND_DISK: Default for cache(), spills to disk if needed
DISK_ONLY: Saves memory but slower reads
MEMORY_AND_DISK_SER: Serialized format, uses less memory but more CPU

When you're done with a cached DataFrame, call df.unpersist() to free the memory.

Terminal requires a larger screen

Open this page on a desktop or tablet (≥ 768px) to launch the terminal and practice hands-on.

Linux Terminal Environment

Write and execute your solution in the terminal below.

Essential

SQL 0/33

Git 0/15

Spark 0/20

Snowflake 0/22

Python 0/24

Question	Difficulty	Company	Access
Debug SSH Lockout	Medium	TCS	Free
Recursive Keyword Finder	Easy	X	Free
Docker Multi-Architecture Image	Easy	Accenture	Free
Average Order Value	Easy	Accenture	Free
Join Employees and Departments	Easy	Adobe	Free
Filter Orders by Date Range	Easy	Google	Free
Find Customers Without Orders	Easy	LinkedIn	Free
Use COALESCE for Null Handling	Easy	Samsung	Free
Merge Multiple Address Fields	Easy	Datadog	Free
String Concatenation in SELECT	Easy	Wix	Free
Find Nth Highest Revenue	Easy	Dropbox	Free
Self-Join to Identify Missing Supervisors	Easy	Meta	Free
Year-over-Year Revenue Growth	Easy	OpenAI	Free
Above Average Price Products	Medium	Hulu	Free
Calculate Cumulative Sales	Medium	Uber	Free
Find Overlapping Date Ranges	Medium	X	Free
Set Operation: INTERSECT	Medium	DoorDash	Free
Subquery for Best Order per Customer	Medium	Anthropic	Free
Ranking with Dense_Rank	Medium	Amazon	Free
Median Salary by Job Title	Medium	ActivisionBlizzard	Free
String Splitting and Aggregation	Medium	Vercel	Free
Salary Comparison with CTE Aggregation	Medium	Crypto.Com	Free
String Pattern Extraction in Descriptions	Medium	Zscaler	Free
Nested Subquery for Latest Record	Medium	DoorDash	Free
Window Function for Moving Average	Medium	DeutscheBank	Free
Re-enrollment Rate Calculator	Medium	Google	Free
String Pattern Matching Using LIKE	Medium	Apple	Free
Merge Employee and Department Records	Hard	Anthropic	Free
Sequence Products by Price	Hard	GoDaddy	Free
Top Categories by Average Price	Hard	Samsung	Free
Customer Order Aggregation	Medium	BMW	Free
Filter Popular Videos on a Streaming Platform	Easy	Apple	Free
Replace Keywords in Social Media Post Text	Easy	PayPal	Free
Filter Movies with Missing Box Office Data	Easy	DoorDash	Free
Daily Category Sales	Easy	Snowflake	Free
Filter and Uppercase Artifacts	Easy	AMD	Free
Combine Customer Orders and Products	Medium	Twilio	Free
Anonymize User PII Data for a Social Media Platform	Medium	Atlassian	Free
Product Sales and Inventory Data	Medium	PayPal	Free
Products and Duplicates	Medium	JPMorgan	Free
Mortgage Rate Calculator	Medium	NVIDIA	Free
Weekend Order Detection	Medium	IBM	Free
Flooring Company Data	Medium	Databricks	Free
Rank Top Products by Revenue per Category	Hard	Coinbase	Free
Highest SEO Score Pages per Domain	Hard	Cisco	Free
Math Expressions	Hard	IBM	Free
CSV and Partitions	Easy	Atlassian	Free
Repartition	Easy	Robinhood	Free
Broadcast Join	Easy	Databricks	Free
Correcting Social Media Posts	Easy	Twitter	Free
Daily Category Sales Aggregation	Easy	Microsoft	Free
Cache and Performance	Medium	Palantir	Free
Filter Popular Videos	Medium	Netflix	Free
Anonymize User PII	Medium	Meta	Free
Call Center Daily Stats	Medium	VMware	Free
Venture Capital Sector Analysis	Medium	Cloudflare	Free
Window Functions without Partitions	Medium	Google	Free
Calculating PE Portfolio Values	Medium	IBM	Free
Mountain Climber Logs	Hard	Stripe	Free
Global & Domain SEO Leaders	Hard	Amazon	Free
Tracking Customer Purchase History	Hard	Coinbase	Free
Contains Duplicate	Easy	Apple	Free
Valid Anagram	Easy	Anthropic	Free
Two Sum	Easy	Cloudflare	Free
Valid Palindrome	Easy	Capital One	Free
Valid Parentheses	Easy	Splunk	Free
Binary Search	Easy	Intel	Free
Merge Two Sorted Lists	Easy	SAP	Free
Invert Binary Tree	Easy	Robinhood	Free
Maximum Depth of Binary Tree	Easy	Google	Free
Diameter of Binary Tree	Easy	Atlassian	Free
Balanced Binary Tree	Easy	Tesla	Free
Same Tree	Easy	OpenAI	Free
Subtree of Another Tree	Easy	Samsung	Free
Group Anagrams	Medium	Netflix	Free
Top K Frequent Elements	Medium	Cloudflare	Free
Product of Array Except Self	Medium	Samsung	Free
Longest Consecutive Sequence	Medium	Meta	Free
Two Sum II - Input Array Is Sorted	Medium	Databricks	Free
Three Sum	Medium	SAP	Free
Container With Most Water	Medium	Amazon	Free
Longest Substring Without Repeating Characters	Medium	GitHub	Free
Longest Repeating Character Replacement	Medium	DoorDash	Free
Permutation in String	Medium	OpenAI	Free
Daily Temperatures	Medium	Intel	Free
Car Fleet	Medium	JaneStreet	Free
Search a 2D Matrix	Medium	SAP	Free
Koko Eating Bananas	Medium	Meta	Free
Find Minimum in Rotated Sorted Array	Medium	AMD	Free
Search in Rotated Sorted Array	Medium	Anthropic	Free
Remove Nth Node From End of List	Medium	Cloudflare	Free
Add Two Numbers	Medium	Google	Free
Lowest Common Ancestor of a BST	Medium	Stripe	Free
Binary Tree Level Order Traversal	Medium	Atlassian	Free
Validate Binary Search Tree	Medium	SAP	Free
Kth Smallest Element in a BST	Medium	Datadog	Free
K Closest Points to Origin	Medium	Atlassian	Free
Kth Largest Element in an Array	Medium	Microsoft	Free
Task Scheduler	Medium	Samsung	Free
Combination Sum	Medium	Bloomberg	Free
Permutations	Medium	PayPal	Free
Number of Islands	Medium	Vercel	Free
Course Schedule II	Medium	Bloomberg	Free
Graph Valid Tree	Medium	Coinbase	Free
Network Delay Time	Medium	Salesforce	Free
Jump Game	Medium	Elastic	Free
Jump Game II	Medium	Snowflake	Free
Gas Station	Medium	JPMorgan	Free
Partition Labels	Medium	DoorDash	Free
Create Branch from Detached HEAD State	Easy	CGI	Free
Rebase Feature Branch	Easy	GitHub	Free
Apply Specific Stash from Multiple Stashes	Easy	UBS	Free
Remove Last Commit and Discard Changes	Easy	GitLab	Free
Checkout Single File from Another Branch	Easy	Twilio	Free
Cherry-Pick Specific Commit	Easy	Ubisoft	Free
Restore File to Previous Version	Medium	Slack	Free
Create an Annotated Tag	Medium	Nintendo	Free
Add Git Submodule	Medium	EY	Free
Update Submodule to Latest Commit	Medium	GoDaddy	Free
Stash Work, Fix Bug, Restore and Update	Medium	IBM	Free
Remove File from Entire Git History	Medium	Netflix	Free
Merge Repositories Preserving Both Histories	Medium	Zscaler	Free
Fix Repository with Unrelated Histories	Medium	Zscaler	Free
Recover Lost Commits from Detached HEAD	Medium	Kayak	Free
Merge Customer Records from Two Sources	Easy	Lyft	Free
Filter Funded Startups	Easy	Salesforce	Free
Assign Row Numbers to Authors per Paper	Medium	Cloudflare	Free
Amusement Park Rating Anomalies	Medium	GitHub	Free
Usage and Accuracy per Model Type	Medium	VMware	Free
Find the Last Climber per Mountain	Medium	Bloomberg	Free
Track Product Purchases	Hard	Microsoft	Free
Most Common Order Status	Easy	Airbnb	Free
Calculating Overtime Pay	Easy	Cisco	Free
Top Products by Revenue	Medium	Walmart	Free
Product Summary	Medium	Amazon	Free
Parsing Comma-Separated Values	Medium	Revolut	Free
Number of Connected Components in an Undirected Graph	Medium	Stripe	Free
Course Schedule	Medium	Uber	Free
Walls and Gates	Medium	Amazon	Free
Surrounded Regions	Medium	Meta	Free
Pacific Atlantic Water Flow	Medium	Apple	Free
Max Area of Island	Medium	Netflix	Free
Clone Graph	Medium	GitHub	Free
Subsets	Medium	Visa	Free
Binary Tree Right Side View	Medium	Okta	Free
Linked List Cycle	Easy	Google	Free
Copy List with Random Pointer	Medium	Apple	Free
Reorder List	Medium	Samsung	Free
Reverse Linked List	Easy	Google	Free
Evaluate Reverse Polish Notation	Medium	Google	Free
Min Stack	Medium	Google	Free
LRU Cache	Medium	Google	Free
Implement Trie (Prefix Tree)	Medium	Google	Free
Design Add and Search Words Data Structure	Medium	Google	Free
Design Twitter	Medium	Google	Free
Sliding Window Median	Hard	Google	Free
Subarray Sum Equals K	Medium	Google	Free
Accounts Merge	Medium	Google	Free
Continuous Subarray Sum	Medium	Google	Free
Moving Average from Data Stream	Easy	Amazon	Free
Top K Frequent Elements in Stream	Medium	JPMorgan	Free
Log Aggregator	Medium	Microsoft	Free
Event Stream Deduplicator	Medium	Google	Free
Skew-Aware Key Partitioner	Medium	Okta	Free
Hash Join Simulator	Medium	Apple	Free
CSV Row Filter and Count	Easy	DoorDash	Free
Analyze Sales Dataset Dimensions and Calculate Total Revenue	Easy	Databricks	Free
Sort Avro Employee Records by Salary	Easy	GitHub	Free
Count User Events from JSON Activity Logs	Easy	Uber	Free
Split Delimited Column into Separate Columns with Pandas	Easy	Snowflake	Free
Compare SQLite Database and CSV File Records	Easy	Robinhood	Free
Analyze DataFrame Memory Usage	Easy	SAP	Free
Time-Series Rolling Window Analysis for Multi-Stock Price Data	Medium	HashiCorp	Free
Calculate Descriptive Statistics for Numeric Columns in Pandas	Easy	Google	Free
Decompose Time-Series Data into Trend, Seasonal, and Residual Components	Medium	Instacart	Free
Parse JSON Log Files and Extract Fields to CSV	Easy	Okta	Free
Extract Schema Information from Parquet File Using PyArrow	Easy	Palantir	Free
Select Specific Columns from Parquet File	Easy	OpenAI	Free
Flatten Nested Struct Columns in Parquet and Export to CSV	Medium	Coinbase	Free
Merge Customer and Purchase Data Using Pandas	Easy	Mastercard	Free
SQL JOIN with Pandas Data Processing and CSV Export	Medium	Intel	Free
Insert New Records into SQLite Database from CSV	Medium	Visa	Free
Aggregate SQL Query Results with Pandas and Export to Excel	Medium	Meta	Free
Aggregate Time-Series Data into Fixed Time Windows	Hard	Tesla	Free
Export SQLite Database to Parquet Format with Metadata	Hard	GitLab	Free
Interpolate Missing Values in Irregular Time-Series Sensor Data	Hard	VMware	Free
Remove Seasonal Effects from Time-Series Sales Data	Hard	Cloudflare	Free
Convert Excel Files with Multiple Sheets to Individual CSV Files	Easy	Airbnb	Free
Combine Data from Multiple Sources into Unified Report	Hard	Vercel	Free

Need more practice in this area? Explore more questions →