177. Extract Schema Information from Parquet File Using PyArrow
Beginner Mode
Start your terminal to use beginner mode.
Scenario
A Parquet file contains structured data and you need to extract and document its schema information for analysis purposes.
Task
Write a Python script at /home/interview/extract_schema.py using pyarrow that reads /home/interview/data.parquet, extracts the schema information (column names, data types, compression codec, row count, and file size), and saves the output as JSON to /home/interview/schema_info.json.
Note: The pyarrow module is already installed.
Example
Expected output format in /home/interview/schema_info.json:
{
"file": "/home/interview/data.parquet",
"row_count": 390,
"file_size_bytes": 45632,
"file_size_kb": 44.56,
"compression_codec": "SNAPPY",
"columns": [
{
"name": "id",
"type": "int64"
},
{
"name": "name",
"type": "string"
},
...
]
}
Terminal requires a larger screen
Open this page on a desktop or tablet (≥ 768px) to launch the terminal and practice hands-on.
Linux Terminal Environment
Write and execute your solution in the terminal below.
Essential
SQL 0/33
Git 0/15
Spark 0/20
Snowflake 0/22
Python 0/24
Need more practice in this area? Explore more questions →
Palantir
TCS
X
Accenture
Adobe
Google
LinkedIn
Samsung
Datadog
Wix
Dropbox
Meta
OpenAI
Hulu
Uber
DoorDash
Anthropic
Amazon
ActivisionBlizzard
Vercel
Crypto.Com
Zscaler
DeutscheBank
Apple
GoDaddy
BMW
PayPal
Snowflake
AMD
Twilio
Atlassian
JPMorgan
NVIDIA
IBM
Databricks
Coinbase
Cisco
Robinhood
Twitter
Microsoft
Netflix
VMware
Cloudflare
Stripe
Capital One
Splunk
Intel
SAP
Tesla
GitHub
JaneStreet
Bloomberg
Salesforce
Elastic
CGI
UBS
GitLab
Ubisoft
Slack
Nintendo
EY
Kayak
Lyft
Airbnb
Walmart
Revolut
Visa
Okta
HashiCorp
Instacart
Mastercard