Given a huge dataset of numbers distributed across multiple computers, describe an efficient algorithm to find the average of all the numbers. Provide a brief explanation and a sample implementation using a distributed computing framework.
Please sign-in to view the solution
Given a dataset of 2 billion URLs and their sizes, describe an efficient algorithm to find the 95th percentile of all the sizes. Provide a brief explanation and a sample implementation using a distributed computing framework.
Please sign-in to view the solution
Given a data set of 60 million records and an O(n) computation that takes 2 weeks to complete on a single computer, how would you accelerate the processing time to be completed within 24 hours? Describe your approach and the technologies you would use.
Please sign-in to view the solution
You have a large dataset stored in a distributed file system like HDFS, and you need to perform complex transformations and aggregations. Explain how you would use Apache Spark to process this dataset. Provide an example of a Spark job that calculates the average value of a specific column.
Please sign-in to view the solution
Describe the types of technologies and architecture you would need to build a scalable data engineering solution for a platform like YouTube. Focus on data ingestion, storage, processing, and analytics.
Please sign-in to view the solution
What is Big Data and what is Hadoop used for?
Please sign-in to view the solution