You are working with a large-scale data processing project that uses Hadoop for storage and PySpark for data processing. Explain how you would set up a PySpark job to read data from HDFS, perform a transformation to filter out records with missing values, and then write the cleaned data back to HDFS. Provide a sample PySpark code snippet to demonstrate this process.
Please sign-in to view the solution
Explain how Hadoop and Apache Spark handle big data processing. What are the main differences between the two frameworks? Provide examples of use cases where each would be more suitable.
Please sign-in to view the solution
Describe the basic concepts of Apache Pig and Apache Oozie in the Hadoop ecosystem. What are their primary functions and how do they fit into the big data processing workflow?
Please sign-in to view the solution
Explain how to implement MapReduce for the following common problems: word count, inverted index, and mean calculation. Provide a brief description and a sample code for each problem.
Please sign-in to view the solution
Create an Apache Pig script to solve a Longest Common Subsequence (LCS) problem using data stored in Hadoop HDFS. Additionally, describe how you would integrate this data processing with a machine learning model to predict the likelihood of sequences being similar in future data streams.
Please sign-in to view the solution
Using the MapReduce framework, explain how you would find the second-degree friends for each user in a social network. Provide a brief description and a sample code for the solution.
Please sign-in to view the solution
Given a dataset of 2 billion URLs and their sizes, describe an efficient algorithm to find the 95th percentile of all the sizes. Provide a brief explanation and a sample implementation using a distributed computing framework.
Please sign-in to view the solution
Given a huge dataset of numbers distributed across multiple computers, describe an efficient algorithm to find the average of all the numbers. Provide a brief explanation and a sample implementation using a distributed computing framework.
Please sign-in to view the solution
What is Big Data and what is Hadoop used for?
Please sign-in to view the solution