You are working with a large-scale data processing project that uses **Hadoop** for storage and **PySpark** for data processing. Explain how you would set up a PySpark job to read data from **HDFS**, perform a transformation to filter out records with missing values, and then write the cleaned data back to **HDFS**. Provide a sample PySpark code snippet to demonstrate this process.

Please sign-in to view the solution

Explain how **Hadoop** and **Apache Spark** handle big data processing. What are the main differences between the two frameworks? Provide examples of use cases where each would be more suitable.

Please sign-in to view the solution

Describe the basic concepts of **Apache Pig** and **Apache Oozie** in the Hadoop ecosystem. What are their primary functions and how do they fit into the big data processing workflow?

Please sign-in to view the solution

Explain how to implement **MapReduce** for the following common problems: **word count**, **inverted index**, and **mean calculation**. Provide a brief description and a sample code for each problem.

Please sign-in to view the solution

Create an **Apache Pig script** to solve a **Longest Common Subsequence (LCS)** problem using data stored in **Hadoop HDFS**. Additionally, describe how you would integrate this data processing with a **machine learning** model to predict the likelihood of sequences being similar in future data streams.

Please sign-in to view the solution

Using the **MapReduce framework**, explain how you would find the second-degree friends for each user in a social network. Provide a brief description and a sample code for the solution.

Please sign-in to view the solution

Given a dataset of 2 billion URLs and their sizes, describe an efficient algorithm to find the **95th percentile** of all the sizes. Provide a brief explanation and a sample implementation using a distributed computing framework.

Please sign-in to view the solution

Given a huge dataset of numbers distributed across multiple computers, describe an efficient algorithm to find the average of all the numbers. Provide a brief explanation and a sample implementation using a distributed computing framework.

Please sign-in to view the solution

What is Big Data and what is Hadoop used for?

Please sign-in to view the solution