profile pic Hadoop
Upvote 0 Downvote
Data Processing with Hadoop and PySpark Data Engineer @ Yahoo Difficulty medium

You are working with a large-scale data processing project that uses Hadoop for storage and PySpark for data processing. Explain how you would set up a PySpark job to read data from HDFS, perform a transformation to filter out records with missing values, and then write the cleaned data back to HDFS. Provide a sample PySpark code snippet to demonstrate this process.

Solution:

Please sign-in to view the solution

Upvote 0 Downvote
Big Data Concepts in Hadoop and Spark Data Engineer @ Yahoo Difficulty medium

Explain how Hadoop and Apache Spark handle big data processing. What are the main differences between the two frameworks? Provide examples of use cases where each would be more suitable.

Solution:

Please sign-in to view the solution

Upvote 0 Downvote
Basics of Pig and Oozie in Hadoop Ecosystem Data Engineer @ Yahoo Difficulty medium

Describe the basic concepts of Apache Pig and Apache Oozie in the Hadoop ecosystem. What are their primary functions and how do they fit into the big data processing workflow?

Solution:

Please sign-in to view the solution

Upvote 0 Downvote
Implementing MapReduce for Common Problems Data Engineer @ Yahoo Difficulty hard

Explain how to implement MapReduce for the following common problems: word count, inverted index, and mean calculation. Provide a brief description and a sample code for each problem.

Solution:

Please sign-in to view the solution

Upvote 0 Downvote
Apache Pig Script for LCS Problem with Hadoop and Machine Learning Integration Data Engineer @ Yahoo Difficulty hard

Create an Apache Pig script to solve a Longest Common Subsequence (LCS) problem using data stored in Hadoop HDFS. Additionally, describe how you would integrate this data processing with a machine learning model to predict the likelihood of sequences being similar in future data streams.

Solution:

Please sign-in to view the solution

Upvote 0 Downvote
Finding Second-Degree Friends Using MapReduce Framework Data Engineer @ Yahoo Difficulty hard

Using the MapReduce framework, explain how you would find the second-degree friends for each user in a social network. Provide a brief description and a sample code for the solution.

Solution:

Please sign-in to view the solution

Upvote 0 Downvote
Finding the 95th Percentile of URL Sizes Data Engineer @ Yahoo Difficulty hard

Given a dataset of 2 billion URLs and their sizes, describe an efficient algorithm to find the 95th percentile of all the sizes. Provide a brief explanation and a sample implementation using a distributed computing framework.

Solution:

Please sign-in to view the solution

Upvote 0 Downvote
Calculating Average of Large Dataset Across Multiple Computers Data Engineer @ Yahoo Difficulty medium

Given a huge dataset of numbers distributed across multiple computers, describe an efficient algorithm to find the average of all the numbers. Provide a brief explanation and a sample implementation using a distributed computing framework.

Solution:

Please sign-in to view the solution

Upvote 0 Downvote
Understanding Big Data and Hadoop Data Analyst @ Google Difficulty medium

What is Big Data and what is Hadoop used for?

Solution:

Please sign-in to view the solution