profile pic # Data Engineering @ Yahoo
Upvote 0 Downvote
Secure Data Engineering System Design with AWS and Machine Learning DevOps Engineer @ Yahoo Difficulty Medium

You are tasked with designing a secure database system for Yahoo's Paranoids team, focusing on a Linux environment hosted on AWS. The system should ensure data integrity and protection against cyber threats. Additionally, propose a machine learning approach to detect and mitigate potential security breaches. Describe your approach, the technologies you would use, and the steps involved in the implementation.

Solution:

Please sign-in to view the solution

Upvote 0 Downvote
Using HAVING Clause and JOINS in SQL Data Engineer @ Yahoo Difficulty Medium

You are given two tables in a relational database: Orders and Customers. The Orders table includes columns for order_id, customer_id, order_date, and order_amount. The Customers table includes columns for customer_id, customer_name, and customer_region. Write an SQL query to find the total order amount for each customer in the "West" region, but only include customers who have made orders totaling more than $5000. Your query should use both JOIN and HAVING clauses.

Solution:

Please sign-in to view the solution

Upvote 0 Downvote
Data Processing with Hadoop and PySpark Data Engineer @ Yahoo Difficulty Medium

You are working with a large-scale data processing project that uses Hadoop for storage and PySpark for data processing. Explain how you would set up a PySpark job to read data from HDFS, perform a transformation to filter out records with missing values, and then write the cleaned data back to HDFS. Provide a sample PySpark code snippet to demonstrate this process.

Solution:

Please sign-in to view the solution

Upvote 0 Downvote
Cybersecurity in Database Design with AWS, Linux, and Machine Learning DevOps Engineer @ Yahoo Difficulty Hard

Design a secure database system for a cybersecurity team, focusing on a Linux environment hosted on AWS. The system must ensure data integrity and protection against cyber threats. Additionally, propose a machine learning solution to detect and mitigate potential security breaches. Detail your approach, the technologies you would use, and the steps involved in the implementation.

Solution:

Please sign-in to view the solution

Upvote 0 Downvote
Understanding Overfitting, Activation Functions, and Batch Normalization Machine Learning Engineer @ Yahoo Difficulty Medium

Explain the concept of overfitting in machine learning. What are activation functions and what role do they play in neural networks? Additionally, describe batch normalization and its purpose in training deep learning models.

Solution:

Please sign-in to view the solution

Upvote 0 Downvote
Difference Between Reduce and GroupBy Functions in Spark Data Engineer @ Yahoo Difficulty Medium

In the context of Apache Spark, explain the difference between the reduce function and the groupBy function. Provide examples of when you would use each.

Solution:

Please sign-in to view the solution

Upvote 0 Downvote
Big Data Concepts in Hadoop and Spark Data Engineer @ Yahoo Difficulty Medium

Explain how Hadoop and Apache Spark handle big data processing. What are the main differences between the two frameworks? Provide examples of use cases where each would be more suitable.

Solution:

Please sign-in to view the solution

Upvote 0 Downvote
Basics of Pig and Oozie in Hadoop Ecosystem Data Engineer @ Yahoo Difficulty Medium

Describe the basic concepts of Apache Pig and Apache Oozie in the Hadoop ecosystem. What are their primary functions and how do they fit into the big data processing workflow?

Solution:

Please sign-in to view the solution

Upvote 0 Downvote
Implementing MapReduce for Common Problems Data Engineer @ Yahoo Difficulty Hard

Explain how to implement MapReduce for the following common problems: word count, inverted index, and mean calculation. Provide a brief description and a sample code for each problem.

Solution:

Please sign-in to view the solution

Upvote 0 Downvote
Calculating Median from a Stream of Data Data Engineer @ Yahoo Difficulty Hard

Describe an efficient algorithm to calculate the median from a stream of data. How would you implement this in a real-time data processing system?

Solution:

Please sign-in to view the solution

Upvote 0 Downvote
Apache Pig Script for LCS Problem with Hadoop and Machine Learning Integration Data Engineer @ Yahoo Difficulty Hard

Create an Apache Pig script to solve a Longest Common Subsequence (LCS) problem using data stored in Hadoop HDFS. Additionally, describe how you would integrate this data processing with a machine learning model to predict the likelihood of sequences being similar in future data streams.

Solution:

Please sign-in to view the solution

Upvote 0 Downvote
Finding Second-Degree Friends Using MapReduce Framework Data Engineer @ Yahoo Difficulty Hard

Using the MapReduce framework, explain how you would find the second-degree friends for each user in a social network. Provide a brief description and a sample code for the solution.

Solution:

Please sign-in to view the solution

Upvote 0 Downvote
Finding the 95th Percentile of URL Sizes Data Engineer @ Yahoo Difficulty Hard

Given a dataset of 2 billion URLs and their sizes, describe an efficient algorithm to find the 95th percentile of all the sizes. Provide a brief explanation and a sample implementation using a distributed computing framework.

Solution:

Please sign-in to view the solution

Upvote 0 Downvote
Calculating Average of Large Dataset Across Multiple Computers Data Engineer @ Yahoo Difficulty Medium

Given a huge dataset of numbers distributed across multiple computers, describe an efficient algorithm to find the average of all the numbers. Provide a brief explanation and a sample implementation using a distributed computing framework.

Solution:

Please sign-in to view the solution