Site Reliability Engineer, System Eng. @ Google
Networking
Section titled Networking-
Layer 2 and Layer 3 Networking
Section titled Layer 2 and Layer 3 NetworkingWhat are the primary differences between the Data Link Layer (Layer 2) and the Network Layer (Layer 3) in the OSI model, particularly in terms of addressing and device types used at each layer?
Layer 2, or the Data Link Layer, focuses on transferring data between devices on the same network using MAC addresses, and involves switches and bridges. Layer 3, the Network Layer, facilitates data transfer across different networks using IP addresses, involving routers to navigate data across the internet or between networks.
Answer Breakdown -
IP forwarding
Section titled IP forwardingHow can we enable IP forwarding in the Linux kernel, and why is it disabled by default?
It is disabled by default to prevent unintended network bridging, which could lead to security vulnerabilities or routing issues. Enabling IP forwarding allows the system to forward packets between interfaces, acting as a router.
Answer Breakdown -
Container networking
Section titled Container networkingFor a containerization solution that requires isolated yet connected network environments for each container, which Virtual Network Interface (VNI) type would best suit this need? Provide a command example to create this VNI type in a Linux environment.
- Create bridge network for isolated yet connected network environments for each container.
To connect containers or virtual network interfaces to the bridge, you would then use commands to create virtual Ethernet (veth) pairs and attach one end to the bridge. For example, to create a veth pair and attach it to the bridge for a container:
Answer Breakdown -
Route tables
Section titled Route tablesHow can you configure a Linux system to route traffic destined for a specific subnet through a different gateway, while keeping this routing rule isolated from the system’s main routing table? Provide the commands necessary to create a custom routing table and add a route to this table that directs traffic for the subnet 192.168.2.0/24 to pass through the gateway at 192.168.1.2 via the eth0 network interface.
Answer Breakdown -
Address Resolution Protocol (ARP)
Section titled Address Resolution Protocol (ARP)You need to verify the MAC address associated with an IP address on your local network. Which command would you use to check the ARP cache and possibly refresh it if the address is not found?
To check the ARP cache for a specific IP address:
Answer Breakdown
Practical Coding / Scripting
Section titled Practical Coding / ScriptingHere is the Google’s official SRE,SE preparation guide states: coding exercise will assess simple algorithm/data structure implementation. We are looking for a solution that shows you understand your language usage well with a clean and working implementation that’s efficient. On top of this you should be familiar with practical Linux scripting in bash.
-
Log Analysis with awk
Section titled Log Analysis with awkYou have a server access log file access.log that follows this format:
Write an awk command to:
- The number of successful (HTTP status code 200) GET requests for each unique resource (e.g., /index.html, /about.html)
- Displaying the count and the resource path.
Answer Breakdown -
System monitoring Bash script
Section titled System monitoring Bash scriptWrite a bash script that monitors system health and sends an alert if any of the following conditions are met:
- The CPU usage exceeds 80% for more than 5 minutes.
- The available disk space on the root partition is less than 10%.
- If any condition is met, the script should output an appropriate message to the standard error (stderr) indicating the issue.
You can use the following template to create your script:
Below is an example of a bash script that monitors CPU usage and disk space, sending an alert if the conditions are met.
Your solution may vary and be written differently, however if you are not familiar with bash scripting, we recommend you to learn the basics of bash scripting before attempting this question.
Answer Breakdown -
Backup and Archive to tarball
Section titled Backup and Archive to tarballCreate a script that backs up a specified directory (including all subdirectories) to a tarball, appending the current date to the filename. The script should also delete backups older than 30 days.
You can use the following template to create your backup script:
Answer Breakdown -
Process Text Files
Section titled Process Text FilesYou have a directory full of text files. Write a script to find and display all files that contain a specific keyword, along with the count of how many times that keyword appears in each file.
Answer Breakdown -
DSA Easy Problems
Section titled DSA Easy ProblemsSite Reliability Engineer, Systems Engineer (SRE, SE), at Google, it’s less common to encounter hard Data Structures and Algorithms (DSA) problems during the interview process. While medium complexity problems may occasionally arise, the focus is predominantly on ensuring candidates are comfortable and proficient in solving easy-level DSA problems.
Below are some examples of easy-level DSA problems that you might encounter during a Google SRE, SE interview, the list is not exhaustive, but it gives you an idea of the types of problems you might face. Once you are comfortable with these, you can find more Easy or Medium DSA problems on coding platforms like LeetCode.
Two Sum Problem: Given an array of integers, return indices of the two numbers such that they add up to a specific target.
Valid Palindrome: Given a string, determine if it is a palindrome, considering only alphanumeric characters and ignoring cases.
Valid Parentesis: Given a string containing just the characters ’(’, ’)’, ', ', ’[’ and ’]’, determine if the input string is valid.
Roman to Integer: Given a roman numeral, convert it to an integer.
Longest Common Prefix: Write a function to find the longest common prefix string amongst an array of strings.
Reverse Integer: Given a 32-bit signed integer, reverse digits of an integer.
Merge Two Sorted Lists: Merge two sorted linked lists and return it as a new sorted list.
Remove Duplicates from Sorted Array: Given a sorted array nums, remove the duplicates in-place such that each element appears only once and returns the new length.
You can find the solutions to these problems on LeetCode or other coding platforms.
Non-Abstract Systems Design
Section titled Non-Abstract Systems Design-
Video Streaming Service
Section titled Video Streaming ServiceDesign a global video streaming service similar to Netflix or YouTube, focusing on scalability, reliability, and low latency. Consider the following:
- How would you architect the system to support millions of concurrent users globally?
- What strategies would you employ for content delivery and network efficiency?
- Discuss how you would handle metadata storage, search functionality, and user personalization at scale.
Since we need to “support millions of concurrent users globally” and ensure “low latency,” we need a distributed system architecture that can handle high traffic and deliver content efficiently. Here are some key components and strategies for designing a video streaming service:
Lets first decide on which technologies to use for each of the above mentioned components and then we can move on to the architecture.
- Technology: Since we are designing a global video streaming service, we need to choose technologies that can scale horizontally and handle high traffic. Kubernetes can be used for container orchestration, allowing us to deploy and manage microservices efficiently.
- Scalability: To support millions of concurrent users, we can use a microservices architecture that can scale horizontally. Docker/OCI containers can be used to package and deploy microservices, and Kubernetes can help manage these containers at scale. We will use horizontal pod autoscaler to automatically scale the number of pods in a deployment based on observed CPU utilization or memory usage. Additionally we will use Node autoscaler to automatically adjust the size of a node pool based on the demands of the workloads running on it. To determine size of the node pool we can use load testing tool like Apache JMeter to simulate a heavy load on the system and monitor the performance and resource utilization. Before doing that we need to setup our monitoring and alerting system to monitor the performance of the system. Lets assume our JMeter test for 100000 concurrent users shows that our CPU and Memory utilization is at 80% and 70% respectively for 3 Node cluster with 8VCPU and 16Gb memory each. We can then calculate the number of nodes required for 1 million concurrent users being 240 VCPU and 480 Gb memory. IOps and network bandwidth can be calculated using the formulae:
- Availability & Fault Tolerance: To ensure high availability, we can deploy our services across multiple regions and availability zones. Modern cloud providers provides a global network with multiple regions and zones, allowing us to distribute our services for redundancy. Depending on our setup and network topology we can either use solution like HAProxy or Cloud global Load Balancer to distribute the traffic across multiple regions.
- Performance:
This is one of important aspects of our system. You should always evaluate this part on case by case basis as for some services like video streaming service IOPS and Network Bandwidth are very important.
Lets assume following performance metrics for our system:
- Average bitrate: 5 Mbps (megabits p/s)
- Average concurrent users: 1 million
- Average IO Size per operation: 2.5Megabits p/s
- Total bandwidth required: (5Mbits per second / 8) * 1 million = 625,000 Mb/s
- Average IOPS required: 625,000 / (2.5Mbit per second / 8) = 2,000,000 IOPS
- Security: We will use OAuth for user authentication and authorization. For simplicity we can use Google OAuth to secure our services. We will also encrypt data in transit using HTTPS protocol and data at rest using AES-256 encryption. For HTTPS we can use Let’s Encrypt which provides free SSL certificates. We will use cert-manager/certbot to automate the management and issuance of TLS certificates. Another consideration would be using rate-limiting to prevent abuse and DDoS attacks and limit conncurrent video streaming from single ip address.
-
Data Management: For storing video files, we can use object storage, like MinIO, Google Cloud Storage or Amazon S3 for scalable and durable object storage. We will use Content Delivery Network (CDN) to cache and deliver video content closer to users, reducing latency and improving performance. When a video is requested, the CDN retrieves the relevant chunks from the origin server (e.g., an object storage service) and caches them at edge locations closer to the user. Subsequent requests for these chunks can be served directly from the cache, reducing latency and improving the viewing experience. As a database we will use NoSQL database like MongoDB or Cassandra for storing metadata and user information. We can also use Elasticsearch for search functionality. These databases are designed to handle large volumes of data and are well-suited for storing metadata about videos, user profiles, viewing preferences, and session information. We will use cross-regional replication with read replicas in 3 different regions to ensure data availability and durability.
-
Monitoring & Analytics: For monitoring we will need an engine for collecting, processing, and visualizing metrics from our services, a long term storage for collected metrics. Also we will need a Log collection and search engine with long term storage capability. We will collect traces via trace agent and ship it to tracing backend for analysis. We also need a visualization tool to create dashboards and alerts based on the collected metrics. We don’t list specific tools here as it depends on the company’s preference and the existing infrastructure. Regarding metrics we will monitor the following:
- CPU and Memory utilization
- Network Bandwidth and IOPS
- Latency and Throughput
- Error rates and availability
- Number of concurrent users
- Number of requests per second
- etc.
-
Disaster Recovery & Backup: We will need to schedule regular backup of both our K8s cluster configuration and our data. We will use Velero for backup and restore of our Kubernetes cluster. We will also use Google Cloud Storage for storing our backups. We can skip object storage backups as storage is already replicated across multiple regions and 99.999999% durable. As human error is the most common reason for data loss, we will enable versioning on our buckets to protect against accidental deletion.
-
Continious Integration & Deployment: We will have two pipelines, one for Infrastructure as Code and one for application code. We will also use K8s manifest templating tool for deployment. If we are using cloud we will utilize Google’s Workload Identities (similar to other cloud providers Managed Identities) to access our GCP resources securely.
-
Real-time Messaging System
Section titled Real-time Messaging SystemDesign a scalable real-time messaging system like WhatsApp or Telegram that can support high-volume, low-latency messaging across the globe. Address the following points:
- Describe the system architecture needed to ensure message delivery with minimal delay.
- How would you design the data model to store conversations, ensure data consistency, and manage user presence status?
- Explain the trade-offs between consistency, availability, and partition tolerance (CAP theorem) in your design.
- Client-Server: Develop a server implementation for handling message routing, storage, and delivery.
- Message Queue: Deploy a message queue to decouple message sending and receiving, enhancing system scalability and reliability as well as delivery failures with dead letter queues.
- Microservices: Adopt a microservices architecture for different components such as authentication, message processing, notification delivery, and user presence management. This allows for independent scaling of each component based on demand.
- Load Balancing: Use load balancers to distribute client requests evenly across servers, preventing any single server from becoming a bottleneck.
- Caching: Implement caching layers, using in-memory data stores like Redis or Memcached, to store frequently accessed data such as user sessions and recent messages, reducing database load and improving response times.
- CDN: Utilize a Content Delivery Network (CDN) to distribute media content (images, videos) to users with lower latency.
- Database: Use a document-based database like MongoDB to store conversations. Each document can represent a conversation with fields for participant IDs, messages, timestamps, etc.
Answer Breakdown -
Implement CDN Service
Section titled Implement CDN ServiceYou are tasked by a Cloud Provider to create a CDN Product similar to Cloudflare. Design the architecture for the CDN service, focusing on content caching, load balancing, and global content delivery.
To yet written, but you can contribute by submitting a PR.
-
Distributed File Storage System
Section titled Distributed File Storage SystemDesign a distributed file storage system similar to Google Drive or Dropbox, capable of storing and retrieving large amounts of data across a distributed network. Consider the following aspects:
- Outline the system architecture, focusing on data distribution, redundancy, and fault tolerance.
- How would you ensure fast and reliable access to files for users worldwide?
- Discuss the security measures and encryption strategies to protect user data.
To yet written, but you can contribute by submitting a PR.
Operating Systems
Section titled Operating Systems-
Understanding Threads and Processes
Section titled Understanding Threads and ProcessesExplain the difference between a process and a thread in a Linux operating system. How do threads differ from processes in terms of resource allocation and execution?
Write a command to list all threads of a specific process using the process ID (PID).
Processes are independent execution units with their own state information, a unique process ID, and their own memory space. Threads are lightweight and share the same memory space within a process but execute independently. While processes have more overhead due to operating in separate memory spaces, threads can communicate and share data more efficiently but require synchronization to prevent concurrency issues.
Answer Breakdown -
Deadlock and Avoidance
Section titled Deadlock and AvoidanceDescribe a scenario where a deadlock could occur in a system.
Write command to detect and trace a deadlock in a Linux system.
Deadlock occurs when two or more processes hold resources and wait for the other to release other resources, causing a cycle of dependencies. Avoiding deadlock can involve strategies like ensuring resources are requested in a fixed order, so Circular Wait is impossible.
To detect and trace a deadlock in a Linux system, you can use the following command:
Alternatively, you can use ps command: Use ps -el to list all running processes and threads. Look for processes in an uninterruptible sleep state (D state). While not all processes in D state are deadlocked, a process stuck in this state for a long time may be a symptom of a deadlock.
Use strace -fp [PID] to trace system calls and signals of the process. This can help identify which resources (e.g., locks, files) are causing the deadlock.
Answer Breakdown -
Context Switching and Scheduling
Section titled Context Switching and SchedulingExplain the concept of context switching in the Linux operating system. Discuss how context switching impacts system performance and how the scheduler plays a role in managing the execution of processes and threads. Additionally, describe the role of modern concurrency constructs like mutexes and semaphores in minimizing the cost of context switches.
Also describe the steps you would take to diagnose the high context switching rate on the server.
Context switching is the process of saving the state of a currently executing process or thread so another can run. It’s managed by the OS scheduler, which determines which process or thread to execute next based on priority, fairness, or other criteria. Although necessary for multitasking, it introduces overhead and can affect performance. Modern concurrency constructs like mutexes and semaphores help manage access to shared resources, minimizing costly context switches and enhancing efficiency.
Use tools like vmstat, sar, and pidstat to observe context switching rates and identify if they are unusually high for the given workload. Utilize top or htop to identify processes with high CPU usage and potentially high context switching rates. Investigate whether the high context switching rate is due to CPU-bound processes, I/O wait, or excessive system calls.
Answer Breakdown -
System Calls in Unix/Linux for Containerization
Section titled System Calls in Unix/Linux for ContainerizationDescribe how the clone(), unshare(), and cgroups (control groups) system calls or mechanisms contribute to the underlying functionality of containerization?
- clone() is used to create new processes or threads, similar to fork(), but with more control over what is shared between the parent and child processes. In containerization, it allows for the creation of processes in isolated namespaces, providing the basis for container processes.
- unshare() detaches parts of the process execution context, enabling a running process to disassociate from certain namespaces. This is crucial for containers, as it allows for the isolation of filesystems, network interfaces, and other system resources, without creating a new process.
- cgroups (control groups) is a Linux kernel feature that limits, accounts for, and isolates the resource usage (CPU, memory, disk I/O, etc.) of a collection of processes. This mechanism is fundamental in ensuring that each container has access to a defined set of resources, preventing any single container from exhausting system resources and affecting other containers.
Answer Breakdown -
Libraries and Linking
Section titled Libraries and LinkingIn Unix/Linux systems, what is the difference between static and dynamic linking? Describe a scenario where you would prefer one over the other.
Static linking incorporates library code directly into the executable, resulting in a larger file size but eliminating the need for the library at runtime. Dynamic linking references libraries external to the executable, requiring the library to be present on the system during execution, which reduces the executable size and allows multiple programs to share the same library version. You might prefer static linking for isolated systems where dependencies cannot be guaranteed, and dynamic linking for systems where memory and storage efficiency is critical, allowing for easier updates and maintenance of shared libraries.
Example of a static and dynamic linking (This example is for demonstation only and is not expected to be written in an interview):
Answer Breakdown -
Memory Management
Section titled Memory ManagementExplain how a Unix/Linux operating system handles memory overcommitment.
What mechanism it uses to decide which processes to terminate when the system runs out of physical memory and swap space (OOM condition)?
Unix/Linux operating systems handle memory overcommitment by allowing processes to allocate more memory than is physically available, relying on the assumption that not all allocated memory will be used simultaneously. When the system runs out of physical memory and swap space, it triggers the Out-Of-Memory (OOM) killer, which selects and terminates processes to free up memory. The decision on which process to terminate is based on an algorithm that evaluates factors like process age, memory usage, and importance (with preference to keep essential system processes running). This mechanism aims to preserve system stability under low memory conditions.
Answer Breakdown