Question 1

Managing High I/O Processes

Accepted Answer

Users are complaining about slow file access and we have high disc utilization. We need to reduce IO activity of top offenders using IO priorities and we need to settle the IO priority to idle. While doing so, we need to keep critical jobs, databases, message queues, applications at high priority. First we need to identify processes that have high IO activity. We'll use IO top command, and we at hyphen N 10 at the end of this command. This means that this command will run 10 times. To check current IO priority of job one, we need to use ionice command. First we need to look into process ID of this job. We have no priority set for this process, and we need to set this to idle. And idle will be priority number three. The command for this will be ionice three and the process ID.

Question 2

Tracing Log File Writes

Accepted Answer

We have bar log messages file that has been growing unusually fast, filling up space within hours. Something is writing a lot of logs into this file and we have to identify what, which process is writing those logs. Using PS command add information about this process into this file. And then we also need last 50 lines of logs also added to this file. We start by monitoring log file live using tail command. We type tail f var log messages and we will see immediately the stream of logs. We immediately see that most of the logs are written by my app. We already have process IDs and this logs file, otherwise, alternatively, we could have used something like PG rep to find process id. We will type tail 50 and then name of the file, var log messages. We will grab process name. We type double arrows, so this will just append, not replace contents of this file.

Question 3

Port Conflict Resolution

Accepted Answer

An application under home interview server sh fails to start and we need to find the cause of failure and resolve it so server can start successfully. The error we get is that the port 8080 is already in use. Something is occupying this port that doesn't let our server to start and to find what causes that we can use LSOF and then number of the port. We see that Python server is occupying the port and it has process ID of 15. We can kill this process, pseudo kill, and then the process id. Verify if something is running and we see that now nothing is running under the port. This time we got no error.

Question 4

Diagnose Nginx CPU Bottleneck

Accepted Answer

We have web server that is Nix web server and that Nix web server is running multiple worker processes and some of those worker processes are consuming excess CPU resources. This question is asking us just to record that process ID in the solution txt file. In order to solve this question, we will use ps aux and sort everything by CPU usage. We have list of running processes. Grab everything by word nix. This process at the bottom is using 97% of CPU. Other nginx workers are actually idling at zero. And this nginx worker has process ID of 25. We'll type echo 25 and we will send that into solutions txt.

Question 5

Handling Large Log Archives

Accepted Answer

We had some incident where massive amount logs were written in var log, app access log. This file will be very large in gigabytes and it'll be very difficult to use any editor or analytics tool to analyze this file. The question asks us to split this file into smaller, manageable chunks. First, create folder TMP log parts. We'll use WC word count, and then lines var log, app access log, and it contains 375 lines. We are going to use split command to split this file into more manageable chunks. We'll use hyphen l flag to split it in smaller chunks because by default, split command does a split into 1000 lines. Since question is asking us to split that in a hundred lines, we'll use hyphen L hundred and then we add file name var log, app access log, and the destination where we like to save our chunks.

Question 6

Validating DNS Consistency

Accepted Answer

We need to resolve DNS entries, for example, local, and then store those entries in the local name resolution file, etc slash. For this we can use different commands. I will use dig and alternatively an S lookup. This command will output US IP V four, for example, local. And do it again for IP V six. Alternatively, we can get both of those entries by using an S lookup. Append those entries into local name resolution file, ETC host. And verify if those were added. They both in the bottom of the list.

Question 7

Network Packet Loss Diagnosis

Accepted Answer

We have random timeouts that users report and we need to investigate why this occurs. We've been tasked to check the gateway ip, the DNS server and google.com. We'd like to check the local network ability to connect to the external IP and the DNS resolution. We need to identify the gateway ip and for that we'll use IP route and we'll search for something where we see default. This means that any traffic that will not match other rules will go to this default gateway's IP address. Next, ping our gateway IP address and we've been asked to ping it five times. So it's flag C. And then number five. Repeat the same thing for Google DNS. An average round trip is indicated. Most likely the issue is coming from the DNS resolution because we've got 60% packet loss.

Question 8

Network Port Service Cleanup

Accepted Answer

We have several unauthorized applications that are listening on the ports between 8,000 and 9,000. Our task is to identify those processes and terminate them. We'll need to scan this port range and find processes that are binding to TCP and UDP ports and then get their process ID and kill that. We'll use SS command, which is more suitable for this, and SS stands for socket statistics. We'll do SS T that stands for TCP, U for UDP, and then L, the ones that are listening, P to get process IDs. And then we'll have to add N as well to get numerical ports. Sometimes we need to use force kill, and for that we need to use sudo hyphen nine and then process ID.

Question 9

Temporary Route Configuration

Accepted Answer

We have a remote subnet that isn't covered by a default route, so we cannot reach it. We need to check current routing table to confirm that we do not have any routing for this subnet. Once we confirm that, we need to add static routing to this subnet via this gateway using interface VETH zero. We need to confirm that we do not have any rules for this subnet. Use IP command. And indeed, in the list, we do not have anything for this route.

Question 10

Network Socket Usage Analysis

Accepted Answer

We are experiencing network latency issues, and our suspicion is that we have too many TCP connections. We'll use either NETSTAT or we'll use SS command. In both cases, we need to provide additional flags. We'll use T for TCP eu. For UDP, we will add a flag to show everything both established and listen type of connections. And then we'll add N to show numerical ports. And finally, we need to add P to show processes. Alternatively, we could have used SS command, which shows us same type of information.

Question 11

Analyzing Log Partition Usage

Accepted Answer

Log rotation has stopped working and we suspect VAR log might be mounted on a different file system with limited space or incorrect mount options. Find out information about mounting and write that debug information into this file. We need to give file system type, size, usage, mount point and device name. Start with using DF command, which will give us information about file system. This is a human readable format by typing DF hyphen H. EXT four is most common on Linux. It's stable, fast and predictable file system. We also sometimes have XFS, which is more enterprise grade file system used for large systems. Find mnt and then we can provide Target, or we just can type find mnt and it'll give us all the mounted storages on our system.

Question 12

Using Unmounted Partitions

Accepted Answer

We have a server with Unmounted partitions, which means that we have a block device but it's not used so we can mount it for additional storage. Our task is asking us to identify those unmounted partitions, the ones that are safe. We need to avoid system critical partitions, mount them, but before that also create a file system. We'll list block devices by LSBLK. Try another command, LSBLK hyphen F to see the file system type for those devices. ISO is read only file system type. We cannot mount to this. File system is formatted drive NTFS XT four or XFS format. It lets us to use this device as a folder, as a file system. The question is asking us to create XT four file system. This is the most common file system for Linux. We'll do that by typing MKFS. It stands for make file system. In Linux, all the block devices are living under slash dev. Next we need to create a folder where we'll mount our file system. Finally, mount our file system. Verify that file system was mounted correctly by typing DF command.

Question 13

Debug SSH Lockout

Accepted Answer

We have developer account dev that has been locked out of the server and security logs indicate that there were too many failed SSH identification attempts. We need to check logs and count exactly how many attempts this user had today. And once we have this number, we need to update the configuration to increase this allowed login attempts above this number. We'll need to check log for login attempts. We'll need to use sudo to view this as admin. We can use grep command to filter out the lines that we need. We can also count this with word count hyphen L, which will print us number of lines. We need to change SSHD daemon configuration file, which is located under etc ssh and then sshd config. Find something that says Max. Finally, restart sshd daemon using systemctl restart.

Question 14

Monitoring Process Ownership

Accepted Answer

The server is used by multiple teams with their own credentials, meaning each team has a username. We need to identify which user, meaning which team is running most number of processes, meaning count regardless of CPU or memory. We need to take list of running processes by typing PS a x. Use a WK to isolate only first column we can type print dollar sign $1 Sign one means first column. We can do sort to sort this. This will do an alphabetical sort, and then we can use unique hyphen C. Unique gives us unique values and Hy C will give us number of occurrences. Dev team has highest number of running processes. We need to write the name of the team into the solutions TXT file.

Question 15

Detect Memory Leak by Monitoring RSS

Accepted Answer

One of the long running node services has been slowing down. The CPU usage and IO is normal. When we have something that is leaking memory, we take a snapshot of the memory usage, and then after some time, we can take another snapshot and compare relatively to other services how much the memory consumption is increasing. First list all the node JS processes. They run under name node usually in the system. We will use pgrep since we know the process name. Use ps and then the process number, and we will output RSS, which is resident set size. It's how much ram this process is using in kbs. Process number 1585 that's the one that is leaking. We need to kill that process.

Question 16

Fix Inode Exhaustion Issue

Accepted Answer

We have a server that cannot create new files. When we execute commands like touch command to create an empty file, we get an error saying no space left on device. But when we use DF hyphen H or just DF command, we see that there's a lot of space left on the disc, but we still cannot create new files. The file system has exhausted available iNodes, which means that there are too many small files each containing the metadata. The task asks us to save iNode usage, find which directory contains excessive files and save the problematic directory to this file. First we need to check inode usage. For that we use DF hyphen I, and we can see that inode usage is a hundred percent used by var spool. We use the du command for the disc usage and the inodes flag. We type max depth how many folder levels this command will dig into. We can use find, the name of this directory, and we write type F. We will count how many files we have. The final task will be clean up this directory. Do pseudo RM rf.

Question 17

Fix HTTPS Certificate Error

Accepted Answer

Connection fails because certificate doesn't contain Subject alternative name. Because modern TLS clients ignore common name and they require SAN to have DNS or IP address for verification. We'll need to generate new certificate that contains a valid IP address or DNS to which we will connect. We can run openssl as client to capture information about the certificate that is being served from this web server. We can just run openssl s_client and then grep subject field and we'll see that subject is wrong. Generate new certificate. OpenSSL x509 new key and then algorithm, nodes, the key and output is server certificate. Subject is going to be localhost and subject alternative name is going to be an IP address and this is our validity time of the certificate. We can use sed command. Sed command is useful for finding and replacing certain lines in the file.

Question 18

Real-Time Log Timestamping

Accepted Answer

We have a scenario when we troubleshoot some service. It produce untagged log outputs when it runs manually and we need to create some file and locate that file under local bin timestamp and make it executable so it can be used with any pipeline. We need to pipe out some output into this file and it has to produce us timestamp result. Try to use date command and format this in current format. We started with shebang to execute it via bash and while loop. Then we type IFS equals to not split lines. And we add read minus r to not interpret backslashes. We need this if we want to read lines safely. Then we start our loop, and then our command echo line. This is our variable that we get via the pipe. And then the date command. Use chmod plus x to make it executable. And now if we echo and pipe it out into this file, we'll see the timestamp.

Question 19

Update Cloud Configs

Accepted Answer

We need to change cloud configuration in this directory and add certain lines in the file. Our task is to allocate all .com files under this directory. In this directory we'll need to change line with multi AZ setting from false to true and availability zone line to append certain parameters into the line. To solve this question we'll use find command. Find command lets us search the directory for certain files or folders. Since we need only files, we'll type f. Since we need only .com files, we'll type name and then we'll give wild card .com. We can do that by typing flag exec, which is very powerful feature for the find command that lets us execute commands for any file set was found via find command. We'll use exec sed and sed stands for stream editor. One of the use cases of sed to find replace lines within files. We need to replace this in place, so we need to add hyphen i flag. This will replace this in line.

Question 20

Upload-Safe File Partitioning

Accepted Answer

We have a question with the files under TMP app directory that are above one megabyte and we'll need to split those files to make them below one megabyte in the chunks. We'll run find command to see files about one megabyte. We can split into original file and the file name part with split command. Split is built in command and find function lets us to perform exec to execute split command after finding the file name. We'll create a loop. IFS read R lets us read the input. Split will split the file that is beyond one megabyte. We'll need to add hyphen print zero. We need to use this because it will send everything to the bash script without new lines. Because with the new lines, the split command will think that we're sending two or three files. However, hyphen print zero adds specific delimiter that is processed properly by split.

Question 21

Fix Port Exhaustion for High-Speed Scraper

Accepted Answer

We have a web scraper service that runs in systemd and we get an error indicating that connections cannot be established even though everything is fine with our server, services accessible and network is fine. Connection three zeros means that we cannot allocate the port and this could happen because of port exhaustion. Port exhaustion happens when we have a certain range of ports on our machine and every time we make an outbound connection, we need that port. And that port usually becomes blocked for 60 seconds and cannot be reused. When we have more connections within 60 seconds than we have allocated port pool, we will get this kind of error. To confirm this, we can run SS, so it's TCP all numbers state time wait, and we can count lines with WC line. In order to fix this problem, we'll use kernel time wait reuse flag. We'll use sysctl net IPV four ip local port range. This gives us the range of ports that we currently have available on our machine. The reuse status is zero. In order to fix this, we need to use reuse one.

Question 22

Validating Network Routes

Accepted Answer

Our server has multiple network interfaces and traffic. The 10.1.0.0/16 subnet is being routed incorrectly. That means packets might be living through the wrong interface, which can break connectivity or slow down communication. Our goal is to verify current routing table and make sure that subnet is using correct gateway and correct network interface. Start with running route -n to display the route table. This lets us see which interfaces different networks are using and we can search through this table by typing the destination IP address by using grep command. The subnet is using wrong network interface and it's using incorrect gateway. First, we'll need to delete this route table entry, and then we'll have to recreate a correct one. We'll use route delete command, and add netmask. Recreate route with a proper gateway and proper network interface.

Question 23

Inspecting HTTP Traffic Flow

Accepted Answer

We suspect that web service isn't receiving HTP requests and we need to confirm network traffic on port 80. We're tasked to capture network packets that are incoming and outgoing from port 80. We'll use TCP dump for this. Main command to solve this question will be TCP dump hyphen i, meaning interface any. We'll use all network interfaces that are on this host machine and C 10, meaning we'll capture only first 10 packets. And then we'll write this output into this file. And the last part is our filter. We'll use port 80. Alternative filters would be host and then IP address, or we can use port range and other filters that TCP dump supports. Next we need to read the speed cap file and save this under some file name. Finally, we need to read this CAT file, and then we need to run this script.

Question 24

Forward Traffic Between Ports

Accepted Answer

We have a scenario where we need to forward everything from the port 8080 to Port 8081. We have an app that runs under Port 8080, and we cannot change anything in the app configuration, and without doing that, we have to redirect traffic from one port to another. This is not IP route because ips are on the L three level where ports does not exist yet. For that, we'll need to use nat tables. Make sure we have application running under port 8080. Check this with SS TLNP, or we can use LSOF hyphen I. To add entries into nat, we'll use IP tables. First, we'll use pre routing, and the protocol will be the TCP destination. Port is 8081, and it's going to be redirected to the Port 8080. Pre routing will be for external traffic coming via our network interface and output will be the traffic that we invoke internally. We can run pseudo IP table safe command. Our configuration changes will persist after the reboot of the system.

Question 25

Rapid Disk Growth on /var

Accepted Answer

Usage on slash bar partition is 92% and increasing rapidly. We need to identify the largest file consuming this space, process IDs that are using those files, and log rotation status, because log rotation lets us avoid files growing indefinitely. First we need to use command find to search slash bar directory. We'll type F which will show us files only. We need size and then the file name. Find lets us add exec so we can execute a command and we'll do in human readable format and we will add dev null because we don't want to see permission errors. We use sort, we'll use reverse order because we need largest files first and we'll use also H because our values here for the size are in human readable format. We need to get top 10 results, head 10. Next task is we need to find process IDs. We can use X args LSOF, which will show us the process ID and files. Finally we need to find log rotate status. For log rotate status we need to search for this directory ETC log rotate D. We need first to isolate only log files. We'll use grep, those are our log files, and now we need to search for that name in log rotate.

Question 26

Manage Service Failure Recovery

Accepted Answer

We have a service that is failing. It runs periodically and exits with some error code. Our task here is to create a systemly service that will run regularly. It should try to restart the service once it fails. Trigger should be on failure and it should attempt three restarts within 60 seconds. And delay between attempts should be five seconds. We need to create system D service and I'll create it under ETC system, D system, and then the name of the service. Start limit burst equals to three. Three attempts and start limit interval in seconds is 60. Trigger on failure and then attempt window. How many seconds we should wait between next attempt, restart sec equals to five. We've been asked to configure the service to start on boot. For this we need to have install section with wanted by key equals to multi-user target. Enable our new system D service, pseudo system CTL enable, and then name of the service.

Question 27

Nginx Rate Limit Calculation

Accepted Answer

We have a scenario with JS web server, a list of IP addresses in the access log. We need to use this file to calculate average load on our server and make a rate limit on our JS configuration. First thing we need to do is to calculate top IP addresses under this file. IP is stored under first column as a first element in this file, so we can use awk to extract this. By typing awk print dollar sign one, we'll print the first element. Use this command to add additional features, meaning sort this take only unique values and count them, then sort them again and take top three values. Next, we'll use rate limit equals to echo, sum divided by three, multiplied by 0.8. And I'll use this formula, the sum divided by three and multiplied by 0.8. And the answer is 300. We need to add 300 into our configuration file.

Question 28

Automated Archive and Retention

Accepted Answer

We have files in ETC folder that are at risk of being lost due to accidental changes or deletion. We need to write a script that accepts target backup path as a command line argument, creates a compressed archive of ETC directory with the naming format ETC backup year, month, day Tar gc, and automatically removes backups older than seven days. Then we need to create a CR job that will run file every time at 2:00 AM in the morning. Once file is created, give it an execution permission. We create a conditional statement where we say that if the argument number one, meaning the expected target location is not provided, it'll throw an error saying backup directory path is required. Then the main command to archive the file. We implement this option that we need to delete files that older than seven days. We'll use find command for this, and this M time plus seven delete command to delete files that are older than seven days. Chron TAP E will edit the chron tab. We'll type zero, two star, star, star, which will run it every day at 2:00 AM.

Question 29

Trace Process Service Ownership

Accepted Answer

In production systems alerts often come as a process Id saying certain process IDs are having troubles, difficulties beyond the memory limits and so on. Often we need a script that will let us debug problems that are hinted to us via process id and go to the service name. We need to create a trace service sh batch script that will give us the status, the service name, its last 20 logs and all that just by providing the process id. We will try system CTL status 2031, which will show us the status of this system and the name data processing service. Since we know the name of the service, we can run system CTL status by the name of the service. We'll have first argument as variable PID. We need to extract the service name. We can use awk and print the second column, which was the name of the service. We need to print status and logs. Make this executable.

Question 30

Create AWS IAM Admin User with Group and Policy

Accepted Answer

We need to set up new admin account for regular user. We've been asked to create IAM user named DevOps admin with console password access, and then we need to add the user to the admin group, which should have administrator access policy attached. In AWS, we can attach policies meaning access rights to certain user via two ways. First, we can attach this policy directly to the user, but this won't be very efficient. Instead, in AWS, we can use what's called groups. We can create an admin group and then attach administrator access directly to this group. Once this is done, we need to tag this user with a key that's equal to role and the value is equal to DevOps. We use tag to easily identify users. First we need to go to the IAM. This is where we'll create users. We need to provide console access, which means that this user will be able to access not only via CLI but also via UI. The policy name was Administrator Access. We could see that the action is wildcard, meaning everything, effect is allow and resource is also wildcard.

Question 31

Create IAM Role for EC2 with Full IAM Access

Accepted Answer

Our team needs an EC2 instance to manage IAM resources programmatically and to follow security best practices, we need to use IAM role instead of embedding credentials. Our task is to create IAM role IAM full access EC2 and allow EC2 service to assume this role and this role has to have IAM full access policy attached. In AWS, we could access services by using access key and secret key, it's like having a username and password. Imagine if our AWS EC2 instance has been compromised, our access key and secret key could be reused for other purposes. The best practice in AWS is to use something called AWS role. When EC2 instance assumes a role, it sends an API request to AWS Secure Token service and it checks if this instance has access to this role and if it does, secure token service gives a temporary credentials to our EC2 instance. This is short-lived credentials and they're never stored directly in our EC2 instance.

Question 32

Create a Hello World Lambda Function

Accepted Answer

We need to create a simple Lambda function for our greeting microservice and our task is to create Lambda function named Hello function, which will take the name from the event return. We can use pre-created IAM role, lambda execution role, and at the end we need to invoke the function with the name world and verify that we get Hello World as a return value. We can use Python or no Gs. We need Lambda execution role to allow Lambda function to access our AWS services. The name will be Lambda Handler. We can use different name, but Lambda Handler is a default name. We'll have to have this variable name equals to event name because event is a dictionary or a map with keys and values. And finally, we need to return the result in expected format. First you go to Lambda and create function. As a runtime we'll use Python, keep handler name as default and in permissions we can use existing role and we can choose role lambda execution role. We won't need JSON because we need a simple string return. This variable will grab even name and we will return F. Print. Hello name.

Question 33

Launch an EC2 Web Server Instance

Accepted Answer

We need to create an EC2 instance. An instance needs to be reachable over HTTP and serve Hello from web one page. We need to create a security group that will allow TCP port 80, meaning HTTP over IPV four to be able to access this EC2 instance. Go to security groups and create our security group. The security group name is web sg. We can add inbound rules because we need HTP traffic to be allowed. Add the rule HTTP anywhere IPV four. Click on instances and click on launch instance. The instance type will be T two Micro. For firewall meaning security groups, we'll go to existing security groups. Security groups manages who can access our instance and which protocols they could use to do that. In AWS, there's something that called user data. User data provides commands that will run when you launch your instance. We'll create directory or three W HTML P will make sure that all those directories will be created. And then we will echo hello from web one into this file.

Question 34

Audit and Enforce Least-Privilege IAM Permissions

Accepted Answer

This question is about user app deployer, having way too much access and basically having root access administrator access policy on our AWS account, this violates strategy of lease privilege, meaning user should only have access to the resources that he needs and nothing more. This user should have access to S3 read, add Object and lease buckets and CloudWatch logs to create log groups, log streams, and put log events. Our task is to inspect current policies attached to app deployer. Remove overly broad administrative access policy, create new policy with proper rights, and attach it to the user app deployer. In AWS, administrator access is building policy that has maximum access on AWS account. The second item is action. Action means what type of action we can perform on AWS. Resource means to which AWS resource this action could apply. Wildcard means that we can put AWS S3 object to any S3 bucket. We need to first inspect current policies attached to the app Deployer user. We've been asked in S3 to get object, put Object and lease buckets. Same for logs. Create log group, create log stream, and put log events.

Question 35

Create Route 53 Hosted Zone and DNS Records

Accepted Answer

We need to configure DNS in route 53 for the domain example.com. We are tasked with creating a hosted zone example.com and creating an A record for this sub-domain pointing to this IP address. We're also given a requirement that DNS responses needs to be cached by resolvers for one minute. First your browser checks your Stub Resolver. Stub Resolver is part of your operating system where operating system keep records of domains and their corresponding IP addresses. If you haven't set up any specific DNS server entries in your operating system, the fallback will be DNS resolvers at your ISP internet service provider. TTL means how long resolvers will keep example.com entry in their cache. We have two types of hosted zones, public hosted zone, and private hosted zone. Public hosted zone will be routed on internet. Private hosted zone will be available only within our VPC. This IP address is not routable on internet, meaning we have to create private hosted zone. Record type will be a record that routes traffic to an IPv4 address. TTL is 60 seconds and subdomain is app.

Question 36

Create Route 53 Health Checks

Accepted Answer

We have two services, web server and API server, and we need to create health checks for those services. If we create health checks in Route 53, that health check will periodically request the health of that server and store it in its database. Before DNS resolver responds to us with IP address, it checks with the database of the Route 53 Health Check to see if this endpoint is healthy or not. If it's healthy, it'll point us to our primary endpoint. If not, we'll be pointed to the failover endpoint, a secondary endpoint that we will configure. By using Health Check, we could be confident that our user will always end up at the working website instead of receiving an error. Additionally, health checks could be used to publish some data about that endpoint and that data could be pushed to the CloudWatch. Go to Route 53, open a menu here and select health checks. This is an Endpoint health check. You need to click Advanced Configurations to fill details that are provided in the table of the question.

Question 37

Design Egress Only VPC with NAT

Accepted Answer

We have to create infrastructure for ECS and EC2 instances that infrastructure has to span across at least two availability zone. These workloads require outbound internet access and inbound access is not allowed. Additionally, our application should send data to S3 in a cost effective way. VPC is a private network that you deploy other networks and compute resources into. The subnets could be in availability zones and availability zone is our data center. We split resources into availability zones for redundancy. Cider block defines how big our network range will be. It sets how many IP addresses will be available within our virtual private cloud. Since we do not need ingress traffic, we can use nut gateway for that. We need to allocate an elastic IP address for this not gateway because not gateway will need a public IP address. We need to add a route. We'll add a default route and target that to our not gateway. Our security group has to have HTP and HTPS outbound allowed and no inbound traffic. Since both S3 and our virtual private cloud both are located under the same provider AWS, we can use AWS own private network to send traffic to S3 and it's free. We could do that using VP

Question 38

Build a Serverless API with Lambda, API Gateway, and DynamoDB

Accepted Answer

We need to build an internal serverless API for an order management. Service orders will go into DynamoDB table and all access to that table must go through a Lambda function. Lambda is going to be our handler and we will not access DynamoDB directly. Lambda execution role should only have the exact permission it needs and nothing more. This means that it's gonna be a least privileged role and functionality of the app needs to read specific order, never scan entire database, and write orders into DynamoDB. An IAM role gives our lambda function a temporary access credentials to be able to perform actions on our AWS services. In order to call our lambda function, we need some handler. API gateway gives us a straightforward solution to call Lambda function. We'll create two methods, get and post in our API gateway and point it to our Lambda function. First thing we need to do is to create our DynamoDB table. The name will be orders and partition key will be order id. It's been said in the question that we are not allowed to attach wildcard policies. So we'll create inline policy. The resource will be our DynamoDB table. This will be put item and get item in DynamoDB.

Question 39

Deploy an Internal Web App with VPC, EC2, ALB, and Route 53

Accepted Answer

This is a realistic three tier style deploy kept fully private. You stand up a VPC with public and private subnets, run EC2 instances in the private subnets behind an Application Load Balancer with health checks, lock traffic down with security groups, and expose the app through an internal Route 53 record instead of the public internet. The ALB sits where it can reach the instances, the security groups only let the ALB talk to the app port, and the Route 53 record points at the load balancer. Where people slip is security group chaining and putting the instances somewhere the ALB cannot actually reach. VPC layout, ALB health checks, security groups, and internal DNS together make a strong AWS architecture interview question because they force you to connect networking, compute, and routing into one working system.

Question 40

Dynamic Volume Expansion

Accepted Answer

We need to have persistent storage that can be resized without downtime. That means that we'll create a storage class that could be resized on the spot, which means that we'll have to enable allow volume expansion feature. We will need to create a storage class, persistent volume claim and pod. This initial size is one gigabyte and our expanded size that we'll resize to going to be five. First we added storage class, we changed name and we fixed false true in the allow volume expansion. Next, we added persistent volume claim, we changed name, name space, storage class name to expandable storage class, and the storage initially is one gigabyte. We'll apply everything one by one, apply storage class, apply persistent volume claim. And we need to expand volume claim from one to five. Our storage has been expanded to five gigabytes.

Question 41

Create Namespace

Accepted Answer

Create Kubernetes Namespace: Set Up an Isolated Environment for Application Workloads. Establish an isolated Kubernetes environment by creating a dedicated namespace for experimentation and application workloads. This improves cluster organization, enables cleaner multi-tenant setups, and lays the foundation for applying scoped RBAC, resource quotas, and policies per environment.

Question 42

Pod with Readiness Probe

Accepted Answer

Kubernetes HTTP Readiness Probe: NGINX Port 80 Path / Health Check Mastery. Deploy production-ready Kubernetes pods with HTTP readiness probes using web-ready pod and nginx:latest image. Configure HTTP GET probe on port 80 path / to ensure only healthy web servers receive traffic, preventing traffic to unready pods. Master readinessProbe httpGet, health check configuration, traffic routing control, and zero-downtime deployments. Perfect for web application HA, microservices readiness, NGINX Kubernetes deployment, and production traffic management.

Question 43

Pod Viewer Access

Accepted Answer

Kubernetes RBAC Pod Reader: Namespace demo read-only Access Control. Grant monitoring applications precise read-only Pod access in demo namespace using RBAC Role pod-reader, ServiceAccount reader-sa, and RoleBinding pod-reader-binding. Authorize only get, list, watch verbs on pods resource for principle of least privilege. Perfect for monitoring tools, observability platforms, security auditing, compliance requirements, and namespace-scoped access control.

Question 44

Crashing Misconfigured Pod

Accepted Answer

We have a Kubernetes cluster with deployment web app in namespace prod, and it's stuck in crash loop back of state. Our task is to fix the PO so it reaches the running state. Config directory already exists and it's used by an application. We cannot mount config map over the entire directory 'cause it'll override and cause app to fail. The deployment needs to be fixed in the way that configuration is injected without replacing the existing directory. Check the logs. Log says that config file could not be found. We're missing the config map. We're missing sub path because we're trying to mount a file instead of directory. But if we don't use sub path, this file will be mounted as a folder. We need to fix this by adding sub path. We type QCTL added deployment web Appen Pro find section that says volume mounts and add sub path config yaml. Our port is now working, it's a running state and the old one is getting terminated.

Question 45

Image Pull BackOff and Secrets

Accepted Answer

We have a deployment backend in the dev namespace and it's failing to start. Pods are stuck in image pull back off. Our task is to fix deployment so it successfully pull the image and enter the running state. Describe this to see details. When we look at events at the bottom, we can see that we get 401 unauthorized. The version is not the same as we have in image information. So we will need to fix this. We can do kubectl edit, name of the namespace dev, deploy, and then the name of the deployment backend, which will open to us vim editor. Second thing we need to do is we need to create a secret and we'll use command kubectl create. The secret type will be Docker Secret because it's a secret for the registry. Next, edit again our deployment. Find line that says spec and add image pull secrets configuration. And now our pod is running.

Question 46

CronJob Schedule Misconfiguration

Accepted Answer

We have a cron job named cleanup in ops namespace that is failing to trigger and we suspect incorrect schedule that relies on default time zone. It also retains too many completed jobs. Task is to clean up cron job so that validation confirms it triggers exactly once per minute. The time zone should be UTC. First we'll get the YAML manifest of the cron job named cleanup. Edit this cron job, we'll use kubectl edit command for this. We need to update this schedule. The second thing is we need to add time zone. We've been told that we need only most recent successful run to be retained. We've got successful job history limit set to three, so we need to update this to one.

Question 47

Traffic Splitting with Native Kubernetes

Accepted Answer

Traffic splitting with native Kubernetes means that we need to split traffic between two deployments, V one and V two, but using native Kubernetes primitives. We will not use reverse proxy like traefik or nginx and we'll use Kubernetes services instead. This is called a canary deployment where we have a stable version of our application V one, and we would like to introduce version V two, but we don't want to channel all the traffic to V two because we are not sure if it's stable or not, and we rather traffic only one third of it. We'll create a service with a selector that covers both my app, which will be V one and both V two, which will have the same selector. That service will randomly go both to V one and V two because they will have same selectors. We can monitor logs of our V two deployment and see if it's stable or not. When we check endpoints for my app service in the canary namespace, we have three endpoints and two of them will be for V one and one of them will be for V two.

Question 48

ConfigMap Reload with Sidecar

Accepted Answer

We have a config map app config that we would like to mount to this pod, but we would like to mount it as a sidecar both to the main pod and the sidecar. The reason we would like to run a script in our sidecar that will run every five seconds to detect changes in the config map. And once it's detected, it'll update the settings file. First we will analyze our watcher yaml. This is a partial configuration of the sidecar config watcher. Here's the sidecar container. It already contains command, which will run the script every five seconds. We need to add configuration of the pod itself. We add volume mounts both to config watcher and to the main pod. The volume is config volume. This will locate the config map, which will be mounted. One thing that we also need to do is we need to create a config map itself. Here's the name of the config map and the key settings dot conf. The value of the settings conf will be debug false.

Question 49

Implement StatefulSet with Stable DNS

Accepted Answer

We need to deploy a stateful set. We've been told that we need to deploy each pod so that it's addressable by a predictable and unchanging host name. So the nodes can find each other for data replication and the standard services which load balance are not suitable for this. When we do create cluster ip, we'll get a virtual IP address that will load balance between our workloads. In contrast, headless service will return all ips for every workload that we have. First, we'll need to create a service that is headless. And then the second thing we'll have to create a stateful set for that service. The main thing here is cluster IP will be none, which will make the service headless. The service name must match the headless service name. Finally, we can check our resolution by using this command and we can resolve this pod.

Question 50

StorageClass and PVC Expansion

Accepted Answer

This question talks about storage classes and persistent volume claim expansion, meaning that if we have certain persistent volume claim with one gigabyte of storage, we can expand it to two gigabytes and further. First edit our storage class yaml. The main thing is allow volume expansion is being changed from false to true. Next we'll edit our persistent volume claim. We'll use Fast SC the one we've just created and the storage initially is going to be one gigabytes. We've been asked to create a pod. Cube CTL Run, which will create and run the pod, the name of the pod, the image and namespace, dry run client. The O YAML will output everything in YAML format. We can edit this file and add missing things like volume mounts and so on. There's one last thing to do in this question is to expand our pod. Type kubectl edit persistent volume claim. And what we will do here is we'll change storage from one to two.

Question 51

OOMKilled Pod Analysis & Fix

Accepted Answer

A common issue in containerized environments is when we are out of memory and our container gets killed. In this example we have a pod named OOM Demo. It has been killed by OOMKilled error. We'll do kubectl get pod oom demo in the namespace app and we will search by output that says last state terminated reason and reason is OOMKilled. We can try another way by typing describe, kubectl describe pods in namespace apps. Usually this means that we don't have enough memory. We see that limits memory is 20 megabytes. The question asks us to increase this limit from 20 to a hundred so that application then can run comfortably within this limits. We'll get this pod oom demo in the namespace apps with output yaml and we will save this as oom fix yaml. Find the line that says 20 Mi and change this to a hundred. Apply won't work in this case because we cannot update limits and resources on the spot, we'll use kubectl replace command for this reason. We'll use kubectl replace, this is the name of the file and we'll use force to replace it forcefully.

Question 52

Secure Internal Service Communication

Accepted Answer

We have an application that requires TLS certificates for internal services for communication. We'll need to set up cert manager to issue a valid TLS certificate using a self-signed cluster issuer to bootstrap a ca issuer. We have service A and we have service B. Service A sends traffic to service B, and service B is a web server. Our service B has to present a valid certificate. We need to add ECA true. Otherwise, our intermediate certificate will not be able to issue leaf certificates. Leaf meaning the final certificates. Next we'll create our CA issuer, which will be used to issue our leaf certificates. This issuer will use the certificate that we just created with ca True. The important part is DNS names, because certificate has to include those DNS names in subject alternative name. Otherwise it'll not be valid certificate.

Question 53

Custom Resource Definition Setup

Accepted Answer

You have a controller, certain operator that manages resources in Kubernetes and you'd like to have a custom resource called widget. In order to apply this custom resource widget, you need to create a custom resource definition. Otherwise, you cannot apply something that is beyond known resources of Kubernetes. Our task is to create a custom resource definition and define group as my company io. The kind is going to be widget and the scope will be namespaced. The namespace scope means that we will apply this widget specifically inside the namespace. The version will be named v1, and then we'll need to create a custom resource of this type named simple widget in the extensions namespace. API version will be apiextensions.k8s.io kind custom resource definition. In the spec, we will type the group name, version, the name, scope, and names, plural widgets, singular widget, kind widget and short name is wd.

Question 54

CRD Schema Validation

Accepted Answer

We have a question with custom resource definition schema validation. The custom resource definition is called widgets. In Kubernetes we can create custom resources with specific APIs. This resource has been created in extensions namespace. If we do kubectl apply F bad widget, the widget will be created without any validation. The spec is empty, but nevertheless it has been applied, which according to our question crashes our controller. We need to apply a strict validation that requires spec size. It has to be integer and minimum value has to be one. We'll do this by editing the custom resource definition schema, and that's done by kubectl edit CRD, and then the name of the resource. We need to find the schema section. We'll have same openAPIV3schema field type object. And properties will be, instead of empty spec, we'll have object type required size. Reapply our bad widget yaml. We get a validation error that widget bad widget is invalid, requires value.

Question 55

Docker Image Tagging with Commit SHA

Accepted Answer

We have a repo with a Dockerfile, but no CI/CD pipeline setup. We are given a starter workflow file with just the basic structure, and we need to complete it. The goal is that every time someone pushes to main, a Docker image named app gets built and tagged with a short commit SHA. Every commit in Git has a unique hash. The short version is just the first seven characters. We get it by running git rev-parse --short head. Each step runs in its own shell, so a regular variable in one step doesn't exist in the next. GitHub Actions provides a mechanism called github_env that lets us share values across steps. Actions/checkout pulls the code from the repo into this runner so that the steps after it can actually access the Dockerfile and build the image. It grabs the first seven characters of the current commit hash and puts it in a variable called TAG. This builds a Docker image and tags it as app:env.TAG.

Question 56

Matrix Build Strategy

Accepted Answer

We have a Node.js application with a test suite, and we need to make sure it works across three different Node.js versions, 18, 20, and 22. A matrix strategy lets us run the same job multiple times with different configurations. With a matrix, we define these versions once as an array, and GitHub Actions automatically creates a separate job for each value. One job definition becomes three jobs, and they all run in parallel. With container images, we turn the job to run inside a Docker container that already has Node.js installed in it. node:18-slim is a lightweight Docker image with Node.js ready to go. Artifacts are files that a workflow produces and stores after it finishes, things like test results, build outputs, or logs. By adding an upload artifact step, we save those files to storage. Each job gets its own matrix.node version value. Actions/checkout pulls the repo code into the container so that the subsequent steps can access the test script. Without unique names, the three parallel jobs would overwrite each other's artifacts.

Question 57

Multi-Job Workflow with Artifact Handoff

Accepted Answer

We need a two-stage pipeline. The first job runs tests and produces results. The second job takes those results and creates a summary report. The challenge is that each job runs on its own runner, so files from one job aren't automatically available on the next. We need to pass files between jobs using artifacts. Each job in a workflow runs on a separate runner. When the first job finishes, its runner is destroyed, along with all its files. We use artifacts as shared storage. Job A saves its output as an artifact, and then job B retrieves that same artifact. By default, jobs run in parallel. We need to create a dependency, so job B only starts after job A has completed successfully. And this is where the needs keyword plays a major role. actions/checkout pulls the repo code into the runner. We use ./github actions download artifact, which is the local action for retrieving artifacts. grep -c PASS test-results.txt will count how many lines contain PASS. The -c flag tells grep to count the output instead of matching the lines.

Question 58

Path-Based Workflow Execution

Accepted Answer

We have a repo with infrastructure code in a slash infra directory and documentation in slash docs. We need a workflow that only runs when infrastructure files change. By default, a push trigger fires on every push, regardless of which files change. The path filter narrows those down. Patterns use glob syntax. infra** matches any file which is inside of the infra directory, including the files in subdirectories. The workflow only runs when at least one changed file matches a pattern. Without it, every push runs every workflow. With path filtering, the workflow only runs when relevant files are being changed. This helps keep the pipelines fast and focused and helps avoid wasting computer resources on checks that don't apply to the changes which are being made. Actions/checkout pulls the repo code so that we can access the validation script.

Question 59

Automated Rollback on Deployment Failure with Values File Restoration

Accepted Answer

We have a deployment workflow that sometimes fails. When it fails, the values.yaml config file has already been updated with the bad image tag, and the team has to manually revert it. If the deployment fails, automatically restore values.yaml to the version from the previous commit, and commit the fix. A rollback is reverting to a known good state after something breaks. If we say needs deploy and if failure, the rollback job only runs when the deploy job has failed. By default, actions/checkout only fetches the latest commit, which is with the shallow clone of fetch depth one by default. So we set fetch depth to zero to fetch the full Git history, and this lets us use Git checkout HEAD 1 on values.yaml to restore the file from one commit back. So needs and if failure creates a backup plan pattern. The double hyphen separates the revision from the file path. Then we are running git add values.yaml, which stages the restored file.

Question 60

Macvlan Network Configuration Fix

Accepted Answer

We have two running Docker containers, Mac one and Mac two. They're failing to communicate with the host network. The reason for this is that both of these containers are not created in the host network. Our task is to fix this and make both of these containers to be able to reach host network. Maclan is a device that has its own virtual network, the same as Bridge. The difference between Mac vlan and Bridge is that Mac vlan gives each container that connected to Mac vlan via its network interface, a unique Mac address. We need to create an additional network interface and connect it to the maclan. For this we'll use sudo IP link add. Type Mac vlan and Mode is going to be Bridge. We need an IP range to be assigned to this network interface. Finally, enable our network interface and retest our connectivity. We have 0% packet loss and everything now is working.

Question 61

Container CPU Limit Configuration

Accepted Answer

We have a container CPU test that's running from the image, my app CPU, and it performs CPU intensive computations. This container regularly spikes to 150 to a hundred percent of the CPU usage and our task is to limit that. We need to set 500 M meaning half CPU to achieve that. Type docker stat CPU test and it will show us in the real time the CPU utilization of this container. Inspect this container by using Docker inspect, and we grab CPU related keywords and everything is either null or zero, meaning that there's no CPU related limits set on this container. We will run docker, run in the detached mode. We will use hyphen CPUs zero five, which equals to 500 M, meaning we'll set a limit of half CPU to this container. The container is utilizing below 60%, around 50% of the CPU.

Question 62

Log Rotation Size Limit Configuration

Accepted Answer

We have a container image. My app log app is continuously writing logs to stdout. And by default, they also are saved in the logging file. We need to run a container from this image with this name, my app container, and we need to configure Docker in a way that it will rotate those logs. We've been given an explicit number of 10 megabytes maximum per file. And we also have to retain maximum up to three log files during rotation, which means that if we go over three files, the earliest file is going to get deleted. So we always retain maximum three files. We type docker images to see if we have my app log app in our local registry. Then we will implement those two requirements. It's one line command. So we type docker run, and docker container has been started. Verify this by running Docker ps.

Question 63

Docker Multi-Architecture Image

Accepted Answer

When we build our container from this file, it builds it in architecture of our host system. Our task is to change the setup to build it in multiple architectures. We'll use Docker build with an instance named Multi Arc and we will use buildx create. We'll list current builders by typing Docker buildx ls. We have only default that builds it in our current platform's underlying os. So add new builder. Docker buildx create and then name multi arc. Then we use driver network host and then use to have it as default builder. Now attempt to build for our multi architecture setup. We'll run Docker buildx build. Verify this by typing Docker images.

Question 64

Docker Binary Architecture

Accepted Answer

We have a container image built from Dockerfile that fails during execution. Our task is to identify this issue and fix it so when it runs, it runs successfully with exit code zero. Error reads like exec, then the binary name, exec format error. Usually this error means that the binary was built with wrong architecture. To verify this, try to run file command. This is executable and the architecture is ARM 64. Next we need to identify the host architecture, the one that we're running this docker. Type uname hyphen M, which shows us our host architecture, which is X 86 64. There's a mismatch. X 86 is an instruction set and AMD 64 is a 64 bit version of that instruction set. When we are asked about X 86, for the 64 bit processors, we need to use AMD 64.

DevOps Interview Questions

Linux (29)

AWS (10)

Kubernetes (15)

CI/CD (5)

Networking (1)

Docker (10)

Git (15)

Programming (15)

Practice on real environments