Best SRE Interview Questions to Ask and Answer

1

What is Consistent Hashing?

Reference answer

Consistent hashing is a technique that helps you to maintain database integrity by ensuring that every read operation will always return the same result. In database systems, consistent hashing is a way of keeping data in sync by ensuring that each piece of data has been hashed in the same way. In other words, if you have two database tables, A and B, and you want to ensure that both tables have the same data, then you need to hash all of the entries in both tables together (A and B). This ensures that every time you read from table A, it will be returned with the same hash value. If another user then goes to read from table B, they will get the same hash value back. As long as there are no changes to either table, this means both tables should have the same data.

2

What is observability?

Reference answer

The concept of observability refers to the ability to understand the internal state of a software system based on its external outputs. It involves using data and insights from monitoring to understand the system's health and performance. Observability methods include USE and RED.

3

What is the swap area, regarding memory?

Reference answer

Swap area is disk space used as virtual memory extension when physical RAM is full. The kernel moves inactive memory pages to swap to free RAM for active processes. Swap can be a partition or a swap file. Excessive swapping degrades performance due to disk I/O latency.

4

What's the difference between TCP and UDP?

Reference answer

TCP (Transmission Control Protocol) is connection-oriented, reliable, and ensures ordered delivery with error checking and flow control. It is used for applications like web browsing and email. UDP (User Datagram Protocol) is connectionless, lightweight, and provides no guarantees on delivery, order, or error recovery. It is used for real-time applications like streaming and gaming.

5

Where does caching take place in servers? And what is cache invalidation?

Reference answer

Caching is the act of storing data that changes infrequently in memory so that it can be used later. It's often used to speed up performance and reduce network traffic. Caching can take place at different levels within a server: - In front-end web servers, when a page is requested, the page's content is cached in memory. - In back-end web servers, when a page is requested, the contents of the cache are checked to see if the contents are still valid. If they are, then no request needs to be made. Instead, the cached data can be served right away. If the cached data has changed since being stored in the cache, then it needs to be updated before it can be served. Cache invalidation is also an important part of caching in servers. Cache invalidation involves checking to see if the cached content still holds true and if it needs to be updated before serving it again. Caching can improve performance for any application that uses persistent data or relies on a heavy number of requests per second (RPS). By reducing these numbers, caching allows your server to complete more requests per second without having to spend as much time loading data into memory and parsing it.

6

How do you handle database migrations in production with minimal downtime?

Reference answer

SREs use strategies like: making schema changes backward-compatible (add columns, not delete), using online migration tools (e.g., pt-online-schema-change for MySQL), running migrations in phases with canary deployment, and ensuring application code works with both old and new schema. Rollbacks are also planned. This minimizes downtime by allowing the application to continue serving during the migration process.

7

What is a Service Level Indicator?

Reference answer

Answer:A service level indicator is the specific metric that helps businesses measure aspects of the level of service to their consumers. SLIs are smaller sub-sections of SLOs, which are, in turn, part of SLAs that have an impact on overall service reliability. They help businesses identify ongoing network and application issues to lead to more efficient recoveries.

8

What is Transmission Control Protocol, or TCP, and can you list some of the TCP connection states?

Reference answer

TCP is a reliable, connection-oriented transport protocol that ensures data delivery in order. Common TCP connection states include LISTEN, SYN-SENT, SYN-RECEIVED, ESTABLISHED, FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, TIME-WAIT, and CLOSED.

9

Describe how you integrate automated testing and reliability checks into CI/CD pipelines.

Reference answer

I integrate automated testing by adding unit tests, integration tests, and end-to-end tests at different pipeline stages. Reliability checks include: load testing for performance regression, chaos experiments (e.g., injecting failures) in a staging environment, and validation of SLIs (e.g., latency, error rates) against SLOs. I also incorporate static analysis and security scanning. The pipeline gates deployment based on passing these checks. If reliability metrics degrade, the pipeline fails, and the team is notified to address issues before production release.

10

What monitoring and automation tools are you familiar with?

Reference answer

I have experience using Prometheus and Grafana for monitoring application performance and system metrics. During my internship at Canva, I set up alerts in Grafana to notify the team of any unusual spikes in latency. Additionally, I'm familiar with Ansible for automating server configurations, which helped streamline deployments and reduce errors. I'm eager to learn more about other tools like Kubernetes for container orchestration.

11

Have you set up a disaster recovery plan? Describe the process.

Reference answer

Yes, setting up a disaster recovery plan is an essential aspect of site reliability engineering. In my previous role, I was tasked with creating such a plan for our major systems. First, we identified critical systems whose disruption would have the most significant impact on our business operations. For each of these systems, we mapped out the possible disaster scenarios, such as data center failure, network outage, or cyber-attacks. Then we evaluated each system's current state, including the existing backup processes, system resilience, availability, and the ability to function on backup systems. We identified the weaknesses and started addressing them. Next, we determined the Recovery Point Objective (RPO) and the Recovery Time Objective (RTO) for each system, two critical metrics in disaster recovery. We then designed strategies for each disaster scenario considering the RPO and RTO. The strategies included mirroring data between data centers, establishing redundant servers, regular backup of data, and configuring auto-scaling and load balancing. Lastly, we frequently tested these strategies through drills, actual failover testing, and recovery drills. We learned from each test and refined our strategies. Setting up a disaster recovery plan is a dynamic and ongoing process. It requires regular monitoring, updating, training of the response team, and testing to ensure its effectiveness. The ultimate goal is to minimize downtime and prevent data loss in the event of a catastrophic failure.

12

What is the role of automation in SRE?

Reference answer

Automation is central to SRE for reducing toil, eliminating human error, and enabling consistent, scalable operations. Examples include automated deployment pipelines, self-healing systems (e.g., auto-restarting failed services), automated incident response (runbooks triggered by alerts), and infrastructure-as-code for provisioning. Automation allows SREs to manage complex systems efficiently and focus on higher-level engineering challenges.

13

What is the difference between active-active and active-passive failover?

Reference answer

Active-active failover involves multiple systems actively serving traffic, providing higher availability and load balancing. Active-passive failover has a primary active system with a standby passive system that takes over only if the primary fails, providing a backup.

14

What is a runbook?

Reference answer

A runbook is a detailed guide that outlines the steps required to perform specific operational tasks or handle incidents. It serves as a reference for engineers during troubleshooting.

15

How do you handle logging and log management?

Reference answer

Logging involves capturing and storing logs from various services, while log management includes aggregating, analyzing, and maintaining logs for troubleshooting and monitoring purposes.

16

Describe techniques to optimize costs while preserving reliability in cloud-native environments.

Reference answer

Techniques include: using reserved instances or savings plans for predictable workloads, implementing autoscaling to match demand, right-sizing instances based on actual usage (e.g., using Spot instances for fault-tolerant workloads), leveraging serverless for bursty or low-traffic services, optimizing storage (e.g., lifecycle policies for data), and reducing network costs by minimizing cross-region traffic. Reliability is preserved by maintaining redundancy, using health checks, and ensuring cost-saving measures (e.g., Spot interruptions) are handled gracefully.

17

How do you ensure data integrity during a system failure?

Reference answer

Ensuring data integrity involves using mechanisms like write-ahead logging, ACID transactions (where appropriate), checksums, and replication. During failures, SREs rely on idempotency, consistent backups, and automatic failover to replicated databases. Regular validation checks and recovery drills help verify that data remains consistent. For distributed systems, using consensus algorithms (e.g., Paxos, Raft) can help maintain data integrity across nodes.

18

How do you perform root cause analysis for a recurring incident?

Reference answer

SREs use techniques like the '5 Whys', fishbone diagrams, and analysis of incident timelines from monitoring logs and traces. They identify the immediate trigger, contributing factors, and systemic issues (e.g., lack of testing, monitoring gaps). The output is a prioritized list of action items (code fixes, process changes, improved alerts) to prevent recurrence, tracked in a postmortem document.

19

How would you ensure high availability and disaster recovery in a microservices architecture?

Reference answer

I would create a high-availability setup using load balancers and implement redundancy at each layer of the microservices architecture. By distributing traffic among multiple instances of a service, we can limit the impact of a single instance failure. For disaster recovery, I would implement a data replication strategy across different regions and ensure regular backups. I would also establish a well-documented failover procedure to minimize downtime during a disaster.

20

What are service level indicators (SLIs)?

Reference answer

The main metrics that demonstrate whether a service is on track are called service level indicators. Without them, it is challenging to determine whether the company is accomplishing its goals. SLIs can be broadly classified into three categories: availability, response time, and quality of service.

21

How does SRE differ from DevOps?

Reference answer

“While both SRE and DevOps aim to bridge the gap between development and operations, SRE focuses more on reliability through engineering practices. SRE has specific goals like meeting Service Level Objectives (SLOs), automating operations, and managing risk through error budgets. DevOps is a broader cultural shift that emphasizes collaboration and continuous delivery. SRE is often considered an implementation of DevOps principles with a focus on reliability.”

22

What is mean time to recovery and why is it important?

Reference answer

Mean Time to Recovery (MTTR) is the average time taken to recover from a failure or incident, from the moment it is detected until the service is restored. It is important because it measures the efficiency of incident response and resolution processes. A lower MTTR indicates faster recovery, minimizing downtime and impact on users, which is a key reliability metric for SREs.

23

How did you handle recurring downtimes due to inefficient resource usage?

Reference answer

During a project last year, we had recurring downtimes due to inefficient resource usage that strained our servers during peak times. I spearheaded a comprehensive analysis of our application logs and server metrics to identify the components causing the inefficiencies. We found that a few database queries were underoptimized and causing high CPU usage. Working with the development team, we optimized the problematic database queries and also introduced a caching layer to reduce the load on the database. I also suggested splitting some of our monolithic services into scalable microservices to distribute the system load evenly. In addition, I recommended and implemented better alerting systems to proactively warn us about potential overload situations. These measures significantly reduced the frequency and duration of downtimes. We also improved our incident response time thanks to the new and more efficient alert system.

24

What are the most commonly used signals with the Linux kill command? What does each do? What is the default? When is each appropriate?

Reference answer

- kill -15 sends a TERM signal, which attempts to gracefully stop a process. It is the default. - kill -1 sends a HUP signal, which reloads a process. - kill -9 sends a KILL signal, which kills a process. You can follow this up nicely with a discussion of important system calls.

25

How do you balance the need for rapid deployment with maintaining system reliability?

Reference answer

To maintain reliability while enabling rapid deployment, I implement SLOs that define acceptable performance and uptime levels. I use CI/CD pipelines to automate testing and integrate monitoring tools like Prometheus to catch issues early. At my last job, this approach allowed us to deploy updates weekly without sacrificing system reliability, resulting in a 30% decrease in downtime incidents.

26

What's the difference between TCP and UDP, and when would you use each?

Reference answer

Look for understanding of reliability vs. speed tradeoffs and the ability to explain concepts clearly rather than reciting memorized definitions.

27

How do you troubleshoot high CPU usage?

Reference answer

- Use `top` or `htop` to identify resource-heavy processes. - Profile with `strace` or `perf`. Example: A Java app with high CPU might need garbage collection tuning.

28

What is your process for conducting a post-mortem review after a significant incident?

Reference answer

After a significant incident, conducting a post-mortem review is integral to understanding what happened and how we can prevent similar occurrences in the future. The first step in this process is data collection. I gather all relevant information, including but not limited to, system logs, incident timelines, actions taken during the incident, and any communication that occurred. This step is followed by an analysis of the incident. I look at what triggered the issue, how we detected it, how long it took us to respond, and how effective our response was. We also investigate any cascading effects that might have occurred and preventive measures that were either lacking or failed. Once the analysis is complete, we organize a meeting with all relevant team members to go through the updated incident report and discuss our findings. During this meeting, we focus on identifying actionable improvements we can make to our systems and processes to avoid a similar incident in the future. We also address any communication or procedural issues that might have negatively impacted the incident management process. Importantly, the atmosphere during this meeting and the overall process is blame-free. The focus is solely on learning from the situation and improving our service. Finally, the outcome of this meeting, along with proposed changes and improvements, is documented and shared with stakeholders. We then track the implementation of these changes to ensure improvements are being made effectively.

29

GitLab CI vs. Jenkins

Reference answer

- GitLab CI: Integrated with GitLab. - Jenkins: Standalone, plugin-driven automation server.

30

What is virtualization?

Reference answer

The process of running numerous virtual machines on a single physical system is known as virtualization. Companies who want to pool their computing resources to keep them running round-the-clock without having to invest in extra hardware frequently employ it.

31

How have you implemented automation to improve efficiency in your previous role?

Reference answer

In my previous role, I recognized that a significant amount of time was being dedicated to repetitive manual tasks, such as deploying updates, system monitoring, database backups, and writing incident reports. I saw this as an opportunity to implement automation, saving the team time and reducing the chances of human error. I introduced DevOps tools like Jenkins and Ansible into our workflow. Jenkins was used to implement Continuous Integration/Continuous Delivery (CI/CD), which automated our code deployment processes, while Ansible allowed us to automate various server configuration tasks. To automate system monitoring, I set up automated alerts using Grafana and Prometheus. This helped us to get real-time notifications about any system performance fluctuations which might need our attention. For database backups and incident reports, I wrote custom scripts using Python. These scripts automated regular database backups and the generation of basic incident reports whenever a service disruption occurred, allowing us to focus on troubleshooting rather than spending time on documenting the issues. The end result was a considerable reduction in repetitive manual work, increasing our team's efficiency and productivity.

32

Given a root of the binary tree, a node X in the tree is called good if there are no nodes with values larger than X along the route from root to X. Write a program in which the number of good nodes in the binary tree should be returned. Example - Consider the tree given below. The Nodes marked with yellow color are good nodes. Because no such nodes in between have a value greater than the current node up to the root. [7,8,9,7]

Reference answer

For solving this problem, we need to traverse every node by passing the current node value recursively. If on every node, the value passed from the parent node will be compared. If the node is found greater than the value from the parent node. Then the count will be incremented and we can update the value with the current node value and pass it to both the child recursively. So the code for this approach will be - class Solution { //Global variable that keeps count of the good nodes. int ans; private void solution(TreeNode root, int val){ //When found the node value greater than the value from parent if(root.val >= val){ ans++; val = root.val; } //Recursively calling the solution if the child node exists. if(root.left != null) solution(root.left, val); if(root.right != null) solution(root.right, val); } public int goodNodes(TreeNode root) { //Calling helper method to count the good node. solution(root, root.val); return ans; } } The time complexity for the above approach will be O(n) because we have to traverse all the nodes at once. And we have used recursion so we can say that because of the call stack, the space complexity will be O(n).

33

What is your experience with container orchestration tools like Kubernetes?

Reference answer

I have experience deploying and managing containerized applications on Kubernetes. This includes configuring deployments, services, and ingress, setting up autoscaling, monitoring cluster health, and troubleshooting issues with pods, nodes, and networking within the cluster.

34

What is Dynamic Host Configuration Protocol (DHCP), and what is it used for?

Reference answer

DHCP is a network protocol that automatically assigns IP addresses, subnet masks, gateways, and other network configuration parameters to devices on a network. It is used to simplify network management and avoid manual configuration of each device.

35

How do you ensure security while deploying infrastructure as code?

Reference answer

- Use tools like HashiCorp Vault for secret management. - Implement role-based access control (RBAC) in deployment tools. - Automate security scanning during the CI/CD pipeline.

36

Can you discuss how containerization contributes to site reliability?

Reference answer

Containerization greatly contributes to site reliability by encapsulating an application with its dependencies into a self-contained unit that can run anywhere. This ensures consistency across different environments - development, testing, staging, and production - thus reducing the "it works on my machine" type of problems. Furthermore, thanks to their lightweight nature, containers can be started and stopped quickly, which is crucial for scaling applications in response to changing demand, thereby improving site reliability.

37

Can you explain SLO?

Reference answer

Many people are aware of Service Level Agreement (SLA), but few are aware of Service Level Objective (SLO). An SLA is the uptime promise we make to a customer. These are often legally defined with penalties for missing the target availability. The SLO is a critical element of SLA between the vendor and client agreed beforehand to measure the performance of service providers and is formed as a way of avoiding disputes. SLOs provide a quantitative means to define the level of service a customer can expect from a provider, such as availability, throughput, frequency, response time, or quality. SLA can be understood as a promise to customers for uptime and service availability, while SLO is the goal set to meet the SLA. SREs are often responsible for developing an SLO and collaborating with multiple teams to ensure realistic and sustainable. Therefore, the candidates should define the SLO and share an example of SLO and how it helps the teams and customers.

38

How do you ensure data integrity in distributed systems?

Reference answer

Data integrity in distributed systems is ensured through techniques like transaction management, data replication, consistency checks, and using consensus algorithms (e.g., Raft, Paxos) to maintain consistency across nodes.

39

What is the purpose of load balancing?

Reference answer

The purpose of load balancing is to efficiently distribute incoming network traffic across a group of backend servers. This prevents any single server from becoming a bottleneck, improves application availability, and enhances overall system performance and reliability.

40

Why do you want to work in Site Reliability Engineering?

Reference answer

Answer: I am drawn to a career in the SRE sector due to its dynamic and challenging nature. It combines my passion for software development and operations, which provides the unique opportunity to bridge the gap between these two crucial aspects of technology. The SRE role is well-aligned with my goal of ensuring the reliability, scalability, and efficiency of systems that contribute to a seamless user experience. Furthermore, I am eager to contribute to the growth of a company and am confident that my proactive approach and problem-solving abilities will make me a valuable member of the team. I am particularly interested in exploring career opportunities through Executive Search & IT Recruitment firms, as they offer access to a wide range of exciting roles and companies.

41

How would you reduce latency and improve performance for a globally distributed application?

Reference answer

- CDN: Use a Content Delivery Network (CDN) like Cloudflare or AWS CloudFront to cache static content closer to end-users. - Edge computing: Move compute operations closer to users via edge services like AWS Lambda@Edge or Cloudflare Workers. - Database replication: Implement geo-replicated databases to reduce query time by having data stored closer to users. - Global load balancing: Use geo-based DNS routing or Anycast IP routing to direct users to the nearest regional data center. - Caching: Introduce caching layers (e.g., Redis, Memcached) to reduce repeated database calls and application load. These methods help reduce latency by bringing content and compute resources closer to the user.

42

What is chaos engineering and when would you use it?

Reference answer

Chaos engineering is the practice of intentionally injecting failures into a system to test its resilience and identify weaknesses before they cause real-world incidents. It is used to validate that systems can tolerate unexpected disruptions, such as server crashes, network latency, or resource exhaustion, and to improve confidence in system reliability, especially in distributed or complex architectures.

43

How should organizations define SLOs?

Reference answer

Organizations should come up with a simple SLO first and then iterate on it over time. They should consider questions such as where is their SLO document, how do they know that SLO matches customer expectations, what is their SLO review process, how do they consider their SLO and their system design process, and how do they measure SLO compliance.

44

Describe a time when you had to troubleshoot a production issue. What steps did you take?

Reference answer

During a high-traffic event, our web application experienced a sudden spike in latency. I quickly identified a database bottleneck, optimized the slow queries, and implemented caching, which resolved the issue and improved performance by 50%.

45

What is the difference between reliability and availability?

Reference answer

Availability measures whether a system is up and accessible (e.g., uptime percentage). Reliability measures whether a system consistently performs its intended functions correctly over time. A system can be available (up) but not reliable (e.g., serving errors). SREs focus on both, using SLIs and SLOs to track correctness and performance, not just uptime.

46

What would be the downtime for a server with 99.9 percent uptime?

Reference answer

A server with 99.9 percent uptime would be down for more than 10 minutes per week, or 1 minute and 26.4 seconds per day. That's adequate for a generic business server.

47

How would you implement blue-green or canary deployments in a Kubernetes environment?

Reference answer

To implement blue-green or canary deployments in Kubernetes: - Blue-Green Deployment: I would deploy the new version of the application in a separate environment (green), while the old version (blue) continues running. Once the new version is tested and stable, we switch traffic from blue to green using a Kubernetes service. - Canary Deployment: For a canary release, I would gradually roll out the new version to a small subset of users. Kubernetes deployment strategies, like rolling updates or Istio for traffic management, can control the rollout process. We monitor metrics to ensure the new release is stable before scaling it up to the entire user base.

48

How does SRE differ from DevOps?

Reference answer

While both SRE and DevOps focus on improving collaboration and efficiency between development and operations, SRE is more focused on applying engineering practices to operations, often with a stronger emphasis on reliability and performance.

49

Explain the concept of Service Level Objectives (SLOs) and how they are used in SRE.

Reference answer

Service Level Objectives (SLOs) are specific, measurable targets for system performance and availability that help set clear expectations between service providers and users. In SRE, SLOs guide prioritization and decision-making, ensuring that reliability and performance goals are consistently met.

50

What does your metrics & monitoring setup look like? How do you debug issues with the system?

Reference answer

This may be a controversial one, but if the title is "SRE" I ask why the title is "SRE" and not something else (same for "DevOps"). I'm looking to see if they're being thoughtful about what the term means and how they are defining "resilience" for their systems.

51

What is Infrastructure as Code (IaC), and how have you implemented it?

Reference answer

IaC is the practice of managing and provisioning infrastructure through code, rather than through manual processes. It enables consistent and repeatable deployment of servers and services with the help of tools such as Terraform, CloudFormation, or Azure Resource Manager templates. Application examples might include: Automating the creation of cloud environments, Scaling resources based on demand, Ensuring compliance with security policies.

52

Write a program that returns the leftmost value in the final row of a binary tree given the root. Example - In the below image, we can see that the leftmost node in the last row of the tree is 7. So we need to return that.

Reference answer

We can solve this problem recursively by traversing to the last row and returning the leftmost node value. And because we are not aware of the final row of each sub-tree, so we can have a count of height that helps in obtaining the answer from the tree. So the code of this approach will be - class Solution { int maxHeight, ans; private void solution(TreeNode root, int height){ //Checking if it is the leaf node and also if it is the last row. //We are checking the last row based on the height of the tree. if(root.left == null && root.right == null){ if(height > maxHeight){ maxHeight = height; ans = root.val; } return; } //Recursively traversing for the final row if child exists. if(root.left != null) solution(root.left, height+1); if(root.right != null) solution(root.right, height+1); } public int findBottomLeftValue(TreeNode root) { maxHeight = -1; //Calling helper method that finds the leftmost node in the tree. solution(root, 0); return ans; } } The Time complexity for the above approach is O(n) because we are traversing each node only once. And the space complexity can be O(n) because of the recursion.

53

How do you collaborate with software development teams to build reliable software?

Reference answer

In my experience, close collaboration with software development teams plays a vital role in building reliable software. At one of my previous roles, I helped facilitate the adoption of the DevOps culture in the organization, which enhanced collaboration between the operations and development teams. We set processes for reviewing each other's work and giving feedback, which lead to better code quality and efficiency. As an SRE, I collaborated with development teams on establishing strong testing and deployment strategies. Incorporating a strong suite of tests, including unit, integration, and end-to-end tests, alongside a robust CI/CD pipeline, meant catching and rectifying many issues before they reached production. I've also worked with development teams to implement the principles of 'Chaos Engineering', slowly introducing faults in the system to test the resilience of our applications. This provided invaluable insights into potential weak points and allowed us to create better disaster recovery plans. Lastly, I've trained the development team on the principles of SRE and the importance of building with reliability and scalability in mind. By ensuring everyone understands the intricacies of the production environment, they were more capable of writing code that performs well within that context.

54

Explain how you would approach debugging a memory leak in a distributed application.

Reference answer

To debug a memory leak, I would first identify the affected service by monitoring memory usage trends and correlating with recent changes. I would use profiling tools (e.g., heap dumps, memory analyzers) to capture memory snapshots and compare them over time. I would look for objects that are not being garbage collected, such as growing caches, unclosed connections, or listener leaks. In a distributed system, I would trace request flows to isolate the component. After identifying the root cause, I would fix the code and add monitoring for memory metrics to prevent recurrence.

55

What is an Inode?

Reference answer

Answer: Inode is the data structure in the UNIX that includes the metadata about the file. Some of the items in the inode are mode, OWNER (UID, GID), size, time, and time.

56

Given a collection of ads (data structure given), what is the mean and median of the array?

Reference answer

Mean: Sum all ad values and divide by the number of ads. Median: Sort the array. If the number of elements is odd, median is the middle element; if even, median is the average of the two middle elements. For large datasets, use a streaming algorithm like a min-heap and max-heap for median tracking.

57

How do you monitor and manage resource utilization in a containerized environment?

Reference answer

Managing resource utilization in a containerized environment is critical for optimizing performance and cost. Here's my approach: - Resource Requests and Limits: I set appropriate CPU and memory limits for each container, ensuring that no container consumes too many resources and causes resource contention. - Horizontal Scaling: I use Kubernetes horizontal pod autoscalers to scale the number of pods based on resource utilization, ensuring optimal resource allocation. - Monitoring Tools: I use Prometheus, Grafana, and Datadog for monitoring resource utilization and setting up alerts for when resources are over- or underutilized. - Cost Optimization: I analyze container usage and adjust resource allocations based on historical performance metrics, ensuring we don't over-provision and waste resources. - Logs and Metrics: Use tools like Fluentd or ELK stack to aggregate logs and monitor performance metrics for each container to identify bottlenecks. This proactive management approach helps ensure containers run efficiently while minimizing resource wastage.

58

What is the primary goal of SRE?

Reference answer

SRE ensures scalable, reliable systems by applying software engineering principles to operations. Key goals include automating repetitive tasks (toil reduction), defining SLIs/SLOs, and balancing innovation with reliability using error budgets.

59

What is the '/proc' file system?

Reference answer

A special kind of file system with unique access rights is a '/proc' file system. When the kernel wants to run a process or access specific system resources, it is elevated in Linux systems. Information about the system's present condition, such as memory consumption and CPU speed, can be found in the /proc directory.

60

How is the Service Risk (S.R.) approach similar to DevOps?

Reference answer

The Service Risk (S.R.) approach is similar to DevOps in terms of practices and fundamentals, but it has different perspectives. The goal of both is building scale and more reliable software.

61

How do you troubleshoot a performance bottleneck in a web application?

Reference answer

SREs start by defining the problem (e.g., high latency, low throughput) and using monitoring tools to identify the bottleneck (CPU, memory, I/O, network, or database). They use profiling, tracing, and load testing to isolate the cause. Common fixes include optimizing database queries, adding caching, scaling horizontally, or refactoring code. The process is iterative, with validation through performance testing after changes.

62

What metrics do you prioritize when evaluating system health?

Reference answer

I use the RED method: Rate, Errors, Duration. For Rate, I track requests per second because traffic patterns often precede issues. Errors are critical—I care about error count and error rate. Duration is latency—both p50 and p99, because p99 tells you about your worst users' experience. We also track saturation: CPU, memory, disk I/O, and connection pool utilization. These are early warning signs that we're about to have problems. For specific services, I add business metrics. For our payment service, I care about transaction success rate. For our search service, I care about results accuracy. The mistake I see people make is treating all metrics equally. We have hundreds of metrics, but I set up dashboards focused on the maybe 12 that actually tell me if the service is healthy. If those are green, we're good. If anything is red, I investigate. I also spend time understanding the baseline for each metric. A p99 latency of 2 seconds might be normal if we're doing complex queries, but it's a disaster if we should be responding in milliseconds.

63

What are SNAT and DNAT?

Reference answer

Source Network Address Translation (SNAT) - It is a network function that maps an internal IP address to an external IP address. It often occurs at the edge of the network, where a device is connected to the public Internet. SNAT enables a device to “see” the outside world by translating its internal IP address into the external IP address of the router or server that serves it. - With SNAT enabled, a device can use the public Internet to communicate with other devices on the Internet. - SNAT also allows a device to receive data sent by other devices on the Internet, even if they are behind a firewall that blocks all incoming connections. Destination network address translation (DNAT) - It is a technology that allows a server to have multiple IP addresses in different networks. DNAT allows a server to be located in one location but maps its IP address to the IP address of another location. DNAT can be used for many purposes, including load balancing, site-to-site VPN connectivity, and security. - The primary benefit of DNAT is that it can be used to load balance traffic across multiple servers. By translating the server's public IP address into multiple private IP addresses, it is possible to have multiple servers at the same location function as though they were all located elsewhere. This allows for failover and redundancy without adding additional hardware or network infrastructure.

64

Describe a deployment strategy you've used to minimize downtime.

Reference answer

In my role at a tech startup, I implemented a blue-green deployment strategy to minimize downtime during major releases. I prepared a detailed rollback plan in case of issues. During deployment, I used Datadog to monitor system health and performance metrics closely. This approach allowed us to quickly revert to the previous version when we detected a problem, ensuring our service remained reliable and user experience unharmed.

65

What is 'MTTR' and 'MTBF'?

Reference answer

- MTTR (Mean Time to Recover): Average time to resolve incidents. - MTBF (Mean Time Between Failures): Average time between system failures.

66

Explain how you use logging and tracing to debug production issues.

Reference answer

I use structured logging to get context from various services and correlate events using request IDs. Distributed tracing tools visualize the path of a request across multiple services, helping identify where latency or errors are introduced within the system architecture.

67

How would you handle disagreement between product and engineering on the SLO target?

Reference answer

Look for diplomatic negotiation skills and the ability to use data-driven arguments to align stakeholders on realistic reliability targets.

68

What tools do you use for tracing in distributed systems, and why are they important?

Reference answer

Tools like Jaeger, Zipkin, or OpenTelemetry are used for distributed tracing. Tracing is important because it allows you to track the flow of requests across multiple services, helping to identify performance bottlenecks and failure points in complex architectures.

69

Can you explain the concept of “shift left” in DevOps, and how it applies to site reliability engineering?

Reference answer

The concept of "shift left" in DevOps refers to the practice of moving tasks earlier in the development cycle, aiming for early detection and resolution of issues. In the context of site reliability engineering, we apply the "shift left" principle by involving SREs right from the design and development stages of a project. This way, we can build reliability into the system from the outset and catch potential issues before they become system-wide problems.

70

What is a service-level indicator (SLI)?

Reference answer

Anything that can be accurately monitored and used to help you think through, define, and assess whether you are meeting SLOs and SLAs is referred to as a service-level indicator (SLI). They are frequently expressed as the proportion of exceptional occurrences to all events.

71

What is cloud computing?

Reference answer

- Cloud computing is the delivery of IT services, such as servers, storage, and software as a service (SaaS), through network-connected cloud infrastructure. The term can refer to both private clouds, which are managed by a single organization and shared among internal users, and public clouds, which are owned by third parties (e.g., Amazon Web Services) that rent out computing power and storage capacity to companies or individuals on a subscription basis. Cloud computing has the potential to transform IT infrastructure and delivery models across industries but faces challenges in terms of security and regulation. - The “cloud” in “cloud computing” refers to the Internet itself and the networked computers and software that make up the Internet infrastructure. Cloud computing allows organizations to offload workloads from their data centers and focus more resources on applications and business processes. In addition, it enables them to create hybrid environments that combine elements of on-premises data centers with those hosted in cloud environments. This can be especially helpful for companies that need to scale quickly and want to reduce costs. - Cloud computing also has the potential to revolutionize IT operations by allowing organizations to deliver IT services through a flexible, scalable model that reduces costs while improving service quality. For example, it can allow organizations to integrate legacy systems with newer ones (such as mobile applications), reduce complexity and risk by automating routine tasks and streamline the management of remote assets. Cloud computing can also help organizations save money by reducing the costs of leasing or purchasing IT equipment compared to buying it outright.

72

What is 'chaos engineering' and how is it used in SRE?

Reference answer

Chaos engineering is the practice of intentionally injecting failures (e.g., killing servers, introducing latency) into a system to test its resilience and uncover weaknesses. SREs use it to validate that systems can handle unexpected failures, improve monitoring and incident response, and build confidence in system reliability. It is done in controlled experiments with a hypothesis and rollback plan to avoid real user impact.

73

Briefly describe a major reliability improvement you implemented and the outcome.

Reference answer

I implemented a proactive autoscaling policy for a critical API service that previously experienced latency spikes during traffic surges. By analyzing historical traffic patterns and using custom metrics (e.g., request queue depth), I configured autoscaling to trigger preemptively. The outcome was a 40% reduction in p99 latency during peak hours and zero downtime events during high-traffic periods, improving the service's SLO compliance from 99.5% to 99.9%.

74

Explain how you would create a runbook for a critical service and what it should include.

Reference answer

A runbook for a critical service should include: a description of the service and its dependencies, contact information for on-call teams, monitoring dashboards and alert definitions, step-by-step procedures for common incidents (e.g., high latency, service down), escalation paths, rollback procedures for deployments, and checklists for health checks. It should be kept up to date and tested regularly. I would collaborate with the team to document known issues and solutions, ensuring clarity and actionability.

75

How to find files modified in the last 7 days?

Reference answer

bash find /path -type f -mtime -7

76

What is a synthetic transaction, and how is it used in monitoring?

Reference answer

A synthetic transaction is a scripted sequence of interactions with a service that mimics real user behavior. It is used in monitoring to proactively check the availability and performance of services by simulating user actions.

77

How do you prioritize on-call tasks during a high-severity incident?

Reference answer

During a high-severity incident, I first confirm the incident and assess its impact on users and SLAs. I prioritize tasks by stabilizing the service (e.g., rolling back a deployment, scaling resources) over root cause analysis. I communicate status to stakeholders and escalate if needed. I use a structured triage approach: contain the issue to prevent further damage, then work on resolution. Post-incident, I document actions and initiate a blameless postmortem.

78

How do engineers share their work with product teammates in the QA phase? How many environments do you have?

Reference answer

These questions are incredibly important to me. It could both surface fun red flags for you to discuss with your interviewer and see how receptive they are to your opinions and give you an idea of things you might be working on for them.

79

Explain the difference between TCP and UDP, including their use cases and trade-offs.

Reference answer

Expect answers to cover that: TCP (Transmission Control Protocol) is a connection-oriented protocol that ensures reliable and ordered delivery of a stream of bytes. It's beneficial for applications where data integrity is critical. UDP (User Datagram Protocol) is a connectionless protocol that offers faster transmissions but without guarantees on delivery or order. It's suitable for applications where speed is more critical than reliability, like streaming or gaming. Candidates might discuss trade-offs, noting how TCP's error correction mechanisms can introduce latency but ensure reliability, whereas UDP's lightweight nature can enhance performance but at the risk of data loss or out-of-order arrival.

80

What is the tech stack?

Reference answer

I'm looking at the question from an operations perspective. Are they using a hodgepodge of languages or is the development flow opinionated? How many different technologies does the team have to support?

81

What is Continuous Integration/Continuous Deployment (CI/CD) and how have you used it?

Reference answer

Continuous Integration/Continuous Deployment (CI/CD) is a modern development practice that involves automating the processes of integrating code changes and deploying the application to production. The goal is to catch and address issues faster, improve code quality, and reduce the time it takes to get changes live. I've implemented and utilized CI/CD pipelines in several of my past roles. In one instance, we used Jenkins as our CI/CD tool. For Continuous Integration, every time a developer pushed code to our repository, Jenkins would trigger a process that built the code, ran unit tests, and performed code quality checks. If any of these steps failed, the team would be instantly notified, enabling quick fixes. For Continuous Deployment, once the code passed all CI stages, it'd be automatically deployed to a staging environment where integration and system tests would run. If all tests passed in the staging environment, the code would then be automatically deployed to production. This ensured that we had a smooth, automated path from code commit to production deployment, leading to more efficient and reliable release processes.

82

How can you end all active processes in Linux?

Reference answer

The Linux kills command makes it simple to end all active processes. You can kill any process with this command, including programmes, services, and processes that aren't even active on Linux systems. In other words, it will stop or end any process that is currently active on the system.

83

How do you manage changes in production systems?

Reference answer

Changes in production systems are managed through version control, automated testing, staged rollouts, monitoring, and having rollback plans in place to quickly revert changes if issues arise.

84

Describe your experience with containerization and orchestration tools like Docker and Kubernetes.

Reference answer

Skilled applicants will be proficient in containerization technologies like Docker and orchestration tools like Kubernetes, Docker Swarm, or Amazon ECS. Look for examples where candidates have successfully used these tools to improve deployment speed, reliability, and scalability. Candidates might also talk about container registries, continuous integration and continuous deployment (CI/CD), and managing containerized workloads at scale.

85

Describe a comprehensive approach to system performance monitoring. What tools would you use?

Reference answer

A comprehensive approach to system performance monitoring features a variety of tools, such as: System-level monitors like top, htop, vmstat, Application performance monitoring (APM) tools, Logging tools. Some more advanced solutions are: Prometheus for metric collection and alerting, Grafana for dashboards, ELK Stack (Elasticsearch, Logstash, Kibana) for log aggregation and visualization. Skilled candidates might also explain that monitoring should not just be reactive, i.e. fixing issues as they arise, but also proactive, identifying potential issues before they impact users.

86

What are cgroups (control groups), and how have you used them?

Reference answer

Skilled candidates will explain that cgroups (control groups) allow for the allocation, prioritization, and monitoring of system resources like CPU time, system memory, network bandwidth, or combinations of these resources among user-defined groups of tasks. They may describe past situations where they've used cgroups, for example to: Limit resource hogging by certain processes, Ensure critical services have enough resources, Manage containerized applications efficiently.

87

Using Bash and cron, write a script that runs automatically on the last Friday of every month.

Reference answer

Use cron with a conditional check. Schedule the script to run on Fridays (e.g., 0 0 * * 5). Inside the script, check if the current date is the last Friday of the month: compare the day of the week and ensure that adding 7 days moves to the next month. If true, execute the desired task. Example: `if [ $(date +%d) -gt $(date -d "$(date +%Y-%m-01) +1 month -1 week" +%d) ]; then ...`

88

Differentiate between process and thread.

Reference answer

| Process | Thread | | When the program is under execution then it's known as a process. | The segment of the process is known as the thread. | | It takes the maximum time to stop. | It consumes less time to stop. | | It requires more time for work and conception. | It takes less time for work and conceptions. | | When it comes to communication it is not that most effective. | It is much more effective in terms of communication. | | If one procedure is obstructed then it will not affect the operation of another procedure. | If one thread the obstructed then it will affect the execution of another process. |

89

What's your experience with disaster recovery and testing?

Reference answer

Disaster recovery planning is one of those things that feels abstract until you actually need it. We have a documented DR plan for each critical service—what to do if a region goes down, if the database is corrupted, if we get hacked. But the real test is game days. We run one or two per year where we actually simulate failures and practice our response. Last year, we simulated losing an entire region, and it exposed some gaps: our DNS failover wasn't automatic, and we had 20 minutes of downtime before we switched. We implemented automatic failover for DNS and reduced that to under 2 minutes. We also tested our backup restore process and found it took 6 hours—way too long for a critical service. We rearchitected our backup strategy and got it down to 30 minutes. The most important part of DR testing is that it's blameless. We don't use it to blame people who missed steps; we use it to improve our systems and documentation. It's also exposed that we need better communication protocols with external teams when a real disaster happens.

90

How do you manage configuration drift across multiple environments?

Reference answer

- Use Infrastructure as Code (IaC) tools like Terraform or Ansible to ensure consistent configurations. - Implement version control (e.g., Git) for infrastructure and environment configurations. - Regularly run configuration audits and apply changes automatically via CI/CD pipelines. - Monitor configuration changes using tools like Chef Automate or Puppet.

91

Describe a situation where you implemented automation to improve reliability.

Reference answer

I identified a recurring issue with manual server configuration that led to frequent downtime. By implementing an automated configuration management tool, we reduced downtime by 70% and improved overall system reliability.

92

What are some of the common Linux kill commands?

Reference answer

Common Linux kill commands include 'kill' (sends a signal to a process by PID), 'killall' (kills processes by name), 'pkill' (kills processes based on pattern matching), and 'xkill' (graphically terminate windows). The default signal is SIGTERM, but other signals like SIGKILL can be specified.

93

What is an Error Budget?

Reference answer

An error budget is the maximum acceptable downtime or failure rate for a service, calculated directly from the SLO. It allows teams to balance feature development against reliability work; exceeding it shifts focus to reliability.

94

How do you prioritize incidents in a production environment?

Reference answer

I prioritize incidents based on their impact on users and business operations, ensuring that critical issues are addressed first. By using predefined criteria and SLAs, I can categorize and manage incidents effectively, keeping stakeholders informed throughout the process.

95

Explain CDN.

Reference answer

A CDN (Content Delivery Network) is a network of servers that stores and distributes content to clients. These servers are typically located in data centres, and they can be used to improve performance by reducing latency, ensuring that the content is available at the right time, and ensuring that the content is delivered in a timely manner. CDNs are most commonly used to store static content, such as images and videos, but they can also be used to store dynamic content, such as HTML or JavaScript. CDNs can also be used to deliver content from one location to another, such as from a website to a mobile device. CDNs are an important part of the Internet infrastructure because they allow content to be stored and distributed in a more efficient way. They also allow content to be served from multiple locations, which can improve performance and reduce latency. A CDN can be used in many different ways, including - Providing a central location for static content. - Providing a central location for dynamic content. - Providing a central location for content from multiple locations. - Providing a central location for content from multiple data centers. - Providing redundancy for critical infrastructure components such as servers and routers. CDNs are also an important part of the Internet infrastructure because they help to ensure that the Internet works well for everyone. They help to ensure that everyone has access to the same content at the same time, and equally prioritize access.

96

Name some other data structures.

Reference answer

Queue, stack, heap, hash table, binary tree, etc. Depending on your needs, this could be followed up with a question about data algorithms.

97

Describe the three-way handshake.

Reference answer

- Client sends SYN. - Server responds with SYN-ACK. - Client sends ACK.

98

How do you handle secrets management in a secure way?

Reference answer

SREs use dedicated secrets management tools (e.g., HashiCorp Vault, AWS Secrets Manager, Kubernetes Secrets) to store and rotate secrets like passwords, API keys, and certificates. Secrets are never hard-coded or committed to version control. Access is controlled via policies, and secrets are injected into applications at runtime (e.g., via environment variables or sidecar mounts). Regular rotation and audit logs are essential for security.

99

What's the difference between SLI, SLO, and SLA?

Reference answer

- SLI (Service Level Indicator): A measurable metric (e.g., request latency, error rate). - SLO (Service Level Objective): The target value for an SLI (e.g., 99.95% uptime). - SLA (Service Level Agreement): A contractual commitment with penalties if SLOs are violated.

100

What is the main concern of companies regarding software operations?

Reference answer

1 Availability and reliability problems after launching 2. The cost of operational costs of software is a significant concern for many companies 3. Lack of harmony and attrition between developers and operation teams 4. Organizational silos between development and operations

101

List a few automation tasks that you have completed.

Reference answer

Automation tasks I have completed include: - Automating server backups using Bash scripts with passwordless SSH for secure transfers. - Creating CI/CD pipelines with Jenkins to build, test, and deploy applications automatically. - Using Terraform to provision and manage cloud infrastructure as code. - Writing Ansible playbooks for configuration management and application deployment across multiple servers. - Automating Docker image builds and tagging as part of the development workflow.

102

What is 2FA?

Reference answer

Two-factor Authentication refers to the use of any two self-reliant methods from the various authentication methods. Two-factor authentication is used to ensure that the user has been recognized to access secure systems and to increase the security. Two-factor authentication is first implemented for laptops because of the fundamental security liabilities in mobile computers. By the use of two-factor authentication, it becomes more difficult for unauthorized users to use a mobile device to access secure data or systems.

103

What is the importance of having both DevOps engineers and SREs?

Reference answer

Having both DevOps engineers and SREs is important as they help implement DevOps principles and ensure the system's reliability and stability.

104

Explain the concept of ‘defense in depth' in security.

Reference answer

‘Defense in depth' is a layered security approach where multiple security measures are implemented to protect data and systems. If one layer fails, others still provide protection.

DON'T WANT TO MISS A THING?

Latest Cisco, PMP, AWS, CompTIA, Microsoft Materials on SALE
Get Now

Best SRE Interview Questions to Ask and Answer | SPOTO

Earn a certification to make your resume stand out.

DON'T WANT TO MISS A THING?

Latest Cisco, PMP, AWS, CompTIA, Microsoft Materials on SALE Get Now

Best SRE Interview Questions to Ask and Answer | SPOTO

Earn a certification to make your resume stand out.

Latest Cisco, PMP, AWS, CompTIA, Microsoft Materials on SALE
Get Now