Common SRE Interview Questions and Answers

1

A critical service is experiencing high latency. Walk me through your troubleshooting process

Reference answer

Strong candidates start with impact assessment before diving into root cause analysis, demonstrating structured thinking under pressure.

2

What is the importance of redundancy in SRE?

Reference answer

Redundancy ensures that there are multiple instances of critical components, reducing the risk of a single point of failure and improving overall system reliability.

3

What is DNS?

Reference answer

Domain Name System (DNS) is a hostname for IP address translation service. DNS is a distributed database implemented in a hierarchy of name servers. It is an application layer protocol for message exchange between clients and servers. It is required for the functioning of the Internet.

4

How have you fostered a culture of reliability within your teams?

Reference answer

At AWS, I implemented a 'blameless post-mortem' policy after incidents, encouraging teams to analyze failures without fear of repercussions. I also established a monthly reliability training session that included cross-team participation. Over time, we saw a 30% reduction in recurring incidents, illustrating how a culture of transparency and learning fosters better reliability.

5

How do you integrate the customer experience into your SRE strategy?

Reference answer

I define SLIs and SLOs based on customer-facing metrics like page load time and error rates, monitor user journeys, gather feedback from customer support, and prioritize reliability improvements that directly impact user satisfaction. I also use error budgets to ensure reliability investments are aligned with customer needs.

6

How do you manage configuration drift?

Reference answer

Configuration drift is managed through automation, regular audits, and using IaC to ensure consistent configurations across environments.

7

How do you ensure that your automation scripts are maintainable and scalable?

Reference answer

Look for answers that mention the importance of readability, modularity, and reusability of code when discussing automation scripts and how to maintain them. For this, site reliability engineers might: Use version control for scripts, Document the code and its purpose, Apply consistent naming conventions, Break down scripts into smaller, manageable functions or modules. To ensure scalability, they would need to create scripts that can handle variable loads and environments dynamically. Candidates might also explain how they've used parameters, environment variables, or configuration files to adapt scripts to different scenarios and needs. Insights into testing strategies, such as unit tests or integration tests for automation scripts, are a plus.

8

Scenario: A new application release has caused increased latency across multiple services. What steps would you take to diagnose and resolve the issue?

Reference answer

- Check the release logs for configuration or code changes that may have caused the issue. - Analyze latency metrics using APM tools (e.g., Datadog, New Relic) to find where the bottlenecks occur. - Check dependency services (e.g., databases, external APIs) for potential slowdowns. - Roll back the deployment if the problem persists and investigate further in a non-production environment. - Review resource usage to ensure adequate CPU, memory, and network resources.

9

Explain three-tier architecture along with its real-time uses of it?

Reference answer

- A three-tier architecture is a type of architecture in which the application logic is separated from the data storage and retrieval. The three-tier architecture can be implemented in a wide range of business applications, including CRM, e-commerce, and enterprise resource planning (ERP). - The three-tier architecture is often used when there are many different types of data that need to be stored, such as customer data and product data. By separating the different types of data into different tiers, it becomes easier to manage and maintain the data. - A three-tier architecture can be a useful tool for monitoring IT systems. As each tier in the architecture has its own distinct purpose, it can be easier to keep track of what's happening within each tier. This makes it easier to detect problems that might have otherwise gone unnoticed. - In addition, a three-tier architecture can help provide better visibility into how all the tiers are working together. For example, if you need to troubleshoot an issue with your company's website, it will be easier to do so if you have easy access to all the information that needs to be looked at as a separate logic.

10

How would you design a highly available service?

Reference answer

Start with redundancy (multiple nodes/regions), load balancers, auto-scaling, health checks, monitoring, and data replication. Use design patterns like failover, circuit breakers, and graceful degradation.

11

Describe a time you improved system reliability through changes in monitoring or automation.

Reference answer

At a technology startup, we faced frequent outages due to a lack of automated monitoring. I spearheaded the implementation of a comprehensive monitoring solution using Prometheus and Grafana. This change reduced our downtime by 70% within three months and improved our incident response time significantly. I learned the importance of cross-team communication in driving successful change.

12

Tell me about a recent/interesting project you worked on

Reference answer

I led the migration of a critical microservice from a monolithic deployment to Kubernetes, reducing deployment time by 80% and improving scalability. I designed the containerization strategy, implemented CI/CD pipelines, and set up monitoring with Prometheus and Grafana. The project involved coordinating with multiple teams and troubleshooting issues like network policies and resource limits, resulting in a 99.9% uptime post-migration.

13

Explain the concept of infrastructure as code (IaC).

Reference answer

IaC is the practice of managing and provisioning infrastructure using machine-readable configuration files, ensuring consistency, and enabling automation.

14

What is the time complexity to complete the operation 'x in l' if l is a list?

Reference answer

For a list, the 'x in l' operation has O(n) time complexity, where n is the number of elements. It performs a linear search through the list until it finds the element or reaches the end. For a set or dictionary, the same operation is O(1) on average.

15

What is DHCP, and why is it used?

Reference answer

Dynamically assigns IP addresses to devices, reducing manual configuration errors.

16

Can you define the term 'Inode'?

Reference answer

An inode is a data structure in Unix-like file systems that stores metadata about a file, such as its size, permissions, ownership, and pointers to the data blocks on disk, but not the filename. Each file has a unique inode number.

17

What is a shadow deployment?

Reference answer

A shadow deployment involves deploying a new version of a service alongside the current version and mirroring the live traffic to it without affecting the production traffic. This helps in validating the new version under real-world conditions without impacting users.

18

How to scrape metrics for a new application? Can you explain with an example?

Reference answer

To scrape metrics for a new application, you typically expose an HTTP endpoint (e.g., /metrics) that returns metrics in a format the monitoring system can parse (e.g., Prometheus format). For example, if you have a Python Flask application, you can use the prometheus_flask_exporter library to expose metrics. Then configure your Prometheus server to scrape that endpoint by adding a job to the prometheus.yml configuration file with the target application's address and port. The metrics are then collected and stored in Prometheus for querying and alerting.

19

How do you forecast future needs for a system?

Reference answer

Forecasting future needs for a system primarily relies on historical data analysis and understanding the business trajectory. One method I utilize is trend analysis. By monitoring usage patterns, load on the server, storage requirements, and resource utilization over time, I can spot trends and extrapolate them into the future. Tools like Prometheus and Grafana have been significantly helpful for resource trend analysis. Also, close collaboration with the product and business teams is essential. Understanding the product roadmap, upcoming features, and expected growth in user base or transaction volume can significantly impact system requirements. For example, if the business is planning to expand into new markets, we need to prepare for increased traffic and potentially more distributed traffic. For scaling infrastructure, I often utilize predictive auto-scaling features available on cloud platforms. These services can automatically adjust capacity based on learned patterns and predictions. These combined strategies provide a good estimate of future requirements and allow us to plan for system adjustments proactively, rather than reactively.

20

What is cloud computing?

Reference answer

Common answers are "using someone else's computer" or running services on equipment in someone else's data center. Follow up with a question about why companies use any of the various cloud platforms (save money, offload maintenance, etc.).

21

What are some key metrics for measuring the performance of a microservices architecture?

Reference answer

Key metrics include latency, throughput, error rates, request rates, and resource utilization (CPU, memory). These metrics help in understanding the performance and health of individual services and the overall system.

22

How do you approach patch management and system updates in production?

Reference answer

- Automation: Use configuration management tools like Chef, Puppet, or Ansible to automate patching across environments. - Testing: Apply patches first in staging environments and validate before rolling out to production. - Rolling updates: Perform rolling updates to minimize downtime and ensure that services remain available during patches. - Monitor system health post-patch to ensure no degradation in performance.

23

What is TCP?

Reference answer

Answer: Transmission Control Protocol, which stands for TCP, is one of the main protocols of the Internet Protocol suite. It lies among the application and network layers, which are mainly used to offer reliable delivery services. It is a connection-based protocol for communications that supports the exchange of messages between different devices over the network.

24

What is the difference between consistency, availability, and partition tolerance in the CAP theorem?

Reference answer

- Consistency: Every read receives the most recent write (or an error). - Availability: Every request receives a response (successful or failure), even if it's not the most recent data. - Partition Tolerance: The system continues to operate even if there is a network partition (communication failure between nodes). In a distributed system, you can only have two of the three guarantees (Consistency, Availability, Partition Tolerance), so SREs must design systems to balance these properties based on business needs.

25

What are the key principles of designing for failure?

Reference answer

Designing for failure assumes components will fail and focuses on building resilience. Key principles include: redundancy (multiple instances), graceful degradation (partial functionality if some parts fail), statelessness where possible, asynchronous communication with retries and backoff, circuit breakers to prevent cascading failures, and comprehensive monitoring to detect failures quickly. This approach ensures the system remains available and reliable despite individual component failures.

26

What is an inode?

Reference answer

An inode is a data structure in Unix/Linux that contains metadata about a file. Some of the items contained in an inode are: - mode - owner (UID, GID) - size - atime, ctime, mtime - acl's - a blocks list of where the data is The filename is present in the parent directory's inode structure.

27

What is a 'circuit breaker' pattern and when would you use it?

Reference answer

The circuit breaker pattern prevents a service from repeatedly making failing requests to an external dependency. It monitors for failures and, after a threshold, 'opens' the circuit to stop further calls, allowing time for recovery. It is used to avoid cascading failures, resource exhaustion, and prolonged latency in distributed systems. Once the dependency recovers, the circuit closes again, often with a half-open state for testing.

28

What is the role of blameless post-mortems?

Reference answer

To identify systemic issues (not individual blame) and implement preventive measures. Example: If a deployment fails due to missing tests, the fix might involve improving CI/CD pipelines.

29

How would you deploy an application to AWS?

Reference answer

When a company wants to use AWS to deploy their workloads, they need to set up a landing zone (https://docs.aws.amazon.com/prescriptive-guidance/latest/migration-aws-environment/understanding-landing-zones.html). Here, you will set up VPCs, including NAT gateways, Internet gateways, Security Groups etc. if you want to deploy an application on a VM, that is, an EC2 instance, then you will need to provision it. Depending on the kind of application, it will either have internet access or will not. So, you will have to choose the right VPC and, within the VPC, the right subnet (public or private), and the Security group. Also, if an application needs autoscaling, then you may use an AutoScaling group, attach it to a Load Balancer, and point the LB DNS to a DNS record in your hosted zone (example Route53). Once the instance is up, you must install application dependencies (here I am talking about a monolithic application) and deploy the application…

30

Which monitoring systems have you worked with and how have you used them?

Reference answer

I've worked with several monitoring systems in my career, including Nagios, Prometheus, and Grafana. These tools have allowed me to monitor a host of metrics. Nagios, which I used earlier in my career, was primarily for monitoring system health. It kept an eye on key metrics like CPU usage, disk usage, memory usage, and network bandwidth. It was a excellent tool for generating alerts when any of these metrics crossed a predefined threshold. More recently, I've used Prometheus and Grafana. Prometheus collects metrics from monitored targets by scraping metrics HTTP endpoints. We used it for collecting a wide variety of metrics including system metrics similar to Nagios, application performance metrics, request counts, and error counts. Grafana was used to visualize these metrics collected by Prometheus. We built different Grafana dashboards for different requirements, including system-level monitoring, application performance monitoring, and business-level monitoring. Grafana's alerting features enabled us to set up customizable alerts based on these metrics, which in turn helped us proactively identify potential problems and act on them promptly.

31

What are containers on a server?

Reference answer

Containers are self-contained software packages that can run in any environment without any modifications. They virtualize the operating system and are capable of running in various settings, including private data centers and public clouds. Docker is a common containerization tool.

32

What is Google Slash Resources?

Reference answer

Google Slash Resources offers access to books published by Google or necessary, as well as courses called TIE reliability engineering measuring and managing reliability. It provides individuals with the necessary resources and knowledge to prepare for certification.

33

What is DNS?

Reference answer

The domain name system is known as DNS. It is a mechanism that converts hostnames to IP addresses so that, when you type a website address into your browser, you can quickly identify the right server. Each domain name has one or more 'resolvers,' or IP addresses, associated with them by the DNS system.

34

What is ICMP?

Reference answer

Internet Control Message Protocol (used by `ping` for connectivity checks).

35

What is ARP?

Reference answer

Address Resolution Protocol is referred to as ARP. ARP is a protocol that permits device communication on local networks. It makes it possible for devices connected to the same network to discover each other's MAC address, IP address, as well as other network details. In order for network devices to communicate with one another, it is used to dynamically assign An ip address to those devices.

36

What is an error budget?

Reference answer

Answer: An error budget is how much downtime a system can afford without upsetting consumers, or it is also known as the margin of error permitted by the service level objective. It encourages the teams to minimize actual incidents and maximize innovation by taking risks within acceptable limits. An error budget policy is used to track if the company is meeting contractual promises for the system or service, and prevents it from pursuing too much innovation at the expense of the system or service's reliability.

37

What are Vertical and Horizontal Scaling? Which is more preferable? And list some advantages and disadvantages of Horizontal Scaling.

Reference answer

- Vertical scaling is a process of increasing the size of a system by increasing its number of resources. This is often used to increase capacity, performance, and throughput. It generally involves adding more hardware or more servers on a single physical server. This process is also called Scale-up. Because the size of the system increases in this. - Horizontal scaling is a process of increasing the size of a system by adding multiple logical resources. This can be done by adding more virtual machines per host, or by adding containers per host. It can also be done by adding additional hosts altogether. This is also called Scale-out. Because it increases the number of systems. Horizontal scaling is preferable. Because of the going time and load on the system. This can be scalable. There are several advantages to Horizontal Scaling (Scale-out): - It requires less upfront investment. - It reduces operational overhead and - It allows for easier scaling as demand increases. However, there are also some disadvantages: - Horizontal scaling requires careful planning and coordination between all parties involved, which can be a challenge in large multi-tenant environments where different tenants have different needs and requirements. Also, it can result in increased complexity and security risk if not done carefully. - Horizontal scaling can also lead to scalability problems if one component causes issues for multiple other components, so it's important to monitor each component closely during the entire process from start to finish.

38

How do you measure success as a Site Reliability Engineer?

Reference answer

Success as a Site Reliability Engineer can be measured by a combination of tangible metrics and less tangible improvements within a team or organization. On the metrics side, quantifiable items like uptime, system performance, incident response times are critical. If the system has high uptime, fast and consistent performance, and if incidents are rare and quickly resolved when they do occur, these are indicators of effective SRE work. On the other hand, success can also be gauged through process improvements and cultural changes. For example, implementing productive processes for post-mortems, where incidents are dissected and learned from in a blameless manner, improving communication between engineering teams, promoting a culture of reliability and performance across the organization, etc. In essence, if a Site Reliability Engineer can maintain a smooth, reliable, and efficient system while helping to foster a culture of proactive and thoughtful consideration for reliability, scalability, and performance features, they can be considered successful in their role.

39

How do you establish SLOs and SLIs, and are you open to making adjustments to these when warranted?

Reference answer

I establish SLOs and SLIs by collaborating with stakeholders to identify critical user journeys, defining measurable indicators like latency or error rate, and setting realistic targets based on historical data and business needs. I am open to adjustments when data shows the SLO is too strict or too lenient, or when user expectations or system behavior changes.

40

What's the difference between SRE and DevOps?

Reference answer

SRE is an implementation of DevOps with a stronger engineering focus. While DevOps is a cultural philosophy about collaboration between dev and ops, SRE applies software engineering principles to solve ops problems — using SLIs/SLOs, automation, and error budgets.

41

What is black-box monitoring?

Reference answer

Black box monitoring is a type of application monitoring that focuses on an application's external behavior without needing access to its source code.

42

What is a 'snowflake server' and why is it problematic?

Reference answer

A 'snowflake server' is a server that has unique, manually configured settings or state, making it difficult to replicate or replace. It is problematic because it increases toil, complicates scaling and recovery, and introduces inconsistency. SREs avoid snowflakes by using infrastructure-as-code, immutable infrastructure (where servers are replaced, not modified), and configuration management tools to ensure all servers are identical and disposable.

43

Name three types of databases and an example of each. Name some you have used.

Reference answer

They must name relational databases as one of the types, like MySQL, Postgres, Oracle and so on. After that, we are looking for what sorts of other databases they may know of or have familiarity working with. The candidate should be able to describe the difference between each type they name. Here are some examples: Key/value stores: BerkeleyDB, Cassandra, etcd, Memcached and MemcacheDB, Redis, Riak Document stores: CouchDB, MongoDB Wide column stores: BigTable, HBase Graph stores: FlockDB, Neo4j, OrientDB

44

Principle of Least Privilege

Reference answer

Grant minimal permissions required for users/roles.

45

Tell me about a time you managed multiple critical incidents simultaneously.

Reference answer

In my previous role at Singtel, we encountered three critical outages simultaneously. I quickly assessed the impact on customer experience for each incident and prioritized the one affecting our core services. I communicated with management and the affected teams, ensuring everyone was aligned. By using our incident management tool, I dispatched resources effectively, reducing resolution time by 40% across all incidents.

46

Tell me about the differences between process and thread in the context of site reliability engineering.

Reference answer

A process is an independent program execution unit with its own memory space, while a thread is a lightweight unit within a process that shares the same memory space. In SRE, threads are more efficient for concurrent tasks within a single service, but require careful synchronization. Processes provide better isolation, which is critical for reliability and fault tolerance.

47

How do you handle dynamic scaling of a stateless vs. stateful service in Kubernetes?

Reference answer

- Stateless services: For stateless applications, horizontal scaling is straightforward using Horizontal Pod Autoscaler (HPA) based on CPU, memory, or custom metrics. Pods can be added or removed without affecting the system's state. - Stateful services: For stateful applications (e.g., databases, message brokers), scaling requires careful coordination of storage and state. Use StatefulSets in Kubernetes to manage stable network identities and persistent volumes for each pod. Scaling stateful services involves replication and coordination to maintain data consistency.

48

How is SRE related to DevOps?

Reference answer

SRE might be considered an implementation of DevOps. Like DevOps, SRE is about team culture and relationships. Connecting the dots between the dev and ops teams is a goal shared by SRE and DevOps.

49

How do you prioritize tasks and incidents in SRE?

Reference answer

Tasks and incidents are prioritized based on their impact on service reliability, user experience, and business objectives. Critical incidents affecting SLOs are usually prioritized higher.

50

Write a script to count 'ERROR' lines in a log file.

Reference answer

bash grep "ERROR" app.log | wc -l

51

What's the difference between scaling up and scaling out?

Reference answer

- Scaling Up (Vertical Scaling): Increasing the capacity of an existing server. - Scaling Out (Horizontal Scaling): Adding more servers to distribute the load.

52

Describe to me how you balance the interests of different stakeholders in the organization.

Reference answer

I balance interests by facilitating transparent discussions, using data and SLOs to align priorities, understanding each stakeholder's goals (e.g., feature velocity vs. stability), and proposing trade-offs that maximize overall value. Regular communication and shared metrics help ensure fair decision-making.

53

What is the importance of monitoring and alerting for managing cloud services effectively?

Reference answer

Monitoring and alerting are crucial for investigating and understanding the situation, predicting the cost of running your service, and detecting regressions.

54

How would you implement a sorted hash table in C?

Reference answer

A sorted hash table can be implemented by combining a hash table for key-value storage with a sorted data structure like a balanced binary search tree (e.g., AVL or Red-Black tree) or a sorted linked list. The hash table provides O(1) average lookups, while the sorted structure maintains order. On insertion, the key-value pair is stored in the hash table and also inserted into the sorted structure. For iteration, you traverse the sorted structure. Trade-offs include memory overhead and slower insertions due to sorting.

55

What is autoscaling, and how does it benefit reliability?

Reference answer

Autoscaling automatically adjusts the number of running instances based on current demand. This ensures that resources are available to handle increased loads, improving reliability and performance during peak times while reducing costs during low demand periods.

56

What is Service Level Indicator (SLI)?

Reference answer

SLI tells you at any moment in time how well your service is doing and if it's performing acceptably or not.

57

How would you determine what else might be affected if this service is compromised?

Reference answer

This tests whether candidates recognize that the hardest part of real incidents is quickly stitching together identity, network reachability, and workload context to understand blast radius.

58

What activities can reduce toil?

Reference answer

Answer: Activities that can reduce the toil are creating external automation, creating internal automation, and enhancing the service so that it does not require maintenance intervention.

59

How do you handle capacity planning and scaling for high-traffic applications?

Reference answer

- Use load balancing - Distribute requests across multiple servers, optimizing resource utilization. - Cache frequently accessed data - Improve response time and scalability by storing frequently accessed data. - Automate testing - Stress test your system to identify bottlenecks in performance. Use continuous integration tools (like GitHub Actions) to automatically test code changes. - Monitor systems - Utilize APM tools to provide real-time insight into your system's performance and health to detect issues before they affect customers. - Design your system to scale - Scale your system up or down to meet the needs of traffic to maintain performance.

60

What are the key areas of focus for DevOps?

Reference answer

The key areas of focus for DevOps are reducing organizational silos, planning and accepting failures, implementing gradual changes, removing human error, and measuring success in all areas. DevOps aims to bring down silos between development, architecture, and operations.

61

What's your process for conducting a blameless postmortem?

Reference answer

A blameless postmortem is essential for identifying the root cause of incidents without assigning blame. Here's my process: - Gather Data: I start by collecting logs, metrics, and any available data from monitoring systems to understand the timeline of the incident. - Timeline Reconstruction: We work to reconstruct the event, from the first signs of failure to resolution, ensuring no details are overlooked. - Root Cause Analysis: Using methods like 5 Whys or Fishbone Diagrams, I facilitate a collaborative discussion to identify the root cause without blaming individuals. - Actionable Insights: We focus on creating actionable insights to prevent similar incidents in the future, which include improving processes, automation, or monitoring. - Share Learnings: Postmortems are shared with relevant teams, encouraging continuous improvement and knowledge sharing. This approach ensures that the team learns from incidents and takes steps to prevent recurrence.

62

Questions I ask in SRE interviews

Reference answer

A collection of questions that an interviewer asks in Site Reliability Engineer interviews.

63

Explain the concept of a 'blast radius' in incident management.

Reference answer

The blast radius is the extent of damage or impact an incident or change can cause. SREs design systems and processes to minimize blast radius, e.g., using microservices to isolate failures, implementing circuit breakers, and using gradual rollouts. In incident response, reducing the blast radius is a key priority, achieved through containment strategies like isolating affected components or redirecting traffic.

64

How do you implement blue-green deployments in Kubernetes?

Reference answer

Blue-green deployments in Kubernetes can be implemented by creating two separate environments (blue and green) using deployments and services. Traffic is routed to the blue environment while the green environment is updated. Once validated, traffic is switched to the green environment, and the blue environment is kept as a fallback.

65

Describe the functions of an ideal DevOps team.

Reference answer

Answer: Basically, the functions of the ideal DevOps team can't be precisely defined. As we know, the DevOps team bridges the development and operations departments and contributes to continued delivery. The perfect DevOps team cooperatively combines software development and IT operations to improve productivity, speed, and dependability across the software delivery lifecycle. Among the responsibilities are continuous Integration, automated testing, deployment automation, monitoring, and cultivating an environment of communication and cooperation between the development and operations teams.

66

What is RAID?

Reference answer

- “Redundant Array of Independent Disk” is a term used to describe a type of storage system that has more than one hard disk to provide more redundancy in case one disk fails. A redundant Array of Independent Disk is commonly used in networks and server farms. - Redundant Array of Independent Disk systems is routinely used in data centres; they provide a second disk drive on a single physical system so if the first disk fails, the user can continue working by accessing the second disk drive. This extra protection means users don't have to worry about losing data if a drive fails. - Redundant Array of Independent Disk systems can be implemented as a single controller with multiple drives or as multiple controllers connected to each other with each controller housing a single drive. The resulting configuration can be optimized for throughput or for redundancy. - This type of storage system is available from many vendors and can be found in medium-sized or even large-scale enterprise environments, where it's essential for ensuring the availability of critical data.

67

Define an SLO for an internal API consumed by three downstream services.

Reference answer

Whether you consider the downstream consumers' SLOs when setting the upstream target. Setting the SLO in isolation without discussing dependency chains or cascading failure risk is a common point where candidates lose points.

68

Describe a time when you improved the reliability of a legacy system. What steps did you take?

Reference answer

Improving legacy system reliability is challenging but achievable with a structured approach. Here's how I did it: - Assessment & Planning: - Conducted a root cause analysis to identify recurring issues and bottlenecks. The system had high latency, frequent downtime, and lacked automation. - Conducted a root cause analysis to identify recurring issues and bottlenecks. The system had high latency, frequent downtime, and lacked automation. - System Modernization: - Containerized the legacy application using Docker, which improved scalability and simplified deployment. - Replaced monolithic components with microservices where applicable, improving fault isolation and enabling independent scaling. - Containerized the legacy application using Docker, which improved scalability and simplified deployment. - Automation & CI/CD: - Introduced automated testing and CI/CD pipelines to reduce human errors and accelerate deployment cycles. - Introduced automated testing and CI/CD pipelines to reduce human errors and accelerate deployment cycles. - Performance Tuning: - Identified bottlenecks in database queries and network traffic. Improved caching and database indexing, reducing latency by 30%. - Identified bottlenecks in database queries and network traffic. Improved caching and database indexing, reducing latency by 30%. - Monitoring & Alerts: - Implemented a robust monitoring solution using Prometheus and Grafana to get real-time performance metrics, improving incident response times. - Implemented a robust monitoring solution using Prometheus and Grafana to get real-time performance metrics, improving incident response times. The combination of these actions resulted in a significant improvement in reliability and a reduction in downtime, which helped increase user satisfaction and operational efficiency.

69

Can you explain the SRE golden signals and why they are important?

Reference answer

The SRE golden signals are key metrics indicative of a system's health and performance. They include latency (the time it takes to respond to a request), traffic (the amount of demand on your system), errors (the rate of failed requests), and saturation (how close your system is to being overloaded). Monitoring these signals is crucial as they provide a comprehensive view of system performance, enabling quick detection of issues and proactive optimization of system resources.

70

How do error budgets influence your decision-making?

Reference answer

Effective answers show understanding of how error budgets create shared accountability between SRE and development teams when making tradeoffs between velocity and stability.

71

What are the key responsibilities of an SRE?

Reference answer

Key responsibilities include monitoring system health, incident management, automating manual tasks (toil reduction), capacity planning, disaster recovery, and conducting blameless postmortems to improve system reliability.

72

Kubernetes job interview questions

Reference answer

A collection of questions to practice with for Kubernetes job interviews.

73

How to scale a Deployment?

Reference answer

bash kubectl scale deployment/myapp --replicas=5

74

What's your philosophy on technical debt and how do you balance it with new work?

Reference answer

Technical debt is real, and ignoring it usually costs more than paying it down. I think about it in layers. First, there's critical debt—systems that are unreliable or pose security risks. That has to be addressed. Second, there's efficiency debt—systems that work but are inefficient and slow down development. Third, there's knowledge debt—systems no longer understood by anyone. I prioritize in that order. In my current role, we had a deployment tool that nobody understood anymore and it was causing frequent deployment failures. We rebuilt it, and deployment success rate went from 92% to 99%. That was worth the time. The mistake I see is treating all technical debt equally or ignoring it entirely. I also try to be opportunistic—if we're working on a system anyway, we address debt in that area. And I always budget for debt reduction. If 100% of your time goes to new features, your systems will slowly degrade. We aim for 20-30% of capacity going to infrastructure improvements and debt reduction. I also make it visible to leadership. When deployment takes 45 minutes and we could get it down to 10 minutes by spending two weeks, I show the cost of the delay and make the business case.

75

What is a “/proc” file system?

Reference answer

A “/proc” file system is a special type of file system that has special access permissions. It is mounted in Linux systems when the kernel needs to execute a process or access certain system resources. A /proc directory contains information about the current state of the system, such as memory usage and CPU speed. There are three subdirectories under /proc: - /proc/1: This is the first subdirectory in the /proc directory tree. It contains information about the CPU and its speed. - /proc/1/cmdline: This subdirectory contains the command line parameters passed to the currently running process. - /proc/1/maps: This subdirectory contains virtual memory map data for processes running on Linux systems. It can be used to determine which parts of the memory are being used by which process.

76

What tools do you use for monitoring and alerting? Why?

Reference answer

I use Prometheus for monitoring because of its powerful querying capabilities and Grafana for visualization due to its user-friendly interface. These tools have helped us proactively identify and resolve issues, significantly improving system reliability.

77

What is SLO – please explain?

Reference answer

An SLO or Service Level Objective is basically a key element of a service-level agreement (SLA) between a service provider and a customer that is agreed upon to measure the performance of service providers and are formed as a way of avoiding disputes. Between two parties. SLO can be a specific measurable characteristic of SLA like availability, throughput, frequency, response time, or quality. These SLOs togethe define the expected service between the provider and the customer while varying depending on the service's urgency, resources, and budget. SLOs provide a quantitative means to define the level of service a customer can expect from a provider.

78

Describe a situation where you had to work closely with another team with conflicting priorities or different technical approaches. How did you navigate it?

Reference answer

S – Situation We were in the middle of a critical migration project, moving a substantial part of our legacy monolithic application to a new microservices architecture. My role as an SRE was to ensure the new services were highly observable and reliable from day one. One of the core new microservices being developed was the user authentication service, a critical component for every user interaction. The feature team developing this service had a strong preference for a specific proprietary Application Performance Monitoring (APM) tool, largely due to their familiarity and previous positive experiences with it. However, our SRE team was actively standardizing on an open-source observability stack (Prometheus for metrics, Grafana for dashboards, Loki for logs, and Tempo for traces) across the organization to reduce costs, foster a unified operational view, and simplify on-call rotations by having a consistent set of tools. This created a clear conflict in technical approaches and priorities. T – Task My task was to bridge this gap: ensure the new authentication service launched with robust, standardized observability that aligned with the SRE team's long-term strategy, while simultaneously respecting the feature team's technical expertise and preferences, and without causing delays to their aggressive development schedule for such a crucial service. The challenge was to integrate their needs into our overarching strategy without alienating them or creating a fragmented observability landscape that would hinder future incident response and operational efficiency. A – Action I initiated a proactive, open dialogue by scheduling a dedicated meeting with the feature team lead and their principal engineer. Crucially, I started not by dictating our preferred tools, but by actively listening and seeking to understand why they favored their chosen proprietary APM solution. They explained its specific strengths in distributed tracing and their team's deep proficiency, which allowed for very rapid debugging during development. I acknowledged their valuable experience and the clear benefits they saw in their tool. Then, I articulated our SRE team's perspective, emphasizing the broader organizational advantages of a unified open-source observability platform: easier cross-service correlation during complex incidents, reduced cognitive load for on-call engineers who wouldn't need to learn multiple toolsets, significant cost savings at scale, and a consistent query language for all types of telemetry data (metrics, logs, traces). I proposed a pragmatic, phased approach: initially, they could instrument their authentication service with both their preferred proprietary APM tool and our open-source agents (e.g., Prometheus client libraries for metrics, OpenTelemetry for traces and logs). This "dual-instrumentation" strategy would allow them to continue using the tool they were comfortable with for their immediate development and debugging needs, while simultaneously allowing our SRE team to begin ingesting the same telemetry into our centralized platform. I also offered to dedicate SRE resources to assist their team with the integration of our open-source stack, including creating custom, service-specific Grafana dashboards tailored to their service's unique metrics and operational needs. We also agreed to hold bi-weekly sync-ups to review the data coming from both systems, compare insights, and discuss the path towards eventual consolidation. R – Result This collaborative and empathetic approach proved highly effective. The feature team appreciated that their expertise was respected and that we offered tangible support rather than a mandate. Over the next few weeks, as they began to see the ease of correlating authentication metrics with other upstream and downstream services in Grafana, and the efficiency of querying all service logs in Loki during testing, they naturally started leaning more towards our standard stack. They experienced firsthand how a unified view simplified their understanding of the broader system context. Within two months, they voluntarily decided to de-prioritize further integration with the proprietary APM tool and fully embraced our open-source observability solution, citing improved collaboration with SRE and better overall operational visibility as key drivers. This allowed us to successfully launch the critical authentication service with a fully standardized observability footprint, significantly strengthening inter-team relationships and validating our strategy for organizational-wide adoption of our SRE tooling. The personal lesson for me was profound: rather than forcing a solution, demonstrating the practical benefits and investing in collaboration, understanding, and support is far more effective in driving long-term adoption and fostering a cohesive engineering culture.

79

What coding best practices do you follow to ensure clean, maintainable code?

Reference answer

Skilled candidates will be deeply familiar with the importance of clean code. Look for specific best practices they mention. For example, they might explain that they: Write modular code, Use clear and meaningful variable names, Implement consistent coding styles, Conduct thorough testing with unit tests and integration tests. They might also talk about the importance of code reviews, giving and receiving feedback, and maintaining clear documentation to ensure the codebase is transparent for others. Mentioning specific tools like linters or formatters, and principles such as DRY (Don't Repeat Yourself) or SOLID, indicates a strong understanding of coding best practices.

80

What tools do you use regularly as an SRE?

Reference answer

- Monitoring: Prometheus, Grafana, CloudWatch - Logs: ELK stack, Fluentd, Loki - Infra-as-code: Terraform, Helm - CI/CD: Jenkins, ArgoCD - Others: Kubernetes, Docker, Bash, Git, tcpdump ,strace

81

Describe your experience with Infrastructure as Code (IaC). What challenges have you faced?

Reference answer

Infrastructure as Code (IaC) is a key component of modern DevOps practices. Here's my experience and the challenges I've faced: - Tools Used: I have experience with tools like Terraform, CloudFormation, and Ansible to automate the provisioning and management of infrastructure in a consistent and repeatable manner. - Version Control: I store IaC configurations in Git repositories, which allows me to track changes and roll back if necessary. - Challenges: - State Management: Managing state in tools like Terraform can be tricky, especially when working with multiple teams or environments. - Testing Infrastructure: It's hard to test IaC without deploying it. I've overcome this by using mock environments or deploying in isolated, non-production environments. - Collaboration: Ensuring that teams collaborate effectively on IaC changes can be challenging. I've addressed this with thorough code reviews and clear documentation. - State Management: Managing state in tools like Terraform can be tricky, especially when working with multiple teams or environments. Overall, IaC has enabled us to scale efficiently and reduce human error in infrastructure management.

82

How do you balance reliability and feature velocity when working with development teams?

Reference answer

Balancing reliability with the speed of feature delivery is a crucial part of an SRE's job. Here's my approach: - Use Error Budgets: The key is error budgets. We define an acceptable level of risk (usually in terms of availability or latency) and allow new features to be released as long as the error budget isn't exhausted. - Continuous Integration and Automated Testing: By incorporating CI/CD pipelines and automated testing, we ensure that features don't break the production environment and that we can release quickly while maintaining stability. - Focus on Small Releases: Encourage smaller, incremental releases to avoid big, risky changes. This allows for better control over quality and easier rollback in case of failures. - Frequent Monitoring and Feedback: Continuously monitor service performance (using tools like Grafana, Prometheus) and maintain a close feedback loop with development teams, so issues can be caught early. - Collaborate on Priorities: Act as a bridge between the dev team's goals and the SRE's reliability focus. Communicate the importance of reliability early and often, helping prioritize technical debt and reliability improvements alongside new features. Balancing both ensures that the product evolves rapidly without sacrificing the user experience due to reliability failures.

83

How would you troubleshoot a pod stuck in CrashLoopBackOff?

Reference answer

Check pod logs using 'kubectl logs ', inspect previous logs with '--previous', describe the pod for events and status, verify resource limits, check for configuration errors in environment variables or volumes, and test the container image locally if needed.

84

How do you define and track SLOs?

Reference answer

SLOs need to come from understanding what matters to your users and your business. We start by defining SLIs—the actual measurements—like request latency and error rate. For our user-facing API, we decided on a 99.9% availability SLO, which translates to about 43 minutes of acceptable downtime per month. We track this with a 30-day rolling window using Prometheus. The key part is the error budget: if we have 0.1% error budget and we've already burned through 0.08% handling an incident, the team knows we need to be more conservative with deployments. This forces an interesting conversation—do we deploy that new feature or do we focus on stability? In practice, it means we've had to say 'no' to shipping features until we improved reliability, which actually led to fixing some serious underlying issues we'd been ignoring.

85

What is a Linux signal, and what are some common ones you work with?

Reference answer

A Linux signal is an asynchronous notification sent to a process to indicate an event, such as an error or external request. Common signals include SIGTERM (request termination), SIGKILL (force termination), SIGINT (interrupt from keyboard), SIGHUP (hang up or reload configuration), and SIGSEGV (segmentation fault).

86

Explain the role of service level indicators in capacity planning and autoscaling policy design.

Reference answer

SLIs, such as latency, error rate, and resource utilization, are essential for capacity planning and autoscaling. In capacity planning, SLIs help identify when a service is approaching saturation, guiding decisions on scaling resources. For autoscaling, SLIs like CPU usage or request latency are used as metrics to trigger scaling actions (e.g., adding instances when latency exceeds a threshold). This ensures that capacity adjusts dynamically to maintain SLOs while optimizing resource usage.

87

How do you ensure that your systems are resilient to failures?

Reference answer

To ensure system resilience, I implement redundancy and failover mechanisms, conduct regular stress testing, and use monitoring tools to detect issues promptly. This proactive approach helps identify and mitigate potential failures before they impact users.

88

How does your current deployment pipeline look? What are the biggest issues?

Reference answer

This question determines your ability to analyze your deployment pipeline and make intelligent decisions for changing it. You can showcase how in your experience, you, alongside your team, brought significant improvements to resilience without drastically affecting employee productivity to highlight your problem-solving skills.

89

Add two numbers given as strings and return the resulting number as a string without leading zeros

Reference answer

Simulate addition from right to left, starting from the least significant digit. Keep a carry variable. For each digit, sum the digits from both strings plus carry, compute the result digit as sum % 10, and update carry as Math.floor(sum / 10). After processing all digits, if carry remains, prepend it. Remove leading zeros from the result string.

90

How to secure SSH?

Reference answer

Disable root login, use SSH keys, and enable 2FA.

91

Can you explain how you would implement automation in a site reliability engineering context?

Reference answer

Automation is a key aspect of site reliability engineering. I would use tools such as Ansible, Terraform, and Jenkins, and scripting languages like Python or Shell to automate repetitive tasks. These could include server provisioning and configuration, deployment of applications, and incident response. Automation reduces the risk of human error, saves time, and allows us to focus on more complex tasks that require a human touch, thus improving overall site reliability.

92

What does S3 operations focus on?

Reference answer

S3 operations evaluate both sides of the problem, optimize resources, and ensure seamless operations during significant events or product launches. Through capacity planning and forecasting, S3 enterprises may prepare their systems for future issues and preserve service dependability.

93

What is the role of change management in managing cloud services?

Reference answer

Change management is crucial for managing the risk of outages caused by changes to live systems. Organizations can avoid global changes, implement progressive rollouts, and detect issues quickly with good monitoring to ensure safe and quick rollbacks.

94

How would you design a monitoring dashboard for a microservices-based application? What key metrics would you include?

Reference answer

For a microservices-based application, the monitoring dashboard should provide insights into both the health of individual services and the system as a whole. Key metrics include: - Service availability (uptime, error rates) - Latency (response time for each service) - Throughput (requests per second) - Error budget consumption (helps with release management) - Resource utilization (CPU, memory usage) - Service dependencies (to see inter-service interactions) - Custom application metrics (specific to business logic, like transactions processed) The dashboard should offer real-time metrics and historical trends, and the ability to drill down into specific service failures.

95

What is “chaos engineering” and how does it benefit reliability?

Reference answer

Chaos engineering involves intentionally introducing failures into a system to test its resilience. This practice ensures that systems can handle unexpected events and recover gracefully.

96

Design a thumbnail service

Reference answer

Design a service that generates thumbnails from images. Accept uploads, store original images in object storage (e.g., S3), and queue thumbnail generation tasks. Workers process images using libraries like ImageMagick, resize to specified dimensions, and store results. Use a CDN for fast delivery. Cache thumbnails with expiration. Handle high concurrency with async processing and autoscaling workers.

97

How would you design a high-availability architecture for a database?

Reference answer

- Implement database replication (e.g., MySQL replication, PostgreSQL streaming replication) across multiple availability zones or regions. - Use automatic failover with tools like Patroni or AWS RDS Multi-AZ. - Employ load balancers to distribute read requests to read replicas while write requests go to the primary database. - Regularly perform database backups and test disaster recovery plans. - Use sharding to distribute large datasets across multiple servers to ensure scalability.

98

Make a class that, given a keyboard configuration and the size of the keyboard, returns the key for given coordinates (creation includes a list of vectors with the keys).

Reference answer

Define a class with a constructor that takes a keyboard configuration (e.g., a list of key positions and labels) and the keyboard size. Store the mapping from coordinates to keys in a dictionary or 2D array. The getKey(x, y) method returns the key at those coordinates, handling bounds checking. The configuration could include key shapes or rectangular regions.

99

How do you handle disaster recovery in SRE?

Reference answer

Disaster recovery involves creating and maintaining a plan that includes data backups, redundancy, failover mechanisms, and regular testing to ensure business continuity.

100

How would you lead a blameless postmortem process that drives measurable reliability improvements?

Reference answer

I would start by scheduling the postmortem as soon as possible after an incident, ensuring psychological safety by emphasizing that the focus is on system failures, not individual mistakes. I would gather data from monitoring, logs, and timelines to create a factual timeline. The postmortem should identify root causes, contributing factors, and action items (e.g., automated mitigations, runbook updates). Action items are tracked with owners and deadlines. I measure improvement by tracking reduction in similar incidents, MTTR, and error budget usage over time.

101

What is your favorite Google product?

Reference answer

My favorite Google product is Google Search. It demonstrates the company's core strength in organizing information and delivering relevant results in milliseconds. The underlying infrastructure, including indexing, ranking algorithms, and distributed systems, is a marvel of engineering. It directly impacts how billions of people access knowledge daily, which aligns with my interest in large-scale systems.

102

Describe a time you responded to a critical incident as a junior engineer.

Reference answer

At my previous internship with Atlassian, we experienced an unexpected outage on our service affecting several users. I quickly gathered logs and used monitoring tools to identify that a recent deployment introduced a bug. I collaborated with the development team to roll back the change, which restored service within 30 minutes. Afterward, we conducted a post-mortem to implement better testing for future deployments.

103

What is Multithreading in Operating System?

Reference answer

A thread is a path which is followed during a program's execution. Majority of programs written now a days run as a single thread.Lets say, for example a program is not capable of reading keystrokes while making drawings. These tasks cannot be executed by the program at the same time. This problem can be solved through multitasking so that two or more tasks can be executed simultaneously. Multitasking is of two types: Processor based and thread based. Processor based multitasking is totally managed by the OS, however multitasking through multithreading can be controlled by the programmer to some extent. The concept of multi-threading needs proper understanding of these two terms – a process and a thread. A process is a program being executed. A process can be further divided into independent units known as threads. A thread is like a small light-weight process within a process. Or we can say a collection of threads is what is known as a process.

104

How do you implement zero downtime deployments?

Reference answer

Zero downtime deployments can be achieved through techniques like blue-green deployments, canary releases, rolling updates, and using feature toggles.

105

What is DHCP?

Reference answer

The Dynamic Host Configuration Protocol (DHCP) is a network management protocol used on Internet Protocol (IP) networks, whereby a DHCP server dynamically assigns an IP address and other network configuration parameters to each device on the network, so they can communicate with other IP networks.

106

You're on-call for the Shakespeare search service and receive an alert, Shakespeare-BlackboxProbe_SearchFailure: your black-box monitoring hasn't been able to find search results for “the forms of things unknown” for the past five minutes. What do you do?

Reference answer

1. Acknowledge the alert and assess severity. 2. Check monitoring dashboards for latency, error rates, and resource usage. 3. Verify if the service is reachable via curl or browser. 4. Look at recent deployments or config changes. 5. Check logs for errors in search backend, database, or API. 6. Test the search query manually. 7. If needed, roll back recent changes or restart services. 8. Communicate status and escalate if unresolved.

107

What is vertical scaling?

Reference answer

Vertical scaling is the process of expanding a system's size by adding more resources. This is frequently used to improve throughput, performance, and capacity. On an one physical server, it typically means adding additional hardware or servers. Another name for this procedure is scaling up.

108

How would you handle a situation where the error budget is consistently being consumed?

Reference answer

- Pause new feature rollouts: Temporarily stop deploying new features to focus on improving system reliability. - Analyze root causes: Use incident postmortems and monitor system logs and metrics to understand where the error budget is being consumed. - Focus on stability: Implement fixes such as improved retries, redundancy, and error handling in areas causing frequent outages or slowdowns. - Improve automation: Automate processes that are leading to human error or unnecessary toil. - Tighten SLOs: Review if the current SLOs are too loose or if they accurately reflect the business requirements and adjust accordingly.

109

What are SLAs, and how do they differ from SLOs and SLIs?

Reference answer

SLA (Service Level Agreement) is a contract that defines the level of service expected from a service provider, including uptime, performance, and response times. SLO (Service Level Objective) is a specific target within an SLA that a service must meet, like 99.9% uptime. SLI (Service Level Indicator) is a metric used to measure the performance of a service against an SLO, such as response time or error rate.

110

Describe a time you identified and resolved a recurring reliability issue.

Reference answer

At my previous role at Vodafone, we experienced frequent outages due to a misconfigured load balancer. I led a root cause analysis and discovered that our configuration management was inconsistent. I implemented a standardized configuration process and automated our deployment pipeline, reducing outages by 75% and increasing system reliability significantly. This experience taught me the importance of thorough configuration management.

111

What is a circuit breaker pattern, and how does it improve reliability in microservices?

Reference answer

The circuit breaker pattern is a fault-tolerance mechanism that stops requests from reaching a service when it's detected to be failing. - Closed State: The circuit allows requests as normal. - Open State: Requests are blocked, and the system immediately returns an error, preventing cascading failures. - Half-Open State: Allows a limited number of requests to check if the service has recovered. This pattern improves reliability by preventing downstream failures from overwhelming upstream services and helps avoid performance degradation.

112

Differentiate between SNAT and DNAT.

Reference answer

| SNAT | DNAT | | A single public IP address can be shared by several internal devices thanks to SNAT, which changes the source IP address of outgoing packets. | Incoming packets' destination IP address is changed by DNAT to route traffic to particular internal servers. | | For packets exiting a network, it is often used to transform the private address or port into the public address or port. | Incoming packets having a public address or port as their destination are often redirected to a private IP address or port within the network. | | It allows multiple hosts on the inside to get any host on outside. | It allows multiple hosts on the outside to get the single host on inside. |

113

Why do you want to leave your current job?

Reference answer

I am seeking new challenges where I can apply my skills to larger-scale systems and more complex reliability problems. While my current role has been valuable, I want to work in an environment that prioritizes automation, innovation, and has a stronger impact on billions of users. I believe Google offers the technical depth and growth opportunities I am looking for.

114

How to parse JSON in Python?

Reference answer

python import json data = json.loads(json_string)

115

How do you handle capacity planning for a distributed system?

Reference answer

Capacity planning involves forecasting future resource needs (CPU, memory, storage, network) based on usage trends, growth projections, and seasonal patterns. SREs use monitoring data, load testing, and modeling to predict demand. They then plan for scaling (vertical or horizontal) and implement auto-scaling policies. It also includes considering redundancy and disaster recovery requirements to ensure reliability during peak load.

116

What are containers in servers?

Reference answer

Containers in the server are like a virtual machine that runs an application. A container can be compared with a virtual machine because it provides an environment for running applications. However, containers are different from virtual machines in many ways. First, containers are much more lightweight than virtual machines. They take up far less space on disk and use fewer CPU resources. Second, containers don't need to be preinstalled on a server. Therefore, they can be deployed quickly and easily. Third, containers can run on any type of hardware, from desktop computers to high-end servers. Finally, containers can only be used for running specific applications and not for general-purpose computing tasks like email or word processing. Having said all these differences between containers and virtual machines, one thing is certain: Containers are the future of server infrastructure! When it comes to deploying modern enterprise applications in today's digital world, container technology has proven itself to be the most reliable solution. From deployment speed to stability to security controls, container technology offers unparalleled advantages over traditional virtualization methods. While there are numerous vendors providing solutions that enable the creation of containers (e.g., Docker), there is no single standard or protocol that governs container technology. This lack of standardization presents challenges when trying to deploy containerized applications across multiple organizations or even within an organization's own data centers.

117

Describe your experience with containerization and orchestration.

Reference answer

We use Docker for containerization and Kubernetes for orchestration. I'm comfortable writing Dockerfiles, managing image registries, and setting up CI/CD pipelines that build and push images. In Kubernetes, I've worked with deployments, stateful sets, and daemonsets. We use Helm for templating configurations across environments. On the troubleshooting side, I can diagnose issues with pod scheduling, resource constraints, and networking. We had an incident where pods kept getting evicted, and I traced it to memory requests being set too conservatively—we were over-subscribing nodes. I updated the resource requests across our services, and the evictions stopped. I've also implemented resource quotas per namespace to prevent one team's runaway deployment from taking down another team's services. The biggest challenge we've faced is managing persistent state in Kubernetes—we eventually moved stateful services like databases to managed services rather than fighting Kubernetes.

118

Write a function in Python that checks if a given string is a palindrome.

Reference answer

A palindrome is a string that reads the same forward and backward. Here's a Python function to check if a given string is a palindrome: def is_palindrome(s): return s == s[::-1]

119

Sysadmin Test Questions

Reference answer

A collection of questions for testing system administration knowledge.

120

Explain `inode` in Linux.

Reference answer

Stores file metadata (permissions, timestamps). Use `df -i` to check inode usage.

121

What are the 4 Golden Signals of SRE?

Reference answer

- Latency - The amount of time your services take to fulfill a request. - Traffic - The number of requests your service receives. - Errors - The number of unsuccessful requests both overall and at specific end points. - Saturation - The utilization of resources in comparison to their capacity.

122

Design a 'snakes' game

Reference answer

Design a real-time multiplayer snake game with a game server managing state. The server receives player inputs, updates snake positions, checks collisions, and broadcasts state to clients. Use a grid-based map with food spawning. Handle latency with client-side prediction and server reconciliation. Scale with sharding by game instance and use a load balancer. Store high scores in a database.

123

How would you design an alerting strategy to avoid alert fatigue while maintaining visibility?

Reference answer

To design an alerting strategy that avoids alert fatigue, I would prioritize actionable alerts based on SLOs and error budgets. Alerts should be tiered: critical alerts for immediate response (e.g., SLO burn rate exceeding thresholds), warning alerts for non-urgent issues (e.g., elevated error rates but within budget), and informational notifications for trends. I would reduce noise by tuning alert thresholds, implementing deduplication and grouping, and ensuring alerts have clear runbooks for response.

124

What is a hybrid cloud?

Reference answer

A hybrid cloud combines on-premises infrastructure with public cloud services, allowing data and applications to be shared between them.

125

How do you approach reducing manual tasks in day-to-day operations?

Reference answer

I start by identifying and measuring toil through tracking manual tasks like deployments, alert responses, or routine checks. I prioritize automation for high-frequency or error-prone tasks. For example, I would automate deployment pipelines, create self-healing scripts (e.g., auto-restarting failed services), and use infrastructure as code to manage configurations. I also build runbooks and chatbots to streamline incident response. The goal is to reduce manual effort over time, freeing up the team for higher-value work.

126

How does HTTPS work?

Reference answer

Encrypts HTTP traffic via TLS: - Server sends certificate. - Client verifies it. - Symmetric key exchange.

127

How do you use metrics and monitoring data to improve system reliability?

Reference answer

Metrics and monitoring data are analyzed to identify trends, detect anomalies, and measure the impact of changes. This information helps in making data-driven decisions to improve system reliability.

128

What is circuit breaking in distributed systems?

Reference answer

Circuit breaking is a design pattern in distributed systems where a proxy or client detects excessive failures when calling a service. It 'opens' the circuit, preventing further calls to the failing service for a duration, thus preventing cascading failures and often allowing for a fallback response.

129

How do you manage service dependencies in a microservices architecture to ensure reliability?

Reference answer

- Circuit breakers: Implement circuit breakers (e.g., via Hystrix or Istio) to prevent cascading failures when dependent services are down or slow. - Retries with backoff: Use retries with exponential backoff to handle transient failures while avoiding overwhelming the service. - Bulkheads: Apply the bulkhead pattern to isolate different microservices, preventing failures in one service from affecting others. - Timeouts: Set timeouts for service calls to prevent requests from hanging indefinitely when a service is slow. - Service mesh: Use a service mesh (e.g., Istio or Linkerd) to manage and observe inter-service communication, retries, and timeouts centrally. These patterns ensure that individual service failures don't propagate throughout the system and degrade overall reliability.

130

Describe your ideal on-call rotation.

Reference answer

The answer interviewers respond to: a specific rotation structure with a handoff process, escalation paths, a defined response time SLO for pages, and an opinion about compensation for on-call hours. That specificity. Candidates who've actually run or participated in designing an on-call rotation have that answer ready. Candidates who've only been a participant in someone else's rotation tend to describe what they experienced rather than what they'd design, and interviewers pick up on that distinction faster than most people expect.

131

How would you automate a repetitive manual task you've encountered?

Reference answer

Strong answers demonstrate a systematic approach to identifying and eliminating toil through practical automation.

132

What is Test-Driven Development (TDD) and how have you used it?

Reference answer

Test-Driven Development (TDD) has been a key part of the agile development process in several of my previous roles. The principle behind TDD is that you write the tests for the function or feature before you write the code. It's a strategy that I found particularly powerful for ensuring reliability of code and preventing bugs from getting into production. In one of my previous roles, we enforced TDD rigidly. Each new feature or function had a corresponding set of tests written before the actual implementation was done. These tests served as both the developer's guide for what the code needed to do, and as verification that the implementation was correct once it was done. More importantly, these tests added to our growing test suite that would be run in our Continuous Integration pipeline every time a change was pushed. If the change broke something elsewhere in the system, we would discover it early thanks to these tests, which significantly improved the stability of our system. Thus, TDD, in my experience, not only helps produce better code, it also speeds up the development process overall, as fewer bugs means less time spent debugging and more time spent building new functionality.

133

How do you measure and enforce error budgets across multiple services?

Reference answer

Error budgets are calculated as 1 minus the SLO target (e.g., 99.9% uptime allows 0.1% error budget). For multiple services, each service has its own SLO and error budget, tracked via monitoring systems. Enforcement involves: alerting when error budget consumption exceeds a threshold (e.g., 50% consumed in a window), freezing feature releases when the budget is exhausted until reliability is restored, and prioritizing reliability work over features when the budget is low. This ensures teams balance innovation with reliability.

134

What is the importance of implementing a SRE culture?

Reference answer

SRE culture is needed to have higher-ups who can create code from idea to operations. Implementing the culture of blamelessness and agreeing on the necessity of playing business is vital. Post-feedback lets teams collaborate and resolve difficulties.

135

Explain the difference between IaaS, PaaS and SaaS.

Reference answer

- IaaS (Infrastructure as a Service) - provides virtualized computing resources over the internet, giving users control over the operating systems, storage, and deployed applications. Examples include AWS, Azure, and Google Cloud. - PaaS (Platform as a Service) - offers a platform for developers to build, run, and manage applications without managing the underlying infrastructure. Examples include Heroku, Fly.io, and Render. - SaaS (Software as a Service) - delivers fully functional software applications over the internet, accessible via web browsers, with the provider handling all underlying infrastructure and maintenance. Examples include PagerTree, Netflix, and Google Sheets.

136

What is a service mesh?

Reference answer

A service mesh is a dedicated infrastructure layer that manages service-to-service communication, providing features like load balancing, service discovery, and security.

137

What is observability?

Reference answer

Answer: Observability strongly emphasizes gathering and analyzing information from various sources to comprehend a system's behavior as a whole. Teams can efficiently monitor, debug, and optimize their systems thanks to the core analysis loop, which is a continuous cycle of data gathering, analysis, and action. To maximize observability, discern the data flowing in an environment, focusing on relevant types for goals. Distill, curate, and transform data into actionable insights, providing valuable clues about DevOps maturity.

138

How do you decide which operational tasks to automate?

Reference answer

The decision whether to automate a task should be based on factors such as: The task's frequency and complexity, The potential for errors, The time investment required for automation. Experienced candidates would also consider the impact on the team and whether automating the task would reduce toil or improve efficiency. They might also discuss evaluating the return on investment (ROI) of automation and ensuring that automated processes are documented and maintainable.

139

How do you align SRE practices with DevOps principles?

Reference answer

Site Reliability Engineering (SRE) and DevOps share common goals - improving deployment frequency, lowering failure rates of new releases, hastening incident recovery times, and providing a seamless, high-quality user experience. SRE complements DevOps by providing a set of practices and methods to achieve these goals. For instance, SRE's emphasis on automating manual tasks aligns with DevOps' principle of automation. Furthermore, SRE's use of error budgets fosters a culture of shared responsibility for system reliability, which is a cornerstone of DevOps.

140

If a filesystem is full, and you see a large file that is taking up a lot of space, how do you make space on the filesystem?

Reference answer

There are several options. We want at least one or something just as good. Perhaps follow up with a question about when/why their answer might be suitable and when a different option would be better. - If no process has the filehandle open, you can delete the file. - If a process has the filehandle open, it is better if you do not delete the file, instead you can cp /dev/null on the file, which will reduce it's size to 0. - A filesystem has a reserve, you can reduce the size of this reserve to create more space using tunefs.

141

Explain the principle of "least privilege" and how you enforce it in cloud environments.

Reference answer

The principle of least privilege means granting users, services, or systems only the minimum level of access required to perform their tasks. Here's how I enforce it in cloud environments: - Role-Based Access Control (RBAC): I implement RBAC to define user roles and assign permissions according to the principle of least privilege. - Identity and Access Management (IAM): I use IAM policies in AWS, Azure, or GCP to ensure that users and services only have the permissions necessary for their roles. - Audit Logs and Monitoring: Regularly review access logs to monitor and verify that permissions are appropriate. - Temporary Access: For emergency or high-privilege actions, I provide temporary elevated permissions (using AWS STS or Azure's Managed Identity) and then revoke them immediately after the task is complete. This practice helps reduce the attack surface and minimizes the impact of a potential breach.

142

What is multithreading?

Reference answer

A programming method called multithreading enables the simultaneous execution of several tasks. Each task is given its own processor or processor in order to accomplish this. Multiple jobs can be processed at once by dividing the load across these processors.

143

Can you explain the difference between a blue-green and canary deployment?

Reference answer

Blue-green deployment involves running two identical environments (blue and green). One environment handles live traffic, while the other is used for testing new releases before directing traffic to it, making it easy to revert if necessary. In contrast, canary deployment introduces the new version gradually to a small group of users before a full release, enabling step-by-step validation and reducing the impact of any potential issues.

144

How would you troubleshoot network packets reaching some parts of the network and not others?

Reference answer

Start with ping and traceroute to check connectivity and path. Use tcpdump or Wireshark to capture packets and inspect headers. Check for firewall rules (iptables) and routing tables. Verify ARP tables and DNS resolution. Test with netcat or curl to isolate application-level issues. Check interface status and errors with ifconfig or ip link. Consider MTU mismatches or VLAN configurations.

145

Scenario: Your microservices-based system has intermittent failures when communicating between services. How would you address this?

Reference answer

- Circuit Breaker Pattern: Implement the circuit breaker pattern to stop overloading failing services and give them time to recover. - Retries with exponential backoff: Add retry logic with exponential backoff to reduce the impact of temporary failures. - Service mesh: Use a service mesh like Istio to manage and secure service-to-service communication, including retries, timeouts, and circuit breaking. - Network monitoring: Monitor network health for packet loss, latency, or misconfigurations that might cause communication failures. - Distributed tracing: Implement distributed tracing (e.g., Jaeger, Zipkin) to identify which service calls are failing and why.

146

SRE Interview Questions

Reference answer

A collection of questions to practice with for Site Reliability Engineer interviews.

147

How would you describe cloud computing to someone who doesn't have a technical background?

Reference answer

Cloud computing is like renting computing resources—servers, storage, databases, and software—over the internet instead of owning and maintaining physical hardware yourself. You pay only for what you use, and you can scale up or down as needed, similar to how you pay for electricity or water.

148

Describe a time you proactively identified a potential issue before it impacted users.

Reference answer

At a previous role with a cloud service provider, I noticed a pattern of increased latency in our database queries during peak hours. By utilizing Prometheus for performance monitoring, I identified inefficient query patterns. I collaborated with the development team to optimize those queries, which reduced latency by 30% and improved overall user satisfaction metrics.

149

How can you use OOPs in designing a Server?

Reference answer

OOPs is a programming paradigm that encourages the creation of objects to represent real-world entities and these objects are then used to perform tasks. These can be useful in designing a Server because they allow you to break down the tasks into manageable chunks, which will help you to keep your Server under control. As well as this, OOPs allows you to create reusable code which will save time and money. When designing a Server using OOPs, it's important to follow some basic design principles. - The first of these is the Single Responsibility Principle (SRP). This states that each object should have one and only one reason to exist. For example, if you're creating an Order Repository, it should only be responsible for one thing -- processing orders. This will help ensure that your code is easy to read and maintain. - The second principle is the Open/Closed Principle (OCP), which states that an object should be either open for addition or closed for modification. For example, if you're creating an Order Repository, it should be able to accept new orders but not modify existing ones.

150

How do you handle log aggregation and analysis in a distributed system?

Reference answer

Use centralized logging systems like the ELK Stack (Elasticsearch, Logstash, Kibana) or Fluentd to collect, store, and analyze logs from multiple services. This simplifies debugging and performance analysis.

151

Describe the difference between a stateful and a stateless service.

Reference answer

A stateless service does not store client session data between requests; each request is independent and can be handled by any instance. A stateful service maintains state (e.g., session data, database records) and often requires the same instance to handle subsequent requests (sticky sessions). Stateless services are easier to scale and are more resilient to failures, which is why SREs prefer stateless designs where possible.

152

What is the difference between traditional methods and new methods of measuring the error budget?

Reference answer

1. Traditional methods are complicated, while new methods are not 2. New methods are complicated, while traditional methods are not 3. New approaches assess ability by dividing excellent interactions by total interactions to a product or service, whereas traditional methods measure good time by total time. 4. None of the above

153

How would you design a highly available web application?

Reference answer

Look for candidates who ask about SLO requirements before designing and discuss failure modes proactively rather than focusing only on the happy path.

154

What is Shell in Linux?

Reference answer

A shell is a special user program that provides an interface for the user to use operating system services. Shell accepts human-readable commands from users and converts them into something which the kernel can understand. It is a command language interpreter that executes commands read from input devices such as keyboards or from files. The shell gets started when the user logs in or starts the terminal. Shell is broadly classified into two categories : - Command Line Shell - Graphical shell

155

How do you perform capacity planning?

Reference answer

Capacity planning involves analyzing current usage patterns, forecasting future needs, and ensuring that infrastructure can handle anticipated growth.

156

What is the difference between TCP and UDP?

Reference answer

Transmission control protocol or TCP is a reliable connection-based protocol. While more reliable than UDP, data transfers are slower. User Datagram Protocol or UDP is a less reliable connectionless protocol that works faster than TCP. You can think of TCP as a “handshake” communication technology, and UDP as a ”broadcast/shout to the ether” communication technology.

157

How do you architect systems to tolerate network partitions and partial failures?

Reference answer

Systems are architected using principles like microservices, circuit breakers, retries with exponential backoff, and bulkheads. For network partitions, I use asynchronous communication (e.g., message queues) to decouple services, implement idempotency for retries, and design for eventual consistency. Partial failures are handled by using graceful degradation (e.g., caching or fallback responses), health checks for service discovery, and replication for data resilience. I also test with chaos engineering to validate tolerance.

158

Tell me about your experience working in a cross-functional team or during a critical incident.

Reference answer

Situation: During a complete database failure at 2 AM on a Tuesday, I was working with database engineers, backend developers, and infrastructure team. Task: I was coordinating between teams—making sure everyone understood what was being tried, communicating with leadership, and documenting decisions for our post-mortem. Action: I opened a Slack war room and established a 'single source of truth' channel where decisions were logged. I asked clarifying questions to make sure the database team and backend team understood each other's constraints. When someone proposed an aggressive recovery method, I asked about rollback risk. We chose a more conservative approach. Result: We recovered in 90 minutes with no further data loss. More importantly, the team told me afterward that having clear communication made a stressful situation manageable. It reinforced for me how much incident management is about coordination, not just technical skill.

159

How do you handle flaky alerts or alert fatigue?

Reference answer

Review alert thresholds regularly, group similar alerts, suppress low-priority noise, and add alert deduplication or alert routing logic. The goal is actionable alerts only.

160

Explain the difference between SLIs, SLOs, and SLAs. Then tell me which one you'd change first if reliability was declining.

Reference answer

Practical understanding versus textbook knowledge. The SLI is almost always the answer, because you're measuring the wrong thing. Giving definitions only and not addressing the 'which one first' part with a real scenario is where candidates lose points.

161

What is caching?

Reference answer

In order to use data that changes infrequently later, caching is the act of storing it in memory. It is frequently applied to boost performance and lessen network load.

162

Describe a situation where you had to advocate for reliability improvements to stakeholders. How did you make your case?

Reference answer

When advocating for reliability improvements, I: - Quantify the Impact: I start by quantifying the cost of downtime or reliability issues in terms of lost revenue, customer trust, and operational inefficiency. - Use Metrics: I use metrics like Mean Time to Recovery (MTTR), service-level objectives (SLOs), and error budgets to show how reliability improvements will enhance system performance. - Propose Actionable Solutions: I suggest that practical solutions, such as automated testing, canary deployments, and better monitoring, can be explained how each would improve the system's reliability. - Showcase ROI: I present the ROI of investing in reliability, demonstrating how it will reduce incident costs, improve customer satisfaction, and increase uptime. This data-driven approach helps stakeholders understand the tangible benefits of investing in reliability improvements.

163

Scenario: You are facing frequent production outages due to sudden traffic spikes. How would you solve this?

Reference answer

- Implement auto-scaling to dynamically add or remove resources based on demand, ensuring the system can handle traffic spikes without manual intervention. - Use CDNs to cache static content and reduce load on backend servers. - Optimize database queries and use read replicas to distribute the load. - Add rate limiting and throttling to control traffic and prevent the system from being overwhelmed. - Ensure load balancers are properly configured to distribute traffic evenly across servers.

164

How will you secure your Docker containers?

Reference answer

To secure your docker container, you need to follow these guidelines: - Choose third party containers carefully - Enable Docker content trust - Set resource limit for your containers - Consider a third-party security tool - Use Docker Bench Security

165

Describe a rollback strategy for a failed deployment in a continuous delivery pipeline.

Reference answer

A rollback strategy involves having a fully automated rollback process in the CI/CD pipeline. This includes: maintaining the previous stable version of the application and infrastructure (e.g., via blue-green deployments or canary releases), automated health checks after deployment (e.g., monitoring error rates and latency), and automatic rollback triggers if thresholds are breached. The rollback should restore the last known good version with minimal downtime, and the team should be alerted immediately. Post-rollback, the failure root cause is investigated before re-deploying.

166

Explain the concept of Service Level Objectives (SLOs) and Service Level Indicators (SLIs).

Reference answer

SLIs are metrics that measure a specific aspect of reliability, such as latency, error rate, or uptime. SLOs are target thresholds for these SLIs (e.g., 99.9% uptime). SREs use SLOs to define acceptable reliability levels and make data-driven decisions about balancing feature velocity with stability. If performance falls below the SLO, engineering efforts are prioritized to restore reliability.

167

Explain the term SLO.

Reference answer

A Service Level Objective (SLO) is a measure of how good or bad the service quality is, and it is usually expressed as a percentage. It shows how close the actual performance of the service level is to what was expected. An SLO is typically set by the customer, but can also be set by management as a way to monitor performance. SLOs are important because they can help organizations understand when they are underperforming, and they can also help them set targets for improvement. By setting targets, managers have something to strive toward and can motivate employees to work harder. When you're setting up an SLO, remember that it's not just about what your customers are getting right now—it's also about what they could be getting right in the future. So think about both short-term and long-term goals when making your SLO. The main objective of SLO is to ensure that customers receive quality service, as measured by the: - Completeness of order fulfilment. - Quality of product. - Timeliness of delivery. - Accuracy and completeness of the information provided to customers. - Communication and support provided by employees.

168

What's the difference between continuous integration (CI) and continuous deployment (CD)?

Reference answer

- CI: Automatically tests and integrates code changes into a shared repository. - CD: Automates the release of code into production after it passes tests.

169

How do you approach resource capacity planning to ensure system performance?

Reference answer

In my role at Deutsche Telekom, I utilized tools like Prometheus and Grafana for monitoring. I analyzed historical usage patterns and collaborated with development to understand upcoming features. By forecasting resource needs, I adjusted our Kubernetes clusters, which resulted in a 20% cost saving while improving application response times by 15%.

170

How would you manage incident response and postmortems in a production environment?

Reference answer

- Incident Response: Acknowledge, diagnose, resolve, and document the incident. Communication and coordination are key during incidents. - Postmortems: Conduct blameless postmortems to identify root causes and implement preventative measures.

171

How would you troubleshoot high CPU usage on a Linux server?

Reference answer

Use tools like top , htop , pidstat , and strace . Check for recent deployments, memory leaks, or runaway processes. Confirm if it's a system-wide issue or tied to a specific service.

172

How do you prioritize which alerts to respond to first during an incident?

Reference answer

The first step to prioritizing alerts is to understand the severity level of the incident, who it affects, and what kind of impact it will have on your customers or systems. After determining the severity level, you can prioritize alerts starting from a SEV-1 level (highest, greatest impact) down to a SEV-5 (lowest, smallest impact).

173

Describe a time you transitioned from a monolithic architecture to microservices.

Reference answer

In one of my previous roles, we had a monolithic application that was becoming increasingly difficult to manage and scale. The application had grown over years with different teams adding various features, resulting in a complex codebase and a high number of interdependencies. This was leading to slower deployment cycles and an increase in the number of issues causing system downtime. Recognizing that the monolithic architecture was holding us back, I proposed transitioning to a microservices architecture. I presented the benefits like improved scalability, faster deployment cycles and isolation of issues to management. I also discussed potential challenges such as managing inter-service communication and data consistency. After getting approval, I worked closely with the development team to carve out independent services from the monolith one by one, ensuring each new service was fully functional and tested before moving onto the next. Over time, we managed to successfully move most of the application functionality to microservices. As a result, our deployment cycle shortened significantly as teams could work on their respective services independently, system reliability improved due to fault isolation, and overall system performance improved due to the ability to individually scale services based on their specific needs. It was a significant improvement to our system's design and demonstrated how even major architectural changes can pay off.

174

Describe a time when a critical system failed despite your best efforts. What was the outcome, and what did you personally learn from the experience?

Reference answer

S – Situation We had recently deployed a new, highly anticipated version of our core inventory management service. This service is absolutely critical for tracking product stock across all our warehouses and online storefronts, directly impacting order fulfillment and preventing overselling. Despite extensive unit, integration, and even some load testing in staging environments, a few days after the deployment, during an unexpected surge in order volume — coincidental with a flash sale that exceeded our typical Black Friday traffic – the service began to exhibit severe degradation. It started failing to process real-time inventory updates, leading to widespread inconsistencies in our stock counts and, consequently, significant overselling issues that frustrated customers and caused logistical nightmares. T – Task My immediate task was to stabilize the inventory service, prevent further data corruption, and restore accurate, real-time inventory tracking to stop the bleeding and mitigate customer impact. Once the immediate crisis was averted, my larger task was to lead the blameless post-mortem analysis. This involved deep investigation into why our extensive pre-deployment testing failed to catch this critical vulnerability, understanding the exact failure mode, and then driving the implementation of systemic changes to prevent similar incidents in the future. A – Action During the incident, I immediately jumped into action, working with the on-call team. Through rapid analysis of logs and metrics, I quickly identified that the core issue stemmed from a newly introduced caching layer in the latest release. Specifically, it was its aggressive cache refresh logic under extremely high concurrency, combined with its interaction with a legacy database connection pool. The cache was attempting to refresh stale data far too frequently, overwhelming the database with an unsustainable number of concurrent connections, which caused the database to reject connections and ultimately led to a cascading failure across the inventory service. My first and most critical action was to initiate an immediate rollback to the previous stable version of the inventory service. This critical decision quickly stabilized the system, allowing us to regain accurate inventory counts and stop the overselling. Once the system was stable, I facilitated a blameless post-mortem session involving the development, QA, and SRE teams. We systematically deep-dived into the specific failure mode, meticulously recreating it in a dedicated test environment. We discovered that while our load tests had simulated high request volumes, they hadn't accurately mimicked the bursty, highly concurrent nature of real-world flash sale traffic combined with the specific cache invalidation patterns. Furthermore, the test environment had a more generous database connection limit than our production environment, masking the underlying resource contention issue. R – Result The immediate outcome was the successful restoration of the inventory service and the cessation of overselling, albeit with some customer goodwill lost due to initial fulfillment issues. The deeper, more significant outcome was a clear, shared understanding of the limitations of our pre-production load testing methodologies, particularly concerning bursty traffic patterns and realistic resource constraints on shared infrastructure. Personally, I gleaned several critical lessons. First, the profound importance of realistic load testing. It's not just about raw request volume, but accurately simulating actual user behavior, including rapid cache invalidations and the specific concurrency patterns of downstream services. We subsequently integrated "chaos engineering" principles into our staging environments, deliberately introducing resource constraints like limited database connections, network latencies, and forced cache invalidations to proactively uncover such hidden interdependencies and failure modes. Second, the incident reinforced the critical value of a robust and quickly executable rollback strategy; having this as a primary incident response tool was paramount in minimizing the incident's duration and impact. Third, it underscored the power of a blameless post-mortem culture. By focusing on systemic and process improvements rather than individual blame, the entire team openly shared insights, fostering psychological safety and enabling us to collectively devise far more effective and sustainable solutions. This experience led to a significant overhaul of our load testing frameworks and the introduction of a mandatory "Production Readiness Review" checklist for all critical service deployments, requiring explicit sign-off on various resilience aspects, including database connection pooling strategies and caching mechanisms under extreme, simulated load.

175

How would you use Infrastructure as Code (IaC) to manage and provision computing resources?

Reference answer

I would use tools like Terraform or AWS CloudFormation to define and provide data center infrastructure using a high-level configuration syntax. This approach ensures that the infrastructure setup is repeatable and consistent, and can be version controlled and validated. For instance, using Terraform, I could codify the setup of a virtual private cloud (VPC), including subnets, security groups, and instance types, and then use this code to spin up identical environments in different regions or accounts.

176

What are some best practices for incident documentation?

Reference answer

Best practices for incident documentation include detailed logging of incident timelines, steps taken to mitigate the issue, root cause analysis, impact assessment, and lessons learned. Documentation should be clear, concise, and accessible to all relevant team members.

177

Write a program to check If all asteroids can be eliminated, then return true. Return false otherwise. You are given an integer mass that represents a planet's initial mass. You are also provided with an integer array called asteroids, where asteroids[i] represent the mass of the ith asteroid. You may make the planet smash with the asteroids in whatever sequence you like. If the planet's mass is more than or equal to the asteroid's mass, the asteroid is destroyed and the planet obtains the asteroid's mass. Otherwise, the world will be destroyed.

Reference answer

One of the many solutions can be sorting the asteroid array. By sorting this, we can pick the smallest element such that it can gain the mass of the planet. And if the planet destroys (if planet's mass is less than asteroids) then we will return false. So the solution can be - public boolean asteroidsDestroyed(int mass, int[] asteroids) { //Sorting the array Arrays.sort(asteroids); int n = asteroids.length; for(int i = 0; i < n; i++){ //Attacking the planet with asteroid if(mass >= asteroids[i]) mass += asteroids[i]; //If the mass of the planet becomes greater than the largest //asteroid then no need to check further, just return true. if(mass > asteroids[n-1]) return true; } //If the planet is being destroyed by the asteroid return false; } We have used sorting and sorting takes O(n*log n) times. So the time complexity of the solution will also be O(n*log n).

178

What is the purpose of a Service Level Agreement (SLA)?

Reference answer

An SLA is a formal agreement between a service provider and a customer that defines the expected level of service, including uptime, performance, and response times.

179

Deployment vs. StatefulSet

Reference answer

- Deployment: Manages stateless apps (e.g., web servers). - StatefulSet: Manages stateful apps (e.g., databases).

180

Design an SLO for a payment processing service that handles 50,000 transactions per hour.

Reference answer

Start with the SLI: what counts as a successful transaction? Is it HTTP 200? Or does it need to include end-to-end processing confirmation from the payment gateway? Those are different measurements and the SLO math changes depending on which one you pick. Then the SLO target: 99.95% over a 28-day rolling window gives you roughly 21 minutes of error budget per month. At 50,000 transactions per hour, that's about 875 failed transactions before you've burned the budget. Is the business comfortable with that number? That conversation, between the SRE team and the product org, is the actual skill being tested.

181

How do you ensure high availability in your systems?

Reference answer

High availability is ensured through redundancy, failover mechanisms, load balancing, and designing systems to avoid single points of failure.

182

Implement a data structure that supports the operations flipBits(ith bit, # of bits) and getBit(ith bit), billions of possible bits, <= O(n log n) preferably.

Reference answer

Use a segment tree or binary indexed tree (Fenwick tree) to handle range updates and point queries. For flipBits, update the range [i, i+#bits-1] with a flip operation (XOR). For getBit, query the value at position i. Both operations can be O(log n) using lazy propagation in a segment tree, where n is the number of bits.

183

What is DevOps?

Reference answer

DevOps is a software development process that involves collaboration between software engineers and IT operations staff or the words (Dev - Development, Ops - Operations). This collaboration helps to improve overall productivity, while also providing better quality assurance and faster time to market. DevOps is a movement that seeks to bring together developers and IT operations staff, in order to make the two groups work more closely together. DevOps is a relatively new concept, but it's quickly becoming one of the most important aspects of modern software development. In recent years, we've seen a number of enterprises adopt DevOps practices as part of their software development lifecycle (SDLC). This has helped organizations become more efficient and effective, by increasing the overall speed and quality of their products. As such, it's clear that there's plenty of value in the DevOps model today.

184

What is the role of a Site Reliability Engineer (SRE) and how does it differ from a traditional system administrator?

Reference answer

An SRE bridges software engineering and IT operations, focusing on automating operations, improving system reliability, and scaling infrastructure. Unlike a traditional sysadmin who manually handles tasks, an SRE uses coding and automation to solve operational problems, sets service level objectives (SLOs), and manages incident response with a focus on reducing manual toil and increasing system resilience.

185

Why do you think that you will become a Site Reliability Engineer?

Reference answer

- I have a practical understanding and working knowledge in DevOps with a deep understanding of: - The inter-relationship of SRE with DevOps and other popular frameworks - The underlying principles behind SRE - Service Level Objectives (SLO's) and their user focus - Service Level Indicators (SLI's) and the modern monitoring landscape - Error budgets and the associated error budget policies - Toil and its effect on an organization's productivity - Some practical steps that can help to eliminate toil - Observability as something to indicate the health of a service - SRE tools, automation techniques, and the importance of security - Anti-fragility, our approach to failure and failure testing - The organizational impact that introducing SRE brings Hence, I feel Site Reliability Engineer is the perfect job role for me.

186

How do you troubleshoot high CPU usage on a Linux server?

Reference answer

Use commands like 'top' or 'htop' to identify processes consuming CPU, check 'ps aux' for detailed process information, use 'strace' to trace system calls, analyze logs for anomalies, and inspect application configuration or code for infinite loops or inefficient operations.

187

What is “distributed tracing,” and how would you implement it in a microservices architecture?

Reference answer

Distributed tracing allows you to track requests across multiple microservices, providing visibility into how requests flow through the system. To implement: - Instrumentation: Use tracing libraries like OpenTelemetry, Jaeger, or Zipkin to instrument services. - Propagate trace context: Ensure trace IDs are passed between services in headers (e.g., `X-B3-TraceId`). - Aggregation tools: Use a central platform like Jaeger or AWS X-Ray to collect and visualize traces, helping to pinpoint bottlenecks or failures. - Tagging and logging: Add key metadata (e.g., service name, request IDs) to each trace span for detailed analysis. - Monitor latency and errors: Track SLIs like service latency, request counts, and error rates at each hop in the system. Distributed tracing is critical for identifying performance bottlenecks and understanding dependencies in a microservices environment.

188

How do you stay current with new tools and technologies?

Reference answer

I spend time reading—I follow several SRE and infrastructure blogs, and I read one technical book every quarter or so. The SRE Book from Google is required reading in this field. But honestly, the best learning comes from actually breaking things and fixing them. We use a lab environment where we experiment with new tools before bringing them to production. We just evaluated three different service mesh tools because our microservices architecture was getting complicated. I spent a week setting up Istio and Linkerd in our lab, ran some load tests, and reported back to the team. We ended up not adopting either one—we realized we didn't have the operational maturity for a service mesh yet—but I learned a ton. I also attend a few conferences per year. I'm selective—I go to talks on topics I actually need to learn, not just for the networking. And honestly, I learn a lot from my team. When someone solves a problem I haven't encountered, I ask them to walk me through it.

189

Given a matrix (not necessarily a square), a starting position on the matrix, and a set of valid moves, count the total number of paths from the starting position

Reference answer

Use Depth-First Search (DFS) with memoization or dynamic programming. Define a recursive function that explores all valid moves from the current position, ensuring you stay within bounds. Count paths by summing the results of recursive calls. Memoize results for each position to avoid recomputation. Base case: if the position is a target or boundary condition, return 1 for a valid path.

190

How do you load a Linux kernel?

Reference answer

The Linux kernel is loaded by the bootloader (e.g., GRUB) from the boot partition. The bootloader reads the kernel image (vmlinuz) and initial RAM disk (initrd or initramfs) into memory, then passes control to the kernel. The kernel initializes hardware, mounts the root filesystem, and starts the init process.

191

Tell me about a time you had to learn something new quickly on the job.

Reference answer

Situation: My company decided to migrate from on-premises infrastructure to Kubernetes, and I had no Kubernetes experience. Task: We had six weeks before the migration, and I needed to be proficient enough to troubleshoot issues and make architecture decisions. Action: I took an online course, read the official Kubernetes documentation, and set up a test cluster. I also paired with a senior engineer who knew Kubernetes to review my decisions and help me understand the operational model. I focused on the 20% of concepts that applied to our use case rather than trying to learn everything. Result: By migration day, I could handle basic troubleshooting and we caught several architectural issues in our planning. Six months in, I'm confident enough to mentor new team members on Kubernetes basics. The key was being intentional about learning—focusing on what mattered to our specific situation.

192

What is a zombie process?

Reference answer

A terminated process lingering in the process table. Fix by reaping its exit status via the parent process.

193

What is virtual memory?

Reference answer

Virtual Memory is a storage allocation scheme in which secondary memory can be addressed as though it were part of the main memory. The addresses a program may use to reference memory are distinguished from the addresses the memory system uses to identify physical storage sites and program-generated addresses are translated automatically to the corresponding machine addresses.

194

What is the role of a Site Reliability Engineer and how does it differ from traditional operations?

Reference answer

The role of a Site Reliability Engineer (SRE) is to bridge software engineering and operations to ensure resilient, scalable systems and measurable reliability. It differs from traditional operations by applying software engineering practices to operational problems, automating manual tasks, and focusing on reliability through metrics like SLOs and error budgets, rather than solely on maintaining system uptime.

195

How do you approach the challenge of maintaining consistency in a distributed system?

Reference answer

In distributed systems, ensuring consistency can be difficult due to network partitions and latency. Approaches to maintain consistency include: - Strong Consistency: Use consensus algorithms like Paxos or Raft to ensure data is consistently written across all nodes. - Eventual Consistency: Use systems like Cassandra or DynamoDB, where consistency is achieved over time, and ensure the system can handle eventual consistency where it's acceptable. - CAP Theorem: Understand the trade-offs between consistency, availability, and partition tolerance and design systems accordingly based on business needs. - Implement quorum-based reads/writes to strike a balance between performance and consistency.

196

How to terminate a running process?

Reference answer

Use the kill command with a signal, e.g., `kill -9 ` sends SIGKILL to forcefully terminate the process. `kill -15 ` sends SIGTERM for graceful termination. Alternatively, use `pkill` or `killall` with process names. For interactive termination, use Ctrl+C (SIGINT) in the terminal.

197

What is the default signal that is generated when sending a kill command to a process in Linux?

Reference answer

The default signal for the kill command without an explicit signal is SIGTERM (signal 15). This requests the process to terminate gracefully, allowing it to clean up resources. To force termination, use SIGKILL (signal 9).

198

What is Multithreading? What are the benefits of this?

Reference answer

Multithreading is a programming technique that allows the execution of multiple tasks at the same time. To achieve this, each task is assigned its own processing unit or processor. By splitting up the workload across these processors, it is possible to process several tasks simultaneously. This can be helpful for processing large amounts of data, or when running short-lived tasks that have a high resource consumption. Multithreading can be implemented in different ways, depending on the underlying technology used. For example, multithreading can be achieved by executing multiple tasks on separate processors, or by running those tasks in parallel on a single processor. Multithreading has many benefits. It allows for increased performance and reduced execution time of long-running computations. Also, it can improve the responsiveness of applications and reduce latency. Multithreading can also be used to execute short-lived tasks that have a high resource consumption. As such, multithreaded applications are ideal for use in IoT environments where there is a constant network traffic and battery drain due to sensor readings and other processes being executed within the device.

199

How to reduce Docker image size?

Reference answer

Use multi-stage builds and Alpine Linux base images.

200

What is the role of a 'runbook' in incident response?

Reference answer

A runbook is a documented procedure for handling specific incidents or operations (e.g., restarting a service, scaling a cluster). It provides step-by-step instructions, expected outcomes, and escalation paths. Runbooks enable consistent, fast, and accurate responses, especially by on-call engineers or automated systems. They reduce toil and help prevent mistakes during high-pressure situations.

DON'T WANT TO MISS A THING?

Latest Cisco, PMP, AWS, CompTIA, Microsoft Materials on SALE
Get Now

Common SRE Interview Questions and Answers | SPOTO

Earn a certification to make your resume stand out.

DON'T WANT TO MISS A THING?

Latest Cisco, PMP, AWS, CompTIA, Microsoft Materials on SALE Get Now

Common SRE Interview Questions and Answers | SPOTO

Earn a certification to make your resume stand out.

Latest Cisco, PMP, AWS, CompTIA, Microsoft Materials on SALE
Get Now