Mock Interview Questions for Site Reliability Engineers

1

What do you understand by the terms SLIs, SLOs, and SLAs?

Reference answer

SLI (Service Level Indicator) is a quantitative measure of a specific aspect of service performance, such as latency, error rate, or uptime. SLO (Service Level Objective) is a target value or range for an SLI, representing the desired level of service reliability (e.g., 99.9% uptime). SLA (Service Level Agreement) is a formal contract between a service provider and a customer that specifies the agreed-upon SLOs and consequences for not meeting them, such as penalties or credits. SLIs are the actual measurements, SLOs are the internal goals, and SLAs are the external commitments.

2

What is a postmortem and why is it important in SRE?

Reference answer

A postmortem is a written analysis of an incident that documents what happened, why it happened, and what actions will prevent recurrence. It is important because it fosters a blameless culture focused on learning and system improvement. By analyzing failures, SRE teams can implement changes to infrastructure, processes, or code to increase future reliability.

3

Explain readiness vs. liveness probes.

Reference answer

Readiness probes indicate whether a container is ready to serve traffic; if it fails, the pod is removed from service endpoints. Liveness probes indicate whether a container is running; if it fails, the container is restarted. Readiness ensures traffic is only sent to healthy pods, while liveness recovers stuck or deadlocked containers.

4

How do you ensure security in SRE?

Reference answer

Security is ensured through regular vulnerability assessments, implementing best practices like least privilege access, encryption, and monitoring for suspicious activities.

5

What is the difference between snat and dnat?

Reference answer

Source Network Address Translation (source-nat or SNAT) is a technique that allows traffic from a private network to go out to the internet. Destination network address translation (DNAT) is a technique for transparently changing the destination IP address of an end route packet and performing the inverse function for any replies. Any router situated between two endpoints can perform this transformation of the packet. Difference: - On either side of a NAT device, we have an outside world and inside the world, When the inside world communicates with the outside world SNAT happens. When the outside world communicates with the inside world DNAT happens. - When many internal private IP addresses get translated to one public IP address, it's called Static SNAT. When many internal private IP addresses get translated to many public IP addresses it's called Dynamic SNAT - Source NAT changes the source address in the IP header packet. Source NAT changes the destination address in the IP header packet. - SNAT allows multiple hosts on the “inside” to get to any host on the “outside”. DNAT allows multiple hosts on the “outside” to get to any host on the “inside”

6

What is the value of learning about the error budget in S.R.?

Reference answer

1. It is not crucial 2. It is used to compromise from the product to make changes or plan for space for mistakes or potential outages 3. It is used to estimate the amount of availability that needs to be achieved rather than 100% 4. It is not related to S.R.

7

What is a service-level indicator (SLI)?

Reference answer

An SLI is a quantitative metric that measures the performance or reliability of a service. Examples include the percentage of successful requests, average request latency, or system uptime. SLOs are built upon one or more SLIs.

8

How do you define and measure Service Level Objectives (SLOs) and Service Level Indicators (SLIs)?

Reference answer

SLOs are typically written and set by product managers to meet or exceed promises made in the company's SLA. SLOs are typically written to give teams an error budget and room for experimentation. SLIs are the actual measured performance of the service being provided indicating whether the performance is meeting SLOs and SLAs.

9

What is an Error Budget in SRE?

Reference answer

“An error budget represents the allowable level of failure for a system within a given time frame. It is calculated as 1 - SLO . For example, if an SLO guarantees 99.95% uptime, the error budget is 0.05%, which equates to 21.6 minutes of…”

10

How will you secure your Docker containers?

Reference answer

Follow these instructions to secure your Docker container: - Choose third-party containers with caution. - Turn on Docker content trust. - Limit the resources available to your containers. - Consider utilizing a third-party security product. - Docker Bench Security should be used. Other than these questions, there are also some questions that are based on your personal understanding of the system if you are an experienced person. The questions can be like this - - How can you strengthen the bond between the operations and IT teams? - What is the distinction between site reliability engineers and development operations? - What actions would you take to develop a monitoring strategy for a service that does not have one? - How can information technology infrastructure be scaled? - What type of experience do you have building deployment automation code? - Why would you want to be an SRE rather than an SDE? What piques your interest in this role? etc.

11

Walk me through how you'd run an incident for a service that's returning elevated error rates but hasn't triggered any customer-facing alerts yet.

Reference answer

Five things need to show up in your answer: how you detected the problem before customers flagged it, how you'd classify severity when there's no customer-facing impact yet, who you'd loop in and through what channel, the decision between rolling back immediately versus investigating further while the service is partially degraded, and what the post-incident review process looks like afterward. That's five distinct elements and most candidates only hit three of them. Candidates who cover three get through. Candidates who cover two don't.

12

How do you handle noisy neighbors in a multi-tenant environment?

Reference answer

Noisy neighbors are managed through resource isolation techniques such as setting resource limits (CPU, memory), using cgroups, implementing quality of service (QoS) policies, and monitoring resource usage to detect and mitigate the impact on other tenants.

13

What is a 'canary deployment' and how does it help with reliability?

Reference answer

A canary deployment is a release strategy where a new version is rolled out to a small subset of users or servers before full deployment. This helps detect issues (e.g., bugs, performance regressions) early with minimal impact. If the canary shows no problems, the release is gradually expanded. SREs use canary deployments to reduce blast radius and ensure changes are safe before reaching all production traffic.

14

How would you set up a high-availability (HA) system for a web application?

Reference answer

- Load balancers to distribute traffic. - Multiple instances across availability zones. - Database replication for failover. - Use auto-scaling to handle traffic spikes.

15

What is a playbook, and how is it used in SRE?

Reference answer

A playbook is a comprehensive set of procedures and protocols for handling specific operational tasks and incidents. It provides detailed steps for troubleshooting, incident resolution, and routine maintenance, ensuring consistency and efficiency.

16

What is Error Budgets? And for what error budgets is used?

Reference answer

Error budget defines the maximum amount of time a technical system can fail without contractual consequences. Error budget encourages the teams to minimize real incidents and maximize innovation by taking risks within acceptable limits.

17

Write a regular expression to validate an email address.

Reference answer

To validate an email address, you can use a regular expression that checks for the correct format. Here's a simple example: ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$

18

How would you improve a system with high on-call alert fatigue?

Reference answer

Alert fatigue usually means you're alerting on symptoms that aren't actually user-impacting, or you're not setting appropriate thresholds. My approach is to audit the alerts. For each alert that's firing frequently, I ask: if this fires right now, would I wake up? If the answer is no, it shouldn't page the on-call engineer. It should go to a dashboard that on-call reviews during business hours. We had an alert for 'latency above 500ms' that was firing constantly. But when we looked at actual user impact, we weren't losing requests until latency hit 2 seconds. We also implemented alert suppression rules—during deployments, certain alerts get suppressed because we expect things to be in flux. We set up alert grouping so that if the same root cause triggers 50 alerts, on-call gets one notification instead of 50 pages. We also fixed some fundamental issues—our database was getting slow during backup windows, which triggered dozens of alerts. We moved to incremental backups and the problem went away. I also implemented an SLA for on-call: we shouldn't be paging more than once per shift on average. When we hit more than that, it's an organizational priority to fix it. Within six months, we cut false alerts by 80%.

19

What is the “Four Golden Signals” concept in SRE?

Reference answer

The Four Golden Signals are metrics used to measure the health of a system: - Latency: Time taken to serve a request. - Traffic: The demand placed on your system (e.g., requests per second). - Errors: The rate of failed requests. - Saturation: How close the system is to its full capacity.

20

What is Site Reliability Engineering (SRE)?

Reference answer

SRE is a discipline that applies software engineering principles to infrastructure and operations problems. The goal is to create scalable and highly reliable software systems.

21

How do you ensure the reliability of CI/CD pipelines?

Reference answer

- Automated Testing: Ensure unit, integration, and system tests are part of the pipeline. - Parallelization: Speed up builds by running tests in parallel. - Staging Environments: Deploy to a staging environment before production. - Monitoring: Use CI/CD monitoring tools (e.g., Jenkins, CircleCI) to ensure builds and deployments are successful. - Rollback mechanisms: Have easy and fast rollback mechanisms if deployments fail.

22

How does your team monitor their system and track "success"?

Reference answer

This is an excellent technical question to determine how you've set up monitoring and alerting tools and how you've helped define the "healthy" state of a system in the past. If you want to join an SRE team, you'll need to understand how you can leverage both internal and external outputs to determine overall system health. Then, you should be able to translate that information into insights and action for IT and engineering teams.

23

What are the pros and cons of horizontal vs. vertical scaling in cloud infrastructure?

Reference answer

- Horizontal Scaling: This involves adding more instances of services or servers to handle increased load. - Pros: High availability, fault tolerance, easier to scale out as demand grows. - Cons: More complex to manage, potential network latency between instances. - Pros: High availability, fault tolerance, easier to scale out as demand grows. - Vertical Scaling: This involves adding more resources (CPU, RAM) to a single server. - Pros: Simple to implement and manage, no need to handle inter-instance communication. - Cons: Limited by the hardware of the machine, single point of failure, less flexible. - Pros: Simple to implement and manage, no need to handle inter-instance communication. For cloud-native applications, horizontal scaling is generally preferred because it provides better redundancy and scalability.

24

How do you handle on-call rotations and what strategies do you use to manage burnout?

Reference answer

I handle on-call rotations by creating a fair and balanced schedule, ensuring that no one is overburdened. To manage burnout, I emphasize the importance of clear communication, regular breaks, and mental health support.

25

What is an APM tool?

Reference answer

Application Performance Monitoring (e.g., New Relic) tracks latency, errors, and throughput.

26

What techniques do you use for capacity planning?

Reference answer

Capacity planning involves analyzing historical usage data, forecasting future growth based on business projections, and modeling system load under peak conditions. This helps ensure infrastructure scales correctly to meet demand without over- or under-provisioning.

27

Tell me about an incident where you disagreed with the incident commander's decision during a live outage.

Reference answer

The right answer describes a specific moment where you pushed back on a mitigation decision during a live incident, did it constructively enough that the IC didn't lose coordination authority, and then brought the structural concern to the post-mortem where the team could actually discuss it without the pressure of a running outage. That sequence matters.

28

Why is monitoring and alerting important in S3 operations?

Reference answer

Monitoring and alerting are crucial in S3 operations to measure performance against targets, have high policy alerts, and trigger systems when necessary to ensure that SLAs and SLIs are adjusted to goals and provide proper alerting depending on the severity of incidents or outages.

29

What is a canary deployment?

Reference answer

Roll out changes to a small user subset before full deployment to minimize risk.

30

Explain the relationship between SLI, SLO, and SLA. Give a practical example.

Reference answer

Service Level Indicators (SLI), Service Level Objectives (SLO), and Service Level Agreements (SLA) are key concepts in SRE. - SLI is a quantitative measure of service performance, like latency or availability. - SLO defines a target value or range for the SLI, such as "99.9% uptime over the last month." - SLA is a formal agreement between the provider and customer specifying what happens if the SLO is not met. Practical Example: If an SLI is "API response time," the SLO might be "90% of requests should respond within 200ms." The SLA could specify that if the SLO isn't met, the provider owes a refund or credits.

31

Can you explain what Site Reliability Engineering (SRE) is?

Reference answer

Site Reliability Engineering (SRE) is a DevOps discipline that applies software engineering principles to infrastructure and operations to create scalable and highly reliable software systems.

32

Describe a time you led the response to a major outage.

Reference answer

At my previous job with SAP, we faced a major outage due to a database overload during peak hours. I quickly assembled a cross-functional team to investigate, and we discovered a misconfigured query causing the spike. We implemented a temporary rollback and then optimized the query. Post-incident, I led a retrospective that resulted in enhanced monitoring and improved query performance, reducing similar incidents by 30%.

33

How do you approach documenting your work?

Reference answer

Proper documentation is a critical aspect of software development and system management, and I utilize a mix of methods to document my work. For coding, I'm a huge proponent of code being self-documenting as far as possible. I use meaningful variable and function names, and keep functions and classes compact and focused on doing one thing. When necessary, I add comments to explain complex logic or algorithms that can't be expressed clearly through just code. For code or software documentation, I use tools like Doxygen or JavaDoc. They create comprehensive documentation based on specially-formatted comments in source code, describing the functionality of classes, methods, and variables. As for documenting system configurations, I prefer to have configuration files stored in a version control system like Git. This provides an implicit documentation of changes made over time, who made them, and why. For complex system-level changes, I write separate documentation which provides an overview of the system, important configurations, and step-by-step procedures for performing common tasks. The aim is always to ensure that anyone with sufficient access can understand and manage the system without needing to figure things out from scratch. I also make use of README files in our Git repositories, and on more significant projects, we have employed wiki-style tools like Confluence to document architectures, workflows and decisions at a more macro level. GitHub's wiki feature is also handy for this.

34

Tell me about a SEV-1 incident you handled

Reference answer

Candidates should discuss both technical debugging and communication coordination with stakeholders, since incident response requires explaining complex situations while simultaneously troubleshooting.

35

Describe your experience with version control systems like Git. What advanced features are you familiar with?

Reference answer

Expect candidates to describe their proficiency with Git or similar systems through specific examples, such as: Branching and merging, Handling merge conflicts, Collaborating with team members. Knowledge of advanced features like rebase, cherry-pick, and tagging is a plus. Their answers should also demonstrate an understanding of best practices for integrating version control into CI/CD pipelines.

36

What is a playbook in the context of SRE?

Reference answer

A playbook is a collection of standardized procedures and protocols that guide engineers in handling various operational tasks and incidents.

37

What strategies do you use to reduce downtime during deployments?

Reference answer

Strategies include blue-green deployments, canary releases, feature toggles, and automated rollback mechanisms.

38

What is the 'Two Generals Problem' and how does it relate to distributed systems?

Reference answer

The Two Generals Problem illustrates the impossibility of reliably achieving consensus over an unreliable communication channel. In distributed systems, it relates to the challenge of coordinating actions across nodes where messages can be lost. Solutions involve using protocols like TCP (which provides reliable delivery but not full consensus) or consensus algorithms like Paxos/Raft that guarantee agreement despite failures, forming the basis for distributed transactions and replication.

39

What does `lsof` do?

Reference answer

Lists open files and their processes. Example: `lsof /var/log/syslog` identifies log access.

40

What can happen if SLAs consistently outperform SLOs?

Reference answer

If SLAs consistently outperform SLOs, it is crucial to check yourself to avoid setting unrealistic expectations with customers, as Google will schedule extra downtime if unrealistic expectations are set, ensuring that SLAs reflect user needs and expectations.

41

Design a multi-region failover strategy for a stateful service. What trade offs do you consider?

Reference answer

A multi-region failover strategy for a stateful service involves active-passive or active-active replication across regions. For active-passive, a primary region handles traffic, and a standby region replicates data synchronously or asynchronously. Trade-offs include: data consistency vs. latency (synchronous replication ensures consistency but adds latency; asynchronous may cause data loss), cost of maintaining redundant infrastructure, complexity of failover automation (e.g., DNS-based routing or load balancers), and recovery time objective (RTO) vs. recovery point objective (RPO). I would prioritize based on business requirements for consistency and availability.

42

What is the purpose of a load balancer?

Reference answer

A load balancer distributes incoming traffic across multiple servers to ensure no single server becomes a bottleneck, improving availability and reliability.

43

What is SRE and how does it differ from traditional operations?

Reference answer

Site Reliability Engineering (SRE) is a discipline that incorporates software engineering principles to solve infrastructure and operations challenges. Unlike traditional operations, SRE emphasizes automation, reliability, and scalability, proactively managing incidents and continuously improving systems.

44

Write a SQL query to find the top 5 users with the highest number of logins in a database.

Reference answer

To find the top 5 users with the highest number of logins, you can use the following SQL query: SELECT user_id, COUNT(*) as login_count FROM logins GROUP BY user_id ORDER BY login_count DESC LIMIT 5; This query groups logins by user, counts them, and orders the results to show the top 5 users.

45

What is the difference between SLI and SLA?

Reference answer

SLI is the service level indicator, which tells us how well the service is doing in real-time, while SLA is the aggregation of SLI over time. SLA is the equivalent of error budgeting, but it is more related to business.

46

What is database sharding?

Reference answer

Database sharding involves distributing a large database across multiple machines. Since a single machine or database server can only handle a limited amount of data, sharding splits the data into smaller logical chunks called shards and stores them across multiple database servers to overcome this limitation.

47

When analyzing a software development pipeline, how do you identify ways to improve its efficiency?

Reference answer

I measure key metrics like build time, deployment frequency, lead time, and failure rate. I identify bottlenecks by profiling each stage, reviewing logs, and gathering team feedback. Common improvements include parallelizing tests, caching dependencies, optimizing container builds, and automating manual approvals.

48

What are some common causes of high latency in a distributed system?

Reference answer

Common causes include network latency or congestion between services, resource contention (CPU/memory) on overloaded servers, inefficient database queries, blocking I/O operations, serialization/deserialization overhead, or slow dependencies between microservices.

49

Define Hardlink and soft link with examples?

Reference answer

A soft link is an actual link to the original file that can cross the file system, allows you to link between directories, and has different inode numbers or file permission to the original file. A softlink looks like this: $ SRE softlink.file A hard link is a mirror copy of the original file that can't cross the file system boundaries, can't link directories, and has the same inode number and permissions as the original. Example: $ SRE hardlink.file

50

What are data structures?

Reference answer

A collection of guidelines called data structures is used by computers to organize and store data. Data structures are employed to manage memory, structure databases, and organize data. Data structures make it simple to organize data, make it simple to get data, and make good use of resources.

51

What are the Linux kill command? Enlist all the Linux kill commands with its functions

Reference answer

The Linux Kill commands are: Killall: Killall command is used to kill all the processes with a particular name. Pkill: This command is a lot like killall, except it kills processes with partial names. Xkill: xkill allows users to kill command by clicking on the window

52

What is the difference between a process and a thread?

Reference answer

Process | Thread | |---|---| | Process means any program is in execution. | Thread means a segment of a process. | | The process takes more time to terminate. | The thread takes less time to terminate. | | It takes more time to creation. | It takes less time for creation. | | It also takes more time for context switching. | It takes less time for context switching. | | The process is less efficient in terms of communication. | Thread is more efficient in terms of communication. |

53

Explain a metric you would use to measure reliability for an API service.

Reference answer

A key metric is the service's error rate, specifically the percentage of requests resulting in errors (e.g., HTTP 5xx) over a defined window. This is an SLI that directly impacts reliability. Combined with latency (e.g., p99 response time), it provides a comprehensive view. I would set an SLO for error rate (e.g., 99.9% of requests succeed) and track it over time to ensure the service meets reliability targets.

54

Design a system for copying a file to remote servers

Reference answer

Use a client-server model. The client breaks the file into chunks, computes checksums, and sends chunks over TCP or UDP with error correction. The server reassembles chunks and verifies integrity. For large-scale, use a distributed file transfer protocol like rsync or BitTorrent. Consider encryption, compression, and resumability. Use a coordination service to track progress.

55

What is a zombie process?

Reference answer

A zombie process is a terminated process that still has an entry in the process table because its parent has not yet called wait() or waitpid() to read its exit status. Zombies consume minimal resources (only a process table entry) but can accumulate and exhaust system process table limits if not reaped.

56

What are the states that the process could be in?

Reference answer

Processes are the computer program that is going to be executed by the CPU. And during the execution cycle of the process, it does from various stages. That is the process state. So the process states are - - New - A new process is a program that will be loaded into the main memory by the operating system. - Ready - When a process is formed, it immediately enters the ready state and waits for the CPU to be assigned. The operating system selects new processes from secondary memory and places them all in the main memory. Ready-state processes are processes that are ready for execution and sit in the main memory. Many processes may be present in the ready stage. They all can be aligned into the queue for getting a chance to execute. - Running - The OS will select one of the processes from the ready state based on the scheduling mechanism. As a result, if we only have one CPU in our system, the number of operating processes at any given time will always be one. If we have n processors in the system, we can run n tasks at the same time. - Block/Wait - Depending on the scheduling method or the inherent behavior of the process, a process can migrate from the Running state to the block or wait for the state. - When a process waits for a specific resource to be provided or for user input, the operating system moves it to the block or waits for the state and assigns the CPU to other processes. - Terminated - The termination state is reached when a process completes its execution. The process's context (Process Control Block) will likewise be removed, and the process will be terminated by the operating system. - Suspend Block/Wait - Rather than removing the process from the ready queue, it is preferable to delete the stalled process that is waiting for resources in the main memory. Because it is already waiting for a resource to become available, it is preferable if it waits in secondary memory to create a way for the higher priority process. These processes conclude their execution when the main memory becomes accessible and their wait is over. - Suspend Ready - A process in the ready state that is transferred to secondary memory from main memory owing to a shortage of resources (mostly primary memory) is referred to as being in the suspend ready state. If the main memory is full and a higher-priority process arrives for execution, the OS must free up space in the main memory by moving the lower-priority process to secondary memory. Suspend-ready processes are kept in secondary memory until the main memory becomes accessible.

57

Differences between TCP and UDP

Reference answer

Basis | Transmission Control Protocol (TCP) | User Datagram Protocol (UDP) | |---|---|---| | Type of Service | TCP is a connection-oriented protocol. Connection orientation means that the communicating devices should establish a connection before transmitting data and should close the connection after transmitting the data. | UDP is the Datagram-oriented protocol. This is because there is no overhead for opening a connection, maintaining a connection, or terminating a connection. UDP is efficient for broadcast and multicast types of network transmission. | | Reliability | TCP is reliable as it guarantees the delivery of data to the destination router. | The delivery of data to the destination cannot be guaranteed in UDP. | | Error checking mechanism | TCP provides extensive error-checking mechanisms. It is because it provides flow control and acknowledgment of data. | UDP has only the basic error-checking mechanism using checksums. | | Acknowledgment | An acknowledgment segment is present. | No acknowledgment segment. |

58

What is horizontal scaling?

Reference answer

By adding several logical resources, a system's size can be increased horizontally. To do this, either more virtual machines or containers can be added to each host. Additionally, it is possible by adding many hosts at once. This is also known as scaling out.

59

How do you ensure backups are up-to-date and readily available?

Reference answer

Ensuring backups are up-to-date and readily available begins with automating the process. I usually set up automated scripts to perform regular backups, be it daily, weekly or as required for the specific application. By doing this, we can have a reliable recovery point even in the event of a catastrophic failure. I also set up backup verification processes. This involves periodically checking that backups are not only happening as scheduled but also that the data is consistent and can be correctly restored when needed. It's a good practice to conduct routine "fire drills" where we actually restore data from a backup to a test environment just to ensure we can do it quickly and correctly in case of a real need. In addition, I ensure the backups are securely stored in two separate locations, usually one in the same region and one in a different region, providing geographic redundancy. This way, in case of a regional disaster, we still have a reliable backup available. Also, it's important to protect backups with the same security measures as the original data to ensure their integrity and confidentiality.

60

The Must-Know Terraform Interview Questions

Reference answer

A collection of essential Terraform interview questions.

61

What's the difference between SRE and DevOps?

Reference answer

A site reliability engineering role focuses on managing the systems belonging to core infrastructure inclined and applicable to the production environment. On the other hand, DevOps is used to inculcate automation and simplification in system development teams and their non-computing parameters. Ultimately, the goal of these two teams is to reduce the gap between development and operations.

62

What activity means Reducing Toil?

Reference answer

Activities that can reduce toil are: - Creating external automation - Creating internal automation - Enhancing the service to not require maintenance intervention.

63

Why do companies need SREs?

Reference answer

High expectations from users role are that companies need SREs to help them meet higher reliability expectations.

64

What is a data structure?

Reference answer

Answer: The data structure is the way of organizing and storing the data in the computer so that it can be accessed and manipulated efficiently. There is a wide range of data structures that serve various purposes, and the choice of the specific data structure depends on the needs of the algorithms or operations being performed. Arrays, Linked Lists, Stacks, Trees, Heaps, and Hash tables are the types of data structures.

65

What is the difference between logging, monitoring, and tracing, and how do they contribute to observability?

Reference answer

Expect candidates to explain that: Logging is the recording of discrete events that happen in the system. Monitoring is the continuous collection and analysis of metrics to assess system health. Tracing is tracking the execution path of requests to diagnose problems or performance bottlenecks. The three practices enhance observability by collecting data on system performance and behavior, helping identify issues and inform the team's decisions.

66

What are hardlinks and softlinks?

Reference answer

The two forms of file system links that used to distribute files between directories are hardlinks and softlinks. Soft links generate a single reference to the position of a file in one location, whereas hard links provide a single reference to a file in two different locations. Each hardlink you make has the exact same length as the original.

67

How do you define and implement a companywide observability strategy that serves engineers and SREs?

Reference answer

A companywide observability strategy defines unified standards for metrics, logs, and traces across all services. I would start by selecting a centralized observability platform (e.g., Datadog, Prometheus/Grafana). Implementation includes: instrumenting services with standardized libraries for telemetry, defining common SLIs and SLOs, creating dashboards for different roles (engineers for debugging, SREs for reliability), and setting up alerting with proper escalation paths. I would also provide training and documentation to ensure adoption. The strategy should evolve based on feedback and incident learnings.

68

Describe your experience with security in an SRE context.

Reference answer

Security is everyone's job, but SREs play a particular role because we control access and deployments. We implement least privilege access—developers don't have production SSH access. We use role-based access control and audit every production access. For patch management, we automate security patches through Ansible to ensure they get applied consistently and quickly. We've had zero-day situations where we've had a few hours to patch thousands of servers. Automation makes that possible. We also do regular security audits of our infrastructure—checking for misconfigured security groups, exposed databases, things like that. We had an incident where a developer accidentally left a temporary RDS instance with public access enabled. Our auditing tool caught it. I also make sure disaster recovery processes include security considerations. If we're restoring from backup, we need to ensure we're not restoring credentials or sensitive data to the wrong place. And we have an incident response plan specifically for security incidents—different from operational incidents because you need different communication protocols and evidence preservation.

69

How would you ensure security in a site reliability engineering context?

Reference answer

Ensuring security in SRE involves various practices. First, I would incorporate secure design principles right from the system design phase. This could include segregating the network, minimizing attack surface, and implementing least privilege principles. I would also ensure secure coding practices are followed to avoid common security vulnerabilities. I would utilize tools for vulnerability scanning and employ methodologies like threat modeling to identify potential security threats and mitigate them. Additionally, I would implement security monitoring and incident response procedures to respond swiftly to any security incidents.

70

What is a 'playbook'?

Reference answer

Step-by-step guide for common incidents (e.g., database failure).

71

What is SRE?

Reference answer

SRE's full form is Site Reliability Engineer. A Site Reliability Engineer is a software engineer who specializes in building and maintaining a reliable system that can handle unexpected changes in the environment. They typically work on large web applications, but they also work with other types of software systems. - They are responsible for making sure that their system is able to handle all of the possible variations that might occur in the world. For example, if one server goes down, they need to make sure that their system can continue running without any problems. They also need to make sure that the site is secure against hackers and other attackers. - Many sites are built using a combination of technologies, such as web apps, databases, and other systems. A Site Reliability Engineer needs to be familiar with all of these different components so that they can make sure that everything is working properly together. - There are also DevOps engineers that sound similar to the work of site reliability engineers. But still, there are differences between them. So let's understand the first DevOps and then we will understand the difference between these two in the follow-up questions. Responsibilities of Site Reliability Engineer - Site reliability engineers collaborate with other engineers, product owners, and customers to develop goals and metrics. This assists them in ensuring system availability. Once everyone has agreed on a system's uptime and availability, it is simple to determine the best moment to act. - Site Reliability Engineer implements error budgets to assess risk, balance availability, and drive feature development. When there are no unreasonable reliability expectations, a team has the freedom to make system upgrades and changes. - SRE is committed to decreasing labour. As a consequence, jobs that require a human operator to operate manually are automated. - A site reliability engineer should be well-versed in the systems and their interconnections. - The objective of site reliability engineers is to detect problems early in order to decrease the cost of failure.

72

Explain how you would define SLOs for a new service

Reference answer

This question reveals whether candidates understand the foundational process of setting reliability targets based on user expectations and business criticality rather than arbitrary numbers.

73

What is an Incident Command System in SRE?

Reference answer

An Incident Command System (ICS) is a standardized framework for managing incidents, assigning specific roles like Incident Commander, Communications Lead, and Subject Matter Experts to ensure efficient, coordinated, and clear communication during outages.

74

What is GCP BigQuery?

Reference answer

Serverless data warehouse for SQL queries on large datasets.

75

The Pacific and Atlantic oceans are both bordered by a m x n rectangle island. The Pacific Ocean hits the left and top corners of the island, while the Atlantic Ocean reaches the right and bottom edges. The island is divided into square cells by a grid. You are provided a m x n integer matrix height, where heights[r][c] reflect the cell's height above sea level (r, c). The island receives a lot of rain, and the rainwater can flow to nearby cells immediately north, south, east, and west if the height of the adjoining cell is less than or equal to the height of the present cell. Water can flow into an ocean from any cell close to one. Write an algorithm that returns the indices of (row, column) such that from this location, water can flow to both the pacific and Atlantic oceans. (Asked in LinkedIn Interview last month) Example - In the above image, the box colored is the mountain from where the water can flow to both the ocean so we need to return the list of the indices as - [[0,4],[1,3],[1,4],[2,2],[3,0],[3,1],[4,0]]

Reference answer

This seems graph problem. And for solving this problem we need to keep track of the places for reaching both the Pacific and the Atlantic Oceans separately. So the steps that can be followed to solve this problem are - - Create two boolean matrices, one for reaching the Pacific and the other for reaching the Atlantic. And at the first identified the location from where it might reach the Pacific or Atlantic oceans. - Then I performed a Breadth First Search on each of the positions from which it might reach the target. - Finally, it was tested in both matrices to see whether it could reach both oceans and was added to the response list. public List> pacificAtlantic(int[][] heights) { int m = heights.length, n = heights[0].length; //Grid that keep track of the mountain from where it can reach to //pacific Ocean. boolean[][] reachPacific = new boolean[m][n]; //Grid that keep track of the mountain from where it can reach to //atlantic Ocean. boolean[][] reachAtlantic = new boolean[m][n]; //Oueue that helps for breadth first traersal on matrix Queue queuePacific = new LinkedList<>(); Queue queueAtlantic = new LinkedList<>(); //Marking the row and column as true grom where we can reach to the //Pacific or atlantic ocean initially. for(int i = 0; i < m; i++){ reachPacific[i][0] = true; queuePacific.add(new int[]{i,0}); reachAtlantic[i][n-1] = true; queueAtlantic.add(new int[]{i, n-1}); } for(int i = 0; i < n; i++){ reachPacific[0][i] = true; queuePacific.add(new int[]{0,i}); reachAtlantic[m-1][i] = true; queueAtlantic.add(new int[]{m-1,i}); } //BFS on the grid to mark all the places from where it can traverse //to the pacific ocean. while(queuePacific.size() > 0){ int[] val = queuePacific.poll(); int i = val[0], j = val[1]; if(i-1 >= 0 && !reachPacific[i-1][j] && heights[i-1][j] >= heights[i][j]){ reachPacific[i-1][j] = true; queuePacific.add(new int[]{i-1, j}); } if(i+1 < m && !reachPacific[i+1][j] && heights[i+1][j] >= heights[i][j]){ reachPacific[i+1][j] = true; queuePacific.add(new int[]{i+1, j}); } if(j-1 >= 0 && !reachPacific[i][j-1] && heights[i][j-1] >= heights[i][j]){ reachPacific[i][j-1] = true; queuePacific.add(new int[]{i, j-1}); } if(j+1 < n && !reachPacific[i][j+1] && heights[i][j+1] >= heights[i][j]){ reachPacific[i][j+1] = true; queuePacific.add(new int[]{i, j+1}); } } //BFS on the grid to mark all the places from where it can traverse //to the atlantic ocean. while(queueAtlantic.size() > 0){ int[] val = queueAtlantic.poll(); int i = val[0], j = val[1]; if(i-1 >= 0 && !reachAtlantic[i-1][j] && heights[i-1][j] >= heights[i][j]){ reachAtlantic[i-1][j] = true; queueAtlantic.add(new int[]{i-1, j}); } if(i+1 < m && !reachAtlantic[i+1][j] && heights[i+1][j] >= heights[i][j]){ reachAtlantic[i+1][j] = true; queueAtlantic.add(new int[]{i+1, j}); } if(j-1 >= 0 && !reachAtlantic[i][j-1] && heights[i][j-1] >= heights[i][j]){ reachAtlantic[i][j-1] = true; queueAtlantic.add(new int[]{i, j-1}); } if(j+1 < n && !reachAtlantic[i][j+1] && heights[i][j+1] >= heights[i][j]){ reachAtlantic[i][j+1] = true; queueAtlantic.add(new int[]{i, j+1}); } } //List that stores all the indices of the places. List> ans = new ArrayList<>(); //Traversing on grid to check the place from where it can reach to //both pacific and atlantic ocean and adding to the answer list. for(int i = 0; i < m; i++) for(int j = 0; j < n; j++) if(reachAtlantic[i][j] && reachPacific[i][j]) ans.add(new ArrayList(Arrays.asList(i, j))); return ans; } The time complexity for the above algorithm is O(m*n) because all the places in the matrix will be visited more than once. But the degree of the polynomial is m*n, So it's O(m*n).

76

Can you discuss the difference between snat and dnat?

Reference answer

SNAT (Source Network Address Translation) changes the source IP address of outgoing packets, typically used for internal hosts to access external networks. DNAT (Destination Network Address Translation) changes the destination IP address of incoming packets, commonly used for port forwarding or load balancing.

77

Describe the concept of throttling in SRE.

Reference answer

Throttling limits the number of requests a service can handle to prevent overload and ensure fair resource allocation. It helps maintain system stability during high traffic periods by controlling the rate of incoming requests.

78

Explain the process of root cause analysis after a major incident. What tools or methods do you use?

Reference answer

Root Cause Analysis (RCA) involves identifying the underlying cause of an incident. Here's the process I follow: - Gather Data: Collect logs, metrics, and traces from monitoring systems (e.g., Azure Monitor, Prometheus). - Reconstruct Timeline: Use tools like Jaeger or Grafana to map out the timeline of the incident, identifying when and where the issue began. - Identify Symptoms: Look for patterns or commonalities among affected services, users, or resources. - Collaborate: Engage with relevant teams (Dev, Ops) to understand any potential changes that could have contributed. - Identify Root Cause: Once the data is analyzed, we isolate the underlying cause, whether it's a configuration error, network issue, or service overload. - Preventive Actions: Document findings, implement fixes, and improve monitoring to prevent similar incidents.

79

How is the error budget shared among teams involved in the process?

Reference answer

The error budget is shared among all teams involved in the process, ensuring that everyone is part of the decision-making process.

80

SRE vs DevOps: What's the difference?

Reference answer

SRE focuses on engineering solutions for system reliability and performance, while DevOps emphasizes collaborative practices to enhance and streamline the software development and delivery process.

81

Describe how you handle a high-severity production incident.

Reference answer

During a high-severity incident, I follow established procedures: first, acknowledge and assess the impact; second, identify and isolate the root cause; third, implement mitigations; fourth, communicate updates clearly; fifth, conduct a root cause analysis; and finally, perform a blameless postmortem.

82

What are TCP connection statuses?

Reference answer

Various TCP connection statuses are another. A TCP connection state connects a client and a server's TCP endpoints. The TCP three-way greeting mechanism defines these states. TCP is able to connect two endpoints thanks to the three-way handshake process, in which one side uses a SYN packet to start the connection setup and the other side replies with an ACK packet.

83

What is your approach to capacity planning?

Reference answer

In my previous roles, I've used a combination of historical data analysis, current trends and future business projections for capacity planning. Historical data, drawn from system metrics, helps in understanding how our systems have been utilized over time. For instance, we may identify cyclical changes in demand related to business cycles or features. The next step is to factor in the current trends. This includes aspects like user growth and behaviour, release of new features which might increase resource usage, or updates that improve efficiency and decrease resource usage. Finally, I bring in the future projections given by the business and product teams. They provide an idea of upcoming features, projected growth, and special events, all of which could mean changes in system usage. This comprehensive review helps to estimate the resources needed in the future with a suitable buffer for unexpected spikes. We then plan how to scale up our existing infrastructure to meet the expected demand. This approach helps us prevent outages due to capacity issues, avoid overprovisioning, and plan for budget effectively.

84

What is the significance of automation in SRE?

Reference answer

Automation helps in reducing manual tasks, minimizing human errors, increasing efficiency, and ensuring consistent performance across the infrastructure.

85

What happens when you type a URL in a browser?

Reference answer

The browser parses the URL, checks its cache for a DNS record, performs a DNS lookup to resolve the domain to an IP address, establishes a TCP connection (often with TLS handshake for HTTPS), sends an HTTP request to the server, receives an HTTP response, and renders the page content.

86

Explain the difference between horizontal and vertical scaling.

Reference answer

Horizontal scaling involves adding more machines to a system to handle increased load, while vertical scaling increases the capacity of a single machine by adding more resources. Horizontal scaling is often more cost-effective and provides better fault tolerance, whereas vertical scaling can be simpler but has physical limitations.

87

How do you implement disaster recovery (DR) in a distributed system?

Reference answer

Implement multi-region replication, frequent backups, and automated failover to another region. Regularly test the DR plan to ensure it can be executed smoothly in an actual disaster.

88

What is the purpose of sharing the error budget among all teams involved in the process?

Reference answer

1. To ensure that everyone is part of the decision-making process 2. To make it difficult to manage the error budgets 3. To promote fairness and positivity 4. To avoid harmful, unethical, prejudiced, or negative content

89

What is the role of an SRE in incident response?

Reference answer

An SRE's role in incident response includes detecting and diagnosing issues, coordinating the response, mitigating impact, and conducting post-incident analysis to prevent future occurrences.

90

What is a 'sidecar' pattern in microservices?

Reference answer

The sidecar pattern deploys a helper container (the sidecar) alongside the main application container, sharing the same lifecycle and network. The sidecar handles cross-cutting concerns like logging, monitoring, service discovery, or traffic management (e.g., via Envoy proxy). This pattern allows SREs to add operational functionality without modifying the application code, improving maintainability and consistency.

91

Explain 'Chaos Engineering.'

Reference answer

Proactively testing system resilience by simulating failures (e.g., shutting down nodes with Chaos Monkey).

92

Explain the relationship between SLAs, SLOs, and SLIs.

Reference answer

SLIs (Service Level Indicators) are the actual measured metrics of a service's performance, such as latency or error rate. SLOs (Service Level Objectives) are the target values or ranges for those SLIs, defining what is considered acceptable performance. SLAs (Service Level Agreements) are formal contracts with customers that specify consequences if SLOs are not met, often including penalties. SLOs are derived from SLIs, and SLAs are based on meeting or exceeding SLOs.

93

What strategies do you use for disaster recovery?

Reference answer

Disaster recovery strategies include implementing regular, verified backups, replicating data across geographically diverse regions, establishing automated failover processes to secondary sites, maintaining clear and tested recovery runbooks, and conducting periodic disaster recovery drills.

94

Tell me about a time you had to debug a complex system issue.

Reference answer

We had an issue where a specific customer's API requests were consistently timing out, but only during certain times of day. Other customers weren't affected. That was weird—it suggested something about their specific request patterns. I started by looking at traces for that customer's requests. I noticed that their requests were hitting a specific downstream service that was taking 5 seconds instead of the normal 50 milliseconds. That downstream service's metrics looked fine—CPU, memory, latency for other callers were all normal. Then I noticed the pattern: it was happening during their evening peak time when they were hitting us with lots of requests. I looked at the connection pool for that downstream service and saw it was getting exhausted during their traffic spikes. Their requests were queuing up waiting for a connection. We increased the connection pool size for that downstream dependency, and the timeout went away. But the real lesson was that the underlying issue was that downstream service wasn't scaled for their traffic. We implemented autoscaling based on connection pool utilization, which fixed it permanently.

95

Which system call's return value is not zero, although it ended successfully?

Reference answer

The fork() system call returns a non-zero value (the child PID) to the parent on success, while it returns 0 to the child. Other system calls like open() return a non-negative file descriptor on success. Generally, most system calls return 0 on success, but fork() is a notable exception.

96

How would you deal with an unreliable monitoring system?

Reference answer

An unreliable monitoring system is a critical incident itself. I would prioritize investigating its root cause, stabilizing it immediately, potentially adding redundancy, and implementing checks to validate its data integrity and the correctness of alerts it generates.

97

What are some of the steps you can take to reduce toil in a process?

Reference answer

Steps to reduce toil include automating repetitive tasks, implementing self-service tools, improving monitoring and alerting to reduce manual intervention, standardizing processes, eliminating unnecessary work, and using infrastructure as code to manage configurations and deployments.

98

Can you explain how Service Level Objectives, or SLOs, are used in the work of a site reliability engineer?

Reference answer

SLOs are target values or ranges for a service level indicator (SLI) that define the desired level of reliability. SREs use SLOs to set measurable reliability goals, guide decision-making on whether to release new features or focus on stability, and trigger actions when the SLO is at risk. They are central to the error budget policy and help align engineering efforts with business priorities.

99

What is chaos engineering, and how would you implement it in a production environment?

Reference answer

Chaos engineering is the practice of intentionally introducing failures into a system to test its resilience and identify weaknesses. Steps to implement chaos engineering: - Define a steady state: Identify what “normal” looks like, including SLIs and system baselines. - Start small: Begin with small, controlled experiments in staging environments (e.g., random pod failures in Kubernetes). - Use chaos tools: Implement tools like Chaos Monkey or Gremlin to automate failure injections (e.g., network latency, resource exhaustion, or process kills). - Monitor the effects: Use monitoring systems to track system behavior during chaos experiments. - Gradually increase scope: After validating in staging, run controlled experiments in production to test for real-world resilience.

100

What is Sharding in DBMS?

Reference answer

Sharding is a very important concept that helps the system to keep data in different resources according to the sharding process. The word "Shard" means "a small part of a whole". Hence Sharding means dividing a larger part into smaller parts. In DBMS, Sharding is a type of database partitioning in which a large database is divided or partitioned into smaller data and different nodes. These shards are not only smaller, but also faster and hence easily manageable.

101

Explain the concept of 'eventual consistency' and where it is appropriate.

Reference answer

Eventual consistency is a consistency model where updates to a distributed system will eventually propagate to all nodes, but read operations may return stale data temporarily. It is appropriate for systems like DNS, content delivery networks, or social media feeds where high availability and partition tolerance are prioritized over strong consistency. SREs design applications to tolerate eventual consistency when needed (e.g., via idempotency or conflict resolution).

102

What is DNS?

Reference answer

This is a BIG question and it will be interesting how the candidate answers. Ultimately, you aren't looking necessarily for comprehensive knowledge, but rather whether they can name the main points of interest and do so with clear definitions. The domain name system (DNS) is a decentralized naming system for resources connected to the internet or a private network. These resources are assigned internet protocol (IP) addresses, which are defined strings of unique identifying numbers that follow a precise format. However, humans cannot feasibly remember IP addresses, so DNS allows the assigning of a human-readable name, such as google.com, to use in place of the IP address. They may also talk about IPv4 versus IPv6, DNS records and the fields involved and how to create one, nameservers and decentralization and the existence of a set of canonical root nameservers, queries, caching, primary versus secondary DNS settings, reverse DNS lookups, DNS zones, and security concerns. All of these are important, but you are really looking at whether the candidate understands the big picture and how they communicate it to you.

103

Steps to resolve an outage?

Reference answer

Detect → Acknowledge → Diagnose → Fix → Post-mortem → Prevent recurrence.

104

How does your current deployment pipeline look? What are the biggest issues?

Reference answer

At first, this seems like a simple question — but beware: it's a loaded one. The interviewer wants to determine your ability to analyze your deployment pipeline and make intelligent decisions for changing it. SRE teams are crucial for: - Identifying monitoring deficiencies and deployment bottlenecks. - Surfacing reliability concerns to the applicable parties. Being able to determine where your team can make the biggest improvements to resilience without drastically affecting employee productivity or process will show that you're able to problem-solve at a high level.

105

Tell me about a time you identified a manual, repetitive task and successfully automated it. What was the impact?

Reference answer

S – Situation For several months, our on-call Site Reliability Engineers were spending an average of 1-2 hours daily on a highly repetitive, manual task: generating and emailing compliance reports for a specific regulatory requirement. This process involved logging into multiple systems – our primary operational database, our centralized logging platform, and our monitoring system APIs – to extract specific metrics and log data. This data then needed to be collated into a precise CSV format, reviewed for accuracy, and finally emailed to a specific distribution list of compliance officers. Not only was it time-consuming, but the manual nature introduced a high risk of human error, and missing a deadline could result in significant compliance fines for the company. It was a significant source of operational toil that pulled engineers away from more impactful work. T – Task My task was clear: eliminate this manual effort entirely, thereby freeing up valuable on-call engineer time, improving the accuracy and consistency of the compliance reports, and ensuring timely delivery without fail. The goal was to transform this error-prone, labor-intensive process into a reliable, automated workflow that required minimal human intervention and provided continuous assurance of compliance. This meant designing a solution that could reliably access disparate data sources, perform data transformations, and handle secure distribution, all while being robust to potential failures in any part of the chain. A – Action I began by conducting a thorough analysis of the existing manual process, meticulously documenting every step, data source, and transformation logic. I identified that all the required data could be accessed programmatically using existing APIs and database connectors. I then developed a robust Python script designed to automate the entire workflow. The script leveraged our existing Python SDKs for database access (using SQLAlchemy), our logging platform's API client, and our monitoring system's REST API. It would connect to these sources, retrieve the necessary data for the specified time period, perform the required aggregations, filtering, and formatting operations, and then generate the compliance report in the specified CSV format. For distribution, I integrated an SMTP library within the script to securely send the generated report to the predefined compliance distribution list. To ensure the automation itself was reliable, I containerized the Python script using Docker, making it portable and ensuring consistent execution environments. This Docker image was then deployed onto our Kubernetes cluster as a cron job, scheduled to run every morning well before the compliance deadline. Crucially, I built comprehensive error handling and logging into the script. If any data source was unreachable, an API call failed, or the email could not be sent, the script would log the error details and trigger an alert to the SRE team, ensuring immediate visibility into any issues with the automation itself. Before fully replacing the manual process, I ran the automated script in parallel with the manual report generation for two weeks, cross-referencing every output to meticulously verify accuracy and build confidence in the automated system. R – Result The automation was an unqualified success. It completely eliminated approximately 10 hours of manual work per week for the on-call team, allowing them to redirect their focus towards proactive system improvements, complex incident resolution, and strategic projects that genuinely advanced our reliability goals. The accuracy of the compliance reports drastically improved due to the removal of human transcription and collation errors, ensuring consistent and correct data. Reports were now consistently delivered on time, every time, eradicating any risk of compliance fines due to late submissions. Furthermore, the modular design of the script meant that it could be easily adapted and extended for future reporting requirements, establishing a reusable pattern for similar automation tasks across the organization. This initiative not only significantly reduced operational toil but also showcased the tangible benefits of automation in enhancing operational efficiency, improving compliance posture, and empowering our engineers to contribute to higher-value activities. It solidified our team's reputation as champions of efficiency and reliability.

106

Explain Horizontal Pod Autoscaler (HPA).

Reference answer

Scales pods based on CPU/memory usage or custom metrics.

107

What are the benefits and challenges of microservices architecture in terms of reliability?

Reference answer

Benefits: - Fault Isolation: Issues in one service don't bring down the entire system. - Scalability: Individual services can scale independently based on demand. Challenges: - Increased Complexity: More services mean more operational overhead. - Inter-service Communication: Latency and failure in communication between services. - Monitoring: Requires comprehensive monitoring of each service and its interactions.

108

Explain the concept of 'distributed tracing' and its use in SRE.

Reference answer

Distributed tracing tracks requests as they flow through multiple services in a distributed system, using unique trace IDs and spans. It helps SREs identify bottlenecks, debug errors, and understand system dependencies. Tools like Jaeger, Zipkin, or OpenTelemetry are used to visualize request paths and latencies. This is critical for diagnosing performance issues in microservices architectures.

109

How do you ensure that your Terraform configuration matches what's actually running in production, and what do you do when it doesn't?

Reference answer

Drift is the word. The answer should cover automated drift detection, alerting on unexpected changes, and the decision process for whether to reconcile Terraform to match production or revert production to match Terraform. That decision depends entirely on context, and saying 'I'd always reconcile to Terraform' is a tell that you haven't been in the situation where the drift was intentional and undocumented by someone who no longer works there.

110

What is a Service Level Agreement (SLA) and how does it relate to SLOs?

Reference answer

An SLA is a contractual commitment made to customers, often including penalties for non-compliance. SLOs are internal targets that are typically stricter than SLAs to provide a safety margin. SREs set SLOs with a buffer (e.g., 99.95% for an SLA of 99.9%) to avoid breaching the SLA and to manage customer expectations effectively.

111

Given a data structure of rows (source, ratio, destination), find the value of conversion for a given source to a given destination. Example (EUR, 1.23, GBP)

Reference answer

Treat the conversion rates as a graph where currencies are nodes and ratios are edges. Use a graph traversal algorithm (e.g., BFS or DFS) to find a path from source to destination, multiplying ratios along the path. If multiple paths exist, handle potential arbitrage or use shortest path for consistent conversion. Return the product of ratios.

112

Can you describe the Best SRE Tools for each Stage of DevOps?

Reference answer

The appropriate SRE tools for each stage of DevOps are: - Plan: Jira, Pivotal Tracker, and other task management tool - Create: Source-control tools like GitHub - Verify: CI/CD tools like Jenkins or CircleCI - Package: Container orchestration services like Kubernetes or Mesosphere. - Configure: Tools like Terraform and Ansible

113

How do you handle configuration management?

Reference answer

I use configuration management tools like Ansible or Terraform to define infrastructure and application configurations declaratively. This ensures consistency across environments, enables version control for configurations, and facilitates automated deployments and rollbacks.

114

Can you explain data structures and also describe the physical data structure and logical data structure?

Reference answer

Data structures are a set of rules for organizing and storing data in a computer. Data structures are used to structure databases, manage memory, and organize data. Data structures allow for easy organization of data, easy retrieval of data, and efficient use of resources. - Physical Data Structures can be Arrays and Linked lists. We can call these two physical data structures because the data stored in the actual physical memory, are based on these two. An array is the collection of contiguous data elements of the same type. And the linked list is also the collection of the data elements but it may or may not be contiguous in memory. A linked list consists of nodes that store the data and also the pointer that is pointing to the next node in the memory. - Logical Data Structure can be considered as all the data structures that are constructed while using the two physical data structures. The logical data structures can be stack, queue, tree, graph, etc. These data structures have only the logic and based on this logic it defines a property and stores the data using arrays and linked lists in the memory.

115

How do you maintain team morale and productivity during challenging on-call situations?

Reference answer

In my previous role at a cloud service provider, I implemented a rotation system that ensured fair distribution of on-call duties. We also held regular debrief sessions after incidents to share insights and recognize individual contributions. Additionally, I introduced a 'no-work' policy for the day after a heavy on-call shift, allowing my team to recharge. This approach resulted in a noticeable improvement in team morale and engagement.

116

Explain the concept of “self-healing” systems and how you can implement them.

Reference answer

Self-healing systems automatically detect failures and recover without manual intervention. Implementation strategies: - Health checks and monitoring to detect failures. - Auto-scaling to add or remove instances based on demand. - Automated failover to switch to backup systems during failures. - Error recovery mechanisms that restart failed processes or roll back bad deployments.

117

How do you handle incomplete or ambiguous requirements?

Reference answer

When I encounter incomplete or ambiguous requirements, my first step is to initiate a detailed discussion with the relevant stakeholders. The goal is to clarify expectations, articulate the needs better, and make sure everyone is on the same page. For technical requirements, I often ask for use-cases or scenarios that help me understand what the stakeholder is trying to achieve. At times, I might present prototypes or sketches to illustrate the proposed implementation and that, in turn, prompts more detailed feedback. Also, it's beneficial to keep an open mind during these dialogues as sometimes the solution the stakeholder initially proposed may not be the best way to address their actual need. For example, in my previous role, a product manager once requested a feature that, on the surface, seemed straightforward. But it wasn't clear how this feature would affect existing systems and workflows. Rather than making assumptions or taking the request at face value, I initiated several meetings with the product manager to understand their vision, presented some mock-ups, and proposed alternate solutions that would achieve their goal with lesser system impact. In conclusion, clear communication, initiative to probe deeper, and presenting your understanding or solutions as visual feedback are key in dealing with incomplete or ambiguous requirements.

118

How do you ensure configuration consistency across multiple environments (dev, staging, prod)?

Reference answer

To ensure configuration consistency, I use: - IaC (Infrastructure as Code): By defining all infrastructure configurations in code (e.g., using Terraform or CloudFormation), I ensure that the same configurations are applied across all environments. - Version Control: Store configuration files in a version-controlled system (e.g., Git). - Automated Testing: Set up tests to ensure that configurations are deployed consistently across environments. - Environment-Specific Variables: Use tools like Vault to manage environment-specific variables securely. This approach ensures that the dev, staging, and production environments remain consistent, minimizing the risk of discrepancies.

119

Simple: What happens when you type in 'www.cnn.com' in your browser?

Reference answer

This is a simple version of the question asking to explain the sequence of events that occur when a URL is entered into a browser.

120

Describe a recent automation script you implemented.

Reference answer

Recently, I implemented a script aimed at automating the rollover of log files in our systems. As we gathered a considerable amount of log data daily, the disk space was getting filled quickly, which could cause system issues if not addressed. Manual cleanup was not a sustainable solution due to the volume of the logs and the continuous nature of the task. I scripted the task using Python and partnered with a system-cron job that would trigger the script at a specific time daily. The script would backup the log files from the day into a compressed format, move these backups into a designated backup directory and then purge the original logs from the system, retaining only the last three days' worth of logs within the system. This automated process, not only freed up considerable disk space continually and improved system performance, but also made sure that we retained log data for a longer period which would be helpful for any future debugging or post-incident analysis. It was a significant win in terms of usage of disk space, system efficiency and availability of historical log data.

121

What is the main goal of SREs?

Reference answer

The main goal of SREs is to implement and automate DevOps practices to reduce the number of problems and make the system more reliable and able to grow.

122

How do you go about setting SLOs and SLIs and how do you make adjustments when necessary?

Reference answer

Service Level Objectives (SLOs) and Service Level Indicators (SLIs) are foundational metrics for SREs. SLOs are the goals for a particular application; SLIs are the actual measurement of performance against those goals. Lachhman notes that the SRE function is often at the heart of defining and refining SLOs and SLIs; oftentimes, developers don't necessarily know the norm or baseline for the applications they build and maintain, particularly if SRE is a relatively new dimension of the broader team. Hiring managers should dig into how the candidate identifies and defines SLOs and SLIs; if you're the candidate, you should be prepared to speak about how you approach these metrics. Moreover, make sure you can discuss a thoughtful process for reevaluating and optimizing those measurements over time. 'Like any metric, they need to evolve,' Lachhman says. 'Negotiating changes to SLO/SLI measurements is par for the course.'

123

What are the three terms in the error budget?

Reference answer

The three terms in the error budget are Service Level Indicator (SLI), Service Level Objective (SLO), and Service Level Agreement (SLA).

124

What is the difference between NFS (Network File System) and SAN (Storage Area Network)?

Reference answer

Look for answers explaining that: NFS (Network File System) is a protocol allowing remote access to files over a network, presenting storage at the file level. SAN (Storage Area Network) is a specialized, high-speed network that gives access to consolidated, block-level storage. NFS is often used for sharing files across a network of devices, making it suitable for situations where ease of access and file sharing are a priority, while SAN is typically used in environments requiring high performance, such as databases, where direct access to the disk block is necessary.

125

How to check open ports on a Linux server?

Reference answer

Use `netstat -tuln` or `ss -tuln`. Example: `ss -tuln | grep 443` checks HTTPS usage.

126

Explain in detail the working of ARP.

Reference answer

Most computer applications employ IP addresses (logical addresses) to send or receive messages, therefore actual communication takes occurs via physical addresses (MAC addresses). So the goal of ARP (Address Resolution Protocol) is to determine the destination's MAC address, which will allow us to interact with other devices. In this scenario, the ARP is truly necessary since it translates the IP address to a physical address. - When the source wishes to interact with the destination at the network layer. First, the source must determine the destination's MAC address (Physical Address). The source will look in the ARP cache and ARP database for the destination's MAC address. If the destination's MAC address is found in the ARP cache or ARP table, the source uses that MAC address for communication. - If the destination's MAC address is not in the ARP cache or table, the Source sends an ARP Request message. The source's MAC address and IP address are included in the ARP Request message. It also includes the destination's IP address and MAC address. The destination's MAC address was left blank since the user desired it. - The source computer will broadcast the ARP Request message to the local network. The broadcast message is received by all devices on the LAN network. Now, each device compares its own IP address to the destination's IP address. If the device's IP address matches the destination's IP address, the device will send an ARP-to-respond message. If the device's IP address does not match the destination's IP address, the packet is dropped automatically. - When the destination address matches the device, the destination sends an ARP reply packet. The MAC address of the device is included in the ARP Reply packet. Because the source's MAC address will be required for communication, the destination device automatically changes the database and saves it. - The source device now serves as a target for the destination device, which sends the ARP Reply message. - The ARP Reply message is sent unicast rather than broadcast. This is due to the fact that the device (destination) sending the ARP Reply message is aware of the MAC address of the device (source) to whom the ARP Reply message is delivered. - When the source device receives the ARP Reply message, it will know the destination's MAC address since the ARP Reply packet contains the destination's MAC address along with the other addresses. The source will update the destination's MAC address in the ARP cache. The sender can now connect directly with the recipient.

127

How would you reduce latency in a distributed system?

Reference answer

- Use CDNs to cache data closer to users. - Optimize databases with indexing and caching (e.g., Memcached, Redis). - Reduce network hops by optimizing routing and reducing dependencies.

128

Explain the difference between DevOps and SRE.

Reference answer

Answer: Implementing new features: DevOps is responsible for developing new feature requests to the product, whereas SREs ensure those new changes don't increase the overall failure rates in production. Procedure flow: The DevOps team has the perspective of the development environment to make changes from development to production. SREs have a viewpoint of production, so they can make propositions to the development team to border the let-down rates notwithstanding the new variations. Incident handling: DevOps teams work on the incident feedback to mitigate the issue, whereas SRE conducts the post-incident reviews to identify the root cause and document the findings to offer feedback to the core development team.

129

How do you decide if the team should work on new features or paying down technical debt?

Reference answer

SREs play a growing role in negotiating the tension between building new features and reducing technical debt: Most organizations can't do both simultaneously week in, week out. While this question might be rooted in technical decisions, it speaks to the 'socio-technical' nature of SRE. This is one of Merker's favorite questions, and he deliberately leaves it open-ended – he wants to hear the candidate dig in for more data and context. 'If they have hard-and-fast rules, I am less impressed by their answer,' Merker says. 'What I'm looking for is curiosity about the customer and the business, an understanding of a variety of roles in the company, and a desire to get data (when possible) to back up different points of view.' For SRE candidates, this topic is a chance to show how you approach seemingly insurmountable conflicts. Everyone thinks their goal or issue is the most important; how do you actually set priorities that people can (mostly) agree on and work on? When is technical debt acceptable (or inevitable)? How do you pay it down? 'A big part of SRE is mediating between these different interests and finding practical and actionable answers to somewhat impossible questions,' Merker says. 'There is no exact right answer; it's the process of discovery to find what truly matters that makes me want to say STRONG HIRE!'

130

What is a rollback window?

Reference answer

A rollback window is a predetermined time frame during which a new deployment can be rolled back to the previous version if issues are detected. It ensures quick recovery from deployment failures and minimizes the impact on users.

131

Define INodes. Also, state the reason why it is important.

Reference answer

Inodes are the units of storage on a Linux filesystem. Every file, directory, and block device has an inode associated with it, which is essentially a pointer to where the file is located in the filesystem. Inodes also have other properties such as their size and owner and group ID. If a file or directory is deleted, the inode will be marked as deleted and all data associated with that inode will be removed as well. Inodes are an important resource for both performance and security. There are a number of reasons why they can be important: - For performance, inodes are used to determine how much space a file occupies, so they can be used to optimize the placement of files that are likely to change frequently. When a file is created or moved between partitions, it must go through the inode stage first. - For security, there are two main roles for inodes: indexing and ACLs (access control lists). Indexing allows tools like locate or grep to quickly find files by name or location. ACLs allow users to control access to their files based on permissions assigned by their system administrator. In addition, having all files written to disk as soon as they are modified can help prevent data loss due to power outages or other unforeseen events. Finally, while most people might assume that inodes are used primarily for storing data on disk drives, Inodes are also used to track metadata about every file on your computer, as well as directories and other objects stored on your computer's hard drive. This data is used to keep track of which files have been deleted, modified, or copied, and can also be used to determine the overall health and performance of your computer.

132

Define Hardlink and Softlink.

Reference answer

- Hardlinks and soft links are two different types of file system links used to share files between directories. - Hardlinks create a single link to a file in two different locations, while soft links create a single pointer to the location of a file in one location. - When you create hardlinks, each link is the same size as the original file. Soft links, on the other hand, can be created with or without the original file and can be of variable sizes. - To create a hardlink, you must have the “write” permission for both the original and target file. To create a softlink, you must have the “write” permission for only the target file. If you try to write to the original file while you have the write permission for only one of the files, your attempt will fail and generate an error message. If you try to delete just one of the files while you have the write permission for both, it will also fail and generate an error message.

133

Tell me about a time you responded to a major production incident.

Reference answer

Last year, we had a database connection pool exhaustion during a traffic spike on Black Friday. Our service started returning 503 errors. I was on-call, and my first move was to page the on-call database engineer and open a war room Slack channel to communicate with stakeholders. While they investigated the database side, I started looking at our metrics—I could see CPU and memory were normal, but connection count was maxed out. I implemented a temporary fix by increasing the timeout on database connections to force recycling, which bought us 20 minutes while we worked on the root cause. The database team discovered that a recent code change had removed connection pooling in one of our services. We reverted that change and gradually brought traffic back. What impressed me most was how the team handled the post-mortem—no blame, just data. We implemented automated alerts for connection pool saturation and improved our deployment process to catch connection pool changes during code review.

134

Define Service Level Indicators

Reference answer

A Service Level Indicator (SLI) measures the service level provided by a service provider to a customer. SLIs form the basis of SLO, which is a critical element of SLAs. Common SLIs include latency, throughput, availability, and error rate; others include durability, end-to-end latency, and correctness. SLIs can be measured precisely to define and determine whether you are meeting SLOs and SLAs.

135

What is the difference between SRE teams and Scrum teams?

Reference answer

Site reliability engineering (SRE) teams do both operational works that is interrupted and planned work, which could include some software development. Scrum is for software development teams that are working on one or a few products.

136

Why do we use the concept of Private IPs and Public IPs?

Reference answer

The Private IP Address of a system is the IP address that is used to communicate within the same network. Using private IP data or information can be sent or received within the same network. The router basically assigns these types of addresses to the device. Unique private IP Addresses are provided to each and every device that is present on the network. These things make Private IP Addresses more secure than Public IP Addresses. The Public IP Address of a system is the IP address that is used to communicate outside the network. A public IP address is basically assigned by the ISP (Internet Service Provider). Public IP Address is basically of two types: - Dynamic IP Address: Dynamic IP Addresses are addresses that change over time. After establishing a connection of a smartphone or computer with the Internet, ISP provides an IP Address to the device, these random addresses are called Dynamic IP Address. - Static IP Address: Static Addresses are those addresses that do not change with time. These are stated as permanent internet addresses. Mostly these are used by the DNS (Domain Name System) Servers.

137

Describe a problem you had to troubleshoot; how did you find it and fix it?

Reference answer

The hiring manager is looking for the candidate's thinking process and how organized they find problem sources. They also want to check how you can think out of the box in resolving queries.

138

What kind of metrics would you monitor to understand the health of a service?

Reference answer

To understand a service's health, I would monitor metrics like request rate, error rate, response time, and resource usage (such as CPU, memory, and disk I/O). Request rate and error rate provide insight into the traffic and reliability of the service. Response time helps identify latency issues. Resource usage metrics help identify bottlenecks or capacity issues in the service. By correlating these metrics, we can gain a comprehensive understanding of the service's health and make informed decisions for performance optimization.

139

How would you set up monitoring and alerting for a microservices architecture with 30 services?

Reference answer

The answer that passes explains what you're measuring and why. The four golden signals from Google's SRE practices: latency, traffic, errors, saturation. Not as a list. As a diagnostic framework. 'I'd instrument latency at p50, p95, and p99 because p50 tells you the common case and p99 tells you about the tail that generates support tickets. I'd alert on p99 crossing the SLO threshold, not on p50, because p50 alerts generate noise that trains people to ignore pages.' That reasoning. The tooling is secondary.

140

How do you differentiate between process and thread?

Reference answer

- When execution of a program allows you to perform the appropriate actions specified in the program, that's called process. - On the other hand, the thread is the segment of processes. - Process is not lightweight. Threads are lightweight. - The process takes more time to terminate. Threads take more time to terminate. - Process creation takes more time. Thread creation takes less time. - The process takes more time in context switching. Threads take less time in context switching. - The process is more isolated. Threads share memory. - The process does not share data. Threads share data with each other.

141

What's your strategy for managing secrets and sensitive data in CI/CD pipelines?

Reference answer

Managing secrets and sensitive data is crucial for maintaining the security of CI/CD pipelines. Here's my strategy: - Environment Variables: Sensitive data like API keys and database credentials are stored in environment variables instead of being hardcoded in the code. - Secret Management Tools: I use HashiCorp Vault, AWS Secrets Manager, and Azure Key Vault for securely storing and managing secrets, ensuring they are only accessible to authorized services. - Access Control: Implementing least privilege access ensures that only authorized users and services can access sensitive data. - Encryption: All secrets are encrypted both in transit and at rest using robust encryption algorithms. - Automated Rotation: Implement automated rotation of secrets, keys, and passwords to minimize the risk of exposure over time. This strategy ensures the integrity and security of sensitive data while maintaining operational efficiency in CI/CD pipelines.

142

What are hard links and soft links?

Reference answer

Answer: Hard Link: A hard link is a duplicate of the source file that acts as a pointer to the original, enabling access to it even if the source file is moved or erased. Hard links are different from soft links in that changes made to one file affect other files, and the rigid connection persists even if the original file is removed from the system. Soft Link: A brief pointer file that connects a filename to a pathname is called a soft link. Like the Windows OS shortcut option, it's nothing more than a shortcut to the original file. Without the actual contents of the file, the soft link functions as a reference to another file. Users can remove the soft links without impacting the contents of the original file.

143

What is a circuit breaker pattern, and why is it used?

Reference answer

The circuit breaker pattern is a design pattern used to detect failures and prevent cascading failures in distributed systems. It temporarily blocks requests to a service when failures are detected, allowing the service to recover before resuming normal operations.

144

What is the role of monitoring in SRE?

Reference answer

Monitoring in SRE involves tracking system performance, identifying issues, and ensuring that the systems meet the defined SLOs. It helps in proactive incident detection and resolution.

145

What is Site Reliability Engineering (SRE)?

Reference answer

SRE is a discipline that utilizes software engineering principles to manage operations problems. It aims to create highly reliable, scalable systems through automation, measurement, and focusing on metrics like SLOs.

146

Tell me about some of the process improvements you have implemented in the past.

Reference answer

I implemented automated deployment pipelines to reduce manual errors, introduced chaos engineering experiments to uncover weaknesses, streamlined incident response with runbooks and on-call rotations, and standardized monitoring dashboards to improve visibility across teams.

147

Explain the concept of load shedding.

Reference answer

Load shedding involves intentionally dropping or refusing to process some requests when a system is overloaded to protect its core functionality and prevent complete failure. This can be done by prioritizing critical requests and shedding less critical load.

148

What is the term 'SRE' stand for?

Reference answer

The term 'SRE' stands for 'Site Reliability Engineer.' A software engineer with a focus on creating and maintaining dependable systems that can withstand unforeseen environmental changes is known as a site reliability engineer.

149

What steps would you take to reduce alert fatigue in a large-scale environment?

Reference answer

Alert fatigue can occur when there are too many alerts, leading to important issues being ignored. To reduce this, I would: - Implement alert prioritization using severity levels and thresholds. Only send high-priority alerts that require immediate action. - Use noise reduction techniques like grouping similar alerts, suppressing low-impact alerts, and setting rate limits on alerts. - Leverage intelligent alerting with anomaly detection, so the system can automatically determine whether an alert is critical or not. - Incorporate alert acknowledgment and escalation policies to ensure that alerts are handled by the right team.

150

How do you prioritize tasks during an incident?

Reference answer

During an incident, the absolute priority is service restoration and mitigating the immediate impact on users. This involves quick assessment and applying known fixes or workarounds. Communication is also high priority. Root cause analysis comes after the system is stable.

151

What is scaling and forecasting in S3 operations?

Reference answer

Efficiency and environment performance are considered while scaling and anticipating S3 activities. Overscaling wastes resources and money, while overthinking causes overuse and excessive expenses. Testing for margins to close may cause deterioration and slowness, which hurts users and customers.

152

Describe the incident response lifecycle and the SRE role in it.

Reference answer

The incident response lifecycle includes detection, triage, containment, resolution, and postmortem. The SRE role involves detecting incidents through monitoring and alerts, triaging to assess severity and impact, containing the issue to prevent further damage, resolving the root cause, and leading blameless postmortems to document findings and implement preventive measures to reduce future incidents.

153

What is “Error Budget” and how does it relate to SRE?

Reference answer

An error budget represents the allowable downtime or failure within a service's SLO. If the error budget is exceeded, new features may be paused to prioritize reliability improvements.

154

Scenario: A critical system component is experiencing high CPU utilization, degrading performance. How do you resolve this?

Reference answer

- Analyze CPU usage: Use tools like top, htop, or Kubernetes metrics to determine which processes or pods are consuming excessive CPU. - Horizontal scaling: If possible, horizontally scale the component by increasing the number of instances or pods. - Code optimization: Profile the application using tools like Flamegraphs or profilers to identify inefficient code paths, loops, or algorithms causing high CPU usage. - Caching: Implement or optimize in-memory caching (e.g., Redis) to reduce redundant processing or expensive computations. - Optimize resource limits: Ensure that CPU resource requests/limits are configured correctly in Kubernetes to avoid bottlenecks due to CPU starvation. Tuning CPU usage requires a mix of horizontal scaling, code optimization, and fine-tuning resource requests.

155

How can using the budget be part of the Service Level Agreement (S.L.O.)?

Reference answer

Service Level Agreements (SLA) might include financial provisions for resource distribution, such as staff and equipment, for particular activities or projects. The S.L.A. may deploy resources to finish the work on schedule and to the agreed-upon criteria.

156

What happens when you type a URL into a browser?

Reference answer

This classic question tests breadth of knowledge across DNS, TCP/IP, TLS, HTTP, and application layers.

157

Check if a string is a palindrome (Python).

Reference answer

python def is_palindrome(s): return s == s[::-1]

158

How do you set effective alert thresholds to balance information and noise?

Reference answer

The best candidates will know how to set up alert thresholds that balance information and noise. Expect them to talk about analyzing the normal operating ranges of systems and services and looking into historical performance data. Candidates should also mention the practice of simultaneously using static thresholds for fixed values, and dynamic thresholds, which adjust based on trends or patterns. For example, they might set static thresholds for critical system resources, such as 90% disk space usage, to prevent service disruption. As for dynamic thresholds, they could use them for metrics like CPU usage, where normal ranges might vary depending on the time of day or workload.

159

What is LILO?

Reference answer

LILO (Linux Loader) is a bootloader for Linux that is used to load Linux into memory and start the operating system. It is also known as a boot manager since it allows a computer to dual boot. It can act as a master boot program or a secondary boot program, and it performs a variety of tasks such as locating the kernel, identifying other supporting programs, loading memory, and launching the kernel. If you wish to utilize Linux OS, you must install a special bootloader called LILO, which allows Linux OS to boot quickly.

160

What is Transmission Control Protocol (TCP)?

Reference answer

TCP (Transmission Control Protocol) is one of the main protocols of the Internet protocol suite. It lies between the Application and Network Layers which are used in providing reliable delivery services. It is a connection-oriented protocol for communications that helps in the exchange of messages between different devices over a network. The Internet Protocol (IP), which establishes the technique for sending data packets between computers, works with TCP.

161

What is 'SRE' and how does it relate to DevOps?

Reference answer

SRE (Site Reliability Engineering) is a discipline that applies software engineering principles to operations, with a focus on reliability, automation, and scalability. DevOps is a broader culture and practice that emphasizes collaboration between development and operations. SRE can be seen as a specific implementation of DevOps principles, providing concrete practices like SLOs, error budgets, and toil reduction to achieve reliability goals.

162

List common Linux signals.

Reference answer

`SIGHUP` (1): Reload configurations. `SIGINT` (2): Interrupt process (Ctrl+C). `SIGKILL` (9): Force termination. `SIGTERM` (15): Graceful shutdown.

163

How does TCP differ from UDP, and when would you use each?

Reference answer

TCP is connection-based and reliable (used for HTTP, SSH). UDP is faster but doesn't guarantee delivery (used in DNS, video streaming). Choose based on latency vs reliability tradeoffs.

164

What are crucial aspects of S3 operations?

Reference answer

Monitoring and alerting, reducing human attention, capacity planning and forecasting, scaling and forecasting, and ensuring availability of resources during big events or product launches.

165

Explain how error budgeting influences release velocity and engineering priorities.

Reference answer

Error budgeting directly influences release velocity: if the error budget is not exhausted, teams can release new features more freely, as there is room for acceptable risk. When the budget is low or exhausted, the velocity slows down because priority shifts to reliability improvements and incident response. This creates a feedback loop where teams must balance feature development with maintaining SLOs, ensuring that reliability is not sacrificed for speed. It also encourages proactive investment in automation and resiliency.

166

How do you implement security standards in an SRE role?

Reference answer

In a Site Reliability Engineering role, implementing security standards involves ensuring the infrastructure is set up and maintained securely, applications are developed and deployed securely, and that data is handled in a secure way. For the infrastructure, I follow the principle of least privilege, meaning individuals or services only have the permissions necessary to perform their tasks, limiting the potential damage in case of a breach. I apply regular security updates and patches, keep systems properly hardened and segmented, and ensure secure configurations. When it comes to applications, I work closely with the dev team to ensure secure coding practices are followed, and that all code is regularly reviewed and tested for security issues. I implement security mechanisms such as encryption for data in transit and at rest, two-factor authentication, and robust logging and monitoring to detect and respond to threats promptly. In one of my past roles, I also lead the implementation of a comprehensive IAM (Identity and Access Management) strategy where we streamlined, monitored, and audited all account and access-related matters, significantly enhancing our system's security posture. Through ongoing security training and staying updated on latest security trends, I continually work toward maintaining a strong security culture in the team.

167

What steps would you take to reduce system downtime?

Reference answer

- Improve monitoring and alerting. - Automate routine tasks to reduce human error. - Use blue/green deployments or canary releases to safely roll out changes. - Design systems with high availability (HA) using load balancers, redundancy, and failover mechanisms.

168

What are the key principles of SRE?

Reference answer

- Embracing and managing risk - Utilizing error budget to implement and test new features. - Maintaining Service Level Objectives - Tracking and comparing SLIs to your SLOs to ensure you meet your SLA. - Eliminating toil - Reducing repetitive mundane tasks that can be automated, allowing for better use of time. - Monitoring - Keeping track of systems and performance to address issues before they become real problems. - Automation - Implementing automation to reduce toil. - Release engineering - The technical aspects of compiling, assembling, and delivering source code. - Simplicity - Its easier to understand the effect of small simple changes over large batch changes.

169

Describe your experience with orchestration and containerization technologies.

Reference answer

Absolutely. Throughout my career, I've gained significant experience with both orchestration and containerization technologies. I've used Docker extensively for containerizing applications. With Docker, I've isolated application dependencies within containers, which made the applications more portable, scalable, and easier to manage. As for orchestration, I have solid experience with Kubernetes. I've used Kubernetes in production environments for automating the deployment, scaling, and management of containerized applications. Kubernetes helped us ensure that our applications were always running the desired number of instances, across numerous deployment environments. It also handled the networking aspects, allowing communication between different services within the cluster. In one of my past roles, I managed a project that involved moving our monolithic application to a microservices architecture. We used Docker for containerizing each microservice, and Kubernetes as the orchestration platform, allowing us to scale each microservice independently based on demand and efficiently manage the complexity of running dozens of inter-related services. The move significantly improved our system's reliability and resource usage efficiency.

170

What is the significance of a distributed cache?

Reference answer

A distributed cache improves system performance and scalability by storing frequently accessed data in memory across multiple nodes. This reduces database load, decreases latency, and speeds up data retrieval.

171

SRE vs DevOps: What's the Difference Between Them?

Reference answer

- DevOps and Site Reliability Engineer are the two terms used to describe a person who specializes in improving applications and services while they are being used. - DevOps and Site reliability engineering are both important roles in modern IT organizations. However, there is a big difference between them. Those are - | DevOps | SRE | |---|---| | DevOps involves the development of software that can be updated and modified while it is running. | Site reliability engineer, on the other hand, focuses on keeping an application or service up and running. | | DevOps teams often use automation tools to improve their workflow. | Site reliability engineers, on the other hand, work with both automation tools and humans to ensure service continues to operate smoothly. | | DevOps deals with when and how software is built. | The site reliability engineer focuses on what happens once it's built | Refer to this blog for a more detailed understanding of the difference between SRE and DevOps.

172

What's your framework for deciding what to automate versus what to leave manual?

Reference answer

The textbook answer is 'automate anything you do more than three times.' The experienced answer is more nuanced. Some tasks are done frequently but are so variable that automation costs more to maintain than the manual effort saves. Some tasks are done rarely but carry enough blast radius that building automation with proper guardrails and a dry-run mode is worth the investment even if the script only runs twice a year, because the one time a human fat-fingers the manual version at 2 AM is the time it takes down the database.

173

How was the relationship between your operations and engineering team?

Reference answer

An SRE is involved in multiple aspects of the engineering organization and business; they have a unique perspective on improvement areas. They need to maintain smooth relationships between inter and intra departments and identify bottlenecks in productivity. With this question, the hiring manager is trying to determine how you would work collaboratively with different teams and solve issues between cross-functional teams.

174

What scripting languages are you comfortable with for automating SRE tasks?

Reference answer

I'm comfortable with Python and Bash for automation. I use them for tasks such as automating deployments, parsing logs for analysis, setting up monitoring configurations, and scripting routine maintenance operations.

175

Your service is growing 15% month over month. When do you need to scale, and how do you decide between vertical and horizontal scaling?

Reference answer

The math matters. The organizational question matters more: who owns the capacity forecast, how far ahead do you plan, and what happens when the forecast is wrong in the expensive direction? Budget awareness is an SRE skill that most prep guides skip entirely.

176

What is subnetting? What is Network Id? Why do we use classless addressing?

Reference answer

Dividing a large block of addresses into several contiguous sub-blocks and assigning these sub-blocks to different smaller networks is called subnetting. It is a practice that is widely used when classless addressing is done. A subnet or subnetwork is a network inside a network. Subnets make networks more efficient. Through subnetting, network traffic can travel a shorter distance without passing through unnecessary routers to reach its destination.

177

Explain blue-green deployment.

Reference answer

Blue-green deployment is a strategy where two identical environments (blue and green) are maintained. The new version is deployed to the green environment while the blue environment continues to serve users. Traffic is then switched to the green environment.

178

Walk me through how you'd troubleshoot a memory leak in a production service.

Reference answer

First, I'd pull memory metrics over time to confirm it's actually growing. Sometimes what looks like a leak is just seasonal traffic patterns. Assuming it's real, I'd check garbage collection behavior—if the old generation is growing, that suggests memory that's not being reclaimed. I'd enable memory profiling for the service, which gives me a breakdown of which objects are consuming memory. Usually, it's a cache that's not bounded, event listeners not being cleaned up, or something holding references to data that should be garbage collected. Once I identify the cause, we'd implement a fix—maybe add an eviction policy to the cache or fix the listener cleanup. We'd deploy it to a single instance first, monitor it, then roll it out. To prevent this, we'd add monitoring for memory growth rate as a metric we track—if memory is growing 10% per hour, that's worth investigating before it brings down the service.

179

What is Service Level Objective (SLO)?

Reference answer

SLO aggregates SLI over time and defines what you're willing to do against it.

180

Tell me about a time you had to resolve a conflict in a team/group.

Reference answer

During a project, two team members disagreed on the database schema design. I facilitated a meeting where each presented their approach with trade-offs. I encouraged focusing on project goals rather than personal preferences. We agreed on a hybrid solution that combined strengths of both designs. I ensured everyone felt heard and documented the decision. The project stayed on schedule, and the team collaborated better afterward.

181

What is the 'PagerDuty' or on-call rotation, and how do you design it effectively?

Reference answer

On-call rotation schedules engineers to handle alerts and incidents outside normal hours. Effective design includes: balanced distribution (fair load), clear escalation policies, secondary backups, and limiting shifts to prevent burnout. SREs also ensure on-call engineers have proper runbooks and tools, and that alerts are actionable and not noisy. Rotation schedules should be reviewed regularly based on incident volume and feedback.

182

Write a Python script to parse logs and extract error counts.

Reference answer

import re from collections import Counter error_counts = Counter() with open('app.log', 'r') as f: for line in f: if 'ERROR' in line: # Extract error type using regex or simple split match = re.search(r'ERROR\s+(\w+)', line) if match: error_counts[match.group(1)] += 1 for error, count in error_counts.items(): print(f'{error}: {count}')

183

What is a Redundant Array of Independent Disk (RAID)?

Reference answer

A form of storage system with more than one hard disc to offer extra redundancy in the event that one disc fails is referred to as a 'Redundant Array of Independent Disk.' In networks and server farms, a redundant Array of Independent Disk is frequently used.

184

Prometheus vs. Grafana

Reference answer

- Prometheus: Collects and stores metrics. - Grafana: Visualizes metrics via dashboards.

185

How do you stay updated with the latest trends and technologies in SRE?

Reference answer

I stay updated with the latest trends and technologies in SRE by following industry blogs and subscribing to newsletters from reputable sources. Additionally, I participate in online communities and attend webinars, conferences, and workshops to learn from experts and network with peers.

186

Describe the Sharding process. How does sharding improve performance?

Reference answer

Sharding is a method of dividing a database into multiple pieces. Each piece stores a subset of the data, which can be used to run different types of queries. Sharding makes it possible to distribute the workload across many more servers. This can reduce the time it takes to process queries and improve performance. Sharding is also useful when you need to store a large number of small objects (e.g., objects with low cardinality). In this case, each object is stored in its own piece, and only one piece can be read at a time. Sharding can be used to improve performance in two main ways: - By running several smaller jobs on a single machine, it becomes possible to spread out the load between many machines. - By storing objects in separate pieces, it becomes possible to read only the piece that needs to be accessed at any given time.

187

What is the Linux Shell?

Reference answer

A command-line interface called a Linux shell enables user interaction with the system. The Linux command line interface (CLI) offers a text-based interface for carrying out system commands, managing files, and issuing other instructions.

188

Explain the difference between vertical scaling and horizontal scaling.

Reference answer

Vertical scaling means increasing the resources (like CPU, RAM, storage) of an existing server. Horizontal scaling means adding more servers or instances to a system to distribute the load, which is generally more flexible and resilient for large systems.

189

What is incident management in the context of SRE?

Reference answer

Incident management involves detecting, responding to, and resolving incidents to minimize the impact on services and ensure quick recovery and restoration.

190

Why we used DHCP Server?

Reference answer

- Requesting IP addresses and networking parameters automatically from the Internet service provider (ISP) - Reducing the need for a network administrator or a user to manually assign IP addresses to all network devices.

191

Scenario: A critical application is experiencing intermittent slow response times. How would you troubleshoot?

Reference answer

- Check logs for patterns during slow response times. - Monitor metrics such as CPU, memory, disk I/O, and network throughput. - Profile the application to identify slow queries or bottlenecks in code execution. - Investigate external dependencies (e.g., third-party APIs or databases). - Correlate slow response times with specific events or user actions.

192

How do you implement a robust backup and disaster recovery strategy?

Reference answer

A robust strategy includes: regular automated backups (snapshots, full/incremental), off-site or cloud storage for redundancy, encryption of backups, and periodic recovery drills to verify data integrity and restore processes. SREs define Recovery Point Objectives (RPO) and Recovery Time Objectives (RTO) based on business needs. The strategy must cover critical services, databases, and configuration data, with clear runbooks for failover and recovery.

193

Scenario: You are experiencing high CPU usage on a critical production server. How would you address this?

Reference answer

- Identify the culprit process using monitoring tools or `top`. - Scale up or out by adding more resources. - Investigate potential memory leaks or inefficient queries and optimize code. - Implement auto-scaling to prevent future occurrences.

194

Which programming languages are you most adept at working with as a site reliability engineer?

Reference answer

I'm most adept at working with Python, as it's been the primary language I've used in my roles as a site reliability engineer. I've used it extensively for scripting and automation tasks given its simplicity and powerful libraries. Apart from Python, I'm comfortable with Go due to its excellent support for concurrent programming which proves to be very useful when working with distributed systems. Besides these, I have a solid foundational understanding of Java and Bash scripting, and I've had some experience using them in specific projects.

195

Explain DNS and its importance.

Reference answer

DNS or Domain Name System translates domain names into IP addresses so browsers can load webpages. DNS servers allow the average user to type words into their browser and find the pages they are looking for without having a phonebook of IP addresses.

196

Design a distributed job scheduler that processes 10 million tasks per day with a 99.9% completion SLO.

Reference answer

A software engineer designs for throughput. An SRE designs for what happens when a worker node dies mid-task, when the queue backs up past capacity, when a dependency goes intermittent, and when two of those things happen simultaneously. The answer needs to address failure modes explicitly. Not as an afterthought. As the primary design constraint.

197

What is the Google SRE interview process like, step by step?

Reference answer

The Google SRE interview process begins with a recruiter outreach, typically via LinkedIn, followed by submitting a resume. The process includes multiple stages: initial phone screens, technical interviews focusing on systems design, coding, and troubleshooting, and on-site interviews. Key areas assessed include distributed systems, automation, incident response, and cultural fit.

198

Explain the concept of an 'error budget.'

Reference answer

An error budget quantifies the maximum acceptable downtime for a service. For example, a 99.9% uptime SLO allows 8.76 hours/year downtime. Teams use this budget to prioritize feature releases or reliability improvements.

199

What is the 'suspend ready state' of a process?

Reference answer

The term 'suspend ready state' refers to a process that is in the ready state but has been moved from main memory to secondary memory due to a lack of resources (primarily primary memory). The OS must move the lower-priority program to secondary memory in order to make room in the main memory if it is full and a higher-priority program arrives for execution. Processes that are prepared to suspend are held in secondary storage until the strongest memory is available.

200

What are the 'Four Golden Signals' of monitoring?

Reference answer

- Latency: Time to serve requests. - Traffic: Request volume (e.g., queries per second). - Errors: Rate of failed requests. - Saturation: Resource utilization (e.g., CPU, memory).

DON'T WANT TO MISS A THING?

Latest Cisco, PMP, AWS, CompTIA, Microsoft Materials on SALE
Get Now

Mock Interview Questions for Site Reliability Engineers | SPOTO

Earn a certification to make your resume stand out.

DON'T WANT TO MISS A THING?

Latest Cisco, PMP, AWS, CompTIA, Microsoft Materials on SALE Get Now

Mock Interview Questions for Site Reliability Engineers | SPOTO

Earn a certification to make your resume stand out.

Latest Cisco, PMP, AWS, CompTIA, Microsoft Materials on SALE
Get Now