1

Resposta de referência

To calculate the Fibonacci sequence up to a given number, you can use a simple iterative approach. Here's a Python function that does this: def fibonacci(n): a, b = 0, 1; while a < n: print(a, end=' '); a, b = b, a + b

2

Resposta de referência

My approach to monitoring system performance is proactive and data-driven. I use tools such as Prometheus and Grafana for real-time monitoring and visualization of system metrics. I focus on key performance indicators like CPU usage, load averages, memory usage, and network IO stats. Based on the insights derived from these metrics, I devise strategies to enhance system performance. For instance, if I observe a consistent memory bottleneck, I might suggest scaling up the server or optimizing the application to use less memory.

3

Resposta de referência

bash awk '{print $1}' access.log | sort | uniq -c | sort -nr | head -5

4

Resposta de referência

The Transmission Control Protocol (TCP) is one of the main protocols of the Internet protocol suite. TCP originated in the initial network implementation in which it complemented the Internet Protocol (IP). Hence, it is broadly referred to as TCP/IP. - Few TCP connection states are: 1) LISTEN – Server is listening on a port, such as HTTP 2) SYNC-SENT – Sent a SYN request, waiting for a response 3) SYN-RECEIVED – (Server) Waiting for an ACK, occurs after sending an ACK from the server 4) ESTABLISHED – 3 way TCP handshake has completed

5

Resposta de referência

Addressing a performance issue in a distributed system involves pinpointing where the performance bottleneck is and then identifying the underlying problem. Effective monitoring and observability tools are crucial here - they can provide key insights into aspects like network latency, CPU usage, memory usage, and disk I/O across each part of the distributed system. Once a potential source of the problem is identified, I would dive deeper into it. For example, if a particular service is using too much CPU, I would look into whether it's due to a sudden surge in requests, inefficient code, or need for more resources. After identifying the root cause, the solution could vary from scaling the resources, optimizing the code or algorithm for efficiency, or even re-architecting the system if required. A common approach for handling performance issues in distributed systems is also to load balance requests and applying caching mechanisms where appropriate. Post-resolution, it's also important to document the incident and maintain a record of what was done to solve the issue. This record is valuable for tackling similar issues in the future and for identifying patterns that could help optimize the distributed system's design.

6

Resposta de referência

Linux kills command is an easy way to kill all running processes. With this command, you can kill a process, e.g., a program, a service, or a process that is not running on any Linux system. In other words, it will bring down or terminate any process running on the system. By using the Linux kill command, you can close down a malfunctioning application or stop a misbehaving service. You can also use the kill command to terminate misbehaving jobs in batch scripts. Through this command, you can also reboot the server or halt it while shutting down the network connection and power off the server with one single command.

7

Resposta de referência

Since we have been already given the classes and methods. We only need to implement the logic to achieve the desired result. So, we can use the stack to store the URL, and, on each move, we have to modify the stack behavior for achieving this result. So, the solution is - class BrowserHistory { //Stack that stores the URL. String[] stack; //additional pointer curr, used to manage back and forward. int top, curr; public BrowserHistory(String homepage) { stack = new String[5001]; stack[top] = homepage; } public void visit(String URL) { //Adjusting the stack with the value. And also pointers stack[++curr] = URL; top = curr; } public String back(int steps) { //Adjusting the pointer while Going Backward. while(curr > 0 && steps > 0){ curr--; steps--; } return stack[curr]; } public String forward(int steps) { //Adjusting the pointer while Going Forward. while(curr < top && steps > 0){ curr++; steps--; } return stack[curr]; } } The time complexity for the above solution is O(steps) because it has to move forward or backwards in the stack for almost step time.

8

Resposta de referência

Fault injection involves deliberately introducing errors or faults into a system to test its resilience and ability to recover. This can be done using tools like Chaos Monkey, Gremlin, or by simulating network failures, server crashes, or high latency conditions.

9

Resposta de referência

Implement SLOs by defining acceptable levels of reliability, then identify key metrics (SLIs) that reflect those levels. Monitor and refine these metrics based on real-world data.

10

Resposta de referência

Learning about the error budget is the first step in Service Risk (S.R.). The error budget must be estimated to plan ahead. Traditional approaches like dividing good time by product or service time are difficult. Whether a service is entirely down or partly down is easy to determine if one of its servers is down.

11

Resposta de referência

DevOPS | Site Reliability Engineering (SRE) | |---|---| | Software development and operations | System reliability | | Holistic, cultural, and mindset-driven | Technical and software-first | | A wider range of organizations | More specialized, typically large tech companies | | Break down silos, automate tasks, and improve communication between development and operations | Ensure the reliability, scalability, and performance of IT systems | | Continuous integration and delivery (CI/CD), infrastructure as code, and monitoring and observability | Error budgeting, service level objectives (SLOs), and incident management |

12

Resposta de referência

This may not lead anywhere, but I'm looking for a discussion about what their data auditing procedures look like, and how easy it is to answer security questions about their data quickly.

13

Resposta de referência

It's a data structure where each data element is a separate element in a list. Elements are connected (linked) using pointers. The list starts with a head, which is a reference to the first node in the list. The head is followed by nodes, which include a data element and a reference to the next data element. The final node, the tail, includes the data element and a reference to null, indicating the end of the list.

14

Resposta de referência

System reliability is measured using metrics like uptime, response time, and error rates. Continuous improvement involves analyzing incidents, implementing fixes, and refining monitoring and automation.

15

Resposta de referência

One of the most challenging problems I had to solve involved a persistent memory leak in a critical service of our system. The service would run fine for a few days but would eventually run out of memory and crash, causing disruptions. Initial efforts to isolate the issue using regular debugging methods were not successful because the issue took days to manifest and was not easily reproducible in a non-production environment. To tackle this, I first ensured we had good monitoring and alerting set up for memory usage on this service, to give us immediate feedback on our efforts. We also arranged for temporary measures to restart the service automatically when memory usage approached dangerous levels, to minimize disruptions to our users. Next, I wrote custom scripts to regularly capture and store detailed memory usage data of the service in operation. After we had collected a few weeks worth of data, I started analysing the data patterns in depth. Upon combining this analysis with code review of the service, we managed to narrow it down to a specific area of the code where objects were being created but not released after use. After identifying the issue, we updated the code to ensure proper memory management and monitored the service closely. With the fix, the service ran smoothly and memory usage remained stable over time. It was a challenging and prolonged problem to solve but it was rewarding in the end, and it significantly improved the stability of our system.

16

Resposta de referência

In my previous role, I experienced a critical site downtime situation due to an unexpected surge in traffic. The first move I made was to acknowledge the issue and gather all available data about the disruption from our monitoring systems. I then quickly assembled our response team, which included fellow site reliability engineers, network specialists, and necessary app developers, to look into the issue and pinpoint the root cause. While we found that the traffic surge was overwhelming our database capacity, we temporarily mitigated the situation by redirecting some of the traffic to a backup site. Simultaneously, we quickly worked on expanding server capacity and tweaking the load balancing configurations to handle the increased load. Once the changes were complete and tested, we gradually rolled back the traffic to the main site and monitored closely to ensure stability. We then did a detailed incident review, and consequently improved our capacity planning and automated scaling processes to prevent such scenarios in the future.

17

Resposta de referência

Chaos engineering is the practice of intentionally introducing failures into a system to test its resilience and identify weaknesses. It is crucial in SRE as it helps proactively prevent outages and ensures high availability by improving system reliability.

18

Resposta de referência

Bonus points if they start by talking about a bare metal server. Virtualization installs a control layer on top of a set of bare metal servers to create a pool of resources from the combination of the physical resources of those servers. It then allows you to create "virtual machines" that have a varied combination of memory, storage, and processor resources according to need, each machine with its own operating system. Virtual machines can be created and destroyed quickly and easily. Containers are similar, except they do not contain the base layer operating system. Instead the control layer provides the operating system access while also keeping the containers and their processes isolated from one another. Containers include software such as a microservice along with all of the software dependencies required to run that software. This provides isolation and flexibility. Kubernetes adds an orchestration layer to containers, making the management of them, especially large systems, easier.

19

Resposta de referência

I have experience with Prometheus and Grafana for time-series monitoring and visualization, Datadog for unified monitoring, and the ELK stack for log aggregation and analysis. I've configured alerts in these systems based on critical metrics.

20

Resposta de referência

For capacity planning, I would: - Estimate traffic volume based on historical data, expected growth, and business forecasts. - Understand service dependencies, including microservices, databases, and third-party APIs, to assess their scalability. - Analyze past performance using metrics like CPU, memory usage, and I/O bandwidth to estimate resource needs. - Simulate load testing using tools like Apache JMeter or Locust to determine how the service performs under heavy traffic. - Calculate redundancy and failover requirements to ensure high availability. Having accurate data on traffic patterns, resource utilization, and failure rates is essential for creating a reliable capacity plan.

21

Resposta de referência

The Dynamic Host Configuration Protocol is known as DHCP. It is a technique that enables networks to assign IP addresses to network hosts on a dynamic basis. Devices like PCs and routers are given IP addresses through the use of DHCP. An IP address may be required for a device to connect to The internet after installation. Therefore, when a new system is placed, DHCP will provide it an IP address so that it may access the network.

22

Resposta de referência

There's a difference between monitoring—'is the system up?'—and observability—'why is it behaving this way?' We use the RED method for application metrics: Rate, Errors, Duration. Prometheus scrapes metrics from our applications every 30 seconds. For infrastructure, we track CPU, memory, disk, and network. But the real power is in observability. We use structured logging with JSON payloads so we can actually query logs meaningfully, and we have distributed tracing with Jaeger to follow requests through multiple services. What changed our game was moving away from alerting on every metric to alerting on symptoms of user-impacting problems. Instead of alerting on 'CPU above 80%,' we alert on 'latency above 1 second' or 'error rate above 0.5%.' We still ended up with too many false positives, so we implemented alert fatigue rules—we don't page the on-call engineer unless it's truly urgent. That reduced false alerts by 60% and made on-call actually bearable.

23

Resposta de referência

- Automate security patching: Use tools like Ansible or Puppet to automatically apply security patches to servers and containers. - Secrets management: Store credentials and secrets in tools like HashiCorp Vault, AWS Secrets Manager, or Azure Key Vault and avoid hardcoding secrets. - Network segmentation and firewalls: Use network policies in Kubernetes or security groups in cloud environments to limit access to critical resources. - Monitoring and logging: Implement real-time monitoring for security breaches using tools like AWS CloudTrail or SIEM (Security Information and Event Management) tools. - Identity and access management (IAM): Apply the principle of least privilege for users and services.

24

Resposta de referência

The CAP theorem states that a distributed system can guarantee at most two of three properties: Consistency (all nodes see the same data), Availability (every request gets a response), and Partition Tolerance (system works despite network partitions). SREs must choose trade-offs based on use cases, e.g., favoring availability (AP) for user-facing apps or consistency (CP) for financial systems. This guides architecture and recovery strategies.

25

Resposta de referência

A programming paradigm known as OOPs promotes the construction of objects that represent the real entities and are subsequently utilized to carry out tasks. These can be helpful in the design of a server since they enable you to divide the jobs into manageable pieces, which will aid in maintaining control over your server. Additionally, OOPs enables you to write reusable code, which will save you money and time.

26

Resposta de referência

Implement Fisher-Yates shuffle with a derangement constraint. For each position i from n-1 down to 1, swap arr[i] with a random element from arr[0] to arr[i-1]. To ensure no element stays in place, after shuffling, check for fixed points and perform additional swaps to break them. Alternatively, use Sattolo's algorithm, which produces a cyclic permutation with no fixed points.

27

Resposta de referência

In one case, the SLO was defined as 99.9% uptime for a critical service. The error budget was calculated, and it was observed that, due to a rise in errors, the budget was nearing its limit. Thus, releasing features would have put the reliability target at risk. We, together with the product and engineering teams, made a call that no new releases would be made until the service was stabilized; this work then involved bug fixing and performance improvements so that the release would fit into the error budget.

28

Resposta de referência

| SNAT | DNAT | |---|---| | It is generally used to change a private address or port into a public address or port for packets leaving the network. | It is generally used to redirect incoming packets with a destination of a public address or port to a private IP address or port inside the network. | | It translates the source IP address within a connection to the BIG-IP system IP address that one defines. | It translates IP addresses of internal servers that are protected by the device to public IP addresses. | | It is used to change the source address of the packet. | It is used to change the destination address of the packet. | | It also changes the source port in TCP/UDP headers. | It also changes the destination port in TCP/UDP headers. | | It generally allows multiple hosts on the inside to get any host on the outside. | It generally allows multiple hosts on the outside to get a single host on the inside. |

29

Resposta de referência

Absolutely, in one of my previous roles, we were building a new feature that was significant from both a business and user perspective. Naturally, there was a considerable push from stakeholders to roll it out quickly. However, as the SRE, I knew that a quick release without proper testing and gradual deployment could jeopardize system reliability. I proposed a phased approach for the feature release. First, we focused on comprehensive testing, covering all possible use cases and stress testing for scalability. We utilized automated testing and also engaged in rigorous manual testing, particularly for user-experience-centric components. Once we were confident with the testing results, we moved towards a phased release. Instead of rolling out the feature to all our users at once, we initially launched it to a selected group of users. We monitored system behavior closely, gathering feedback, and making necessary adjustments. Only when we were fully confident that the feature would not affect the overall system's reliability did we roll it out to all users. In this case, the balance was struck between speed and reliability by introducing well-planned phases, in-depth testing, and gradual deployment. It allowed us to deliver value rapidly, but without compromising on system stability.

30

Resposta de referência

I've used a variety of database management systems in my projects depending on the specific use-cases and requirements. In one project, we had a significant amount of structured data with complex relationships. We needed to perform complex queries, so we used a relational database management system, specifically PostgreSQL. I worked on designing and optimizing the schema, wrote stored procedures, and created views for this project. In another project, we collected a huge amount of semi-structured event data. It wasn't suitable for a traditional SQL database, so I implemented a NoSQL database, MongoDB, for this purpose. I worked on data modeling and tune performance for read-heavy workloads. For another application where we needed to store and retrieve user session data quickly, I used a key-value store, Redis. It's incredibly fast for this kind of workload, where you're storing and retrieving simple data by keys. Diverse database management systems each have their strengths and are suited for different types of data and workloads. Being familiar with various types allows for better system design by leveraging the strengths of each as necessary.

31

Resposta de referência

Dependencies in microservices are managed using service discovery, API gateways, and dependency management tools. Monitoring and logging dependencies, versioning APIs, and implementing retries and circuit breakers also help manage dependencies effectively.

32

Resposta de referência

In one of my previous roles, I was part of a team managing an e-commerce platform. With the user base growing rapidly, the infrastructure costs were escalating due to the processing power needed for some computationally intensive tasks. We identified a process that was reading from the database, performing some transformations, and writing back to the database. The issue was that this process was running for every user action, even when there was no update, leading to an unnecessary load. To address this, we implemented a caching system and stored the results of the process. So, the next time the same user action occurred, instead of initiating the whole process again, the system would first check the cache for results. If the results were already there, the system would retrieve them from the cache, significantly reducing the number of reads and writes to the database. By introducing caching, we maintained the functionality and improved performance, all while reducing the strain on our database servers. This ultimately led to a smaller resource footprint and a noticeable reduction in our infrastructure costs.

33

Resposta de referência

Service Level Indicators are the key measurements that show if service is on track. Without them, it's difficult to know if the organization is meeting its objectives. There are three main types of SLIs: Availability, Response Time, and Quality of Service. - Availability measures how often a given service can be provided without causing downtime. - Response time measures how quickly service is delivered. - And the quality of service measures how well a given effort meets certain standards of quality. In addition to these three main types of SLIs, there are also limits on usage and capacity, which measure how much a given resource can be used at any given time. This can be useful for determining if there is enough capacity in the system to handle the additional demand.

34

Resposta de referência

White-box monitoring is a method of monitoring the internal metrics of applications that run on a server when you can access its source code.

35

Resposta de referência

Vertical scaling (scaling up) adds more resources (CPU, RAM) to a single machine. Horizontal scaling (scaling out) adds more machines or instances to a pool. Horizontal scaling is preferred in distributed systems for better fault tolerance and elasticity, as it allows spreading load across many nodes. Vertical scaling is simpler but has limits and creates a single point of failure.

36

Resposta de referência

Chaos Engineering is a methodical approach to discovering failures before they lead to outages. By proactively testing how a system responds to stress, you can pinpoint and resolve failures before they become problems that affect your customers and systems. Chaos Monkey is a popular tool used in Chaos Engineering.

37

Resposta de referência

During a project, the development team noticed that our service's uptime was below the agreed SLO. I worked with them to identify the root causes, such as poor error handling and insufficient retries on external API calls. Approach: - We reviewed and improved the error handling in the codebase. - Introduced retries with exponential backoff for external API requests. - Added better monitoring and logging to detect failures early. - Collaboratively improved the CI/CD pipeline to automate testing and catch reliability issues before production releases.

38

Resposta de referência

- A service-level agreement (SLA) is a commitment we make to a client about uptime. These are frequently legally specified, with consequences for failing to meet the desired availability. As a result, SLAs are typically established with values that are simpler to satisfy than SLOs. - A service-level indicator (SLI) is anything that can be precisely measured to assist you in thinking about, defining, and determining if you are satisfying SLOs and SLAs. They are commonly presented as the ratio of the number of excellent occurrences to the total number of events. A simple example would be the number of successful HTTP requests divided by the total number of HTTP queries. SLIs are typically stated as a percentage, with 0 indicating that everything is broken and 100 indicating that everything is operating flawlessly.

39

Resposta de referência

The classes and methods are already defined and we need to implement the logic. So we can use the Hashmap that points to every user. And each user can be represented as a node. So the user can be obtained in constant time. And similarly, we can use the node for each tweet that consists of the records of the tweets and the userId to whom the tweets belong. So the Solution can be - class Twitter { //This belongs to each individual user and his/her following. private class User{ int userID; HashMap followings; User(int id){ userID = id; followings = new HashMap<>(); } } //Every Individual tweet. And belongs to which user. private class Tweet{ int tweetID, userID; Tweet(int userID, int tweetID){ this.tweetID = tweetID; this.userID = userID; } } //List that consists of every tweets. List tweets; //Map to get the user details in constant time. HashMap map; public Twitter() { map = new HashMap<>(); tweets = new ArrayList<>(); } public void postTweet(int userId, int tweetId) { //If user don't exist, so create user if(!map.containsKey(userId)) map.put(userId, new User(userId)); //adding the tweets in the list for particular user tweets.add(new Tweet(userId, tweetId)); } public List getNewsFeed(int userId) { List feeds = new ArrayList<>(); int n = tweets.size()-1; int count = 0; //Loop that gives 10 recent tweets if it have otherwise //whatever less than 10 tweets of followed user. while(n >= 0 && count < 10){ int tweetID = tweets.get(n).tweetID; int userID = tweets.get(n).userID; //Checking if user followed the user for which the //tweet belongs. boolean exist = (map.get(userId)).followings.containsKey(userID); if(userId == userID || exist){ feeds.add(tweetID); count++; } n--; } return feeds; } public void follow(int followerId, int followeeId) { //Following user or followed user if not exist then //creating and adding to the following list. if(!map.containsKey(followerId)) map.put(followerId, new User(followerId)); if(!map.containsKey(followeeId)) map.put(followeeId, new User(followeeId)); (map.get(followerId)).followings.put(followeeId, true); } public void unfollow(int followerId, int followeeId) { //Following user or followed user if not exist then //removing from the following list if exist. if(!map.containsKey(followerId)) map.put(followerId, new User(followerId)); if(!map.containsKey(followeeId)) map.put(followeeId, new User(followeeId)); (map.get(followerId)).followings.remove(followeeId); } } The time complexity for the solution will be O(10) which is nothing but constant. It is because at most 10 tweets must be returned to the user.

40

Resposta de referência

In one of my past projects, we were developing a new feature that was expected to significantly increase the demand on our systems. Instead of purchasing and setting up additional physical servers, we utilized cloud computing services of AWS. We arranged scalable compute power using a combination of EC2 and Lambda functions, used S3 for robust and scalable storage, and RDS for managing our databases. This allowed us to quickly and cost-effectively handle the increased load, while also shedding the headaches of server maintenance and hardware failure risks. Additionally, the built-in AWS services like CloudWatch greatly enhanced our monitoring capabilities.

41

Resposta de referência

Automating repetitive tasks (e.g., deployments, backups) using tools like Kubernetes or Ansible. For example, replacing manual server scaling with auto-scaling groups.

42

Resposta de referência

This is a hard question. LRU Cache (Least Recently Used Cache) is a data structure that maintains a fixed capacity and evicts the least recently used item when a new item is added and the cache is full. Common implementations use a combination of a hash map (for O(1) lookups) and a doubly linked list (for O(1) insertions and deletions). The hash map stores keys pointing to nodes in the linked list. On access (get), the node is moved to the head of the list. On insertion (put), if the cache is full, the node at the tail (least recently used) is evicted, and the new node is added to the head.

43

Resposta de referência

A rollback strategy involves reverting to a previous stable version of a service in case of issues with the current deployment. This can be implemented using version control, maintaining previous versions of deployments, and automating rollback processes in CI/CD pipelines.

44

Resposta de referência

Infrastructure as Code (IaC) is a practice where the infrastructure management process is automated and treated just like any other code. Rather than manually configuring and managing infrastructure, we define the desired state of the system using machine-readable definition files or scripts, which are used by automation tools to set up and maintain the infrastructure. In one of my past jobs, we used Terraform for implementing IaC in our AWS environment. With Terraform scripts, we could not only set up our compute, networking, and storage resources but also handle their versioning and maintain them efficiently. Every change in the infrastructure was reviewed and applied using these scripts, keeping the whole process consistent and repeatable. Implementing IaC offered us multiple benefits. Notably, it allowed us to keep our infrastructure setup in version control alongside our application code, which greatly eased tracking changes and rolling back if there were errors. It also streamlined the process of setting up identical development, testing, and production environments, and brought in a high level of efficiency and consistency to our operations.

45

Resposta de referência

To optimize the costs of cloud resources, SREs would need to: Analyze current and projected costs with tools provided by cloud platforms, Use autoscaling to adjust resources based on demand, Select the right types and sizes of resources (e.g., compute instances) for the task at hand, Use spot instances or reserved instances where appropriate, Set up budget alerts to monitor and control expenses. Skilled applicants will also mention that different deployment architectures, such as serverless deployments or containers, also impact costs.

46

Resposta de referência

Microservices architecture involves breaking down a monolithic application into smaller, independently deployable services, each responsible for a specific functionality.

47

Resposta de referência

Error budget policy enforcement and cross-functional communication under pressure. Jumping to 'freeze deployments' without explaining who you notify, how the decision is documented, and what the exception process looks like is where candidates lose points.

48

Resposta de referência

Multithreading is the ability of a CPU to execute multiple threads concurrently, each thread running a part of a program. A good answer would outline the benefits of multithreading, such as improved application performance and responsiveness, and its challenges, like the complexity of thread synchronization and potential for deadlocks. Expect skilled applicants to give you examples of using multithreading in past projects and be familiar with synchronization mechanisms, such as mutexes or semaphores.

49

Resposta de referência

Virtualization is the process of using one physical system to run multiple virtual machines. It is commonly used by companies that want to consolidate computing resources and keep them running 24/7 without having to buy more hardware. Virtualization can also be used for testing purposes, such as for software development or system performance testing. Virtualization can be used in a number of different ways, from simple setups where multiple virtual machines run on the same physical server, to complex setups that use multiple servers and virtual networks. The end goal is always the same: reducing overhead costs and improving overall IT infrastructure efficiency. Virtualization can also be used to create hybrid environments where physical servers are augmented by cloud-based services. There are many different types of virtualization technology available today, including: - VMware - This is one of the most popular virtualization technologies available today. It runs on almost any platform and is easy to install and manage. It's also very cost-effective because it leverages a lot of existing hardware and software infrastructure already in place. - Windows Server - Windows Server is a common choice for virtualizing Microsoft applications because it has built-in support for Hyper-V, making it easy to deploy and manage. There are also several third-party solutions available to further augment administrator capabilities. - Hyper-V - This is another option that's popular with organizations looking to virtualize their servers. While it's not as widely used as Hyper-V, it's still an option that's worth exploring if you're looking for a low-cost way to virtualize. It's one of the newer options available, so it might not be as widely accepted as the others but it's still a valid option.

50

Resposta de referência

SRE decreases organizational silos by incorporating software engineers on both sides, including coders and release support. This helps diagnose product faults and resolve outages.

51

Resposta de referência

IT services including servers, storage, and software as a service (SaaS) are delivered over network-connected cloud infrastructure through cloud computing. The phrase can be used to describe both private clouds, controlled by a single company and shared by internal users, and cloud environments, owned by outside companies and provide computing capacity for rent, such as Amazon Web Services.

52

Resposta de referência

Example: I was working on a system where page load times were slow. After profiling, I found bottlenecks in database queries and excessive API calls. - Solution: I optimized slow queries using indexes, cached repetitive API results using Redis, and compressed static assets to reduce load times.

53

Resposta de referência

Detect via alerting systems, mitigate the impact quickly, communicate clearly, document the incident, and perform a blameless postmortem. Automation plays a big role in reducing MTTR.

54

Resposta de referência

DevOps is a cultural and collaborative movement that aims to unify software development and operations, emphasizing automation, continuous delivery, and shared responsibility. Site Reliability Engineering (SRE) is a specific implementation of DevOps principles, applying software engineering practices to operations problems, with a focus on reliability, scalability, and using error budgets and service level objectives to balance reliability and feature velocity.

55

Resposta de referência

I've worked as a site reliability engineer for around five years, primarily in the e-commerce sector. My role involved ensuring the reliability and scalability of high-traffic web applications. I've gained extensive experience in designing, building, and maintaining the infrastructures of these applications, primarily using cloud platforms like AWS and Azure. A vital part of my work also included crafting effective alerting systems to minimize downtime, and automating repetitive tasks to improve system efficiency. Additionally, I've had the responsibility of orchestrating collaborative responses to incidents, performing postmortems, and implementing problem-solving strategies to prevent recurrence.

56

Resposta de referência

Data backup involves regularly saving copies of data, while recovery involves restoring data from backups in case of loss or corruption. Regular testing of backup and recovery processes is essential.

57

Resposta de referência

VMs virtualize the entire hardware stack including the OS for each instance. Containers, however, share the host OS kernel and package applications with dependencies into lightweight, isolated environments, offering faster startup and portability.

58

Resposta de referência

An error budget is the amount of acceptable unreliability (e.g., 0.1% downtime per month if SLO is 99.9%). It is calculated as 100% minus the SLO target. The error budget is used to balance reliability with innovation: if the budget is not exhausted, teams can deploy new features faster; if it is exhausted, development is paused to focus on reliability improvements. This provides a data-driven way to manage risk.

59

Resposta de referência

- Uptime/availability. - Mean Time to Recovery (MTTR). - Mean Time Between Failures (MTBF). - Latency and response time.

60

Resposta de referência

A network of servers known as a CDN (Content Delivery Network) stores and provides content to clients. These servers, which are often found in data centers, can be utilized to enhance performance by lowering latency, guaranteeing that the information is accessible when needed, and ensuring that it is provided promptly. Although HTML or JavaScript are examples of dynamic material, CDNs could also be used to store static information like photographs and movies.

61

Resposta de referência

Scalable object storage for unstructured data (images, logs).

62

Resposta de referência

This is an open-ended question and is asked early in the interview to test your knowledge of different programming languages and technical systems you'll need to use to do your job. Share the list of tools, programming languages, and architecture you are familiar with, and give instances of how you used it successfully.

63

Resposta de referência

In a previous role, we had an operational failure where a backend service suddenly started crashing frequently, causing disruptions to our main application. The crashes would happen within seconds after the service started up, making it difficult to catch what was going wrong with regular debugging methods. To mitigate the immediate problem, we quickly spun up additional instances of the service and implemented a checkpoint system to save progress regularly, so that even if a crash happened, we could recover with minimal data loss. This helped minimize disruptions to end-users while we examined the issue in detail. On examining the service logs, we found it was running out of memory very quickly. This was puzzling since it was not seeing an increase in load and had been running fine with the same memory allocation for months. On deeper investigation, we found that there was a change pushed recently into a library that this service was using. It was an optimization change but had a memory leak, which was why the memory footprint of the service was growing rapidly until it ran out of memory. We quickly rolled back the change, and the service stopped crashing. The operational failure taught us the value of monitoring all changes, not just within our own code but also in the libraries and services we rely on. We also learned the importance of having good failure mitigation strategies in place until we can resolve the root cause of a problem.

64

Resposta de referência

Knowing the tools is the floor. Having opinions about when to use which is the ceiling.

65

Resposta de referência

To design a system for rate limiting an API, I would implement a token bucket algorithm to control the rate of requests. Additionally, I would use monitoring and logging to dynamically adjust the rate limits based on real-time usage patterns.

66

Resposta de referência

A container orchestration system, like Kubernetes, automates the deployment, scaling, and management of containerized applications.

67

Resposta de referência

Use a publish-subscribe architecture. Clients connect to chat servers via WebSocket. Each chat room is a topic. Messages are published to a message queue (e.g., Kafka) and distributed to subscribers. Use a stateful server for presence and typing indicators. Scale with horizontal sharding by user ID. Store message history in a distributed database. Handle failures with redundancy and retries.

68

Resposta de referência

Security patches are managed through automated patch management tools, regular patch cycles, vulnerability assessments, and ensuring minimal disruption by scheduling updates during maintenance windows or using rolling updates.

69

Resposta de referência

Depending on what they say, we'll be talking about this for a while and will probably create a lot of other questions.

70

Resposta de referência

I prefer using Prometheus and Grafana for monitoring due to their flexibility and powerful visualization capabilities. I set up monitoring for our microservices architecture, defining KPIs such as response times and error rates. I established alert thresholds based on SLOs and conducted regular reviews to adjust those thresholds and reduce alert fatigue. This approach helped us improve system performance and response times by 30%.

71

Resposta de referência

Serverless compute service for event-driven code (e.g., processing S3 uploads).

72

Resposta de referência

I recall a production incident where an API service experienced a sharp increase in latency. I first triaged the alert to confirm the impact and severity. Then, I accessed monitoring dashboards to check resource usage, logs, and recent deployments. I identified a memory leak caused by a recent code change. I rolled back the deployment to restore stability, then implemented a fix by adding proper memory management. Post-resolution, I led a blameless postmortem to update runbooks and add monitoring for memory metrics.

73

Resposta de referência

I ensure code quality through practices like code reviews with peers, adhering to style guides, writing comprehensive unit and integration tests, designing modular components, and refactoring code to improve readability and performance over time.

74

Resposta de referência

Horizontal Pod Autoscaling (HPA) in Kubernetes automatically adjusts the number of pods in a deployment, replica set, or stateful set based on observed CPU utilization (or other metrics like memory or custom metrics). - The HPA controller checks the metrics at regular intervals. - If resource usage exceeds or drops below the defined threshold, the HPA scales the number of pods up or down accordingly. - For example, if CPU utilization exceeds 80%, the HPA may add more pods to handle the increased load.

75

Resposta de referência

Efficient and performance are important aspects of capacity planning, as running your service faster than necessary can waste resources and cause user dissatisfaction, while running at 110% utilization can degrade latency and cause user dissatisfaction.

76

Resposta de referência

Capacity planning and forecasting in S3 operations involve planning for both organic growth from new users or website growth over time, or from launching new products or sites. It is important to plan for both scenarios.

77

Resposta de referência

A Service Level Objective (SLO), which is typically represented as a percentage, is a gauge of how excellent or terrible the service quality is. It demonstrates how well the service level's actual performance matches expectations.

78

Resposta de referência

The error budget is a crucial aspect of product development, as it helps determine the amount of availability that needs to be achieved rather than 100%. It involves negotiating with all areas from developing to delivery the product and determining how much of this can be negotiated. The error budget is used to compromise from the product to make changes or plan for space for mistakes or potential outages.

79

Resposta de referência

It is important to have an expectation of the SLA between all areas before launching a product to avoid problems between business development and operations.

80

Resposta de referência

Container orchestration (e.g., Kubernetes) automates deployment, scaling, and management of containerized applications. It provides self-healing, load balancing, and easy rollbacks, thus improving reliability.

81

Resposta de referência

This is a quick yet obvious question. Of course, the interviewer wants to know if you're familiar with the languages and technical systems you'll need to use in order to do your job.

82

Resposta de referência

- Use strong consistency models (e.g., Paxos, Raft) for mission-critical systems. - Monitor replication lag using database metrics. - Set up geo-replication with automatic failover mechanisms. - Test failover scenarios to ensure minimal downtime.

83

Resposta de referência

To monitor CPU usage and send an alert if it exceeds a certain threshold, you can use a simple Bash script. Here's an example: while true; do cpu=$(top -bn1 | grep "Cpu(s)" | sed "s/.*, *$[0-9.]*$%* id.*/\1/" | awk '{print 100 - $1}'); if (( $(echo "$cpu > 80" | bc -l) )); then echo "CPU usage is above 80%"; fi; sleep 60; done

84

Resposta de referência

A signal is a software interrupt delivered to a process to notify it of an event. The kernel generates signals for events like segmentation faults or user interrupts. Processes can handle signals by: ignoring (SIG_IGN), using a default action (e.g., terminate), or installing a custom handler via signal() or sigaction(). The kernel saves context, executes the handler, and restores the process.

85

Resposta de referência

I'd start by understanding the SLOs for the platform, because monitoring flows from those. For an e-commerce platform, uptime and checkout latency are critical. I'd instrument RED metrics for each service—Prometheus is a good choice here. We'd ship metrics from every service into a central Prometheus, plus use distributed tracing for understanding cross-service latency. For alerting, I'd avoid alerting on infrastructure metrics alone. Instead, I'd alert on user-impacting issues: checkout latency above 1 second, error rate above 0.5%, or availability below SLO. I'd set up alert grouping by root cause so that if a single issue triggers 50 alerts, on-call gets one. For the on-call dashboard, I'd focus on the 12 metrics that actually tell you if the system is healthy. Everything else lives in detailed dashboards for root cause analysis, not on-call visibility.

86

Resposta de referência

They distribute incoming traffic across multiple servers. Can operate at Layer 4 (TCP) or Layer 7 (HTTP). They help with scaling, fault tolerance, and zero-downtime deployments.

87

Resposta de referência

- Service Discovery is the process of automatically detecting services within a system, enabling dynamic communication between services in a distributed environment. It allows services to register and locate one another without the need for manual configuration. - Load Balancing is the process of distributing incoming traffic across multiple servers to ensure optimal resource utilization, reduce response time, and prevent any single server from being overloaded. How They Work Together: - In distributed systems, service discovery helps in dynamically identifying which servers or services are available. Once the service is discovered, load balancing distributes the incoming requests across these services to ensure high availability and fault tolerance. These two concepts complement each other by ensuring that the system is both efficient and resilient.

88

Resposta de referência

A computer has a sufficient amount of physical memory but most of the time we need more so we swap some memory on disk. Swap space is a space on a hard disk that is a substitute for physical memory. It is used as virtual memory which contains process memory image

89

Resposta de referência

Google introduced Site Reliability Engineering (SRE), which includes software developers designing I&T operations. Asking a software engineer to design operations teams bridges development and operations teams, minimizing organizational silos and making minor changes simpler to adopt and deploy.

90

Resposta de referência

The benefits of using an error budget include incentivizing team development, finding a balance between evaluating compromises and managing the risk of change, and being realistic about the reliability of the budget.

91

Resposta de referência

A distributed tracing system tracks requests as they flow through different services in a microservices architecture, helping in pinpointing latency issues and understanding system behavior.

92

Resposta de referência

A runbook is a step-by-step guide for handling incidents or repetitive tasks. It reduces toil, improves on-call response, and helps new team members act confidently during outages.

93

Resposta de referência

- DevOps focuses on cultural collaboration between dev and ops teams. - SRE applies engineering rigor to operations (e.g., SLOs, error budgets).

94

Resposta de referência

Toil is a term used to describe manual, repetitive, and tedious tasks that engineers perform in production environments. Toil reduction is the process of reducing the amount of time spent on tasks that are considered toil. This can be achieved through process automation.

95

Resposta de referência

A technique for breaking up a database into several parts is called sharding. Each component saves a portion of the data that can be utilized for various kinds of searches.

96

Resposta de referência

Chaos engineering is the practice of intentionally injecting failures into a system in production to test its resilience and uncover weaknesses before they cause outages. While I haven't personally run chaos experiments, I understand its value and know tools like Chaos Monkey.

97

Resposta de referência

One of the most important parts of Site Reliability Engineering (SRE) is change management, which is concerned with keeping IT systems up and running as much as possible while keeping interruptions to a minimum.

98

Resposta de referência

A service level indicator (SLI) is a quantifiable measure of a specific aspect of a service's performance or reliability, such as latency, error rate, throughput, or availability. SLIs are used to assess whether the service meets its defined SLOs.

99

Resposta de referência

During a high-traffic period, the system crashed due to overloaded database connections. I first stabilized the system by increasing connection limits and rerouting traffic. Then, I implemented connection pooling and optimized slow queries, preventing future incidents.

100

Resposta de referência

A hard link is a direct pointer to an inode, meaning it shares the same data blocks as the original file. A soft link (symbolic link) is a reference to a file path, which can point to files across file systems. Example: 'ln original.txt hardlink.txt' creates a hard link; 'ln -s original.txt softlink.txt' creates a soft link.

101

Resposta de referência

A service mesh (e.g., Istio, Linkerd) is an infrastructure layer that manages communication between microservices. It provides the following features: - Traffic management: Handles routing, load balancing, and retries. - Security: Offers mutual TLS (mTLS) for secure communication between services. - Observability: Provides metrics, logs, and distributed tracing for monitoring. - Resilience: Supports circuit breakers, rate-limiting, and failovers. It helps by abstracting the complexity of inter-service communication, allowing developers to focus on business logic while the mesh handles service-to-service interactions.

102

Resposta de referência

- API versioning: Implement API versioning through URL paths (e.g., `/v1/resource`) or headers to ensure backward compatibility for clients. - Feature flags: Use feature flags to gradually roll out changes and allow easy rollback without downtime. - Contract testing: Use tools like Pact to implement consumer-driven contract testing between services, ensuring that changes don't break dependencies. - Deprecation strategies: Communicate API deprecations clearly with clients and provide sufficient time for them to upgrade. - Canary releases: Use canary releases to deploy new versions of microservices to a small subset of users before a full rollout. Backward compatibility ensures that older versions of services continue to function without disruption during upgrades.

103

Resposta de referência

My first step to troubleshoot a service outage is to acknowledge the issue and gather as much information as possible. I'd look into our monitoring and logging system to understand what triggered the incident. Next, I'd engage the right team members to dive deeper into the issue, as often, expertise from different domains may be required to identify the root cause.

104

Resposta de referência

DNS stands for Domain Name System. It is a system that maps hostnames to IP addresses so that you can find the correct server when you type in a website address in your browser. The DNS system associates each domain name with one or more IP addresses, which are called 'resolvers.' When you type in a URL (e.g., www.google.com) into your browser, the computer sends a request to the DNS resolver for the IP address associated with that domain name. The DNS resolver then returns an IP address to the browser, which is either the IP address of a local computer or of another server that has been configured to return that particular IP address. Consider the below image for a better understanding - DNS is necessary because hosts on the Internet have only human-readable names like google.com and not machine-readable names like 111.222.333.444. Without DNS, you would need to know how to interpret a URL's human-readable name in order to find it on the Internet, which would be very difficult without a centralized authority like Google to help you out!

105

Resposta de referência

- SNAT changes the source IP (e.g., private to public IP). - DNAT changes the destination IP (e.g., routing traffic to a backend server).

106

Resposta de referência

Reducing human attention in S3 operations allows the team to be notified by page or phone for critical issues and ticket systems for less urgent issues. Humans should only need attention when essential and not conduct coding-able job.

107

Resposta de referência

Compliance is ensured by implementing security controls, maintaining audit logs, conducting regular security assessments, and following best practices for data protection and privacy. Compliance tools and frameworks help automate and enforce these requirements.

108

Resposta de referência

I monitor system performance using tools like Prometheus and Grafana to track key metrics such as latency, error rates, throughput, and resource utilization. I configure alerts based on predefined thresholds to proactively detect issues.

109

Resposta de referência

A load balancer distributes incoming traffic across multiple servers or instances. It improves reliability by preventing any single server from becoming overloaded, providing redundancy (if one server fails, traffic is redirected to others), and enabling smooth scaling. Load balancers can also perform health checks and remove unhealthy servers from the pool, ensuring only healthy instances handle requests.

110

Resposta de referência

Analyzing historical data to predict resource needs (e.g., adding nodes to a Kubernetes cluster during peak traffic). Tools like Prometheus forecast usage trends.

111

Resposta de referência

An inode is a data structure in Unix that contains metadata about a file. Some of the items contained in an inode are: 1) mode 2) owner (UID, GID) 3) size 4) atime, ctime, time

112

Resposta de referência

Distributes VMs across fault domains to ensure redundancy during hardware failures.

113

Resposta de referência

Effective capacity management requires a deep understanding of the current system usage, historical trends, and future growth predictions. I use monitoring tools to gain insight into resource usage and identify bottlenecks. Based on these trends, I forecast future capacity needs. This is complemented by horizontal scaling strategies and the use of auto-scaling groups in the cloud, allowing the system to seamlessly handle unexpected increases in demand.

114

Resposta de referência

A consensus algorithm ensures multiple nodes in a distributed system agree on a single state or value, even in the presence of failures. An example is Raft or Paxos, used in systems like etcd, Consul, and Zookeeper for leader election and distributed coordination. These algorithms are critical for maintaining consistency in distributed databases, configuration stores, and service discovery systems.

115

Resposta de referência

Whether you can hold the line without being adversarial. Political judgment, not just technical correctness. Answering with 'I'd say no' or 'I'd escalate' shows neither the negotiation skill the role requires.

116

Resposta de referência

Strong candidates recognize that over-delivering on reliability may indicate overly conservative targets that slow down feature development unnecessarily.

117

Resposta de referência

A soft limit is a threshold that triggers a warning or action (e.g., scaling) but does not immediately enforce a cap. A hard limit is a strict boundary that cannot be exceeded (e.g., a CPU limit in a container). SREs use soft limits for proactive management and hard limits for enforcing boundaries and preventing resource exhaustion, ensuring system stability.

118

Resposta de referência

- Image: Template with app code and dependencies. - Container: Running instance of an image.

119

Resposta de referência

The common Linux signals are mentioned below: - SIGHUP - SIGINT - SIGQUIT - SIGFPE - SIGKILL - SIGALRM - SIGTERM

120

Resposta de referência

1. Asking a software engineer to design operations teams 2. A practice developed at Google in 2003 to reduce organizational silos 3. The cost of operational costs of software is a significant concern for many companies 4. Measuring everything is crucial to determine success in all areas

121

Resposta de referência

fork() | exec() | |---|---| | It is a system call in the C programming language | It is a system call of operating system | | It is used to create a new process | exec() runs an executable file | | Its return value is an integer type | It does not creates new process | | It does not takes any parameters. | Here the Process identifier does not changes | | It can return three types of integer values | In exec() the machine code, data, heap, and stack of the process are replaced by the new program. |

122

Resposta de referência

- Monitoring memory usage trends over time using tools like Prometheus or Datadog. - Heap dumps and analysis tools (e.g., jmap, GDB) to identify problematic allocations. - Use profilers to monitor application memory (e.g., JProfiler for Java). - Implement proper garbage collection or memory management techniques in code, if necessary.

123

Resposta de referência

Common data structures include hash tables, trees, queues, stacks, graphs, and arrays. SREs often use these in scripting, automation, configuration management, monitoring systems, and analyzing log data to optimize performance and reliability.

124

Resposta de referência

Incident management involves detecting, responding to, and resolving outages or degradations. The process typically includes: alerting via monitoring systems, declaring an incident, assembling a response team, triaging the issue, applying fixes or rollbacks, communicating status to stakeholders, and conducting a postmortem to identify root causes and preventive actions. Automation and runbooks are crucial for reducing response time and human error.

125

Resposta de referência

To create a simple REST API in Node.js that returns a list of users, I would use Express to set up the server and define a route that handles GET requests. Here's a basic example: const express = require('express'); const app = express(); const users = [{ id: 1, name: 'John Doe' }, { id: 2, name: 'Jane Doe' }]; app.get('/users', (req, res) => { res.json(users); }); app.listen(3000, () => { console.log('Server is running on port 3000'); });

126

Resposta de referência

- Define and monitor SLIs/SLOs. - Automate toil (e.g., CI/CD pipelines). - Conduct blameless post-mortems. - Optimize cloud resource usage.

127

Resposta de referência

Designing for high availability involves eliminating single points of failure through redundancy, using failover mechanisms, replicating data across multiple locations, distributing services across nodes or regions, and implementing automated health checks with self-healing capabilities.

128

Resposta de referência

I would start by analyzing current traffic patterns, resource usage, and historical growth trends. I would model future demand based on 3x growth, considering seasonality. I would then identify bottlenecks (e.g., database, compute, network) and design scaling strategies, such as horizontal scaling for stateless components, database sharding or read replicas, and caching. I would use autoscaling policies to handle spikes and overprovision slightly for safety. Regular load testing would validate the plan, and I would monitor utilization to adjust proactively.

129

Resposta de referência

I approach incident postmortems with a blameless mindset to encourage open communication and learning. Key elements include a detailed incident timeline, root cause analysis, and actionable recommendations to prevent future occurrences.

130

Resposta de referência

- Latency monitoring: Use APM tools (e.g., Datadog, New Relic) to pinpoint high-latency regions. - Check CDN performance: Ensure the CDN (Content Delivery Network) is properly distributing content, especially to the affected regions. - DNS and routing: Verify DNS configurations and check for potential misconfigurations with geolocation-based routing. - Network issues: Investigate network latency using tools like traceroute or ping to see if there are issues between users and your infrastructure. - Geo-replication: Deploy regional data centers or use cloud providers' global regions to reduce latency for distant users. - Edge computing: Shift some workload to the edge using services like AWS Lambda@Edge or Cloudflare Workers for faster processing closer to users.

131

Resposta de referência

Skilled candidates will talk about strategies such as using virtual environments, containerization, or specific tools (like npm for Node.js or pip for Python) to manage packages. They should emphasize the importance of testing updates in a development or staging environment before applying them to production to avoid unexpected downtime.

132

Resposta de referência

Through SLIs (Service Level Indicators) like latency, uptime, and error rate, and SLOs (Service Level Objectives) which are targets for those indicators. Error budgets are used to balance shipping features vs stability.

133

Resposta de referência

Monitoring provides real-time insight into system health and performance, allowing SREs to detect issues before they impact customers. Tools commonly used include: - Prometheus for metrics - Grafana for dashboards - Nagios/Zabbix for alerting - Elasticsearch, Logstash, and Kibana (ELK) for logs - Datadog for full-stack monitoring

134

Resposta de referência

Auto-instrumentation gives you the request path. Custom spans at service boundaries, database calls, and external API calls give you the diagnostic detail you actually need when something is slow and you can't tell where. Most teams add custom spans reactively, after a post-mortem where the trace data existed and told them nothing useful. Knowing that pattern and building the spans proactively, before the first post-mortem forces you to, is the kind of foresight that interviewers at mature SRE organizations are specifically screening for because it's so rare.

135

Resposta de referência

I follow industry blogs and forums, attend conferences and webinars, participate in open-source communities, read books and research papers, and experiment with new tools in lab environments. I also share learnings with my team to foster continuous improvement.

136

Resposta de referência

The wrong answer starts with adjusting thresholds. The right answer starts with classifying which alerts led to action in the last 30 days and which didn't. Delete the ones that never led to action. Adjust the ones that led to action but too late. Add the ones that are missing based on recent incidents where no alert fired. That triage order matters.

137

Resposta de referência

The real question underneath it: 'Have you actually run one where the person who caused the outage was in the room, and how did you keep it blameless when everyone knew who made the change?' That's a different skill than reading the Google SRE book chapter on post-mortems. Candidates who reference the book by name without adding operational specifics tend to get flagged as having studied the theory without living it.

138

Resposta de referência

This is based on the CAP theorem. Balancing consistency, availability, and partition tolerance depends on use case. For critical data (e.g., financial transactions), I prioritize consistency and partition tolerance (CP), using quorum-based replication and synchronous writes, which may reduce availability during partitions. For high-traffic services (e.g., social media), I prioritize availability and partition tolerance (AP), using eventual consistency and asynchronous replication. I analyze business requirements for each service to choose the appropriate trade-off, often using hybrid approaches like read replicas or tunable consistency.

139

Resposta de referência

- Circuit Breaker: Implement circuit breakers to prevent cascading failures when a service is failing. - Retries with backoff: Implement retry mechanisms with exponential backoff to handle transient failures. - Fallbacks: Provide fallback options when services fail (e.g., serve cached data or default responses). - Monitoring and Alerts: Monitor dependencies for latency and error rates using APM tools or Prometheus, and set up alerts for failure conditions. - Service Mesh: Use a service mesh like Istio to handle inter-service communication and automatically reroute traffic when dependencies fail.

140

Resposta de referência

You are looking for their thinking process, their organization, and how methodical they are in finding problem sources. You are also looking for how creative they can be in solving them.

141

Resposta de referência

I would implement a zero-downtime deployment strategy using techniques like blue/green deployments or canary releases. In a blue/green deployment, two identical production environments are set up. The new version is deployed to the inactive ("green") environment, and once it's ready, the traffic is switched from the active ("blue") environment to the green one. Canary releases involve deploying a new version to a small subset of users before rolling it out to the rest. Both these techniques allow for testing in production-like environments and quick rollback if necessary, ensuring zero downtime during deployments.

142

Resposta de referência

- Use CDNs (Content Delivery Networks) to distribute traffic. - Rate-limiting to throttle excessive requests. - Auto-scaling infrastructure to absorb spikes. - Deploy Web Application Firewalls (WAFs) to block malicious traffic.

143

Resposta de referência

- Proactive Monitoring: Involves collecting metrics and logs to predict potential failures and address issues before they become critical. Implemented using tools like Prometheus, Datadog, and Grafana with predictive alerts based on trends (e.g., resource saturation, memory leaks). - Reactive Monitoring: Responds to issues as they happen, using alerts triggered by failures, high error rates, or performance degradation. Implemented through alerting systems integrated with monitoring tools and on-call rotations for handling incidents as they occur. Proactive monitoring helps prevent outages, while reactive monitoring ensures that incidents are quickly detected and resolved.

144

Resposta de referência

Auto-scaling automatically adjusts the number of servers or containers based on load. You can implement it with: - AWS Auto Scaling for EC2 instances. - Kubernetes Horizontal Pod Autoscaler (HPA) for containerized applications.

145

Resposta de referência

Blameless postmortems are incident reviews focused on understanding the systemic factors that contributed to a failure, not on individual mistakes. The goal is to learn from the incident and implement preventative measures to improve future reliability, fostering a culture of trust and learning.

146

Resposta de referência

DevOps Lifecycle is the set of phases that includes DevOps for taking part in Development and Operation group duties for quicker software program delivery. DevOps follows positive techniques that consist of code, building, testing, releasing, deploying, operating, displaying, and planning. DevOps lifecycle follows a range of phases such as non-stop development, non-stop integration, non-stop testing, non-stop monitoring, and non-stop feedback. 7 Cs of DevOps - Continuous Development - Continuous Integration - Continuous Testing - Continuous Deployment/Continuous Delivery - Continuous Monitoring - Continuous Feedback - Continuous Operations

147

Resposta de referência

Preparation involves understanding Site Reliability Engineering principles, including system design for reliability, coding in multiple languages (e.g., Python, Go), and practicing troubleshooting scenarios. Candidates should review Google's SRE books, learn about load balancing, monitoring, and incident management, and work on real-world systems problems. Behavioral preparation is also crucial to demonstrate leadership and teamwork.

148

Resposta de referência

- Canary Deployments: Roll out new features to a small subset of users first to test and monitor performance before full deployment. - Blue-Green Deployment: Run two environments: one live (blue) and one staging (green). After validating the new version in green, switch traffic to it. - Feature Flags: Enable or disable specific features without redeploying the entire application. - Automated Testing: Ensure that integration, unit, and end-to-end tests pass before deployment.

149

Resposta de referência

A site reliability engineer addresses issues such as incident response, monitoring and alerting, capacity planning, performance tuning, automating operational tasks to reduce toil, managing service level objectives, conducting postmortems, and ensuring system reliability and availability.

150

Resposta de referência

On-call rotations are managed by scheduling engineers to be available for incident response, ensuring proper documentation, and providing necessary training to handle incidents effectively.

151

Resposta de referência

Look for answers that outline the following differences and use cases: IaaS (Infrastructure as a Service) provides virtualized computing resources online. It's best used for custom, scalable computing environments. AWS EC2 is an example. PaaS (Platform as a Service) offers a platform where customers can develop, run, and manage applications without building and maintaining the infrastructure. Examples include Heroku and Google App Engine. SaaS (Software as a Service) is a software distribution model in which service providers host applications and make them available to customers over the internet. Examples include Salesforce, Docusign, Zelt, and even TestGorilla.

152

Resposta de referência

An inode is a data structure that stores metadata about a file, including its size, permissions, timestamps, owner, and pointers to the data blocks on disk. Each file has a unique inode number. Inodes do not store the filename; that is stored in directory entries. They are essential for filesystem operations like reading, writing, and permissions checking.

153

Resposta de referência

In the context of Site Reliability Engineering, Accelerated Problem Resolution (APR) is crucial for quickly addressing and resolving issues that affect system performance and reliability. Here are five main points about APR in Site Reliability Engineering: - **Monitoring and Alerting**: Continuous monitoring is fundamental in APR. It involves actively observing system metrics to detect anomalies or performance degradation. When an anomaly is detected, alerts are generated to notify the Site Reliability Engineers. - **Rapid Diagnosis**: Speed is crucial in problem resolution to minimize downtime. SREs perform a quick initial assessment to understand the nature and severity of the issue. They gather data, logs, and other diagnostic information to pinpoint the root cause. - **Issue Resolution and Mitigation**: Once the root cause is identified, the SREs focus on resolving the issue. Depending on the nature of the problem, this can involve applying hotfixes, rerouting network traffic, or scaling resources. In addition to resolution, mitigation strategies might be used to reduce the impact of the issue on the system and users. - **Post-mortem Analysis and Documentation**: After resolving the issue, a thorough post-mortem analysis is conducted to understand the cause, how it was addressed, and the impact it had. This information is documented for future reference, learning, and improving response strategies. - **Continuous Improvement**: Insights from post-mortem analysis are used to improve the system and the incident response process. This includes implementing preventive measures, enhancing monitoring tools, improving alerting mechanisms, and refining protocols for quicker and more efficient resolution of future incidents.

154

Resposta de referência

Security in SRE involves applying the principle of least privilege for access control, using secure methods for secrets management, performing regular vulnerability scanning, keeping systems patched, and integrating security monitoring into our alerting pipeline.

155

Resposta de referência

Graceful degradation is a strategy where a system continues to operate with reduced functionality in the event of partial failures. This ensures that critical services remain available, even if some features are temporarily disabled or limited.

156

Resposta de referência

Every architecture is different, so you are looking for them to mention networking problems, resource allocation, unusual service interactions, and so on.

157

Resposta de referência

It's true: The 'e' in SRE stands for engineering, and SREs have technical skills. But this role requires more people skills and change agent capabilities than some other IT roles. 'While the SRE position is an engineering role, it is atypical to what one thinks of an engineering role,' says Oehrlich of the DevOps Institute. 'While in some organizations existing monitoring practices, on-call procedures, and other standard processes are already well-established, an SRE should think and challenge existing ways of working. This calls for creativity and tenacity.' Lots of roles might pay lip service to creativity and tenacity desired traits in the job description. In SRE, though, they're actually critical characteristics, especially when dealing with egos, cultural resistance to change, and other challenges. 'As hiring manager, I would ask for examples where the individual has shown such qualities, how they go about it, and what has been achieved,' Oehrlich says.

158

Resposta de referência

Load balancing is a process used in computing to distribute network or application traffic across a number of servers or resources. This distribution improves the responsiveness and availability of applications, websites or databases by ensuring no single server bears too much demand. One of its main benefits is to ensure application reliability by redistributing traffic during peak times or when a server fails. This ensures users get served without experiencing lag or service unavailability. Load balancing can also provide redundancy by automatically rerouting traffic to a backup server if the primary server fails, ensuring high availability and disaster recovery. In addition, load balancing optimizes resource use as it allows you to use your servers more efficiently and increases the overall capacity of your application. For example, in a previous role, I implemented a load balancer in front of our cluster of web servers. This significantly improved the application's performance during peak times and ensured a smooth user experience, even if one of the servers ran into issues.

159

Resposta de referência

A service registry is a dynamic database of service instances and their locations, used for service discovery in a microservices architecture. It helps services find and communicate with each other by maintaining an updated list of available services.

160

Resposta de referência

Proactive RCA The main question that arises in proactive RCA is "What could go wrong?". RCA can also be used proactively to mitigate failure or risk. The main importance of RCA can be seen when it is applied to events that have not occurred yet. Proactive RCA is a root cause analysis that is performed before any occurrence of failure or defect. Advantages : - Helps one to prioritize tasks according to its severity and then resolve it. - Increases teamwork and their knowledge. Disadvantages : - Sometimes, resolving equipment after failure can be more costly than preventing failure from an occurrence. - Failed equipment can cause greater damage to system and interrupts production activities. Reactive RCA : The main question that arises in reactive RCA is "What went wrong?". Before investigating or identifying the root cause of failure or defect, failure needs to be in place or should be occurred already. One can only identify the root cause and perform the analysis only when problem or failure had occurred that causes malfunctioning in the system. Advantages : - Helps one to prioritize tasks according to its severity and then resolve it. - Increases teamwork and their knowledge. Disadvantages : - Sometimes, resolving equipment after failure can be more costly than preventing failure from an occurrence. - Failed equipment can cause greater damage to system and interrupts production activities.

161

Resposta de referência

Talk about any script or tool you wrote — maybe a log parser, a restart script, or a dashboard that replaced manual checks. Highlight the impact: time saved, fewer errors, etc.

162

Resposta de referência

For automating repetitive tasks, I follow these steps: - Identify Repetitive Tasks: These can include infrastructure provisioning, monitoring configuration, and incident response. - Use Infrastructure as Code (IaC): Tools like Terraform and Ansible are great for automating infrastructure provisioning. - Set Up CI/CD Pipelines: Automate deployments and testing using Jenkins, GitLab CI, or ArgoCD. - Leverage Automation Tools: Tools like RunDeck or SaltStack are useful for automating operational workflows and incident response. - Monitor and Maintain: Use monitoring and alerting systems like Prometheus and Grafana to ensure automation is working as expected. Automation tools reduce human error and free up resources for more strategic tasks.

163

Resposta de referência

A runbook is a set of standardized procedures for troubleshooting and resolving specific system issues. It ensures that any team member can resolve incidents efficiently, improving response time during outages.

164

Resposta de referência

Answer: The SLO stands for Service Level Objective, which is the agreement within the SLA about a specific metric, such as uptime or response time. They are agreed-upon targets within an SLA, which might be achieved for each activity, function and process to provide the best opportunity for consumer success. It also includes business matrices like conversion rates, uptime and availability.

165

Resposta de referência

Monitoring refers to the process of collecting and displaying predefined metrics (e.g., CPU usage, latency). Observability is a broader concept that includes monitoring but focuses on the ability to understand and diagnose systems from external outputs (logs, metrics, traces). Observability allows SREs to troubleshoot and debug without predefining every potential issue.

166

Resposta de referência

The toughest challenge I have faced was a critical production outage caused by a corrupted filesystem on a key server. The system was unresponsive, and data recovery was at risk. I overcame it by first remaining calm and methodically diagnosing the issue using fsck after unmounting the drive to repair the filesystem. I then restored services from backups and implemented automated monitoring and regular filesystem checks to prevent recurrence. This experience reinforced the importance of staying calm under pressure and having robust backup and recovery procedures.

167

Resposta de referência

- Start by checking network latency and packet loss using tools like ping or traceroute. - Use netstat or tcpdump to analyze network traffic and identify potential bottlenecks. - Check firewall rules and security groups for misconfigurations. - Review load balancer settings and DNS configurations. - Monitor bandwidth usage and QoS (Quality of Service) settings.

168

Resposta de referência

1. To reduce organizational silos between development and operations 2. To ensure that smaller changes are easier to implement and deploy 3. To reduce risks and make it easier to roll back when problems arise 4. To treat operations and software engineering problems as separate areas

169

Resposta de referência

In one of my previous roles, we leveraged machine learning to optimize system performance in the context of our e-commerce platform. One of the challenges we frequently encountered was correctly predicting the demand for computing resources for different services based on the time of day, day of the week, and other events like sales or launches. To address this, we utilized a machine learning model that used historical data as input to predict future demand. We first instrumented our systems to gather data about request count, server load, error rate, and response times. This data, combined with contextual information about the time of day, day of the week, and any special events, was fed into our ML model. The model was trained to predict the load on our servers and we used the output to handle autoscaling of our cloud resources. Implementing this machine learning model significantly improved our autoscaling logic. It helped us proactively adjust our resources in advance of anticipated load spikes and reduced resource waste during periods of low demand, optimizing system performance and cost-efficiency.

170

Resposta de referência

The difference between the two is that: A process is an instance of a running program with its own dedicated memory space. A thread is the smallest unit of processing that can be scheduled by an operating system. Threads operate within a process and share its memory space.

171

Resposta de referência

“Site Reliability Engineering (SRE) is a discipline that incorporates software engineering principles into infrastructure and operations tasks to create scalable and reliable systems. The goal of SRE is to improve service reliability through automation, monitoring, and proactive solutions while maintaining performance and ensuring availability.”

172

Resposta de referência

1. To reduce organizational silos between development and operations 2. To focus on building scale and more reliable software 3. To treat operations and software engineering problems as separate areas 4. To measure the ability dividing the good interactions by the total interactions we have to a service or product

173

Resposta de referência

Distributed tracing tool for microservices (e.g., tracking request flows).

174

Resposta de referência

- Use tools like HashiCorp Vault, AWS Secrets Manager, or Azure Key Vault to store secrets securely. - Ensure least privilege access and encrypt sensitive data at rest and in transit. - Rotate credentials regularly and audit access to secrets. - Avoid hardcoding sensitive information in code or configurations.

175

Resposta de referência

Prioritization is guided by error budgets. If the error budget is not exhausted, I allocate a portion of time to reliability improvements and feature work, based on business impact. Urgent incidents take immediate precedence to preserve SLOs. Reliability improvements are prioritized based on their potential to reduce toil or prevent future incidents. Feature requests are evaluated for alignment with reliability goals and capacity. I use a structured framework like ICE (Impact, Confidence, Ease) or RICE to balance these competing demands.

176

Resposta de referência

- Use spot instances: Deploy non-critical workloads on spot instances or preemptible VMs for cost savings, with autoscalers that manage sudden instance termination. - Right-sizing nodes: Use Cluster Autoscaler and ensure your node types are appropriately sized based on workload requirements. - Optimize resource requests: Ensure each service has accurate CPU and memory requests/limits to avoid over-provisioning resources. - Idle resources: Identify and scale down idle or underutilized resources with the help of tools like Kubernetes Metrics Server or KubeCost. - Serverless functions: Use serverless compute where applicable (e.g., Knative or AWS Fargate) to avoid the overhead of running always-on infrastructure. Balancing cost optimization with high availability requires continuous monitoring and fine-tuning resource allocations based on actual usage.

177

Resposta de referência

Configuration management ensures that systems are configured consistently and correctly. It involves maintaining and versioning configuration files, automating configuration changes, and using tools like Ansible, Puppet, or Chef to manage configurations across environments.

178

Resposta de referência

When rolling out a new feature, the first step is rigorous testing in isolated and controlled environments. We run a whole suite of tests such as unit tests, integration tests, and system tests to verify the functionality and catch any bugs or performance issues. Beyond functional correctness, it's important to test the load and stress handling capabilities of the new feature. Load testing and stress testing help identify performance bottlenecks and ensure that the feature can handle real-world traffic patterns and volumes. A good practice is to use a canary deployment or a similar gradual rollout strategy. The new feature can be released to a small percentage of users initially. This allows us to observe the impact under real-world conditions, while limiting potential negative effects. Monitoring the effects of the new feature is also crucial. I typically adjust our monitoring systems to capture key metrics for the new feature, allowing us to quickly identify and react to any unexpected behavior. If anything seems off, we can quickly roll back the feature, fix the issue, and then resume the rollout once we're confident that the issue has been addressed.

179

Resposta de referência

To back up a directory to a remote server, you can use a Bash script with rsync for efficient file transfer. Here's a simple example: rsync -avz /local/directory user@remote:/remote/directory

180

Resposta de referência

To implement a blue-green deployment strategy, I would maintain two identical environments: one active (blue) and one idle (green). After deploying and testing the new version in the green environment, I would switch traffic from blue to green, ensuring a seamless transition with minimal downtime.

181

Resposta de referência

Linux Shell is an integral part of the Linux OS. The Linux OS is a free and open-source OS developed by Linus Torvalds. It is the most popular OS to run on servers and embedded devices. A Linux shell is a command line interface that allows the user to interact with the system. The command line interface (CLI) of Linux provides a text-based interface for executing commands, performing file management tasks, and issuing other system commands. There are two types of shells in Linux – - Interactive shell - It starts automatically when a user logs into their computer. - Non-Interactive shell - It can be started manually for the execution of any program. These two types allow different users to have access to different sets of commands, depending on whether they are logged in or not. In most cases, non-interactive shells are used for administrative tasks such as managing user accounts and managing applications or services. On a typical Linux system, the following shells are widely used: - KSH (Korn Shell) - BASH (Bourne Again Shell) - TCSH - CSH (C Shell) - Bourne Shell - ZSH

182

Resposta de referência

Load testing involves simulating high traffic conditions to evaluate system performance and identify bottlenecks. It helps ensure that the system can handle expected and peak loads, providing insights into scalability and reliability.

183

Resposta de referência

- Vertical Scaling: Increase the capacity of existing resources (e.g., bigger servers). - Horizontal Scaling: Add more instances (e.g., more servers or containers). - Optimize the application by load balancing, caching (e.g., Redis), and database sharding.

184

Resposta de referência

- Blue-green deployment: Implement blue-green deployment for smooth cutover to the new system while keeping the old system intact until the migration is verified. - Data replication: Use real-time replication between old and new databases (e.g., AWS DMS) to keep data in sync during the migration. - Incremental migration: Migrate services or data in small, controlled increments instead of a “big bang” approach. - Canary testing: Deploy the new system to a small percentage of users first to validate functionality and performance. - Downtime windows: Plan migration during off-peak hours to minimize user impact and communicate downtime windows in advance. - Rollback plan: Prepare a detailed rollback plan to quickly revert to the previous state in case of failure. Minimizing downtime during a migration requires careful planning, testing, and the ability to rollback quickly if issues arise.

185

Resposta de referência

Horizontal scaling involves adding more instances to distribute the load, while vertical scaling involves adding more resources (CPU, memory) to existing instances. Horizontal scaling provides better fault tolerance and load distribution, while vertical scaling can be limited by hardware constraints.

186

Resposta de referência

- Implement access controls - Ensure only those you trust have access to sensitive systems and information. - Conduct regular security checks - Identify vulnerabilities and risks often and early. - Monitor and log activity - Proactively monitor systems and logs for suspicious activities. - Implement backups and disaster recovery - Having backups helps you recover systems quickly and effectively.

187

Resposta de referência

- Immediate mitigation: Roll back the release if necessary, or implement a hotfix. - Communicate with stakeholders: Inform the relevant teams and users of the outage and expected resolution times. - Incident documentation: Record detailed steps about what went wrong and how it was resolved. - Postmortem analysis: Conduct a blameless postmortem to understand the root cause (e.g., a bug, configuration error, or infrastructure issue). - Automated testing and CI/CD improvements: Strengthen automated testing, add canary releases or blue-green deployments, and improve staging environment testing to prevent future issues.

188

Resposta de referência

Common indicators to assess system health include latency (response times), error rates (e.g., HTTP 5xx errors), traffic (request volume), saturation (resource utilization like CPU, memory, disk I/O, and network bandwidth), and availability (uptime or success rate). These are often aligned with the Four Golden Signals of monitoring.

189

Resposta de referência

A Service Level Agreement (SLA) is a contract that outlines the level of service a customer can expect from a service provider. In the context of site reliability engineering, it defines key performance metrics like uptime, response time, and problem resolution times. This is important because it sets clear expectations between the service provider and the customer, mitigating any possible disputes about service quality. One key component of an SLA that site reliability engineers pay the most attention to is uptime, often represented as a percentage like 99.95%. Our job is to develop and maintain systems to at least meet, if not exceed, this target. Having well-defined SLAs directs our strategies for redundancy, failovers, and maintenance schedules. It also plays a significant role in how we plan for growth and capacity, making sure we can meet these commitments even during peak usage periods. In my previous role, I have actively used SLAs as a benchmark to guide my decisions - whether it's designing new features, performing system upgrades, or responding to incidents - the SLA has always acted as a key measure of our services' reliability and quality.

190

Resposta de referência

The TCP three-way handshake is the process to establish a reliable connection between a client and server: SYN, SYN-ACK, ACK. It ensures both sides are ready to communicate and synchronizes sequence numbers for reliable data transfer. SREs understand this for troubleshooting network issues, optimizing connection timeouts, and configuring load balancers or firewalls that handle TCP connections.

191

Resposta de referência

Leverage Infrastructure as Code (IaC) tools like Ansible, Puppet, or Terraform to automate and version control configuration across servers, ensuring consistency and repeatability.

192

Resposta de referência

SREs detect memory leaks through monitoring memory usage over time (e.g., gradual increase), heap dumps, and profiling tools (e.g., Valgrind, gperftools, or application-specific profilers). To handle them, they can restart the process temporarily, then analyze the root cause (e.g., unreferenced objects, poor cache management). Long-term fixes involve code changes and better resource management, with alerts set to detect abnormal memory growth.

193

Resposta de referência

It is essential to balance development, velocity, and reliability in SRE to align with business goals.

194

Resposta de referência

A service-level objective (SLO) defines the target availability (uptime) we want for a system or service. We define reliability as meeting our SLOs. Follow up: What is an SLA? An SLI? A service-level agreement (SLA) is the uptime promise that we make to a customer. These are often legally-defined with penalties for missing the target availability. For this reason, SLAs are generally set using figures that are easier to meet than SLOs. A service-level indicator (SLI) is something you can measure with precision to help you think about, define, and determine whether you are meeting SLOs and SLAs. They are generally reported as the ratio between the number of good events divided by the total number of events. A simple example would be the number of successful HTTP requests / total HTTP requests. SLIs are frequently reported as a percentage with 0% meaning everything is broken and 100% meaning everything is working perfectly.

195

Resposta de referência

Observability is the ability to understand a system's internal state based on its external outputs, such as logs, metrics, and traces. To improve observability, I would implement structured logging, distributed tracing, comprehensive metric collection, effective alerting, dashboards for key SLIs, and ensure teams have access to actionable data for debugging and incident response.

196

Resposta de referência

The three pillars are logs, metrics, and traces. I depend on metrics the most because they provide real-time, aggregated data on system health and performance, enabling quick identification of anomalies and trends, though all three are essential for full observability.

197

Resposta de referência

Organizations can implement metrics and monitoring, capacity planning, change management, emergency response, and cultural changes to manage their SRE efforts effectively.

198

Resposta de referência

A service-level agreement (SLA) is a guarantee of uptime that we give to a client. These are sometimes legally required, and there may be repercussions if the intended availability is not met. SLAs are often created with values that really are easier to meet than SLOs as a result.

199

Resposta de referência

Observability is basically a conversation around the measurement and instrument of an organization. - Understand what types of data flow from an environment, and which of those data types are relevant and useful to your observability goals - Get a clear vision of what a team cares about and figure out how your strategy is making sense of data by distilling, curating, transforming it into actionable insights into the performance of your systems. - Observability offer potentially useful clues about an organization's DevOps maturity level.

200

Resposta de referência

While both promote collaboration between dev and ops, SRE is a specific approach that applies software engineering to ops, emphasizing reliability via SLOs/SLIs and error budgets. DevOps is broader, focusing on culture and faster delivery.

NÃO QUER PERDER NADA?

Os testes práticos Cisco, PMP, CISA, CISM e AWS 100% aprovados estão à venda!
Obtenha agora

Obtenha uma certificação para destacar o seu currículo.

NÃO QUER PERDER NADA?

Os testes práticos Cisco, PMP, CISA, CISM e AWS 100% aprovados estão à venda! Obtenha agora

Obtenha uma certificação para destacar o seu currículo.

Os testes práticos Cisco, PMP, CISA, CISM e AWS 100% aprovados estão à venda!
Obtenha agora