Sre london matters




















You can manage and mentor great engineers but still somehow find the time to jump into the coal face with your team. Well, actually yes. An overused term, but how would it feel to use your years of expertise to genuinely take on the corporate behemoths in the financial world? Join Microsoft they said, get a safe career at Oracle they said. You gravitated away from the safe corporate world towards the roguish, rugged, chaotic start-ups. For the most part, a site reliability engineer is focused on multiple tasks and projects at one time, so for most SREs, the various tools they use reflect their eve-evolving responsibilities.

A typical SRE is busy automating, cleaning up code, upgrading servers, and continually monitoring dashboards for performance, etc. As technology ecosystems become increasingly complex, organizations need a broader range of professionals to focus on tasks like product development, troubleshooting, and customer services. SRE and DevOps have emerged as two of the most critical approaches to success. A given set of production dependencies can be shared, possibly with different stipulations around intent.

Demand for one service trickles down to result in demand for one or more other services. Understanding the chain of dependencies helps formulate the general scope of the bin packing problem, but we still need more information about expected resource usage. How many compute resources does service Foo need to serve N user queries?

For every N queries of service Foo, how many Mbps of data do we expect for service Bar? Performance metrics are the glue between dependencies. They convert from one or more higher-level resource type s to one or more lower-level resource type s. Deriving appropriate performance metrics for a service can involve load testing and resource usage monitoring. Inevitably, resource constraints result in trade-offs and hard decisions: of the many requirements that all services have, which requirements should be sacrificed in the face of insufficient capacity?

Intent-driven planning forces these decisions to be made transparently, openly, and consistently. Resource constraints entail the same trade-offs, but all too often, the prioritization can be ad hoc and opaque to service owners. Intent-based planning allows prioritization to be as granular or coarse as needed. Auxon is a perfect case study to demonstrate how software development can be fostered within SRE.

Auxon is actively used to plan the use of many millions of dollars of machine resources at Google. It has become a critical component of capacity planning for several major divisions within Google. These user intents are expressed as requirements for how the owner would like the service to be provisioned. These requirements—the intent—are ultimately represented internally as a giant mixed-integer or linear program. Auxon solves the linear program, and uses the resultant bin packing solution to formulate an allocation plan for resources.

Performance Data describes how a service scales: for every unit of demand X in cluster Y , how many units of dependency Z are used?

This scaling data may be derived in a number of ways depending on the maturity of the service in question.

Some services are load tested, while others infer their scaling based upon past performance. Per-Service Demand Forecast Data describes the usage trend for forecasted demand signals.

Some services derive their future usage from demand forecasts—a forecast of queries per second broken down by continent. Not all services have a demand forecast: some services e. Resource Supply provides data about the availability of base-level, fundamental resources: for example, the number of machines expected to be available for use at a particular point in the future. In linear program terminology, the resource supply acts as an upper bound that limits how services can grow and where services can be placed.

Ultimately, we want to make the best use of this resource supply as the intent-based description of the combined group of services allows. Resource Pricing provides data about how much base-level, fundamental resources cost.

In linear program terminology, the prices inform the overall calculated costs, which act as the objective that we want to minimize.

Intent Config is the key to how intent-based information is fed to Auxon. It defines what constitutes a service, and how services relate to one another. The config ultimately acts as a configuration layer that allows all the other components to be wired together. Auxon Configuration Language Engine acts based upon the information it receives from the Intent Config. This component formulates a machine-readable request: a protocol buffer that can be understood by the Auxon Solver.

It applies light sanity checking to the configuration, and is designed to act as the gateway between the human-configurable intent definition and the machine-parseable optimization request. Auxon Solver is the brain of the tool. It formulates the giant mixed-integer or linear program based upon the optimization request received from the Configuration Language Engine. In addition to mixed-integer linear programming toolkits, there are also components within the Auxon Solver that handle tasks such as scheduling, managing a pool of workers, and descending decision trees.

Allocation Plan is the output of the Auxon Solver. It prescribes which resources should be allocated to which services in what locations. Having performed manual capacity planning in spreadsheets, they were well positioned to understand the inefficiencies and opportunities for improvement through automation, and the features such a tool might require. Through these ongoing interactions, the team was able to stay grounded in the production world: they acted as both the consumer and developer of their own product.

When the product failed, the team was directly impacted. Launch and iterate. Any sufficiently complex software engineering effort is bound to encounter uncertainty as to how a component should be designed or how a problem should be tackled. Auxon met with such uncertainty early in its development because the linear programming world was uncharted territory for the team members. The limitations of linear programming, which seemed to be a central part of how the product would likely function, were not well understood.

In the case of the Stupid Solver, the entire solver interface was abstracted away within Auxon such that the solver internals could be swapped out at a later date. Eventually, as we built confidence in a unified linear programming model, it was a simple operation to switch out the Stupid Solver for something, well, smarter.

Building software with fuzzy requirements can be a frustrating challenge, but some degree of uncertainty need not be a showstopper. Use this fuzziness as an incentive to ensure that the software is designed to be both general and modular.

However, at the time, the world of automation systems was in a great deal of flux, as a huge variety of approaches were in use.

Rather than try to design unique solutions to allow Auxon to work with each individual tool, we instead shaped the Allocation Plan to be universally useful such that these automation systems could work on their own integration points. We also leveraged modular designs to deal with fuzzy requirements when building a model of machine performance within Auxon.

Playbooks contain high-level instructions on how to respond to automated alerts. They explain the severity and impact of the alert, and include debugging suggestions and possible actions to take to mitigate impact and fully resolve the alert. In SRE, whenever an alert is created, a corresponding playbook entry is usually created. These guides reduce stress, the mean time to repair MTTR , and the risk of human error.

At three months, they became the primary on-call, with the Kirkland SREs as backup. That way, they could easily escalate to the Kirkland SREs if needed. Next, the Nooglers shadowed the more experienced, local SREs and joined the rotation. Good documentation and the various strategies discussed earlier all helped the team form a solid foundation and ramp up quickly.

While the approach used by the Mountain View SREs made sense when a cohort of SREs were becoming a team, they needed a more lightweight approach when only one person joined the team at a given time. In anticipation of future turnover, the SREs created service architecture diagrams and formalized the basic training checklist into a series of exercises that could be completed semi-independently with minimal involvement from a mentor.

These exercises included describing the storage layer, performing capacity increases, and reviewing how HTTP requests are routed. Prior to December , Evernote ran only on on-prem datacenters, built to support our monolithic application. Our network and servers were designed with a specific architecture and data flow in mind. This, combined with a host of other constraints, meant we lacked the flexibility needed to support a horizontal architecture.

However, we still had one major hurdle to surmount: migrating all our production and supporting infrastructure to GCP. Fast-forward 70 days. Through a Herculean effort and many remarkable feats for example, moving thousands of servers and 3. The move to the cloud unleashed the potential for our infrastructure to grow rapidly, but our on-call policies and processes were not yet set up to handle such growth.

Once the migration wrapped up, we set out to remedy the problem. In our previous physical datacenter, we built redundancy into nearly every component. This meant that while component failure was common given our size, generally no individual component was capable of negatively impacting users.

The infrastructure was very stable because we controlled it—any small bump would inevitably be due to a failure somewhere in the system. Our alerting policies were structured with that in mind: a few dropped packets, resulting in a JDBC Java Database Connectivity connection exception, invariably meant that a VM virtual machine host was on the verge of failing, or that the control plane on one of our switches was on the fritz.

In a world of live migrations and network latency, we needed to take a much more holistic approach to monitoring. Reframing paging events in terms of first principles, and writing these principles down as our explicit SLOs service level objectives , helped give the team clarity regarding what was important to alert on and allowed us to trim the fat from our monitoring infrastructure.

Our focus on higher-level indicators such as API responsiveness, rather than lower-level infrastructure such as InnoDB row lock waits in MySQL, meant we could focus more time on the real pain our users experience during an outage. For our team, this meant less time spent chasing transient problems. This translated into more sleep, effectiveness, and ultimately, job satisfaction. Our primary on-call rotation is staffed by a small but scrappy team of engineers who are responsible for our production infrastructure and a handful of other business systems for example, staging and build pipeline infrastructure.

One of the ways we achieve this is to keep our signal-to-noise ratio low by maintaining simple but effective alerting SLAs service level agreements. We classify any event generated by our metrics or monitoring infrastructure into three categories:. Any P1 or P2 event has an incident ticket attached to it. The ticket is used for obvious tasks like event triage and tracking remediation actions, as well as for SLO impact, number of occurrences, and postmortem doc links, where applicable.

When an event pages category P1 , the on-call is tasked with assessing the impact to users. Incidents are triaged into severities from 1 to 3.

For severity 1 Sev 1 incidents, we maintain a finite set of criteria to make the escalation decision as straightforward as possible for the responder. Once the event is escalated, we assemble an incident team and begin our incident management process. The incident manager is paged, a scribe and communications lead is elected, and our communication channels open. After the incident is resolved, we conduct an automatic postmortem and share the results far and wide within the company.

For events rating Sev 2 or Sev 3, the on-call responder handles the incident lifecycle, including an abbreviated postmortem for incident review.

One of the benefits of keeping our process lightweight is that we can explicitly free the on-call from any expectations of project work. This empowers and encourages the on-call to take immediate follow-up action, and also to identify any major gaps in tooling or process after completing the post-incident review.

In this way, we achieve a constant cycle of improvement and flexibility during every on-call shift, keeping pace with the rapid rate of change in our environment. The goal is to make every on-call shift better than the last. With the introduction of SLOs, we wanted to track performance over time, and share that information with stakeholders within the company. We have also used this forum to review our on-call burden as a barometer of team health, and discuss remediation actions when we exceed our pager budget.

This forum has the dual purpose of spreading the importance of SLOs within the company and keeping the technical organization accountable for maintaining the health and wellness of our service and team. It can be difficult to pinpoint root causes that are hidden behind layers of cloud abstraction, so having a Googler at our side take the guesswork out of black-box event triaging was helpful. More importantly, this exercise further reduced our MTTR, which is ultimately what our users care about.

Specifically, this translates into projects such as improving our microservices platform and establishing production readiness criteria for our product development teams. Thus, we perpetuate the cycle of improving on-call for everyone. But what about specific considerations of being on-call? The following sections discuss these implementation details in more depth:. Now everyone knows that your on-call engineers are unhappy.

What next? Pager load is the number of paging incidents that an on-call engineer receives over a typical shift length such as per day or per week. An incident may involve more than one page. SRE needs to be within arm's reach of a charged and authenticated laptop with network access at all times; cannot travel; must heavily coordinate with secondary at all times.

SRE can leave their home for a quick errand or short commute; secondary does not need to provide coverage during this time.

The hypothetical Connection SRE Team, responsible for frontend load balancing and terminating end-user connections, found itself in a position of high pager load.

They had an established pager budget of two paging incidents per shift, but for the past year they had regularly been receiving five paging incidents per shift. Analysis revealed that fully one-third of shifts were exceeding their pager budget. Some engineers left the team to join less operationally burdened teams. High-quality incident follow-up was rare, since on-call engineers only had time to mitigate immediate problems. Alerting thresholds were set to align with their SLO, and paging alerts were symptom-based in nature, meaning they fired only when customers were impacted.

When senior management were approached with all of this information, they agreed that the team was in operational overload and reviewed the project plan to bring the team back to a healthy state. The large number of intergroup relationships was complex and had quietly grown difficult to manage.

Despite the team following best practices in structuring their monitoring, many of the pages that they faced were outside their direct control.

For example, a black-box probe may have failed due to congestion in the network, causing packet loss. The only action the team could take to mitigate congestion in the backbone was to escalate to the team directly responsible for that network.

On top of their operational burden, the team needed to deliver new features to the frontend systems, which would be used by all Google services.

To make matters worse, their infrastructure was being migrated from a year-old legacy framework and cluster management system to a better-supported replacement. The team clearly needed to combat this excessive pager load using a variety of techniques. The technical program manager and the people manager of the team approached senior management with a project proposal, which senior management reviewed and approved.

The team turned their full attention to reducing their pager load, and learned some valuable lessons along the way. The first step in tackling high pager load is to determine what is causing it. Pager load is influenced by three main factors: bugs 5 in production, alerting, and human processes. Each of these factors has several inputs, some of which we discuss in more detail in this section.

No system is perfect. There will always be bugs in production: in your own code, the software and libraries that you build upon, or the interfaces between them. The bugs may not be causing paging alerts right now, but they are definitely present. Ideally, the SRE team and its partner developer teams should detect new bugs before they even make it into production.

In reality, automated testing misses many bugs, which are then launched to production. Software testing is a large topic well covered elsewhere e. However, software testing techniques are particularly useful in reducing the number of bugs that reach production, and the amount of time they remain in production:. This kind of rollback strategy requires predictable and frequent releases so that the cost of rolling back any one release is small. Minimizing the number of bugs in production not only reduces pager load, it also makes identifying and classifying new bugs easier.

Therefore, it is critical to remove production bugs from your systems as quickly as possible. Prioritize fixing existing bugs above delivering new features; if this requires cross-team collaboration, see SRE Engagement Model. Architectural or procedural problems, such as automated health checking, self-healing, and load shedding, may need significant engineering work to resolve.

Chapter 3 of Site Reliability Engineering describes how error budgets are a useful way to manage the rate at which new bugs are released to production. The Connection team from our example adopted a strict policy requiring every outage to have a tracking bug.



0コメント

  • 1000 / 1000