My notes on the SRE book

05 Oct, 2023

Foreword

I read this book and wrote these notes as part of an SRE book club that a colleague of mine helped set up. I have also included my personal notes there occasionally, those notes start with Ilmo or Ilmo's note. Hope you find these interesting!

Chapter 1: Introduction

At their core, development teams want to launch new features and see them adopted by users.
At their core, ops teams want to make sure the service doesn’t break while they are holding the pager.
- Because most outages are caused by some kind of change—new config, new feature launch, new type of user traffic—the two teams’ goals are fundamentally in tension.
SRE is what happens when you ask a software engineer to design an operations team.
Google places 50% cap on the aggregate “ops” work for all SREs
- Includes tickets, on-call, manual tasks, etc.
- Cap ensures enough time on operations and providing stability to services
- Cap is an upper bound
- Over time, in theory, SREs should end up with very little operational load and almost entirely engage in development tasks.
  - In practise, scale and new features keep SREs on their toes.
- Remaining time can be spent on project/product work.
- Time spent on ops work is tracked
  - Ilmo’s note: For example, through a calendar setup, where everyone marks their hours as “meetings” and automation picks up how many hours have been used.
- Excess work is redirected until operation load drops below 50% once again.
SREs should receive, on average, max two events per an 8–12-hour on-call shift
- If SREs consistently receive fewer than one event per shift, keeping them “on point” is a waste of time.
SRE competes for the same candidates as the product development hiring pipeline
- Google set the hiring bar so high in terms of coding and system engineering skills, the pool is “necessarily” small.
SRE discipline is new (Ilmo’s note: at the time of writing).
Once SRE team is in place, their potentially unorthodox approaches to service management require strong management support.
- Eg: decision to stop releases for the remainder of the quarter once an error budget is depleted might not be embraced by a product development team unless mandated by their management.
DevOps or SRE?
- DevOps emerged in industry in late 2008. In 2016, is still in “state of flux”.
- “Its core principles—involvement of the IT function in each phase of a system’s design and development, heavy reliance on automation versus human effort, the application of engineering practises and tools to operations tasks—are consistent with many of SRE’s principles and practises.”
- “One could view DevOps as a generalization of several core SRE principles to a wider range of organizations, management structures, and personnel. One could equivalently view SRE as a specific implementation of DevOps with some idiosyncratic extensions.”
Tenets of SRE
- For their services, an SRE team is responsible for:
  - Availability
  - Latency
  - Performance
  - Efficiency
  - Change management
  - Monitoring
  - Emergency response
  - Capacity planning
Postmortems should be written for all significant incidents regardless of whether or not they were paged.
- Investigation should establish
  - What happened (in detail)
  - Find all root causes of the event
    - Ilmo’s note: there are no root causes, there are only contributing causes.
  - Assign actions to correct the problem or improve how it is addressed next time
- Google operates under blame-free postmortem culture, with the goal of exposing faults and applying engineering to fix these faults, rather than avoiding or minimizing them.
SLAs
- Motivation: 100% is the wrong reliability target for basically everything (pacemakers and anti-lock brakes excluded, notably).
- No user can tell difference between 100% or 99.999% availability.
- What is the right reliability target for a system? Questions to consider:
  - What level of availability will the users be happy with, given how they use the product?
  - What alternatives are available to userse who are dissatisfied with the product’s availability?
  - What happens to users’ usage of the product at different availability levels?
- Once a target is established, error budget can be formed.
  - Eg: A service that is 99.99% available is 0.01% unavailable. The 0.01% is the service’s error budget.
  - We can spend the budget on anything we want, as long as we don’t overspend it.
- Use of an error budget resolves the structural conflict of incentives between development and SRE.
  - SREs no longer have to reach for “zero outages”, SREs and product developers can instead spend the error budget getting maximum feature velocity.
  - An outage is no longer a bad thing. It’s an expected part of the process of innovation, and an occurrence that both development and SRE teams manage rather than fear.
Reliability is a function of mean time to failure (MTTF) and mean time to repair (MTTR)
- Ilmo’s note: Sometimes you also use MTTD, which means Mean Time To Detect.
- Google observed a 3x improvement in MTTR compared to improvisation when playbooks were introduced.
Google found out that roughly 70% of outages are due to changes in a live system.
- Best practises thus remain:
  - Implementing progressive rollouts
  - Quickly and accurately being able to detect problems
  - Rolling back changes safely when problems arise
When doing capacity planning, it’s important to remember these
- Having an accurate organic demand forecast, which extends beyond the lead time required for acquiring capacity.
- An accurate incorporation of inorganic demand into the demand forecast.
  - Ilmo’s note: Wonder what this means? Self-inflicted traffic?
- Regular load testing.
- As capacity is critical to availability, it’s natural to think SREs must also be in charge of capacity planning, which means they must also be in charge of provisioning.
Resource use is a function of demand (load), capacity, and software efficiency.
- A slowdown in a service equates to a loss of capacity.
- SREs provision to meet a capacity target at a specific response speed. This is why observing service performance continuously matters!

Chapter 2: The production environment at Google, from the viewpoint of an SRE

This chapter describes the infrastructure pieces that are used to run Google (at time of writing)

Chapter 3: Embracing risk

Costliness has two dimensions
- Cost of redundant machine/compute resources
  - The cost associated with redundant equipment that, for example, allows us to take systems offline for routine or unforeseen maintenance, or provides space for us to store parity code blocks that provide a minimum data durability guarantee.
- The opportunity cost
  - Cost borne by an organization when it allocates engineering resources to build systems or features that diminish risk instead of features that are directly visible to or usable by end users. These engineers no longer work on new features and products for end users.
SREs see risk as a continuum; they give equal importance to figuring out how to engineer greater reliability and identify appropriate levels of tolerance for the services we run.
We want to exceed a given availability target (eg. 99.9%) but not by much.
- A target is both a minimum and a maximum.
Measuring service risk
- Time-based availability
  - availability = uptime / (uptime + downtime)
  - Eg: system with an availability target of 99.99% can be down for up to 52.56 minutes in a year and stay within its availability target.
- Aggregate availability
  - availability = successful requests / total requests
  - Eg: system that serves 2.5M reqs/day with a daily availability target of 99.99% can serve up to 250 errors and still hit its target for that given day.
- These availability measuring tactics can be used to track a service’s performance against those targets on a 1/7-day (Ilmo’s note: or 14/30-day) window.
To identify risk tolerance of a service, SREs must work with the product owners to turn a set of business goals into explicit objectives to which we can engineer.
While consumer services often have clear product owners, infra services don’t have similar product ownership.
Many factors to consider when assessing risk tolerance of a service:
- What level of availability is required?
- Do different types of failures have different effects on the service?
- How can we use the service cost to help locate a service on the risk continuum?
- What other service metrics are important to take into account?
Things to consider when figuring out a target level of availability
- What level of service will the users expect?
- Does this service tie directly to revenue? (Either ours or customer revenue.)
- Is this a paid or free service?
- If there are competitors in the marketplace, what level of service do they provide?
- Is this service targeted at consumers or at enterprise?
- Example: For typical Google Apps for Work service, there would be an external quarterly availability target of 99.9%, and back this target with a stronger internal availability target and a contract that stipulates penalties if we fail to deliver to the external target.
Things to consider when determining cost
- What would be the increase in revenue for the proposed availability increase?
- Does the additional revenue offset the cost of reaching that level of reliability?
- Example:
  - Proposed improvement in availability target: 99.9% → 99.99%
  - Proposed increase in availability: 0.09%
  - Service revenue: 1M USD
  - Value of improved availability: 1M USD * 0.0009 = 900 USD
  - Here, if the cost of improving availability by one nine is less than 900 USD, it is worth the investment. Otherwise, not worth it.
Other service metrics
- Latency can be used to measure if a service is unable to provide a response quick enough.
  - Eg: AdSense needs to serve responses quick enough when inserting contextual ads.
Measuring risk tells us the cost of what it takes to keep a service at a particular level.
- Helps understand what is the lowest cost solution that meets everyone’s needs.
Motivation for error budgets
- Typical factors that cause tension in software engineering
  - Software fault tolerance
    - How hardened do we make the software against unexpected events? What’s enough?
  - Testing
    - Too much testing can cost us the entire market. Too little testing can result in incidents or bugs.
  - Push frequency
    - Every push is a risk. (Ilmo’s note: Every push is a chance.) How much should we work on reducing that risk?
  - Canary duration and size
    - How long do we wait when canarying and how big is the canary?
- There is usually an informal balance between these factors.
  - The more data-based this balance is made, the better.
Forming an error budget
- Error budgets removes the politics from negotiations between the SREs and the product developers when deciding how much risk to allow.
1. Product management defines an SLO, which sets an expectation of how much uptime the service should have per quarter.
2. The actual uptime is measured by a neutral third party: the monitoring system.
3. The difference between these two numbers is the “budget” of how much “unreliability” is remaining for the quarter.
4. As long as the uptime measured is above the SLO—in other words, as long as there is error budget remaining—new releases can be made.
- Eg: a service SLO with 99.999% of all queries per quarter, with error budget 0.001% for given quarter. If a problem causes us to fail 0.0002% of the expected queries for the quarter, the problem spends 20% of the service’s quarterly error budget.
Benefits
- Error budgets can block deployments temporarily to pressure the team to focus on improving reliability.
- Could also introduce slow-down of releases or rolling deployments back when error budget is close to being used up.
- Error budgets can also guide epic planning, to allow more risktaking.
SREs must have authority to actually stop launches if the error budget runs out.
Sometimes an SLO has to be loosened (thus increasing the error budget) to increase innovation.

Chapter 4: Service Level Objectives

SLI (Service Level Indicator): a quantitative measure of some aspect of the level of service that is provided
- Most services consider request latency as a key SLI
- Availability is an important SLI type
  - Fraction of the time that a service is usable
    - Often defined in terms of the fraction of well-formed requests that succeed (sometimes called yield)
  - 100% is not possible
  - 99—99.999% is
    - often referred to as “n nines” availability
      - special case being the “three and a half nines”, or 99.95%
- Other common SLI types
  - Error rate (expressed as a fraction of all requests received)
  - System throughput (eg. requests/s)
  - Durability (for data storage systems)
- Measurements often aggregated, ie. collected over a measurement window and turned into a rate, average, or percentile.
- Ideally measures the service level directly
  - Sometimes only proxy metrics are available
SLO (Service Level Objective): target value or range of values for a service level that is measured by an SLI.
- Natural structure for SLO is thus SLI <= target, or lower bound <= SLI <= upper bound.
- Eg: our average search request latency should be less than 100 milliseconds (Ilmo’s note: during a time window of 1/14/30/n days, where n factors into error budget renewal)
- Without an explicit SLO, users often develop their own belief about desired performance, can be unrelated to the beliefs held by the people designing and operating the service
  - Such dynamic can lead to both over-reliance on the service, believing service is more available than it actually is—or under-reliance when prospective users believe a system is flakier and less reliable than it actually is
SLA (Service Level Agreement): an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain.
- Consequences are most easily recognized when they are financial (rebate or penalty), but can also take other forms.
- Easy way to tell difference between SLOs and SLAs, ask: “What happens if the SLOs aren’t met?”
  - If no explicit consequences, we are talking about an SLO.
- A real SLA violation can also result in a court case for breach of contract.
SREs don’t typically construct these because SLAs are business and product decisions.
- SREs do help avoid triggering the consequences of missed SLOs.
- They can also help define SLIs.
Typically, handful of indicators are enough to evaluate and reason about a system’s health.
- For user-facing serving systems: availability, latency, and throughput
- For storage systems: latency, availability, and durability
  - Questions to ask: How long does it take to read or write data? Can we access the data on demand? Is the data still there when we need it?
- For big data systems & data processing pipelines: throughput, end-to-end latency
  - Some pipelines can have targets for latency on individual processing stages
- All systems should care about correctness: was the right answer returned, the right data retrieved, the right analysis done?
Some systems should be instrumented with client-side collection
- Not measuring behavior at the client can miss a range of problems that affect users but don’t affect server-side metrics.
Most metrics are better thought of as distributions rather than averages.
- Using percentiles for indicators allows you to consider the shape of the distribution and its differing attributes
  - 99th (or 99.9th) percentile shows you plausible worst-case value
  - 50th percentile (median) emphasizes typical case
- The higher the variance in response times, the more the typical user experience is affected by long-tail behavior, an effect exacerbated at high load by queuing effects
- Some user studies have shown people to prefer slightly slower systems to one with high variance in response times
  - Because of this, some SRE teams focus only on high percentile values.
Standardizing indicators
- Aggregation intervals: “Averaged over 1 minute”
- Aggregation regions: “All the tasks in a cluster”
- How frequently measurements are made: “Every 10 seconds”
- Which requests are included: “HTTP GETs from black-box monitoring jobs”
- How the data is acquired: “Through our monitoring, measured at the server”
- Data-access latency: “Time to last byte”
- It’s useful to build reusable SLI templates for each common metric
When carving out SLOs, find out what your users care about, not what you can measure.
- Often what users care about the most can’t be measured pragmatically so you end up approximating that need.
- To end up with useful SLOs, work your way from desired objectives backward to specific indicators. This way, you avoid measuring unnecessary metrics.
SLOs should specify how they’re measured and the conditions under which they’re valid.
- Eg: 99% averaged over 1 minute of GET RPC calls will complete in less than 100 ms (measured across all the backend servers)
- Eg: 99% of GET RPC calls will complete in less than 100 ms
- Sometimes you need multi-targets, eg:
  - 90% of GET RPC calls will complete in less than 1 ms
  - 99% of GET RPC calls will complete in less than 10 ms
  - 99.9% of GET RPC calls will complete in less than 100 ms
- If a service has heterogeneous workloads, having separate objectives for each class of workload makes sense:
  - Eg: 95% of throughput clients’ Set RPC calls will complete in <1 s
  - Eg: 99% of latency clients’ Set RPC calls with payloads < 1 kB will complete in <10 ms
It’s better to allow an error budget and track that on a daily or weekly basis.
- Upper management might be interested in monthly or quarterly assessment.
  - After all, an error budget is an SLO for meeting other SLOs.
Tips for choosing targets for an SLO
- Don’t pick a target based on current performance
  - Without reflection you may end up supporting a system that requires heroic efforts to meet its targets
- Keep it simple
  - Complicated aggregations in SLIs can obscure changes to system performance, also hard to reason about
- Avoid absolutes (Ilmo’s note: YAGNI)
  - You don’t need to design a system that can scale its load infinitely without any latency increase and that is always available.
- Have as few SLOs as possible
  - Choose just enough SLOs to provide good coverage of your system’s attributes
  - Defend the SLOs you pick.
- Perfection can wait
  - SLO definitions and targets can always be refined over time
  - It’s better to start with a loose target that you tighten than to choose an overly strict target that has to be relaxed when you discover it’s unattainable.
Poorly thought-out SLOs can result in wasted work if a team uses heroic efforts to meet an overly aggressive SLO
- Too lax SLO can result in a bad product.
SLIs and SLOs can help prevent bigger incidents
- Eg: If we observe that request latency SLI is increasing and will miss the SLO in a few hours unless something is done, we can take preventative action.
Set realistic expectations
- Keep a safety margin
  - Use internal SLOs (eg: 99.5%) and external SLOs (eg: 99%)
- Don’t overachieve
  - Users can become adjusted to a too well-performing service
  - You can avoid overdependence by deliberately taking the system offline occasionally, throttling some requests, or designing the system so that it isn’t faster under light loads.
SLAs should be conservative in what they advertise to users
- The broaded the constituency, the harder it is to change or delete SLAs that prove to be unwise or difficult to work with.

Chapter 5: Eliminating toil

Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.
Following descriptions may describe toil
- Manual work, for example running a script that automates some task
- Repetitive work that could have been run by a machine instead
  - If human judgment is essential for the task, it might not be toil.
- Tactical work (work that is interrupt-driven and reactive)
  - Handling pager alerts is toil
  - If the service remains in the same state after you have finished a task, it was probably toil.
- Work that scales linearly with service growth
  - Eg: service size, traffic volume, user count
Less toil is better
- Feature development focuses on improving reliability, performance, utilization often reducing toil as a second-order effect.
- Calculating toil
  - Typical SRE has one week of primary on-call, one week of secondary on-call in each cycle
  - Eg: In a 6-person rotation, at least 2 of every 6 weeks are dedicated to on-call shifts and interrupt handling => lower bound on potential toil is 2/6 = 33%
  - Eg: In a 8-person rotation, lower bound is 2/8 = 25%
Top source of toil is interrupts
- Non-urgent service-related messages and emails
- Next: on-call response
- Releasing
When individual SREs report excessive toil, often an indication for managers to spread toil load more evenly across the team
- Possibly also to encourage those SREs to find satisfying engineering projects
Typical SRE activities fall into categories
- Software engineering
- Systems engineering
  - Configuring production systems, modifying configurations, setuping monitoring, setting up/configuring load balancing, server configuration
- Toil
- Overhead
  - Admin work not tied directly to running a service.
Every SRE needs to spend at least 50% of their time on engineering work (when averaged over a few quarters or a year, toil tends to be spiky)
Is toil always bad?
- Some people are unaffected by it, especially if it comes in small amounts.
- Some people may gravitate towards toil.
Toil becomes toxic when experienced in large quantities.
- You should complain loudly if you are in this situation.
Doing too much toil can lead to:
- Career stagnation if you spend too little time on projects
- Low morale; burnout, boredom, and/or discontent.
- From the organization’s perspective
  - May create confusion (especially if it’s communicated that SREs participate in engineering work)
  - Slows progress
  - Sets precedent for others to load you with more toil (sometimes shifting responsibility from Devs to SRE)
  - Promotes attrition (your teammates might want you to work on engineering work even if you don’t want to)
  - Causes breach of faith if new hires were promised project work

Chapter 6: Monitoring distributed systems

Terminology
- Monitoring
  - Collecting, processing, aggregating, displaying real-time quantitative data about a system, such as query counts and types, error counts and types, processing times and server lifetimes.
- White-box monitoring
  - Monitoring based on metrics exposed by the internals of the system, including logs, interfaces (like JVM profiling interface), or an HTTP handler that emits internal statistics
- Black-box monitoring
  - Testing externally visible behavior as a user would seee it.
- Dashboard
  - Summary view of a service’s core metrics.
- Alert
  - A notification intended to be read by a human and that is pushed to a system such as a bug or ticket queue, an email alias, or a pager. Respectively, these alerts are classified as tickets, email alerts, and pages.
- Root cause
  - A defect in a software or human system that, if repaired, instills confidence that this event won’t happen again in the same way. A given incident might have multiple root causes: for example, perhaps it was caused by a combination of insufficient process automation, software that crashed on bogus input, and insufficient testing of the script used to generate the configuration.
- Node and machine
  - A single instance of a running kernel in either a physical server, virtual machine, or container.
- Push
  - Any change to a service’s running software or its configuration.
Why monitor?
- Analyzing long-term trends
- Comparing over time or experiment groups
- Alerting
- Building dashboards
  - Should answer basic questions about your service, and normally include some form of the four golden signals.
- Conducting ad hoc retrospective analysis
You should never trigger an alert simply because “something seems a bit weird”
- Exception: security auditing on very narrowly scoped components of a system.
When pages occur too frequently, employees second-guess, skim, or even ignore incoming alerts, sometimes even ignoring a “real” page that’s masked by noise.
Setting reasonable expectations for monitoring
- Avoid “magic” systems that try to learn thresholds or automatically detect causality.
- Rules that detect unexpected changes in end-user request rates are a counter-example.
Complex dependency hierarchies have limited success
- Eg: “If I know the database is slow, alert for a slow database; otherwise, alert for the website being generally slow.”
- Only works for very stable parts of a system
- Eg: “If a datacenter is drained, then don’t alert me on its latency”
  - Common datacenter alerting rule :+1::skin-tone-2:
To keep noise low and signal high, elements of your monitoring system that direct to a pager need to be very simple and robust.
- Rules that generate alerts for humans should be simple to understand and represent a clear failure.
Symptoms vs causes
- Your monitoring system should address two questions
  - What’s broken?
    - Eg: “Private content is world-readable”
  - Why is it broken?
    - Eg: “A new software push caused ACLs to be forgotten and allowed all requests”
- What vs why is one of the most important distictions in writing good monitoring with maximum signal and minimum noise.
If web servers seem slow on DB-heavy requests, you need to know both how fast the web server perceives the database to be, and how fast the DB believes itself to be.
Four golden signals
- Latency
  - Time it takes to service a request
  - Important to distinguish between the latency of successful requests and the latency of failed requests
  - Eg: HTTP 500 error triggered due to loss of connection to DB might be served very quickly, but HTTP 500 indicates a failed request, factoring 500s into overall latency might result in misleading calculations.
  - Increases in latency often leading indicator of saturation
    - Measure 99th percentile response time over small window can give very early signal of saturation
- Traffic
  - Measure of how much demand is being placed on your system, measured in a high-level system-specific metric. Usually req/s (perhaps broken out by the nature of the requests, eg. static vs dynamic content).
- Errors
  - Rate of requests that fail, either explicitly (500), implicitly (200 OK, but coupled with the wrong content), or by policy (“if you commit to 1s response times, any request over 1s is an error”)
  - Sometimes you have to measure partial failure modes (can be drastically different to measure these)
- Saturation
  - How “full” your service is.
  - Measure of your system fraction, emphasizing the resources that are most constrained
    - Eg: memory, IO
  - Note that many systems degrade before they achieve 100% utilization, so having a utilization target is essential.
  - Saturation here is also concerned with predictions of impending saturation
    - Eg: “database will fill its hard drive in 4 hours”
If you measure all of the four golden signals and page a human when one signal is problematic (or reaching problematic levels in the case of saturation), your service will be at least decently covered by monitoring.
Worrying about tail
- Simplest way to differentiate between a slow average and very slow “tail” of requests => collect requests counts bucketed by latencies (suitable for rendering a histogram) rather than actual latencies
  - Eg: how many requests did I serve that took between 0 ms and 10 ms, between 10 ms and 30 ms, between 30 ms and 100 ms, between 100 ms and 300 ms, etc.
Choosing an appropriate resolution for measurements
- Observing CPU load over the time span of a minute won’t reveal even quite long-lived spikes that drive high tail latencies
  - Collecting per-second measurements of CPU load might yield interesting data, but may be expensive to collect, store, and analyze
- For a service with 99.9% SLA (9 hours of aggregate downtime per year) probing for 200 OK every 1-2 minutes is too much
- Checking hard drive fullness for a service targeting 99.9% availability more than once every 1-2 minutes is unnecessary
If your monitoring goal is high resolution but doesn’t require extremely low latency, you can reduce costs by performing internal sampling on the server, then configuring an external system to collect and aggregate that distribution over time or across services, eg:
1. Record current CPU utilization each second
2. Using buckets of 5% granularity, increment the appropriate CPU utilization bucket each second
3. Aggregate those values every minute
- This allows to observe brief CPU hotspots without incurring very high cost due to collection and retention.
You can end up with a complex monitoring system, eg:
- Alerts on different latency thresholds, at different percentiles, on all kinds of different metrics
- Extra code to detect and expose possible causes
- Associated dashboards for each of these possible causes
Better to design a simple system, using guidelines:
- Rules that catch real incidents most often should be simple, predictable, reliable (as much as possible).
- Data collection, aggregation, alerting configuration that is rarely exercised should be up for removal.
- Signals that are collected but not exposed in any prebaked dashboard nor used by any alert are candidates for removal.
In Google’s experience, basic collection and aggregation of metrics, paired with alerting and dashboards has worked well as a relatively standalone system.
When creating rules for monitoring and alerting, following rules help:
- Does this rule detect an otherwise undetected condition that is urgent, actionable, and actively or imminently user-visible?
- Will I ever be able to ignore this alert, knowing it’s benign? When and why will I be able to ignore this alert, and how can I avoid this scenario?
- Does this alert definitely indicate that users are being negatively affected? Are there detectable cases in which users aren’t being negatively impacted, such as drained traffic or test deployments, that should be filtered out?
- Can I take action in response to this alert? Is that action urgrent, or could it wait until morning? Could the action be safely automated? Will that action be a long-term fix, or just a short-term workaround?
- Are other people getting paged for this issue, therefore rendering at least one of the pages unnecessary?
On pages, some rules:
- Every time the pager goes off, I should be able to react with a sense of urgency. I can only react with a sense of urgency a few times a day before I become fatigued.
- Every page should be actionable.
- Every page response should require intelligence. If a page merely merits a robotic response, it shouldn’t be a page.
  - Pages with rote, algorithmic responses should be a red flag.
- Pages should be about a novel problem or an event that hasn’t been seen before.
It’s better to spend much more effort on catching symptoms than causes
- When it comes to catching causes, only worry about very definite, very imminent causes.
Case Bigtable, highlights:
- Email alerts were triggered as the SLO approached and paging alerts were triggered when the SLO was exceeded.
- To remedy the situation as described in the chapter, team used a three-pronged approach
  1. Made great efforts to improve the performance of Bigtable.
  2. Temporarily dialled back the SLO target, using 75th percentile request latency.
  3. Disabled email alerts.
Often sheer force of effort can help a rickety system achieve high availability
- Better to have a short-term decrease in availability (even if painful), as it’s a strategic trade for the long-run stability of the system.
Periodical reviews done on page frequency (incidents per shift, typically couple pages per shift)
- Reviewed together with management in quarterly reports

Chapter 7: The Evolution of Automation at Google

doing automation thoughtlessly can create as many problems as it solves
- Software-based is superior most of the time (highlight own)
Consistency
- Very few will have the rigour to act with equal consistency every time an action is performed
Platform
- Can be extended to more systems(???) or spun out for profit (????)
- Centralizes mistakes
Faster repairs
- May lead to reduced MTTR, given that automation is run regularly and successfully enough
Faster action
- Machines react more quickly
- Eg. failovers, traffic switching
- Ilmo: one counter example are rollbacks (harder to come up with an automatable path)
Time saving
- Re:writing automation. Decoupling operator from operation is very powerful.
- Joseph Bironas, a Google SRE: “If we have to staff humans to do the work, we are feeding the machines with the blood, sweat, and tears of human beings.”
Value for Google SRE
- Strong bias towards because planet-spanning scale and need to reduce manual ops.
- Google vendors its upstream dependencies aggressively to achieve better automation.
- Controlling entire stack desirable
- Ilmo: they seem to be self-aware of the fact that not everyone is a Google by saying “not everyone has the ability or the inclination to develop automation at a particular time”.
Use cases for automation
- Automation is “meta-software”—software to act on software
Google SRE’s Use Cases for Automation
- Affinity for running infra as opposed to managing quality of data that passes over infra
- If, eg, 50% data dissipates, coarse-grained alerts fire
A Hierarchy of Automation Classes
- Better have a system that needs no glue logic at all
  - Eg. turnup automation (???)
- Infrequently run automation is fragile
- Evolution of automation
  - No automation
  - Externally maintained system-specific automation (failover script in an SRE’s local machine)
  - Externally maintained generic automation (more generic, documented for everyone)
  - Internally maintained system-specific automation (same but versioned to a repo of the system)
  - Systems that don’t need any automation (Ilmo: this is an oxymoron)
Internal case study: ALL
- They defined their error budget via their SLA and decided to automate failover, and go even further than that.
- Failovers were automated so an outage no longer paged a human.
- Total cost of operational maintenace of Ads Database dropped by ~95%. Up to 60% of hardware utilization was freed.
Internal case study: Cluster turnups
- Early automation was an initial win, but free-form scripts became a cholesterol of technical debt
- Prodtest (Production Test)
  - Python unit tests framework was extended to allow for unit testing of real-world services.
  - Unit tests have dependencies, allowing a chain of tests, and a failure in one test would quickly abort.
  - Given team’s prodtest was given the cluster name, and it could validate a team’s services within that cluster (???)
  - Later additions allowed them to generate a graph of the unit tests and their states.
  - This allowed engineer to see quickly if the service was correctly configured in all clusters, and if not, why.
- Resolving inconsistencies idempotently
  - Going from “Network works, and machines are listed in the database” → “Serving 1% of websearch and ads traffic” with suitable scripts
  - Was not perfect; process of fix verification was flaky because of latency between the test, a fix, and a second test.
- Inclination to specialize
  - Automation varies in 3 respects
    - Competence (accuracy)
    - Latency (when executing steps)
    - Relevance (how well the real world is covered)
  - Relieving teams from ops responsibility
    - No incentive to reduce tech debt for a given service
    - Product manager whose schedule not affected by low quality automation will always prioritize new features
- Reliability is the fundamental feature
  - “Analogous discussions about the impact of automation in the noncomputer domain—for example, in airplane flight or industrial applications—often point out the downside of highly effective automation: human operators are progressively more [illegible] of useful direct contact with the system as the automation covers more and more daily activities over time [illegible] inevitably, then, a situation arises in which the automation fails, and the humans are now unable to successfully operate the system. The fluidity of their reactions has been lost due to lack of practise, and their mental models of what the system should be doing no longer reflect the reality of what it is doing. This situation arises more when the system is nonautonomous—ie, where automation replaces manual actions, and the manual actions are presumed to be always performable and available just as they were before. Sadly, over time, this ultimately becomes false: those manual actions are not always performable because the functionality to permit them no longer exists.”
- Recommendations
  - You don’t have to be Google-scale to do automation whatsoever
  - Ilmo: Google-scale usually means that automation is done to the extreme, and this is typically offputting to some, for good reason. Getting a “let’s not automate anything” response is not something I’ve often if ever heard.

Chapter 8: Release Engineering

high velocity
- Some teams perform hourly builds and select which version to deploy to prod based on pool of builds (typically, when ci is green)
- Other teams have adopted “push on green” release model
Hermetic builds
- Insensitive to libraries and other software installed on the build machine
Enforcement of policies and procedures
- Ilmo: In my current org, we do this partially already with OPA/Rego and Kyvern.
CD
- Packaging
  - Packages are named (eg. Search/shakespeare/frontend), versioned with a unique hash, and signed
It’s not just for Googlers
- Same problems everywhere
  - How versioning is handled for packages?
  - CI or CD? Periodic builds?
  - Release how often?
  - What config management policies should one use?
  - Release metrics (Ilmo: This is a must for every team.)
Start release engineering at the beginning
- Teams should budget for release engineering resources at the beginning of product dev cycle
- Cheaper now than later
- SREs, devs, and release engineers work together.
  - ilmo: POs can be involved too.

Chapter 9: Simplicity

“At the end of the day, our job is to keep agility and stability in balance in the system.”
System stability versus agility
- Exploratory coding
  - Code that has shelf life to try to understand the system
I wont’t give up my code
- Commented out code is an anti-pattern
- Forever-gated code is an antipattern
  - Flag toggles should be rehearsed actively
Part III - practises
- No monitoring: you’re blind.
- SREs don’t go on-call for the sake of it: it’s part of achieving the larger mission and remaining in touch with our distributed systems, by learning how they work and fail. If paging could be dropped, it would be dropped.
- Managing incidents effectively should reduce their impact and limit outage-induced anxiety.
- Building blameless postmortem culture is the first step in understanding what went wrong/right.

Chapter 10: Practical Alerting

Monitoring a very large system is challenging
- With a large distr. system, there is a great number of components to analyze
- There also becomes a need to maintain a low maintenance burden
On time-series monitoring
- Conceptually, this is a 1-dimensional matrix of numbers, progressing through time
  - Adding permutations of labels, this matrix becomes multidimensional
- Book mentions Borgmon, a Google insider product
  - A programmable calculator with some syntactic sugar that enablies it to generate alerts.
  - It uses a common data exposition format (Ilmo’s remark: not entirely unlike the prometheus syntax)
- Chapter then goes to to explain different ways of measuring with time-series data
  - Counter
    - Any monotonically non-decreasing variable with which we can measure increasing values, eg. # of km driven
    - Usage of counters are encouraged because they don’t lose meaning when events occur between sampling.
  - Gauge
    - Any value, doesn’t have to be monotonically shifting, eg. amount of fuel remaining, current speed
- On labels
  - There are multiple uses for labels
    - They define breakdowns of data itself
    - They define the source of data, eg. service name or container name
    - They indicate locality or aggregation of data within the service as a whole, eg. zone, shard, etc.
Alertmanager is mentioned
- Can be configured to a) inhibit certain alerts when others are active, b) deduplicate alerts from multiple instances with same labelsets, and c) fan-in or fan-out alerts based on their labelsets when multiple alerts with similar labelsets fire.
Chapter underlines the importance of sending page-worthy alerts to an on-call rotation while keeping non-page-worthy alerts in a separate processing queue (or as informational data)
- Ilmo’s remark: This is highlighted in a later chapter as well. Seems like it’s an Important Detail™.
White-box monitoring vs black-box monitoring
- Borgmon (prometheus?) is white-box monitoring
- Black-box monitoring is about looking at the system from the outside with no details about the innards
  - A good complement to white-box monitoring
  - Ilmo’s remark: We use Checkly for this type of monitoring.

Chapter 11: Being On-Call

On SRE work
- Capping amount of time that SREs spend is highlighted throughout this chapter
- Chapter goes on to say that at Google engineers spend 50% of their time towards operational work
  - At minimum, 50% of engineering work should “further scale the impact of the team through automation, in addition to improving the service”
On-call, the main topic in this chapter
- On-call is about being available to step in to a problem, reacting in a specific amount of time (minutes or hours depending on SLA)
When an pager is received during a shift, it needs to be acknowledged, triaged, and potentially escalated.
Chapter again highlights that the non-paging events are less urgent than paging events, but also mentions that on-call engineers should vet non-paging events during business hours (Ilmo’s remark: sounds reasonable.)
Chapter introduces the concepts of primary and secondary roles which solves the following problems:
- Fall-through for primary
- Handle non-paging events
- Ilmo’s remark: steps in to help with solving the incident if primary escalates.
Some metrics
- Quantity of on-call: % of time spent on-call
- Quality of on-call: # of incidents during on-call
- SRE managers have a duty to keep these two balanced
  - Ilmo’s remark: … but how?
Chapter goes on to talk about the 24/7 rotation and how to achieve that using the percentage capping as a baseline.
- Ilmo’s remark: example assumed week-long shifts, but I would assume that we use sprint-long shifts. Otherwise, it’s hard to estimate which normal engineering stories will get done in any given sprint.
Chapter also introduces concept of multi-site on-call, using multiple teams basically.
- The rationale is sound: night shifts degrade developers physically. Easier to “follow the sun” by having people take lead in eastern timezones etc.
  - Ilmo’s remark: this is easy when you have 10k employers. Not so easy when you have just one person in Taiwan, for example :stuck_out_tongue:
- This is worth it if there’s already enough on-call work.
- Caveat: comes with noticeable overhead wrt communication and baton-passing etc.
In this chapter, there’s a brief mention of postmortems, says Google’s track record for post-incident ops is 6 hours
- Book derives from this number the max # of incidents that should occur.
On compensation models
- They need to exist, be adequate
- Can give time off, or
- Straight up cash, or
- Capped at some proportion of overall salary
  - Incentivizes involvement in on-call
  - Promotes a balanced on-call distribution and limits drawbacks of excessive on-call work, eg. burnout or inadequate time for work
  - Ilmo’s remark: but again, this might fit better for larger orgs than small engineering orgs
Chapter highlights importance of trying to reduce stress and highlight two modes of thinking when facing challenges:
- Intuitive, automatic, rapid action
- Rational, focused, deliberate cognitive functions
  - This leads to better results and to well-planned incident handling
Chapter highlights the most important on-call resources:
- Clear escalation paths
- Well-defined incident management procedures
- Blameless postmortem culture
  - This is a must.
On overloadedness
- You can loan out an SRE to an overloaded team
- You can and should(!) measure symptoms of an overloaded team, eg. # of daily tickets < 5, paging events per shift < 2 etc.
All paging events should be actionable.
- Silencing noisy non-actionable alerts can help with this.
- If there is more than one alert for one incident, the team should strive to tweak the alerts to approach a 1:1 alert/incident ratio.
  - Ilmo’s remark: This essentially means we’d have to tune or merge our alerts not to correspond 1:1 with SLOs.
If there are too many pages occurring, it’s always possible to give the pager to the developers owning those services and work with them to reduce alerts. This effectively means development of features halts until quality of alerts is back to according to the SRE team’s standards.
Operational underload
- SRE teams should be sized to allow every engineer to be on-call once or twice a quarter, to expose developers to production systems.
  - Ilmo’s remark: or how about empowering developers to always take care of their production systems? In smaller organisations that are using cloud as their infrastructure, it should already be possible.
  - Wheel of misfortune may help with honing SRE capabilities.

Chapter 12: Effective Troubleshooting

Troubleshooting is application of hypothetico-deductive method
- Ie. Iterate hypotheses until hypothesis holds
Troubleshooting is learnable
Ideally, problem report tells us the top-level problem
- We then start drilling down into system telemetry, logs, etc. to narrow down the culprits, exclude various parts of the system (maybe by using bisection as a tactic), and identifying contributing factors
  - Ilmo’s remark: book keeps using the root cause term, but we are already better than that :wink:
Couple ways to test hypotheses
- Compare the observed state against theories to find (un)confirming evidence
- Treat the system: change the system in a controlled way and observe
Some common troubleshooting pitfalls
- Looking at irrelevant symptoms; wild goose chases
- Misunderstanding the system dynamics (inputs, behavior, etc.)
- Coming up with wildly improbable theories
- Hunting down spurious correlations, eg. coincidences, correlated with shared causes
One should always prefer simple explanations
- Ilmo’s middle remark: Using the four golden signals works as a way to build that simple explanation.
An effective report contains
- Expected behavior
- Reproduction steps
- Consistent form
- Exists in a system where searching is possible
Book gives a great tip about ignoring the instinct to start investigate the “root cause” as quickly as possible, but tells you to make the software work first.
- Stopping the bleeding is super important.
- The book does, however, advise you to preserve earlier evidence of the incident for later investigation.
Chapter then goes on to talk about tracing, trace IDs, and observability
- Main points
  - Structured logs are important in building tools to gain retrospective analysis powers
  - It’s important to pass around Trace IDs using a common standard
    - Ilmo’s remark: if you push this everywhere and always pass it to the frontend in the request response, incident analysis gets a whole lot easier for certain types of production bugs/incidents.
  - Designing systems with well-understood and observable interfaces between components makes for an easier troubleshooting session
    - Ilmo’s remark: Don’t make it hard to guess what’s happening inside a system. Develop using observability-driven engineering.

Chapter 13: Emergency response

At the beginning, chapter underlines the importance of not panicking when systems start breaking.
- Typically these are true for an incident
  - You’re not alone
  - Sky is not falling
  - Nobody is dying, maybe
- Should you start to feeling overwhelmed, you can always pull in more people.
  - Sometimes everyone has to be paged. (Ilmo’s note: these are sometimes called Major Incidents.)
It’s important to follow an incident response process
- Ilmo’s note: In my current org, this is what we already have with our Incident protocol and I find it to be Good™. There are also heavier standards & processes defined in eg. ITIL… we probably don’t want that level of process just yet :wink:
Test-induced emergency
- These are planned, proactive ways to break production
- Failures are controlled using Science and are aborted when things go wrong
- Book then goes on to cover a real-life example of things going awry
  - They learned that
    - their review hadn’t been good enough (despite many pairs of eyes having looked at it), but that’s because nobody really understood how two systems interacted with each other when it came to a particular interaction.
    - incident response process had not been followed, process would have ensured wider awareness of the incident
    - rollback procedures had not been rehearsed on test env, turns out the procedures were broken!
      - => rollbacking is now tested before a large-scale test
Change-induced emergency
- This is just a regular incident where the incident stems from our own deployment/configuration changes.
Process-induced emergency
- This is google’s fancy term for an incident that is caused by a process (automated typically, but not necessarily) that then wreaks havoc.
- Eg: automation that accidentally wipes out all hard drives from each machine
“All problems have solutions”
- One of the greatest Google lessons: “—a solution exists, even if it may not be obvious, especially to the person whose pager is screaming.”
- If you can’t think of a solution, dig deeper or cast a wider net. Involve more people, ask for help, do whatever you have to do, quickly.
- The highest priority is to resolve the issue at hand quickly.
- Involve the person whose actions triggered the incident. That person knows a lot.
  - Ilmo’s note: IME, change-induced emergencies are typically fixed faster when we involve the person who introduced the fault.
  - Ilmo’s note: We might greatly benefit from a pager setup where we also paged the last person to deploy something or make a config change.
Keep a history of outages
- Ask hard questions.
- Look for specific actions that might prevent an outage next time, not just tactically but strategically.
- Publish & organize postmortems to a place where everyone can learn about them.
  - Ilmo’s note: Postmortems are a good candidate to read right away when you’ve joined a new team.
- Hold yourself and others accountable to following specific actions detailed in the postmortems.
- Once you have a good track record for learning from past outages, see what you can do to prevent future ones.
  - Ilmo’s note: It almost says it, but misses the opportunity to talk about premortems which is what we use to perform our threat modelling.
Until a system has failed, you don’t know how that system, the upstream systems, or users will react. It’s not wise to assume how it will work.
- Ilmo’s note: This is similar to how in Theory of Constraints it is said that “every system, regardless of how well it works, has at least one constraint (a bottleneck) that limits performance.” And typically, it’s good to know where that bottlenecks (may) exist. And you don’t always want to address them before they are a true problem because that just shifts the bottleneck somewhere else.

Chapter 14: Managing Incidents

Recursive separation of responsibilities
- Several distinct roles should be delegated to particular individuals
  - Incident commander
    - Structures incident response task force, assigns responsibilities to need and priority. Otherwise, holds all positions until delegated.
    - Most important task, keep a living incident document (Ilmo’s note: in our case, this refers to our Slack war room).
  - Ops lead
    - Works together with incident command
    - Only one modifying the system during an incident
  - Communications lead
    - Public face of the response task force
    - Hands out periodic updates to the incident response team and stakeholders, may also touch the incident document.
  - Planning lead
    - Deals with longer issues: filing bugs, ordering dinner, arranging handoffs, tracking how the system has diverged from the norm so it can be reverted once the incident is resolved.
- A single war room is recommended
- At the end of the day, incident command should be handed over.
  - When this is done, it should be done very loudly and explicitly in the war room and so that everyone acknowledges this.
When is it ok to declare an incident?
- Better early than sorry
- Good to have clear conditions for declaring an incident
- Some example guidelines:
  - Do you need to involve a second team in fixing the problem?
  - Is the outage visible to customers?
  - Is the issue unsolved even after an hour’s concentrated analysis?
Incident management proficiency quickly degrades when it’s not in constant use.
Some best practises are also mentioned for incident management
- Prioritize. Stop the bleeding, restore service, preserve evidence for postmortem.
- Prepare. Develop and document your incident management procedures in advance, in consultation with incident participants.
  - Ilmo’s note: I like the term introduction, incident participant. We should probably use that one as well.
- Trust. Full autonomy within the assigned role to all incident participants.
- Introspect. Pay attention to your emotional state while responding to an incident. If you feel panicky or overwhelmed, get more support.
- Consider alternatives. Periodically consider your options and re-evaluate whether it still makes sense to continue with the current approach or try another one.
- Practise. Use the process routinely.
- Change it around. When were you incident commander last time? (Ilmo’s note: or an ops lead?) Take on a different role next time. Encourage every team member to acquire familiarity with each role.

Chapter 15: Postmortem culture: Learning from Failure

“The cost of failure is education.”
Postmortem definition given: “—a written record of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and the follow-up actions to prevent the incident from recurring.”
Primary goal to write one is that the incident is documented, underlying contributing causes well understood, and that preventive actions are taken to reduce likelihood and/or impact of recurrence.
Postmortems are expected after any significant undesirable event. Writing one is not a punishment (Ilmo’s note: and should not feel one) but a learning opportunity instead.
Before an incident, try to define a set of criteria to know when a postmortem is necessary.
- Ilmo’s note: In fact, for mini-incidents I have had the occasional habit of writing small postmortems if we had discussed the mini-incident over a Slack thread. It would of course be nicer if you could gather everything to a single database. I think with suitable postmortem tooling & ChatOps it might just be possible.
Any stakeholder may request a postmortem for an event.
On postmortem blamelessness
- Must not indict any individual or team for bad or inappropriate behavior.
- Assumes that everyone involved in an incident had good intentions and did the right thing with the information they had (Ilmo’s note: at the time).
- When done well, should lead to investigating systematic reasons why an individual or team had incomplete or incorrect information and lead to creation of effective prevention plans.
- Done badly, may lead to finger pointing and shaming.
Postmortem workflow includes collaboration and knowledge-sharing at every stage.
For postmortem documentation, you should look for the following key features:
- Real-time collaboration
- An open commenting/annotation system
- Email notifications
- (Ilmo’s notes: I would add that postmortem should assist us in its creation as much as possible, forcing roles upon us during war room creation, collecting evidence from the war room automatically, and allowing for evidence gathering & inspection later down the road.)
Postmortems should be reviewed and presented.
- In practise, drafts are shared internally and assessed by senior engineers.
- Google’s review criteria at the time included:
  - Was key incident data collected for posterity?
  - Are the impact assessments complete?
  - Is the action plan appropriate and are resulting bug fixees at appropriate priority?
  - Did we share the outcome with relevant stakeholders?
- Once the initial review is done, postmortem is shared more broadly, typically the engineering team/org.
An unreviewed postmortem might as well never have existed.
Introducing a postmortem culture
- Postmortem of the month
  - A monthly newsletter with posts shared for the entire org
  - Ilmo’s note: good idea, also could be an internal blog
- Google+ postmortem group
  - A forum for postmortem-related discussion on internal & external postmortems, best practises, etc.
- Postmortem reading clubs
  - Regular sessions where interesting or impactful postmortems are read for open dialogue with participants & non-participants and new people.
- Wheel of Misfortune
  - Re-enact a previous postmortem. Original incident commander may attend to make it as real as possible.
- Ease postmortems into the workflow by having a trial period with successful and complete postmortems that prove their value
- Make sure that writing effective postmortems is rewarded and celebrated practise
- Encourage senior leadership’s acknowledgment and participation
  - Book goes on to say that even Larry Page talks about the high value of postmortems :slightly_smiling_face:
It’s good to ask for feedback on postmortem effectiveness
- Asking questions such as…
  - Is the culture supporting your work?
  - Does writing a postmortem entail too much toil?
  - What best practises does your team recommend for other teams?
  - What kinds of tools would you like to see developed?
- …will give better visibility to possible problems and therefore better chances on increasing effectiveness.

Chapter 16: Tracking outages

Postmortems tend to provide useful insights for improving a single service or set of services
- However: misses opportunities that would have a small effect in individual cases OR opportunities that have a poor cost/benefit ratio, but that would have large horizontal impact
- Ilmo’s note: I think this largely depends on the team composition (plus incentives) and on what is their strategy of ruminating on the problem space. Some teams focus on problems laterally, some teams like to solve the immediately observable problems and move on.
Chapter describes “The Escalator” which is their in-house PagerDuty solution (Ilmo’s note: probably before PagerDuty existed)
Chapter also describes “The Outalator” which is a “time-interleaved list of notifications for multiple queues at once, instead of requiring a user to switch between queues manually”
- Allows annotating incidents
  - Ilmo’s note: Nice, but probably requires that the tooling also supports integration with other incident tooling or that the incident management is incorporated within The Outalator
- Annotations can be marked as important
  - Ilmo’s note: Again, nice. This also seems something that you’d like to be able to see as a time-series visualized data.
- Silently receives and saves a copy of any email replies
  - Ilmo’s note: Seems like something very specific to Google’s flow
- Multiple escalating notifications (or alerts) can be combined into single entity (incident) in the Outalator
On aggregation, chapter notes that a single event may and often will trigger multiple alerts.
- Ability to group multiple alerts together into a single incident is critical.

Chapter 17: Testing for Reliability

“If you haven’t tried it, assume it’s broken.” —Unknown
Confidence can be measured by past reliability and future reliability. In order for future predictions to hold, one of the following must hold:
- Site remains completely unchanged over time with no software releases or changes in the server fleet
- You can confidently describe ALL changes to the site
Passing a test or a series of tests doesn’t necessarily prove reliability.
- Tests that are failing generally prove the absence of reliability.
On MTTR (Mean Time To Recovery)
- Zero MTTR occurs when a system-level test is applied to a subsystem, and that test detects the exact same problem that monitoring would detect.
- Repairing zero MTTR bugs by blocking a push is both quick and convenient.
On regression tests
- Tests have a cost
  - Time-wise
  - Computationally
  - Ilmo’s note: Some tests also have a cost from a maintainability perspective due to flakiness.
  - Example: Bringing up a complete server with required dependencies (or even with mocks) to run tests can take significantly more time—from minutes to hours—and possiby require dedicated resources.
Configuration tests
- Examines how a particular binary is actually configured and reports discrepancies against the actual configuration definition.
Canary tests
- A subset of servers is upgraded to a new version or configuration and then left in an incubation period. If no problems arise, then rest of the servers can be upgraded.
- Not really a test, it’s structured user acceptance.
- When using an exponential rollout strategy, it’s not necessary to attempt to achieve fairness among fractions of user traffic.
On CI/CD
- Optimal when engineers are notified when the build pipeline fails.
- Otherwise, pipeline may block from making (emergency) releases.
  - Ilmo’s note: Deblocking pipelines should always be a first priority.
Google uses Bazel as a build tool
- Creates dependency graphs for software projects
- When a change is made to a file, Bazel only rebuilds the part of the software that depends on that file
SRE tools are need to be tested
- Eg. tools that retrieve and propagate DB perf metrics, predict usage metrics to plan for capacity risks
  - Essentially tools 1) whose side effects remain within the tested mainstream API, and 2) are isolated from user-facing production by an existing “validation and release barrier”
Disaster tools
- Can be made to work “offline” using checkpoint states
- Typically disaster tools are expected to work with instant consistency as opposed to eventual consistency
Statistical techniques such as fuzzing, chaos testing, and Jepsen aren’t necessarily repeatable tests
- You can aim towards better repeatability using random number generator seeds.
When defining a release cadence based on reliability, it often makes sense to segment the reliability budget by functionality or by team (latter is more convenient).
In order to remain reliable and to avoid scaling the number of SREs supporting a service linearly, the production environment has to run mostly unattended.
A config file that changes more than once per user-facing application release can be a major risk if these changes are not treated the same way as application releases.
- If testing and monitoring coverage of that config file is not considerably better than that of the user application, that file will dominate site reliability in a negative way.
- Ilmo’s note: This is a key observation especially reflecting on our (now buried) feature flag proposal where the aim was to increase the amount of dynamic behavior.
- You might be able to mitigate this by, for example, having enough test coverage to support regular routine editing.
  - Ilmo’s note: This might not be so easy with a feature flag system, but probably we could think of introducing something like this.
The contents of the config file are (for testing purposes) potentially hostile content to the interpreter reading the configuration
- Ilmo’s note: Could be a potential threat vector, though typically you would have to combine that with negligent or hostile human behavior to actualize.
Key element of delivering site reliability is finding each anticipated form of misbehavior and making sure that some test reports that misbehavior.
Sometimes a fake backend may be provided and maintained by a team for release testing
- Ilmo’s notes: Similar to an environment created based off a merge request (that is isolated from the test environment).

Chapter 18: Software engineering in SRE

Growth rate of SRE-supported services exceeds the growth rate of the SRE organization
- One SRE guiding principle is that team size should not scale directly with service growth
Sometimes within SRE teams there may be “fully fledged software development projects” to keep SRE coding skills sharp
Desirability of team diversity is very important for SRE
- A variety of backgrounds and problem-solving approaches can help prevent blind spots
- Google always strives to staff its SRE teams with a mix of engineeres with traditional software development experience and engineers with systems engineering experience
At Google, many teams have moved to an Intent-Based Capacity Planning approach.
- Intent is the rationale for how a service owner wants to run their service, eg. “I want 50 cores in clusters X, Y, and Z for service Foo”
- Ilmo’s note: This is essentially what we do with IaC, kubernetes, terraform, etc. nowadays
Case Auxon: Google’s (early) implementation of an intent-based capacity planning and resource allocation solution
- Essentially a linear program that uses “the resultant bin packing solution to formulate an allocation plan for resources”
- Could support cases where “frontend server must be no more than 50 ms away from the backend servers”
  - Ilmo’s note: It’s possible for us to do this as well but not directly, interesting if support in Auxon was first-class.
- Was first imagined by an SRE after managing capacity planning in spreadsheets, then transformed into a product with a regular product backlog & development, SLA, etc. with full team ownership
- Learnings
  - Key was not to focus on perfection and purity of solution; better to launch and iterate.
  - Many uncertainties at the beginning
  - Didn’t wait for the perfect design, kept an overall vision while iterating, keeping software flexible enough to allow cost-efficient rework due to sudden process or strategy changes
  - Raising awareness for adoption with a single email or presentation did not suffice, it needed:
    - consistent and coherent approach
    - user advocacy
    - sponsorship from senior engineers and management (who understand the usefulness of the product)
  - Demonstrating steady, incremental progress via small releases raised confidence in the team’s ability to deliver useful software
  - Don’t be afraid to provide white glove customer support for early adopters to help them through the onboarding process
    - By working one-on-one with early users, you can address personal fears and demonstrate that rather than owning the toil of performing tedious tasks manually, the team instead owns the configuration, processes, and ultimate results of their technical work
  - By avoiding over-customization for one or two big users, they achieved broader adoption across the org and lowered barrier to entry for new services
  - Consciously avoided 100% adoption rate across the org
  - Great results having a “seed team” that combines generalists who are able to get up to speed quickly on a new topic together with engineers with wide and deep experience
  - A good candidate project is one that reduces toil, improves existing piece of infrastructure, or streamlines a complex process
  - It was important to fit into the overall set of objectives of the org
    - Cross-org socialization and review prevent disjoint or overlapping efforts
  - A product that can easily be established as furthering a department-wide objective is easier to staff and support
    - Likewise, an all-or-nothing approach that prevents iterative development is a major red flag in an SRE product. Overly generic SRE products are red flags too.
Majority of software products developed within SRE begin as side projects whose utility leads them to grow and become formalized, product may branch off into following possible directions:
- Remain grassroots effort developed during available free (or spare) time
- Becomes established as a formal project through structured processes
- Gains executive sponsorship from within SRE leadership to expand into a fully staffed software development effort
- Note: In any of these scenarios, it’s essential that the original SRE(s) involved continue working as SREs instead of becoming full-time developers embedded in the SRE org
  - This gives an invaluable perspective as they will be dogfooding the product to themselves.
SREs are used to working closely with their teammates, quickly analyzing, and reacting to problems
Think about the objectives you want to achieve when developing SRE software, some guidelines:
- Create and communicate a clear message
  - Important to communicate strategy, plan, and the benefits (Ilmo’s note: rationale as well)
  - SREs (Ilmo’s note: and devs inheriting any SRE tooling) are skeptical by nature
  - Make a compelling case of how it will help them
  - It’s important to get past the first hurdle: getting SREs to accept the strategy
- Evaluate your organization’s capabilities
  - You’re creating a product team, effectively
- Launch and iterate
  - Establish credibility; deliver some product of value in a reasonable amount of time
- Don’t lower your standards
  - Resist the urge; hold yourself to same standards as a product team does

Chapter 19: Load Balancing on the Frontend

DNS is typically the first layer of load balancing
- While conceptually simple and trivial to implement, many dragons be here.
  1. Very little control over client behavior; records selected randomly and each will attract ~equal amount of traffic
  2. Usually client cannot determine closest address
- No solution is trivial as the DNS server lies somewhere between users and nameservers
- DNS server acts as caching layer and has 3 important implications on traffic management
  1. Recursive resolution of IP addresses
    - Difficult to find optimal IP address to return to nameserver for a given user’s request, especially when requests may potentially come from 1M+ users across regions (or from a single office).
    - Typically these responses are cached with TTL. Implication is that estimating impact of a given reply is difficult.
      - Solved by (1) analyzing traffic and updating list of DNS resolvers with approximate size of the user base behind a given resolver, and (2) estimating geographical distribution of the users behind each tracked resolver.
  2. Nondeterministic reply paths
  3. Additional caching complications
- Despite all this, DNS is still simplest and most effective way to load balance requests (before the user’s request has even began).
  - Not sufficient on its own.
Better approach may be to use DNS + using virtual IP addresses instead
- Ilmo’s notes: On AWS, this relates to VPC & ENI things
- Most important part of this approach is a Network Load Balancer
  - Ilmo’s note: hey we know this! Nothing that can go wrong here :sunglasses: (This is a reference to an incident we had where we were using TCP-based health checks instead of HTTP-based health checks...)
When receiving user requests, load balancer should always prefer redirecting to the least loaded backend
- Chapter details ways on how to redirect efficiently.
  - Consistent hashing
    - A mapping algorithm that remains stable even when new backends are added or removed from to a load balancer, minimizing disruption to existing connections when the pool of backends changes.
    - Doesn’t require keeping state of every connection in memory, and won’t force all connections to reset when a single machine goes down.
  - Can use simple connection tracking for load balancer, but fall back to consistent hashing when the system is under pressure.
- On how to forward packets to a backend
  - NAT
    - Assumes you have a completely stateless fallback mechanism
  - Modifying data link layer information (OSI layer 2) using Direct Server Response
    - Serious compromise: all LBs and backends must reach each other at the data link layer
    - Google stopped using this
  - Packet encapsulation
    - Google started using this
    - Introduces overhead to packets, which can cause the packet size to exceed the available Maximum Transmission Unit (MTU) size and require fragmentation.

Chapter 20: Load balancing in the datacenter

Lame duck state
- One in which the backend task is listening on its port and can serve, but is explicitly asking clients to stop sending requests.
- When a task enters lame duck state, it broadcasts that fact to all its active clients.
Main advantage of allowing task to exist in a quasi-operational lame duck state is that it simplifies clean shutdown
- Avoids serving errors to unlucky requests that happen to be active on backend tasks that are shutting down
Shutdown process
1. job scheduler sends a SIGTERM to backend task
2. backend task enters lame duck state and asks its clients to send new requests to other backend tasks
3. ongoing requests are served, no new requests are taken in anymore
4. finally, all requests should be processed (Ilmo’s note: assuming there are no persisting connections)
5. after some time backend task exits cleanly or job scheduler kills it
- Ilmo’s note: For example, this is exactly how ingress-nginx shutdown process works.
If latency of outgoing requests from a backend task grows (because of eg. competition for network resources with an antagonistic neighbour), the number of active requests will also grow, which can trigger garbage collection.
When a task gets restarted, it often requires significantly more resources for a few minutes.
Traffic sinkholing is when a client starts to send very large amounts of traffic to an unhealthy task because (for example) the backend task is sending “I’m unhealthy” errors with very low latency, contributing to the client increasing the rate of requests as a consequence.
- Can be addressed by tuning load balancer policy (eg. count recent errors as active requests)
For load balancing, there is lots of discussion in the chapter on the most well-versed routing algorithm
- In the end, Weighted Round Robin is recommended.

Chapter 21: Handling overload

Gracefully handling overload conditions is fundamental to running a reliable service
One option is to serve degraded responses (eg. less accurate or more condensed response data, search only a small % of a candidate set, rely on local fallback data)
Redirect when possible, serve degraded results when necessary, handle resource errors transparently when all else fails.
Different queries can have vastly different resource requirements
- Because of this, “queries per second” often makes for a poor metric
Better solution is to measure capacity directly in available resources
- Cost of a request: normalized measure of how much CPU time a request has consumed
When global overload occurs, service should deliver error responses to misbehaving customers, while other customers should remain unaffected.
When a request is out of quota, the service should reject requests quickly.
A service may become overloaded even with caching when the majority of CPU is spent rejecting requests.
- Client-side throttling addresses this problem.
Adaptive throttling tracks two key pieces of information over a particular period of time (eg 2 minutes)
1. requests: # of requests attempted to the backend
2. accepts: # of requests accepted by the backend
- Normally these two are equal, but may differ when traffic is being rejected.
- Requests can happen until requests is K times as large as accepts . Once this is reached, client starts stops any requests from happening.
Adaptive throttling works well in practise, leading to stable rates of requests overall.
Sometimes it can be useful to categorize a request’s criticality
- Google uses a four-value range to determine their request criticality:
  1. CRITICAL_PLUS: Reserved for most critical requests, those that willl resullt in serious user-visible impact if they fail.
  2. CRITICAL: Default value for requests sent from production jobs. Will result in user-visible impact, but impact may be less severe than in CRITICAL_PLUS. Services are expected to provision enough capacity for all CRITICAL_PLUS + CRITICAL traffic.
  3. SHEDDABLE_PLUS: Partial unavailability is expected. Default for batch jobs.
  4. SHEDDABLE: Frequent partial unavailability and occasional full unavailability expected.
- Could in theory have more categories, but this has been sufficient for Google.
- As a result, criticality is a first-class notion within Google’s systems.
- Some examples:
  - When a customer runs out of global quota, a backend service will only reject requests of a given criticality if it’s already rejecting all requests of all lower criticalities.
  - When a service is overloaded, it will reject requests of criticalities sooner.
  - Adaptive throttling also keeps separate stats for each criticality.
  - Criticality is propagated throughout the system (same level of criticality is used for upstream calls).
- Google was using many different criticality “standards” until one was chosen and harmonized throughout the org. Now, criticality can be chosen consistently and reliably.
- Ilmo’s notes: I found this to be a particularly nice tactic that we probably won’t stand to gain immediately from. :slightly_smiling_face: Our most biggest wins may come from splitting the management applications and putting them on separate nodes etc. Even so, I find this interesting!
Overload protection at Google is based on notion of utilization (eg. the current CPU rate divided by the total CPUs reserved for each task, but also executor load average, but also sometimes combined target utilization thresholds)
- As utilization threshold is reached, requests start to get rejected based on their criticality.
- Executor load average: number of active threads in the process
  - This can be used to spot fan-outs and rejecting requests based on this number growing too large.
In case of overload errors, two possible cases exist:
1. Large subset of services in the datacenter (Ilmo’s note: think EKS cluster) are overloaded
  1. Requests should not be retried, errors should bubble up all the way up to the caller
2. Smalll subset of services in the datacenter are overloaded
  1. Preferred response could be to retry the request immediately
Request retries
- From perspective of load balancing policies, retries of requests are indistinguishable from new requests.
- Retries can be organic load balancing
- Google first implemented a per-request retry budget (max 3 times)
  - Ilmo’s note: IMO max 1 retry should be good enough for typical cases. If you were truly unlucky and received failure on your first try, chances are that the next retry will succeed. However, if the service is TRULY overloaded, then 3 retries will just contribute to the congestion even more.
- Google also implemented a per-client retry budget where each client keeps track of the ratio of requests that correspond to retries. A request will only be retried as long as this ratio is below a given %, eg. 10%.
  - Ilmo’s note: Personally this could be okay as long as the operation is idempotent and important enough to warrant that we absolutely need to serve something every time the call is made. But it comes at a complexity cost to maintenance, eg. understanding how & under what conditions the request passes through a particular chain.
- Alternatively, can also implement some kind of state somewhere within the service for some time-series frequency analysis of retries, returning “overloaded; don’t try” error response if a histogram reveals a significant amount of retries.
- Ilmo’s note: One should really carefully consider whether to adopt something like this as part of the system, because it makes it very hard to reason about the system in times of incidents. If you truly, absolutely need this kind of a feature, then you will also want to consider how you will observe this (and not forget about its existence during an incident).
Handling burst load
- Expose load to the cross-datacenter load balancing algorithm, eg. base your load balancing on the utilization of the cluster
- Use a separate proxy backend service for batch jobs to shield batch-originating fan-outs/request bursts from affecting user-facing services
Final words
- It’s a common mistake to assume that an overloaded service should turn down and stop accepting all traffic.
  - Accept as much traffic as possible, only accept load as capacity frees up.

Chapter 22: Addressing cascading failures

“If at first you don’t succeed, back off exponentially.”
- Dan Sandler, Google Software Engineer
“Why do people always forget that you need to add a little jitter?”
- Ade Oshineye, Google Developer Advocate
Cascading failure
- Failure that grows over time as a result of positive feedback
- Can occur when portion of an overall system fails, increasing probability that other portions of the system fail
- Most common cause: overload
Running out of a resource
Can result in higher latency, elevated error rates, or substitution of lower-quality results
- Desired effects: something eventually needs to give as the load increases beyond what a server can handle
Can render server less efficient or cause it to crash
- Can lead the service or the entire cluster into a cascading failure
Different types of resources can be exhausted
- CPU
- Memory
- Threads
- File descriptors
- Dependencies among resources
During CPU exhaustion, typically all requests become slower
Can result in following secondary effects:
Increased number of in-flight requests
- Requests take more to handle, more requests are handled => affects almost all resources, including memory, number of active threads (non-async server), number of file descriptors, and backend resources
Excessively long queue lengths
- Saturation of queues
- Latency increases
- Queue uses more memory
Thread starvation
- Health checks may fail
- CPU or request starvation
- Missed RPC deadlines
Reduced CPU caching benefits
- Decreased usage of local caches and decreased CPU efficiency
Memory exhaustion can cause the following effects:
Dying tasks
- Task might get evicted
Increased rate of GC in Java => increased CPU usage
Vicious cycle can occur: less CPU available => slower requests => increased RAM usage => more GC => even less CPU
- a.k.a. GC death spiral
Reduction in cache hit rates
- If application-level caches are being used, can result in increase of requests to upstream services => can cause overload to happen
Thread starvation
- Can directly cause errors or lead to health check failures
- If the server adds more threads as needed (without an upper bound), thread overhead can use too much RAM.
- Can also cause a secondary effect of running out of process IDs.
File descriptor exhaustion
- Inability to initialize network connections => health check failure
Dependencies among resources being exhausted
- Many of these above scenarios feed from one another
A service experiencing overload may experience from secondary symptoms
- These may look like primary symptoms during an incident response
It’s unlikely that a causal chain is ever fully diagnosed
Service unavailability
Servers crashing may lead to a snowballing effect where load increases on the remaining servers causing them to crash as well
- It’s hard to recover from this if the high rate of requests contributes to the problem
Servers can appear unhealthy to the load balancing layer
- Effect very similar to crashing
Load balancing policies that avoid servers that have served errors can exacerbate problems further
- Again, by snowballing the load to the remaining servers
Strategies for avoiding server overload
Load test server’s capacity limits and test the failure mode for overload
- Most important exercise you should conduct to prevent server overload
Serve degraded results
- Serve lower-quality, cheaper-to-compute results
- Instrument the server to reject requests when overloaded
Instrument higher-level systems to reject requests, rather than overload servers
Rate limiting can be implemented to
Reverse proxies
- Based on IP address
At the load balancers
- Drop requests when the service enters global overload
- Can be indiscriminate: eg. “drop all traffic above X requests per second”
- Or selective: eg. “drop requests that aren’t from users who have recently interacted with the service”
At individual tasks
- Ilmo’s note: eg. clojure endpoints
Perform capacity planning
- Should be coupled with performance testing
Capacity planning reduces (but does not eliminate) the probability of triggering a cascading failure
Horizontal scaling also needs proper capacity planning
On queue management
- Most thread-per-request servers use a queue in front of a thread pool to handle requests
If the request rate and latency of a given service is constant, there’s no reason to queue requests
- Constant number of threads should be occupied
- Queued requests consume memory and increase latency
System with fairy steady traffic over time fares better with a small queue length relative to the thread pool size (eg. <=50%)
- server starts to reject requests early if rate is too unsustainable
- Systems with bursty loads drastically may do better with a queue size based on the current number of threads in use, processing time for each request, and the size and frequency of bursts
Load shedding
- Drop some proportion of load by dropping traffic as the server approaches overload conditions
- Goal is to keep server from running out of RAM, failing health checks, serving with extremely high latency, or any other overload symptoms
- One way to solve: per-task throttling based on CPU, memory, or queue length
- Another way: return 503 to any incoming requests when there are more than a given number of client requests in flight
Or: Change the queueing method from FIFO to LIFO, or use controlled delay algorithm (or similar approaches)
- Can reduce load
- Based on the fact that users will likely refresh the browser than wait, issuing another request
- Other (sophisticated) ways: identify clients, pick requests that are more important, prioritize.
Graceful degradation
- Reduce amount of work, eg. search subset of data stored in an in-memory cache rather than hit the database
When evaluating Load Shedding (LS) or Graceful Degradation (GD) options for a service, consider:
- Which metrics to use to determine when to initiate LS/GD
- What actions should be taken when the server is in degraded mode?
At what layer should LS/GD be implemented?
- Is high-level choke-point sufficient?
Further points
- GD should not trigger very often, keep it simple
Remember: code path you don’t use will also be broken; GD paths will be rehearsed less in steady-state.
- Can mitigate by exercising a small subset of servers near overload.
- Ilmo’s note: maybe automatic load testing.
- Monitor and alert when too many servers enter these modes.
- Complex LS/GD can cause problems
Design a way to turn off LS/GD
- Ilmo’s note: Maybe by using Ops Toggles.
On retries
- They can stabilize a system, but…
- … they can keep an overloaded service in overloaded state.
- … they can amplify the effects of a server overload.
- If this is happening, one has to dramatically stop the requests from coming in to reduce or eliminate any load until the retries stop and the backend stabilizes.
When issuing automatic retries
Most of the backend protection strategies for preventing server overload apply.
- Testing a system can highlight problems, GD can reduce the effect of retries on the backend
- Always use a randomized exponential backoff when scheduling retries.
- Limit retries per request, don’t retry indefinitely.
Consider having a server-wide retry budget.
- If budget exceeds, fail the request.
Decide if you really need to perform retries at a given level. Think about the whole system.
- Avoid amplifying retries by issuing retries at multiple levels. This can lead to catastrophic overload.
Use clear response codes and consider how different failure modes should be handled.
- Separate retriable and nonretriable error conditions.
- Don’t retry permanent errors or malformed requests.
- Return a specific status when overloaded so that clients know how to back off and do not retry.
Graphs of retry rates may not tell the whole story => can be misinterpreted to be a symptom instead of a compounding cause
Remote Procedure Call (or RPC) deadline
- How long a request can wait before client gives up
Useful to have one
- Otherwise, may cause problems that keep on consuming server resources
- High deadlines can result in resource consumption in higher levels of the stack when lower levels of the stack are having problems
- Short deadlines can cause some more expensive requests to fail consistently
Balancing is hard
- Ilmo’s note: It may be useful to measure the request latency distribution at the receiving end PLUS consider server resources such as the upper bound for processable requests PLUS what is the (expected) rate of incoming requests.
- If deadline is being checked over multiple stages (eg. there are a few callbacks), deadline should be checked before continuing work.
Servers should propagate deadlines (set at the top, all downstream services share the same deadline)
- Consider setting an upper bound here for outgoing deadlines as well.
- It’s easy to break ordinary functionality such as requests with large payloads, file uploads, requests that are awaiting computation etc.
Propagating canceling reduces unneeded or doomed work
- Some systems “hedge” requests: they send requests to many places to see which ones answer fastest, sending cancellations to rest of the servers.
Sometimes a small dip in latency leads to a larger error rate
Guidelines
- Try to look at distribution of latencies in addition to the averages.
Return with an error early, don’t wait for the full deadline.
- Using a fail-fast option for an API will help, if one exists.
- Having deadlines several orders of magnitude longer than the mean request latency is usually bad.
Processes are slower to respond to requests right after starting, causes:
- Required initialization
- Runtime performance characteristics, eg. JIT, hotspot optimization, deferred class loading (Ilmo’s note: clojure compilation)
Cold cache
New cluster, maintaining a service, restarts can induce this
- External caches will help, externally managed caches even more
If caching has a significant effect on a service, you may want to:
- Overprovision the service.
Think about which caches are latency caches, which ones capacity caches.
- Be vigilant about not adding a cache as a hard dependency by accident.
When adding load to a cluster, slowly increase the load.
- Before opening up the valves, make sure the service carries a nominal load and that the caches keep warm.
Always go downward in the stack
- Avoid intra-layer communication
- Communications are susceptible to a distributed deadlock
Triggering conditions for cascading failures
Process death
- Eg. Query of Death, cluster issues, assertion failures, etc.
Process updates
- Eg. new service version deployment
New rollouts
- Eg. config, infrastructure changes
- Should have some kind of change logging to see what has changed
Organic growth
- Eg. Usage consumption grew higher than estimated capacity.
Planned changes, drains, or turndowns
- Eg. outages in a custer, maintenance, drainage
Depending on slack CPU as a safety net is dangerous
- When performing load tests, make sure that you remain within your committed resource limits.
Testing for cascading failures
- You should test under heavy load.
Test until it breaks
- At this point, service starts to serve errors, degraded results, but NOT significantly reduce the rate at which it successfully handles requests.
- Consider testing both gradual and impulse load patterns.
- Load test each component separately
- Should track state between multiple interactions and check correctness at high load.
- Be careful about testing in production.
Immediate steps to address cascading failures
Increase resources
- adding more resources may not be sufficient to recover
Stop health check failures
Note that process health checks and service health checking are two different things
- Ilmo’s note: probe & liveness health checks in kubernetes
Restart servers
- Especially if there’s a GC death spiral,
- OR if in-flight requests have no deadlines but are consuming resources leading to blocking threads,
- OR servers are deadlocked
Drop traffic
- Be aggressive; let 1% through only.
- Last resort trick.
- Enter degraded modes
- Eliminate batch loads
- Eliminate bad traffic
Be careful when evaluating changes to ensure that one outage is not being traded for another!

Chapter 23 — Managing critical state: Distributed consensus for reliability

Distributed consensus problem deals with reaching agreement among a group of processes connected by an unreliable communications network.
- One of the most fundamental concepts in distributed computing.
- Groups of processes should reliably agree on following questions
  - Which process is the leader of a group of processes?
  - What is the set of processes in a group?
  - Has a message been successfully committed to a distributed queue?
  - Does a process hold a lease or not?
  - What is a value in a datastore for a given key?
Whenever you see leader election, critical shared state, distributed locking: recommended to use distributed consensus systems that have been formally proven and tested thoroughly
CAP theorem holds that a distributed system cannot simultaneously have all three of the following properties:
- Consistent views of the data at each node
- Availability of the data at each node
- Tolerance to network partitions
Most systems that support BASE semantics rely on multimaster replication
- BASE: Basically Available, Soft state, Eventual consistency
Eventual consistency can lead to surprising results especially with clock drift (inevitable in distributed systems) or network partitioning
It is difficult to design systems that work well with datastores that support only BASE semantics
Case study 1: the split-brain problem
- Assume you have a pair of file servers with one leader and one follower per server. Servers monitor each other via heartbeats.
  - If one file server cannot contact its partner, it issues a kill command to its partner node to shut the node down and takes “mastership” of its files.
- A slow or packet-dropping network may introduce faulty state behavior in which both nodes are expected to be active for the same resource or where both are down because both issued a kill command to each other. => corruption or unavailability of data
- Gist: heartbeats can’t be used to solve the leader election problem.
Case study 2: failover requires human intervention
- Assume a highly sharded database system has a primary for each shard which replicates synchronously to a secondary in another datacenter.
  - An external system checks the health of the primaries. If they are no longer healthy, promotes the secondary to primary.
- If the primary can’t determine the health of its secondary, it makes itself unavailable and escalates to a human in order to avoid the split-brain scenario.
- This approach doesn’t risk data loss, but
  - negatively impacts availability of data.
  - increases operational load on the engineers.
- Problems with human escalation
  - Human intervention scales poorly
  - If the network is so badly affected where a distributed consensus system cannot elect a primary, a human is likely not better positioned to elect it either.
Case study 3: faulty group-membership algorithms
- Assume a system has a component that performs indexing and searching services.
- When starting, nodes use a gossip protocol to discover each other and join the cluster.
- Cluster elects a leader which performs coordination.
- In case of a network partition that splits the cluster, each side (incorrectly) elects a master and accepts writes/deletions, leading to split-brain scenario and data corruption.
Many distributed systems problems turn out different versions of distributed consensus, including:
- master election
- group membership
- all kinds of distributed locking and leasing
- reliable distributed queueing and messaging
- maintenance of any kind of critical shared state that must be viewed consistently across a group of processes
- ^{^} any of the above problems should be solved only using (verified & tested) distributed consensus algorithms
How distributed consensus works
- Consensus problem has multiple variants.
- We are interested in asynchronous distributed consensus
  - applies to environments with potentially unbounded delays in message passing
  - (synchronous consensus applies to real-time systems, where messages will always be passed with specific timing guarantees)
- Distributed algorithms may be crash-fail
  - = crashed nodes never return to the system
- or crash-recover (Ilmo’s note: … when they do?)
- Algorithms may deal with Byzantine or non-Byzantine failures
  - Byzantine := when a process passes incorrect messages due to a bug or malicious activity
    - Comparatively costly to handle, less often encountered
  - Ilmo’s note: non-Byzantine isn’t explained here.
- Solving the asynchronous distributed consensus problem in bounded time is impossible. (ie. no related algorithm here can guarantee progress in the presence of an unreliable network)
  - This is mitigated by having sufficient healthy replicas and network connectivity.
    - (The book also goes on to mention backoff jitter to avoid the dueling proposers problem.)
Original solution to the distributed consensus problem was Lamport’s Paxos protocol
- Others that exist are: Raft, Zab, Mencius
- Paxos also has many variations, with performance optimizations
Paxos overview
- Paxos operates as a sequence of proposals which may or may not be accepted by a majority of the processes in the system
  - If a proposal isn’t accepted, it fails.
  - Each proposal has a sequence number (imposing a strict ordering on all of the operations)
    - Solves any problems relating to ordering of messages in the system.
- First, the proposer sends a sequence number to the acceptors
  - Each acceptor will agree to accept the proposal only if a proposal with a higher sequence number hasn’t been seen.
  - Proposer can retry with a higher sequence number (if necessary).
  - Proposers must use a unique sequence number.
    - Eg. using their hostname as part of the sequence number
- If a proposer receives agreement from a majority of the acceptors, it can commit the proposal by sending a commit with a value.
  - => two different values can’t be committed for the same proposal, because any two majorities wil overlap in at least one node
- Acceptors must write/persist a journal when accepting a proposal (to account for restarts)
- There is also Fast Paxos, focusing on wide area networks
Distributed consensus algorithms allow a set of nodes to agree on a value, once.
Examples of clients using consensus algorithms: Zookeeper, Consul, etcd.
Replicated State Machine (RSM)
- a system that executes the same set of operations, in the same order, on several processes
  - ordering is global
- fundamental building block for distributed systems
- several whitepapers say that any deterministic program can be implemented as a highly available replicated service by being implemented as an RSM
- need transaction logs for recovery purposes
Timestamps are highly problematic in distributed systems due to the fact that it’s impossible to guarantee clocks being synchronized across multiple machines
Replicated services that use a single leader are very common
- allows for ensuring mutual exclusion at a coarse level
A barrier
- a primitive that blocks a group of processes from proceeding until some condition is met
- effectively splits a distributed computation into logical phases
- can be implemented by a single coordinator process, however adds a single point of failure (bad)
- can also be implemented as an RSM
- Zookeeper supports barriers
Lock
- Can be used to prevent multiple workers from processing the same input file
- Should be used with timeouts to prevent deadlocks
- Supported in RSM
- There be dragons here, know what you are doing
Queueing-based systems
- can tolerate failure & loss of worker nodes
- system must ensure claimed tasks are successfully processed
  - solution: lease system
- :-1: : loss of the queue prevents the entire system from operating
- Implementing the queue as an RSM can minimize the risk & make the system “far more robust”
  - Ilmo’s note: Would be interesting to poc this sometimes.
Atomic broadcast
- Messages are received by all participants reliably and in same order
- “Incredibly powerful—concept”
Queueing-as-work-distribution pattern
- Queue used as a load balancing device => point-to-point messaging
- Typically also a pub-sub queue implemented, where the same messages of a channel or topic can be consumed by many clients
  - can be used to implement a coherent distributed cache
Queueing & messaging systems often need excellent throughput
- They don’t need extremely low latency
- High latencies can lead to problems
Conventional wisdom: consensus algorithms are too slow and costly to use for many systems requiring high throughput and low latency
- not true; in fact, they have proven extremely effective in practise at Google
There is no one “best” distributed consensus and state machine replication algorithm for performance
Workload may vary in terms of:
- throughput, no of proposals being made per (second/minute/etc) at peak load
- type of requests, proportion of operations that change state
- consistency semantics required for read operations (Ilmo’s notes: wtf?)
- request sizes
Deployment strategy vary:
- deployment scale: local or wide area?
- what kinds of quorum are used, where are the majority of the processes?
- does the system use sharding, pipelining, batching?
Multi-Paxos
- Uses a strong leader process
  - Optimal for message throughput
- Requires only 1 round trip from proposer to a quorum of acceptors to reach consensus
- Important to implement backoff jitter and timeouts to avoid the dueling proposers problem here as well
For replicated datastores, important to do one of the following:
- Perform a read-only consensus algorithms
- Read data from a replica that is guaranteed to be most up-to-date
  - Stable leader process can provide this guarantee.
- Use quorum leases: allow strongly consistent local reads and, as a compromise, lose some write perf.
  - Reduces latency, increase throughput for read operations
  - Useful especially for read-heavy workloads in which some data is being read from a single geographic area.
Two major physical constraints on performance when committing state changes
- network round-trip time
- lead time for writing data to persistent storage
Stable leaders
- Allows for read optimizations, but has problems:
  - All operations that change state must be sent via the leader
  - Outgoing network bandwidth is a system bottleneck
  - Throughput is dependent on machine performance
- Typically single stable leader pattern is used in most consensus systems where performance is a concern
Batching
- Increases system throughput
- Inefficiencies introduced by idle replicas can be solved with pipelining
  - Allows for mulltiplle proposals to be in-flight at once
- Can be combined with a RSM
Assume a write to disk takes 10ms => rate of consensus operations could be <=100 per second
Consensus systems operate using majority quorums
- For non-Byzantine failures, minimum no of replicas that can be deployed is 3. If less than that, no failures can be tolerated.
- If quorum cannot be formed, then the system may be in an unrecoverable state
- Adding a replica in a majority quorum can decrease system availability.
Unavailability of the consensus system is usually unacceptable
Treat the system logs (or replicated log) as critical for production incidents.
Some specific aspects warrant special attention:
- number of members running in each consensus group, status of each process
- persistently lagging replicas
- whether or not a leader exists
- number of leader changes
  - too rapid increase signals flappiness, decrease can be a serious bug
- consensus transaction number
  - is the system making any progress?
- number of proposals seen/agreed upon
- throughput and latency
Following aspects may also be interesting for production monitoring
- Latency distributions for proposals acceptance
- Distributions of network latencies
- Amount of time acceptors spend on durable logging
- Overall bytes accepted per second
Chapter concludes by saying, “if you remember nothing else from this chapter” one should keep in mind the sorts of problems that distributed consensus can be used to solve and the types of problems that can arise from using ad hoc methods such as heartbeats.

Chapter 24 – Distributed Periodic Scheduling with Cron

Cron: a unix tool designed for launching arbitrary periodic jobs, user-defined times OR intervals.
Chapter describes a distributed cron system
- Can be deployed on a small number of machines
- Can launch cron jobs across “an entire datacenter” (while dealing with a central scheduling system like Borg)
A few notable reliability aspects of a (non-distributed) cron service
- A simple cron service’s failure domain is one machine.
- Only state that needs to restart across crond restarts is crontab configuration.
  - As cron launches are fire-and-forget, we don’t have to track how the launches succeed!
    - Exception: anacron attemps to launch jobs (eg. maintenance jobs) that would have been launched when the system was down.
Cron jobs may come in all shapes and sizes
- idempotent
  - eg: GC, cleanups
- side-effectful
  - eg: sending out email newsletters
- (non-)time-pressured
  - eg: GC may skip one launch
Skipping a cron job is generally better than risking a double run
Cron jobs should be monitored!
Hosting a cron service on a single machine can spell out reliability catastrophe.
- To increase the reliability concerns, we should decouple processes from machines.
  - Ilmo’s note: This means that there could be a separate service that decides the machine that will run the cron job.
For a distributed cron service, there are two options to track the state of cron jobs.
- Storing data externally in generally available distributed storage.
  - Ilmo’s note: postgresql, GFS, EFS, for example
  - :+1: Storage support for very large files/blobs may be better
  - :-1: Small writes on a distributed filesystem are very expensive, comes with high latency cost
- Using a system that stores a small volume of state as part of the cron service (Ilmo’s note: or should it say as part of the machine where the cron job is run?).
  - :+1: No extra dependencies
  - Should be considered especially when the cron job failing would have wide impact
Paxos recommended for a distributed cron service.
- This gives us strong consistency guarantees.
- Leader replica actively launches cron jobs (only)
- Completion of a cron job launch synced to other replicas
- Leader election needs to happen within a 1-minute threshold to avoid missing or delaying a cron job launch
Every cron job launch has two sync points
- When does the launch happen?
- When does the launch finish?
- “These two points allow us to delimit the launch.” :thinking_face: :thinking_face:
To reduce missed launches or double launches, we should meet (one of) following things when a leader replica dies:
- all operations have to be idempotent
  - How to mitigate, book example: ahead-of-time known job names
- observability to see if the requests stemming from a cron job launch all succeeded
Chapter also describes how to handle a continuous log of state changes, couple things to keep in mind:
- Log will need to be compacted to prevent it from growing indefinitely
  - Snapshotting (Ilmo’s note: think event snapshots) can be an effective tactic here.
- Log itself needs to be stored somewhere
  - External distributed storage
    - not desirable due to high frequency of small writes
  - Small local volume
    - faster, but may result in data loss (Google deemed this acceptable for their own distributed cron service)
      - Paxos also helps here, if one machine loses the logs, we can still retain data from the replicas
The thundering herd problem
- Cron service can cause substantial spikes due to many concurrent cron jobs spawning up HTTP calls.
- Google came up with their own crontab format where they allow for use of a question mark to make the system decide the value randomly, eg 0 0 * * ?
  - Ilmo’s note: basically this is adding more jitter to the system, once again.
  - Ilmo’s note: Colleague chimed in to say that you can do this inside the application using randomized sleep!

Chapter 25 - Data processing pipelines

Data pipeline
- Write a program that reads data in
- Transform it in some desired way
- Output new data
- Sometimes scheduled
Data pipelines historically known as co-routines, DTSS communication files, UNIX pipe, later also ETL pipelines.
Simple, one-phase pipelines
- Performs periodic or continuous transformation on big data
Multiphase pipelines
- Organized into a chained series
- This is for ease of understanding rather than operational efficiency.
- Depth of a pipeline
  - Number of programs chained together
Periodic pipelines are
- generally stable when there are sufficient workers for the volume of data AND execution demand is “within computation capacity”.
- useful and practical.
- fragile!
- initially reliable performancewise.
  - at scale, there can be problems, eg.
    - jobs that exceed their run deadline
    - resource exhaustion
    - hanging processing chunks (which precede corresponding operational load)
“Embarrasingly parallel” algorithms
- Cuts workload into small-enough chunks to fit onto individual machines.
“Because the customer is the point of indivisibility, end-to-end runtime is thus capped to the runtime of the largest customer.”
“Hanging chunk” problem
- Resources are assigned due to differences between machines in a cluster or overallocation to a job.
- Typical user code waits for total computation to complete which delays completion time.
- Responding to this problem (after detection) can make things worse.
  - Eg. killing a job will lead to processes starting all over again.
Execution cost is inversely proportional to requested startup delay, and directly proportional to resourced consumed.
Excessive use of a batch scheduler places jobs at risk of preemptions when cluster load is high because other users are starved of batch resources.
As the scheduled execution frequency increases, minimum time between executions can quickly reach the minimum average delay point.
- This places a lower bound on the latency that a periodic pipeline can expect to attain.
Distinction between batch scheduling resources vs production priority resources should be made.
The “thundering herd” problem (again)
- Given a large enough periodic pipeline, thousands of workers can immediately start work.
- If there are too many workers or if workers are misconfigured or otherwise experiencing problems in sync, then the underlying shared cluster services, any networking infrastructure may be overwhelmed.
- If retry logic is not implemented, correctness problems can result when work is dropped upon failure.
  - Naive retry logic can compound the problem.
- Adding more workers to a pipeline when a job fails (within a time period) may compound the problem.
- Buggy pipelines at scale (10k workers) is always hard on the infra!
Moiré load pattern
- Two or more pipelines run simultaneously and their execution sequences occasionally overlap.
  - Can cause them to simultaneously consume a common shared resource.
- Can occur in continuous pipelines as well
- Less common when load arrives more evenly
- Can best be observed through pipeline usage of shared resource.
Workflow as Model-View-Controller Pattern
- Distributed system equivalent of MVC pattern
- Model —or— Task Master
  - Uses system prevalence pattern to hold all job states in memory while synchronously journaling mutations to persistent disk.
  - Can have task groups
    - This is where the work can happen that corresponds to a pipeline stage.
- View
  - Workers that continually update the system state transactionally with the master according to their perspective as a subcomponent of the pipeline.
  - Workers should be completely stateless and ephemeral.
- Controller
  - Optional
  - Supports a number of auxiliary system activities that affect the pipeline, eg. runtime scaling of the pipeline, snapshotting, workcycle state control, rolling back pipeline state.
- Ilmo’s note: There was a LOT of explanation here regarding this pattern. If you’re interested in structuring code in this way architecturally, you should read this chapter for yourself.
Big data pipelines need to continue processing despite failures of all types.

Chapter 26 — Data Integrity

Data integrity definition
- Data integrity is whatever users think it is. (Assuming users come first)
- Measure of the accessibility and accuracy of the datastores needed to provide users with an adequate level of service.
  - Services in the cloud remain accessible to users. User access to data is important, so this access should remain in perfect shape.
Every service has independent uptime and data integrity requirements, explicit or implicit.
Secret to superior data integrity is proactive detection and rapid repair and recovery.
On choosing a strategy for superior data integrity
- All strategies trade uptime against data integrity with respect to affected users.
- Most cloud computing apps seek to optimize for some combo of
  - uptime
  - latency
  - scale
    - service’s volume of users and mixture of workloads the service can handle before latency suffers or service crashes
  - velocity
    - how fast a service can innovate to provide users with superior value at reasonable cost
  - privacy
    - (definition in this chapter is simplified): data must be destroyed within a reasonable time after users delete it
On backups
- Traditionally, companies protect data against loss by investing in backup strategies.
  - Real focus of such backup efforts should be data recovery, which distinguishes real backups from archives.
    - (Archives safekeep data for long periods of time to meet auditing, discovery, and compliance needs.)
- When formulating a backup strategy
  - Consider how quickly you need to be able to recover from a problem (Ilmo’s note: not mentioned, but this is RTO or Recovery Time Objective) and how much recent data you can afford to lose (Ilmo’s note: this is RPO or Recovery Point Objective)
- Making backups is a classically neglected, delegated, and deferred task of system administration.
- Teams shouldn’t be made to just “practise” their backups, but instead:
  - Teams should define SLOs for data availability (using a variety of failure modes)
    - Example: for deletion anomalies, measure global data deletion rate across all users which crosses an extreme threshold (10x the observed p95)
  - Team should practise and demonstrate their ability to meet these SLOs
Some cloud environment requirements
- If the environment uses a mixture of transactional and nontransactional backup and restore solutions, recovered data won’t necessarily be correct.
- If services must evolve without going down for maintenance, different versions of business logic may act on data in parallel.
- Incompatible versions of different services may interact momentarily.
- APIs also have to be mindful of the following
  - Data locality and caching
  - Local and global data distribution
  - Strong and/or eventual consistency
  - Data durability, backup, and recovery
Failure mode factors
- Scope
  - Narrow/directed ~ widespread
  - Example: delete or corrupt data specific to a small subset of users
  - Example: all files are lost from S3
- Rate
  - Big bang event ~ creeping
  - Example: single SQL query deletes all data
  - Example: distributed application logic contributes to a creeping null value in Snowflake
You need different strategies for different failures mode combinations
- Example: guarded logic against creeping data loss won’t necessarily help you with a datacenter fire.
Study: most common user-visible data loss scenarios involved data deletion or loss of referential integrity (caused by bugs)
- Most challenging variants involved low-grade corruption/deletion discovered weeks/months after
Point-in-time recovery
- Very useful
Replication and redundancy are not recoverability.
- Read replicas and backups may contain the same corrupted data.
  - Can be mitigated via nonserving copies, maybe exporting to another file format. Not safe from “lower layer” data problems.
- Media isolation such as tapes may protect from media flaws (eg. bug in disk device driver)
  - Note that tape restorations may need programmatic solutions.
Defending against many failure modes through layers:
- 1st layer: defend against user and developer errors
  - Example: Soft/Lazy deletion
  - Example: Revision history
  - Gets destroyed after a reasonable delay, eg: 15/30/45/60 days (if possible, not any longer)
  - Architecture should prevent developers from circumventing soft deletion
- 2nd layer: backups and related recovery methods
  - Prefacing, backups don’t matter, what matters is recovery.
  - Questions to ask
    - Which backup and recovery methods to use
    - How frequently you establish restore points by taking full or incremental backups of your data
      - Can be costly in terms of money and computation
        
        Computation cost can be mitigated by taking full backups during off-peak hours and incremental backups during busier times
    - Where you store backups
    - How long you retain backups
    - Further
      - Are your backups valid and complete? Or empty?
      - Does the recovery process complete in a reasonable amount of time?
      - Do you have sufficient monitoring available for state of recovery?
      - Are you able to perform the recovery during any time (or during a period that you deem it necessary)?
  - Rehearse your backups…
    - …using automation
- 3rd layer: Regular data validation
  - Bad data doesn’t sit idly by, it propagates.
    - The faster you know about data that is being fan-outed to another service’s RDS instance or to data analytics, the better.
  - Recommended to validate “invariants” that cause devastation to users (vs super strict data validation which will be abandoned by developers).
  - Troubleshooting failed validations can take significant effort.
    - Ability to drill down into validation “audit logs” essential (using playbooks, basic observability tooling, and a data validation dashboard)
- Overarching layer: Replication
  - Replication of every storage instance not necessarily feasible
  - Here it’s better to choose a continuously battletested popular scheme
General principles of SRE as applied to data integrity
- Have a beginner’s mind
  - Trust but verify, apply defense in depth.
    - Check correctness of the most critical elements of your data using “out-of-band” data validators even if API semantics suggest that you don’t need to do so.
  - Doesn’t mean putting a new hire in charge of an important data pipeline.
- Hope is not a strategy
  - Proving that data recovery works via automation.
Ending words
- Recognize that not just anything can go wrong, but everything will go wrong.

Chapter 27 — Reliable product launches at scale (notes)

Google has a special team within SRE called Launch Coordination Engineers (or LCEs)
- They facilitate a smooth launch process in couple ways
  - They audit products and services for internal reliability compliance & best practises, provide some action to improve reliability
  - They liaise between multiple teams involved in a launch
  - They drive technical aspects of a launch, make sure momentum is kept
  - They act as gatekeepers, but also sign off on launches that are deemed “safe”
  - They educate developers on best practises and how to integrate with the organization’s (eg. Google’s) services
- Held to the same standards as SREs
- Expected to have strong communication and leadership skills
  - Can expect to mediate between “disparate” parties mediating potential conflicts, guiding/coaching/educating, while working towards a common goal (= the launch)
A launch is any new code that introduces an externally visible change to an application
- According to this definition, up to 70 launches per week was measured sometimes.
An LCE team offers the following advantages
- Breadth of experience
  - Working across multiple teams, LCEs have the advantage of learning about most products inside the org.
  - LCEs make for great vehicles of knowledge transfer
- Cross-functional perspective
  - LCEs can have a holistic view of the launch, enabling them to coordinate among different “disparate” teams
    - Can be important for complicated launches, eg. multiple teams on multiple timezones
- Objectivity
  - Ideally, LCEs are “nonpartisan advisors” between different stakeholder groups (SREs, product developers, product managers, marketing, etc.)
Despite striving for objectivity, LCEs are incentivized to prioritize reliability over other concerns.
- In another company, incentive structure could be different
  - Ilmo’s note: for example, another company could favor being first-to-market over reliability.
Launch process can be characterized with some criteria:
- Lightweight — easy on devs
- Robust — catches obvious errors
- Thorough — addresses important details consistently and reproducibly
- Scalable — accommodates both a large number of simple launches and fewer complex launches
- Adaptable (Ilmo’s note: not really sure what it means in this context)
Tactics that can achieve the above criteria:
- Simplicity — getting the basics right, don’t plan for every eventuality
- High-touch approach — experienced engineers customize the process to suit each launch
- Fast common paths — identify classes of launches that always follow a common pattern (eg. launching in a new country/region), provide simplified launch process for this
LCEs have a launch checklist for “launch qualification”
- Example of a checklist entry:
  - Question: Are you storing persistent data?
  - Action item: Make sure you implement backups. Here are instructions for implementing backups.
- Each checklist entry has to abide by the following criteria
  - Each question is there to prevent a past mistake or otherwise “proven” in battle
  - Every instruction has to be concrete, practical, and reasonable to accomplish by a dev.
- New checklist entry should focus around a broad theme such as reliability, failure modes, or processes.
- Growth of list had to be hindered by employing a rigorous review process (eg. top leadership reviews)
- List is reviewed once/twice a year to identify obsolete items
- Infra/Tool/Software standardization (such as Kubernetes or having a unified logging system) can help simplify launch checklists
Checklist themes
- Architecture and dependencies
  - Are you using shared infra correctly?
  - What is the request flow like?
  - Are there different types of requests with different latency requirements?
- Integration with internal ecosystem
  - Do you have a DNS name for your service?
  - Is your service running through the API gateway correctly?
  - Ilmo’s note: Are you integrating to monitoring & logging correctly?
- Capacity planning
  - Is this launch tied to a press release, advertisement, blog post or other form of promotion?
  - How much traffic and rate of growth do you expect during and after the launch?
  - Have you obtained all the compute resources needed to support your traffic?
- Failure modes
  - Do you have any Single Points of Failure in the design?
  - How do you mitigate unavailability of your dependencies?
- Client behavior
  - Does the client have any auto-save, auto-complete, or heartbeat functionality?
- Processes and Automation
  - Are there any manual processes required to keep the service running?
- Development process
  - Ilmo’s note: Review the development & review process.
  - Ilmo’s note: Review the delivery process.
- External dependencies
  - What 3rd party code, data, services, or events does the service or the launch depend upon?
  - Do any partners depend on your service? If so, do they need to be notified of your launch.
  - What happens if you or the vendor can’t meet a hard launch deadline?
- Rollout planning
  - Have a plan that involves all the stakeholders’ opinions on how to launch.
  - Good to have “contingency measures”
    - This doesn’t need to be complicated. Communicating the failure is often enough.
      - Ilmo’s note: sometimes you may need to plan how to roll things back and also when/how to communicate about it.
Selected techniques for reliable launches
- Gradual and staged rollouts
  - Eg: Canary testing, rate-limited signups + user-driven invite system
  - Almost all updates at Google done gradually using appropriate verification steps
- Feature flag frameworks
  - Following requirements should be met:
    - Roll out many changes in parallel, each to a few servers, users, entities, or datacenters
    - Gradually increase to a larger but limited group of users usually between 1 and 10 percent
    - Direct traffic through different servers depending on users, sessions, objects, and/or locations
    - Automatically handle failure of the new code paths by design, without affecting users
    - Independently revert each such change immediately in the event of serious bugs or side effects
    - Measure the extent to which each change improves the user experience
  - Feature flag frameworks can be divided into two classes:
    - They drive user interface improvements.
    - They support arbitrary server-side and business logic changes.
- Dealing with abusive client behavior
  - Important tool: Ability to control client from the server-side.
  - Eg: client is forced to download config file from the server.
- Overload behavior and load tests
  - Ilmo’s notes: Try to bring the service to its knees and see how the service AND the services around it respond
Launch reviews became common practise days-to-weeks before launch of many new products
- These are also known as Production Reviews.
- Ilmo’s note: kinda reminds me of the security reviews / threat modelling sessions we do for each epic.
Problems LCE couldn’t solve
- Launching a small new service was difficult.
- Services that grew to a larger scale faced unique set of problems LCEs could not solve.
Final words
- LCE team was Google’s solution to the problem of achieving safety without impeding change.

Chapter 28: Accelerating SREs to On-Call and Beyond

You’ve hired your next SRE(s), now what?
- Now you have to train them on the job!
Successful SRE teams are built on trust
- Trust in your fellow on-callers …
  - … to know how the system works.
  - … to be able to diagnose atypical system behaviors.
  - … to be comfortable with reaching out for help.
  - … in being able to react under pressure to “save the day”.
- Additionally, you need to ask the following questions
  - How can my existing on-callers assess the readiness of the newbie for on-call?
  - How can we harness the enthusiasm and curiosity in our new hires to make sure that existing SREs benefit from it?
  - What activities can I commit our team to that benefit everyone’s education, but that everyone will like?
- They should also exhibit the following characteristics
  - Strong reverse engineering skills
  - Ability to think statistically, rather than procedurally
  - When standard operating procedures break down, ability to improvise fully.
There is no style of education that works best to train new SREs.
- You need to come up with your own course content that embodies the above characteristics, in addition to other attributes specific to your SRE team.
SRE education practises
- Recommended patterns — anti-patterns
  - Design concrete, sequential learning experiences to follow — Menial work, eg. alerts or ticket triage; “trial by fire”
    - Trial by fire approach presumes that many or most aspects of a team can be taught strictly by doing, rather than by reasoning.
  - Encourage reverse engineering, statistical thinking, working from first principles — Training strictly through the “manual”, checklists, or playbooks
  - Celebrating the analysis of failure through writing and sharing postmortems — Treating outages as secrets
  - Creating contained but realistic breakages to fix — Encountering a typical problem for the first time during on-call
  - Roleplaying disasters — Creating experts who can fix any issue
  - Enabling students to shadow their on-call rotation early — Pushing students into being primary on-call before the achieve a holistic understanding of their service
  - Pairing students with expert SREs to revise targeted sections of the on-call training plan — Treating on-call training plans as static and untouchable except by subject matter experts
  - Giving partial ownership by having to implement nontrivial work to a service — Awarding all nontrivial work to senior SREs
    - Even a minor sense of ownership will help here.
Training activities and practises should be appropriately paced.
Any type of training is better than random tickets and interrupts.
- Make a conscious effort to combine the right mix of theory and application.
- Start giving hands-on experience ASAP.
Consider a starting point for learning your stack, eg:
- How a request enters a system
- How is the frontend served?
- How is the load balancing setup, any caching?
- What comprises the infrastructure?
- What are typical debugging techniques, escalation procedures, and ways to recover (from mild to disaster recovery)?
Chapter recommends writing an on-call checklist containing the following information:
- “Frontended by”
  - Webapp(s) listed
- “Backends called”
- “SRE experts”
  - Ilmo’s note: could be the owning team here as well.
- “Developer contacts”
  - Ilmo’s note: in a more DevOps~y organization, these could be the same people as the SRE experts.
- “Know before moving on”
  - Which clusters service is deployed in?
  - How to roll service (and its DB) back?
  - Any critical paths and rationale for said critical paths?
- Note that this isn’t a playbook or doesn’t aim to cover everything.
  - Instead, it focuses on strictly listing out the expert contacts and highlights the most useful documentation that exists for it, establishes basic knowledge to internalize, and asks probing questions to verify you have the knowledge to move onward.
Getting a sense of how much information the trainee is retaining: a good idea
- Getting feedback about this doesn’t have to be formal.
Tiered access model
- First tier: read-only access to service (in production?)
- Later tier: write access to service in production
- Also known as “powerups” on the route to on-call
Good practise starter project patterns
- Make a trivial user-visible feature change, walking it all the way to production.
- Add monitoring to your service for any blind spot that you’ve encountered.
- Automate a pain point to eliminate toil.
Five practises for aspiring on-callers
- Read and share postmortems
  - When writing a postmortem, keep in mind that its most appreciative audience might be an engineer who hasn’t yet been hired.
  - Collect your best postmortems and make them prominently available for your newbies.
  - “Postmortem reading clubs”
  - Presentations on outages
  - Use as basis for wheel of misfortune rehearsals
- Disaster roleplaying
  - Wheel of misfortune
  - In a session of 30-to-60-minutes long, primary and secondary on-call attempt to root-cause the issue. GM observes and can intervene by explaining further details to keep the scenario moving.
- Break real things, fix real things
  - Have one instance you can divert from live traffic and temporarily loan to a learning exercise.
  - Try to break the stack from a known good configuration, observe how the systems break upstream & downstream.
- Documentation as apprenticeship
  - An on-call learning checklist
  - Must be internalized before moving on to shadow an on-caller.
  - Helps establish boundaries of the system their team supports.
  - Student gains an understanding what systems are most important and why.
  - Can be used as an adaptable body of work to keep everyone’s knowledge fresh (both newbie and senior knowledge).
- Shadow on-call early and often
  - After going through the fundamentals, consider configuring copying the alerts of a senior SRE to a newbie (during business hours first).
  - Tip: co-author postmortem with newbie, write it yourself.
  - Do reverse shadowing: where senior watches the newbie become a primary and the senior provides active support, help, validation, hints, etc.
Some teams do final exams before giving them “full access”.
In the end, on-call is a rite of passage and it should be celebrated by the team.

Chapter 29: Dealing with interrupts

Operational load
- Pages
  - Includes Expected Response Time (SLO), measured in minutes
  - Managed by a dedicated primary on-call engineer
  - Can be escalated to another team member if a problem isn’t understood
- Tickets
  - Could have an SLO, measured in hours/days/weeks
  - Can be managed by primary, secondary, or a dedicated ticket person not on-call
  - Should not be assigned randomly to team members.
  - Processing tickets is a full-time role.
  - Ilmo’s note: This seems to relate to customer support requests or otherwise automated production incident tickets and doesn’t have to do with tickets inside our Jira system necessarily (which may also contain stories with feature development).
- Ongoing operational activities
  - Eg: Flag rollouts, answering support questions
  - Managing person may vary
How to manage interruptions? Some useful metrics.
- Interrupt SLO or expected response time.
- Number of interrupts that are usually backlogged.
- Severity of interrupts.
- Frequency of interrupts.
- Number of people available to handle a certain kind of interrupt (at Google, some people require certain amount of ticket work before going on-call).
For most stressed-out on-call engineers, stress is caused either by pager volume, or because they’re treating on-call as an interrupt.
- These engineers exist in a state of constant interruption (or interruptability) which is extremely stressful.
- Each engineer has an individual preference over how much project work vs on-call work they are comfortable with (and they might not know it themselves).
Assign a cost to context switches.
- A 20-minute interruption while working on a project entails two context switches; realistically, this interruption results in a loss of a couple hours of truly productive work.
Polarizing time: knowing if they are working on just project work or just interrupts
For any given class of interrupt, if the volume of interrupts is too high for one person, add another person.
On-call
- Primary should focus only on on-call work.
  - During quiet times, tickets or other non-critical interrupt-based work should be done instead.
  - An on-caller doesn’t participate into or progress project work. Account that person out during planning.
  - If there’s important project work to be done, don’t let that person be on-call.
- A secondary may do project work.
  - Based on agreement within the team, they could also support the primary in case of high pager volume.
Don’t spread ticket load across the entire team, this causes context switches to everyone.
There should be a handoff for tickets (same way as for on-call work).
It’s good to regularly look at tickets and try to examine classes of interrupts to see if a common/root cause can be identified.

Chapter 30: Embedding an SRE to Recover from Operational Overload

One way to relieve burden: temporarily transfer an SRE into the overloaded team
- SRE focuses on improving the team’s practises instead of simply helping the team empty the ticket queue
- Usually transferring one SRE will suffice
Guidance for embedded SREs
- Phase 1: Learn the service and get context
  - Articulate why processes and habits contribute to / detract from service’s scalability
  - Remind more tickets should not equal more SREs (unless complexity rises)
  - If teams are sliding into ops mode, shadow an on-call session to understand more.
    - A tiny service suffering from ops mode: focus on ways in which team’s current approach prevents them from improving service reliability.
    - A service just getting started is suffering from ops mode: focus on ways to prep team for explosive growth.
  - Overwhelming perceived pressure may contribute to sliding into ops mode.
  - Identify team’s largest existing problems, then emergencies that may happen. Potential sources of emergencies:
    - Subsystems that aren’t designed to be self-managing
    - Knowledge gaps
    - Services developed by SRE that are quietly increasing in importance
    - Strong dependence on “the next big thing” (eg: “The new architecture will change EVERYTHING, better not do anything for now.“)
    - Common alerts that aren’t diagnosed by either the dev team or SREs
    - Any service that gets complaints from clients and lacks a formal SLO (or SLA)
    - Any service where capacity planning happens often
    - Postmortems for services where the only action items are rolling back the specific change causing the outage
    - Service which nobody wants to own (or devs own onesidedly)
      - One should at least understand the consequences of the service breaking and the urgency if it breaks.
- Phase 2: Sharing context
  - Write a good postmortem for the team
    - Take ownership of the next postmortem using a blameless tone
  - Sort fires according to type
    - Categories: 1. fires that should not exist 2. fires that cause stress
    - Sort the fires into toil and not-toil buckets
- Phase 3: Driving change
  - Start with the basics
    - Write SLOs (if they don’t exist)
  - Get help clearing kindling
    - Resist the urge to fix these yourself.
    - Instead:
      - Find useful work accomplishable by anyone in the team.
      - Explain usefulness, say, to address a postmortem action item.
      - Review the code yourself
      - Repeat!
  - Explain your reasoning
    - Build a mental model for the team by explaining your thinking where possible.
  - Ask leading questions
    - Ask questions in a way that encourages people to think about the basic principles.
      - Eg: “I see that the TaskFailures alert fires frequently, but the on-call engineers usually don’t do anything to respond to the alert. How does this impact the SLO?”
- Final phase: after-action report
  - A postvitam report where you may explain the critical decisions at each step that led to success.

#sre