a blog about dev & hobbies

My notes on the SRE book

Foreword

I read this book and wrote these notes as part of an SRE book club that a colleague of mine helped set up. I have also included my personal notes there occasionally, those notes start with Ilmo or Ilmo's note. Hope you find these interesting!

Chapter 1: Introduction

Chapter 2: The production environment at Google, from the viewpoint of an SRE

Chapter 3: Embracing risk

Chapter 4: Service Level Objectives

Chapter 5: Eliminating toil

Chapter 6: Monitoring distributed systems

Chapter 7: The Evolution of Automation at Google

Chapter 8: Release Engineering

Chapter 9: Simplicity

Chapter 10: Practical Alerting

Chapter 11: Being On-Call

Chapter 12: Effective Troubleshooting

Chapter 13: Emergency response

Chapter 14: Managing Incidents

Chapter 15: Postmortem culture: Learning from Failure

Chapter 16: Tracking outages

Chapter 17: Testing for Reliability

Chapter 18: Software engineering in SRE

Chapter 19: Load Balancing on the Frontend

Chapter 20: Load balancing in the datacenter

Chapter 21: Handling overload

Chapter 22: Addressing cascading failures

Chapter 23 — Managing critical state: Distributed consensus for reliability

Chapter 24 – Distributed Periodic Scheduling with Cron

Chapter 25 - Data processing pipelines

Chapter 26 — Data Integrity

Chapter 27 — Reliable product launches at scale (notes)

Chapter 28: Accelerating SREs to On-Call and Beyond

Chapter 29: Dealing with interrupts

Chapter 30: Embedding an SRE to Recover from Operational Overload

#sre