SRE: the good, the bad, and the ouch PREVAIL 2021 - An IBM Academy of Technology Conference © 2021 IBM Corporation

Speaker Holly Cummins Innovation Leader, IBM SPEED IBM Garage alum @holly_cummins PREVAIL 2021 - An IBM Academy of Technology Conference © 2021 IBM Corporation 2

PREVAIL Technical Conference 2021 what is SRE? @holly_cummins

PREVAIL Technical Conference 2021 SRE what ops would be like if it was done by software engineers @holly_cummins

PREVAIL Technical Conference 2021 why SRE? @holly_cummins

PREVAIL Technical Conference 2021 reliability is very important @holly_cummins

PREVAIL Technical Conference 2021 old ops @holly_cummins

PREVAIL Technical Conference 2021 manual old ops @holly_cummins

PREVAIL Technical Conference 2021 manual repetitive old ops @holly_cummins

PREVAIL Technical Conference 2021 manual repetitive siloed old ops @holly_cummins

PREVAIL Technical Conference 2021 manual repetitive siloed not aligned to business goals old ops @holly_cummins

PREVAIL Technical Conference 2021 manual repetitive siloed not aligned to business goals old ops unable to handle complexity of cloud native @holly_cummins

PREVAIL Technical Conference 2021 eliminate repetitive tasks @holly_cummins

PREVAIL Technical Conference 2021 eliminate toil @holly_cummins

PREVAIL Technical Conference 2021 aligned incentives @holly_cummins

PREVAIL Technical Conference 2021 failure is a symptom, not a cause @holly_cummins

PREVAIL Technical Conference 2021 devops? @holly_cummins

PREVAIL Technical Conference 2021 SRE DevOps automate everything @holly_cummins

PREVAIL Technical Conference 2021 SRE DevOps holistic + collaborative @holly_cummins

PREVAIL Technical Conference 2021 what could possibly go wrong? @holly_cummins

true story the cunning rebrand “we’re SRE now” IBM Garage @holly_cummins

true story the cunning rebrand “we’re SRE now” IBM Garage @holly_cummins

PREVAIL Technical Conference 2021 SRE: what ops would be like if it was done by software engineers software engineer @holly_cummins

PREVAIL Technical Conference 2021 SRE: what ops would be like if it was done by software engineers software engineer @holly_cummins

PREVAIL Technical Conference 2021 SRE: what ops would be like if it was done by software engineers ops software engineer @holly_cummins

PREVAIL Technical Conference 2021 SRE: what ops would be like if it was done by ops? ops @holly_cummins

PREVAIL Technical Conference 2021 SRE: what ops would be like if it was done by ops? ops @holly_cummins

true story the cunning rebrand “we are just as good: we have scripts” IBM Garage @holly_cummins

PREVAIL Technical Conference 2021 what triggers the scripts? @holly_cummins

PREVAIL Technical Conference 2021 how much contact do SRE have with dev? SRE dev @holly_cummins

PREVAIL Technical Conference 2021 how much contact do SRE have with dev? SRE dev @holly_cummins

PREVAIL Technical Conference 2021 are there any SRE NFRs in the dev backlog? SRE dev @holly_cummins

true story the cunning rebrand “we do SRE … in silos” IBM Garage @holly_cummins

PREVAIL Technical Conference 2021 @holly_cummins

PREVAIL Technical Conference 2021 @holly_cummins

PREVAIL Technical Conference 2021 @holly_cummins

PREVAIL Technical Conference 2021 @holly_cummins

PREVAIL Technical Conference 2021 I am not designed for this. @holly_cummins

PREVAIL Technical Conference 2021 two war rooms @holly_cummins

PREVAIL Technical Conference 2021 team mainframe @holly_cummins team mobile

PREVAIL Technical Conference 2021 we’re responsible for stability of the front end we’re responsible for stability of the mainframe team mainframe @holly_cummins team mobile

PREVAIL Technical Conference 2021 we’re responsible for we’re stability offor the responsible mainframe … as stability of thelong as it’s used correctly mainframe we’re responsible for stability of the front end the ambassador team mainframe @holly_cummins team mobile

true story dots aren’t connected “we have a ticket per team, not per incident” IBM Garage @holly_cummins

“we want to do SRE but we don’t have enough permissions on our systems”

PREVAIL Technical Conference 2021 “the DBAs don’t trust us” @holly_cummins

PREVAIL Technical Conference 2021 “it takes us 15 minutes just to get permission to run a standard set of SQL diagnostic statements” @holly_cummins

PREVAIL Technical Conference 2021 “it takes us 15 minutes just to get permission to run a standard set of SQL diagnostic statements” @holly_cummins

PREVAIL Technical Conference 2021 silos cost @holly_cummins

true story the gap between intent and reality “we do post-mortems after every incident … maybe” IBM Garage @holly_cummins

PREVAIL Technical Conference 2021 @holly_cummins

PREVAIL Technical Conference 2021 measure the number of incidents @holly_cummins

PREVAIL Technical Conference 2021 measure the number of incidents measure the number of post-mortems @holly_cummins

PREVAIL Technical Conference 2021 measure the number of incidents measure the number of post-mortems see if they match @holly_cummins

PREVAIL Technical Conference 2021 @holly_cummins

PREVAIL Technical Conference 2021 advanced metrics: @holly_cummins

PREVAIL Technical Conference 2021 advanced metrics: how many people were in the post-mortem? @holly_cummins

PREVAIL Technical Conference 2021 advanced metrics: how many people were in the post-mortem? does it include more than the people directly involved? @holly_cummins

PREVAIL Technical Conference 2021 advanced metrics: how many people were in the post-mortem? does it include more than the people directly involved? did we invite more than our own team? @holly_cummins

true story “no one says anything in our blameless post-mortems” IBM Garage @holly_cummins

PREVAIL Technical Conference 2021 ‘blameless’ post-mortem @holly_cummins

PREVAIL Technical Conference 2021 if involvement in an incident is punished, people will avoid engaging with systems @holly_cummins

PREVAIL Technical Conference 2021 “great idea, go build that!” if ideas are punished with extra work, people will try not to have ideas @holly_cummins

true story the perverse incentive “we have success metrics” IBM Garage @holly_cummins

PREVAIL Technical Conference 2021 metrics are good @holly_cummins

PREVAIL Technical Conference 2021 SREs are data-driven @holly_cummins

PREVAIL Technical Conference 2021 but … @holly_cummins

PREVAIL Technical Conference 2021 as senior leaders, be careful what you incentivise @holly_cummins

PREVAIL Technical Conference 2021 be careful what behaviours you discourage @holly_cummins

true story the perverse incentive “we count how many incidents we have; if the number goes down, it means we are working better” IBM Garage @holly_cummins

PREVAIL Technical Conference 2021 outstanding quality! @holly_cummins

PREVAIL Technical Conference 2021 delivery excellence! @holly_cummins

PREVAIL Technical Conference 2021 fewer people working → fewer incidents @holly_cummins

PREVAIL Technical Conference 2021 new release → more incidents @holly_cummins

PREVAIL Technical Conference 2021 what should you measure? @holly_cummins

PREVAIL Technical Conference 2021 @holly_cummins

PREVAIL Technical Conference 2021 make work visible @holly_cummins

true story the email timesink “we never seem to complete the work we planned” IBM Garage @holly_cummins

PREVAIL Technical Conference 2021 1 sprint theory @holly_cummins

PREVAIL Technical Conference 2021 1 sprint theory 50% story points @holly_cummins

PREVAIL Technical Conference 2021 theory 1 sprint 50% unplanned work (tickets) 50% story points @holly_cummins

PREVAIL Technical Conference 2021 theory reality 1 sprint 50% unplanned work (tickets) 50% story points @holly_cummins

PREVAIL Technical Conference 2021 theory reality 1 sprint 50% unplanned work (tickets) 50% story points 10% story points @holly_cummins

PREVAIL Technical Conference 2021 theory reality 1 sprint 50% unplanned work (tickets) 50% tickets 50% story points 10% story points @holly_cummins

PREVAIL Technical Conference 2021 theory reality 50% unplanned work (tickets) 1 sprint 40% ?? 50% tickets 50% story points 10% story points @holly_cummins

PREVAIL Technical Conference 2021 theory reality 50% unplanned work (tickets) 1 sprint 40% ?? 50% tickets 50% story points 10% story points @holly_cummins

PREVAIL Technical Conference 2021 @holly_cummins

PREVAIL Technical Conference 2021 “can you just … “ @holly_cummins

PREVAIL Technical Conference 2021 “can you just … “ “how do I do this?” @holly_cummins

PREVAIL Technical Conference 2021 “can you just … “ “how do I do this?” “where is this documented?” @holly_cummins

PREVAIL Technical Conference 2021 “can you just … “ “how do I do this?” “where is this documented?” @holly_cummins

PREVAIL Technical Conference 2021 @holly_cummins

PREVAIL Technical Conference 2021 this wasn’t a team failure @holly_cummins

PREVAIL Technical Conference 2021 this wasn’t a team failure it was a data quality issue @holly_cummins

PREVAIL Technical Conference 2021 this wasn’t a team failure it was a data quality issue it was a process issue @holly_cummins

PREVAIL Technical Conference 2021 track work use the data to eliminate toil @holly_cummins

PREVAIL Technical Conference 2021 measure blockers @holly_cummins

PREVAIL Technical Conference 2021 mean time to failure? mean time to detect problems? @holly_cummins

PREVAIL Technical Conference 2021 what is failure in a complex system? if a system goes down but user experience is ne, does that count? fi @holly_cummins

PREVAIL Technical Conference 2021 measure “what have I learned” measure “have I made sure it won’t happen again” @holly_cummins

true client story value on the shelf “we can’t actually release this.” IBM Garage @holly_cummins

PREVAIL Technical Conference 2021 @holly_cummins

PREVAIL Technical Conference 2021 what’s stopping more frequent deploys? @holly_cummins

PREVAIL Technical Conference 2021 “it costs too much to release” @holly_cummins

PREVAIL Technical Conference 2021 “it costs too much to release” you can x that fi @holly_cummins

PREVAIL Technical Conference 2021 “we can’t ship until we have more con dence in the quality” fi @holly_cummins

PREVAIL Technical Conference 2021 “we can’t ship until we have more con dence in the quality” you can x that fi fi @holly_cummins

PREVAIL Technical Conference 2021 deferred wiring @holly_cummins

PREVAIL Technical Conference 2021 feature flags @holly_cummins

true client story the monolithic microservices “we can’t release this microservice… we deploy all our microservices at the same time… because otherwise nothing works.” IBM Garage @holly_cummins

PREVAIL Technical Conference 2021 let’s talk about microservices @holly_cummins

true client story the peril of microservices “every time we change code, something breaks” IBM Garage @holly_cummins

PREVAIL Technical Conference 2021 just because a system runs across 6 containers doesn’t mean it’s decoupled @holly_cummins

PREVAIL Technical Conference 2021 @holly_cummins

PREVAIL Technical Conference 2021 mars climate explorer @holly_cummins

Courtesy NASA/ JPL-Caltech #IBMGarage @holly_cummins

distributing did not help

metric units distributing did not help

metric units imperial units distributing did not help

PREVAIL Technical Conference 2021 testing @holly_cummins

PREVAIL Technical Conference 2021 Cluster + Ariane 5 $370 million loss https://en.wikipedia.org/wiki/Cluster_(spacecraft) @holly_cummins

PREVAIL Technical Conference 2021 @holly_cummins

PREVAIL Technical Conference 2021 they tested it … @holly_cummins

PREVAIL Technical Conference 2021 they tested it … but stubbed out one component. @holly_cummins

PREVAIL Technical Conference 2021 they tested it … but stubbed out one component. that component was the one that broke. @holly_cummins

PREVAIL Technical Conference 2021 the ariane failed in 36 seconds you can’t a/b test a $370 million rocket @holly_cummins

PREVAIL Technical Conference 2021 testing will always be incomplete aim for recoverability @holly_cummins

PREVAIL Technical Conference 2021 @holly_cummins

PREVAIL Technical Conference 2021 resilience @holly_cummins

PREVAIL Technical Conference 2021 resilience recoverability @holly_cummins

PREVAIL Technical Conference 2021 observability @holly_cummins

they often couldn’t see the explorer

feedback is good engineering

PREVAIL Technical Conference 2021 when SRE is right it is great @holly_cummins

bank

PREVAIL Technical Conference 2021 remember this bank? team mainframe @holly_cummins team mobile

PREVAIL Technical Conference 2021 remember this bank? we’re responsible for stability of the front end team mainframe @holly_cummins team mobile

PREVAIL Technical Conference 2021 remember this bank? we’re responsible for stability of the mainframe … as long as it’s used correctly we’re responsible for stability of the front end the ambassador team mainframe @holly_cummins team mobile

PREVAIL Technical Conference 2021 one team web front-end back-end another department … @holly_cummins

PREVAIL Technical Conference 2021 e on team mobile front-end web front-end back-end another department … @holly_cummins

PREVAIL Technical Conference 2021 e on team CI/CD pipelines canary deploys CI/CD pipelines big-bang deploys onto AIX one team, range of techniques @holly_cummins

PREVAIL Technical Conference 2021 by the way … @holly_cummins

PREVAIL Technical Conference 2021 big bang deploys @holly_cummins

PREVAIL Technical Conference 2021 50% failure rate big bang deploys @holly_cummins

PREVAIL Technical Conference 2021 50% failure rate big bang deploys canary deploys @holly_cummins

PREVAIL Technical Conference 2021 50% 10% failure rate failure rate big bang deploys canary deploys @holly_cummins

industrial

remember the suspicious DBAs? #IBMGarage @holly_cummins

PREVAIL Technical Conference 2021 two root problems: • automation • trust and transparency @holly_cummins

PREVAIL Technical Conference 2021 trigger automation via slack @holly_cummins

PREVAIL Technical Conference 2021 because it was transparent, DBAs were happy and automated more things @holly_cummins

PREVAIL Technical Conference 2021 what happens when things go wrong? @holly_cummins

PREVAIL Technical Conference 2021 @holly_cummins

PREVAIL Technical Conference 2021 leadership need to provide a safety net. @holly_cummins

PREVAIL Technical Conference 2021 celebrate success celebrate failure @holly_cummins

PREVAIL Technical Conference 2021 celebrate success celebrate learning @holly_cummins

PREVAIL Technical Conference 2021 transformation endurance @holly_cummins

PREVAIL Technical Conference 2021 remember the why @holly_cummins

PREVAIL Technical Conference 2021 better, safer, faster, happier @holly_cummins

PREVAIL 2021 - An IBM Academy of Technology Conference. The information in this presentation is representative of the presenter and their views and opinions are not necessarily those of IBM or of the IBM Academy of Technology. PREVAIL 2021 - An IBM Academy of Technology Conference © 2021 IBM Corporation