Resilience for Retail A Tale Not About Ice Cream But Somehow Also About Ice Cream

Quintessence Anx DevOps Advocate @ PagerDuty @QuintessenceAnx

Don’t panic @QuintessenceAnx

@QuintessenceAnx

Elevated response period @QuintessenceAnx

@QuintessenceAnx

How to determine your Elevated Response Period @QuintessenceAnx

What support is needed @QuintessenceAnx

Build or Buy ! :: Make or Buy ” @QuintessenceAnx

@QuintessenceAnx

⛔ @QuintessenceAnx

v0 Architecture @QuintessenceAnx

Random Outage Graph @QuintessenceAnx

What, When, Where @QuintessenceAnx

Let’s Talk a Little About Resiliency Itself @QuintessenceAnx

A resilient system is a system that is able to withstand adversity. @QuintessenceAnx

Something is resilient if it is able to withstand adversity. @QuintessenceAnx

What can this look like? @QuintessenceAnx

Organizational Resilience can look like having the appropriate response structure(s) in place for IT systems, services, and users in the event of a latency or outage. @QuintessenceAnx

(IT) System Resilience can look like an application not going down, and/or autoscaling, in response to increased traffic. @QuintessenceAnx

Why is this important? @QuintessenceAnx

@QuintessenceAnx

Response and Design @QuintessenceAnx

@QuintessenceAnx

Resilient Response @QuintessenceAnx

Resilient Response Checklist • Define elevated response • Maximize experienced responders • Both primary and secondary • Do not design around resources you do not have • Minimize responder burnout • Clear handoff procedures • Clear ownership • Dedicated, clear, responder roles • Practiced response process • Validate responder access to tools and data • Updated documentation @QuintessenceAnx

Define elevated response @QuintessenceAnx

Maximize Experienced Responders @QuintessenceAnx

Do not design around resources you do not have @QuintessenceAnx

Responder Burnout @QuintessenceAnx

Clear Handoff Procedures @QuintessenceAnx

Clear Ownership @QuintessenceAnx

Dedicated, clear, responder roles @QuintessenceAnx

Practiced Response Process @QuintessenceAnx

Validate access to tools and data @QuintessenceAnx

Updated documentation @QuintessenceAnx

Resilient Response Checklist • Define elevated response • Maximize experienced responders • Both primary and secondary • Do not design around resources you do not have • Minimize responder burnout • Clear handoff procedures • Clear ownership • Dedicated, clear, responder roles • Practiced response process • Validate responder access to tools and data • Updated documentation @QuintessenceAnx

@QuintessenceAnx

Resilient Design Checklist • Build, test, secure with scalability in mind • Build, test, secure with humans in mind • Automate as much as is feasible • Keep documentation updated in pace of releases • Build, test, secure with redundancy • Do not design around resources and/or failover in mind • Build, test, secure with operator control in mind • Build, test, secure with observability in mind you do not have • Clear ownership • Who owns the service, writes the code, etc. @QuintessenceAnx

Build, test, secure @QuintessenceAnx

Build, test, secure: scalability @QuintessenceAnx

Build, test, secure: humans @QuintessenceAnx

Build, test, secure: redundancy / failover @QuintessenceAnx

Build, test, secure: operator control @QuintessenceAnx

Build, test, secure: observability @QuintessenceAnx

Automation @QuintessenceAnx

Updated documentation @QuintessenceAnx

Do not design around resources you do not have @QuintessenceAnx

Clear ownership @QuintessenceAnx

Resilient Design Checklist • Build and test with scalability in mind • Build and test with humans in mind • Automate as much as is feasible • Keep documentation updated in pace of releases • Do not design around resources you do not have • Clear ownership • Build and test with redundancy and/ or failover in mind • Build and test with security in mind • Build and test with operator control in mind • Build and test with observability in mind • Who owns the service, writes the code, etc. @QuintessenceAnx

Practice with Ice Cream ” @QuintessenceAnx

@QuintessenceAnx

Understand the Business @QuintessenceAnx

@QuintessenceAnx

Resilient Response: Questions to Ask • What cannot go wrong? • What is at risk of going wrong? • What responses are needed in each situation? • Who is doing what step(s) in the response process(es)? • Are we in an Elevated Response Period? • And are separate considerations for that period defined? @QuintessenceAnx

Resilient System: Questions to Ask • How do we prevent “what cannot go wrong”? • How do we mitigate risk for “what else can go wrong”? • How do we support our response process(es)? • How do we support our responders? • How does an elevated response period impact our system? @QuintessenceAnx

Resiliency is not limited to IT systems and personnel @QuintessenceAnx

Resources & References noti.st/quintessence @QuintessenceAnx

Questions? Quintessence Anx DevOps Advocate noti.st/quintessence @QuintessenceAnx