WEBINAR Best Practices for Monitoring: How to Combat Alert Fatigue

Hello! ● ● ● ● Developer Advocate for Logz.io a.k.a. “Recovering SRE” Worked in IT community for 10 yrs, including ~5 years in “Cloud Engineering” / Infrastructure … i.e. SREing before it was cool (j.k. it was always cool) Mentors underrepresented minorities in tech in the Buffalo NY / Niagara Falls region Quintessence Anx Developer Advocate quinn@logz.io @QuintessenceAnx

Agenda ● Establish healthy thought patterns regarding ○ Why to monitor ○ How to create a workflow around monitoring / alerts ○ How to setup and maintain boundaries for monitoring and its accompanying noise

When we try to know everything…

Too much noise can… ● ● ● …bury important / high severity alerts in a sea of low priority notices …causing engineering teams to start muting alarms or whole alarm sources …which in turn means the people who need to be notified, won’t be.

Turning the dial back too far, however…

Let’s find a happy medium All alerts are fictional.

What is the cost of noise?

Your brain on alerts

Time cost ~25 Minutes

Quality cost

Cost of multitasking

So how do we reduce the noise?

Be aware, not overwhelmed ● ● ● ● Determine the sources of noise Categorize the types of noise Channel the noise into a productive workflow Create a routine to clear the clutter

Sources of noise ● ● ● ● ● Logging / alert system Knowledge base Ticketing system Chat integrations Repetition ○ …and you

Wait, I need to be aware of myself? (Absolutely.) All alerts are fictional.

How often do you… ● …check your email? ● …check your social media? ● …check your text messages? ● …check your Apple / Google messages? ● … the list goes on.

Communication & Boundaries ● ● ● ● Plan for set times to focus on your work and mute non-critical alerts This includes messages from friends & family When setting boundaries make sure your friends, family, and co-workers know what you consider to be relevant emergencies Set reasonable expectations for yourself and others

What about external sources of noise? All alerts are fictional.

Categorizing your noise ● False positives ● False negatives ● Fragility ● Frequency (just fix it)

Save time by creating your noise flow ● What needs to be known ● Who needs to know it ● How soon should they know ● How should they be notified

Re-evaluate redundancy Know when to add a little complexity to stop a vacuum.

Resilient noise builds trust ● How reliable are your tools and services? ● How much notification duplication is needed? ● Do you have the ability to switch alert endpoints in the event of a service outage? ● Do you regularly evaluate the reliability of your services (external and internal)? All alerts are fictional.

Keep alerts relevant with “sprint cleaning” For every alert triggered, ask: ● Was the notification needed? ● How was the incident resolved? ● Can the solution be automated? ● Is the solution permanent? ● How urgently was a solution needed?

Summing it up

Next steps ● Logz.io blog - Building Monitors you can Trust: https://logz.io/blog/building-monitors/ ● TechBeacon - How to use monitoring for innovation and resilience, not firefighting: https://techbeacon.com/app-dev-testing/how-use -monitoring-innovation-resilience-not-firefighting Quintessence Anx Developer Advocate quinn@logz.io @QuintessenceAnx

Questions

Thanks!