SLOConf Logo

Talks

The first Service Level Objective Conference for Site Reliability Engineers

Agile & DevOps Walk into a Bar

Beyond Theory

Melissa Boggs

@Sauce Labs

Ryan Lockard

@Cognizant

Tune in to hear an Agility Exec and a DevOps Exec talk about the intersection of agile, DevOps, and metrics over a virtual "beer". In this 10 minute convo, we chat about the definitions of DevOps and agile and how metrics can play a part in showing leadership and teams where they can improve. Are your metrics acting as a window or a mirror?

Applying SLOs to Infrastructure and Compliance as Code

SLOcializing

Matt Ray

@Chef

Audits, compliance, and security are top of mind for most enterprises, while configuration management is not something most executives consider. Management teams are focused on reaching their business targets, but operations is the engine that helps the organization achieve their goals. Developers and operators need to align their goals with the business, and Service Level Objectives (SLOs) help focus these efforts and raise visibility. Configuration management _is_ important, but it needs to be part of an SLO for delivering reliable infrastructure quickly and efficiently. Security and passing audits are important, we need to understand our exposure to risk by attaining high levels of compliance. This session will provide examples of making those goals visible through SLOs, with examples provided from the open source Chef and InSpec projects.

A Year of SLO Bootcamps

New To SLOs

Kit Merker

@Nobl9

In this talk, I'll share what I've learned in the last year leading a hands-on SLO bootcamp for a variety of cross functional teams. You'll learn a proven strategy for helping teams get over the hump of a first SLO and how to drive a scalable organizational and cultural change to the SLO-based way of thinking. With COVID, I had to adapt my SLO Bootcamp to being online only, and this forced me to focus on just the essentials, increase interactivity, and ensure the course was of value to all the participants. I'll go over resources you can use to run your own SLO Bootcamp too!

Benchmarking SLOs Using Chaos Engineering

Technical and Deep Dive

Uma Mukkara

@ChaosNative

SLOs are the visible results that SREs need to maintain in any operations. Recently the concept or application of SLOs is increasing being observed into pre-production CI/CD pipelines. If the pre-production setups are closer to production, the resilience of such a setup can be tested by introducing Chaos in the pipeline and measuring the SLOs. In this talk, we discuss the techniques to introduce chaos testing as a trigger to CD and as a post CD action in production or pre-production. Audience will see an example chaos stage in action in a cloud-native CI/CD pipeline and how the prometheus based SLIs are used to measure SLOs during a given period of time and use this benchmarking to make decisions to trigger continuous deployments. The takeaway for the SREs is using chaos testing as a tool to measure SLO based resilience and how this can be automated using declarative config and GitOps.

Creating Great Dev Culture though Error Budgets

New To SLOs

Sal Kimmich

@Reliably

In the most basic definition, error budgets are simply the amount of error that a service can accumulate over a specified period of time before users grumble about the experience. While many organizations introducing error budgets observe them as just another metric for system quality control, there's a huge utility to incorporating error budgets as a fundamental part of your developer culture around trust and timely innovation: with the critical autonomy provided to engineers in this working paradigm, the development team can spend their error budget however they feel is right: either in prevention or cure of system instabilities. In this talk, we will cover common combinations of SLIs that lead to error budget best practices, as well as protocols that can be enacted when error budgets slip: the who, what, and when and why of pre-incident reporting.

Defining a Maturity Model for SLOs

SLOcializing

Yury Niño Roa

@ADL Digital Labs

Service Level Objectives or SLOs are a quantitative contract that describes the expected service behavior. They are often used by Organizations to prioritize the reliability, availability, coverage, and other service-level indicators of the software systems. Based on what I have learned defining and implementing SLOs, I have discovered that they are valuable when they are used to build feedback loops in two axes: adoption and automation. SLOs are a process, not a project, which imposes a need for having a framework that helps organizations to adopt a culture based on SLOs. In this talk, I am presenting a framework that allows determining the level of adoption and automation of SLOs. Based on questions related to the amount of convincing: engineering, operations, product, leadership, legal, and quality assurance, we determine the level of adoption. On the other side, considering aspects such as established and documented measurements, the level of user-centric metrics, observability strategies, and reporting toolsets, we determine the level of automation.

Defining SLOs: A Practical Guide

Technical and Deep Dive

Matthias Loibl

@Polar Signals

Frederic Branczyk

@Polar Signals

SLOs often seem simple in theory, but tend to get difficult when actually implementing them, as the reality if often not by the textbook. SLOs are an invaluable tool for both engineers as well as management to consistently communicate reliability with data. Defining bad SLOs can also be harmful, so it’s important to keep various caveats in mind. Not only are SLOs about data, it is equally important to clarify and evangelize expectations of SLOs within an organization. Frederic and Matthias have many years of experience of defining SLOs for many services and components. Together they will demonstrate real life examples of choosing, measuring, alerting and reporting SLOs based on Prometheus metrics. Join this talk to learn how to implement SLOs successfully using data you most likely already have.

Don't be a victim of your own success

Beyond Theory

Mick Roper

@Reliably

The downside of creating a service level that is too high, especially if you are able to achieve it! Systems that have exceedingly high uptime often cause a disproportionate impact when they inevitably fail, since the users of those systems are unprepared for the incoming disruption to their workflow. In this talk I discuss how to create an appropriate SLO, that attempts to find a balance between service excellence and management of expectations. This talk covers product management and systems architecture, looking at how the design of a system can be used to maintain an SLO, and how 'disruptive engineering' (chaos engineering, fire drills, continuous deployment, release strategy, etc...) can be used to test and utilise an SLO. From a product management perspective, I cover the required conversations that need to be had between product ownership and engineering that help to explain the need for 'downtime'. This is often an area of the product owner/engineering team that is fraught with difficulty, since product ownership want the best possible service from a user perspective, while engineers take a more risk-averse attitude towards service provisioning.

Error Economics: How to avoid breaking the budget

Technical and Deep Dive

Simon Aronsson

@k6

It’s scary to release to production, especially if you don’t know if your system is performing within your quality SLOs. Using error budgets and testing at scale as quality gates in your release cycle, you’ll be able to gain much-needed confidence about the risk-level associated with your release. Using open-source tools, we’ll set up a test, generate the necessary load to run it at scale and make sure we stay on budget. After attending this talk, attendees will: - Have an understanding of what error budgets are and how they are measured. - Know how to use them as indicators of service quality. - Know how to create their first high-concurrency test using a load generator and how to set it up with acceptance thresholds based on their error budget.

Evaluate Application Resilience with Chaos Engineering and SLOs

Technical and Deep Dive

Jürgen Etzlstorfer

@Dynatrace

SLOs are not only a great way to efficiently measure the availability and quality of production environments but should also be used to ensure the resilience of applications before production as part of chaos engineering. While many organizations start with ad-hoc chaos experiments in production to validate the impact on SLOs it is more efficient to bake these tests and checks into the continuous delivery process. In this session, we give you practical guidance on “chaos stages” as part of your continuous delivery to validate the compliance with your production SLOs prior to entering production. As a showcase we are demoing a chaos enriched delivery orchestration with the CNCF projects LitmusChaos (for chaos experiments) and Keptn (for orchestration of automated load testing and SLO validation).

From Availability to User Happiness: An Introduction to SLOs That Matter

New To SLOs

Michael Ericksen

@Intelligent Medical Objects

This talk tells the story of an engineering team that finds themselves in a quasi-incident for a web application that runs inside of Electronic Health Record (EHR) systems like Epic and Cerner. The engineering dashboard for the application showed uptime at 100%. Users, however, paused their implementation timelines because of poor application performance. As an organization, we were measuring the wrong thing. In this talk, I will tell the story of how an engineering team pivoted from measuring availability to key application behaviors for their end users to dramatically improve user satisfaction.

Fundamentals for improving customer experience

Beyond Theory

Meghan Jordan

@Datadog

Service level objectives (SLOs) help you understand the health of your systems and how your end users experience them.You're not likely to achieve desired results if you're not basing decisions on useful data and this means that poorly defined SLIs (using the wrong metrics) and SLOs (defining the wrong targets) could cause worse outcomes for your users. In this talk we’ll cover how SLOs help you make more informed decisions. You’ll learn how to get started with SLOs and choose the right service level indicators to meet your customers’ expectations.

GitLab’s journey to SLO Monitoring

Technical and Deep Dive

Andrew Newdigate

@GitLab Inc.

This talk covers GitLab's adoption of SLO monitoring, from our previous causal alerting strategy, which had outgrown its purpose as the complexity and traffic volumes grew, to our early attempts, building and maintaining configuration, and the problems that brought about, to our current, declarative approach. The talk will cover the challenges of getting buy-in from engineering, operations and product stakeholders, the benefits of having a common language of availability across the organisation and our future plans. This is a deep-dive, practical talk; all the code and configuration for GitLab.com's monitoring infrastructure is open-source, and the talk will include links to these resources. The talk is based on a talk I did at ScaleConf 2020, which received good feedback.

Infrastructure Comes out of the Wall, No One Cares How

New To SLOs

Richard Hartmann

@Grafana Labs

You care about your service and how it works internally, your users do not. Your water, electricity, and Internet come out the wall, and if they stop doing that, you call someone to complain. That's how you should think about your services, and we'll explore this thought more.

Introduction to SLO Alerting and Monitoring

Technical and Deep Dive

Niall Murphy

@SRE Book

Super simple rehearsal of the "SLO alerting" chapter from the book, with worked example.

Just Say No (to Dashboards): You Don’t Need More Information, You Need the Right Information

New To SLOs

Zac Nickens

@Boxboat

Abby Bangser

@Duffel

Talk will illustrate the differences between signal and noise in monitoring efforts. Engineers shouldn't sit watching dashboards, they should be improving existing features and developing new features; dashboard/metrics fatigue prevents engineers from living their best life. talk will trace a journey from too many dashboards to identifying the signals that are most meaningful for a team, and adopting an SLO approach to reduce signal fatigue.

Left Shift your SLOs

SLOcializing

Michael Friedrich

@GitLab

Everyone talks about Security shifting left in your CI/CD pipeline. Tools and cultural changes enable teams to scale and avoid deployment problems. SLOs are left out - what if a software change triggers a regression and your production SLOs fail? As a developer, you want to detect these problems as early as possible. This talk dives deep into CI/CD pipelines and discusses ideas to calculate and match SLOs in the development lifecycle. Early in your Pull or Merge Request for review.

Lessons from Failure: How to Fail and Still Succeed

Beyond Theory

Dan Wilson

@Control Plane Corporation

I worked at Concur on infrastructure, operations and engineering as it grew from a few users to millions. Over the years, I was witness of many failures across the stack and caused a handful of issues myself. In this talk, I'll walk through some of the most brutal and customer impacting failures that I saw or caused and highlight the core principles I learned after surviving through these stressful situations.

Management & Governance in the Cloud

Technical and Deep Dive

MaSonya Scott

@AWS

MaSonya Scott, Principle Specialist at AWS will teach you to: - Articulate the importance of cloud operations to achieve & scale adoption of AWS - Understand AWS Management & Governance value proposition to infrastructure operations for customers - Provision and manage AWS resources in standardized & secure fashion - Evaluate cloud operations leveraging ITSM tools integrated to AWSS Management & Governance services

No More Theater: Building SLO Culture Without the Bullsh*t

SLOcializing

Zac Nickens

@Boxboat

Using SLOculture to break down silos, empower engineers, and drive user (and engineer) happiness. Using real life examples from unnamed orgs, I will highlight the pitfalls and traps of "theater" and "fiefdoms" and how SLO culture is can be used to break down barriers to high performance and high happiness.

Production Load Testing as a Guardrail for SLOs

Technical and Deep Dive

Hassy Veldstra

@Artillery.io

Production load testing (yes you read that right!) can be an excellent technique for building an extra buffer of safety around your SLOs. We will cover: - Using existing SLOs to prioritize the areas of the system to test - Using existing SLOs to run production load tests safely - Putting SLOs on the load tests themselves This talk draws on the author's experience of implementing production load testing for building a margin of safety around SLOs at a large international publisher.

Production Readiness Review: Providing a Solid Base for SLOs

New To SLOs

Milan Plžík

@Grafana Labs

It's hard to propose a good SLO for a new service with little mileage. Even for years-running service, it's hard to gain confidence that if the service scales 10x, SLO won't be impacted. We'll have a look at Production Readiness Review process, which seeks to identify and remove common pitfalls and already-learned mistakes by a review focused, strengthening confidence in the defined SLO. The process was originally developed at Google (https://sre.google/sre-book/evolving-sre-engagement-model/); at Grafana Labs, we've tailored the process towards our needs, which is what this talk will discuss.

Service Level Overkill - SLO In a World of SOA

New To SLOs

Mick Roper

@Reliably

Service levels are excellent for understanding the limits you put on your own services, but in a world of web services your own ability to create a useful SLO is impacted by everything you depend upon. In this chat I discuss how to understand SLOs from other teams, how to try to mitigate SLO impact and how to deal with it when it happens. I also talk about what a low SLO means, and why it shouldn't be assumed that you need 9 9's of availability to offer a useful service!

Should SLOs Be Request-Based or Time-Based? And Why Neither Really Works…

Technical and Deep Dive

Björn Rabenstein

@Grafana Labs

Once you had gotten somewhat familiar with SLOs, you probably realized that time-based SLOs aren't really fair for most users. It doesn't help you if your ISP gives you perfect connectivity while you are asleep but always goes down during that important weekly video conference. Or in other words: A time-based SLO means free uptime whenever your service isn't used. Clearly a request-based SLO is much better: It measures what matters, and now an outage during peak time will consume your error budget much more quickly. If this talk were on the “New To SLOs” track, we would stop here. But since this is on the “Deep Dive” track, we need to go deeper. Let's explore a few common scenarios to see how a request-based SLO sometimes exaggerates and sometimes masks problems with your service and what we can do about it.

SLIs, SLOs, and Error Budgets at Scale

Technical and Deep Dive

Fred Moyer

@Zendesk

How can one democratize the implementation of SLIs, SLOs, and Error Budgets to put them in the hands of a thousand engineers at once? At Zendesk we developed simple algorithms and practical approaches for implementing SLIs, SLOs, and Error Budgets at scale using a number of observability tools. This talk will show the approaches developed and how we were able to manage observability instrumentation across dozens of teams quickly in a complex ecosystem (CDN, UI, middleware, backend, queues, dbs, queues, etc). This talk is for engineers and operations folks who are putting SLIs, SLOs, and Error Budgets into practice. Attendees will come away with concrete examples of how to communicate and implement Error Budgets across multiple teams and diverse service architectures.

SLO Basics - a conversation about reliability

New To SLOs

Keri Melich

@Nobl9

SLO Basics - a conversation about reliability

SLO — From Nothing to… Production

Beyond Theory

Ioannis Georgoulas

@Paddle.com

My focus of this talk will be on how I educated myself about SLOs and how applied this to my organization. I will present my biggest learnings; such as having an SLO mindset is definitely a marathon. I will present my SLO journey and more specific: what I read and did to learn more about SLOs, how I got the buy in from the appropriate stateholders, how advocacy of SLOs internally is super important and how we build an SLO "framework". On the SLO framework I will cover what tools we use to build our SLIs, where we store the SLO docs, how we implement burn rate alerting and how all these fit together in a scalable and extendable way. The last part will be learnings from our SLOs and ways of working with the Product teams in order to define their SLOs.

SLO Math

New To SLOs

Steve McGhee

@Google

It's the architecture, not the products or infrastructure that matter. How to think about your dependencies and how their SLOs affect your own. ⛓Chained services slo = SLOs ^ depth ⛷Parallel isolated services slo = min(SLOs) 🤹‍♂️ Redundant parallel services = much better ~= SLO of the LB “above”

SLOs As One Course in the Full Reliability Tasting Menu

Technical and Deep Dive

Jacob Scott

@Stripe

SLOs can help us understand our reliability, but they aren’t magic beans. In this talk I’ll explain what they aren’t good for (spoiler: catastrophes). Embracing the fact that SLOs are an incomplete approach to reliability lets us use them in composition with other approaches to better wrangle with the end-to-end reliability of our (complex, socio-technical) systems. I’ll also discuss how techniques from modern safety science (“resilience engineering”) can pair well with SLOs. You’ll leave this talk curious about how these techniques can help you address the concrete reliability challenges you face in your systems today.

SLOs at Facebook

New To SLOs

Posten A

@Facebook

Scaling SLOs at Facebook to planetary scale using SLICK - a purpose built centralised SLO store integrated into key observability systems.

SLOs Beyond Production Reporting - Automate Delivery & Operations Resilience

SLOcializing

Andreas Grabner

@Dynatrace

SLOs are a great vehicle to assess the quality-of-service situation in production. SLO initiatives typically start to standardize reporting based on SRE practices and to assess the risk of impacting business relevant SLAs when deploying changes into production. But don’t stop there. SLOs are much more powerful when applying them beyond the obvious production reporting use case. In this session we advocate for SLO-driven engineering which takes SLOs and enforces them as part of continuous delivery where SLOs act as fast developer feedback and quality gates leading to higher quality code making it to production. It also expands into automating runbooks and building self-healing platforms where continuous SLO validation lead to better closed loop auto remediation resulting in more stable production environments despite increased change of frequency. Join us and see what potential we have with taking SLOs out of production and spreading it across delivery and operations automation.

SLOs for climate: How to Continuously Reduce the Climate Impact of Tech Services

New To SLOs

Benoit Petit

@Hubblo

Site Reliability Engineering’s goal is to ensure that software systems and services that are created in an organization are made to evolve easily and especially to be extremely reliable. There are several definitions of reliability, one being: “reliability is the ability for a system to fulfill a mission in some defined conditions, for a given period of time”. This definition allows to redefine the conditions that dictate if the system did actually fulfill its mission on the given period of time. As the tech industry has to lower its Green House Gas emissions of 45% in the next 10 years to match Paris agreement objectives, it seems essential to me that a tech service or system is considered reliable, not only if it satisfies the client on the short term, but also if it doesn’t contribute to jeopardize the client’s future. That means obviously, that it has to respect objectives smartly defined regarding GHG emissions related to it’s very existence and usage. In this talk we'll see we can do right now to use those methods, not only to create business value, but for our future too.

SLOs for Production Grade Kubernetes.

Technical and Deep Dive

Bhargav Bhikkaji

@Tailwinds

We all know that cloud native platform and especially Kubernetes is hard to operate, would not it be great to look at list of SLIs/SLOs to understand if our Kubernetes platform is fine or not. I being cloud native consultant and have worked with many organizations have helped customers to kick start and manage their Kubernetes journey, would like share experiences on important SLOs they monitor for their production grade Kubernetes.

SLOs For Quality Gates In Your Delivery Pipeline

Technical and Deep Dive

Andreas Grabner

@Dynatrace

SREs use SLOs to ensure production is stable and changes from development are not impacting SLAs. Error Budgets are a great way to decide whether we can still deploy or not. But – every deployment has a risk of impacting critical SLOs, will eat up the error budget faster than planned and eventually lead to a slowdown of innovation. In this session we demonstrate how to use the concept of SLOs as part of continuous delivery to already validate the impact of code or configuration changes before its time to deploy to production. It gives developers faster feedback on the potential impact of their code changes, will increase quality of code that makes it to the gates of production and will therefore result in less impact when the actual production deployment happens. We will demoing this approach using the open source projects Keptn’s SLO-based Quality Gate capability.

SLOs for VPs: What They Give You, What They Cost

New To SLOs

Niall Murphy

@SRE Book

Targeting "VP-style" audience, explain that SLOs kinda look like KPIs, but they're used to make resourcing decisions rather than provide pure visibility. Worked example. 10m.

SLOs & Observability - better together

Beyond Theory

Liz Fong

@Honeycomb

We built support for SLOs (Service Level Objectives) against our event store so we could monitor our own complex distributed system. In the process of doing so, we learned that there were a number of important aspects that we didn’t expect from carefully reading the SRE workbook. This talk is the story of the missing pieces, unexpected pitfalls, and how we solved those problems. We’d like to share what we learned and how we iterated on our SLO adventure. As an SLO advocate and a design researcher, we collected user feedback through iterative deployments to learn what challenges users were running into. This conversation will discuss how we iterated our design, based on user feedback; how we deployed, what we learned, and re-deployed; and how we collected information from our users and from the alerts our system fired. In this talk, we will discuss how we brought the theory of SLOs to practice, and what we learned that we hadn’t expected in the process. We’ll discuss implementing the SLO feature and burn alerts; and our experiences from working with the SRE team who started using the alerts. Our hope is that when you buy or build your SLO tools, you’ll know what to look for, and how to get started. implementors will be able to start with a more solid ground, and that we will be able to advance the state of SLO support for all teams that wish to implement them. The major design points will be broken into a discussion of what we actually built; a number of unexpected technical features; and ways that we had to educate users beyond the standard SLO guidelines. The talk is largely conceptual: no live code will be shown, although some innocent servers may well die in the process of being visualized.

Standardizing SLOs as Code

Technical and Deep Dive

Andreas Grabner

@Dynatrace

Netflix’s Kayenta, Pivotals Indicator or T-Systems Performance Signature were early implementation of SLOs. As more SLO solutions are emerging on the market it’s time to think about open standards of SLO definitions as code, agree on how SLOs should be validated and how to standardize on where the data (SLIs) is coming from to easily switch from one observability platform to another without having to rethink or redefine SLOs. We need to follow the footsteps of observability who standardized on OpenTelemetry making it easier for users and open up new opportunity for tool vendors! In this session want to start this discussion by introducing the open source efforts of the Keptn project around SLOs. Keptn uses a declarative approach to SLIs and SLOs, a clear separation between data providers, SLO definition and SLO validation.

Supporting tools/templates to guide your SLO journey

SLOcializing

Michael March

@Isos Technology

Your org has chosen to implement SLOs, awesome! Beyond the core tooling (monitoring, SLO measuring, etc) this talk will quickly demonstrate concrete examples of tools and processes one can utilize which will support your organization implementation journey - soup to nuts.

Survival Guide: What I Learned From Putting 200 Developers On Call

Beyond Theory

Alina Anderson

@Outreach

We want to live in a world where the development team who writes the code, also owns that code’s success...or failure, in production. Nothing incentivizes a team to ship better quality software than getting paged at 2am, but how do we do this? In this talk, you’ll learn some tips and tricks for easing less than enthusiastic development teams into on-call rotations, how SRE facilitates the transition to production code ownership and why SLOs are critical to your success.

The Game of SLOs - A Three Part Reliability Musical

SLOcializing

Bart Enkelaar

@bol.com

Ever since the great success of important society-shaping documentaries like Cats, Wicked and Hamilton, it has been clear that music is the way to truly get a broad audience to accept new information. As SREs, evangelisation is often a core part of what we do, since it often revolves around convincing people to take a new approach to innovation. In this three-part musical, we'll describe the journey through SRE in a manner which is both recognisable and informative and as such should be directly applicable to change hearts and minds on reliability all across the world. Get out your ukulele and sing along!

The New Stack: What happens when an SLO goes wrong?

Beyond Theory

Alex Hidalgo

@Nobl9

Kristina Bennett

@Google

Niall Murphy

@SRE Book

Join TNS Founder and Publisher Alex Williams for a panel discussion to explore what happens when an SLO goes wrong. Panelist include: Kristina Bennett, editing contributor of "Building Secure and Reliable Systems" & "Implementing Service Level Objectives" Niall Murphey, co-author of "Site Reliability Engineering" & "The Site Reliability Workbook" Alex Hidalgo, author of "Implementing Service Level Objectives"

The Psychology of Chaos Engineering

Beyond Theory

Julie Gunderson

@PagerDuty

Chaos Engineering, failure injection, and similar practices have verified benefits to the resilience of systems and infrastructure. But can they provide similar resilience to teams and people? What are the effects and impacts on the humans involved in the systems? This talk will delve into both positive and negative outcomes to all the groups of people involved - including users, engineers, product, and business owner

The State of the Histogram

Technical and Deep Dive

Heinrich Hartmann

@Zalando

In this talk we are going to survey different available technologies to capture (latency) distributions and store them in time-series databases. This includes (a) the theoretical underpinnings (b) accuracy and performance and (c) operational aspects (d) adoption. Disclaimer: The author worked on openhistogram.io in the past.

Top 5 Real-life SLOs and Decision Tree to Define Your SLOs

Beyond Theory

Wolfgang Heider

@Dynatrace

The Google SRE theory already tells us, what many confirm with the own SRE journey: It is a hard task to determine the most valuable SLOs for your system. Monitoring tools like Dynatrace provide over 2000 metrics with many filter options and even more data is available with the integration of data sources like OpenTelemetry, SNMP, or any business data sources. For SLOs one needs to choose to focus on important data. We had a look at our customers adopting SLO monitoring in Dynatrace and present a hit list of SLO types we got reported as important. We show how the setup of such SLOs looks like – for both major categories of SLOs: real-user traffic request count based SLOs and synthetic availability monitoring SLOs. We propose a decision tree how to get from an idea to defined SLO configurations.

Unboxing Blackbox Monitoring for SLO

New To SLOs

Navya Dwarakanath

@Catchpoint Systems

You have read this in every SLO book and heard it in several talks – measure SLOs from the perspective of the end user. Measuring from the user’s perspective is not easy or straightforward but the very basics of how effective your SLOs are. Learn why the user’s perspective is paramount, what makes Blackbox monitoring is effective, the blind spots it helps you cover and how you can use it to define your SLOs.

Using Binomial proportion confidence intervals to reduce false positives in low QPS services

Technical and Deep Dive

Dylan Zehr

@Google

Description of how to use Binomial intervals (specific Wilson score intervals) to modify SLO metrics to reduct false positives in services with periods of low QPS. The description would cover some basic background of the statistical methods, some example graphs, possibly an example of how to configure using a common platform.

Using Observability to Set Good SLOs

Beyond Theory

Daniel “Spoons” Spoonhower

@Lightstep

While setting SLOs for externally visible services can be relatively straightforward, doing so for *internal* services can be more challenging. Teams can use current performance metrics to take a first stab at what internal services SLOs should be. While this lets them set realistic targets, it often means that they set objectives that are too high. In contrast, using distributed traces to understand how requests – and SLOs – flow through through the application can help set SLOs that are looser (but not too loose). And not only does it help teams set better SLOs, it also helps them better understand which other SLOs their services depend on (and which depend on them). In this talk, I'll walk through a couple of examples to show how.

Weaknesses of the SLO Model

New To SLOs

Niall Murphy

@SRE Book

Kit will hate this, but it's probably worth spending 5 minutes on problems with the SLO model, and potential approaches to fixing them.