Beyond Theory Talks
The first Service Level Objective Conference for Site Reliability Engineers
Agile & DevOps Walk into a Bar
Beyond TheoryMelissa Boggs Ryan Lockard
Tune in to hear an Agility Exec and a DevOps Exec talk about the intersection of agile, DevOps, and metrics over a virtual "beer". In this 10 minute convo, we chat about the definitions of DevOps and agile and how metrics can play a part in showing leadership and teams where they can improve. Are your metrics acting as a window or a mirror?
Don't be a victim of your own success
Beyond TheoryMick Roper
The downside of creating a service level that is too high, especially if you are able to achieve it! Systems that have exceedingly high uptime often cause a disproportionate impact when they inevitably fail, since the users of those systems are unprepared for the incoming disruption to their workflow. In this talk I discuss how to create an appropriate SLO, that attempts to find a balance between service excellence and management of expectations. This talk covers product management and systems architecture, looking at how the design of a system can be used to maintain an SLO, and how 'disruptive engineering' (chaos engineering, fire drills, continuous deployment, release strategy, etc...) can be used to test and utilise an SLO. From a product management perspective, I cover the required conversations that need to be had between product ownership and engineering that help to explain the need for 'downtime'. This is often an area of the product owner/engineering team that is fraught with difficulty, since product ownership want the best possible service from a user perspective, while engineers take a more risk-averse attitude towards service provisioning.
Fundamentals for improving customer experience
Beyond TheoryMeghan Jordan
Service level objectives (SLOs) help you understand the health of your systems and how your end users experience them.You're not likely to achieve desired results if you're not basing decisions on useful data and this means that poorly defined SLIs (using the wrong metrics) and SLOs (defining the wrong targets) could cause worse outcomes for your users. In this talk we’ll cover how SLOs help you make more informed decisions. You’ll learn how to get started with SLOs and choose the right service level indicators to meet your customers’ expectations.
Lessons from Failure: How to Fail and Still Succeed
Beyond TheoryDan Wilson
I worked at Concur on infrastructure, operations and engineering as it grew from a few users to millions. Over the years, I was witness of many failures across the stack and caused a handful of issues myself. In this talk, I'll walk through some of the most brutal and customer impacting failures that I saw or caused and highlight the core principles I learned after surviving through these stressful situations.
SLO — From Nothing to… Production
Beyond TheoryIoannis Georgoulas
My focus of this talk will be on how I educated myself about SLOs and how applied this to my organization. I will present my biggest learnings; such as having an SLO mindset is definitely a marathon. I will present my SLO journey and more specific: what I read and did to learn more about SLOs, how I got the buy in from the appropriate stateholders, how advocacy of SLOs internally is super important and how we build an SLO "framework". On the SLO framework I will cover what tools we use to build our SLIs, where we store the SLO docs, how we implement burn rate alerting and how all these fit together in a scalable and extendable way. The last part will be learnings from our SLOs and ways of working with the Product teams in order to define their SLOs.
SLOs & Observability - better together
Beyond TheoryLiz Fong
We built support for SLOs (Service Level Objectives) against our event store so we could monitor our own complex distributed system. In the process of doing so, we learned that there were a number of important aspects that we didn’t expect from carefully reading the SRE workbook. This talk is the story of the missing pieces, unexpected pitfalls, and how we solved those problems. We’d like to share what we learned and how we iterated on our SLO adventure. As an SLO advocate and a design researcher, we collected user feedback through iterative deployments to learn what challenges users were running into. This conversation will discuss how we iterated our design, based on user feedback; how we deployed, what we learned, and re-deployed; and how we collected information from our users and from the alerts our system fired. In this talk, we will discuss how we brought the theory of SLOs to practice, and what we learned that we hadn’t expected in the process. We’ll discuss implementing the SLO feature and burn alerts; and our experiences from working with the SRE team who started using the alerts. Our hope is that when you buy or build your SLO tools, you’ll know what to look for, and how to get started. implementors will be able to start with a more solid ground, and that we will be able to advance the state of SLO support for all teams that wish to implement them. The major design points will be broken into a discussion of what we actually built; a number of unexpected technical features; and ways that we had to educate users beyond the standard SLO guidelines. The talk is largely conceptual: no live code will be shown, although some innocent servers may well die in the process of being visualized.
Survival Guide: What I Learned From Putting 200 Developers On Call
Beyond TheoryAlina Anderson
We want to live in a world where the development team who writes the code, also owns that code’s success...or failure, in production. Nothing incentivizes a team to ship better quality software than getting paged at 2am, but how do we do this? In this talk, you’ll learn some tips and tricks for easing less than enthusiastic development teams into on-call rotations, how SRE facilitates the transition to production code ownership and why SLOs are critical to your success.
The New Stack: What happens when an SLO goes wrong?
Beyond TheoryAlex Hidalgo Kristina Bennett Niall Murphy
Join TNS Founder and Publisher Alex Williams for a panel discussion to explore what happens when an SLO goes wrong. Panelist include: Kristina Bennett, editing contributor of "Building Secure and Reliable Systems" & "Implementing Service Level Objectives" Niall Murphey, co-author of "Site Reliability Engineering" & "The Site Reliability Workbook" Alex Hidalgo, author of "Implementing Service Level Objectives"
The Psychology of Chaos Engineering
Beyond TheoryJulie Gunderson
Chaos Engineering, failure injection, and similar practices have verified benefits to the resilience of systems and infrastructure. But can they provide similar resilience to teams and people? What are the effects and impacts on the humans involved in the systems? This talk will delve into both positive and negative outcomes to all the groups of people involved - including users, engineers, product, and business owner
Top 5 Real-life SLOs and Decision Tree to Define Your SLOs
Beyond TheoryWolfgang Heider
The Google SRE theory already tells us, what many confirm with the own SRE journey: It is a hard task to determine the most valuable SLOs for your system. Monitoring tools like Dynatrace provide over 2000 metrics with many filter options and even more data is available with the integration of data sources like OpenTelemetry, SNMP, or any business data sources. For SLOs one needs to choose to focus on important data. We had a look at our customers adopting SLO monitoring in Dynatrace and present a hit list of SLO types we got reported as important. We show how the setup of such SLOs looks like – for both major categories of SLOs: real-user traffic request count based SLOs and synthetic availability monitoring SLOs. We propose a decision tree how to get from an idea to defined SLO configurations.
Using Observability to Set Good SLOs
Beyond TheoryDaniel “Spoons” Spoonhower
While setting SLOs for externally visible services can be relatively straightforward, doing so for *internal* services can be more challenging. Teams can use current performance metrics to take a first stab at what internal services SLOs should be. While this lets them set realistic targets, it often means that they set objectives that are too high. In contrast, using distributed traces to understand how requests – and SLOs – flow through through the application can help set SLOs that are looser (but not too loose). And not only does it help teams set better SLOs, it also helps them better understand which other SLOs their services depend on (and which depend on them). In this talk, I'll walk through a couple of examples to show how.