SLOconf 2021
The world wants to share and learn about SLOs and who are we to stop them?
Learn about the success of SLOconf 2021, as we’re bringing back the virtual conference to our community in 2022!
Previous speakers
Abby Bangser
Site Reliability Engineer
Duffel
Abby Bangser
Site Reliability Engineer
Duffel
Just Say No (to Dashboards): You Don't Need More Information...
Abby Bangser
Outside of work Abby is active in the community by co-leading TechVoices which mentors new and diverse speakers, hosts the London chapter of #CoffeeOps meetup which provides a more interactive space for DevOps professionals to discuss relevant topics, and co-hosts the London Essentials which brings together mentors and new joiners to the software testing industry.
Just Say No (to Dashboards): You Don't Need More Information, You Need the Right Information
Talk will illustrate the differences between signal and noise in monitoring efforts. Engineers shouldn't sit watching dashboards, they should be improving existing features and developing new features; dashboard/metrics fatigue prevents engineers from living their best life. talk will trace a journey from too many dashboards to identifying the signals that are most meaningful for a team, and adopting an SLO approach to reduce signal fatigue.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelAlex Hidalgo
Principal Reliability Advocate
Nobl9
Alex Hidalgo
Alina Anderson
Senior TPM of Site Reliability Engineering
Outreach
Alina Anderson
Senior TPM of Site Reliability Engineering
Outreach
Survival Guide: What I Learned From Putting 200 Developers O...
Alina Anderson
Survival Guide: What I Learned From Putting 200 Developers On Call
We want to live in a world where the development team who writes the code, also owns that code's success...or failure, in production. Nothing incentivizes a team to ship better quality software than getting paged at 2am, but how do we do this? In this talk, you'll learn some tips and tricks for easing less than enthusiastic development teams into on-call rotations, how SRE facilitates the transition to production code ownership and why SLOs are critical to your success.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelAndreas Grabner
DevOps Activist at Dynatrace & DevRel for CNCF Kep...
Dynatrace
Andreas Grabner
DevOps Activist at Dynatrace & DevRel for CNCF Keptn
Dynatrace
SLOs For Quality Gates In Your Delivery Pipeline
Andreas Grabner
SLOs For Quality Gates In Your Delivery Pipeline
SREs use SLOs to ensure production is stable and changes from development are not impacting SLAs. Error Budgets are a great way to decide whether we can still deploy or not. But Ð every deployment has a risk of impacting critical SLOs, will eat up the error budget faster than planned and eventually lead to a slowdown of innovation.In this session we demonstrate how to use the concept of SLOs as part of continuous delivery to already validate the impact of code or configuration changes before its time to deploy to production. It gives developers faster feedback on the potential impact of their code changes, will increase quality of code that makes it to the gates of production and will therefore result in less impact when the actual production deployment happens.We will demoing this approach using the open source projects Keptn's SLO-based Quality Gate capability.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelAndrew Newdigate
Distinguished Engineer
GitLab Inc.
Andrew Newdigate
After living in London in the UK for 17 years, he recently relocated back to his hometown of Cape Town in South Africa.
GitLab's journey to SLO Monitoring
This talk covers GitLab's adoption of SLO monitoring, from our previous causal alerting strategy, which had outgrown its purpose as the complexity and traffic volumes grew, to our early attempts, building and maintaining configuration, and the problems that brought about, to our current, declarative approach. The talk will cover the challenges of getting buy-in from engineering, operations and product stakeholders, the benefits of having a common language of availability across the organisation and our future plans. This is a deep-dive, practical talk; all the code and configuration for GitLab.com's monitoring infrastructure is open-source, and the talk will include links to these resources. The talk is based on a talk I did at ScaleConf 2020, which received good feedback.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelBart Enkelaar
Lead Site Reliability Engineer
bol.com
Bart Enkelaar
Lead Site Reliability Engineer
bol.com
The Game of SLOs - A Three Part Reliability Musical
Bart Enkelaar
The Game of SLOs - A Three Part Reliability Musical
Ever since the great success of important society-shaping documentaries like Cats, Wicked and Hamilton, it has been clear that music is the way to truly get a broad audience to accept new information.As SREs, evangelisation is often a core part of what we do, since it often revolves around convincing people to take a new approach to innovation.In this three-part musical, we'll describe the journey through SRE in a manner which is both recognisable and informative and as such should be directly applicable to change hearts and minds on reliability all across the world. Get out your ukulele and sing along!
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelBenoit Petit
Founder
Hubblo
Benoit Petit
Founder
Hubblo
SLOs for climate: How to Continuously Reduce the Climate Imp...
SLOs for climate: How to Continuously Reduce the Climate Impact of Tech Services
Site Reliability Engineering's goal is to ensure that software systems and services that are created in an organization are made to evolve easily and especially to be extremely reliable.There are several definitions of reliability, one being: 'reliability is the ability for a system to fulfill a mission in some defined conditions, for a given period of time'. This definition allows to redefine the conditions that dictate if the system did actually fulfill its mission on the given period of time.As the tech industry has to lower its Green House Gas emissions of 45% in the next 10 years to match Paris agreement objectives, it seems essential to me that a tech service or system is considered reliable, not only if it satisfies the client on the short term, but also if it doesn't contribute to jeopardize the client's future. That means obviously, that it has to respect objectives smartly defined regarding GHG emissions related to it's very existence and usage.In this talk we'll see we can do right now to use those methods, not only to create business value, but for our future too.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelBhargav Bhikkaji
Founder & CEO
Tailwinds
Bhargav Bhikkaji
SLOs for Production Grade Kubernetes.
We all know that cloud native platform and especially Kubernetes is hard to operate, would not it be great to look at list of SLIs/SLOs to understand if our Kubernetes platform is fine or not. I being cloud native consultant and have worked with many organizations have helped customers to kick start and manage their Kubernetes journey, would like share experiences on important SLOs they monitor for their production grade Kubernetes.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelBjörn Rabenstenie
Grafana Labs
Grafana Labs
Björn Rabenstenie
Grafana Labs
Grafana Labs
Should SLOs Be Request-Based or Time-Based? And Why Neither ...
Björn Rabenstenie
Grafana Labs
Grafana Labs
Should SLOs Be Request-Based or Time-Based? And Why Neither Really WorksÉ
Once you had gotten somewhat familiar with SLOs, you probably realized that time-based SLOs aren't really fair for most users. It doesn't help you if your ISP gives you perfect connectivity while you are asleep but always goes down during that important weekly video conference. Or in other words: A time-based SLO means free uptime whenever your service isn't used. Clearly a request-based SLO is much better: It measures what matters, and now an outage during peak time will consume your error budget much more quickly. If this talk were on the ÒNew To SLOsÓ track, we would stop here. But since this is on the ÒDeep DiveÓ track, we need to go deeper. Let's explore a few common scenarios to see how a request-based SLO sometimes exaggerates and sometimes masks problems with your service and what we can do about it.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelDan Wilson
Co-founder, CTO
Control Plane Corporation
Dan Wilson
Co-founder, CTO
Control Plane Corporation
Lessons from Failure: How to Fail and Still Succeed
Dan Wilson
Lessons from Failure: How to Fail and Still Succeed
I worked at Concur on infrastructure, operations and engineering as it grew from a few users to millions. Over the years, I was witness of many failures across the stack and caused a handful of issues myself. In this talk, I'll walk through some of the most brutal and customer impacting failures that I saw or caused and highlight the core principles I learned after surviving through these stressful situations.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelDaniel “Spoons” Spoonhower
Co-founder, Chief Architect
Lightstep
Daniel “Spoons” Spoonhower
Co-founder, Chief Architect
Lightstep
Using Observability to Set Good SLOs
Daniel “Spoons” Spoonhower
Using Observability to Set Good SLOs
While setting SLOs for externally visible services can be relatively straightforward, doing so for *internal* services can be more challenging. Teams can use current performance metrics to take a first stab at what internal services SLOs should be. While this lets them set realistic targets, it often means that they set objectives that are too high. In contrast, using distributed traces to understand how requests Ð and SLOs Ð flow through through the application can help set SLOs that are looser (but not too loose). And not only does it help teams set better SLOs, it also helps them better understand which other SLOs their services depend on (and which depend on them). In this talk, I'll walk through a couple of examples to show how.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelDylan Zehr
Site Reliability Engineering Manager
Google SRE
Dylan Zehr
Site Reliability Engineering Manager
Google SRE
Using Binomial proportion confidence intervals to reduce fal...
Using Binomial proportion confidence intervals to reduce false positives in low QPS services
Description of how to use Binomial intervals (specific Wilson score intervals) to modify SLO metrics to reduct false positives in services with periods of low QPS.The description would cover some basic background of the statistical methods, some example graphs, possibly an example of how to configure using a common platform.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelFred Moyer
Senior Staff SRE
Zendesk
Fred Moyer
SLIs, SLOs, and Error Budgets at Scale
How can one democratize the implementation of SLIs, SLOs, and Error Budgets to put them in the hands of a thousand engineers at once? At Zendesk we developed simple algorithms and practical approaches for implementing SLIs, SLOs, and Error Budgets at scale using a number of observability tools. This talk will show the approaches developed and how we were able to manage observability instrumentation across dozens of teams quickly in a complex ecosystem (CDN, UI, middleware, backend, queues, dbs, queues, etc).This talk is for engineers and operations folks who are putting SLIs, SLOs, and Error Budgets into practice. Attendees will come away with concrete examples of how to communicate and implement Error Budgets across multiple teams and diverse service architectures.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelFrederic Branczyk
CEO and Founder
Polar Signals
Frederic Branczyk
Defining SLOs: A Practical Guide
SLOs often seem simple in theory, but tend to get difficult when actually implementing them, as the reality if often not by the textbook. SLOs are an invaluable tool for both engineers as well as management to consistently communicate reliability with data. Defining bad SLOs can also be harmful, so it's important to keep various caveats in mind. Not only are SLOs about data, it is equally important to clarify and evangelize expectations of SLOs within an organization.Frederic and Matthias have many years of experience of defining SLOs for many services and components. Together they will demonstrate real life examples of choosing, measuring, alerting and reporting SLOs based on Prometheus metrics.Join this talk to learn how to implement SLOs successfully using data you most likely already have.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelHassy Veldstra
Open source developer, SRE, founder of an open sou...
Artillery.io
Hassy Veldstra
Open source developer, SRE, founder of an open source company. On a mission to help dev teams keep their production systems fast & reliable and pagers silent.
Artillery.io
Production Load Testing as a Guardrail for SLOs
Hassy Veldstra
Open source developer, SRE, founder of an open source company. On a mission to help dev teams keep their production systems fast & reliable and pagers silent.
Artillery.io
Production Load Testing as a Guardrail for SLOs
Production load testing (yes you read that right!) can be an excellent technique for building an extra buffer of safety around your SLOs.We will cover:- Using existing SLOs to prioritize the areas of the system to test- Using existing SLOs to run production load tests safely- Putting SLOs on the load tests themselvesThis talk draws on the author's experience of implementing production load testing for building a margin of safety around SLOs at a large international publisher.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelHeinrich Hartmann
Principal Engineer
Zalando
The State of the Histogram
In this talk we are going to survey different available technologies to capture (latency) distributions and store them in time-series databases. This includes (a) the theoretical underpinnings (b) accuracy and performance and (c) operational aspects (d) adoption.Disclaimer: The author worked on openhistogram.io in the past.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelIoannis Georgoulas
Senior Site Reliability Engineering Manager
Paddle.com
Ioannis Georgoulas
Senior Site Reliability Engineering Manager
Paddle.com
SLO From Nothing to Production
SLO From Nothing to Production
My focus of this talk will be on how I educated myself about SLOs and how applied this to my organization. I will present my biggest learnings; such as having an SLO mindset is definitely a marathon. I will present my SLO journey and more specific: what I read and did to learn more about SLOs, how I got the buy in from the appropriate stateholders, how advocacy of SLOs internally is super important and how we build an SLO "framework".On the SLO framework I will cover what tools we use to build our SLIs, where we store the SLO docs, how we implement burn rate alerting and how all these fit together in a scalable and extendable way. The last part will be learnings from our SLOs and ways of working with the Product teams in order to define their SLOs.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelJacob Scott
Reliability Engineer
Stripe
Jacob Scott
Reliability Engineer
Stripe
SLOs As One Course in the Full Reliability Tasting Menu
SLOs As One Course in the Full Reliability Tasting Menu
SLOs can help us understand our reliability, but they aren't magic beans. In this talk I'll explain what they aren't good for (spoiler: catastrophes). Embracing the fact that SLOs are an incomplete approach to reliability lets us use them in composition with other approaches to better wrangle with the end-to-end reliability of our (complex, socio-technical) systems. I'll also discuss how techniques from modern safety science ('resilience engineering') can pair well with SLOs.You'll leave this talk curious about how these techniques can help you address the concrete reliability challenges you face in your systems today.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelJulie Gunderson
Senior Reliability Advocate
Gremlin Inc.
Julie Gunderson
Senior Reliability Advocate
Gremlin Inc.
The Psychology of Chaos Engineering
Julie Gunderson
In her off time Julie can be found either traipsing through the mountains in Idaho, or making circuit boards into wearable art.
The Psychology of Chaos Engineering
Chaos Engineering, failure injection, and similar practices have verified benefits to the resilience of systems and infrastructure. But can they provide similar resilience to teams and people? What are the effects and impacts on the humans involved in the systems? This talk will delve into both positive and negative outcomes to all the groups of people involved - including users, engineers, product, and business owner
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelJürgen Etzlstorfer
Technology Strategist
Dynatrace
Jürgen Etzlstorfer
Technology Strategist
Dynatrace
Evaluate Application Resilience with Chaos Engineering and S...
Jürgen Etzlstorfer
Evaluate Application Resilience with Chaos Engineering and SLOs
SLOs are not only a great way to efficiently measure the availability and quality of production environments but should also be used to ensure the resilience of applications before production as part of chaos engineering. While many organizations start with ad-hoc chaos experiments in production to validate the impact on SLOs it is more efficient to bake these tests and checks into the continuous delivery process.In this session, we give you practical guidance on Òchaos stagesÓ as part of your continuous delivery to validate the compliance with your production SLOs prior to entering production. As a showcase we are demoing a chaos enriched delivery orchestration with the CNCF projects LitmusChaos (for chaos experiments) and Keptn (for orchestration of automated load testing and SLO validation).
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelKeri Melich
Senior Site Reliability Engineer
Nobl9
Keri Melich
Senior Site Reliability Engineer
Nobl9
SLO Basics - a conversation about reliability
SLO Basics - a conversation about reliability
SLO Basics - a conversation about reliability
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelKit Merker
COO
Nobl9
Kit Merker
A Year of SLO Bootcamps
In this talk, I'll share what I've learned in the last year leading a hands-on SLO bootcamp for a variety of cross functional teams. You'll learn a proven strategy for helping teams get over the hump of a first SLO and how to drive a scalable organizational and cultural change to the SLO-based way of thinking. With COVID, I had to adapt my SLO Bootcamp to being online only, and this forced me to focus on just the essentials, increase interactivity, and ensure the course was of value to all the participants. I'll go over resources you can use to run your own SLO Bootcamp too!
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelKristina Bennett
Site Reliability Engineer, Customer Reliability
Google SRE
Kristof Renders
Autonomous Cloud Enablement Practice Manager
Dynatrace
Kristof Renders
Autonomous Cloud Enablement Practice Manager
Dynatrace
Top 5 Real-life SLOs and Decision Tree to Define Your SLOs
Kristof Renders
Autonomous Cloud Enablement Practice Manager
Dynatrace
Top 5 Real-life SLOs and Decision Tree to Define Your SLOs
The Google SRE theory already tells us, what many confirm with the own SRE journey: It is a hard task to determine the most valuable SLOs for your system. Monitoring tools like Dynatrace provide over 2000 metrics with many filter options and even more data is available with the integration of data sources like OpenTelemetry, SNMP, or any business data sources. For SLOs one needs to choose to focus on important data. We had a look at our customers adopting SLO monitoring in Dynatrace and present a hit list of SLO types we got reported as important. We show how the setup of such SLOs looks like Ð for both major categories of SLOs: real-user traffic request count based SLOs and synthetic availability monitoring SLOs. We propose a decision tree how to get from an idea to defined SLO configurations.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelLiz Fong-Jones
Developer advocate, Labor And Ethics Organizer, & ...
Honeycomb
Liz Fong-Jones
Developer advocate, Labor And Ethics Organizer, & Site Reliability Engineer
Honeycomb
SLOs & Observability - better together
Liz Fong-Jones
She lives in Vancouver, BC with her wife Elly, partners, and a Samoyed/Golden Retriever mix, and in Sydney, NSW. She plays classical piano, leads an EVE Online alliance, and advocates for transgender rights.
SLOs & Observability - better together
We built support for SLOs (Service Level Objectives) against our event store so we could monitor our own complex distributed system. In the process of doing so, we learned that there were a number of important aspects that we didn't expect from carefully reading the SRE workbook. This talk is the story of the missing pieces, unexpected pitfalls, and how we solved those problems. We'd like to share what we learned and how we iterated on our SLO adventure. As an SLO advocate and a design researcher, we collected user feedback through iterative deployments to learn what challenges users were running into. This conversation will discuss how we iterated our design, based on user feedback; how we deployed, what we learned, and re-deployed; and how we collected information from our users and from the alerts our system fired.In this talk, we will discuss how we brought the theory of SLOs to practice, and what we learned that we hadn't expected in the process. We'll discuss implementing the SLO feature and burn alerts; and our experiences from working with the SRE team who started using the alerts. Our hope is that when you buy or build your SLO tools, you'll know what to look for, and how to get started. implementors will be able to start with a more solid ground, and that we will be able to advance the state of SLO support for all teams that wish to implement them.The major design points will be broken into a discussion of what we actually built; a number of unexpected technical features; and ways that we had to educate users beyond the standard SLO guidelines. The talk is largely conceptual: no live code will be shown, although some innocent servers may well die in the process of being visualized.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelMaSonya Scott
Principal Specialist
AWS
MaSonya Scott
Principal Specialist
AWS
Matt Ray
Regional Manager, Customer Architect - APJ
Chef
Matt Ray
Regional Manager, Customer Architect - APJ
Chef
Applying SLOs to Infrastructure and Compliance as Code
Matt Ray
Applying SLOs to Infrastructure and Compliance as Code
Audits, compliance, and security are top of mind for most enterprises, while configuration management is not something most executives consider. Management teams are focused on reaching their business targets, but operations is the engine that helps the organization achieve their goals. Developers and operators need to align their goals with the business, and Service Level Objectives (SLOs) help focus these efforts and raise visibility. Configuration management _is_ important, but it needs to be part of an SLO for delivering reliable infrastructure quickly and efficiently. Security and passing audits are important, we need to understand our exposure to risk by attaining high levels of compliance. This session will provide examples of making those goals visible through SLOs, with examples provided from the open source Chef and InSpec projects.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelMatthias Loibl
Senior Software Engineer
Polar Signals
Matthias Loibl
Defining SLOs: A Practical Guide
SLOs often seem simple in theory, but tend to get difficult when actually implementing them, as the reality if often not by the textbook. SLOs are an invaluable tool for both engineers as well as management to consistently communicate reliability with data. Defining bad SLOs can also be harmful, so it's important to keep various caveats in mind. Not only are SLOs about data, it is equally important to clarify and evangelize expectations of SLOs within an organization.Frederic and Matthias have many years of experience of defining SLOs for many services and components. Together they will demonstrate real life examples of choosing, measuring, alerting and reporting SLOs based on Prometheus metrics.Join this talk to learn how to implement SLOs successfully using data you most likely already have.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelMeghan Jordan
Senior Product Manager
Datadog
Meghan Jordan
Senior Product Manager
Datadog
Fundamentals for improving customer experience
Fundamentals for improving customer experience
Service level objectives (SLOs) help you understand the health of your systems and how your end users experience them.You're not likely to achieve desired results if you're not basing decisions on useful data and this means that poorly defined SLIs (using the wrong metrics) and SLOs (defining the wrong targets) could cause worse outcomes for your users.In this talk we'll cover how SLOs help you make more informed decisions. You'll learn how to get started with SLOs and choose the right service level indicators to meet your customers' expectations.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelMelissa Boggs
VP of Business Agility
Sauce Labs
Melissa Boggs
VP of Business Agility
Sauce Labs
Agile & DevOps Walk into a Bar
Tune in to hear an Agility Exec and a DevOps Exec talk about the intersection of agile, DevOps, and metrics over a virtual "beer". In this 10 minute convo, we chat about the definitions of DevOps and agile and how metrics can play a part in showing leadership and teams where they can improve. Are your metrics acting as a window or a mirror?
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelMichael Ericksen
Site Reliability Engineer
Intelligent Medical Objects
Michael Ericksen
Site Reliability Engineer
Intelligent Medical Objects
From Availability to User Happiness: An Introduction to SLOs...
Michael Ericksen
From Availability to User Happiness: An Introduction to SLOs That Matter
This talk tells the story of an engineering team that finds themselves in a quasi-incident for a web application that runs inside of Electronic Health Record (EHR) systems like Epic and Cerner. The engineering dashboard for the application showed uptime at 100%. Users, however, paused their implementation timelines because of poor application performance. As an organization, we were measuring the wrong thing. In this talk, I will tell the story of how an engineering team pivoted from measuring availability to key application behaviors for their end users to dramatically improve user satisfaction.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelMichael Friedrich
Senior Developer Evangelist
GitLab Inc.
Michael Friedrich
Left Shift your SLOs
Everyone talks about Security shifting left in your CI/CD pipeline. Tools and cultural changes enable teams to scale and avoid deployment problems. SLOs are left out - what if a software change triggers a regression and your production SLOs fail? As a developer, you want to detect these problems as early as possible. This talk dives deep into CI/CD pipelines and discusses ideas to calculate and match SLOs in the development lifecycle. Early in your Pull or Merge Request for review.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelMichael March
Head of Innovation
Isos Technology
Michael March
Head of Innovation
Isos Technology
Supporting tools/templates to guide your SLO journey
Supporting tools/templates to guide your SLO journey
Your org has chosen to implement SLOs, awesome! Beyond the core tooling (monitoring, SLO measuring, etc) this talk will quickly demonstrate concrete examples of tools and processes one can utilize which will support your organization implementation journey - soup to nuts.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelMick Roper
Software Engineer
Reliably
Mick Roper
Software Engineer
Reliably
Service Level Overkill - SLO In a World of SOA
Service levels are excellent for understanding the limits you put on your own services, but in a world of web services your own ability to create a useful SLO is impacted by everything you depend upon. In this chat I discuss how to understand SLOs from other teams, how to try to mitigate SLO impact and how to deal with it when it happens. I also talk about what a low SLO means, and why it shouldn't be assumed that you need 9 9's of availability to offer a useful service!
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelMilan Plžík
Site Reliability Engineer
Grafana Labs
Milan Plžík
Site Reliability Engineer
Grafana Labs
Production Readiness Review: Providing a Solid Base for SLOs
Milan Plžík
Site Reliability Engineer
Grafana Labs
Production Readiness Review: Providing a Solid Base for SLOs
It's hard to propose a good SLO for a new service with little mileage. Even for years-running service, it's hard to gain confidence that if the service scales 10x, SLO won't be impacted. We'll have a look at Production Readiness Review process, which seeks to identify and remove common pitfalls and already-learned mistakes by a review focused, strengthening confidence in the defined SLO. The process was originally developed at Google (https://sre.google/sre-book/evolving-sre-engagement-model/); at Grafana Labs, we've tailored the process towards our needs, which is what this talk will discuss.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelNavya Dwarakanath
Senior Solutions Engineer
Catchpoint Systems
Navya Dwarakanath
Senior Solutions Engineer
Catchpoint Systems
Unboxing Blackbox Monitoring for SLO
Navya Dwarakanath
Unboxing Blackbox Monitoring for SLO
You have read this in every SLO book and heard it in several talks Ð measure SLOs from the perspective of the end user. Measuring from the user's perspective is not easy or straightforward but the very basics of how effective your SLOs are. Learn why the user's perspective is paramount, what makes Blackbox monitoring is effective, the blind spots it helps you cover and how you can use it to define your SLOs.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelNiall Murphy
Author
SRE Book
Introduction to SLO Alerting and Monitoring
Super simple rehearsal of the "SLO alerting" chapter from the book, with worked example.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelPosten A
Production Engineer
Posten A
Production Engineer
SLOs at Facebook
Scaling SLOs at Facebook to planetary scale using SLICK - a purpose built centralised SLO store integrated into key observability systems.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelRichard Hartmann
Community Director
Grafana Labs
Richard Hartmann
Community Director
Grafana Labs
Infrastructure Comes out of the Wall, No One Cares How
Richard Hartmann
He also designed and built a datacenter from scratch, is a Prometheus maintainer, and founded OpenMetrics.
Infrastructure Comes out of the Wall, No One Cares How
You care about your service and how it works internally, your users do not.Your water, electricity, and Internet come out the wall, and if they stop doing that, you call someone to complain. That's how you should think about your services, and we'll explore this thought more.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelRyan Lockard
SVP, @ Contino | Chief Technology Officer Cloud+
Ryan Lockard
SVP, @ Contino | Chief Technology Officer Cloud+
Agile & DevOps Walk into a Bar
Ryan Lockard
SVP, @ Contino | Chief Technology Officer Cloud+
Agile & DevOps Walk into a Bar
Tune in to hear an Agility Exec and a DevOps Exec talk about the intersection of agile, DevOps, and metrics over a virtual "beer". In this 10 minute convo, we chat about the definitions of DevOps and agile and how metrics can play a part in showing leadership and teams where they can improve. Are your metrics acting as a window or a mirror?
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelSal Kimmich
Product Strategist, Developer Advocate
Reliably
Sal Kimmich
Product Strategist, Developer Advocate
Reliably
Creating Great Dev Culture though Error Budgets
Sal Kimmich
Creating Great Dev Culture though Error Budgets
In the most basic definition, error budgets are simply the amount of error that a service can accumulate over a specified period of time before users grumble about the experience. While many organizations introducing error budgets observe them as just another metric for system quality control, there's a huge utility to incorporating error budgets as a fundamental part of your developer culture around trust and timely innovation: with the critical autonomy provided to engineers in this working paradigm, the development team can spend their error budget however they feel is right: either in prevention or cure of system instabilities. In this talk, we will cover common combinations of SLIs that lead to error budget best practices, as well as protocols that can be enacted when error budgets slip: the who, what, and when and why of pre-incident reporting.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelSimon Aronsson
Head of Developer Relations
K6
Simon Aronsson
Head of Developer Relations
K6
Error Economics: How to avoid breaking the budget
Simon Aronsson
In my spare time, you’ll usually find me either out and about on my longboard or alpine skis, caring for the chilies in my hydroponic window garden, building software or hardware or playing with my Commodore 64.
Error Economics: How to avoid breaking the budget
It's scary to release to production, especially if you don't know if your system is performing within your quality SLOs. Using error budgets and testing at scale as quality gates in your release cycle, you'll be able to gain much-needed confidence about the risk-level associated with your release.Using open-source tools, we'll set up a test, generate the necessary load to run it at scale and make sure we stay on budget.After attending this talk, attendees will:- Have an understanding of what error budgets are and how they are measured.- Know how to use them as indicators of service quality.- Know how to create their first high-concurrency test using a load generator and how to set it up with acceptance thresholds based on their error budget.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelSteve McGhee
Reliability Advocacy Engineer
Google SRE
SLO Math
It's the architecture, not the products or infrastructure that matter. How to think about your dependencies and how their SLOs affect your own._Chained services slo = SLOs ^ depth_Parallel isolated services slo = min(SLOs)_____ Redundant parallel services = much better ~= SLO of the LB ÒaboveÓ
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelUma Mukkara
Co-Founder & CEO
ChaosNative
Uma Mukkara
Benchmarking SLOs Using Chaos Engineering
SLOs are the visible results that SREs need to maintain in any operations. Recently the concept or application of SLOs is increasing being observed into pre-production CI/CD pipelines. If the pre-production setups are closer to production, the resilience of such a setup can be tested by introducing Chaos in the pipeline and measuring the SLOs. In this talk, we discuss the techniques to introduce chaos testing as a trigger to CD and as a post CD action in production or pre-production. Audience will see an example chaos stage in action in a cloud-native CI/CD pipeline and how the prometheus based SLIs are used to measure SLOs during a given period of time and use this benchmarking to make decisions to trigger continuous deployments. The takeaway for the SREs is using chaos testing as a tool to measure SLO based resilience and how this can be automated using declarative config and GitOps.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelVidya Subramanian
Founder
MyTechLadder
Vidya Subramanian
Founder
MyTechLadder
Founder MyTechLadder - a career progression and career mobility service, my give back project.
Founder 1MinuteDances - a nonprofit dedicated to helping women reconnect with the performing art.
Startup advisor - Founder Devopsly, LLC. Advising startups in the DevSecOps space.
Startup investor - Angel and early stage investor
Wolfgang Heider
Senior Technical Product Manager
Dynatrace
Wolfgang Heider
Senior Technical Product Manager
Dynatrace
Top 5 Real-life SLOs and Decision Tree to Define Your SLOs
Wolfgang Heider
Senior Technical Product Manager
Dynatrace
Top 5 Real-life SLOs and Decision Tree to Define Your SLOs
The Google SRE theory already tells us, what many confirm with the own SRE journey: It is a hard task to determine the most valuable SLOs for your system. Monitoring tools like Dynatrace provide over 2000 metrics with many filter options and even more data is available with the integration of data sources like OpenTelemetry, SNMP, or any business data sources. For SLOs one needs to choose to focus on important data. We had a look at our customers adopting SLO monitoring in Dynatrace and present a hit list of SLO types we got reported as important. We show how the setup of such SLOs looks like Ð for both major categories of SLOs: real-user traffic request count based SLOs and synthetic availability monitoring SLOs. We propose a decision tree how to get from an idea to defined SLO configurations.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelYury Niño Roa
SRE Technical Program Manager
ADL Digital Labs
Yury Niño Roa
SRE Technical Program Manager
ADL Digital Labs
Defining a Maturity Model for SLOs
Yury Niño Roa
Software Engineer with 7+ years of experience designing, implementing and managing the development of software applications using agile methodologies such as scrum and kanban. 2+ years of hands-on experience supporting, automating and optimizing mission-critical deployments. Experience with on-premise and cloud architectures and foundations both on the coding and deploying systems. +1 year as Technical Program Manager of a Site Reliability Engineering Team, designing and architecting software to improve availability, scalability, latency and efficiency.
Professor of Software Engineering and Researcher with interest in solving performance, resilience and reliability issues, using chaos engineering and studying human factors, safety on systems and lack of monitoring and observability.
Defining a Maturity Model for SLOs
Service Level Objectives or SLOs are a quantitative contract that describes the expected service behavior. They are often used by Organizations to prioritize the reliability, availability, coverage, and other service-level indicators of the software systems. Based on what I have learned defining and implementing SLOs, I have discovered that they are valuable when they are used to build feedback loops in two axes: adoption and automation. SLOs are a process, not a project, which imposes a need for having a framework that helps organizations to adopt a culture based on SLOs. In this talk, I am presenting a framework that allows determining the level of adoption and automation of SLOs. Based on questions related to the amount of convincing: engineering, operations, product, leadership, legal, and quality assurance, we determine the level of adoption. On the other side, considering aspects such as established and documented measurements, the level of user-centric metrics, observability strategies, and reporting toolsets, we determine the level of automation.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelZac Nickens
Site Reliability Engineer
Outsystems
Zac Nickens
Site Reliability Engineer
Outsystems
No More Theater: Building SLO Culture Without the Bullsh*t
No More Theater: Building SLO Culture Without the Bullsh*t
Using SLOculture to break down silos, empower engineers, and drive user (and engineer) happiness. Using real life examples from unnamed orgs, I will highlight the pitfalls and traps of "theater" and "fiefdoms" and how SLO culture is can be used to break down barriers to high performance and high happiness.