Technical and Deep Dive Talks Talks
The first Service Level Objective Conference for Site Reliability Engineers
Benchmarking SLOs Using Chaos Engineering
Technical and Deep DiveUma Mukkara
SLOs are the visible results that SREs need to maintain in any operations. Recently the concept or application of SLOs is increasing being observed into pre-production CI/CD pipelines. If the pre-production setups are closer to production, the resilience of such a setup can be tested by introducing Chaos in the pipeline and measuring the SLOs. In this talk, we discuss the techniques to introduce chaos testing as a trigger to CD and as a post CD action in production or pre-production. Audience will see an example chaos stage in action in a cloud-native CI/CD pipeline and how the prometheus based SLIs are used to measure SLOs during a given period of time and use this benchmarking to make decisions to trigger continuous deployments. The takeaway for the SREs is using chaos testing as a tool to measure SLO based resilience and how this can be automated using declarative config and GitOps.
Defining SLOs: A Practical Guide
Technical and Deep DiveMatthias Loibl Frederic Branczyk
SLOs often seem simple in theory, but tend to get difficult when actually implementing them, as the reality if often not by the textbook. SLOs are an invaluable tool for both engineers as well as management to consistently communicate reliability with data. Defining bad SLOs can also be harmful, so it’s important to keep various caveats in mind. Not only are SLOs about data, it is equally important to clarify and evangelize expectations of SLOs within an organization. Frederic and Matthias have many years of experience of defining SLOs for many services and components. Together they will demonstrate real life examples of choosing, measuring, alerting and reporting SLOs based on Prometheus metrics. Join this talk to learn how to implement SLOs successfully using data you most likely already have.
Error Economics: How to avoid breaking the budget
Technical and Deep DiveSimon Aronsson
It’s scary to release to production, especially if you don’t know if your system is performing within your quality SLOs. Using error budgets and testing at scale as quality gates in your release cycle, you’ll be able to gain much-needed confidence about the risk-level associated with your release. Using open-source tools, we’ll set up a test, generate the necessary load to run it at scale and make sure we stay on budget. After attending this talk, attendees will: - Have an understanding of what error budgets are and how they are measured. - Know how to use them as indicators of service quality. - Know how to create their first high-concurrency test using a load generator and how to set it up with acceptance thresholds based on their error budget.
Evaluate Application Resilience with Chaos Engineering and SLOs
Technical and Deep DiveJürgen Etzlstorfer
SLOs are not only a great way to efficiently measure the availability and quality of production environments but should also be used to ensure the resilience of applications before production as part of chaos engineering. While many organizations start with ad-hoc chaos experiments in production to validate the impact on SLOs it is more efficient to bake these tests and checks into the continuous delivery process. In this session, we give you practical guidance on “chaos stages” as part of your continuous delivery to validate the compliance with your production SLOs prior to entering production. As a showcase we are demoing a chaos enriched delivery orchestration with the CNCF projects LitmusChaos (for chaos experiments) and Keptn (for orchestration of automated load testing and SLO validation).
GitLab’s journey to SLO Monitoring
Technical and Deep DiveAndrew Newdigate
This talk covers GitLab's adoption of SLO monitoring, from our previous causal alerting strategy, which had outgrown its purpose as the complexity and traffic volumes grew, to our early attempts, building and maintaining configuration, and the problems that brought about, to our current, declarative approach. The talk will cover the challenges of getting buy-in from engineering, operations and product stakeholders, the benefits of having a common language of availability across the organisation and our future plans. This is a deep-dive, practical talk; all the code and configuration for GitLab.com's monitoring infrastructure is open-source, and the talk will include links to these resources. The talk is based on a talk I did at ScaleConf 2020, which received good feedback.
Introduction to SLO Alerting and Monitoring
Technical and Deep DiveNiall Murphy
Super simple rehearsal of the "SLO alerting" chapter from the book, with worked example.
Management & Governance in the Cloud
Technical and Deep DiveMaSonya Scott
MaSonya Scott, Principle Specialist at AWS will teach you to: - Articulate the importance of cloud operations to achieve & scale adoption of AWS - Understand AWS Management & Governance value proposition to infrastructure operations for customers - Provision and manage AWS resources in standardized & secure fashion - Evaluate cloud operations leveraging ITSM tools integrated to AWSS Management & Governance services
Production Load Testing as a Guardrail for SLOs
Technical and Deep DiveHassy Veldstra
Production load testing (yes you read that right!) can be an excellent technique for building an extra buffer of safety around your SLOs. We will cover: - Using existing SLOs to prioritize the areas of the system to test - Using existing SLOs to run production load tests safely - Putting SLOs on the load tests themselves This talk draws on the author's experience of implementing production load testing for building a margin of safety around SLOs at a large international publisher.
Should SLOs Be Request-Based or Time-Based? And Why Neither Really Works…
Technical and Deep DiveBjörn Rabenstein
Once you had gotten somewhat familiar with SLOs, you probably realized that time-based SLOs aren't really fair for most users. It doesn't help you if your ISP gives you perfect connectivity while you are asleep but always goes down during that important weekly video conference. Or in other words: A time-based SLO means free uptime whenever your service isn't used. Clearly a request-based SLO is much better: It measures what matters, and now an outage during peak time will consume your error budget much more quickly. If this talk were on the “New To SLOs” track, we would stop here. But since this is on the “Deep Dive” track, we need to go deeper. Let's explore a few common scenarios to see how a request-based SLO sometimes exaggerates and sometimes masks problems with your service and what we can do about it.
SLIs, SLOs, and Error Budgets at Scale
Technical and Deep DiveFred Moyer
How can one democratize the implementation of SLIs, SLOs, and Error Budgets to put them in the hands of a thousand engineers at once? At Zendesk we developed simple algorithms and practical approaches for implementing SLIs, SLOs, and Error Budgets at scale using a number of observability tools. This talk will show the approaches developed and how we were able to manage observability instrumentation across dozens of teams quickly in a complex ecosystem (CDN, UI, middleware, backend, queues, dbs, queues, etc). This talk is for engineers and operations folks who are putting SLIs, SLOs, and Error Budgets into practice. Attendees will come away with concrete examples of how to communicate and implement Error Budgets across multiple teams and diverse service architectures.
SLOs As One Course in the Full Reliability Tasting Menu
Technical and Deep DiveJacob Scott
SLOs can help us understand our reliability, but they aren’t magic beans. In this talk I’ll explain what they aren’t good for (spoiler: catastrophes). Embracing the fact that SLOs are an incomplete approach to reliability lets us use them in composition with other approaches to better wrangle with the end-to-end reliability of our (complex, socio-technical) systems. I’ll also discuss how techniques from modern safety science (“resilience engineering”) can pair well with SLOs. You’ll leave this talk curious about how these techniques can help you address the concrete reliability challenges you face in your systems today.
SLOs for Production Grade Kubernetes.
Technical and Deep DiveBhargav Bhikkaji
We all know that cloud native platform and especially Kubernetes is hard to operate, would not it be great to look at list of SLIs/SLOs to understand if our Kubernetes platform is fine or not. I being cloud native consultant and have worked with many organizations have helped customers to kick start and manage their Kubernetes journey, would like share experiences on important SLOs they monitor for their production grade Kubernetes.
SLOs For Quality Gates In Your Delivery Pipeline
Technical and Deep DiveAndreas Grabner
SREs use SLOs to ensure production is stable and changes from development are not impacting SLAs. Error Budgets are a great way to decide whether we can still deploy or not. But – every deployment has a risk of impacting critical SLOs, will eat up the error budget faster than planned and eventually lead to a slowdown of innovation. In this session we demonstrate how to use the concept of SLOs as part of continuous delivery to already validate the impact of code or configuration changes before its time to deploy to production. It gives developers faster feedback on the potential impact of their code changes, will increase quality of code that makes it to the gates of production and will therefore result in less impact when the actual production deployment happens. We will demoing this approach using the open source projects Keptn’s SLO-based Quality Gate capability.
Standardizing SLOs as Code
Technical and Deep DiveAndreas Grabner
Netflix’s Kayenta, Pivotals Indicator or T-Systems Performance Signature were early implementation of SLOs. As more SLO solutions are emerging on the market it’s time to think about open standards of SLO definitions as code, agree on how SLOs should be validated and how to standardize on where the data (SLIs) is coming from to easily switch from one observability platform to another without having to rethink or redefine SLOs. We need to follow the footsteps of observability who standardized on OpenTelemetry making it easier for users and open up new opportunity for tool vendors! In this session want to start this discussion by introducing the open source efforts of the Keptn project around SLOs. Keptn uses a declarative approach to SLIs and SLOs, a clear separation between data providers, SLO definition and SLO validation.
The State of the Histogram
Technical and Deep DiveHeinrich Hartmann
In this talk we are going to survey different available technologies to capture (latency) distributions and store them in time-series databases. This includes (a) the theoretical underpinnings (b) accuracy and performance and (c) operational aspects (d) adoption. Disclaimer: The author worked on openhistogram.io in the past.
Using Binomial proportion confidence intervals to reduce false positives in low QPS services
Technical and Deep DiveDylan Zehr
Description of how to use Binomial intervals (specific Wilson score intervals) to modify SLO metrics to reduct false positives in services with periods of low QPS. The description would cover some basic background of the statistical methods, some example graphs, possibly an example of how to configure using a common platform.