SLOconf 2023

The world wants to share and learn about SLOs and who are we to stop them?

Learn about the success of SLOconf 2023, as we’re bringing back the virtual conference to our community in 2024!

SLOconf 2023 keep calm and slo on

Speakers

Adrian Hoban
Adrian Hoban

Principal Engineer

Intel

Adrian Hoban

Principal Engineer

Intel

Intent Driven Orchestration with SLOs!
Learn more ›
close
Adrian Hoban

Adrian Hoban

Principal Engineer

Intel

Adrian is a principal engineer leading cloud native resource orchestration for the Network and Edge Group in Intel. He has invested 20+ years becoming an expert on Cloud Native Orchestration, Service Orchestration and Management, Automation and Cloud Native Observability, with many of those years focused on applying and enhancing cloud technologies for some of the most demanding, high-performance, deterministic networking applications. Now he leads strategy, requirements, and architecture definition for cloud native resource orchestration of distributed networking application that span from the cloud to the edge. 

In the past, Adrian created and influenced the ecosystem to adopt Enhanced Platform Awareness which is a suite of platform enabled capabilities at different layers of orchestration stacks. He was one of the contributors of Management and Orchestration standards definition for Network Functions Virtualisation bringing platform aware virtualisation technology to the Communications Service Providers for high performance, interoperable NFV solutions. Adrian was a co-founder and first Technical Steering Committee lead of the Open Source Management and Orchestration (OSM) community.

Adrian is also a keen sports fan, loves outdoor sports and in particular Gaelic games, rugby and mountain biking.

2023

Intent Driven Orchestration with SLOs!

With a Serverless mindset in place, there should no longer be a need to define resource requests or any other kinds of information. Just developed your app or function and run it.

But how can we achieve performance and make sure we run the system most efficiently? This is were we can now let user define what they truly care about - their SLOs. Based on these performance targets our Intent Driven Orchestration Planner will do the rest. No need to define resource requests and limit s on e.g. Kubernetes cluster anymore. The planner will set up the systems in such a way that your app/functions behave as expected, without the need to know anything about the underlying infrastructure – less knowledge needed, fewer errors made!

This is a shift in how we do orchestration and SLO management, that could be of interest to this community. Away from a monitoring & alerting way of looking SLOs, towards a way of using SLOs to manage the system. Furthermore this truly allows for ease-of-use of the user; rather than defining numbers an values based on domain & contextualized knowledge we let them define what they truly care about: their SLOs!

Adriana Villela
Adriana Villela

Sr. Developer Advocate

Lightstep

Adriana Villela

Sr. Developer Advocate

Lightstep

Translating failures into SLOs
Learn more ›
close
Adriana Villela

Adriana Villela

Sr. Developer Advocate

Lightstep

Twitter LinkedIn

Adriana is a Sr. Developer Advocate at Lightstep from Toronto, Canada, with over 20 years of experience in tech. She focuses on helping companies achieve reliability greatness through Observability, DevOps, and SRE practices. Before Lightstep, she was a Sr. Manager at Tucows. During this time, she defined technical direction in the organization, running both a Platform Engineering team, and an Observability Practices team. Adriana has also worked at various large-scale enterprises, including Bank of Montreal (BMO), Ceridian, and Accenture. At BMO, she was responsible for defining and driving the bank's enterprise-wide DevOps practice, which impacted business and technology teams across multiple geographic locations across the globe.

Adriana has a widely-read technical blog on Medium (https://adri-v.medium.com), which is known for its casual and approachable tone to complex technical topics, and its high level of technical detail. She is also an OpenTelemetry contributor, HashiCorp Ambassador (https://www.credly.com/badges/551d47a7-67cb-41bb-baeb-8c90f114f03a/public_url), and co-host of the On-Call Me Maybe Podcast (https://oncallmemaybe.com).

2023

Translating failures into SLOs

Downtime is hard and we can definitely be proactive about failure by following practices like Chaos Engineering and SLOs. But how do you translate failure to SLO? What learnings should you leverage from the incidents you’ve been through? How can you turn a bad thing (something like an outage or downtime) into a good thing (information you have to prevent or mitigate future outages)?
 
Join Ana Margarita and Adriana as they walk back from failure to reliability by leveraging SLOs, using examples from three outages from 2022 and 2023 that affected many of us.

Alayshia Knighten
Alayshia Knighten

Manager of Onboarding Eng

Honeycomb

Alayshia Knighten

Manager of Onboarding Eng

Honeycomb

SLI Negotiation Tactics for Engineers
Learn more ›
close
Alayshia Knighten

Alayshia Knighten

Manager of Onboarding Eng

Honeycomb

LinkedIn

Alayshia Knighten is an Engineering Manager of Product Training at Honeycomb with many years of experience in the DevOps realm. Alayshia specializes in enhancing technical and team-related experiences while educating customers on their journey with and beyond observability. In her words, “Getting shit done while identifying how to accelerate at the person beyond the tooling is the real meat and potatoes.” She enjoys solving the “so, how do we solve that?” problems and meeting people from all walks of life. Her tiny hometown and Southern background inspire Alayshia. In her spare time, she enjoys hiking, grilling, painting, and making random bird calls with her father.

2023

SLI Negotiation Tactics for Engineers

Service level indicators are quantitative measures of a service, which in turn, are measured by SLOs. This is not the talk you think it is.
 
As Engineers, we have our own SLIs, which are Survival Level Indicators, that measure and define if we are okay or not okay at a job. What happens when the rockstar engineer, who performs essential task A and B, hasn’t taken vacation in 9 months? Over time, not meeting SLIs can take its toll on engineers. How do we avoid burnout, turnover, and wider destruction in our teams?
 
In this session, I will review different strategies to identify human burnout versus company personal objectives.Engineers share the same importance as customers and we should provide technical love to them as well.
 
In this discussion, I will be talking about how we can improve ourselves and survive in the the high risk climate in tech. The talk provides engineers and managers with the courage to take care of both themselves, the team and others around them. Sometimes it is hard to identify when we or our friends are okay or not okay. In the course of this discussion, we will review how to identify those says that “Houston we have a problem”, ways to better the problems we face, and overall strategies on strengthening who we are.

Aleksandra Dziamska
Aleksandra Dziamska

Engineering Manager

Nobl9

Aleksandra Dziamska

Engineering Manager

Nobl9

Product and Engineering Collaboration With SLOs
Learn more ›
close
Aleksandra Dziamska

Aleksandra Dziamska

Engineering Manager

Nobl9

Aleksandra works as an Engineering Manager at Nobl9.

Her over 10-year journey in software development started with being a software engineer and moved towards team leadership and management. Throughout her career, she strived to focus on what she feels is most important (in IT as in life): people. Translating it to Engineering Manager dialect: lead the engineering team to deliver best value to the end users, effectively combining Product and Engineering priorities. She explores the way SLOs can help here.

2023

Product and Engineering Collaboration With SLOs

Present a real-life scenario of using SLOs to manage product requirements and be smarter about allocating the engineering team’s focus. The talk will discuss an example of how SLOs helped monitor a potential problem instead of assigning an engineering team to jump into solving a complex issue. It also helped focus on a specific area of a wider problem. In a broader sense, the talk will discuss the collaboration between the Product Manager and the Engineering Manager and how SLOs can lead to more productive conversations.

Alex Hidalgo
Alex Hidalgo

Principal Reliability Advocate

Nobl9

Alex Hidalgo

Principal Reliability Advocate

Nobl9

Error Budgets for Conference Planning
Learn more ›
close
Alex Hidalgo

Alex Hidalgo

Principal Reliability Advocate

Nobl9

Twitter
Alex Hidalgo is the Principal Reliability Advocate at Nobl9 and author of “Implementing Service Level Objectives.” During his career he has developed a deep love for sustainable operations, proper observability, and using SLO data to drive discussions and make decisions. Alex’s previous jobs have included IT support, network security, restaurant work, t-shirt design, and hosting game shows at bars. When not sharing his passion for technology with others, you can find him scuba diving or watching college basketball. He lives in Brooklyn with his partner Jen and a rescue dog named Taco. Alex has a BA in philosophy from Virginia Commonwealth University.
2023

Error Budgets for Conference Planning

Planning an event, no matter the size, can be stressful and complicated. Planning a hybrid conference, for example, includes having to ensure you end up with the right number of speakers, registrants, sponsors, and local venues. In this talk Alex will break down what it was like while organizing SLOconf 2023, and how he used his experience setting reasonable targets to help guide him and everyone else involved along the way.

Alex Kudryashov
Alex Kudryashov

Lead software engineer

New Relic

Alex Kudryashov

Lead software engineer

New Relic

Adoption of SLs in New Relic: an iterative approach
Learn more ›
close
Alex Kudryashov

Alex Kudryashov

Lead software engineer

New Relic

These days I am leading a team that is developing Service Level Management in New Relic. I love solving challenges at intersection of product and engineering, so I am creating tools for developers like myself.

2023

Adoption of SLs in New Relic: an iterative approach

In this talk, we would share our experience in promoting Service Level practice across a large organization with over 900 engineers. Learn about the challenges we faced, the strategies we used to encourage adoption, and the valuable lessons we learned along the way.

Whether you're looking to implement SLs in your own organization or simply interested in how to drive adoption of new engineering practices in general.

Alexandra McCoy
Alexandra McCoy

SRE Engineer & VMware Enthusiast

VMware

Alexandra McCoy

SRE Engineer & VMware Enthusiast

VMware

Reliability Enablement: Achieving Reliability with SLOs
Learn more ›
close
Alexandra McCoy

Alexandra McCoy

SRE Engineer & VMware Enthusiast

VMware

Twitter LinkedIn

Alexandra is an SRE Engineer at VMware. She is passionate about Cloud Native, Open Source, and Reliability Engineering communities. Although VMware is home, she was introduced to the Cloud while in IBM - Public Sector and then transitioned into IBM Cloud. She later gained additional hybrid cloud experience at Diamanti, focusing on E2E product support for their Kubernetes based appliance. She is excited about the industry's direction and hopes to contribute in a way that only helps improve the cloud space.

2023

Reliability Enablement: Achieving Reliability with SLOs

At VMware, we've created a team of Cloud Reliability Engineers who support Tanzu SaaS engineering teams through technical reliability enablement projects. These projects help to measure the reliability of VMware SaaS services. Throughout the Reliability Enablement journey of implementing SLIs & SLOs, we found an increasing need to consistently manage these measurements in a collaborative and version-controlled way. This talk will provide a discussion of how we, the Tanzu SaaS Reliability Enablement Team, internally designed a technical process to measure, monitor, and alert on SaaS service performance as code, utilizing Tanzu Mission Control, Aria Operations for Applications and other Cloud Native & Open Sourced solutions.
 
Focusing on SLIs and SLOs helped to realize that in order to remain reliable we needed a reliable manner to consistently create & manage our SLOs. The TSRE team has focused on the SRE aspect of this by creating a suite of tools that will allow customers to easily create and manage SLI/SLO dashboards via terraform. This is an industry-standard and efficient means to harness the full potential of Aria Operations for Apps. In addition to the terraforming tools, the TSRE team has also created a framework that allows users to create or use existing probes to feed these SLI/SLO dashboards. Our goal is to improve the ability of developers to produce quality products by making customer personas and their journey central to the practices & processes we implement.

Ana Margarita Medina
Ana Margarita Medina

Staff Developer Advocate

Lightstep

Ana Margarita Medina

Staff Developer Advocate

Lightstep

Translating failures into SLOs
Learn more ›
close
Ana Margarita Medina

Ana Margarita Medina

Staff Developer Advocate

Lightstep

Twitter LinkedIn

Ana Margarita is a Staff Developer Advocate at Lightstep and focuses on helping companies be more reliable by leveraging Observability and Incident Response practices. Before Lightstep, she was a Senior Chaos Engineer at Gremlin and helped companies avoid outages by running proactive chaos engineering experiments. She has also worked at various-sized companies including Google, Uber, SFEFCU, and Miami-based startups. Ana is an internationally recognized speaker and has presented at: AWS re:Invent, KubeCon, DockerCon, DevOpDays, AllDayDevOps, Write/Speak/Code, and many others.

Catch her tweeting at @Ana_M_Medina about traveling, diversity in tech, and mental health.

2023

Translating failures into SLOs

Downtime is hard and we can definitely be proactive about failure by following practices like Chaos Engineering and SLOs. But how do you translate failure to SLO? What learnings should you leverage from the incidents you’ve been through? How can you turn a bad thing (something like an outage or downtime) into a good thing (information you have to prevent or mitigate future outages)?
 
Join Ana Margarita and Adriana as they walk back from failure to reliability by leveraging SLOs, using examples from three outages from 2022 and 2023 that affected many of us.

Anais Dotis-Georgiou
Anais Dotis-Georgiou

Lead Developer Advocate

InfluxData

Anais Dotis-Georgiou

Lead Developer Advocate

InfluxData

Harnessing the Power of time series databases and OpenTeleme...
Learn more ›
close
Anais Dotis-Georgiou

Anais Dotis-Georgiou

Lead Developer Advocate

InfluxData

LinkedIn
2023

Harnessing the Power of time series databases and OpenTelemetry to Uphold SLO

In this talk, we explore the synergy between time series databases and OpenTelemetry standards, which together help organizations maintain their Service Level Objectives (SLOs) with precision and ease. Additionally, we will highlight the benefits of columnar-based data storage for high-performance queries and storage logs, traces, and metrics.

Andrew Clay Shafer
Andrew Clay Shafer

Principal

Ergonautic

Andrew Clay Shafer

Principal

Ergonautic

Systems of Work: Socio-Technical SLOs
Learn more ›
close
Andrew Clay Shafer

Andrew Clay Shafer

Principal

Ergonautic

Andrew Clay Shafer evangelized DevOps tools and practices when DevOps was not a word before falling in love with SLOs in theory and practice. Living at the intersection of Open Source and Cloud Computing across two decades, they gained experience in every role in software delivery from support and QA to product and development. Andrew now focuses on engineering operable resilient socio-technical systems and communities as a founder of Ergonautic.

2023

Systems of Work: Socio-Technical SLOs

 Service Levels Objectives are often perceived as by SRE, for SRE, which limits the impact we can have on improving our systems because enforcing SLOs often collides with other priorities. Can we help others in the organization understand the value of improving the system? Can we apply SLOs to qualities of the system which other people already care about? The speed run introduction to SLOs as a commitment to improve using metrics on the work people do for building a coalition who will care about system reliability.

Andrew Howden
Andrew Howden

SRE Engineering Manager

Zalando

Andrew Howden

SRE Engineering Manager

Zalando

Driving engineering priorities with service level objectives...
Learn more ›
close
Andrew Howden

Andrew Howden

SRE Engineering Manager

Zalando

Twitter LinkedIn
Andrew is a failed sports science student who wanders into software engineering by virtue of luck and the necessity to find a job in a hurry. Through the grace and patience of his colleagues, he has spent nearly the past decade learning how to be a software engineer, systems engineer, site reliability engineer and student of human factors. Most recently, he has been learning how to become an engineering manager and trying to pass on what knowledge he has gained so far to the next generation of software adventurers.
2023

Driving engineering priorities with service level objectives on critical business operations

I will talk through the details of how SLOs at Zalando have evolved from the initial implementation ("SLOs for everything!") to the challenge of ensuring SLOs have the organizational power to drive changes in engineering priorities, to the current design of "critical business operations" and SLOs on those operations.

I'll discuss how to address the "fast burn" SLO problem by leveraging distributed tracing to identify regression in the customer experience automatically. When those regressions are identified, automatically identify and page the team best empowered to address them.

I'll discuss how to address the "slow burn" SLO problem through periodic operational review meetings, in which the SLOs are evaluated, and violations to the SLO (or slow burn issues) are allocated to an owner to investigate and address.

Lastly, I'll talk about challenges with the existing approach, including the difficulty of modelling event systems as a reliable flow, difficulty in rolling out more SLOs for non-customer-facing aspects of the organization and returning to service-specific SLOs.

Andrew Newdigate
Andrew Newdigate

Distinguished Engineer

GitLab Inc.

Andrew Newdigate

Distinguished Engineer

GitLab Inc.

Tamland: How GitLab.com uses long-term monitoring data for c...
Learn more ›
close
Andrew Newdigate

Andrew Newdigate

Distinguished Engineer

GitLab Inc.

Twitter LinkedIn

Andrew is a seasoned engineer with over two decades of experience in software development and reliability engineering. As a Distinguished Engineer at GitLab, he is responsible for the reliability and availability of GitLab's SAAS properties: GitLab.com and GitLab Dedicated. He is a strong advocate for using SLOs, error budgets, and observability data to drive change and manage technical debt. Previously, Andrew co-founded the developer community site Gitter in 2012, where he served as CTO until its acquisition by GitLab in 2017.

2023

Tamland: How GitLab.com uses long-term monitoring data for capacity forecasting

For any large scale production system, the ability to effectively forecast potential capacity issues is crucial for the smooth functioning of the environment. With a reliable prediction, teams can proactively plan ahead, implement necessary scaling changes in a controlled manner and avoid unexpected availability issues that can cause stress and harm to the system.

Before implementing Tamland, the capacity planning process at GitLab.com was ad-hoc, and relied heavily on manual processes and intuition. Unfortunately, this approach often resulted in oversights, with issues going unnoticed until it was too late, sometimes only surfacing when site availability was impacted.

This talk delves into how GitLab leveraged the power of statistical analysis to greatly improve its capacity planning process. The session will be a practical demonstration of how we analyse long-term metrics data using the Meta’s Prophet library to build sophisticated forecast models.

Tamland, the capacity planning tool built by GitLab, is an open-source project and attendees will have access to the source code if they're interested in exploring the implementation in greater detail. This session is for anyone interested in learning about how forecasting libraries such as Prophet, Greykite, or NeuralProphet, and how they can be integrated into an observability system to provide greater insight into the health of a system.

Ashley Chen
Ashley Chen

Software Engineer

Datadog

Ashley Chen

Software Engineer

Datadog

How I learned to stop worrying and love burn rates
Learn more ›
close
Ashley Chen

Ashley Chen

Software Engineer

Datadog

Ashley is a software engineer on the SLO team at Datadog. When she’s not working, she enjoys mentoring future engineers at Emergent Works and exploring the transit history of New York City.

2023

How I learned to stop worrying and love burn rates

Life can feel bleak when you’re paged at 3 in the morning for an error rate monitor that has once again exceeded its threshold. You wonder about the purpose of your microservices and the impact this alert will have on your sleep schedule. We talk about building reliable services and web platforms, but what can we do to build reliable monitoring for our reliable services?

Part of building the infrastructure for SLOs at Datadog includes putting SLOs into practice. As an engineering team, we have seen the direct impact of utilizing burn rate alerts over traditional threshold alerts. Our story starts with understanding the purpose of our alerts. Though these monitors have well defined runbooks and technical implications, they do not fully capture the impact of these errors on our users. In this talk, I will discuss the process we took to replace some of our threshold alerts with burn rate alerts and how we were able to quantify the urgency of service degradation by alerting at different burn rates. This transition has driven the balance of reliability and development work for the team, which has led to more reliable services and better nights of sleep.
 
The talk will cover our engineering team’s process of evaluating threshold alerts vs burn rate alerts. We were looking at how noisy they are, how unclear the customer impact and how often they caused alert fatigue.

We will then tell the story of our implementation of burn rate alerts, deciding which ones to use and comparing them to threshold alerts. We discovered that they were more reliable and triggered less often. One example is that we've seen our burn rate alerts trigger when they see dependencies failing whereas that didn't happen with threshold alerts. Burn rate alerts ended up reducing our alert fatigue and late night pages due to being more reliable and building trust in our systems on the team.

We learned that paging at high burn rates captures when human intervention is needed to resolve customer impact. In contrast, low burn rates help us anticipate short-term impact. More discussion can come around by looking at these alerts in reviews and retros. We can change the way that we maintain the reliability process of our team and also in return actually see the number of pages decrease and see the service become more reliable.

Christian Long
Christian Long

Senior Software Engineer & Skeeball Champion

3M

Christian Long

Senior Software Engineer & Skeeball Champion

3M

SLOs & the Game of Skeeball
Learn more ›
close
Christian Long

Christian Long

Senior Software Engineer & Skeeball Champion

3M

Twitter LinkedIn

Christian has been rolling skeeball competitively in Brooklyn, NY for 11 years and nationally for 7 years. He started out as a straightforward 40-roller, as is both the conventional recommended starting approach and a widely regarded standard for high level competition. Eventually he started dabbling in hybrid rolling and came to develop a fine-tuned highly tactical strategy that minimizes risk and has virtually no ceiling, enabling him to compete with the best rollers in the world.

2023

SLOs & the Game of Skeeball

Using SLOs to craft a potent, advanced, high-performing strategy in competitive skeeball.

Skeeball is mostly thought of as a kids game, but it has become a competitive endeavor, complete with its own vast array of surprisingly complex strategies. Classic simple strategies involve either the conservative "middle rolling" (going only for the 40 pocket) or the aggressive "hundo rolling" (going only for the hundred pocket). A more nuanced strategy has evolved that is known as "hybrid rolling", which seeks to incorporate hundos while mitigating risk by switching to the 40 cup at opportune times. This talk will explain how SLOs can be applied to craft this strategic approach in order to create optimal winning conditions against a wide variety of opponent styles and skill levels.

Dan Venkitachalam
Dan Venkitachalam

Software Reliability Engineer

Atlassian

Dan Venkitachalam

Software Reliability Engineer

Atlassian

Terraforming SLOs (SLO automation at Atlassian)
Learn more ›
close
Dan Venkitachalam

Dan Venkitachalam

Software Reliability Engineer

Atlassian

Dan is a veteran software engineer and technical manager with over 20 years of experience. He currently works on the Tome team at Atlassian, helping internal organisations to define and achieve their operational goals with SLOs.

2023

Terraforming SLOs (SLO automation at Atlassian)

How Atlassian uses Terraform and Configuration as Code to maintain SLOs across software teams.

Tome is our internal platform for managing, reporting and alerting on SLOs. A design goal was to enable SLOs to be defined with Configuration as Code. This has become the primary way that SLOs are maintained within Atlassian. Working with SLOs this way has many benefits:

  • Enforces consistency in how we organise, define and validate SLOs
  • Changes are tracked and attributed through a version control system
  • Updates can be deployed as part of existing continuous integration and delivery pipelines

We have written a custom Terraform plugin for provisioning SLOs using Terraform, which interfaces with Tome's backend API.

  • Optimizes deployments by tracking deployment state and applying diffs only
  • Performs custom validation on configurations
In this talk I'll run through our implementation of the plugin and associated API, and how it compares to other SLO configuration systems.

Daniel Golant
Daniel Golant

Senior Software Engineer

Daniel Golant

Senior Software Engineer

Seeing Like A State: SLOs From The C-Suite
Learn more ›
close
Daniel Golant

Daniel Golant

Senior Software Engineer

Twitter LinkedIn

Daniel Golant is a software developer based in New York City with an interest in how observability can increase engineering leverage. When not speaking you can find him watching tech talks on the treadmill or playing dominoes on the Lower East Side.

2023

Seeing Like A State: SLOs From The C-Suite

A brief, humorous foray into what it might look like if executives had to set SLOs for a massive and widely used service like Facebook (Blue App) from the *top* of the hierarchy.
 
We will cover both how *I* would recommend doing it, from requirements gathering, to the resources a C*O has available over what the average EM might, to the possible secondary effects of setting overly tight objectives. We will cover thorny questions like tradeoffs involving potentially degrading the experience of "less valuable" users to benefit "more valuable" users. What impact would this have on growth? What factors would we consider when making such a call?
 
We will close with a few thoughts on whether this sort of top-down orchestration is desirable or feasible, and with a humorous review of how this might *actually* play out if implemented.

David Bartok
David Bartok

Software Engineer

Meta

David Bartok

Software Engineer

Meta

SLICK: SLO Reviews at Meta
Learn more ›
close
David Bartok

David Bartok

Software Engineer

Meta

LinkedIn

David is a Software Engineer at Meta. He is currently working in the Monitoring space, primarily focused on SLICK, the company’s SLO tracking platform. Previously, he was part of the GraphQL team at Meta, where he optimized cache performance and efficiency. Before joining Meta, David worked at Bloomberg as a full-stack developer in the Mobile Market Data team.

2023

SLICK: SLO Reviews at Meta

SLICK is our reliability tracking platform at Meta, pioneering an SLO-focused culture across the company. While we have been very successful in onboarding teams to SLICK, we started to notice that a significant amount of teams only got limited value out of their SLOs after the initial onboarding.

In order for SLOs to be useful, the whole team needs to adopt them, use them regularly and retrospect on them frequently. As an initial attempt to help socialize SLOs, we built various integrations into SLICK. This includes periodic reports in our internal work groups to increase the visibility of SLOs, and collaborative data annotations to enable retrospecting on the root causes of SLO violations.

Bringing all of the above together, we will present our brand new SLO review tooling. This provides a structured workflow to have meaningful discussions about SLOs and identify follow-ups, enabling teams to get the most value out of SLOs.

Deepak Kumar
Deepak Kumar

Senior Cloud Infrastructure and Devops

Zenduty

Deepak Kumar

Senior Cloud Infrastructure and Devops

Zenduty

Kubernetes Monitoring - Choosing Optimal Metrics for SLO Ale...
Learn more ›
close
Deepak Kumar

Deepak Kumar

Senior Cloud Infrastructure and Devops

Zenduty

LinkedIn

I'm a Senior Cloud Infrastructure and Devops Engineer at Zenduty - an incident management and response orchestration platform, trying my best to make sure that every service and application at our org is secure, reliable and accessible 24/7. I have experience working with and am passionate about cloud services, orchestration engines, enterprise networking, observability platforms and figuring out how they work best together. Looking to talk about my experiences and how we manage mission critical operations at an organisation that has no room to fail.

2023

Kubernetes Monitoring - Choosing Optimal Metrics for SLO Alerting

A well performing monitoring system needs to answer two simple questions: “What’s broken, and why?”. Monitoring allows you to watch and understand your system’s state using a predefined set of metrics and logs. Observability, on the other hand, is about bringing visibility into the system - essentially turning the lights on, to see and understand the state of each component of your system, and to discover the answer to the ‘why’ part of the problem.
 
Building an efficient and battle-tested monitoring platform usually takes quite a while. You need to learn over a period of time how your system performs on various fields, before you can accurately know which metrics to monitor for prompt alerting that help predict unavoidable incidents, meet your SLOs and in turn prevent downtime.
 
We have analysed the incident data of over 150 highly active organisations deploying Prometheus Alertmanager to monitor their Kubernetes infrastructure, have discovered some unusually common yet fatal mistakes made when choosing SLO metrics as well as some clever configurations drastically reducing noise.
 
This talk aims to give you a run-through of best practices and ‘what not to do’ when choosing Prometheus metrics for clean and noiseless alerting.

Derek Osborn
Derek Osborn

Incident, Problem and Service Level Manager

Flexera

Derek Osborn

Incident, Problem and Service Level Manager

Flexera

Flexera's SLO Journey - from DIY to NOBL9
Learn more ›
close
Derek Osborn

Derek Osborn

Incident, Problem and Service Level Manager

Flexera

Twitter LinkedIn

As the Senior Incident, Problem and Service Level Manager for Flexera, I have been building and expanding the function since joining Flexera, nearly 4 years ago.

2023

Flexera's SLO Journey - from DIY to NOBL9

I'll cover Flexera's journey from our internally developed SLO solution, to partnering with NOBL9, and also include how we engaged teams to help develop SLO's. I'll also cover how SLO's are now part of our engineering goals for 2023.

Devin Cunningham
Devin Cunningham

Software Engineer

Procore

Devin Cunningham

Software Engineer

Procore

SLOs as code
Learn more ›
close
Devin Cunningham

Devin Cunningham

Software Engineer

Procore

2023

SLOs as code

By managing Service Level Objectives as code we can co-locate SLO definitions and ownership with the product code and team. This supports horizontal scaling of SLO ownership, while establishing a single source of truth, adding transparency, integrating with the code management process, and creating an audit trail for SLOs.

The talk will be based off this article published on Procore's Engineering blog: https://engineering.procore.com/implementing-slos-as-code-a-case-study-2/ 

Emily Gorcenski
Emily Gorcenski

Lead Data Scientist

Thoughtworks

Emily Gorcenski

Lead Data Scientist

Thoughtworks

A "moving SLO" for machine learning
Learn more ›
close
Emily Gorcenski

Emily Gorcenski

Lead Data Scientist

Thoughtworks

Twitter

Emily has over ten years of experience in scientific computing and engineering research and development. She has a background in mathematical analysis, with a focus on probability theory and numerical analysis. She is currently working in Python development, though she has a background that includes C#/.Net, Unity3D, SQL, and MATLAB. In addition, she has experience in statistics and experimental design, and has served as Principal Investigator in clinical research projects.

2023

A "moving SLO" for machine learning

For microservices, we have fairly concrete SLOs and ways to measure them, such as latency and availability. However, products with embedded machine learning artifacts add another dimension. This occurs because data drifts, and we need to be able to detect that data drift. There are existing analogues in control theory, and these analogues should inspire us to create an improved vision of how to better design SLOs for AI-integrated systems that adapt better to the world. I'm addressing this topic as a former control systems engineer and computational mathematician, and while I won't have time to dive deeply into the mathematics, I will get very technical very quickly. Nevertheless, there will be plenty of metaphors and examples to root this in an understandable frame, even for those without advanced mathematics backgrounds.

Eric Moore
Eric Moore

Ex-chemist SRE

Eric Moore

Ex-chemist SRE

Confident Rare SLOs
Learn more ›
close
Eric Moore

Eric Moore

Ex-chemist SRE

Twitter LinkedIn

Formerly a computational chemist, Eric tries to bring some of those skills into SRE-land.

2023

Confident Rare SLOs

Using SLO based measurements on rare events can be quite noisy, since each event moves the needle more. In this talk we'll cover the relevant statistics, and go over some ways to apply the statistics when setting SLOs, interpreting SLO measurements, and setting alerting thresholds.

Frances Zhao-Perez
Frances Zhao-Perez

Senior Director of Product Management

Salesforce

Frances Zhao-Perez

Senior Director of Product Management

Salesforce

Measuring What Matters: SLOs Help to Pursue Customer Happine...
Learn more ›
close
Frances Zhao-Perez

Frances Zhao-Perez

Senior Director of Product Management

Salesforce

Frances is the Senior Director of Product Management at Salesforce leading the Monitoring Cloud and Service Ownership platform investment. Prior to joining Salesforce, Frances was the VP of Product Management at New Relic, responsible for driving the APM business, the head of product management at AWS Marketplace, driving key initiatives, and spent 16 years as a Senior Director of Product Management at Oracle running the middleware business.
2023

Measuring What Matters: SLOs Help to Pursue Customer Happiness

It’s all about the customer. In this session, I will use real-world scenarios to discuss the importance of SLOs to help us set actionable business goals, measure them, and stay on target.
 
We will deep-dive into the difference between monitoring and observability, how the SLOs have changed the observability business, and why SLO is the MVP of observability. We will also discuss how to leverage the error budget to help us prioritize our investment, and positive impact opportunities that go beyond the baselines, and balance our roadmap towards the business goals.

Fred Moyer
Fred Moyer

Engineering Geek

Fred Moyer

Engineering Geek

The Body's Error Budget; SLOs for healthy eating
Learn more ›
close
Fred Moyer

Fred Moyer

Engineering Geek

Twitter LinkedIn

Fred is an Observability SRE in his day job and doesn't get to ride his bike 10k kilometers a year like he used to, so now he relies on science to help him stay in shape.

2023

The Body's Error Budget; SLOs for healthy eating

This talk might sound a bit unusual, but it's a chronicle of a real life challenge that I've faced a few times in my life and have been focused on the past few months. My desire to see my kids well into adulthood has driven my inner nerd to put my SLO and technical chops to work and try to life a more healthy life.
 
I'll talk about how I've used a food tracker to meet my own personal health goals, yet still be able to enjoy some of my favorite indulgences on a regular basis. I'll share my nutrition SLOs, and talk about why they are that way given some biochemistry that I've learned about how food has changed in society over the last several dozen (and perhaps a couple hundred) years.

Greg Arnette
Greg Arnette

Co-founder & CPO

CloudTruth

Greg Arnette

Co-founder & CPO

CloudTruth

The Hidden (Config) Tax Affecting Your Uptime SLO
Learn more ›
close
Greg Arnette

Greg Arnette

Co-founder & CPO

CloudTruth

Twitter LinkedIn

Greg is co-founder & CPO of CloudTruth. Prior, Greg was the founder & CTO of three cloud / SaaS companies in the data protection market.

2023

The Hidden (Config) Tax Affecting Your Uptime SLO

The presenters interviewed over 1000 DevOps leaders to understand the role "config sprawl" plays in meeting uptime SLOs.
 
The startling (but not too surprising) conclusion is that most teams struggle managing secrets and configs at scale for infrastructure and applications.
 
The presenters will introduce a new concept for managing config called "The 7 Factor Config" principles, which describes a way of managing secrets and config that, when followed, enables companies to deploy reliably, scale quickly and reduce unplanned downtime and security incidents.

Gwen Berry
Gwen Berry

Site Reliability Engineer

IAG

Gwen Berry

Site Reliability Engineer

IAG

Reliability Benchmarking: A Pre-cursor to SLO Adoption
Learn more ›
close
Gwen Berry

Gwen Berry

Site Reliability Engineer

IAG

Junior Site Reliability Engineer, working in an SRE enablement team at IAG.

2023

Reliability Benchmarking: A Pre-cursor to SLO Adoption

When we first attempted to implement SLOs we took a "theory first" approach. We developed workshops and ran sessions to uncover the key users, services, indicators and objectives for a platform or application. But we failed. We didn't identify meaningful SLOs, track them, or define error budgets. We also failed to garner interest or investment from the team.
 
Taking a step back, we tried a different approach. We got access to their observability data alongside other sources of information (e.g. incidents) to build a picture of where the team was currently at in terms of reliability and operational maturity.
 
This new approach was much more successful in getting that initial spark of excitement. By providing actionable insight up front, we were able to start the SLO and SRE conversation off the right way.
 
In this talk I will share our experience and process for benchmarking reliability, and how this could be leveraged to begin SLO adoption in a complex organisation.

Hazel Weakly
Hazel Weakly

Infrastructure Team Lead

Hazel Weakly

Infrastructure Team Lead

Motivating SLOs Mathematically
Learn more ›
close
Hazel Weakly

Hazel Weakly

Infrastructure Team Lead

LinkedIn
2023

Motivating SLOs Mathematically

Have you ever wondered if there's something behind the experiential knowledge that we hold as best practices? I've noticed that things that "feel" right can often be connected together, and the connection between SLOs and observability feels right, like there's something deeper underneath.

So, I've been digging into relationships between SLOs, observability, known knowns, unknown unknowns, cardinality, entropy, and more. There's still a lot of details to work out, but what I present here in this talk is a rough overview of where I'm at so far when it comes to motivating SLOs from a more interconnected perspective.

Hezheng Yin
Hezheng Yin

Co-founder & CTO / Creator

Apache DevLake

Merico

Hezheng Yin

Co-founder & CTO / Creator

Apache DevLake

Merico

Creating and Tracking SLOs that Empower Developer Happiness ...
Learn more ›
close
Hezheng Yin

Hezheng Yin

Co-founder & CTO / Creator

Apache DevLake

Merico

LinkedIn

Hezheng is a perceptive and persistent pioneer in applying technology to make the world a better place. At Merico, he leads the engineering and research team to build innovative algorithms to help developers quantify the impact of their work. Before this, his research focuses on empowering the next generation of education technology with artificial intelligence and machine learning. Hezheng got his bachelor's degree from Tsinghua University and was pursuing his Ph.D. in computer science at UC Berkeley.

2023

Creating and Tracking SLOs that Empower Developer Happiness and Productivity

In this session, startup CTO and creator of Apache DevLake, Hezheng Yin will introduce a data-driven approach to improving developer happiness and productivity. The speaker will make the case for establishing SLOs that support this approach and introduce the SPACE framework for developer productivity. You will see a fast and practical implementation of this framework using Apache DevLake, an open-source solution. This session is designed to provide attendees with actionable solutions and knowledge to establish SLOs targeting critical, yet previously ambiguous, concepts such as culture, collaboration, and flow.

Imaya Kumar Jagannathan
Imaya Kumar Jagannathan

Principal Solution Architect

AWS

Imaya Kumar Jagannathan

Principal Solution Architect

AWS

Why are SLOs important? - SLOs in the world of efficiency
Learn more ›
close
Imaya Kumar Jagannathan

Imaya Kumar Jagannathan

Principal Solution Architect

AWS

Imaya Kumar Jagannathan is a Principal Solution Architect at Amazon Web Services. He specializes in Observability and talks to hundreds of customers each year in helping them design, architect and optimize their Observability environments. He mentors and leads  AWS Solution Architects around the world and has created several projects such as the One Observability Workshop, AWS Observability Best Practices Guide and the AWS Observability Accelerator to help AWS customers with their Observability needs. He speaks at various conferences such as re:Invent, KubeCon, AWS Summit etc. His past experience include working at Microsoft in Consultant and Program Manager roles for several years.
2023

Why are SLOs important? - SLOs in the world of efficiency

In this session, I will talk about FOMO vs SLO driven monitoring and how SLO driven monitoring helps improve efficiency in terms of Cost, Time and Resources. I lay out the differences in the journey in these two approaches and help you understand why being intentional about what to monitor has a big influence on overall health of the Observability in your company.
 

Ioannis Georgoulas
Ioannis Georgoulas

Director of SRE

Paddle.com

Ioannis Georgoulas

Director of SRE

Paddle.com

How you SLO your SLOs?
Learn more ›
close
Ioannis Georgoulas

Ioannis Georgoulas

Director of SRE

Paddle.com

Twitter LinkedIn

Ioannis is the Director of SRE at Paddle.com. He is an SLO evangelist and practitioner with an obsession to measure anything that matters to the users and the business.

2023

How you SLO your SLOs?

In this talk, I will cover some metrics and signals that you can use to understand if your SLO framework and culture are performing and at what level.
 
These metrics will be used to measure (and SLO) your SLOs performance and impact on your (internal) users, business and overall reliability culture.

Jason Greenwell
Jason Greenwell

SRE Leader

Ford - Model e

Jason Greenwell

SRE Leader

Ford - Model e

Promise Theory and SLOs
Learn more ›
close
Jason Greenwell

Jason Greenwell

SRE Leader

Ford - Model e

LinkedIn

Jason is an SLO and developer expereince advocate that has held a number of technical leadership positions at Ford and Ford Credit over the past 20 years. He is currenlty heading up SRE for Model-e's Cloud Platform driving SLO adoption and SRE culture throug the org.

2023

Promise Theory and SLOs

Discussion on the importance of explicitly stating performance expectations, and performance to those expectations through the lens of Promise Theory as it applies to SLOs. Tracking and understanding these promises is critical to reducing the complexity of a highly distributed microservices ecosystem.

Jeff Martens
Jeff Martens

CEO & Co-Founder

Metrist

Jeff Martens

CEO & Co-Founder

Metrist

Managing SLOs & SLAs when your app is built on other apps
Learn more ›
close
Jeff Martens

Jeff Martens

CEO & Co-Founder

Metrist

Twitter LinkedIn

Jeff has built observability products and developer tools for more than 12 years. The first company he founded, CPUsage, was a pioneer in the serverless computing space before AWS Lambda existed. Later he joined New Relic pre-IPO to focus on new products. There he served on the team creating the company’s high-performance event database, before leading Real User Monitoring and growing the product into the company’s 2nd largest revenue generator. Jeff then joined PagerDuty pre-IPO where he worked on designing, building, and launching a suite of business analytics products. Jeff is an alumnus of the University of Oregon and works between Portland, Oregon and the San Francisco Bay Area.

2023

Managing SLOs & SLAs when your app is built on other apps

The average digital business uses 137 cloud products, with 40-50 typically powering the company's product. On top of that, as much as 70% of customer-facing downtime can be tied back to a cloud dependency outage.
 
If we want to operate resilient systems, it is not enough to simply rely on systems with impressive SLAs. In short, an app with 4 cloud dependencies offering 99.9% uptime cannot itself offer 99.9%.
 
In this short talk, we'll cover some examples of how cloud dependency uptime, and our dependency's dependencies, can impact our own reliability, and what we can do to understand the risk and manage it better.

Jessica Kerr
Jessica Kerr

Engineering Manager of Developer Relations

Honeycomb

Jessica Kerr

Engineering Manager of Developer Relations

Honeycomb

Evolving SLOs at Honeycomb
Learn more ›
close
Jessica Kerr

Jessica Kerr

Engineering Manager of Developer Relations

Honeycomb

Twitter LinkedIn

Jessica Kerr is a developer of 20 years, conference speaker of 10, and ringleader of a household containing two teenagers and their cats. She works and speaks in TypeScript, Java, Clojure, Scala, Ruby, Elm etc etc. Her real love is systems thinking in symmathesy (a learning system made of learning parts). She works at Honeycomb.io because our software should be a good teammate and teach us what is going on. If you're into sociotechnical systems, find her blog and newsletter at jessitron.com.

2023

Evolving SLOs at Honeycomb

SLOs are part of our product, so we've cared about them for a long time. We think really hard about how we use them (especially Fred Hebert, Staff SRE, who is co-author and possibly co-presenter).
 
Our practices have changed, as we regularly re-evaluate each SLO, trading off alert fatigue against customer experience.
 
We also know something about how our customers use SLOs, so we know that other companies could benefit from the kind of thought Honeycomb's SRE team puts into this.

Justin Hoang
Justin Hoang

Software Engineer

Procore

Justin Hoang

Software Engineer

Procore

SLOs as code
Learn more ›
close
Justin Hoang

Justin Hoang

Software Engineer

Procore

LinkedIn

Software Engineer, miniature painter, and avid note taker.

2023

SLOs as code

By managing Service Level Objectives as code we can co-locate SLO definitions and ownership with the product code and team. This supports horizontal scaling of SLO ownership, while establishing a single source of truth, adding transparency, integrating with the code management process, and creating an audit trail for SLOs.

The talk will be based off this article published on Procore's Engineering blog: https://engineering.procore.com/implementing-slos-as-code-a-case-study-2/ 

Kayla Annunziata
Kayla Annunziata

SRE Platform Development Sr Mgr

Capital One

Kayla Annunziata

SRE Platform Development Sr Mgr

Capital One

Interpreting Error Budget Signals
Learn more ›
close
Kayla Annunziata

Kayla Annunziata

SRE Platform Development Sr Mgr

Capital One

Kayla is an Enterprise SRE Platform Development Sr Mgr at Capital One to drive adoption of SRE best practices across all Capital One applications.

Before joining FinTech she was a Software Engineering Manager at Lockheed Martin supporting the Space Industry developing reliable flight software products for NASA’s Orion Spacecraft Artemis 1 mission that successfully broke the record for the farthest distance from Earth traveled by an Earth-returning human-rated spacecraft by ~20,000 miles.

2023

Interpreting Error Budget Signals

As an Enterprise SRE Team at Capital One we are on a journey to consistently measure reliability through practices such as adopting Service Level Objectives (SLOs). In partnership with our internal beta teams we identified clear Error Budget Signals that demonstrate an application’s health through periods of degradation and recovery.
 
Identifying an Error Budget Signal:
Some key indicators we leverage to analyze our reliability and reinforce in our Error Budget Policy are primarily related to our Error Budget Remaining (EBR) value and the trend of that EBR over time. Based on the slope of our EBR over a rolling window (usually 30 days), the key signals we seek to interpret are negative slope (degradation of our EBR), positive slope (EBR recoveries) and an EBR value of 0 (or negative).
 
Interpreting an Error Budget Signal:
Data without a clear way to interpret it is every engineer’s nightmare (or problem solving dream?). We seek to aid our teams in interpreting their Error Budget Signals and correlate potential drivers to the EBR trends. By having clear EBR signals indicated we can begin to overlay events that could be contributing factors and merit investigation. Is it a change related burn? Did it occur at the same time as a chaos engineering event? Have we lapsed a full time window where a prior incident is no longer negatively affecting our rolling EBR?
 
Actioning from an Error Budget Signal:
As an SRE team, we advise teams how to react to fast degradations (incident alert!) and slow burning ones, or even EBR recoveries (Party On Wayne!). This establishes an error budget policy to balance the reliability of our applications while continuing to release new features.
 
We’ll share with you some real examples of what we’ve seen in action and how we have advised our application SRE teams to investigate and actionably improve the reliability of their services as a result of these signals. In sharing this we hope that other teams can help foster buy-in to SLOs as an SRE practice area and support where they work today by incorporating an Error Budge Policy to benefit their applications’ reliability.

Keri Melich
Keri Melich

Site Reliability Engineer

Nobl9

Keri Melich

Site Reliability Engineer

Nobl9

Zero to SLO Hero: Part 1
Zero to SLO Hero: Part 2
Learn more ›
close
Keri Melich

Keri Melich

Site Reliability Engineer

Nobl9

Twitter LinkedIn

Keri is a SRE working to help scale and secure the Nobl9 platform. Before that, she has worked in DevOps building secure and scalable solutions for internal users. She is also passionate about building a safe and diverse workplace and spends her free time hiking, woodworking, and 3D printing.

2023

Zero to SLO Hero: Part 1

"Zero to SLO Hero" is a two-part talk that will guide you through the journey of implementing Service Level Objectives (SLOs) in your organization. Part 1 of the talk will cover the basics of SLOs, including what they are, why they are important, and how they can help you measure and improve the reliability of your services. In Part 2, we'll dive deeper into the practicalities of implementing SLOs in your organization with examples in Prometheus. We'll also cover some common pitfalls to avoid when implementing SLOs, and how to overcome them. By the end of this two-part talk, you'll have a solid understanding of what SLOs are, why they matter, and how to implement them in your organization. You'll be well on your way to becoming a SLO hero and improving the reliability of your services!"

2023

Zero to SLO Hero: Part 2

Kyle Forster
Kyle Forster

Founder

RunWhen

Kyle Forster

Founder

RunWhen

SLOs with Teeth: Partnering with Product Management
Learn more ›
close
Kyle Forster

Kyle Forster

Founder

RunWhen

LinkedIn
Kyle is the founder of RunWhen, a new company building a platform for "Social Reliability Engineering". Prior to RunWhen, Kyle was a Sr Director of Product Management in Google Cloud's AppMod business unit (Kubernetes).
2023

SLOs with Teeth: Partnering with Product Management

Many SLO initiatives are started by SREs for SREs. Sometimes these get traction, sometimes they do not.
 
In this talk, we'll outline a different approach to get SLO traction from personal experience now replicated by a number of our customers. The results have consistently led to broad adoption of SLOs and rapid increases in executive support.
 
The first step: partnering with product managers to define 2-3 error budgets that they see as critical inputs to their forecasts.
 
The second, third and fourth step will be relayed in the talk.

Lukasz Dobek
Lukasz Dobek

Software Engineer

Nobl9

Lukasz Dobek

Software Engineer

Nobl9

Non-Conway’s Game of SLOs
Learn more ›
close
Lukasz Dobek

Lukasz Dobek

Software Engineer

Nobl9

LinkedIn
Łukasz Dobek is a Software Engineer that works with cloud-native technologies on a daily basis. He strives to be language-agnostic and to treat programming languages as tools. Most of the time, you can find him building software with Go, JavaScript, or Python. He has experience in DevOps which certainly helps him develop and implement practical, effective, and easy-to-maintain solutions. Working at scale is another thing he can share his knowledge about, be it Kubernetes or Serverless architecture.

Currently, he’s developing Service Level Objectives platform at Nobl9, helping to make a cultural shift to the Site Reliability Engineering mindset.
2023

Non-Conway’s Game of SLOs

In this talk, I want to point out the importance of SLO evolution over time. I will be comparing Conway's Game of Life, which is a zero-player game, to the classic SLO approach, where users, after creating an initial configuration, redesign and rework it. Assumptions and service requirements can change, and SLOs should reflect those changes.

Marcus Merell
Marcus Merell

VP of Technology Strategy

Sauce Labs

Marcus Merell

VP of Technology Strategy

Sauce Labs

Functional Testing & SLOs - Together at Last!
Learn more ›
close
Marcus Merell

Marcus Merell

VP of Technology Strategy

Sauce Labs

LinkedIn
2023

Functional Testing & SLOs - Together at Last!

SLOs govern org-wide expectations for how your software runs in production: but how do you know it's actually working at a functional level? Configuration, user analytics, and data flows are all highly engineered code, but they generally aren't treated as such. So how do you incorporate testing in the modern world of SRE?

Join Marcus for a quick story about how testing can raise early warnings about SLOs that might be slipping--and how to preserve your error budget for only the highest-risk concerns.

Matthias Loibl
Matthias Loibl

Senior Software Engineer

Polar Signals

Matthias Loibl

Senior Software Engineer

Polar Signals

Second Day Operations for SLOs
Learn more ›
close
Matthias Loibl

Matthias Loibl

Senior Software Engineer

Polar Signals

Twitter LinkedIn
Matthias Loibl is a Senior Software Engineer who works on cloud-native observability at Polar Signals, previously at Red Hat and Kubermatic, and is a maintainer of many projects like Prometheus, Thanos, Prometheus Operator, and Parca. He enjoys working on Distributed Systems with Go and gRPC.
2023

Second Day Operations for SLOs

Once you have implemented SLOs for your organization how do you move forward?
At Polar Signals, we have quarterly SLO reviews. We're first doing a retrospective and discussing where we did great and also could have improved. For the upcoming quarter, we discuss the OKRs and from those, we derive SLOs. Sometimes OKRs are easy to derive from and sometimes they need to be rephrased to make sense for SLOs.

Matthias will walk you through an example of an SLO that we implemented quarters ago and how it changed over time. The example will showcase the SLO tracked in the open-source Pyrra project which makes SLOs with Prometheus manageable, accessible, and easy to use for everyone.

Max Knee
Max Knee

Staff Software Engineer

The New York Times

Max Knee

Staff Software Engineer

The New York Times

Use SLOs to manage your day
Engaging with your Customers in your SLO Journey
Learn more ›
close
Max Knee

Max Knee

Staff Software Engineer

The New York Times

Twitter LinkedIn
Software Engineer working in the developer productivity space, ensuring teams deliver reliably and efficiently.
2023

Use SLOs to manage your day

I'm pretty bad at time management, so I was looking into ways to improve that part of my life.

I turned to a lo-fi way to use SLOs to manage my day by turning my tasks and other things I need to do during the day into SLOs.

It's increased my productivity and am interested in sharing it with others!

2023

Engaging with your Customers in your SLO Journey

You've already sold your organization on SLOs, now it's time to sell them to your customers. But instead of pitching, this is a collaborative exercise to ensure that they understand your system and you understand their needs.

Measuring and having SLIs on latency could help, but what if your customers care about correctness?
In this talk, we'll discuss how to better meet your customers needs proactively since your SLOs will mirror what their expectations are and will reduce asking if there's an issue.

With this approach, you can reduce alert fatigue and improve the customer experience by making them happier and increasing trust in your system.

Michael Knox
Michael Knox

Platform SRE Team Lead

ANZx

Michael Knox

Platform SRE Team Lead

ANZx

Type 1 Diabetic management and SLOs
Learn more ›
close
Michael Knox

Michael Knox

Platform SRE Team Lead

ANZx

LinkedIn
Wide range of SRE, Platform engineering, and Development roles over 25 years across Banking, FinTech, Transport and Communications, for organisations including ANZ, Boeing and NEC.
2023

Type 1 Diabetic management and SLOs

In this talk, I'm looking at parallels, lessons & inspiration from Type 1 Diabetics tracking and responding to their Blood Glucose Levels, and SLO's in an IT system context. Type 1 Diabetics are always on-call, with potentially life threatening ramifications for events that they respond to on a daily basis.
 
A friend of mine, and my wife, are both Type 1 Diabetics; there are parallels and lessons that we can draw from their experience of being on-call 24x7; with Monitoring, burn rate predictions, Incidents, Alert Fatigue, Problem Management activities all playing a core part of their lives.

Natalia Sikora
Natalia Sikora

Product Manager

Nobl9

Natalia Sikora

Product Manager

Nobl9

Product and Engineering Collaboration With SLOs
Learn more ›
close
Natalia Sikora

Natalia Sikora

Product Manager

Nobl9

Natalia is a Product Manager at Nobl9. She enjoys collaborating with cross-functional teams to solve complex problems for customers. Before joining Nobl9 in the noble pursuit of reliable software, she spent 10 years developing, publishing, and managing various products for one of the world’s largest educational companies. Outside of work, you can find her hiking in the mountains, working on another art piece at a printing workshop, or playing video games.

2023

Product and Engineering Collaboration With SLOs

Present a real-life scenario of using SLOs to manage product requirements and be smarter about allocating the engineering team’s focus. The talk will discuss an example of how SLOs helped monitor a potential problem instead of assigning an engineering team to jump into solving a complex issue. It also helped focus on a specific area of a wider problem. In a broader sense, the talk will discuss the collaboration between the Product Manager and the Engineering Manager and how SLOs can lead to more productive conversations.

Neil Pagaduan
Neil Pagaduan

Manager of Technology & Engineering

Cox Edge

Neil Pagaduan

Manager of Technology & Engineering

Cox Edge

Applying Service Level Objectives (SLO) to Edge Networks
Learn more ›
close
Neil Pagaduan

Neil Pagaduan

Manager of Technology & Engineering

Cox Edge

LinkedIn
Neil Pagaduan is the Manager of Technology and Engineering at Cox Edge. He has been with Cox Communications for over 20 years, where he spearheaded the development of their cutting-edge cloud services. Today, Neil’s is focused on determining how Cox Edge can most efficiently bring the value of Edge services to our customers. 
2023

Applying Service Level Objectives (SLO) to Edge Networks

In today's digital world, it's important to ensure that our services are reliable, performant, and meet the needs of our users. In this video, we'll dive into the world of SLO and explore how we can apply it to edge networks.

Neil Pagaduan, Manager of Technology and Engineering at Cox Edge will begin by discussing the importance of SLO in Edge Networks, and how it can help to deliver a better user experience. Then, we'll explore the steps involved in achieving SLO, from defining objectives to measuring performance. Finally, Neil will discuss how to apply SLO to an edge network, including some key considerations to keep in mind.

Ricardo Castro
Ricardo Castro

Principal Site Reliability Engineer

FanDuel

Blip

Ricardo Castro

Principal Site Reliability Engineer

FanDuel

Blip

Overcoming SRE Anti-Pattern Roadblocks: Lack of a Reliabilit...
Learn more ›
close
Ricardo Castro

Ricardo Castro

Principal Site Reliability Engineer

FanDuel

Blip

Twitter LinkedIn
Principal Site Reliability Engineer at FanDuel/Blip.pt. MSc in Computer Science by the University of Porto. CK{AD, A, S} by Cloud Native Computing Foundation (CNCF) | Linux Foundation. {Terraform, Consul, Vault} Associate by HashiCorp. Working daily to build high-performance, reliable and scalable systems. DevOps Porto meetup co-organizer and DevOpsDays Portugal co-organizer. A strong believer in culture and teamwork. Open source passionate, martial arts amateur, and metal lover.
2023

Overcoming SRE Anti-Pattern Roadblocks: Lack of a Reliability Framework

SREs, as the name implies, care about service reliability. But, often, they struggle with having a way to define, measure and assess their services reliability. In practice, they lack a Reliability Framework.

How can SLOs help? They provided an opinionated way to do just that: define, measure and assess service reliability from the users perspective. They provide a common language to talk about reliability and prioritize work. They help fix the anti-pattern of trying to ensure service reliability without clearly defining what it means.

Roman Khavronenko
Roman Khavronenko

Software Engineer

VictoriaMetrics

Roman Khavronenko

Software Engineer

VictoriaMetrics

Retroactive evaluation of SLO objectives in VictoriaMetrics
Learn more ›
close
Roman Khavronenko

Roman Khavronenko

Software Engineer

VictoriaMetrics

Twitter LinkedIn
Roman is a software engineer with experience in distributed systems, databases, monitoring, and high-performance microservices. Roman's passion is open source and he's proud to have contributions to Prometheus, Grafana, and ClickHouse. Currently, Roman is working on the open source time series database and monitoring solution VictoriaMetrics.
2023

Retroactive evaluation of SLO objectives in VictoriaMetrics

Recording rules is a clever concept introduced by Prometheus for storing results of query expressions in a form of a new time series. This concept is used for SLO calculations. But due to the nature of recording rules they have no retroactive effect. And since SLO objective usually captures a time window no less than 30d, recording rules produce incomplete results until the whole time window is captured.

The talk will cover how this can be fixed in VictoriaMetrics monitoring solution via retroactive rules evaluation on example of rules generated via https://github.com/slok/sloth framework.

Sal Furino
Sal Furino

CRE

Sal Furino

CRE

Two Paths in the Woods
Learn more ›
close
Sal Furino

Sal Furino

CRE

Twitter LinkedIn

Sal Furino is a Customer Reliablity Engineer. During his career he's worked as a TPM, SRE, Developer, Sys Admin, and IT support. While not working he enjoys cooking, gamings, traveling, skiings, and golfing. Sal lives in Queens with his parter and has a BS in Applied Mathematics from Marist College.

2023

Two Paths in the Woods

While the direction and intent of SRE has been established and is becoming better-understood, the details on how to achieve "SRE" is still an exercise left for the reader.

Steve from Google and Sal from Nobl9 will present two independently developed methods for teaching SRE topics to customers, which we have discovered are actually quite similar. Huzzah!

Steve will present the "reliability map" and Sal will show Nobl9's SLODLC. Both are essentially sets of documentation presented in a way that allows customers to take a *guided* path through the dark and scary woods that is today's SRE.

The "reliability map" provides a detailed accounts of what development, infrastructure, operational, observability, and people/culture activities organizations can expect to be practicing to achieve different eras of reliability (99%, 99.9%, 99.99%, 99.999%). The "reliability map" takes it one step further and provides descriptions and references to additional content for all activities.

SLODLC provides a framework of documentation and templates to allow customers to iterate on improving their SLOs. This is especially necessary when customers have high reliability targets in which rely upon automation to execute predetermined playbooks. (rollback, exponential backoff, autoscale, etc) If SLOs are the indicator the performance of a user journey and decide when actions needs to be taken, then customers need a clear and useable way to explain:

  • What they are attempting to measure (golden signals?, something else?)
  • Why was it decided to measure X in such a way?
  • How is X impactful for the targeted user journey?
  • When was the last time the SLO objective, metric, time window, etc was changed?
  • When the error budget is in danger of being breeched what actions should be take?
In short both projects are approaching the same topic of helping customers improve their reliability from different points of view and they synergize well together.

Sally Wahba
Sally Wahba

Principal Engineer

Splunk

Sally Wahba

Principal Engineer

Splunk

Thinking about SLO from On-Prem to Cloud - A Developer's Per...
Learn more ›
close
Sally Wahba

Sally Wahba

Principal Engineer

Splunk

Sally is a Principal Software Engineer at Splunk where she works on data ingestion for observability. Before Splunk, she spent around a decade working on operating systems for data storage systems at NetApp. Sally obtained her PhD in Computer Science from Clemson University. She presented her work and research both nationally and internationally.

When not working you will find her doing computer science outreach activities, reviewing for technical conferences, mentoring, and learning Spanish.
2023

Thinking about SLO from On-Prem to Cloud - A Developer's Perspective

My background has mostly been in developing operating systems for data storage companies. In this environment, almost everything is controlled internally. For example, if the SLO of the operating system is five 9s, then the error budget is usually all consumed by software bugs owned internally by the company. As I switched to developing SaaS products in the cloud, this has drastically changed. Below are examples of lessons learned during different phases of working on a product, from development, to production, to support and operation. My goal is sharing these lessons so other developers can learn from my experience.

In my previous role, two main things I relied on while developing operating system products were the suite of testing as well as the release cadence. Shipping a new operating system every 6 months was considered fast. This gave developers a lot of time for their code to soak and be tested internally before being released to customers. Additionally, with a slower release cadence there was a lot of effort invested in creating different layers of testing. After all, a bug fix would take months to reach customers. Even if we released a patch quickly, which in this context means a few weeks, customers would still need to update their operating systems to apply that patch, and who knows how long a customer will wait before applying that patch. After moving to building SaaS products in the cloud, the release cadence became much faster. This required shifting my thought process. Instead of relying on soak time and various levels of testing, activities such as code review and in-build unit tests now take a front row seat. Metrics like code coverage from unit tests mattered more, while metrics like how long it's been since QA found a bug mattered less.

Another difference between my old role and new role is how developers access production and production metrics. In the old role, gathering production metrics was no easy feat.
Harder access to production metrics, implied that changing SLOs internally was harder, took longer, and a lot of developers didn't know how/why SLOs were changing. For a SaaS product running in the cloud, developers have access to overall system performance metrics at the click of a button, while maintaining compliance. This makes it easier for developers to know why/how SLOs are affected by SLAs and also gives them faster reaction times.

From the operation perspective, the old role and the new role were quite different. In the old role developers didn't go on-call. There was a customer support organization that would handle any customer issues first. Developers were brought in occasionally if customer support needed a bug fixed. Developers didn't need to wake up in the middle of the night because there's an outage. In the new role, developers are on-call, which means developers can and do occasionally get paged in the middle of the night. This shift caught me off-guard as I had to think much harder about how my service impacts the quality of life of my colleagues and myself. It made me think about how to change and measure SLOs to eventually avoid waking someone up in the middle of the night.

Another lesson that caught me off guard is not all SLAs can be trusted when working in the cloud. Yes, I knew this lesson theoretically, but learning it in practice is a different story. One example was an outage that was caused by a cloud provider breaching their SLAs for a managed service that we relied on for our product. This resulted in our product breaching its SLAs. When something like this happens, you think hard about how to update your SLOs to prepare for these issues, catch such issues, and react to them. I found that using SLOs that are tighter than SLAs was helpful in that regard.

In conclusion, even with years of professional experience, moving from developing On-Prem products to cloud SaaS offerings changed how I think about SLOs and it's truly not a one-size-fits-all.

Sandeep Chatra Raveesh
Sandeep Chatra Raveesh

Observability Lead

eBay

Sandeep Chatra Raveesh

Observability Lead

eBay

Scaling SLI/SLO - Pushing Your Observability Platform To Its...
Learn more ›
close
Sandeep Chatra Raveesh

Sandeep Chatra Raveesh

Observability Lead

eBay

Twitter LinkedIn
2023

Scaling SLI/SLO - Pushing Your Observability Platform To Its Limits

At eBay we use a wrapped version of Prometheus as our centralized time series database. Data residing in Prometheus is used for mission critical functions like detecting issues on the site through anomaly detection, SLOs or simple threshold based alerts. To be able to enforce SLOs across the 1000s of micro services that are deployed inside of eBay, we would need to be able to aggregate commonly instrumented metrics at an application level.

Why are these aggregations necessary? Host level metrics when used for computation of things like burn rates over a 30 day window can become very very expensive. Using a single query across all micro services to generate such aggregations have several scale limitations. Manually onboarding each micro service into such aggregations can also be tedious.

The talk discusses our commonly instrumented metrics, our metrics store architecture, scale challenges seen with SLI/burn rate computation and how we came up with concepts like templated rule groups that automatically generate Prometheus rule groups to be able to do aggregations for every application that is emitting metrics.

Sasha Rosenbaum
Sasha Rosenbaum

Principal

Ergonautic

Sasha Rosenbaum

Principal

Ergonautic

SLO Prompt Engineering: Aligning Humans for Better Outcomes
Learn more ›
close
Sasha Rosenbaum

Sasha Rosenbaum

Principal

Ergonautic

Sasha is Principal at a new venture, Ergonautic

With a degree in Computer Science, an MBA, and two decades of experience across development, operations, product management, and technical sales, Sasha Rosenbaum brings a unique perspective to optimizing the organizational flow of work, bridging gaps with empathy and insight.

2023

SLO Prompt Engineering: Aligning Humans for Better Outcomes

From experience, we know that the most difficult part of implementing SLOs is incorporating them into the organizational culture, so that reliability becomes a key consideration in decision-making. This talk proposes using a proven behavioral model to align Product, Development, and Operations teams and make it easier for your organization to embrace SLOs.
 

Sergey Sidorov
Sergey Sidorov

Software Engineer on SLO Monitoring (SLICK)

Meta

Sergey Sidorov

Software Engineer on SLO Monitoring (SLICK)

Meta

SLICK: SLO Reviews at Meta
Learn more ›
close
Sergey Sidorov

Sergey Sidorov

Software Engineer on SLO Monitoring (SLICK)

Meta

Twitter LinkedIn

Software Engineer with a track record of building & shipping complex software with primary focus on infrastructure and advanced backend systems. My experience includes working on high-throughput messaging infrastructure, trade execution engines, and large-scale monitoring and observability systems. Below are keywords that might be useful.

2023

SLICK: SLO Reviews at Meta

SLICK is our reliability tracking platform at Meta, pioneering an SLO-focused culture across the company. While we have been very successful in onboarding teams to SLICK, we started to notice that a significant amount of teams only got limited value out of their SLOs after the initial onboarding.

In order for SLOs to be useful, the whole team needs to adopt them, use them regularly and retrospect on them frequently. As an initial attempt to help socialize SLOs, we built various integrations into SLICK. This includes periodic reports in our internal work groups to increase the visibility of SLOs, and collaborative data annotations to enable retrospecting on the root causes of SLO violations.

Bringing all of the above together, we will present our brand new SLO review tooling. This provides a structured workflow to have meaningful discussions about SLOs and identify follow-ups, enabling teams to get the most value out of SLOs.

Shubham Srivastava
Shubham Srivastava

Head of Developer Relations

Zenduty

Shubham Srivastava

Head of Developer Relations

Zenduty

Kubernetes Monitoring - Choosing Optimal Metrics for SLO Ale...
Learn more ›
close
Shubham Srivastava

Shubham Srivastava

Head of Developer Relations

Zenduty

Twitter LinkedIn
Leading Developer Relations at Zenduty - an advanced incident management and response orchestration platform.
Take pride in making mistakes, learning from them and advocating for best practices for orgs setting up their DevOps, SRE and Production Engineering teams.

A zealous and eternally curious professional, fascinated by stories from DevOps, Incident Management and Product Design; hoping to solve real-world problems with the skills and technology I'm actively amused by. An orator, writer, and hopeful comedian trying his very best to do something I'm proud of everyday.
2023

Kubernetes Monitoring - Choosing Optimal Metrics for SLO Alerting

A well performing monitoring system needs to answer two simple questions: “What’s broken, and why?”. Monitoring allows you to watch and understand your system’s state using a predefined set of metrics and logs. Observability, on the other hand, is about bringing visibility into the system - essentially turning the lights on, to see and understand the state of each component of your system, and to discover the answer to the ‘why’ part of the problem.
 
Building an efficient and battle-tested monitoring platform usually takes quite a while. You need to learn over a period of time how your system performs on various fields, before you can accurately know which metrics to monitor for prompt alerting that help predict unavoidable incidents, meet your SLOs and in turn prevent downtime.
 
We have analysed the incident data of over 150 highly active organisations deploying Prometheus Alertmanager to monitor their Kubernetes infrastructure, have discovered some unusually common yet fatal mistakes made when choosing SLO metrics as well as some clever configurations drastically reducing noise.
 
This talk aims to give you a run-through of best practices and ‘what not to do’ when choosing Prometheus metrics for clean and noiseless alerting.

Stephan Lips
Stephan Lips

Software Engineer and SLO Advocate

Procore

Stephan Lips

Software Engineer and SLO Advocate

Procore

Black Box SLIs
SLOs as code
Learn more ›
close
Stephan Lips

Stephan Lips

Software Engineer and SLO Advocate

Procore

LinkedIn
2023

Black Box SLIs

Adopting an SLO culture involves identifying the metrics that matter without drowning in noise and alert fatigue. The black box concept lets us aggregate granular metrics into SLIs that focus on the user experience as an indicator of system reliability.

The talk will be based off this article published on Procore's Engineering blog: https://careers.procore.com/blogs/engineering-at-procore/black-box-slis.

2023

SLOs as code

By managing Service Level Objectives as code we can co-locate SLO definitions and ownership with the product code and team. This supports horizontal scaling of SLO ownership, while establishing a single source of truth, adding transparency, integrating with the code management process, and creating an audit trail for SLOs.

The talk will be based off this article published on Procore's Engineering blog: https://engineering.procore.com/implementing-slos-as-code-a-case-study-2/ 

Stephen Townshend
Stephen Townshend

Developer Advocate (SRE)

SquaredUp

Stephen Townshend

Developer Advocate (SRE)

SquaredUp

Reliability Benchmarking: A Pre-cursor to SLO Adoption
Learn more ›
close
Stephen Townshend

Stephen Townshend

Developer Advocate (SRE)

SquaredUp

Twitter LinkedIn
Stephen has a background in SRE and performance engineering. He has worked in the industry for 15 years as both an external consultant and an internal engineer.

Our industry is full of buzzwords and exaggerations, it can be hard to know what is real or not. Stephen strives to take these complex technical concepts and to simplify and present them in a way everyone can understand and apply (and to call out when something is too good to be true).

Stephen lives in Auckland, New Zealand and currently works as a Developer Advocate for SquaredUp, as well as promoting and improving observability and SRE practices internally in the organisation.
2023

Reliability Benchmarking: A Pre-cursor to SLO Adoption

When we first attempted to implement SLOs we took a "theory first" approach. We developed workshops and ran sessions to uncover the key users, services, indicators and objectives for a platform or application. But we failed. We didn't identify meaningful SLOs, track them, or define error budgets. We also failed to garner interest or investment from the team.
 
Taking a step back, we tried a different approach. We got access to their observability data alongside other sources of information (e.g. incidents) to build a picture of where the team was currently at in terms of reliability and operational maturity.
 
This new approach was much more successful in getting that initial spark of excitement. By providing actionable insight up front, we were able to start the SLO and SRE conversation off the right way.
 
In this talk I will share our experience and process for benchmarking reliability, and how this could be leveraged to begin SLO adoption in a complex organisation.

Stephen Weber
Stephen Weber

Staff Site Reliability Engineer

Procore

Stephen Weber

Staff Site Reliability Engineer

Procore

Arguments in Favor: why SLOs?
Learn more ›
close
Stephen Weber

Stephen Weber

Staff Site Reliability Engineer

Procore

Twitter LinkedIn
Stephen Weber is a Site Reliability Engineer at Procore Technologies, helping build the platform that builds the world. He's worked as a consulting SRE within his orgs for the past 4 years. Stephen lives and works remotely from Oregon, and has accidentally trained his huskies to know when standup should be over.
2023

Arguments in Favor: why SLOs?

In my experience, many teams encounter SLOs as something they've been told to "do" and the flip side is it's been something I've been asked to help them with. Naturally this is not ideal - as engineers we prefer things to be self-evident.

I have a number of strategies to use when communicating the process and the value of creating and using SLOs. I'll give away the thing I say the most right here: "SLOs are for decision-making". They're not magic or even a single thing. They're a tool that helps us do our jobs.

Audience will come away either better-able to articulate the pragmatic usefulness of an SLO mindset, or with one or two real motivations why they should consider developing and using SLOs for their own systems.

Steve McGhee
Steve McGhee

Reliability Advocacy Engineer

Google SRE

Steve McGhee

Reliability Advocacy Engineer

Google SRE

Two Paths in the Woods
Learn more ›
close
Steve McGhee

Steve McGhee

Reliability Advocacy Engineer

Google SRE

Twitter LinkedIn
Steve was an SRE at Google for about 10 years, then left to help a company move to the Cloud. He's back at Google, helping more companies do that.
2023

Two Paths in the Woods

While the direction and intent of SRE has been established and is becoming better-understood, the details on how to achieve "SRE" is still an exercise left for the reader.

Steve from Google and Sal from Nobl9 will present two independently developed methods for teaching SRE topics to customers, which we have discovered are actually quite similar. Huzzah!

Steve will present the "reliability map" and Sal will show Nobl9's SLODLC. Both are essentially sets of documentation presented in a way that allows customers to take a *guided* path through the dark and scary woods that is today's SRE.

The "reliability map" provides a detailed accounts of what development, infrastructure, operational, observability, and people/culture activities organizations can expect to be practicing to achieve different eras of reliability (99%, 99.9%, 99.99%, 99.999%). The "reliability map" takes it one step further and provides descriptions and references to additional content for all activities.

SLODLC provides a framework of documentation and templates to allow customers to iterate on improving their SLOs. This is especially necessary when customers have high reliability targets in which rely upon automation to execute predetermined playbooks. (rollback, exponential backoff, autoscale, etc) If SLOs are the indicator the performance of a user journey and decide when actions needs to be taken, then customers need a clear and useable way to explain:

  • What they are attempting to measure (golden signals?, something else?)
  • Why was it decided to measure X in such a way?
  • How is X impactful for the targeted user journey?
  • When was the last time the SLO objective, metric, time window, etc was changed?
  • When the error budget is in danger of being breeched what actions should be take?
In short both projects are approaching the same topic of helping customers improve their reliability from different points of view and they synergize well together.

Steve Upton
Steve Upton

Principal QA Consultant

Thoughtworks

Steve Upton

Principal QA Consultant

Thoughtworks

Data Product Thinking with SLOs
Learn more ›
close
Steve Upton

Steve Upton

Principal QA Consultant

Thoughtworks

Twitter LinkedIn
Steve is a Quality Analyst who works to build empowered teams, capable of delivering and taking ownership of quality. He has worked on a wide variety of products, from mainframes to microservices and has a particular interest in complex socio-technical systems and how we work with them.
 
He is passionate about complexity theory, building quality into culture and testing as part of continuous delivery in modern, distributed architectures. Outside of work, Steve enjoys travel and mountains.
2023

Data Product Thinking with SLOs

The talk tells the story of how conversations around SLOs can be a great trigger to start a shift to a Product Thinking mindset, with practical examples. We'll also take a light dip into constraint mapping from an SLO perspective.

Steve Xuereb
Steve Xuereb

Staff Site Reliability Engineer

GitLab Inc.

Steve Xuereb

Staff Site Reliability Engineer

GitLab Inc.

So many SLOs so many alerts
Learn more ›
close
Steve Xuereb

Steve Xuereb

Staff Site Reliability Engineer

GitLab Inc.

Twitter LinkedIn
Slight Reliability Engineer, I solve more problems than I create.
2023

So many SLOs so many alerts

This is the talk version of https://about.gitlab.com/blog/2022/07/19/reducing-pager-fatigue-and-improving-on-call-life/ where we’ll describe the following:
 
Problem:
At GitLab, we use SLIs to monitor each service, and a service can have multiple SLIs. When we start burning through the error budget too fast we will page the SRE on-call. If there is a service-wide degradation we ended up paging the on-call multiple times within minutes, which is not ideal and adds stress for the already stressed on-call engineer. The worst case scenario was when there was a service degradation with multiple upstream dependencies like a database that resulted in 50+ pages. We'll go over two solutions we’ve implemented to cut our pager load by more than half, using built-in tools from Alertmanager that users can just configure themselves. We’ll also show other possible solutions that we could have used, and why we opted for the Alertmanager option.
 
Solution One:
[Alertmanager grouping]
 (https://prometheus.io/docs/alerting/latest/alertmanager/#grouping) enabled us to group all alerts for 1 service into 1 page. We grouped the alerts using a specific set of labels that our alerts have, luckily all alerts had a service label that we could group by. We’ll walk through an example of how this works and go into more detail about how Alertmanager grouping works
 
Solution Two:
Now that we have alert grouping by service rolled out, the next step was to introduce service dependencies so when a downstream service was alerting we wouldn’t alert about the upstream service if it’s also burning through the error budget too fast. To achieve this we used another feature in Alertmanager called [inhibition]
(notion://https://prometheus.io/docs/alerting/latest/alertmanager/#inhibition. We’ll walk through how we implemented a new DSL for this for our metric catalog which is a jsonnet library where it’s the single source of truth of our SLI, the guard rails we’ve implemented in the DSL and a real-life example of this for GitLab.com.
 
Results:
We’ll show a real-life example where a degradation on the database resulted in 1 page, where before it would have been more than 15 pages. Finally, we’ll show how we got fewer pages, making the on-call happier and our alerting data cleaner and easier to understand.

Surya Bhagvat
Surya Bhagvat

Director, SRE

Harness

Surya Bhagvat

Director, SRE

Harness

The Business of Properly Setting SLOs
Learn more ›
close
Surya Bhagvat

Surya Bhagvat

Director, SRE

Harness

Surya Bhagvat leads the SRE team at Harness. Surya has started the journey onto the cloud by leading the teams back at eBay and Symantec, building on the OpenStack cloud, and then moving over to AWS and GCP. Surya enjoys being in the SRE space because there is always something new to learn every day.
2023

The Business of Properly Setting SLOs

Join Surya Bhagvat, Director of Site Reliability Engineering at Harness, to discuss how his team used business objectives to create SLOs that positively impacted customer experience and his engineering team. Surya will share lessons learned in choosing and implementing the SLIs and SLOs aligned with customer expectations and Harness’ desired business outcomes.

Thijs Metsch
Thijs Metsch

Researcher

Intel Labs

Thijs Metsch

Researcher

Intel Labs

Intent Driven Orchestration with SLOs!
Learn more ›
close
Thijs Metsch

Thijs Metsch

Researcher

Intel Labs

Thijs is a Research Engineer building cool stuff at Intel Labs. His key interests include system performance and distributed systems. In past career moves, he did work on HPC, Grids, and Cloud/Edge for companies such as IBM, Sun Microsystems, and the German Aerospace Center. He helped make shipbuilding easier, ran massive parallel workloads, managed tons of compute in hybrid environments, and created one of the first standards for the Cloud more than a decade ago. Now focused on making orchestration easier with tools such as Kubernetes using e.g. AI/ML techniques.
2023

Intent Driven Orchestration with SLOs!

With a Serverless mindset in place, there should no longer be a need to define resource requests or any other kinds of information. Just developed your app or function and run it.

But how can we achieve performance and make sure we run the system most efficiently? This is were we can now let user define what they truly care about - their SLOs. Based on these performance targets our Intent Driven Orchestration Planner will do the rest. No need to define resource requests and limit s on e.g. Kubernetes cluster anymore. The planner will set up the systems in such a way that your app/functions behave as expected, without the need to know anything about the underlying infrastructure – less knowledge needed, fewer errors made!

This is a shift in how we do orchestration and SLO management, that could be of interest to this community. Away from a monitoring & alerting way of looking SLOs, towards a way of using SLOs to manage the system. Furthermore this truly allows for ease-of-use of the user; rather than defining numbers an values based on domain & contextualized knowledge we let them define what they truly care about: their SLOs!

Toby Burress
Toby Burress

SRE

Dropbox

Toby Burress

SRE

Dropbox

What We Mean By "Mean"
Learn more ›
close
Toby Burress

Toby Burress

SRE

Dropbox

Toby is an SRE at Dropbox. In his free time he argues about cancelled TV shows on the internet.
2023

What We Mean By "Mean"

There's been a lot of (really good!) discussion in the last several years about how to think about, monitor, and alert on long-tailed distributed quantities, such as latency. However, I'm worried that in our desire to describe the long tail we may be too eager to abandon tools that are still useful.

In this talk I'll (re)introduce everyone's favorite summary statistic, the average. I'll talk about the difference between a sample average and a random variable's expectation, and how the two are uniquely linked by the law of large numbers. I'll also talk about how the central limit theorem allows us to treat sample averages as draws from a Gaussian distribution, irrespective of the distribution the samples come from, and then I'll talk about the exceptions. We'll finish up by looking at how the properties of expected values can give us insight into the behavior of systems, even at the tail.

Through all of this we'll be looking at examples drawn from real-world latency data, and comparing the insights gleaned from this versus other common summary statistics.

Troy Koss
Troy Koss

Director, Enterprise SRE

Capital One

Troy Koss

Director, Enterprise SRE

Capital One

Is Your Resilience Reliable?
Learn more ›
close
Troy Koss

Troy Koss

Director, Enterprise SRE

Capital One

LinkedIn
With what seems to be a natural attraction towards reliability, Troy has constantly found himself involved in making things... well... more reliable. After working in software development, he stumbled into operations and saw a clear opportunity to use software to orchestrate such efforts. Currently he works in Capital One’s stability organization leading enterprise Site Reliability Engineering (SRE). Here he plays a critical part in both evolving the enterprise strategy while leading a team of engineers focused on partnering with and influencing business, architecture, and technology partners in delivering on the strategy. His interest in reliability extends into the culture he seeks to foster for his teams with the goal of providing a dependable haven where engineers can be autonomous and empowered to drive critical decisions. In the same spirit of helping others develop, he spends time counseling young STEM talent as an advisor for Women’s Association of Venture & Equity (WAVE). Outside of his professional career, he enjoys horticulture, fitness, traveling to new locations, and spending time with his pup and family.
2023

Is Your Resilience Reliable?

Resiliency is a critical piece to building reliable systems. It allows us to feel safe knowing failure is inevitable. After all, as noted in the OG SRE book, 100% is terrible target for basically everything.

We spend a lot of resources to add in layers of resiliency from redundant multi-region compute stacks to backups on our backups. How do we know this resilience achieved our ultimate outcome of reliability for our customers? We'll discuss the ways to observe your SLOs and error budgets during resiliency events.

The various events we'll look at include, regional failover, chaos experiments (such as latency injection), database recovery, and more! After failing a region, do we know if your customer's experienced a disturbance? When you're running a resiliency test or game day, how do you measure success?

Observing error budgets before, during, and after an event paint a picture of our customer's experience and can ultimately be part of the success criteria. It is critical that we know how our architecture and system changes unfold. What if new resiliency introduces latency that negatively impacts your customer? For example, if there's complexity introduced that makes your release engineering more convoluted, we may see a longer error budget burn while we remediate. On the flip side the partnership of adding resiliency and observing your SLOs can also lead to improving the objectives with newly matured levels of resiliency; raising the bar for performance.

Vijay Samuel
Vijay Samuel

Observability Architect

eBay

Vijay Samuel

Observability Architect

eBay

Scaling SLI/SLO - Pushing Your Observability Platform To Its...
Learn more ›
close
Vijay Samuel

Vijay Samuel

Observability Architect

eBay

Twitter LinkedIn
Vijay Samuel works with eBay's observability platform as its architect. During his time at eBay Vijay has transformed eBay's observability platform into a cloud native offering that is primarily built on top of open source technologies. He loves to code in Go and play video games.
2023

Scaling SLI/SLO - Pushing Your Observability Platform To Its Limits

At eBay we use a wrapped version of Prometheus as our centralized time series database. Data residing in Prometheus is used for mission critical functions like detecting issues on the site through anomaly detection, SLOs or simple threshold based alerts. To be able to enforce SLOs across the 1000s of micro services that are deployed inside of eBay, we would need to be able to aggregate commonly instrumented metrics at an application level.

Why are these aggregations necessary? Host level metrics when used for computation of things like burn rates over a 30 day window can become very very expensive. Using a single query across all micro services to generate such aggregations have several scale limitations. Manually onboarding each micro service into such aggregations can also be tedious.

The talk discusses our commonly instrumented metrics, our metrics store architecture, scale challenges seen with SLI/burn rate computation and how we came up with concepts like templated rule groups that automatically generate Prometheus rule groups to be able to do aggregations for every application that is emitting metrics.

2023 SLOconf Highlights

sloconf 2023 highlights collage

2023 Talks

Translating failures into SLOs
SLI Negotiation Tactics for Engineers
Adoption of SLs in New Relic: an iterati...
Reliability Enablement: Achieving Reliab...
Driving engineering priorities with serv...
Tamland: How GitLab.com uses long-term m...
How I learned to stop worrying and love ...
SLOs & the Game of Skeeball
Terraforming SLOs (SLO automation at Atl...
Seeing Like A State: SLOs From The C-Sui...
Prev
Next

Local Events

For SLOconf 2023 we were facilitating a number of in-person SLOconf events to run concurrently with the primary virtual event. These “SLOconf Local” events have covered the entire globe.

local-events-shape
new-york-thumb

New York

May 15th, 6pm to 8pm ET
Cockroach Labs

sydney-thumb-min-1

Sydney

May 16th, 4pm to 6pm AEST
Google Sydney

tokyo-shibuya-thumb

Tokyo

May 16th, 6pm to 8pm JST
Google Japan

zurich-thumb-min

Zürich

May 16th, 18:00 to 20:00 CEST
Google Switzerland

chennai-thumb-min

Chennai

May 17th, 10:00am to 12:30pm IST
Sarabhai Conference Hall

sunnyvale-thumb-min

Sunnyvale

May 17th, 4pm to 6pm
Google Sunnyvale

dublin-thumb-min

Dublin

May 17th, 17:00 to 20:00 IST, UTC+1
Google Ireland

poznan-thumb-min

Poznań

May 17th, 17:00 to 21:00 CEST
Concordia Design

madrid-thumb-min

Madrid

May 18th, 18:00 to 21:30 CEST
Klarna

london-thumb

London

May 18th, 18:30 to 21:30 BST
Derbyshire House

Our Sponsors

Media Sponsors

DevOps.com

DevOps.com hosts a variety of articles, videos, podcasts and
custom content, all designed to educate, inform and engage.

DevOps.com hosts a variety of articles, videos, po...

TFiR

TFiR is a video-focussed story-telling platform covering Open Source, Cloud Native Computing, Security, Edge, 5G & AI/ML.

TFiR is a video-focussed story-telling platform co...

The New Stack

For developers and engineers building and managing new stacks around the world that are built on open source technologies and distributed infrastructures.

For developers and engineers building and managing...

VMblog

VMblog.com is dedicated to spreading the word about modern Data Center technologies like Virtualization, Cloud Computing, Containers, Hyperconvergence, IoT, Software-Defined "X", etc.

VMblog.com is dedicated to spreading the word abou...