2023 SLOconf Speakers

May 15-18, 2023

 

This Year's Speakers

Adriana Villela
Adriana Villela

Sr. Developer Advocate

Lightstep

Adriana Villela

Sr. Developer Advocate

Lightstep

Translating failures into SLOs
Learn more ›
close
Adriana Villela

Adriana Villela

Sr. Developer Advocate

Lightstep

Twitter LinkedIn

Adriana is a Sr. Developer Advocate at Lightstep from Toronto, Canada, with over 20 years of experience in tech. She focuses on helping companies achieve reliability greatness through Observability, DevOps, and SRE practices. Before Lightstep, she was a Sr. Manager at Tucows. During this time, she defined technical direction in the organization, running both a Platform Engineering team, and an Observability Practices team. Adriana has also worked at various large-scale enterprises, including Bank of Montreal (BMO), Ceridian, and Accenture. At BMO, she was responsible for defining and driving the bank's enterprise-wide DevOps practice, which impacted business and technology teams across multiple geographic locations across the globe.

Adriana has a widely-read technical blog on Medium (https://adri-v.medium.com), which is known for its casual and approachable tone to complex technical topics, and its high level of technical detail. She is also an OpenTelemetry contributor, HashiCorp Ambassador (https://www.credly.com/badges/551d47a7-67cb-41bb-baeb-8c90f114f03a/public_url), and co-host of the On-Call Me Maybe Podcast (https://oncallmemaybe.com).

2023

Translating failures into SLOs

Downtime is hard and we can definitely be proactive about failure by following practices like Chaos Engineering and SLOs. But how do you translate failure to SLO? What learnings should you leverage from the incidents you’ve been through? How can you turn a bad thing (something like an outage or downtime) into a good thing (information you have to prevent or mitigate future outages)?
 
Join Ana Margarita and Adriana as they walk back from failure to reliability by leveraging SLOs, using examples from three outages from 2022 and 2023 that affected many of us.

Alayshia Knighten
Alayshia Knighten

Manager of Onboarding Eng

Honeycomb

Alayshia Knighten

Manager of Onboarding Eng

Honeycomb

SLI Negotiation Tactics for Engineers
Learn more ›
close
Alayshia Knighten

Alayshia Knighten

Manager of Onboarding Eng

Honeycomb

LinkedIn

Alayshia Knighten is an Engineering Manager of Product Training at Honeycomb with many years of experience in the DevOps realm. Alayshia specializes in enhancing technical and team-related experiences while educating customers on their journey with and beyond observability. In her words, “Getting shit done while identifying how to accelerate at the person beyond the tooling is the real meat and potatoes.” She enjoys solving the “so, how do we solve that?” problems and meeting people from all walks of life. Her tiny hometown and Southern background inspire Alayshia. In her spare time, she enjoys hiking, grilling, painting, and making random bird calls with her father.

2023

SLI Negotiation Tactics for Engineers

Service level indicators are quantitative measures of a service, which in turn, are measured by SLOs. This is not the talk you think it is.
 
As Engineers, we have our own SLIs, which are Survival Level Indicators, that measure and define if we are okay or not okay at a job. What happens when the rockstar engineer, who performs essential task A and B, hasn’t taken vacation in 9 months? Over time, not meeting SLIs can take its toll on engineers. How do we avoid burnout, turnover, and wider destruction in our teams?
 
In this session, I will review different strategies to identify human burnout versus company personal objectives.Engineers share the same importance as customers and we should provide technical love to them as well.
 
In this discussion, I will be talking about how we can improve ourselves and survive in the the high risk climate in tech. The talk provides engineers and managers with the courage to take care of both themselves, the team and others around them. Sometimes it is hard to identify when we or our friends are okay or not okay. In the course of this discussion, we will review how to identify those says that “Houston we have a problem”, ways to better the problems we face, and overall strategies on strengthening who we are.

Aleksandra Dziamska
Aleksandra Dziamska

Engineering Manager

Nobl9

Aleksandra Dziamska

Engineering Manager

Nobl9

EM & PM collaboration based on SLOs
Learn more ›
close
Aleksandra Dziamska

Aleksandra Dziamska

Engineering Manager

Nobl9

Aleksandra works as an Engineering Manager at Nobl9.
Her over 10-year journey in software development started with being a software engineer and moved towards team leadership and management. Throughout her career, she strived to focus on what she feels is most important (in IT as in life): people. Translating it to Engineering Manager dialect: lead the engineering team to deliver best value to the end users, effectively combining Product and Engineering priorities. She explores the way SLOs can help here.
2023

EM & PM collaboration based on SLOs

Using SLOs in managing product priorities and the most efficient use of the engineering team's time.
 
Present a real-life scenario of using SLOs to manage product requirements and be smarter about allocating the engineering team's focus. The talk will discuss an example of how SLOs helped monitor a potential problem instead of assigning an engineering team to jump into solving a complex issue. It also helped focus on a specific area of a wider problem. In a broader sense, the talk will discuss the collaboration between the Product Manager and the Engineering Manager and how SLOs can lead to more productive conversations.

Alex Hidalgo
Alex Hidalgo

Principal Reliability Advocate

Nobl9

Alex Hidalgo

Principal Reliability Advocate

Nobl9

Error Budgets for Conference Planning
Learn more ›
close
Alex Hidalgo

Alex Hidalgo

Principal Reliability Advocate

Nobl9

Twitter
Alex Hidalgo is the Principal Reliability Advocate at Nobl9 and author of “Implementing Service Level Objectives.” During his career he has developed a deep love for sustainable operations, proper observability, and using SLO data to drive discussions and make decisions. Alex’s previous jobs have included IT support, network security, restaurant work, t-shirt design, and hosting game shows at bars. When not sharing his passion for technology with others, you can find him scuba diving or watching college basketball. He lives in Brooklyn with his partner Jen and a rescue dog named Taco. Alex has a BA in philosophy from Virginia Commonwealth University.
2023

Error Budgets for Conference Planning

Planning an event, no matter the size, can be stressful and complicated. Planning a hybrid conference, for example, includes having to ensure you end up with the right number of speakers, registrants, sponsors, and local venues. In this talk Alex will break down what it was like while organizing SLOconf 2023, and how he used his experience setting reasonable targets to help guide him and everyone else involved along the way.

Alex Kudryashov
Alex Kudryashov

Lead software engineer

New Relic

Alex Kudryashov

Lead software engineer

New Relic

SLI adoption in mid size company: wins and flops
Learn more ›
close
Alex Kudryashov

Alex Kudryashov

Lead software engineer

New Relic

These days I am leading a team that is developing Service Level Management in New Relic. I love solving challenges at intersection of product and engineering, so I am creating tools for developers like myself.

2023

SLI adoption in mid size company: wins and flops

It's been a year since we started to introduce SLIs as a broad engineering practice in New Relic. We would review what helped with the adoption, and what did not work as expected. The session is intended for engineering leaders challenged with adoption of SLIs in their companies.
 
We have around 100 engineering teams that could use SLIs, but not all of them are doing it even after a year of active adoption. We would go through the various approaches (top-down, down-up) of adoption, what value did the teams found attractive to start adopting SLIs, when you should be aligning with product managers etc.

Alexandra McCoy
Alexandra McCoy

SRE Engineer & VMware Enthusiast

VMware

Alexandra McCoy

SRE Engineer & VMware Enthusiast

VMware

Reliability Enablement: Achieving Reliability with SLOs
Learn more ›
close
Alexandra McCoy

Alexandra McCoy

SRE Engineer & VMware Enthusiast

VMware

Twitter LinkedIn

Alexandra is an SRE Engineer at VMware. She is passionate about Cloud Native, Open Source, and Reliability Engineering communities. Although VMware is home, she was introduced to the Cloud while in IBM - Public Sector and then transitioned into IBM Cloud. She later gained additional hybrid cloud experience at Diamanti, focusing on E2E product support for their Kubernetes based appliance. She is excited about the industry's direction and hopes to contribute in a way that only helps improve the cloud space.

2023

Reliability Enablement: Achieving Reliability with SLOs

At VMware, we've created a team of Cloud Reliability Engineers who support Tanzu SaaS engineering teams through technical reliability enablement projects. These projects help to measure the reliability of VMware SaaS services. Throughout the Reliability Enablement journey of implementing SLIs & SLOs, we found an increasing need to consistently manage these measurements in a collaborative and version-controlled way. This talk will provide a discussion of how we, the Tanzu SaaS Reliability Enablement Team, internally designed a technical process to measure, monitor, and alert on SaaS service performance as code, utilizing Tanzu Mission Control, Aria Operations for Applications and other Cloud Native & Open Sourced solutions.
 
Focusing on SLIs and SLOs helped to realize that in order to remain reliable we needed a reliable manner to consistently create & manage our SLOs. The TSRE team has focused on the SRE aspect of this by creating a suite of tools that will allow customers to easily create and manage SLI/SLO dashboards via terraform. This is an industry-standard and efficient means to harness the full potential of Aria Operations for Apps. In addition to the terraforming tools, the TSRE team has also created a framework that allows users to create or use existing probes to feed these SLI/SLO dashboards. Our goal is to improve the ability of developers to produce quality products by making customer personas and their journey central to the practices & processes we implement.

Ana Margarita Medina
Ana Margarita Medina

Staff Developer Advocate

Lightstep

Ana Margarita Medina

Staff Developer Advocate

Lightstep

OKRs with BLOs & SLOs via User Journeys
Translating failures into SLOs
Learn more ›
close
Ana Margarita Medina

Ana Margarita Medina

Staff Developer Advocate

Lightstep

Twitter LinkedIn

Ana Margarita is a Staff Developer Advocate at Lightstep and focuses on helping companies be more reliable by leveraging Observability and Incident Response practices. Before Lightstep, she was a Senior Chaos Engineer at Gremlin and helped companies avoid outages by running proactive chaos engineering experiments. She has also worked at various-sized companies including Google, Uber, SFEFCU, and Miami-based startups. Ana is an internationally recognized speaker and has presented at: AWS re:Invent, KubeCon, DockerCon, DevOpDays, AllDayDevOps, Write/Speak/Code, and many others.

Catch her tweeting at @Ana_M_Medina about traveling, diversity in tech, and mental health.

2023

OKRs with BLOs & SLOs via User Journeys

We hear it in commercials, in job interviews, and in the applications we use. “Users matter!” or “Customer experience is built into our culture and values!” But how are we proactively following what your organization is preaching?
 
Agenda layout:
  • What are all these buzzwords?
    • What is OKR
    • What is KPI
    • What is BLO
    • What is User Journey
    • What is SLO
  • Why our systems need them
    • How to define them from the top down
    • How to define them from the bottom up
  • How to keep SLOs and User Journeys healthy
  • Error Budget
Ana Margarita will guide us in defining some concepts in the industry that help business and engineering teams come together to focus on the user experience and reliability of the systems. She will guide us on how to define SLOs by leveraging user journeys and we will also cover how to define these when the companies are defining OKRs for the year or quarter.

 

2023

Translating failures into SLOs

Downtime is hard and we can definitely be proactive about failure by following practices like Chaos Engineering and SLOs. But how do you translate failure to SLO? What learnings should you leverage from the incidents you’ve been through? How can you turn a bad thing (something like an outage or downtime) into a good thing (information you have to prevent or mitigate future outages)?
 
Join Ana Margarita and Adriana as they walk back from failure to reliability by leveraging SLOs, using examples from three outages from 2022 and 2023 that affected many of us.

Andrew Howden
Andrew Howden

SRE Engineering Manager

Zalando

Andrew Howden

SRE Engineering Manager

Zalando

Driving engineering priorities with service level objectives...
Learn more ›
close
Andrew Howden

Andrew Howden

SRE Engineering Manager

Zalando

Twitter LinkedIn
Andrew is a failed sports science student who wanders into software engineering by virtue of luck and the necessity to find a job in a hurry. Through the grace and patience of his colleagues, he has spent nearly the past decade learning how to be a software engineer, systems engineer, site reliability engineer and student of human factors. Most recently, he has been learning how to become an engineering manager and trying to pass on what knowledge he has gained so far to the next generation of software adventurers.
2023

Driving engineering priorities with service level objectives on critical business operations

I will talk through the details of how SLOs at Zalando have evolved from the initial implementation ("SLOs for everything!") to the challenge of ensuring SLOs have the organizational power to drive changes in engineering priorities, to the current design of "critical business operations" and SLOs on those operations.

I'll discuss how to address the "fast burn" SLO problem by leveraging distributed tracing to identify regression in the customer experience automatically. When those regressions are identified, automatically identify and page the team best empowered to address them.

I'll discuss how to address the "slow burn" SLO problem through periodic operational review meetings, in which the SLOs are evaluated, and violations to the SLO (or slow burn issues) are allocated to an owner to investigate and address.

Lastly, I'll talk about challenges with the existing approach, including the difficulty of modelling event systems as a reliable flow, difficulty in rolling out more SLOs for non-customer-facing aspects of the organization and returning to service-specific SLOs.

Andrew Newdigate
Andrew Newdigate

Distinguished Engineer

GitLab Inc.

Andrew Newdigate

Distinguished Engineer

GitLab Inc.

Tamland: How GitLab.com uses long-term monitoring data for c...
Learn more ›
close
Andrew Newdigate

Andrew Newdigate

Distinguished Engineer

GitLab Inc.

Twitter LinkedIn

Andrew is a seasoned engineer with over two decades of experience in software development and reliability engineering. As a Distinguished Engineer at GitLab, he is responsible for the reliability and availability of GitLab's SAAS properties: GitLab.com and GitLab Dedicated. He is a strong advocate for using SLOs, error budgets, and observability data to drive change and manage technical debt. Previously, Andrew co-founded the developer community site Gitter in 2012, where he served as CTO until its acquisition by GitLab in 2017.

2023

Tamland: How GitLab.com uses long-term monitoring data for capacity forecasting

For any large scale production system, the ability to effectively forecast potential capacity issues is crucial for the smooth functioning of the environment. With a reliable prediction, teams can proactively plan ahead, implement necessary scaling changes in a controlled manner and avoid unexpected availability issues that can cause stress and harm to the system.

Before implementing Tamland, the capacity planning process at GitLab.com was ad-hoc, and relied heavily on manual processes and intuition. Unfortunately, this approach often resulted in oversights, with issues going unnoticed until it was too late, sometimes only surfacing when site availability was impacted.

This talk delves into how GitLab leveraged the power of statistical analysis to greatly improve its capacity planning process. The session will be a practical demonstration of how we analyse long-term metrics data using the Meta’s Prophet library to build sophisticated forecast models.

Tamland, the capacity planning tool built by GitLab, is an open-source project and attendees will have access to the source code if they're interested in exploring the implementation in greater detail. This session is for anyone interested in learning about how forecasting libraries such as Prophet, Greykite, or NeuralProphet, and how they can be integrated into an observability system to provide greater insight into the health of a system.

Andrew Snyder
Andrew Snyder

Senior DevOps Engineer

Contino

Cognizant

Andrew Snyder

Senior DevOps Engineer

Contino

Cognizant

Taking Your Error Budgets to the Next Level
Learn more ›
close
Andrew Snyder

Andrew Snyder

Senior DevOps Engineer

Contino

Cognizant

Twitter LinkedIn

Andrew has begun his third decade of providing exceptional technology solutions in the DevOps space, having worked full-time in engineering management leadership roles at global Fortune 100 companies including Standard & Poor's / The McGraw-Hill Companies, Bank of America / Merrill Lynch, Time Warner, and others.

2023

Taking Your Error Budgets to the Next Level

This talk will cover how error budgets should help inform future work, up to and including setting OKRs, and using them to change the KPIs for which an organization's performance is judged. The goal is to make Error Budgets more agile and applicable to the client's application reliability.
 
In order to take your Error Budgets to their next level of utility, we are suggesting in this SLOconf to follow these guidelines:
  1. Review Error Budgeting That Is Currently In-Place
  2. Observations Made on the On-Going Maintenance of EB’s
  3. Modifications to SRE KPIs in Alignment with Ongoing EB Compliance
Real-life scrubbed examples will be referred to towards directing attendee’s continuous improvement of their Error Budgets.

Ashley Chen
Ashley Chen

Software Engineer

Datadog

Ashley Chen

Software Engineer

Datadog

How I learned to stop worrying and love burn rates
Learn more ›
close
Ashley Chen

Ashley Chen

Software Engineer

Datadog

Ashley is a software engineer on the SLO team at Datadog. When she’s not working, she enjoys mentoring future engineers at Emergent Works and exploring the transit history of New York City.

2023

How I learned to stop worrying and love burn rates

Life can feel bleak when you’re paged at 3 in the morning for an error rate monitor that has once again exceeded its threshold. You wonder about the purpose of your microservices and the impact this alert will have on your sleep schedule. We talk about building reliable services and web platforms, but what can we do to build reliable monitoring for our reliable services?

Part of building the infrastructure for SLOs at Datadog includes putting SLOs into practice. As an engineering team, we have seen the direct impact of utilizing burn rate alerts over traditional threshold alerts. Our story starts with understanding the purpose of our alerts. Though these monitors have well defined runbooks and technical implications, they do not fully capture the impact of these errors on our users. In this talk, I will discuss the process we took to replace some of our threshold alerts with burn rate alerts and how we were able to quantify the urgency of service degradation by alerting at different burn rates. This transition has driven the balance of reliability and development work for the team, which has led to more reliable services and better nights of sleep.
 
The talk will cover our engineering team’s process of evaluating threshold alerts vs burn rate alerts. We were looking at how noisy they are, how unclear the customer impact and how often they caused alert fatigue.

We will then tell the story of our implementation of burn rate alerts, deciding which ones to use and comparing them to threshold alerts. We discovered that they were more reliable and triggered less often. One example is that we've seen our burn rate alerts trigger when they see dependencies failing whereas that didn't happen with threshold alerts. Burn rate alerts ended up reducing our alert fatigue and late night pages due to being more reliable and building trust in our systems on the team.

We learned that paging at high burn rates captures when human intervention is needed to resolve customer impact. In contrast, low burn rates help us anticipate short-term impact. More discussion can come around by looking at these alerts in reviews and retros. We can change the way that we maintain the reliability process of our team and also in return actually see the number of pages decrease and see the service become more reliable.

Bram Vogelaar
Bram Vogelaar

Software Engineer

Seaplane

Bram Vogelaar

Software Engineer

Seaplane

a Pint size introduction to SLO
Learn more ›
close
Bram Vogelaar

Bram Vogelaar

Software Engineer

Seaplane

Twitter LinkedIn

Bram Vogelaar spent the first part of his career as a Molecular Biologist, he then moved on to supporting his peers by building tools and platforms for them with a lot of Open Source technologies. He now works as a software engineer at seaplane.io, building a global platform for building & scaling your apps.

2023

a Pint size introduction to SLO

Athletes, Firemen and Doctors train everyday to be the best at their chosen profession. As engineers we spend much of our time getting stuff to production and making sure our infrastructure doesn’t burn down out right. We however spend very little time learning to understand and respond to outages. Does our platform degrade in a graceful way or what does a high CPU load really mean? What can we learn from level 1 outages to be able to run our platforms more reliably.

Plenty of people are jumping on the new hype, Observability, lots of them are replacing their "legacy" monitoring stack. Not all of them achieve the goals they set. But observability is not a tool — it is a property of a system. Moving from many small black boxes to a more data driven view of your system.

Furthermore we'll discuss the need for and the options of not only monitoring our platforms and it's inevitable outages, but also their (potential) length and impact. We'll look at tools like at using Service Level Objects for ways to prepare teams to tweak their testing and monitoring setup and run-books to quickly observe, react to and resolve problems.

Bryan Oliver
Bryan Oliver

Principal Architect and K8s Sig Network Member

Thoughtworks

Bryan Oliver

Principal Architect and K8s Sig Network Member

Thoughtworks

SLO Driven Deployments - Point of Change Compliance meets Op...
Learn more ›
close
Bryan Oliver

Bryan Oliver

Principal Architect and K8s Sig Network Member

Thoughtworks

LinkedIn

Bryan is an experienced engineer and leader who designs and builds complex distributed systems. He has spent his career developing mobile and back-end systems whilst building autonomous teams. More recently he has been focused on delivery and cloud native at Thoughtworks. In his free time he plays ice hockey and goes trail running, and tries to break into the champion rank in rocket league. https://olivercodes.com

2023

SLO Driven Deployments - Point of Change Compliance meets OpenSLO

Point of Change compliance is a deployment concept in which we leverage Kubernetes Admission controllers to block/allow deployments into the environment at the boundary of each environment. We can combine this concept with OpenSLO to create a powerful 0-trust architecture for error budgets. Imagine if we enforced error budgets at the boundary of the environment! Teams will begin to take them more seriously as a result.

Christian
Christian "Serpico" Long

7-time Brooklyn skeeball champion and nationally-r...

Christian "Serpico" Long

7-time Brooklyn skeeball champion and nationally-ranked roller

SLOs and Skeeball
Learn more ›
close
Christian

Christian "Serpico" Long

7-time Brooklyn skeeball champion and nationally-ranked roller

Twitter LinkedIn

Christian has been rolling skeeball competitively in Brooklyn, NY for 11 years and nationally for 7 years. He started out as a straightforward 40-roller, as is both the conventional recommended starting approach and a widely regarded standard for high level competition. Eventually he started dabbling in hybrid rolling and came to develop a fine-tuned highly tactical strategy that minimizes risk and has virtually no ceiling, enabling him to compete with the best rollers in the world.

2023

SLOs and Skeeball

Using SLOs to craft a potent, advanced, high-performing strategy in competitive skeeball.

Skeeball is mostly thought of as a kids game, but it has become a competitive endeavor, complete with its own vast array of surprisingly complex strategies. Classic simple strategies involve either the conservative "middle rolling" (going only for the 40 pocket) or the aggressive "hundo rolling" (going only for the hundred pocket). A more nuanced strategy has evolved that is known as "hybrid rolling", which seeks to incorporate hundos while mitigating risk by switching to the 40 cup at opportune times. This talk will explain how SLOs can be applied to craft this strategic approach in order to create optimal winning conditions against a wide variety of opponent styles and skill levels.

Dan Venkitachalam
Dan Venkitachalam

Software Reliability Engineer

Atlassian

Dan Venkitachalam

Software Reliability Engineer

Atlassian

Terraforming SLOs at Atlassian
Learn more ›
close
Dan Venkitachalam

Dan Venkitachalam

Software Reliability Engineer

Atlassian

Dan is a veteran software engineer and technical manager with over 20 years of experience. He currently works on the Tome team at Atlassian, helping internal organisations to define and achieve their operational goals with SLOs.

2023

Terraforming SLOs at Atlassian

How Atlassian uses Terraform and Configuration as Code to maintain SLOs across software teams.

Tome is our internal platform for managing, reporting and alerting on SLOs. A design goal was to enable SLOs to be defined with Configuration as Code. This has become the primary way that SLOs are maintained within Atlassian. Working with SLOs this way has many benefits:

  • Enforces consistency in how we organise, define and validate SLOs
  • Changes are tracked and attributed through a version control system
  • Updates can be deployed as part of existing continuous integration and delivery pipelines

We have written a custom Terraform plugin for provisioning SLOs using Terraform, which interfaces with Tome's backend API.

  • Optimizes deployments by tracking deployment state and applying diffs only
  • Performs custom validation on configurations
In this talk I'll run through our implementation of the plugin and associated API, and how it compares to other SLO configuration systems.

Daniel Golant
Daniel Golant

Senior Software Engineer

Daniel Golant

Senior Software Engineer

Seeing Like A State: SLOs From The C-Suite
Learn more ›
close
Daniel Golant

Daniel Golant

Senior Software Engineer

Twitter LinkedIn

Daniel Golant is a software developer based in New York City with an interest in how observability can increase engineering leverage. When not speaking you can find him watching tech talks on the treadmill or playing dominoes on the Lower East Side.

2023

Seeing Like A State: SLOs From The C-Suite

A brief, humorous foray into what it might look like if executives had to set SLOs for a massive and widely used service like Facebook (Blue App) from the *top* of the hierarchy.
 
We will cover both how *I* would recommend doing it, from requirements gathering, to the resources a C*O has available over what the average EM might, to the possible secondary effects of setting overly tight objectives. We will cover thorny questions like tradeoffs involving potentially degrading the experience of "less valuable" users to benefit "more valuable" users. What impact would this have on growth? What factors would we consider when making such a call?
 
We will close with a few thoughts on whether this sort of top-down orchestration is desirable or feasible, and with a humorous review of how this might *actually* play out if implemented.

Deepak Kumar
Deepak Kumar

Senior Cloud Infrastructure and Devops

Zenduty

Deepak Kumar

Senior Cloud Infrastructure and Devops

Zenduty

Kubernetes Monitoring - Choosing Optimal Metrics for SLO Ale...
Learn more ›
close
Deepak Kumar

Deepak Kumar

Senior Cloud Infrastructure and Devops

Zenduty

LinkedIn

I'm a Senior Cloud Infrastructure and Devops Engineer at Zenduty - an incident management and response orchestration platform, trying my best to make sure that every service and application at our org is secure, reliable and accessible 24/7. I have experience working with and am passionate about cloud services, orchestration engines, enterprise networking, observability platforms and figuring out how they work best together. Looking to talk about my experiences and how we manage mission critical operations at an organisation that has no room to fail.

2023

Kubernetes Monitoring - Choosing Optimal Metrics for SLO Alerting

A well performing monitoring system needs to answer two simple questions: “What’s broken, and why?”. Monitoring allows you to watch and understand your system’s state using a predefined set of metrics and logs. Observability, on the other hand, is about bringing visibility into the system - essentially turning the lights on, to see and understand the state of each component of your system, and to discover the answer to the ‘why’ part of the problem.
 
Building an efficient and battle-tested monitoring platform usually takes quite a while. You need to learn over a period of time how your system performs on various fields, before you can accurately know which metrics to monitor for prompt alerting that help predict unavoidable incidents, meet your SLOs and in turn prevent downtime.
 
We have analysed the incident data of over 150 highly active organisations deploying Prometheus Alertmanager to monitor their Kubernetes infrastructure, have discovered some unusually common yet fatal mistakes made when choosing SLO metrics as well as some clever configurations drastically reducing noise.
 
This talk aims to give you a run-through of best practices and ‘what not to do’ when choosing Prometheus metrics for clean and noiseless alerting.

Derek Osborn
Derek Osborn

Incident, Problem and Service Level Manager

Flexera

Derek Osborn

Incident, Problem and Service Level Manager

Flexera

Flexera's SLO Journey - from DIY to NOBL9
Learn more ›
close
Derek Osborn

Derek Osborn

Incident, Problem and Service Level Manager

Flexera

Twitter LinkedIn

As the Senior Incident, Problem and Service Level Manager for Flexera, I have been building and expanding the function since joining Flexera, nearly 4 years ago.

2023

Flexera's SLO Journey - from DIY to NOBL9

I'll cover Flexera's journey from our internally developed SLO solution, to partnering with NOBL9, and also include how we engaged teams to help develop SLO's. I'll also cover how SLO's are now part of our engineering goals for 2023.

Derek Remund
Derek Remund

Practice Lead, Reliability Engineering

Google Cloud

Derek Remund

Practice Lead, Reliability Engineering

Google Cloud

Law and Order: SLO - These Are Our Stories
Learn more ›
close
Derek Remund

Derek Remund

Practice Lead, Reliability Engineering

Google Cloud

Derek Remund has held roles in game development, distributed systems engineering, data architecture, and SRE. He studied Computer Science at UIUC after a stint in the Army infantry. Derek currently leads the Reliability Engineering practice in Google Cloud Professional Services, helping Google Cloud customers build and implement their own SRE approaches.

2023

Law and Order: SLO - These Are Our Stories

It’s nigh-indisputable that Service Level Objectives are table stakes for practicing SRE within an organization. But once you’ve made the decision to pursue a modern approach to service level measurement, or perhaps to expand your existing SLO footprint, where do you go from there?
 
We run a team at Google that is dedicated to helping our customers develop SRE practices, from technical tools implementation, to organizational culture change, to building new approaches to day-to-day operations. We’ve had the opportunity to help dozens of firms take their first, second, and fiftieth steps into reshaping the ways they build and run services. For each one of those companies we’ve inevitably had to wade deep into SLO development.
 
After working with organizations big and small, public and private, startups, enterprises, and everything in between, we’ve pulled together a few key themes from our clients. We’d like to share with you common failure modes, key success indicators, what’s worked, and what very much hasn’t. Along the way you’ll hear plenty of true stories of the trials and tribulations of SLO development in the real world.
 
We’re not here to preach theory; we’re here to discuss practice, lay bare some failures, and hopefully give you some ideas that will help you take your service level measurement up a notch (and bring the rest of your organization along with you).

Dylan Keyer
Dylan Keyer

SRE Ops

Twilio

Dylan Keyer

SRE Ops

Twilio

SLOs at Twilio
Learn more ›
close
Dylan Keyer

Dylan Keyer

SRE Ops

Twilio

I used to answer 911 calls. Now I like solving problems where technical concerns most impact a business's bottom line.

2023

SLOs at Twilio

With engineering teams using a myriad of tools, how do you centralize them?
 
We'll talk at a high level about Twilio's observability landscape and how previously-siloed solutions are now converging towards a single discoverable pane of glass.

Emily Gorcenski
Emily Gorcenski

Lead Data Scientist

Thoughtworks

Emily Gorcenski

Lead Data Scientist

Thoughtworks

A "moving SLO" for machine learning
Learn more ›
close
Emily Gorcenski

Emily Gorcenski

Lead Data Scientist

Thoughtworks

Twitter

Emily has over ten years of experience in scientific computing and engineering research and development. She has a background in mathematical analysis, with a focus on probability theory and numerical analysis. She is currently working in Python development, though she has a background that includes C#/.Net, Unity3D, SQL, and MATLAB. In addition, she has experience in statistics and experimental design, and has served as Principal Investigator in clinical research projects.

2023

A "moving SLO" for machine learning

For microservices, we have fairly concrete SLOs and ways to measure them, such as latency and availability. However, products with embedded machine learning artifacts add another dimension. This occurs because data drifts, and we need to be able to detect that data drift. There are existing analogues in control theory, and these analogues should inspire us to create an improved vision of how to better design SLOs for AI-integrated systems that adapt better to the world. I'm addressing this topic as a former control systems engineer and computational mathematician, and while I won't have time to dive deeply into the mathematics, I will get very technical very quickly. Nevertheless, there will be plenty of metaphors and examples to root this in an understandable frame, even for those without advanced mathematics backgrounds.

Eric Moore
Eric Moore

Ex-chemist SRE

Eric Moore

Ex-chemist SRE

Confident rare SLO measurement
Learn more ›
close
Eric Moore

Eric Moore

Ex-chemist SRE

Twitter LinkedIn

Formerly a computational chemist, Eric tries to bring some of those skills into SRE-land.

2023

Confident rare SLO measurement

Using SLO based measurements on rare events can be quite noisy, since each event moves the needle more. In this talk we'll cover the relevant statistics, and go over some ways to apply the statistics when setting SLOs, interpreting SLO measurements, and setting alerting thresholds.

Frances Zhao-Perez
Frances Zhao-Perez

Senior Director of Product Management

Salesforce

Frances Zhao-Perez

Senior Director of Product Management

Salesforce

Measuring What Matters: SLOs Help to Pursue Customer Happine...
Learn more ›
close
Frances Zhao-Perez

Frances Zhao-Perez

Senior Director of Product Management

Salesforce

Frances is the Senior Director of Product Management at Salesforce leading the Monitoring Cloud and Service Ownership platform investment. Prior to joining Salesforce, Frances was the VP of Product Management at New Relic, responsible for driving the APM business, the head of product management at AWS Marketplace, driving key initiatives, and spent 16 years as a Senior Director of Product Management at Oracle running the middleware business.
2023

Measuring What Matters: SLOs Help to Pursue Customer Happiness

It’s all about the customer. In this session, I will use real-world scenarios to discuss the importance of SLOs to help us set actionable business goals, measure them, and stay on target.
 
We will deep-dive into the difference between monitoring and observability, how the SLOs have changed the observability business, and why SLO is the MVP of observability. We will also discuss how to leverage the error budget to help us prioritize our investment, and positive impact opportunities that go beyond the baselines, and balance our roadmap towards the business goals.

Fred Moyer
Fred Moyer

Engineering Geek

Fred Moyer

Engineering Geek

The body's Error Budget; SLOs for healthy eating
Learn more ›
close
Fred Moyer

Fred Moyer

Engineering Geek

Twitter LinkedIn

Fred is an Observability SRE in his day job and doesn't get to ride his bike 10k kilometers a year like he used to, so now he relies on science to help him stay in shape.

2023

The body's Error Budget; SLOs for healthy eating

This talk might sound a bit unusual, but it's a chronicle of a real life challenge that I've faced a few times in my life and have been focused on the past few months. My desire to see my kids well into adulthood has driven my inner nerd to put my SLO and technical chops to work and try to life a more healthy life.
 
I'll talk about how I've used a food tracker to meet my own personal health goals, yet still be able to enjoy some of my favorite indulgences on a regular basis. I'll share my nutrition SLOs, and talk about why they are that way given some biochemistry that I've learned about how food has changed in society over the last several dozen (and perhaps a couple hundred) years.

Garrett Plasky
Garrett Plasky

Staff Reliability Engineer

Google Cloud

Garrett Plasky

Staff Reliability Engineer

Google Cloud

Law and Order: SLO - These Are Our Stories
Learn more ›
close
Garrett Plasky

Garrett Plasky

Staff Reliability Engineer

Google Cloud

Garrett Plasky is a Strategic Cloud Engineer at Google Cloud focusing on reliability and SRE. He brings a wealth of practical SRE knowledge to the table, having led the teams responsible for running Evernote’s core services and infrastructure supporting their 200M+ global user base. Under his leadership, he transformed Evernote’s traditional Operations organization into an early adopter of the canonical SRE model, embracing key principles such as SLOs, error budgets, and toil management across the entire engineering organization. Garrett has written about some of these experiences as a contributor to the Google-published SRE Workbook and in his current role helps companies both big and small successfully adopt SRE practices.

2023

Law and Order: SLO - These Are Our Stories

It’s nigh-indisputable that Service Level Objectives are table stakes for practicing SRE within an organization. But once you’ve made the decision to pursue a modern approach to service level measurement, or perhaps to expand your existing SLO footprint, where do you go from there?
 
We run a team at Google that is dedicated to helping our customers develop SRE practices, from technical tools implementation, to organizational culture change, to building new approaches to day-to-day operations. We’ve had the opportunity to help dozens of firms take their first, second, and fiftieth steps into reshaping the ways they build and run services. For each one of those companies we’ve inevitably had to wade deep into SLO development.
 
After working with organizations big and small, public and private, startups, enterprises, and everything in between, we’ve pulled together a few key themes from our clients. We’d like to share with you common failure modes, key success indicators, what’s worked, and what very much hasn’t. Along the way you’ll hear plenty of true stories of the trials and tribulations of SLO development in the real world.
 
We’re not here to preach theory; we’re here to discuss practice, lay bare some failures, and hopefully give you some ideas that will help you take your service level measurement up a notch (and bring the rest of your organization along with you).

George Hantzaras
George Hantzaras

Director, Cloud Engineering

Citrix

George Hantzaras

Director, Cloud Engineering

Citrix

Scaling SLOs with open source tools
Learn more ›
close
George Hantzaras

George Hantzaras

Director, Cloud Engineering

Citrix

Twitter LinkedIn

George is a Director of Cloud Platform Engineering at Citrix. He has been organising the Athens Cloud Computing Meetup Group since 2016 and the Athens HashiCorp User Group. His recent talks include topics in cloud computing, observability, SRE and Agile practices, and he has participated in conferences like Voxxed Days, Hashiconf Global, DeveloperWeek, and more.

2023

Scaling SLOs with open source tools

Defining Service Level Objectives and Service Level Indicators is a really important aspect of implementing SRE. Through service metrics (SLOs, SLIs, Error Budgets), SRE can help us measure our system’s performance and improve customer experience. They not only enable your teams to monitor and plan around reliability, but can also be early predictors of customer satisfaction, NPS, churn rates, and more.
 
With the rise of cloud native technologies, it has become more and more relevant to automate our observability, extending it to an SLO-as-code model. In this session we’ll see how SLOs have evolved and can be used in a Cloud Native world. We’ll then explore how technologies like Kubernetes and Prometheus can help us scale SLOs, while promoting best practices and standards using Observability as code. Finally, we’ll see how to put all these together with Jenkins and Rancher, to operationalize error budgets.

Greg Arnette
Greg Arnette

Co-founder & CPO

CloudTruth

Greg Arnette

Co-founder & CPO

CloudTruth

The Hidden (Config) Tax Affecting Your Uptime SLO
Learn more ›
close
Greg Arnette

Greg Arnette

Co-founder & CPO

CloudTruth

Twitter LinkedIn

Greg is co-founder & CPO of CloudTruth. Prior, Greg was the founder & CTO of three cloud / SaaS companies in the data protection market.

2023

The Hidden (Config) Tax Affecting Your Uptime SLO

The presenters interviewed over 1000 DevOps leaders to understand the role "config sprawl" plays in meeting uptime SLOs.
 
The startling (but not too surprising) conclusion is that most teams struggle managing secrets and configs at scale for infrastructure and applications.
 
The presenters will introduce a new concept for managing config called "The 7 Factor Config" principles, which describes a way of managing secrets and config that, when followed, enables companies to deploy reliably, scale quickly and reduce unplanned downtime and security incidents.

Gwen Berry
Gwen Berry

Site Reliability Engineer

IAG

Gwen Berry

Site Reliability Engineer

IAG

Reliability Benchmarking: A Pre-cursor to SLO Adoption
Learn more ›
close
Gwen Berry

Gwen Berry

Site Reliability Engineer

IAG

Junior Site Reliability Engineer, working in an SRE enablement team at IAG.

2023

Reliability Benchmarking: A Pre-cursor to SLO Adoption

When we first attempted to implement SLOs we took a "theory first" approach. We developed workshops and ran sessions to uncover the key users, services, indicators and objectives for a platform or application. But we failed. We didn't identify meaningful SLOs, track them, or define error budgets. We also failed to garner interest or investment from the team.
 
Taking a step back, we tried a different approach. We got access to their observability data alongside other sources of information (e.g. incidents) to build a picture of where the team was currently at in terms of reliability and operational maturity.
 
This new approach was much more successful in getting that initial spark of excitement. By providing actionable insight up front, we were able to start the SLO and SRE conversation off the right way.
 
In this talk I will share our experience and process for benchmarking reliability, and how this could be leveraged to begin SLO adoption in a complex organisation.

Hezheng Yin
Hezheng Yin

Co-founder & CTO / Creator

Apache DevLake

Merico

Hezheng Yin

Co-founder & CTO / Creator

Apache DevLake

Merico

Creating and Tracking SLOs that Empower Developer Happiness ...
Learn more ›
close
Hezheng Yin

Hezheng Yin

Co-founder & CTO / Creator

Apache DevLake

Merico

LinkedIn

Hezheng is a perceptive and persistent pioneer in applying technology to make the world a better place. At Merico, he leads the engineering and research team to build innovative algorithms to help developers quantify the impact of their work. Before this, his research focuses on empowering the next generation of education technology with artificial intelligence and machine learning. Hezheng got his bachelor's degree from Tsinghua University and was pursuing his Ph.D. in computer science at UC Berkeley.

2023

Creating and Tracking SLOs that Empower Developer Happiness and Productivity

In this session, startup CTO and creator of Apache DevLake, Hezheng Yin will introduce a data-driven approach to improving developer happiness and productivity. The speaker will make the case for establishing SLOs that support this approach and introduce the SPACE framework for developer productivity. You will see a fast and practical implementation of this framework using Apache DevLake, an open-source solution. This session is designed to provide attendees with actionable solutions and knowledge to establish SLOs targeting critical, yet previously ambiguous, concepts such as culture, collaboration, and flow.

Ioannis Georgoulas
Ioannis Georgoulas

Director of SRE

Paddle.com

Ioannis Georgoulas

Director of SRE

Paddle.com

How you SLO your SLOs?
Learn more ›
close
Ioannis Georgoulas

Ioannis Georgoulas

Director of SRE

Paddle.com

Twitter LinkedIn

Ioannis is the Director of SRE at Paddle.com. He is an SLO evangelist and practitioner with an obsession to measure anything that matters to the users and the business.

2023

How you SLO your SLOs?

In this talk, I will cover some metrics and signals that you can use to understand if your SLO framework and culture are performing and at what level.
 
These metrics will be used to measure (and SLO) your SLOs performance and impact on your (internal) users, business and overall reliability culture.

Jason Greenwell
Jason Greenwell

SRE Leader

Ford Motor Company

Jason Greenwell

SRE Leader

Ford Motor Company

SLOs and Promise Theory
Learn more ›
close
Jason Greenwell

Jason Greenwell

SRE Leader

Ford Motor Company

LinkedIn

Jason is an SLO and developer expereince advocate that has held a number of technical leadership positions at Ford and Ford Credit over the past 20 years. He is currenlty heading up SRE for Model-e's Cloud Platform driving SLO adoption and SRE culture throug the org.

2023

SLOs and Promise Theory

Discussion on the importance of explicitly stating performance expectations, and performance to those expectations through the lens of Promise Theory as it applies to SLOs. Tracking and understanding these promises is critical to reducing the complexity of a highly distributed microservices ecosystem.

Jeff Martens
Jeff Martens

CEO & Co-Founder

Metrist

Jeff Martens

CEO & Co-Founder

Metrist

Managing SLOs & SLAs when your app is built on other apps
Learn more ›
close
Jeff Martens

Jeff Martens

CEO & Co-Founder

Metrist

Twitter LinkedIn

Jeff has built observability products and developer tools for more than 12 years. The first company he founded, CPUsage, was a pioneer in the serverless computing space before AWS Lambda existed. Later he joined New Relic pre-IPO to focus on new products. There he served on the team creating the company’s high-performance event database, before leading Real User Monitoring and growing the product into the company’s 2nd largest revenue generator. Jeff then joined PagerDuty pre-IPO where he worked on designing, building, and launching a suite of business analytics products. Jeff is an alumnus of the University of Oregon and works between Portland, Oregon and the San Francisco Bay Area.

2023

Managing SLOs & SLAs when your app is built on other apps

The average digital business uses 137 cloud products, with 40-50 typically powering the company's product. On top of that, as much as 70% of customer-facing downtime can be tied back to a cloud dependency outage.
 
If we want to operate resilient systems, it is not enough to simply rely on systems with impressive SLAs. In short, an app with 4 cloud dependencies offering 99.9% uptime cannot itself offer 99.9%.
 
In this short talk, we'll cover some examples of how cloud dependency uptime, and our dependency's dependencies, can impact our own reliability, and what we can do to understand the risk and manage it better.

Jessica Kerr
Jessica Kerr

Engineering Manager of Developer Relations

Honeycomb

Jessica Kerr

Engineering Manager of Developer Relations

Honeycomb

Evolving Our Use of SLOs at Honeycomb
Learn more ›
close
Jessica Kerr

Jessica Kerr

Engineering Manager of Developer Relations

Honeycomb

Twitter LinkedIn

Jessica Kerr is a developer of 20 years, conference speaker of 10, and ringleader of a household containing two teenagers and their cats. She works and speaks in TypeScript, Java, Clojure, Scala, Ruby, Elm etc etc. Her real love is systems thinking in symmathesy (a learning system made of learning parts). She works at Honeycomb.io because our software should be a good teammate and teach us what is going on. If you're into sociotechnical systems, find her blog and newsletter at jessitron.com.

2023

Evolving Our Use of SLOs at Honeycomb

SLOs are part of our product, so we've cared about them for a long time. We think really hard about how we use them (especially Fred Hebert, Staff SRE, who is co-author and possibly co-presenter).
 
Our practices have changed, as we regularly re-evaluate each SLO, trading off alert fatigue against customer experience.
 
We also know something about how our customers use SLOs, so we know that other companies could benefit from the kind of thought Honeycomb's SRE team puts into this.

Jim Deville
Jim Deville

Principal Software Engineer

Procore

Jim Deville

Principal Software Engineer

Procore

Good and SLO
Learn more ›
close
Jim Deville

Jim Deville

Principal Software Engineer

Procore

Twitter LinkedIn

Jim has been working on various stages of the software stack for over 15 years. Most recently he is the technical leader of the Observability team at Procore.

2023

Good and SLO

Companies adopting SLOs inevitably have a lot of discussions about how to integrate SLO and Error Budget-based monitoring practices with older dashboard and alert methods, and with newer efforts to improve overall observability. Two concerns that often come up relate to identifying good SLOs, and how to connect the SLO to the underlying telemetry that can assist in determining the cause of a burning SLO.
 
At the core, an SLO should be tied to Service Level Indicators (SLIs) that relate to user happiness and core user journeys. By relating to user happiness and user journeys, we link the health and reliability of our systems to one of the most fundamental business requirements: trust. This also means that when we respond to a burning SLO, we are responding to events that erode user happiness and by extension, trust. Suppose users are unable to use a given system, due to unreliable system behavior, unreasonable downtime, or sluggish performance. In that case, they will look for somewhere else to provide the features that a system provides.
 
In this talk we will discuss considerations for how to build good SLO practices.

Joe Blubaugh
Joe Blubaugh

Principal Engineer

Grafana Labs

Joe Blubaugh

Principal Engineer

Grafana Labs

How we keep engineers sane with SLOs
Learn more ›
close
Joe Blubaugh

Joe Blubaugh

Principal Engineer

Grafana Labs

Twitter LinkedIn

Joe has worked building and operating large distributed systems for over 10 years, from Google to Twitter to Grafana. Along the way he's picked up some tricks and gotten some scars. He loves automation and standard practices in automation, and anything that removes developer toil.

2023

How we keep engineers sane with SLOs

I'll talk about Grafana's experience using SLOs to help us describe our critical operations for our systems and monitor them. We had to do several things to make this successful:

  • Promote consistent metric and objective definitions across teams.
  • Create tooling to make it easy to define SLOs.
  • Create tooling for an as-code workflow with SLOs.
  • Create and nurture a culture that sees the value of SLOs for both the team and the company.

I'll be talking about the history at Grafana and also presenting results of interviews with some Grafana engineers about what SLOs have done for their teams and where they see room for improvement in implementations.

Kayla Annunziata
Kayla Annunziata

SRE Platform Development Sr Mgr

Capital One

Kayla Annunziata

SRE Platform Development Sr Mgr

Capital One

Error Budget Signals - Identifying, Interpreting and Actioni...
Learn more ›
close
Kayla Annunziata

Kayla Annunziata

SRE Platform Development Sr Mgr

Capital One

Kayla is an Enterprise SRE Platform Development Sr Mgr at Capital One to drive adoption of SRE best practices across all Capital One applications.

Before joining FinTech she was a Software Engineering Manager at Lockheed Martin supporting the Space Industry developing reliable flight software products for NASA’s Orion Spacecraft Artemis 1 mission that successfully broke the record for the farthest distance from Earth traveled by an Earth-returning human-rated spacecraft by ~20,000 miles.

2023

Error Budget Signals - Identifying, Interpreting and Actioning

As an Enterprise SRE Team at Capital One we are on a journey to consistently measure reliability through practices such as adopting Service Level Objectives (SLOs). In partnership with our internal beta teams we identified clear Error Budget Signals that demonstrate an application’s health through periods of degradation and recovery.
 
Identifying an Error Budget Signal:
Some key indicators we leverage to analyze our reliability and reinforce in our Error Budget Policy are primarily related to our Error Budget Remaining (EBR) value and the trend of that EBR over time. Based on the slope of our EBR over a rolling window (usually 30 days), the key signals we seek to interpret are negative slope (degradation of our EBR), positive slope (EBR recoveries) and an EBR value of 0 (or negative).
 
Interpreting an Error Budget Signal:
Data without a clear way to interpret it is every engineer’s nightmare (or problem solving dream?). We seek to aid our teams in interpreting their Error Budget Signals and correlate potential drivers to the EBR trends. By having clear EBR signals indicated we can begin to overlay events that could be contributing factors and merit investigation. Is it a change related burn? Did it occur at the same time as a chaos engineering event? Have we lapsed a full time window where a prior incident is no longer negatively affecting our rolling EBR?
 
Actioning from an Error Budget Signal:
As an SRE team, we advise teams how to react to fast degradations (incident alert!) and slow burning ones, or even EBR recoveries (Party On Wayne!). This establishes an error budget policy to balance the reliability of our applications while continuing to release new features.
 
We’ll share with you some real examples of what we’ve seen in action and how we have advised our application SRE teams to investigate and actionably improve the reliability of their services as a result of these signals. In sharing this we hope that other teams can help foster buy-in to SLOs as an SRE practice area and support where they work today by incorporating an Error Budge Policy to benefit their applications’ reliability.

Kyle Forster
Kyle Forster

Founder

RunWhen

Kyle Forster

Founder

RunWhen

SLOs with Teeth: Partnering with Product Management
Learn more ›
close
Kyle Forster

Kyle Forster

Founder

RunWhen

LinkedIn
Kyle is the founder of RunWhen, a new company building a platform for "Social Reliability Engineering". Prior to RunWhen, Kyle was a Sr Director of Product Management in Google Cloud's AppMod business unit (Kubernetes).
2023

SLOs with Teeth: Partnering with Product Management

Many SLO initiatives are started by SREs for SREs. Sometimes these get traction, sometimes they do not.
 
In this talk, we'll outline a different approach to get SLO traction from personal experience now replicated by a number of our customers. The results have consistently led to broad adoption of SLOs and rapid increases in executive support.
 
The first step: partnering with product managers to define 2-3 error budgets that they see as critical inputs to their forecasts.
 
The second, third and fourth step will be relayed in the talk.

Lukasz Dobek
Lukasz Dobek

Software Engineer

Nobl9

Lukasz Dobek

Software Engineer

Nobl9

Non-Conway’s Game of SLOs
Learn more ›
close
Lukasz Dobek

Lukasz Dobek

Software Engineer

Nobl9

LinkedIn
Łukasz Dobek is a Software Engineer that works with cloud-native technologies on a daily basis. He strives to be language-agnostic and to treat programming languages as tools. Most of the time, you can find him building software with Go, JavaScript, or Python. He has experience in DevOps which certainly helps him develop and implement practical, effective, and easy-to-maintain solutions. Working at scale is another thing he can share his knowledge about, be it Kubernetes or Serverless architecture.

Currently, he’s developing Service Level Objectives platform at Nobl9, helping to make a cultural shift to the Site Reliability Engineering mindset.
2023

Non-Conway’s Game of SLOs

In this talk, I want to point out the importance of SLO evolution over time. I will be comparing Conway's Game of Life, which is a zero-player game, to the classic SLO approach, where users, after creating an initial configuration, redesign and rework it. Assumptions and service requirements can change, and SLOs should reflect those changes.

Matthias Loibl
Matthias Loibl

Senior Software Engineer

Polar Signals

Matthias Loibl

Senior Software Engineer

Polar Signals

Second Day Operations for SLOs
Learn more ›
close
Matthias Loibl

Matthias Loibl

Senior Software Engineer

Polar Signals

Twitter LinkedIn
Matthias Loibl is a Senior Software Engineer who works on cloud-native observability at Polar Signals, previously at Red Hat and Kubermatic, and is a maintainer of many projects like Prometheus, Thanos, Prometheus Operator, and Parca. He enjoys working on Distributed Systems with Go and gRPC.
2023

Second Day Operations for SLOs

Once you have implemented SLOs for your organization how do you move forward?
At Polar Signals, we have quarterly SLO reviews. We're first doing a retrospective and discussing where we did great and also could have improved. For the upcoming quarter, we discuss the OKRs and from those, we derive SLOs. Sometimes OKRs are easy to derive from and sometimes they need to be rephrased to make sense for SLOs.

Matthias will walk you through an example of an SLO that we implemented quarters ago and how it changed over time. The example will showcase the SLO tracked in the open-source Pyrra project which makes SLOs with Prometheus manageable, accessible, and easy to use for everyone.

Max Knee
Max Knee

Staff Software Engineer

The New York Times

Max Knee

Staff Software Engineer

The New York Times

Use SLOs to manage your day
Learn more ›
close
Max Knee

Max Knee

Staff Software Engineer

The New York Times

Twitter LinkedIn
Software Engineer working in the developer productivity space, ensuring teams deliver reliably and efficiently.
2023

Use SLOs to manage your day

I'm pretty bad at time management, so I was looking into ways to improve that part of my life.

I turned to a lo-fi way to use SLOs to manage my day by turning my tasks and other things I need to do during the day into SLOs.

It's increased my productivity and am interested in sharing it with others!

Michael Knox
Michael Knox

Platform SRE Team Lead

ANZx

Michael Knox

Platform SRE Team Lead

ANZx

SLOs are all around us and we don't know it
Learn more ›
close
Michael Knox

Michael Knox

Platform SRE Team Lead

ANZx

LinkedIn
Wide range of SRE, Platform engineering, and Development roles over 25 years across Banking, FinTech, Transport and Communications, for organisations including ANZ, Boeing and NEC.
2023

SLOs are all around us and we don't know it

In this talk, I'm looking at parallels, lessons & inspiration from Type 1 Diabetics tracking and responding to their Blood Glucose Levels, and SLO's in an IT system context. Type 1 Diabetics are always on-call, with potentially life threatening ramifications for events that they respond to on a daily basis.
 
A friend of mine, and my wife, are both Type 1 Diabetics; there are parallels and lessons that we can draw from their experience of being on-call 24x7; with Monitoring, burn rate predictions, Incidents, Alert Fatigue, Problem Management activities all playing a core part of their lives.

Mike Fiedler
Mike Fiedler

Wrangler of the Unusual

Mike Fiedler

Wrangler of the Unusual

Call off the Jam: Lessons in Setting Reasonable Objectives f...
Learn more ›
close
Mike Fiedler

Mike Fiedler

Wrangler of the Unusual

Twitter LinkedIn
With over three decades of experience as a professional engineer, Mike has amassed a wealth of knowledge and expertise in his field. Throughout his career, he has sought to learn from every colleague he's worked with, and in turn, has taught many others. He has held senior leadership roles at companies such as LeafLink, Warby Parker, and Capital One (Paribus), and has also worked at notable companies like Datadog and MongoDB.

Mike has been a speaker at conferences since 2012, and has been recognized for his contributions to the tech community with awards such as the Awesome Community Chef Award in 2016 and an AWS Container Hero since 2018. As a true technologist, he devotes his free time to working on open source tools, learning new technologies, and volunteering as a roller derby referee. With a holistic view of systems and software and a passion for problem-solving, Mike excels in helping others navigate the complexities of the tech world.
2023

Call off the Jam: Lessons in Setting Reasonable Objectives from a Roller Derby Referee

Speaking about the past experience of how I got involved, and making the barrier to entry lower for newcomers, but still keeping a high bar for quality, and balancing the challenges there.
 
Taking skills from the work environment to volunteer sports officiating and vice versa will demonstrate allowing people to have their lives influence and be influenced by objective setting

Natalia Sikora-Zimna
Natalia Sikora-Zimna

Product Manager

Nobl9

Natalia Sikora-Zimna

Product Manager

Nobl9

EM & PM collaboration based on SLOs
Learn more ›
close
Natalia Sikora-Zimna

Natalia Sikora-Zimna

Product Manager

Nobl9

2023

EM & PM collaboration based on SLOs

Using SLOs in managing product priorities and the most efficient use of the engineering team's time.
 
Present a real-life scenario of using SLOs to manage product requirements and be smarter about allocating the engineering team's focus. The talk will discuss an example of how SLOs helped monitor a potential problem instead of assigning an engineering team to jump into solving a complex issue. It also helped focus on a specific area of a wider problem. In a broader sense, the talk will discuss the collaboration between the Product Manager and the Engineering Manager and how SLOs can lead to more productive conversations.

Pankaj Gupta
Pankaj Gupta

Senior Software Engineer

Sumo Logic

Pankaj Gupta

Senior Software Engineer

Sumo Logic

SLOs created from Monitors
Learn more ›
close
Pankaj Gupta

Pankaj Gupta

Senior Software Engineer

Sumo Logic

Worked as a software engineer in companies like Samsung, Amazon and Sumologic for past few years across multiple domains like linux kernel, android, advertisements, last mile delivery and observability.
2023

SLOs created from Monitors

Critical monitors are good candidates for creating SLOs ex: a Monitor which monitors latency of a customer critical API (trigger alarm when latency > 500ms) is a good candidate for creating an SLO. Teams will want to have SLOs for such customer critical services. Since monitors already have the thresholds violating which the team thinks a customer is being impacted (that's why they setup alerts on this) creating SLO on top of such monitors saves user a lot of work (and clicks, at Sumologic user needs to fill only 4 fields to to create an SLO) as these thresholds automatically get propagated to the created SLO. At Sumologic we are in the process of launching something similar where our users will be able to create SLOs from their monitors. The related SLO is penalised for the duration for which the underlying monitor condition is violated. In a way the good windows/bad windows of these SLOs are determined by the trigger states (or trigger time series) of the monitor. These SLOs are modeled as window based SLOs in Sumologic. Given that monitors are already widely adapted in the industry and people are comfortable with monitors, we at Sumologic believe that it will be easier for our customers to grasp the concept coming from Monitors and they will be able to adapt to the reliability management/ SLO concepts faster.

Ramesh Nampelly
Ramesh Nampelly

Senior Director of Cloud Infrastructure and Platfo...

Palo Alto Networks

Ramesh Nampelly

Senior Director of Cloud Infrastructure and Platform Engineering

Palo Alto Networks

Improve SLOs through auto remediations and external context ...
Learn more ›
close
Ramesh Nampelly

Ramesh Nampelly

Senior Director of Cloud Infrastructure and Platform Engineering

Palo Alto Networks

Twitter LinkedIn
Ramesh is currently at Palo Alto networks leading cloud infrastructure and platform engineering, his team is responsible building internal engineering platforms that’d help developers and SREs to improve their productivity. Prior to PAN, Ramesh was head of engineering effectiveness at Cohesity.
2023

Improve SLOs through auto remediations and external context correlation

I will be focusing on how we’ve built observability platform that include incident analytics > auto remediations and secrets management which in turn brought down MTTR and other SLOs for our production services.

I will be covering high level architecture of our internal developer platform, observability Platform and their interaction. Will talk through each component involved such as backstage.IO, grafana stack (grafana, mimir, Loki and temp), vector.dev and stackstorm.

Will conclude by showing the northstar metrics we’ve used to track internally and how they helped increasing the platform adoption.

Ricardo Castro
Ricardo Castro

Principal Site Reliability Engineer

FanDuel

Blip

Ricardo Castro

Principal Site Reliability Engineer

FanDuel

Blip

Overcoming SRE Anti-Pattern Roadblocks: Lack of a Reliabilit...
Learn more ›
close
Ricardo Castro

Ricardo Castro

Principal Site Reliability Engineer

FanDuel

Blip

Twitter LinkedIn
Principal Site Reliability Engineer at FanDuel/Blip.pt. MSc in Computer Science by the University of Porto. CK{AD, A, S} by Cloud Native Computing Foundation (CNCF) | Linux Foundation. {Terraform, Consul, Vault} Associate by HashiCorp. Working daily to build high-performance, reliable and scalable systems. DevOps Porto meetup co-organizer and DevOpsDays Portugal co-organizer. A strong believer in culture and teamwork. Open source passionate, martial arts amateur, and metal lover.
2023

Overcoming SRE Anti-Pattern Roadblocks: Lack of a Reliability Framework

SREs, as the name implies, care about service reliability. But, often, they struggle with having a way to define, measure and assess their services reliability. In practice, they lack a Reliability Framework.

How can SLOs help? They provided an opinionated way to do just that: define, measure and assess service reliability from the users perspective. They provide a common language to talk about reliability and prioritize work. They help fix the anti-pattern of trying to ensure service reliability without clearly defining what it means.

Roman Khavronenko
Roman Khavronenko

Software Engineer

VictoriaMetrics

Roman Khavronenko

Software Engineer

VictoriaMetrics

Retroactive evaluation of SLO objectives in VictoriaMetrics
Learn more ›
close
Roman Khavronenko

Roman Khavronenko

Software Engineer

VictoriaMetrics

Twitter LinkedIn
Roman is a software engineer with experience in distributed systems, databases, monitoring, and high-performance microservices. Roman's passion is open source and he's proud to have contributions to Prometheus, Grafana, and ClickHouse. Currently, Roman is working on the open source time series database and monitoring solution VictoriaMetrics.
2023

Retroactive evaluation of SLO objectives in VictoriaMetrics

Recording rules is a clever concept introduced by Prometheus for storing results of query expressions in a form of a new time series. This concept is used for SLO calculations. But due to the nature of recording rules they have no retroactive effect. And since SLO objective usually captures a time window no less than 30d, recording rules produce incomplete results until the whole time window is captured.

The talk will cover how this can be fixed in VictoriaMetrics monitoring solution via retroactive rules evaluation on example of rules generated via https://github.com/slok/sloth framework.

Sal Furino
Sal Furino

CRE

Nobl9

Sal Furino

CRE

Nobl9

Two Paths in the Woods
Learn more ›
close
Sal Furino

Sal Furino

CRE

Nobl9

Twitter LinkedIn

Sal Furino is a Customer Reliablity Engineer at Nobl9. During his career he's worked as a TPM, SRE, Developer, Sys Admin, and IT support. While not working he enjoys cooking, gamings, traveling, skiings, and golfing. Sal lives in Queens with his parter and has a BS in Applied Mathematics from Marist College.

2023

Two Paths in the Woods

While the direction and intent of SRE has been established and is becoming better-understood, the details on how to achieve "SRE" is still an exercise left for the reader.

Steve from Google and Sal from Nobl9 will present two independently developed methods for teaching SRE topics to customers, which we have discovered are actually quite similar. Huzzah!

Steve will present the "reliability map" and Sal will show Nobl9's SLODLC. Both are essentially sets of documentation presented in a way that allows customers to take a *guided* path through the dark and scary woods that is today's SRE.

The "reliability map" provides a detailed accounts of what development, infrastructure, operational, observability, and people/culture activities organizations can expect to be practicing to achieve different eras of reliability (99%, 99.9%, 99.99%, 99.999%). The "reliability map" takes it one step further and provides descriptions and references to additional content for all activities.

SLODLC provides a framework of documentation and templates to allow customers to iterate on improving their SLOs. This is especially necessary when customers have high reliability targets in which rely upon automation to execute predetermined playbooks. (rollback, exponential backoff, autoscale, etc) If SLOs are the indicator the performance of a user journey and decide when actions needs to be taken, then customers need a clear and useable way to explain:

  • What they are attempting to measure (golden signals?, something else?)
  • Why was it decided to measure X in such a way?
  • How is X impactful for the targeted user journey?
  • When was the last time the SLO objective, metric, time window, etc was changed?
  • When the error budget is in danger of being breeched what actions should be take?
In short both projects are approaching the same topic of helping customers improve their reliability from different points of view and they synergize well together.

Sally Wahba
Sally Wahba

Principal Engineer

Splunk

Sally Wahba

Principal Engineer

Splunk

Thinking about SLO from On-Prem to Cloud - A Developer's Per...
Learn more ›
close
Sally Wahba

Sally Wahba

Principal Engineer

Splunk

Sally is a Principal Software Engineer at Splunk where she works on data ingestion for observability. Before Splunk, she spent around a decade working on operating systems for data storage systems at NetApp. Sally obtained her PhD in Computer Science from Clemson University. She presented her work and research both nationally and internationally.

When not working you will find her doing computer science outreach activities, reviewing for technical conferences, mentoring, and learning Spanish.
2023

Thinking about SLO from On-Prem to Cloud - A Developer's Perspective

My background has mostly been in developing operating systems for data storage companies. In this environment, almost everything is controlled internally. For example, if the SLO of the operating system is five 9s, then the error budget is usually all consumed by software bugs owned internally by the company. As I switched to developing SaaS products in the cloud, this has drastically changed. Below are examples of lessons learned during different phases of working on a product, from development, to production, to support and operation. My goal is sharing these lessons so other developers can learn from my experience.

In my previous role, two main things I relied on while developing operating system products were the suite of testing as well as the release cadence. Shipping a new operating system every 6 months was considered fast. This gave developers a lot of time for their code to soak and be tested internally before being released to customers. Additionally, with a slower release cadence there was a lot of effort invested in creating different layers of testing. After all, a bug fix would take months to reach customers. Even if we released a patch quickly, which in this context means a few weeks, customers would still need to update their operating systems to apply that patch, and who knows how long a customer will wait before applying that patch. After moving to building SaaS products in the cloud, the release cadence became much faster. This required shifting my thought process. Instead of relying on soak time and various levels of testing, activities such as code review and in-build unit tests now take a front row seat. Metrics like code coverage from unit tests mattered more, while metrics like how long it's been since QA found a bug mattered less.

Another difference between my old role and new role is how developers access production and production metrics. In the old role, gathering production metrics was no easy feat.
Harder access to production metrics, implied that changing SLOs internally was harder, took longer, and a lot of developers didn't know how/why SLOs were changing. For a SaaS product running in the cloud, developers have access to overall system performance metrics at the click of a button, while maintaining compliance. This makes it easier for developers to know why/how SLOs are affected by SLAs and also gives them faster reaction times.

From the operation perspective, the old role and the new role were quite different. In the old role developers didn't go on-call. There was a customer support organization that would handle any customer issues first. Developers were brought in occasionally if customer support needed a bug fixed. Developers didn't need to wake up in the middle of the night because there's an outage. In the new role, developers are on-call, which means developers can and do occasionally get paged in the middle of the night. This shift caught me off-guard as I had to think much harder about how my service impacts the quality of life of my colleagues and myself. It made me think about how to change and measure SLOs to eventually avoid waking someone up in the middle of the night.

Another lesson that caught me off guard is not all SLAs can be trusted when working in the cloud. Yes, I knew this lesson theoretically, but learning it in practice is a different story. One example was an outage that was caused by a cloud provider breaching their SLAs for a managed service that we relied on for our product. This resulted in our product breaching its SLAs. When something like this happens, you think hard about how to update your SLOs to prepare for these issues, catch such issues, and react to them. I found that using SLOs that are tighter than SLAs was helpful in that regard.

In conclusion, even with years of professional experience, moving from developing On-Prem products to cloud SaaS offerings changed how I think about SLOs and it's truly not a one-size-fits-all.

Shubham Srivastava
Shubham Srivastava

Head of Developer Relations

Zenduty

Shubham Srivastava

Head of Developer Relations

Zenduty

Kubernetes Monitoring - Choosing Optimal Metrics for SLO Ale...
Learn more ›
close
Shubham Srivastava

Shubham Srivastava

Head of Developer Relations

Zenduty

Twitter LinkedIn
Leading Developer Relations at Zenduty - an advanced incident management and response orchestration platform.
Take pride in making mistakes, learning from them and advocating for best practices for orgs setting up their DevOps, SRE and Production Engineering teams.

A zealous and eternally curious professional, fascinated by stories from DevOps, Incident Management and Product Design; hoping to solve real-world problems with the skills and technology I'm actively amused by. An orator, writer, and hopeful comedian trying his very best to do something I'm proud of everyday.
2023

Kubernetes Monitoring - Choosing Optimal Metrics for SLO Alerting

A well performing monitoring system needs to answer two simple questions: “What’s broken, and why?”. Monitoring allows you to watch and understand your system’s state using a predefined set of metrics and logs. Observability, on the other hand, is about bringing visibility into the system - essentially turning the lights on, to see and understand the state of each component of your system, and to discover the answer to the ‘why’ part of the problem.
 
Building an efficient and battle-tested monitoring platform usually takes quite a while. You need to learn over a period of time how your system performs on various fields, before you can accurately know which metrics to monitor for prompt alerting that help predict unavoidable incidents, meet your SLOs and in turn prevent downtime.
 
We have analysed the incident data of over 150 highly active organisations deploying Prometheus Alertmanager to monitor their Kubernetes infrastructure, have discovered some unusually common yet fatal mistakes made when choosing SLO metrics as well as some clever configurations drastically reducing noise.
 
This talk aims to give you a run-through of best practices and ‘what not to do’ when choosing Prometheus metrics for clean and noiseless alerting.

Stephan Lips
Stephan Lips

Software Engineer and SLO Advocate

Procore

Stephan Lips

Software Engineer and SLO Advocate

Procore

Black Box SLIs
SLOs as code
Learn more ›
close
Stephan Lips

Stephan Lips

Software Engineer and SLO Advocate

Procore

LinkedIn
2023

Black Box SLIs

Adopting an SLO culture involves identifying the metrics that matter without drowning in noise and alert fatigue. The black box concept lets us aggregate granular metrics into SLIs that focus on the user experience as an indicator of system reliability.

The talk will be based off this article published on Procore's Engineering blog: https://careers.procore.com/blogs/engineering-at-procore/black-box-slis.

2023

SLOs as code

By managing Service Level Objectives as code we can co-locate SLO definitions and ownership with the product code and team. This supports horizontal scaling of SLO ownership, while establishing a single source of truth, adding transparency, integrating with the code management process, and creating an audit trail for SLOs.

The talk will be based off this article published on Procore's Engineering blog: https://engineering.procore.com/implementing-slos-as-code-a-case-study-2/ 

Stephen Townshend
Stephen Townshend

Developer Advocate (SRE)

SquaredUp

Stephen Townshend

Developer Advocate (SRE)

SquaredUp

Reliability Benchmarking: A Pre-cursor to SLO Adoption
Learn more ›
close
Stephen Townshend

Stephen Townshend

Developer Advocate (SRE)

SquaredUp

Twitter LinkedIn
Stephen has a background in SRE and performance engineering. He has worked in the industry for 15 years as both an external consultant and an internal engineer.

Our industry is full of buzzwords and exaggerations, it can be hard to know what is real or not. Stephen strives to take these complex technical concepts and to simplify and present them in a way everyone can understand and apply (and to call out when something is too good to be true).

Stephen lives in Auckland, New Zealand and currently works as a Developer Advocate for SquaredUp, as well as promoting and improving observability and SRE practices internally in the organisation.
2023

Reliability Benchmarking: A Pre-cursor to SLO Adoption

When we first attempted to implement SLOs we took a "theory first" approach. We developed workshops and ran sessions to uncover the key users, services, indicators and objectives for a platform or application. But we failed. We didn't identify meaningful SLOs, track them, or define error budgets. We also failed to garner interest or investment from the team.
 
Taking a step back, we tried a different approach. We got access to their observability data alongside other sources of information (e.g. incidents) to build a picture of where the team was currently at in terms of reliability and operational maturity.
 
This new approach was much more successful in getting that initial spark of excitement. By providing actionable insight up front, we were able to start the SLO and SRE conversation off the right way.
 
In this talk I will share our experience and process for benchmarking reliability, and how this could be leveraged to begin SLO adoption in a complex organisation.

Stephen Weber
Stephen Weber

Staff Site Reliability Engineer

Procore

Stephen Weber

Staff Site Reliability Engineer

Procore

Arguments in Favor: why SLOs?
Learn more ›
close
Stephen Weber

Stephen Weber

Staff Site Reliability Engineer

Procore

Twitter LinkedIn
Stephen Weber is a Site Reliability Engineer at Procore Technologies, helping build the platform that builds the world. He's worked as a consulting SRE within his orgs for the past 4 years. Stephen lives and works remotely from Oregon, and has accidentally trained his huskies to know when standup should be over.
2023

Arguments in Favor: why SLOs?

In my experience, many teams encounter SLOs as something they've been told to "do" and the flip side is it's been something I've been asked to help them with. Naturally this is not ideal - as engineers we prefer things to be self-evident.

I have a number of strategies to use when communicating the process and the value of creating and using SLOs. I'll give away the thing I say the most right here: "SLOs are for decision-making". They're not magic or even a single thing. They're a tool that helps us do our jobs.

Audience will come away either better-able to articulate the pragmatic usefulness of an SLO mindset, or with one or two real motivations why they should consider developing and using SLOs for their own systems.

Steve McGhee
Steve McGhee

Reliability Advocacy Engineer

Google SRE

Steve McGhee

Reliability Advocacy Engineer

Google SRE

Two Paths in the Woods
Learn more ›
close
Steve McGhee

Steve McGhee

Reliability Advocacy Engineer

Google SRE

Twitter LinkedIn
Steve was an SRE at Google for about 10 years, then left to help a company move to the Cloud. He's back at Google, helping more companies do that.
2023

Two Paths in the Woods

While the direction and intent of SRE has been established and is becoming better-understood, the details on how to achieve "SRE" is still an exercise left for the reader.

Steve from Google and Sal from Nobl9 will present two independently developed methods for teaching SRE topics to customers, which we have discovered are actually quite similar. Huzzah!

Steve will present the "reliability map" and Sal will show Nobl9's SLODLC. Both are essentially sets of documentation presented in a way that allows customers to take a *guided* path through the dark and scary woods that is today's SRE.

The "reliability map" provides a detailed accounts of what development, infrastructure, operational, observability, and people/culture activities organizations can expect to be practicing to achieve different eras of reliability (99%, 99.9%, 99.99%, 99.999%). The "reliability map" takes it one step further and provides descriptions and references to additional content for all activities.

SLODLC provides a framework of documentation and templates to allow customers to iterate on improving their SLOs. This is especially necessary when customers have high reliability targets in which rely upon automation to execute predetermined playbooks. (rollback, exponential backoff, autoscale, etc) If SLOs are the indicator the performance of a user journey and decide when actions needs to be taken, then customers need a clear and useable way to explain:

  • What they are attempting to measure (golden signals?, something else?)
  • Why was it decided to measure X in such a way?
  • How is X impactful for the targeted user journey?
  • When was the last time the SLO objective, metric, time window, etc was changed?
  • When the error budget is in danger of being breeched what actions should be take?
In short both projects are approaching the same topic of helping customers improve their reliability from different points of view and they synergize well together.

Steve Upton
Steve Upton

Principal QA Consultant

Thoughtworks

Steve Upton

Principal QA Consultant

Thoughtworks

Data Product Thinking with SLOs
Learn more ›
close
Steve Upton

Steve Upton

Principal QA Consultant

Thoughtworks

Twitter LinkedIn
Steve is a Quality Analyst who works to build empowered teams, capable of delivering and taking ownership of quality. He has worked on a wide variety of products, from mainframes to microservices and has a particular interest in complex socio-technical systems and how we work with them.
 
He is passionate about complexity theory, building quality into culture and testing as part of continuous delivery in modern, distributed architectures. Outside of work, Steve enjoys travel and mountains.
2023

Data Product Thinking with SLOs

The talk tells the story of how conversations around SLOs can be a great trigger to start a shift to a Product Thinking mindset, with practical examples. We'll also take a light dip into constraint mapping from an SLO perspective.

Steve Xuereb
Steve Xuereb

Staff Site Reliability Engineer

GitLab Inc.

Steve Xuereb

Staff Site Reliability Engineer

GitLab Inc.

So many SLOs so many alerts
Learn more ›
close
Steve Xuereb

Steve Xuereb

Staff Site Reliability Engineer

GitLab Inc.

Twitter LinkedIn
Slight Reliability Engineer, I solve more problems than I create.
2023

So many SLOs so many alerts

This is the talk version of https://about.gitlab.com/blog/2022/07/19/reducing-pager-fatigue-and-improving-on-call-life/ where we’ll describe the following:
 
Problem:
At GitLab, we use SLIs to monitor each service, and a service can have multiple SLIs. When we start burning through the error budget too fast we will page the SRE on-call. If there is a service-wide degradation we ended up paging the on-call multiple times within minutes, which is not ideal and adds stress for the already stressed on-call engineer. The worst case scenario was when there was a service degradation with multiple upstream dependencies like a database that resulted in 50+ pages. We'll go over two solutions we’ve implemented to cut our pager load by more than half, using built-in tools from Alertmanager that users can just configure themselves. We’ll also show other possible solutions that we could have used, and why we opted for the Alertmanager option.
 
Solution One:
[Alertmanager grouping]
 (https://prometheus.io/docs/alerting/latest/alertmanager/#grouping) enabled us to group all alerts for 1 service into 1 page. We grouped the alerts using a specific set of labels that our alerts have, luckily all alerts had a service label that we could group by. We’ll walk through an example of how this works and go into more detail about how Alertmanager grouping works
 
Solution Two:
Now that we have alert grouping by service rolled out, the next step was to introduce service dependencies so when a downstream service was alerting we wouldn’t alert about the upstream service if it’s also burning through the error budget too fast. To achieve this we used another feature in Alertmanager called [inhibition]
(notion://https://prometheus.io/docs/alerting/latest/alertmanager/#inhibition. We’ll walk through how we implemented a new DSL for this for our metric catalog which is a jsonnet library where it’s the single source of truth of our SLI, the guard rails we’ve implemented in the DSL and a real-life example of this for GitLab.com.
 
Results:
We’ll show a real-life example where a degradation on the database resulted in 1 page, where before it would have been more than 15 pages. Finally, we’ll show how we got fewer pages, making the on-call happier and our alerting data cleaner and easier to understand.

Thijs Metsch
Thijs Metsch

Researcher

Intel Labs

Thijs Metsch

Researcher

Intel Labs

Intent Driven Orchestration with SLOs!
Learn more ›
close
Thijs Metsch

Thijs Metsch

Researcher

Intel Labs

Thijs is a Research Engineer building cool stuff at Intel Labs. His key interests include system performance and distributed systems. In past career moves, he did work on HPC, Grids, and Cloud/Edge for companies such as IBM, Sun Microsystems, and the German Aerospace Center. He helped make shipbuilding easier, ran massive parallel workloads, managed tons of compute in hybrid environments, and created one of the first standards for the Cloud more than a decade ago. Now focused on making orchestration easier with tools such as Kubernetes using e.g. AI/ML techniques.
2023

Intent Driven Orchestration with SLOs!

With a Serverless mindset in place, there should no longer be a need to define resource requests or any other kinds of information. Just developed your app or function and run it.

But how can we achieve performance and make sure we run the system most efficiently? This is were we can now let user define what they truly care about - their SLOs. Based on these performance targets our Intent Driven Orchestration Planner will do the rest. No need to define resource requests and limit s on e.g. Kubernetes cluster anymore. The planner will set up the systems in such a way that your app/functions behave as expected, without the need to know anything about the underlying infrastructure – less knowledge needed, fewer errors made!

This is a shift in how we do orchestration and SLO management, that could be of interest to this community. Away from a monitoring & alerting way of looking SLOs, towards a way of using SLOs to manage the system. Furthermore this truly allows for ease-of-use of the user; rather than defining numbers an values based on domain & contextualized knowledge we let them define what they truly care about: their SLOs!

Toby Burress
Toby Burress

SRE

Dropbox

Toby Burress

SRE

Dropbox

What We Mean By "Mean"
Learn more ›
close
Toby Burress

Toby Burress

SRE

Dropbox

Toby is an SRE at Dropbox. In his free time he argues about cancelled TV shows on the internet.
2023

What We Mean By "Mean"

There's been a lot of (really good!) discussion in the last several years about how to think about, monitor, and alert on long-tailed distributed quantities, such as latency. However, I'm worried that in our desire to describe the long tail we may be too eager to abandon tools that are still useful.

In this talk I'll (re)introduce everyone's favorite summary statistic, the average. I'll talk about the difference between a sample average and a random variable's expectation, and how the two are uniquely linked by the law of large numbers. I'll also talk about how the central limit theorem allows us to treat sample averages as draws from a Gaussian distribution, irrespective of the distribution the samples come from, and then I'll talk about the exceptions. We'll finish up by looking at how the properties of expected values can give us insight into the behavior of systems, even at the tail.

Through all of this we'll be looking at examples drawn from real-world latency data, and comparing the insights gleaned from this versus other common summary statistics.

Troy Koss
Troy Koss

Director, Enterprise SRE

Capital One

Troy Koss

Director, Enterprise SRE

Capital One

Resiliency is only good if it's reliable
Learn more ›
close
Troy Koss

Troy Koss

Director, Enterprise SRE

Capital One

LinkedIn
With what seems to be a natural attraction towards reliability, Troy has constantly found himself involved in making things... well... more reliable. After working in software development, he stumbled into operations and saw a clear opportunity to use software to orchestrate such efforts. Currently he works in Capital One’s stability organization leading enterprise Site Reliability Engineering (SRE). Here he plays a critical part in both evolving the enterprise strategy while leading a team of engineers focused on partnering with and influencing business, architecture, and technology partners in delivering on the strategy. His interest in reliability extends into the culture he seeks to foster for his teams with the goal of providing a dependable haven where engineers can be autonomous and empowered to drive critical decisions. In the same spirit of helping others develop, he spends time counseling young STEM talent as an advisor for Women’s Association of Venture & Equity (WAVE). Outside of his professional career, he enjoys horticulture, fitness, traveling to new locations, and spending time with his pup and family.
2023

Resiliency is only good if it's reliable

Resiliency is a critical piece to building reliable systems. It allows us to feel safe knowing failure is inevitable. After all, as noted in the OG SRE book, 100% is terrible target for basically everything.

We spend a lot of resources to add in layers of resiliency from redundant multi-region compute stacks to backups on our backups. How do we know this resilience achieved our ultimate outcome of reliability for our customers? We'll discuss the ways to observe your SLOs and error budgets during resiliency events.

The various events we'll look at include, regional failover, chaos experiments (such as latency injection), database recovery, and more! After failing a region, do we know if your customer's experienced a disturbance? When you're running a resiliency test or game day, how do you measure success?

Observing error budgets before, during, and after an event paint a picture of our customer's experience and can ultimately be part of the success criteria. It is critical that we know how our architecture and system changes unfold. What if new resiliency introduces latency that negatively impacts your customer? For example, if there's complexity introduced that makes your release engineering more convoluted, we may see a longer error budget burn while we remediate. On the flip side the partnership of adding resiliency and observing your SLOs can also lead to improving the objectives with newly matured levels of resiliency; raising the bar for performance.

Vijay Samuel
Vijay Samuel

Observability Architect

eBay

Vijay Samuel

Observability Architect

eBay

Scaling SLI/SLO - Pushing Your Observability Platform To Its...
Learn more ›
close
Vijay Samuel

Vijay Samuel

Observability Architect

eBay

Twitter LinkedIn
Vijay Samuel works with eBay's observability platform as its architect. During his time at eBay Vijay has transformed eBay's observability platform into a cloud native offering that is primarily built on top of open source technologies. He loves to code in Go and play video games.
2023

Scaling SLI/SLO - Pushing Your Observability Platform To Its Limits

At eBay we use a wrapped version of Prometheus as our centralized time series database. Data residing in Prometheus is used for mission critical functions like detecting issues on the site through anomaly detection, SLOs or simple threshold based alerts. To be able to enforce SLOs across the 1000s of micro services that are deployed inside of eBay, we would need to be able to aggregate commonly instrumented metrics at an application level.

Why are these aggregations necessary? Host level metrics when used for computation of things like burn rates over a 30 day window can become very very expensive. Using a single query across all micro services to generate such aggregations have several scale limitations. Manually onboarding each micro service into such aggregations can also be tedious.

The talk discusses our commonly instrumented metrics, our metrics store architecture, scale challenges seen with SLI/burn rate computation and how we came up with concepts like templated rule groups that automatically generate Prometheus rule groups to be able to do aggregations for every application that is emitting metrics.

Wayne Major
Wayne Major

Cloud Reliability Engineer - OutSystems

Outsystems

Wayne Major

Cloud Reliability Engineer - OutSystems

Outsystems

So a M&O, DevOps, and Data Science Engineer got locked in a ...
Learn more ›
close
Wayne Major

Wayne Major

Cloud Reliability Engineer - OutSystems

Outsystems

Native of South Carolina and I've been working in the IT industry for 15+ years working in a number of different roles.
2023

So a M&O, DevOps, and Data Science Engineer got locked in a room together

This sounds like the start to a terrible bar joke, and you're absolutely right it is. Soooooooooooooo a Data Science Engineer, M&O Engineer and a DevOps Engineer met via a zoom chat from two completely different regions and time zones.

Our organization, Outsystems is a low code development platform that wanted to monitor the reliability of our customers environments. Due to the nature of these environments we were looking to design SLOs that fit unknown and constantly changing architecture. To further complicate matters there were also competing technology stacks the traditional legacy platform and the next-generation. We needed an automated solution that could dynamically fit into their build pipelines to measure the reliability of our customer’s environments at scale.

This is a talk on how three engineers worked thru buy-in, cultural, and technical challenges to come together and use our specialties to create a fully automated SLO creation factories utilizing the OpenSLO framework for our products that allows us to scale at a moments whim.

Weyert de Boer
Weyert de Boer

Head of App Store Engineering

Tapico

Weyert de Boer

Head of App Store Engineering

Tapico

Generating SLOs rules based on OpenSLO specifications
Learn more ›
close
Weyert de Boer

Weyert de Boer

Head of App Store Engineering

Tapico

Twitter
Weyert is a converted interaction designer who is passionate about building amazing and user-friendly digital products, that are a joy to use! My developer story in short: from Flash to Web.

Weyert also contributes to various communities and is part of the OpenSLO team to help define the SLOs in a declarative way.

In his spare time, Weyert enjoys reading about ancient history, a hobby paleoanthropologist, helping developers out in various communities, and trying to get better at oil painting.
2023

Generating SLOs rules based on OpenSLO specifications

In this talk I would like to show a utility for generating service level objectives rules for Prometheus/AlertManager based on the OpenSLO v1 specification.

The talk shows a tool that has been developed by Tapico for generating the configuration files for Prometheus and Alertmanager based on OpenSLO specifications.

Zachary Nickens
Zachary Nickens

Global Reliability Engineering

Outsystems

Zachary Nickens

Global Reliability Engineering

Outsystems

Project Constellations
Organizational Transformations through SLOs
Learn more ›
close
Zachary Nickens

Zachary Nickens

Global Reliability Engineering

Outsystems

Twitter LinkedIn
Zac is a Reliability Engineer having served the energy, defense, and private sectors primarily focused on reliability and observability for scientific, distributed, and geospatial computing.
2023

Project Constellations

Every organization typically identifies a “North Star” for use in guiding its business as it navigates its business, technologies, and customer experiences. Navigation via only one celestial body however is inefficient and prone to environmental constraints. Navigation using multiple known celestial bodies, which are known and imbued with context, prevents environmental and operational constraints.

Identifying a contextual North Star for a technology organization provides benefit, but identifying a constellation of service level objectives provides a set of greater benefit. SLOs used as the primary navigational devices offer transformative advantages in navigating the business, technology, and customer experience challenges.

2023

Organizational Transformations through SLOs

Leading organizational change always presents challenges. Leveraging Service Level Objectives, we are transforming multiple aspects of our business and operational decision making processes. SLOs are providing innovative technical, operational, and business accelerations across our SRE Transformation, SDLC, and operation of both existing and emerging products.