2023 SLOconf Speakers
May 15-18, 2023
This Year's Speakers

Adriana Villela
Sr. Developer Advocate
Lightstep

Adriana Villela
Adriana is a Sr. Developer Advocate at Lightstep from Toronto, Canada, with over 20 years of experience in tech. She focuses on helping companies achieve reliability greatness through Observability, DevOps, and SRE practices. Before Lightstep, she was a Sr. Manager at Tucows. During this time, she defined technical direction in the organization, running both a Platform Engineering team, and an Observability Practices team. Adriana has also worked at various large-scale enterprises, including Bank of Montreal (BMO), Ceridian, and Accenture. At BMO, she was responsible for defining and driving the bank's enterprise-wide DevOps practice, which impacted business and technology teams across multiple geographic locations across the globe.
Adriana has a widely-read technical blog on Medium (https://adri-v.medium.com), which is known for its casual and approachable tone to complex technical topics, and its high level of technical detail. She is also an OpenTelemetry contributor, HashiCorp Ambassador (https://www.credly.com/badges/551d47a7-67cb-41bb-baeb-8c90f114f03a/public_url), and co-host of the On-Call Me Maybe Podcast (https://oncallmemaybe.com).
Translating failures into SLOs

Alayshia Knighten
Manager of Onboarding Eng
Honeycomb
Alayshia Knighten
Manager of Onboarding Eng
Honeycomb
SLI Negotiation Tactics for Engineers

Alayshia Knighten
Alayshia Knighten is an Engineering Manager of Product Training at Honeycomb with many years of experience in the DevOps realm. Alayshia specializes in enhancing technical and team-related experiences while educating customers on their journey with and beyond observability. In her words, “Getting shit done while identifying how to accelerate at the person beyond the tooling is the real meat and potatoes.” She enjoys solving the “so, how do we solve that?” problems and meeting people from all walks of life. Her tiny hometown and Southern background inspire Alayshia. In her spare time, she enjoys hiking, grilling, painting, and making random bird calls with her father.
SLI Negotiation Tactics for Engineers

Aleksandra Dziamska
Engineering Manager
Nobl9

Aleksandra Dziamska
Engineering Manager
Nobl9
Her over 10-year journey in software development started with being a software engineer and moved towards team leadership and management. Throughout her career, she strived to focus on what she feels is most important (in IT as in life): people. Translating it to Engineering Manager dialect: lead the engineering team to deliver best value to the end users, effectively combining Product and Engineering priorities. She explores the way SLOs can help here.
EM & PM collaboration based on SLOs

Alex Hidalgo
Principal Reliability Advocate
Nobl9
Alex Hidalgo
Principal Reliability Advocate
Nobl9
Error Budgets for Conference Planning

Alex Hidalgo
Error Budgets for Conference Planning

Alex Kudryashov
Lead software engineer
New Relic
Alex Kudryashov
Lead software engineer
New Relic
SLI adoption in mid size company: wins and flops

Alex Kudryashov
Lead software engineer
New Relic
These days I am leading a team that is developing Service Level Management in New Relic. I love solving challenges at intersection of product and engineering, so I am creating tools for developers like myself.
SLI adoption in mid size company: wins and flops

Alexandra McCoy
SRE Engineer & VMware Enthusiast
VMware
Alexandra McCoy
SRE Engineer & VMware Enthusiast
VMware
Reliability Enablement: Achieving Reliability with SLOs

Alexandra McCoy
Alexandra is an SRE Engineer at VMware. She is passionate about Cloud Native, Open Source, and Reliability Engineering communities. Although VMware is home, she was introduced to the Cloud while in IBM - Public Sector and then transitioned into IBM Cloud. She later gained additional hybrid cloud experience at Diamanti, focusing on E2E product support for their Kubernetes based appliance. She is excited about the industry's direction and hopes to contribute in a way that only helps improve the cloud space.
Reliability Enablement: Achieving Reliability with SLOs

Ana Margarita Medina
Staff Developer Advocate
Lightstep
Ana Margarita Medina
Staff Developer Advocate
Lightstep
OKRs with BLOs & SLOs via User Journeys
Translating failures into SLOs

Ana Margarita Medina
Ana Margarita is a Staff Developer Advocate at Lightstep and focuses on helping companies be more reliable by leveraging Observability and Incident Response practices. Before Lightstep, she was a Senior Chaos Engineer at Gremlin and helped companies avoid outages by running proactive chaos engineering experiments. She has also worked at various-sized companies including Google, Uber, SFEFCU, and Miami-based startups. Ana is an internationally recognized speaker and has presented at: AWS re:Invent, KubeCon, DockerCon, DevOpDays, AllDayDevOps, Write/Speak/Code, and many others.
Catch her tweeting at @Ana_M_Medina about traveling, diversity in tech, and mental health.
OKRs with BLOs & SLOs via User Journeys
- What are all these buzzwords?
- What is OKR
- What is KPI
- What is BLO
- What is User Journey
- What is SLO
- Why our systems need them
- How to define them from the top down
- How to define them from the bottom up
- How to keep SLOs and User Journeys healthy
- Error Budget
Translating failures into SLOs

Andrew Howden
SRE Engineering Manager
Zalando
Andrew Howden
SRE Engineering Manager
Zalando
Driving engineering priorities with service level objectives...

Andrew Howden
Driving engineering priorities with service level objectives on critical business operations
I will talk through the details of how SLOs at Zalando have evolved from the initial implementation ("SLOs for everything!") to the challenge of ensuring SLOs have the organizational power to drive changes in engineering priorities, to the current design of "critical business operations" and SLOs on those operations.
I'll discuss how to address the "fast burn" SLO problem by leveraging distributed tracing to identify regression in the customer experience automatically. When those regressions are identified, automatically identify and page the team best empowered to address them.
I'll discuss how to address the "slow burn" SLO problem through periodic operational review meetings, in which the SLOs are evaluated, and violations to the SLO (or slow burn issues) are allocated to an owner to investigate and address.
Lastly, I'll talk about challenges with the existing approach, including the difficulty of modelling event systems as a reliable flow, difficulty in rolling out more SLOs for non-customer-facing aspects of the organization and returning to service-specific SLOs.

Andrew Newdigate
Distinguished Engineer
GitLab Inc.
Andrew Newdigate
Distinguished Engineer
GitLab Inc.
Tamland: How GitLab.com uses long-term monitoring data for c...

Andrew Newdigate
Andrew is a seasoned engineer with over two decades of experience in software development and reliability engineering. As a Distinguished Engineer at GitLab, he is responsible for the reliability and availability of GitLab's SAAS properties: GitLab.com and GitLab Dedicated. He is a strong advocate for using SLOs, error budgets, and observability data to drive change and manage technical debt. Previously, Andrew co-founded the developer community site Gitter in 2012, where he served as CTO until its acquisition by GitLab in 2017.
Tamland: How GitLab.com uses long-term monitoring data for capacity forecasting
For any large scale production system, the ability to effectively forecast potential capacity issues is crucial for the smooth functioning of the environment. With a reliable prediction, teams can proactively plan ahead, implement necessary scaling changes in a controlled manner and avoid unexpected availability issues that can cause stress and harm to the system.
Before implementing Tamland, the capacity planning process at GitLab.com was ad-hoc, and relied heavily on manual processes and intuition. Unfortunately, this approach often resulted in oversights, with issues going unnoticed until it was too late, sometimes only surfacing when site availability was impacted.
This talk delves into how GitLab leveraged the power of statistical analysis to greatly improve its capacity planning process. The session will be a practical demonstration of how we analyse long-term metrics data using the Meta’s Prophet library to build sophisticated forecast models.
Tamland, the capacity planning tool built by GitLab, is an open-source project and attendees will have access to the source code if they're interested in exploring the implementation in greater detail. This session is for anyone interested in learning about how forecasting libraries such as Prophet, Greykite, or NeuralProphet, and how they can be integrated into an observability system to provide greater insight into the health of a system.

Andrew Snyder
Senior DevOps Engineer
Contino
Cognizant
Andrew Snyder
Senior DevOps Engineer
Contino
Cognizant
Taking Your Error Budgets to the Next Level

Andrew Snyder
Andrew has begun his third decade of providing exceptional technology solutions in the DevOps space, having worked full-time in engineering management leadership roles at global Fortune 100 companies including Standard & Poor's / The McGraw-Hill Companies, Bank of America / Merrill Lynch, Time Warner, and others.
Taking Your Error Budgets to the Next Level
- Review Error Budgeting That Is Currently In-Place
- Observations Made on the On-Going Maintenance of EB’s
- Modifications to SRE KPIs in Alignment with Ongoing EB Compliance

Ashley Chen
Software Engineer
Datadog
Ashley Chen
Software Engineer
Datadog
How I learned to stop worrying and love burn rates

Ashley Chen
Software Engineer
Datadog
Ashley is a software engineer on the SLO team at Datadog. When she’s not working, she enjoys mentoring future engineers at Emergent Works and exploring the transit history of New York City.
How I learned to stop worrying and love burn rates
Part of building the infrastructure for SLOs at Datadog includes putting SLOs into practice. As an engineering team, we have seen the direct impact of utilizing burn rate alerts over traditional threshold alerts. Our story starts with understanding the purpose of our alerts. Though these monitors have well defined runbooks and technical implications, they do not fully capture the impact of these errors on our users. In this talk, I will discuss the process we took to replace some of our threshold alerts with burn rate alerts and how we were able to quantify the urgency of service degradation by alerting at different burn rates. This transition has driven the balance of reliability and development work for the team, which has led to more reliable services and better nights of sleep.
We will then tell the story of our implementation of burn rate alerts, deciding which ones to use and comparing them to threshold alerts. We discovered that they were more reliable and triggered less often. One example is that we've seen our burn rate alerts trigger when they see dependencies failing whereas that didn't happen with threshold alerts. Burn rate alerts ended up reducing our alert fatigue and late night pages due to being more reliable and building trust in our systems on the team.
We learned that paging at high burn rates captures when human intervention is needed to resolve customer impact. In contrast, low burn rates help us anticipate short-term impact. More discussion can come around by looking at these alerts in reviews and retros. We can change the way that we maintain the reliability process of our team and also in return actually see the number of pages decrease and see the service become more reliable.

Bram Vogelaar
Software Engineer
Seaplane

Bram Vogelaar
Bram Vogelaar spent the first part of his career as a Molecular Biologist, he then moved on to supporting his peers by building tools and platforms for them with a lot of Open Source technologies. He now works as a software engineer at seaplane.io, building a global platform for building & scaling your apps.
a Pint size introduction to SLO
Furthermore we'll discuss the need for and the options of not only monitoring our platforms and it's inevitable outages, but also their (potential) length and impact. We'll look at tools like at using Service Level Objects for ways to prepare teams to tweak their testing and monitoring setup and run-books to quickly observe, react to and resolve problems.

Bryan Oliver
Principal Architect and K8s Sig Network Member
Thoughtworks
Bryan Oliver
Principal Architect and K8s Sig Network Member
Thoughtworks
SLO Driven Deployments - Point of Change Compliance meets Op...

Bryan Oliver
Bryan is an experienced engineer and leader who designs and builds complex distributed systems. He has spent his career developing mobile and back-end systems whilst building autonomous teams. More recently he has been focused on delivery and cloud native at Thoughtworks. In his free time he plays ice hockey and goes trail running, and tries to break into the champion rank in rocket league. https://olivercodes.com
SLO Driven Deployments - Point of Change Compliance meets OpenSLO
Point of Change compliance is a deployment concept in which we leverage Kubernetes Admission controllers to block/allow deployments into the environment at the boundary of each environment. We can combine this concept with OpenSLO to create a powerful 0-trust architecture for error budgets. Imagine if we enforced error budgets at the boundary of the environment! Teams will begin to take them more seriously as a result.

Christian "Serpico" Long
7-time Brooklyn skeeball champion and nationally-r...
Christian "Serpico" Long
7-time Brooklyn skeeball champion and nationally-ranked roller
SLOs and Skeeball

Christian "Serpico" Long
Christian has been rolling skeeball competitively in Brooklyn, NY for 11 years and nationally for 7 years. He started out as a straightforward 40-roller, as is both the conventional recommended starting approach and a widely regarded standard for high level competition. Eventually he started dabbling in hybrid rolling and came to develop a fine-tuned highly tactical strategy that minimizes risk and has virtually no ceiling, enabling him to compete with the best rollers in the world.
SLOs and Skeeball

Dan Venkitachalam
Software Reliability Engineer
Atlassian
Dan Venkitachalam
Software Reliability Engineer
Atlassian
Terraforming SLOs at Atlassian

Dan Venkitachalam
Software Reliability Engineer
Atlassian
Dan is a veteran software engineer and technical manager with over 20 years of experience. He currently works on the Tome team at Atlassian, helping internal organisations to define and achieve their operational goals with SLOs.
Terraforming SLOs at Atlassian
Tome is our internal platform for managing, reporting and alerting on SLOs. A design goal was to enable SLOs to be defined with Configuration as Code. This has become the primary way that SLOs are maintained within Atlassian. Working with SLOs this way has many benefits:
- Enforces consistency in how we organise, define and validate SLOs
- Changes are tracked and attributed through a version control system
- Updates can be deployed as part of existing continuous integration and delivery pipelines
We have written a custom Terraform plugin for provisioning SLOs using Terraform, which interfaces with Tome's backend API.
- Optimizes deployments by tracking deployment state and applying diffs only
- Performs custom validation on configurations

Daniel Golant
Senior Software Engineer

Seeing Like A State: SLOs From The C-Suite

Deepak Kumar
Senior Cloud Infrastructure and Devops
Zenduty
Deepak Kumar
Senior Cloud Infrastructure and Devops
Zenduty
Kubernetes Monitoring - Choosing Optimal Metrics for SLO Ale...

Deepak Kumar
I'm a Senior Cloud Infrastructure and Devops Engineer at Zenduty - an incident management and response orchestration platform, trying my best to make sure that every service and application at our org is secure, reliable and accessible 24/7. I have experience working with and am passionate about cloud services, orchestration engines, enterprise networking, observability platforms and figuring out how they work best together. Looking to talk about my experiences and how we manage mission critical operations at an organisation that has no room to fail.
Kubernetes Monitoring - Choosing Optimal Metrics for SLO Alerting

Derek Osborn
Incident, Problem and Service Level Manager
Flexera
Derek Osborn
Incident, Problem and Service Level Manager
Flexera
Flexera's SLO Journey - from DIY to NOBL9

Flexera's SLO Journey - from DIY to NOBL9
I'll cover Flexera's journey from our internally developed SLO solution, to partnering with NOBL9, and also include how we engaged teams to help develop SLO's. I'll also cover how SLO's are now part of our engineering goals for 2023.

Derek Remund
Practice Lead, Reliability Engineering
Google Cloud
Derek Remund
Practice Lead, Reliability Engineering
Google Cloud
Law and Order: SLO - These Are Our Stories

Derek Remund
Practice Lead, Reliability Engineering
Google Cloud
Derek Remund has held roles in game development, distributed systems engineering, data architecture, and SRE. He studied Computer Science at UIUC after a stint in the Army infantry. Derek currently leads the Reliability Engineering practice in Google Cloud Professional Services, helping Google Cloud customers build and implement their own SRE approaches.
Law and Order: SLO - These Are Our Stories

Dylan Keyer
SRE Ops
Twilio

Dylan Keyer
SRE Ops
Twilio
I used to answer 911 calls. Now I like solving problems where technical concerns most impact a business's bottom line.
SLOs at Twilio

Emily Gorcenski
Lead Data Scientist
Thoughtworks

Emily Gorcenski
Emily has over ten years of experience in scientific computing and engineering research and development. She has a background in mathematical analysis, with a focus on probability theory and numerical analysis. She is currently working in Python development, though she has a background that includes C#/.Net, Unity3D, SQL, and MATLAB. In addition, she has experience in statistics and experimental design, and has served as Principal Investigator in clinical research projects.
A "moving SLO" for machine learning

Eric Moore
Ex-chemist SRE

Confident rare SLO measurement

Frances Zhao-Perez
Senior Director of Product Management
Salesforce
Frances Zhao-Perez
Senior Director of Product Management
Salesforce
Measuring What Matters: SLOs Help to Pursue Customer Happine...

Frances Zhao-Perez
Senior Director of Product Management
Salesforce
Measuring What Matters: SLOs Help to Pursue Customer Happiness

Fred Moyer
Engineering Geek

The body's Error Budget; SLOs for healthy eating

Garrett Plasky
Staff Reliability Engineer
Google Cloud
Garrett Plasky
Staff Reliability Engineer
Google Cloud
Law and Order: SLO - These Are Our Stories

Garrett Plasky
Staff Reliability Engineer
Google Cloud
Garrett Plasky is a Strategic Cloud Engineer at Google Cloud focusing on reliability and SRE. He brings a wealth of practical SRE knowledge to the table, having led the teams responsible for running Evernote’s core services and infrastructure supporting their 200M+ global user base. Under his leadership, he transformed Evernote’s traditional Operations organization into an early adopter of the canonical SRE model, embracing key principles such as SLOs, error budgets, and toil management across the entire engineering organization. Garrett has written about some of these experiences as a contributor to the Google-published SRE Workbook and in his current role helps companies both big and small successfully adopt SRE practices.
Law and Order: SLO - These Are Our Stories

George Hantzaras
Director, Cloud Engineering
Citrix
George Hantzaras
Director, Cloud Engineering
Citrix
Scaling SLOs with open source tools

George Hantzaras
George is a Director of Cloud Platform Engineering at Citrix. He has been organising the Athens Cloud Computing Meetup Group since 2016 and the Athens HashiCorp User Group. His recent talks include topics in cloud computing, observability, SRE and Agile practices, and he has participated in conferences like Voxxed Days, Hashiconf Global, DeveloperWeek, and more.
Scaling SLOs with open source tools

Greg Arnette
Co-founder & CPO
CloudTruth
Greg Arnette
Co-founder & CPO
CloudTruth
The Hidden (Config) Tax Affecting Your Uptime SLO

The Hidden (Config) Tax Affecting Your Uptime SLO

Gwen Berry
Site Reliability Engineer
IAG
Gwen Berry
Site Reliability Engineer
IAG
Reliability Benchmarking: A Pre-cursor to SLO Adoption

Gwen Berry
Site Reliability Engineer
IAG
Junior Site Reliability Engineer, working in an SRE enablement team at IAG.
Reliability Benchmarking: A Pre-cursor to SLO Adoption

Hezheng Yin
Co-founder & CTO / Creator
Apache DevLake
Merico
Hezheng Yin
Co-founder & CTO / Creator
Apache DevLake
Merico
Creating and Tracking SLOs that Empower Developer Happiness ...

Hezheng Yin
Hezheng is a perceptive and persistent pioneer in applying technology to make the world a better place. At Merico, he leads the engineering and research team to build innovative algorithms to help developers quantify the impact of their work. Before this, his research focuses on empowering the next generation of education technology with artificial intelligence and machine learning. Hezheng got his bachelor's degree from Tsinghua University and was pursuing his Ph.D. in computer science at UC Berkeley.
Creating and Tracking SLOs that Empower Developer Happiness and Productivity

Ioannis Georgoulas
Director of SRE
Paddle.com

How you SLO your SLOs?

Jason Greenwell
SRE Leader
Ford Motor Company

Jason Greenwell
Jason is an SLO and developer expereince advocate that has held a number of technical leadership positions at Ford and Ford Credit over the past 20 years. He is currenlty heading up SRE for Model-e's Cloud Platform driving SLO adoption and SRE culture throug the org.
SLOs and Promise Theory

Jeff Martens
CEO & Co-Founder
Metrist
Jeff Martens
CEO & Co-Founder
Metrist
Managing SLOs & SLAs when your app is built on other apps

Jeff Martens
Jeff has built observability products and developer tools for more than 12 years. The first company he founded, CPUsage, was a pioneer in the serverless computing space before AWS Lambda existed. Later he joined New Relic pre-IPO to focus on new products. There he served on the team creating the company’s high-performance event database, before leading Real User Monitoring and growing the product into the company’s 2nd largest revenue generator. Jeff then joined PagerDuty pre-IPO where he worked on designing, building, and launching a suite of business analytics products. Jeff is an alumnus of the University of Oregon and works between Portland, Oregon and the San Francisco Bay Area.
Managing SLOs & SLAs when your app is built on other apps

Jessica Kerr
Engineering Manager of Developer Relations
Honeycomb
Jessica Kerr
Engineering Manager of Developer Relations
Honeycomb
Evolving Our Use of SLOs at Honeycomb

Jessica Kerr
Jessica Kerr is a developer of 20 years, conference speaker of 10, and ringleader of a household containing two teenagers and their cats. She works and speaks in TypeScript, Java, Clojure, Scala, Ruby, Elm etc etc. Her real love is systems thinking in symmathesy (a learning system made of learning parts). She works at Honeycomb.io because our software should be a good teammate and teach us what is going on. If you're into sociotechnical systems, find her blog and newsletter at jessitron.com.
Evolving Our Use of SLOs at Honeycomb

Jim Deville
Principal Software Engineer
Procore

Good and SLO

Joe Blubaugh
Principal Engineer
Grafana Labs

Joe Blubaugh
Joe has worked building and operating large distributed systems for over 10 years, from Google to Twitter to Grafana. Along the way he's picked up some tricks and gotten some scars. He loves automation and standard practices in automation, and anything that removes developer toil.
How we keep engineers sane with SLOs
I'll talk about Grafana's experience using SLOs to help us describe our critical operations for our systems and monitor them. We had to do several things to make this successful:
- Promote consistent metric and objective definitions across teams.
- Create tooling to make it easy to define SLOs.
- Create tooling for an as-code workflow with SLOs.
- Create and nurture a culture that sees the value of SLOs for both the team and the company.
I'll be talking about the history at Grafana and also presenting results of interviews with some Grafana engineers about what SLOs have done for their teams and where they see room for improvement in implementations.

Kayla Annunziata
SRE Platform Development Sr Mgr
Capital One
Kayla Annunziata
SRE Platform Development Sr Mgr
Capital One
Error Budget Signals - Identifying, Interpreting and Actioni...

Kayla Annunziata
SRE Platform Development Sr Mgr
Capital One
Kayla is an Enterprise SRE Platform Development Sr Mgr at Capital One to drive adoption of SRE best practices across all Capital One applications.
Before joining FinTech she was a Software Engineering Manager at Lockheed Martin supporting the Space Industry developing reliable flight software products for NASA’s Orion Spacecraft Artemis 1 mission that successfully broke the record for the farthest distance from Earth traveled by an Earth-returning human-rated spacecraft by ~20,000 miles.
Error Budget Signals - Identifying, Interpreting and Actioning

Kyle Forster
Founder
RunWhen

SLOs with Teeth: Partnering with Product Management

Lukasz Dobek
Software Engineer
Nobl9

Lukasz Dobek
Currently, he’s developing Service Level Objectives platform at Nobl9, helping to make a cultural shift to the Site Reliability Engineering mindset.
Non-Conway’s Game of SLOs

Matthias Loibl
Senior Software Engineer
Polar Signals

Matthias Loibl
Second Day Operations for SLOs
Once you have implemented SLOs for your organization how do you move forward?
At Polar Signals, we have quarterly SLO reviews. We're first doing a retrospective and discussing where we did great and also could have improved. For the upcoming quarter, we discuss the OKRs and from those, we derive SLOs. Sometimes OKRs are easy to derive from and sometimes they need to be rephrased to make sense for SLOs.
Matthias will walk you through an example of an SLO that we implemented quarters ago and how it changed over time. The example will showcase the SLO tracked in the open-source Pyrra project which makes SLOs with Prometheus manageable, accessible, and easy to use for everyone.

Max Knee
Staff Software Engineer
The New York Times

Use SLOs to manage your day
I'm pretty bad at time management, so I was looking into ways to improve that part of my life.
I turned to a lo-fi way to use SLOs to manage my day by turning my tasks and other things I need to do during the day into SLOs.
It's increased my productivity and am interested in sharing it with others!

Michael Knox
Platform SRE Team Lead
ANZx

SLOs are all around us and we don't know it

Mike Fiedler
Wrangler of the Unusual
Mike Fiedler
Wrangler of the Unusual
Call off the Jam: Lessons in Setting Reasonable Objectives f...

Mike Fiedler
Mike has been a speaker at conferences since 2012, and has been recognized for his contributions to the tech community with awards such as the Awesome Community Chef Award in 2016 and an AWS Container Hero since 2018. As a true technologist, he devotes his free time to working on open source tools, learning new technologies, and volunteering as a roller derby referee. With a holistic view of systems and software and a passion for problem-solving, Mike excels in helping others navigate the complexities of the tech world.
Call off the Jam: Lessons in Setting Reasonable Objectives from a Roller Derby Referee

Natalia Sikora-Zimna
Product Manager
Nobl9

Natalia Sikora-Zimna
Product Manager
Nobl9
EM & PM collaboration based on SLOs

Pankaj Gupta
Senior Software Engineer
Sumo Logic

Pankaj Gupta
Senior Software Engineer
Sumo Logic
SLOs created from Monitors

Ramesh Nampelly
Senior Director of Cloud Infrastructure and Platfo...
Palo Alto Networks
Ramesh Nampelly
Senior Director of Cloud Infrastructure and Platform Engineering
Palo Alto Networks
Improve SLOs through auto remediations and external context ...

Ramesh Nampelly
Improve SLOs through auto remediations and external context correlation
I will be focusing on how we’ve built observability platform that include incident analytics > auto remediations and secrets management which in turn brought down MTTR and other SLOs for our production services.
I will be covering high level architecture of our internal developer platform, observability Platform and their interaction. Will talk through each component involved such as backstage.IO, grafana stack (grafana, mimir, Loki and temp), vector.dev and stackstorm.
Will conclude by showing the northstar metrics we’ve used to track internally and how they helped increasing the platform adoption.

Ricardo Castro
Principal Site Reliability Engineer
FanDuel
Blip
Ricardo Castro
Principal Site Reliability Engineer
FanDuel
Blip
Overcoming SRE Anti-Pattern Roadblocks: Lack of a Reliabilit...

Ricardo Castro
Overcoming SRE Anti-Pattern Roadblocks: Lack of a Reliability Framework
SREs, as the name implies, care about service reliability. But, often, they struggle with having a way to define, measure and assess their services reliability. In practice, they lack a Reliability Framework.
How can SLOs help? They provided an opinionated way to do just that: define, measure and assess service reliability from the users perspective. They provide a common language to talk about reliability and prioritize work. They help fix the anti-pattern of trying to ensure service reliability without clearly defining what it means.

Roman Khavronenko
Software Engineer
VictoriaMetrics
Roman Khavronenko
Software Engineer
VictoriaMetrics
Retroactive evaluation of SLO objectives in VictoriaMetrics

Roman Khavronenko
Retroactive evaluation of SLO objectives in VictoriaMetrics
Recording rules is a clever concept introduced by Prometheus for storing results of query expressions in a form of a new time series. This concept is used for SLO calculations. But due to the nature of recording rules they have no retroactive effect. And since SLO objective usually captures a time window no less than 30d, recording rules produce incomplete results until the whole time window is captured.
The talk will cover how this can be fixed in VictoriaMetrics monitoring solution via retroactive rules evaluation on example of rules generated via https://github.com/slok/sloth framework.

Sal Furino
CRE
Nobl9

Sal Furino
Sal Furino is a Customer Reliablity Engineer at Nobl9. During his career he's worked as a TPM, SRE, Developer, Sys Admin, and IT support. While not working he enjoys cooking, gamings, traveling, skiings, and golfing. Sal lives in Queens with his parter and has a BS in Applied Mathematics from Marist College.
Two Paths in the Woods
While the direction and intent of SRE has been established and is becoming better-understood, the details on how to achieve "SRE" is still an exercise left for the reader.
Steve from Google and Sal from Nobl9 will present two independently developed methods for teaching SRE topics to customers, which we have discovered are actually quite similar. Huzzah!
Steve will present the "reliability map" and Sal will show Nobl9's SLODLC. Both are essentially sets of documentation presented in a way that allows customers to take a *guided* path through the dark and scary woods that is today's SRE.
The "reliability map" provides a detailed accounts of what development, infrastructure, operational, observability, and people/culture activities organizations can expect to be practicing to achieve different eras of reliability (99%, 99.9%, 99.99%, 99.999%). The "reliability map" takes it one step further and provides descriptions and references to additional content for all activities.
SLODLC provides a framework of documentation and templates to allow customers to iterate on improving their SLOs. This is especially necessary when customers have high reliability targets in which rely upon automation to execute predetermined playbooks. (rollback, exponential backoff, autoscale, etc) If SLOs are the indicator the performance of a user journey and decide when actions needs to be taken, then customers need a clear and useable way to explain:
- What they are attempting to measure (golden signals?, something else?)
- Why was it decided to measure X in such a way?
- How is X impactful for the targeted user journey?
- When was the last time the SLO objective, metric, time window, etc was changed?
- When the error budget is in danger of being breeched what actions should be take?

Sally Wahba
Principal Engineer
Splunk
Sally Wahba
Principal Engineer
Splunk
Thinking about SLO from On-Prem to Cloud - A Developer's Per...

Sally Wahba
Principal Engineer
Splunk
When not working you will find her doing computer science outreach activities, reviewing for technical conferences, mentoring, and learning Spanish.
Thinking about SLO from On-Prem to Cloud - A Developer's Perspective
My background has mostly been in developing operating systems for data storage companies. In this environment, almost everything is controlled internally. For example, if the SLO of the operating system is five 9s, then the error budget is usually all consumed by software bugs owned internally by the company. As I switched to developing SaaS products in the cloud, this has drastically changed. Below are examples of lessons learned during different phases of working on a product, from development, to production, to support and operation. My goal is sharing these lessons so other developers can learn from my experience.
In my previous role, two main things I relied on while developing operating system products were the suite of testing as well as the release cadence. Shipping a new operating system every 6 months was considered fast. This gave developers a lot of time for their code to soak and be tested internally before being released to customers. Additionally, with a slower release cadence there was a lot of effort invested in creating different layers of testing. After all, a bug fix would take months to reach customers. Even if we released a patch quickly, which in this context means a few weeks, customers would still need to update their operating systems to apply that patch, and who knows how long a customer will wait before applying that patch. After moving to building SaaS products in the cloud, the release cadence became much faster. This required shifting my thought process. Instead of relying on soak time and various levels of testing, activities such as code review and in-build unit tests now take a front row seat. Metrics like code coverage from unit tests mattered more, while metrics like how long it's been since QA found a bug mattered less.
Another difference between my old role and new role is how developers access production and production metrics. In the old role, gathering production metrics was no easy feat.
Harder access to production metrics, implied that changing SLOs internally was harder, took longer, and a lot of developers didn't know how/why SLOs were changing. For a SaaS product running in the cloud, developers have access to overall system performance metrics at the click of a button, while maintaining compliance. This makes it easier for developers to know why/how SLOs are affected by SLAs and also gives them faster reaction times.
From the operation perspective, the old role and the new role were quite different. In the old role developers didn't go on-call. There was a customer support organization that would handle any customer issues first. Developers were brought in occasionally if customer support needed a bug fixed. Developers didn't need to wake up in the middle of the night because there's an outage. In the new role, developers are on-call, which means developers can and do occasionally get paged in the middle of the night. This shift caught me off-guard as I had to think much harder about how my service impacts the quality of life of my colleagues and myself. It made me think about how to change and measure SLOs to eventually avoid waking someone up in the middle of the night.
Another lesson that caught me off guard is not all SLAs can be trusted when working in the cloud. Yes, I knew this lesson theoretically, but learning it in practice is a different story. One example was an outage that was caused by a cloud provider breaching their SLAs for a managed service that we relied on for our product. This resulted in our product breaching its SLAs. When something like this happens, you think hard about how to update your SLOs to prepare for these issues, catch such issues, and react to them. I found that using SLOs that are tighter than SLAs was helpful in that regard.
In conclusion, even with years of professional experience, moving from developing On-Prem products to cloud SaaS offerings changed how I think about SLOs and it's truly not a one-size-fits-all.

Shubham Srivastava
Head of Developer Relations
Zenduty
Shubham Srivastava
Head of Developer Relations
Zenduty
Kubernetes Monitoring - Choosing Optimal Metrics for SLO Ale...

Shubham Srivastava
Take pride in making mistakes, learning from them and advocating for best practices for orgs setting up their DevOps, SRE and Production Engineering teams.
A zealous and eternally curious professional, fascinated by stories from DevOps, Incident Management and Product Design; hoping to solve real-world problems with the skills and technology I'm actively amused by. An orator, writer, and hopeful comedian trying his very best to do something I'm proud of everyday.
Kubernetes Monitoring - Choosing Optimal Metrics for SLO Alerting

Stephan Lips
Software Engineer and SLO Advocate
Procore
Black Box SLIs
Adopting an SLO culture involves identifying the metrics that matter without drowning in noise and alert fatigue. The black box concept lets us aggregate granular metrics into SLIs that focus on the user experience as an indicator of system reliability.
The talk will be based off this article published on Procore's Engineering blog: https://careers.procore.com/blogs/engineering-at-procore/black-box-slis.
SLOs as code
By managing Service Level Objectives as code we can co-locate SLO definitions and ownership with the product code and team. This supports horizontal scaling of SLO ownership, while establishing a single source of truth, adding transparency, integrating with the code management process, and creating an audit trail for SLOs.
The talk will be based off this article published on Procore's Engineering blog: https://engineering.procore.com/implementing-slos-as-code-a-case-study-2/

Stephen Townshend
Developer Advocate (SRE)
SquaredUp
Stephen Townshend
Developer Advocate (SRE)
SquaredUp
Reliability Benchmarking: A Pre-cursor to SLO Adoption

Stephen Townshend
Our industry is full of buzzwords and exaggerations, it can be hard to know what is real or not. Stephen strives to take these complex technical concepts and to simplify and present them in a way everyone can understand and apply (and to call out when something is too good to be true).
Stephen lives in Auckland, New Zealand and currently works as a Developer Advocate for SquaredUp, as well as promoting and improving observability and SRE practices internally in the organisation.
Reliability Benchmarking: A Pre-cursor to SLO Adoption

Stephen Weber
Staff Site Reliability Engineer
Procore

Stephen Weber
Arguments in Favor: why SLOs?
In my experience, many teams encounter SLOs as something they've been told to "do" and the flip side is it's been something I've been asked to help them with. Naturally this is not ideal - as engineers we prefer things to be self-evident.
I have a number of strategies to use when communicating the process and the value of creating and using SLOs. I'll give away the thing I say the most right here: "SLOs are for decision-making". They're not magic or even a single thing. They're a tool that helps us do our jobs.
Audience will come away either better-able to articulate the pragmatic usefulness of an SLO mindset, or with one or two real motivations why they should consider developing and using SLOs for their own systems.

Steve McGhee
Reliability Advocacy Engineer
Google SRE

Two Paths in the Woods
While the direction and intent of SRE has been established and is becoming better-understood, the details on how to achieve "SRE" is still an exercise left for the reader.
Steve from Google and Sal from Nobl9 will present two independently developed methods for teaching SRE topics to customers, which we have discovered are actually quite similar. Huzzah!
Steve will present the "reliability map" and Sal will show Nobl9's SLODLC. Both are essentially sets of documentation presented in a way that allows customers to take a *guided* path through the dark and scary woods that is today's SRE.
The "reliability map" provides a detailed accounts of what development, infrastructure, operational, observability, and people/culture activities organizations can expect to be practicing to achieve different eras of reliability (99%, 99.9%, 99.99%, 99.999%). The "reliability map" takes it one step further and provides descriptions and references to additional content for all activities.
SLODLC provides a framework of documentation and templates to allow customers to iterate on improving their SLOs. This is especially necessary when customers have high reliability targets in which rely upon automation to execute predetermined playbooks. (rollback, exponential backoff, autoscale, etc) If SLOs are the indicator the performance of a user journey and decide when actions needs to be taken, then customers need a clear and useable way to explain:
- What they are attempting to measure (golden signals?, something else?)
- Why was it decided to measure X in such a way?
- How is X impactful for the targeted user journey?
- When was the last time the SLO objective, metric, time window, etc was changed?
- When the error budget is in danger of being breeched what actions should be take?

Steve Upton
Principal QA Consultant
Thoughtworks

Steve Upton
Data Product Thinking with SLOs
The talk tells the story of how conversations around SLOs can be a great trigger to start a shift to a Product Thinking mindset, with practical examples. We'll also take a light dip into constraint mapping from an SLO perspective.

Steve Xuereb
Staff Site Reliability Engineer
GitLab Inc.

So many SLOs so many alerts

Thijs Metsch
Researcher
Intel Labs

Thijs Metsch
Researcher
Intel Labs
Intent Driven Orchestration with SLOs!
With a Serverless mindset in place, there should no longer be a need to define resource requests or any other kinds of information. Just developed your app or function and run it.
But how can we achieve performance and make sure we run the system most efficiently? This is were we can now let user define what they truly care about - their SLOs. Based on these performance targets our Intent Driven Orchestration Planner will do the rest. No need to define resource requests and limit s on e.g. Kubernetes cluster anymore. The planner will set up the systems in such a way that your app/functions behave as expected, without the need to know anything about the underlying infrastructure – less knowledge needed, fewer errors made!
This is a shift in how we do orchestration and SLO management, that could be of interest to this community. Away from a monitoring & alerting way of looking SLOs, towards a way of using SLOs to manage the system. Furthermore this truly allows for ease-of-use of the user; rather than defining numbers an values based on domain & contextualized knowledge we let them define what they truly care about: their SLOs!

Toby Burress
SRE
Dropbox

Toby Burress
SRE
Dropbox
What We Mean By "Mean"
There's been a lot of (really good!) discussion in the last several years about how to think about, monitor, and alert on long-tailed distributed quantities, such as latency. However, I'm worried that in our desire to describe the long tail we may be too eager to abandon tools that are still useful.
In this talk I'll (re)introduce everyone's favorite summary statistic, the average. I'll talk about the difference between a sample average and a random variable's expectation, and how the two are uniquely linked by the law of large numbers. I'll also talk about how the central limit theorem allows us to treat sample averages as draws from a Gaussian distribution, irrespective of the distribution the samples come from, and then I'll talk about the exceptions. We'll finish up by looking at how the properties of expected values can give us insight into the behavior of systems, even at the tail.
Through all of this we'll be looking at examples drawn from real-world latency data, and comparing the insights gleaned from this versus other common summary statistics.

Troy Koss
Director, Enterprise SRE
Capital One
Troy Koss
Director, Enterprise SRE
Capital One
Resiliency is only good if it's reliable

Troy Koss
Resiliency is only good if it's reliable
Resiliency is a critical piece to building reliable systems. It allows us to feel safe knowing failure is inevitable. After all, as noted in the OG SRE book, 100% is terrible target for basically everything.
We spend a lot of resources to add in layers of resiliency from redundant multi-region compute stacks to backups on our backups. How do we know this resilience achieved our ultimate outcome of reliability for our customers? We'll discuss the ways to observe your SLOs and error budgets during resiliency events.
The various events we'll look at include, regional failover, chaos experiments (such as latency injection), database recovery, and more! After failing a region, do we know if your customer's experienced a disturbance? When you're running a resiliency test or game day, how do you measure success?
Observing error budgets before, during, and after an event paint a picture of our customer's experience and can ultimately be part of the success criteria. It is critical that we know how our architecture and system changes unfold. What if new resiliency introduces latency that negatively impacts your customer? For example, if there's complexity introduced that makes your release engineering more convoluted, we may see a longer error budget burn while we remediate. On the flip side the partnership of adding resiliency and observing your SLOs can also lead to improving the objectives with newly matured levels of resiliency; raising the bar for performance.

Vijay Samuel
Observability Architect
eBay
Vijay Samuel
Observability Architect
eBay
Scaling SLI/SLO - Pushing Your Observability Platform To Its...

Vijay Samuel
Scaling SLI/SLO - Pushing Your Observability Platform To Its Limits
At eBay we use a wrapped version of Prometheus as our centralized time series database. Data residing in Prometheus is used for mission critical functions like detecting issues on the site through anomaly detection, SLOs or simple threshold based alerts. To be able to enforce SLOs across the 1000s of micro services that are deployed inside of eBay, we would need to be able to aggregate commonly instrumented metrics at an application level.
Why are these aggregations necessary? Host level metrics when used for computation of things like burn rates over a 30 day window can become very very expensive. Using a single query across all micro services to generate such aggregations have several scale limitations. Manually onboarding each micro service into such aggregations can also be tedious.
The talk discusses our commonly instrumented metrics, our metrics store architecture, scale challenges seen with SLI/burn rate computation and how we came up with concepts like templated rule groups that automatically generate Prometheus rule groups to be able to do aggregations for every application that is emitting metrics.

Wayne Major
Cloud Reliability Engineer - OutSystems
Outsystems
Wayne Major
Cloud Reliability Engineer - OutSystems
Outsystems
So a M&O, DevOps, and Data Science Engineer got locked in a ...

Wayne Major
Cloud Reliability Engineer - OutSystems
Outsystems
So a M&O, DevOps, and Data Science Engineer got locked in a room together
This sounds like the start to a terrible bar joke, and you're absolutely right it is. Soooooooooooooo a Data Science Engineer, M&O Engineer and a DevOps Engineer met via a zoom chat from two completely different regions and time zones.
Our organization, Outsystems is a low code development platform that wanted to monitor the reliability of our customers environments. Due to the nature of these environments we were looking to design SLOs that fit unknown and constantly changing architecture. To further complicate matters there were also competing technology stacks the traditional legacy platform and the next-generation. We needed an automated solution that could dynamically fit into their build pipelines to measure the reliability of our customer’s environments at scale.
This is a talk on how three engineers worked thru buy-in, cultural, and technical challenges to come together and use our specialties to create a fully automated SLO creation factories utilizing the OpenSLO framework for our products that allows us to scale at a moments whim.
.jpg?width=180&name=Image%20from%20iOS%20(10).jpg)
Weyert de Boer
Head of App Store Engineering
Tapico
Weyert de Boer
Head of App Store Engineering
Tapico
Generating SLOs rules based on OpenSLO specifications
.jpg?width=180&name=Image%20from%20iOS%20(10).jpg)
Weyert de Boer
Weyert also contributes to various communities and is part of the OpenSLO team to help define the SLOs in a declarative way.
In his spare time, Weyert enjoys reading about ancient history, a hobby paleoanthropologist, helping developers out in various communities, and trying to get better at oil painting.
Generating SLOs rules based on OpenSLO specifications
In this talk I would like to show a utility for generating service level objectives rules for Prometheus/AlertManager based on the OpenSLO v1 specification.
The talk shows a tool that has been developed by Tapico for generating the configuration files for Prometheus and Alertmanager based on OpenSLO specifications.

Zachary Nickens
Global Reliability Engineering
Outsystems
Zachary Nickens
Global Reliability Engineering
Outsystems
Project Constellations
Organizational Transformations through SLOs

Project Constellations
Every organization typically identifies a “North Star” for use in guiding its business as it navigates its business, technologies, and customer experiences. Navigation via only one celestial body however is inefficient and prone to environmental constraints. Navigation using multiple known celestial bodies, which are known and imbued with context, prevents environmental and operational constraints.
Identifying a contextual North Star for a technology organization provides benefit, but identifying a constellation of service level objectives provides a set of greater benefit. SLOs used as the primary navigational devices offer transformative advantages in navigating the business, technology, and customer experience challenges.
Organizational Transformations through SLOs
Leading organizational change always presents challenges. Leveraging Service Level Objectives, we are transforming multiple aspects of our business and operational decision making processes. SLOs are providing innovative technical, operational, and business accelerations across our SRE Transformation, SDLC, and operation of both existing and emerging products.