2023 SLOconf Speakers
May 15-18, 2023

Adrian Hoban
Principal Engineer
Intel

Adrian Hoban
Principal Engineer
Intel
Adrian is a principal engineer leading cloud native resource orchestration for the Network and Edge Group in Intel. He has invested 20+ years becoming an expert on Cloud Native Orchestration, Service Orchestration and Management, Automation and Cloud Native Observability, with many of those years focused on applying and enhancing cloud technologies for some of the most demanding, high-performance, deterministic networking applications. Now he leads strategy, requirements, and architecture definition for cloud native resource orchestration of distributed networking application that span from the cloud to the edge.
In the past, Adrian created and influenced the ecosystem to adopt Enhanced Platform Awareness which is a suite of platform enabled capabilities at different layers of orchestration stacks. He was one of the contributors of Management and Orchestration standards definition for Network Functions Virtualisation bringing platform aware virtualisation technology to the Communications Service Providers for high performance, interoperable NFV solutions. Adrian was a co-founder and first Technical Steering Committee lead of the Open Source Management and Orchestration (OSM) community.
Adrian is also a keen sports fan, loves outdoor sports and in particular Gaelic games, rugby and mountain biking.
Intent Driven Orchestration with SLOs!
With a Serverless mindset in place, there should no longer be a need to define resource requests or any other kinds of information. Just developed your app or function and run it.
But how can we achieve performance and make sure we run the system most efficiently? This is were we can now let user define what they truly care about - their SLOs. Based on these performance targets our Intent Driven Orchestration Planner will do the rest. No need to define resource requests and limit s on e.g. Kubernetes cluster anymore. The planner will set up the systems in such a way that your app/functions behave as expected, without the need to know anything about the underlying infrastructure – less knowledge needed, fewer errors made!
This is a shift in how we do orchestration and SLO management, that could be of interest to this community. Away from a monitoring & alerting way of looking SLOs, towards a way of using SLOs to manage the system. Furthermore this truly allows for ease-of-use of the user; rather than defining numbers an values based on domain & contextualized knowledge we let them define what they truly care about: their SLOs!
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Adriana Villela
Sr. Developer Advocate
Lightstep

Adriana Villela
Adriana is a Sr. Developer Advocate at Lightstep from Toronto, Canada, with over 20 years of experience in tech. She focuses on helping companies achieve reliability greatness through Observability, DevOps, and SRE practices. Before Lightstep, she was a Sr. Manager at Tucows. During this time, she defined technical direction in the organization, running both a Platform Engineering team, and an Observability Practices team. Adriana has also worked at various large-scale enterprises, including Bank of Montreal (BMO), Ceridian, and Accenture. At BMO, she was responsible for defining and driving the bank's enterprise-wide DevOps practice, which impacted business and technology teams across multiple geographic locations across the globe.
Adriana has a widely-read technical blog on Medium (https://adri-v.medium.com), which is known for its casual and approachable tone to complex technical topics, and its high level of technical detail. She is also an OpenTelemetry contributor, HashiCorp Ambassador (https://www.credly.com/badges/551d47a7-67cb-41bb-baeb-8c90f114f03a/public_url), and co-host of the On-Call Me Maybe Podcast (https://oncallmemaybe.com).
Translating failures into SLOs
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Alayshia Knighten
Manager of Onboarding Eng
Honeycomb
Alayshia Knighten
Manager of Onboarding Eng
Honeycomb
SLI Negotiation Tactics for Engineers

Alayshia Knighten
Alayshia Knighten is an Engineering Manager of Product Training at Honeycomb with many years of experience in the DevOps realm. Alayshia specializes in enhancing technical and team-related experiences while educating customers on their journey with and beyond observability. In her words, “Getting shit done while identifying how to accelerate at the person beyond the tooling is the real meat and potatoes.” She enjoys solving the “so, how do we solve that?” problems and meeting people from all walks of life. Her tiny hometown and Southern background inspire Alayshia. In her spare time, she enjoys hiking, grilling, painting, and making random bird calls with her father.
SLI Negotiation Tactics for Engineers
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel.jpg?width=180&name=aleksandra-dziamska%20(2).jpg)
Aleksandra Dziamska
Engineering Manager
Nobl9
Aleksandra Dziamska
Engineering Manager
Nobl9
Product and Engineering Collaboration With SLOs
.jpg?width=180&name=aleksandra-dziamska%20(2).jpg)
Aleksandra Dziamska
Engineering Manager
Nobl9
Aleksandra works as an Engineering Manager at Nobl9.
Her over 10-year journey in software development started with being a software engineer and moved towards team leadership and management. Throughout her career, she strived to focus on what she feels is most important (in IT as in life): people. Translating it to Engineering Manager dialect: lead the engineering team to deliver best value to the end users, effectively combining Product and Engineering priorities. She explores the way SLOs can help here.
Product and Engineering Collaboration With SLOs
Present a real-life scenario of using SLOs to manage product requirements and be smarter about allocating the engineering team’s focus. The talk will discuss an example of how SLOs helped monitor a potential problem instead of assigning an engineering team to jump into solving a complex issue. It also helped focus on a specific area of a wider problem. In a broader sense, the talk will discuss the collaboration between the Product Manager and the Engineering Manager and how SLOs can lead to more productive conversations.
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Alex Hidalgo
Principal Reliability Advocate
Nobl9
Alex Hidalgo
Principal Reliability Advocate
Nobl9
Error Budgets for Conference Planning

Alex Hidalgo
Error Budgets for Conference Planning

Alex Kudryashov
Lead software engineer
New Relic
Alex Kudryashov
Lead software engineer
New Relic
Adoption of SLs in New Relic: an iterative approach

Alex Kudryashov
Lead software engineer
New Relic
These days I am leading a team that is developing Service Level Management in New Relic. I love solving challenges at intersection of product and engineering, so I am creating tools for developers like myself.
Adoption of SLs in New Relic: an iterative approach
In this talk, we would share our experience in promoting Service Level practice across a large organization with over 900 engineers. Learn about the challenges we faced, the strategies we used to encourage adoption, and the valuable lessons we learned along the way.
Whether you're looking to implement SLs in your own organization or simply interested in how to drive adoption of new engineering practices in general.
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Alexandra McCoy
SRE Engineer & VMware Enthusiast
VMware
Alexandra McCoy
SRE Engineer & VMware Enthusiast
VMware
Reliability Enablement: Achieving Reliability with SLOs

Alexandra McCoy
Alexandra is an SRE Engineer at VMware. She is passionate about Cloud Native, Open Source, and Reliability Engineering communities. Although VMware is home, she was introduced to the Cloud while in IBM - Public Sector and then transitioned into IBM Cloud. She later gained additional hybrid cloud experience at Diamanti, focusing on E2E product support for their Kubernetes based appliance. She is excited about the industry's direction and hopes to contribute in a way that only helps improve the cloud space.
Reliability Enablement: Achieving Reliability with SLOs
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Ana Margarita Medina
Staff Developer Advocate
Lightstep

Ana Margarita Medina
Ana Margarita is a Staff Developer Advocate at Lightstep and focuses on helping companies be more reliable by leveraging Observability and Incident Response practices. Before Lightstep, she was a Senior Chaos Engineer at Gremlin and helped companies avoid outages by running proactive chaos engineering experiments. She has also worked at various-sized companies including Google, Uber, SFEFCU, and Miami-based startups. Ana is an internationally recognized speaker and has presented at: AWS re:Invent, KubeCon, DockerCon, DevOpDays, AllDayDevOps, Write/Speak/Code, and many others.
Catch her tweeting at @Ana_M_Medina about traveling, diversity in tech, and mental health.
Translating failures into SLOs
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Anais Dotis-Georgiou
Lead Developer Advocate
InfluxData
Anais Dotis-Georgiou
Lead Developer Advocate
InfluxData
Harnessing the Power of time series databases and OpenTeleme...
Harnessing the Power of time series databases and OpenTelemetry to Uphold SLO
In this talk, we explore the synergy between time series databases and OpenTelemetry standards, which together help organizations maintain their Service Level Objectives (SLOs) with precision and ease. Additionally, we will highlight the benefits of columnar-based data storage for high-performance queries and storage logs, traces, and metrics.
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Andrew Clay Shafer
Principal
Ergonautic

Andrew Clay Shafer
Principal
Ergonautic
Andrew Clay Shafer evangelized DevOps tools and practices when DevOps was not a word before falling in love with SLOs in theory and practice. Living at the intersection of Open Source and Cloud Computing across two decades, they gained experience in every role in software delivery from support and QA to product and development. Andrew now focuses on engineering operable resilient socio-technical systems and communities as a founder of Ergonautic.
Systems of Work: Socio-Technical SLOs
Service Levels Objectives are often perceived as by SRE, for SRE, which limits the impact we can have on improving our systems because enforcing SLOs often collides with other priorities. Can we help others in the organization understand the value of improving the system? Can we apply SLOs to qualities of the system which other people already care about? The speed run introduction to SLOs as a commitment to improve using metrics on the work people do for building a coalition who will care about system reliability.
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Andrew Howden
SRE Engineering Manager
Zalando
Andrew Howden
SRE Engineering Manager
Zalando
Driving engineering priorities with service level objectives...

Andrew Howden
Driving engineering priorities with service level objectives on critical business operations
I will talk through the details of how SLOs at Zalando have evolved from the initial implementation ("SLOs for everything!") to the challenge of ensuring SLOs have the organizational power to drive changes in engineering priorities, to the current design of "critical business operations" and SLOs on those operations.
I'll discuss how to address the "fast burn" SLO problem by leveraging distributed tracing to identify regression in the customer experience automatically. When those regressions are identified, automatically identify and page the team best empowered to address them.
I'll discuss how to address the "slow burn" SLO problem through periodic operational review meetings, in which the SLOs are evaluated, and violations to the SLO (or slow burn issues) are allocated to an owner to investigate and address.
Lastly, I'll talk about challenges with the existing approach, including the difficulty of modelling event systems as a reliable flow, difficulty in rolling out more SLOs for non-customer-facing aspects of the organization and returning to service-specific SLOs.
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Andrew Newdigate
Distinguished Engineer
GitLab Inc.
Andrew Newdigate
Distinguished Engineer
GitLab Inc.
Tamland: How GitLab.com uses long-term monitoring data for c...

Andrew Newdigate
Andrew is a seasoned engineer with over two decades of experience in software development and reliability engineering. As a Distinguished Engineer at GitLab, he is responsible for the reliability and availability of GitLab's SAAS properties: GitLab.com and GitLab Dedicated. He is a strong advocate for using SLOs, error budgets, and observability data to drive change and manage technical debt. Previously, Andrew co-founded the developer community site Gitter in 2012, where he served as CTO until its acquisition by GitLab in 2017.
Tamland: How GitLab.com uses long-term monitoring data for capacity forecasting
For any large scale production system, the ability to effectively forecast potential capacity issues is crucial for the smooth functioning of the environment. With a reliable prediction, teams can proactively plan ahead, implement necessary scaling changes in a controlled manner and avoid unexpected availability issues that can cause stress and harm to the system.
Before implementing Tamland, the capacity planning process at GitLab.com was ad-hoc, and relied heavily on manual processes and intuition. Unfortunately, this approach often resulted in oversights, with issues going unnoticed until it was too late, sometimes only surfacing when site availability was impacted.
This talk delves into how GitLab leveraged the power of statistical analysis to greatly improve its capacity planning process. The session will be a practical demonstration of how we analyse long-term metrics data using the Meta’s Prophet library to build sophisticated forecast models.
Tamland, the capacity planning tool built by GitLab, is an open-source project and attendees will have access to the source code if they're interested in exploring the implementation in greater detail. This session is for anyone interested in learning about how forecasting libraries such as Prophet, Greykite, or NeuralProphet, and how they can be integrated into an observability system to provide greater insight into the health of a system.
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Ashley Chen
Software Engineer
Datadog
Ashley Chen
Software Engineer
Datadog
How I learned to stop worrying and love burn rates

Ashley Chen
Software Engineer
Datadog
Ashley is a software engineer on the SLO team at Datadog. When she’s not working, she enjoys mentoring future engineers at Emergent Works and exploring the transit history of New York City.
How I learned to stop worrying and love burn rates
Part of building the infrastructure for SLOs at Datadog includes putting SLOs into practice. As an engineering team, we have seen the direct impact of utilizing burn rate alerts over traditional threshold alerts. Our story starts with understanding the purpose of our alerts. Though these monitors have well defined runbooks and technical implications, they do not fully capture the impact of these errors on our users. In this talk, I will discuss the process we took to replace some of our threshold alerts with burn rate alerts and how we were able to quantify the urgency of service degradation by alerting at different burn rates. This transition has driven the balance of reliability and development work for the team, which has led to more reliable services and better nights of sleep.
We will then tell the story of our implementation of burn rate alerts, deciding which ones to use and comparing them to threshold alerts. We discovered that they were more reliable and triggered less often. One example is that we've seen our burn rate alerts trigger when they see dependencies failing whereas that didn't happen with threshold alerts. Burn rate alerts ended up reducing our alert fatigue and late night pages due to being more reliable and building trust in our systems on the team.
We learned that paging at high burn rates captures when human intervention is needed to resolve customer impact. In contrast, low burn rates help us anticipate short-term impact. More discussion can come around by looking at these alerts in reviews and retros. We can change the way that we maintain the reliability process of our team and also in return actually see the number of pages decrease and see the service become more reliable.
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Christian Long
Senior Software Engineer & Skeeball Champion
3M
Christian Long
Senior Software Engineer & Skeeball Champion
3M
SLOs & the Game of Skeeball

Christian Long
Christian has been rolling skeeball competitively in Brooklyn, NY for 11 years and nationally for 7 years. He started out as a straightforward 40-roller, as is both the conventional recommended starting approach and a widely regarded standard for high level competition. Eventually he started dabbling in hybrid rolling and came to develop a fine-tuned highly tactical strategy that minimizes risk and has virtually no ceiling, enabling him to compete with the best rollers in the world.
SLOs & the Game of Skeeball
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Dan Venkitachalam
Software Reliability Engineer
Atlassian
Dan Venkitachalam
Software Reliability Engineer
Atlassian
Terraforming SLOs (SLO automation at Atlassian)

Dan Venkitachalam
Software Reliability Engineer
Atlassian
Dan is a veteran software engineer and technical manager with over 20 years of experience. He currently works on the Tome team at Atlassian, helping internal organisations to define and achieve their operational goals with SLOs.
Terraforming SLOs (SLO automation at Atlassian)
Tome is our internal platform for managing, reporting and alerting on SLOs. A design goal was to enable SLOs to be defined with Configuration as Code. This has become the primary way that SLOs are maintained within Atlassian. Working with SLOs this way has many benefits:
- Enforces consistency in how we organise, define and validate SLOs
- Changes are tracked and attributed through a version control system
- Updates can be deployed as part of existing continuous integration and delivery pipelines
We have written a custom Terraform plugin for provisioning SLOs using Terraform, which interfaces with Tome's backend API.
- Optimizes deployments by tracking deployment state and applying diffs only
- Performs custom validation on configurations
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Daniel Golant
Senior Software Engineer

Seeing Like A State: SLOs From The C-Suite
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
David Bartok
Software Engineer
Meta

David Bartok
David is a Software Engineer at Meta. He is currently working in the Monitoring space, primarily focused on SLICK, the company’s SLO tracking platform. Previously, he was part of the GraphQL team at Meta, where he optimized cache performance and efficiency. Before joining Meta, David worked at Bloomberg as a full-stack developer in the Mobile Market Data team.
SLICK: SLO Reviews at Meta
SLICK is our reliability tracking platform at Meta, pioneering an SLO-focused culture across the company. While we have been very successful in onboarding teams to SLICK, we started to notice that a significant amount of teams only got limited value out of their SLOs after the initial onboarding.
In order for SLOs to be useful, the whole team needs to adopt them, use them regularly and retrospect on them frequently. As an initial attempt to help socialize SLOs, we built various integrations into SLICK. This includes periodic reports in our internal work groups to increase the visibility of SLOs, and collaborative data annotations to enable retrospecting on the root causes of SLO violations.
Bringing all of the above together, we will present our brand new SLO review tooling. This provides a structured workflow to have meaningful discussions about SLOs and identify follow-ups, enabling teams to get the most value out of SLOs.
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Deepak Kumar
Senior Cloud Infrastructure and Devops
Zenduty
Deepak Kumar
Senior Cloud Infrastructure and Devops
Zenduty
Kubernetes Monitoring - Choosing Optimal Metrics for SLO Ale...

Deepak Kumar
I'm a Senior Cloud Infrastructure and Devops Engineer at Zenduty - an incident management and response orchestration platform, trying my best to make sure that every service and application at our org is secure, reliable and accessible 24/7. I have experience working with and am passionate about cloud services, orchestration engines, enterprise networking, observability platforms and figuring out how they work best together. Looking to talk about my experiences and how we manage mission critical operations at an organisation that has no room to fail.
Kubernetes Monitoring - Choosing Optimal Metrics for SLO Alerting
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Derek Osborn
Incident, Problem and Service Level Manager
Flexera
Derek Osborn
Incident, Problem and Service Level Manager
Flexera
Flexera's SLO Journey - from DIY to NOBL9

Flexera's SLO Journey - from DIY to NOBL9
I'll cover Flexera's journey from our internally developed SLO solution, to partnering with NOBL9, and also include how we engaged teams to help develop SLO's. I'll also cover how SLO's are now part of our engineering goals for 2023.
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Devin Cunningham
Software Engineer
Procore

Devin Cunningham
Software Engineer
Procore
SLOs as code
By managing Service Level Objectives as code we can co-locate SLO definitions and ownership with the product code and team. This supports horizontal scaling of SLO ownership, while establishing a single source of truth, adding transparency, integrating with the code management process, and creating an audit trail for SLOs.
The talk will be based off this article published on Procore's Engineering blog: https://engineering.procore.com/implementing-slos-as-code-a-case-study-2/
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Emily Gorcenski
Lead Data Scientist
Thoughtworks

Emily Gorcenski
Emily has over ten years of experience in scientific computing and engineering research and development. She has a background in mathematical analysis, with a focus on probability theory and numerical analysis. She is currently working in Python development, though she has a background that includes C#/.Net, Unity3D, SQL, and MATLAB. In addition, she has experience in statistics and experimental design, and has served as Principal Investigator in clinical research projects.
A "moving SLO" for machine learning
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Eric Moore
Ex-chemist SRE

Confident Rare SLOs
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Frances Zhao-Perez
Senior Director of Product Management
Salesforce
Frances Zhao-Perez
Senior Director of Product Management
Salesforce
Measuring What Matters: SLOs Help to Pursue Customer Happine...

Frances Zhao-Perez
Senior Director of Product Management
Salesforce
Measuring What Matters: SLOs Help to Pursue Customer Happiness
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Fred Moyer
Engineering Geek

The Body's Error Budget; SLOs for healthy eating
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Greg Arnette
Co-founder & CPO
CloudTruth
Greg Arnette
Co-founder & CPO
CloudTruth
The Hidden (Config) Tax Affecting Your Uptime SLO

The Hidden (Config) Tax Affecting Your Uptime SLO
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Gwen Berry
Site Reliability Engineer
IAG
Gwen Berry
Site Reliability Engineer
IAG
Reliability Benchmarking: A Pre-cursor to SLO Adoption

Gwen Berry
Site Reliability Engineer
IAG
Junior Site Reliability Engineer, working in an SRE enablement team at IAG.
Reliability Benchmarking: A Pre-cursor to SLO Adoption
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Hazel Weakly
Infrastructure Team Lead
Motivating SLOs Mathematically
Have you ever wondered if there's something behind the experiential knowledge that we hold as best practices? I've noticed that things that "feel" right can often be connected together, and the connection between SLOs and observability feels right, like there's something deeper underneath.
So, I've been digging into relationships between SLOs, observability, known knowns, unknown unknowns, cardinality, entropy, and more. There's still a lot of details to work out, but what I present here in this talk is a rough overview of where I'm at so far when it comes to motivating SLOs from a more interconnected perspective.
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Hezheng Yin
Co-founder & CTO / Creator
Apache DevLake
Merico
Hezheng Yin
Co-founder & CTO / Creator
Apache DevLake
Merico
Creating and Tracking SLOs that Empower Developer Happiness ...

Hezheng Yin
Hezheng is a perceptive and persistent pioneer in applying technology to make the world a better place. At Merico, he leads the engineering and research team to build innovative algorithms to help developers quantify the impact of their work. Before this, his research focuses on empowering the next generation of education technology with artificial intelligence and machine learning. Hezheng got his bachelor's degree from Tsinghua University and was pursuing his Ph.D. in computer science at UC Berkeley.
Creating and Tracking SLOs that Empower Developer Happiness and Productivity
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Imaya Kumar Jagannathan
Principal Solution Architect
AWS
Imaya Kumar Jagannathan
Principal Solution Architect
AWS
Why are SLOs important? - SLOs in the world of efficiency

Imaya Kumar Jagannathan
Principal Solution Architect
AWS
Why are SLOs important? - SLOs in the world of efficiency
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Ioannis Georgoulas
Director of SRE
Paddle.com

How you SLO your SLOs?
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Jason Greenwell
SRE Leader
Ford - Model e

Jason Greenwell
Jason is an SLO and developer expereince advocate that has held a number of technical leadership positions at Ford and Ford Credit over the past 20 years. He is currenlty heading up SRE for Model-e's Cloud Platform driving SLO adoption and SRE culture throug the org.
Promise Theory and SLOs
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Jeff Martens
CEO & Co-Founder
Metrist
Jeff Martens
CEO & Co-Founder
Metrist
Managing SLOs & SLAs when your app is built on other apps

Jeff Martens
Jeff has built observability products and developer tools for more than 12 years. The first company he founded, CPUsage, was a pioneer in the serverless computing space before AWS Lambda existed. Later he joined New Relic pre-IPO to focus on new products. There he served on the team creating the company’s high-performance event database, before leading Real User Monitoring and growing the product into the company’s 2nd largest revenue generator. Jeff then joined PagerDuty pre-IPO where he worked on designing, building, and launching a suite of business analytics products. Jeff is an alumnus of the University of Oregon and works between Portland, Oregon and the San Francisco Bay Area.
Managing SLOs & SLAs when your app is built on other apps
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Jessica Kerr
Engineering Manager of Developer Relations
Honeycomb
Jessica Kerr
Engineering Manager of Developer Relations
Honeycomb
Evolving SLOs at Honeycomb

Jessica Kerr
Jessica Kerr is a developer of 20 years, conference speaker of 10, and ringleader of a household containing two teenagers and their cats. She works and speaks in TypeScript, Java, Clojure, Scala, Ruby, Elm etc etc. Her real love is systems thinking in symmathesy (a learning system made of learning parts). She works at Honeycomb.io because our software should be a good teammate and teach us what is going on. If you're into sociotechnical systems, find her blog and newsletter at jessitron.com.
Evolving SLOs at Honeycomb
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Justin Hoang
Software Engineer
Procore
SLOs as code
By managing Service Level Objectives as code we can co-locate SLO definitions and ownership with the product code and team. This supports horizontal scaling of SLO ownership, while establishing a single source of truth, adding transparency, integrating with the code management process, and creating an audit trail for SLOs.
The talk will be based off this article published on Procore's Engineering blog: https://engineering.procore.com/implementing-slos-as-code-a-case-study-2/
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Kayla Annunziata
SRE Platform Development Sr Mgr
Capital One
Kayla Annunziata
SRE Platform Development Sr Mgr
Capital One
Interpreting Error Budget Signals

Kayla Annunziata
SRE Platform Development Sr Mgr
Capital One
Kayla is an Enterprise SRE Platform Development Sr Mgr at Capital One to drive adoption of SRE best practices across all Capital One applications.
Before joining FinTech she was a Software Engineering Manager at Lockheed Martin supporting the Space Industry developing reliable flight software products for NASA’s Orion Spacecraft Artemis 1 mission that successfully broke the record for the farthest distance from Earth traveled by an Earth-returning human-rated spacecraft by ~20,000 miles.
Interpreting Error Budget Signals
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Keri Melich
Site Reliability Engineer
Nobl9
Keri Melich
Site Reliability Engineer
Nobl9
Zero to SLO Hero: Part 1
Zero to SLO Hero: Part 2

Keri Melich
Keri is a SRE working to help scale and secure the Nobl9 platform. Before that, she has worked in DevOps building secure and scalable solutions for internal users. She is also passionate about building a safe and diverse workplace and spends her free time hiking, woodworking, and 3D printing.
Zero to SLO Hero: Part 1
"Zero to SLO Hero" is a two-part talk that will guide you through the journey of implementing Service Level Objectives (SLOs) in your organization. Part 1 of the talk will cover the basics of SLOs, including what they are, why they are important, and how they can help you measure and improve the reliability of your services. In Part 2, we'll dive deeper into the practicalities of implementing SLOs in your organization with examples in Prometheus. We'll also cover some common pitfalls to avoid when implementing SLOs, and how to overcome them. By the end of this two-part talk, you'll have a solid understanding of what SLOs are, why they matter, and how to implement them in your organization. You'll be well on your way to becoming a SLO hero and improving the reliability of your services!"
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelZero to SLO Hero: Part 2
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Kyle Forster
Founder
RunWhen

SLOs with Teeth: Partnering with Product Management
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Lukasz Dobek
Software Engineer
Nobl9

Lukasz Dobek
Currently, he’s developing Service Level Objectives platform at Nobl9, helping to make a cultural shift to the Site Reliability Engineering mindset.
Non-Conway’s Game of SLOs
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Marcus Merell
VP of Technology Strategy
Sauce Labs
Marcus Merell
VP of Technology Strategy
Sauce Labs
Functional Testing & SLOs - Together at Last!
Functional Testing & SLOs - Together at Last!
SLOs govern org-wide expectations for how your software runs in production: but how do you know it's actually working at a functional level? Configuration, user analytics, and data flows are all highly engineered code, but they generally aren't treated as such. So how do you incorporate testing in the modern world of SRE?
Join Marcus for a quick story about how testing can raise early warnings about SLOs that might be slipping--and how to preserve your error budget for only the highest-risk concerns.
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Matthias Loibl
Senior Software Engineer
Polar Signals

Matthias Loibl
Second Day Operations for SLOs
Once you have implemented SLOs for your organization how do you move forward?
At Polar Signals, we have quarterly SLO reviews. We're first doing a retrospective and discussing where we did great and also could have improved. For the upcoming quarter, we discuss the OKRs and from those, we derive SLOs. Sometimes OKRs are easy to derive from and sometimes they need to be rephrased to make sense for SLOs.
Matthias will walk you through an example of an SLO that we implemented quarters ago and how it changed over time. The example will showcase the SLO tracked in the open-source Pyrra project which makes SLOs with Prometheus manageable, accessible, and easy to use for everyone.
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Max Knee
Staff Software Engineer
The New York Times
Max Knee
Staff Software Engineer
The New York Times
Use SLOs to manage your day
Engaging with your Customers in your SLO Journey

Use SLOs to manage your day
I'm pretty bad at time management, so I was looking into ways to improve that part of my life.
I turned to a lo-fi way to use SLOs to manage my day by turning my tasks and other things I need to do during the day into SLOs.
It's increased my productivity and am interested in sharing it with others!
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelEngaging with your Customers in your SLO Journey
You've already sold your organization on SLOs, now it's time to sell them to your customers. But instead of pitching, this is a collaborative exercise to ensure that they understand your system and you understand their needs.
Measuring and having SLIs on latency could help, but what if your customers care about correctness?
In this talk, we'll discuss how to better meet your customers needs proactively since your SLOs will mirror what their expectations are and will reduce asking if there's an issue.
With this approach, you can reduce alert fatigue and improve the customer experience by making them happier and increasing trust in your system.
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Michael Knox
Platform SRE Team Lead
ANZx

Type 1 Diabetic management and SLOs
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Natalia Sikora
Product Manager
Nobl9

Natalia Sikora
Product Manager
Nobl9
Natalia is a Product Manager at Nobl9. She enjoys collaborating with cross-functional teams to solve complex problems for customers. Before joining Nobl9 in the noble pursuit of reliable software, she spent 10 years developing, publishing, and managing various products for one of the world’s largest educational companies. Outside of work, you can find her hiking in the mountains, working on another art piece at a printing workshop, or playing video games.
Product and Engineering Collaboration With SLOs
Present a real-life scenario of using SLOs to manage product requirements and be smarter about allocating the engineering team’s focus. The talk will discuss an example of how SLOs helped monitor a potential problem instead of assigning an engineering team to jump into solving a complex issue. It also helped focus on a specific area of a wider problem. In a broader sense, the talk will discuss the collaboration between the Product Manager and the Engineering Manager and how SLOs can lead to more productive conversations.
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Neil Pagaduan
Manager of Technology & Engineering
Cox Edge
Neil Pagaduan
Manager of Technology & Engineering
Cox Edge
Applying Service Level Objectives (SLO) to Edge Networks

Neil Pagaduan
Applying Service Level Objectives (SLO) to Edge Networks
In today's digital world, it's important to ensure that our services are reliable, performant, and meet the needs of our users. In this video, we'll dive into the world of SLO and explore how we can apply it to edge networks.
Neil Pagaduan, Manager of Technology and Engineering at Cox Edge will begin by discussing the importance of SLO in Edge Networks, and how it can help to deliver a better user experience. Then, we'll explore the steps involved in achieving SLO, from defining objectives to measuring performance. Finally, Neil will discuss how to apply SLO to an edge network, including some key considerations to keep in mind.
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Ricardo Castro
Principal Site Reliability Engineer
FanDuel
Blip
Ricardo Castro
Principal Site Reliability Engineer
FanDuel
Blip
Overcoming SRE Anti-Pattern Roadblocks: Lack of a Reliabilit...

Ricardo Castro
Overcoming SRE Anti-Pattern Roadblocks: Lack of a Reliability Framework
SREs, as the name implies, care about service reliability. But, often, they struggle with having a way to define, measure and assess their services reliability. In practice, they lack a Reliability Framework.
How can SLOs help? They provided an opinionated way to do just that: define, measure and assess service reliability from the users perspective. They provide a common language to talk about reliability and prioritize work. They help fix the anti-pattern of trying to ensure service reliability without clearly defining what it means.
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Roman Khavronenko
Software Engineer
VictoriaMetrics
Roman Khavronenko
Software Engineer
VictoriaMetrics
Retroactive evaluation of SLO objectives in VictoriaMetrics

Roman Khavronenko
Retroactive evaluation of SLO objectives in VictoriaMetrics
Recording rules is a clever concept introduced by Prometheus for storing results of query expressions in a form of a new time series. This concept is used for SLO calculations. But due to the nature of recording rules they have no retroactive effect. And since SLO objective usually captures a time window no less than 30d, recording rules produce incomplete results until the whole time window is captured.
The talk will cover how this can be fixed in VictoriaMetrics monitoring solution via retroactive rules evaluation on example of rules generated via https://github.com/slok/sloth framework.
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Sal Furino
CRE

Sal Furino
Sal Furino is a Customer Reliablity Engineer. During his career he's worked as a TPM, SRE, Developer, Sys Admin, and IT support. While not working he enjoys cooking, gamings, traveling, skiings, and golfing. Sal lives in Queens with his parter and has a BS in Applied Mathematics from Marist College.
Two Paths in the Woods
While the direction and intent of SRE has been established and is becoming better-understood, the details on how to achieve "SRE" is still an exercise left for the reader.
Steve from Google and Sal from Nobl9 will present two independently developed methods for teaching SRE topics to customers, which we have discovered are actually quite similar. Huzzah!
Steve will present the "reliability map" and Sal will show Nobl9's SLODLC. Both are essentially sets of documentation presented in a way that allows customers to take a *guided* path through the dark and scary woods that is today's SRE.
The "reliability map" provides a detailed accounts of what development, infrastructure, operational, observability, and people/culture activities organizations can expect to be practicing to achieve different eras of reliability (99%, 99.9%, 99.99%, 99.999%). The "reliability map" takes it one step further and provides descriptions and references to additional content for all activities.
SLODLC provides a framework of documentation and templates to allow customers to iterate on improving their SLOs. This is especially necessary when customers have high reliability targets in which rely upon automation to execute predetermined playbooks. (rollback, exponential backoff, autoscale, etc) If SLOs are the indicator the performance of a user journey and decide when actions needs to be taken, then customers need a clear and useable way to explain:
- What they are attempting to measure (golden signals?, something else?)
- Why was it decided to measure X in such a way?
- How is X impactful for the targeted user journey?
- When was the last time the SLO objective, metric, time window, etc was changed?
- When the error budget is in danger of being breeched what actions should be take?
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Sally Wahba
Principal Engineer
Splunk
Sally Wahba
Principal Engineer
Splunk
Thinking about SLO from On-Prem to Cloud - A Developer's Per...

Sally Wahba
Principal Engineer
Splunk
When not working you will find her doing computer science outreach activities, reviewing for technical conferences, mentoring, and learning Spanish.
Thinking about SLO from On-Prem to Cloud - A Developer's Perspective
My background has mostly been in developing operating systems for data storage companies. In this environment, almost everything is controlled internally. For example, if the SLO of the operating system is five 9s, then the error budget is usually all consumed by software bugs owned internally by the company. As I switched to developing SaaS products in the cloud, this has drastically changed. Below are examples of lessons learned during different phases of working on a product, from development, to production, to support and operation. My goal is sharing these lessons so other developers can learn from my experience.
In my previous role, two main things I relied on while developing operating system products were the suite of testing as well as the release cadence. Shipping a new operating system every 6 months was considered fast. This gave developers a lot of time for their code to soak and be tested internally before being released to customers. Additionally, with a slower release cadence there was a lot of effort invested in creating different layers of testing. After all, a bug fix would take months to reach customers. Even if we released a patch quickly, which in this context means a few weeks, customers would still need to update their operating systems to apply that patch, and who knows how long a customer will wait before applying that patch. After moving to building SaaS products in the cloud, the release cadence became much faster. This required shifting my thought process. Instead of relying on soak time and various levels of testing, activities such as code review and in-build unit tests now take a front row seat. Metrics like code coverage from unit tests mattered more, while metrics like how long it's been since QA found a bug mattered less.
Another difference between my old role and new role is how developers access production and production metrics. In the old role, gathering production metrics was no easy feat.
Harder access to production metrics, implied that changing SLOs internally was harder, took longer, and a lot of developers didn't know how/why SLOs were changing. For a SaaS product running in the cloud, developers have access to overall system performance metrics at the click of a button, while maintaining compliance. This makes it easier for developers to know why/how SLOs are affected by SLAs and also gives them faster reaction times.
From the operation perspective, the old role and the new role were quite different. In the old role developers didn't go on-call. There was a customer support organization that would handle any customer issues first. Developers were brought in occasionally if customer support needed a bug fixed. Developers didn't need to wake up in the middle of the night because there's an outage. In the new role, developers are on-call, which means developers can and do occasionally get paged in the middle of the night. This shift caught me off-guard as I had to think much harder about how my service impacts the quality of life of my colleagues and myself. It made me think about how to change and measure SLOs to eventually avoid waking someone up in the middle of the night.
Another lesson that caught me off guard is not all SLAs can be trusted when working in the cloud. Yes, I knew this lesson theoretically, but learning it in practice is a different story. One example was an outage that was caused by a cloud provider breaching their SLAs for a managed service that we relied on for our product. This resulted in our product breaching its SLAs. When something like this happens, you think hard about how to update your SLOs to prepare for these issues, catch such issues, and react to them. I found that using SLOs that are tighter than SLAs was helpful in that regard.
In conclusion, even with years of professional experience, moving from developing On-Prem products to cloud SaaS offerings changed how I think about SLOs and it's truly not a one-size-fits-all.
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Sandeep Chatra Raveesh
Observability Lead
eBay
Sandeep Chatra Raveesh
Observability Lead
eBay
Scaling SLI/SLO - Pushing Your Observability Platform To Its...
Scaling SLI/SLO - Pushing Your Observability Platform To Its Limits
At eBay we use a wrapped version of Prometheus as our centralized time series database. Data residing in Prometheus is used for mission critical functions like detecting issues on the site through anomaly detection, SLOs or simple threshold based alerts. To be able to enforce SLOs across the 1000s of micro services that are deployed inside of eBay, we would need to be able to aggregate commonly instrumented metrics at an application level.
Why are these aggregations necessary? Host level metrics when used for computation of things like burn rates over a 30 day window can become very very expensive. Using a single query across all micro services to generate such aggregations have several scale limitations. Manually onboarding each micro service into such aggregations can also be tedious.
The talk discusses our commonly instrumented metrics, our metrics store architecture, scale challenges seen with SLI/burn rate computation and how we came up with concepts like templated rule groups that automatically generate Prometheus rule groups to be able to do aggregations for every application that is emitting metrics.
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Sasha Rosenbaum
Principal
Ergonautic
Sasha Rosenbaum
Principal
Ergonautic
SLO Prompt Engineering: Aligning Humans for Better Outcomes

Sasha Rosenbaum
Principal
Ergonautic
Sasha is Principal at a new venture, Ergonautic
With a degree in Computer Science, an MBA, and two decades of experience across development, operations, product management, and technical sales, Sasha Rosenbaum brings a unique perspective to optimizing the organizational flow of work, bridging gaps with empathy and insight.
SLO Prompt Engineering: Aligning Humans for Better Outcomes
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Sergey Sidorov
Software Engineer on SLO Monitoring (SLICK)
Meta
Sergey Sidorov
Software Engineer on SLO Monitoring (SLICK)
Meta
SLICK: SLO Reviews at Meta

Sergey Sidorov
Software Engineer with a track record of building & shipping complex software with primary focus on infrastructure and advanced backend systems. My experience includes working on high-throughput messaging infrastructure, trade execution engines, and large-scale monitoring and observability systems. Below are keywords that might be useful.
SLICK: SLO Reviews at Meta
SLICK is our reliability tracking platform at Meta, pioneering an SLO-focused culture across the company. While we have been very successful in onboarding teams to SLICK, we started to notice that a significant amount of teams only got limited value out of their SLOs after the initial onboarding.
In order for SLOs to be useful, the whole team needs to adopt them, use them regularly and retrospect on them frequently. As an initial attempt to help socialize SLOs, we built various integrations into SLICK. This includes periodic reports in our internal work groups to increase the visibility of SLOs, and collaborative data annotations to enable retrospecting on the root causes of SLO violations.
Bringing all of the above together, we will present our brand new SLO review tooling. This provides a structured workflow to have meaningful discussions about SLOs and identify follow-ups, enabling teams to get the most value out of SLOs.
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Shubham Srivastava
Head of Developer Relations
Zenduty
Shubham Srivastava
Head of Developer Relations
Zenduty
Kubernetes Monitoring - Choosing Optimal Metrics for SLO Ale...

Shubham Srivastava
Take pride in making mistakes, learning from them and advocating for best practices for orgs setting up their DevOps, SRE and Production Engineering teams.
A zealous and eternally curious professional, fascinated by stories from DevOps, Incident Management and Product Design; hoping to solve real-world problems with the skills and technology I'm actively amused by. An orator, writer, and hopeful comedian trying his very best to do something I'm proud of everyday.
Kubernetes Monitoring - Choosing Optimal Metrics for SLO Alerting
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Stephan Lips
Software Engineer and SLO Advocate
Procore
Black Box SLIs
Adopting an SLO culture involves identifying the metrics that matter without drowning in noise and alert fatigue. The black box concept lets us aggregate granular metrics into SLIs that focus on the user experience as an indicator of system reliability.
The talk will be based off this article published on Procore's Engineering blog: https://careers.procore.com/blogs/engineering-at-procore/black-box-slis.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelSLOs as code
By managing Service Level Objectives as code we can co-locate SLO definitions and ownership with the product code and team. This supports horizontal scaling of SLO ownership, while establishing a single source of truth, adding transparency, integrating with the code management process, and creating an audit trail for SLOs.
The talk will be based off this article published on Procore's Engineering blog: https://engineering.procore.com/implementing-slos-as-code-a-case-study-2/
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Stephen Townshend
Developer Advocate (SRE)
SquaredUp
Stephen Townshend
Developer Advocate (SRE)
SquaredUp
Reliability Benchmarking: A Pre-cursor to SLO Adoption

Stephen Townshend
Our industry is full of buzzwords and exaggerations, it can be hard to know what is real or not. Stephen strives to take these complex technical concepts and to simplify and present them in a way everyone can understand and apply (and to call out when something is too good to be true).
Stephen lives in Auckland, New Zealand and currently works as a Developer Advocate for SquaredUp, as well as promoting and improving observability and SRE practices internally in the organisation.
Reliability Benchmarking: A Pre-cursor to SLO Adoption
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Stephen Weber
Staff Site Reliability Engineer
Procore

Stephen Weber
Arguments in Favor: why SLOs?
In my experience, many teams encounter SLOs as something they've been told to "do" and the flip side is it's been something I've been asked to help them with. Naturally this is not ideal - as engineers we prefer things to be self-evident.
I have a number of strategies to use when communicating the process and the value of creating and using SLOs. I'll give away the thing I say the most right here: "SLOs are for decision-making". They're not magic or even a single thing. They're a tool that helps us do our jobs.
Audience will come away either better-able to articulate the pragmatic usefulness of an SLO mindset, or with one or two real motivations why they should consider developing and using SLOs for their own systems.
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Steve McGhee
Reliability Advocacy Engineer
Google SRE

Two Paths in the Woods
While the direction and intent of SRE has been established and is becoming better-understood, the details on how to achieve "SRE" is still an exercise left for the reader.
Steve from Google and Sal from Nobl9 will present two independently developed methods for teaching SRE topics to customers, which we have discovered are actually quite similar. Huzzah!
Steve will present the "reliability map" and Sal will show Nobl9's SLODLC. Both are essentially sets of documentation presented in a way that allows customers to take a *guided* path through the dark and scary woods that is today's SRE.
The "reliability map" provides a detailed accounts of what development, infrastructure, operational, observability, and people/culture activities organizations can expect to be practicing to achieve different eras of reliability (99%, 99.9%, 99.99%, 99.999%). The "reliability map" takes it one step further and provides descriptions and references to additional content for all activities.
SLODLC provides a framework of documentation and templates to allow customers to iterate on improving their SLOs. This is especially necessary when customers have high reliability targets in which rely upon automation to execute predetermined playbooks. (rollback, exponential backoff, autoscale, etc) If SLOs are the indicator the performance of a user journey and decide when actions needs to be taken, then customers need a clear and useable way to explain:
- What they are attempting to measure (golden signals?, something else?)
- Why was it decided to measure X in such a way?
- How is X impactful for the targeted user journey?
- When was the last time the SLO objective, metric, time window, etc was changed?
- When the error budget is in danger of being breeched what actions should be take?
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Steve Upton
Principal QA Consultant
Thoughtworks

Steve Upton
Data Product Thinking with SLOs
The talk tells the story of how conversations around SLOs can be a great trigger to start a shift to a Product Thinking mindset, with practical examples. We'll also take a light dip into constraint mapping from an SLO perspective.
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Steve Xuereb
Staff Site Reliability Engineer
GitLab Inc.

So many SLOs so many alerts
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Surya Bhagvat
Director, SRE
Harness

Surya Bhagvat
Director, SRE
Harness
The Business of Properly Setting SLOs
Join Surya Bhagvat, Director of Site Reliability Engineering at Harness, to discuss how his team used business objectives to create SLOs that positively impacted customer experience and his engineering team. Surya will share lessons learned in choosing and implementing the SLIs and SLOs aligned with customer expectations and Harness’ desired business outcomes.
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Thijs Metsch
Researcher
Intel Labs

Thijs Metsch
Researcher
Intel Labs
Intent Driven Orchestration with SLOs!
With a Serverless mindset in place, there should no longer be a need to define resource requests or any other kinds of information. Just developed your app or function and run it.
But how can we achieve performance and make sure we run the system most efficiently? This is were we can now let user define what they truly care about - their SLOs. Based on these performance targets our Intent Driven Orchestration Planner will do the rest. No need to define resource requests and limit s on e.g. Kubernetes cluster anymore. The planner will set up the systems in such a way that your app/functions behave as expected, without the need to know anything about the underlying infrastructure – less knowledge needed, fewer errors made!
This is a shift in how we do orchestration and SLO management, that could be of interest to this community. Away from a monitoring & alerting way of looking SLOs, towards a way of using SLOs to manage the system. Furthermore this truly allows for ease-of-use of the user; rather than defining numbers an values based on domain & contextualized knowledge we let them define what they truly care about: their SLOs!
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Toby Burress
SRE
Dropbox

Toby Burress
SRE
Dropbox
What We Mean By "Mean"
There's been a lot of (really good!) discussion in the last several years about how to think about, monitor, and alert on long-tailed distributed quantities, such as latency. However, I'm worried that in our desire to describe the long tail we may be too eager to abandon tools that are still useful.
In this talk I'll (re)introduce everyone's favorite summary statistic, the average. I'll talk about the difference between a sample average and a random variable's expectation, and how the two are uniquely linked by the law of large numbers. I'll also talk about how the central limit theorem allows us to treat sample averages as draws from a Gaussian distribution, irrespective of the distribution the samples come from, and then I'll talk about the exceptions. We'll finish up by looking at how the properties of expected values can give us insight into the behavior of systems, even at the tail.
Through all of this we'll be looking at examples drawn from real-world latency data, and comparing the insights gleaned from this versus other common summary statistics.
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Troy Koss
Director, Enterprise SRE
Capital One

Troy Koss
Is Your Resilience Reliable?
Resiliency is a critical piece to building reliable systems. It allows us to feel safe knowing failure is inevitable. After all, as noted in the OG SRE book, 100% is terrible target for basically everything.
We spend a lot of resources to add in layers of resiliency from redundant multi-region compute stacks to backups on our backups. How do we know this resilience achieved our ultimate outcome of reliability for our customers? We'll discuss the ways to observe your SLOs and error budgets during resiliency events.
The various events we'll look at include, regional failover, chaos experiments (such as latency injection), database recovery, and more! After failing a region, do we know if your customer's experienced a disturbance? When you're running a resiliency test or game day, how do you measure success?
Observing error budgets before, during, and after an event paint a picture of our customer's experience and can ultimately be part of the success criteria. It is critical that we know how our architecture and system changes unfold. What if new resiliency introduces latency that negatively impacts your customer? For example, if there's complexity introduced that makes your release engineering more convoluted, we may see a longer error budget burn while we remediate. On the flip side the partnership of adding resiliency and observing your SLOs can also lead to improving the objectives with newly matured levels of resiliency; raising the bar for performance.
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel
Vijay Samuel
Observability Architect
eBay
Vijay Samuel
Observability Architect
eBay
Scaling SLI/SLO - Pushing Your Observability Platform To Its...

Vijay Samuel
Scaling SLI/SLO - Pushing Your Observability Platform To Its Limits
At eBay we use a wrapped version of Prometheus as our centralized time series database. Data residing in Prometheus is used for mission critical functions like detecting issues on the site through anomaly detection, SLOs or simple threshold based alerts. To be able to enforce SLOs across the 1000s of micro services that are deployed inside of eBay, we would need to be able to aggregate commonly instrumented metrics at an application level.
Why are these aggregations necessary? Host level metrics when used for computation of things like burn rates over a 30 day window can become very very expensive. Using a single query across all micro services to generate such aggregations have several scale limitations. Manually onboarding each micro service into such aggregations can also be tedious.
The talk discusses our commonly instrumented metrics, our metrics store architecture, scale challenges seen with SLI/burn rate computation and how we came up with concepts like templated rule groups that automatically generate Prometheus rule groups to be able to do aggregations for every application that is emitting metrics.