In this episode of Alexa’s Input (AI), I sit down with Sal Furino to explore the hidden engineering work that keeps modern systems reliable.
We break down what Service Level Objectives, Indicators (SLOs/SLIs), and error budgets actually mean in practice, why reliability is as much a cultural problem as a technical one, and how teams can better measure real user experience instead of just infrastructure health.
Sal also explains reliability engineering and the challenges of reliability at scale, like:
- Why latency and correctness become harder to measure with GenAI
- The difference between a bad incident and a fundamentally bad system
- How observability and telemetry shape modern engineering organizations
- Why most teams focus too much on infrastructure metrics and not enough on user happiness
- Why “the best systems are the ones nobody notices.”
If you work in AI infrastructure, distributed systems, platform engineering, observability, or SRE, this episode is a must listen!
SRECon Talk Dashboards & Dragons: Reliability Magic for AI Platforms by Alexa Griffith and Sal Furino: https://youtu.be/aWMB_7ksbkc?si=S49nPyAl_hCUIH7y
General Podcast Links
Watch: https://www.youtube.com/@alexa_griffith
Read: https://alexasinput.substack.com/
Listen: https://creators.spotify.com/pod/profile/alexagriffith/
More: https://linktr.ee/alexagriffith
Learn more about the host at
Website: https://alexagriffith.com/
LinkedIn: https://www.linkedin.com/in/alexa-griffith/
Find out more about the guest at:
LinkedIn: https://www.linkedin.com/in/salvatore-furino/
Rootly Interview: https://rootly.com/humans-of-reliability/salvatore-furino
Reliability at Scale Talk: https://youtu.be/J-VrU5JHPlk?si=8aV8acy57NWX30KA
Bloomberg Careers: https://bloomberg.avature.net/careers/SearchJobs
Chapters
00:00 - Introduction: Reliability in a world reshaped by generative AI
02:22 - The importance of seamless, background system design
04:41 - Becoming a Customer Reliability Engineer at Bloomberg
05:17 - Clarifying the CRE role and its customer focus
08:02 - The importance of observability and high-scale performance in finance
09:00 - Balancing technical and cultural aspects of reliability
10:19 - Coaching teams to be proactive using error budgets and SLIs
12:21 - The social-technical system: People, processes, and tools
13:06 - Mediation of differing opinions on reliability practices
15:06 - The nuanced approach to alerting and incident response
17:08 - The significance of tiered SLOs and the concept of error budgets
21:08 - Using signals like latency, correctness, availability, saturation in system measurement
22:53 - The impact of service level "nines" on system design and resilience
28:00 - Handling non-determinism and trust in AI responses
33:01 - Error budgets and their role in managing deployments
34:10 - The challenge of achieving five nines and data durability considerations
40:03 - Adapting SLOs for GenAI systems: core principles remain intact
42:23 - Measuring non-deterministic AI responses and quality proxies
44:41 - The ongoing importance of reliability even in AI/ML contexts
47:25 - Reacting to error budget exhaustion and proactive mitigation
50:42 - The significance of involving cross-functional teams during outages
55:36 - Advocating reliability investment to leadership
56:24 - The customer perspective: reliability as a fundamental feature
58:42 - Connecting with Sal Furino: where to follow his work and learn more about Bloomberg's engineering culture
59:20 - Final advice: Focus on user happiness to avoid common pitfalls in adopting SLOs