Abstract
Real world experience and things that go wrong are two of life’s best teachers. This talk will explore key elements of scalable large-system design and Site Reliability Engineering (SRE) principles* through anti-patterns encountered in real life. Find out what lessons can be gleaned from watching the dynamics in a crowded cafe or dealing with a security issue during a hotel stay. Learn about fundamental site reliability engineering principles and practices including:
-Avoiding cascading failures
-Not feeding the machines with human toil
-Writing blameless postmortems
-Engineering solutions to eliminate classes of errors rather than implementing point fixes
These principles will be framed through a lens of the suboptimal while demonstrating the impact of SRE anti-patterns on user trust.
* SRE is often thought of as a specific implementation of the DevOps interface.