Facebook’s motto is “Move fast and break things”. Outside of the industry, it is common to misinterpret this motto as a simple conjunction from people who like to watch the world burn.
Basically: “1. Move fast; 2. Break things”.
But no. That’s actually not what it means. Implied in the motto is an understanding that there is necessarily a tradeoff between velocity and risk/reliability. Facebook’s motto is taking a position on that tradeoff: “Move fast, therefore be ok with breaking things” or (even more accurately) “Prioritise moving fast over not breaking things”. Somehow not quite as catchy…
One problem with the Facebook formulation is that it doesn’t specify quite how broken it is OK for things to be. It is not the case at Facebook (and never was) that it was OK for the service to be constantly egregiously broken. Indeed, Facebook has always been a reasonably well-built and stable product.
So how do we decide how broken is too broken?
The Error Budget
For that, we turn to a concept from Site Reliability Engineering.
The excellent “SRE Book” from Google introduces the concept of error budgets for software services. The error budget of a service is simply the quantity of errors that is agreed to be acceptable. The error budget is intrinsically linked to a service’s SLO, but framing it in terms of a budget speaks directly to the tension between moving fast and breaking things:
Product development performance is largely evaluated on product velocity, which creates an incentive to push new code as quickly as possible. Meanwhile, SRE performance is (unsurprisingly) evaluated based upon reliability of a service, which implies an incentive to push back against a high rate of change.
In other words, there is a direct tradeoff between how fast a software product changes, and how reliable that product is from a technical operations perspective. Error budgets are designed to allow that tradeoff to be formalised and negotiated:
The error budget provides a clear, objective metric that determines how unreliable the service is allowed to be within a single quarter. This metric removes the politics from negotiations between the SREs and the product developers when deciding how much risk to allow.
Basically, the error budget is a formal allowance for how often things are allowed to go wrong. If you exceed that budget, you’ve screwed up (and you need to slow down).
Note that it is the intention of an error budget that it be fully (or nearly fully) spent. If a service is coming in month after month well below its error budget, the implication is that the product velocity could be higher, to the benefit of users and the business. In other words, you should be moving faster.
The Fuckup Budget
“This is how much you are allowed to break things.”
I love the concept of error budgeting, and feel that it has applicability well beyond the operational metrics.
On the surface, it feels good for everything to be going smoothly. But is it actually a good thing? In many cases, it means that you have moved more slowly than you could have.
For any organisation, at any stage, there is value in being intentional about how much breakage is ok in service of making progress: in other words, a “Fuckup Budget”.
How many customers is it ok to let down? What defect rate is tolerable? How many bugs is too many? What percentage of bad hires can you efficiently deal with?
If you are like most people, there is a temptation to answer “None. Zero. None. Zero.” But pursuit of perfection leads to extreme caution and ultimately to stasis. So stop, and think about what actual amount of fuckups is OK in the context of what you’re trying to achieve. Set that number, write it down. And then try to reach that number! Remember, if you’re not spending your Fuckup Budget then you could be moving faster.
Happy moving fast! And don’t sweat the breaks.
Exciting news! I have started a podcast, all about startups, technology, and disruption. Check it out, then subscribe on Spotify, Apple Podcasts, or Google Podcasts. Reviews and ratings don’t hurt either!