In his blog post on “Crisis Driven Development”, Software Engineer Zach Drillings talks about how to deal with somewhat disastrous system failures, and, ideally, emerge on the other side with bigger, better, stronger software and processes. While I agree that moving forward with new infrastructure instead of rolling back or patching things up is definitely a good outcome of a fire, it does come at a high cost:
- It’s seriously nerve wracking. While some might thrive in this sort of high pressure environment that comes from dealing with a production fire, we really don’t want to subject any of our engineers to the stresses of late nights while risking the integrity of our products, just to drive development forward at a rapid pace — it’s pretty obvious that the benefits of relying on such a method of driving development forward are far outweighed by the risks of emergency development.
- There’s a reason why we go through prioritization exercises when we plan sprints — we want to get the highest value features and fixes out to our customers at the right time. Bringing “all hands on deck” to deal with fires is disruptive and requires teams to drop their planned work, which then in turn leads to delays in getting these other important features out the door.
- Did I mention it’s nerve wracking? One line from my colleague Rohit Parulkar that always stuck with me is “burnout is real”. For the sake of a stable work environment that comes from stable and happy engineers, you really don’t want to get into burnout territory.
Despite all of these downsides, we have found that there are also some really valuable effects of crisis driven development (CDD). This got us thinking that we might be able to apply some of these learnings during our software development and rollout process — so CDD without the actual crisis. In this part of the blog post, I will focus on highlighting “the good parts” of crisis driven development, and how we can apply them without the stress of red alerts, war rooms and late nights.
When dealing with a crisis, it is clear to both the team internally, as well as other teams that might have dependencies, that the fire is our highest priority right now, and that (almost) all other requests will have to wait. We can confidently tell others that we won’t start working on anything else until the problem is fixed. Having worked on an infrastructure team with multiple downstream consumers who all have their own values and priorities, I know how hard it is to say “no” to, well, most of them. However, by cultivating and communicating a culture of radical prioritization as a team and frequently asking yourself “what is the highest value thing we can work on right now”, not just on a sprint-by-sprint basis, but for an entire quarter, you get a pretty good litmus test for what’s really important. Besides, if your team has so many different responsibilities that you’re pulled into multiple directions, it might be time to split ownership of systems to allow getting some of that focus back.
Well-defined “definition of done”
Occasionally, you find yourself working on a fix or a new feature and it’s just not completely obvious what the “definition of done” really is. Tested locally and merged to master? Passed the integration tests? Ran in the staging environment and “looked okay”? Went into production and customers report that it works the way they expect it? Of course, being a good product engineering team, this should be totally clear from the start when we write well defined tickets including test plans and acceptance criteria, but let’s be honest, sometimes you simply haven’t thought through everything end-to-end before starting work on something. Especially with infrastructure (i.e. not customer facing) features like complex data processing pipelines, we tend to be a little more exploratory and less goal-driven. CDD takes away that guess work and makes it very clear when the work is done: When production systems work the way they did before the crisis and customers are happy. You can make use of this by focusing on your end users’ expectations. Don’t ask “how should this code work” but “what is the expected output or functionality of this code”, and use this to guide your acceptance criteria.
This follows very closely from the previous point on “definition of done”: Your production system is broken. Hence, you treat everything you do as a production feature, with all the users, use cases and load that comes with it. This is also occasionally difficult to consider when working on infrastructure projects: A/B testing (running two different implementations of a feature in production) is not really a concept for infrastructure, and it’s often hard to do end-to-end tests on these kinds of systems. With CDD, the one use case that you’re addressing is your production system.
This is one of my favorite aspects of CDD — huddling. I say “let’s huddle” a lot, which — unlike formal planning meetings — means “let’s get everyone in the room together, put our heads down and get shit done”. By working closely with engineers on your team as well as your major stakeholders (product managers, customer support…), preferably in physical proximity, you short-circuit feedback loops and eliminate wait times for code reviews since the people you’re working with are sitting right next to you — and are also focused on getting this thing out the door. This is one of the benefits of CDD that’s easy to apply during peaceful times, and usually ends up being an incredibly productive, focused and even fun afternoon, day, or week, that also significantly contributes to team bonding.
… But also: hierarchy
As a tech startup, we love to iterate on “minimum viable products”, but crises don’t actually leave a lot of room for this. Rather than being able to “experiment” with your end users, you have an urgent problem and need an adequate solution immediately, period. One of the interesting side effects of this focus is that CDD often requires pulling in engineers who have the relevant experience with and authority over a specific part of our code base, or a specific technology, in order to get the fire put out as quickly as possible. While we try to maintain a high “bus factor” (the slightly macabre definition of “how many engineers can get hit by a bus without losing the ability to maintain this product or feature”) for our code base, CDD can be helpful to identify those experts and include them in the decision making process for regular, non-crisis related development cycles.
Just recently, interestingly enough, one of our infrastructure teams applied some of the CDD principles to a large-scale infrastructure migration. They set strict timelines for when they would turn off the legacy service, which — while somewhat scary — gave downstream teams a very well-defined definition of done and caused them to prioritize the work they needed to get done in order to migrate their system dependencies accordingly. Applying my favorite principle of huddling, the infrastructure team leading the migration was stationed in a dedicated meeting room during migration week, where engineers could stop by and get immediate support for any problems that would occur, thus reducing lengthy feedback cycles that could have slowed down the process. And while we certainly did a lot more planning and pre-work than usual in a crisis, this approach led to a successful final push, culminating in the smashing of yet another piñata that our engineering team is strangely obsessed with.