A liar will not be believed, even when he speaks the truth. : Aesop

Once you have a project that is a few years old with a large test suite an ugly pattern emerges.

Some tests that used to always work, start “sometimes” working. This starts slowly, “oh that test, yeah it sometimes fails, kick the build off again”. If left unmitigated it can very quickly snowball and paralyze an entire test suite.

Most developers know about this problem and call these tests “non deterministic tests”, “flaky tests”,“random tests”, “erratic tests”, “brittle tests”, “flickering tests” or even “heisentests”.

Naming is hard, it seems that this toxic pattern does not have a well established unique and standard name. Over the years at Discourse we have called this many things, for the purpose of this article I will call them flaky tests, it seems to be the most commonly adopted name.

Much has been written about why flaky tests are a problem.

Martin Fowler back in 2011 wrote:

Non-deterministic tests have two problems, firstly they are useless, secondly they are a virulent infection that can completely ruin your entire test suite.

To this I would like to add that flaky tests are an incredible cost to businesses. They are very expensive to repair often requiring hours or even days to debug and they jam the continuous deployment pipeline making shipping features slower.

I would like to disagree a bit with Martin. Sometimes I find flaky tests are useful at finding underlying flaws in our application. In some cases when fixing a flaky test, the fix is in the app, not in the test.

In this article I would like to talk about patterns we observed at Discourse and mitigation strategies we have adopted.

Patterns that have emerged at Discourse

A few months back we introduced a game.

We created a topic on our development Discourse instance. Each time the test suite failed due to a flaky test we would assign the topic to the developer who originally wrote the test. Once fixed the developer who sorted it out would post a quick post morterm.

This helped us learn about approaches we can take to fix flaky tests and raised visibility of the problem. It was a very important first step.

Following that I started cataloging the flaky tests we found with the fixes at: https://review.discourse.org/tags/heisentest

Recently, we built a system that continuously re-runs our test suite on an instance at digital ocean and flags any flaky tests (which we temporarily disable).

Quite a few interesting patterns leading to flaky tests have emerged which are worth sharing.

Hard coded ids

Sometimes to save doing work in tests we like pretending.

user.avatar_id = 1
user.save!

# then amend the avatar
user.upload_custom_avatar!

# this is a mistake, upload #1 never existed, so for all we know
# the legitimate brand new avatar we created has id of 1. 
assert(user.avatar_id != 1)  

This is more or less this example here.

Postgres often uses sequences to decide on the id new records will get. They start at one and keep increasing.

Most test frameworks like to rollback a database transaction after test runs, however the rollback does not roll back sequences.

ActiveRecord::.transaction do
   puts User.create!.id
   # 1
   raise ActiveRecord::Rollback
puts 

puts User.create!.id
# 2

This has caused us a fair amount of flaky tests.

In an ideal world the “starting state” should be pristine and 100% predictable. However this feature of Postgres and many other DBs means we need to account for slightly different starting conditions.

This is the reason you will almost never see a test like this when the DB is involved:

t = Topic.create!
assert(t.id == 1)

Another great, simple example is here.

Random data

Occasionally flaky tests can highlight legitimate application flaws. An example of such a test is here.

data = SecureRandom.hex
explode if data[0] == "0"

Of course nobody would ever write such code. However, in some rare cases the bug itself may be deep in the application code, in an odd conditional.

If the test suite is generating random data it may expose such flaws.

Making bad assumptions about DB ordering

create table test(a int)
insert test values(1)
insert test values(2)

I have seen many times over the years cases where developers (including myself) incorrectly assumed that if you select the first row from the example above you are guaranteed to get 1.

select a from test limit 1

The output of the SQL above can be 1 or it can be 2 depending on a bunch of factors. If one would like guaranteed ordering then use:

select a from test order by a limit 1

This problem assumption can sometimes cause flaky tests, in some cases the tests themselves can be “good” but the underlying code works by fluke most of the time.

An example of this is here another one is here.

A wonderful way of illustrating this is:

[8] pry(main)> User.order('id desc').find_by(name: 'sam').id
  User Load (7.6ms)  SELECT  "users".* FROM "users" WHERE "users"."name" = 'sam' ORDER BY id desc LIMIT 1
=> 25527
[9] pry(main)> User.order('id').find_by(name: 'sam').id
  User Load (1.0ms)  SELECT  "users".* FROM "users" WHERE "users"."name" = 'sam' ORDER BY id LIMIT 1
=> 2498
[10] pry(main)> User.find_by(name: 'sam').id
  User Load (0.6ms)  SELECT  "users".* FROM "users" WHERE "users"."name" = 'sam' LIMIT 1
=> 9931

Even if the clustered index primary key is on id you are not guaranteed to retrieve stuff in id order unless you explicitly order.

Incorrect assumptions about time

My test suite is not flaky, excepts from 11AM UTC till 1PM UTC.

A very interesting thing used to happen with some very specific tests we had.

If I ever checked in code around 9:50am, the test suite would sometimes fail. The problem was that 10am in Sydney is 12am in UTC time (daylight savings depending). That is exactly the time that the clock shifted in some reports causing some data to be in the “today” bucket and other data in the “yesterday” bucket.

This meant that if we chucked data into the database and asked the reports to “bucket” it the test would return incorrect numbers at very specific times during the day. This is incredibly frustrating and not particularly fair on Australia that have to bear the brunt.

An example is here (though the same code went through multiple iterations previously to battle this).

The general solution we have for the majority of these issues is simply to play pretend with time. Test pretends it is 1PM UTC in 2018, then does something, winds clock forward a bit and so on. We use our freeze time helper in Ruby and Sinon.JS in JavaScript. Many other solutions exist including timecop, the fascinating libfaketime and many more.

Other examples I have seen are cases where sleep is involved:

sleep 0.001
assert(elapsed < 1) 

It may seem obvious that that I slept for 1 millisecond, clearly less than 1 second passed. But this obvious assumption can be incorrect sometimes. Machines can be under extreme load causing CPU scheduling holdups.

Another time related issue we have experienced is insufficient timeouts, this has plagued our JS test suite. Many integration tests we have rely on sequences of events; click button, then check for element on screen. As a safeguard we like introducing some sort of timeout so the JS test suite does not hang forever waiting for an element to get rendered in case of bugs. Getting the actual timeout duration right is tricky. On a super taxed AWS instance that Travis CI provides much longer timeouts are needed. This issue sometimes is intertwined with other factors, a resource leak may cause JS tests to slowly require longer and longer time.

Leaky global state

For tests to work consistently they often rely on pristine initial state.

If a test amends global variables and does not reset back to the original state it can cause flakiness.

An example of such a spec is here.

class Frog
   cattr_accessor :total_jumps
   attr_accessor :jumps

   def jump
     Frog.total_jumps = (Frog.total_jumps || 0) + 1
     self.jumps = (self.jumps || 0) + 1
   end
end

# works fine as long as this is the first test
def test_global_tracking
   assert(Frog.total_jumps.nil?)
end

def test_jumpy
   frog = Frog.new
   frog.jump
   assert(frog.jumps == 1)
end 

Run test_jumpy first and then test_global_tracking fails. Other way around works.

We tend to hit these types of failures due to distributed caching we use and various other global registries that the tests interact with. It is a balancing act cause on one hand we want our application to be fast so we cache a lot of state and on the other hand we don’t want an unstable test suite or a test suite unable to catch regressions.

To mitigate we always run our test suite in random order (which makes it easy to pick up order dependent tests). We have lots of common clean up code to avoid the situations developers hit most frequently. There is a balancing act, our clean up routines can not become so extensive that they cause major slowdown to our test suite.

Bad assumptions about the environment

It is quite unlikely you would have a test like this in your test suite.

def test_disk_space
   assert(free_space_on('/') > 1.gigabyte)
end

That said, hidden more deeply in your code you could have routines that behaves slightly differently depending on specific machine state.

A specific example we had is here.

We had a test that was checking the internal implementation of our process for downloading images from a remote source. However, we had a safeguard in place that ensured this only happened if there was ample free space on the machine. Not allowing for this in the test meant that if you ran our test suite on a machine strained for disk space tests would start failing.

We have various safeguards in our code that could depend on environment and need to make sure we account for them when writing tests.

Concurrency

Discourse contains a few subsystems that depend on threading. The MessageBus that powers live updates on the site, cache synchronization and more uses a background thread to listen on a Redis channel. Our short lived “defer” queue powers extremely short lived non-critical tasks that can run between requests and hijacked controller actions that tend to wait long times on IO (a single unicorn worker can sometimes serve 10s or even 100s of web requests in our setup). Our background scheduler handles recurring jobs.

An example would be here.

Overall, this category is often extremely difficult to debug. In some cases we simply disable components in test mode to ensure consistency, the defer queue runs inline. We also evict threaded component out of our big monolith. I find it significantly simpler to work through and repair a concurrent test suite for a gem that takes 5 seconds to run vs repairing a sub-section in a giant monolith that has a significantly longer run time.

Other tricks I have used is simulating an event loop, pulsing it in tests simulating multiple threads in a single thread. Joining threads that do work and waiting for them to terminate and lots of puts debugging.

Resource leaks

Our JavaScript test suite integration tests have been amongst the most difficult tests to stabilise. They cover large amounts of code in the application and require Chrome web driver to run. If you forget to properly clean up a few event handlers, over thousands of tests this can lead to leaks that make fast tests gradually become very slow or even break inconsistently.

To work through these issues we look at using v8 heap dumps after tests, monitoring memory usage of chrome after the test suite runs.

It is important to note that often these kind of problems can lead to a confusing state where tests consistently work on production CI yet consistently fail on resource strained Travis CI environment.

Mitigation patterns

Over the years we have learned quite a few strategies you can adopt to help grapple with this problem. Some involve coding, others involve discussion. Arguably the most important first step is admitting you have a problem, and as a team, deciding how to confront it.

Start an honest discussion with your team

How should you deal with flaky tests? You could keep running them until they pass. You could delete them. You could quarantine and fix them. You could ignore this is happening.

At Discourse we opted to quarantine and fix. Though to be completely honest, at some points we ignored and we considered just deleting.

I am not sure there is a perfect solution here.

:wastebasket: “Deleting and forgetting” can save money at the expense of losing a bit of test coverage and potential app bug fixes. If your test suite gets incredibly erratic, this kind of approach could get you back to happy state. As developers we are often quick to judge and say “delete and forget” is a terrible approach, it sure is drastic and some would judge this to be lazy and dangerous. However, if budgets are super tight this may be the only option you have. I think there is a very strong argument to say a test suite of 100 tests that passes 100% of the time when you rerun it against the same code base is better than a test suite of 200 tests where passing depends on a coin toss.

:recycle: “Run until it passes” is another approach. It is an attempt to have the cake and eat it at the same time. You get to keep your build “green” without needing to fix flaky tests. Again, it can be considered somewhat “lazy”. The downside is that this approach may leave broken application code in place and make the test suite slower due to repeat test runs. Also, in some cases, “run until it passes” may fail on CI consistently and work on local consistently. How many retries do you go for? 2? 10?

:man_shrugging:t4: “Do nothing” which sounds shocking to many, is actually surprisingly common. It is super hard to let go of tests you spent time carefully writing. Loss aversion is natural and means for many the idea of losing a test may just be too much to cope with. Many just say “the build is a flake, it sometimes fails” and kick it off again. I have done this in the past. Fixing flaky tests can be very very hard. In some cases where there is enormous amounts of environment at play and huge amounts of surface area, like large scale full application integration tests hunting for the culprit is like searching for a needle in a haystack.

:biohazard: “Quarantine and fix” is my favourite general approach. You “skip” the test and have the test suite keep reminding you that a test was skipped. You lose coverage temporarily until you get around to fixing the test.

There is no, one size fits all. Even at Discourse we sometimes live between the worlds of “Do nothing” and “Quarantine and fix”.

That said, having an internal discussion about what you plan to do with flaky tests is critical. It is possible you are doing something now you don’t even want to be doing, it could be behaviour that evolved.

Talking about the problem gives you a fighting chance.

If the build is not green nothing gets deployed

At Discourse we adopted continuous deployment many years ago. This is our final shield. Without this shield our test suite could have gotten so infected it would likely be useless now.

Always run tests in random order

From the very early days of Discourse we opted to run our tests in random order, this exposes order dependent flaky tests. By logging the random seed used to randomise the tests you can always reproduce a failed test suite that is order dependent.

Sadly rspec bisect has been of limited value

One assumption that is easy to make when presented with flaky tests, is that they are all order dependent. Order dependent flaky tests are pretty straightforward to reproduce. You do a binary search reducing the amount of tests you run but maintain order until you find a minimal reproduction. Say test #1200 fails with seed 7, after a bit of automated magic you can figure out that the sequence #22,#100,#1200 leads to this failure. In theory this works great but there are 2 big pitfalls to watch out for.

  1. You may have not unrooted all your flaky tests, if the binary search triggers a different non-order dependent test failure, the whole process can fail with very confusing results.

  2. From our experience with our code base the majority of our flaky tests are not order dependent. So this is usually an expensive wild goose chase.

Continuously hunt for flaky tests

Recently Roman Rizzi introduced a new system to hunt for flaky tests at Discourse. We run our test suite in a tight loop, over and over again on a cloud server. Each time tests fail we flag them and at the end of a week of continuous running we mark flaky specs as “skipped” pending repair.

This mechanism increased test suite stability. Some flaky specs may only show up 1 is 1000 runs. At snail pace, when running tests once per commit, it can take a very long time to find these rare flakes.

Quarantine flaky tests

This brings us to one of the most critical tools at your disposal. “Skipping” a flaky spec is a completely reasonable approach. There are though a few questions you should explore:

  • Is the environment flaky and not the test? Maybe you have a memory leak and the test that failed just hit a threshold?

  • Can you decide with confidence using some automated decision metric that a test is indeed flaky

There is a bit of “art” here and much depends on your team and your comfort zone. My advice here though would be to be more aggressive about quarantine. There are quite a few tests over the years I wish we quarantined earlier, which cause repeat failures.

Run flaky tests in a tight loop randomizing order to debug

One big issue with flaky tests is that quite often they are very hard to reproduce. To accelerate a repro I tend to try running a flaky test in a loop.

100.times do
   it "should not be a flake" do
      yet_it_is_flaky
   end
end

This simple technique can help immensely finding all sorts of flaky tests. Sometimes it makes sense to have multiple tests in this tight loop, sometimes it makes sense to drop the database and Redis and start from scratch prior to running the tight loop.

Invest in a fast test suite

For years at Discourse we have invested in speeding up to our test suite. There is a balancing act though, on one hand the best tests you have are integration tests that cover large amounts of application code. You do not want the quest for speed to compromise the quality of your test suite. That said there is often large amount of pointless repeat work that can be eliminated.

A fast test suite means

  • It is faster for you to find flaky tests
  • It is faster for you to debug flaky tests
  • Developers are more likely to run the full test suite while building pieces triggering flaky tests

At the moment Discourse has 11,000 or so Ruby tests it takes them 5m40s to run single threaded on my PC and 1m15s or so to run tests concurrently.

Getting to this speed involves a regular amount of “speed maintenance”. Some very interesting recent things we have done:

  • Daniel Waterworth introduced test-prof into our test suite and refined a large amount of tests to use: the let_it_be helper it provides (which we call fab! cause it is awesome and it fabricates). Prefabrication can provide many of the speed benefits you get from fixtures without inheriting the many of the limitations fixtures prescript.

  • David Taylor introduced the parallel tests gem which we use to run our test suite concurrently saving me 4 minutes or so each time I run the full test suite. Built-in parallel testing is coming to Rails 6 thanks to work by Eileen M. Uchitelle and the Rails core team.

On top of this the entire team have committed numerous improvements to the test suite with the purpose of speeding it up. It remains a priority.

Add purpose built diagnostic code to debug flaky tests you can not reproduce

A final trick I tend to use when debugging flaky tests is adding debug code.

An example is here.

Sometimes, I have no luck reproducing locally no matter how hard I try. Diagnostic code means that if the flaky test gets triggered again I may have a fighting chance figuring out what state caused it.

def test_something
   make_happy(user)
   if !user.happy
      STDERR.puts "#{user.inspect}"
   end
    assert(user.happy)
end

Let’s keep the conversation going!

Do you have any interesting flaky test stories? What is your team’s approach for dealing with the problem? I would love to hear more so please join the discussion on this blog post.

Extra reading