Image Credit: Tim Green, CC-BY-2.0 https://www.flickr.com/photos/atoach/6014917153

This is the story of a debugging odyssey that could have been avoided by a more careful reading of documentation. I’m telling it not only to draw attention to the documentation, but also because it’s the story of a fun hunt. And there are good lessons to be learned for any of us.

A preview

If you’re in a hurry to get out of here, then know this. If you want to use pytest with Django — and you use database replication — you’ll want to follow Django’s instructions. Carefully.

If you choose to read on, here’s what I hope you’ll learn from the work:

  • Even for an experienced developer, print() is a valuable tool in the debugging tool chest.
  • Logging can be useful, both in Python and in external systems like databases.
  • Where a problem surfaces is often not where it was caused.
  • Python’s inspect module is super helpful.
  • Always read the documentation.

Begin the journey

Our story starts with me, testing a new module I was working on. I was using pytest with Django for the first time. I installed the pytest-django plugin, wrote a few tests, and ran them. They worked. Great!

Then I wrote another test and re-ran the suite.

============================ ERRORS ============================
__ ERROR at teardown of test_num_friends_calculation[False-4] __
Traceback (most recent call last):
File "...django/db/backends/utils.py", line 83, in _execute
return self.cursor.execute(sql)
psycopg2.OperationalError: database "test_cmb_local" is being accessed by other users
DETAIL: There is 1 other session using the database.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "...pytest_django/fixtures.py", line 104, in teardown_database
verbosity=pytest.config.option.verbose,
File "...django/test/utils.py", line 299, in teardown_databases
connection.creation.destroy_test_db(old_name, verbosity, keepdb)
File "...django/db/backends/base/creation.py", line 259, in destroy_test_db
self._destroy_test_db(test_database_name, verbosity)
File "...django/db/backends/base/creation.py", line 276, in _destroy_test_db
% self.connection.ops.quote_name(test_database_name))
File "...raven/contrib/django/client.py", line 123, in execute
return real_execute(self, sql, params)
File "...django/db/backends/utils.py", line 68, in execute
return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
File "...django/db/backends/utils.py", line 77, in _execute_with_wrappers
return executor(sql, params, many, context)
File "...django/db/backends/utils.py", line 85, in _execute
return self.cursor.execute(sql, params)
File "...django/db/utils.py", line 89, in __exit__
raise dj_exc_value.with_traceback(traceback) from exc_value
File "...django/db/backends/utils.py", line 83, in _execute
return self.cursor.execute(sql)
django.db.utils.OperationalError: database "test_cmb_local" is being accessed by other users
DETAIL: There is 1 other session using the database.

I re-ran the tests several times. The error seemed to pop up at random. Once it had happened the first time, it seemed to be more common. If I waited longer between runs, it seemed to be less common, but no clear patterns emerged. Great. An intermittent test failure. I hate those.

Unfortunately, the traceback above did not provide any clues about what might have caused the problem. I was clear that a database session was still open, but what could be keeping it open? I searched for the error message online and found this issue and this one too. They shared the same exception, but in both cases the problem appeared to be caused by open connections on parallel processes or threads. My situation was different, no threads or multi-processing involved.

I needed to find out what was keeping a session open. If I could figure that out, then I could work on trying to discover why it was happening.

The power of logging

We use postgresql for our database, and a number of the posts I read suggested that we might just terminate all the open connections at the end of the test run. This open issue in the pytest-django repository even suggests setting up a fixture to automatically do so. While this was a way out of the box, it didn’t feel right. But a thought occurred to me. All the posts I read used the pg_stat_activity table to find open connections. Perhaps I could use the same table and print out the queries and connections that were still open when the database was being torn down.

I hacked the following bit into the _destroy_test_db method referenced in the above error traceback:

def _destroy_test_db(self, test_database_name, verbosity):
with self.connection._nodb_connection.cursor() as cursor:
cursor.execute( # <
"select * from pg_stat_activity " # <
"where datname='test_cmb_local'") # <
print(cursor.fetchall()) # <
# ...

Lo and behold! When I ran pytest -s (to allow printing of stdout) I saw the following at the end of every failed test run:

[(…, “ROLLBACK”)]

ROLLBACK is a query that Django uses to isolate the effects of each individual test from all the others. Each test begins with a new transaction, and when the test is complete, the transaction is rolled back. This restores the state of the database to what it was before the test started. But, if this is a normal, expected query, how could it be hanging? Is there something wrong with the interaction of pytest and Django’s transaction management?

I needed more information. Happily, postgresql provides some amazingly powerful logging. I decided to look there next. I added the following to our postgresql.conf file and restarted the database:

logging_collector = on
log_directory = 'pg_log'
log_connections = on
log_disconnections = on
log_duration = on
log_line_prefix = '%p %c %l %s :: \n\t'

The logging_collector = on statement enables postgresql’s logging collector. Without turning this on, no query logs will be produced. By enabling log_connections, log_disconnections and log_duration I was able to see not only the queries that were running, but also when each connection to the database was opened and closed, and how long each query lasted. Most importantly, the log_line_prefix statement configures a prefix for every log statement. This prefix allowed me to add vital information that would otherwise be missing from the logs. The documentation in postgresql.conf summarizes the options:

# special values:
# %a = application name
# %u = user name
# %d = database name
# %r = remote host and port
# %h = remote host
# %p = process ID
# %t = timestamp without milliseconds
# %m = timestamp with milliseconds
# %n = timestamp with milliseconds (as a Unix epoch)
# %i = command tag
# %e = SQL state
# %c = session ID
# %l = session line number
# %s = session start timestamp
# %v = virtual transaction ID
# %x = transaction ID (0 if none)
# %q = stop here in non-session
# processes
# %% = '%'
# e.g. '<%u%%%d> '

I chose to see the process ID, session ID, session line number, and session start timestamp for each event logged. I also added a newline and a tab at the end of the prefix to make it a bit easier to see the prefixes, and visually differentiate them from the logged statements. With these changes I could use the tail command to follow along with the postgres log while my tests ran:

tail -f </path/to/postgresql/data>/pg_log/<logname>.log

And behold, a successful test run had output like this:

70 5ad121a5.46 1 2018-04-13 21:31:17 UTC ::
LOG: connection received: host=*** port=***
70 5ad121a5.46 2 2018-04-13 21:31:17 UTC ::
LOG: connection authorized: user=*** database=***
70 5ad121a5.46 3 2018-04-13 21:31:17 UTC ::
LOG: statement: CREATE DATABASE "test_cmb_local"
70 5ad121a5.46 4 2018-04-13 21:31:17 UTC ::
LOG: duration: 267.281 ms
70 5ad121a5.46 5 2018-04-13 21:31:17 UTC ::
LOG: disconnection: session time: 0:00:00.433 user=*** database=*** host=*** port=***
71 5ad121a6.47 1 2018-04-13 21:31:18 UTC ::
LOG: connection received: host=*** port=***
...
71 5ad121a6.47 2621 2018-04-13 21:31:18 UTC ::
LOG: statement: ROLLBACK
71 5ad121a6.47 2622 2018-04-13 21:31:18 UTC ::
LOG: duration: 0.080 ms
71 5ad121a6.47 2623 2018-04-13 21:31:18 UTC ::
LOG: disconnection: session time: 0:00:02.570 user=*** database=*** host=*** port=***
72 5ad121a8.48 1 2018-04-13 21:31:20 UTC ::
LOG: connection received: host=*** port=***
72 5ad121a8.48 2 2018-04-13 21:31:20 UTC ::
LOG: connection authorized: user=*** database=***
72 5ad121a8.48 3 2018-04-13 21:31:20 UTC ::
LOG: statement: DROP DATABASE "test_cmb_local"
72 5ad121a8.48 4 2018-04-13 21:31:20 UTC ::
LOG: duration: 117.056 ms
72 5ad121a8.48 5 2018-04-13 21:31:20 UTC ::
LOG: disconnection: session time: 0:00:00.284 user=*** database=*** host=*** port=***

Since the connection and session ids are sequential, we can see that when the tests run well, and teardown works, there are three connections to the database (70, 71, and 72) with three sessions (5ad121a5.46, 5ad121a6.47, and 5ad121a8.48).

What happens with a failing test run? It starts normally enough:

43 5ad1204a.2b 1 2018-04-13 21:25:30 UTC ::
LOG: connection received: host=*** port=***
43 5ad1204a.2b 2 2018-04-13 21:25:30 UTC ::
LOG: connection authorized: user=*** database=***
43 5ad1204a.2b 3 2018-04-13 21:25:30 UTC ::
LOG: statement: CREATE DATABASE "test_cmb_local"
43 5ad1204a.2b 4 2018-04-13 21:25:30 UTC ::
LOG: duration: 258.960 ms
43 5ad1204a.2b 5 2018-04-13 21:25:30 UTC ::
LOG: disconnection: session time: 0:00:00.312 user=*** database=*** host=*** port=***
44 5ad1204a.2c 1 2018-04-13 21:25:30 UTC ::
LOG: connection received: host=*** port=***

But as it comes to an end, something odd happens:

45 5ad1204b.2d 1605 2018-04-13 21:25:31 UTC ::
LOG: statement: ROLLBACK
45 5ad1204b.2d 1606 2018-04-13 21:25:31 UTC ::
LOG: duration: 0.159 ms
44 5ad1204a.2c 1027 2018-04-13 21:25:30 UTC ::
LOG: disconnection: session time: 0:00:02.695 user=*** database=*** host=*** port=***
46 5ad1204d.2e 1 2018-04-13 21:25:33 UTC ::
LOG: connection received: host=*** port=***
46 5ad1204d.2e 2 2018-04-13 21:25:33 UTC ::
LOG: connection authorized: user=*** database=***
46 5ad1204d.2e 3 2018-04-13 21:25:33 UTC ::
LOG: statement: DROP DATABASE "test_cmb_local"
46 5ad1204d.2e 4 2018-04-13 21:25:33 UTC ::
ERROR: database "test_cmb_local" is being accessed by other users
46 5ad1204d.2e 5 2018-04-13 21:25:33 UTC ::
DETAIL: There is 1 other session using the database.
46 5ad1204d.2e 6 2018-04-13 21:25:33 UTC ::
STATEMENT: DROP DATABASE "test_cmb_local"
46 5ad1204d.2e 7 2018-04-13 21:25:33 UTC ::
LOG: disconnection: session time: 0:00:05.213 user=*** database=*** host=*** port=***
45 5ad1204b.2d 1607 2018-04-13 21:25:31 UTC ::
LOG: disconnection: session time: 0:00:07.190 user=*** database=*** host=*** port=***

Notice that when the ROLLBACK command is executed, it is using process ID 45, but when the following disconnect happens, it happens on process ID 44! Then a new connection opens with process ID 46, makes and attempt to drop the database, and fails. This is because the connection responsible for theROLLBACK command is still open. Somehow an extra connection to the database is being opened somewhere between when the database is created and when the tests themselves run. Looking deeper in the log, I found this:

44 5ad1204a.2c 1025 2018-04-13 21:25:30 UTC ::
LOG: statement: SELECT "django_migrations"."app", "django_migrations"."name" FROM "django_migrations"
44 5ad1204a.2c 1026 2018-04-13 21:25:30 UTC ::
LOG: duration: 0.071 ms
45 5ad1204b.2d 1 2018-04-13 21:25:31 UTC ::
LOG: connection received: host=*** port=***

So it looks like just after a query related to finding all migrations, a new connection is opened. What’s that about?

Dive into the code

Thanks to postgresql logging I was able to discover that a new connection was being opened in the time after the database was created and migrations were run but before the tests themselves were executed. My next step was to dig into the code, looking for places where this might happen.

First, a word about pytest fixtures.

In pytest, “fixtures” are functions that are called before your tests. They are used to create the required state for a test. When writing tests for pytest you indicate that a test requires a given fixture by adding the name of the fixture to the list of positional arguments:

@pytest.fixture(scope='function')
def create_a_foo():
"""Ensure there is a foo for tests to use"""
new_foo = Foo()
def test_foo_exists(create_a_foo):
assert find_a_foo()

In pytest, you can also combine fixture functionality. One approach is to add multiple fixture names to a test. But you can also make one fixture “dependent” on others. Again, by adding the names of any existing fixtures you want to depend on as positional arguments to the new fixture.

The pytest-django plugin provides a fixture called db that enables database access for a given test. It looks like this:

@pytest.fixture(scope='function')
def db(request, django_db_setup, django_db_blocker):
if 'transactional_db' in request.funcargnames \
or 'live_server' in request.funcargnames:
getfixturevalue(request, 'transactional_db')
else:
_django_db_fixture_helper(False, request, django_db_blocker)

Given what we’ve learned about pytest fixtures, we can read that the db fixture is dependent on the request fixture (provided by pytest itself) and the django_db_setup and django_db_blocker fixtures.

I had evidence that there was a problem somewhere in the process of getting a database ready for tests. These fixtures provided by pytest-django are responsible for doing that work. So it seemed that I should look at them for more clues.

In the django_db_setup fixture, I found this block:

with django_db_blocker.unblock():
db_cfg = setup_databases(
verbosity=pytest.config.option.verbose,
interactive=False,
**setup_databases_args
)

The setup_databases function invoked here is imported from Django. So the pytest-django fixtures are really just thin wrappers around the functionality that Django itself uses to set up databases for tests. (This is a great approach to writing integration code, because it means that if Django changes how it does database set up, the wrapper will automatically get those updates. No editing will be needed.)

Digging a bit further, I found that the setup_databases function calls thecreate_test_db method, found on the BaseDatabaseCreation class. In that method, I found these statements:

self._create_test_db(verbosity, autoclobber, keepdb)

self.connection.close()

Okay, so it appears that when setting up, Django first creates a new database and then closes its connection to that database. This action cleanly accounts for the work done in the first process ID I saw in the postgresql logs above. Later in the same method I found this code:

call_command(
'migrate',
verbosity=max(verbosity - 1, 0),
interactive=False,
database=self.connection.alias,
run_syncdb=True,
)

Alright! That could account for the new connection established and the queries I was seeing related to Django migrations. If there is a place that the extra connection is being made, it must be pretty close to here. And what do you know, a few lines later, I found this:

self.connection.ensure_connection()

At this point I was overjoyed. The ensure_connection method of the connection class is responsible for making sure that there is an open connection to the database, ready to be used:

def ensure_connection(self):
if self.connection is None:
with self.wrap_database_errors:
self.connect()

At this point, my theory is that somehow self.connection was getting set to None inadvertently and a new connection was being opened when none was needed. I needed to verify this theory though. I added a print statement to the ensure_connection method:

def ensure_connection(self):
if self.connection is None:
print("establishing a new connection") # <
with self.wrap_database_errors:
self.connect()

I re-ran my tests, expecting that on those runs where they failed I would see this output, and on those runs where they succeeded I would not. But my theory was incorrect. I saw the output in both cases. Unfortunately — because stdout was filled with lots of extra output from over-zealous use of log.info— it was hard to be sure if it happened more frequently when the tests failed. What I really wanted to know is where — and under what circumstances — was a new database connection being made.

Inspect the stack

The inspect module from the Python standard library is great in situations like this. It allows you to peer into all sorts of corners, including looking at the stack frames that lead to the call of a particular function.

If you’re not familiar with the term “stack frames”, you have seen them before. When you look at the traceback produced by an exception in your Python code, each line corresponds to a single frame in the “call stack” of your program. Every time a function is called, a new frame is added to the stack, and when the function returns, the corresponding frame is removed from the stack.

I wanted to find out in what contexts the ensure_connection method was being called, so I added this to the code:

def ensure_connection(self):
if self.connection is None:
import inspect # <
curframe = inspect.currentframe() # <
print(inspect.getouterframes(curframe)[1][3]) # <
with self.wrap_database_errors:
self.connect()

The call to inspect.currentframe() returns a frame object representing the current function call that is now underway (ensure_connection). Passing that frame object to inspect.getouterframes produces a list of tuples, called “frame records”. Each frame record in the list contains information about a frame in the call stack. The first frame record in the list is the current frame (returned back to you) and the last frame record is the first call put into the stack. Each frame record contains a number of elements: the frame object, the filename of the function, the line number currently being executed, the function name, a list of lines of context for the current line, and the index position of this line in that context.

I wanted to learn what function was calling ensure_connction. To do so, I printed the fourth element (index 3) in the second frame record (index 1) from the list.

For test runs that succeeded, I saw this:

_nodb_connection
_cursor
_nodb_connection

But when test runs failed, I could see there was an extra call:

_nodb_connection
_cursor
get_autocommit
_nodb_connection

Once I found this, I felt I was getting close to a solution. Was there some condition under which get_autocommit would call ensure_connection? The code told me no:

def get_autocommit(self):
self.ensure_connection()
return self.autocommit

So, if calling get_autocommit always ends up calling ensure_connection, then could there be such a condition further up the call stack? Something must send us down the path that leads to ensure_connection. What I needed to do was print all the frames in the current stack whenever ensure_connection was called:

def ensure_connection(self):
if self.connection is None:
import inspect # <
curframe = inspect.currentframe() # <
for frame in inspect.getouterframes(curframe): # <
filename = frame[1] # <
funcname = frame[3] # <
print("{}:{}".format(filename, funcname)) # <
with self.wrap_database_errors:
self.connect()

And out popped the stack that was apparently causing my trouble.

'...django/db/backends/base/base.py':'ensure_connection'
'...django/db/backends/base/base.py':'get_autocommit'
'...django/db/transaction.py':'__enter__'
'...django/test/testcases.py':'_enter_atomics'
'...django/test/testcases.py':'_fixture_setup'
'...django/test/testcases.py':'_pre_setup'
'...pytest_django/fixtures.py':'_django_db_fixture_helper'
'...pytest_django/fixtures.py':'db'
...

Looking one frame further up the stack from get_autocommit, I found my mystery condition. In Atomic.__enter__ there is code that checks to see if we are already involved in an “atomic” transaction block:

def __enter__(self):
connection = get_connection(self.using)

if not connection.in_atomic_block:
# Reset state when entering an outermost atomic block.
connection.commit_on_exit = True
connection.needs_rollback = False
if not connection.get_autocommit():
# ... BAM ^^^

if connection.in_atomic_block:
# no calls to get_autocommit here.

There is was. A nice, clean answer. When the tests failed at teardown, the path led into the first condition, when they succeeded, into the second. This must mean that the connection was supposed to be in an atomic block, but it was not.

At this point, my debugging sense began to itch. I have a rule of thumb when debugging a problem:

If you’re debugging transaction management you didn’t write yourself, you’re in the wrong place.

I find that when I’m using frameworks and libraries used by thousands of other developers and my bug hunt leads into code this fundamental, then the problem is almost invariably caused by something I am doing. I decided it was time to figure out what was going on with that connection.

In the next frame above this, we find the code setting up atomic transactions for our tests:

@classmethod
def _enter_atomics(cls):
atomics = {}
for db_name in cls._databases_names():
atomics[db_name] = transaction.atomic(using=db_name)
atomics[db_name].__enter__()

I decided to verify which database I was using, and printed out db_name. It always came up default. Herm.

Then I remembered that the problem always came in-between the step of creating the database and running migrations, and the point where tests were running. This code for atomic transactions is used to isolate individual tests, the second step. Perhaps there was a difference between the database used for these connections and the one used for the first step.

I remembered that back when I was looking at the setup_databases code I had seen reference to db_name as well. What was that code?

def setup_databases(
verbosity, interactive, keepdb=False,
debug_sql=False, parallel=0, **kwargs
):
"""Create the test databases."""
test_databases, mirrored_aliases = \
get_unique_databases_and_mirrors()

old_names = []

for signature, (db_name, aliases) in test_databases.items():
first_alias = None
for alias in aliases:
# ...
if first_alias is None:
first_alias = alias
connection.creation.create_test_db(
verbosity=verbosity,
autoclobber=not interactive,
keepdb=keepdb,
serialize=connection.settings_dict.get(
'TEST', {}).get('SERIALIZE', True),
)
# ...
            # Other connections are mirrors of the first one
else:
connections[alias].creation.set_as_test_mirror(
connections[first_alias].settings_dict)

This code is responsible for creating connections to all the databases that will be used. It uses a dict mapping a unique signature to aliases that can be used to refer to that signature. The first of these aliases is used to create the test database, and then all the other aliases are designated to be “mirrors” of that first one.

It was at this point that I remembered that our database setup has a second alias for the database. We use a primary database for writes, maintain replicas of that database, and often use the replicas for read-only operations. I wondered if that could be the cause of our problem and so I decided to print alias each time through this loop. In successful runs it looked like this:

default
replica

But when the tests failed, then every time the names printed like this:

replica
default

And finally I understood what was happening.

The call above to get_unique_databases_and_mirrors iterates through the databases in settings.DATABASES, builds a signature from the connection settings for the database, and then returns a dict with that signature as a key, and a tuple with the name of the database and a tuple of all the aliases for that database.

Here’s what we have in settings.DATABASES for our local development and test settings:

DATABASES = {
'default': {
'ENGINE': 'django.db.backends.postgresql_psycopg2',
# other connection paramters
},
'replica': {
'ENGINE': 'django.db.backends.postgresql_psycopg2',
# exactly the same connection parameters
},
}

These two configurations have exactly the same connection parameters so they result in the same signature. They are “aliases” for each-other.

In Python (before version 3.6) the dict type is unordered. At CMB we are using Python 3.4 for a while longer.

When Django was creating and migrating the test database, the order of the aliases was changing. Sometimes the replica alias was first, and other times the default alias was first. And the first alias is always used.

But when the tests ran and the _pre_setup method of Django’s TestCase was called, the transaction management work was being done on the default database alias. This is because settings.DEFAULT_DB_ALIAS is used to choose a connection.

By default, Django maintains a connection to each of the aliases independently. This meant that when replica was the first alias found, it opened a connection to that alias. That connection remained open while tests were run using a connection to the default alias.

As it turns out, there is documentation that helps to solve this problem. The pytest-django plugin documentation says the following:

Currently pytest-django does not specifically support Django’s multi-database support.

And when you read Django’s documentation on testing with primary and replica databases the solution is immediately apparent:

If you’re testing a multiple database configuration with primary/replica (referred to as master/slave by some databases) replication, this strategy of creating test databases poses a problem. When the test databases are created, there won’t be any replication, and as a result, data created on the primary won’t be seen on the replica.
To compensate for this, Django allows you to define that a database is a test mirror.

Fix the problem

The solution to our problem ended up being quite simple. All I had to do was tell Django that the replica alias of our database was a mirror.

DATABASES = {
'default': {
'ENGINE': 'django.db.backends.postgresql_psycopg2',
# other connection parameters
},
'slave': {
'ENGINE': 'django.db.backends.postgresql_psycopg2',
# exactly the same connection parameters
'TEST': {
'MIRROR': 'default'
}
},
}

This explicitly instructs Django to treat the default alias as the primary, and to treat replica as a mirror when testing. Once this change was made, the tests completed without problems 100% of the time.

Problem Solved!

Learn a lesson

While a journey like this can be fun purely for the tale of the hunt, it’s always better if there are lessons you can take home. Here’s what I learned from this hunt.

Use print() when appropriate.

Breakpoints are terrific. But when you are in code that runs dozens or hundreds of times and you are trying to find small differences, print() is your best friend. Printing information from inside the ensure_connection method allowed me to discover the source of my problem.

Use the power of logging.

Logging can give you access to information that is not otherwise available, especially when debugging problems with external systems like a database. I would never have found the answer to this problem without being able to see the extra process IDs produced by connections to my database.

Use inspect to see into the past.

Sometimes once you find a place where something unexpected happens, you still have to find out how you got there. The inspect module provides some very useful tools for looking into the path that led you to this place. Although I could see the problem from ensure_connection, I still needed to look up the call stack to discover the cause.

Look elsewhere.

My problem appeared when the test database was being torn down. I’ve found a number of references suggesting that forcing all the existing connections closed at that time is a good solution. But the real problem lay elsewhere. Fixing the symptom of a problem does not solve the problem, it just hides it. Be prepared to look elsewhere to find the cause.

Always read the documentation.

This story ended successfully (and was a lot of fun to work through). But it is clear that had I read the available documentation more carefully I could have avoided the entire thing. I think it’s possible that the documentation could be more clear about this case, and I’ve made a pull request to improve it. But I have to remember that just jumping in without reading first has consequences.