Property based testing, hypothesis and finding a bug
I really have fallen in love with test driven development and as much as is reasonable stick to it. This mainly revolves around writing Python unit tests, which makes programming a lot easier (for me). In Namibia I was lucky enough to meet and chat with David MacRiver, the main developer of the hypothesis package. Before that the Axelrod project had already incorporated some of what hypothesis is for: property based testing. This blog post will briefly describe what that is, how it took some hours of sleep away from me (in a good way) and why it’s particularly great for research software.
Property based testing
I’m slowly slowly learning that hypothesis is ridiculously clever but at a very basic level what it allows you to do is take 1 test and instead of testing a single example, test a whole range of parameters.
Here’s a simple example, based on a short demo I gave at the collaborations workshop 2016 in Edinburgh. You can find the slides I gave for that at this link: vknight.org/Talks/2016-03-22-Division-by-11-and-property-based-testing-with-hypothesis/.
Let us write a function that tests divisibility by 11 (assuming that we don’t
know about % 11
. To do this we’ll use the following property:
A number is divisible by 11 if and only if the alternating (in sign) sum of the number’s digits is 0.
For example: 121 is divisible by 11 because (1-2+1=0) (side note: if you’re about to yell at me in the comments please read the rest of this post).
Let’s save below in a file called main.py
:
def divisible_by_11(number):
"""Uses above criterion to check if number is divisible by 11"""
string_number = str(number)
alternating_sum = sum([(-1) ** i * int(d) for i, d
in enumerate(string_number)])
return alternating_sum == 0
Now if we write out some examples we get expected behaviour:
>>> import main
>>> for k in range(10):
... print(main.divisible_by_11(11 * k))
True
True
True
True
True
True
True
True
True
True
But let’s write an actual test suite. Here’s a very basic unittest that
we’ll save in test_main.py
:
import unittest
import main
class TestDivisible(unittest.TestCase):
def test_divisible_by_11(self):
for k in range(10):
self.assertTrue(main.divisible_by_11(11 * k))
self.assertFalse(main.divisible_by_11(11 * k + 1))
# Some more examples
self.assertTrue(main.divisible_by_11(121))
self.assertTrue(main.divisible_by_11(12122))
self.assertFalse(main.divisible_by_11(123))
self.assertFalse(main.divisible_by_11(12123))
The above tests the first 10 numbers divisible by 11 (and that that number + 1 is not) and also some specific tests (121 and 12123).
We then run this by using the following:
$ python -m unittest test_main
.
----------------------------------------------------------------------
Ran 1 test in 0.000s
OK
At this point we could be very happy and proud of ourselves: we have tested well written software that can be shipped and used by researchers world wide to test the divisibility of a number by 11!!!
This is how mathematics breaks.
Let’s write a hypothesis test.
We’ll write the following in test_property_main.py
:
import unittest
import main
from hypothesis import given # This is how we will define inputs
from hypothesis.strategies import integers # This is the type of input we will use
class TestDivisible(unittest.TestCase):
@given(k=integers(min_value=1)) # This is the main decorator
def test_divisible_by_11(self, k):
self.assertTrue(main.divisible_by_11(11 * k))
The above is actually a simpler test than before: we’re only testing that a number that we know is divisible by 11 is in fact getting the expected output from our function.
Let’s run it:
$ python -m unittest test_propert_main
Falsifying example: test_divisible_by_11(self=<test_property_main.TestDivisible testMethod=test_divisible_by_11>, k=19)
F
======================================================================
FAIL: test_divisible_by_11 (test_property_main.TestDivisible)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test_property_main.py", line 10, in test_divisible_by_11
def test_divisible_by_11(self, k):
File "/usr/local/lib/python2.7/site-packages/hypothesis/core.py", line 502, in wrapped_test
print_example=True, is_final=True
File "/usr/local/lib/python2.7/site-packages/hypothesis/executors.py", line 57, in default_new_style_executor
return function(data)
File "/usr/local/lib/python2.7/site-packages/hypothesis/core.py", line 103, in run
return test(*args, **kwargs)
File "test_property_main.py", line 11, in test_divisible_by_11
self.assertTrue(main.divisible_by_11(11 * k))
AssertionError: False is not true
----------------------------------------------------------------------
Ran 1 test in 0.058s
FAILED (failures=1)
We get an error! An right at the top we get the Falsifying example
so we see
that our function fails for k=19
. For, k=19
the number being tested is
\(19\times 11=209\). That number is obviously divisible by 11 (by
construction) but it’s alternating sum is in fact 11 which is not 0.
At this point, as described in this previous blog post about divisibility by 11 I can let slip the correct property for divisibility by 11:
A number is divisible by 11 if and only if the alternating (in sign) sum of the number’s digits is divisible by 11.
(There’s a simple algebraic proof of that at the older blog post.)
Now let’s modify our main.py
(note that I’m using a slightly better
modification than the lazy one I gave in the demo at the collaboration
workshop):
def divisible_by_11(number):
"""Uses above criterion to check if number is divisible by 11"""
string_number = str(number)
# Using abs as the order of the alternating sum doesn't matter.
alternating_sum = abs(sum([(-1) ** i * int(d) for i, d
in enumerate(string_number)]))
# Recursively calling the function
return (alternating_sum in [0, 11]) or divisible_by_11(alternating_sum)
Running the tests:
$ python -m unittest test_propert_main
.
----------------------------------------------------------------------
Ran 1 test in 0.043s
OK
As we hoped!
I think the above is a good example of why property based testing and tools like hypothesis are useful for research software.
It helped identify an error in my mathematical thought process, an error that I
in fact thought I had tested properly (using the test_main.py
tests). If it
wasn’t for hypothesis I would have shipped code that redefined what it meant to
be divisible by 11.
If you only came here to get a basic working example of hypothesis you can stop now :)
Obviously the above is a simple example but hypothesis recently found a bug in Axelrod (the python package for reproducing Iterated Prisoner’s Dilemma tournaments).
Axelrod creates tournaments of players picked from one of the (at time of writing) 123 strategies. Here is one of the tests that randomly selects strategies to create tournaments and check that they run:
@given(s=lists(sampled_from(axelrod.strategies),
min_size=2, # Errors are returned if less than 2 strategies
max_size=5, unique=True),
turns=integers(min_value=2, max_value=50),
repetitions=integers(min_value=2, max_value=4),
rm=random_module())
@settings(max_examples=50, timeout=0)
@example(s=test_strategies, turns=test_turns, repetitions=test_repetitions,
rm=random.seed(0))
def test_property_serial_play(self, s, turns, repetitions, rm):
"""Test serial play using hypothesis"""
# Test that we get an instance of ResultSet
players = [strat() for strat in s]
tournament = axelrod.Tournament(
name=self.test_name,
players=players,
game=self.game,
turns=turns,
repetitions=repetitions)
results = tournament.play()
self.assertIsInstance(results, axelrod.ResultSet)
self.assertEqual(len(results.cooperation), len(players))
self.assertEqual(results.nplayers, len(players))
self.assertEqual(results.players, players)
Without going in to precise details, at one point hypothesis found that two particular strategies could not play together. Here is the github issue that documented this. These were:
Backstabber
MindReader
These two strategies did in fact already play in a tournament together. The
main tournament that includes all cheating strategies, but the side effect of
that is that another cheating strategy played BackStabber
first and overwrite
it’s strategy (meaning it could then play nice with MindReader
). Hypothesis
however created a tournament were they were directly opposed and then our bug
occured.
Anyway, it’s now fixed and the test includes a @example
statement to ensure
that the specific bug is in fact fixed: you can see this in the source
code.
In practice, the recommendation is to use a combination of property based tests and traditional example based tests (as there are some specific things that need to be checked) but I think all test suites can be improved by using tools like hypothesis, in particular research software. With research software a robust testing framework helps not only test the code but also the ideas.
Helpful links:
- @DRMacIver: David is the writer of Hypothesis.
- Github repo.
- Documentation.
Finally, if you’re of the irc persuasion, I really recommend dropping in at
#hypothesis
on freenode. Here’s a direct link to the channel on irccloud:
www.irccloud.com/invite?channel=%23hypothesis&hostname=irc.freenode.net&port=6697&ssl=1.
Everyone there has always been very helpful.