I think this is how to crawl the history of a git repository
Dec 5, 2015
This blog post is a direct application of Cunningham’s
Law: which is that “the
best way to get the right answer on the Internet is not to ask a question, it’s
to post the wrong answer”. With the other core developers of the
Axelrod library we’re writing a
paper and I wanted to see the evolution of a particular property of the library
through the 2000+ commits (mainly to include a nice graph in the paper). This
post will detail how I’ve cycled through all the commits and recorded the
particular property I’m interested in. EDIT: thanks to Mario for the comments:
see the edits in bold to see the what I didn’t quite get right.
The Axelrod library is a
collaborative project that allows anyone to submit strategies for the iterated
prisoner’s dilemma via pull request (read more about this here:
axelrod.readthedocs.org/en/latest/).
When the library was first put on github it had 6 strategies, it currently has
This figure can be obtained by simply running:
The goal of this post is to obtain the plot below:
EDIT: here is the correct plot:
Here is how I’ve managed that:
Write a script that imports the library and throws the required data in to a
file.
Write another script that goes through the commits and runs the previous
script.
So first of all here’s the script that gets the number of strategies:
The (very loose) error handling is because any given commit might or might not
be able to run at all (for a number of reasons). The command line arguments are
so that my second script can pass info about the commits (date and hash).
Here is the script that walks the github repository:
Now, I am not actually sure if I need the 10 seconds of sleep in there but it
seems to make things a little more reliable (this is where I’m hoping some
knowledgeable kind soul will point out something isn’t quite right).
Here is an animated gif of the library as the script checks through the commits
(I used a sleep of 0.1 second here, and cut if off at the beginning):
(You can find a video version of the above at the record.it site.)
The data set from above looks like this:
That’s all great and then the plot
above can be drawn
straightforwardly. The thing is: I’m not convinced it’s worked as I had hoped.
Indeed:
c7dc2d22ff2e300098cd9b29cd03080e01d64879
took place on the 18th of June and added 3 strategies but it’s not in the data
set (or indeed in the plot).
Also, for some reason the data set gets these lines at some point (here be
gremlins…) ?????:
What’s more confusing is that it’s not completely wrong because that does
overall look ‘ok’ (correct number of strategies at the beginning, end and
various commits are right there). So does anyone know why the above doesn’t
work properly?
I’m really hoping this xkcd comic kicks in and someone
tells me what’s wrong with what I’ve done:
EDIT: Big thanks to Mario Wenzel below in the comments for figuring out
everythig that wasn’t quite right.
Here’s the script to count the strategies (writing to file instead of piping
and also with correct error catching to deal with changes within the
library):
Here is the modified script to roll through the commits (basically the same
as before but it calls the other script with the -B flag (to avoid
importing compiled files) and also without the need to sleep:
It looks like you should delete all pyc files from the repository in
question and run the second script with the -B tag.