Given distributed version control systems store all the history in a local storage an access to commit metadata as well as commits themselves is usually a lot faster compared to conventional version control systems. Additionally, GIT provides clever tooling that exposes information about the commits, that can be nicely visualized via the means of diffstat and such, or exported for further analysis with other tools.
I'll describe the process of getting the data into GoodData, a web-based data analysis tool with collaboration features somewhat resembling those of social networking web sites. What can you do with your GIT commit data is to create reports similar to LWN's Who wrote... articles for Linux kernel releases, or maybe employ analysis similar to what Ohloh does for your project.
Perl has a rather stable and active developer community and its source code is publicly available, therefore its commit history will serve our purpose perfectly. Let's obtain the code from the GIT repository first.
$ git clone git://perl5.git.perl.org/perl.git $ cd perl
For illustrational purposes we'll just use the tail of the development history, since 5.10.0 release which was tagged in December 2007.
$ git log --date=short --format="format:%h:%aN:%aE:%cd:%ct" --shortstat perl-5.10.0 |
awk 'BEGIN {print "Commit SHA1 Hash:" \
"Author Name:" "Author e-mail:" "Date:" "Timestamp:" \
"Files touched:" "Lines added:" "Lines deleted"}
/^ / {n=0; print ":"$1":"$4":"$6; next}
/^[^ ]/ {if (n) {print ":0:0:0"}; n=1; printf $0}' >history.csv
Most likely there is a better way of obtaining the line changes than piping a shortstat through awk, but this is the fastest one I could come up with. It would probably have been a better idea to use a --numstat flag and preserve the per-file change information.
Note that we stick a header line there. That's not strictly required -- we could add it in the UI instead as well.
(Available in theora, h264, youtube)During the upload, we mapped the dates to date dimension.
Now let's see create some reports.
The first report visualizes changes in number of lines added and removed in time. Given the Perl code is currently frozed for 5.12.0 release the peaks probably aren't merges of big features, not big refactoring, probably just some scripted bulk changes.
(Available in theora, h264, youtube)I found this one report a bit more interesting. It shows the commiter activity in the number of commits done. It's a rather simple way to see which developers are most active -- which, of course, are the culprits when the code breaks :)
(Available in theora, h264, youtube)I hope you liked it, though if you're a bit familiar with BI, which I'm not, you could certainly create nicer examples :) Have fun!
Source code to the entries and scripts that format this site are available on github. Text of journal entries is licensed under CC-BY-SA license.
Mail questions, comments and pizza to lkundrak@v3.sk