This website contains my PhD thesis, and all the tools, data sources and databases that I have used to complete it. That should help to reproduce all the results that I show in my thesis. See below for more details.
A statistical examination of the properties and evolution of libre software
- Download a copy of the thesis (PDF, 7.9 MB)
The text of the thesis has been frozen as of September 29th 2008. The above link will always point to the latest version, that may be changed since the frozen version. For a list of changes from that final version, have a look at the ChangeLog file.
I have used two main data sources for this thesis:
Although all the data is available in the databases described below, I include here a list (in plain text, each field separated by a tab) of the case studies used in this thesis, so it can be reproduced using other data sources different that those included here:
- List of all ports retrieved (second column of table 3.1) (12010 ports)
- List of all ports that only contain C source code (again second column of 3.1, but only ports with source code written in C) (6587 ports)
- List of selected ports, that exclusively contain C source code, without repeated files and other automated files (last column of table 3.1) (6556 ports)
- List of SF.net projects (properties shown in table 3.4) (3821 projects)
To use the data and scripts included in this page, you will need to install the following tools
- MySQL (version >= 4)
- Python (version >= 2.4)
- GNU R (version >= 2.6)
- RMySQL module for GNU R
You have to use an Unix environment to install those tools. If you are newbie, I recommend you to install Ubuntu. If you ask me for the systems that I have used in my thesis, I have used FreeBSD, Ubuntu and Debian GNU/Linux.
The versions of the program are the ones that I have used, but the data and scripts might probably be used with older versions.
A detailed howto to help to reproduce the results of this thesis will be published in this web in the following months. So far, you can try to reuse the scripts, GNU R images and databases that I have used in my thesis.
The scripts included below generate the databases that I have used for my thesis. You do not have to use these scripts in order to reuse my databases. I include the scripts here just in case you want to reuse them to create your own databases, but you can use directly the databases without using any of these scripts.
- Ports set
These scripts download the source code of all the ports of a FreeBSD system. After that step, all the files are measured (LOC and SLOC), and it identifies the programming language of every. Next, it measures some complexity metrics for all the files written in C. This set of scripts generates the metrics database.
Download: ports.tar.bz2 (tar.bz2, 6.8 KB) | ports.tar.gz (tar.gz, 6.7 KB) | Sources directory
- Correlations set
Although most of the linear correlations are calculated using the GNU R images described below, the correlations at the category level are calculated using this set of scripts. The data obtained using these scripts were not finally included in the thesis. I make these scripts available just in case you want to see how to calculate linear correlations using Python, GNU R and reading directly from a MySQL database. This set of scritps uses the metrics database, and generates some additional tables.
Download: correlations.tar.bz2 (tar.bz2, 3.2 KB) | correlations.tar.gz (tar.gz, 3.0 KB) | Sources directory
- SF.net set
These scripts obtain the last version of the sources of all the projects in SF.net that have a CVS repository. After that step, it measures the sources directory, and provides data about the total SLOC of every project. The rest of the analysis of the SF.net database is done using the GNU R images shown below. This set of scripts generates the SF.net database.
Download: sf.net.tar.bz2 (tar.bz2, 2.0 KB) | sf.net.tar.gz (tar.gz, 1.8 KB) | Sources directory
GNU R images
All the results, figures and tables shown in this thesis have been obtained using GNU R, which is a statistical program available for some different systems (Windows, GNU/Linux, *BSD, etc).
The data needed to obtain all the results were extracted from the databases shown below. If you want to reproduce the results, you can try to use directly the images files included here.
To load the RData files, you have to use the load() function of GNU R. To load the commands history you can use the loadhistory() function. Using the commands history, you can browse all the commands that I have used to obtain the results.
The RData and Rhistory files are distributed in the same tarball for each one of the two datasets:
- Metrics dataset
This dataset contains the data stored in the metrics database, that was obtained using the ports set of scripts shown above. You can use directly the data contained in these files to obtain the results, without executing the scripts or creating any database.
Download: metrics.tar.bz2 (tar.bz2, 69 MB) | metrics.tar.gz (tar.gz, 69 MB) | Directory
- SF.net dataset
This dataset contains the data stored in the SF.net database, that was obtained using the SF.net set of scripts shown above, and the CVSAnaly SF and FLOSSMole datasets. You can use directly the data contained in these files to obtain the results, without executing the scripts or creating any database.
Download: sf.net.tar.bz2 (tar.bz2, 489 KB)| sf.net.tar.gz (tar.gz, 480 KB) | Directory
I have used two databases for the thesis:
- Metrics database
This database contains size metrics for all the files in all the ports of FreeBSD. It also identifies the programming language of each file. If the file is written in C, complexity metrics are also available.
Download: metrics.sql.bz2 (MySQL, 172 MB) | metrics.sql.gz (MySQL, 217 MB)
- SourceForge.net CVS commits and modification requests database
This database contains information about the daily number of changes and modification requests for all the projects that are in the CVSAnaly SourceForge dataset. The information in this database is aggregated. If you want to access to the raw data, please use the CVSAnaly SF dataset and the FLOSSMole SF dataset.
Download: sf.sql.bz2 (MySQL, 11 MB) | sf.sql.gz (MySQL, 14 MB)
About the author
Israel Herraiz <herraiz _at_ gsyc.es>
Feel free to send any comment or suggestion about this thesis to the above e-mail address.
See my profile page at Libresoft.es for more details.