Opportunities and Challenges Applying Functional Data Analysis to the Study of Open Source Software Evolution

TitleOpportunities and Challenges Applying Functional Data Analysis to the Study of Open Source Software Evolution
Publication TypeJournal Article
Year of Publication2006
AuthorsStewart, Katherine J., Darcy David P., and Daniel Sherae L.
Secondary TitleStatistical Science
PublisherInstitute of Mathematical Statistics
ISSN Number08834237
Keywordscomplexity, evolution, fda, java, lines of code, loc, release history, scm, size, sourceforge

This paper explores the application of functional data analysis (FDA) as a means to study the dynamics of software evolution in the open source context. Several challenges in analyzing the data from software projects are discussed, an approach to overcoming those challenges is described, and preliminary results from the analysis of a sample of open source software (OSS) projects are provided. The results demonstrate the utility of FDA for uncovering and categorizing multiple distinct patterns of evolution in the complexity of OSS projects. These results are promising in that they demonstrate some patterns in which the complexity of software decreased as the software grew in size, a particularly novel result. The paper reports preliminary explorations of factors that may be associated with decreasing complexity patterns in these projects. The paper concludes by describing several next steps for this research project as well as some questions for which more sophisticated analytical techniques may be needed.


"As part of a larger project, data were collected on 105 OSS projects hosted online at Sourceforge (sf.net)." "...we limited our data collection to projects that use only the Java programming language and were listed in the Internet and System Networking domains." "... only including these projects that use an OSI approved license..." "had to have posted at least one file on the Sourceforge site as of the time of our initial project selection Fall 2002" "Data were collected on the published release history of each project thatmet the screening criteria. Each release of each project was
analyzed to calculate CplXLCoh. The size of each release was measured using a calculation of the number of lines of code (LOC)"