Abstract
Recent studies have reported that Support Vector Regression (SVR) has the potential as a technique for software development effort estimation. However, its prediction accuracy is heavily influenced by the setting of parameters that needs to be done when employing it. No general guidelines are available to select these parameters, whose choice also depends on the characteristics of the dataset being used. This motivated the work described in (Corazza et al. 2010), extended herein. In order to automatically select suitable SVR parameters we proposed an approach based on the use of the meta-heuristics Tabu Search (TS). We designed TS to search for the parameters of both the support vector algorithm and of the employed kernel function, namely RBF. We empirically assessed the effectiveness of the approach using different types of datasets (single and cross-company datasets, Web and not Web projects) from the PROMISE repository and from the Tukutuku database. A total of 21 datasets were employed to perform a 10-fold or a leave-one-out cross-validation, depending on the size of the dataset. Several benchmarks were taken into account to assess both the effectiveness of TS to set SVR parameters and the prediction accuracy of the proposed approach with respect to widely used effort estimation techniques. The use of TS allowed us to automatically obtain suitable parameters’ choices required to run SVR. Moreover, the combination of TS and SVR significantly outperformed all the other techniques. The proposed approach represents a suitable technique for software development effort estimation.





Similar content being viewed by others
Notes
The same combination of effort estimation measures is used as objective function in the present paper, so it will be detailed in Section 2.3.
We cannot report the 10 folds used for the Tukutuku datasets since the information included in the Tukutuku database are not public available, for confidence reasons.
References
Albrecht AJ, Gaffney JE (1983) Software function, source lines of code, and development effort prediction: a software science validation. IEEE Trans Softw Eng 9(6):639–648
Bailey JW, Basili VR (1981) A meta model for software development resource expenditure. Procs. International Conference on Software Engineering, pp 107–116
Braga PL, AL Oliveira, Meira SR (2007) Software effort estimation using machine learning techniques with robust confidence intervals. Procs IEEE International Conference on Hybrid Intelligent Systems, pp 352–357
Briand L, Emam KE, Surmann D, Wiekzorek I, Maxwell K (1999) An assessment and comparison of common software cost estimation modeling techniques. Procs. International Conference on Software Engineering
Briand L, Langley T, Wiekzorek I (2000) A replicated assessment and comparison of common software cost modeling techniques. Procs. International Conference on Software Engineering, pp 377–386
Briand L, Wieczorek I (2002) Software resource estimation. Encyclopedia of Software Engineering, pp 1160–1196
Burgess CJ, Lefley M (2001) Can genetic programming improve software effort estimation? A comparative evaluation. Inf Softw Technol 43(14):863–873
Chang C-C, Lin C-J (2001) LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
Cherkassky V, Ma Y (2004) Practical selection of SVM parameters and noise estimation for SVM Regression. Neural Netw 17(1):113–126
Chiu N-H, Huang S-J (2007) The adjusted analogy-based software effort estimation based on similarity distances. J Syst Software 80(4):628–640
Conte SD, Dunsmore HE, Shen VY (1986) Software engineering metrics and model. Benjamin-Cummins Pub Co, Inc. Redwood City, CA, USA
Conover WJ (1998) Practical nonparametric statistics, 3rd edn. Wiley, New York
Cook RD (1977) Detection of influential observations in linear regression. Technometrics 19:15–18
Corazza A, Di Martino S, Ferrucci F, Gravino C, Mendes E (2009) Applying support vector regression for web effort estimation using a cross-company dataset. Procs. Empirical Software Engineering and Measurement, pp 191–202
Corazza A, Di Martino S, Ferrucci F, Gravino C, Mendes E (2011) Investigating the use of Support Vector Regression for Web Effort Estimation. Empir Softw Eng 16(2):211–243
Corazza A, Di Martino S, Ferrucci F, Gravino C, Sarro F, Mendes E (2010) How effective is Tabu search to configure support vector regression for effort estimation? Procs. International Conference on Predictive Models in Software Engineering, 4
Cortes C, Vapnik V (1995) Support-vector networks. Machine Learning, 20
Costagliola G, Di Martino S, Ferrucci F, Gravino C, Tortora G, Vitiello G (2006) Effort estimation modeling techniques: a case study for web applications. Procs. International Conference on Web Engineering, pp 9–16
Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, New York, NY, USA
Desharnais JM (1989) Analyse statistique de la productivitie des projets in formatique a partie de la technique des point des fonction, Ph.D. thesis, Unpublished Masters Thesis, University of Montreal
Di Martino S, Ferrucci F, Gravino C, Mendes E (2007) comparing size measures for predicting web application development effort: a case study. Procs. Empirical Software Engineering and Measurement, pp 324–333
Ferrucci F, Gravino C, Oliveto R, Sarro F (2009) Using Tabu search to estimate software development effort. Procs. International Conferences on Software Process and Product Measurement. LNCS 5891. Springer-Verlag, Berlin-Heidelberg, pp 307–320
Ferrucci F, Gravino C, Mendes E, Oliveto R, Sarro F (2010) Investigating Tabu search for web effort estimation. Procs. EUROMICRO Conference on Software Engineering and Advanced Applications, pp 350–357
Glover F, Laguna M (1997) Tabu search. Kluwer Academic Publishers, Boston
Hsu C-W, Chang C-C, Lin C-J (2010) A practical guide to support vector classification, available at http://www.csie.ntu.edu.tw/\~{}cjlin/papers/guide/guide.pdf
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explorations 11(1), ACM New York, NY, USA, pp 10–18
Hofmann T, Schölkopf B, Smola AJ (2008) Kernel methods in machine learning. Ann Stat 36:1171–1220
Jeffery R, Ruhe M, Wieczorek I (2000) A comparative study of two software development cost modeling techniques using multi-organizational and company-specific data. Inf Softw Technol 42:1009–1016
Kemerer CF (1987) An empirical validation of software cost estimation models. Commun ACM 30(5):416–429
Keerthi S (2002) Efficient tuning of SVM hyper-parameters using radius/margin bound and iterative algorithms. IEEE Trans Neural Netw 13(5):1225–1229
Keerthi S, Lin C-J (2003) Asymptotic behaviors of support vector machines with Gaussian Kernel. Neural Comput 15:1667–1689
Kitchenham BA, Mendes E, Travassos GH (2007) Cross versus within-company cost estimation studies: a systematic review. IEEE Trans Softw Eng 33(5):316–329
Kitchenham B, Pickard LM, MacDonell SG, Shepperd MJ (2001) What accuracy statistics really measure. IEE Proceedings Software 148(3):81–85
Kitchenham BA, Mendes E (2004) A comparison of cross-company and single-company effort estimation models for web applications. Procs. Evaluation & Assessment in Software Engineering, pp 47–55
Kitchenham BA, Mendes E (2009) Why comparative effort prediction studies may be invalid. Procs. International Conference on Predictor Models in Software Engineering
Kitchenham BA (1998) A procedure for analyzing unbalanced datasets. IEEE Trans Softw Eng 24(4):278–301
Kitchenham BA, Pickard L, Peeger S (1995) Case studies for method and tool evaluation. IEEE Softw 12(4):52–62
Kocaguneli E, Gay G, Menzies T, Yang Y, Keung JW (2010) When to use data from other projects for effort estimation. Procs. IEEE/ACM international conference on Automated Software Engineering, pp 321–324
Kwok JT, Tsang IW (2003) Linear dependency between ε and the input noise in ε-support vector regression. IEEE Trans Neural Netw 14(3):544–553
Lefley M, Shepperd MJ (2003) Using genetic programming to improve software effort estimation based on general datasets. Procs. GECCO, LNCS 2724, Springer-Verlag, Berlin, Heidelberg, pp 2477–2487
Li YF, Xie M, Goh TN (2009) A study of project selection and feature weighting for analogy based software cost estimation. J Syst Software 82(2):241–252
Mair C, Shepperd M (2005) The consistency of empirical comparisons of regression and analogy-based software project cost estimation. Procs ISESE, pp 509–518
Mattera D, Haykin S (1999) Support vector machines for dynamic reconstruction of a chaotic system. In: Scholkopf B, Burges J, Smola A (eds) Advances in kernel methods: support vector machine. MIT, Cambridge
Maxwell (2002) Applied statistics for software managers. Software Quality Institute Series, Prentice Hall, Upper Saddle River, NJ, USA
Maxwell K, Wassenhove LS, Dutta S (1999) Performance evaluation of general and company specific models in software development effort estimation. Manag Sci 45(6):787–803
Mendes E (2008) The use of bayesian networks for web effort estimation: further investigation. Procs. International Conference on Web Engineering, pp 203–216
Mendes E, Pollino C, Mosley N (2009) Building an expert-based web effort estimation model using Bayesian Networks Procs EASE Conference, pp 1–10
Mendes E (2009) Web cost estimation and productivity benchmarking. ISSSE, LNCS 5413, Publisher: Springer-Verlag, Berlin Heidelberg, pp 194–222
Mendes E, Mosley N, Counsell S (2005a) Investigating web size metrics for early web cost estimation. J Syst Software 77(2):157–172
Mendes E, Di Martino S, Ferrucci F, Gravino C (2008) Cross-company vs. single-company web effort models using the Tukutuku database: an extended study. J Syst Software 81(5):673–690
Mendes E, Mosley N, Counsell S (2003a) Investigating early web size measures for web cost estimation Procs. Evaluation and Assessment in Software Engineering, pp 1–22
Mendes E, Kitchenham BA (2004) Further Comparison of cross-company and within-company effort estimation models for web applications. Procs. IEEE International Software Metrics Symposium, pp 348–357
Mendes E, Counsell S, Mosley N, Triggs C, Watson I (2003b) Comparative study of cost estimation models for web hypermedia applications. Empir Softw Eng 8(23):163–196
Mendes E, Mosley N, Counsell S (2005b) The need for web engineering: an introduction, web engineering. In: Mendes E, Mosley N (eds). Springer-Verlag, pp 1–28
Miyazaki Y, Terakado M, Ozaki K, Nozaki H (1994) Robust regression for developing software estimation models. J Syst Softw 27(1):3–16
Moser R, Pedrycz W, Succi G (2007) Incremental effort prediction models in agile development using radial basis functions. Procs. International Conference on Software Engineering and Knowledge Engineering, pp 519–522
Oliveira ALI (2006) Estimation of software project effort with support vector regression. Neurocomputing 69(13–15):1749–1753
PROMISE (2011) Repository of empirical software engineering data. http://promisedata.org/repository
Shepperd MJ, Kadoda G (2001) Using simulation to evaluate prediction techniques. Procs. IEEE International Software Metrics Symposium, pp 349–358
Shepperd M, Schofield C (1997) Estimating software project effort using analogies. IEEE Trans Softw Eng 23(11):736–743
Shepperd M, Schofield C, Kitchenham BA (1996) Effort estimation using analogy. Procs. International Conference on Software Engineering, pp 170–178
Shin M, Goel AL (2000) Empirical data modeling in software engineering using radical basis functions. IEEE Trans Softw Eng 26(6):567–576
Scholkopf B, Smola A (2002) Learning with Kernels. MIT Press
Schölkopf B, Sung K, Burges C, Girosi F, Niyogi P, Poggio T, Vapnik V (1997) Comparing support vector machines with Gaussian Kernels to radial basis function classifiers. IEEE Trans Signal Process 45(11):2758–2765
Shukla KK (2000) Neuro-genetic prediction of software development effort. Inf Softw Technol 42(10):701–713
Smola AJ, Schölkopf B (2004) A tutorial on support vector regression. Stat Comput 14(3):199–222
Vapnik V, Chervonenkis A (1964) A note on one class of perceptrons. Automatics and Remote Control 25
Vapnik V, Chervonenkis AY (1974) Theory of pattern recognition (in Russian). Nauka, Moscow
Vapnik V (1995) The nature of statistical learning theory. Springer-Verlag
Wieczorek I, Ruhe M (2002) How valuable is company-specific data compared to multi-company data for software cost estimation? Procs. International Software Metrics Symposium, pp 237–246
Acknowledgments
Authors would like to thank the anonymous reviewers for their valuable comments and suggestions and all companies that volunteered data to the Tukutuku database and to the PROMISE repository. The research has been carried out also exploiting the computer systems funded by University of Salerno’s Finanziamento Medie e Grandi Attrezzature (2005).
Open Access
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editor: Tim Menzies and Gunes Koru
Appendix
Appendix
1.1 A. Datasets Descriptions
In this appendix we provided further information on the employed datasets from the PROMISE repository and the Tukutuku database. In particular, summary statistics for the employed variables are shown Tables 6, 7, and 8, and each dataset is detailed in the following.
Table
Table
Table
1.1.1 Albrecht
The Albrecht dataset contains data on 24 applications developed by the IBM DP Services organization with different programming language (i.e., COBOL, PL/I or DMS). We employed as independent variables the four types of external input/output elements (i.e., Input, Output, Inquiry, File) used to compute Function Points (Albrecht and Gaffney 1983) and as dependent variable the Effort quantified in person-hours and representing the time employed to design, develop, and test each application. We excluded from the analysis the number of SLOC.
1.1.2 China
The China dataset contains data on 499 projects developed in China by various software companies in multiple business domains. We employed as independent variables the external input/output elements used to calculate Function Points (i.e., Input, Output, Inquiry, File, Interface) and Effort as dependent variable (PROMISE 2011).
1.1.3 Desharnais
Desharnais (Desharnais 1989) has been widely used to evaluate estimation methods, e.g., (Burgess and Lefley 2001; Ferrucci et al. 2009; Shepperd and Schofield 1997; Shepperd et al. 1996). It contains data about 81, but we excluded four projects that have some missing values, as done in other studies (e.g., Shepperd and Schofield 1997; Shepperd et al. 1996).
As independent variables we employed: TeamExp (i.e., the team experience measured in years), ManagerExp (i.e., the manager experience measured in years), Entities (i.e., the number of the entities in the system data model), Transactions (i.e., the number of basic logical transactions in the system), AdjustedFPs (i.e., the adjusted Function Points), and Envergure (i.e., a complex measure derived from other factors defining the environment). We considered as dependent variable the total effort while we excluded the length of the code. The categorical variable YearEnd was also excluded from the analysis as done in other works (e.g., Kocaguneli et al. 2010; Shepperd and Kadoda 2001) since this not an information that could influence the effort prediction of new applications. The other categorical variable, namely Languages, was used (as done in Kocaguneli et al. 2010; Shepperd and Schofield 1997) to split the original dataset into three different datasets Desharnais1 (having 44 observations), Desharnais2 (having 23 observations), and Desharnais3 (having 10 observations) corresponding to Languages 1, 2, and 3, respectively.
1.1.4 Finnish
Finnish contains data on 38 projects from different Finnish companies (Shepperd et al. 1996). In particular, the dataset consists of a dependent variable, the Effort expressed in person-hours, and five independent variables. We decided to do not consider the PROD variable because it represents the productivity expressed in terms of Effort and size (FP).
1.1.5 Kemerer
The Kemerer dataset (Kemerer 1987) contains 15 large business applications, 12 of which were written entirely in Cobol. In particular, for each application the number of both adjusted and raw function points is reported (only AdjFP has been exploited in our study). The Effort is the total number of actual hours expended by staff members (i.e., not including secretarial labor) on the project through implementation, divided by 152. We excluded from our analysis the KSLOC variable which counts the thousands of delivered source instructions, the variable Duration, which represents the project durations in calendar months, and two categorical variables, Software and Hardware, that indicate the software (i.e., Bliss, Cobol, Natural) and the hardware (e.g., IBM 308X, IBM 43XX, DEC Vax) employed in each project, respectively. Note that differently from Desharnais dataset these categorical variables could not be used to create subsets since the resulting sets were too small.
1.1.6 Maxwell
The Maxwell dataset (Maxwell 2002) contains data of 62 projects in terms of 17 features: Function Points and 16 ordinal variables, i.e., number of different development languages used (Nlan), customer participation (T01), development environment adequacy (T02), staff availability (T03), standards used (T04), methods used (T05), tools used (T06), software’s logical complexity (T07), requirements volatility (T08), quality requirements (T09), efficiency requirements (T10), installation requirements (T11), staff analysis skills (T12), staff application knowledge (T13), staff tool skills (T14), staff team skills (T15). As done for the Desharnais dataset, we used the categorical variables to split the original dataset. In particular, using the three variables, App, Source, and TelonUse (the former indicates the application type, the second indicates in-house or outsourcing development, and the last indicates whether the Telon CASE tool was employed) we obtained 9 datasets, however only those datasets having a number of observations greater than the feature number were used in our experimentation. In particular, we employed the set of 29 observations having App equals to 2, the set of 18 observations having App equals to 3, the set of 54 observation having Source equals to 2, and the set of 47 observations having TelonUse equals to 1. In the following we refer to these datasets as MaxwellA2, MaxwellA3, MaxwellS2, and MaxwellT1, respectively.
1.1.7 Miyazaki
The Miyazaki dataset is composed by projects data collected from 48 systems in 20 Japanese companies by Fujitsu Large Systems Users Group (Miyazaki et al. 1994). We considered the independent variables SCRN (i.e., the number of different input or output screen formats), and FORM (i.e., the number of different form) as done in (Miyazaki et al. 1994). The dependent variable is the Effort defined as the number of person-hours needed from system design to system test, including indirect effort such as project management.
1.1.8 Telecom
Telecom includes information on two independent variables, i.e., Changes and Files, and the dependent variable Effort (Shepperd and Schofield 1997). Changes represents the number of changes made as recorded by the configuration management system and Files is the number of files changed by the particular enhancement project.
1.1.9 Tukutuku
The Tukutuku database (Mendes et al. 2005a) contains Web hypermedia systems and Web applications. The former are characterized by the authoring of information using nodes (chunks of information), links (relations between nodes), anchors, access structures (for navigation) and its delivery over the Web. Conversely, the latter represent software applications that depend on the Web or use the Web’s infrastructure for execution and are characterized by functionality affecting the state of the underlying business logic. Web applications usually include tools suited to handle persistent data, such as local file system, (remote) databases, or Web Services.
The Tukutuku database has data on 195 projects, where:
-
projects came mostly from 10 different countries, mainly New Zealand (47%), Italy (17%), Spain (16%), Brazil (10%), United States (4%), England (2%), and Canada (2%);
-
project types are new developments (65.6%) or enhancement projects (34.4%);
-
about dynamic technologies, PHP is used in 42.6% of the projects, ASP (VBScript or .Net) in 13.8%, Perl in 11.8%, J2EE in 9.2%, while 9.2% of the projects used other solutions;
-
the remaining projects used only HTML and/or Javascript,
-
each Web project in the database is characterized by process and product variables.
The features characterizing the web projects have the following meaning:
-
nlang: Number of programming languages adopted in the project.
-
DevTeam: Number of Developers involved in the project.
-
TeamExp: Mean number of years of experience for the team members.
-
TotWP: Total number of Web pages (new and reused).
-
NewWP: Total number of new Web pages.
-
TotImg: Total number of images (new and reused).
-
NewImg: Total number of new images.
-
Fots: Number of features/functions reused without any adaptation.
-
HFotsA: Number of reused high-effort features/functions adapted.
-
Hnew: Number of new high-effort features/functions.
-
totHigh: Total number of high-effort features/functions.
-
FotsA: Number of reused low-effort features adapted.
-
New: Number of new low-effort features/functions.
-
totNHigh: Total number of low-effort features/functions.
-
TotEff: Effort in person-hours (dependent variable).
The Tukutuku database contains also the following categorical variables:
-
TypeProj: Type of project (new or enhancement).
-
DocProc: If project followed defined and documented process.
-
ProImpr: If project team was involved in a process improvement programme.
-
Metrics: If project team was part of a software metrics programme.
1.2 B. Manual Stepwise Regression
We applied MSWR using the technique proposed by Kitchenham (1998). Basically the idea is to use this technique to select the important independent variables according to the R2 values and the significance of the model obtained employing those variable, and then to use linear regression to obtain the final model.
In our study we employed the variables shown in Tables 6, 7, and 8 during cross validation and we selected the variables for the training set of each split by using the MSWR procedure. In particular, at the first step we identified the numerical variable that had a statistically significant effect on the variable denoting the effort and gave the highest R2. This was obtained by applying simple regression analysis using each numerical variable in turn. Then, we constructed the single variable regression equation with effort as the dependent variable using the most highly (and significantly) correlated input variable and calculated the residuals. In the subsequent step we correlated the residuals with all the other input variables. We continued in this way until there were no more input variables available for inclusion in the model or none of the remaining variables were significantly correlated with the current residuals (Kitchenham 1998). At the end of the procedure, the obtained variables were used to build the estimation model for the considered training set, which was then used to obtain the estimates for the observations in the corresponding validation set.
It is worth mentioning that whenever variables were highly skewed they were transformed before being used in the MSWR procedure. This was done to comply with the assumptions underlying stepwise regression (Maxwell 2002) (i.e., residuals should be independent and normally distributed; relationship between dependent and independent variables should be linear). The transformation employed was to take the natural log(Ln), which makes larger values smaller and brings the data values closer to each other (Kitchenham and Mendes 2009). A new variable containing the transformed values was created for each original variable that needed to be transformed. In addition, whenever a variable needed to be transformed but had zero values, the Ln transformation was applied to the variable’s value after adding 1.
To verify the stability of each effort estimation model built using MSWR, the following steps were employed (Kitchenham and Mendes 2004; Kitchenham and Mendes 2009):
-
Use of a residual plot showing residuals vs. fitted values to investigate if the residuals are randomly and normally distributed.
-
Calculate Cook’s distance values (Cook 1977) for all projects to identify influential data points. Any projects with distances higher than 3 × (4/n), where n represents the total number of projects, are immediately removed from the data analysis (Kitchenham and Mendes 2004). Those with distances higher than 4/n but smaller than 3 × (4/n) are removed to test the model stability by observing the effect of their removal on the model. If the model coefficients remain stable and the adjusted R2 (goodness of fit) improves, the highly influential projects are retained in the data analysis.
1.3 C. Case-Based Reasoning
To apply CBR we have to choose the similarity function, the number of analogies to pick the similar projects to consider for estimation, and the analogy adaptation strategy for generating the estimation. Moreover, also relevant project features could be selected.
In our case study, we applied CBR by employing the tool ANGEL (Shepperd and Schofield 1997) that implements the Euclidean distance which is the measure used in the literature with the best results (Mendes et al. 2003b). As for the number of analogies, we used 1, 2, and 3 analogies, as suggested in other similar works (Briand et al. 2000; Mendes and Kitchenham 2004). Moreover, to select similar projects for the estimation, we employed as adaptation strategies the mean of k analogies. Regarding the feature selections, we considered the independent variables that are statistically correlated to the effort (at level 0.05), obtained by carrying out a Pearson correlation test (Mendes 2008) on the training set of each split. We did not use feature subset selection of ANGEL since it might be inefficient, as reported in (Briand et al. 1999; Shepperd and Schofield 1997). In addition, all the project attributes considered by the similarity function had equal influence upon the selection of the most similar project(s). We also decided to apply CBR employing all the variables of Table 1 as set of features, as done for the application of SVR + TS, considering all relevant factors for designers and developers. In the paper we distinguish between the two different applications of CBR, using CBRfss for denoting the use of the method with feature selection.
1.4 D. Executions of SVR + TS on training sets
Table 9 reports for each dataset some summary statistics of the objective values achieved in the 10 executions of SVR + TS on training sets. As we can see the standard deviation of the results is very low, thus there is not so much variability in the achieved results on all the employed datasets.
Table
1.5 E. Folds of China dataset
Table 10
Table
Rights and permissions
About this article
Cite this article
Corazza, A., Di Martino, S., Ferrucci, F. et al. Using tabu search to configure support vector regression for effort estimation. Empir Software Eng 18, 506–546 (2013). https://doi.org/10.1007/s10664-011-9187-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-011-9187-3