{ "data_id": "53", "name": "kc1", "exact_name": "kc1", "version": 1, "version_label": null, "description": "**Author**: \n**Source**: Unknown - Date unknown \n**Please cite**: \n\n%-*- text -*-\n%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%\nThis is a PROMISE Software Engineering Repository data set made publicly\navailable in order to encourage repeatable, verifiable, refutable, and\/or\nimprovable predictive models of software engineering.\n\nIf you publish material based on PROMISE data sets then, please\nfollow the acknowledgment guidelines posted on the PROMISE repository\nweb page http:\/\/promise.site.uottawa.ca\/SERepository .\n%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%\n1. Title\/Topic: KC1\/software defect prediction\n2. Sources:\n\n-- Creators: NASA, then the NASA Metrics Data Program,\n-- http:\/\/mdp.ivv.nasa.gov. Contacts: Mike Chapman,\nGalaxy Global Corporation (Robert.Chapman@ivv.nasa.gov)\n+1-304-367-8341; Pat Callis, NASA, NASA project manager\nfor MDP (Patrick.E.Callis@ivv.nasa.gov) +1-304-367-8309\n\n-- Donor: Tim Menzies (tim@barmag.net)\n\n-- Date: December 2 2004\n3. Past usage:\n\n1. How Good is Your Blind Spot Sampling Policy?; 2003; Tim Menzies\nand Justin S. Di Stefano; 2004 IEEE Conference on High Assurance\nSoftware Engineering (http:\/\/menzies.us\/pdf\/03blind.pdf).\n\n-- Results:\n\n-- Very simple learners (ROCKY) perform as well in this domain\nas more sophisticated methods (e.g. J48, model trees, model\ntrees) for predicting detects\n\n-- Many learners have very low false alarm rates.\n\n-- Probability of detection (PD) rises with effort and rarely\nrises above it.\n\n-- High PDs are associated with high PFs (probability of\nfailure)\n\n-- PD, PF, effort can change significantly while accuracy\nremains essentially stable\n\n-- With two notable exceptions, detectors learned from one\ndata set (e.g. KC2) have nearly they same properties when\napplied to another (e.g. PC2, KC2). Exceptions:\n-- LinesOfCode measures generate wider inter-data-set variances;\n-- Precision's inter-data-set variances vary wildly\n\n2. \"Assessing Predictors of Software Defects\", T. Menzies and\nJ. DiStefano and A. Orrego and R. Chapman, 2004,\nProceedings, workshop on Predictive Software Models, Chicago,\nAvailable from http:\/\/menzies.us\/pdf\/04psm.pdf.\n-- Results:\n\n-- From KC2, Naive Bayes generated PDs of 45% with PF of 10%\n\n-- Naive Bayes out-performs J48 for defect detection\n\n-- When learning on more and more data, little improvement is\nseen after processing 300 examples.\n\n-- PDs are much higher from data collected below the sub-sub-\nsystem level.\n\n-- Accuracy is a surprisingly uninformative measure of success\nfor a defect detector. Two detectors with the same accuracy\ncan have widely varying PDs and PFs.\n4. Relevant information:\n\n\n-- KC1 is a \"C++\" system implementing storage management for\nreceiving and processing ground data\n\n-- Data comes from McCabe and Halstead features extractors of\nsource code. These features were defined in the 70s in an attempt\nto objectively characterize code features that are associated with\nsoftware quality. The nature of association is under dispute.\nNotes on McCabe and Halstead follow.\n\n-- The McCabe and Halstead measures are \"module\"-based where a\n\"module\" is the smallest unit of functionality. In C or Smalltalk,\n\"modules\" would be called \"function\" or \"method\" respectively.\n\n-- Defect detectors can be assessed according to the following measures:\n\nmodule actually has defects\n+-------------+------------+\n| no | yes |\n+-----+-------------+------------+\nclassifier predicts no defects | no | a | b |\n+-----+-------------+------------+\nclassifier predicts some defects | yes | c | d |\n+-----+-------------+------------+\n\naccuracy = acc = (a+d)\/(a+b+c+d\nprobability of detection = pd = recall = d\/(b+d)\nprobability of false alarm = pf = c\/(a+c)\nprecision = prec = d\/(c+d)\neffort = amount of code selected by detector\n= (c.LOC + d.LOC)\/(Total LOC)\n\nIdeally, detectors have high PDs, low PFs, and low\neffort. This ideal state rarely happens:\n\n-- PD and effort are linked. The more modules that trigger\nthe detector, the higher the PD. However, effort also gets\nincreases\n\n-- High PD or low PF comes at the cost of high PF or low PD\n(respectively). This linkage can be seen in a standard\nreceiver operator curve (ROC). Suppose, for example, LOC> x\nis used as the detector (i.e. we assume large modules have\nmore errors). LOC > x represents a family of detectors. At\nx=0, EVERY module is predicted to have errors. This detector\nhas a high PD but also a high false alarm rate. At x=0, NO\nmodule is predicted to have errors. This detector has a low\nfalse alarm rate but won't detect anything at all. At 0 but does not reach it.\n\n-- The line pf=pd on the above graph represents the \"no information\"\nline. If pf=pd then the detector is pretty useless. The better\nthe detector, the more it rises above PF=PD towards the \"sweet spot\".\n\nNOTES ON MCCABE\/HALSTEAD\n========================\nMcCabe argued that code with complicated pathways are more\nerror-prone. His metrics therefore reflect the pathways within a\ncode module.\n@Article{mccabe76,\ntitle \t= \"A Complexity Measure\",\nauthor \t= \"T.J. McCabe\",\npages \t= \"308--320\",\njournal = \"IEEE Transactions on Software Engineering\",\nyear \t= \"1976\",\nvolume \t= \"2\",\nmonth \t= \"December\",\nnumber \t= \"4\"}\n\nHalstead argued that code that is hard to read is more likely to be\nfault prone. Halstead estimates reading complexity by counting the\nnumber of concepts in a module; e.g. number of unique operators.\n@Book{halstead77,\nAuthor \t = \"M.H. Halstead\",\nTitle \t = \"Elements of Software Science\",\nPublisher = \"Elsevier \",\nYear \t = 1977}\n\nWe study these static code measures since they are useful, easy to\nuse, and widely used:\n\n-- USEFUL: E.g. this data set can generate highly accurate\npredictors for defects\n\n-- EASY TO USE: Static code measures (e.g. lines of code, the\nMcCabe\/Halstead measures) can be automatically and cheaply\ncollected.\n\n-- WIDELY USED: Many researchers use static measures to guide\nsoftware quality predictions (see the reference list in the above\n\"blind spot\" paper. Verification and validation (V\\&V) textbooks\nadvise using static code complexity measures to decide which\nmodules are worthy of manual inspections. Further, we know of\nseveral large government software contractors that won't review\nsoftware modules _unless_ tools like McCabe predict that they are\nfault prone. Hence, defect detectors have a major economic impact\nwhen they may force programmers to rewrite code.\n\nNevertheless, the merits of these metrics has been widely\ncriticized. Static code measures are hardly a complete\ncharacterization of the internals of a function. Fenton offers an\ninsightful example where the same functionality is achieved using\ndifferent programming language constructs resulting in different\nstatic measurements for that module. Fenton uses this example to\nargue the uselessness of static code measures.\n@book{fenton97,\nauthor = \"N.E. Fenton and S.L. Pfleeger\",\ntitle = {Software metrics: a Rigorous \\& Practical Approach},\npublisher = {International Thompson Press},\nyear = {1997}}\n\nAn alternative interpretation of Fenton's example is that static\nmeasures can never be a definite and certain indicator of the\npresence of a fault. Rather, defect detectors based on static\nmeasures are best viewed as probabilistic statements that the\nfrequency of faults tends to increase in code modules that trigger\nthe detector. By definition, such probabilistic statements will\nare not categorical claims with some a non-zero false alarm\nrate. The research challenge for data miners is to ensure that\nthese false alarms do not cripple their learned theories.\n\nThe McCabe metrics are a collection of four software metrics:\nessential complexity, cyclomatic complexity, design complexity and\nLOC, Lines of Code.\n\n-- Cyclomatic Complexity, or \"v(G)\", measures the number of\n\"linearly independent paths\". A set of paths is said to be\nlinearly independent if no path in the set is a linear combination\nof any other paths in the set through a program's \"flowgraph\". A\nflowgraph is a directed graph where each node corresponds to a\nprogram statement, and each arc indicates the flow of control from\none statement to another. \"v(G)\" is calculated by \"v(G) = e - n + 2\"\nwhere \"G\" is a program's flowgraph, \"e\" is the number of arcs in\nthe flowgraph, and \"n\" is the number of nodes in the\nflowgraph. The standard McCabes rules (\"v(G)\">10), are used to\nidentify fault-prone module.\n\n-- Essential Complexity, or \"ev(G)$\" is the extent to which a\nflowgraph can be \"reduced\" by decomposing all the subflowgraphs\nof $G$ that are \"D-structured primes\". Such \"D-structured\nprimes\" are also sometimes referred to as \"proper one-entry\none-exit subflowgraphs\" (for a more thorough discussion of\nD-primes, see the Fenton text referenced above). \"ev(G)\" is\ncalculated using \"ev(G)= v(G) - m\" where $m$ is the number of\nsubflowgraphs of \"G\" that are D-structured primes.\n\n-- Design Complexity, or \"iv(G)\", is the cyclomatic complexity of a\nmodule's reduced flowgraph. The flowgraph, \"G\", of a module is\nreduced to eliminate any complexity which does not influence the\ninterrelationship between design modules. According to McCabe,\nthis complexity measurement reflects the modules calling patterns\nto its immediate subordinate modules.\n\n-- Lines of code is measured according to McCabe's line counting\nconventions.\n\nThe Halstead falls into three groups: the base measures, the\nderived measures, and lines of code measures.\n\n-- Base measures:\n-- mu1 = number of unique operators\n-- mu2 = number of unique operands\n-- N1 = total occurrences of operators\n-- N2 = total occurrences of operands\n-- length = N = N1 + N2\n-- vocabulary = mu = mu1 + mu2\n-- Constants set for each function:\n-- mu1' = 2 = potential operator count (just the function\nname and the \"return\" operator)\n-- mu2' = potential operand count. (the number\nof arguments to the module)\n\nFor example, the expression \"return max(w+x,x+y)\" has \"N1=4\"\noperators \"return, max, +,+)\", \"N2=4\" operands (w,x,x,y),\n\"mu1=3\" unique operators (return, max,+), and \"mu2=3\" unique\noperands (w,x,y).\n\n-- Derived measures:\n-- P = volume = V = N * log2(mu) (the number of mental\ncomparisons needed to write\na program of length N)\n-- V* = volume on minimal implementation\n= (2 + mu2')*log2(2 + mu2')\n-- L = program length = V*\/N\n-- D = difficulty = 1\/L\n-- L' = 1\/D\n-- I = intelligence = L'*V'\n-- E = effort to write program = V\/L\n-- T = time to write program = E\/18 seconds\n5. Number of instances: 2109\n6. Number of attributes: 22 (5 different lines of code measure,\n3 McCabe metrics, 4 base Halstead measures, 8 derived\nHalstead measures, a branch-count, and 1 goal field)\n7. Attribute Information:\n\n1. loc : numeric % McCabe's line count of code\n2. v(g) : numeric % McCabe \"cyclomatic complexity\"\n3. ev(g) : numeric % McCabe \"essential complexity\"\n4. iv(g) : numeric % McCabe \"design complexity\"\n5. n : numeric % Halstead total operators + operands\n6. v : numeric % Halstead \"volume\"\n7. l : numeric % Halstead \"program length\"\n8. d : numeric % Halstead \"difficulty\"\n9. i : numeric % Halstead \"intelligence\"\n10. e : numeric % Halstead \"effort\"\n11. b : numeric % Halstead\n12. t : numeric % Halstead's time estimator\n13. lOCode : numeric % Halstead's line count\n14. lOComment : numeric % Halstead's count of lines of comments\n15. lOBlank : numeric % Halstead's count of blank lines\n16. lOCodeAndComment: numeric\n17. uniq_Op : numeric % unique operators\n18. uniq_Opnd : numeric % unique operands\n19. total_Op : numeric % total operators\n20. total_Opnd : numeric % total operands\n21: branchCount : numeric % of the flow graph\n22. problems : {false,true}% module has\/has not one or more\n% reported defects\n8. Missing attributes: none\n9. Class Distribution: the class value (problems) is discrete\nyes: 326 = 15.45%\nno: 1783 = 84.54%\n%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%", "format": "ARFF", "uploader": "Joaquin Vanschoren", "uploader_id": 2, "visibility": "public", "creator": null, "contributor": null, "date": "2014-10-06 23:57:43", "update_comment": null, "last_update": "2014-10-06 23:57:43", "licence": "Public", "status": "active", "error_message": null, "url": "https:\/\/www.openml.org\/data\/download\/53950\/kc1.arff", "default_target_attribute": "defects", "row_id_attribute": null, "ignore_attribute": null, "runs": 0, "suggest": { "input": [ "kc1", "%-*- text -*- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% This is a PROMISE Software Engineering Repository data set made publicly available in order to encourage repeatable, verifiable, refutable, and\/or improvable predictive models of software engineering. If you publish material based on PROMISE data sets then, please follow the acknowledgment guidelines posted on the PROMISE repository web page http:\/\/promise.site.uottawa.ca\/SERepository . %%%%%%%%%%%%%%%%%%%%%% " ], "weight": 5 }, "qualities": { "NumberOfInstances": 2109, "NumberOfFeatures": 22, "NumberOfClasses": 2, "NumberOfMissingValues": 0, "NumberOfInstancesWithMissingValues": 0, "NumberOfNumericFeatures": 21, "NumberOfSymbolicFeatures": 1, "AutoCorrelation": 0.9990512333965844, "CfsSubsetEval_DecisionStumpAUC": 0.7072599430889553, "CfsSubsetEval_DecisionStumpErrRate": 0.15647226173541964, "CfsSubsetEval_DecisionStumpKappa": 0.2604340674054145, "CfsSubsetEval_NaiveBayesAUC": 0.7072599430889553, "CfsSubsetEval_NaiveBayesErrRate": 0.15647226173541964, "CfsSubsetEval_NaiveBayesKappa": 0.2604340674054145, "CfsSubsetEval_kNN1NAUC": 0.7072599430889553, "CfsSubsetEval_kNN1NErrRate": 0.15647226173541964, "CfsSubsetEval_kNN1NKappa": 0.2604340674054145, "ClassEntropy": 0.6211733422333798, "DecisionStumpAUC": 0.7123747114018215, "DecisionStumpErrRate": 0.15457562825983878, "DecisionStumpKappa": 0, "Dimensionality": 0.010431484115694643, "EquivalentNumberOfAtts": null, "J48.00001.AUC": 0.7030784952637211, "J48.00001.ErrRate": 0.15267899478425795, "J48.00001.Kappa": 0.22302666956511774, "J48.0001.AUC": 0.7030784952637211, "J48.0001.ErrRate": 0.15267899478425795, "J48.0001.Kappa": 0.22302666956511774, "J48.001.AUC": 0.7030784952637211, "J48.001.ErrRate": 0.15267899478425795, "J48.001.Kappa": 0.22302666956511774, "MajorityClassPercentage": 84.54243717401611, "MajorityClassSize": 1783, "MaxAttributeEntropy": null, "MaxKurtosisOfNumericAtts": 103.07904520134113, "MaxMeansOfNumericAtts": 5242.386239924155, "MaxMutualInformation": null, "MaxNominalAttDistinctValues": 2, "MaxSkewnessOfNumericAtts": 8.789034149895826, "MaxStdDevOfNumericAtts": 17444.98121137856, "MeanAttributeEntropy": null, "MeanKurtosisOfNumericAtts": 31.853391931073038, "MeanMeansOfNumericAtts": 285.0969166158651, "MeanMutualInformation": null, "MeanNoiseToSignalRatio": null, "MeanNominalAttDistinctValues": 2, "MeanSkewnessOfNumericAtts": 4.203178136828403, "MeanStdDevOfNumericAtts": 915.455014376506, "MinAttributeEntropy": null, "MinKurtosisOfNumericAtts": 1.2413950208681674, "MinMeansOfNumericAtts": 0.08673779042200101, "MinMutualInformation": null, "MinNominalAttDistinctValues": 2, "MinSkewnessOfNumericAtts": 1.1407658504141103, "MinStdDevOfNumericAtts": 0.17550652636316466, "MinorityClassPercentage": 15.457562825983878, "MinorityClassSize": 326, "NaiveBayesAUC": 0.7895503283577173, "NaiveBayesErrRate": 0.17449027975343764, "NaiveBayesKappa": 0.3008403119099438, "NumberOfBinaryFeatures": 1, "PercentageOfBinaryFeatures": 4.545454545454546, "PercentageOfInstancesWithMissingValues": 0, "PercentageOfMissingValues": 0, "PercentageOfNumericFeatures": 95.45454545454545, "PercentageOfSymbolicFeatures": 4.545454545454546, "Quartile1AttributeEntropy": null, "Quartile1KurtosisOfNumericAtts": 12.284905644252186, "Quartile1MeansOfNumericAtts": 1.7170222854433477, "Quartile1MutualInformation": null, "Quartile1SkewnessOfNumericAtts": 2.869811695669978, "Quartile1StdDevOfNumericAtts": 3.2305652323985976, "Quartile2AttributeEntropy": null, "Quartile2KurtosisOfNumericAtts": 22.422776708097334, "Quartile2MeansOfNumericAtts": 7.631673779042244, "Quartile2MutualInformation": null, "Quartile2SkewnessOfNumericAtts": 3.7371013785570546, "Quartile2StdDevOfNumericAtts": 7.863645549059788, "Quartile3AttributeEntropy": null, "Quartile3KurtosisOfNumericAtts": 36.53033082841476, "Quartile3MeansOfNumericAtts": 26.141894262683927, "Quartile3MutualInformation": null, "Quartile3SkewnessOfNumericAtts": 4.706421340727173, "Quartile3StdDevOfNumericAtts": 41.92522717121333, "REPTreeDepth1AUC": 0.7649357084117553, "REPTreeDepth1ErrRate": 0.14414414414414414, "REPTreeDepth1Kappa": 0.18493789806892894, "REPTreeDepth2AUC": 0.7649357084117553, "REPTreeDepth2ErrRate": 0.14414414414414414, "REPTreeDepth2Kappa": 0.18493789806892894, "REPTreeDepth3AUC": 0.7649357084117553, "REPTreeDepth3ErrRate": 0.14414414414414414, "REPTreeDepth3Kappa": 0.18493789806892894, "RandomTreeDepth1AUC": 0.6404083900780722, "RandomTreeDepth1ErrRate": 0.1763869132290185, "RandomTreeDepth1Kappa": 0.3113191316041728, "RandomTreeDepth2AUC": 0.6404083900780722, "RandomTreeDepth2ErrRate": 0.1763869132290185, "RandomTreeDepth2Kappa": 0.3113191316041728, "RandomTreeDepth3AUC": 0.6404083900780722, "RandomTreeDepth3ErrRate": 0.1763869132290185, "RandomTreeDepth3Kappa": 0.3113191316041728, "StdvNominalAttDistinctValues": 0, "kNN1NAUC": 0.7470447488121043, "kNN1NErrRate": 0.1512565196775723, "kNN1NKappa": 0.3504547881404616 }, "tags": [ { "tag": "study_14", "uploader": "1" }, { "tag": "study_1", "uploader": "0" }, { "tag": "study_395", "uploader": "0" }, { "tag": "study_429", "uploader": "0" }, { "tag": "study_380", "uploader": "0" }, { "tag": "study_286", "uploader": "0" }, { "tag": "study_769", "uploader": "0" }, { "tag": "study_114", "uploader": "0" }, { "tag": "study_170", "uploader": "0" }, { "tag": "study_722", "uploader": "0" }, { "tag": "study_757", "uploader": "0" }, { "tag": "study_617", "uploader": "0" }, { "tag": "study_661", "uploader": "0" } ], "features": [ { "name": "defects", "index": "21", "type": "nominal", "distinct": "2", "missing": "0", "target": "1", "distr": [ [ "false", "true" ], [ [ "1783", "0" ], [ "0", "326" ] ] ] }, { "name": "t", "index": "11", "type": "numeric", "distinct": "947", "missing": "0", "min": "0", "max": "18045", "mean": "291", "stdev": "969" }, { "name": "branchCount", "index": "20", "type": "numeric", "distinct": "44", "missing": "0", "min": "1", "max": "89", "mean": "5", "stdev": "8" }, { "name": "total_Opnd", "index": "19", "type": "numeric", "distinct": "153", "missing": "0", "min": "0", "max": "428", "mean": "19", "stdev": "32" }, { "name": "total_Op", "index": "18", "type": "numeric", "distinct": "207", "missing": "0", "min": "0", "max": "678", "mean": "31", "stdev": "52" }, { "name": "uniq_Opnd", "index": "17", "type": "numeric", "distinct": "73", "missing": "0", "min": "0", "max": "120", "mean": "10", "stdev": "12" }, { "name": "uniq_Op", "index": "16", "type": "numeric", "distinct": "34", "missing": "0", "min": "0", "max": "37", "mean": "8", "stdev": "6" }, { "name": "locCodeAndComment", "index": "15", "type": "numeric", "distinct": "12", "missing": "0", "min": "0", "max": "12", "mean": "0", "stdev": "1" }, { "name": "lOBlank", "index": "14", "type": "numeric", "distinct": "31", "missing": "0", "min": "0", "max": "58", "mean": "2", "stdev": "4" }, { "name": "lOComment", "index": "13", "type": "numeric", "distinct": "28", "missing": "0", "min": "0", "max": "44", "mean": "1", "stdev": "3" }, { "name": "lOCode", "index": "12", "type": "numeric", "distinct": "121", "missing": "0", "min": "0", "max": "262", "mean": "15", "stdev": "24" }, { "name": "loc", "index": "0", "type": "numeric", "distinct": "139", "missing": "0", "min": "1", "max": "288", "mean": "20", "stdev": "30" }, { "name": "b", "index": "10", "type": "numeric", "distinct": "92", "missing": "0", "min": "0", "max": "3", "mean": "0", "stdev": "0" }, { "name": "e", "index": "9", "type": "numeric", "distinct": "961", "missing": "0", "min": "0", "max": "324804", "mean": "5242", "stdev": "17445" }, { "name": "i", "index": "8", "type": "numeric", "distinct": "893", "missing": "0", "min": "0", "max": "193", "mean": "21", "stdev": "22" }, { "name": "d", "index": "7", "type": "numeric", "distinct": "548", "missing": "0", "min": "0", "max": "54", "mean": "7", "stdev": "8" }, { "name": "l", "index": "6", "type": "numeric", "distinct": "52", "missing": "0", "min": "0", "max": "2", "mean": "0", "stdev": "0" }, { "name": "v", "index": "5", "type": "numeric", "distinct": "729", "missing": "0", "min": "0", "max": "7919", "mean": "259", "stdev": "516" }, { "name": "n", "index": "4", "type": "numeric", "distinct": "278", "missing": "0", "min": "0", "max": "1106", "mean": "50", "stdev": "84" }, { "name": "iv(g)", "index": "3", "type": "numeric", "distinct": "26", "missing": "0", "min": "1", "max": "45", "mean": "3", "stdev": "3" }, { "name": "ev(g)", "index": "2", "type": "numeric", "distinct": "21", "missing": "0", "min": "1", "max": "26", "mean": "2", "stdev": "2" }, { "name": "v(g)", "index": "1", "type": "numeric", "distinct": "31", "missing": "0", "min": "1", "max": "45", "mean": "3", "stdev": "4" } ], "nr_of_issues": 0, "nr_of_downvotes": 0, "nr_of_likes": 0, "nr_of_downloads": 0, "total_downloads": 0, "reach": 0, "reuse": 11, "impact_of_reuse": 0, "reach_of_reuse": 0, "impact": 11 }