{ "data_id": "23", "name": "spambase", "exact_name": "spambase", "version": 1, "version_label": "1", "description": "**Author**: \n**Source**: Unknown - \n**Please cite**: \n\n1. Title: SPAM E-mail Database\n \n 2. Sources:\n (a) Creators: Mark Hopkins, Erik Reeber, George Forman, Jaap Suermondt\n Hewlett-Packard Labs, 1501 Page Mill Rd., Palo Alto, CA 94304\n (b) Donor: George Forman (gforman at nospam hpl.hp.com) 650-857-7835\n (c) Generated: June-July 1999\n \n 3. Past Usage:\n (a) Hewlett-Packard Internal-only Technical Report. External forthcoming.\n (b) Determine whether a given email is spam or not.\n (c) ~7% misclassification error.\n False positives (marking good mail as spam) are very undesirable.\n If we insist on zero false positives in the training\/testing set,\n 20-25% of the spam passed through the filter.\n \n 4. Relevant Information:\n The \"spam\" concept is diverse: advertisements for products\/web\n sites, make money fast schemes, chain letters, pornography...\n \tOur collection of spam e-mails came from our postmaster and \n \tindividuals who had filed spam. Our collection of non-spam \n \te-mails came from filed work and personal e-mails, and hence\n \tthe word 'george' and the area code '650' are indicators of \n \tnon-spam. These are useful when constructing a personalized \n \tspam filter. One would either have to blind such non-spam \n \tindicators or get a very wide collection of non-spam to \n \tgenerate a general purpose spam filter.\n \n For background on spam:\n Cranor, Lorrie F., LaMacchia, Brian A. Spam! \n Communications of the ACM, 41(8):74-83, 1998.\n \n 5. Number of Instances: 4601 (1813 Spam = 39.4%)\n \n 6. Number of Attributes: 58 (57 continuous, 1 nominal class label)\n \n 7. Attribute Information:\n The last column of 'spambase.data' denotes whether the e-mail was \n considered spam (1) or not (0), i.e. unsolicited commercial e-mail. \n Most of the attributes indicate whether a particular word or\n character was frequently occuring in the e-mail. The run-length\n attributes (55-57) measure the length of sequences of consecutive \n capital letters. For the statistical measures of each attribute, \n see the end of this file. Here are the definitions of the attributes:\n \n 48 continuous real [0,100] attributes of type word_freq_WORD \n = percentage of words in the e-mail that match WORD,\n i.e. 100 * (number of times the WORD appears in the e-mail) \/ \n total number of words in e-mail. A \"word\" in this case is any \n string of alphanumeric characters bounded by non-alphanumeric \n characters or end-of-string.\n \n 6 continuous real [0,100] attributes of type char_freq_CHAR\n = percentage of characters in the e-mail that match CHAR,\n i.e. 100 * (number of CHAR occurences) \/ total characters in e-mail\n \n 1 continuous real [1,...] attribute of type capital_run_length_average\n = average length of uninterrupted sequences of capital letters\n \n 1 continuous integer [1,...] attribute of type capital_run_length_longest\n = length of longest uninterrupted sequence of capital letters\n \n 1 continuous integer [1,...] attribute of type capital_run_length_total\n = sum of length of uninterrupted sequences of capital letters\n = total number of capital letters in the e-mail\n \n 1 nominal {0,1} class attribute of type spam\n = denotes whether the e-mail was considered spam (1) or not (0), \n i.e. unsolicited commercial e-mail. \n \n \n 8. Missing Attribute Values: None\n \n 9. Class Distribution:\n \tSpam\t 1813 (39.4%)\n \tNon-Spam 2788 (60.6%)\n \n \n Attribute Statistics:\n Min: Max: Average: Std.Dev: Coeff.Var_%: \n 1 0 4.54 0.10455 0.30536 292 \n 2 0 14.28 0.21301 1.2906 606 \n 3 0 5.1 0.28066 0.50414 180 \n 4 0 42.81 0.065425 1.3952 2130 \n 5 0 10 0.31222 0.67251 215 \n 6 0 5.88 0.095901 0.27382 286 \n 7 0 7.27 0.11421 0.39144 343 \n 8 0 11.11 0.10529 0.40107 381 \n 9 0 5.26 0.090067 0.27862 309 \n 10 0 18.18 0.23941 0.64476 269 \n 11 0 2.61 0.059824 0.20154 337 \n 12 0 9.67 0.5417 0.8617 159 \n 13 0 5.55 0.09393 0.30104 320 \n 14 0 10 0.058626 0.33518 572 \n 15 0 4.41 0.049205 0.25884 526 \n 16 0 20 0.24885 0.82579 332 \n 17 0 7.14 0.14259 0.44406 311 \n 18 0 9.09 0.18474 0.53112 287 \n 19 0 18.75 1.6621 1.7755 107 \n 20 0 18.18 0.085577 0.50977 596 \n 21 0 11.11 0.80976 1.2008 148 \n 22 0 17.1 0.1212 1.0258 846 \n 23 0 5.45 0.10165 0.35029 345 \n 24 0 12.5 0.094269 0.44264 470 \n 25 0 20.83 0.5495 1.6713 304 \n 26 0 16.66 0.26538 0.88696 334 \n 27 0 33.33 0.7673 3.3673 439 \n 28 0 9.09 0.12484 0.53858 431 \n 29 0 14.28 0.098915 0.59333 600 \n 30 0 5.88 0.10285 0.45668 444 \n 31 0 12.5 0.064753 0.40339 623 \n 32 0 4.76 0.047048 0.32856 698 \n 33 0 18.18 0.097229 0.55591 572 \n 34 0 4.76 0.047835 0.32945 689 \n 35 0 20 0.10541 0.53226 505 \n 36 0 7.69 0.097477 0.40262 413 \n 37 0 6.89 0.13695 0.42345 309 \n 38 0 8.33 0.013201 0.22065 1670 \n 39 0 11.11 0.078629 0.43467 553 \n 40 0 4.76 0.064834 0.34992 540 \n 41 0 7.14 0.043667 0.3612 827 \n 42 0 14.28 0.13234 0.76682 579 \n 43 0 3.57 0.046099 0.22381 486 \n 44 0 20 0.079196 0.62198 785 \n 45 0 21.42 0.30122 1.0117 336 \n 46 0 22.05 0.17982 0.91112 507 \n 47 0 2.17 0.0054445 0.076274 1400 \n 48 0 10 0.031869 0.28573 897 \n 49 0 4.385 0.038575 0.24347 631 \n 50 0 9.752 0.13903 0.27036 194 \n 51 0 4.081 0.016976 0.10939 644 \n 52 0 32.478 0.26907 0.81567 303 \n 53 0 6.003 0.075811 0.24588 324 \n 54 0 19.829 0.044238 0.42934 971 \n 55 1 1102.5 5.1915 31.729 611 \n 56 1 9989 52.173 194.89 374 \n 57 1 15841 283.29 606.35 214 \n 58 0 1 0.39404 0.4887 124 \n \n \n This file: 'spambase.DOCUMENTATION' at the UCI Machine Learning Repository\n http:\/\/www.ics.uci.edu\/~mlearn\/MLRepository.html\n\n Information about the dataset\n CLASSTYPE: nominal\n CLASSINDEX: last", "format": "ARFF", "uploader": "Jan van Rijn", "uploader_id": 1, "visibility": "public", "creator": null, "contributor": null, "date": "2014-04-06 23:22:41", "update_comment": null, "last_update": "2014-04-06 23:22:41", "licence": "Public", "status": "active", "error_message": null, "url": "https:\/\/www.openml.org\/data\/download\/44\/dataset_44_spambase.arff", "default_target_attribute": "class", "row_id_attribute": null, "ignore_attribute": null, "runs": 0, "suggest": { "input": [ "spambase", "1. Title: SPAM E-mail Database 2. Sources: (a) Creators: Mark Hopkins, Erik Reeber, George Forman, Jaap Suermondt Hewlett-Packard Labs, 1501 Page Mill Rd., Palo Alto, CA 94304 (b) Donor: George Forman (gforman at nospam hpl.hp.com) 650-857-7835 (c) Generated: June-July 1999 3. Past Usage: (a) Hewlett-Packard Internal-only Technical Report. External forthcoming. (b) Determine whether a given email is spam or not. (c) ~7% misclassification error. False positives (marking good mail as spam) are ver " ], "weight": 5 }, "qualities": { "NumberOfInstances": 4601, "NumberOfFeatures": 58, "NumberOfClasses": 2, "NumberOfMissingValues": 0, "NumberOfInstancesWithMissingValues": 0, "NumberOfNumericFeatures": 57, "NumberOfSymbolicFeatures": 1, "AutoCorrelation": 0.9997826086956522, "CfsSubsetEval_DecisionStumpAUC": 0.9397314627894664, "CfsSubsetEval_DecisionStumpErrRate": 0.08563355792219082, "CfsSubsetEval_DecisionStumpKappa": 0.8208876445659258, "CfsSubsetEval_NaiveBayesAUC": 0.9397314627894664, "CfsSubsetEval_NaiveBayesErrRate": 0.08563355792219082, "CfsSubsetEval_NaiveBayesKappa": 0.8208876445659258, "CfsSubsetEval_kNN1NAUC": 0.9397314627894664, "CfsSubsetEval_kNN1NErrRate": 0.08563355792219082, "CfsSubsetEval_kNN1NKappa": 0.8208876445659258, "ClassEntropy": 0.9673602371807668, "DecisionStumpAUC": 0.7941574124705914, "DecisionStumpErrRate": 0.20930232558139536, "DecisionStumpKappa": 0.549772420190581, "Dimensionality": 0.012605955227124538, "EquivalentNumberOfAtts": null, "J48.00001.AUC": 0.924541669007748, "J48.00001.ErrRate": 0.08411214953271028, "J48.00001.Kappa": 0.82347465212921, "J48.0001.AUC": 0.924541669007748, "J48.0001.ErrRate": 0.08411214953271028, "J48.0001.Kappa": 0.82347465212921, "J48.001.AUC": 0.924541669007748, "J48.001.ErrRate": 0.08411214953271028, "J48.001.Kappa": 0.82347465212921, "MajorityClassPercentage": 60.59552271245382, "MajorityClassSize": 2788, "MaxAttributeEntropy": null, "MaxKurtosisOfNumericAtts": 1480.6420502862907, "MaxMeansOfNumericAtts": 283.28928493805716, "MaxMutualInformation": null, "MaxNominalAttDistinctValues": 2, "MaxSkewnessOfNumericAtts": 31.062064279039635, "MaxStdDevOfNumericAtts": 606.3478507248471, "MeanAttributeEntropy": null, "MeanKurtosisOfNumericAtts": 241.1700186517731, "MeanMeansOfNumericAtts": 6.150770191072139, "MeanMutualInformation": null, "MeanNoiseToSignalRatio": null, "MeanNominalAttDistinctValues": 2, "MeanSkewnessOfNumericAtts": 11.186639096029253, "MeanStdDevOfNumericAtts": 15.193997694546747, "MinAttributeEntropy": null, "MinKurtosisOfNumericAtts": 5.257394367988116, "MinMeansOfNumericAtts": 0.005444468593783957, "MinMutualInformation": null, "MinNominalAttDistinctValues": 2, "MinSkewnessOfNumericAtts": 1.5916742687064245, "MinStdDevOfNumericAtts": 0.07627427063724908, "MinorityClassPercentage": 39.404477287546186, "MinorityClassSize": 1813, "NaiveBayesAUC": 0.93523126498161, "NaiveBayesErrRate": 0.20234731580091284, "NaiveBayesKappa": 0.605321391923295, "NumberOfBinaryFeatures": 1, "PercentageOfBinaryFeatures": 1.7241379310344827, "PercentageOfInstancesWithMissingValues": 0, "PercentageOfMissingValues": 0, "PercentageOfNumericFeatures": 98.27586206896551, "PercentageOfSymbolicFeatures": 1.7241379310344827, "Quartile1AttributeEntropy": null, "Quartile1KurtosisOfNumericAtts": 50.655931063002996, "Quartile1MeansOfNumericAtts": 0.06479352314714121, "Quartile1MutualInformation": null, "Quartile1SkewnessOfNumericAtts": 5.8507230150661425, "Quartile1StdDevOfNumericAtts": 0.31695822185668954, "Quartile2AttributeEntropy": null, "Quartile2KurtosisOfNumericAtts": 127.37652934849572, "Quartile2MeansOfNumericAtts": 0.10285155400999912, "Quartile2MutualInformation": null, "Quartile2SkewnessOfNumericAtts": 9.724847529978312, "Quartile2StdDevOfNumericAtts": 0.4440553289821315, "Quartile3AttributeEntropy": null, "Quartile3KurtosisOfNumericAtts": 299.0723734257733, "Quartile3MeansOfNumericAtts": 0.24413062377743908, "Quartile3MutualInformation": null, "Quartile3SkewnessOfNumericAtts": 13.646188094980591, "Quartile3StdDevOfNumericAtts": 0.8437450862048406, "REPTreeDepth1AUC": 0.9386679853220129, "REPTreeDepth1ErrRate": 0.10345577048467725, "REPTreeDepth1Kappa": 0.7807805902062573, "REPTreeDepth2AUC": 0.9386679853220129, "REPTreeDepth2ErrRate": 0.10345577048467725, "REPTreeDepth2Kappa": 0.7807805902062573, "REPTreeDepth3AUC": 0.9386679853220129, "REPTreeDepth3ErrRate": 0.10345577048467725, "REPTreeDepth3Kappa": 0.7807805902062573, "RandomTreeDepth1AUC": 0.8911774993451567, "RandomTreeDepth1ErrRate": 0.10410780265159748, "RandomTreeDepth1Kappa": 0.781973609246172, "RandomTreeDepth2AUC": 0.8911774993451567, "RandomTreeDepth2ErrRate": 0.10410780265159748, "RandomTreeDepth2Kappa": 0.781973609246172, "RandomTreeDepth3AUC": 0.8911774993451567, "RandomTreeDepth3ErrRate": 0.10410780265159748, "RandomTreeDepth3Kappa": 0.781973609246172, "StdvNominalAttDistinctValues": 0, "kNN1NAUC": 0.8937334657000572, "kNN1NErrRate": 0.10736796348619865, "kNN1NKappa": 0.775167746729542 }, "tags": [ { "tag": "study_14", "uploader": "1" }, { "tag": "study_1", "uploader": "0" }, { "tag": "study_105", "uploader": "0" }, { "tag": "study_2", "uploader": "0" }, { "tag": "study_66", "uploader": "0" } ], "features": [ { "name": "class", "index": "57", "type": "nominal", "distinct": "2", "missing": "0", "target": "1", "distr": [ [ "0", "1" ], [ [ "2788", "0" ], [ "0", "1813" ] ] ] }, { "name": "word_freq_telnet", "index": "30", "type": "numeric", "distinct": "128", "missing": "0", "min": "0", "max": "13", "mean": "0", "stdev": "0" }, { "name": "word_freq_labs", "index": "29", "type": "numeric", "distinct": "179", "missing": "0", "min": "0", "max": "6", "mean": "0", "stdev": "0" }, { "name": "word_freq_857", "index": "31", "type": "numeric", "distinct": "106", "missing": "0", "min": "0", "max": "5", "mean": "0", "stdev": "0" }, { "name": "word_freq_data", "index": "32", "type": "numeric", "distinct": "184", "missing": "0", "min": "0", "max": "18", "mean": "0", "stdev": "1" }, { "name": "word_freq_415", "index": "33", "type": "numeric", "distinct": "110", "missing": "0", "min": "0", "max": "5", "mean": "0", "stdev": "0" }, { "name": "word_freq_85", "index": "34", "type": "numeric", "distinct": "177", "missing": "0", "min": "0", "max": "20", "mean": "0", "stdev": "1" }, { "name": "word_freq_technology", "index": "35", "type": "numeric", "distinct": "159", "missing": "0", "min": "0", "max": "8", "mean": "0", "stdev": "0" }, { "name": "word_freq_1999", "index": "36", "type": "numeric", "distinct": "188", "missing": "0", "min": "0", "max": "7", "mean": "0", "stdev": "0" }, { "name": "word_freq_parts", "index": "37", "type": "numeric", "distinct": "53", "missing": "0", "min": "0", "max": "8", "mean": "0", "stdev": "0" }, { "name": "word_freq_pm", "index": "38", "type": "numeric", "distinct": "163", "missing": "0", "min": "0", "max": "11", "mean": "0", "stdev": "0" }, { "name": "word_freq_direct", "index": "39", "type": "numeric", "distinct": "125", "missing": "0", "min": "0", "max": "5", "mean": "0", "stdev": "0" }, { "name": "word_freq_cs", "index": "40", "type": "numeric", "distinct": "108", "missing": "0", "min": "0", "max": "7", "mean": "0", "stdev": "0" }, { "name": "word_freq_meeting", "index": "41", "type": "numeric", "distinct": "186", "missing": "0", "min": "0", "max": "14", "mean": "0", "stdev": "1" }, { "name": "word_freq_original", "index": "42", "type": "numeric", "distinct": "136", "missing": "0", "min": "0", "max": "4", "mean": "0", "stdev": "0" }, { "name": "word_freq_project", "index": "43", "type": "numeric", "distinct": "160", "missing": "0", "min": "0", "max": "20", "mean": "0", "stdev": "1" }, { "name": "word_freq_re", "index": "44", "type": "numeric", "distinct": "230", "missing": "0", "min": "0", "max": "21", "mean": "0", "stdev": "1" }, { "name": "word_freq_edu", "index": "45", "type": "numeric", "distinct": "227", "missing": "0", "min": "0", "max": "22", "mean": "0", "stdev": "1" }, { "name": "word_freq_table", "index": "46", "type": "numeric", "distinct": "38", "missing": "0", "min": "0", "max": "2", "mean": "0", "stdev": "0" }, { "name": "word_freq_conference", "index": "47", "type": "numeric", "distinct": "106", "missing": "0", "min": "0", "max": "10", "mean": "0", "stdev": "0" }, { "name": "char_freq_%3B", "index": "48", "type": "numeric", "distinct": "313", "missing": "0", "min": "0", "max": "4", "mean": "0", "stdev": "0" }, { "name": "char_freq_%28", "index": "49", "type": "numeric", "distinct": "641", "missing": "0", "min": "0", "max": "10", "mean": "0", "stdev": "0" }, { "name": "char_freq_%5B", "index": "50", "type": "numeric", "distinct": "225", "missing": "0", "min": "0", "max": "4", "mean": "0", "stdev": "0" }, { "name": "char_freq_%21", "index": "51", "type": "numeric", "distinct": "964", "missing": "0", "min": "0", "max": "32", "mean": "0", "stdev": "1" }, { "name": "char_freq_%24", "index": "52", "type": "numeric", "distinct": "504", "missing": "0", "min": "0", "max": "6", "mean": "0", "stdev": "0" }, { "name": "char_freq_%23", "index": "53", "type": "numeric", "distinct": "316", "missing": "0", "min": "0", "max": "20", "mean": "0", "stdev": "0" }, { "name": "capital_run_length_average", "index": "54", "type": "numeric", "distinct": "2161", "missing": "0", "min": "1", "max": "1103", "mean": "5", "stdev": "32" }, { "name": "capital_run_length_longest", "index": "55", "type": "numeric", "distinct": "271", "missing": "0", "min": "1", "max": "9989", "mean": "52", "stdev": "195" }, { "name": "capital_run_length_total", "index": "56", "type": "numeric", "distinct": "919", "missing": "0", "min": "1", "max": "15841", "mean": "283", "stdev": "606" }, { "name": "word_freq_free", "index": "15", "type": "numeric", "distinct": "253", "missing": "0", "min": "0", "max": "20", "mean": "0", "stdev": "1" }, { "name": "word_freq_address", "index": "1", "type": "numeric", "distinct": "171", "missing": "0", "min": "0", "max": "14", "mean": "0", "stdev": "1" }, { "name": "word_freq_all", "index": "2", "type": "numeric", "distinct": "214", "missing": "0", "min": "0", "max": "5", "mean": "0", "stdev": "1" }, { "name": "word_freq_3d", "index": "3", "type": "numeric", "distinct": "43", "missing": "0", "min": "0", "max": "43", "mean": "0", "stdev": "1" }, { "name": "word_freq_our", "index": "4", "type": "numeric", "distinct": "255", "missing": "0", "min": "0", "max": "10", "mean": "0", "stdev": "1" }, { "name": "word_freq_over", "index": "5", "type": "numeric", "distinct": "141", "missing": "0", "min": "0", "max": "6", "mean": "0", "stdev": "0" }, { "name": "word_freq_remove", "index": "6", "type": "numeric", "distinct": "173", "missing": "0", "min": "0", "max": "7", "mean": "0", "stdev": "0" }, { "name": "word_freq_internet", "index": "7", "type": "numeric", "distinct": "170", "missing": "0", "min": "0", "max": "11", "mean": "0", "stdev": "0" }, { "name": "word_freq_order", "index": "8", "type": "numeric", "distinct": "144", "missing": "0", "min": "0", "max": "5", "mean": "0", "stdev": "0" }, { "name": "word_freq_mail", "index": "9", "type": "numeric", "distinct": "245", "missing": "0", "min": "0", "max": "18", "mean": "0", "stdev": "1" }, { "name": "word_freq_receive", "index": "10", "type": "numeric", "distinct": "113", "missing": "0", "min": "0", "max": "3", "mean": "0", "stdev": "0" }, { "name": "word_freq_will", "index": "11", "type": "numeric", "distinct": "316", "missing": "0", "min": "0", "max": "10", "mean": "1", "stdev": "1" }, { "name": "word_freq_people", "index": "12", "type": "numeric", "distinct": "158", "missing": "0", "min": "0", "max": "6", "mean": "0", "stdev": "0" }, { "name": "word_freq_report", "index": "13", "type": "numeric", "distinct": "133", "missing": "0", "min": "0", "max": "10", "mean": "0", "stdev": "0" }, { "name": "word_freq_addresses", "index": "14", "type": "numeric", "distinct": "118", "missing": "0", "min": "0", "max": "4", "mean": "0", "stdev": "0" }, { "name": "word_freq_make", "index": "0", "type": "numeric", "distinct": "142", "missing": "0", "min": "0", "max": "5", "mean": "0", "stdev": "0" }, { "name": "word_freq_business", "index": "16", "type": "numeric", "distinct": "197", "missing": "0", "min": "0", "max": "7", "mean": "0", "stdev": "0" }, { "name": "word_freq_email", "index": "17", "type": "numeric", "distinct": "229", "missing": "0", "min": "0", "max": "9", "mean": "0", "stdev": "1" }, { "name": "word_freq_you", "index": "18", "type": "numeric", "distinct": "575", "missing": "0", "min": "0", "max": "19", "mean": "2", "stdev": "2" }, { "name": "word_freq_credit", "index": "19", "type": "numeric", "distinct": "148", "missing": "0", "min": "0", "max": "18", "mean": "0", "stdev": "1" }, { "name": "word_freq_your", "index": "20", "type": "numeric", "distinct": "401", "missing": "0", "min": "0", "max": "11", "mean": "1", "stdev": "1" }, { "name": "word_freq_font", "index": "21", "type": "numeric", "distinct": "99", "missing": "0", "min": "0", "max": "17", "mean": "0", "stdev": "1" }, { "name": "word_freq_000", "index": "22", "type": "numeric", "distinct": "164", "missing": "0", "min": "0", "max": "5", "mean": "0", "stdev": "0" }, { "name": "word_freq_money", "index": "23", "type": "numeric", "distinct": "143", "missing": "0", "min": "0", "max": "13", "mean": "0", "stdev": "0" }, { "name": "word_freq_hp", "index": "24", "type": "numeric", "distinct": "395", "missing": "0", "min": "0", "max": "21", "mean": "1", "stdev": "2" }, { "name": "word_freq_hpl", "index": "25", "type": "numeric", "distinct": "281", "missing": "0", "min": "0", "max": "17", "mean": "0", "stdev": "1" }, { "name": "word_freq_george", "index": "26", "type": "numeric", "distinct": "240", "missing": "0", "min": "0", "max": "33", "mean": "1", "stdev": "3" }, { "name": "word_freq_650", "index": "27", "type": "numeric", "distinct": "200", "missing": "0", "min": "0", "max": "9", "mean": "0", "stdev": "1" }, { "name": "word_freq_lab", "index": "28", "type": "numeric", "distinct": "156", "missing": "0", "min": "0", "max": "14", "mean": "0", "stdev": "1" } ], "nr_of_issues": 0, "nr_of_downvotes": 0, "nr_of_likes": 0, "nr_of_downloads": 0, "total_downloads": 0, "reach": 0, "reuse": 0, "impact_of_reuse": 0, "reach_of_reuse": 0, "impact": 0 }