OpenML

JavaScript is required to properly view the contents of this page!

Explore
- Data
- Task
- Flow
- Run
- Study
- Task type
- Measure
- People
Help
Blog
Contact
Please cite us

spambase

active ARFF Publicly available Visibility: public Uploaded 06-04-2014 by Jan van Rijn
0 likes downloaded by 0 people , 0 total downloads 0 issues 0 downvotes

Issue	#Downvotes for this reason	By

Loading wiki

Help us complete this description Edit

Author: Source: Unknown - Please cite: 1. Title: SPAM E-mail Database 2. Sources: (a) Creators: Mark Hopkins, Erik Reeber, George Forman, Jaap Suermondt Hewlett-Packard Labs, 1501 Page Mill Rd., Palo Alto, CA 94304 (b) Donor: George Forman (gforman at nospam hpl.hp.com) 650-857-7835 (c) Generated: June-July 1999 3. Past Usage: (a) Hewlett-Packard Internal-only Technical Report. External forthcoming. (b) Determine whether a given email is spam or not. (c) ~7% misclassification error. False positives (marking good mail as spam) are very undesirable. If we insist on zero false positives in the training/testing set, 20-25% of the spam passed through the filter. 4. Relevant Information: The "spam" concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography... Our collection of spam e-mails came from our postmaster and individuals who had filed spam. Our collection of non-spam e-mails came from filed work and personal e-mails, and hence the word 'george' and the area code '650' are indicators of non-spam. These are useful when constructing a personalized spam filter. One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter. For background on spam: Cranor, Lorrie F., LaMacchia, Brian A. Spam! Communications of the ACM, 41(8):74-83, 1998. 5. Number of Instances: 4601 (1813 Spam = 39.4%) 6. Number of Attributes: 58 (57 continuous, 1 nominal class label) 7. Attribute Information: The last column of 'spambase.data' denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail. Most of the attributes indicate whether a particular word or character was frequently occuring in the e-mail. The run-length attributes (55-57) measure the length of sequences of consecutive capital letters. For the statistical measures of each attribute, see the end of this file. Here are the definitions of the attributes: 48 continuous real [0,100] attributes of type word_freq_WORD = percentage of words in the e-mail that match WORD, i.e. 100 * (number of times the WORD appears in the e-mail) / total number of words in e-mail. A "word" in this case is any string of alphanumeric characters bounded by non-alphanumeric characters or end-of-string. 6 continuous real [0,100] attributes of type char_freq_CHAR = percentage of characters in the e-mail that match CHAR, i.e. 100 * (number of CHAR occurences) / total characters in e-mail 1 continuous real [1,...] attribute of type capital_run_length_average = average length of uninterrupted sequences of capital letters 1 continuous integer [1,...] attribute of type capital_run_length_longest = length of longest uninterrupted sequence of capital letters 1 continuous integer [1,...] attribute of type capital_run_length_total = sum of length of uninterrupted sequences of capital letters = total number of capital letters in the e-mail 1 nominal {0,1} class attribute of type spam = denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail. 8. Missing Attribute Values: None 9. Class Distribution: Spam 1813 (39.4%) Non-Spam 2788 (60.6%) Attribute Statistics: Min: Max: Average: Std.Dev: Coeff.Var_%: 1 0 4.54 0.10455 0.30536 292 2 0 14.28 0.21301 1.2906 606 3 0 5.1 0.28066 0.50414 180 4 0 42.81 0.065425 1.3952 2130 5 0 10 0.31222 0.67251 215 6 0 5.88 0.095901 0.27382 286 7 0 7.27 0.11421 0.39144 343 8 0 11.11 0.10529 0.40107 381 9 0 5.26 0.090067 0.27862 309 10 0 18.18 0.23941 0.64476 269 11 0 2.61 0.059824 0.20154 337 12 0 9.67 0.5417 0.8617 159 13 0 5.55 0.09393 0.30104 320 14 0 10 0.058626 0.33518 572 15 0 4.41 0.049205 0.25884 526 16 0 20 0.24885 0.82579 332 17 0 7.14 0.14259 0.44406 311 18 0 9.09 0.18474 0.53112 287 19 0 18.75 1.6621 1.7755 107 20 0 18.18 0.085577 0.50977 596 21 0 11.11 0.80976 1.2008 148 22 0 17.1 0.1212 1.0258 846 23 0 5.45 0.10165 0.35029 345 24 0 12.5 0.094269 0.44264 470 25 0 20.83 0.5495 1.6713 304 26 0 16.66 0.26538 0.88696 334 27 0 33.33 0.7673 3.3673 439 28 0 9.09 0.12484 0.53858 431 29 0 14.28 0.098915 0.59333 600 30 0 5.88 0.10285 0.45668 444 31 0 12.5 0.064753 0.40339 623 32 0 4.76 0.047048 0.32856 698 33 0 18.18 0.097229 0.55591 572 34 0 4.76 0.047835 0.32945 689 35 0 20 0.10541 0.53226 505 36 0 7.69 0.097477 0.40262 413 37 0 6.89 0.13695 0.42345 309 38 0 8.33 0.013201 0.22065 1670 39 0 11.11 0.078629 0.43467 553 40 0 4.76 0.064834 0.34992 540 41 0 7.14 0.043667 0.3612 827 42 0 14.28 0.13234 0.76682 579 43 0 3.57 0.046099 0.22381 486 44 0 20 0.079196 0.62198 785 45 0 21.42 0.30122 1.0117 336 46 0 22.05 0.17982 0.91112 507 47 0 2.17 0.0054445 0.076274 1400 48 0 10 0.031869 0.28573 897 49 0 4.385 0.038575 0.24347 631 50 0 9.752 0.13903 0.27036 194 51 0 4.081 0.016976 0.10939 644 52 0 32.478 0.26907 0.81567 303 53 0 6.003 0.075811 0.24588 324 54 0 19.829 0.044238 0.42934 971 55 1 1102.5 5.1915 31.729 611 56 1 9989 52.173 194.89 374 57 1 15841 283.29 606.35 214 58 0 1 0.39404 0.4887 124 This file: 'spambase.DOCUMENTATION' at the UCI Machine Learning Repository http://www.ics.uci.edu/~mlearn/MLRepository.html Information about the dataset CLASSTYPE: nominal CLASSINDEX: last

58 features

class (target)	nominal	2 unique values 0 missing
word_freq_telnet	numeric	128 unique values 0 missing
word_freq_labs	numeric	179 unique values 0 missing
word_freq_857	numeric	106 unique values 0 missing
word_freq_data	numeric	184 unique values 0 missing
word_freq_415	numeric	110 unique values 0 missing
word_freq_85	numeric	177 unique values 0 missing
word_freq_technology	numeric	159 unique values 0 missing
word_freq_1999	numeric	188 unique values 0 missing
word_freq_parts	numeric	53 unique values 0 missing
word_freq_pm	numeric	163 unique values 0 missing
word_freq_direct	numeric	125 unique values 0 missing
word_freq_cs	numeric	108 unique values 0 missing
word_freq_meeting	numeric	186 unique values 0 missing
word_freq_original	numeric	136 unique values 0 missing
word_freq_project	numeric	160 unique values 0 missing
word_freq_re	numeric	230 unique values 0 missing
word_freq_edu	numeric	227 unique values 0 missing
word_freq_table	numeric	38 unique values 0 missing
word_freq_conference	numeric	106 unique values 0 missing
char_freq_%3B	numeric	313 unique values 0 missing
char_freq_%28	numeric	641 unique values 0 missing
char_freq_%5B	numeric	225 unique values 0 missing
char_freq_%21	numeric	964 unique values 0 missing
char_freq_%24	numeric	504 unique values 0 missing
char_freq_%23	numeric	316 unique values 0 missing
capital_run_length_average	numeric	2161 unique values 0 missing
capital_run_length_longest	numeric	271 unique values 0 missing
capital_run_length_total	numeric	919 unique values 0 missing
word_freq_free	numeric	253 unique values 0 missing
word_freq_address	numeric	171 unique values 0 missing
word_freq_all	numeric	214 unique values 0 missing
word_freq_3d	numeric	43 unique values 0 missing
word_freq_our	numeric	255 unique values 0 missing
word_freq_over	numeric	141 unique values 0 missing
word_freq_remove	numeric	173 unique values 0 missing
word_freq_internet	numeric	170 unique values 0 missing
word_freq_order	numeric	144 unique values 0 missing
word_freq_mail	numeric	245 unique values 0 missing
word_freq_receive	numeric	113 unique values 0 missing
word_freq_will	numeric	316 unique values 0 missing
word_freq_people	numeric	158 unique values 0 missing
word_freq_report	numeric	133 unique values 0 missing
word_freq_addresses	numeric	118 unique values 0 missing
word_freq_make	numeric	142 unique values 0 missing
word_freq_business	numeric	197 unique values 0 missing
word_freq_email	numeric	229 unique values 0 missing
word_freq_you	numeric	575 unique values 0 missing
word_freq_credit	numeric	148 unique values 0 missing
word_freq_your	numeric	401 unique values 0 missing
word_freq_font	numeric	99 unique values 0 missing
word_freq_000	numeric	164 unique values 0 missing
word_freq_money	numeric	143 unique values 0 missing
word_freq_hp	numeric	395 unique values 0 missing
word_freq_hpl	numeric	281 unique values 0 missing
word_freq_george	numeric	240 unique values 0 missing
word_freq_650	numeric	200 unique values 0 missing
word_freq_lab	numeric	156 unique values 0 missing

Show all 58 features

107 properties

NumberOfInstances

4601

Number of instances (rows) of the dataset.

NumberOfFeatures

Number of attributes (columns) of the dataset.

NumberOfClasses

Number of distinct values of the target attribute (if it is nominal).

NumberOfMissingValues

Number of missing values in the dataset.

NumberOfInstancesWithMissingValues

Number of instances with at least one value missing.

NumberOfNumericFeatures

Number of numeric attributes.

NumberOfSymbolicFeatures

Number of nominal attributes.

AutoCorrelation

Average class difference between consecutive instances.

CfsSubsetEval_DecisionStumpAUC

0.94

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.DecisionStump -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

CfsSubsetEval_DecisionStumpErrRate

0.09

Error rate achieved by the landmarker weka.classifiers.trees.DecisionStump -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

CfsSubsetEval_DecisionStumpKappa

0.82

Kappa coefficient achieved by the landmarker weka.classifiers.trees.DecisionStump -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

CfsSubsetEval_NaiveBayesAUC

0.94

Area Under the ROC Curve achieved by the landmarker weka.classifiers.bayes.NaiveBayes -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

CfsSubsetEval_NaiveBayesErrRate

0.09

Error rate achieved by the landmarker weka.classifiers.bayes.NaiveBayes -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

CfsSubsetEval_NaiveBayesKappa

0.82

Kappa coefficient achieved by the landmarker weka.classifiers.bayes.NaiveBayes -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

CfsSubsetEval_kNN1NAUC

0.94

Area Under the ROC Curve achieved by the landmarker weka.classifiers.lazy.IBk -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

CfsSubsetEval_kNN1NErrRate

0.09

Error rate achieved by the landmarker weka.classifiers.lazy.IBk -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

CfsSubsetEval_kNN1NKappa

0.82

Kappa coefficient achieved by the landmarker weka.classifiers.lazy.IBk -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

ClassEntropy

0.97

Entropy of the target attribute values.

DecisionStumpAUC

0.79

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.DecisionStump

DecisionStumpErrRate

0.21

Error rate achieved by the landmarker weka.classifiers.trees.DecisionStump

DecisionStumpKappa

0.55

Kappa coefficient achieved by the landmarker weka.classifiers.trees.DecisionStump

Dimensionality

0.01

Number of attributes divided by the number of instances.

EquivalentNumberOfAtts

Number of attributes needed to optimally describe the class (under the assumption of independence among attributes). Equals ClassEntropy divided by MeanMutualInformation.

J48.00001.AUC

0.92

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.J48 -C .00001

J48.00001.ErrRate

0.08

Error rate achieved by the landmarker weka.classifiers.trees.J48 -C .00001

J48.00001.Kappa

0.82

Kappa coefficient achieved by the landmarker weka.classifiers.trees.J48 -C .00001

J48.0001.AUC

0.92

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.J48 -C .0001

J48.0001.ErrRate

0.08

Error rate achieved by the landmarker weka.classifiers.trees.J48 -C .0001

J48.0001.Kappa

0.82

Kappa coefficient achieved by the landmarker weka.classifiers.trees.J48 -C .0001

J48.001.AUC

0.92

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.J48 -C .001

J48.001.ErrRate

0.08

Error rate achieved by the landmarker weka.classifiers.trees.J48 -C .001

J48.001.Kappa

0.82

Kappa coefficient achieved by the landmarker weka.classifiers.trees.J48 -C .001

MajorityClassPercentage

60.6

Percentage of instances belonging to the most frequent class.

MajorityClassSize

2788

Number of instances belonging to the most frequent class.

MaxAttributeEntropy

Maximum entropy among attributes.

MaxKurtosisOfNumericAtts

1480.64

Maximum kurtosis among attributes of the numeric type.

MaxMeansOfNumericAtts

283.29

Maximum of means among attributes of the numeric type.

MaxMutualInformation

Maximum mutual information between the nominal attributes and the target attribute.

MaxNominalAttDistinctValues

The maximum number of distinct values among attributes of the nominal type.

MaxSkewnessOfNumericAtts

31.06

Maximum skewness among attributes of the numeric type.

MaxStdDevOfNumericAtts

606.35

Maximum standard deviation of attributes of the numeric type.

MeanAttributeEntropy

Average entropy of the attributes.

MeanKurtosisOfNumericAtts

241.17

Mean kurtosis among attributes of the numeric type.

MeanMeansOfNumericAtts

6.15

Mean of means among attributes of the numeric type.

MeanMutualInformation

Average mutual information between the nominal attributes and the target attribute.

MeanNoiseToSignalRatio

An estimate of the amount of irrelevant information in the attributes regarding the class. Equals (MeanAttributeEntropy - MeanMutualInformation) divided by MeanMutualInformation.

MeanNominalAttDistinctValues

Average number of distinct values among the attributes of the nominal type.

MeanSkewnessOfNumericAtts

11.19

Mean skewness among attributes of the numeric type.

MeanStdDevOfNumericAtts

15.19

Mean standard deviation of attributes of the numeric type.

MinAttributeEntropy

Minimal entropy among attributes.

MinKurtosisOfNumericAtts

5.26

Minimum kurtosis among attributes of the numeric type.

MinMeansOfNumericAtts

0.01

Minimum of means among attributes of the numeric type.

MinMutualInformation

Minimal mutual information between the nominal attributes and the target attribute.

MinNominalAttDistinctValues

The minimal number of distinct values among attributes of the nominal type.

MinSkewnessOfNumericAtts

1.59

Minimum skewness among attributes of the numeric type.

MinStdDevOfNumericAtts

0.08

Minimum standard deviation of attributes of the numeric type.

MinorityClassPercentage

39.4

Percentage of instances belonging to the least frequent class.

MinorityClassSize

1813

Number of instances belonging to the least frequent class.

NaiveBayesAUC

0.94

Area Under the ROC Curve achieved by the landmarker weka.classifiers.bayes.NaiveBayes

NaiveBayesErrRate

0.2

Error rate achieved by the landmarker weka.classifiers.bayes.NaiveBayes

NaiveBayesKappa

0.61

Kappa coefficient achieved by the landmarker weka.classifiers.bayes.NaiveBayes

NumberOfBinaryFeatures

Number of binary attributes.

PercentageOfBinaryFeatures

1.72

Percentage of binary attributes.

PercentageOfInstancesWithMissingValues

Percentage of instances having missing values.

PercentageOfMissingValues

Percentage of missing values.

PercentageOfNumericFeatures

98.28

Percentage of numeric attributes.

PercentageOfSymbolicFeatures

1.72

Percentage of nominal attributes.

Quartile1AttributeEntropy

First quartile of entropy among attributes.

Quartile1KurtosisOfNumericAtts

50.66

First quartile of kurtosis among attributes of the numeric type.

Quartile1MeansOfNumericAtts

0.06

First quartile of means among attributes of the numeric type.

Quartile1MutualInformation

First quartile of mutual information between the nominal attributes and the target attribute.

Quartile1SkewnessOfNumericAtts

5.85

First quartile of skewness among attributes of the numeric type.

Quartile1StdDevOfNumericAtts

0.32

First quartile of standard deviation of attributes of the numeric type.

Quartile2AttributeEntropy

Second quartile (Median) of entropy among attributes.

Quartile2KurtosisOfNumericAtts

127.38

Second quartile (Median) of kurtosis among attributes of the numeric type.

Quartile2MeansOfNumericAtts

0.1

Second quartile (Median) of means among attributes of the numeric type.

Quartile2MutualInformation

Second quartile (Median) of mutual information between the nominal attributes and the target attribute.

Quartile2SkewnessOfNumericAtts

9.72

Second quartile (Median) of skewness among attributes of the numeric type.

Quartile2StdDevOfNumericAtts

0.44

Second quartile (Median) of standard deviation of attributes of the numeric type.

Quartile3AttributeEntropy

Third quartile of entropy among attributes.

Quartile3KurtosisOfNumericAtts

299.07

Third quartile of kurtosis among attributes of the numeric type.

Quartile3MeansOfNumericAtts

0.24

Third quartile of means among attributes of the numeric type.

Quartile3MutualInformation

Third quartile of mutual information between the nominal attributes and the target attribute.

Quartile3SkewnessOfNumericAtts

13.65

Third quartile of skewness among attributes of the numeric type.

Quartile3StdDevOfNumericAtts

0.84

Third quartile of standard deviation of attributes of the numeric type.

REPTreeDepth1AUC

0.94

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.REPTree -L 1

REPTreeDepth1ErrRate

0.1

Error rate achieved by the landmarker weka.classifiers.trees.REPTree -L 1

REPTreeDepth1Kappa

0.78

Kappa coefficient achieved by the landmarker weka.classifiers.trees.REPTree -L 1

REPTreeDepth2AUC

0.94

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.REPTree -L 2

REPTreeDepth2ErrRate

0.1

Error rate achieved by the landmarker weka.classifiers.trees.REPTree -L 2

REPTreeDepth2Kappa

0.78

Kappa coefficient achieved by the landmarker weka.classifiers.trees.REPTree -L 2

REPTreeDepth3AUC

0.94

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.REPTree -L 3

REPTreeDepth3ErrRate

0.1

Error rate achieved by the landmarker weka.classifiers.trees.REPTree -L 3

REPTreeDepth3Kappa

0.78

Kappa coefficient achieved by the landmarker weka.classifiers.trees.REPTree -L 3

RandomTreeDepth1AUC

0.89

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.RandomTree -depth 1

RandomTreeDepth1ErrRate

0.1

Error rate achieved by the landmarker weka.classifiers.trees.RandomTree -depth 1

RandomTreeDepth1Kappa

0.78

Kappa coefficient achieved by the landmarker weka.classifiers.trees.RandomTree -depth 1

RandomTreeDepth2AUC

0.89

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.RandomTree -depth 2

RandomTreeDepth2ErrRate

0.1

Error rate achieved by the landmarker weka.classifiers.trees.RandomTree -depth 2

RandomTreeDepth2Kappa

0.78

Kappa coefficient achieved by the landmarker weka.classifiers.trees.RandomTree -depth 2

RandomTreeDepth3AUC

0.89

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.RandomTree -depth 3

RandomTreeDepth3ErrRate

0.1

Error rate achieved by the landmarker weka.classifiers.trees.RandomTree -depth 3

RandomTreeDepth3Kappa

0.78

Kappa coefficient achieved by the landmarker weka.classifiers.trees.RandomTree -depth 3

StdvNominalAttDistinctValues

Standard deviation of the number of distinct values among attributes of the nominal type.

kNN1NAUC

0.89

Area Under the ROC Curve achieved by the landmarker weka.classifiers.lazy.IBk

kNN1NErrRate

0.11

Error rate achieved by the landmarker weka.classifiers.lazy.IBk

kNN1NKappa

0.78

Kappa coefficient achieved by the landmarker weka.classifiers.lazy.IBk

Show all 107 properties

11 tasks

Supervised Classification on spambase

0 runs - estimation_procedure: 10-fold Crossvalidation - target_feature: class

Supervised Classification on spambase

0 runs - estimation_procedure: 5 times 2-fold Crossvalidation - target_feature: class

Supervised Classification on spambase

0 runs - estimation_procedure: 10 times 10-fold Crossvalidation - target_feature: class

Supervised Classification on spambase

0 runs - estimation_procedure: Leave one out - target_feature: class

Supervised Classification on spambase

0 runs - estimation_procedure: 33% Holdout set - target_feature: class

Supervised Classification on spambase

0 runs - estimation_procedure: Test on Training Data - target_feature: class

Supervised Classification on spambase

0 runs - estimation_procedure: 20% Holdout (Ordered) - target_feature: class

Supervised Classification on spambase

0 runs - estimation_procedure: 10% Holdout set - target_feature: class

Learning Curve on spambase

0 runs - estimation_procedure: 10 times 10-fold Learning Curve - target_feature: class

Learning Curve on spambase

0 runs - estimation_procedure: 10-fold Learning Curve - target_feature: class

Supervised Data Stream Classification on spambase

0 runs - estimation_procedure: Interleaved Test then Train - target_feature: class

Define a new task

Sign in

spambase

58 features

107 properties

11 tasks