OpenML

JavaScript is required to properly view the contents of this page!

Explore
- Data
- Task
- Flow
- Run
- Study
- Task type
- Measure
- People
Help
Blog
Contact
Please cite us

wisconsin

active ARFF Publicly available Visibility: public Uploaded 23-04-2014 by Jan van Rijn
0 likes downloaded by 0 people , 0 total downloads 0 issues 0 downvotes

Issue	#Downvotes for this reason	By

Loading wiki

Help us complete this description Edit

Author: Source: Unknown - Please cite: 1. Title: Wisconsin Prognostic Breast Cancer (WPBC) 2. Source Information a) Creators: Dr. William H. Wolberg, General Surgery Dept., University of Wisconsin, Clinical Sciences Center, Madison, WI 53792 wolberg@eagle.surgery.wisc.edu W. Nick Street, Computer Sciences Dept., University of Wisconsin, 1210 West Dayton St., Madison, WI 53706 street@cs.wisc.edu 608-262-6619 Olvi L. Mangasarian, Computer Sciences Dept., University of Wisconsin, 1210 West Dayton St., Madison, WI 53706 olvi@cs.wisc.edu b) Donor: Nick Street c) Date: December 1995 3. Past Usage: Various versions of this data have been used in the following publications: (i) W. N. Street, O. L. Mangasarian, and W.H. Wolberg. An inductive learning approach to prognostic prediction. In A. Prieditis and S. Russell, editors, Proceedings of the Twelfth International Conference on Machine Learning, pages 522--530, San Francisco, 1995. Morgan Kaufmann. (ii) O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and prognosis via linear programming. Operations Research, 43(4), pages 570-577, July-August 1995. (iii) W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian. Computerized breast cancer diagnosis and prognosis from fine needle aspirates. Archives of Surgery 1995;130:511-516. (iv) W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Image analysis and machine learning applied to breast cancer diagnosis and prognosis. Analytical and Quantitative Cytology and Histology, Vol. 17 No. 2, pages 77-87, April 1995. (v) W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian. Computer-derived nuclear ``grade'' and breast cancer prognosis. Analytical and Quantitative Cytology and Histology, Vol. 17, pages 257-264, 1995. See also: http://www.cs.wisc.edu/~olvi/uwmp/mpml.html http://www.cs.wisc.edu/~olvi/uwmp/cancer.html Results: Two possible learning problems: 1) Predicting field 2, outcome: R = recurrent, N = nonrecurrent - Dataset should first be filtered to reflect a particular endpoint; e.g., recurrences before 24 months = positive, nonrecurrence beyond 24 months = negative. - 86.3% accuracy estimated accuracy on 2-year recurrence using previous version of this data. Learning method: MSM-T (see below) in the 4-dimensional space of Mean Texture, Worst Area, Worst Concavity, Worst Fractal Dimension. 2) Predicting Time To Recur (field 3 in recurrent records) - Estimated mean error 13.9 months using Recurrence Surface Approximation. (See references (i) and (ii) above) 4. Relevant information Each record represents follow-up data for one breast cancer case. These are consecutive patients seen by Dr. Wolberg since 1984, and include only those cases exhibiting invasive breast cancer and no evidence of distant metastases at the time of diagnosis. The first 30 features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. A few of the images can be found at http://www.cs.wisc.edu/~street/images/ The separation described above was obtained using Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree Construction Via Linear Programming." Proceedings of the 4th Midwest Artificial Intelligence and Cognitive Science Society, pp. 97-101, 1992], a classification method which uses linear programming to construct a decision tree. Relevant features were selected using an exhaustive search in the space of 1-4 features and 1-3 separating planes. The actual linear program used to obtain the separating plane in the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34]. The Recurrence Surface Approximation (RSA) method is a linear programming model which predicts Time To Recur using both recurrent and nonrecurrent cases. See references (i) and (ii) above for details of the RSA method. This database is also available through the UW CS ftp server: ftp ftp.cs.wisc.edu cd math-prog/cpo-dataset/machine-learn/WPBC/ 5. Number of instances: 198 6. Number of attributes: 34 (ID, outcome, 32 real-valued input features) 7. Attribute information 1) ID number 2) Outcome (R = recur, N = nonrecur) 3) Time (recurrence time if field 2 = R, disease-free time if field 2 = N) 4-33) Ten real-valued features are computed for each cell nucleus: a) radius (mean of distances from center to points on the perimeter) b) texture (standard deviation of gray-scale values) c) perimeter d) area e) smoothness (local variation in radius lengths) f) compactness (perimeter^2 / area - 1.0) g) concavity (severity of concave portions of the contour) h) concave points (number of concave portions of the contour) i) symmetry j) fractal dimension ("coastline approximation" - 1) Several of the papers listed above contain detailed descriptions of how these features are computed. The mean, standard error, and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 4 is Mean Radius, field 14 is Radius SE, field 24 is Worst Radius. Values for features 4-33 are recoded with four significant digits. 34) Tumor size - diameter of the excised tumor in centimeters 35) Lymph node status - number of positive axillary lymph nodes observed at time of surgery 8. Missing attribute values: Lymph node status is missing in 4 cases. 9. Class distribution: 151 nonrecur, 47 recur ----------------------------------------------------------------------------------------------------------- Luis Torgo's version: (reconstructed) - removed the four instances with unknown values of the last attribute - exchanged the attribute position of attributes n.3 (Time) and n.35 (Lymph node). - removed the attribute outcome as it is the class attribute if the problem is treated as a classification one -----------------------------------------------------------------------------------------------------------

33 features

time (target)	numeric	94 unique values 0 missing
compactness_se	numeric	188 unique values 0 missing
compactness_mean	numeric	189 unique values 0 missing
compactness_worst	numeric	183 unique values 0 missing
concavity_mean	numeric	185 unique values 0 missing
concavity_se	numeric	191 unique values 0 missing
concavity_worst	numeric	179 unique values 0 missing
concave_points_mean	numeric	183 unique values 0 missing
concave_points_se	numeric	180 unique values 0 missing
concave_points_worst	numeric	187 unique values 0 missing
symmetry_mean	numeric	169 unique values 0 missing
symmetry_se	numeric	187 unique values 0 missing
symmetry_worst	numeric	193 unique values 0 missing
fractal_dimension_mean	numeric	181 unique values 0 missing
fractal_dimension_se	numeric	188 unique values 0 missing
fractal_dimension_worst	numeric	185 unique values 0 missing
tumor_size	numeric	39 unique values 0 missing
lymph_node_status	numeric	22 unique values 0 missing
smoothness_worst	numeric	193 unique values 0 missing
smoothness_se	numeric	192 unique values 0 missing
smoothness_mean	numeric	189 unique values 0 missing
area_worst	numeric	187 unique values 0 missing
area_se	numeric	192 unique values 0 missing
area_mean	numeric	190 unique values 0 missing
perimeter_worst	numeric	172 unique values 0 missing
perimeter_se	numeric	185 unique values 0 missing
perimeter_mean	numeric	192 unique values 0 missing
texture_worst	numeric	188 unique values 0 missing
texture_se	numeric	175 unique values 0 missing
texture_mean	numeric	188 unique values 0 missing
radius_worst	numeric	177 unique values 0 missing
radius_se	numeric	190 unique values 0 missing
radius_mean	numeric	174 unique values 0 missing

Show all 33 features

107 properties

NumberOfInstances

194

Number of instances (rows) of the dataset.

NumberOfFeatures

Number of attributes (columns) of the dataset.

NumberOfClasses

Number of distinct values of the target attribute (if it is nominal).

NumberOfMissingValues

Number of missing values in the dataset.

NumberOfInstancesWithMissingValues

Number of instances with at least one value missing.

NumberOfNumericFeatures

Number of numeric attributes.

NumberOfSymbolicFeatures

Number of nominal attributes.

AutoCorrelation

-29.74

Average class difference between consecutive instances.

CfsSubsetEval_DecisionStumpAUC

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.DecisionStump -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

CfsSubsetEval_DecisionStumpErrRate

Error rate achieved by the landmarker weka.classifiers.trees.DecisionStump -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

CfsSubsetEval_DecisionStumpKappa

Kappa coefficient achieved by the landmarker weka.classifiers.trees.DecisionStump -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

CfsSubsetEval_NaiveBayesAUC

Area Under the ROC Curve achieved by the landmarker weka.classifiers.bayes.NaiveBayes -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

CfsSubsetEval_NaiveBayesErrRate

Error rate achieved by the landmarker weka.classifiers.bayes.NaiveBayes -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

CfsSubsetEval_NaiveBayesKappa

Kappa coefficient achieved by the landmarker weka.classifiers.bayes.NaiveBayes -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

CfsSubsetEval_kNN1NAUC

Area Under the ROC Curve achieved by the landmarker weka.classifiers.lazy.IBk -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

CfsSubsetEval_kNN1NErrRate

Error rate achieved by the landmarker weka.classifiers.lazy.IBk -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

CfsSubsetEval_kNN1NKappa

Kappa coefficient achieved by the landmarker weka.classifiers.lazy.IBk -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

ClassEntropy

Entropy of the target attribute values.

DecisionStumpAUC

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.DecisionStump

DecisionStumpErrRate

Error rate achieved by the landmarker weka.classifiers.trees.DecisionStump

DecisionStumpKappa

Kappa coefficient achieved by the landmarker weka.classifiers.trees.DecisionStump

Dimensionality

0.17

Number of attributes divided by the number of instances.

EquivalentNumberOfAtts

Number of attributes needed to optimally describe the class (under the assumption of independence among attributes). Equals ClassEntropy divided by MeanMutualInformation.

J48.00001.AUC

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.J48 -C .00001

J48.00001.ErrRate

Error rate achieved by the landmarker weka.classifiers.trees.J48 -C .00001

J48.00001.Kappa

Kappa coefficient achieved by the landmarker weka.classifiers.trees.J48 -C .00001

J48.0001.AUC

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.J48 -C .0001

J48.0001.ErrRate

Error rate achieved by the landmarker weka.classifiers.trees.J48 -C .0001

J48.0001.Kappa

Kappa coefficient achieved by the landmarker weka.classifiers.trees.J48 -C .0001

J48.001.AUC

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.J48 -C .001

J48.001.ErrRate

Error rate achieved by the landmarker weka.classifiers.trees.J48 -C .001

J48.001.Kappa

Kappa coefficient achieved by the landmarker weka.classifiers.trees.J48 -C .001

MajorityClassPercentage

Percentage of instances belonging to the most frequent class.

MajorityClassSize

Number of instances belonging to the most frequent class.

MaxAttributeEntropy

Maximum entropy among attributes.

MaxKurtosisOfNumericAtts

26.3

Maximum kurtosis among attributes of the numeric type.

MaxMeansOfNumericAtts

1401.76

Maximum of means among attributes of the numeric type.

MaxMutualInformation

Maximum mutual information between the nominal attributes and the target attribute.

MaxNominalAttDistinctValues

The maximum number of distinct values among attributes of the nominal type.

MaxSkewnessOfNumericAtts

3.9

Maximum skewness among attributes of the numeric type.

MaxStdDevOfNumericAtts

587.04

Maximum standard deviation of attributes of the numeric type.

MeanAttributeEntropy

Average entropy of the attributes.

MeanKurtosisOfNumericAtts

2.72

Mean kurtosis among attributes of the numeric type.

MeanMeansOfNumericAtts

86.32

Mean of means among attributes of the numeric type.

MeanMutualInformation

Average mutual information between the nominal attributes and the target attribute.

MeanNoiseToSignalRatio

An estimate of the amount of irrelevant information in the attributes regarding the class. Equals (MeanAttributeEntropy - MeanMutualInformation) divided by MeanMutualInformation.

MeanNominalAttDistinctValues

Average number of distinct values among the attributes of the nominal type.

MeanSkewnessOfNumericAtts

1.12

Mean skewness among attributes of the numeric type.

MeanStdDevOfNumericAtts

33.4

Mean standard deviation of attributes of the numeric type.

MinAttributeEntropy

Minimal entropy among attributes.

MinKurtosisOfNumericAtts

-0.81

Minimum kurtosis among attributes of the numeric type.

MinMeansOfNumericAtts

Minimum of means among attributes of the numeric type.

MinMutualInformation

Minimal mutual information between the nominal attributes and the target attribute.

MinNominalAttDistinctValues

The minimal number of distinct values among attributes of the nominal type.

MinSkewnessOfNumericAtts

-0.13

Minimum skewness among attributes of the numeric type.

MinStdDevOfNumericAtts

Minimum standard deviation of attributes of the numeric type.

MinorityClassPercentage

Percentage of instances belonging to the least frequent class.

MinorityClassSize

Number of instances belonging to the least frequent class.

NaiveBayesAUC

Area Under the ROC Curve achieved by the landmarker weka.classifiers.bayes.NaiveBayes

NaiveBayesErrRate

Error rate achieved by the landmarker weka.classifiers.bayes.NaiveBayes

NaiveBayesKappa

Kappa coefficient achieved by the landmarker weka.classifiers.bayes.NaiveBayes

NumberOfBinaryFeatures

Number of binary attributes.

PercentageOfBinaryFeatures

Percentage of binary attributes.

PercentageOfInstancesWithMissingValues

Percentage of instances having missing values.

PercentageOfMissingValues

Percentage of missing values.

PercentageOfNumericFeatures

100

Percentage of numeric attributes.

PercentageOfSymbolicFeatures

Percentage of nominal attributes.

Quartile1AttributeEntropy

First quartile of entropy among attributes.

Quartile1KurtosisOfNumericAtts

0.44

First quartile of kurtosis among attributes of the numeric type.

Quartile1MeansOfNumericAtts

0.09

First quartile of means among attributes of the numeric type.

Quartile1MutualInformation

First quartile of mutual information between the nominal attributes and the target attribute.

Quartile1SkewnessOfNumericAtts

0.58

First quartile of skewness among attributes of the numeric type.

Quartile1StdDevOfNumericAtts

0.02

First quartile of standard deviation of attributes of the numeric type.

Quartile2AttributeEntropy

Second quartile (Median) of entropy among attributes.

Quartile2KurtosisOfNumericAtts

1.71

Second quartile (Median) of kurtosis among attributes of the numeric type.

Quartile2MeansOfNumericAtts

0.36

Second quartile (Median) of means among attributes of the numeric type.

Quartile2MutualInformation

Second quartile (Median) of mutual information between the nominal attributes and the target attribute.

Quartile2SkewnessOfNumericAtts

0.99

Second quartile (Median) of skewness among attributes of the numeric type.

Quartile2StdDevOfNumericAtts

0.17

Second quartile (Median) of standard deviation of attributes of the numeric type.

Quartile3AttributeEntropy

Third quartile of entropy among attributes.

Quartile3KurtosisOfNumericAtts

3.58

Third quartile of kurtosis among attributes of the numeric type.

Quartile3MeansOfNumericAtts

21.65

Third quartile of means among attributes of the numeric type.

Quartile3MutualInformation

Third quartile of mutual information between the nominal attributes and the target attribute.

Quartile3SkewnessOfNumericAtts

1.59

Third quartile of skewness among attributes of the numeric type.

Quartile3StdDevOfNumericAtts

4.91

Third quartile of standard deviation of attributes of the numeric type.

REPTreeDepth1AUC

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.REPTree -L 1

REPTreeDepth1ErrRate

Error rate achieved by the landmarker weka.classifiers.trees.REPTree -L 1

REPTreeDepth1Kappa

Kappa coefficient achieved by the landmarker weka.classifiers.trees.REPTree -L 1

REPTreeDepth2AUC

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.REPTree -L 2

REPTreeDepth2ErrRate

Error rate achieved by the landmarker weka.classifiers.trees.REPTree -L 2

REPTreeDepth2Kappa

Kappa coefficient achieved by the landmarker weka.classifiers.trees.REPTree -L 2

REPTreeDepth3AUC

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.REPTree -L 3

REPTreeDepth3ErrRate

Error rate achieved by the landmarker weka.classifiers.trees.REPTree -L 3

REPTreeDepth3Kappa

Kappa coefficient achieved by the landmarker weka.classifiers.trees.REPTree -L 3

RandomTreeDepth1AUC

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.RandomTree -depth 1

RandomTreeDepth1ErrRate

Error rate achieved by the landmarker weka.classifiers.trees.RandomTree -depth 1

RandomTreeDepth1Kappa

Kappa coefficient achieved by the landmarker weka.classifiers.trees.RandomTree -depth 1

RandomTreeDepth2AUC

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.RandomTree -depth 2

RandomTreeDepth2ErrRate

Error rate achieved by the landmarker weka.classifiers.trees.RandomTree -depth 2

RandomTreeDepth2Kappa

Kappa coefficient achieved by the landmarker weka.classifiers.trees.RandomTree -depth 2

RandomTreeDepth3AUC

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.RandomTree -depth 3

RandomTreeDepth3ErrRate

Error rate achieved by the landmarker weka.classifiers.trees.RandomTree -depth 3

RandomTreeDepth3Kappa

Kappa coefficient achieved by the landmarker weka.classifiers.trees.RandomTree -depth 3

StdvNominalAttDistinctValues

Standard deviation of the number of distinct values among attributes of the nominal type.

kNN1NAUC

Area Under the ROC Curve achieved by the landmarker weka.classifiers.lazy.IBk

kNN1NErrRate

Error rate achieved by the landmarker weka.classifiers.lazy.IBk

kNN1NKappa

Kappa coefficient achieved by the landmarker weka.classifiers.lazy.IBk

Show all 107 properties

7 tasks

Supervised Regression on wisconsin

0 runs - estimation_procedure: 10-fold Crossvalidation - target_feature: time

Supervised Regression on wisconsin

0 runs - estimation_procedure: 5 times 2-fold Crossvalidation - target_feature: time

Supervised Regression on wisconsin

0 runs - estimation_procedure: 10 times 10-fold Crossvalidation - target_feature: time

Supervised Regression on wisconsin

0 runs - estimation_procedure: 10% Holdout set - target_feature: time

Supervised Regression on wisconsin

0 runs - estimation_procedure: 33% Holdout set - target_feature: time

Supervised Regression on wisconsin

0 runs - estimation_procedure: Leave one out - target_feature: time

Supervised Regression on wisconsin

0 runs - estimation_procedure: Test on Training Data - target_feature: time

Define a new task

Sign in

wisconsin

33 features

107 properties

7 tasks