view · edit · attach · print · history

StupidFilter Data Overview

This is a binary classification problem with eight attributes. Using only basic textual properties, the goal is to identify potentially stupid forum comments. The attributes are as follows:

  1. percentage of characters which are lower case
  2. percentage of characters which are upper case
  3. percentage of characters which are punctuation marks
  4. incidence of repeat emphasis
  5. percentage of words beginning with a capital letter
  6. percentage of words containing a capital letter but beginning with a lower-case letter
  7. average word length score
  8. incidence of misspellings

For more detailed information on the data, the original site has both a summary and reference source code available.

Download the formatted data here:

Attach:SF_training.txt Attach:SF_testing.txt

These training and testing files are in comma-separated value (csv) format, without gaps. The first 8 columns are the training attributes. These attributes are pre-normalized to [0,1]. The 9th column is the class, either a 0 or a 1.

Matlab, SciPy, and most other analysis packages have integrated support for csv/text data, so importing the data shouldn't require custom code in the majority of cases.

CN550 Spring 2009 Results

# Name System Parameters-Settings % Correct C-index Runtime Details/Link
1 Best All predictions correct   100 1.0 0  
2 Worst All predictions wrong   0 0 0  
3 All IN All predictions IN   50 0.5 0  
4 All OUT All predictions OUT   50 0.5 0  
5 Chance Random IN/OUT   50 0.5 0  
N your name model type key parameters/settings     further details or link to a details page

Format for CIS confusion matrices

# CIS confusion matrix # actual IN # actual OUT total
1 # predicted IN TP: true IN FP: false IN TP + FP
2 # predicted OUT FN: false OUT TN: true OUT TN + FN
3 total 4,997 5,003 10,000

1 - Best

# CIS 1 - Best # actual IN # actual OUT total
1 # predicted IN 4,997 0 4,997
2 # predicted OUT 0 5,003 5,003
3 total 4,997 5,003 10,000

2 - Worst

# CIS 2 - Worst # actual IN # actual OUT total
1 # predicted UP 0 5,003 5,003
2 # predicted DOWN 4,997 0 4,997
3 total 4,997 5,003 10,000

3 - All IN

# CIS 3 - All IN # actual IN # actual OUT total
1 # predicted IN 4,997 5,003 10,000
2 # predicted OUT 0 0 0
3 total 4,997 5,003 10,000

4 - All OUT

# CIS 4 - All OUT # actual IN # actual OUT total
1 # predicted IN 0 0 0
2 # predicted OUT 4,997 5,003 10,000
3 total 4,997 5,003 10,000

5 -Chance

# CIS 5 - Chance # actual IN # actual OUT total
1 # predicted IN ~2,498 ~2,502 ~5,000
2 # predicted OUT ~2,499 ~2,501 ~5,000
3 total 4,997 5,003 10,000
view · edit · attach · print · history
Page last modified on February 16, 2009, at 08:15 PM