Class Project

Project Results

Political Database Benchmark

Circle in Square Database Benchmark

Greyhound Benchmark

Weka Classifier Implementations

Learning Game

Data Sets for Final Project

Greyhound Races Data Set

A set of racing results data from the Birmingham Greyhound Track in Alabama from late 2005 and early 2006. Each feature vector consists of information about ONE DOG in ONE RACE, and includes information about that dog's previous THREE RACES. The class for each feature vector is an integer from 1 to 8 (specifically, an ordinal rank), indicating the position the dog finished in the race.

  • 14320 instances. (For consistency between classifiers, use first 10320 instances as training/validation set, last 4000 instances as test set.)
  • 66 features.
  • 8 Classes.

The Data Set is available in 2 formats:

  • An ARFF file (a mix of numeric and nominal data): GH_data.arff
  • A Matlab-readable text file (all numeric data; the last column, column 73, holds the class; there are 72 "features" in this set because one nominal feature has been binarized): GH_data.m

Return Function to MAXIMIZE:
This function calculates the total amount of dollars you would have won if you bet $1 on each dog (to win) that your classifier predicts would win.
RETURN = (# of times your classifier was CORRECT when it predicted class 1) + (sum of ODDS feature for all correct predictions of class 1, given in column 12 in data set) - (# of times your classifier was INCORRECT when it predicted class 1)

National Election Survey (NES) Voters Data

A set of data drawn from interviews of individual voters from 1952-2004 (27 different interview years).

TRAIN on years 1952 to 2002
TEST on 2004

  • 29820 Instances consisting of:
    • 5745 Republicans
    • 24075 Democrats
  • 25 features (across 19categories)
  • 2 Classes: Democrat or Republican

The Data Set is available in 2 formats:

  • An ARFF file (includes significant missing data): NES_710_fix.arff
  • A Matlab-readable text file (all numeric data; the last column, column 30, holds the default class; missing data has been replaced with a value of -1): ELECTION_710_fix.m
  • A description of the features and classes can be found here: VOTER_DATA_description_fix.doc

Financial Data from Yahoo!

  • Dataset that matches Versace et al (2004)
    • This alternative set is closer to that proposed in Max's paper. Namely, it has %-based features, rather than absolute dollar amounts, and it uses the Dow Industrial Average fund as the target fund, rather than the NASDAQ index.
    • As with Chuan-Heng's, some features may need normalization, but since nearly all of them are already percentages, they're pretty close to the [0,1] interval already.
    • As of March 20:
      • Excludes days with missing features
      • Defines the class as 0 if the next days close is <= today's close and 1 otherwise.
      • Contains a test set based on January, 2006.

  • Data set v1.0
    • Original fetch.m issue in mathworks.
    • Feature spec:
      • It is just the raw data from yahoo. and aligned with reference days. So it looks like:
        • Pre1.Feature1 Pre1.Feature2 ... Pre2.Feature1 Pre2.Feature2 ... CurrentDay.TargetClass
    • Usage:
      • use run.m to generate the dataset.
      • if you want to generate your own dataset, you can:
        • modify date in RETRIEVE_FINANCIAL_DATA.m
        • modify reference days in run.m
        • modify target class in run.m (Date/Open/High/Low/Close/Volume/Adj. Close)
        • rewrite Merge.m if you want to do some advanced feature selection.

  • Data set v2.0
    • Change from v1.0:
      • one integrated .arff file according to date, fill missing data with previous date.
      • add macd as feature.
    • Feature spec:
      • Pre1.Comp1.Feature1 Pre1.Comp1.Feature2 ... Pre1.Comp2.Feature1 Pre1.Comp2.Feature2 ... CurrentDay.TargetClass (default target class: Nasdaq CLOSE).
    • Usage:
      • use run.m to generate the dataset.
      • if you want to generate your own dataset, you can:
        • modify N_RETRIEVE_DAYS in run.m
        • modify TARGET_COMP in run.m
        • modify TARGET_FEATURE in run.m
        • modify date in RETRIEVE_FINANCIAL_DATA.m
        • modify line 8-12 to do advanced feature selection in Merge.m

  • Script to fetch data from Yahoo!
  • Technical Analysis Indicators
  • Script Repository

Other Interesting Data Sets

CN 550 Benchmarks

Plankton data set

  • Researchers at the University of South Florida, under an ONR grant, have been studying ML approaches to identifying the species of plankton based on images from some fancy new underwater camera. Their work is quite recent, just published in the Journal of Machine Learning Research. They’re in the process of making their preprocessed data public (probably at, but in the meantime, they've shared their data (both preprocessed and raw) with us. They’ve already published the performance of MLP/backprop and various flavors of SVM on the problem, so we have something to reference and compare to.
  • Preprocessed data in arff, C4.5, and sparse formats.
  • Raw image data, divided into validation and test sets.

Microarray Gene Expression Data

Below is the URL for the original data, which is in a tab-separated format with a header line: I will put up the .ARFF file, reference papers, and a concise description of what the numbers represent as soon as I get to my computer which has its annual hard-drive problems.


ARFF to Matlab

  • Here is a little matlab script for converting an arff-formatted dataset to a matlab matrix. It is by no means self contained. If you choose to use it, you should modify it to suit your needs.

Matlab to ARFF

  • A matlab script to generate an arff text file from a matlab matrix.
  • Here's an alternative that does pretty much the same thing but via a weka.core.Instances object.

Other scripts