Research
Title: Dataset Selection for Aggregate Model
Implementation in Predictive Data Mining
Date of graduation: 2 September, 2010
In her thesis, Patricia Lutu
studied the problem of aggregate classification model design and related
training dataset selection methods from large amounts of data as commonly
encountered in data mining. The objectives of the study were to establish
aggregate model design methods, feature selection methods and training instance
selection methods that result in classification models with a high level of
predictive performance. New methods were designed for feature ranking, and a
new algorithm was designed for feature subset search to identify the best
predictive features for classification modeling. The two methods
of aggregate modeling that were studied are One-Vs-All (OVA) and
positive-Vs-negative (pVn) modeling. While OVA is an existing method that has been used for small datasets,
pVn is a new method of aggregate modeling, proposed
by Patricia Lutu. The sparse confusion matrix property was defined and used as
a basis for a new construct that was called a confusion graph and used as a
basis for the design of base models for aggregation. Two new algorithms were
designed to process confusion graphs. A new algorithm was designed to resolve
tied predictions that arise in aggregate modeling. Experimental work was
conducted to demonstrate the performance of the proposed feature and training
instance selection methods, aggregate modeling methods, and new algorithms.
Theoretical models were developed to specify the relationships between the
factors that affect the quality of selected features and aggregate model
performance. Guidelines were developed for feature selection, training instance
selection and aggregate modeling from large datasets.
The thesis
document is available at the University of Pretoria UPeTD
website at
this link