In Partial Fulfillment of the Requirements for the Degree of
Doctor of Philosophy
Will defend his dissertation
These days, there is an unprecedented amount of data available from different sources. This has caused the field of knowledge discovery to garner quite a bit of attention in recent years, from researchers and industry alike. However, many real-world datasets are imbalanced. Learning from imbalanced data still poses a major challenge and has been recognized as an area in need of significant research. The problem with imbalanced data centers on the performance of learning algorithms in the presence of underrepresented data and severely skewed class distribution. Models which are trained on imbalanced datasets strongly favor the majority class and largely ignore the minority class. Several of the approaches that have been introduced to date have presented solutions at both the algorithmic and data levels. However, both algorithmic and data approaches have been criticized due to their lack of generalization, tendency to forfeit important information, and likelihood of resulting in difficulties involving over-fitting.
The goal of this thesis is to develop algorithms to balance imbalanced datasets in order to allow each classifier to reach optimal predictions. The specific objectives are to: (i) develop sampling methods for imbalanced data, (ii) develop a framework capable of determining which sampling method to use, given various imbalanced datasets, (iii) evaluate the performance of these method on a variety of datasets which are imbalanced in nature, and (iv) develop a new risk prediction framework for cardiovascular events using machine learning methods.
We propose a method for filtering over-sampled data using non-cooperative game theory. It addresses the imbalanced data issue by formulating the problem as a non-cooperative game—all the data are players and the goal is to uniformly and consistently label all of the synthetic data created by any oversampling technique. The proposed algorithm does not require any prior assumptions and selects representative synthetic instances while generating a very small amount of noisy data. In addition, we propose a technique for addressing the imbalanced data problem using semi-supervised learning. Specifically, from a supervised problem, we create a semi-supervised problem, and then use a semi-supervised learning method to identify the most relevant instances to establish a well-defined training set. The method integrates under-sampling and semi-supervised learning (US-SSL) to tackle the imbalance problem. The proposed algorithm, on average, significantly outperforms all other sampling algorithms in 67% of cases, across three different classifiers, and ranks second best for the remaining 33% of cases. Finally, we propose a novel framework based on the US-SSL algorithm to select the appropriate semi-supervised algorithm to balance and refine a given dataset in order to establish a well-defined training set. In summary, our framework can help answer the following important questions: "Given an imbalanced dataset, which under-sampling method should be used?", and "Which method will perform the best?" We present extensive experimental results over a large collection of datasets using three different classifiers to demonstrate the advantages of our methods.
Date: Wednesday, August 27, 2014
Time: 9:00 AM
Place: HBS 317
Faculty, students, and the general public are invited.
Advisor: Prof. Ioannis Kakadiaris
Committee: Profs. Christoph Eick, Shishir Shah, Panagiotis Tsiamyrtzis and Ricardo Vilalta