In Partial Fulfillment of the Requirements for the Degree of
Master of Science
Will defend his thesis
Data mining is used to extract interesting and useful knowledge from large pools of data. Due to computational complexity of algorithms and the problems of handling massive data sets, many data mining applications often require days to perform their analysis. This thesis presents the design and evaluation of a high performance computing framework for CLEVER, a prototype-based clustering algorithm which has been successfully used for a wide range of application scenarios. The CLEVER algorithm supports plug-in fitness functions and employs randomized hill climbing to find clusters, maximizing a given fitness function. This Thesis investigates various parallel computing techniques including OpenMP and CUDA. In particular, three parallel versions of CLEVER have been developed: OpenMP C++ CLEVER, OpenMP C CLEVER, and CUDA C CLEVER. The first version is written by C++ and OpenMP relying on object-oriented design. The second and third version are written in C using procedural design and only use very simple data structures. We evaluated the different parallel versions on a benchmark consisting of datasets with 3,000, 360,000 and 2,000,000 objects. Our results indicate that OpenMP C++ CLEVER and CUDA C CLEVER achieve 30x and 37x speedup on AMD Opteron with 48 cores and on Nvidia Tesla M2050, respectively. Moreover, we analyzed the relationship between cluster quality and runtime used; the experimental results indicate that using a high number of CPU cores enables us to find significantly better solutions than the serial versions of CLEVER.