With the help of Dr. Miroslav Kubat, the basics of examining data are brought into the realm of quantifiable methods of calculating similarities in multi-dimensional settings. Some of Dr. Kubat's research papers can be found here.

Machine Learning

Nearest Neighbors and Tomek Links

Utilizing the k-nearest-neighbor method along with identifying what are referred to as 'Tomek' links, our assignment was to create a randomized set of points on a 2-dimensional plane and experiment with trying to discover noise from within the system of class-assignments which we gave to each of the points. Approximately 50% of the area in the 2D plain would represent a positive class, and the rest a negative class. With differing amounts of noise (wrongly classified points) injected into the testing set, we ran the two algorithms mentioned above to attempt to weed out the improperly classified examples.

Below are a series of screen shots portraying the algorithm's ability to remove points which it percieves to be marked incorrectly. A brief synopsis of the images: the grey box in the middle of the window represents the area designated to represent the positions of any positive examples. Blue dots represent positively classified examples, and red dots negative examples. In the images marked as data being removed, the hollow white circles represent data points where the algorithm estimated noise from within the data set. The purer the regions are of their respective colored dots, the more successful our algorithm was.

The report we drafted from these experiments goes into more detail about the processes used to obtain these results; feel free to read through it if you like.

Mouse over the bullets to view the images

Unprocessed Data Sets

Data Sets Processed by our Algorithm

The testing environment was programmed in C++ utilizing Dr. Stephen Murrell's graphics library. Visual Studio 2008 Professional was the IDE we chose to develop in. The accuracy reports (bottom 3 links) above were drafted using Matlab.

Decision Tree Analysis using Weka

This project involved experimenting with Weka, an open-source Java application featuring many prominent Machine Learning algorithms and a GUI to run customized tests and trials with data sets. Once becoming familiar with the environment, the objective was to utilize the J48 decision tree-inducing algorithm (developed by Ross Quinlan) and apply two pruning methods to test the tree's classification accuracy after each tactic.

The two pruning methods used were online pruning and post pruning. Post pruning is a process that functions on a decision tree once it has been induced. The process involves removing statistically irrelavent nodes, and replacing them with classifications or smaller subtrees. Online pruning occurs as a decision tree is being formed, and removes branches when there appears to be insufficient information gain. If your are interested in pruning topics, feel free to read more in our report linked below.

Mouse over the bullets to view the images

Online Pruning

Post Pruning

Sam Drazin © 2017

Home | About | Contact | Site Map