Terminology:Regression:Regression is a statistical approach for determining the relation between two attributes.
Regression Indetermination coefficient: Using the regression indetermination coefficient regression quality can be determined. Higher the indetermination coefficient lower the quality of regression.Clustering:Clustering is a process of partitioning the data points into meaningful subclasses called clusters.Clustering Algorithms: Main goal of clustering algrithms is to categorize data into clusters such that objects are grouped in the same cluster when they are similar according to specific metrics, like Volume,Variety and velocity.Introduction:The aim of writing this report is to give a detailed description about a new clustering algorithm that has been introduced in order to deal with partitioning the dataset into clusters under the condition that, the regression indetermination coefficient for each cluster results a minimal value. I am also going to discuss the proposal and implementation of this algorithm. This clustering algorithm is developed with generic programming approach.
We use generic algorithms because it provides the sub-optimal solutions. Most of the clustering algorithms try to minimize the inter clustering and intra clustering distances, However a few applications such as cellular network planning require to decrease the indetermination coefficient. Regression is one of the most widely used techniques for data analysis.
It has a wide application range from experimental data to modern data mining. Below I am discussing how regression is helpful in dividing the data into clusters with an example.Idea of considering regression as a clustering method with an Example:The population of the fish has been considered during the ichthyology research. We assume all the fish belongs under one species because it is impossible to describe a few attributes like size, shape, color etc., of fish. Under the condition – “effect of water temperature on the average swimming speed of the fish” we acquire a set of data after several measurements.
Now we try to delineate (Figure 1) the data obtained considering speed of fish on Y axis and temperature on X axis and build a function. One of the approaches is to build a linear regression function for the data. After acquiring the function we try to find the relation that exists between the average swimming speed and temperature. From the obtained linear regression function we predict that “as the temperature increases there is a decrease in the average speed of fish” though it results a high error rate because of the residuals.