Contents

Principal Components Analysis

Introduction

In data mining you often encounter situations where there are a large number of variables in the database. In such situations it is very likely that subsets of variables are highly correlated with each other. The accuracy and reliability of a classification or prediction model will suffer if you include highly correlated variables or variables that are unrelated to the outcome of interest. Superfluous variables can increase the data-collection and data-processing costs of deploying a model on a large database. The dimensionality of a model is the number of independent or input variables used by the model. One of the key steps in data mining is finding ways to reduce dimensionality without sacrificing accuracy.

Principal component analysis (PCA) is a mathematical procedure that transforms a number of (possibly) correlated variables into a (smaller) number of uncorrelated variables called principal components. The objective of principal component analysis is to reduce the dimensionality (number of variables) of the dataset but retain most of the original variability in the data. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible.

This procedure performs Principal Component Analysis on the selected dataset. A principal component analysis is concerned with explaining the variance covariance structure of a high dimensional random vector through a few linear combinations of the original component variables. Consider a p-dimensional random vector X = ( X1, X2, ..., Xp ). k principal components ( k£ p ) of X are k (univariate) random variables Y1, Y2, ..., Yk which are defined by the following formulae.

Where the coefficient vectors l1,l2 ,..etc are chosen such that they satisfy the following conditions:

First Principal Component = Linear combination l1'X that maximizes Var(l1'X) and || l1 || =1

Second Principal Component = Linear combination l2'X that maximizes Var(l2'X) and || l2 || =1

and Cov(l1'X , l2'X) =0

j th Principal Component = Linear combination lj'X that maximizes Var(lj'X) and || lj || =1

and Cov(lk'X , lj'X) =0 for all k < j

This says that the principal components are those linear combinations of the original variables which maximize the variance of the linear combination and which have zero covariance (and hence zero correlation) with the previous principal components.

It can be proved that there are exactly p such linear combinations. However, typically, the first few of them explain most of the variance in the original data. So instead of working with all the original variables X1, X2, ..., Xp, you would typically first perform PCA and then use only first two or three principal components, say Y1 and Y2, in subsequent analysis.

See also