Home > R > Clustering with selected Principal Components

Clustering with selected Principal Components

In the Visualizing Principal Components post, I looked at the Principal Components of the companies in the Dow Jones Industrial Average index over 2012. Today, I want to show how we can use Principal Components to create Clusters (i.e. form groups of similar companies based on their distance from each other)

Let’s start by loading the historical prices for the the companies in the Dow Jones Industrial Average index that we saved in the Visualizing Principal Components post.

###############################################################################
# Load Systematic Investor Toolbox (SIT)
# http://systematicinvestor.wordpress.com/systematic-investor-toolbox/
###############################################################################
setInternet2(TRUE)
con = gzcon(url('http://www.systematicportfolio.com/sit.gz', 'rb'))
    source(con)
close(con)

	#*****************************************************************
	# Load historical data
	#****************************************************************** 
	load.packages('quantmod')	
	
	# load data saved in the bt.pca.test() function
	load(file='bt.pca.test.Rdata')

	#*****************************************************************
	# Principal component analysis (PCA), for interesting discussion
	# http://machine-master.blogspot.ca/2012/08/pca-or-polluting-your-clever-analysis.html
	#****************************************************************** 
	prices = data$prices	
	ret = prices / mlag(prices) - 1
	
	p = princomp(na.omit(ret))
	
	loadings = p$loadings[]
	
	x = loadings[,1]
	y = loadings[,2]
	z = loadings[,3]	

To create Clusters, I will use the hierarchical cluster analysis, hclust function, in stats package. The first argument in the hclust function is the distance (dissimilarity) matrix. To compute distance matrix, let’s take the first 2 principal components and compute the Euclidean distance between each company:

	#*****************************************************************
	# Create clusters
	#****************************************************************** 		
	# create and plot clusters based on the first and second principal components
	hc = hclust(dist(cbind(x,y)), method = 'ward')
	plot(hc, axes=F,xlab='', ylab='',sub ='', main='Comp 1/2')
	rect.hclust(hc, k=3, border='red')

plot1.png.small

Similarly we can use the first three principal components:

	# create and plot clusters based on the first, second, and third principal components
	hc = hclust(dist(cbind(x,y,z)), method = 'ward')
	plot(hc, axes=F,xlab='', ylab='',sub ='', main='Comp 1/2/3')
	rect.hclust(hc, k=3, border='red')

plot2.png.small

Another option is to use the Correlation matrix as a proxy for a distance matrix:

	# create and plot clusters based on the correlation among companies
	hc = hclust(as.dist(1-cor(na.omit(ret))), method = 'ward')
	plot(hc, axes=F,xlab='', ylab='',sub ='', main='Correlation')
	rect.hclust(hc, k=3, border='red')

plot3.png.small

Please note that Clusters will be quite different, depending on the distance matrix you use.

To view the complete source code for this example, please have a look at the bt.clustering.test() function in bt.test.r at github.

Categories: R
  1. Hugo
    December 29, 2012 at 4:51 am

    Dear systematicinvestor,

    Just found your blog and I absolutely love it! I am somewhat new to R, but have have a long history with both Excel and finance, you keep amaze me with both your and R’s powers when dealing with financial data.

    Thank you and keep doing what you are doing!

    Hugo

  2. Mike
    December 29, 2012 at 10:22 am

    Nice post!

    Here is a good tutorial on clustering with different R packages

    http://research.stowers-institute.org/efg/R/Visualization/cor-cluster/index.htm

    provided by Earl F. Glynn.

    Take a look at pvclust package on stability issue. This package use bootstrap and calculates p-values.

    Happy New Year!

    Mike

  3. Didier Ruedin
    December 29, 2012 at 10:34 pm

    You might also be interested in FactoMineR (http://factominer.free.fr/), a package not mentioned in the tutorial linked. It offers a GUI (as part of Rcmdr) for those less comfortable with the command line, and apparently you can also use it within Excel (using RExcel).

  4. nx
    January 9, 2013 at 8:29 pm

    Hi,

    Excellent blog. The question now is how to use these clusters further.

    I was wondering about trying to find the most representative element of each cluster and using the list of those to do a further portfolio opt. I.e.

    Sp500 (500 names) –> cluster (20 or so clusters) –> for each cluster take most representative name (20 names) –> portfolio opt these 20 names

    Any ideas on what procedure would best find each cluster’s most representative stock?

  1. January 11, 2013 at 5:11 am
  2. January 15, 2013 at 1:52 am
  3. January 15, 2013 at 4:44 am
  4. January 16, 2013 at 5:58 pm

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 252 other followers

%d bloggers like this: