## Clustering with selected Principal Components

In the Visualizing Principal Components post, I looked at the Principal Components of the companies in the Dow Jones Industrial Average index over 2012. Today, I want to show how we can use Principal Components to create Clusters (i.e. form groups of similar companies based on their distance from each other)

Let’s start by loading the historical prices for the the companies in the Dow Jones Industrial Average index that we saved in the Visualizing Principal Components post.

############################################################################### # Load Systematic Investor Toolbox (SIT) # http://systematicinvestor.wordpress.com/systematic-investor-toolbox/ ############################################################################### setInternet2(TRUE) con = gzcon(url('http://www.systematicportfolio.com/sit.gz', 'rb')) source(con) close(con) #***************************************************************** # Load historical data #****************************************************************** load.packages('quantmod') # load data saved in the bt.pca.test() function load(file='bt.pca.test.Rdata') #***************************************************************** # Principal component analysis (PCA), for interesting discussion # http://machine-master.blogspot.ca/2012/08/pca-or-polluting-your-clever-analysis.html #****************************************************************** prices = data$prices ret = prices / mlag(prices) - 1 p = princomp(na.omit(ret)) loadings = p$loadings[] x = loadings[,1] y = loadings[,2] z = loadings[,3]

To create Clusters, I will use the hierarchical cluster analysis, hclust function, in stats package. The first argument in the hclust function is the distance (dissimilarity) matrix. To compute distance matrix, let’s take the first 2 principal components and compute the Euclidean distance between each company:

#***************************************************************** # Create clusters #****************************************************************** # create and plot clusters based on the first and second principal components hc = hclust(dist(cbind(x,y)), method = 'ward') plot(hc, axes=F,xlab='', ylab='',sub ='', main='Comp 1/2') rect.hclust(hc, k=3, border='red')

Similarly we can use the first three principal components:

# create and plot clusters based on the first, second, and third principal components hc = hclust(dist(cbind(x,y,z)), method = 'ward') plot(hc, axes=F,xlab='', ylab='',sub ='', main='Comp 1/2/3') rect.hclust(hc, k=3, border='red')

Another option is to use the Correlation matrix as a proxy for a distance matrix:

# create and plot clusters based on the correlation among companies hc = hclust(as.dist(1-cor(na.omit(ret))), method = 'ward') plot(hc, axes=F,xlab='', ylab='',sub ='', main='Correlation') rect.hclust(hc, k=3, border='red')

Please note that Clusters will be quite different, depending on the distance matrix you use.

To view the complete source code for this example, please have a look at the bt.clustering.test() function in bt.test.r at github.

## Visualizing Principal Components

Principal Component Analysis (PCA) is a procedure that converts observations into linearly uncorrelated variables called principal components (Wikipedia). The PCA is a useful descriptive tool to examine your data. Today I will show how to find and visualize Principal Components.

Let’s look at the components of the Dow Jones Industrial Average index over 2012. First, I will download the historical prices and sector infromation for all components of the Dow Jones Industrial Average index.

############################################################################### # Load Systematic Investor Toolbox (SIT) # http://systematicinvestor.wordpress.com/systematic-investor-toolbox/ ############################################################################### setInternet2(TRUE) con = gzcon(url('http://www.systematicportfolio.com/sit.gz', 'rb')) source(con) close(con) #***************************************************************** # Find Sectors for each company in DOW 30 #****************************************************************** tickers = spl('XLY,XLP,XLE,XLF,XLV,XLI,XLB,XLK,XLU') tickers.desc = spl('ConsumerCyclicals,ConsumerStaples,Energy,Financials,HealthCare,Industrials,Materials,Technology,Utilities') sector.map = c() for(i in 1:len(tickers)) { sector.map = rbind(sector.map, cbind(sector.spdr.components(tickers[i]), tickers.desc[i]) ) } colnames(sector.map) = spl('ticker,sector') #***************************************************************** # Load historical data #****************************************************************** load.packages('quantmod') tickers = dow.jones.components() sectors = factor(sector.map[ match(tickers, sector.map[,'ticker']), 'sector']) names(sectors) = tickers data <- new.env() getSymbols(tickers, src = 'yahoo', from = '2000-01-01', env = data, auto.assign = T) for(i in ls(data)) data[[i]] = adjustOHLC(data[[i]], use.Adjusted=T) bt.prep(data, align='keep.all', dates='2012') # re-order sectors, because bt.prep can change the order of tickers sectors = sectors[data$symbolnames] # save data for later examples save(data, tickers, sectors, file='bt.pca.test.Rdata')

Next, let’s run the Principal Component Analysis (PCA) on the companies returns during 2012 and plot percentage of variance explained for each principal component.

#***************************************************************** # Principal component analysis (PCA), for interesting discussion # http://machine-master.blogspot.ca/2012/08/pca-or-polluting-your-clever-analysis.html #****************************************************************** prices = data$prices ret = prices / mlag(prices) - 1 p = princomp(na.omit(ret)) loadings = p$loadings[] p.variance.explained = p$sdev^2 / sum(p$sdev^2) # plot percentage of variance explained for each principal component barplot(100*p.variance.explained, las=2, xlab='', ylab='% Variance Explained')

The first principal component, usually it is market returns, explains around 45% of variance during 2012.

Next let’s plot all companies loadings on the first and second principal components and highlight points according to the sector they belong.

#***************************************************************** # 2-D Plot #****************************************************************** x = loadings[,1] y = loadings[,2] z = loadings[,3] cols = as.double(sectors) # plot all companies loadings on the first and second principal components and highlight points according to the sector they belong plot(x, y, type='p', pch=20, col=cols, xlab='Comp.1', ylab='Comp.2') text(x, y, data$symbolnames, col=cols, cex=.8, pos=4) legend('topright', cex=.8, legend = levels(sectors), fill = 1:nlevels(sectors), merge = F, bty = 'n')

Please notice that the companies in the same sector tend to group together on the plot.

Next, let’s go one step further and create a 3D plot using first, second, and third principal components

#***************************************************************** # 3-D Plot, for good examples of 3D plots # http://statmethods.wordpress.com/2012/01/30/getting-fancy-with-3-d-scatterplots/ #****************************************************************** load.packages('scatterplot3d') # plot all companies loadings on the first, second, and third principal components and highlight points according to the sector they belong s3d = scatterplot3d(x, y, z, xlab='Comp.1', ylab='Comp.2', zlab='Comp.3', color=cols, pch = 20) s3d.coords = s3d$xyz.convert(x, y, z) text(s3d.coords$x, s3d.coords$y, labels=data$symbolnames, col=cols, cex=.8, pos=4) legend('topleft', cex=.8, legend = levels(sectors), fill = 1:nlevels(sectors), merge = F, bty = 'n')

The 3D chart does add a bit of extra info, but I find the 2D chart easier to look at.

In the next post, I will demonstrate clustering based on the selected Principal components and after that I want to explore the interesting discussion presented in the using PCA for spread trading post.

Happy Holidays

To view the complete source code for this example, please have a look at the bt.pca.test() function in bt.test.r at github.

## XLLoop examples

Today I want to follow up with the XLLoop framework post. Please read the XLLoop framework post first to setup the XLLoop before trying the examples below.

My first example is based on the TFX Package – to retrieve real-time FX quotes. To try this example, please first install the TFX Package. Please note that you would need R (>= 2.15.0) to run TFX.

# TFX : R (>= 2.15.0) install.packages(c('XML','bitops','TFX'), repos = 'http://cran.r-project.org', dependencies = 'Imports')

Next, let’s create a function to retrieve real-time FX quotes.

# http://rpubs.com/gsee/TFX # https://gist.github.com/4122626 getFX <- function() { require(TFX) qtf <- QueryTrueFX() qtf$TimeStamp <- as.character(qtf$TimeStamp) names(qtf)[6] <- "TimeStamp (GMT)" qtf[, c(6, 1:3, 5:4)] }

Finally, you will need to add getFX() function to the rstart.r script. For this, I created a user.r file to hold all user functions that you want to be available in Excel and added source(“user.R”) to rstart.r script to load our functions. I have updated the xlloop-basic.zip archive with this new functionality.

At this point you might download xlloop-basic.zip archive and follow the steps in the XLLoop framework post to setup the XLLoop. Next, play with xlloop_TFX.xls

If all works well, you will see following Excel spreadsheet that will update every 3 seconds. You can stop/start updating by clicking Stop/Start buttons. You can also change the speed of updates by entering a different number in cell J5.

Another example that I want to share today will retrieve real-time stock quotes from BATS Exchange. Please first install the RJSONIO Package:

install.packages('RJSONIO', repos = 'http://cran.r-project.org', dependencies = 'Imports')

Next, let’s create a function to retrieve real-time stock quotes.

#=FS("bats.quote","IBM") # http://www.batstrading.com/book/IBM/ bats.quote <- function(ticker) { require(RJSONIO) url = paste('http://www.batstrading.com/json/bzx/book/', ticker, sep = '') x = fromJSON(url) # create output out = matrix('', 15,6) out[1,2] = x$data$timestamp out[1,3] = x$data$symbol out[1,4] = x$data$company out[2,2] = 'Orders' out[2,5] = 'Volume' out[3,2] = x$data$orders out[3,5] = x$data$volume out[4,2] = 'Top of the Book' out[4,5] = paste('Last', length(x$data$trades), 'trades') out[5,] = c('','Shares','Price','Time','Shares','Price') out[6,1] = 'Asks' if(length(x$data$asks) > 0) out[6:10,2:3] = t(sapply(x$data$asks, unlist)) out[11,1] = 'Bids' if(length(x$data$bids) > 0) out[11:15,2:3] = t(sapply(x$data$bids, unlist)) if(length(x$data$trades) > 0) out[6:15,4:6] = t(sapply(x$data$trades, unlist)) out }

I have added bats.quote() to a user.r file in the xlloop-basic.zip archive for this example. Next, play with xlloop_BATS.xls

If all works well, you will see following Excel spreadsheet that will update every 3 seconds. You can stop/start updating by clicking Stop/Start buttons. You can also change the speed of updates by entering a different number in cell J5. You can also change the company by entering a different ticker in cell C2.

Please let me know what problems you run into while experimenting with these examples.

I have noticed that FX quotes on my system do not update as often as I would expect.

## XLLoop framework

Today I want to highlight the XLLoop framework : Excel User-Define Functions in in any language.

The XLLoop consists of two main components:

- An Excel addin implementation (XLL written in c++).
- A server and framework written in R (or/and in many other languages).

The XLLoop allows you to connect Excel and R in very simple way with almost zero installation.

**To get started**, please download and unzip the xlloop-basic.zip archive that contains all files you need to make my sample application work.

**Next**, start Excel and add XLLoop addin.

I.e. in Office 2007/2010 Click File->Options->Add-Ins, press Alt+G, Click Browse and locate xlloop-0.3.2.xll

in Office 2003, Click Tools->Add-Ins, Click Browse and locate xlloop-0.3.2.xll

**Next**, edit runr.bat that was extracted from the zip archive. Enter correct path to your R installation.

**Finally**, execute runr.bat, you will see a command window popup, next go to Excel and type following formula

=FS(“ProductTest”, 32, 1886.5)

If all works well, you would see 60368

This might seem like a bit of black magic, so let me explain what is going on:

The runr.bat batch file starts a new R session and executes rstart.r script. In the rstart.r script we load/define any libraries / functions that we want to access in Excel. Next we load code for the XLLoop server and start the server. Here is the code in the rstart.r script

# define function ProductTest <- function(x, y) x*y # start xlloop server source('xlloop.R') XLLoopServer()

Next to access R functionality in Excel, we use FS function, the first parameter is the R function that we want to execute, following by function parameters. For example the =FS("ProductTest", 32, 1886.5) formula calls ProductTest R fucntion with x = 32 and y = 1886.5

*If you get “Cannot connect to the server” error message, please make sure that runr.bat batch is running (i.e. there is a command window) and try recalculating Excel (i.e. press F9)*

Once, you are done working with Excel, you can just close the command window created by runr.bat batch file.

I have included a few examples in the xlloop.xls for you to explore.

I will show a few more examples of the XLLoop framework in the next post.

Please let me know what problems you run into while experimenting with XLLoop.

A side note. There are many options for connecting Excel and R. For example I have previously showed examples of RExcel to execute R functions and display their output in Excel.

## TFX Package

Today I want to highlight the TFX Package created by Garrett See. TFX is an R Interface to the TrueFX(tm) Web API for free streaming real-time and historical tick-by-tick market data for dealable interbank foreign exchange rates with millisecond detail.

Garrett provided a great tutorial, examples, and shiny application of TFX at http://rpubs.com/gsee/TFX

Please note that you would need R (>= 2.15.0) to run TFX. It is very easy to get started. First install required packages:

# TFX : R (>= 2.15.0) install.packages(c('XML','bitops','TFX','shiny'), repos = 'http://cran.r-project.org', dependencies = 'Imports')

Next, let run a shiny example of TFX provided by Garrett:

# http://cran.r-project.org/web/packages/TFX/index.html # http://rpubs.com/gsee/TFX # https://gist.github.com/4122626 library('shiny') runGist('4122626')

Finally, to access the FX quotes in R session:

library(TFX) QueryTrueFX()