## Tracking Number of Historical Clusters

In the prior post, Optimal number of clusters, we looked at methods of selecting number of clusters. Today, I want to continue with clustering theme and show historical Number of Clusters time series using these methods.

In particular, I will look at the following methods of selecting optimal number of clusters:

- Minimum number of clusters that explain at least 90% of variance
- Elbow method
- Hierarchical clustering tree cut at 1/3 height

Let’s first load historical prices for the 10 major asset classes

############################################################################### # Load Systematic Investor Toolbox (SIT) # https://systematicinvestor.wordpress.com/systematic-investor-toolbox/ ############################################################################### setInternet2(TRUE) con = gzcon(url('http://www.systematicportfolio.com/sit.gz', 'rb')) source(con) close(con) #***************************************************************** # Load historical data for ETFs #****************************************************************** load.packages('quantmod') tickers = spl('GLD,UUP,SPY,QQQ,IWM,EEM,EFA,IYR,USO,TLT') data <- new.env() getSymbols(tickers, src = 'yahoo', from = '1900-01-01', env = data, auto.assign = T) for(i in ls(data)) data[[i]] = adjustOHLC(data[[i]], use.Adjusted=T) bt.prep(data, align='remove.na')

Next, I created 3 helper functions to automate cluster selection. In particular, I used following methods of selecting optimal number of clusters:

- Minimum number of clusters that explain at least 90% of variance – cluster.group.kmeans.90 function
- Elbow method – cluster.group.kmeans.elbow function
- Hierarchical clustering tree cut at 1/3 height – cluster.group.hclust function

To view the complete source code for these functions please have a look at the startegy.r at github.

Let’s use these functions on our data set every week with 250 days look-back to compute correlations.

#***************************************************************** # Use following 3 methods to determine number of clusters # * Minimum number of clusters that explain at least 90% of variance # cluster.group.kmeans.90 # * Elbow method # cluster.group.kmeans.elbow # * Hierarchical clustering tree cut at 1/3 height # cluster.group.hclust #****************************************************************** # helper function to compute portfolio allocation additional stats portfolio.allocation.custom.stats.clusters <- function(x,ia) { return(list( ncluster.90 = max(cluster.group.kmeans.90(ia)), ncluster.elbow = max(cluster.group.kmeans.elbow(ia)), ncluster.hclust = max(cluster.group.hclust(ia)) )) } #***************************************************************** # Compute # Clusters #****************************************************************** periodicity = 'weeks' lookback.len = 250 obj = portfolio.allocation.helper(data$prices, periodicity = periodicity, lookback.len = lookback.len, min.risk.fns = list(EW=equal.weight.portfolio), custom.stats.fn = portfolio.allocation.custom.stats.clusters )

Finally, the historical number of cluster time series plots for each method:

#***************************************************************** # Create Reports #****************************************************************** temp = list(ncluster.90 = 'Kmeans 90% variance', ncluster.elbow = 'Kmeans Elbow', ncluster.hclust = 'Hierarchical clustering at 1/3 height') for(i in 1:len(temp)) { hist.cluster = obj[[ names(temp)[i] ]] title = temp[[ i ]] plota(hist.cluster, type='l', col='gray', main=title) plota.lines(SMA(hist.cluster,10), type='l', col='red',lwd=5) plota.legend('Number of Clusters,10 period moving average', 'gray,red', x = 'bottomleft') }

All methods selected clusters a little bit differently, as expected. The “Minimum number of clusters that explain at least 90% of variance” method seems to produce the most stable results. I would suggest looking at the larger universe (for example DOW30) and longer period of time (for example 1995-present) to evaluate these methods.

Takeaways: As I mentioned in the Optimal number of clusters post, there are many different methods to create clusters, and I have barely scratched the surface. There is also another dimension that I have not explored yet, the distance matrix. Currently, I’m using a correlation matrix as a distance measure to create clusters. I was pointed out by Matt Considine that there is an R interface to the Maximal Information-based Nonparametric Exploration (MINE) metric that can be used as a better measure of correlation.

To view the complete source code for this example, please have a look at the bt.cluster.optimal.number.historical.test() function in bt.test.r at github.

## Weekend Reading – S&P 500 Visual History

Michael Johnston at the ETF Database shared a very interesting post with me over the holidays. The S&P 500 Visual History – is an interactive post that shows the top 10 components in the S&P 500 each year, going back to 1980.

On a different note, Judson Bishop contributed a plota.recession() function to add recession bars to the existing plot. The Recession dates are from National Bureau of Economic Research. Following is a simple example of plota.recession() function.

############################################################################### # Load Systematic Investor Toolbox (SIT) # https://systematicinvestor.wordpress.com/systematic-investor-toolbox/ ############################################################################### setInternet2(TRUE) con = gzcon(url('http://www.systematicportfolio.com/sit.gz', 'rb')) source(con) close(con) #***************************************************************** # Load historical data for ETFs #****************************************************************** load.packages('quantmod') SPY = getSymbols('SPY', auto.assign = F) #***************************************************************** # Create Clusters #****************************************************************** plota(SPY, type='l') plota.recession() plota.legend('SPY','black',SPY)

## Optimal number of clusters

In the last post, Examples of Current Major Market Clusters, we looked at clustering Major Markets into 4 groups based on their correlations in 2012. Today, I want to continue with clustering theme and discuss methods of selecting number of clusters.

I will look at the following methods of selecting optimal number of clusters:

- Minimum number of clusters that explain at least 90% of variance
- Minimum number of clusters such that correlation among all components in each cluster is at least 40%
- Elbow method

Let’s first load historical for the 10 major asset classes

############################################################################### # Load Systematic Investor Toolbox (SIT) # https://systematicinvestor.wordpress.com/systematic-investor-toolbox/ ############################################################################### setInternet2(TRUE) con = gzcon(url('http://www.systematicportfolio.com/sit.gz', 'rb')) source(con) close(con) #***************************************************************** # Load historical data for ETFs #****************************************************************** load.packages('quantmod') tickers = spl('GLD,UUP,SPY,QQQ,IWM,EEM,EFA,IYR,USO,TLT') data <- new.env() getSymbols(tickers, src = 'yahoo', from = '1900-01-01', env = data, auto.assign = T) for(i in ls(data)) data[[i]] = adjustOHLC(data[[i]], use.Adjusted=T) bt.prep(data, align='remove.na') #***************************************************************** # Create Clusters #****************************************************************** # compute returns ret = data$prices / mlag(data$prices) - 1 ret = na.omit(ret) # setup period and method to compute correlations dates = '2012::2012' method = 'pearson' # kendall, spearman correlation = cor(ret[dates], method = method) dissimilarity = 1 - (correlation) distance = as.dist(dissimilarity) # get first 2 pricipal componenets xy = cmdscale(distance)

Next, let’s iterate from 2 clusters to 2/3N clusters (where N is the number of securities) and in each case compute the percentage of variance explained by clusters and minimum correlation among all components in each cluster.

#***************************************************************** # Determine number of clusters #****************************************************************** n = ncol(data$prices) n1 = ceiling(n*2/3) # percentage of variance explained by clusters p.exp = rep(0,n1) # minimum correlation among all components in each cluster min.cor = matrix(1,n1,n1) for (i in 2:n1) { fit = kmeans(xy, centers=i, iter.max=100, nstart=100) p.exp[i] = 1- fit$tot.withinss / fit$totss for (j in 1:i) { index = fit$cluster == j min.cor[i,j] = min(correlation[index,index]) } } # minimum number of clusters that explain at least 90% of variance min(which(p.exp > 0.9)) # minimum number of clusters such that correlation among all components in each cluster is at least 40% # will not always work min(which(apply(min.cor[-1,],1,min,na.rm=T) > 0.4)) + 1 # number of clusters based on elbow method find.maximum.distance.point(p.exp[-1]) + 1

In the first method, we set number of clusters equal to the minimum number of clusters that explain at least 90% of variance. This is not a bad choice; the downside is why 90%, why not 75%?

min(which(p.exp > 0.9)) [1] 4

In the second method, we set number of clusters equal to the minimum number of clusters such that correlation among all components in each cluster is at least 40%. Also not a bad choice; the downside is why 40%, why not 50%?

min(which(apply(min.cor[-1,],1,min,na.rm=T) > 0.4)) + 1 [1] 5

In the third method, we use Elbow method to set number of clusters. The idea behind the Elbow method is to look at the marginal gain of adding each additional cluster. And set the number of clusters equal to the largest K that has positive marginal gain. Geometrically thinking, if you graph the percentage of variance explained by the clusters against the number of clusters, the point on the curve that is farthest from the 45 degree line corresponds the optimal number of clusters.

find.maximum.distance.point(p.exp[-1]) + 1 [1] 4

All methods have produced similar number of clusters. Let’s visually see the difference between 4 and 5 clusters

#***************************************************************** # Create Plot #****************************************************************** load.packages('cluster') fit = kmeans(xy, 4, iter.max=100, nstart=100) clusplot(xy, fit$cluster, color=TRUE, shade=TRUE, labels=3, lines=0, plotchar=F, main = paste('Major Market Clusters over', dates, ', 4 Clusters'), sub='') fit = kmeans(xy, 5, iter.max=100, nstart=100) clusplot(xy, fit$cluster, color=TRUE, shade=TRUE, labels=3, lines=0, plotchar=F, main = paste('Major Market Clusters over', dates, ', 5 Clusters'), sub='')

It is a judgment call, what do you think is better 4 or 5 clusters in this case?

Takeaways: I have barely scratched the surface in terms of methods that you can use to select number of clusters. I highly recommend following resources, if you want to get more background and/or discussion about clustering:

- R and Data Mining: Examples and Case Studies by Y. Zhao, Chapter 6, Clustering
- Counting Clusters
- Clustergram: visualization and diagnostics for cluster analysis
- Equity Clusters
- Ways to define optimal number of clusters
- Choosing the number of cluster

In the next post I will show the transition of clusters as we move in time. After that I will discuss the Cluster Risk Parity portfolio allocation method (Varadi, Kapler, 2012). For more details about Cluster Risk Parity, please read David’s blog at http://cssanalytics.wordpress.com/.

To view the complete source code for this example, please have a look at the bt.cluster.optimal.number.test() function in bt.test.r at github.

## Examples of Current Major Market Clusters

I want to follow up and provide a bit more details to the excellent “A Visual of Current Major Market Clusters” post by David Varadi.

Let’s first load historical for the 10 major asset classes:

- Gold ( GLD )
- US Dollar ( UUP )
- S&P500 ( SPY )
- Nasdaq100 ( QQQ )
- Small Cap ( IWM )
- Emerging Markets ( EEM )
- International Equity ( EFA )
- Real Estate ( IYR )
- Oil ( USO )
- Treasurys ( TLT )

############################################################################### # Load Systematic Investor Toolbox (SIT) # https://systematicinvestor.wordpress.com/systematic-investor-toolbox/ ############################################################################### setInternet2(TRUE) con = gzcon(url('http://www.systematicportfolio.com/sit.gz', 'rb')) source(con) close(con) #***************************************************************** # Load historical data for ETFs #****************************************************************** load.packages('quantmod') tickers = spl('GLD,UUP,SPY,QQQ,IWM,EEM,EFA,IYR,USO,TLT') data <- new.env() getSymbols(tickers, src = 'yahoo', from = '1900-01-01', env = data, auto.assign = T) for(i in ls(data)) data[[i]] = adjustOHLC(data[[i]], use.Adjusted=T) bt.prep(data, align='remove.na')

Next let’s use the historical returns over the past year to compute correlations between all asset classes and group assets into 4 clusters:

#***************************************************************** # Create Clusters #****************************************************************** # compute returns ret = data$prices / mlag(data$prices) - 1 ret = na.omit(ret) # setup period and method to compute correlations dates = '2012::2012' method = 'pearson' # kendall, spearman correlation = cor(ret[dates], method = method) dissimilarity = 1 - (correlation) distance = as.dist(dissimilarity) # find 4 clusters xy = cmdscale(distance) fit = kmeans(xy, 4, iter.max=100, nstart=100) #***************************************************************** # Create Plot #****************************************************************** load.packages('cluster') clusplot(xy, fit$cluster, color=TRUE, shade=TRUE, labels=3, lines=0, plotchar=F, main = paste('Major Market Clusters over', dates), sub='')

There are 4 clusters: TLT, GLD, UUP, and Equities / Oil / Real Estate. You can see assigned clusters by executing

fit$cluster

This works quite well, but we have a number of things to explore:

- how to select number of clusters
- what correlation measure to use i.e. pearson, kendall, spearman
- what look back to use i.e. 1 month / 6 months / 1 year
- what frequency of data to use i.e daily / weekly / monthly

In the next post I will provide some ideas how to select number of clusters.

To view the complete source code for this example, please have a look at the bt.cluster.visual.test() function in bt.test.r at github.

## More Principal Components Fun

Today, I want to continue with the Principal Components theme and show how the Principal Component Analysis can be used to build portfolios that are not correlated to the market. Most of the content for this post is based on the excellent article, “Using PCA for spread trading” by Jev Kuznetsov.

Let’s start by loading the components of the Dow Jones Industrial Average index over last 5 years.

############################################################################### # Load Systematic Investor Toolbox (SIT) # https://systematicinvestor.wordpress.com/systematic-investor-toolbox/ ############################################################################### setInternet2(TRUE) con = gzcon(url('http://www.systematicportfolio.com/sit.gz', 'rb')) source(con) close(con) #***************************************************************** # Load historical data #****************************************************************** load.packages('quantmod') tickers = dow.jones.components() data <- new.env() getSymbols(tickers, src = 'yahoo', from = '2009-01-01', env = data, auto.assign = T) bt.prep(data, align='remove.na')

Next let’s compute the Principal Components based on the first year of price history.

#***************************************************************** # Principal component analysis (PCA), for interesting discussion # http://machine-master.blogspot.ca/2012/08/pca-or-polluting-your-clever-analysis.html #****************************************************************** prices = last(data$prices, 1000) n = len(tickers) ret = prices / mlag(prices) - 1 p = princomp(na.omit(ret[1:250,])) loadings = p$loadings[] # look at the first 4 principal components components = loadings[,1:4] # normalize all selected components to have total weight = 1 components = components / repRow(colSums(abs(components)), len(tickers)) # note that first component is market, and all components are orthogonal i.e. not correlated to market market = ret[1:250,] %*% rep(1/n,n) temp = cbind(market, ret[1:250,] %*% components) colnames(temp)[1] = 'Market' round(cor(temp, use='complete.obs',method='pearson'),2) # the variance of each component is decreasing round(100*sd(temp,na.rm=T),2)

Correlation: Market Comp.1 Comp.2 Comp.3 Comp.4 Market 1.0 1 0.2 0.1 0 Comp.1 1.0 1 0.0 0.0 0 Comp.2 0.2 0 1.0 0.0 0 Comp.3 0.1 0 0.0 1.0 0 Comp.4 0.0 0 0.0 0.0 1 Standard Deviation: Market Comp.1 Comp.2 Comp.3 Comp.4 1.8 2.8 1.2 1.0 0.8

Please note that the first principal component is highly correlated to the market and all principal components have very low correlation to each other and very low correlation to the market. Also by construction the volatility of principal components is decreasing. An interesting observation that you might want to check yourself: principal components are quite persistent in time (i.e. if you compute both correlations and volatilities using the future prices, for example, 4 years of prices, the principal components maintain their correlation and volatility profiles)

Next, let’s check if any of the principal components are mean-reverting. I will use Augmented Dickey-Fuller test to check if principal components are mean-reverting. (small p-value => stationary i.e. mean-reverting)

#***************************************************************** # Find stationary components, Augmented Dickey-Fuller test #****************************************************************** library(tseries) equity = bt.apply.matrix(1 + ifna(-ret %*% components,0), cumprod) equity = make.xts(equity, index(prices)) # test for stationarity ( mean-reversion ) adf.test(as.numeric(equity[,1]))$p.value adf.test(as.numeric(equity[,2]))$p.value adf.test(as.numeric(equity[,3]))$p.value adf.test(as.numeric(equity[,4]))$p.value

The Augmented Dickey-Fuller test indicates that the 4th principal component is stationary. Let’s have a closer look at its price history and its composition:

#***************************************************************** # Plot securities and components #***************************************************************** layout(1:2) # add Bollinger Bands i.comp = 4 bbands1 = BBands(repCol(equity[,i.comp],3), n=200, sd=1) bbands2 = BBands(repCol(equity[,i.comp],3), n=200, sd=2) temp = cbind(equity[,i.comp], bbands1[,'up'], bbands1[,'dn'], bbands1[,'mavg'], bbands2[,'up'], bbands2[,'dn']) colnames(temp) = spl('Comp. 4,1SD Up,1SD Down,200 SMA,2SD Up,2SD Down') plota.matplot(temp, main=paste(i.comp, 'Principal component')) barplot.with.labels(sort(components[,i.comp]), 'weights')

The price history along with 200 day moving average and 1 and 2 Bollinger Bands are shown on the top pane. The portfolio weights of the 4th principal component are shown on the bottom pane.

So now you have a mean-reverting portfolio that is also uncorrelated to the market. There are many ways you can use this infromation. Please share your ideas and suggestions.

To view the complete source code for this example, please have a look at the bt.pca.trading.test() function in bt.test.r at github.