Archive
Quality of Historical Stock Prices from Yahoo Finance
I recently looked at the strategy that invests in the components of S&P/TSX 60 index, and discovered that there are some abnormal jumps/drops in historical data that I could not explain. To help me spot these points and remove them, I created a helper function data.clean() function in data.r at github. Following is an example of how you can use data.clean() function:
############################################################################## # Load Systematic Investor Toolbox (SIT) # https://systematicinvestor.wordpress.com/systematic-investor-toolbox/ ############################################################################### setInternet2(TRUE) con = gzcon(url('http://www.systematicportfolio.com/sit.gz', 'rb')) source(con) close(con) ############################################################################### # S&P/TSX 60 Index as of Mar 31 2014 # http://ca.spindices.com/indices/equity/sp-tsx-60-index ############################################################################### load.packages('quantmod') tickers = spl('AEM,AGU,ARX,BMO,BNS,ABX,BCE,BB,BBD.B,BAM.A,CCO,CM,CNR,CNQ,COS,CP,CTC.A,CCT,CVE,GIB.A,CPG,ELD,ENB,ECA,ERF,FM,FTS,WN,GIL,G,HSE,IMO,K,L,MG,MFC,MRU,NA,PWT,POT,POW,RCI.B,RY,SAP,SJR.B,SC,SLW,SNC,SLF,SU,TLM,TCK.B,T,TRI,THI,TD,TA,TRP,VRX,YRI') tickers = gsub('\\.', '-', tickers) tickers.suffix = '.TO' data <- new.env() for(ticker in tickers) data[[ticker]] = getSymbols(paste0(ticker, tickers.suffix), src = 'yahoo', from = '1980-01-01', auto.assign = F) ############################################################################### # Plot Abnormal Series ############################################################################### layout(matrix(1:4,2)) plota(data$ARX$Adjusted['2000'], type='p', pch='|', main='ARX Adjusted Price in 2000') plota(data$COS$Adjusted['2000'], type='p', pch='|', main='COS Adjusted Price in 2000') plota(data$ERF$Adjusted['2000'], type='p', pch='|', main='ERF Adjusted Price in 2000') plota(data$YRI$Adjusted['1999'], type='p', pch='|', main='YRI Adjusted Price in 1999') ############################################################################### # Clean data ############################################################################### data.clean(data, min.ratio = 2)
> data.clean(data, min.ratio = 2) Removing BNS TRP have less than 756 observations Abnormal price found for ARX 23-Jun-2000 Ratio : 124.7 Abnormal price found for ARX 26-Sep-2000 Inverse Ratio : 99.4 Abnormal price found for COS 23-Jun-2000 Ratio : 124.1 Abnormal price found for COS 26-Sep-2000 Inverse Ratio : 101.1 Abnormal price found for ERF 14-Jun-2000 Ratio : 7.9 Abnormal price found for YRI 18-Feb-1998 Ratio : 2.1 Abnormal price found for YRI 25-May-1999 Ratio : 3
It is surprising that Bank of Nova Scotia (BNS.TO) has only one year worth of historical data. I also did not find an explanations for jumps in the ARX, COS, ERF during 2000.
Next, I did same analysis for the stocks in the S&P 100 index:
############################################################################### # S&P 100 as of Mar 31 2014 # http://ca.spindices.com/indices/equity/sp-100 ############################################################################### tickers = spl('MMM,ABT,ABBV,ACN,ALL,MO,AMZN,AXP,AIG,AMGN,APC,APA,AAPL,T,BAC,BAX,BRK.B,BIIB,BA,BMY,COF,CAT,CVX,CSCO,C,KO,CL,CMCSA,COP,COST,CVS,DVN,DOW,DD,EBAY,EMC,EMR,EXC,XOM,FB,FDX,F,FCX,GD,GE,GM,GILD,GS,GOOG,HAL,HPQ,HD,HON,INTC,IBM,JNJ,JPM,LLY,LMT,LOW,MA,MCD,MDT,MRK,MET,MSFT,MDLZ,MON,MS,NOV,NKE,NSC,OXY,ORCL,PEP,PFE,PM,PG,QCOM,RTN,SLB,SPG,SO,SBUX,TGT,TXN,BK,TWX,FOXA,UNP,UPS,UTX,UNH,USB,VZ,V,WMT,WAG,DIS,WFC') tickers.suffix = '' data <- new.env() for(ticker in tickers) data[[ticker]] = getSymbols(paste0(ticker, tickers.suffix), src = 'yahoo', from = '1980-01-01', auto.assign = F) ############################################################################### # Plot Abnormal Series ############################################################################### layout(matrix(1:4,2)) plota(data$AAPL$Adjusted['2000'], type='p', pch='|', main='AAPL Adjusted Price in 2000') plota(data$AIG$Adjusted['2008'], type='p', pch='|', main='AIG Adjusted Price in 2008') plota(data$FDX$Adjusted['1982'], type='p', pch='|', main='1982 Adjusted Price in 1982') ############################################################################### # Clean data ############################################################################### data.clean(data, min.ratio = 2)
> data.clean(data, min.ratio = 2) Removing ABBV FB have less than 756 observations Abnormal price found for AAPL 29-Sep-2000 Inverse Ratio : 2.1 Abnormal price found for AIG 15-Sep-2008 Inverse Ratio : 2.6 Abnormal price found for FDX 13-May-1982 Ratio : 8 Abnormal price found for FDX 06-Aug-1982 Ratio : 7.8 Abnormal price found for FDX 14-May-1982 Inverse Ratio : 8 Abnormal price found for FDX 09-Aug-1982 Inverse Ratio : 8
I first thought that September 29th, 2000 drop in AAPL was an data error; however, I found following news item: Apple bruises tech sector, September 29, 2000: 4:33 p.m. ET Computer maker’s warning weighs on hardware, chip stocks; Nasdaq tumbles.
So working with data requires a bit of data manipulation and a bit of detective works. Please, always have a look at the data before running any back-tests or making any conclusions.
Capturing Intraday data, Backup plan
In the Capturing Intraday data post, I outlined steps to setup your own process to capture Intraday data. But what do you do if you missed some data points due for example internet being down or due to power outage your server was re-started. To fill up the gaps in the Intraday data, you could get up to 10 day historical Intraday data from Google finance.
I created a wrapper function for Google finance, getSymbol.intraday.google() function in data.r at github, to download historical Intrday quotes. For example,
############################################################################## # Load Systematic Investor Toolbox (SIT) # https://systematicinvestor.wordpress.com/systematic-investor-toolbox/ ############################################################################### setInternet2(TRUE) con = gzcon(url('http://www.systematicportfolio.com/sit.gz', 'rb')) source(con) close(con) #***************************************************************** # Load Intraday data #****************************************************************** data = getSymbol.intraday.google('GOOG', 'NASDAQ', 60, '5d') last(data, 10) plota(data, type='candle', main='Google Intraday prices')
> last(data, 10) Open High Low Close Volume 2014-03-28 15:52:00 1119.10 1119.61 1119.10 1119.610 4431 2014-03-28 15:53:00 1119.30 1119.30 1118.75 1118.805 3954 2014-03-28 15:54:00 1119.31 1119.45 1119.18 1119.340 5702 2014-03-28 15:55:00 1119.17 1119.40 1119.00 1119.340 8907 2014-03-28 15:56:00 1119.19 1119.35 1119.18 1119.190 11882 2014-03-28 15:57:00 1119.30 1119.30 1119.02 1119.270 6298 2014-03-28 15:58:00 1119.25 1119.35 1119.15 1119.265 10542 2014-03-28 15:59:00 1119.38 1119.49 1118.82 1119.250 29496 2014-03-28 16:00:00 1120.15 1120.15 1119.15 1119.380 71518 2014-03-28 16:01:00 1120.15 1120.15 1120.15 1120.150 0
So if your Intraday capture process failed, you can rely on Google fiance data to fill up the gaps.
Probabilistic Momentum with Intraday data
I want to follow up the Intraday data post with testing the Probabilistic Momentum strategy on Intraday data. I will use Intraday data for SPY and GLD from the Bonnot Gang to test the strategy.
############################################################################## # Load Systematic Investor Toolbox (SIT) # https://systematicinvestor.wordpress.com/systematic-investor-toolbox/ ############################################################################### setInternet2(TRUE) con = gzcon(url('http://www.systematicportfolio.com/sit.gz', 'rb')) source(con) close(con) #***************************************************************** # Load historical data #****************************************************************** load.packages('quantmod') # data from http://thebonnotgang.com/tbg/historical-data/ # please save SPY and GLD 1 min data at the given path spath = 'c:/Desktop/' data = bt.load.thebonnotgang.data('SPY,GLD', spath) data1 <- new.env() data1$FI = data$GLD data1$EQ = data$SPY data = data1 bt.prep(data, align='keep.all', fill.gaps = T) lookback.len = 120 confidence.level = 60/100 prices = data$prices ret = prices / mlag(prices) - 1 models = list() #***************************************************************** # Simple Momentum #****************************************************************** momentum = prices / mlag(prices, lookback.len) data$weight[] = NA data$weight$EQ[] = momentum$EQ > momentum$FI data$weight$FI[] = momentum$EQ <= momentum$FI models$Simple = bt.run.share(data, clean.signal=T) #***************************************************************** # Probabilistic Momentum + Confidence Level # http://cssanalytics.wordpress.com/2014/01/28/are-simple-momentum-strategies-too-dumb-introducing-probabilistic-momentum/ # http://cssanalytics.wordpress.com/2014/02/12/probabilistic-momentum-spreadsheet/ #****************************************************************** ir = sqrt(lookback.len) * runMean(ret$EQ - ret$FI, lookback.len) / runSD(ret$EQ - ret$FI, lookback.len) momentum.p = pt(ir, lookback.len - 1) data$weight[] = NA data$weight$EQ[] = iif(cross.up(momentum.p, confidence.level), 1, iif(cross.dn(momentum.p, (1 - confidence.level)), 0,NA)) data$weight$FI[] = iif(cross.dn(momentum.p, (1 - confidence.level)), 1, iif(cross.up(momentum.p, confidence.level), 0,NA)) models$Probabilistic = bt.run.share(data, clean.signal=T) data$weight[] = NA data$weight$EQ[] = iif(cross.up(momentum.p, confidence.level), 1, iif(cross.up(momentum.p, (1 - confidence.level)), 0,NA)) data$weight$FI[] = iif(cross.dn(momentum.p, (1 - confidence.level)), 1, iif(cross.up(momentum.p, confidence.level), 0,NA)) models$Probabilistic.Leverage = bt.run.share(data, clean.signal=T) #***************************************************************** # Create Report #****************************************************************** strategy.performance.snapshoot(models, T)
Next, let’s examine the hourly perfromance of the strategy.
#***************************************************************** # Hourly Performance #****************************************************************** strategy.name = 'Probabilistic.Leverage' ret = models[[strategy.name]]$ret ret.number = 100*as.double(ret) dates = index(ret) factor = format(dates, '%H') layout(1:2) par(mar=c(4,4,1,1)) boxplot(tapply(ret.number, factor, function(x) x),outline=T, main=paste(strategy.name, 'Distribution of Returns'), las=1) barplot(tapply(ret.number, factor, function(x) sum(x)), main=paste(strategy.name, 'P&L by Hour'), las=1)
There are lots of abnormal returns in the 9:30-10:00am box due to big overnight returns. I.e. a return from today’s open to prior’s day close. If we exclude this observation every day, the distribution each hour is more consistent.
#***************************************************************** # Hourly Performance: Remove first return of the day (i.e. overnight) #****************************************************************** day.stat = bt.intraday.day(dates) ret.number[day.stat$day.start] = 0 layout(1:2) par(mar=c(4,4,1,1)) boxplot(tapply(ret.number, factor, function(x) x),outline=T, main=paste(strategy.name, 'Distribution of Returns'), las=1) barplot(tapply(ret.number, factor, function(x) sum(x)), main=paste(strategy.name, 'P&L by Hour'), las=1)
The strategy performs best in the morning and dwindles down in the afternoon and overnight.
These hourly seasonality plots are just a different way to analyze performance of the strategy based on Intraday data.
To view the complete source code for this example, please have a look at the bt.strategy.intraday.thebonnotgang.test() function in bt.test.r at github.
Capturing Intraday data
I want to follow up the Intraday data post with an example of how you can capture Intraday data without too much effort by recording 1 minute snapshots of the market.
I will take market snapshots from Yahoo Finance using following function that downloads delayed market quotes with date and time stamps:
############################################################################### # getSymbols interface to Yahoo today's delayed qoutes # based on getQuote.yahoo from quantmod package ############################################################################### getQuote.yahoo.today <- function(Symbols) { require('data.table') what = yahooQF(names = spl('Symbol,Last Trade Time,Last Trade Date,Open,Days High,Days Low,Last Trade (Price Only),Volume')) names = spl('Symbol,Time,Date,Open,High,Low,Close,Volume') all.symbols = lapply(seq(1, len(Symbols), 100), function(x) na.omit(Symbols[x:(x + 99)])) out = c() for(i in 1:len(all.symbols)) { # download url = paste('http://download.finance.yahoo.com/d/quotes.csv?s=', join( trim(all.symbols[[i]]), ','), '&f=', what[[1]], sep = '') txt = join(readLines(url),'\n') data = fread(paste0(txt,'\n'), stringsAsFactors=F, sep=',') setnames(data,names) setkey(data,'Symbol') out = rbind(out, data) } out }
Next we can run the getQuote.yahoo.today function from 9:30 to 16:00 every minute and record market snap shoots. Please note that you will have to make some judgement calls in terms of how you want to deal with highs and lows.
Symbols = spl('IBM,AAPL') prev = c() while(T) { out = getQuote.yahoo.today(Symbols) if (is.null(prev)) for(i in 1:nrow(out)) { cat(names(out), '\n', sep=',', file=paste0(out$Symbol[i],'.csv'), append=F) cat(unlist(out[i]), '\n', sep=',', file=paste0(out$Symbol[i],'.csv'), append=T) } else for(i in 1:nrow(out)) { s0 = prev[Symbol==out$Symbol[i]] s1 = out[i] s1$Volume = s1$Volume - s0$Volume s1$Open = s0$Close s1$High = iif(s1$High > s0$High, s1$High, max(s1$Close, s1$Open)) s1$Low = iif(s1$Low < s0$Low , s1$Low , min(s1$Close, s1$Open)) cat(unlist(s1), '\n', sep=',', file=paste0(out$Symbol[i],'.csv'), append=T) } # copy prev = out # sleep 1 minute Sys.sleep(60) }
For example I was able to saved following quotes for AAPL:
Symbol Time Date Open High Low Close Volume AAPL 2:57pm 3/10/2014 528.360 533.330 528.3391 531.340 5048146 AAPL 2:58pm 3/10/2014 531.340 531.570 531.3400 531.570 7650 AAPL 2:59pm 3/10/2014 531.570 531.570 531.5170 531.517 2223 AAPL 3:00pm 3/10/2014 531.517 531.517 531.4500 531.450 5283 AAPL 3:01pm 3/10/2014 531.450 531.450 531.2900 531.290 4413 AAPL 3:02pm 3/10/2014 531.290 531.490 531.2900 531.490 2440
Unfortunately, there is no way to go back in history, unless you buy historical intraday data. But if you want to start recording market moves yourself, following code should get you started.
Intraday data
In the Intraday Backtest post I showed an example of loading and working with Forex Intraday data from the FXHISTORICALDATA.COM. Recently, I came across another interesting source of Intraday data at the Bonnot Gang site. Please note that you will have to register to get access to the Intraday data; the registration is free.
Today, I want examine quality of the Intraday data from the Bonnot Gang and show how it can be integrated into Backtest using the Systematic Investor Toolbox. For the example below, please first download and save 1 minute Intraday historical data for SPX and GLD. Next let’s load and plot time series for SPX.
############################################################################### # Load Systematic Investor Toolbox (SIT) # https://systematicinvestor.wordpress.com/systematic-investor-toolbox/ ############################################################################### con = gzcon(url('http://www.systematicportfolio.com/sit.gz', 'rb')) source(con) close(con) #***************************************************************** # Load historical data #****************************************************************** load.packages('quantmod') # data from http://thebonnotgang.com/tbg/historical-data/ spath = 'c:/Desktop/' # http://stackoverflow.com/questions/14440661/dec-argument-in-data-tablefread Sys.localeconv()["decimal_point"] Sys.setlocale("LC_NUMERIC", "French_France.1252") data <- new.env() data$SPY = read.xts(paste0(spath,'SPY_1m.csv'), sep = ';', date.column = 3, format='%Y-%m-%d %H:%M:%S', index.class = c("POSIXlt", "POSIXt")) data$GLD = read.xts(paste0(spath,'GLD_1m.csv'), sep = ';', date.column = 3, format='%Y-%m-%d %H:%M:%S', index.class = c("POSIXlt", "POSIXt")) #***************************************************************** # Create plot for Nov 1, 2012 and 2013 #****************************************************************** layout(c(1,1,2)) plota(data$SPY['2012:11:01'], type='candle', main='SPY on Nov 1st, 2012', plotX = F) plota(plota.scale.volume(data$SPY['2012:11:01']), type = 'volume') layout(c(1,1,2)) plota(data$SPY['2013:11:01'], type='candle', main='SPY on Nov 1st, 2013', plotX = F) plota(plota.scale.volume(data$SPY['2013:11:01']), type = 'volume')
It jumps right away that the data provider had changed the time scale, in 2012 data was recorded from 9:30 to 16:00 and in 2013 data was recorded from 13:30 to 20:00.
Next, let’s check if there are any big gaps in the series Intraday.
#***************************************************************** # Data check for Gaps in the series Intraday #****************************************************************** i = 'GLD' dates = index(data[[i]]) factor = format(dates, '%Y%m%d') gap = tapply(dates, factor, function(x) max(diff(x))) gap[names(gap[gap > 4*60])] data[[i]]['2013:02:19'] i = 'SPY' dates = index(data[[i]]) factor = format(dates, '%Y%m%d') gap = tapply(dates, factor, function(x) max(diff(x))) gap[names(gap[gap > 4*60])] data[[i]]['2013:02:19']
Please see below the dates for GLD with gaps over 4 minutes
20120801 12 20121226 22 20130219 48 20130417 6 20130531 6 20130705 8 20131105 4 20131112 4 20140124 14 20140210 22 20140303 6
A detailed look at the Feb 19th, 2013 shows a 48 minute gap between 14:54 and 15:42
> data[[i]]['2013:02:19 14:50::2013:02:19 15:45'] open high low close volume 2013-02-19 14:50:54 155.3110 155.315 155.3001 155.315 8900 2013-02-19 14:51:56 155.3100 155.310 155.3100 155.310 119900 2013-02-19 14:52:52 155.3100 155.330 155.3000 155.305 354600 2013-02-19 14:53:55 155.2990 155.300 155.2800 155.280 0 2013-02-19 14:54:54 155.2900 155.290 155.2659 155.279 10500 2013-02-19 15:42:57 155.3400 155.360 155.3400 155.350 587900 2013-02-19 15:43:57 155.3501 155.355 155.3300 155.332 8300 2013-02-19 15:44:59 155.3395 155.340 155.3200 155.340 10700 2013-02-19 15:45:55 155.3300 155.340 155.3300 155.340 5100
So there is definitely something going on with data acquisition at that time.
Next, let’s compare Intrada data with daily data:
#***************************************************************** # Data check : compare with daily #****************************************************************** data.daily <- new.env() quantmod::getSymbols(spl('SPY,GLD'), src = 'yahoo', from = '1970-01-01', env = data.daily, auto.assign = T) layout(1) plota(data$GLD, type='l', col='blue', main='GLD') plota.lines(data.daily$GLD, type='l', col='red') plota.legend('Intraday,Daily', 'blue,red') plota(data$SPY, type='l', col='blue', main='SPY') plota.lines(data.daily$SPY, type='l', col='red') plota.legend('Intraday,Daily', 'blue,red')
The Intraday data matches Daily data very well.
Please note that the raw Intraday data comes with seconds time stamp, for back-testing purposes we will also want to round date time to the nearest minute, so that we can merge the Intraday data series without introducing multiple entries for the same minute. For example:
#***************************************************************** # Round to the next minute #****************************************************************** GLD.sample = data$GLD['2012:07:10::2012:07:10 09:35'] SPY.sample= data$SPY['2012:07:10::2012:07:10 09:35'] merge( Cl(GLD.sample), Cl(SPY.sample) ) # round to the next minute index(GLD.sample) = as.POSIXct(format(index(GLD.sample) + 60, '%Y-%m-%d %H:%M'), format = '%Y-%m-%d %H:%M') index(SPY.sample) = as.POSIXct(format(index(SPY.sample) + 60, '%Y-%m-%d %H:%M'), format = '%Y-%m-%d %H:%M') merge( Cl(GLD.sample), Cl(SPY.sample) )
> merge( Cl(GLD.sample), Cl(SPY.sample) ) close close.1 2012-07-10 09:30:59 155.0900 136.030 2012-07-10 09:31:59 155.1200 136.139 2012-07-10 09:32:58 155.1100 NA 2012-07-10 09:32:59 NA 136.180 2012-07-10 09:33:56 155.1400 NA 2012-07-10 09:33:59 NA 136.100 2012-07-10 09:34:59 155.0999 136.110 2012-07-10 09:35:59 155.0200 136.180 > merge( Cl(GLD.sample), Cl(SPY.sample) ) close close.1 2012-07-10 09:31:00 155.0900 136.030 2012-07-10 09:32:00 155.1200 136.139 2012-07-10 09:33:00 155.1100 136.180 2012-07-10 09:34:00 155.1400 136.100 2012-07-10 09:35:00 155.0999 136.110 2012-07-10 09:36:00 155.0200 136.180
I got an impression that these Intraday data is not really authentic, but was collected by running Intraday snap shoots of the quotes and later on processed to create one minute bars. But I might be wrong.
Next, let’s clean the Intraday data, by removing any day with time gaps over 4 minutes and let’s round all times to the nearest minute:
#***************************************************************** # Clean data #****************************************************************** # remove dates with gaps over 4 min for(i in ls(data)) { dates = index(data[[i]]) factor = format(dates, '%Y%m%d') gap = tapply(dates, factor, function(x) max(diff(x))) data[[i]] = data[[i]][ is.na(match(factor, names(gap[gap > 4*60]))) ] } common = unique(format(index(data[[ls(data)[1]]]), '%Y%m%d')) for(i in ls(data)) { dates = index(data[[i]]) factor = format(dates, '%Y%m%d') common = intersect(common, unique(factor)) } # remove days that are not present in both time series for(i in ls(data)) { dates = index(data[[i]]) factor = format(dates, '%Y%m%d') data[[i]] = data[[i]][!is.na(match(factor, common)),] } #***************************************************************** # Round to the next minute #****************************************************************** for(i in ls(data)) index(data[[i]]) = as.POSIXct(format(index(data[[i]]) + 60, '%Y-%m-%d %H:%M'), tz = Sys.getenv('TZ'), format = '%Y-%m-%d %H:%M')
Once Intraday data is ready, we can test a simple equal weight strategy:
#***************************************************************** # Load historical data #****************************************************************** bt.prep(data, align='keep.all', fill.gaps = T) prices = data$prices dates = data$dates nperiods = nrow(prices) models = list() #***************************************************************** # Benchmarks #****************************************************************** data$weight[] = NA data$weight$SPY = 1 models$SPY = bt.run.share(data, clean.signal=F) data$weight[] = NA data$weight$GLD = 1 models$GLD = bt.run.share(data, clean.signal=F) data$weight[] = NA data$weight$SPY = 0.5 data$weight$GLD = 0.5 models$EW = bt.run.share(data, clean.signal=F) #***************************************************************** # Create Report #****************************************************************** strategy.performance.snapshoot(models, T)
In this post, I tried to outline the basic steps you need to take if you are planning to work with a new data source. Next, I plan to follow with more examples of testing Intraday strategies.
To view the complete source code for this example, please have a look at the bt.intraday.thebonnotgang.test() function in bt.test.r at github.