Thursday, February 25, 2010

Recorded Future: Statistical Analytics With R

There are so many opportunities that can be imagined with the information we are assembling at Recorded Future. Every time we present the concept to someone for the first time, they think of new possibilities. New relationships to find. New signals to assess. And I'm no different.
However, I think of all of this information at a different level as well. Namely, when we imagine a relationship to look for, and we devise an algorithm to find that relationship, how do we ensure that the answer we get is based on the underlying relationship and not some artifact in the data? Artifact? Well, lets explore a little in the context of a simple question like are we seeing a spike or drop in news about a specific company.
You could simply look at a graph of the number of articles about that company since January 2009. In fact, lets do that. I've written a little program in R that gets data for a specific time for a specific entity using the Recorded Future web API. (an API is an Application Program Interface that allows software I write, in this case in R, to interact with another software environment, in this case the Recorded Future web site). R is a free software environment for statistical computing and graphics (R-Project). It contains many libraries for specialized types analyses, including libraries specifically for financial data - which should allow for really interesting integration of Recorded Future data with cutting edge financial algorithms.

The API that I''m using is a preliminary version of the API planned for the Recorded Future content. It is likely to change without notice, so I''ve removed a few details from the URL below. The specific example here illustrates how little code is required to get access to our content Anyone interested in getting early access to the API should feel free to contact me at bill.ladd at recorded future.

library(RCurl)
getrf<-function(entity,startTime="2001-01-01",stopTime="2100-01-01") { entity<-gsub(" ","+",entity,fixed=T) opts = curlOptions(header = FALSE, userpwd = "u/p") url<-paste("http://tempurl/ws/entity/events?e=",entity,"&p0=",startTime,"&p1=",stopTime,sep="") wp<-getURL(url, .opts = opts) out <- read.csv(tc <- textConnection(wp),as.is=T); close(tc) out }

This example function in R can extract records for all news events in our database for a specific entity (i.e. Apple) for a given range of dates. I''ve run this for the top 80 entities in the Recorded Future database and stored the data in an R session. The Recorded Future API returns tables of information like we see below extracted from a query for news articles discussing Apple.


Start.time Stop.time Published Type
Rank Source Title
3/1/2009 3/1/2009 3/24/2009 Occurrence 0.1210739 ContrarianProfits The Importance of Stop Losses
3/24/2009 3/24/2009 3/24/2009 Occurrence 0.1210739 PortfolioMarketMovers Cash: The Winners and Losers
4/1/2009 4/1/2009 3/23/2009 Occurrence 0.1210739 seeking-alpha-hardware Apple Is Still Apple Even without Steve Jobs
3/24/2009 3/24/2009 3/24/2009 Occurrence 0.1210739 seeking-alpha-editors-picks Companies'' Cash Holdings: The Winners and Losers
3/24/2009 3/24/2009 3/24/2009 Occurrence 0.1210739 seeking-alpha-usmarket Trading Update - Recovery Edition II
1/1/2007 1/1/2007 3/24/2009 Occurrence 0.1210739 slashdot Court Says USPTO Can Change Patent Rules
1/1/2012 12/31/2012 3/24/2009 Occurrence 0.1210739 seeking-alpha-consumer-electronics Exuberant Forecasts for Apple, RIM and Palm - RBC

In this table, you can see both the start and stop time (in the relative past, present and future) for the event recorded from the article as well as the publication date, source and title of the article. Other information is available in our databases, but this will help us get started.

Now I'm ready to do some exploring. For example, I''ve created a plot in R to look at Apple and the number of entries from news articles for each day of 2009.



There are two main trends that we easily see with this data. First, the volume of news about Apple is generally increasing over time. This is due to the fact that we are adding news sources to our database on a constant basis. Second, we see weekly rises and drops of the volume. This is due to a significant reduction in news coverage on weekends. Both of these make immediate sense, once we think about them and are examples of artifacts in the data that we want to account for when we are asking questions. Lets return to our question about finding a spike or a drop in news coverage. Given that the volume of news is changing dramatically, both on a small and long time frame, lets get the daily volume of all news captured by Recorded Future. There is an API to get news volume that I can access with the following R code

getrfVol<-function(industry,startTime="2001-01-01",stopTime="2100-01-01") {
opts = curlOptions(header = FALSE, userpwd = "u/p")
url<-paste("http://tempurl/ws/entity/company_volume?d0=",startTime,"&d1=",stopTime,"&i=",industry,sep="")
wp<-getURL(url, .opts = opts)
out <- read.csv(tc <- textConnection(wp),as.is=T); close(tc)
out
}

This allows me to get the daily volume of all news for since March 2009. Lets take a look at what we get.



This plot contains the daily number of news entries captured by our system. You can see a rough step change in May as a number of new sources started entering the system as well as the weekly pattern.

Its clear from these two plots that there is, unsurprisingly, a relationship between the number of events for a company on a specific day and the number of events in our database for that day. If we are interested in spikes in news for a company, we need to make sure we are accounting for changes in the overall volume as well. One way to normalize the information for a given company relative to the overall news volume is to look at what ratio of the overall news from a given day is about company X. These ratios are relatively small (~.05 or smaller) and taking logs of these ratios give us some data we can more easily visualize. Consider the following plot generated in R.



We are looking at the log of the ratio of Apple news events to all news events since the beginning of 2009. To simplify the interpretation, I''ve removed weekends and days with less than 1000 news entries. This type of chart presented here is called a control chart and the dotted lines illustrate levels that appear outside of a typical range.
We can see a run of days in November where Apple was consistently quieter than usual (the yellow dots are an unlikely "run" of values above or below the average). We see dates in September and January 2010 that appear to be spikes in Apple news, that is the are above the "upper control limit" of the chart.
What happened those days? iPad? Are these the news spikes we are looking for or are they artifacts due to some data phenomena we haven''t yet identified. If they are real news spikes, do they predict or lag other values like trading volume, volatility or price? Are there other normalization strategies that help us understand the data better. Normalization by industry or control limits that vary over time? Can we predict such peaks? These are the questions we at Recorded Future are working on and over time you''ll see some of the answers on this blog.

0 comments:

Post a Comment