Monday, September 20, 2010

White Paper - News Analytics for Quantitative Trading Strategies

Recorded Future is building news analytics for large scale analysis of online media flow that spans blogs and Twitter to mainstream news to government filings. The white paper discussing our temporal analytic approach can be found here.

Although our content has broad application across many domains, we have had significant initial interest from the area of algorithmic trading across asset classes. This document will focus on some news analytic approaches relevant to this area.

News Analytics


In order to define investment strategies, quantitative investors take a variety of data streams, build models based on principles such as pair trading, mean reversion, etc. They assess these models with back testing and other historical simulation methods, and implement them in trading strategies.

Recorded Future news analytic content fits directly into that approach as an additional set of news analytic data streams that may either be modeled on their own or in conjunction with other data streams.

In some cases, the Recorded Future data streams may be explored for statistically significant relationships with market outcomes of interest and when these are found, optimized and included in trading strategies. Other approaches may simply evaluate a variety of trading strategies based on the Recorded Future data.

The point of any analysis in support of investment is to motivate a change in positions. In the end, any signals of interest, continuous, discrete, or composite will be applied in a trading strategy.

Before diving too deeply into modeling issues, it is important to consider the classes of signals available in Recorded Future content.

These can be broken into discrete and continuous data types:


Measures and metrics: Continuous Data Types

Continuous streams, i.e. momentum, sentiment, hedging, entity volume, document volume, are measured or calculated quantities that vary over time for specific events and entities.

Momentum is a measure of the “buzz” around a specific entity (person, company, place) or event type ("merger," "person travel," etc). It is based on short, intermediate, and long term levels and change in content, as well as source credibility and a number of other factors. Think of it as a “Google Page Rank” for media flow content.

Sentiment measures include metrics of the positivity and negativity of the language used in the context of entity or event while hedging is a measure of the certainty in the language describing an entitity/event. On the other hand, the simplest measurements are to simply count the number of entity instances or event instances specific to a company of interest.

These factors are essentially time series of specific metrics over time. These continuous measures can be refined (subsets) and aggregated (averages on supersets) to specific groupings of interest as desired.

For example, company record volume, sentiment and momentum can be grouped by industry, market cap, etc. These measures can also be broken down further; one could examine sentiment or company record volume from specific media source types, media topic, geography, etc.

Additionally these time series can be evaluated in different frameworks. Content can be interpreted according to the time it is published or according to the time it is made available in our system. Typically this difference is small though it can occasionally be large, for example when adding a new historical source. This choice might depend on what type of backtesting one is interested in performing.

One might also want to focus on event time. As new events are added into our system, we determine when these events are stated to occur, whether it's in the past, present or future. These event times are particularly useful in finding and analyzing predicted future events.

Event and Temporal Data: Discrete Data Types

The core records in the Recorded Future database are event and entity instances. Entities are typically companies, people or geographic locations while there are currently ~150 event types including "Quotation," "Acquisition," "Earnings Call," to name a few.

Consider an event such as a quotation from Ben Bernanke about the federal funds rate. The Recorded Future database will contain a record of specific event instances for this over time. Each of these instances is an atomic event, derived from a single observed event and can be used in further modeling. It is also possible to generate discrete events from continuous measures, for example a specific company having a momentum change of X over the course of a week.

Atomic events can be grouped together to form composite events. For example, three or more press releases and two or more insider trading events happening in the same week for a given company is a composite event. We can create a single event from a set of rules applied to atomic events. The rules for defining a composite event may be arbitrarily complex and may include partial time ordering as well as the occurrence of specific intra-relationships between atomic events (i.e. the press release and the insider trading events all correspond to the same company)

These composite events are closely related to complex events and their detection and analysis is related to complex event processing. As defined here, the composite event is the collection of aggregated atomic events and the complex event is a higher level event inferred from the existence of the composite event, perhaps significant changes occurring at a company that meets these criteria.

Signal Analysis

Modeling Market Metrics with Continuous Recorded Future Variables

Analysis of continuous data may be performed using a variety of regression approaches examining the explanatory power of the continuous data against outcomes of interest such as returns, trading volume, or volatility. Other predictors may be added to see if the Recorded Future continuous data provides explanatory power after compensating for other variables such as S&P performance (other common predictors include...).

In one such analysis posted on our blog, we looked at whether or not differences in momentum for a company were predictive of changes in market volume following the momentum change. In a regression controlling for both the previous days volume and the average volume over the last 20 days, we found a statistically significant relationship between the previous days momentum (weighted by the trailing average volume). The specific model fit was:

DVt = a*DV(t-1) + b*SMA(DV, t-1, t-20) + c*(MOt-1*SMA(DV, t-1, t-20)) + et

Where DVx is Dollar Volume at time x, SMA provides a simple moving average function on a range of time periods, MO is the Recorded Future momentum measure, and et is the error term at time t. We performed the analysis in the statistical computing environment R and the fitted model was:

Call:

lm(formula = Dollarvol.1 ~ 0 + lDollarvol.1 + smaDvol.Dollarvol.1 + smaxlMo, data = seriesdf)


Residuals:
Min 1Q Median 3Q Max
-5.039e+09 -2.215e+07 -2.284e+06 1.813e+07 1.597e+10

Coefficients
Estimate Std. Error t value Pr(>t)
lDollarvol.1 0.513193 0.003237 158.54 < 2e-16 ***
smaDvol.Dollarvol.1 0.471645 0.003817 123.56 < 2e-16 ***
smaxlMo 0.077162 0.015683 4.92 8.67e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 170900000 on 72109 degrees of freedom
Multiple R-squared: 0.8539, Adjusted R-squared: 0.8539
F-statistic: 1.405e+05 on 3 and 72109 DF, p-value: < 2.2e-16


The positive coefficient for the SmaxlMo term implies increasing trading volume with increasing momentum. More details on this momentum analysis can be seen in our blog.

This example is just one possibility of how to use Recorded Futures continuous metrics as a predictor for market data. Many other news based approaches are possible.

Perhaps more significantly, these metrics can be incorporated into existing models to add additional explanatory power. If “news” is contributing noise to an existing model, incorporation of news analytic data may improve the model performance. Quantitative investors might consider the strategies that they are using today and assess the potential utility of adding news analytic metrics to existing models.

Modeling Market Metrics with Discrete Recorded Future Variables

The evaluation of a discrete signal for trading may involve deriving a set of potential trades (or non-trades if evaluating trading hiatus strategies) from the signal and evaluating the returns obtained by making those trades. Did the direction of the trade and rise/fall of the asset price agree more often than expected? What is the average return on trades made using the signal? What are the Sharpe/Sortino ratios of trades based on the trading signal? How do the returns to the trading signal do vs. the market? These approaches are appropriate for both atomic and composite discrete events.

In another example from our blog, we looked at “future” events for S&P500 companies where future events are limited to events “occurring” after publication, lasting one day or less and occurring on a trading day. We then looked to see if market volume on these days for these companies was higher than average.

We found a statistically significant relationship where volume on these “future” days were on average higher than on other days. We also looked at this for individual companies using a wilcoxon test and observed that for an unexpectedly large number of companies, future days had increased volume.


Histogram of P-Values for relationship between Future Events and Trading Volume. A disproportionate number of the relationships show statistical significance.

If there were no relationship between future events and volume, we’d expect this histogram to be relatively flat, with roughly 5% of the t-tests having a p-value less than 5%. In contrast, we see about 35% of our companies having significant differences between predicted event volume and non-predicted volume. This type of prediction of volumes may be useful if an investor is interested in the change in liquidity of a given stock over time.

In the last example, we looked at market metrics on days associated discrete events from the Recorded Future database and compared to market metrics on other days to see if the discrete events are associated with differences in these events. In the example from the previous section, we looked at whether there was a relationship between the continuous momentum metric and trading volume. It is also possible to combine these discrete and continuous variables into arbitrarily complex metrics as well.

For example, in a third blog post we looked at whether there was a relationship between company mentions in a specific financial news blog (FT Alphaville) and future market returns. Specifically for discrete days where a company was mentioned in that blog, we calculated a metric based on sentiment and momentum for that company and looked for a relationship between that metric and returns. We found a statistically significant relationship and interestingly enough, did not find a similar relationship across media mentions as a whole.

Thus far we have considered atomic discrete events. Composite events may be arbitrarily complex and it may also be useful to think of scoring them for potential relevance. Consider the world of mergers and acquisitions where there we might want to monitor 15-20 different classes of atomic events and trigger the composite event when a “critical mass” of the various events has occurred. “critical mass” could be a score which is built by applying scoring criteria to the underlying events. Perhaps the more sources report on a potential merger, the higher the score of the “merger” event is etc.

Signals like this may be assessed by a human for potential relevance rather than automatically triggering a trade. A composite event detection paradigm can provide value by tracking numerous lower level events that in themselves might not be informative, but when combined with other similar event streams might lead to a coherent signal.

Trading Strategies

Statistically significant relationships are important, but in order to actually generate profits from a specific signal, an explicit trading strategy must be specified. Based on a given signal, there will be a large number of strategies available by varying hold times, and portfolio strategy as well as defining what transactions decisions are tied to what levels of the signal. Additionally, other signals from other sources might be integrated, both in selecting trades and also weighting portfolios. Clearly, the potential trading value of any signal will depend greatly on the trading strategy employed that uses it. Financial modeling expertise will be required to select the optimum trading strategy for any signal of interest.

We explored one news analytic approach for this in a blog posting. In that case we analyzed a trading strategy based on a change in sentiment in specific sources about a company.

According to the selected strategy, if positive sentiment was increasing over time we took or held a long position, while a decrease in positive sentiment led to taking or holding a short position. Evaluating the market performance of a paper portfolio based on these trading signals is displayed below:

This particular strategy did a good job responding to the market crisis in late 2008 but doesn’t fare well in less turbulent times. Perhaps this signal could be used in other trading strategies to improve performance.

Trading independent analysis

One may want to look for statistically significant relationships between two types of events, or events and continuous readouts that are not related to trading. In general, we have a collection of point processes and continuous data streams. Exploring if point processes are predictive of continuous processes can be done similar to the trading strategies discussed earlier. Examine the set of changes in the continuous variable following an event and determine if the behavior is typical or not.

For example, consider a set of momentum changes per day for a company when the event has not occurred. This collection of changes will have a mean and a variance. We also consider the momentum changes from the much smaller set of days following specific event occurrences.

We can use parametric (i.e. t-test) or non-parametric (i.e. wilcoxon test) approaches to establish the likelihood of the two sets of data having the same distribution. These approaches can establish a statistically significant relationship between the two signals, although not reaching the standard to determine causation.

Examining relationships between point processes can be performed in a number of ways. One simple approach is to look at the rate of occurrence in one of the event types in time periods before or after the other event type. Compare the observed rates in these time periods to the overall rates to assess the significance of the relationships.

10 financial modelling experiments for you to run with Recorded Future’s News Analytic Data
Here are some suggested analyses you can run with Recorded Future.
  1. Can I build a profitable trading strategy using sentiment and momentum based metrics for different events types?
  2. Can I detect times where sentiment/momentum for a company diverge from those for an industry?
  3. Are certain events predictive of abnormal returns?
  4. Can I define a set of Future occurring events that are predictive of market metrics like abnormal returns, volatility, or volume?
  5. Can I incorporate Recorded Future news analytic content (events, or company metrics) in my existing models to improve predicted power.
  6. Can I predict the times when my existing models fail.
  7. Can I find collections of related events that are predictive of market metrics.
  8. Can I assess the credibility of a source by looking at past predictions
  9. Can I detect quiet periods for companies?
  10. Are there differences between blog and mainstream sentiment and can I build a trading signal from this.

Getting Started with the News Analytic API

Users access the Recorded Future content via a web services based Application Programming Interface (API). Using an industry standard JSON format, many different languages and environments can be used to access the service including Python, Java, R, and Matlab.

We maintain documentation of our API as well as examples and have put together a tutorial showing how to use them. These samples are hosted on our new Google Code site, which is our central repository for hosting these such examples.

API user’s can download these examples and start accessing Recorded Future content immediately. Access to this documentation does not require an API license and anyone interested in a deeper and more technical investigation of the API and content can review the materials at the Google Code site.

Recorded Future Web Analytic Interface
Recorded Future also provides a user interface for interacting with our content. Quantitative traders may use this site to begin exploring the type of data we organize to look for potential signals and patterns that they can use to make trades on an ongoing basis.

The web user interface site can be used to support a hypothesis generation phase. Once hypotheses have been formulated they can be backtested via the API and if deemed valuable can be implemented as part of a trading strategy using future data obtained through the API.



Any pattern researched can be systemically monitored through the use of so called Futures, where a pattern is monitored and users notified via email if the pattern is matched – i.e. notify me as soon as there’s a product problem among pharmaceutical companies within a week of a product launch.

Conclusion

Recorded Future’s news analytic data contains discrete entities and events occurring in the past present and future as well as a (growing) number of derived continuous metrics generated from these events and entities. A web service API is available for investors to extract data sets of interest into their analytic environment of choice and historical data is available for building relevant models.

Once an investor has determined a useful model, realtime API queries can be performed to extract the latest data to be applied in the model. This suite of data and tools is currently in use by finance professionals and is available to others interested in adding news analytic strategies to their quantitative modeling approaches.

3 comments:

  1. Interesting article. Shocked I am the first to comment on it.

    Seems this model could also extend to include analysis to model demand for new product and/or give insights to new product adoptions.

    ReplyDelete
  2. Thanks for great post!
    Market Optimization liked your informative post a lot. It is valuable for us. Keep posting...

    ReplyDelete
  3. That's a very interesting article. Quantitative strategies are definitely needed for such factors.

    binary option

    ReplyDelete