Monday, November 29, 2010

Clean Tech vs. Mobile Media Buzz - Introducing Recorded Future’s Custom Linguistic Scoring Feature for API Users

At Recorded Future, we employ cutting-edge computational linguistic technology to organize structured and unstructured content from the web into instances of content that reference particular entities and events. Once we have this information organized across dimensions of entities, events, and several time dimensions, we calculate several metrics based on this restructured content. These metrics include measures of linguistic properties of the content itself (positive and negative sentiment) as well as measures based on a global view of the data (momentum).

Lately, we've helped Recorded Future API users to apply their own custom linguistic scoring approaches based on a collection of interesting words and phrases to our web content. In effect, this allows users to score Recorded Future's database of web content according to their own belief of what is interesting in that content.

Let's take a look at a concrete example of how this might be used in practice. Let's say I pose the following question: Does chatter online around particular technologies predict stock returns for companies focused on building these technologies? Using Recorded Future's new custom linguistic scoring feature, I'll show an approach we might use to answer this question.

Let's start with our experimental setup. For the purposes of this post, I'll look specifically at the discussion of clean technology vs. discussion of mobile technology. In order to use a custom linguistic scoring metric, I first need to determine an appropriate basket of words or phrases to score a particular text fragment for its discussion of each of these topics. I developed one basket of words for clean tech, and another for mobile.

I could have picked a few words out of thin air ("apps","CDMA","android","iphone") to decide what words I should use to describe these technologies, but in this case I took a slightly more sophisticated approach. I pulled a corpus of company descriptions from CrunchBase, where the companies have been classified by industry. I calculated frequencies of words used across descriptions of all technologies in the corpus, then calculated the frequencies of words in just the descriptions of mobile companies. I then rank words by difference in frequencies between the "mobile only" corpus and the general corpus and select the 100 highest ranked words. I follow a similar approach with the clean tech corpus.

The idea here is that words that appear with abnormal frequency in the descriptions of companies in specific industries are going to be representative words unique to descriptions of companies in that industry. To give you an idea of what this exercise yielded, let's take a look at the highest scoring words in the two industries.

Clean TechMobile
renewableapplication
electrictext
gaslocation
carbonsms
greendevices
cleanapplications
waterservice
systemsphones
technologyiphone
solarwireless
powerphone
energymobile

Now, we have everything we need to score some text for its level of reference to words related to these industries. So, now what data do we want to score? And how can we take this scored content and turn it into something meaningful?

Since we're exploring a link between this language and stock prices, I took all references to S&P 500 companies (entity occurrences) in our database over the period January 1, 2009 to October 17, 2010 and applied the scoring metrics to this text. I then took the average score assigned to every text fragment for each day in this date range to come up with an average score per day for each metric. The result is an aggregate level of chatter around each topic in text that references S&P500 companies over the time period. Below, you can see a chart comparing these two metrics:


As you can see, the level of chatter around clean tech has remained flat over the period, while mobile is on the rise. Now, how can we test whether there's a relationship between these metrics and market activity?

One way to do this might be to look at venture capital deal flows by industry. Another way might be to go to the public markets and take a look at the comparison between these two metrics and a couple of industry-specific ETFs. In this case, we look at the Claymore/Mac Global Solar Index ETF (TAN) for comparison with our clean tech metric, and the PowerShares Dynamic Telecommunications & Wireless Portfolio ETF (PTE) for comparison with our mobile metric:



We see a positive correspondence between the returns of these two ETFs and the changes in their respective metrics over the entire period. We also see certain sub-periods where the returns of the ETFs are strongly negatively correlated with their metrics.

Even if these relationships are not strong, they suggest a few next steps in exploratory analysis of this data - perhaps we look at relationships between days of unusual height in our mobile metric and volatility in mobile stocks? Additionally, it looks like the level of discussion around clean tech peaked in late 2009/early 2010, just as the Solar ETF was about to take a dive. Is this more than coincidence? We'd have to dig into the data to find out.

Of course, I didn't have to pick clean tech/mobile as metrics here - I didn't have to even pick an industry. Just a basket of words used for classifying language. Could be "tightening" talk by central bankers, or deceitful language used by CEOs. I also didn't have to pick mentions of S&P 500 companies as my source text to apply scoring to. I could have chosen any entity occurrences in blogs written by my favorite economic bloggers, or around takeover rumors in the Healthcare sector.

The possibilities are endless. If you're interested in this idea, please contact us to obtain an API token.

0 comments:

Post a Comment