Thursday, May 13, 2010

Clustering based on news flow

In this post, I want to take a look at some comparisons across multiple companies. I’m starting with the 30 financial companies from the S&P500 with the largest number of entity instances in our system. An entity in our system is pretty simple, a person place or company for example. So the financial services company Goldman Sachs would be an entity. Entities get created as we need them, and some entities, such as Goldman Sachs have additional attributes assigned to them, like ticker, membership in the S&P500 etc. A specific entity is in our system just once. When we identify references to entities in the content we process, we create entity instances for that. So blog posts for that discuss Goldman Sachs will generate entity instances for Goldman Sachs. Below is a table of the top 30 referenced companies in our system together with the number of entity instances we gathered between May of 2009 and May of 2010. I’ve also included their market cap as of May 12, 2010. I am using an R-JSON interface to access aggregation functionality available from our web service to obtain the entity instance counts.



Company
#Entity Instances
Market Cap (B)
Goldman Sachs Group
91738
79.3
Citigroup Inc.
44675
119.6
JPMorgan Chase & Co.
35547
165.5
Bank of America Corp.
35419
171.2
American International Group
25774
5.6
Morgan Stanley
23041
38.9
Moody's Corp
14813
5.3
Wells Fargo
14244
175.2
SLM Corporation
10133
5.8
Prudential Financial
8216
29.8
NASDAQ OMX Group
5846
4.2
American Express
4564
53.0
CME Group Inc.
4187
21.8
MetLife Inc.
3923
36.0
Bank of New York Mellon Corp. 
3489
37.5
Simon Property Group Inc
3216
26.4
E-Trade
2717
3.5
BB&T Corporation
2706
24.6
NYSE Euronext
2419
8.2
State Street Corp.
2368
21.46
Regions Financial Corp.
2246
10.5
PNC Financial Services
1973
36
SunTrust Banks
1863
15.9
AFLAC Inc.
1844
23.5
Fifth Third Bancorp
1792
11.9
Capital One Financial
1646
21.1
Ameriprise Financial Inc.
1503
11.9
Allstate Corp.
1470
17.8
Northern Trust Corp.
1460
13.4
U.S. Bancorp
1423
51.




Its interesting to note that while there is some correlation between market cap and number of entities, its not as large as I might have expected. Some of this is obviously due to the coverage of various aspects of the financial crisis. It might take a little more investigation to understand why U.S. Bancorp and Northern Trust have similar degrees of coverage with a nearly four-fold difference in market cap.
In order to compare companies, I’m going to look at weekly entity instance counts for each of these companies between May of 2009 and May 0f 2010. For example, a portion of the Goldman and Citigroup coverage is included in the table below (ranging from the 19th week of 2009 to the 17th week of 2010)



You can easily see that the number of entities instances included from early 2009 is smaller than we see in 2010. This is due to the increase in the number of sources from which we are harvesting now. If we were trying to compare Goldman coverage in 2010 to Goldman coverage in 2009, we would need to normalize for this difference in harvesting sources. However, since I am interested in comparing companies, I need to use a different normalization approach. For my purposes today, I want to standardize the counts for each company. For each company, I’m going to subtract the mean and divide by the standard deviation of its weekly levels. This generates a standardized measure of entity instances for each company over time. I do this because I want to compare companies based on the pattern of news flow for each company rather than the volume.

I’m going to use a hierarchical clustering approach for this comparison. (Strictly speaking, I’m using the agnes algorithm from the clustering package in R with Euclidean distances and average linkage). The results of the clustering are seen below.


Companies are organized here by the similarity of the pattern of the news flow over the years worth of data I’ve included. The degree of similarity between two branches is related to the distance from the bottom of the graph. The closer the connection between branches is to the bottom, the more similar the news pattern of the companies is. Thus AIG and Prudential have more similarity than CME and MetLife. From here it would be interesting to look deeper into the data that drives both the expected and unexpected groupings.

This is just one approach to making this type of comparison. I might have used a different time grouping, different clustering approaches or even different comparison frameworks (e.g. Principal component analysis). I just wanted to take a first look at an approach and share it with interested observers.

No comments:

Post a Comment