How to Data Mine Google Reader Feeds for Trends
Although I was never the A+ math student in school, I am a big fan of drawing insights from statistics - if the method is simple. If it involves math that ends in "ometry" then it's way over my head.
In addition, I am also a huge believer in studying tendencies. Humans are all creatures of habit. Identify someone or some group's patterns and you can figure out directionally where's they are headed. This makes it easier to spot and capitalize on trends no matter what your interests are. (Believe it or not these lessons come from reading about NFL coaches who actively study player/team patterns "on film.")
Google Reader - my favorite RSS application - recently added a powerful search functionality that has made me infinitely better at studying people and their social patterns. Using Google Reader you can now search an individual feed, tag or a folder and get back a total item count, all sorted by date for as long as you have been a subscriber to that feed. In my case, some of my feeds go back to October 2005 when the Reader first launched. That's a ton of data to mine for trends.
Now that my reader shows a huge cache of posts, I am subscribing to tons more feeds, stuffing them into a folder solely for the purpose of data mining them. The site also has limited set of advanced search operators. One hopes they will add more. It's worth noting that I don't actually read these high volume feeds. Rather, I mark them all as read so they get logged in my feed database and can be searched for insights.
Let's take a look at this in action at a very simplistic level. One of my favorite blogs is Lost Remote. I have been subscribed to their feed ever since I started using Google Reader. So I have two years worth of posting data to mine.
Let's take a look at some searches for the major TV nets and the results they returned.
Let's take another simplistic example. Is Robert Scoble's showing more blog love to Facebook and Twitter than his newborn son, Milan? Hmmm, the data shows it. (Just kidding Robert!) This is just a superficial analysis of his blog but in reality I could also add Robert's Twitter stream do the same run as long as it all lives in a Google Reader folder.
There's much more data here than what I have in the chart. When you actually look at the search results, patterns emerge. The vast majority of Robert's Facebook mentions came after they opened up their development platform in May. He only mentioned the site 14 times in 2006. Now imagine I ran this same search across all of the big tech bloggers, the digg home page and Techmeme feeds - all at once. What would I learn? Data breeds insights. And insights makes you smarter at whatever you want to accomplish.

A lot of the very basic stuff - e.g. searches within a feed - you can glean from using Google Blog Search, Blogpulse and Technorati. However, do not underestimate Google Reader. If you subscribe to feeds just for the sake of data mining and organize them the right way, you will be able to read tea leaves better than you can using a search engine. This will make you smarter at whatever subject you want to follow. It works best on full text feeds, but try it on mainstream newsfeeds too. You can learn a lot about what words make it into headlines and how often.







This is certainly a tool to help with social media measurement but it's not the end all.
Measuring engagement/interest by instances only isn't sufficient. There are many other attributes that need to be measured.
See this white paper I co-wrote with Matt Toll of Factiva (Dow Jones) to balance out the attributes.
http://www.web-strategist.com/blog/2007/08/20/social-media-white-paper-tracking-the-influence-factiva-of-dow-jones/
Just to be clear, this is certainly a step in the right direction.
I expect to see some dynamic graphing tools start to build reports like you did above.
Posted by: Jeremiah Owyang | Friday, September 21, 2007 at 05:10 PM
Thanks for this Steve. I came here from your Twitter message. The URL was truncated, though, did not work.
Posted by: vaspers aka steven e. streight | Friday, September 21, 2007 at 05:11 PM
Jeremiah, I actually think it's a tool for all data, not just social media. Whatever lives in feeds. Also, I see this more as an insights tool rather than a way to really substantiate success or measurement.
Posted by: Steve Rubel | Friday, September 21, 2007 at 05:24 PM
Great post, Steve.
It's fantastic that we can access a bit more of the resident data, but it's too bad we have to jump though hoops like pretending to read it so that that they show up in the feed db.
There is so much data living inside non-accessible server log files that sites don't share. I hope more people will share more in the future.
For example, Feedburner tells us how many subscriptions there are to feeds, now why doesn't Google show us how many unique individuals are actually reading each post? They have the data somewhere. Or how many "subscriptions" does Scoble have in his GReader's shared items? How many people read them? If we had the ability to "reshare" Scoble's shared items, then we might also be able to learn about how memes, ideas or content is being distributed, and who is participating at what levels.
As Jeremiah points out in his whitepaper:
"the ones with the greatest impact are those that result in specific niche conversations, interaction and participation in any
given community."
(disclaimer: we're working a project that will try to increase the amount of data available while improving its transparency), but this is an enormous topic that requires the community's guidance and participation.
Thanks for highlighting how feed data can lead to interesting insights. Cheers!
Posted by: Israel LHeureux | Friday, September 21, 2007 at 06:41 PM
I certainly hope Google is indexing all of this stuff jointly across the server farms. Just imagine what would happen if everyone stashed as many feeds as Rubel - and the storage and processing power that would consume.
Imagine the market for shared archives! I'll bet Steve wishes he had more historical archives of the RSS feeds of many of the blogs he de-listed before. If there was only a way to share those through Google Shared or a "sneakernet transfer" through Google Gears...
(Steve. Will I get credit for the above?)
Posted by: Ike | Friday, September 21, 2007 at 08:14 PM
Excellent post ! Obviously your binding of Stats has reinforced your data bank of subject matter !
Posted by: marshal sandler | Saturday, September 22, 2007 at 09:57 AM
I use MyYahoo. I dont see anything comparable to this on MyYahoo. They seem to be more concerned about how and where to display ads instead of rolling out new features. In near future I want to shift from MyYahoo to Google reader if yohoo comes up with a comparable analysis tool. Only problem is I dont know any easy way of doing this. I have over 30 feeds in MyYahoo. If some one knows an easy way like OPML in MyYahoo, please let me know.
Posted by: Free Pres | Saturday, September 22, 2007 at 12:19 PM
Hey Steve,
Thats a fantastic post. It was interesting to know that you are interested in studying people's habits/ patterns, well do you suggest any book which talks about human tendencies and their influence especially on business. Anyway keep up the good work.
Peace.
Posted by: Josh | Saturday, September 29, 2007 at 12:17 AM