I have been poking around Adam Coopers text mining weak signals in R code, and being too lazy to collect data in CSV format wondered if I could come up with something similar that used RSS feeds. I discovered it was really easy to read and start to mine RSS feeds in R, but there didn’t seem to be much help available on the web so I thought I’d share my findings.

My test case was the new CETIS publications site, Phil has blogged about how the underlying technology behind site is wordpress, which means it has an easy to find feed. I wrote a very small script to test things out that looks something like this:

      src<-xpathApply(xmlRoot(doc), "//category")
      tags<- NULL

      for (i in 1:length(src)) {
             tags<- rbind(tags,data.frame(tag=tag<-xmlSApply(src[[i]], xmlValue)) )  

This simply grabs the feed and puts all the categories tags into a dataframe. I then removed the tags that referred to the type of publication and plotted it as a piechart. I’m pretty sure this isn’t the prettiest way to do this, but it was very quick and worked!

         cats <- subset(tags, tag != "Briefing Paper" & tag != "White Paper" & tag != "Other Publication" & tag != "Journal Paper"  & tag != "Report")
         df$tag = factor(df$tag)

Which gave me a visual breakdown of all the categories used on our publications site and how much they are used:

typesI was surprised at how much of a 5 minute job it was. It struck me that because the feed has the publication date it would be easy to do the Google Hans Rosling style chart with it. My next step would be to grab multiple feeds and use some of Adams techniques on the descriptions/content of the feed.


I had been interested in how to grab RSS and pump it into R and ‘interesting things we can do with the CETIS publications RSS feed’ had been a bit of an after thought. Martin brought up the idea of using the feed to drive a wordl (see comments). I stole the code from the comment and changed my code slightly so that I was grabbing the publication descriptions rather than the tags used… This is what it came up with.. Click to enlarge



Martin Hawksey · March 20, 2012 at 3:28 pm

and again you demonstrate that you know far more about R than I do 😉 Here’s a snippet I’ve ‘borrowed’ from elsewhere to generate a wordcloud (in this scenario it’s probably not the best graph type, which is probably why you haven’t gone down this route)

# this bit from http://onertipaday.blogspot.com/2011/07/word-cloud-in-r.html
# note if you are pulling in multiple columns you may needd to change which one
# in the dataset is select e.g. dataset[,2] etc
ap.corpus <- Corpus(DataframeSource(data.frame(as.character(cats[,1]))))
#ap.corpus <- tm_map(ap.corpus, removePunctuation)
#ap.corpus <- tm_map(ap.corpus, tolower)
#ap.corpus <- tm_map(ap.corpus, function(x) removeWords(x, stopwords(“english”)))
# additional stopwords can be used as shown below
ap.corpus <- tm_map(ap.corpus, function(x) removeWords(x, c(“and”)))
ap.tdm <- TermDocumentMatrix(ap.corpus)
ap.m <- as.matrix(ap.tdm)
ap.v <- sort(rowSums(ap.m),decreasing=TRUE)
ap.d <- data.frame(word = names(ap.v),freq=ap.v)
pal2 <- brewer.pal(8,”Dark2″)
png(“cetis-pub.png”, width=1280,height=800)
wordcloud(ap.d$word,ap.d$freq, scale=c(8,.2),min.freq=3,
max.words=Inf, random.order=FALSE, rot.per=.15, colors=pal2)


dms2ect · March 20, 2012 at 3:44 pm

Thanks Martin,

Thats useful (*copy and pastes*)!

I hadn’t really though about a route except to cut my R code down into chunks in separate scripts so I can mix and match them, it was getting a chunk that could pull stuff in via RSS that I was after so I could do extremely quick (and extremely dirty) text mining on other sources. Usefulness of data we were getting out of the publications site was an after thought.

It does mean that it shouldn’t take two seconds to add your code as a separate script and have a wordcloud generated from the RSS descriptions… I’ll have a go..

Sheila MacNeill · March 21, 2012 at 11:20 am

Hi David

Great stuff!


Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *