Today I took part in an introduction to R workshop being held at The University of Manchester. R is a software environment for statistics and while it does all sorts of interesting things that are beyond my ability one thing that I can grasp and enjoy is exploring all the packages that are available for R, these packages extend Rs capabilities and let you do all sorts of cool things in a couple of lines of code.
The target I set out for myself was to use JISC CETIS Project Directory data and find a way of visualising standards used in JISC funded projects and programmes over time. I found a Google Visualisation package and using this I was surprised at how easy it was to generate an output , the hardest bits being manipulating the data (and thinking about how to structure it). Although my output from the day is incomplete I thought I’d write up my experience while it is fresh in my mind.
First I needed a dataset of projects, start dates, standards and programme. I got the results in CSV format by using the sparqlproxy web service that I use in this tutorial and stole and edited a query from Martin
Sparql:
PREFIX rdfs:
PREFIX jisc:
PREFIX doap:
PREFIX prod:
SELECT DISTINCT ?projectID ?Project ?Programme ?Strand ?Standards ?Comments ?StartDate ?EndDate
WHERE {
?projectID a doap:Project .
?projectID prod:programme ?Programme .
?projectID jisc:start-date ?StartDate .
?projectID jisc:end-date ?EndDate .
OPTIONAL { ?projectID prod:strand ?Strand } .
# FILTER regex(?strand, “^open education”, “i”) .
?projectID jisc:short-name ?Project .
?techRelation doap:Project ?projectID .
?techRelation prod:technology ?TechnologyID .
FILTER regex(str(?TechnologyID), “^http://prod.cetis.ac.uk/standard/”) .
?TechnologyID rdfs:label ?Standards .
OPTIONAL { ?techRelation prod:comment ?Comments } .
}
From this I created a pivot table of all standards, and how much they appeared in each projects and programmes for each year (using the project start date). After importing this into R, it took two lines to grab the google visualisation package and plot this as Google Visualisation Chart.
library(“googleVis”)
M = gvisMotionChart(data=prod_csv, idvar=”Standards”, timevar=”Year”, chartid=”Standards”)
Which gives you the ‘Hans Rosling’ style flow chart. I can’t get this to embed in my wordpress blog, but you can click the diagram to view the interaction version. The higher up a standard is the more projects it is in and the further across it goes the more programmes it spans.
Some things it made me think about:
- Data from PROD is inconsistent
- How useful is it?
- Do we need all that data?
Standards can be spelt differently; some programmes/projects might have had a more time spent on inputting related standards than others
This was extremely easy to do, but is it worth doing? I feel it has value for me because its made me think about the way JISC CETIS staff use PROD and the sort of data we input. Would this be of value to anybody else? Although it was interesting to see the high number of projects across three programmes that involved XCRI in 2008.
There are a lot of standards represented in the visualisation. Do we need them all? Can we concentrate on subsets of this data.
8 Comments
Sheila MacNeill · February 2, 2012 at 9:44 am
Hi David
This is great – and you raise some very interesting issues, about our use and collection of data. I’ve found that there lots of standards and actually more different types of technologies in use across programmes. However there are very low instances of lots of them which makes visualisation and making sense of things quite challenging. So, as we discussed yesterday maybe what we need to do to a further refinement of the most popular ones, and then have maybe just a simple list of the other instances as an appendix. Lots to think about!
I’d going to try and to do this for a specific programme ans see what that looks like.
Martin Hawksey · February 2, 2012 at 10:18 am
Hi David,
You make this look stunningly simple. Just discovering the tribulations of R myself and like you still find data manipulation a bit of a head scratcher. For example would have never known that as.data.frame(table(dataset$acolumnname)) would look at acolumnname in a dataset and create a frequency table!
Anyway great work and I look forward to learning more about R from the both of you
Martin
david · February 2, 2012 at 10:34 am
Hi Martin,
I have been finding data manipulation hard (and sometimes doing it in a Google spreadsheet first). The quite a few things that I wouldn’t have worked out on my own. I find Rs huge community is a real help for things like this.
Martin Hawksey · February 2, 2012 at 11:30 am
PS I’m quickly discovering with R that ‘there is a package for that’ in this case you might be able to digest the SPARQL query straight into R http://cran.r-project.org/web/packages/SPARQL/SPARQL.pdf
dms2ect · February 2, 2012 at 11:32 am
Fab! Helps me skip a few steps.
OER Visualisation Project: Exploring automated reporting using linked data and R/Sweave/R2HTML [day 36] – MASHe · February 7, 2012 at 9:32 pm
[…] Already CETIS’s David Sherlock has used R to produce a Google Visualisation of Standards used in JISC programmes and projects over time and CETIS’s Adam Cooper has used R for Text Mining Weak Signals, so there is some in-house skills […]
OER Visualisation Project: Exploring automated reporting using linked data and R/Sweave/R2HTML [day 36] – MASHe · February 7, 2012 at 9:32 pm
[…] design of the software environment makes it easy to add functionality through existing packages (as I pointed out to David there is a SPARQL package for R which means he could theoretically consume linked data directly from PROD); andR has a number of […]
OER Visualisation Project: Fin [day 40.5] – MASHe · February 12, 2012 at 12:50 am
[…] CETIS staff are already using some of documented processes to generate there own visualisations (David’s post | Sheila’s post). Visualisations that were produced include: OER Phase 1 and 2 maps [day 20], […]