The Google Scholar site doesn’t have an API, which is a shame and has left me to park one of my current projects on the sideline for now. I still spent half a day working out the best method to get information out of it so thought I would write up what I found in case it was useful to anyone else. The particular project I was working on was grabbing citations of papers and if anybody is interested it is parked because not all papers have a Cluster ID, which I naively assumed they would. It doesn’t seem worth going back and finding a work around as I’ve been down this route before trying to scrape things from websites only to find that they break after a UI tweak.
For those who still want a go at poking Google Scholar I found a python script called scholar.py, written by Christian Kreibich worked very well and can be accessed from Github here. I found this forked repository by Korbinian Riedhammer also adds the option to grab citations based on the Cluster ID.
It is easy to get started, on MAC OS X I went with
- sudo easy_install BeautifulSoup
- Download script
- Try something like: scholar.py -c 1 –author “D Sherlock” –phrase “tools for online Habits”
- or Or scholar.py –cites –cluster-id 13746912682491308133 for citations