Getting information out of Google Scholar

The Google Scholar site doesn’t have an API, which is a shame and has left me to park one of my current projects on the sideline for now. I still spent half a day working out the best method to get information out of it so thought I would write up what I found in case it was useful to anyone else. The particular project I was working on was grabbing citations of papers and if anybody is interested it is parked because not all papers have a Cluster ID, which I naively assumed they would. It doesn’t seem worth going back and finding a work around as I’ve been down this route before trying to scrape things from websites only to find that they break after a UI tweak.

For those who still want a go at poking Google Scholar I found a python script called scholar.py, written by Christian Kreibich worked very well and can be accessed from Github here. I found this forked repository by  Korbinian Riedhammer also adds the option to grab citations based on the Cluster ID.

It is easy to get started, on MAC OS X I went with

  1. sudo easy_install BeautifulSoup
  2. Download script
  3. Try something like: scholar.py -c 1 –author “D Sherlock” –phrase “tools for online Habits”
  4. or  Or scholar.py –cites –cluster-id 13746912682491308133 for citations

 

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.