Problems with quick Wikipedia heatmaps of birth locations

This doesn’t tell you how many people of that profession there are, it tells you how many people of that profession are in Wikipedia, with structured data on both the persons birthplace and profession. It might tell us as much about how Wikipedia is being used as how many people from that profession are actually born there. Still, thought it was interesting!

If you are interested in the process, I wrote some notes when I was playing. I was trying to do something slightly different, but you can follow the method. At the bottom is my script if you want to reproduce them, at the moment these use a shape file to put pins in the map.

There are some problems, I’ll try to point them out as I go along…

Birthplaces of Footballers who have played in the Premier League

The first problem is that there are loads of people missing, no person from France has ever played in the league etc. This is because not everyone who has played in the premier league is a subject of premier league players or it may be because birth information is not available, I suspect the former has articles premier league players seem pretty fleshed out. Also, some players have been born in the pacific ocean apparently. This is either because I picked up the birthplace incorrectly or the Long/Lat data got messed up.

footballers premier

Birthplace of all people in Wikipedia in category ‘wrestler’

Keeping in spirit with some of the previous data work I’ve done I did the birthplace of prowrestlers in Wikipedia. Same rules apply as for football, the wrestler must have a page in wikipedia with a city as a place of birth. I must be able to look up long lat for the wrestler.  Again there are some people in the Pacific Ocean, which makes me think some of my long lat is out or something.

So does this tell us that people from the east of the U.S,  and Japan becoming pro wrestlers? Or that people love creating fleshed out articles for these people from these areas because people from these areas write more articles and love their local heroes?

wrestlers

Birthplace of all people in Wikipedia in category ‘Ping Pong Player’

In an attempt to see if I could plot birthplaces of people who were not in the east coast or Japan I went for another sport. One which I thought might be big in Korea and East Europe. Apparently it was:

pingpong

So before I went any further I think I discovered that the data in Wikipedia is too inconsistant to really get a big picture. It does seem to kind of tell you were hot spots are, but only the really well looked after pages seem to have the data we need. Still I could reuse the process with other data. Not all is lost.

My R script in case its useful to anybody:

setwd("< directory with all your files>")
library(rgdal)         # for readOGR(...)
library(ggplot2)
library(RColorBrewer)  # for brewer.pal(...)

UKmap  <- readOGR(dsn="shape",layer="TM_WORLD_BORDERS_SIMPL-0.3")
map.df <- fortify(UKmap)


endpoint <- "http://dbpedia.org/sparql"
options <- NULL

query = "

PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX : <http://dbpedia.org/resource/>
PREFIX dbpedia2: <http://dbpedia.org/property/>
PREFIX dbpedia: <http://dbpedia.org/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX category: <http://dbpedia.org/resource/Category:>
prefix grs: <http://www.georss.org/georss/>


SELECT ?person ?name ?born ?gps ?label WHERE {
?person rdf:type dbpedia-owl:TableTennisPlayer .
?person rdfs:label ?name .
?person dbpedia-owl:birthPlace ?born .
?born rdf:type dbpedia-owl:City .
?born grs:point ?gps .

}
"

qd <- SPARQL(endpoint,query)
df <- qd$results

read <- data.frame(do.call('rbind', strsplit(as.character(df$gps),' ',fixed=TRUE)))
colnames(read) <- c("lat","long")

#attempt 1, doesnt work
ggplot(read, aes(x=long, y=lat)) + 
  stat_density2d(aes(fill = ..level..), geom="polygon")+
  geom_point(colour="red")+
  geom_path(data=map.df,aes(x=long, y=lat,group=group), colour="grey50")+
  scale_fill_gradientn(colours=rev(brewer.pal(7,"Spectral")))

p