Power Tools for Aspiring Data Journalists: Funnel Plots in R

Picking up on Paul Bradshaw’s post A quick exercise for aspiring data journalists which hints at how you can use Google Spreadsheets to grab – and explore – a mortality dataset highlighted by Ben Goldacre in DIY statistical analysis: experience the thrill of touching real data, I thought I’d describe a quick way of analysing the data using R, a very powerful statistical programming environment that should probably be part of your toolkit if you ever want to get round to doing some serious stats, and have a go at reproducing the analysis using a bit of judicious websearching and some cut-and-paste action…

R is an open-source, cross-platform environment that allows you to do programming like things with stats, as well as producing a wide range of graphical statistics (stats visualisations) as if by magic. (Which is to say, it can be terrifying to try to get your head round… but once you’ve grasped a few key concepts, it becomes a really powerful tool… At least, that’s what I’m hoping as I struggle to learn how to use it myself!)

I’ve been using R-Studio to work with R, a) because it’s free and works cross-platform, b) it can be run as a service and accessed via the web (though I haven’t tried that yet; the hosted option still hasn’t appeared yet, either…), and c) it offers a structured environment for managing R projects.

So, to get started. Paul describes a dataset posted as an HTML table by Ben Goldacre that is used to generate the dots on this graph:

The lines come from a probabilistic model that helps us see the likely spread of death rates given a particular population size.

If we want to do stats on the data, then we could, as Paul suggests, pull the data into a spreadsheet and then work from there… Or, we could pull it directly into R, at which point all manner of voodoo stats capabilities become available to us.

As with the =importHTML formula in Google spreadsheets, R has a way of scraping data from an HTML table anywhere on the public web:

#First, we need to load in the XML library that contains the scraper function
library(XML)
#Scrape the table
cancerdata=data.frame( readHTMLTable( 'http://www.guardian.co.uk/commentisfree/2011/oct/28/bad-science-diy-data-analysis', which=1, header=c('Area','Rate','Population','Number')))

The format is simple: readHTMLTable(url,which=TABLENUMBER) (TABLENUMBER is used to extract the N’th table in the page.) The header part labels the columns (the data pulled in from the HTML table itself contains all sorts of clutter).

We can inspect the data we’ve imported as follows:

#Look at the whole table
cancerdata
#Look at the column headers
names(cancerdata)
#Look at the first 10 rows
head(cancerdata)
#Look at the last 10 rows
tail(cancerdata)
#What sort of datatype is in the Number column?
class(cancerdata$Number)

The last line – class(cancerdata$Number) – identifies the data as type ‘factor’. In order to do stats and plot graphs, we need the Number, Rate and Population columns to contain actual numbers… (Factors organise data according to categories; when the table is loaded in, the data is loaded in as strings of characters; rather than seeing each number as a number, it’s identified as a category.)

#Convert the numerical columns to a numeric datatype
cancerdata$Rate=as.numeric(levels(cancerdata$Rate)[as.integer(cancerdata$Rate)])
cancerdata$Population=as.numeric(levels(cancerdata$Population)[as.integer(cancerdata$Population)])
cancerdata$Number=as.numeric(levels(cancerdata$Number)[as.integer(cancerdata$Number)])

#Just check it worked…
class(cancerdata$Number)
head(cancerdata)

We can now plot the data:

#Plot the Number of deaths by the Population
plot(Number ~ Population,data=cancerdata)

If we want to, we can add a title:
#Add a title to the plot
plot(Number ~ Population,data=cancerdata, main='Bowel Cancer Occurrence by Population')

We can also tweak the axis labels:

plot(Number ~ Population,data=cancerdata, main='Bowel Cancer Occurrence by Population',ylab='Number of deaths')

The plot command is great for generating quick charts. If we want a bit more control over the charts we produce, the ggplot2 library is the way to go. (ggpplot2 isn't part of the standard R bundle, so you'll need to install the package yourself if you haven't already installed it. In RStudio, find the Packages tab, click Install Packages, search for ggplot2 and then install it, along with its dependencies...):

require(ggplot2)
ggplot(cancerdata)+geom_point(aes(x=Population,y=Number))+opts(title='Bowel Cancer Data')+ylab('Number of Deaths')

Doing a bit of searching for the "funnel plot" chart type used to display the ata in Goldacre's article, I came across a post on Cross Validated, the Stack Overflow/Statck Exchange site dedicated to statistics related Q&A: How to draw funnel plot using ggplot2 in R?

The meta-analysis answer seemed to produce the similar chart type, so I had a go at cribbing the code... This is a dangerous thing to do, and I can't guarantee that the analysis is the same type of analysis as the one Goldacre refers to... but what I'm trying to do is show (quickly) that R provides a very powerful stats analysis environment and could probably do the sort of analysis you want in the hands of someone who knows how to drive it, and also knows what stats methods can be appropriately applied for any given data set...

Anyway - here's something resembling the Goldacre plot, using the cribbed code which has confidence limits at the 95% and 99.9% levels. Note that I needed to do a couple of things:

1) work out what values to use where! I did this by looking at the ggplot code to see what was plotted. p was on the y-axis and should be used to present the death rate. The data provides this as a rate per 100,000, so we need to divide by 100, 000 to make it a rate in the range 0..1. The x-axis is the population.

#TH: funnel plot code from:
#TH: http://stats.stackexchange.com/questions/5195/how-to-draw-funnel-plot-using-ggplot2-in-r/5210#5210
#TH: Use our cancerdata
number=cancerdata$Population
#TH: The rate is given as a 'per 100,000' value, so normalise it
p=cancerdata$Rate/100000

p.se <- sqrt((p*(1-p)) / (number))
df <- data.frame(p, number, p.se)

## common effect (fixed effect model)
p.fem <- weighted.mean(p, 1/p.se^2)

## lower and upper limits for 95% and 99.9% CI, based on FEM estimator
#TH: I'm going to alter the spacing of the samples used to generate the curves
number.seq <- seq(1000, max(number), 1000)
number.ll95 <- p.fem - 1.96 * sqrt((p.fem*(1-p.fem)) / (number.seq))
number.ul95 <- p.fem + 1.96 * sqrt((p.fem*(1-p.fem)) / (number.seq))
number.ll999 <- p.fem - 3.29 * sqrt((p.fem*(1-p.fem)) / (number.seq))
number.ul999 <- p.fem + 3.29 * sqrt((p.fem*(1-p.fem)) / (number.seq))
dfCI <- data.frame(number.ll95, number.ul95, number.ll999, number.ul999, number.seq, p.fem)

## draw plot
#TH: note that we need to tweak the limits of the y-axis
fp <- ggplot(aes(x = number, y = p), data = df) +
geom_point(shape = 1) +
geom_line(aes(x = number.seq, y = number.ll95), data = dfCI) +
geom_line(aes(x = number.seq, y = number.ul95), data = dfCI) +
geom_line(aes(x = number.seq, y = number.ll999, linetype = 2), data = dfCI) +
geom_line(aes(x = number.seq, y = number.ul999, linetype = 2), data = dfCI) +
geom_hline(aes(yintercept = p.fem), data = dfCI) +
scale_y_continuous(limits = c(0,0.0004)) +
xlab("number") + ylab("p") + theme_bw()

fp

As I said above, it can be quite dangerous just pinching other folks' stats code if you aren't a statistician and don't really know whether you have actually replicated someone else's analysis or done something completely different... (this is a situation I often find myself in!); which is why I think we need to encourage folk who release statistical reports to not only release their data, but also show their working, including the code they used to generate any summary tables or charts that appear in those reports.

In addition, it's worth noting that cribbing other folk's code and analyses and applying it to your own data may lead to a nonsense result because some stats analyses only work if the data has the right sort of distribution...So be aware of that, always post your own working somewhere, and if someone then points out that it's nonsense, you'll hopefully be able to learn from it...

Given those caveats, what I hope to have done is raise awareness of what R can be used to do (including pulling data into a stats computing environment via an HTML table screenscrape) and also produced some sort of recipe we could take to a statistician to say: is this the sort of thing Ben Goldacre was talking about? And if not, why not?

[If I've made any huge - or even minor - blunders in the above, please let me know... There's always a risk in cutting and pasting things that look like they produce the sort of thing you're interested in, but may actually be doing something completely different!]


Active Lobbying Through Meetings with UK Government Ministers

In a move that seemed to upset collectors of UK ministerial meeting data, @whoslobbying, on grounds of wasted effort, the Guardian datastore published a spreadsheet last night containing data relating to ministerial meetings between May 2010 and March 2011.

(The first release of the spreadsheet actually omitted the column containing who the meeting was with, but that seems to be fixed now… There are, however, still plenty of character encoding issues (apostrophes, accented characters, some sort of em-dash, etc) that might cripple some plug and play tools.)

Looking over the data, we can use it as the basis for a network diagram with actors (Ministers and lobbiests) with edges representing meetings between Minsiters and lobbiests. There is one slight complication in that where there is a meeting between a Minister and several lobbiests, we ideally need to separate out the separate lobbiests into their own nodes.

UK gov meetings spreadsheet

This probably provides an ideal opportunity to have a play with the Stanford Data Wrangler and try forcing these separate lobbiests onto separate rows, but I didn’t allow myself much time for the tinkering (and the requisite learning!), so I resorted to Python script to read in the data file and split out the different lobbiests. (I also did an iterative step, cleaning the downloaded CSV file in a text editor by replacing nasty characters that caused the script to choke.) You can find the script here (note that it makes use of the networkx network analysis library, which you’ll need to install if you want to run the script.)

The script generates a directed graph with links from Ministers to lobbiests and dumps it to a GraphML file (available here) that can be loaded directly into Gephi. Here’s a view – using Gephi – of the hearth of the network. If we filter the graph to show nodes that met with at least five different Ministers…

Gephi - k-core filter

we can get a view into the heart of the UK lobbying netwrok:

Active Lobbiests

I sized the lobbiest nodes according to eigenvector centrality, which gives an indication of well connected they are in the network.

One of the nice things about Gephi is that it allows for interactive exploration of a graph, For example, I can hover over a lobbiest node – Barclays in this case – to see which Ministers were met:

Bankers connect...

Alternatively, we can see who of the well connected met with the Minister for Welfare Reform:

Welfare meetings...

Looking over the data, we also see how some Ministers are inconsistently referenced within the original dataset:

Multiple mentions

Note that the layout algorithm is such that the different representations of the same name are likely to meet similar lobbiests, which will end up placing the node in a similar location under the force directed layout I used. Which is to say – we may be able to use visual tools to help us identify fractured representations of the same individual. (Note that multiple meetings between the same parties can be visualised using the thickness of the edges, which are weighted according to the number of times the edge is described in the GraphML file…)

Unifying the different representations of the same indivudal is something that Google Refine could help us tidy up with its various clustering tools, although it would be nice if the Datastore folk addressed this at source (or at least, as part of an ongoing data quality enhancement process…;-)

I guess we could also trying reconciling company names against universal company identifiers, for example by using Google Refine’s reconciliation service and the Open Corporates database? Hmmm, which makes me wonder: do MySociety, or Public Whip, offer an MP or Ministerial position reconciliation service that works with Google Refine?

A couple of things I haven’t done: represented the department (which could be done via a node attribute, maybe, at least for the Ministers); represented actual meetings, and what I guess we might term co-lobbying behaviour, where several organisations are in the same meeting.


Next Page →