A Bit of NewsJam MoJo – SocialGeo Twitter Map

At the Mozilla/Knight/Guardian/BBC NewsJam #mojo event on Saturday (review by Martin Belam; see also Martin’s related review of the news:rewired event the day before), I was part of a small team that looked at how we might start validating folk tweeting about a particular news event. Here’s a brief write up of our design…

Try it here: SocioGeo map

When exploring twitter based activity around an event, Guardian journalist Paul Lewis raised the question “how does a journalist know which sources are to be trusted?” (a source verification problem), identifying this as one area where tool innovation may be able to help the journalist assessing which twitter sources around an event may be worth following or contacting directly.

The SocioGeo map begins to address this concern, and represents an initial attempt at mapping the social and geographical spread of tweets around an event in near real time. In its first incarnation, SocioGeoMap is intended to support visual analysis of the social and spatial distribution of people tweeting about an event in order to identify the extent to which people tweeting about an event are co-located with it/and or each other (initially, based on a sampling of geocoded tweets, although this might extend to reconciliation of identities from Twitter into location based checkin services such as Foursquare, or derived location services such as uploaded geocoded photos to Flickr), and the extent to which they know each other (initially, this is limited to whether or not they are following each other on Twitter, but could be extended to other social networks).

In his presentation at the #mojo London event, Guardian interactive designer Alastair Dant suggested a fruitful approach for hacks/hackers communities might be to identify publication “archetypes” such as maps and timelines, as well as “standard content types” such as map+timeline combinations. To these, we might add the “social network” archetype, and geo-social maps (locating nodes in space and drawing social connections between them), socio-temporal maps (showing how social connections ebb and flow over time, or how messages are passed between actors) or geo-socio-temporal maps (where we plot how messages move across spatially and socially distributed nodes over time.

If the simple geo-social map depiction demonstrated above does turn out to be useful, informative or instructive, the next phase might be to start using mathematical analyses of the geographical concentration of people tweeting about an event, as well as social network analysis metrics to start assigning certainty factors to individuals relating to the degree of confidence we might have that they were eyewitness to an event, embedded within it/central to it or a secondary/amplifying source only, and so on. A wider social network analysis (eg of the social networks of people associated with an event) might also provide information related to the authority/trustedness/reputations of the source in other contexts. These certainty factors might then be able to rank tweets associated with an event, or identify sources who might be worth contacting directly, or ignoring altogether. (That is, the analyses might be able to contribute to automatic filter configuration).

SocioGeoMap is based on several observations:

Taking the example of football match, we might imagine several different co-existing states:

In the case of a bomb going off in a busy public space, we might imagine:

SocioGeoMap helps visualise the extent to which twitter activity around an event is either distributed or localised in both social/social network and geographical spaces.

In its current form, SocioGeoMap is built from a couple of pre-existing services:

In its envisioned next generation form, SocioGeoMap will display people tweeting about a particular topic by location (i.e. on a map) and also draw connections between them to demonstrate the extent to which they are socially connected (on Twitter, at least).

SocioGeoMap as currently presented is based on particular, user submitted search queries that may also have a geographical scope. An extention of SocioGeoMap might be to create SocioGeoMap alerting dashboards around particular event types, using techniques similar to the techniques employed in many sentiment analysis tools, specifically the filtering of items through word lists containing terms that are meaningful in terms of sentiment. The twist in news terms is to identify meaningful terms that potentially relate to newsworthy exclamations (“Just heard a loud explosion”, “goal!!!!”, “feel an earthquake?” and so on), and rather than associating positive or negative sentiment around a term brand, trying to discover tweets associated with sentiments of shock or concern in a particular geographical location.

SocioGeoMap may also be used in associsation with other services that support the pre-qualification or pre-verification of individuals, or certainty measure estimates on their expertise or likelihood of being in a particular place at a particular time. So for example, in the first case we might imagine doing some prequalification work around people likely to attend a planned event, such as a demonstration, based on their public declarations (“Off to #bigDemo tomorrow”), or identify their remote support/interest in it (“gutted not to be going to #bigDemo tomorrow”). Another example might include looking for geolocated evidence that an individual is a frequenter of a particular space, for example through a geo-coded analysis of their personal twitter stream and potentially also at one remove, such as through a geocoded analysis of their friends’ profiles and tweetstream, and as a result derive a certainty measure about the association of an individual with a particular location; that is, we could start to assign certainty measure to the likelihood of their being an eyewitness to an event in a particular locale based on previous geo-excursions.


By: Tony Hirst (@psychemedia), Alex Gamela (@alexgamela), Misae Richwood (@minxymoggy)
Mozilla/Knight/Guardian/BBC News Jam, Kings Tower, London, May 28th, 2011 #mojo

Implementation notes:

The demo was built out of a couple of pre-existing tools/components: a geo-based Twitter search constructed using Yahoo Pipes (Discovering Co-location Communities – Twitter Maps of Tweets Near Wherever…); and a map of social network connections between folk recently using a particular search term or hashtag (Using Protovis to Visualise Connections Between People Tweeting a Particular Term). It is possible to grab a KML URL from the geotwitter pipe and feed it into a Google map that can be embedded in a page using an iframe. The social connections graph can also be embedded in an iframe. The SocialGeoMap page is a page that contains two iframes, one that loads the map, and a second that loads the social network graph. The same data pulled from the Yahoo geo-search pipe feeds both visualisations.

In many cases, several tweets may have the exact same geo-coordinates, which means they are overlaid on the map and difficult to see. to get around this, a certain amount of jitter is added to each latitude and longitude; because Yahoo Pipes doesn’t have a native random number generator, I use a tweet ID to generate a jitter offset using the following pipe:

This is called just before the output of the geotwitter search pipe:

Whilst this does mean that no points are plotted with their exact original co-ordinates, it does mean that we can separate out most of the markers corresponding to tweets with the same latitude and longitude and thus see them independently on the map at their approximate location.

A next step in development might to move away from using Yahoo pipes, (which incur a cacheing overhead) and use server side service. A quickstart solution to this might be to generate a Python equivalent of the current pipework using Greg Gaughan’s pipe2py compiler, that generates a Python code equivalent of a Yahoo pipe.


Google Correlate: What Search Terms Does Your Time Series Data Correlate With?

Just a few days over three years ago, I blogged about a site I’d put together to try to crowdsource observations about correlated searchtrends: TrendSpotting.

One thing that particularly interested me then, as it still does now, was the way that certain search trends they reveal rhythmic behaviour over the course of weeks, months or years.

At the start of this year, I revisited the topic with a post on Identifying Periodic Google Trends, Part 1: Autocorrelation (followd by Improving Autocorrelation Calculations on Google Trends Data).

Anyway today it seems that Google has cracked the scaling issues with discovering correlations between search trends (using North American search trend data), as well as opening up a service that will identify what search trends correlate most closely with your own uploaded time series data: Correlate (announcement: Mining patterns in search data with Google Correlate)

For the quick overview, check out the Google Correlate Comic.

So what’s on offer? First, enter a search term and see what it’s correlated with:

As well as the line chart, correlations can also be plotted as a scatterplot:

You can also run “spatial correlations”, though at the moment this appears to be limited to US states. (I *think* this works by looking for search terms that are popular in the requested areas and not popular in the other listed areas. To generalise this, I guess you need three things: the total list of areas that work for the spatial correlation query; the areas you want the search volume for the “to be discovered correlated phrase” to be high; the areas you want to the search volume for the “to be discovered correlated phrase” to be low?)

At this point it’s maybe worth remembering that correlation does not imply causation…

A couple of other interesting things to note: firstly, you can offset the data (so shift it a few weeks forwards or backwards in time, as you might do if you were looking for lead/lag behaviour); secondly, you can export/download the data.

You can also upload your own data to see what terms correlate with it:

(I wonder if they’ll start offering time series analysis features on uploaded, as well as other trend data, too? For example, frequency analysis or trend analysis? This is presumably going on in the background (though I haven’t read the white paper [PDF] yet…)

As if that’s not enough, you can also draw a curve/trendline and then see what correlates with it (so this a weak alternative to uploading your own data, right? Just draw something that looks like it… (h/t to Mike Ellis for first point this out to me).

I’m not convinced that search trends map literally onto the well known “hype cycle” curve, but I thought I’d try out a hype cycle reminiscent curve where the hype was a couple of years ago, and we’re now maybe seeing start to reach mainstream maturity, with maybe the first inklings of a plateau…

Hmmm… the pr0n industry is often identified as a predictor of certain sorts of technology adoption… maybe the 5ex searchers are too?! (Note that correlated hand-drawn charts are linkable).

So – that’s Google Correlate; nifty, eh?

PS Here’s another reason why I blog… my blog history helps me work out how far i the future I live;-) So currently between about three years in the future.. how about you?!;-)

PPS I can imagine Google’s ThinkInsights (insight marketing) loving the thought that folk are going to check out their time series data against Google Trends so the Goog can weave that into it’s offerings… A few additional thoughts leading on from that: 1) when will correlations start to appear in Google AdWords support tools to help you pick adwords based on your typical web traffic patterns or even sales patterns? 2) how far are we off seeing a Google Insights box to complement the Google Search Appliances, that will let you run correlations – as well as Google Prediction type services – onsite without feeling as if you have to upload your data to Google’s servers, and instead, becoming part of Google’s out-kit-in-your-racks offering; 3) when is Google going to start buying up companies like Prism and will it then maybe go after the likes of Experian and Dunnhumby to become a company that organises information about the world of people, as well as just the world’s information…?!)

PPPS Seems like as well as “traditional” link sharing offerings, you can share the link via your Google Reader account…

Interesting…


Whose Investor Relations Sites Do Thomson Reuters Host? A Form of URL Hacking…

A quick little infoskills demo using Google search…

I very rarely do more then skim the headline of posts in my feed from Techcrunch, but today I actually opened up a post about Amazon buying another imprint (Amazon Expands To Mysteries And Thrillers With Fifth Publishing Imprint, Thomas & Mercer). As with a lot of TechCrunch posts, it’s pretty much just a rebranding of a Press Release, though to their credit Techcrunch linked to the “original” source: www.businesswire.com/news/home/20110518005494/en/Amazon-Launches-Publishing-Imprint-Thomas-Mercer

…insofar as a wire service copy of a press release is an original source… This got me wondering whether the press release had also appeared on a more direct Amazon press release page…?

Googling for press release amazon on google.co.uk turned up a way in to Amazon’s UK media relations site: www.amazon.co.uk/gp/press/pr/20080710 (try hacking around the URL to find an amazon.com equivalent…) as well a page on the central US site: phx.corporate-ir.net/phoenix.zhtml?ID=1565581&c=176060&p=irol-newsArticle, a third party (?) hosted site with Amazon branding: phx.corporate-ir.net/phoenix.zhtml?p=irol-mediaHome&c=176060. (One of the oft-taught infoskills tips is not to necessarily trust a site where the domain in the URL doesn’t appear to fit…) Note: I also got redirected to the main page of the Amazon-looking-but-not-on-an-Amazon URL page through trying amazon.com/pr.

Seeing who hosts this site – that is, just trying phx.corporate-ir.net – we get a redirect to www.ccbn.com which immediately then hops to thomsonreuters.com/products_services/financial/financial_products/corporate_services/investor_relations/

So Thomson Reuters, maybe… (If we were doing a proper job, we’d be looking up internet domain registrations, but I’m a trusting sort.. If you’re interested, the service you need to use is called whois and then look for the registrant. For example, www.networksolutions.com/whois-search/corporate-ir.net;-) But who else do they run this service for?

If you Google inurl:phx.corporate-ir.net/phoenix.zhtml you can spot results from various sources. Looking through the URLs, many of the pages are hosted on (hosted) investor relations sites. The p= argument specifies the page, the c= argument looks like it might identify a company.

We can try blindly hacking company numbers to see what companies turn up, or we can be a little more structured, for example by finding the investor relations homepage value (p=irol-irhome looks a good bet) and Googling on inurl:http://phx.corporate-ir.net/phoenix.zhtml? inurl:p=irol-irhome site:phx.corporate-ir.net.

The results pages are then full of sites of investor relations hosted by Thomson Reuters for third parties.

(You might think that the use of site: is redundant given the first inurl: limit. I hit upon using it like that as a result of clicking on the “More results form this site” option from one of my initial search attempts… It seems to override the collapsing of multiple results on a single domain…)

If you want to tunnel down to press release/media relations sites, there seem to be a variety of landing pages (p=irol-news, p=irol-press), so instead we can fudge a bit and search for “press” in the page title: inurl:http://phx.corporate-ir.net/phoenix.zhtml? intitle:press site:phx.corporate-ir.net

If you live and work the web, being able to read URL, and use that knowledge to help you search URL, is a handy skill to have…


Data Driven Journalism – Survey

The notion of data driven journalism appears to have some sort of traction at the moment, not least as a recognised use context of some very powerful data handling tools, as Simon “Guardian Datastore” Rogers appearance at Google I/O suggests:


(Simon’s slot starts about 34:30 in, but there’s a good tutorial intro to Fusion Tables from the start…)

As I start to doodle ideas for an open online course on something along the lines of “visually, data” to run October-December, data journalism is going to provide one of the major scenarios for working through ideas. So I guess it’s in my interest to promote this European Journalism Centre: Survey on Data Journalism to try to find out what might actually be useful to journalists…;-)

[T]he survey Data-Driven Journalism – Your opinion aims to gather the opinion of journalists on the emerging practice of data-driven journalism and their training needs in this new field. The survey should take no more than 10 minutes to complete. The results will be publicly released and one of the entries will win a EUR 100 Amazon gift voucher

I think the EJC are looking to run a series of data-driven journalism training activities/workshops too, so it’s worth keeping an eye on the EJC site if #datajourn is your thing…

PS related: the first issue of Google’s “Think Quarterly” magazine was all about data: Think Data

PPS Data in journalism often gets conflated with data visualisation, but that’s only a part of it… Where the visulisation is the thing, then here’s a few things to think about…


Ben Fry interviewed at Where 2.0 2011


Plotting Tabular (CSV) Data and Algebraic Expressions On the Same Graph Using Gnuplot

A couple of the reasons why I’ve been making so much use of Formula 1 data for visualisations lately are that: a) the data is authentic, representing someone’s legitimate data but presented in a way that is outside my control (so I have to wrangle with scraping, representation and modeling issues); b) if I get anything wrong whilst I’m playing, it doesn’t matter… (Whereas if I plotted some university ranking tables and oopsed to show that all the students at X were unhappy, its research record and teaching quality were lousy, and it was going to be charging 9k fees on courses with a 70% likelihood of closing most of them, when in fact it was doing really well, I might get into trouble…;-)

I’ve also been using the data as a foil for finding tools and applications that I can use to create data visualisations that other people might be interested in trying out too. There is a work-related rationale here too: in October, I hope to run a “MOOC”, (all you need to know…) on visual analysis as a live development/production exercise for a new short course that will hopefully be released next year, and part of that will involve the use of various third party tools and applications for the hands-on activities.

One of the issues I’ve recently faced is how to plot a chart that combines tabulated data imported from a CSV file with a line graph plotted from an equation. My ideal tool would be a graphical environment that lets me import data and plot it, and then overlay a plot generated from a formula or equation of my own. Being able to apply a function to the tabulated data (for example, remove a value y = sin(x) from the tabular data would be ideal, but not essential.

In this post, I’ll describe one tool – Gnuplot – that meets at least the first requirement, and show how it can be used to plot some time series data from a CSV file overlaid with a decreasing linear function. (Which is to say, how to plot F1 laptime data against the fuel weight time penalty, (the amount of time that the weight of the fuel in the car slows the car down by… For more on this, see F1 2011 Turkey Race – Fuel Corrected Laptimes.)

I’ve been using Gnuplot off and on for a couple of decades(?!), though I’ve really fallen out of practice with it over the last ten years… (not doing research!;-)

The easiest way of using the tool is to launch it in the directory where your data files are stored. So for example, if the data file resides in the directory \User\tony\f1\data, I would launch my terminal, enter cd \User\tony\f1\data or the equivalent to move to that directory, and then start gnuplot there using the command gnuplot):

gnuplot> set term x11
Terminal type set to 'x11'
gnuplot> set xrange [1:58]
gnuplot> set xlabel "Lap"
gnuplot> set ylabel "Fuel Weight Time Penalty"
gnuplot> set datafile separator ","

For some reason, my version of Gnuplot (on a Mac), wouldn’t display any graphs till i set the output to use x11… The set xrange [1:58] command sets the range of the axis (there are 58 laps in a race, hence those settings.) The xlabel and ylabel settings are hopefully self-explanatory (they define axis labels). The set datafile separator "," command prepares Gnuplot to load in a file formatted as tabular data, one row per line, with commas separating the columns (I assume if you pass in something like this, “this, that”, the other, the “this,that” string is detected as a single column/cell value, and not as two columns with cell values “this and that”? I forget…)

The data file I have is not as clean as it might be. (If you want to play along, the data file is here). It’s arranged as follows:

Driver,Lap,Lap Time,Fuel Adjusted Laptime,Fuel and fastest lap adjusted laptime
25,1,106.951,102.334,9.73
25,2,99.264,94.728,2.124
...
25,55,94.979,94.574,1.97
25,56,95.083,94.759,2.155
20,1,103.531,98.914,6.959
20,2,97.370,92.834,0.879
...

That is, there is one row per driver lap. Each driver’s data is on a consecutive line, in increasing lap number, so driver 25 is on lines 1 to 56, driver 20′s data starts on line 26 and so on…

To plot from a data file, we use the command plot 'turlapTimeFuel.csv' (that is, plot ‘filename). To pull data from columns 3 and 5, we use the subcommand using 3 (x would count incrementally, so we get a plot of column 3 against increasing row number) from the command:
gnuplot> plot 'turlapTimeFuel.csv' using 3

To plot from just a range of numbers (e.g. rows 0 to 57 (the header row is ignored), against row number, we can use a subcommand if the form using y:
gnuplot> plot 'turlapTimeFuel.csv' every::0::57 using 3

To specify the x value (e.g. to plot column 3 as y against column 2 as x), we use a subcommand of the form using x:y:
gnuplot> plot 'turlapTimeFuel.csv' every::0::57 using 2:3

(The first column is column 1; I think the first row is row 0…)

So for example, we can plot Laptime against driver using plot ‘turlapTimeFuel.csv’ using 1:3, or Fuel Adjusted Laptime against driver using plot ‘turlapTimeFuel.csv’ using 1:4. The command plot ‘turlapTimeFuel.csv’ using 2:3 gives us a plot of each driver’s laptime against lap number.

But how do we plot the data for just a single driver? We saw how to plot against a consecutive range of row values (e.g. every::12:23 for rows 12 to 23), but
plotting the laptime data for each driver this way is really painful (we have to know which range of row numbers the data we want to plot are on). Instead we can filter out the data according to driver number (the column 1 values):
gnuplot> plot 'turlapTimeFuel.csv' using ($1==4 ? $2:1/0):3 with lines

How do we read this? The command is actually of the form using x:y, but we do a bit of work to choose a valid value of x. ($1==4 ? 2:1/0):3 says “if the value of column 1 ($1) equals 4, then (?) select column 2, otherwise/else (the first “:”), forget it (1/0 is one divided by zero, a number intensely disliked by gnuplot that says in this context, do nothing more with this row…). If the value of column 1 does equal 4, then we create a valid statement using 2:3, otherwise we ignore the row and the data in columns 2 and 3. The whole statement thus just plots the data for driver 4.

Rather than plot points, the with lines command will join consecutive points using a line, to produce a line chart:
plot 'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):3 with lines

We can add an informative label using the title subcommand:
plot 'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):3 with lines title "(unadjusted) VET Su | Su(11) Su(25) Hn(40) Hn(47)"

We can also plot two drivers’ times on the same chart using different lines:
plot 'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):3 with lines title "(unadjusted) VET Su | Su(11) Su(25) Hn(40) Hn(47)",'turlapTimeFuel.csv' using ($1==2 ? $2:1/0):3 with lines title "WEB "

We can also plot functions. In the following case, I plot the time penalty applied to a car for each lap on the basis of how much more fuel it is carrying at the start of the race compared to the end:
gnuplot> plot 90+(58-x)*2.7*0.03

We can now overlay the drivers’ times and the fuel penalty on the same chart:
gnuplot> set yrange [85:120]
gnuplot> plot 'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):3 with lines title "(unadjusted) VET Su | Su(11) Su(25) Hn(40) Hn(47)",'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):4 with lines title "VET Su | Su(11) Su(25) Hn(40) Hn(47)",90+(58-x)*2.7*0.03

It’s also possible to do sums on the data. If you read $4 as “use the value in column 4 of the current row”, you can start to guess at creating things like the following, which in the first part plots cells in the laptime column 3 modified by the fuel penalty. (I also plot the pre-calculated fuel adjusted laptime data from column 5 as a comparison. The only difference in values is the offset…
gnuplot> set yrange [-1:20]
gnuplot> plot 'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):($3-(88+(58-$2)*2.7*0.03)) with lines,'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):5 with lines


Doh! I should have changed the y-axis label…:-(

That is, plot ‘turlapTimeFuel.csv’ using ($1==1 ? $2:1/0):($3-(88+(58-$2)*2.7*0.03)) with lines says: for rows applying to driver 1 rows where the value in column 1, ($1), equals (==) the driver number (1), $1==1), use the value in column 2 ($2) as the x-value and for the y-value use the value in column 3 ($3) minus 95+(58-$2)*2.7*0.03). Note that the (58-$2) fragment subtracts the lap number (as contained in value in the column 2 ($2) cell) from the lap count to work out how many more laps worth of fuel the car is carrying in the current lap than at the end of the race.

So – that’s a quick tour of gnuplot, showing how it can be used to plot CSV data and an algebraic expression on the same graph, how to filter the data plotted from the CSV file using particular values in a specified column, and how to perform a mathematical operation on the data pulled in from the CSV file before plotting it (and without changing it in the original file).

Just in passing, if you need an online formula plotter, Wolfram Alpha is rather handy… it can do calculus for you too…

PS In a forthcoming post, I’ll describe another tool for creating similar sorts of plot – GeoGebra. If you know of other free, cross-platform, ideally open source, hosted or desktop applications similar to Gnuplot or GeoGebra, please post a link in the comments:-)

PPS I quite like this not-so-frequently-asked questions cribsheet for Gnuplot

PPPS for another worked through example, see F1DataJunkie: F1 2011 Turkey Race – Race Lap History


Merging Datasets with Common Columns in Google Refine

It’s an often encountered situation, but one that can be a pain to address – merging data from two sources around a common column. Here’s a way of doing it in Google Refine…

Here are a couple of example datasets to import into separate Google Refine projects if you want to play along, both courtesy of the Guardian data blog (pulled through the Google Spreadsheets to Yahoo pipes proxy mentioned here):

- University fees data (CSV via pipes proxy)

- University HESA stats, 2010 (CSV via pipes proxy)

We can now merge data from the two projects by creating a new column from values an existing column within one project that are used to index into a similar column in the other project. Looking at the two datasets, both HESA Code and institution/University look like candidates for merging the data. Which should we go with? I’d go with the unique identifier (i.e. HESA code in the case) every time…

First, create a new column:

Now do the merge, using the cell.cross GREL (Google Refine Expression Language) command. Trivially, and pinching wholesale from the documentation example, we might use the following command to bring in Average Teaching Score data from the second project into the first:

cell.cross("Merge Test B", "HESA code").cells["Average Teaching Score"].value[0]

Note that there is a null entry and an error entry. It’s possible to add a bit of logic to tidy things up a little:

if (value!='null',cell.cross("Merge Test B", "HESA code").cells["Average Teaching Score"].value[0],'')

Here’s the result:

Coping with not quite matching key columns

Another situation that often arises is that you have two columns that almost but don’t quite match. For example, this dataset has a different name representation that the above datasets (Merge Test C):

There are several text processing tools that we can use to try to help us match columns that differ in well-structured ways:

In the above case, where am I creating a new column based on the contents of the Institution column in Merge Test C, I’m using a couple of string processing tricks… The GREL expression may look complicated, but if you build it up in a stepwise fashion it makes more sense.

For example, the command replace(value,"this", "that") will replace occurrences of “this” in the string defined by value with “that”. If we replace “this” with an empty string (” (two single quotes next to each other) or “” (two double quotes next to each other)), we delete it from value: replace(value,"this", "")

The result of this operation can be embedded in another replace statement: replace(replace(value,"this", "that"),"that","the other"). In this case, the first replace will replace occurrences of “this” with “that”; the result of this operation is passed to the second (outer) replace function, which replaces “that” with “the other”). Try building up the expression in realtime, and see what happens. First use:
toLowercase(value)
(what happens?); then:
replace(toLowercase(value),'the','')
and then:
replace(replace(toLowercase(value),'the',''),'of','')

The fingerprint() function then separates out the individual words that are left, orders them, and returns the result (more detail). Can you see how this might be used to transform a column that originally contains “The University of Aberdeen” to “aberdeen university”, which might be a key in another project dataset?

When trying to reconcile data across two different datasets, you may find you need to try to minimise the distance between almost common key columns by creating new columns in each dataset using the above sorts of technique.

Be careful not to create false positive matches though; and also be mindful that not everything will necessarily match up (you may get empty cells when using cell.cross; (to mitigate this, filter rows using a crossed column to find ones where there was no match and see if you can correct them by hand). Even if you don’t completely successful cross data from one project to another, you might manage to automate the crossing of most of the rows, minimising the amount of hand crafted copying you might have to do to tidy up the real odds and ends…

So for example, here’s what I ended up using to create a “Pure key” column in Merge Test C:
fingerprint(replace(replace(replace(toLowercase(value),'the',''),'of',''),'university',''))

And in Merge Test A I create a “Complementary Key” column from the University column using fingerprint(value)

From the Complementary Key column in Merge Test A we call out to Merge Test C: cell.cross("Merge Test C", "Pure key").cells["UCAS ID"].value[0]

Obviously, this approach is far from ideal (and there may be more “correct” and/or efficient ways of doing this!) and the process described above is admittedly rather clunky, but it does start to reveal some of what’s involved in trying to bring data across to one Google Refine project from another using columns that don’t quite match in the original dataset, although they do (nominally) refer to the same thing, and does provide a useful introductory exercise to some of the really quite powerful text processing commands in Google Refine …


First Play With R and R-Studio – F1 Lap Time Box Plots

Last summer, at the European Centre for Journalism round table on data driven journalism, I remember saying something along the lines of “your eyes can often do the stats for you”, the implication being that our perceptual apparatus is good at pattern detection, and can often see things in the data that most of us would miss using the very limited range of statistical tools that we are either aware of, or are comfortable using.

I don’t know how good a statistician you need to be to distinguish between Anscombe’s quartet, but the differences are obvious to the eye:

Anscombe's quartet /via Wikipedia

Another shamistician (h/t @daveyp) heuristic (or maybe it’s a crapistician rule of thumb?!) might go something along the lines of: “if you use the right visualisations, you don’t necessarily need to do any statistics yourself”. In this case, the implication is that if you choose a viualisation technique that embodies or implements a statistical process in some way, the maths is done for you, and you get to see what the statistical tool has uncovered.

Now I know that as someone working in education, I’m probably supposed to uphold the “should learn it properly” principle… But needing to know statistics in order to benefit from the use of statistical tools seems to me to be a massive barrier to entry in the use of this technology (statistics is a technology…) You just need to know how to use the technology appropriately, or at least, not use it “dangerously”…

So to this end (“democratising access to technology”), I thought it was about time I started to play with R, the statistical programming language (and rival to SPSS?) that appears to have a certain amount of traction at the moment given the number of books about to come out around it… R is a command line language, but the recently released R-Studio seems to offer an easier way in, so I thought I’d go with that…

Flicking through A First Course in Statistical Programming with R, a book I bought a few weeks ago in the hope that the osmotic reading effect would give me some idea as to what it’s possible to do with R, I found a command line example showing how to create a simple box plot (box and whiskers plot) that I could understand enough to feel confident I could change…

Having an F1 data set/CSV file to hand (laptimes and fuel adjusted laptimes) from the China 2001 grand prix, I thought I’d see how easy it was to just dive in… And it was 2 minutes easy… (If you want to play along, here’s the data file).

Here’s the command I used:
boxplot(Lap.Time ~ Driver, data=lapTimeFuel)

Remembering a comment in a Making up the Numbers blogpost (Driver Consistency – Bahrain 2010) about the effect on laptime distributions from removing opening, in and out lap times, a quick Google turned up a way of quickly stripping out slow times. (This isn’t as clean as removing the actual opening, in and out lap times – it also removes mistake laps, for example, but I’m just exploring, right? Right?!;-)

lapTime2 <- subset(lapTimeFuel, Lap.Time < 110.1)

I could then plot the distribution in the reduced lapTime2 dataset by changing the original boxplot command to use (data=lapTime2). (Note that as with many interactive editors, using your keyboard’s up arrow displays previously entered commands in the current command line; so you can re-enter a previously entered command by hitting the up arrow a few times, then entering return. You can also edit the current command line, using the left and right arrow keys to move the cursor, and the delete key to delete text.)

Prior programming experience suggests this should also work…

boxplot(Lap.Time ~ Driver, data=subset(lapTimeFuel, Lap.Time < 110))

Something else I tried was to look at the distribution of fuel weight adjusted laptimes (where the time penalty from the weight of the fuel in the car is removed):

boxplot(Fuel.Adjusted.Laptime ~ Driver, data=lapTimeFuel)

Looking at the release notes for the latest version of R-Studio suggests that you can build interactive controls into your plots (a bit like Mathematica supports?). The example provided shows how to change the x-range on a plot:
manipulate(
plot(cars, xlim=c(0,x.max)),
x.max=slider(15,25))

Hmm… can we set the filter value dynamically I wonder?

manipulate(
boxplot(Lap.Time ~ Driver, data=subset(lapTimeFuel, Lap.Time < maxval)),
maxval=slider(100,140))

Seems like it…?:-) We can also combine interactive controls:

manipulate(boxplot(Lap.Time ~ Driver, data=subset(lapTimeFuel, Lap.Time < maxval),outline=outline),maxval=slider(100,140),outline = checkbox(FALSE, "Show outliers"))

Okay – that’s enough for now… I reckon that with a handful of commands on a crib sheet, you can probably get quite a lot of chart plot visualisations done, as well as statistical visualisations, in the R-Studio environment; it also seems easy enough to build in interactive controls that let you play with the data in a visually interactive way…

The trick comes from choosing visual statistics approaches to analyse your data that don’t break any of the assumptions about the data that the particular statistical approach relies on in order for it to be applied in any sensible or meaningful way.

[This blog post is written, in part, as a way for me to try to come up with something to say at the OU Statistics Group's one day conference on Visualisation and Presentation in Statistics. One idea I wanted to explore was: visualisations are powerful; visualisation techniques may incorporate statistical methods or let you "see" statistical patterns; most people know very little statistics; that shouldnlt stop them being able to use statistics as a technology; so what are we going to do about it? Feedback welcome... Err....?!]


Fragments: Glueing Different Data Sources Together With Google Refine

I’m working on a new pattern using Google Refine as the hub for a data fusion experiment pulling together data from different sources. I’m not sure how it’ll play out in the end, but here are some fragments….

Grab Data into Google Refine as CSV from a URL (Proxied Google Spreadsheet Query via Yahoo Pipes)

Firstly, getting data into Google Refine… I had hoped to be able to pull a subset of data from a Google Spreadsheet into Google Refine by importing CSV data obtained from the spreadsheet via a query generated using my Google Spreadsheet/Guardian datastore explorer (see Using Google Spreadsheets as a Database with the Google Visualisation API Query Language for more on this) but it seems that Refine would rather pull the whole of the spreadsheet in (or at least, the whole of the first sheet (I think?!)).

Instead, I had to tweak create a proxy to run the query via a Yahoo Pipe (Google Spreadsheet as a database proxy pipe), which runs the spreadsheet query, gets the data back as CSV, and then relays it forward as JSON:

Here’s the interface to the pipe – it requires the Google spreadsheet public key id, the sheet id, and the query… The data I’m using is a spreadsheet maintained by the Guardian datastore containing UK university fees data (spreadsheet.

You can get the JSON version of the data out directly, or a proxied version of the CSV, as CSV via the More options menu…

Using the Yahoo Pipes CSV output URL, I can now get a subset of data from a Google Spreadsheet into Google Refine…

Here’s the result – a subset of data as defined by the query:

We can now augment this data with data from another source using Google Refine’s ability to import/fetch data from a URL. In particular, I’m going to use the Yahoo Pipe described above to grab data from a different spreadsheet and pass it back to Google Refine as a JSON data feed. (Google spreadsheets will publish data as JSON, but the format is a bit clunky…)

To test out my query, I’m going to create a test query in my datastore explorer using the Guardian datastore HESA returns (2010) spreadsheet URL (http://spreadsheets1.google.com/spreadsheet/ccc?hl&key=tpxpwtyiYZwCMowl3gNaIKQ#gid=0) which also has a column containing HESA numbers. (Ultimately, I’m going to generate a URL that treats the Guardian datastore spreadsheet as a database that lets me get data back from the row with a particular HESA code column value. By using the HESA number column in Google Refine to provide the key, I can generate a URL for each institution that grabs its HESA data from the Datastore HESA spreadsheet.)

Hit “Preview Table Headings”, then scroll down to try out a query:

Having tested my query, I can now try the parameters out in the Yahoo pipe. (For example, my query is select D,E,H where D=21 and the key is tpxpwtyiYZwCMowl3gNaIKQ; this grabs data from columns D, E and H where the value of D (HESA Code) is 21). Grab the JSON output URL from the pipe, and use this as a template for the URL template in Google Refine. Here’s the JSON output URL I obtained:

http://pipes.yahoo.com/pipes/pipe.run?_id=4562a5ec2631ce242ebd25a0756d6381
&_render=json&key=tpxpwtyiYZwCMowl3gNaIKQ
&q=select+D%2CE%2CH+where+D%3D21

Remember, the HESA code I experiment with was 21, so this is what we want to replace in the URL with the value from the HESA code column in Google Refine…

Here’s how we create the URLs built around/keyed by an appropriate HESA code…

Google Refine does its thing and fetches the data…

Now we process the JSON response to generate some meaningful data columns (for more on how to do this, see Tech Tips: Making Sense of JSON Strings – Follow the Structure).

First say we want to create a new column based on the imported JSON data:

Then parse the JSON to extract the data field required in the new column.

For example, from the HESA data we might extract the Expenditure per student /10:

value.parseJson().value.items[0]["Expenditure per student / 10"]

or the Average Teaching Score (value.parseJson().value.items[0]["Average Teaching Score"]):

And here’s the result:

So to recap:

- we use a Yahoo Pipe to query a Google spreadsheet and get a subset of data from it;
- we take the CSV output from the pipe and use it to create a new Google Refine database;
- we note that the data table in Google Refine has a HESA code column; we also note that the Guardian datastore HESA spreadsheet has a HESA code column;
- we realise we can treat the HESA spreadsheet as a database, and further that we can create a query (prototyped in the datastore explorer) as a URL keyed by HESA code;
- we create a new column based on HESA codes from a generated URL that pulls JSON data from a Yahoo pipe that is querying a Google spreadsheet;
- we parse the JSON to give us a couple of new columns.

And there we have it – a clunky, but workable, route for merging data from two different Google spreadsheets using Google Refine.


‘Dead’ Osama Bin Laden photos – why have so many news sites published them?

Daily Mail leads with fake dead Bin Laden photo

Both the Daily Mail and the Daily Mirror today – among with several others in the US (including the New York Post, which credits the image to AP) and other countries – published an image purporting to be that of the dead Osama Bin Laden.

It clearly wasn’t.

Any journalist with a drop of cynicism would have questioned the source of the images – even if they did appear on Pakistan television.

It certainly passed the ‘Too good to be true’ test.

Instead, it was users of Reddit and Twitter who first highlighted the dodgy provenance of the image, and the image it was probably based on. Knight News and MSNBC’s Photo blog‘s followed soon after.

It took me all of 10 seconds to verify that it is a fake – by using TinEye to find other instances of the image, I found this example from last April.

But instead of owning up that their image was a fake, both The Daily Mail and Mirror appear to have simply removed the image from their site, leaving that image to circulate amongst their users. Ego, pure and simple.

PS: More on verifying images and other hoax material here.