Data Wrangling for UK Internet Usage
Andrew Bolster
Senior R&D Manager (Data Science) at Black Duck Software and Treasurer @ Bsides Belfast and NI OpenGovernment Network
This post is a little different from my usual fare;
Basically, there was a tweet from MATRIX NI that caught my eye; the latest Office of National Statistics report on Internet Use in the UK.
@MATRIX_NI @ONS And the tables show Northern Ireland being around 7% behind the Avg and 2% behind the next-worst-region...
— Andrew Bolster (@Bolster) May 22, 2015
Basically, NI “lost”. So I thought it was a good opportunity to play around with the data a little bit instead of my usual stuff.
As such this sent me on two main thrusts;
- Demonstrating a little sample of my normal workflow with data wrangling using Python, Pandas, Plot.ly and a few other tools. This is not clean code and it’s not pretty.
- NI Sucks at the internet and I believe this statistic is the more realistic, reliable, and impactful metric to target economic and cultural growth/stability/recovery/happiness/whatever.
Unfortuately the data massage, both in terms of dealing with the Excel data and then massaging the outputs into something the blog could safely handle took longer than I’d expected so the follow up will have to wait for another time.
Until then, well, here’s some data to play with.
“Open Data” sucks
When I’m working on something like this, I usually end up spending about 80% of my time just getting the data into a format that can be used reasonably; Excel documents are NOT accessible open data standards and they should all go die in a fire… But this is what we’ve got.
Lets kick things off with a few ‘preambles’.
In [1]:
Pandas has the relatively intelligent read_excel
method that… well… does
what it says on the tin.
Since every gov department appears to use Excel as a formatting and layout tool rather than a data management platform, there are lots of pointlessly empty rows and columns that we can drop, so we end up with the top of a data structure like this.
As you can see, Pandas has “guessed” that the top row is a set of column headers… this is incorrect…
In [3]:
Table 4B: Recent and lapsed internet users and internet non-users, by geographical location, UK | Unnamed: 2 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Unnamed: 10 | Unnamed: 12 | Unnamed: 13 | Unnamed: 14 | Unnamed: 15 | Unnamed: 16 | Unnamed: 18 | Unnamed: 20 | Unnamed: 21 | Unnamed: 22 | Unnamed: 23 | Unnamed: 24 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Persons aged 16 years and over | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | % |
1 | NaN | Used in the last 3 months | NaN | NaN | NaN | NaN | NaN | Used over 3 months ago | NaN | NaN | NaN | NaN | NaN | Never used | NaN | NaN | NaN | NaN | NaN |
2 | NaN | 2013 Q1 | 2014 Q1 | 2014 Q2 | 2014 Q3 | 2014 Q4 | 2015 Q1 | 2013 Q1 | 2014 Q1 | 2014 Q2 | 2014 Q3 | 2014 Q4 | 2015 Q1 | 2013 Q1 | 2014 Q1 | 2014 Q2 | 2014 Q3 | 2014 Q4 | 2015 Q1 |
4 | UK | 83.3 | 85 | 85.6 | 85.8 | 86 | 86.2 | 2.5 | 2.2 | 2.3 | 2.2 | 2.2 | 2.2 | 14 | 12.6 | 12 | 11.8 | 11.6 | 11.4 |
6 | North East | 79.4 | 81.3 | 81.2 | 81.9 | 80.8 | 81.6 | 2.6 | 2.4 | 2.9 | 2.5 | 2.5 | 3.6 | 16.8 | 14.3 | 14.1 | 12.9 | 14.1 | 13 |
Remove pointless rows (Note that the index on the right hand side isn’t automatically updated)
In [4]:
Table 4B: Recent and lapsed internet users and internet non-users, by geographical location, UK | Unnamed: 2 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Unnamed: 10 | Unnamed: 12 | Unnamed: 13 | Unnamed: 14 | Unnamed: 15 | Unnamed: 16 | Unnamed: 18 | Unnamed: 20 | Unnamed: 21 | Unnamed: 22 | Unnamed: 23 | Unnamed: 24 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | NaN | Used in the last 3 months | NaN | NaN | NaN | NaN | NaN | Used over 3 months ago | NaN | NaN | NaN | NaN | NaN | Never used | NaN | NaN | NaN | NaN | NaN |
2 | NaN | 2013 Q1 | 2014 Q1 | 2014 Q2 | 2014 Q3 | 2014 Q4 | 2015 Q1 | 2013 Q1 | 2014 Q1 | 2014 Q2 | 2014 Q3 | 2014 Q4 | 2015 Q1 | 2013 Q1 | 2014 Q1 | 2014 Q2 | 2014 Q3 | 2014 Q4 | 2015 Q1 |
4 | UK | 83.3 | 85 | 85.6 | 85.8 | 86 | 86.2 | 2.5 | 2.2 | 2.3 | 2.2 | 2.2 | 2.2 | 14 | 12.6 | 12 | 11.8 | 11.6 | 11.4 |
6 | North East | 79.4 | 81.3 | 81.2 | 81.9 | 80.8 | 81.6 | 2.6 | 2.4 | 2.9 | 2.5 | 2.5 | 3.6 | 16.8 | 14.3 | 14.1 | 12.9 | 14.1 | 13 |
7 | Tees Valley and Durham | 78.5 | 81 | 80.3 | 81.4 | 79.6 | 83.5 | 2.3 | 2.1 | 2.8 | 3.1 | 2 | 2.3 | 16.5 | 13 | 14.2 | 12 | 15.2 | 12.9 |
Fill in the ‘headings’ for the “Used in the last” section to wipe out those NaN’s
In [5]:
Table 4B: Recent and lapsed internet users and internet non-users, by geographical location, UK | Unnamed: 2 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Unnamed: 10 | Unnamed: 12 | Unnamed: 13 | Unnamed: 14 | Unnamed: 15 | Unnamed: 16 | Unnamed: 18 | Unnamed: 20 | Unnamed: 21 | Unnamed: 22 | Unnamed: 23 | Unnamed: 24 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | NaN | Used in the last 3 months | Used in the last 3 months | Used in the last 3 months | Used in the last 3 months | Used in the last 3 months | Used in the last 3 months | Used over 3 months ago | Used over 3 months ago | Used over 3 months ago | Used over 3 months ago | Used over 3 months ago | Used over 3 months ago | Never used | Never used | Never used | Never used | Never used | Never used |
2 | NaN | 2013 Q1 | 2014 Q1 | 2014 Q2 | 2014 Q3 | 2014 Q4 | 2015 Q1 | 2013 Q1 | 2014 Q1 | 2014 Q2 | 2014 Q3 | 2014 Q4 | 2015 Q1 | 2013 Q1 | 2014 Q1 | 2014 Q2 | 2014 Q3 | 2014 Q4 | 2015 Q1 |
4 | UK | 83.3 | 85 | 85.6 | 85.8 | 86 | 86.2 | 2.5 | 2.2 | 2.3 | 2.2 | 2.2 | 2.2 | 14 | 12.6 | 12 | 11.8 | 11.6 | 11.4 |
6 | North East | 79.4 | 81.3 | 81.2 | 81.9 | 80.8 | 81.6 | 2.6 | 2.4 | 2.9 | 2.5 | 2.5 | 3.6 | 16.8 | 14.3 | 14.1 | 12.9 | 14.1 | 13 |
7 | Tees Valley and Durham | 78.5 | 81 | 80.3 | 81.4 | 79.6 | 83.5 | 2.3 | 2.1 | 2.8 | 3.1 | 2 | 2.3 | 16.5 | 13 | 14.2 | 12 | 15.2 | 12.9 |
This ones a bit complicated so we’ll split it up; first off transpose the frame (rows become columns, etc), and then set the new column headers to be the row currently containing the region names. Then drop that row since we don’t need it any more, and finally update the columns to give the first two columns useful names.
In [6]:
Table 4B: Recent and lapsed internet users and internet non-users, by geographical location, UK | nan | nan | UK | North East | Tees Valley and Durham | Northumberland and Tyne and Wear | North West | Cumbria | Cheshire | Greater Manchester | ... | Devon | Wales | West Wales and the Valleys | East Wales | Scotland | North Eastern Scotland | Eastern Scotland | South Western Scotland | Highlands and Islands | Northern Ireland |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Table 4B: Recent and lapsed internet users and internet non-users, by geographical location, UK | NaN | NaN | UK | North East | Tees Valley and Durham | Northumberland and Tyne and Wear | North West | Cumbria | Cheshire | Greater Manchester | ... | Devon | Wales | West Wales and the Valleys | East Wales | Scotland | North Eastern Scotland | Eastern Scotland | South Western Scotland | Highlands and Islands | Northern Ireland |
Unnamed: 2 | Used in the last 3 months | 2013 Q1 | 83.3 | 79.4 | 78.5 | 80.1 | 81.9 | 79 | 84.1 | 83.5 | ... | 83.8 | 79.1 | 76.8 | 83.1 | 82.6 | 83.8 | 84.9 | 80.6 | 80.5 | 77.3 |
Unnamed: 4 | Used in the last 3 months | 2014 Q1 | 85 | 81.3 | 81 | 81.5 | 83.7 | 84.9 | 85.4 | 85 | ... | 83.9 | 81.7 | 79.4 | 85.4 | 85.1 | 89.2 | 87.1 | 82.2 | 86 | 77.4 |
Unnamed: 5 | Used in the last 3 months | 2014 Q2 | 85.6 | 81.2 | 80.3 | 82 | 84.1 | 82.5 | 85.5 | 86 | ... | 83.1 | 81.8 | 80.4 | 84 | 85.1 | 89.3 | 86.7 | 82.3 | 87.9 | 79 |
Unnamed: 6 | Used in the last 3 months | 2014 Q3 | 85.8 | 81.9 | 81.4 | 82.3 | 84.5 | 80 | 86.9 | 86.1 | ... | 85.2 | 82.8 | 81.7 | 84.6 | 85.1 | 87.7 | 87.6 | 83.3 | 79.1 | 78.3 |
5 rows × 51 columns
In [7]:
Internet Use | Date | UK | North East | Tees Valley and Durham | Northumberland and Tyne and Wear | North West | Cumbria | Cheshire | Greater Manchester | ... | Devon | Wales | West Wales and the Valleys | East Wales | Scotland | North Eastern Scotland | Eastern Scotland | South Western Scotland | Highlands and Islands | Northern Ireland | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Unnamed: 2 | Used in the last 3 months | 2013 Q1 | 83.3 | 79.4 | 78.5 | 80.1 | 81.9 | 79 | 84.1 | 83.5 | ... | 83.8 | 79.1 | 76.8 | 83.1 | 82.6 | 83.8 | 84.9 | 80.6 | 80.5 | 77.3 |
Unnamed: 4 | Used in the last 3 months | 2014 Q1 | 85 | 81.3 | 81 | 81.5 | 83.7 | 84.9 | 85.4 | 85 | ... | 83.9 | 81.7 | 79.4 | 85.4 | 85.1 | 89.2 | 87.1 | 82.2 | 86 | 77.4 |
Unnamed: 5 | Used in the last 3 months | 2014 Q2 | 85.6 | 81.2 | 80.3 | 82 | 84.1 | 82.5 | 85.5 | 86 | ... | 83.1 | 81.8 | 80.4 | 84 | 85.1 | 89.3 | 86.7 | 82.3 | 87.9 | 79 |
Unnamed: 6 | Used in the last 3 months | 2014 Q3 | 85.8 | 81.9 | 81.4 | 82.3 | 84.5 | 80 | 86.9 | 86.1 | ... | 85.2 | 82.8 | 81.7 | 84.6 | 85.1 | 87.7 | 87.6 | 83.3 | 79.1 | 78.3 |
Unnamed: 7 | Used in the last 3 months | 2014 Q4 | 86 | 80.8 | 79.6 | 81.7 | 85.1 | 80.4 | 86.5 | 86.4 | ... | 82.6 | 82.1 | 81 | 84 | 84.8 | 87.9 | 86.2 | 83.3 | 82.7 | 78.1 |
5 rows × 51 columns
In [8]:
Finally, set the Date to be the primary index (i.e. the row identifier), convert all the values from boring “objects” to “floats”, give the column range a name (for when we’re manipulating the data later), and for play, lets just look at the average internet use statistics between 2013 and 2015
In [9]:
Internet Use | Never used | Used in the last 3 months | Used over 3 months ago |
---|---|---|---|
Region | |||
UK | 12.233333 | 85.316667 | 2.266667 |
North East | 14.200000 | 81.033333 | 2.750000 |
Tees Valley and Durham | 13.966667 | 80.716667 | 2.433333 |
Northumberland and Tyne and Wear | 14.383333 | 81.266667 | 2.983333 |
North West | 13.250000 | 83.950000 | 2.666667 |
Cumbria | 15.433333 | 82.216667 | 2.200000 |
Cheshire | 11.550000 | 86.050000 | 2.316667 |
Greater Manchester | 12.050000 | 85.316667 | 2.483333 |
Lancashire | 13.066667 | 83.783333 | 3.050000 |
Merseyside | 16.383333 | 80.483333 | 3.016667 |
Yorkshire and the Humber | 13.416667 | 84.033333 | 2.383333 |
East Yorkshire and Northern Lincolnshire | 13.916667 | 83.416667 | 2.616667 |
North Yorkshire | 11.216667 | 86.483333 | 1.900000 |
South Yorkshire | 15.250000 | 82.116667 | 2.416667 |
West Yorkshire | 12.916667 | 84.533333 | 2.400000 |
East Midlands | 12.783333 | 84.366667 | 2.733333 |
Derbyshire and Nottinghamshire | 12.916667 | 83.833333 | 3.083333 |
Leicestershire, Rutland and Northamptonshire | 11.633333 | 85.750000 | 2.516667 |
Lincolnshire | 14.933333 | 82.666667 | 2.166667 |
West Midlands | 14.450000 | 82.916667 | 2.483333 |
Herefordshire, Worcestershire and Warwickshire | 12.083333 | 85.050000 | 2.600000 |
Shropshire and Staffordshire | 13.350000 | 84.316667 | 2.266667 |
West Midlands | 16.266667 | 81.116667 | 2.550000 |
East of England | 10.900000 | 86.900000 | 2.100000 |
East Anglia | 12.133333 | 85.400000 | 2.383333 |
Bedfordshire and Hertfordshire | 8.916667 | 89.216667 | 1.850000 |
Essex | 11.216667 | 86.716667 | 1.950000 |
London | 9.116667 | 88.950000 | 1.816667 |
Inner London | 8.000000 | 90.083333 | 1.650000 |
Outer London | 9.850000 | 88.150000 | 1.933333 |
South East | 9.450000 | 88.600000 | 1.916667 |
Berkshire, Buckinghamshire and Oxfordshire | 7.700000 | 90.566667 | 1.700000 |
Surrey, East and West Sussex | 9.100000 | 88.916667 | 1.900000 |
Hampshire and Isle of Wight | 10.516667 | 87.683333 | 1.716667 |
Kent | 11.183333 | 86.466667 | 2.333333 |
South West | 11.500000 | 86.100000 | 2.333333 |
Gloucestershire, Wiltshire and Bristol/Bath area | 10.016667 | 88.050000 | 1.883333 |
Dorset and Somerset | 11.333333 | 85.900000 | 2.616667 |
Cornwall and Isles of Scilly | 13.966667 | 82.566667 | 3.466667 |
Devon | 13.566667 | 84.116667 | 2.283333 |
Wales | 15.333333 | 81.666667 | 2.816667 |
West Wales and the Valleys | 16.800000 | 79.983333 | 3.000000 |
East Wales | 12.850000 | 84.450000 | 2.500000 |
Scotland | 13.033333 | 84.683333 | 2.183333 |
North Eastern Scotland | 10.183333 | 88.050000 | 1.733333 |
Eastern Scotland | 11.350000 | 86.533333 | 2.016667 |
South Western Scotland | 14.933333 | 82.550000 | 2.400000 |
Highlands and Islands | 14.500000 | 83.050000 | 2.416667 |
Northern Ireland | 19.666667 | 78.283333 | 1.816667 |
Now that everythings tidy, we can start to play.
First thing to have a look at is the regional breakdown, so we select the Regional records and make a fresh dataframe just with those values. We can also forget about the “Used in the last 3 months” records because they’re not particularly interesting and make the graphs look weird…
As you can see, there are two “West Midlands”; this is a pain, as one is the ‘Region’ and the other is to NUTS level 2 Region, obviously…. So we drop the local authority region. This was not as easy as I expected to do programmatically, and if someone has a cleverer way to do this, lemme know.
In [10]:
Now we can plot the regional internet uses using Plot.ly, meaning that these graphs both look good in my IPython Notebook where I’m editing this, and hopefully on the blog when this goes up….
In [11]:
Now that doesn’t look too good; Northern Ireland has the highest proportion of “Never Used”, at just below 20%.
“But!” you might say, “What about all the great programs and processes put in place recently? We’re rapidly recovering from a troubles, and we’re on the up and up!”
Alright then, lets look at the standard deviation of these results (i.e. the volativility)
In [12]:
Nope, we still suck. So we’ve got the highest amount of “Digital Illiteracy” and the lowest change in Digital Literacy of any region. Lets take a look at that; std dev is a terrible measure for time series data and doesn’t give a reliable sense of ‘trend’
In [13]:
Remember this is the number of people who said that they never use the internet…
Ok that’s a nice interactive graph that you can poke around with, but it’s very noisy…. Lets see if we can clean that up a bit with a Hodrick-Prescott filter (we could have just done a linear regression but that’s not really that interesting.
In [25]:
Note that because of the big gap at the beginning of the data (i.e. there’s no information between Q1 2013 and Q1 2014), the regression has a kink in it. Looking at the results, and the “knee” in the data, it’s highly likely that this is just a mistake in the accounting and that it’s actually 2013 Q4 data.
Either way, Northern Ireland is not only bottom of the Digital Literacy Table, but it’s also accomplished the least progress in this area of any region across the UK.
Why worry about Digital Illiteracy?
I’ll save that particular rant for another time…