This post is a little different from my usual fare;

Basically, there was a tweet from MATRIX NI that caught my eye; the latest Office of National Statistics report on Internet Use in the UK.

Basically, NI “lost”. So I thought it was a good opportunity to play around with the data a little bit instead of my usual stuff.

As such this sent me on two main thrusts;

  1. Demonstrating a little sample of my normal workflow with data wrangling using Python, Pandas, Plot.ly and a few other tools. This is not clean code and it’s not pretty.
  2. NI Sucks at the internet and I believe this statistic is the more realistic, reliable, and impactful metric to target economic and cultural growth/stability/recovery/happiness/whatever.

Unfortuately the data massage, both in terms of dealing with the Excel data and then massaging the outputs into something the blog could safely handle took longer than I’d expected so the follow up will have to wait for another time.

Until then, well, here’s some data to play with.

“Open Data” sucks

When I’m working on something like this, I usually end up spending about 80% of my time just getting the data into a format that can be used reasonably; Excel documents are NOT accessible open data standards and they should all go die in a fire… But this is what we’ve got.

Lets kick things off with a few ‘preambles’.

In [1]:

import pandas as pd
import plotly.plotly as py
import statsmodels as sm

from sklearn.linear_model import LinearRegression
import scipy, scipy.stats

import cufflinks as cf # Awesome Pandas/Plot.ly integration module https://github.com/santosjorge/cufflinks

py.sign_in('bolster', 'XXXXXXX')

png

Pandas has the relatively intelligent read_excel method that… well… does what it says on the tin.

Since every gov department appears to use Excel as a formatting and layout tool rather than a data management platform, there are lots of pointlessly empty rows and columns that we can drop, so we end up with the top of a data structure like this.

As you can see, Pandas has “guessed” that the top row is a set of column headers… this is incorrect…

In [3]:

df = pd.read_excel("/home/bolster/Downloads/rftiatables152015q1_tcm77-405031.xls", sheetname="4b")
df=df.dropna(axis=1,how="all")
df=df.dropna(axis=0,how="all")
df.head()
Table 4B: Recent and lapsed internet users and internet non-users, by geographical location, UKUnnamed: 2Unnamed: 4Unnamed: 5Unnamed: 6Unnamed: 7Unnamed: 8Unnamed: 10Unnamed: 12Unnamed: 13Unnamed: 14Unnamed: 15Unnamed: 16Unnamed: 18Unnamed: 20Unnamed: 21Unnamed: 22Unnamed: 23Unnamed: 24
0Persons aged 16 years and overNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN%
1NaNUsed in the last 3 monthsNaNNaNNaNNaNNaNUsed over 3 months agoNaNNaNNaNNaNNaNNever usedNaNNaNNaNNaNNaN
2NaN2013 Q12014 Q12014 Q22014 Q32014 Q42015 Q12013 Q12014 Q12014 Q22014 Q32014 Q42015 Q12013 Q12014 Q12014 Q22014 Q32014 Q42015 Q1
4UK83.38585.685.88686.22.52.22.32.22.22.21412.61211.811.611.4
6North East79.481.381.281.980.881.62.62.42.92.52.53.616.814.314.112.914.113

Remove pointless rows (Note that the index on the right hand side isn’t automatically updated)

In [4]:

df.drop(df.index[[0, -3,-2,-1]], inplace=True)
df.head()
Table 4B: Recent and lapsed internet users and internet non-users, by geographical location, UKUnnamed: 2Unnamed: 4Unnamed: 5Unnamed: 6Unnamed: 7Unnamed: 8Unnamed: 10Unnamed: 12Unnamed: 13Unnamed: 14Unnamed: 15Unnamed: 16Unnamed: 18Unnamed: 20Unnamed: 21Unnamed: 22Unnamed: 23Unnamed: 24
1NaNUsed in the last 3 monthsNaNNaNNaNNaNNaNUsed over 3 months agoNaNNaNNaNNaNNaNNever usedNaNNaNNaNNaNNaN
2NaN2013 Q12014 Q12014 Q22014 Q32014 Q42015 Q12013 Q12014 Q12014 Q22014 Q32014 Q42015 Q12013 Q12014 Q12014 Q22014 Q32014 Q42015 Q1
4UK83.38585.685.88686.22.52.22.32.22.22.21412.61211.811.611.4
6North East79.481.381.281.980.881.62.62.42.92.52.53.616.814.314.112.914.113
7Tees Valley and Durham78.58180.381.479.683.52.32.12.83.122.316.51314.21215.212.9

Fill in the ‘headings’ for the “Used in the last” section to wipe out those NaN’s

In [5]:

df.iloc[0].fillna(method="ffill", inplace=True)
df.head()
Table 4B: Recent and lapsed internet users and internet non-users, by geographical location, UKUnnamed: 2Unnamed: 4Unnamed: 5Unnamed: 6Unnamed: 7Unnamed: 8Unnamed: 10Unnamed: 12Unnamed: 13Unnamed: 14Unnamed: 15Unnamed: 16Unnamed: 18Unnamed: 20Unnamed: 21Unnamed: 22Unnamed: 23Unnamed: 24
1NaNUsed in the last 3 monthsUsed in the last 3 monthsUsed in the last 3 monthsUsed in the last 3 monthsUsed in the last 3 monthsUsed in the last 3 monthsUsed over 3 months agoUsed over 3 months agoUsed over 3 months agoUsed over 3 months agoUsed over 3 months agoUsed over 3 months agoNever usedNever usedNever usedNever usedNever usedNever used
2NaN2013 Q12014 Q12014 Q22014 Q32014 Q42015 Q12013 Q12014 Q12014 Q22014 Q32014 Q42015 Q12013 Q12014 Q12014 Q22014 Q32014 Q42015 Q1
4UK83.38585.685.88686.22.52.22.32.22.22.21412.61211.811.611.4
6North East79.481.381.281.980.881.62.62.42.92.52.53.616.814.314.112.914.113
7Tees Valley and Durham78.58180.381.479.683.52.32.12.83.122.316.51314.21215.212.9

This ones a bit complicated so we’ll split it up; first off transpose the frame (rows become columns, etc), and then set the new column headers to be the row currently containing the region names. Then drop that row since we don’t need it any more, and finally update the columns to give the first two columns useful names.

In [6]:

df = df.T
df.columns = df.iloc[0]
df.head()
Table 4B: Recent and lapsed internet users and internet non-users, by geographical location, UKnannanUKNorth EastTees Valley and DurhamNorthumberland and Tyne and WearNorth WestCumbriaCheshireGreater Manchester...DevonWalesWest Wales and the ValleysEast WalesScotlandNorth Eastern ScotlandEastern ScotlandSouth Western ScotlandHighlands and IslandsNorthern Ireland
Table 4B: Recent and lapsed internet users and internet non-users, by geographical location, UKNaNNaNUKNorth EastTees Valley and DurhamNorthumberland and Tyne and WearNorth WestCumbriaCheshireGreater Manchester...DevonWalesWest Wales and the ValleysEast WalesScotlandNorth Eastern ScotlandEastern ScotlandSouth Western ScotlandHighlands and IslandsNorthern Ireland
Unnamed: 2Used in the last 3 months2013 Q183.379.478.580.181.97984.183.5...83.879.176.883.182.683.884.980.680.577.3
Unnamed: 4Used in the last 3 months2014 Q18581.38181.583.784.985.485...83.981.779.485.485.189.287.182.28677.4
Unnamed: 5Used in the last 3 months2014 Q285.681.280.38284.182.585.586...83.181.880.48485.189.386.782.387.979
Unnamed: 6Used in the last 3 months2014 Q385.881.981.482.384.58086.986.1...85.282.881.784.685.187.787.683.379.178.3

5 rows × 51 columns

In [7]:

df.drop(df.index[[0]], inplace=True)
df.columns = ["Internet Use","Date"] + [s.strip() for s in df.columns[2:].tolist()] # Some idiots put spaces at the end of their cells
df.head()
Internet UseDateUKNorth EastTees Valley and DurhamNorthumberland and Tyne and WearNorth WestCumbriaCheshireGreater Manchester...DevonWalesWest Wales and the ValleysEast WalesScotlandNorth Eastern ScotlandEastern ScotlandSouth Western ScotlandHighlands and IslandsNorthern Ireland
Unnamed: 2Used in the last 3 months2013 Q183.379.478.580.181.97984.183.5...83.879.176.883.182.683.884.980.680.577.3
Unnamed: 4Used in the last 3 months2014 Q18581.38181.583.784.985.485...83.981.779.485.485.189.287.182.28677.4
Unnamed: 5Used in the last 3 months2014 Q285.681.280.38284.182.585.586...83.181.880.48485.189.386.782.387.979
Unnamed: 6Used in the last 3 months2014 Q385.881.981.482.384.58086.986.1...85.282.881.784.685.187.787.683.379.178.3
Unnamed: 7Used in the last 3 months2014 Q48680.879.681.785.180.486.586.4...82.682.1818484.887.986.283.382.778.1

5 rows × 51 columns

In [8]:

q_map = {"Q1":"March","Q2":"June","Q3":"September","Q4":"December"}
def _yr_q_parse(yq): return yq[0], q_map[yq[1]]
def _yr_q_join(s): return " ".join(_yr_q_parse(s.split(" ")))
df['Date'] = pd.to_datetime(df['Date'].apply(_yr_q_join))

Finally, set the Date to be the primary index (i.e. the row identifier), convert all the values from boring “objects” to “floats”, give the column range a name (for when we’re manipulating the data later), and for play, lets just look at the average internet use statistics between 2013 and 2015

In [9]:

df.set_index(['Date','Internet Use'],inplace=True)
df.sortlevel(inplace=True)
df=df.astype(float)
df.columns.name="Region"
df.groupby(level='Internet Use').mean().T
Internet UseNever usedUsed in the last 3 monthsUsed over 3 months ago
Region
UK12.23333385.3166672.266667
North East14.20000081.0333332.750000
Tees Valley and Durham13.96666780.7166672.433333
Northumberland and Tyne and Wear14.38333381.2666672.983333
North West13.25000083.9500002.666667
Cumbria15.43333382.2166672.200000
Cheshire11.55000086.0500002.316667
Greater Manchester12.05000085.3166672.483333
Lancashire13.06666783.7833333.050000
Merseyside16.38333380.4833333.016667
Yorkshire and the Humber13.41666784.0333332.383333
East Yorkshire and Northern Lincolnshire13.91666783.4166672.616667
North Yorkshire11.21666786.4833331.900000
South Yorkshire15.25000082.1166672.416667
West Yorkshire12.91666784.5333332.400000
East Midlands12.78333384.3666672.733333
Derbyshire and Nottinghamshire12.91666783.8333333.083333
Leicestershire, Rutland and Northamptonshire11.63333385.7500002.516667
Lincolnshire14.93333382.6666672.166667
West Midlands14.45000082.9166672.483333
Herefordshire, Worcestershire and Warwickshire12.08333385.0500002.600000
Shropshire and Staffordshire13.35000084.3166672.266667
West Midlands16.26666781.1166672.550000
East of England10.90000086.9000002.100000
East Anglia12.13333385.4000002.383333
Bedfordshire and Hertfordshire8.91666789.2166671.850000
Essex11.21666786.7166671.950000
London9.11666788.9500001.816667
Inner London8.00000090.0833331.650000
Outer London9.85000088.1500001.933333
South East9.45000088.6000001.916667
Berkshire, Buckinghamshire and Oxfordshire7.70000090.5666671.700000
Surrey, East and West Sussex9.10000088.9166671.900000
Hampshire and Isle of Wight10.51666787.6833331.716667
Kent11.18333386.4666672.333333
South West11.50000086.1000002.333333
Gloucestershire, Wiltshire and Bristol/Bath area10.01666788.0500001.883333
Dorset and Somerset11.33333385.9000002.616667
Cornwall and Isles of Scilly13.96666782.5666673.466667
Devon13.56666784.1166672.283333
Wales15.33333381.6666672.816667
West Wales and the Valleys16.80000079.9833333.000000
East Wales12.85000084.4500002.500000
Scotland13.03333384.6833332.183333
North Eastern Scotland10.18333388.0500001.733333
Eastern Scotland11.35000086.5333332.016667
South Western Scotland14.93333382.5500002.400000
Highlands and Islands14.50000083.0500002.416667
Northern Ireland19.66666778.2833331.816667

Now that everythings tidy, we can start to play.

First thing to have a look at is the regional breakdown, so we select the Regional records and make a fresh dataframe just with those values. We can also forget about the “Used in the last 3 months” records because they’re not particularly interesting and make the graphs look weird…

As you can see, there are two “West Midlands”; this is a pain, as one is the ‘Region’ and the other is to [NUTS level 2](http://www.ons.gov.uk/ons/guide- method/geography/beginner-s-guide/eurostat/index.html) Region, obviously…. So we drop the local authority region. This was not as easy as I expected to do programmatically, and if someone has a cleverer way to do this, lemme know.

In [10]:

def last_index_of(li, val):
    return (len(li) - 1) - li[::-1].index(val)

def non_dup_indexes_of_columns(df, dupkey):
    col_ids = range(len(df.columns.tolist()))
    col_ids.remove(last_index_of(df.columns.tolist(),dupkey))
    return col_ids


regions = ["UK",
           "North East","North West",
           "Yorkshire and the Humber",
           "East Midlands", "West Midlands",
           "East of England", "London",
           "South East", "South West",
           "Wales", "Scotland",
           "Northern Ireland"
]

df_regional_use = df[regions]
df_regional_use = df_regional_use[non_dup_indexes_of_columns(df_regional_use,"West Midlands")]# Dammit West Midlands...

Now we can plot the regional internet uses using Plot.ly, meaning that these graphs both look good in my IPython Notebook where I’m editing this, and hopefully on the blog when this goes up….

In [11]:

df_regional_use_means = df_regional_use.groupby(level='Internet Use').mean().T

cols = ['Never used','Used over 3 months ago']

df_regional_use_means[cols].sort(columns=cols).iplot(kind="bar",
    filename='Internet Region Use Means Bar',world_readable=True, theme="pearl",
    title="Average Digital Illiteracy by Region (2013-2015)", xTitle="Region", yTitle="%")

Now that doesn’t look too good; Northern Ireland has the highest proportion of “Never Used”, at just below 20%.

“But!” you might say, “What about all the great programs and processes put in place recently? We’re rapidly recovering from a troubles, and we’re on the up and up!”

Alright then, lets look at the standard deviation of these results (i.e. the volativility)

In [12]:

df_regional_use_stddevs = df_regional_use.groupby(level='Internet Use').std().T

df_regional_use_stddevs[cols].sort(columns=cols).iplot(kind="bar",
    filename='Internet Region Use Std Bar',world_readable=True, theme="pearl",
    title="Standard Deviation in UK Digital Illiteracy since 2013", xTitle="Period", yTitle="Std Dev (Variability)")

Nope, we still suck. So we’ve got the highest amount of “Digital Illiteracy” and the lowest change in Digital Literacy of any region. Lets take a look at that; std dev is a terrible measure for time series data and doesn’t give a reliable sense of ’trend’

In [13]:

df_regional_never = df_regional_use.xs("Never used", level="Internet Use")
df_regional_never_pct_change = ((df_regional_never/df_regional_never.iloc[0])-1)

#fig,ax = my_chart_factory()

#df_regional_never_pct_change.plot(ax=ax, legend=False, color=tableau20)
df_regional_never_pct_change.iplot(filename='Internet Region Never Pct Change',world_readable=True, theme="pearl",
                                   title="Reduction in UK Digital Illiteracy since 2013",
                                   xTitle="Period", yTitle="%")

Remember this is the number of people who said that they never use the internet…

Ok that’s a nice interactive graph that you can poke around with, but it’s very noisy…. Lets see if we can clean that up a bit with a Hodrick-Prescott filter (we could have just done a linear regression but that’s not really that interesting.

In [25]:

cycle,trend = sm.tsa.filters.hp_filter.hpfilter(df_regional_never_pct_change)

trend.iplot(filename='Internet Region Never Pct Change Reg',world_readable=True, theme="pearl",
                                   title="Reduction in UK Digital Illiteracy since 2013",
                                   xTitle="Period", yTitle="%")

Note that because of the big gap at the beginning of the data (i.e. there’s no information between Q1 2013 and Q1 2014), the regression has a kink in it. Looking at the results, and the “knee” in the data, it’s highly likely that this is just a mistake in the accounting and that it’s actually 2013 Q4 data.

Either way, Northern Ireland is not only bottom of the Digital Literacy Table, but it’s also accomplished the least progress in this area of any region across the UK.

Why worry about Digital Illiteracy?

I’ll save that particular rant for another time…