Wrangling NI House Price Index Data

Andrew Bolster

Senior R&D Manager (Data Science) at Black Duck Software and Treasurer @ Bsides Belfast and NI OpenGovernment Network

Data Wrangling NI House Price Index Data

This is a ‘messy’ ‘blog post’ that’s just a braindump of a notebook to step through NI House Price Index datasets I was playing around with.

It’s mostly code, so if you were here from some ‘insight’, feck aff.

There is no analysis here, this is just data wrangling.

TLDR As always, Government Open Data has over the years gone from ‘non-existent’ to ‘garbeled’ to ‘inconsistent’ and I feel is now in the stage of ‘consistently inconsistent’, which is progress in my eyes.

Preamble Code, move on.

from bs4 import BeautifulSoup
import pandas as pd
import requests

# Pull the latest pages of https://www.finance-ni.gov.uk/publications/ni-house-price-index-statistical-reports and extract links

base_url= 'https://www.finance-ni.gov.uk/publications/ni-house-price-index-statistical-reports'
base_content = requests.get(base_url).content
base_soup = BeautifulSoup(base_content)

for a in base_soup.find_all('a'):
    if a.attrs.get('href','').endswith('xlsx'):
        source_name, source_url = a.contents[1],a.attrs['href']

source_df = pd.read_excel(source_url, sheet_name = None) # Load all worksheets in

source_df.keys()

dict_keys(['Cover Sheet', 'Contents', 'Table 1', 'Table 2', 'Table 2a', 'Table 2b', 'Table 2c', 'Table 2d', 'Table 3', 'Table 3a', 'Table 3b', 'Table 3c', 'Table 4', 'Fig 5', 'Table 5', 'Table 5a', 'Fig 6', 'Table 6', 'Table 7', 'Table 8', 'Table 9', 'Table 9a', 'Table 9b', 'Table 9c', 'Table 9d', 'Table 10a', 'Table 10b', 'Table 10c', 'Table 10d', 'Table 10e', 'Table 10f', 'Table 10g', 'Table 10h', 'Table 10i', 'Table 10j', 'Table 10k'])

source_df['Contents']

	Table of Contents	Unnamed: 1	Unnamed: 2
0	Worksheet Name	Frequency	House Price Index - Quarter 4 2021
1	Table 1	Quarterly	Table 1: NI HPI Trends Q1 2005 - Q4 2021
2	Figure 1	Quarterly	Figure 1: Graph of NI HPI Q1 2005 - Q4 2021
3	Figure 1a	Quarterly	Figure 1a: Graph of Percentage Quarterly Chang...
4	Figure 1b	Quarterly	Figure 1b: Graph of Percentage Annual Change Q...
...	...	...	...
58	Table 10h	Quarterly	Table 10h: Number of Verified Residential Prop...
59	Table 10i	Quarterly	Table 10i: Number of Verified Residential Prop...
60	Table 10j	Quarterly	Table 10j: Number of Verified Residential Prop...
61	Table 10k	Quarterly	Table 10k: Number of Verified Residential Prop...
62	Figure 11	Quarterly	Figure 11: Number of Verified Residential Prop...

63 rows × 3 columns

Fix the Contents sheet to correctly reflect the Worksheet names

And fix the table headers and sheet-titles while we’re at it.

new_header = source_df['Contents'].iloc[0]
source_df['Contents'] = source_df['Contents'][1:]
source_df['Contents'].columns = new_header

source_df['Contents'].columns = [*new_header[:-1],'Title']

[t for t in source_df['Contents']['Title'].values if t.startswith('Table')]

['Table 1: NI HPI Trends Q1 2005 - Q4 2021',
 'Table 2: NI HPI & Standardised Price Statistics by Property Type Q4 2021',
 'Table 2a: NI Detached Property Price Index Q1 2005 - Q4 2021',
 'Table 2b: NI Semi-Detached Property Price Index Q1 2005 - Q4 2021',
 'Table 2c: NI Terrace Property Price Index Q1 2005 - Q4 2021',
 'Table 2d: NI Apartment Price Index Q1 2005 - Q4 2021',
 'Table 3: NI HPI & Standardised Price Statistics by New/Existing Resold Dwelling Type Q4 2021',
 'Table 3a: NI New Dwelling Price Index Q1 2005 - Q4 2021',
 'Table 3b: NI Existing Resold Dwellings Price Index Q1 2005 - Q4 2021',
 'Table 3c: Number of Verified Residential Property Sales by New/Existing Resold Dwellings Q1 2005 - Q2 2021',
 'Table 4: Number of Verified Residential Property Sales Q1 2005 - Q4 2021',
 'Table 5: HPI & Standardised Price for each Local Government District in NI',
 'Table 5a: Number of Verified Residential Property Sales by Local Government District Q1 2005 - Q4 2021',
 'Table 6: NI HPI & Standardised Price by Urban and Rural areas of Northern Ireland',
 'Table 7: Standardised House Price & Index for Rural Areas of Northern Ireland by drive times',
 'Table 8: Number of Verified Residential Property Sales for Urban and Rural Areas of NI (Q1 2005 - Q4 2021) and Rural Areas of NI by drive times (Q1 2015 - Q4 2021)',
 'Table 9: NI Average Sale Prices All Properties Q1 2005 - Q4 2021',
 'Table 9a: NI Average Sale Prices Detached Properties Q1 2005 - Q4 2021',
 'Table 9b: NI Average Sale Prices Semi-Detached Properties Q1 2005 - Q4 2021',
 'Table 9c: NI Average Sale Prices Terrace Properties Q1 2005 - Q4 2021',
 'Table 9d: NI Average Sale Prices Apartments Q1 2005 - Q4 2021',
 'Table 10a: Number of Verified Residential Property Sales by Type in Antrim and Newtownabbey Council Q1 2005 - Q4 2021',
 'Table 10b: Number of Verified Residential Property Sales by Type in Ards and North Down Council Q1 2005 - Q4 2021',
 'Table 10c: Number of Verified Residential Property Sales by Type in Armagh City, Banbridge and Craigavon Council Q1 2005 - Q4 2021',
 'Table 10d: Number of Verified Residential Property Sales by Type in Belfast Council Q1 2005 - Q4 2021',
 'Table 10e: Number of Verified Residential Property Sales by Type in Causeway Coast and Glens Council Q1 2005 - Q4 2021',
 'Table 10f: Number of Verified Residential Property Sales by Type in Derry City and Strabane Council Q1 2005 - Q4 2021',
 'Table 10g: Number of Verified Residential Property Sales by Type in Fermanagh and Omagh Council Q1 2005 - Q4 2021',
 'Table 10h: Number of Verified Residential Property Sales by Type in Lisburn and Castlereagh Council Q1 2005 - Q4 2021',
 'Table 10i: Number of Verified Residential Property Sales by Type in Mid and East Antrim Council Q1 2005 - Q4 2021',
 'Table 10j: Number of Verified Residential Property Sales by Type in Mid Ulster Council Q1 2005 - Q4 2021',
 'Table 10k: Number of Verified Residential Property Sales by Type in Newry, Mourne and Down Council Q1 2005 - Q4 2021']

# Replace 'Figure' with 'Fig' in 'Worksheet Name'
with pd.option_context('mode.chained_assignment',None):
    source_df['Contents']['Worksheet Name'] = source_df['Contents']['Worksheet Name'].str.replace('Figure','Fig')

Tidy up Data

General Methodology

Ignore figure data (pretty much completly….)

Tables have more or less the same structure; a header on row 3(1), a year and quarter ‘index’ (on time series; otherwise categorical index, see Table 2, Table 3).

Some TS tables also have totals subsections so these should be a) validated and b) ignored.

Any columns with no header in row 3(1) should be ignored (usually text notes)

Operate Sequentially (i.e. Table 1, Table 2, Table 2a; don’t skip, even if it’s tempting)

Use keys from ‘Contents’ to describe data, but may be suffixed by the date which could change between data sets!

There’s also some really columns that look like checksums, so if there is an ‘NI’ column, or a data column that all valid values are ‘100’, delete it.

Table 1: NI HPI Trends Q1 2005 - Q4 2021

TODO: Regexy way to get rid of the ‘\QX-YYYY -\QX YYYY’ tail

source_df['Table 1']

	Table 1: NI House Price Index, Standardised Price and Quarterly and Annual Change	Unnamed: 1	Unnamed: 2	Unnamed: 3	Unnamed: 4	Unnamed: 5	Unnamed: 6	Unnamed: 7
0	Back to contents	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	Year	Quarter	NI House Price Index	NI House Standardised Price	Quarterly Change	Annual Change	NI	NaN
2	2005	Q1	100.883607	111920.268199	NaN	NaN	100	0.0
3	NaN	Q2	104.564663	116004.031639	0.036488	NaN	100	0.0
4	NaN	Q3	111.219	123386.352673	0.063638	NaN	100	0.0
...	...	...	...	...	...	...	...	...
89	NaN	NaN	NaN	NaN	NaN	NaN	100	0.0
90	NaN	NaN	NaN	NaN	NaN	NaN	100	0.0
91	NaN	NaN	NaN	NaN	NaN	NaN	100	0.0
92	NaN	NaN	NaN	NaN	NaN	NaN	100	0.0
93	NaN	NaN	NaN	NaN	NaN	NaN	100	0.0

94 rows × 8 columns

def basic_cleanup(df:pd.DataFrame, offset=1)->pd.DataFrame:
    df = df.copy()
    # Re-header from row 1 (which was row 3 in excel)
    new_header = df.iloc[offset]
    df = df.iloc[offset+1:]
    df.columns = new_header

    # remove 'NaN' trailing columns
    df = df[df.columns[pd.notna(df.columns)]]

    # 'NI' is a usually hidden column that appears to be a checksum;
    #if it's all there and all 100, remove it, otherwise, complain.
    # (Note, need to change this 'if' logic to just 'if there's a
    # column with all 100's, but cross that bridge later)
    if 'NI' in df:
        assert df['NI'].all() and df['NI'].mean() == 100, "Not all values in df['NI'] == 100"
        df = df.drop('NI', axis=1)

    # Strip rows below the first all-nan row, if there is one
    # (Otherwise this truncates the tables as there is no
    # idxmax in the table of all 'false's)
    if any(df.isna().all(axis=1)):
        idx_first_bad_row = df.isna().all(axis=1).idxmax()
        df = df.loc[:idx_first_bad_row-1]

    # By Inspection, other tables use 'Sale Year' and 'Sale Quarter'
    if set(df.keys()).issuperset({'Sale Year','Sale Quarter'}):
        df = df.rename(columns = {
            'Sale Year':'Year',
            'Sale Quarter': 'Quarter'
        })

    # For 'Year','Quarter' indexed pages, there is an implied Year
    # in Q2/4, so fill it downwards
    if set(df.keys()).issuperset({'Year','Quarter'}):
        df['Year'] = df['Year'].astype(float).fillna(method='ffill').astype(int)

        # In Pandas we can represent Y/Q combinations as proper datetimes
        #https://stackoverflow.com/questions/53898482/clean-way-to-convert-quarterly-periods-to-datetime-in-pandas
        df.insert(loc=0,
                  column='Period',
                  value=pd.PeriodIndex(df.apply(lambda r:f'{r.Year}-{r.Quarter}', axis=1), freq='Q')
        )

    # reset index, try to fix dtypes, etc, (this should be the last
    # operation before returning!
    df = df.reset_index(drop=True).infer_objects()  

    return df

df = basic_cleanup(source_df['Table 1'])
df

1	Period	Year	Quarter	NI House Price Index	NI House Standardised Price	Quarterly Change	Annual Change
0	2005Q1	2005	Q1	100.883607	111920.268199	NaN	NaN
1	2005Q2	2005	Q2	104.564663	116004.031639	0.036488	NaN
2	2005Q3	2005	Q3	111.219000	123386.352673	0.063638	NaN
3	2005Q4	2005	Q4	115.083964	127674.143865	0.034751	NaN
4	2006Q1	2006	Q1	118.354129	131302.064422	0.028415	0.173175
...	...	...	...	...	...	...	...
63	2020Q4	2020	Q4	132.931827	147474.561707	0.026103	0.052326
64	2021Q1	2021	Q1	134.382831	149084.306040	0.010915	0.059421
65	2021Q2	2021	Q2	139.105050	154323.134643	0.035140	0.095724
66	2021Q3	2021	Q3	143.346066	159028.118093	0.030488	0.106491
67	2021Q4	2021	Q4	143.456594	159150.737832	0.000771	0.079174

68 rows × 7 columns

dest_df = {
    'Table 1': basic_cleanup(source_df['Table 1'])
}

len([k for k in source_df.keys() if k.startswith('Table')])

One down, 31 to go…

Table 2: NI HPI & Standardised Price Statistics by Property Type Q4 2021’

df = basic_cleanup(source_df['Table 2'])
df

1	Property Type	Index\n(Quarter 4 2021)	Percentage Change on Previous Quarter	Percentage Change over 12 months	Standardised Price\n(Quarter 4 2021)
0	Detached	143.488806	0.008491	0.093110	241131.373512
1	Semi-Detached	140.680694	0.004211	0.076953	153718.543755
2	Terrace	149.564169	-0.009577	0.078758	112831.710806
3	Apartment	133.376791	-0.014732	0.032761	116554.228620
4	All	143.456594	0.000771	0.079174	159150.737832

Those ‘\n (Quarter 4 2021)’ entries are unnecessary, so for this table, lets clear them

df.columns = [c.split('\n')[0] for c in df.columns]
df

	Property Type	Index	Percentage Change on Previous Quarter	Percentage Change over 12 months	Standardised Price
0	Detached	143.488806	0.008491	0.093110	241131.373512
1	Semi-Detached	140.680694	0.004211	0.076953	153718.543755
2	Terrace	149.564169	-0.009577	0.078758	112831.710806
3	Apartment	133.376791	-0.014732	0.032761	116554.228620
4	All	143.456594	0.000771	0.079174	159150.737832

dest_df['Table 2'] = df

Table 2a: NI Detached Property Price Index Q1 2005 - Q4 2021

df = basic_cleanup(source_df['Table 2a'])
df

1	Period	Year	Quarter	NI Detached Property Price Index	NI Detached Property Standardised Price	Quarterly Change	Annual Change
0	2005Q1	2005	Q1	95.465560	160428.832662	NaN	NaN
1	2005Q2	2005	Q2	100.974498	169686.542965	0.057706	NaN
2	2005Q3	2005	Q3	107.526236	180696.666810	0.064885	NaN
3	2005Q4	2005	Q4	110.279730	185323.883533	0.025608	NaN
4	2006Q1	2006	Q1	112.270506	188669.361197	0.018052	0.176032
...	...	...	...	...	...	...	...
63	2020Q4	2020	Q4	131.266614	220592.113069	0.026393	0.055357
64	2021Q1	2021	Q1	133.814014	224872.989982	0.019406	0.071429
65	2021Q2	2021	Q2	139.682380	234734.715703	0.043855	0.129844
66	2021Q3	2021	Q3	142.280745	239101.239764	0.018602	0.112515
67	2021Q4	2021	Q4	143.488806	241131.373512	0.008491	0.093110

68 rows × 7 columns

Table 2x: NI XXX Property Price Index Q1 2005 - Q4 2021

This table structure is consistent against the rest of the Table 2x cohort; mapping to the Property Types listed in Table 2.

For the time being, we can ignore these, but this will probably become a pain later on…

dest_df['Table 2']['Property Type']

       Detached
  Semi-Detached
        Terrace
      Apartment
            All
Name: Property Type, dtype: object

import re

table2s = re.compile('Table 2[a-z]')
assert table2s.match('Table 2') is None, 'Table 2 is matching itself!'
assert table2s.match('Table 20') is None, 'Table 2 is greedy!'
assert table2s.match('Table 2z') is not None, 'Table 2 is matching incorrectly!'

table2s = re.compile('Table 2[a-z]')
for table in source_df:
    if table2s.match(table):
        dest_df[table] = basic_cleanup(source_df[table])

len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df])

(6, 26)

6 down, 26 to go.

Table 3: NI HPI & Standardised Price Statistics by New/Existing Resold Dwelling Type Q4 2021

These appear to be a similar structure of the Table 2’s… hopefully

df = basic_cleanup(source_df['Table 3'])
df.columns = [c.split('\n')[0] for c in df.columns] # Stolen from Table 2 Treatment
df

	Property Type	Index	Percentage Change on Previous Quarter	Percentage Change over 12 months	Standardised Price
0	New	141.769973	0.024877	0.072609	185966.524090
1	Existing Resold	143.518977	-0.004918	0.080771	152275.828046
2	All	143.456594	0.000771	0.079174	159150.737832

dest_df['Table 3'] = df

df = basic_cleanup(source_df['Table 3a'])
df

1	Period	Year	Quarter	NI New Dwellings Price Index	NI New Dwellings Standardised Price	Quarterly Change	Annual Change
0	2005Q1	2005	Q1	95.804706	125671.662611	NaN	NaN
1	2005Q2	2005	Q2	101.229223	132787.263460	0.056621	NaN
2	2005Q3	2005	Q3	106.243580	139364.837967	0.049535	NaN
3	2005Q4	2005	Q4	110.118105	144447.239874	0.036468	NaN
4	2006Q1	2006	Q1	113.624410	149046.629634	0.031841	0.186000
...	...	...	...	...	...	...	...
63	2020Q4	2020	Q4	132.173052	173377.779440	0.004103	0.036125
64	2021Q1	2021	Q1	133.772562	175475.933612	0.012102	0.027916
65	2021Q2	2021	Q2	136.969311	179669.264190	0.023897	0.046474
66	2021Q3	2021	Q3	138.328776	181452.540106	0.009925	0.050867
67	2021Q4	2021	Q4	141.769973	185966.524090	0.024877	0.072609

68 rows × 7 columns

table3s = re.compile('Table 3[a-z]')
for table in source_df:
    if table3s.match(table):
        dest_df[table] = basic_cleanup(source_df[table])
len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df])

(10, 22)

Table 4: Number of Verified Residential Property Sales Q1 2005 - Q4 2021

Table 4 is not looking great

df = source_df['Table 4']
df

	Table 4: Number of Verified Residential Property Sales	Unnamed: 1	Unnamed: 2	Unnamed: 3	Unnamed: 4	Unnamed: 5	Unnamed: 6	Unnamed: 7
0	Verified Sales = Sales matched to a property i...	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	Please note figures for the 2 most recent quar...	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	Back to contents	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	Sale Year	Sale Quarter	Detached	Semi-Detached	Terrace	Apartment	Total	NaN
4	2005\n	Quarter 1	809	894	1035	198	2936	NaN
...	...	...	...	...	...	...	...	...
84	2021	Quarter 1	2509	2477	1962	561	7509	NaN
85	NaN	Quarter 2	2668	2613	2056	604	7941	NaN
86	NaN	Quarter 3	2519	2797	2220	633	8169	Please note this figure is provisional and wil...
87	NaN	Quarter 4	1478	2100	2057	515	6150	and new dwellings sold in this quarter being a...
88	NaN	2021 Total	9174	9987	8295	2313	29769	NaN

89 rows × 8 columns

Of note; new offset for the header row at index 3 instead of index 1, due to lots of fluff at the start that is probably not going to be consistent between reports so that will almost certainly mess up my day in a few months.

Also, Quarter dates have now been shifted into ‘Quarter 1’ instead of ‘Q1’, which … meh 🤷‍♂️. More Egrigiously, it looks like ‘\n’ has leaked into some Sales Year values. Funtimes.

Finally, and possibly most annoying, the introduction of partial total lines is going to throw things off, and this isn’t a validation study, to stuff-em

In an effort not to over-complicate basic_cleanup, we can try and clean these table specific issues first;

df.iloc[:,1]=df.iloc[:,1].str.replace('Quarter ([1-4])',r'Q\1', regex=True)
df

	Table 4: Number of Verified Residential Property Sales	Unnamed: 1	Unnamed: 2	Unnamed: 3	Unnamed: 4	Unnamed: 5	Unnamed: 6	Unnamed: 7
0	Verified Sales = Sales matched to a property i...	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	Please note figures for the 2 most recent quar...	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	Back to contents	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	Sale Year	Sale Quarter	Detached	Semi-Detached	Terrace	Apartment	Total	NaN
4	2005\n	Q1	809	894	1035	198	2936	NaN
...	...	...	...	...	...	...	...	...
84	2021	Q1	2509	2477	1962	561	7509	NaN
85	NaN	Q2	2668	2613	2056	604	7941	NaN
86	NaN	Q3	2519	2797	2220	633	8169	Please note this figure is provisional and wil...
87	NaN	Q4	1478	2100	2057	515	6150	and new dwellings sold in this quarter being a...
88	NaN	2021 Total	9174	9987	8295	2313	29769	NaN

89 rows × 8 columns

df=df[~df.iloc[:,1].str.contains('Total').fillna(False)]

# Lose the year new-lines (needs astype because non str lines are
# correctly inferred to be ints, so .str methods nan-out
with pd.option_context('mode.chained_assignment',None):
    df.iloc[:,0]=df.iloc[:,0].astype(str).str.replace('\n','')

df

	Table 4: Number of Verified Residential Property Sales	Unnamed: 1	Unnamed: 2	Unnamed: 3	Unnamed: 4	Unnamed: 5	Unnamed: 6	Unnamed: 7
0	Verified Sales = Sales matched to a property i...	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	Please note figures for the 2 most recent quar...	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	Back to contents	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	Sale Year	Sale Quarter	Detached	Semi-Detached	Terrace	Apartment	Total	NaN
4	2005	Q1	809	894	1035	198	2936	NaN
...	...	...	...	...	...	...	...	...
82	nan	Q4	2808	2944	2170	555	8477	NaN
84	2021	Q1	2509	2477	1962	561	7509	NaN
85	nan	Q2	2668	2613	2056	604	7941	NaN
86	nan	Q3	2519	2797	2220	633	8169	Please note this figure is provisional and wil...
87	nan	Q4	1478	2100	2057	515	6150	and new dwellings sold in this quarter being a...

72 rows × 8 columns

basic_cleanup(df, offset=3)

3	Period	Year	Quarter	Detached	Semi-Detached	Terrace	Apartment	Total
0	2005Q1	2005	Q1	809	894	1035	198	2936
1	2005Q2	2005	Q2	2208	2474	2808	483	7973
2	2005Q3	2005	Q3	2297	2655	2952	539	8443
3	2005Q4	2005	Q4	2498	3003	3492	631	9624
4	2006Q1	2006	Q1	2185	2650	3158	594	8587
...	...	...	...	...	...	...	...	...
63	2020Q4	2020	Q4	2808	2944	2170	555	8477
64	2021Q1	2021	Q1	2509	2477	1962	561	7509
65	2021Q2	2021	Q2	2668	2613	2056	604	7941
66	2021Q3	2021	Q3	2519	2797	2220	633	8169
67	2021Q4	2021	Q4	1478	2100	2057	515	6150

68 rows × 8 columns

Thats awkward enough to get it’s own function…

def cleanup_table_4(df):
    """
    Table 4: Number of Verified Residential Property Sales
    * Regex 'Quarter X' to 'QX' in future 'Sales Quarter' column
    * Drop Year Total rows
    * Clear any Newlines from the future 'Sales Year' column
    * call `basic_cleanup` with offset=3
    """
    df.iloc[:,1]=df.iloc[:,1].str.replace('Quarter ([1-4])',r'Q\1', regex=True)
    df=df[~df.iloc[:,1].str.contains('Total').fillna(False)]
    # Lose the year new-lines (needs astype because non str lines are
    # correctly inferred to be ints, so .str methods nan-out
    with pd.option_context('mode.chained_assignment',None):
        df.iloc[:,0]=df.iloc[:,0].astype(str).str.replace('\n','')
    return basic_cleanup(df, offset=3)

cleanup_table_4(source_df['Table 4'].copy())

3	Period	Year	Quarter	Detached	Semi-Detached	Terrace	Apartment	Total
0	2005Q1	2005	Q1	809	894	1035	198	2936
1	2005Q2	2005	Q2	2208	2474	2808	483	7973
2	2005Q3	2005	Q3	2297	2655	2952	539	8443
3	2005Q4	2005	Q4	2498	3003	3492	631	9624
4	2006Q1	2006	Q1	2185	2650	3158	594	8587
...	...	...	...	...	...	...	...	...
63	2020Q4	2020	Q4	2808	2944	2170	555	8477
64	2021Q1	2021	Q1	2509	2477	1962	561	7509
65	2021Q2	2021	Q2	2668	2613	2056	604	7941
66	2021Q3	2021	Q3	2519	2797	2220	633	8169
67	2021Q4	2021	Q4	1478	2100	2057	515	6150

68 rows × 8 columns

dest_df['Table 4'] = cleanup_table_4(source_df['Table 4'])
len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df])

(11, 21)

Table 5: HPI & Standardised Price for each Local Government District in NI

This nearly works but structurally requires a multi-index column to make sense….

df = basic_cleanup(source_df['Table 5'])
df

1	Period	Year	Quarter	Antrim and Newtownabbey HPI	Antrim and Newtownabbey Standardised Price	Ards and North Down HPI	Ards and North Down Standardised Price	Armagh City, Banbridge and Craigavon HPI	Armagh City, Banbridge and Craigavon Standardised Price	Belfast HPI	...	Fermanagh and Omagh HPI	Fermanagh and Omagh Standardised Price	Lisburn and Castlereagh HPI	Lisburn and Castlereagh Standardised Price	Mid and East Antrim HPI	Mid and East Antrim Standardised Price	Mid Ulster Standardised HPI	Mid Ulster Standardised Price	Newry, Mourne and Down HPI	Newry, Mourne and Down Standardised Price
0	2005Q1	2005	Q1	99.903277	114851.528270	97.150602	130398.569667	102.245597	100785.145986	99.839849	...	109.429237	104874.980231	95.958322	128828.327513	102.246427	105865.408901	102.714778	114882.211239	100.810773	113420.880186
1	2005Q2	2005	Q2	99.723509	114644.862732	100.794472	137133.037807	106.325843	104807.109982	100.589870	...	117.239850	112360.529330	100.164437	134475.225477	104.443325	108140.065924	110.386311	123462.501283	111.965743	125971.191415
2	2005Q3	2005	Q3	107.940849	124091.740608	102.167971	137133.037807	110.006212	108434.910333	109.614861	...	125.900145	120660.397585	106.757895	143327.237126	112.748278	116738.970434	117.595723	131525.929577	117.235685	131900.333698
3	2005Q4	2005	Q4	111.934696	128683.175719	106.396379	142808.538807	116.073031	114415.072260	110.728237	...	130.781315	125338.422216	111.307116	149434.769200	114.584090	118639.759900	121.851999	136286.397473	123.628047	139092.296651
4	2006Q1	2006	Q1	113.494351	130476.197845	109.206160	146579.915492	121.831058	120090.852733	112.326582	...	135.555749	129914.152078	110.539212	148403.823796	115.878975	119980.477260	129.544702	144890.365875	127.449613	143391.890242
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
63	2020Q4	2020	Q4	131.045293	150653.238745	123.824862	166201.593253	130.857866	128988.723586	133.525177	...	142.381814	136456.127817	129.797418	174258.823716	130.813697	135443.809729	124.617456	139379.446212	138.815696	156179.721555
64	2021Q1	2021	Q1	133.481101	153453.510344	128.398202	172340.072904	130.382658	128520.303209	135.257679	...	141.464114	135576.621629	131.166305	176096.615474	131.268719	135914.936888	127.657615	142779.737045	138.481347	155803.549899
65	2021Q2	2021	Q2	137.827568	158450.326506	130.543464	175219.510303	137.439265	135476.115278	138.558752	...	146.059468	139980.724158	134.815374	180995.654429	138.663140	143571.081234	129.231669	144540.250870	145.000829	163138.533592
66	2021Q3	2021	Q3	141.167257	162289.724156	134.116570	180015.444071	143.522521	141472.480114	142.250634	...	149.743153	143511.101233	136.178634	182825.891020	140.051373	145008.450168	135.759680	151841.560426	152.177814	171213.265699
67	2021Q4	2021	Q4	141.900751	163132.969278	134.883745	181045.170599	140.957739	138944.332704	140.072076	...	156.033991	149540.124905	137.683726	184846.544332	141.311289	146312.960523	136.847809	153058.587031	152.044892	171063.717288

68 rows × 25 columns

# Two inner-columns per LGD
lgds = df.columns[3:].str.replace(' HPI','').str.replace(' Standardised Price','').unique()
lgds

Index(['Antrim and Newtownabbey', 'Ards and North Down',
       'Armagh City, Banbridge and Craigavon', 'Belfast',
       'Causeway Coast and Glens', 'Derry City and Strabane',
       'Fermanagh and Omagh', 'Lisburn and Castlereagh', 'Mid and East Antrim',
       'Mid Ulster Standardised', 'Mid Ulster', 'Newry, Mourne and Down'],
      dtype='object', name=1)

For some reason; Mid-ulster has a ‘Standardised HPI’ which throws off the above trick, so we gotta make it ugly…

lgds = df.columns[3:].str.replace(' Standardised HPI',' HPI')\
    .str.replace(' HPI','')\
    .str.replace(' Standardised Price','').unique()
lgds

Index(['Antrim and Newtownabbey', 'Ards and North Down',
       'Armagh City, Banbridge and Craigavon', 'Belfast',
       'Causeway Coast and Glens', 'Derry City and Strabane',
       'Fermanagh and Omagh', 'Lisburn and Castlereagh', 'Mid and East Antrim',
       'Mid Ulster', 'Newry, Mourne and Down'],
      dtype='object', name=1)

df.columns = [*df.columns[:3], *pd.MultiIndex.from_product([lgds,['Index','Price']], names=['LGD','Metric'])]
df

	Period	Year	Quarter	(Antrim and Newtownabbey, Index)	(Antrim and Newtownabbey, Price)	(Ards and North Down, Index)	(Ards and North Down, Price)	(Armagh City, Banbridge and Craigavon, Index)	(Armagh City, Banbridge and Craigavon, Price)	(Belfast, Index)	...	(Fermanagh and Omagh, Index)	(Fermanagh and Omagh, Price)	(Lisburn and Castlereagh, Index)	(Lisburn and Castlereagh, Price)	(Mid and East Antrim, Index)	(Mid and East Antrim, Price)	(Mid Ulster, Index)	(Mid Ulster, Price)	(Newry, Mourne and Down, Index)	(Newry, Mourne and Down, Price)
0	2005Q1	2005	Q1	99.903277	114851.528270	97.150602	130398.569667	102.245597	100785.145986	99.839849	...	109.429237	104874.980231	95.958322	128828.327513	102.246427	105865.408901	102.714778	114882.211239	100.810773	113420.880186
1	2005Q2	2005	Q2	99.723509	114644.862732	100.794472	137133.037807	106.325843	104807.109982	100.589870	...	117.239850	112360.529330	100.164437	134475.225477	104.443325	108140.065924	110.386311	123462.501283	111.965743	125971.191415
2	2005Q3	2005	Q3	107.940849	124091.740608	102.167971	137133.037807	110.006212	108434.910333	109.614861	...	125.900145	120660.397585	106.757895	143327.237126	112.748278	116738.970434	117.595723	131525.929577	117.235685	131900.333698
3	2005Q4	2005	Q4	111.934696	128683.175719	106.396379	142808.538807	116.073031	114415.072260	110.728237	...	130.781315	125338.422216	111.307116	149434.769200	114.584090	118639.759900	121.851999	136286.397473	123.628047	139092.296651
4	2006Q1	2006	Q1	113.494351	130476.197845	109.206160	146579.915492	121.831058	120090.852733	112.326582	...	135.555749	129914.152078	110.539212	148403.823796	115.878975	119980.477260	129.544702	144890.365875	127.449613	143391.890242
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
63	2020Q4	2020	Q4	131.045293	150653.238745	123.824862	166201.593253	130.857866	128988.723586	133.525177	...	142.381814	136456.127817	129.797418	174258.823716	130.813697	135443.809729	124.617456	139379.446212	138.815696	156179.721555
64	2021Q1	2021	Q1	133.481101	153453.510344	128.398202	172340.072904	130.382658	128520.303209	135.257679	...	141.464114	135576.621629	131.166305	176096.615474	131.268719	135914.936888	127.657615	142779.737045	138.481347	155803.549899
65	2021Q2	2021	Q2	137.827568	158450.326506	130.543464	175219.510303	137.439265	135476.115278	138.558752	...	146.059468	139980.724158	134.815374	180995.654429	138.663140	143571.081234	129.231669	144540.250870	145.000829	163138.533592
66	2021Q3	2021	Q3	141.167257	162289.724156	134.116570	180015.444071	143.522521	141472.480114	142.250634	...	149.743153	143511.101233	136.178634	182825.891020	140.051373	145008.450168	135.759680	151841.560426	152.177814	171213.265699
67	2021Q4	2021	Q4	141.900751	163132.969278	134.883745	181045.170599	140.957739	138944.332704	140.072076	...	156.033991	149540.124905	137.683726	184846.544332	141.311289	146312.960523	136.847809	153058.587031	152.044892	171063.717288

68 rows × 25 columns

We could turn this into a proper multiindex but it would mean pushing the Period/Year/Quarter columns into keys which would be inconsistent behaviour with the rest of the ‘cleaned’ dataset, so that can be a downstream problem; at least we’ve got the relevant metrics consistent!

def cleanup_table_5(df):
    """
    Table 5: Standardised House Price & Index for each Local Government District Northern Ireland
    *
    """
    # Basic Cleanup first
    df = basic_cleanup(df)
    # Build multi-index of LGD / Metric [Index,Price]
    # Two inner-columns per LGD
    lgds = df.columns[3:].str.replace(' Standardised HPI',' HPI')\
        .str.replace(' HPI','')\
        .str.replace(' Standardised Price','')\
        .unique()
    df.columns = [*df.columns[:3], *pd.MultiIndex.from_product([lgds,['Index','Price']], names=['LGD','Metric'])]
    return df

cleanup_table_5(source_df['Table 5'])

	Period	Year	Quarter	(Antrim and Newtownabbey, Index)	(Antrim and Newtownabbey, Price)	(Ards and North Down, Index)	(Ards and North Down, Price)	(Armagh City, Banbridge and Craigavon, Index)	(Armagh City, Banbridge and Craigavon, Price)	(Belfast, Index)	...	(Fermanagh and Omagh, Index)	(Fermanagh and Omagh, Price)	(Lisburn and Castlereagh, Index)	(Lisburn and Castlereagh, Price)	(Mid and East Antrim, Index)	(Mid and East Antrim, Price)	(Mid Ulster, Index)	(Mid Ulster, Price)	(Newry, Mourne and Down, Index)	(Newry, Mourne and Down, Price)
0	2005Q1	2005	Q1	99.903277	114851.528270	97.150602	130398.569667	102.245597	100785.145986	99.839849	...	109.429237	104874.980231	95.958322	128828.327513	102.246427	105865.408901	102.714778	114882.211239	100.810773	113420.880186
1	2005Q2	2005	Q2	99.723509	114644.862732	100.794472	137133.037807	106.325843	104807.109982	100.589870	...	117.239850	112360.529330	100.164437	134475.225477	104.443325	108140.065924	110.386311	123462.501283	111.965743	125971.191415
2	2005Q3	2005	Q3	107.940849	124091.740608	102.167971	137133.037807	110.006212	108434.910333	109.614861	...	125.900145	120660.397585	106.757895	143327.237126	112.748278	116738.970434	117.595723	131525.929577	117.235685	131900.333698
3	2005Q4	2005	Q4	111.934696	128683.175719	106.396379	142808.538807	116.073031	114415.072260	110.728237	...	130.781315	125338.422216	111.307116	149434.769200	114.584090	118639.759900	121.851999	136286.397473	123.628047	139092.296651
4	2006Q1	2006	Q1	113.494351	130476.197845	109.206160	146579.915492	121.831058	120090.852733	112.326582	...	135.555749	129914.152078	110.539212	148403.823796	115.878975	119980.477260	129.544702	144890.365875	127.449613	143391.890242
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
63	2020Q4	2020	Q4	131.045293	150653.238745	123.824862	166201.593253	130.857866	128988.723586	133.525177	...	142.381814	136456.127817	129.797418	174258.823716	130.813697	135443.809729	124.617456	139379.446212	138.815696	156179.721555
64	2021Q1	2021	Q1	133.481101	153453.510344	128.398202	172340.072904	130.382658	128520.303209	135.257679	...	141.464114	135576.621629	131.166305	176096.615474	131.268719	135914.936888	127.657615	142779.737045	138.481347	155803.549899
65	2021Q2	2021	Q2	137.827568	158450.326506	130.543464	175219.510303	137.439265	135476.115278	138.558752	...	146.059468	139980.724158	134.815374	180995.654429	138.663140	143571.081234	129.231669	144540.250870	145.000829	163138.533592
66	2021Q3	2021	Q3	141.167257	162289.724156	134.116570	180015.444071	143.522521	141472.480114	142.250634	...	149.743153	143511.101233	136.178634	182825.891020	140.051373	145008.450168	135.759680	151841.560426	152.177814	171213.265699
67	2021Q4	2021	Q4	141.900751	163132.969278	134.883745	181045.170599	140.957739	138944.332704	140.072076	...	156.033991	149540.124905	137.683726	184846.544332	141.311289	146312.960523	136.847809	153058.587031	152.044892	171063.717288

68 rows × 25 columns

dest_df['Table 5']=cleanup_table_5(source_df['Table 5'])
len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df])

(12, 20)

Table 5a: Number of Verified Residential Property Sales by Local Government District

This one has a new problem; the Sale Year/Quarter is now squished together. This will do a few terrible things to our basic_cleanup so this needs to be done ahead of cleanup. Also has annual total lines.

df = source_df['Table 5a'].copy()
df

	Table 5a: Number of Verified Residential Property Sales by Local Government District	Unnamed: 1	Unnamed: 2	Unnamed: 3	Unnamed: 4	Unnamed: 5	Unnamed: 6	Unnamed: 7	Unnamed: 8	Unnamed: 9	Unnamed: 10	Unnamed: 11
0	Please note figures for the 2 most recent quar...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	Back to contents	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	Sale Year/Quarter	Antrim and Newtownabbey	Ards and North Down	Armagh City, Banbridge and Craigavon	Belfast	Causeway Coast and Glens	Derry City and Strabane	Fermanagh and Omagh	Lisburn and Castlereagh	Mid and East Antrim	Mid Ulster	Newry, Mourne and Down
3	Q1 2005	236	320	333	623	236	226	138	219	188	176	241
4	Q2 2005	735	857	961	1549	712	637	316	655	618	428	505
...	...	...	...	...	...	...	...	...	...	...	...	...
85	Q3 2021	739	989	931	1584	625	485	325	869	671	377	574
86	Q4 2021	532	702	730	1272	417	405	250	572	474	359	437
87	2021 Total	2647	3685	3333	5934	2327	1803	1181	3053	2346	1460	2000
88	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
89	Please note figures for the 2 most recent quar...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

90 rows × 12 columns

dates = df.iloc[:,0].str.extract('(Q[1-4]) ([0-9]{4})').rename(columns={0:'Quarter',1:'Year'})
for c in ['Quarter','Year']:# insert the dates in order, so they come out in reverse in the insert
    df.insert(1,c,dates[c])
    df.iloc[2,1]=c # Need to have the right colname for when `basic_cleanup` is called.

df.iloc[2,1]=c
df

	Table 5a: Number of Verified Residential Property Sales by Local Government District	Year	Quarter	Unnamed: 1	Unnamed: 2	Unnamed: 3	Unnamed: 4	Unnamed: 5	Unnamed: 6	Unnamed: 7	Unnamed: 8	Unnamed: 9	Unnamed: 10	Unnamed: 11
0	Please note figures for the 2 most recent quar...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	Back to contents	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	Sale Year/Quarter	Year	Quarter	Antrim and Newtownabbey	Ards and North Down	Armagh City, Banbridge and Craigavon	Belfast	Causeway Coast and Glens	Derry City and Strabane	Fermanagh and Omagh	Lisburn and Castlereagh	Mid and East Antrim	Mid Ulster	Newry, Mourne and Down
3	Q1 2005	2005	Q1	236	320	333	623	236	226	138	219	188	176	241
4	Q2 2005	2005	Q2	735	857	961	1549	712	637	316	655	618	428	505
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
85	Q3 2021	2021	Q3	739	989	931	1584	625	485	325	869	671	377	574
86	Q4 2021	2021	Q4	532	702	730	1272	417	405	250	572	474	359	437
87	2021 Total	NaN	NaN	2647	3685	3333	5934	2327	1803	1181	3053	2346	1460	2000
88	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
89	Please note figures for the 2 most recent quar...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

90 rows × 14 columns

df=df[~df.iloc[:,0].str.contains('Total').fillna(False)]

df.iloc[1,2]=c

basic_cleanup(df,offset=2)

2	Period	Sale Year/Quarter	Year	Quarter	Antrim and Newtownabbey	Ards and North Down	Armagh City, Banbridge and Craigavon	Belfast	Causeway Coast and Glens	Derry City and Strabane	Fermanagh and Omagh	Lisburn and Castlereagh	Mid and East Antrim	Mid Ulster	Newry, Mourne and Down
0	2005Q1	Q1 2005	2005	Q1	236	320	333	623	236	226	138	219	188	176	241
1	2005Q2	Q2 2005	2005	Q2	735	857	961	1549	712	637	316	655	618	428	505
2	2005Q3	Q3 2005	2005	Q3	757	960	968	1722	714	632	365	654	686	403	582
3	2005Q4	Q4 2005	2005	Q4	893	995	1199	1943	834	746	385	670	759	489	711
4	2006Q1	Q1 2006	2006	Q1	761	933	1038	1686	763	708	348	600	668	515	567
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
63	2020Q4	Q4 2020	2020	Q4	756	1052	974	1565	728	496	336	830	685	419	636
64	2021Q1	Q1 2021	2021	Q1	652	976	849	1497	610	466	290	762	572	349	486
65	2021Q2	Q2 2021	2021	Q2	724	1018	823	1581	675	447	316	850	629	375	503
66	2021Q3	Q3 2021	2021	Q3	739	989	931	1584	625	485	325	869	671	377	574
67	2021Q4	Q4 2021	2021	Q4	532	702	730	1272	417	405	250	572	474	359	437

68 rows × 15 columns

def cleanup_table_5a(df):
    """
    Table 5a: Number of Verified Residential Property Sales by Local Government District
    * Parse the 'Sale Year/Quarter' to two separate cols
    * Insert future-headers for Quarter and Year cols
    * Remove rows with 'total' in the first column
    * Disregard the 'Sale Year/Quarter' column
    * perform `basic_cleanup` with offset=2
    """
    # Safety first
    df=df.copy()

    # Extract 'Quarter' and 'Year' columns from the future 'Sale Year/Quarter' column
    dates = df.iloc[:,0].str.extract('(Q[1-4]) ([0-9]{4})').rename(columns={0:'Quarter',1:'Year'})
    for c in ['Quarter','Year']:# insert the dates in order, so they come out in reverse in the insert
        df.insert(1,c,dates[c])
        df.iloc[2,1]=c # Need to have the right colname for when `basic_cleanup` is called.

    # Remove 'total' rows from the future 'Sale Year/Quarter' column
    df=df[~df.iloc[:,0].str.contains('Total').fillna(False)]

    # Remove the 'Sale Year/Quarter' column all together
    df = df.iloc[:,1:]

    # Standard cleanup
    df = basic_cleanup(df, offset=2)

    return df

cleanup_table_5a(source_df['Table 5a'])

2	Period	Year	Quarter	Antrim and Newtownabbey	Ards and North Down	Armagh City, Banbridge and Craigavon	Belfast	Causeway Coast and Glens	Derry City and Strabane	Fermanagh and Omagh	Lisburn and Castlereagh	Mid and East Antrim	Mid Ulster	Newry, Mourne and Down
0	2005Q1	2005	Q1	236	320	333	623	236	226	138	219	188	176	241
1	2005Q2	2005	Q2	735	857	961	1549	712	637	316	655	618	428	505
2	2005Q3	2005	Q3	757	960	968	1722	714	632	365	654	686	403	582
3	2005Q4	2005	Q4	893	995	1199	1943	834	746	385	670	759	489	711
4	2006Q1	2006	Q1	761	933	1038	1686	763	708	348	600	668	515	567
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
63	2020Q4	2020	Q4	756	1052	974	1565	728	496	336	830	685	419	636
64	2021Q1	2021	Q1	652	976	849	1497	610	466	290	762	572	349	486
65	2021Q2	2021	Q2	724	1018	823	1581	675	447	316	850	629	375	503
66	2021Q3	2021	Q3	739	989	931	1584	625	485	325	869	671	377	574
67	2021Q4	2021	Q4	532	702	730	1272	417	405	250	572	474	359	437

68 rows × 14 columns

dest_df['Table 5a']=cleanup_table_5a(source_df['Table 5a'])
len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df])

(13, 19)

Table 6: Standardised House Price & Index for all Urban and Rural areas in NI

Wee buns, thankfully. Still mixing the ‘HPI’ vs ‘Index’, but that’s a downstream problem

df = basic_cleanup(source_df['Table 6'])
df

1	Period	Year	Quarter	Urban Areas HPI	Urban Areas Standardised Price	Rural Areas HPI	Rural Areas Standardised Price
0	2005Q1	2005	Q1	101.309947	107723.320891	100.109860	124292.601178
1	2005Q2	2005	Q2	104.402908	111012.079786	105.467951	138865.721275
2	2005Q3	2005	Q3	111.163485	118200.631818	111.847591	138865.721275
3	2005Q4	2005	Q4	114.871996	122143.908606	116.175119	144238.615701
4	2006Q1	2006	Q1	118.187559	125669.361667	119.329374	148154.818847
...	...	...	...	...	...	...	...
63	2020Q4	2020	Q4	132.610763	141005.619094	133.854953	166189.226014
64	2021Q1	2021	Q1	134.077654	142565.370205	135.267264	167942.698911
65	2021Q2	2021	Q2	138.575881	147348.355880	140.501443	174441.256673
66	2021Q3	2021	Q3	142.840470	151882.912133	144.695321	179648.216283
67	2021Q4	2021	Q4	142.375033	151388.010443	146.115278	181411.180623

68 rows × 7 columns

dest_df['Table 6']=basic_cleanup(source_df['Table 6'])
len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df])

(14, 18)

Table 7: Standardised House Price & Index for Rural Areas of Northern Ireland by drive times

Nearly-wee-buns; but this one doesn’t have Year or Quarter headers, and the extra \n (Ref: Q1 2015) added, which will complicate downstream analysis if that changes over time…

df = source_df['Table 7'].copy()
df.head()

	Table 7: Standardised House Price & Index for Rural Areas of Northern Ireland by drive times	Unnamed: 1	Unnamed: 2	Unnamed: 3	Unnamed: 4	Unnamed: 5	Unnamed: 6	Unnamed: 7	Unnamed: 8	Unnamed: 9	Unnamed: 10
0	Back to contents	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	NaN	NaN	Drive time within 20mins of town of 10,000 or ...	Drive time within 20mins of town of 10,000 or ...	Drive time outside 20mins of town of 10,000 or...	Drive time outside 20mins of town of 10,000 or...	Drive time within 1hr of Belfast Index	Drive time within 1hr of Belfast Price\n(Ref:...	Drive time outside 1hr of Belfast Index	Drive time outside 1hr of Belfast Price\n(Ref...	NaN
2	2015	Q1	100	124898.676844	100	122528.427865	100	128955.274996	100	111866.40498	100.0
3	NaN	Q2	103.166882	128854.070701	103.003978	126209.155363	103.025069	132856.260679	103.349406	115613.265107	100.0
4	NaN	Q3	105.851629	132207.28391	105.619893	129414.394046	105.031061	135443.093443	107.811831	120605.219276	100.0

df.iloc[1,0] = 'Year'
df.iloc[1,1] = 'Quarter'
df.head()

	Table 7: Standardised House Price & Index for Rural Areas of Northern Ireland by drive times	Unnamed: 1	Unnamed: 2	Unnamed: 3	Unnamed: 4	Unnamed: 5	Unnamed: 6	Unnamed: 7	Unnamed: 8	Unnamed: 9	Unnamed: 10
0	Back to contents	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	Year	Quarter	Drive time within 20mins of town of 10,000 or ...	Drive time within 20mins of town of 10,000 or ...	Drive time outside 20mins of town of 10,000 or...	Drive time outside 20mins of town of 10,000 or...	Drive time within 1hr of Belfast Index	Drive time within 1hr of Belfast Price\n(Ref:...	Drive time outside 1hr of Belfast Index	Drive time outside 1hr of Belfast Price\n(Ref...	NaN
2	2015	Q1	100	124898.676844	100	122528.427865	100	128955.274996	100	111866.40498	100.0
3	NaN	Q2	103.166882	128854.070701	103.003978	126209.155363	103.025069	132856.260679	103.349406	115613.265107	100.0
4	NaN	Q3	105.851629	132207.28391	105.619893	129414.394046	105.031061	135443.093443	107.811831	120605.219276	100.0

basic_cleanup(df).head()

1	Period	Year	Quarter	Drive time within 20mins of town of 10,000 or more Index	Drive time within 20mins of town of 10,000 or more Price\n(Ref: Q1 2015)	Drive time outside 20mins of town of 10,000 or more Index	Drive time outside 20mins of town of 10,000 or more Price\n(Ref: Q1 2015)	Drive time within 1hr of Belfast Index	Drive time within 1hr of Belfast Price\n(Ref: Q1 2015)	Drive time outside 1hr of Belfast Index	Drive time outside 1hr of Belfast Price\n(Ref: Q1 2015)
0	2015Q1	2015	Q1	100.000000	124898.676844	100.000000	122528.427865	100.000000	128955.274996	100.000000	111866.404980
1	2015Q2	2015	Q2	103.166882	128854.070701	103.003978	126209.155363	103.025069	132856.260679	103.349406	115613.265107
2	2015Q3	2015	Q3	105.851629	132207.283910	105.619893	129414.394046	105.031061	135443.093443	107.811831	120605.219276
3	2015Q4	2015	Q4	107.430656	134179.467306	106.924715	131013.172436	106.240145	137002.270924	110.075053	123137.004353
4	2016Q1	2016	Q1	108.909364	136026.354775	108.368772	132782.552750	107.604887	138762.178070	111.828874	125098.941485

def cleanup_table_7(df):
    """
    Table 7: Standardised House Price & Index for Rural Areas of Northern Ireland by drive times
    * Insert Year/Quarter future-headers
    * Clean normally
    # TODO THIS MIGHT BE VALID FOR MULTIINDEXING ON DRIVETIME/[Index/Price]
    """
    df = df.copy()
    df.iloc[1,0] = 'Year'
    df.iloc[1,1] = 'Quarter'
    df = basic_cleanup(df)
    return df

cleanup_table_7(source_df['Table 7'])

1	Period	Year	Quarter	Drive time within 20mins of town of 10,000 or more Index	Drive time within 20mins of town of 10,000 or more Price\n(Ref: Q1 2015)	Drive time outside 20mins of town of 10,000 or more Index	Drive time outside 20mins of town of 10,000 or more Price\n(Ref: Q1 2015)	Drive time within 1hr of Belfast Index	Drive time within 1hr of Belfast Price\n(Ref: Q1 2015)	Drive time outside 1hr of Belfast Index	Drive time outside 1hr of Belfast Price\n(Ref: Q1 2015)
0	2015Q1	2015	Q1	100.000000	124898.676844	100.000000	122528.427865	100.000000	128955.274996	100.000000	111866.404980
1	2015Q2	2015	Q2	103.166882	128854.070701	103.003978	126209.155363	103.025069	132856.260679	103.349406	115613.265107
2	2015Q3	2015	Q3	105.851629	132207.283910	105.619893	129414.394046	105.031061	135443.093443	107.811831	120605.219276
3	2015Q4	2015	Q4	107.430656	134179.467306	106.924715	131013.172436	106.240145	137002.270924	110.075053	123137.004353
4	2016Q1	2016	Q1	108.909364	136026.354775	108.368772	132782.552750	107.604887	138762.178070	111.828874	125098.941485
5	2016Q2	2016	Q2	111.263396	138966.509219	109.739250	134461.778232	110.208116	142119.179594	111.991819	125281.221326
6	2016Q3	2016	Q3	113.419541	141659.506269	112.426034	137753.851946	112.202571	144691.133778	115.398242	129091.864904
7	2016Q4	2016	Q4	113.928074	142294.657346	113.219995	138726.680412	112.508660	145085.851672	116.952798	130830.890712
8	2017Q1	2017	Q1	114.262386	142712.207695	113.549623	139130.567598	112.823330	145491.635911	117.341538	131265.759778
9	2017Q2	2017	Q2	115.566592	144341.144812	115.829688	141924.295411	114.288862	147381.515712	119.397027	133565.161466
10	2017Q3	2017	Q3	116.716428	145777.273752	117.061832	143434.022704	115.192251	148546.484147	121.273582	135664.395891
11	2017Q4	2017	Q4	117.925340	147287.189812	118.541541	145247.086931	116.101165	149718.576638	123.577530	138241.740123
12	2018Q1	2018	Q1	118.482802	147983.452250	120.184585	147260.282195	117.972698	152132.017714	122.075418	136561.381608
13	2018Q2	2018	Q2	119.443631	149183.514842	120.710551	147904.740501	117.686726	151763.241043	125.603705	140508.349303
14	2018Q3	2018	Q3	121.408923	151638.138779	122.222994	149757.912837	119.614730	154249.503782	127.210974	142306.342946
15	2018Q4	2018	Q4	123.531419	154289.108214	125.254013	153471.772944	121.462721	156632.586327	130.953652	146493.143086
16	2019Q1	2019	Q1	122.499375	153000.098716	123.207618	150964.357154	120.087144	154858.706540	129.712835	145105.085124
17	2019Q2	2019	Q2	124.397722	155371.109292	125.151589	153346.274193	122.486275	157952.513361	130.480634	145963.994647
18	2019Q3	2019	Q3	126.533407	158038.551430	128.647747	157630.061642	124.978137	161165.900455	133.258945	149071.990904
19	2019Q4	2019	Q4	127.126748	158779.626458	127.784267	156572.053236	124.547746	160610.887802	134.709059	150694.181735
20	2020Q1	2020	Q1	127.090324	158734.133127	128.619521	157595.476721	124.744271	160864.317472	135.068006	151095.722593
21	2020Q2	2020	Q2	127.200617	158871.887068	127.231209	155894.400318	125.021931	161222.374474	132.935126	148709.746207
22	2020Q3	2020	Q3	129.627870	161903.493901	131.083667	160614.756005	127.873046	164899.037745	135.952621	152085.309777
23	2020Q4	2020	Q4	132.853240	165931.938294	135.873737	166483.954356	130.725554	168577.497418	142.032974	158887.182045
24	2021Q1	2021	Q1	134.414458	167881.879606	136.978231	167837.272586	132.479831	170839.730568	142.532053	159445.483290
25	2021Q2	2021	Q2	139.417605	174130.744141	142.727536	174881.805508	137.737680	177620.004609	147.652942	165174.037638
26	2021Q3	2021	Q3	143.303934	178984.717757	147.615067	180870.420630	141.482937	182449.710341	153.161532	171336.300173
27	2021Q4	2021	Q4	144.731984	180768.332630	149.013971	182584.475980	142.365385	183587.673223	156.204293	174740.127539

dest_df['Table 7'] = cleanup_table_7(source_df['Table 7'])
len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df])

(15, 17)

Table 8: Number of Verified Residential Property Sales of properties in urban and rural areas and properties in rural areas by drive times witihn towns of 10,000 or more and within 1 hour of Belfast

We’re now getting into the swing of this!

This one has two similar problems we’ve already seen; Munged Quarters/Years (this time with no header on that column…), and annual Total rows.

Vee must deeel with it

cleanup_table_5a(source_df['Table 8']).head()

2	Period	Year	Quarter	Urban	Rural	Drive time within 20mins of town of 10,000 or more	Drive time outside 20mins of town of 10,000 or more	Drive time within 1hr of Belfast	Drive time outside 1hr of Belfast
0	2015Q1	2015	Q1	3294	1322	898	424	976	346
1	2015Q2	2015	Q2	3789	1500	1034	466	1142	358
2	2015Q3	2015	Q3	4199	1640	1145	495	1250	390
3	2015Q4	2015	Q4	4396	1780	1223	557	1342	438
4	2016Q1	2016	Q1	4424	1731	1171	560	1263	468

cleanup_table_8 = cleanup_table_5a

dest_df['Table 8'] = cleanup_table_8(source_df['Table 8'])
len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df])

(16, 16)

Table 9: NI Average Sales Prices Q1 2005 - Q4 2021

Wee buns

basic_cleanup(source_df['Table 9'])

1	Period	Year	Quarter	Simple Mean	Simple Median	Standardised Price (HPI)
0	2005Q1	2005	Q1	115912.942222	100000	111920.268199
1	2005Q2	2005	Q2	120481.290591	105000	116004.031639
2	2005Q3	2005	Q3	128866.225917	115000	123386.352673
3	2005Q4	2005	Q4	129649.092074	117000	127674.143865
4	2006Q1	2006	Q1	132972.115070	120000	131302.064422
...	...	...	...	...	...	...
63	2020Q4	2020	Q4	171803.199843	150000	147474.561707
64	2021Q1	2021	Q1	176218.214924	150000	149084.306040
65	2021Q2	2021	Q2	184144.458946	154950	154323.134643
66	2021Q3	2021	Q3	173490.230508	155000	159028.118093
67	2021Q4	2021	Q4	159965.154863	141000	159150.737832

68 rows × 6 columns

dest_df['Table 9'] = basic_cleanup(source_df['Table 9'])
len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df])

(17, 15)

Table 9x: NI Average Sale Prices XXXXX Property Q1 2005 - Q4 2021

These are very similar to Tables 2x; i.e. they’re broken down by property type.

Annoyingly, they don’t follow the same structure as Tables 2x or Table 9 because they don’t include the Year/Quarter headers.

If that reminds you of anything, it’s because Table 7 was the same…

cleanup_table_7(source_df['Table 9a'])

1	Period	Year	Quarter	Simple Mean	Simple Median	Standardised Price (HPI)
0	2005Q1	2005	Q1	166314.816092	149972.5	160428.832662
1	2005Q2	2005	Q2	173370.669076	155000.0	169686.542965
2	2005Q3	2005	Q3	185397.896739	165000.0	180696.666810
3	2005Q4	2005	Q4	186545.119355	165000.0	185323.883533
4	2006Q1	2006	Q1	191328.398119	173000.0	188669.361197
...	...	...	...	...	...	...
63	2020Q4	2020	Q4	243712.512641	220000.0	220592.113069
64	2021Q1	2021	Q1	254182.439174	225000.0	224872.989982
65	2021Q2	2021	Q2	268755.621299	235000.0	234734.715703
66	2021Q3	2021	Q3	245860.399289	225000.0	239101.239764
67	2021Q4	2021	Q4	244468.040738	219000.0	241131.373512

68 rows × 6 columns

cleanup_table_9x = cleanup_table_7

table9s = re.compile('Table 9[a-z]')
for table in source_df:
    if table9s.match(table):
        dest_df[table] = cleanup_table_9x(source_df[table])
len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df])

(21, 11)

Table 10x: Number of Verified Residential Property Sales by Type in XXXXX

Surprisingly, we’re in the home straight; the remaining tables are all of the same structure, with familiar awkwardness.,,

Annual-Total Rows
Munged Year/Quarter Column
That column having a silly (but contextual) name
a different offset

Fortunately, we already have something like that from dealing with Table 5a!

source_df['Table 10a']

	Table 10a: Number of Verified Residential Property Sales by Type in Antrim and Newtownabbey Council	Unnamed: 1	Unnamed: 2	Unnamed: 3	Unnamed: 4	Unnamed: 5
0	Please note figures for the 2 most recent quar...	NaN	NaN	NaN	NaN	NaN
1	Back to contents	NaN	NaN	NaN	NaN	NaN
2	ANTRIM AND NEWTOWNABBEY	Apartments	Detached	Semi-Detached	Terrace	Total
3	Q1 2005	10	61	78	87	236
4	Q2 2005	46	213	216	260	735
...	...	...	...	...	...	...
83	Q1 2021	52	222	212	166	652
84	Q2 2021	56	217	275	176	724
85	Q3 2021	47	222	268	202	739
86	Q4 2021	50	117	176	189	532
87	2021 Total	205	778	931	733	2647

88 rows × 6 columns

cleanup_table_5a(source_df['Table 10a'])

2	Period	Year	Quarter	Apartments	Detached	Semi-Detached	Terrace	Total
0	2005Q1	2005	Q1	10	61	78	87	236
1	2005Q2	2005	Q2	46	213	216	260	735
2	2005Q3	2005	Q3	46	214	238	259	757
3	2005Q4	2005	Q4	65	227	270	331	893
4	2006Q1	2006	Q1	48	186	231	296	761
...	...	...	...	...	...	...	...	...
63	2020Q4	2020	Q4	53	248	268	187	756
64	2021Q1	2021	Q1	52	222	212	166	652
65	2021Q2	2021	Q2	56	217	275	176	724
66	2021Q3	2021	Q3	47	222	268	202	739
67	2021Q4	2021	Q4	50	117	176	189	532

68 rows × 8 columns

cleanup_table_10x = cleanup_table_5a

table10s = re.compile('Table 10[a-z]')
for table in source_df:
    if table10s.match(table):
        dest_df[table] = cleanup_table_10x(source_df[table])
len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df])

(32, 0)

And We’re Done!

So, we can see that while government open data is a pain, at least it’s a … consistently inconsistent pain?

I hope this was helpful to someone else.

dest_df['Contents'] = source_df['Contents'][source_df['Contents']['Worksheet Name'].str.startswith('Table')]

with pd.ExcelWriter('NI Housing Price Index.xlsx') as writer:
    # Thankfully these are semantically sortable otherwise this would be a _massive_ pain
    for k,df in sorted(dest_df.items()):
        df.to_excel(writer, sheet_name=k)

Published: March 27 2022

category: