Wrangling NI House Price Index Data

Andrew Bolster

Senior R&D Manager (Data Science) at Synopsys Software Integrity Group and Treasurer @ Bsides Belfast and NI OpenGovernment Network

Data Wrangling NI House Price Index Data

This is a ‘messy’ ‘blog post’ that’s just a braindump of a notebook to step through NI House Price Index datasets I was playing around with.

It’s mostly code, so if you were here from some ‘insight’, feck aff.

There is no analysis here, this is just data wrangling.

TLDR As always, Government Open Data has over the years gone from ‘non-existent’ to ‘garbeled’ to ‘inconsistent’ and I feel is now in the stage of ‘consistently inconsistent’, which is progress in my eyes.

Preamble Code, move on.

from bs4 import BeautifulSoup
import pandas as pd
import requests

# Pull the latest pages of https://www.finance-ni.gov.uk/publications/ni-house-price-index-statistical-reports and extract links

base_url= 'https://www.finance-ni.gov.uk/publications/ni-house-price-index-statistical-reports'
base_content = requests.get(base_url).content
base_soup = BeautifulSoup(base_content)
for a in base_soup.find_all('a'):
    if a.attrs.get('href','').endswith('xlsx'):
        source_name, source_url = a.contents[1],a.attrs['href']
        
source_df = pd.read_excel(source_url, sheet_name = None) # Load all worksheets in 

source_df.keys()
dict_keys(['Cover Sheet', 'Contents', 'Table 1', 'Table 2', 'Table 2a', 'Table 2b', 'Table 2c', 'Table 2d', 'Table 3', 'Table 3a', 'Table 3b', 'Table 3c', 'Table 4', 'Fig 5', 'Table 5', 'Table 5a', 'Fig 6', 'Table 6', 'Table 7', 'Table 8', 'Table 9', 'Table 9a', 'Table 9b', 'Table 9c', 'Table 9d', 'Table 10a', 'Table 10b', 'Table 10c', 'Table 10d', 'Table 10e', 'Table 10f', 'Table 10g', 'Table 10h', 'Table 10i', 'Table 10j', 'Table 10k'])
source_df['Contents']
Table of Contents Unnamed: 1 Unnamed: 2
0 Worksheet Name Frequency House Price Index - Quarter 4 2021
1 Table 1 Quarterly Table 1: NI HPI Trends Q1 2005 - Q4 2021
2 Figure 1 Quarterly Figure 1: Graph of NI HPI Q1 2005 - Q4 2021
3 Figure 1a Quarterly Figure 1a: Graph of Percentage Quarterly Chang...
4 Figure 1b Quarterly Figure 1b: Graph of Percentage Annual Change Q...
... ... ... ...
58 Table 10h Quarterly Table 10h: Number of Verified Residential Prop...
59 Table 10i Quarterly Table 10i: Number of Verified Residential Prop...
60 Table 10j Quarterly Table 10j: Number of Verified Residential Prop...
61 Table 10k Quarterly Table 10k: Number of Verified Residential Prop...
62 Figure 11 Quarterly Figure 11: Number of Verified Residential Prop...

63 rows × 3 columns

Fix the Contents sheet to correctly reflect the Worksheet names

And fix the table headers and sheet-titles while we’re at it.

new_header = source_df['Contents'].iloc[0]
source_df['Contents'] = source_df['Contents'][1:]
source_df['Contents'].columns = new_header
source_df['Contents'].columns = [*new_header[:-1],'Title']
[t for t in source_df['Contents']['Title'].values if t.startswith('Table')]
['Table 1: NI HPI Trends Q1 2005 - Q4 2021',
 'Table 2: NI HPI & Standardised Price Statistics by Property Type Q4 2021',
 'Table 2a: NI Detached Property Price Index Q1 2005 - Q4 2021',
 'Table 2b: NI Semi-Detached Property Price Index Q1 2005 - Q4 2021',
 'Table 2c: NI Terrace Property Price Index Q1 2005 - Q4 2021',
 'Table 2d: NI Apartment Price Index Q1 2005 - Q4 2021',
 'Table 3: NI HPI & Standardised Price Statistics by New/Existing Resold Dwelling Type Q4 2021',
 'Table 3a: NI New Dwelling Price Index Q1 2005 - Q4 2021',
 'Table 3b: NI Existing Resold Dwellings Price Index Q1 2005 - Q4 2021',
 'Table 3c: Number of Verified Residential Property Sales by New/Existing Resold Dwellings Q1 2005 - Q2 2021',
 'Table 4: Number of Verified Residential Property Sales Q1 2005 - Q4 2021',
 'Table 5: HPI & Standardised Price for each Local Government District in NI',
 'Table 5a: Number of Verified Residential Property Sales by Local Government District Q1 2005 - Q4 2021',
 'Table 6: NI HPI & Standardised Price by Urban and Rural areas of Northern Ireland',
 'Table 7: Standardised House Price & Index for Rural Areas of Northern Ireland by drive times',
 'Table 8: Number of Verified Residential Property Sales for Urban and Rural Areas of NI (Q1 2005 - Q4 2021) and Rural Areas of NI by drive times (Q1 2015 - Q4 2021)',
 'Table 9: NI Average Sale Prices All Properties Q1 2005 - Q4 2021',
 'Table 9a: NI Average Sale Prices Detached Properties Q1 2005 - Q4 2021',
 'Table 9b: NI Average Sale Prices Semi-Detached Properties Q1 2005 - Q4 2021',
 'Table 9c: NI Average Sale Prices Terrace Properties Q1 2005 - Q4 2021',
 'Table 9d: NI Average Sale Prices Apartments Q1 2005 - Q4 2021',
 'Table 10a: Number of Verified Residential Property Sales by Type in Antrim and Newtownabbey Council Q1 2005 - Q4 2021',
 'Table 10b: Number of Verified Residential Property Sales by Type in Ards and North Down Council Q1 2005 - Q4 2021',
 'Table 10c: Number of Verified Residential Property Sales by Type in Armagh City, Banbridge and Craigavon Council Q1 2005 - Q4 2021',
 'Table 10d: Number of Verified Residential Property Sales by Type in Belfast Council Q1 2005 - Q4 2021',
 'Table 10e: Number of Verified Residential Property Sales by Type in Causeway Coast and Glens Council Q1 2005 - Q4 2021',
 'Table 10f: Number of Verified Residential Property Sales by Type in Derry City and Strabane Council Q1 2005 - Q4 2021',
 'Table 10g: Number of Verified Residential Property Sales by Type in Fermanagh and Omagh Council Q1 2005 - Q4 2021',
 'Table 10h: Number of Verified Residential Property Sales by Type in Lisburn and Castlereagh Council Q1 2005 - Q4 2021',
 'Table 10i: Number of Verified Residential Property Sales by Type in Mid and East Antrim Council Q1 2005 - Q4 2021',
 'Table 10j: Number of Verified Residential Property Sales by Type in Mid Ulster Council Q1 2005 - Q4 2021',
 'Table 10k: Number of Verified Residential Property Sales by Type in Newry, Mourne and Down Council Q1 2005 - Q4 2021']
# Replace 'Figure' with 'Fig' in 'Worksheet Name'
with pd.option_context('mode.chained_assignment',None):
    source_df['Contents']['Worksheet Name'] = source_df['Contents']['Worksheet Name'].str.replace('Figure','Fig')

Tidy up Data

General Methodology

Ignore figure data (pretty much completly….)

Tables have more or less the same structure; a header on row 3(1), a year and quarter ‘index’ (on time series; otherwise categorical index, see Table 2, Table 3).

Some TS tables also have totals subsections so these should be a) validated and b) ignored.

Any columns with no header in row 3(1) should be ignored (usually text notes)

Operate Sequentially (i.e. Table 1, Table 2, Table 2a; don’t skip, even if it’s tempting)

Use keys from ‘Contents’ to describe data, but may be suffixed by the date which could change between data sets!

There’s also some really columns that look like checksums, so if there is an ‘NI’ column, or a data column that all valid values are ‘100’, delete it.

TODO: Regexy way to get rid of the ‘\QX-YYYY -\QX YYYY’ tail

source_df['Table 1']
Table 1: NI House Price Index, Standardised Price and Quarterly and Annual Change Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7
0 Back to contents NaN NaN NaN NaN NaN NaN NaN
1 Year Quarter NI House Price Index NI House Standardised Price Quarterly Change Annual Change NI NaN
2 2005 Q1 100.883607 111920.268199 NaN NaN 100 0.0
3 NaN Q2 104.564663 116004.031639 0.036488 NaN 100 0.0
4 NaN Q3 111.219 123386.352673 0.063638 NaN 100 0.0
... ... ... ... ... ... ... ... ...
89 NaN NaN NaN NaN NaN NaN 100 0.0
90 NaN NaN NaN NaN NaN NaN 100 0.0
91 NaN NaN NaN NaN NaN NaN 100 0.0
92 NaN NaN NaN NaN NaN NaN 100 0.0
93 NaN NaN NaN NaN NaN NaN 100 0.0

94 rows × 8 columns

def basic_cleanup(df:pd.DataFrame, offset=1)->pd.DataFrame:
    df = df.copy() 
    # Re-header from row 1 (which was row 3 in excel)
    new_header = df.iloc[offset]
    df = df.iloc[offset+1:]
    df.columns = new_header
    
    # remove 'NaN' trailing columns
    df = df[df.columns[pd.notna(df.columns)]]
    
    # 'NI' is a usually hidden column that appears to be a checksum; 
    #if it's all there and all 100, remove it, otherwise, complain.
    # (Note, need to change this 'if' logic to just 'if there's a 
    # column with all 100's, but cross that bridge later)
    if 'NI' in df:
        assert df['NI'].all() and df['NI'].mean() == 100, "Not all values in df['NI'] == 100"
        df = df.drop('NI', axis=1)
        
    # Strip rows below the first all-nan row, if there is one 
    # (Otherwise this truncates the tables as there is no
    # idxmax in the table of all 'false's)
    if any(df.isna().all(axis=1)):
        idx_first_bad_row = df.isna().all(axis=1).idxmax()
        df = df.loc[:idx_first_bad_row-1]
    
    # By Inspection, other tables use 'Sale Year' and 'Sale Quarter'
    if set(df.keys()).issuperset({'Sale Year','Sale Quarter'}):
        df = df.rename(columns = {
            'Sale Year':'Year',
            'Sale Quarter': 'Quarter'
        })
        
    # For 'Year','Quarter' indexed pages, there is an implied Year
    # in Q2/4, so fill it downwards
    if set(df.keys()).issuperset({'Year','Quarter'}):
        df['Year'] = df['Year'].astype(float).fillna(method='ffill').astype(int)
        
        # In Pandas we can represent Y/Q combinations as proper datetimes
        #https://stackoverflow.com/questions/53898482/clean-way-to-convert-quarterly-periods-to-datetime-in-pandas 
        df.insert(loc=0,
                  column='Period',
                  value=pd.PeriodIndex(df.apply(lambda r:f'{r.Year}-{r.Quarter}', axis=1), freq='Q')
        )
    
    # reset index, try to fix dtypes, etc, (this should be the last 
    # operation before returning!
    df = df.reset_index(drop=True).infer_objects()   
    
    return df

df = basic_cleanup(source_df['Table 1'])
df
1 Period Year Quarter NI House Price Index NI House Standardised Price Quarterly Change Annual Change
0 2005Q1 2005 Q1 100.883607 111920.268199 NaN NaN
1 2005Q2 2005 Q2 104.564663 116004.031639 0.036488 NaN
2 2005Q3 2005 Q3 111.219000 123386.352673 0.063638 NaN
3 2005Q4 2005 Q4 115.083964 127674.143865 0.034751 NaN
4 2006Q1 2006 Q1 118.354129 131302.064422 0.028415 0.173175
... ... ... ... ... ... ... ...
63 2020Q4 2020 Q4 132.931827 147474.561707 0.026103 0.052326
64 2021Q1 2021 Q1 134.382831 149084.306040 0.010915 0.059421
65 2021Q2 2021 Q2 139.105050 154323.134643 0.035140 0.095724
66 2021Q3 2021 Q3 143.346066 159028.118093 0.030488 0.106491
67 2021Q4 2021 Q4 143.456594 159150.737832 0.000771 0.079174

68 rows × 7 columns

dest_df = {
    'Table 1': basic_cleanup(source_df['Table 1'])
}
len([k for k in source_df.keys() if k.startswith('Table')])
32

One down, 31 to go…

Table 2: NI HPI & Standardised Price Statistics by Property Type Q4 2021’

df = basic_cleanup(source_df['Table 2'])
df
1 Property Type Index\n(Quarter 4 2021) Percentage Change on Previous Quarter Percentage Change over 12 months Standardised Price\n(Quarter 4 2021)
0 Detached 143.488806 0.008491 0.093110 241131.373512
1 Semi-Detached 140.680694 0.004211 0.076953 153718.543755
2 Terrace 149.564169 -0.009577 0.078758 112831.710806
3 Apartment 133.376791 -0.014732 0.032761 116554.228620
4 All 143.456594 0.000771 0.079174 159150.737832

Those ‘\n (Quarter 4 2021)’ entries are unnecessary, so for this table, lets clear them

df.columns = [c.split('\n')[0] for c in df.columns]
df
Property Type Index Percentage Change on Previous Quarter Percentage Change over 12 months Standardised Price
0 Detached 143.488806 0.008491 0.093110 241131.373512
1 Semi-Detached 140.680694 0.004211 0.076953 153718.543755
2 Terrace 149.564169 -0.009577 0.078758 112831.710806
3 Apartment 133.376791 -0.014732 0.032761 116554.228620
4 All 143.456594 0.000771 0.079174 159150.737832
dest_df['Table 2'] = df

Table 2a: NI Detached Property Price Index Q1 2005 - Q4 2021

df = basic_cleanup(source_df['Table 2a'])
df
1 Period Year Quarter NI Detached Property Price Index NI Detached Property Standardised Price Quarterly Change Annual Change
0 2005Q1 2005 Q1 95.465560 160428.832662 NaN NaN
1 2005Q2 2005 Q2 100.974498 169686.542965 0.057706 NaN
2 2005Q3 2005 Q3 107.526236 180696.666810 0.064885 NaN
3 2005Q4 2005 Q4 110.279730 185323.883533 0.025608 NaN
4 2006Q1 2006 Q1 112.270506 188669.361197 0.018052 0.176032
... ... ... ... ... ... ... ...
63 2020Q4 2020 Q4 131.266614 220592.113069 0.026393 0.055357
64 2021Q1 2021 Q1 133.814014 224872.989982 0.019406 0.071429
65 2021Q2 2021 Q2 139.682380 234734.715703 0.043855 0.129844
66 2021Q3 2021 Q3 142.280745 239101.239764 0.018602 0.112515
67 2021Q4 2021 Q4 143.488806 241131.373512 0.008491 0.093110

68 rows × 7 columns

Table 2x: NI XXX Property Price Index Q1 2005 - Q4 2021

This table structure is consistent against the rest of the Table 2x cohort; mapping to the Property Types listed in Table 2.

For the time being, we can ignore these, but this will probably become a pain later on…

dest_df['Table 2']['Property Type']
0         Detached
1    Semi-Detached
2          Terrace
3        Apartment
4              All
Name: Property Type, dtype: object
import re

table2s = re.compile('Table 2[a-z]')
assert table2s.match('Table 2') is None, 'Table 2 is matching itself!'
assert table2s.match('Table 20') is None, 'Table 2 is greedy!'
assert table2s.match('Table 2z') is not None, 'Table 2 is matching incorrectly!'
table2s = re.compile('Table 2[a-z]')
for table in source_df:
    if table2s.match(table):
        dest_df[table] = basic_cleanup(source_df[table])
len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df])
(6, 26)

6 down, 26 to go.

Table 3: NI HPI & Standardised Price Statistics by New/Existing Resold Dwelling Type Q4 2021

These appear to be a similar structure of the Table 2’s… hopefully

df = basic_cleanup(source_df['Table 3'])
df.columns = [c.split('\n')[0] for c in df.columns] # Stolen from Table 2 Treatment
df
Property Type Index Percentage Change on Previous Quarter Percentage Change over 12 months Standardised Price
0 New 141.769973 0.024877 0.072609 185966.524090
1 Existing Resold 143.518977 -0.004918 0.080771 152275.828046
2 All 143.456594 0.000771 0.079174 159150.737832
dest_df['Table 3'] = df
df = basic_cleanup(source_df['Table 3a'])
df
1 Period Year Quarter NI New Dwellings Price Index NI New Dwellings Standardised Price Quarterly Change Annual Change
0 2005Q1 2005 Q1 95.804706 125671.662611 NaN NaN
1 2005Q2 2005 Q2 101.229223 132787.263460 0.056621 NaN
2 2005Q3 2005 Q3 106.243580 139364.837967 0.049535 NaN
3 2005Q4 2005 Q4 110.118105 144447.239874 0.036468 NaN
4 2006Q1 2006 Q1 113.624410 149046.629634 0.031841 0.186000
... ... ... ... ... ... ... ...
63 2020Q4 2020 Q4 132.173052 173377.779440 0.004103 0.036125
64 2021Q1 2021 Q1 133.772562 175475.933612 0.012102 0.027916
65 2021Q2 2021 Q2 136.969311 179669.264190 0.023897 0.046474
66 2021Q3 2021 Q3 138.328776 181452.540106 0.009925 0.050867
67 2021Q4 2021 Q4 141.769973 185966.524090 0.024877 0.072609

68 rows × 7 columns

table3s = re.compile('Table 3[a-z]')
for table in source_df:
    if table3s.match(table):
        dest_df[table] = basic_cleanup(source_df[table])
len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df])
(10, 22)

Table 4: Number of Verified Residential Property Sales Q1 2005 - Q4 2021

Table 4 is not looking great

df = source_df['Table 4']
df
Table 4: Number of Verified Residential Property Sales Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7
0 Verified Sales = Sales matched to a property i... NaN NaN NaN NaN NaN NaN NaN
1 Please note figures for the 2 most recent quar... NaN NaN NaN NaN NaN NaN NaN
2 Back to contents NaN NaN NaN NaN NaN NaN NaN
3 Sale Year Sale Quarter Detached Semi-Detached Terrace Apartment Total NaN
4 2005\n Quarter 1 809 894 1035 198 2936 NaN
... ... ... ... ... ... ... ... ...
84 2021 Quarter 1 2509 2477 1962 561 7509 NaN
85 NaN Quarter 2 2668 2613 2056 604 7941 NaN
86 NaN Quarter 3 2519 2797 2220 633 8169 Please note this figure is provisional and wil...
87 NaN Quarter 4 1478 2100 2057 515 6150 and new dwellings sold in this quarter being a...
88 NaN 2021 Total 9174 9987 8295 2313 29769 NaN

89 rows × 8 columns

Of note; new offset for the header row at index 3 instead of index 1, due to lots of fluff at the start that is probably not going to be consistent between reports so that will almost certainly mess up my day in a few months.

Also, Quarter dates have now been shifted into ‘Quarter 1’ instead of ‘Q1’, which … meh 🤷‍♂️. More Egrigiously, it looks like ‘\n’ has leaked into some Sales Year values. Funtimes.

Finally, and possibly most annoying, the introduction of partial total lines is going to throw things off, and this isn’t a validation study, to stuff-em

In an effort not to over-complicate basic_cleanup, we can try and clean these table specific issues first;

df.iloc[:,1]=df.iloc[:,1].str.replace('Quarter ([1-4])',r'Q\1', regex=True)
df
Table 4: Number of Verified Residential Property Sales Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7
0 Verified Sales = Sales matched to a property i... NaN NaN NaN NaN NaN NaN NaN
1 Please note figures for the 2 most recent quar... NaN NaN NaN NaN NaN NaN NaN
2 Back to contents NaN NaN NaN NaN NaN NaN NaN
3 Sale Year Sale Quarter Detached Semi-Detached Terrace Apartment Total NaN
4 2005\n Q1 809 894 1035 198 2936 NaN
... ... ... ... ... ... ... ... ...
84 2021 Q1 2509 2477 1962 561 7509 NaN
85 NaN Q2 2668 2613 2056 604 7941 NaN
86 NaN Q3 2519 2797 2220 633 8169 Please note this figure is provisional and wil...
87 NaN Q4 1478 2100 2057 515 6150 and new dwellings sold in this quarter being a...
88 NaN 2021 Total 9174 9987 8295 2313 29769 NaN

89 rows × 8 columns

df=df[~df.iloc[:,1].str.contains('Total').fillna(False)]
# Lose the year new-lines (needs astype because non str lines are
# correctly inferred to be ints, so .str methods nan-out
with pd.option_context('mode.chained_assignment',None):
    df.iloc[:,0]=df.iloc[:,0].astype(str).str.replace('\n','') 
df
Table 4: Number of Verified Residential Property Sales Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7
0 Verified Sales = Sales matched to a property i... NaN NaN NaN NaN NaN NaN NaN
1 Please note figures for the 2 most recent quar... NaN NaN NaN NaN NaN NaN NaN
2 Back to contents NaN NaN NaN NaN NaN NaN NaN
3 Sale Year Sale Quarter Detached Semi-Detached Terrace Apartment Total NaN
4 2005 Q1 809 894 1035 198 2936 NaN
... ... ... ... ... ... ... ... ...
82 nan Q4 2808 2944 2170 555 8477 NaN
84 2021 Q1 2509 2477 1962 561 7509 NaN
85 nan Q2 2668 2613 2056 604 7941 NaN
86 nan Q3 2519 2797 2220 633 8169 Please note this figure is provisional and wil...
87 nan Q4 1478 2100 2057 515 6150 and new dwellings sold in this quarter being a...

72 rows × 8 columns

basic_cleanup(df, offset=3)
3 Period Year Quarter Detached Semi-Detached Terrace Apartment Total
0 2005Q1 2005 Q1 809 894 1035 198 2936
1 2005Q2 2005 Q2 2208 2474 2808 483 7973
2 2005Q3 2005 Q3 2297 2655 2952 539 8443
3 2005Q4 2005 Q4 2498 3003 3492 631 9624
4 2006Q1 2006 Q1 2185 2650 3158 594 8587
... ... ... ... ... ... ... ... ...
63 2020Q4 2020 Q4 2808 2944 2170 555 8477
64 2021Q1 2021 Q1 2509 2477 1962 561 7509
65 2021Q2 2021 Q2 2668 2613 2056 604 7941
66 2021Q3 2021 Q3 2519 2797 2220 633 8169
67 2021Q4 2021 Q4 1478 2100 2057 515 6150

68 rows × 8 columns

Thats awkward enough to get it’s own function…

def cleanup_table_4(df):
    """
    Table 4: Number of Verified Residential Property Sales
    * Regex 'Quarter X' to 'QX' in future 'Sales Quarter' column
    * Drop Year Total rows
    * Clear any Newlines from the future 'Sales Year' column
    * call `basic_cleanup` with offset=3
    """
    df.iloc[:,1]=df.iloc[:,1].str.replace('Quarter ([1-4])',r'Q\1', regex=True)
    df=df[~df.iloc[:,1].str.contains('Total').fillna(False)]
    # Lose the year new-lines (needs astype because non str lines are
    # correctly inferred to be ints, so .str methods nan-out
    with pd.option_context('mode.chained_assignment',None):
        df.iloc[:,0]=df.iloc[:,0].astype(str).str.replace('\n','') 
    return basic_cleanup(df, offset=3)

cleanup_table_4(source_df['Table 4'].copy())
3 Period Year Quarter Detached Semi-Detached Terrace Apartment Total
0 2005Q1 2005 Q1 809 894 1035 198 2936
1 2005Q2 2005 Q2 2208 2474 2808 483 7973
2 2005Q3 2005 Q3 2297 2655 2952 539 8443
3 2005Q4 2005 Q4 2498 3003 3492 631 9624
4 2006Q1 2006 Q1 2185 2650 3158 594 8587
... ... ... ... ... ... ... ... ...
63 2020Q4 2020 Q4 2808 2944 2170 555 8477
64 2021Q1 2021 Q1 2509 2477 1962 561 7509
65 2021Q2 2021 Q2 2668 2613 2056 604 7941
66 2021Q3 2021 Q3 2519 2797 2220 633 8169
67 2021Q4 2021 Q4 1478 2100 2057 515 6150

68 rows × 8 columns

dest_df['Table 4'] = cleanup_table_4(source_df['Table 4'])
len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df])
(11, 21)

Table 5: HPI & Standardised Price for each Local Government District in NI

This nearly works but structurally requires a multi-index column to make sense….

df = basic_cleanup(source_df['Table 5'])
df
1 Period Year Quarter Antrim and Newtownabbey HPI Antrim and Newtownabbey Standardised Price Ards and North Down HPI Ards and North Down Standardised Price Armagh City, Banbridge and Craigavon HPI Armagh City, Banbridge and Craigavon Standardised Price Belfast HPI ... Fermanagh and Omagh HPI Fermanagh and Omagh Standardised Price Lisburn and Castlereagh HPI Lisburn and Castlereagh Standardised Price Mid and East Antrim HPI Mid and East Antrim Standardised Price Mid Ulster Standardised HPI Mid Ulster Standardised Price Newry, Mourne and Down HPI Newry, Mourne and Down Standardised Price
0 2005Q1 2005 Q1 99.903277 114851.528270 97.150602 130398.569667 102.245597 100785.145986 99.839849 ... 109.429237 104874.980231 95.958322 128828.327513 102.246427 105865.408901 102.714778 114882.211239 100.810773 113420.880186
1 2005Q2 2005 Q2 99.723509 114644.862732 100.794472 137133.037807 106.325843 104807.109982 100.589870 ... 117.239850 112360.529330 100.164437 134475.225477 104.443325 108140.065924 110.386311 123462.501283 111.965743 125971.191415
2 2005Q3 2005 Q3 107.940849 124091.740608 102.167971 137133.037807 110.006212 108434.910333 109.614861 ... 125.900145 120660.397585 106.757895 143327.237126 112.748278 116738.970434 117.595723 131525.929577 117.235685 131900.333698
3 2005Q4 2005 Q4 111.934696 128683.175719 106.396379 142808.538807 116.073031 114415.072260 110.728237 ... 130.781315 125338.422216 111.307116 149434.769200 114.584090 118639.759900 121.851999 136286.397473 123.628047 139092.296651
4 2006Q1 2006 Q1 113.494351 130476.197845 109.206160 146579.915492 121.831058 120090.852733 112.326582 ... 135.555749 129914.152078 110.539212 148403.823796 115.878975 119980.477260 129.544702 144890.365875 127.449613 143391.890242
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
63 2020Q4 2020 Q4 131.045293 150653.238745 123.824862 166201.593253 130.857866 128988.723586 133.525177 ... 142.381814 136456.127817 129.797418 174258.823716 130.813697 135443.809729 124.617456 139379.446212 138.815696 156179.721555
64 2021Q1 2021 Q1 133.481101 153453.510344 128.398202 172340.072904 130.382658 128520.303209 135.257679 ... 141.464114 135576.621629 131.166305 176096.615474 131.268719 135914.936888 127.657615 142779.737045 138.481347 155803.549899
65 2021Q2 2021 Q2 137.827568 158450.326506 130.543464 175219.510303 137.439265 135476.115278 138.558752 ... 146.059468 139980.724158 134.815374 180995.654429 138.663140 143571.081234 129.231669 144540.250870 145.000829 163138.533592
66 2021Q3 2021 Q3 141.167257 162289.724156 134.116570 180015.444071 143.522521 141472.480114 142.250634 ... 149.743153 143511.101233 136.178634 182825.891020 140.051373 145008.450168 135.759680 151841.560426 152.177814 171213.265699
67 2021Q4 2021 Q4 141.900751 163132.969278 134.883745 181045.170599 140.957739 138944.332704 140.072076 ... 156.033991 149540.124905 137.683726 184846.544332 141.311289 146312.960523 136.847809 153058.587031 152.044892 171063.717288

68 rows × 25 columns

# Two inner-columns per LGD
lgds = df.columns[3:].str.replace(' HPI','').str.replace(' Standardised Price','').unique()
lgds
Index(['Antrim and Newtownabbey', 'Ards and North Down',
       'Armagh City, Banbridge and Craigavon', 'Belfast',
       'Causeway Coast and Glens', 'Derry City and Strabane',
       'Fermanagh and Omagh', 'Lisburn and Castlereagh', 'Mid and East Antrim',
       'Mid Ulster Standardised', 'Mid Ulster', 'Newry, Mourne and Down'],
      dtype='object', name=1)

For some reason; Mid-ulster has a ‘Standardised HPI’ which throws off the above trick, so we gotta make it ugly…

lgds = df.columns[3:].str.replace(' Standardised HPI',' HPI')\
    .str.replace(' HPI','')\
    .str.replace(' Standardised Price','').unique()
lgds
Index(['Antrim and Newtownabbey', 'Ards and North Down',
       'Armagh City, Banbridge and Craigavon', 'Belfast',
       'Causeway Coast and Glens', 'Derry City and Strabane',
       'Fermanagh and Omagh', 'Lisburn and Castlereagh', 'Mid and East Antrim',
       'Mid Ulster', 'Newry, Mourne and Down'],
      dtype='object', name=1)
df.columns = [*df.columns[:3], *pd.MultiIndex.from_product([lgds,['Index','Price']], names=['LGD','Metric'])]
df
Period Year Quarter (Antrim and Newtownabbey, Index) (Antrim and Newtownabbey, Price) (Ards and North Down, Index) (Ards and North Down, Price) (Armagh City, Banbridge and Craigavon, Index) (Armagh City, Banbridge and Craigavon, Price) (Belfast, Index) ... (Fermanagh and Omagh, Index) (Fermanagh and Omagh, Price) (Lisburn and Castlereagh, Index) (Lisburn and Castlereagh, Price) (Mid and East Antrim, Index) (Mid and East Antrim, Price) (Mid Ulster, Index) (Mid Ulster, Price) (Newry, Mourne and Down, Index) (Newry, Mourne and Down, Price)
0 2005Q1 2005 Q1 99.903277 114851.528270 97.150602 130398.569667 102.245597 100785.145986 99.839849 ... 109.429237 104874.980231 95.958322 128828.327513 102.246427 105865.408901 102.714778 114882.211239 100.810773 113420.880186
1 2005Q2 2005 Q2 99.723509 114644.862732 100.794472 137133.037807 106.325843 104807.109982 100.589870 ... 117.239850 112360.529330 100.164437 134475.225477 104.443325 108140.065924 110.386311 123462.501283 111.965743 125971.191415
2 2005Q3 2005 Q3 107.940849 124091.740608 102.167971 137133.037807 110.006212 108434.910333 109.614861 ... 125.900145 120660.397585 106.757895 143327.237126 112.748278 116738.970434 117.595723 131525.929577 117.235685 131900.333698
3 2005Q4 2005 Q4 111.934696 128683.175719 106.396379 142808.538807 116.073031 114415.072260 110.728237 ... 130.781315 125338.422216 111.307116 149434.769200 114.584090 118639.759900 121.851999 136286.397473 123.628047 139092.296651
4 2006Q1 2006 Q1 113.494351 130476.197845 109.206160 146579.915492 121.831058 120090.852733 112.326582 ... 135.555749 129914.152078 110.539212 148403.823796 115.878975 119980.477260 129.544702 144890.365875 127.449613 143391.890242
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
63 2020Q4 2020 Q4 131.045293 150653.238745 123.824862 166201.593253 130.857866 128988.723586 133.525177 ... 142.381814 136456.127817 129.797418 174258.823716 130.813697 135443.809729 124.617456 139379.446212 138.815696 156179.721555
64 2021Q1 2021 Q1 133.481101 153453.510344 128.398202 172340.072904 130.382658 128520.303209 135.257679 ... 141.464114 135576.621629 131.166305 176096.615474 131.268719 135914.936888 127.657615 142779.737045 138.481347 155803.549899
65 2021Q2 2021 Q2 137.827568 158450.326506 130.543464 175219.510303 137.439265 135476.115278 138.558752 ... 146.059468 139980.724158 134.815374 180995.654429 138.663140 143571.081234 129.231669 144540.250870 145.000829 163138.533592
66 2021Q3 2021 Q3 141.167257 162289.724156 134.116570 180015.444071 143.522521 141472.480114 142.250634 ... 149.743153 143511.101233 136.178634 182825.891020 140.051373 145008.450168 135.759680 151841.560426 152.177814 171213.265699
67 2021Q4 2021 Q4 141.900751 163132.969278 134.883745 181045.170599 140.957739 138944.332704 140.072076 ... 156.033991 149540.124905 137.683726 184846.544332 141.311289 146312.960523 136.847809 153058.587031 152.044892 171063.717288

68 rows × 25 columns

We could turn this into a proper multiindex but it would mean pushing the Period/Year/Quarter columns into keys which would be inconsistent behaviour with the rest of the ‘cleaned’ dataset, so that can be a downstream problem; at least we’ve got the relevant metrics consistent!

def cleanup_table_5(df):
    """
    Table 5: Standardised House Price & Index for each Local Government District Northern Ireland
    * 
    """
    # Basic Cleanup first
    df = basic_cleanup(df)
    # Build multi-index of LGD / Metric [Index,Price]
    # Two inner-columns per LGD
    lgds = df.columns[3:].str.replace(' Standardised HPI',' HPI')\
        .str.replace(' HPI','')\
        .str.replace(' Standardised Price','')\
        .unique()
    df.columns = [*df.columns[:3], *pd.MultiIndex.from_product([lgds,['Index','Price']], names=['LGD','Metric'])]
    return df

cleanup_table_5(source_df['Table 5'])
Period Year Quarter (Antrim and Newtownabbey, Index) (Antrim and Newtownabbey, Price) (Ards and North Down, Index) (Ards and North Down, Price) (Armagh City, Banbridge and Craigavon, Index) (Armagh City, Banbridge and Craigavon, Price) (Belfast, Index) ... (Fermanagh and Omagh, Index) (Fermanagh and Omagh, Price) (Lisburn and Castlereagh, Index) (Lisburn and Castlereagh, Price) (Mid and East Antrim, Index) (Mid and East Antrim, Price) (Mid Ulster, Index) (Mid Ulster, Price) (Newry, Mourne and Down, Index) (Newry, Mourne and Down, Price)
0 2005Q1 2005 Q1 99.903277 114851.528270 97.150602 130398.569667 102.245597 100785.145986 99.839849 ... 109.429237 104874.980231 95.958322 128828.327513 102.246427 105865.408901 102.714778 114882.211239 100.810773 113420.880186
1 2005Q2 2005 Q2 99.723509 114644.862732 100.794472 137133.037807 106.325843 104807.109982 100.589870 ... 117.239850 112360.529330 100.164437 134475.225477 104.443325 108140.065924 110.386311 123462.501283 111.965743 125971.191415
2 2005Q3 2005 Q3 107.940849 124091.740608 102.167971 137133.037807 110.006212 108434.910333 109.614861 ... 125.900145 120660.397585 106.757895 143327.237126 112.748278 116738.970434 117.595723 131525.929577 117.235685 131900.333698
3 2005Q4 2005 Q4 111.934696 128683.175719 106.396379 142808.538807 116.073031 114415.072260 110.728237 ... 130.781315 125338.422216 111.307116 149434.769200 114.584090 118639.759900 121.851999 136286.397473 123.628047 139092.296651
4 2006Q1 2006 Q1 113.494351 130476.197845 109.206160 146579.915492 121.831058 120090.852733 112.326582 ... 135.555749 129914.152078 110.539212 148403.823796 115.878975 119980.477260 129.544702 144890.365875 127.449613 143391.890242
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
63 2020Q4 2020 Q4 131.045293 150653.238745 123.824862 166201.593253 130.857866 128988.723586 133.525177 ... 142.381814 136456.127817 129.797418 174258.823716 130.813697 135443.809729 124.617456 139379.446212 138.815696 156179.721555
64 2021Q1 2021 Q1 133.481101 153453.510344 128.398202 172340.072904 130.382658 128520.303209 135.257679 ... 141.464114 135576.621629 131.166305 176096.615474 131.268719 135914.936888 127.657615 142779.737045 138.481347 155803.549899
65 2021Q2 2021 Q2 137.827568 158450.326506 130.543464 175219.510303 137.439265 135476.115278 138.558752 ... 146.059468 139980.724158 134.815374 180995.654429 138.663140 143571.081234 129.231669 144540.250870 145.000829 163138.533592
66 2021Q3 2021 Q3 141.167257 162289.724156 134.116570 180015.444071 143.522521 141472.480114 142.250634 ... 149.743153 143511.101233 136.178634 182825.891020 140.051373 145008.450168 135.759680 151841.560426 152.177814 171213.265699
67 2021Q4 2021 Q4 141.900751 163132.969278 134.883745 181045.170599 140.957739 138944.332704 140.072076 ... 156.033991 149540.124905 137.683726 184846.544332 141.311289 146312.960523 136.847809 153058.587031 152.044892 171063.717288

68 rows × 25 columns

dest_df['Table 5']=cleanup_table_5(source_df['Table 5'])
len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df])
(12, 20)

Table 5a: Number of Verified Residential Property Sales by Local Government District

This one has a new problem; the Sale Year/Quarter is now squished together. This will do a few terrible things to our basic_cleanup so this needs to be done ahead of cleanup. Also has annual total lines.

df = source_df['Table 5a'].copy()
df
Table 5a: Number of Verified Residential Property Sales by Local Government District Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10 Unnamed: 11
0 Please note figures for the 2 most recent quar... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Back to contents NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 Sale Year/Quarter Antrim and Newtownabbey Ards and North Down Armagh City, Banbridge and Craigavon Belfast Causeway Coast and Glens Derry City and Strabane Fermanagh and Omagh Lisburn and Castlereagh Mid and East Antrim Mid Ulster Newry, Mourne and Down
3 Q1 2005 236 320 333 623 236 226 138 219 188 176 241
4 Q2 2005 735 857 961 1549 712 637 316 655 618 428 505
... ... ... ... ... ... ... ... ... ... ... ... ...
85 Q3 2021 739 989 931 1584 625 485 325 869 671 377 574
86 Q4 2021 532 702 730 1272 417 405 250 572 474 359 437
87 2021 Total 2647 3685 3333 5934 2327 1803 1181 3053 2346 1460 2000
88 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
89 Please note figures for the 2 most recent quar... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

90 rows × 12 columns

dates = df.iloc[:,0].str.extract('(Q[1-4]) ([0-9]{4})').rename(columns={0:'Quarter',1:'Year'})
for c in ['Quarter','Year']:# insert the dates in order, so they come out in reverse in the insert
    df.insert(1,c,dates[c])
    df.iloc[2,1]=c # Need to have the right colname for when `basic_cleanup` is called.

df.iloc[2,1]=c
df
Table 5a: Number of Verified Residential Property Sales by Local Government District Year Quarter Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10 Unnamed: 11
0 Please note figures for the 2 most recent quar... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Back to contents NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 Sale Year/Quarter Year Quarter Antrim and Newtownabbey Ards and North Down Armagh City, Banbridge and Craigavon Belfast Causeway Coast and Glens Derry City and Strabane Fermanagh and Omagh Lisburn and Castlereagh Mid and East Antrim Mid Ulster Newry, Mourne and Down
3 Q1 2005 2005 Q1 236 320 333 623 236 226 138 219 188 176 241
4 Q2 2005 2005 Q2 735 857 961 1549 712 637 316 655 618 428 505
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
85 Q3 2021 2021 Q3 739 989 931 1584 625 485 325 869 671 377 574
86 Q4 2021 2021 Q4 532 702 730 1272 417 405 250 572 474 359 437
87 2021 Total NaN NaN 2647 3685 3333 5934 2327 1803 1181 3053 2346 1460 2000
88 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
89 Please note figures for the 2 most recent quar... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

90 rows × 14 columns

df=df[~df.iloc[:,0].str.contains('Total').fillna(False)]

df.iloc[1,2]=c

basic_cleanup(df,offset=2)
2 Period Sale Year/Quarter Year Quarter Antrim and Newtownabbey Ards and North Down Armagh City, Banbridge and Craigavon Belfast Causeway Coast and Glens Derry City and Strabane Fermanagh and Omagh Lisburn and Castlereagh Mid and East Antrim Mid Ulster Newry, Mourne and Down
0 2005Q1 Q1 2005 2005 Q1 236 320 333 623 236 226 138 219 188 176 241
1 2005Q2 Q2 2005 2005 Q2 735 857 961 1549 712 637 316 655 618 428 505
2 2005Q3 Q3 2005 2005 Q3 757 960 968 1722 714 632 365 654 686 403 582
3 2005Q4 Q4 2005 2005 Q4 893 995 1199 1943 834 746 385 670 759 489 711
4 2006Q1 Q1 2006 2006 Q1 761 933 1038 1686 763 708 348 600 668 515 567
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
63 2020Q4 Q4 2020 2020 Q4 756 1052 974 1565 728 496 336 830 685 419 636
64 2021Q1 Q1 2021 2021 Q1 652 976 849 1497 610 466 290 762 572 349 486
65 2021Q2 Q2 2021 2021 Q2 724 1018 823 1581 675 447 316 850 629 375 503
66 2021Q3 Q3 2021 2021 Q3 739 989 931 1584 625 485 325 869 671 377 574
67 2021Q4 Q4 2021 2021 Q4 532 702 730 1272 417 405 250 572 474 359 437

68 rows × 15 columns

def cleanup_table_5a(df):
    """
    Table 5a: Number of Verified Residential Property Sales by Local Government District
    * Parse the 'Sale Year/Quarter' to two separate cols
    * Insert future-headers for Quarter and Year cols
    * Remove rows with 'total' in the first column
    * Disregard the 'Sale Year/Quarter' column
    * perform `basic_cleanup` with offset=2
    """
    # Safety first
    df=df.copy()
    
    # Extract 'Quarter' and 'Year' columns from the future 'Sale Year/Quarter' column
    dates = df.iloc[:,0].str.extract('(Q[1-4]) ([0-9]{4})').rename(columns={0:'Quarter',1:'Year'})
    for c in ['Quarter','Year']:# insert the dates in order, so they come out in reverse in the insert
        df.insert(1,c,dates[c])
        df.iloc[2,1]=c # Need to have the right colname for when `basic_cleanup` is called.

    # Remove 'total' rows from the future 'Sale Year/Quarter' column
    df=df[~df.iloc[:,0].str.contains('Total').fillna(False)]
    
    # Remove the 'Sale Year/Quarter' column all together
    df = df.iloc[:,1:]
    
    # Standard cleanup
    df = basic_cleanup(df, offset=2)
    
    return df

cleanup_table_5a(source_df['Table 5a'])
2 Period Year Quarter Antrim and Newtownabbey Ards and North Down Armagh City, Banbridge and Craigavon Belfast Causeway Coast and Glens Derry City and Strabane Fermanagh and Omagh Lisburn and Castlereagh Mid and East Antrim Mid Ulster Newry, Mourne and Down
0 2005Q1 2005 Q1 236 320 333 623 236 226 138 219 188 176 241
1 2005Q2 2005 Q2 735 857 961 1549 712 637 316 655 618 428 505
2 2005Q3 2005 Q3 757 960 968 1722 714 632 365 654 686 403 582
3 2005Q4 2005 Q4 893 995 1199 1943 834 746 385 670 759 489 711
4 2006Q1 2006 Q1 761 933 1038 1686 763 708 348 600 668 515 567
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
63 2020Q4 2020 Q4 756 1052 974 1565 728 496 336 830 685 419 636
64 2021Q1 2021 Q1 652 976 849 1497 610 466 290 762 572 349 486
65 2021Q2 2021 Q2 724 1018 823 1581 675 447 316 850 629 375 503
66 2021Q3 2021 Q3 739 989 931 1584 625 485 325 869 671 377 574
67 2021Q4 2021 Q4 532 702 730 1272 417 405 250 572 474 359 437

68 rows × 14 columns

dest_df['Table 5a']=cleanup_table_5a(source_df['Table 5a'])
len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df])
(13, 19)

Table 6: Standardised House Price & Index for all Urban and Rural areas in NI

Wee buns, thankfully. Still mixing the ‘HPI’ vs ‘Index’, but that’s a downstream problem

df = basic_cleanup(source_df['Table 6'])
df
1 Period Year Quarter Urban Areas HPI Urban Areas Standardised Price Rural Areas HPI Rural Areas Standardised Price
0 2005Q1 2005 Q1 101.309947 107723.320891 100.109860 124292.601178
1 2005Q2 2005 Q2 104.402908 111012.079786 105.467951 138865.721275
2 2005Q3 2005 Q3 111.163485 118200.631818 111.847591 138865.721275
3 2005Q4 2005 Q4 114.871996 122143.908606 116.175119 144238.615701
4 2006Q1 2006 Q1 118.187559 125669.361667 119.329374 148154.818847
... ... ... ... ... ... ... ...
63 2020Q4 2020 Q4 132.610763 141005.619094 133.854953 166189.226014
64 2021Q1 2021 Q1 134.077654 142565.370205 135.267264 167942.698911
65 2021Q2 2021 Q2 138.575881 147348.355880 140.501443 174441.256673
66 2021Q3 2021 Q3 142.840470 151882.912133 144.695321 179648.216283
67 2021Q4 2021 Q4 142.375033 151388.010443 146.115278 181411.180623

68 rows × 7 columns

dest_df['Table 6']=basic_cleanup(source_df['Table 6'])
len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df])
(14, 18)

Table 7: Standardised House Price & Index for Rural Areas of Northern Ireland by drive times

Nearly-wee-buns; but this one doesn’t have Year or Quarter headers, and the extra \n (Ref: Q1 2015) added, which will complicate downstream analysis if that changes over time…

df = source_df['Table 7'].copy()
df.head()
Table 7: Standardised House Price & Index for Rural Areas of Northern Ireland by drive times Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10
0 Back to contents NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN Drive time within 20mins of town of 10,000 or ... Drive time within 20mins of town of 10,000 or ... Drive time outside 20mins of town of 10,000 or... Drive time outside 20mins of town of 10,000 or... Drive time within 1hr of Belfast Index Drive time within 1hr of Belfast Price\n(Ref:... Drive time outside 1hr of Belfast Index Drive time outside 1hr of Belfast Price\n(Ref... NaN
2 2015 Q1 100 124898.676844 100 122528.427865 100 128955.274996 100 111866.40498 100.0
3 NaN Q2 103.166882 128854.070701 103.003978 126209.155363 103.025069 132856.260679 103.349406 115613.265107 100.0
4 NaN Q3 105.851629 132207.28391 105.619893 129414.394046 105.031061 135443.093443 107.811831 120605.219276 100.0
df.iloc[1,0] = 'Year'
df.iloc[1,1] = 'Quarter'
df.head()
Table 7: Standardised House Price & Index for Rural Areas of Northern Ireland by drive times Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10
0 Back to contents NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Year Quarter Drive time within 20mins of town of 10,000 or ... Drive time within 20mins of town of 10,000 or ... Drive time outside 20mins of town of 10,000 or... Drive time outside 20mins of town of 10,000 or... Drive time within 1hr of Belfast Index Drive time within 1hr of Belfast Price\n(Ref:... Drive time outside 1hr of Belfast Index Drive time outside 1hr of Belfast Price\n(Ref... NaN
2 2015 Q1 100 124898.676844 100 122528.427865 100 128955.274996 100 111866.40498 100.0
3 NaN Q2 103.166882 128854.070701 103.003978 126209.155363 103.025069 132856.260679 103.349406 115613.265107 100.0
4 NaN Q3 105.851629 132207.28391 105.619893 129414.394046 105.031061 135443.093443 107.811831 120605.219276 100.0
basic_cleanup(df).head()
1 Period Year Quarter Drive time within 20mins of town of 10,000 or more Index Drive time within 20mins of town of 10,000 or more Price\n(Ref: Q1 2015) Drive time outside 20mins of town of 10,000 or more Index Drive time outside 20mins of town of 10,000 or more Price\n(Ref: Q1 2015) Drive time within 1hr of Belfast Index Drive time within 1hr of Belfast Price\n(Ref: Q1 2015) Drive time outside 1hr of Belfast Index Drive time outside 1hr of Belfast Price\n(Ref: Q1 2015)
0 2015Q1 2015 Q1 100.000000 124898.676844 100.000000 122528.427865 100.000000 128955.274996 100.000000 111866.404980
1 2015Q2 2015 Q2 103.166882 128854.070701 103.003978 126209.155363 103.025069 132856.260679 103.349406 115613.265107
2 2015Q3 2015 Q3 105.851629 132207.283910 105.619893 129414.394046 105.031061 135443.093443 107.811831 120605.219276
3 2015Q4 2015 Q4 107.430656 134179.467306 106.924715 131013.172436 106.240145 137002.270924 110.075053 123137.004353
4 2016Q1 2016 Q1 108.909364 136026.354775 108.368772 132782.552750 107.604887 138762.178070 111.828874 125098.941485
def cleanup_table_7(df):
    """
    Table 7: Standardised House Price & Index for Rural Areas of Northern Ireland by drive times
    * Insert Year/Quarter future-headers
    * Clean normally
    # TODO THIS MIGHT BE VALID FOR MULTIINDEXING ON DRIVETIME/[Index/Price]
    """
    df = df.copy()
    df.iloc[1,0] = 'Year'
    df.iloc[1,1] = 'Quarter'
    df = basic_cleanup(df)
    return df

cleanup_table_7(source_df['Table 7'])
1 Period Year Quarter Drive time within 20mins of town of 10,000 or more Index Drive time within 20mins of town of 10,000 or more Price\n(Ref: Q1 2015) Drive time outside 20mins of town of 10,000 or more Index Drive time outside 20mins of town of 10,000 or more Price\n(Ref: Q1 2015) Drive time within 1hr of Belfast Index Drive time within 1hr of Belfast Price\n(Ref: Q1 2015) Drive time outside 1hr of Belfast Index Drive time outside 1hr of Belfast Price\n(Ref: Q1 2015)
0 2015Q1 2015 Q1 100.000000 124898.676844 100.000000 122528.427865 100.000000 128955.274996 100.000000 111866.404980
1 2015Q2 2015 Q2 103.166882 128854.070701 103.003978 126209.155363 103.025069 132856.260679 103.349406 115613.265107
2 2015Q3 2015 Q3 105.851629 132207.283910 105.619893 129414.394046 105.031061 135443.093443 107.811831 120605.219276
3 2015Q4 2015 Q4 107.430656 134179.467306 106.924715 131013.172436 106.240145 137002.270924 110.075053 123137.004353
4 2016Q1 2016 Q1 108.909364 136026.354775 108.368772 132782.552750 107.604887 138762.178070 111.828874 125098.941485
5 2016Q2 2016 Q2 111.263396 138966.509219 109.739250 134461.778232 110.208116 142119.179594 111.991819 125281.221326
6 2016Q3 2016 Q3 113.419541 141659.506269 112.426034 137753.851946 112.202571 144691.133778 115.398242 129091.864904
7 2016Q4 2016 Q4 113.928074 142294.657346 113.219995 138726.680412 112.508660 145085.851672 116.952798 130830.890712
8 2017Q1 2017 Q1 114.262386 142712.207695 113.549623 139130.567598 112.823330 145491.635911 117.341538 131265.759778
9 2017Q2 2017 Q2 115.566592 144341.144812 115.829688 141924.295411 114.288862 147381.515712 119.397027 133565.161466
10 2017Q3 2017 Q3 116.716428 145777.273752 117.061832 143434.022704 115.192251 148546.484147 121.273582 135664.395891
11 2017Q4 2017 Q4 117.925340 147287.189812 118.541541 145247.086931 116.101165 149718.576638 123.577530 138241.740123
12 2018Q1 2018 Q1 118.482802 147983.452250 120.184585 147260.282195 117.972698 152132.017714 122.075418 136561.381608
13 2018Q2 2018 Q2 119.443631 149183.514842 120.710551 147904.740501 117.686726 151763.241043 125.603705 140508.349303
14 2018Q3 2018 Q3 121.408923 151638.138779 122.222994 149757.912837 119.614730 154249.503782 127.210974 142306.342946
15 2018Q4 2018 Q4 123.531419 154289.108214 125.254013 153471.772944 121.462721 156632.586327 130.953652 146493.143086
16 2019Q1 2019 Q1 122.499375 153000.098716 123.207618 150964.357154 120.087144 154858.706540 129.712835 145105.085124
17 2019Q2 2019 Q2 124.397722 155371.109292 125.151589 153346.274193 122.486275 157952.513361 130.480634 145963.994647
18 2019Q3 2019 Q3 126.533407 158038.551430 128.647747 157630.061642 124.978137 161165.900455 133.258945 149071.990904
19 2019Q4 2019 Q4 127.126748 158779.626458 127.784267 156572.053236 124.547746 160610.887802 134.709059 150694.181735
20 2020Q1 2020 Q1 127.090324 158734.133127 128.619521 157595.476721 124.744271 160864.317472 135.068006 151095.722593
21 2020Q2 2020 Q2 127.200617 158871.887068 127.231209 155894.400318 125.021931 161222.374474 132.935126 148709.746207
22 2020Q3 2020 Q3 129.627870 161903.493901 131.083667 160614.756005 127.873046 164899.037745 135.952621 152085.309777
23 2020Q4 2020 Q4 132.853240 165931.938294 135.873737 166483.954356 130.725554 168577.497418 142.032974 158887.182045
24 2021Q1 2021 Q1 134.414458 167881.879606 136.978231 167837.272586 132.479831 170839.730568 142.532053 159445.483290
25 2021Q2 2021 Q2 139.417605 174130.744141 142.727536 174881.805508 137.737680 177620.004609 147.652942 165174.037638
26 2021Q3 2021 Q3 143.303934 178984.717757 147.615067 180870.420630 141.482937 182449.710341 153.161532 171336.300173
27 2021Q4 2021 Q4 144.731984 180768.332630 149.013971 182584.475980 142.365385 183587.673223 156.204293 174740.127539
dest_df['Table 7'] = cleanup_table_7(source_df['Table 7'])
len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df])
(15, 17)

Table 8: Number of Verified Residential Property Sales of properties in urban and rural areas and properties in rural areas by drive times witihn towns of 10,000 or more and within 1 hour of Belfast

We’re now getting into the swing of this!

This one has two similar problems we’ve already seen; Munged Quarters/Years (this time with no header on that column…), and annual Total rows.

Vee must deeel with it

cleanup_table_5a(source_df['Table 8']).head()
2 Period Year Quarter Urban Rural Drive time within 20mins of town of 10,000 or more Drive time outside 20mins of town of 10,000 or more Drive time within 1hr of Belfast Drive time outside 1hr of Belfast
0 2015Q1 2015 Q1 3294 1322 898 424 976 346
1 2015Q2 2015 Q2 3789 1500 1034 466 1142 358
2 2015Q3 2015 Q3 4199 1640 1145 495 1250 390
3 2015Q4 2015 Q4 4396 1780 1223 557 1342 438
4 2016Q1 2016 Q1 4424 1731 1171 560 1263 468
cleanup_table_8 = cleanup_table_5a
dest_df['Table 8'] = cleanup_table_8(source_df['Table 8'])
len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df])
(16, 16)

Table 9: NI Average Sales Prices Q1 2005 - Q4 2021

Wee buns

basic_cleanup(source_df['Table 9'])
1 Period Year Quarter Simple Mean Simple Median Standardised Price (HPI)
0 2005Q1 2005 Q1 115912.942222 100000 111920.268199
1 2005Q2 2005 Q2 120481.290591 105000 116004.031639
2 2005Q3 2005 Q3 128866.225917 115000 123386.352673
3 2005Q4 2005 Q4 129649.092074 117000 127674.143865
4 2006Q1 2006 Q1 132972.115070 120000 131302.064422
... ... ... ... ... ... ...
63 2020Q4 2020 Q4 171803.199843 150000 147474.561707
64 2021Q1 2021 Q1 176218.214924 150000 149084.306040
65 2021Q2 2021 Q2 184144.458946 154950 154323.134643
66 2021Q3 2021 Q3 173490.230508 155000 159028.118093
67 2021Q4 2021 Q4 159965.154863 141000 159150.737832

68 rows × 6 columns

dest_df['Table 9'] = basic_cleanup(source_df['Table 9'])
len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df])
(17, 15)

Table 9x: NI Average Sale Prices XXXXX Property Q1 2005 - Q4 2021

These are very similar to Tables 2x; i.e. they’re broken down by property type.

Annoyingly, they don’t follow the same structure as Tables 2x or Table 9 because they don’t include the Year/Quarter headers.

If that reminds you of anything, it’s because Table 7 was the same…

cleanup_table_7(source_df['Table 9a'])
1 Period Year Quarter Simple Mean Simple Median Standardised Price (HPI)
0 2005Q1 2005 Q1 166314.816092 149972.5 160428.832662
1 2005Q2 2005 Q2 173370.669076 155000.0 169686.542965
2 2005Q3 2005 Q3 185397.896739 165000.0 180696.666810
3 2005Q4 2005 Q4 186545.119355 165000.0 185323.883533
4 2006Q1 2006 Q1 191328.398119 173000.0 188669.361197
... ... ... ... ... ... ...
63 2020Q4 2020 Q4 243712.512641 220000.0 220592.113069
64 2021Q1 2021 Q1 254182.439174 225000.0 224872.989982
65 2021Q2 2021 Q2 268755.621299 235000.0 234734.715703
66 2021Q3 2021 Q3 245860.399289 225000.0 239101.239764
67 2021Q4 2021 Q4 244468.040738 219000.0 241131.373512

68 rows × 6 columns

cleanup_table_9x = cleanup_table_7
table9s = re.compile('Table 9[a-z]')
for table in source_df:
    if table9s.match(table):
        dest_df[table] = cleanup_table_9x(source_df[table])
len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df])
(21, 11)

Table 10x: Number of Verified Residential Property Sales by Type in XXXXX

Surprisingly, we’re in the home straight; the remaining tables are all of the same structure, with familiar awkwardness.,,

  • Annual-Total Rows
  • Munged Year/Quarter Column
  • That column having a silly (but contextual) name
  • a different offset

Fortunately, we already have something like that from dealing with Table 5a!

source_df['Table 10a']
Table 10a: Number of Verified Residential Property Sales by Type in Antrim and Newtownabbey Council Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5
0 Please note figures for the 2 most recent quar... NaN NaN NaN NaN NaN
1 Back to contents NaN NaN NaN NaN NaN
2 ANTRIM AND NEWTOWNABBEY Apartments Detached Semi-Detached Terrace Total
3 Q1 2005 10 61 78 87 236
4 Q2 2005 46 213 216 260 735
... ... ... ... ... ... ...
83 Q1 2021 52 222 212 166 652
84 Q2 2021 56 217 275 176 724
85 Q3 2021 47 222 268 202 739
86 Q4 2021 50 117 176 189 532
87 2021 Total 205 778 931 733 2647

88 rows × 6 columns

cleanup_table_5a(source_df['Table 10a'])
2 Period Year Quarter Apartments Detached Semi-Detached Terrace Total
0 2005Q1 2005 Q1 10 61 78 87 236
1 2005Q2 2005 Q2 46 213 216 260 735
2 2005Q3 2005 Q3 46 214 238 259 757
3 2005Q4 2005 Q4 65 227 270 331 893
4 2006Q1 2006 Q1 48 186 231 296 761
... ... ... ... ... ... ... ... ...
63 2020Q4 2020 Q4 53 248 268 187 756
64 2021Q1 2021 Q1 52 222 212 166 652
65 2021Q2 2021 Q2 56 217 275 176 724
66 2021Q3 2021 Q3 47 222 268 202 739
67 2021Q4 2021 Q4 50 117 176 189 532

68 rows × 8 columns

cleanup_table_10x = cleanup_table_5a
table10s = re.compile('Table 10[a-z]')
for table in source_df:
    if table10s.match(table):
        dest_df[table] = cleanup_table_10x(source_df[table])
len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df])
(32, 0)

And We’re Done!

So, we can see that while government open data is a pain, at least it’s a … consistently inconsistent pain?

I hope this was helpful to someone else.

dest_df['Contents'] = source_df['Contents'][source_df['Contents']['Worksheet Name'].str.startswith('Table')]
with pd.ExcelWriter('NI Housing Price Index.xlsx') as writer:
    # Thankfully these are semantically sortable otherwise this would be a _massive_ pain
    for k,df in sorted(dest_df.items()):
        df.to_excel(writer, sheet_name=k)
    
blog comments powered by Disqus