Data Wrangling NI House Price Index Data

This is a ‘messy’ ‘blog post’ that’s just a braindump of a notebook to step through NI House Price Index datasets I was playing around with.

It’s mostly code, so if you were here from some ‘insight’, feck aff.

There is no analysis here, this is just data wrangling.

TLDR As always, Government Open Data has over the years gone from ’non-existent’ to ‘garbeled’ to ‘inconsistent’ and I feel is now in the stage of ‘consistently inconsistent’, which is progress in my eyes.

Preamble Code, move on.

from bs4 import BeautifulSoup
import pandas as pd
import requests

# Pull the latest pages of https://www.finance-ni.gov.uk/publications/ni-house-price-index-statistical-reports and extract links

base_url= 'https://www.finance-ni.gov.uk/publications/ni-house-price-index-statistical-reports'
base_content = requests.get(base_url).content
base_soup = BeautifulSoup(base_content)
for a in base_soup.find_all('a'):
    if a.attrs.get('href','').endswith('xlsx'):
        source_name, source_url = a.contents[1],a.attrs['href']

source_df = pd.read_excel(source_url, sheet_name = None) # Load all worksheets in
source_df.keys()
dict_keys(['Cover Sheet', 'Contents', 'Table 1', 'Table 2', 'Table 2a', 'Table 2b', 'Table 2c', 'Table 2d', 'Table 3', 'Table 3a', 'Table 3b', 'Table 3c', 'Table 4', 'Fig 5', 'Table 5', 'Table 5a', 'Fig 6', 'Table 6', 'Table 7', 'Table 8', 'Table 9', 'Table 9a', 'Table 9b', 'Table 9c', 'Table 9d', 'Table 10a', 'Table 10b', 'Table 10c', 'Table 10d', 'Table 10e', 'Table 10f', 'Table 10g', 'Table 10h', 'Table 10i', 'Table 10j', 'Table 10k'])
source_df['Contents']

Table of ContentsUnnamed: 1Unnamed: 2
0Worksheet NameFrequencyHouse Price Index - Quarter 4 2021
1Table 1QuarterlyTable 1: NI HPI Trends Q1 2005 - Q4 2021
2Figure 1QuarterlyFigure 1: Graph of NI HPI Q1 2005 - Q4 2021
3Figure 1aQuarterlyFigure 1a: Graph of Percentage Quarterly Chang...
4Figure 1bQuarterlyFigure 1b: Graph of Percentage Annual Change Q...
............
58Table 10hQuarterlyTable 10h: Number of Verified Residential Prop...
59Table 10iQuarterlyTable 10i: Number of Verified Residential Prop...
60Table 10jQuarterlyTable 10j: Number of Verified Residential Prop...
61Table 10kQuarterlyTable 10k: Number of Verified Residential Prop...
62Figure 11QuarterlyFigure 11: Number of Verified Residential Prop...

63 rows × 3 columns

Fix the Contents sheet to correctly reflect the Worksheet names

And fix the table headers and sheet-titles while we’re at it.

new_header = source_df['Contents'].iloc[0]
source_df['Contents'] = source_df['Contents'][1:]
source_df['Contents'].columns = new_header
source_df['Contents'].columns = [*new_header[:-1],'Title']
[t for t in source_df['Contents']['Title'].values if t.startswith('Table')]
['Table 1: NI HPI Trends Q1 2005 - Q4 2021',
 'Table 2: NI HPI & Standardised Price Statistics by Property Type Q4 2021',
 'Table 2a: NI Detached Property Price Index Q1 2005 - Q4 2021',
 'Table 2b: NI Semi-Detached Property Price Index Q1 2005 - Q4 2021',
 'Table 2c: NI Terrace Property Price Index Q1 2005 - Q4 2021',
 'Table 2d: NI Apartment Price Index Q1 2005 - Q4 2021',
 'Table 3: NI HPI & Standardised Price Statistics by New/Existing Resold Dwelling Type Q4 2021',
 'Table 3a: NI New Dwelling Price Index Q1 2005 - Q4 2021',
 'Table 3b: NI Existing Resold Dwellings Price Index Q1 2005 - Q4 2021',
 'Table 3c: Number of Verified Residential Property Sales by New/Existing Resold Dwellings Q1 2005 - Q2 2021',
 'Table 4: Number of Verified Residential Property Sales Q1 2005 - Q4 2021',
 'Table 5: HPI & Standardised Price for each Local Government District in NI',
 'Table 5a: Number of Verified Residential Property Sales by Local Government District Q1 2005 - Q4 2021',
 'Table 6: NI HPI & Standardised Price by Urban and Rural areas of Northern Ireland',
 'Table 7: Standardised House Price & Index for Rural Areas of Northern Ireland by drive times',
 'Table 8: Number of Verified Residential Property Sales for Urban and Rural Areas of NI (Q1 2005 - Q4 2021) and Rural Areas of NI by drive times (Q1 2015 - Q4 2021)',
 'Table 9: NI Average Sale Prices All Properties Q1 2005 - Q4 2021',
 'Table 9a: NI Average Sale Prices Detached Properties Q1 2005 - Q4 2021',
 'Table 9b: NI Average Sale Prices Semi-Detached Properties Q1 2005 - Q4 2021',
 'Table 9c: NI Average Sale Prices Terrace Properties Q1 2005 - Q4 2021',
 'Table 9d: NI Average Sale Prices Apartments Q1 2005 - Q4 2021',
 'Table 10a: Number of Verified Residential Property Sales by Type in Antrim and Newtownabbey Council Q1 2005 - Q4 2021',
 'Table 10b: Number of Verified Residential Property Sales by Type in Ards and North Down Council Q1 2005 - Q4 2021',
 'Table 10c: Number of Verified Residential Property Sales by Type in Armagh City, Banbridge and Craigavon Council Q1 2005 - Q4 2021',
 'Table 10d: Number of Verified Residential Property Sales by Type in Belfast Council Q1 2005 - Q4 2021',
 'Table 10e: Number of Verified Residential Property Sales by Type in Causeway Coast and Glens Council Q1 2005 - Q4 2021',
 'Table 10f: Number of Verified Residential Property Sales by Type in Derry City and Strabane Council Q1 2005 - Q4 2021',
 'Table 10g: Number of Verified Residential Property Sales by Type in Fermanagh and Omagh Council Q1 2005 - Q4 2021',
 'Table 10h: Number of Verified Residential Property Sales by Type in Lisburn and Castlereagh Council Q1 2005 - Q4 2021',
 'Table 10i: Number of Verified Residential Property Sales by Type in Mid and East Antrim Council Q1 2005 - Q4 2021',
 'Table 10j: Number of Verified Residential Property Sales by Type in Mid Ulster Council Q1 2005 - Q4 2021',
 'Table 10k: Number of Verified Residential Property Sales by Type in Newry, Mourne and Down Council Q1 2005 - Q4 2021']
# Replace 'Figure' with 'Fig' in 'Worksheet Name'
with pd.option_context('mode.chained_assignment',None):
    source_df['Contents']['Worksheet Name'] = source_df['Contents']['Worksheet Name'].str.replace('Figure','Fig')

Tidy up Data

General Methodology

Ignore figure data (pretty much completly….)

Tables have more or less the same structure; a header on row 3(1), a year and quarter ‘index’ (on time series; otherwise categorical index, see Table 2, Table 3).

Some TS tables also have totals subsections so these should be a) validated and b) ignored.

Any columns with no header in row 3(1) should be ignored (usually text notes)

Operate Sequentially (i.e. Table 1, Table 2, Table 2a; don’t skip, even if it’s tempting)

Use keys from ‘Contents’ to describe data, but may be suffixed by the date which could change between data sets!

There’s also some really columns that look like checksums, so if there is an ‘NI’ column, or a data column that all valid values are ‘100’, delete it.

TODO: Regexy way to get rid of the ‘\QX-YYYY -\QX YYYY’ tail

source_df['Table 1']

Table 1: NI House Price Index, Standardised Price and Quarterly and Annual ChangeUnnamed: 1Unnamed: 2Unnamed: 3Unnamed: 4Unnamed: 5Unnamed: 6Unnamed: 7
0Back to contentsNaNNaNNaNNaNNaNNaNNaN
1YearQuarterNI House Price IndexNI House Standardised PriceQuarterly ChangeAnnual ChangeNINaN
22005Q1100.883607111920.268199NaNNaN1000.0
3NaNQ2104.564663116004.0316390.036488NaN1000.0
4NaNQ3111.219123386.3526730.063638NaN1000.0
...........................
89NaNNaNNaNNaNNaNNaN1000.0
90NaNNaNNaNNaNNaNNaN1000.0
91NaNNaNNaNNaNNaNNaN1000.0
92NaNNaNNaNNaNNaNNaN1000.0
93NaNNaNNaNNaNNaNNaN1000.0

94 rows × 8 columns

def basic_cleanup(df:pd.DataFrame, offset=1)->pd.DataFrame:
    df = df.copy()
    # Re-header from row 1 (which was row 3 in excel)
    new_header = df.iloc[offset]
    df = df.iloc[offset+1:]
    df.columns = new_header

    # remove 'NaN' trailing columns
    df = df[df.columns[pd.notna(df.columns)]]

    # 'NI' is a usually hidden column that appears to be a checksum;
    #if it's all there and all 100, remove it, otherwise, complain.
    # (Note, need to change this 'if' logic to just 'if there's a
    # column with all 100's, but cross that bridge later)
    if 'NI' in df:
        assert df['NI'].all() and df['NI'].mean() == 100, "Not all values in df['NI'] == 100"
        df = df.drop('NI', axis=1)

    # Strip rows below the first all-nan row, if there is one
    # (Otherwise this truncates the tables as there is no
    # idxmax in the table of all 'false's)
    if any(df.isna().all(axis=1)):
        idx_first_bad_row = df.isna().all(axis=1).idxmax()
        df = df.loc[:idx_first_bad_row-1]

    # By Inspection, other tables use 'Sale Year' and 'Sale Quarter'
    if set(df.keys()).issuperset({'Sale Year','Sale Quarter'}):
        df = df.rename(columns = {
            'Sale Year':'Year',
            'Sale Quarter': 'Quarter'
        })

    # For 'Year','Quarter' indexed pages, there is an implied Year
    # in Q2/4, so fill it downwards
    if set(df.keys()).issuperset({'Year','Quarter'}):
        df['Year'] = df['Year'].astype(float).fillna(method='ffill').astype(int)

        # In Pandas we can represent Y/Q combinations as proper datetimes
        #https://stackoverflow.com/questions/53898482/clean-way-to-convert-quarterly-periods-to-datetime-in-pandas
        df.insert(loc=0,
                  column='Period',
                  value=pd.PeriodIndex(df.apply(lambda r:f'{r.Year}-{r.Quarter}', axis=1), freq='Q')
        )

    # reset index, try to fix dtypes, etc, (this should be the last
    # operation before returning!
    df = df.reset_index(drop=True).infer_objects()  

    return df

df = basic_cleanup(source_df['Table 1'])
df

1PeriodYearQuarterNI House Price IndexNI House Standardised PriceQuarterly ChangeAnnual Change
02005Q12005Q1100.883607111920.268199NaNNaN
12005Q22005Q2104.564663116004.0316390.036488NaN
22005Q32005Q3111.219000123386.3526730.063638NaN
32005Q42005Q4115.083964127674.1438650.034751NaN
42006Q12006Q1118.354129131302.0644220.0284150.173175
........................
632020Q42020Q4132.931827147474.5617070.0261030.052326
642021Q12021Q1134.382831149084.3060400.0109150.059421
652021Q22021Q2139.105050154323.1346430.0351400.095724
662021Q32021Q3143.346066159028.1180930.0304880.106491
672021Q42021Q4143.456594159150.7378320.0007710.079174

68 rows × 7 columns

dest_df = {
    'Table 1': basic_cleanup(source_df['Table 1'])
}
len([k for k in source_df.keys() if k.startswith('Table')])
32

One down, 31 to go…

Table 2: NI HPI & Standardised Price Statistics by Property Type Q4 2021’

df = basic_cleanup(source_df['Table 2'])
df

1Property TypeIndex\n(Quarter 4 2021)Percentage Change on Previous QuarterPercentage Change over 12 monthsStandardised Price\n(Quarter 4 2021)
0Detached143.4888060.0084910.093110241131.373512
1Semi-Detached140.6806940.0042110.076953153718.543755
2Terrace149.564169-0.0095770.078758112831.710806
3Apartment133.376791-0.0147320.032761116554.228620
4All143.4565940.0007710.079174159150.737832

Those ‘\n (Quarter 4 2021)’ entries are unnecessary, so for this table, lets clear them

df.columns = [c.split('\n')[0] for c in df.columns]
df

Property TypeIndexPercentage Change on Previous QuarterPercentage Change over 12 monthsStandardised Price
0Detached143.4888060.0084910.093110241131.373512
1Semi-Detached140.6806940.0042110.076953153718.543755
2Terrace149.564169-0.0095770.078758112831.710806
3Apartment133.376791-0.0147320.032761116554.228620
4All143.4565940.0007710.079174159150.737832
dest_df['Table 2'] = df

Table 2a: NI Detached Property Price Index Q1 2005 - Q4 2021

df = basic_cleanup(source_df['Table 2a'])
df

1PeriodYearQuarterNI Detached Property Price IndexNI Detached Property Standardised PriceQuarterly ChangeAnnual Change
02005Q12005Q195.465560160428.832662NaNNaN
12005Q22005Q2100.974498169686.5429650.057706NaN
22005Q32005Q3107.526236180696.6668100.064885NaN
32005Q42005Q4110.279730185323.8835330.025608NaN
42006Q12006Q1112.270506188669.3611970.0180520.176032
........................
632020Q42020Q4131.266614220592.1130690.0263930.055357
642021Q12021Q1133.814014224872.9899820.0194060.071429
652021Q22021Q2139.682380234734.7157030.0438550.129844
662021Q32021Q3142.280745239101.2397640.0186020.112515
672021Q42021Q4143.488806241131.3735120.0084910.093110

68 rows × 7 columns

Table 2x: NI XXX Property Price Index Q1 2005 - Q4 2021

This table structure is consistent against the rest of the Table 2x cohort; mapping to the Property Types listed in Table 2.

For the time being, we can ignore these, but this will probably become a pain later on…

dest_df['Table 2']['Property Type']
0         Detached
1    Semi-Detached
2          Terrace
3        Apartment
4              All
Name: Property Type, dtype: object
import re

table2s = re.compile('Table 2[a-z]')
assert table2s.match('Table 2') is None, 'Table 2 is matching itself!'
assert table2s.match('Table 20') is None, 'Table 2 is greedy!'
assert table2s.match('Table 2z') is not None, 'Table 2 is matching incorrectly!'
table2s = re.compile('Table 2[a-z]')
for table in source_df:
    if table2s.match(table):
        dest_df[table] = basic_cleanup(source_df[table])
len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df])
(6, 26)

6 down, 26 to go.

Table 3: NI HPI & Standardised Price Statistics by New/Existing Resold Dwelling Type Q4 2021

These appear to be a similar structure of the Table 2’s… hopefully

df = basic_cleanup(source_df['Table 3'])
df.columns = [c.split('\n')[0] for c in df.columns] # Stolen from Table 2 Treatment
df

Property TypeIndexPercentage Change on Previous QuarterPercentage Change over 12 monthsStandardised Price
0New141.7699730.0248770.072609185966.524090
1Existing Resold143.518977-0.0049180.080771152275.828046
2All143.4565940.0007710.079174159150.737832
dest_df['Table 3'] = df
df = basic_cleanup(source_df['Table 3a'])
df

1PeriodYearQuarterNI New Dwellings Price IndexNI New Dwellings Standardised PriceQuarterly ChangeAnnual Change
02005Q12005Q195.804706125671.662611NaNNaN
12005Q22005Q2101.229223132787.2634600.056621NaN
22005Q32005Q3106.243580139364.8379670.049535NaN
32005Q42005Q4110.118105144447.2398740.036468NaN
42006Q12006Q1113.624410149046.6296340.0318410.186000
........................
632020Q42020Q4132.173052173377.7794400.0041030.036125
642021Q12021Q1133.772562175475.9336120.0121020.027916
652021Q22021Q2136.969311179669.2641900.0238970.046474
662021Q32021Q3138.328776181452.5401060.0099250.050867
672021Q42021Q4141.769973185966.5240900.0248770.072609

68 rows × 7 columns

table3s = re.compile('Table 3[a-z]')
for table in source_df:
    if table3s.match(table):
        dest_df[table] = basic_cleanup(source_df[table])
len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df])
(10, 22)

Table 4: Number of Verified Residential Property Sales Q1 2005 - Q4 2021

Table 4 is not looking great

df = source_df['Table 4']
df

Table 4: Number of Verified Residential Property SalesUnnamed: 1Unnamed: 2Unnamed: 3Unnamed: 4Unnamed: 5Unnamed: 6Unnamed: 7
0Verified Sales = Sales matched to a property i...NaNNaNNaNNaNNaNNaNNaN
1Please note figures for the 2 most recent quar...NaNNaNNaNNaNNaNNaNNaN
2Back to contentsNaNNaNNaNNaNNaNNaNNaN
3Sale YearSale QuarterDetachedSemi-DetachedTerraceApartmentTotalNaN
42005\nQuarter 180989410351982936NaN
...........................
842021Quarter 12509247719625617509NaN
85NaNQuarter 22668261320566047941NaN
86NaNQuarter 32519279722206338169Please note this figure is provisional and wil...
87NaNQuarter 41478210020575156150and new dwellings sold in this quarter being a...
88NaN2021 Total917499878295231329769NaN

89 rows × 8 columns

Of note; new offset for the header row at index 3 instead of index 1, due to lots of fluff at the start that is probably not going to be consistent between reports so that will almost certainly mess up my day in a few months.

Also, Quarter dates have now been shifted into ‘Quarter 1’ instead of ‘Q1’, which … meh 🤷‍♂️. More Egrigiously, it looks like ’\n’ has leaked into some Sales Year values. Funtimes.

Finally, and possibly most annoying, the introduction of partial total lines is going to throw things off, and this isn’t a validation study, to stuff-em

In an effort not to over-complicate basic_cleanup, we can try and clean these table specific issues first;

df.iloc[:,1]=df.iloc[:,1].str.replace('Quarter ([1-4])',r'Q\1', regex=True)
df

Table 4: Number of Verified Residential Property SalesUnnamed: 1Unnamed: 2Unnamed: 3Unnamed: 4Unnamed: 5Unnamed: 6Unnamed: 7
0Verified Sales = Sales matched to a property i...NaNNaNNaNNaNNaNNaNNaN
1Please note figures for the 2 most recent quar...NaNNaNNaNNaNNaNNaNNaN
2Back to contentsNaNNaNNaNNaNNaNNaNNaN
3Sale YearSale QuarterDetachedSemi-DetachedTerraceApartmentTotalNaN
42005\nQ180989410351982936NaN
...........................
842021Q12509247719625617509NaN
85NaNQ22668261320566047941NaN
86NaNQ32519279722206338169Please note this figure is provisional and wil...
87NaNQ41478210020575156150and new dwellings sold in this quarter being a...
88NaN2021 Total917499878295231329769NaN

89 rows × 8 columns

df=df[~df.iloc[:,1].str.contains('Total').fillna(False)]
# Lose the year new-lines (needs astype because non str lines are
# correctly inferred to be ints, so .str methods nan-out
with pd.option_context('mode.chained_assignment',None):
    df.iloc[:,0]=df.iloc[:,0].astype(str).str.replace('\n','')
df

Table 4: Number of Verified Residential Property SalesUnnamed: 1Unnamed: 2Unnamed: 3Unnamed: 4Unnamed: 5Unnamed: 6Unnamed: 7
0Verified Sales = Sales matched to a property i...NaNNaNNaNNaNNaNNaNNaN
1Please note figures for the 2 most recent quar...NaNNaNNaNNaNNaNNaNNaN
2Back to contentsNaNNaNNaNNaNNaNNaNNaN
3Sale YearSale QuarterDetachedSemi-DetachedTerraceApartmentTotalNaN
42005Q180989410351982936NaN
...........................
82nanQ42808294421705558477NaN
842021Q12509247719625617509NaN
85nanQ22668261320566047941NaN
86nanQ32519279722206338169Please note this figure is provisional and wil...
87nanQ41478210020575156150and new dwellings sold in this quarter being a...

72 rows × 8 columns

basic_cleanup(df, offset=3)

3PeriodYearQuarterDetachedSemi-DetachedTerraceApartmentTotal
02005Q12005Q180989410351982936
12005Q22005Q22208247428084837973
22005Q32005Q32297265529525398443
32005Q42005Q42498300334926319624
42006Q12006Q12185265031585948587
...........................
632020Q42020Q42808294421705558477
642021Q12021Q12509247719625617509
652021Q22021Q22668261320566047941
662021Q32021Q32519279722206338169
672021Q42021Q41478210020575156150

68 rows × 8 columns

Thats awkward enough to get it’s own function…

def cleanup_table_4(df):
    """
    Table 4: Number of Verified Residential Property Sales
    * Regex 'Quarter X' to 'QX' in future 'Sales Quarter' column
    * Drop Year Total rows
    * Clear any Newlines from the future 'Sales Year' column
    * call `basic_cleanup` with offset=3
    """
    df.iloc[:,1]=df.iloc[:,1].str.replace('Quarter ([1-4])',r'Q\1', regex=True)
    df=df[~df.iloc[:,1].str.contains('Total').fillna(False)]
    # Lose the year new-lines (needs astype because non str lines are
    # correctly inferred to be ints, so .str methods nan-out
    with pd.option_context('mode.chained_assignment',None):
        df.iloc[:,0]=df.iloc[:,0].astype(str).str.replace('\n','')
    return basic_cleanup(df, offset=3)

cleanup_table_4(source_df['Table 4'].copy())

3PeriodYearQuarterDetachedSemi-DetachedTerraceApartmentTotal
02005Q12005Q180989410351982936
12005Q22005Q22208247428084837973
22005Q32005Q32297265529525398443
32005Q42005Q42498300334926319624
42006Q12006Q12185265031585948587
...........................
632020Q42020Q42808294421705558477
642021Q12021Q12509247719625617509
652021Q22021Q22668261320566047941
662021Q32021Q32519279722206338169
672021Q42021Q41478210020575156150

68 rows × 8 columns

dest_df['Table 4'] = cleanup_table_4(source_df['Table 4'])
len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df])
(11, 21)

Table 5: HPI & Standardised Price for each Local Government District in NI

This nearly works but structurally requires a multi-index column to make sense….

df = basic_cleanup(source_df['Table 5'])
df

1PeriodYearQuarterAntrim and Newtownabbey HPIAntrim and Newtownabbey Standardised PriceArds and North Down HPIArds and North Down Standardised PriceArmagh City, Banbridge and Craigavon HPIArmagh City, Banbridge and Craigavon Standardised PriceBelfast HPI...Fermanagh and Omagh HPIFermanagh and Omagh Standardised PriceLisburn and Castlereagh HPILisburn and Castlereagh Standardised PriceMid and East Antrim HPIMid and East Antrim Standardised PriceMid Ulster Standardised HPIMid Ulster Standardised PriceNewry, Mourne and Down HPINewry, Mourne and Down Standardised Price
02005Q12005Q199.903277114851.52827097.150602130398.569667102.245597100785.14598699.839849...109.429237104874.98023195.958322128828.327513102.246427105865.408901102.714778114882.211239100.810773113420.880186
12005Q22005Q299.723509114644.862732100.794472137133.037807106.325843104807.109982100.589870...117.239850112360.529330100.164437134475.225477104.443325108140.065924110.386311123462.501283111.965743125971.191415
22005Q32005Q3107.940849124091.740608102.167971137133.037807110.006212108434.910333109.614861...125.900145120660.397585106.757895143327.237126112.748278116738.970434117.595723131525.929577117.235685131900.333698
32005Q42005Q4111.934696128683.175719106.396379142808.538807116.073031114415.072260110.728237...130.781315125338.422216111.307116149434.769200114.584090118639.759900121.851999136286.397473123.628047139092.296651
42006Q12006Q1113.494351130476.197845109.206160146579.915492121.831058120090.852733112.326582...135.555749129914.152078110.539212148403.823796115.878975119980.477260129.544702144890.365875127.449613143391.890242
..................................................................
632020Q42020Q4131.045293150653.238745123.824862166201.593253130.857866128988.723586133.525177...142.381814136456.127817129.797418174258.823716130.813697135443.809729124.617456139379.446212138.815696156179.721555
642021Q12021Q1133.481101153453.510344128.398202172340.072904130.382658128520.303209135.257679...141.464114135576.621629131.166305176096.615474131.268719135914.936888127.657615142779.737045138.481347155803.549899
652021Q22021Q2137.827568158450.326506130.543464175219.510303137.439265135476.115278138.558752...146.059468139980.724158134.815374180995.654429138.663140143571.081234129.231669144540.250870145.000829163138.533592
662021Q32021Q3141.167257162289.724156134.116570180015.444071143.522521141472.480114142.250634...149.743153143511.101233136.178634182825.891020140.051373145008.450168135.759680151841.560426152.177814171213.265699
672021Q42021Q4141.900751163132.969278134.883745181045.170599140.957739138944.332704140.072076...156.033991149540.124905137.683726184846.544332141.311289146312.960523136.847809153058.587031152.044892171063.717288

68 rows × 25 columns

# Two inner-columns per LGD
lgds = df.columns[3:].str.replace(' HPI','').str.replace(' Standardised Price','').unique()
lgds
Index(['Antrim and Newtownabbey', 'Ards and North Down',
       'Armagh City, Banbridge and Craigavon', 'Belfast',
       'Causeway Coast and Glens', 'Derry City and Strabane',
       'Fermanagh and Omagh', 'Lisburn and Castlereagh', 'Mid and East Antrim',
       'Mid Ulster Standardised', 'Mid Ulster', 'Newry, Mourne and Down'],
      dtype='object', name=1)

For some reason; Mid-ulster has a ‘Standardised HPI’ which throws off the above trick, so we gotta make it ugly…

lgds = df.columns[3:].str.replace(' Standardised HPI',' HPI')\
    .str.replace(' HPI','')\
    .str.replace(' Standardised Price','').unique()
lgds
Index(['Antrim and Newtownabbey', 'Ards and North Down',
       'Armagh City, Banbridge and Craigavon', 'Belfast',
       'Causeway Coast and Glens', 'Derry City and Strabane',
       'Fermanagh and Omagh', 'Lisburn and Castlereagh', 'Mid and East Antrim',
       'Mid Ulster', 'Newry, Mourne and Down'],
      dtype='object', name=1)
df.columns = [*df.columns[:3], *pd.MultiIndex.from_product([lgds,['Index','Price']], names=['LGD','Metric'])]
df

PeriodYearQuarter(Antrim and Newtownabbey, Index)(Antrim and Newtownabbey, Price)(Ards and North Down, Index)(Ards and North Down, Price)(Armagh City, Banbridge and Craigavon, Index)(Armagh City, Banbridge and Craigavon, Price)(Belfast, Index)...(Fermanagh and Omagh, Index)(Fermanagh and Omagh, Price)(Lisburn and Castlereagh, Index)(Lisburn and Castlereagh, Price)(Mid and East Antrim, Index)(Mid and East Antrim, Price)(Mid Ulster, Index)(Mid Ulster, Price)(Newry, Mourne and Down, Index)(Newry, Mourne and Down, Price)
02005Q12005Q199.903277114851.52827097.150602130398.569667102.245597100785.14598699.839849...109.429237104874.98023195.958322128828.327513102.246427105865.408901102.714778114882.211239100.810773113420.880186
12005Q22005Q299.723509114644.862732100.794472137133.037807106.325843104807.109982100.589870...117.239850112360.529330100.164437134475.225477104.443325108140.065924110.386311123462.501283111.965743125971.191415
22005Q32005Q3107.940849124091.740608102.167971137133.037807110.006212108434.910333109.614861...125.900145120660.397585106.757895143327.237126112.748278116738.970434117.595723131525.929577117.235685131900.333698
32005Q42005Q4111.934696128683.175719106.396379142808.538807116.073031114415.072260110.728237...130.781315125338.422216111.307116149434.769200114.584090118639.759900121.851999136286.397473123.628047139092.296651
42006Q12006Q1113.494351130476.197845109.206160146579.915492121.831058120090.852733112.326582...135.555749129914.152078110.539212148403.823796115.878975119980.477260129.544702144890.365875127.449613143391.890242
..................................................................
632020Q42020Q4131.045293150653.238745123.824862166201.593253130.857866128988.723586133.525177...142.381814136456.127817129.797418174258.823716130.813697135443.809729124.617456139379.446212138.815696156179.721555
642021Q12021Q1133.481101153453.510344128.398202172340.072904130.382658128520.303209135.257679...141.464114135576.621629131.166305176096.615474131.268719135914.936888127.657615142779.737045138.481347155803.549899
652021Q22021Q2137.827568158450.326506130.543464175219.510303137.439265135476.115278138.558752...146.059468139980.724158134.815374180995.654429138.663140143571.081234129.231669144540.250870145.000829163138.533592
662021Q32021Q3141.167257162289.724156134.116570180015.444071143.522521141472.480114142.250634...149.743153143511.101233136.178634182825.891020140.051373145008.450168135.759680151841.560426152.177814171213.265699
672021Q42021Q4141.900751163132.969278134.883745181045.170599140.957739138944.332704140.072076...156.033991149540.124905137.683726184846.544332141.311289146312.960523136.847809153058.587031152.044892171063.717288

68 rows × 25 columns

We could turn this into a proper multiindex but it would mean pushing the Period/Year/Quarter columns into keys which would be inconsistent behaviour with the rest of the ‘cleaned’ dataset, so that can be a downstream problem; at least we’ve got the relevant metrics consistent!

def cleanup_table_5(df):
    """
    Table 5: Standardised House Price & Index for each Local Government District Northern Ireland
    *
    """
    # Basic Cleanup first
    df = basic_cleanup(df)
    # Build multi-index of LGD / Metric [Index,Price]
    # Two inner-columns per LGD
    lgds = df.columns[3:].str.replace(' Standardised HPI',' HPI')\
        .str.replace(' HPI','')\
        .str.replace(' Standardised Price','')\
        .unique()
    df.columns = [*df.columns[:3], *pd.MultiIndex.from_product([lgds,['Index','Price']], names=['LGD','Metric'])]
    return df

cleanup_table_5(source_df['Table 5'])

PeriodYearQuarter(Antrim and Newtownabbey, Index)(Antrim and Newtownabbey, Price)(Ards and North Down, Index)(Ards and North Down, Price)(Armagh City, Banbridge and Craigavon, Index)(Armagh City, Banbridge and Craigavon, Price)(Belfast, Index)...(Fermanagh and Omagh, Index)(Fermanagh and Omagh, Price)(Lisburn and Castlereagh, Index)(Lisburn and Castlereagh, Price)(Mid and East Antrim, Index)(Mid and East Antrim, Price)(Mid Ulster, Index)(Mid Ulster, Price)(Newry, Mourne and Down, Index)(Newry, Mourne and Down, Price)
02005Q12005Q199.903277114851.52827097.150602130398.569667102.245597100785.14598699.839849...109.429237104874.98023195.958322128828.327513102.246427105865.408901102.714778114882.211239100.810773113420.880186
12005Q22005Q299.723509114644.862732100.794472137133.037807106.325843104807.109982100.589870...117.239850112360.529330100.164437134475.225477104.443325108140.065924110.386311123462.501283111.965743125971.191415
22005Q32005Q3107.940849124091.740608102.167971137133.037807110.006212108434.910333109.614861...125.900145120660.397585106.757895143327.237126112.748278116738.970434117.595723131525.929577117.235685131900.333698
32005Q42005Q4111.934696128683.175719106.396379142808.538807116.073031114415.072260110.728237...130.781315125338.422216111.307116149434.769200114.584090118639.759900121.851999136286.397473123.628047139092.296651
42006Q12006Q1113.494351130476.197845109.206160146579.915492121.831058120090.852733112.326582...135.555749129914.152078110.539212148403.823796115.878975119980.477260129.544702144890.365875127.449613143391.890242
..................................................................
632020Q42020Q4131.045293150653.238745123.824862166201.593253130.857866128988.723586133.525177...142.381814136456.127817129.797418174258.823716130.813697135443.809729124.617456139379.446212138.815696156179.721555
642021Q12021Q1133.481101153453.510344128.398202172340.072904130.382658128520.303209135.257679...141.464114135576.621629131.166305176096.615474131.268719135914.936888127.657615142779.737045138.481347155803.549899
652021Q22021Q2137.827568158450.326506130.543464175219.510303137.439265135476.115278138.558752...146.059468139980.724158134.815374180995.654429138.663140143571.081234129.231669144540.250870145.000829163138.533592
662021Q32021Q3141.167257162289.724156134.116570180015.444071143.522521141472.480114142.250634...149.743153143511.101233136.178634182825.891020140.051373145008.450168135.759680151841.560426152.177814171213.265699
672021Q42021Q4141.900751163132.969278134.883745181045.170599140.957739138944.332704140.072076...156.033991149540.124905137.683726184846.544332141.311289146312.960523136.847809153058.587031152.044892171063.717288

68 rows × 25 columns

dest_df['Table 5']=cleanup_table_5(source_df['Table 5'])
len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df])
(12, 20)

Table 5a: Number of Verified Residential Property Sales by Local Government District

This one has a new problem; the Sale Year/Quarter is now squished together. This will do a few terrible things to our basic_cleanup so this needs to be done ahead of cleanup. Also has annual total lines.

df = source_df['Table 5a'].copy()
df

Table 5a: Number of Verified Residential Property Sales by Local Government DistrictUnnamed: 1Unnamed: 2Unnamed: 3Unnamed: 4Unnamed: 5Unnamed: 6Unnamed: 7Unnamed: 8Unnamed: 9Unnamed: 10Unnamed: 11
0Please note figures for the 2 most recent quar...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
1Back to contentsNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2Sale Year/QuarterAntrim and NewtownabbeyArds and North DownArmagh City, Banbridge and CraigavonBelfastCauseway Coast and GlensDerry City and StrabaneFermanagh and OmaghLisburn and CastlereaghMid and East AntrimMid UlsterNewry, Mourne and Down
3Q1 2005236320333623236226138219188176241
4Q2 20057358579611549712637316655618428505
.......................................
85Q3 20217399899311584625485325869671377574
86Q4 20215327027301272417405250572474359437
872021 Total26473685333359342327180311813053234614602000
88NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
89Please note figures for the 2 most recent quar...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN

90 rows × 12 columns

dates = df.iloc[:,0].str.extract('(Q[1-4]) ([0-9]{4})').rename(columns={0:'Quarter',1:'Year'})
for c in ['Quarter','Year']:# insert the dates in order, so they come out in reverse in the insert
    df.insert(1,c,dates[c])
    df.iloc[2,1]=c # Need to have the right colname for when `basic_cleanup` is called.
df.iloc[2,1]=c
df

Table 5a: Number of Verified Residential Property Sales by Local Government DistrictYearQuarterUnnamed: 1Unnamed: 2Unnamed: 3Unnamed: 4Unnamed: 5Unnamed: 6Unnamed: 7Unnamed: 8Unnamed: 9Unnamed: 10Unnamed: 11
0Please note figures for the 2 most recent quar...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
1Back to contentsNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2Sale Year/QuarterYearQuarterAntrim and NewtownabbeyArds and North DownArmagh City, Banbridge and CraigavonBelfastCauseway Coast and GlensDerry City and StrabaneFermanagh and OmaghLisburn and CastlereaghMid and East AntrimMid UlsterNewry, Mourne and Down
3Q1 20052005Q1236320333623236226138219188176241
4Q2 20052005Q27358579611549712637316655618428505
.............................................
85Q3 20212021Q37399899311584625485325869671377574
86Q4 20212021Q45327027301272417405250572474359437
872021 TotalNaNNaN26473685333359342327180311813053234614602000
88NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
89Please note figures for the 2 most recent quar...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN

90 rows × 14 columns

df=df[~df.iloc[:,0].str.contains('Total').fillna(False)]

df.iloc[1,2]=c

basic_cleanup(df,offset=2)

2PeriodSale Year/QuarterYearQuarterAntrim and NewtownabbeyArds and North DownArmagh City, Banbridge and CraigavonBelfastCauseway Coast and GlensDerry City and StrabaneFermanagh and OmaghLisburn and CastlereaghMid and East AntrimMid UlsterNewry, Mourne and Down
02005Q1Q1 20052005Q1236320333623236226138219188176241
12005Q2Q2 20052005Q27358579611549712637316655618428505
22005Q3Q3 20052005Q37579609681722714632365654686403582
32005Q4Q4 20052005Q489399511991943834746385670759489711
42006Q1Q1 20062006Q176193310381686763708348600668515567
................................................
632020Q4Q4 20202020Q475610529741565728496336830685419636
642021Q1Q1 20212021Q16529768491497610466290762572349486
652021Q2Q2 20212021Q272410188231581675447316850629375503
662021Q3Q3 20212021Q37399899311584625485325869671377574
672021Q4Q4 20212021Q45327027301272417405250572474359437

68 rows × 15 columns

def cleanup_table_5a(df):
    """
    Table 5a: Number of Verified Residential Property Sales by Local Government District
    * Parse the 'Sale Year/Quarter' to two separate cols
    * Insert future-headers for Quarter and Year cols
    * Remove rows with 'total' in the first column
    * Disregard the 'Sale Year/Quarter' column
    * perform `basic_cleanup` with offset=2
    """
    # Safety first
    df=df.copy()

    # Extract 'Quarter' and 'Year' columns from the future 'Sale Year/Quarter' column
    dates = df.iloc[:,0].str.extract('(Q[1-4]) ([0-9]{4})').rename(columns={0:'Quarter',1:'Year'})
    for c in ['Quarter','Year']:# insert the dates in order, so they come out in reverse in the insert
        df.insert(1,c,dates[c])
        df.iloc[2,1]=c # Need to have the right colname for when `basic_cleanup` is called.

    # Remove 'total' rows from the future 'Sale Year/Quarter' column
    df=df[~df.iloc[:,0].str.contains('Total').fillna(False)]

    # Remove the 'Sale Year/Quarter' column all together
    df = df.iloc[:,1:]

    # Standard cleanup
    df = basic_cleanup(df, offset=2)

    return df

cleanup_table_5a(source_df['Table 5a'])

2PeriodYearQuarterAntrim and NewtownabbeyArds and North DownArmagh City, Banbridge and CraigavonBelfastCauseway Coast and GlensDerry City and StrabaneFermanagh and OmaghLisburn and CastlereaghMid and East AntrimMid UlsterNewry, Mourne and Down
02005Q12005Q1236320333623236226138219188176241
12005Q22005Q27358579611549712637316655618428505
22005Q32005Q37579609681722714632365654686403582
32005Q42005Q489399511991943834746385670759489711
42006Q12006Q176193310381686763708348600668515567
.............................................
632020Q42020Q475610529741565728496336830685419636
642021Q12021Q16529768491497610466290762572349486
652021Q22021Q272410188231581675447316850629375503
662021Q32021Q37399899311584625485325869671377574
672021Q42021Q45327027301272417405250572474359437

68 rows × 14 columns

dest_df['Table 5a']=cleanup_table_5a(source_df['Table 5a'])
len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df])
(13, 19)

Table 6: Standardised House Price & Index for all Urban and Rural areas in NI

Wee buns, thankfully. Still mixing the ‘HPI’ vs ‘Index’, but that’s a downstream problem

df = basic_cleanup(source_df['Table 6'])
df

1PeriodYearQuarterUrban Areas HPIUrban Areas Standardised PriceRural Areas HPIRural Areas Standardised Price
02005Q12005Q1101.309947107723.320891100.109860124292.601178
12005Q22005Q2104.402908111012.079786105.467951138865.721275
22005Q32005Q3111.163485118200.631818111.847591138865.721275
32005Q42005Q4114.871996122143.908606116.175119144238.615701
42006Q12006Q1118.187559125669.361667119.329374148154.818847
........................
632020Q42020Q4132.610763141005.619094133.854953166189.226014
642021Q12021Q1134.077654142565.370205135.267264167942.698911
652021Q22021Q2138.575881147348.355880140.501443174441.256673
662021Q32021Q3142.840470151882.912133144.695321179648.216283
672021Q42021Q4142.375033151388.010443146.115278181411.180623

68 rows × 7 columns

dest_df['Table 6']=basic_cleanup(source_df['Table 6'])
len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df])
(14, 18)

Table 7: Standardised House Price & Index for Rural Areas of Northern Ireland by drive times

Nearly-wee-buns; but this one doesn’t have Year or Quarter headers, and the extra \n (Ref: Q1 2015) added, which will complicate downstream analysis if that changes over time…

df = source_df['Table 7'].copy()
df.head()

Table 7: Standardised House Price & Index for Rural Areas of Northern Ireland by drive timesUnnamed: 1Unnamed: 2Unnamed: 3Unnamed: 4Unnamed: 5Unnamed: 6Unnamed: 7Unnamed: 8Unnamed: 9Unnamed: 10
0Back to contentsNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
1NaNNaNDrive time within 20mins of town of 10,000 or ...Drive time within 20mins of town of 10,000 or ...Drive time outside 20mins of town of 10,000 or...Drive time outside 20mins of town of 10,000 or...Drive time within 1hr of Belfast IndexDrive time within 1hr of Belfast Price\n(Ref:...Drive time outside 1hr of Belfast IndexDrive time outside 1hr of Belfast Price\n(Ref...NaN
22015Q1100124898.676844100122528.427865100128955.274996100111866.40498100.0
3NaNQ2103.166882128854.070701103.003978126209.155363103.025069132856.260679103.349406115613.265107100.0
4NaNQ3105.851629132207.28391105.619893129414.394046105.031061135443.093443107.811831120605.219276100.0
df.iloc[1,0] = 'Year'
df.iloc[1,1] = 'Quarter'
df.head()

Table 7: Standardised House Price & Index for Rural Areas of Northern Ireland by drive timesUnnamed: 1Unnamed: 2Unnamed: 3Unnamed: 4Unnamed: 5Unnamed: 6Unnamed: 7Unnamed: 8Unnamed: 9Unnamed: 10
0Back to contentsNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
1YearQuarterDrive time within 20mins of town of 10,000 or ...Drive time within 20mins of town of 10,000 or ...Drive time outside 20mins of town of 10,000 or...Drive time outside 20mins of town of 10,000 or...Drive time within 1hr of Belfast IndexDrive time within 1hr of Belfast Price\n(Ref:...Drive time outside 1hr of Belfast IndexDrive time outside 1hr of Belfast Price\n(Ref...NaN
22015Q1100124898.676844100122528.427865100128955.274996100111866.40498100.0
3NaNQ2103.166882128854.070701103.003978126209.155363103.025069132856.260679103.349406115613.265107100.0
4NaNQ3105.851629132207.28391105.619893129414.394046105.031061135443.093443107.811831120605.219276100.0
basic_cleanup(df).head()

1PeriodYearQuarterDrive time within 20mins of town of 10,000 or more IndexDrive time within 20mins of town of 10,000 or more Price\n(Ref: Q1 2015)Drive time outside 20mins of town of 10,000 or more IndexDrive time outside 20mins of town of 10,000 or more Price\n(Ref: Q1 2015)Drive time within 1hr of Belfast IndexDrive time within 1hr of Belfast Price\n(Ref: Q1 2015)Drive time outside 1hr of Belfast IndexDrive time outside 1hr of Belfast Price\n(Ref: Q1 2015)
02015Q12015Q1100.000000124898.676844100.000000122528.427865100.000000128955.274996100.000000111866.404980
12015Q22015Q2103.166882128854.070701103.003978126209.155363103.025069132856.260679103.349406115613.265107
22015Q32015Q3105.851629132207.283910105.619893129414.394046105.031061135443.093443107.811831120605.219276
32015Q42015Q4107.430656134179.467306106.924715131013.172436106.240145137002.270924110.075053123137.004353
42016Q12016Q1108.909364136026.354775108.368772132782.552750107.604887138762.178070111.828874125098.941485
def cleanup_table_7(df):
    """
    Table 7: Standardised House Price & Index for Rural Areas of Northern Ireland by drive times
    * Insert Year/Quarter future-headers
    * Clean normally
    # TODO THIS MIGHT BE VALID FOR MULTIINDEXING ON DRIVETIME/[Index/Price]
    """
    df = df.copy()
    df.iloc[1,0] = 'Year'
    df.iloc[1,1] = 'Quarter'
    df = basic_cleanup(df)
    return df

cleanup_table_7(source_df['Table 7'])

1PeriodYearQuarterDrive time within 20mins of town of 10,000 or more IndexDrive time within 20mins of town of 10,000 or more Price\n(Ref: Q1 2015)Drive time outside 20mins of town of 10,000 or more IndexDrive time outside 20mins of town of 10,000 or more Price\n(Ref: Q1 2015)Drive time within 1hr of Belfast IndexDrive time within 1hr of Belfast Price\n(Ref: Q1 2015)Drive time outside 1hr of Belfast IndexDrive time outside 1hr of Belfast Price\n(Ref: Q1 2015)
02015Q12015Q1100.000000124898.676844100.000000122528.427865100.000000128955.274996100.000000111866.404980
12015Q22015Q2103.166882128854.070701103.003978126209.155363103.025069132856.260679103.349406115613.265107
22015Q32015Q3105.851629132207.283910105.619893129414.394046105.031061135443.093443107.811831120605.219276
32015Q42015Q4107.430656134179.467306106.924715131013.172436106.240145137002.270924110.075053123137.004353
42016Q12016Q1108.909364136026.354775108.368772132782.552750107.604887138762.178070111.828874125098.941485
52016Q22016Q2111.263396138966.509219109.739250134461.778232110.208116142119.179594111.991819125281.221326
62016Q32016Q3113.419541141659.506269112.426034137753.851946112.202571144691.133778115.398242129091.864904
72016Q42016Q4113.928074142294.657346113.219995138726.680412112.508660145085.851672116.952798130830.890712
82017Q12017Q1114.262386142712.207695113.549623139130.567598112.823330145491.635911117.341538131265.759778
92017Q22017Q2115.566592144341.144812115.829688141924.295411114.288862147381.515712119.397027133565.161466
102017Q32017Q3116.716428145777.273752117.061832143434.022704115.192251148546.484147121.273582135664.395891
112017Q42017Q4117.925340147287.189812118.541541145247.086931116.101165149718.576638123.577530138241.740123
122018Q12018Q1118.482802147983.452250120.184585147260.282195117.972698152132.017714122.075418136561.381608
132018Q22018Q2119.443631149183.514842120.710551147904.740501117.686726151763.241043125.603705140508.349303
142018Q32018Q3121.408923151638.138779122.222994149757.912837119.614730154249.503782127.210974142306.342946
152018Q42018Q4123.531419154289.108214125.254013153471.772944121.462721156632.586327130.953652146493.143086
162019Q12019Q1122.499375153000.098716123.207618150964.357154120.087144154858.706540129.712835145105.085124
172019Q22019Q2124.397722155371.109292125.151589153346.274193122.486275157952.513361130.480634145963.994647
182019Q32019Q3126.533407158038.551430128.647747157630.061642124.978137161165.900455133.258945149071.990904
192019Q42019Q4127.126748158779.626458127.784267156572.053236124.547746160610.887802134.709059150694.181735
202020Q12020Q1127.090324158734.133127128.619521157595.476721124.744271160864.317472135.068006151095.722593
212020Q22020Q2127.200617158871.887068127.231209155894.400318125.021931161222.374474132.935126148709.746207
222020Q32020Q3129.627870161903.493901131.083667160614.756005127.873046164899.037745135.952621152085.309777
232020Q42020Q4132.853240165931.938294135.873737166483.954356130.725554168577.497418142.032974158887.182045
242021Q12021Q1134.414458167881.879606136.978231167837.272586132.479831170839.730568142.532053159445.483290
252021Q22021Q2139.417605174130.744141142.727536174881.805508137.737680177620.004609147.652942165174.037638
262021Q32021Q3143.303934178984.717757147.615067180870.420630141.482937182449.710341153.161532171336.300173
272021Q42021Q4144.731984180768.332630149.013971182584.475980142.365385183587.673223156.204293174740.127539
dest_df['Table 7'] = cleanup_table_7(source_df['Table 7'])
len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df])
(15, 17)

Table 8: Number of Verified Residential Property Sales of properties in urban and rural areas and properties in rural areas by drive times witihn towns of 10,000 or more and within 1 hour of Belfast

We’re now getting into the swing of this!

This one has two similar problems we’ve already seen; Munged Quarters/Years (this time with no header on that column…), and annual Total rows.

Vee must deeel with it

cleanup_table_5a(source_df['Table 8']).head()

2PeriodYearQuarterUrbanRuralDrive time within 20mins of town of 10,000 or moreDrive time outside 20mins of town of 10,000 or moreDrive time within 1hr of BelfastDrive time outside 1hr of Belfast
02015Q12015Q132941322898424976346
12015Q22015Q23789150010344661142358
22015Q32015Q34199164011454951250390
32015Q42015Q44396178012235571342438
42016Q12016Q14424173111715601263468
cleanup_table_8 = cleanup_table_5a
dest_df['Table 8'] = cleanup_table_8(source_df['Table 8'])
len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df])
(16, 16)

Table 9: NI Average Sales Prices Q1 2005 - Q4 2021

Wee buns

basic_cleanup(source_df['Table 9'])

1PeriodYearQuarterSimple MeanSimple MedianStandardised Price (HPI)
02005Q12005Q1115912.942222100000111920.268199
12005Q22005Q2120481.290591105000116004.031639
22005Q32005Q3128866.225917115000123386.352673
32005Q42005Q4129649.092074117000127674.143865
42006Q12006Q1132972.115070120000131302.064422
.....................
632020Q42020Q4171803.199843150000147474.561707
642021Q12021Q1176218.214924150000149084.306040
652021Q22021Q2184144.458946154950154323.134643
662021Q32021Q3173490.230508155000159028.118093
672021Q42021Q4159965.154863141000159150.737832

68 rows × 6 columns

dest_df['Table 9'] = basic_cleanup(source_df['Table 9'])
len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df])
(17, 15)

Table 9x: NI Average Sale Prices XXXXX Property Q1 2005 - Q4 2021

These are very similar to Tables 2x; i.e. they’re broken down by property type.

Annoyingly, they don’t follow the same structure as Tables 2x or Table 9 because they don’t include the Year/Quarter headers.

If that reminds you of anything, it’s because Table 7 was the same…

cleanup_table_7(source_df['Table 9a'])

1PeriodYearQuarterSimple MeanSimple MedianStandardised Price (HPI)
02005Q12005Q1166314.816092149972.5160428.832662
12005Q22005Q2173370.669076155000.0169686.542965
22005Q32005Q3185397.896739165000.0180696.666810
32005Q42005Q4186545.119355165000.0185323.883533
42006Q12006Q1191328.398119173000.0188669.361197
.....................
632020Q42020Q4243712.512641220000.0220592.113069
642021Q12021Q1254182.439174225000.0224872.989982
652021Q22021Q2268755.621299235000.0234734.715703
662021Q32021Q3245860.399289225000.0239101.239764
672021Q42021Q4244468.040738219000.0241131.373512

68 rows × 6 columns

cleanup_table_9x = cleanup_table_7
table9s = re.compile('Table 9[a-z]')
for table in source_df:
    if table9s.match(table):
        dest_df[table] = cleanup_table_9x(source_df[table])
len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df])
(21, 11)

Table 10x: Number of Verified Residential Property Sales by Type in XXXXX

Surprisingly, we’re in the home straight; the remaining tables are all of the same structure, with familiar awkwardness.,,

  • Annual-Total Rows
  • Munged Year/Quarter Column
  • That column having a silly (but contextual) name
  • a different offset

Fortunately, we already have something like that from dealing with Table 5a!

source_df['Table 10a']

Table 10a: Number of Verified Residential Property Sales by Type in Antrim and Newtownabbey CouncilUnnamed: 1Unnamed: 2Unnamed: 3Unnamed: 4Unnamed: 5
0Please note figures for the 2 most recent quar...NaNNaNNaNNaNNaN
1Back to contentsNaNNaNNaNNaNNaN
2ANTRIM AND NEWTOWNABBEYApartmentsDetachedSemi-DetachedTerraceTotal
3Q1 200510617887236
4Q2 200546213216260735
.....................
83Q1 202152222212166652
84Q2 202156217275176724
85Q3 202147222268202739
86Q4 202150117176189532
872021 Total2057789317332647

88 rows × 6 columns

cleanup_table_5a(source_df['Table 10a'])

2PeriodYearQuarterApartmentsDetachedSemi-DetachedTerraceTotal
02005Q12005Q110617887236
12005Q22005Q246213216260735
22005Q32005Q346214238259757
32005Q42005Q465227270331893
42006Q12006Q148186231296761
...........................
632020Q42020Q453248268187756
642021Q12021Q152222212166652
652021Q22021Q256217275176724
662021Q32021Q347222268202739
672021Q42021Q450117176189532

68 rows × 8 columns

cleanup_table_10x = cleanup_table_5a
table10s = re.compile('Table 10[a-z]')
for table in source_df:
    if table10s.match(table):
        dest_df[table] = cleanup_table_10x(source_df[table])
len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df])
(32, 0)

And We’re Done!

So, we can see that while government open data is a pain, at least it’s a … consistently inconsistent pain?

I hope this was helpful to someone else.

dest_df['Contents'] = source_df['Contents'][source_df['Contents']['Worksheet Name'].str.startswith('Table')]
with pd.ExcelWriter('NI Housing Price Index.xlsx') as writer:
    # Thankfully these are semantically sortable otherwise this would be a _massive_ pain
    for k,df in sorted(dest_df.items()):
        df.to_excel(writer, sheet_name=k)