Wrangling NI House Price Index Data
Andrew Bolster
Senior R&D Manager (Data Science) at Black Duck Software and Treasurer @ Bsides Belfast and NI OpenGovernment Network
Data Wrangling NI House Price Index Data
This is a ‘messy’ ‘blog post’ that’s just a braindump of a notebook to step through NI House Price Index datasets I was playing around with.
It’s mostly code, so if you were here from some ‘insight’, feck aff.
There is no analysis here, this is just data wrangling.
TLDR As always, Government Open Data has over the years gone from ‘non-existent’ to ‘garbeled’ to ‘inconsistent’ and I feel is now in the stage of ‘consistently inconsistent’, which is progress in my eyes.
Preamble Code, move on.
from bs4 import BeautifulSoup
import pandas as pd
import requests
# Pull the latest pages of https://www.finance-ni.gov.uk/publications/ni-house-price-index-statistical-reports and extract links
base_url= 'https://www.finance-ni.gov.uk/publications/ni-house-price-index-statistical-reports'
base_content = requests.get(base_url).content
base_soup = BeautifulSoup(base_content)
for a in base_soup.find_all('a'):
if a.attrs.get('href','').endswith('xlsx'):
source_name, source_url = a.contents[1],a.attrs['href']
source_df = pd.read_excel(source_url, sheet_name = None) # Load all worksheets in
source_df.keys()
dict_keys(['Cover Sheet', 'Contents', 'Table 1', 'Table 2', 'Table 2a', 'Table 2b', 'Table 2c', 'Table 2d', 'Table 3', 'Table 3a', 'Table 3b', 'Table 3c', 'Table 4', 'Fig 5', 'Table 5', 'Table 5a', 'Fig 6', 'Table 6', 'Table 7', 'Table 8', 'Table 9', 'Table 9a', 'Table 9b', 'Table 9c', 'Table 9d', 'Table 10a', 'Table 10b', 'Table 10c', 'Table 10d', 'Table 10e', 'Table 10f', 'Table 10g', 'Table 10h', 'Table 10i', 'Table 10j', 'Table 10k'])
source_df['Contents']
Table of Contents | Unnamed: 1 | Unnamed: 2 | |
---|---|---|---|
0 | Worksheet Name | Frequency | House Price Index - Quarter 4 2021 |
1 | Table 1 | Quarterly | Table 1: NI HPI Trends Q1 2005 - Q4 2021 |
2 | Figure 1 | Quarterly | Figure 1: Graph of NI HPI Q1 2005 - Q4 2021 |
3 | Figure 1a | Quarterly | Figure 1a: Graph of Percentage Quarterly Chang... |
4 | Figure 1b | Quarterly | Figure 1b: Graph of Percentage Annual Change Q... |
... | ... | ... | ... |
58 | Table 10h | Quarterly | Table 10h: Number of Verified Residential Prop... |
59 | Table 10i | Quarterly | Table 10i: Number of Verified Residential Prop... |
60 | Table 10j | Quarterly | Table 10j: Number of Verified Residential Prop... |
61 | Table 10k | Quarterly | Table 10k: Number of Verified Residential Prop... |
62 | Figure 11 | Quarterly | Figure 11: Number of Verified Residential Prop... |
63 rows × 3 columns
Fix the Contents sheet to correctly reflect the Worksheet names
And fix the table headers and sheet-titles while we’re at it.
new_header = source_df['Contents'].iloc[0]
source_df['Contents'] = source_df['Contents'][1:]
source_df['Contents'].columns = new_header
source_df['Contents'].columns = [*new_header[:-1],'Title']
[t for t in source_df['Contents']['Title'].values if t.startswith('Table')]
['Table 1: NI HPI Trends Q1 2005 - Q4 2021',
'Table 2: NI HPI & Standardised Price Statistics by Property Type Q4 2021',
'Table 2a: NI Detached Property Price Index Q1 2005 - Q4 2021',
'Table 2b: NI Semi-Detached Property Price Index Q1 2005 - Q4 2021',
'Table 2c: NI Terrace Property Price Index Q1 2005 - Q4 2021',
'Table 2d: NI Apartment Price Index Q1 2005 - Q4 2021',
'Table 3: NI HPI & Standardised Price Statistics by New/Existing Resold Dwelling Type Q4 2021',
'Table 3a: NI New Dwelling Price Index Q1 2005 - Q4 2021',
'Table 3b: NI Existing Resold Dwellings Price Index Q1 2005 - Q4 2021',
'Table 3c: Number of Verified Residential Property Sales by New/Existing Resold Dwellings Q1 2005 - Q2 2021',
'Table 4: Number of Verified Residential Property Sales Q1 2005 - Q4 2021',
'Table 5: HPI & Standardised Price for each Local Government District in NI',
'Table 5a: Number of Verified Residential Property Sales by Local Government District Q1 2005 - Q4 2021',
'Table 6: NI HPI & Standardised Price by Urban and Rural areas of Northern Ireland',
'Table 7: Standardised House Price & Index for Rural Areas of Northern Ireland by drive times',
'Table 8: Number of Verified Residential Property Sales for Urban and Rural Areas of NI (Q1 2005 - Q4 2021) and Rural Areas of NI by drive times (Q1 2015 - Q4 2021)',
'Table 9: NI Average Sale Prices All Properties Q1 2005 - Q4 2021',
'Table 9a: NI Average Sale Prices Detached Properties Q1 2005 - Q4 2021',
'Table 9b: NI Average Sale Prices Semi-Detached Properties Q1 2005 - Q4 2021',
'Table 9c: NI Average Sale Prices Terrace Properties Q1 2005 - Q4 2021',
'Table 9d: NI Average Sale Prices Apartments Q1 2005 - Q4 2021',
'Table 10a: Number of Verified Residential Property Sales by Type in Antrim and Newtownabbey Council Q1 2005 - Q4 2021',
'Table 10b: Number of Verified Residential Property Sales by Type in Ards and North Down Council Q1 2005 - Q4 2021',
'Table 10c: Number of Verified Residential Property Sales by Type in Armagh City, Banbridge and Craigavon Council Q1 2005 - Q4 2021',
'Table 10d: Number of Verified Residential Property Sales by Type in Belfast Council Q1 2005 - Q4 2021',
'Table 10e: Number of Verified Residential Property Sales by Type in Causeway Coast and Glens Council Q1 2005 - Q4 2021',
'Table 10f: Number of Verified Residential Property Sales by Type in Derry City and Strabane Council Q1 2005 - Q4 2021',
'Table 10g: Number of Verified Residential Property Sales by Type in Fermanagh and Omagh Council Q1 2005 - Q4 2021',
'Table 10h: Number of Verified Residential Property Sales by Type in Lisburn and Castlereagh Council Q1 2005 - Q4 2021',
'Table 10i: Number of Verified Residential Property Sales by Type in Mid and East Antrim Council Q1 2005 - Q4 2021',
'Table 10j: Number of Verified Residential Property Sales by Type in Mid Ulster Council Q1 2005 - Q4 2021',
'Table 10k: Number of Verified Residential Property Sales by Type in Newry, Mourne and Down Council Q1 2005 - Q4 2021']
# Replace 'Figure' with 'Fig' in 'Worksheet Name'
with pd.option_context('mode.chained_assignment',None):
source_df['Contents']['Worksheet Name'] = source_df['Contents']['Worksheet Name'].str.replace('Figure','Fig')
Tidy up Data
General Methodology
Ignore figure data (pretty much completly….)
Tables have more or less the same structure; a header on row 3(1), a year and quarter ‘index’ (on time series; otherwise categorical index, see Table 2, Table 3).
Some TS tables also have totals subsections so these should be a) validated and b) ignored.
Any columns with no header in row 3(1) should be ignored (usually text notes)
Operate Sequentially (i.e. Table 1, Table 2, Table 2a; don’t skip, even if it’s tempting)
Use keys from ‘Contents’ to describe data, but may be suffixed by the date which could change between data sets!
There’s also some really columns that look like checksums, so if there is an ‘NI’ column, or a data column that all valid values are ‘100’, delete it.
Table 1: NI HPI Trends Q1 2005 - Q4 2021
TODO: Regexy way to get rid of the ‘\QX-YYYY -\QX YYYY’ tail
source_df['Table 1']
Table 1: NI House Price Index, Standardised Price and Quarterly and Annual Change | Unnamed: 1 | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | |
---|---|---|---|---|---|---|---|---|
0 | Back to contents | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | Year | Quarter | NI House Price Index | NI House Standardised Price | Quarterly Change | Annual Change | NI | NaN |
2 | 2005 | Q1 | 100.883607 | 111920.268199 | NaN | NaN | 100 | 0.0 |
3 | NaN | Q2 | 104.564663 | 116004.031639 | 0.036488 | NaN | 100 | 0.0 |
4 | NaN | Q3 | 111.219 | 123386.352673 | 0.063638 | NaN | 100 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
89 | NaN | NaN | NaN | NaN | NaN | NaN | 100 | 0.0 |
90 | NaN | NaN | NaN | NaN | NaN | NaN | 100 | 0.0 |
91 | NaN | NaN | NaN | NaN | NaN | NaN | 100 | 0.0 |
92 | NaN | NaN | NaN | NaN | NaN | NaN | 100 | 0.0 |
93 | NaN | NaN | NaN | NaN | NaN | NaN | 100 | 0.0 |
94 rows × 8 columns
def basic_cleanup(df:pd.DataFrame, offset=1)->pd.DataFrame:
df = df.copy()
# Re-header from row 1 (which was row 3 in excel)
new_header = df.iloc[offset]
df = df.iloc[offset+1:]
df.columns = new_header
# remove 'NaN' trailing columns
df = df[df.columns[pd.notna(df.columns)]]
# 'NI' is a usually hidden column that appears to be a checksum;
#if it's all there and all 100, remove it, otherwise, complain.
# (Note, need to change this 'if' logic to just 'if there's a
# column with all 100's, but cross that bridge later)
if 'NI' in df:
assert df['NI'].all() and df['NI'].mean() == 100, "Not all values in df['NI'] == 100"
df = df.drop('NI', axis=1)
# Strip rows below the first all-nan row, if there is one
# (Otherwise this truncates the tables as there is no
# idxmax in the table of all 'false's)
if any(df.isna().all(axis=1)):
idx_first_bad_row = df.isna().all(axis=1).idxmax()
df = df.loc[:idx_first_bad_row-1]
# By Inspection, other tables use 'Sale Year' and 'Sale Quarter'
if set(df.keys()).issuperset({'Sale Year','Sale Quarter'}):
df = df.rename(columns = {
'Sale Year':'Year',
'Sale Quarter': 'Quarter'
})
# For 'Year','Quarter' indexed pages, there is an implied Year
# in Q2/4, so fill it downwards
if set(df.keys()).issuperset({'Year','Quarter'}):
df['Year'] = df['Year'].astype(float).fillna(method='ffill').astype(int)
# In Pandas we can represent Y/Q combinations as proper datetimes
#https://stackoverflow.com/questions/53898482/clean-way-to-convert-quarterly-periods-to-datetime-in-pandas
df.insert(loc=0,
column='Period',
value=pd.PeriodIndex(df.apply(lambda r:f'{r.Year}-{r.Quarter}', axis=1), freq='Q')
)
# reset index, try to fix dtypes, etc, (this should be the last
# operation before returning!
df = df.reset_index(drop=True).infer_objects()
return df
df = basic_cleanup(source_df['Table 1'])
df
1 | Period | Year | Quarter | NI House Price Index | NI House Standardised Price | Quarterly Change | Annual Change |
---|---|---|---|---|---|---|---|
0 | 2005Q1 | 2005 | Q1 | 100.883607 | 111920.268199 | NaN | NaN |
1 | 2005Q2 | 2005 | Q2 | 104.564663 | 116004.031639 | 0.036488 | NaN |
2 | 2005Q3 | 2005 | Q3 | 111.219000 | 123386.352673 | 0.063638 | NaN |
3 | 2005Q4 | 2005 | Q4 | 115.083964 | 127674.143865 | 0.034751 | NaN |
4 | 2006Q1 | 2006 | Q1 | 118.354129 | 131302.064422 | 0.028415 | 0.173175 |
... | ... | ... | ... | ... | ... | ... | ... |
63 | 2020Q4 | 2020 | Q4 | 132.931827 | 147474.561707 | 0.026103 | 0.052326 |
64 | 2021Q1 | 2021 | Q1 | 134.382831 | 149084.306040 | 0.010915 | 0.059421 |
65 | 2021Q2 | 2021 | Q2 | 139.105050 | 154323.134643 | 0.035140 | 0.095724 |
66 | 2021Q3 | 2021 | Q3 | 143.346066 | 159028.118093 | 0.030488 | 0.106491 |
67 | 2021Q4 | 2021 | Q4 | 143.456594 | 159150.737832 | 0.000771 | 0.079174 |
68 rows × 7 columns
dest_df = {
'Table 1': basic_cleanup(source_df['Table 1'])
}
len([k for k in source_df.keys() if k.startswith('Table')])
32
One down, 31 to go…
Table 2: NI HPI & Standardised Price Statistics by Property Type Q4 2021’
df = basic_cleanup(source_df['Table 2'])
df
1 | Property Type | Index\n(Quarter 4 2021) | Percentage Change on Previous Quarter | Percentage Change over 12 months | Standardised Price\n(Quarter 4 2021) |
---|---|---|---|---|---|
0 | Detached | 143.488806 | 0.008491 | 0.093110 | 241131.373512 |
1 | Semi-Detached | 140.680694 | 0.004211 | 0.076953 | 153718.543755 |
2 | Terrace | 149.564169 | -0.009577 | 0.078758 | 112831.710806 |
3 | Apartment | 133.376791 | -0.014732 | 0.032761 | 116554.228620 |
4 | All | 143.456594 | 0.000771 | 0.079174 | 159150.737832 |
Those ‘\n (Quarter 4 2021)’ entries are unnecessary, so for this table, lets clear them
df.columns = [c.split('\n')[0] for c in df.columns]
df
Property Type | Index | Percentage Change on Previous Quarter | Percentage Change over 12 months | Standardised Price | |
---|---|---|---|---|---|
0 | Detached | 143.488806 | 0.008491 | 0.093110 | 241131.373512 |
1 | Semi-Detached | 140.680694 | 0.004211 | 0.076953 | 153718.543755 |
2 | Terrace | 149.564169 | -0.009577 | 0.078758 | 112831.710806 |
3 | Apartment | 133.376791 | -0.014732 | 0.032761 | 116554.228620 |
4 | All | 143.456594 | 0.000771 | 0.079174 | 159150.737832 |
dest_df['Table 2'] = df
Table 2a: NI Detached Property Price Index Q1 2005 - Q4 2021
df = basic_cleanup(source_df['Table 2a'])
df
1 | Period | Year | Quarter | NI Detached Property Price Index | NI Detached Property Standardised Price | Quarterly Change | Annual Change |
---|---|---|---|---|---|---|---|
0 | 2005Q1 | 2005 | Q1 | 95.465560 | 160428.832662 | NaN | NaN |
1 | 2005Q2 | 2005 | Q2 | 100.974498 | 169686.542965 | 0.057706 | NaN |
2 | 2005Q3 | 2005 | Q3 | 107.526236 | 180696.666810 | 0.064885 | NaN |
3 | 2005Q4 | 2005 | Q4 | 110.279730 | 185323.883533 | 0.025608 | NaN |
4 | 2006Q1 | 2006 | Q1 | 112.270506 | 188669.361197 | 0.018052 | 0.176032 |
... | ... | ... | ... | ... | ... | ... | ... |
63 | 2020Q4 | 2020 | Q4 | 131.266614 | 220592.113069 | 0.026393 | 0.055357 |
64 | 2021Q1 | 2021 | Q1 | 133.814014 | 224872.989982 | 0.019406 | 0.071429 |
65 | 2021Q2 | 2021 | Q2 | 139.682380 | 234734.715703 | 0.043855 | 0.129844 |
66 | 2021Q3 | 2021 | Q3 | 142.280745 | 239101.239764 | 0.018602 | 0.112515 |
67 | 2021Q4 | 2021 | Q4 | 143.488806 | 241131.373512 | 0.008491 | 0.093110 |
68 rows × 7 columns
Table 2x: NI XXX Property Price Index Q1 2005 - Q4 2021
This table structure is consistent against the rest of the Table 2x cohort; mapping to the Property Types listed in Table 2.
For the time being, we can ignore these, but this will probably become a pain later on…
dest_df['Table 2']['Property Type']
0 Detached
1 Semi-Detached
2 Terrace
3 Apartment
4 All
Name: Property Type, dtype: object
import re
table2s = re.compile('Table 2[a-z]')
assert table2s.match('Table 2') is None, 'Table 2 is matching itself!'
assert table2s.match('Table 20') is None, 'Table 2 is greedy!'
assert table2s.match('Table 2z') is not None, 'Table 2 is matching incorrectly!'
table2s = re.compile('Table 2[a-z]')
for table in source_df:
if table2s.match(table):
dest_df[table] = basic_cleanup(source_df[table])
len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df])
(6, 26)
6 down, 26 to go.
Table 3: NI HPI & Standardised Price Statistics by New/Existing Resold Dwelling Type Q4 2021
These appear to be a similar structure of the Table 2’s… hopefully
df = basic_cleanup(source_df['Table 3'])
df.columns = [c.split('\n')[0] for c in df.columns] # Stolen from Table 2 Treatment
df
Property Type | Index | Percentage Change on Previous Quarter | Percentage Change over 12 months | Standardised Price | |
---|---|---|---|---|---|
0 | New | 141.769973 | 0.024877 | 0.072609 | 185966.524090 |
1 | Existing Resold | 143.518977 | -0.004918 | 0.080771 | 152275.828046 |
2 | All | 143.456594 | 0.000771 | 0.079174 | 159150.737832 |
dest_df['Table 3'] = df
df = basic_cleanup(source_df['Table 3a'])
df
1 | Period | Year | Quarter | NI New Dwellings Price Index | NI New Dwellings Standardised Price | Quarterly Change | Annual Change |
---|---|---|---|---|---|---|---|
0 | 2005Q1 | 2005 | Q1 | 95.804706 | 125671.662611 | NaN | NaN |
1 | 2005Q2 | 2005 | Q2 | 101.229223 | 132787.263460 | 0.056621 | NaN |
2 | 2005Q3 | 2005 | Q3 | 106.243580 | 139364.837967 | 0.049535 | NaN |
3 | 2005Q4 | 2005 | Q4 | 110.118105 | 144447.239874 | 0.036468 | NaN |
4 | 2006Q1 | 2006 | Q1 | 113.624410 | 149046.629634 | 0.031841 | 0.186000 |
... | ... | ... | ... | ... | ... | ... | ... |
63 | 2020Q4 | 2020 | Q4 | 132.173052 | 173377.779440 | 0.004103 | 0.036125 |
64 | 2021Q1 | 2021 | Q1 | 133.772562 | 175475.933612 | 0.012102 | 0.027916 |
65 | 2021Q2 | 2021 | Q2 | 136.969311 | 179669.264190 | 0.023897 | 0.046474 |
66 | 2021Q3 | 2021 | Q3 | 138.328776 | 181452.540106 | 0.009925 | 0.050867 |
67 | 2021Q4 | 2021 | Q4 | 141.769973 | 185966.524090 | 0.024877 | 0.072609 |
68 rows × 7 columns
table3s = re.compile('Table 3[a-z]')
for table in source_df:
if table3s.match(table):
dest_df[table] = basic_cleanup(source_df[table])
len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df])
(10, 22)
Table 4: Number of Verified Residential Property Sales Q1 2005 - Q4 2021
Table 4 is not looking great
df = source_df['Table 4']
df
Table 4: Number of Verified Residential Property Sales | Unnamed: 1 | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | |
---|---|---|---|---|---|---|---|---|
0 | Verified Sales = Sales matched to a property i... | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | Please note figures for the 2 most recent quar... | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | Back to contents | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | Sale Year | Sale Quarter | Detached | Semi-Detached | Terrace | Apartment | Total | NaN |
4 | 2005\n | Quarter 1 | 809 | 894 | 1035 | 198 | 2936 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... |
84 | 2021 | Quarter 1 | 2509 | 2477 | 1962 | 561 | 7509 | NaN |
85 | NaN | Quarter 2 | 2668 | 2613 | 2056 | 604 | 7941 | NaN |
86 | NaN | Quarter 3 | 2519 | 2797 | 2220 | 633 | 8169 | Please note this figure is provisional and wil... |
87 | NaN | Quarter 4 | 1478 | 2100 | 2057 | 515 | 6150 | and new dwellings sold in this quarter being a... |
88 | NaN | 2021 Total | 9174 | 9987 | 8295 | 2313 | 29769 | NaN |
89 rows × 8 columns
Of note; new offset for the header row at index 3 instead of index 1, due to lots of fluff at the start that is probably not going to be consistent between reports so that will almost certainly mess up my day in a few months.
Also, Quarter dates have now been shifted into ‘Quarter 1’ instead of ‘Q1’, which … meh 🤷♂️. More Egrigiously, it looks like ‘\n’ has leaked into some Sales Year values. Funtimes.
Finally, and possibly most annoying, the introduction of partial total lines is going to throw things off, and this isn’t a validation study, to stuff-em
In an effort not to over-complicate basic_cleanup
, we can try and clean these table specific issues first;
df.iloc[:,1]=df.iloc[:,1].str.replace('Quarter ([1-4])',r'Q\1', regex=True)
df
Table 4: Number of Verified Residential Property Sales | Unnamed: 1 | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | |
---|---|---|---|---|---|---|---|---|
0 | Verified Sales = Sales matched to a property i... | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | Please note figures for the 2 most recent quar... | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | Back to contents | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | Sale Year | Sale Quarter | Detached | Semi-Detached | Terrace | Apartment | Total | NaN |
4 | 2005\n | Q1 | 809 | 894 | 1035 | 198 | 2936 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... |
84 | 2021 | Q1 | 2509 | 2477 | 1962 | 561 | 7509 | NaN |
85 | NaN | Q2 | 2668 | 2613 | 2056 | 604 | 7941 | NaN |
86 | NaN | Q3 | 2519 | 2797 | 2220 | 633 | 8169 | Please note this figure is provisional and wil... |
87 | NaN | Q4 | 1478 | 2100 | 2057 | 515 | 6150 | and new dwellings sold in this quarter being a... |
88 | NaN | 2021 Total | 9174 | 9987 | 8295 | 2313 | 29769 | NaN |
89 rows × 8 columns
df=df[~df.iloc[:,1].str.contains('Total').fillna(False)]
# Lose the year new-lines (needs astype because non str lines are
# correctly inferred to be ints, so .str methods nan-out
with pd.option_context('mode.chained_assignment',None):
df.iloc[:,0]=df.iloc[:,0].astype(str).str.replace('\n','')
df
Table 4: Number of Verified Residential Property Sales | Unnamed: 1 | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | |
---|---|---|---|---|---|---|---|---|
0 | Verified Sales = Sales matched to a property i... | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | Please note figures for the 2 most recent quar... | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | Back to contents | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | Sale Year | Sale Quarter | Detached | Semi-Detached | Terrace | Apartment | Total | NaN |
4 | 2005 | Q1 | 809 | 894 | 1035 | 198 | 2936 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... |
82 | nan | Q4 | 2808 | 2944 | 2170 | 555 | 8477 | NaN |
84 | 2021 | Q1 | 2509 | 2477 | 1962 | 561 | 7509 | NaN |
85 | nan | Q2 | 2668 | 2613 | 2056 | 604 | 7941 | NaN |
86 | nan | Q3 | 2519 | 2797 | 2220 | 633 | 8169 | Please note this figure is provisional and wil... |
87 | nan | Q4 | 1478 | 2100 | 2057 | 515 | 6150 | and new dwellings sold in this quarter being a... |
72 rows × 8 columns
basic_cleanup(df, offset=3)
3 | Period | Year | Quarter | Detached | Semi-Detached | Terrace | Apartment | Total |
---|---|---|---|---|---|---|---|---|
0 | 2005Q1 | 2005 | Q1 | 809 | 894 | 1035 | 198 | 2936 |
1 | 2005Q2 | 2005 | Q2 | 2208 | 2474 | 2808 | 483 | 7973 |
2 | 2005Q3 | 2005 | Q3 | 2297 | 2655 | 2952 | 539 | 8443 |
3 | 2005Q4 | 2005 | Q4 | 2498 | 3003 | 3492 | 631 | 9624 |
4 | 2006Q1 | 2006 | Q1 | 2185 | 2650 | 3158 | 594 | 8587 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
63 | 2020Q4 | 2020 | Q4 | 2808 | 2944 | 2170 | 555 | 8477 |
64 | 2021Q1 | 2021 | Q1 | 2509 | 2477 | 1962 | 561 | 7509 |
65 | 2021Q2 | 2021 | Q2 | 2668 | 2613 | 2056 | 604 | 7941 |
66 | 2021Q3 | 2021 | Q3 | 2519 | 2797 | 2220 | 633 | 8169 |
67 | 2021Q4 | 2021 | Q4 | 1478 | 2100 | 2057 | 515 | 6150 |
68 rows × 8 columns
Thats awkward enough to get it’s own function…
def cleanup_table_4(df):
"""
Table 4: Number of Verified Residential Property Sales
* Regex 'Quarter X' to 'QX' in future 'Sales Quarter' column
* Drop Year Total rows
* Clear any Newlines from the future 'Sales Year' column
* call `basic_cleanup` with offset=3
"""
df.iloc[:,1]=df.iloc[:,1].str.replace('Quarter ([1-4])',r'Q\1', regex=True)
df=df[~df.iloc[:,1].str.contains('Total').fillna(False)]
# Lose the year new-lines (needs astype because non str lines are
# correctly inferred to be ints, so .str methods nan-out
with pd.option_context('mode.chained_assignment',None):
df.iloc[:,0]=df.iloc[:,0].astype(str).str.replace('\n','')
return basic_cleanup(df, offset=3)
cleanup_table_4(source_df['Table 4'].copy())
3 | Period | Year | Quarter | Detached | Semi-Detached | Terrace | Apartment | Total |
---|---|---|---|---|---|---|---|---|
0 | 2005Q1 | 2005 | Q1 | 809 | 894 | 1035 | 198 | 2936 |
1 | 2005Q2 | 2005 | Q2 | 2208 | 2474 | 2808 | 483 | 7973 |
2 | 2005Q3 | 2005 | Q3 | 2297 | 2655 | 2952 | 539 | 8443 |
3 | 2005Q4 | 2005 | Q4 | 2498 | 3003 | 3492 | 631 | 9624 |
4 | 2006Q1 | 2006 | Q1 | 2185 | 2650 | 3158 | 594 | 8587 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
63 | 2020Q4 | 2020 | Q4 | 2808 | 2944 | 2170 | 555 | 8477 |
64 | 2021Q1 | 2021 | Q1 | 2509 | 2477 | 1962 | 561 | 7509 |
65 | 2021Q2 | 2021 | Q2 | 2668 | 2613 | 2056 | 604 | 7941 |
66 | 2021Q3 | 2021 | Q3 | 2519 | 2797 | 2220 | 633 | 8169 |
67 | 2021Q4 | 2021 | Q4 | 1478 | 2100 | 2057 | 515 | 6150 |
68 rows × 8 columns
dest_df['Table 4'] = cleanup_table_4(source_df['Table 4'])
len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df])
(11, 21)
Table 5: HPI & Standardised Price for each Local Government District in NI
This nearly works but structurally requires a multi-index column to make sense….
df = basic_cleanup(source_df['Table 5'])
df
1 | Period | Year | Quarter | Antrim and Newtownabbey HPI | Antrim and Newtownabbey Standardised Price | Ards and North Down HPI | Ards and North Down Standardised Price | Armagh City, Banbridge and Craigavon HPI | Armagh City, Banbridge and Craigavon Standardised Price | Belfast HPI | ... | Fermanagh and Omagh HPI | Fermanagh and Omagh Standardised Price | Lisburn and Castlereagh HPI | Lisburn and Castlereagh Standardised Price | Mid and East Antrim HPI | Mid and East Antrim Standardised Price | Mid Ulster Standardised HPI | Mid Ulster Standardised Price | Newry, Mourne and Down HPI | Newry, Mourne and Down Standardised Price |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2005Q1 | 2005 | Q1 | 99.903277 | 114851.528270 | 97.150602 | 130398.569667 | 102.245597 | 100785.145986 | 99.839849 | ... | 109.429237 | 104874.980231 | 95.958322 | 128828.327513 | 102.246427 | 105865.408901 | 102.714778 | 114882.211239 | 100.810773 | 113420.880186 |
1 | 2005Q2 | 2005 | Q2 | 99.723509 | 114644.862732 | 100.794472 | 137133.037807 | 106.325843 | 104807.109982 | 100.589870 | ... | 117.239850 | 112360.529330 | 100.164437 | 134475.225477 | 104.443325 | 108140.065924 | 110.386311 | 123462.501283 | 111.965743 | 125971.191415 |
2 | 2005Q3 | 2005 | Q3 | 107.940849 | 124091.740608 | 102.167971 | 137133.037807 | 110.006212 | 108434.910333 | 109.614861 | ... | 125.900145 | 120660.397585 | 106.757895 | 143327.237126 | 112.748278 | 116738.970434 | 117.595723 | 131525.929577 | 117.235685 | 131900.333698 |
3 | 2005Q4 | 2005 | Q4 | 111.934696 | 128683.175719 | 106.396379 | 142808.538807 | 116.073031 | 114415.072260 | 110.728237 | ... | 130.781315 | 125338.422216 | 111.307116 | 149434.769200 | 114.584090 | 118639.759900 | 121.851999 | 136286.397473 | 123.628047 | 139092.296651 |
4 | 2006Q1 | 2006 | Q1 | 113.494351 | 130476.197845 | 109.206160 | 146579.915492 | 121.831058 | 120090.852733 | 112.326582 | ... | 135.555749 | 129914.152078 | 110.539212 | 148403.823796 | 115.878975 | 119980.477260 | 129.544702 | 144890.365875 | 127.449613 | 143391.890242 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
63 | 2020Q4 | 2020 | Q4 | 131.045293 | 150653.238745 | 123.824862 | 166201.593253 | 130.857866 | 128988.723586 | 133.525177 | ... | 142.381814 | 136456.127817 | 129.797418 | 174258.823716 | 130.813697 | 135443.809729 | 124.617456 | 139379.446212 | 138.815696 | 156179.721555 |
64 | 2021Q1 | 2021 | Q1 | 133.481101 | 153453.510344 | 128.398202 | 172340.072904 | 130.382658 | 128520.303209 | 135.257679 | ... | 141.464114 | 135576.621629 | 131.166305 | 176096.615474 | 131.268719 | 135914.936888 | 127.657615 | 142779.737045 | 138.481347 | 155803.549899 |
65 | 2021Q2 | 2021 | Q2 | 137.827568 | 158450.326506 | 130.543464 | 175219.510303 | 137.439265 | 135476.115278 | 138.558752 | ... | 146.059468 | 139980.724158 | 134.815374 | 180995.654429 | 138.663140 | 143571.081234 | 129.231669 | 144540.250870 | 145.000829 | 163138.533592 |
66 | 2021Q3 | 2021 | Q3 | 141.167257 | 162289.724156 | 134.116570 | 180015.444071 | 143.522521 | 141472.480114 | 142.250634 | ... | 149.743153 | 143511.101233 | 136.178634 | 182825.891020 | 140.051373 | 145008.450168 | 135.759680 | 151841.560426 | 152.177814 | 171213.265699 |
67 | 2021Q4 | 2021 | Q4 | 141.900751 | 163132.969278 | 134.883745 | 181045.170599 | 140.957739 | 138944.332704 | 140.072076 | ... | 156.033991 | 149540.124905 | 137.683726 | 184846.544332 | 141.311289 | 146312.960523 | 136.847809 | 153058.587031 | 152.044892 | 171063.717288 |
68 rows × 25 columns
# Two inner-columns per LGD
lgds = df.columns[3:].str.replace(' HPI','').str.replace(' Standardised Price','').unique()
lgds
Index(['Antrim and Newtownabbey', 'Ards and North Down',
'Armagh City, Banbridge and Craigavon', 'Belfast',
'Causeway Coast and Glens', 'Derry City and Strabane',
'Fermanagh and Omagh', 'Lisburn and Castlereagh', 'Mid and East Antrim',
'Mid Ulster Standardised', 'Mid Ulster', 'Newry, Mourne and Down'],
dtype='object', name=1)
For some reason; Mid-ulster has a ‘Standardised HPI’ which throws off the above trick, so we gotta make it ugly…
lgds = df.columns[3:].str.replace(' Standardised HPI',' HPI')\
.str.replace(' HPI','')\
.str.replace(' Standardised Price','').unique()
lgds
Index(['Antrim and Newtownabbey', 'Ards and North Down',
'Armagh City, Banbridge and Craigavon', 'Belfast',
'Causeway Coast and Glens', 'Derry City and Strabane',
'Fermanagh and Omagh', 'Lisburn and Castlereagh', 'Mid and East Antrim',
'Mid Ulster', 'Newry, Mourne and Down'],
dtype='object', name=1)
df.columns = [*df.columns[:3], *pd.MultiIndex.from_product([lgds,['Index','Price']], names=['LGD','Metric'])]
df
Period | Year | Quarter | (Antrim and Newtownabbey, Index) | (Antrim and Newtownabbey, Price) | (Ards and North Down, Index) | (Ards and North Down, Price) | (Armagh City, Banbridge and Craigavon, Index) | (Armagh City, Banbridge and Craigavon, Price) | (Belfast, Index) | ... | (Fermanagh and Omagh, Index) | (Fermanagh and Omagh, Price) | (Lisburn and Castlereagh, Index) | (Lisburn and Castlereagh, Price) | (Mid and East Antrim, Index) | (Mid and East Antrim, Price) | (Mid Ulster, Index) | (Mid Ulster, Price) | (Newry, Mourne and Down, Index) | (Newry, Mourne and Down, Price) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2005Q1 | 2005 | Q1 | 99.903277 | 114851.528270 | 97.150602 | 130398.569667 | 102.245597 | 100785.145986 | 99.839849 | ... | 109.429237 | 104874.980231 | 95.958322 | 128828.327513 | 102.246427 | 105865.408901 | 102.714778 | 114882.211239 | 100.810773 | 113420.880186 |
1 | 2005Q2 | 2005 | Q2 | 99.723509 | 114644.862732 | 100.794472 | 137133.037807 | 106.325843 | 104807.109982 | 100.589870 | ... | 117.239850 | 112360.529330 | 100.164437 | 134475.225477 | 104.443325 | 108140.065924 | 110.386311 | 123462.501283 | 111.965743 | 125971.191415 |
2 | 2005Q3 | 2005 | Q3 | 107.940849 | 124091.740608 | 102.167971 | 137133.037807 | 110.006212 | 108434.910333 | 109.614861 | ... | 125.900145 | 120660.397585 | 106.757895 | 143327.237126 | 112.748278 | 116738.970434 | 117.595723 | 131525.929577 | 117.235685 | 131900.333698 |
3 | 2005Q4 | 2005 | Q4 | 111.934696 | 128683.175719 | 106.396379 | 142808.538807 | 116.073031 | 114415.072260 | 110.728237 | ... | 130.781315 | 125338.422216 | 111.307116 | 149434.769200 | 114.584090 | 118639.759900 | 121.851999 | 136286.397473 | 123.628047 | 139092.296651 |
4 | 2006Q1 | 2006 | Q1 | 113.494351 | 130476.197845 | 109.206160 | 146579.915492 | 121.831058 | 120090.852733 | 112.326582 | ... | 135.555749 | 129914.152078 | 110.539212 | 148403.823796 | 115.878975 | 119980.477260 | 129.544702 | 144890.365875 | 127.449613 | 143391.890242 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
63 | 2020Q4 | 2020 | Q4 | 131.045293 | 150653.238745 | 123.824862 | 166201.593253 | 130.857866 | 128988.723586 | 133.525177 | ... | 142.381814 | 136456.127817 | 129.797418 | 174258.823716 | 130.813697 | 135443.809729 | 124.617456 | 139379.446212 | 138.815696 | 156179.721555 |
64 | 2021Q1 | 2021 | Q1 | 133.481101 | 153453.510344 | 128.398202 | 172340.072904 | 130.382658 | 128520.303209 | 135.257679 | ... | 141.464114 | 135576.621629 | 131.166305 | 176096.615474 | 131.268719 | 135914.936888 | 127.657615 | 142779.737045 | 138.481347 | 155803.549899 |
65 | 2021Q2 | 2021 | Q2 | 137.827568 | 158450.326506 | 130.543464 | 175219.510303 | 137.439265 | 135476.115278 | 138.558752 | ... | 146.059468 | 139980.724158 | 134.815374 | 180995.654429 | 138.663140 | 143571.081234 | 129.231669 | 144540.250870 | 145.000829 | 163138.533592 |
66 | 2021Q3 | 2021 | Q3 | 141.167257 | 162289.724156 | 134.116570 | 180015.444071 | 143.522521 | 141472.480114 | 142.250634 | ... | 149.743153 | 143511.101233 | 136.178634 | 182825.891020 | 140.051373 | 145008.450168 | 135.759680 | 151841.560426 | 152.177814 | 171213.265699 |
67 | 2021Q4 | 2021 | Q4 | 141.900751 | 163132.969278 | 134.883745 | 181045.170599 | 140.957739 | 138944.332704 | 140.072076 | ... | 156.033991 | 149540.124905 | 137.683726 | 184846.544332 | 141.311289 | 146312.960523 | 136.847809 | 153058.587031 | 152.044892 | 171063.717288 |
68 rows × 25 columns
We could turn this into a proper multiindex but it would mean pushing the Period/Year/Quarter columns into keys which would be inconsistent behaviour with the rest of the ‘cleaned’ dataset, so that can be a downstream problem; at least we’ve got the relevant metrics consistent!
def cleanup_table_5(df):
"""
Table 5: Standardised House Price & Index for each Local Government District Northern Ireland
*
"""
# Basic Cleanup first
df = basic_cleanup(df)
# Build multi-index of LGD / Metric [Index,Price]
# Two inner-columns per LGD
lgds = df.columns[3:].str.replace(' Standardised HPI',' HPI')\
.str.replace(' HPI','')\
.str.replace(' Standardised Price','')\
.unique()
df.columns = [*df.columns[:3], *pd.MultiIndex.from_product([lgds,['Index','Price']], names=['LGD','Metric'])]
return df
cleanup_table_5(source_df['Table 5'])
Period | Year | Quarter | (Antrim and Newtownabbey, Index) | (Antrim and Newtownabbey, Price) | (Ards and North Down, Index) | (Ards and North Down, Price) | (Armagh City, Banbridge and Craigavon, Index) | (Armagh City, Banbridge and Craigavon, Price) | (Belfast, Index) | ... | (Fermanagh and Omagh, Index) | (Fermanagh and Omagh, Price) | (Lisburn and Castlereagh, Index) | (Lisburn and Castlereagh, Price) | (Mid and East Antrim, Index) | (Mid and East Antrim, Price) | (Mid Ulster, Index) | (Mid Ulster, Price) | (Newry, Mourne and Down, Index) | (Newry, Mourne and Down, Price) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2005Q1 | 2005 | Q1 | 99.903277 | 114851.528270 | 97.150602 | 130398.569667 | 102.245597 | 100785.145986 | 99.839849 | ... | 109.429237 | 104874.980231 | 95.958322 | 128828.327513 | 102.246427 | 105865.408901 | 102.714778 | 114882.211239 | 100.810773 | 113420.880186 |
1 | 2005Q2 | 2005 | Q2 | 99.723509 | 114644.862732 | 100.794472 | 137133.037807 | 106.325843 | 104807.109982 | 100.589870 | ... | 117.239850 | 112360.529330 | 100.164437 | 134475.225477 | 104.443325 | 108140.065924 | 110.386311 | 123462.501283 | 111.965743 | 125971.191415 |
2 | 2005Q3 | 2005 | Q3 | 107.940849 | 124091.740608 | 102.167971 | 137133.037807 | 110.006212 | 108434.910333 | 109.614861 | ... | 125.900145 | 120660.397585 | 106.757895 | 143327.237126 | 112.748278 | 116738.970434 | 117.595723 | 131525.929577 | 117.235685 | 131900.333698 |
3 | 2005Q4 | 2005 | Q4 | 111.934696 | 128683.175719 | 106.396379 | 142808.538807 | 116.073031 | 114415.072260 | 110.728237 | ... | 130.781315 | 125338.422216 | 111.307116 | 149434.769200 | 114.584090 | 118639.759900 | 121.851999 | 136286.397473 | 123.628047 | 139092.296651 |
4 | 2006Q1 | 2006 | Q1 | 113.494351 | 130476.197845 | 109.206160 | 146579.915492 | 121.831058 | 120090.852733 | 112.326582 | ... | 135.555749 | 129914.152078 | 110.539212 | 148403.823796 | 115.878975 | 119980.477260 | 129.544702 | 144890.365875 | 127.449613 | 143391.890242 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
63 | 2020Q4 | 2020 | Q4 | 131.045293 | 150653.238745 | 123.824862 | 166201.593253 | 130.857866 | 128988.723586 | 133.525177 | ... | 142.381814 | 136456.127817 | 129.797418 | 174258.823716 | 130.813697 | 135443.809729 | 124.617456 | 139379.446212 | 138.815696 | 156179.721555 |
64 | 2021Q1 | 2021 | Q1 | 133.481101 | 153453.510344 | 128.398202 | 172340.072904 | 130.382658 | 128520.303209 | 135.257679 | ... | 141.464114 | 135576.621629 | 131.166305 | 176096.615474 | 131.268719 | 135914.936888 | 127.657615 | 142779.737045 | 138.481347 | 155803.549899 |
65 | 2021Q2 | 2021 | Q2 | 137.827568 | 158450.326506 | 130.543464 | 175219.510303 | 137.439265 | 135476.115278 | 138.558752 | ... | 146.059468 | 139980.724158 | 134.815374 | 180995.654429 | 138.663140 | 143571.081234 | 129.231669 | 144540.250870 | 145.000829 | 163138.533592 |
66 | 2021Q3 | 2021 | Q3 | 141.167257 | 162289.724156 | 134.116570 | 180015.444071 | 143.522521 | 141472.480114 | 142.250634 | ... | 149.743153 | 143511.101233 | 136.178634 | 182825.891020 | 140.051373 | 145008.450168 | 135.759680 | 151841.560426 | 152.177814 | 171213.265699 |
67 | 2021Q4 | 2021 | Q4 | 141.900751 | 163132.969278 | 134.883745 | 181045.170599 | 140.957739 | 138944.332704 | 140.072076 | ... | 156.033991 | 149540.124905 | 137.683726 | 184846.544332 | 141.311289 | 146312.960523 | 136.847809 | 153058.587031 | 152.044892 | 171063.717288 |
68 rows × 25 columns
dest_df['Table 5']=cleanup_table_5(source_df['Table 5'])
len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df])
(12, 20)
Table 5a: Number of Verified Residential Property Sales by Local Government District
This one has a new problem; the Sale Year/Quarter is now squished together. This will do a few terrible things to our basic_cleanup
so this needs to be done ahead of cleanup.
Also has annual total lines.
df = source_df['Table 5a'].copy()
df
Table 5a: Number of Verified Residential Property Sales by Local Government District | Unnamed: 1 | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Unnamed: 9 | Unnamed: 10 | Unnamed: 11 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Please note figures for the 2 most recent quar... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | Back to contents | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | Sale Year/Quarter | Antrim and Newtownabbey | Ards and North Down | Armagh City, Banbridge and Craigavon | Belfast | Causeway Coast and Glens | Derry City and Strabane | Fermanagh and Omagh | Lisburn and Castlereagh | Mid and East Antrim | Mid Ulster | Newry, Mourne and Down |
3 | Q1 2005 | 236 | 320 | 333 | 623 | 236 | 226 | 138 | 219 | 188 | 176 | 241 |
4 | Q2 2005 | 735 | 857 | 961 | 1549 | 712 | 637 | 316 | 655 | 618 | 428 | 505 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
85 | Q3 2021 | 739 | 989 | 931 | 1584 | 625 | 485 | 325 | 869 | 671 | 377 | 574 |
86 | Q4 2021 | 532 | 702 | 730 | 1272 | 417 | 405 | 250 | 572 | 474 | 359 | 437 |
87 | 2021 Total | 2647 | 3685 | 3333 | 5934 | 2327 | 1803 | 1181 | 3053 | 2346 | 1460 | 2000 |
88 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
89 | Please note figures for the 2 most recent quar... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
90 rows × 12 columns
dates = df.iloc[:,0].str.extract('(Q[1-4]) ([0-9]{4})').rename(columns={0:'Quarter',1:'Year'})
for c in ['Quarter','Year']:# insert the dates in order, so they come out in reverse in the insert
df.insert(1,c,dates[c])
df.iloc[2,1]=c # Need to have the right colname for when `basic_cleanup` is called.
df.iloc[2,1]=c
df
Table 5a: Number of Verified Residential Property Sales by Local Government District | Year | Quarter | Unnamed: 1 | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Unnamed: 9 | Unnamed: 10 | Unnamed: 11 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Please note figures for the 2 most recent quar... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | Back to contents | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | Sale Year/Quarter | Year | Quarter | Antrim and Newtownabbey | Ards and North Down | Armagh City, Banbridge and Craigavon | Belfast | Causeway Coast and Glens | Derry City and Strabane | Fermanagh and Omagh | Lisburn and Castlereagh | Mid and East Antrim | Mid Ulster | Newry, Mourne and Down |
3 | Q1 2005 | 2005 | Q1 | 236 | 320 | 333 | 623 | 236 | 226 | 138 | 219 | 188 | 176 | 241 |
4 | Q2 2005 | 2005 | Q2 | 735 | 857 | 961 | 1549 | 712 | 637 | 316 | 655 | 618 | 428 | 505 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
85 | Q3 2021 | 2021 | Q3 | 739 | 989 | 931 | 1584 | 625 | 485 | 325 | 869 | 671 | 377 | 574 |
86 | Q4 2021 | 2021 | Q4 | 532 | 702 | 730 | 1272 | 417 | 405 | 250 | 572 | 474 | 359 | 437 |
87 | 2021 Total | NaN | NaN | 2647 | 3685 | 3333 | 5934 | 2327 | 1803 | 1181 | 3053 | 2346 | 1460 | 2000 |
88 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
89 | Please note figures for the 2 most recent quar... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
90 rows × 14 columns
df=df[~df.iloc[:,0].str.contains('Total').fillna(False)]
df.iloc[1,2]=c
basic_cleanup(df,offset=2)
2 | Period | Sale Year/Quarter | Year | Quarter | Antrim and Newtownabbey | Ards and North Down | Armagh City, Banbridge and Craigavon | Belfast | Causeway Coast and Glens | Derry City and Strabane | Fermanagh and Omagh | Lisburn and Castlereagh | Mid and East Antrim | Mid Ulster | Newry, Mourne and Down |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2005Q1 | Q1 2005 | 2005 | Q1 | 236 | 320 | 333 | 623 | 236 | 226 | 138 | 219 | 188 | 176 | 241 |
1 | 2005Q2 | Q2 2005 | 2005 | Q2 | 735 | 857 | 961 | 1549 | 712 | 637 | 316 | 655 | 618 | 428 | 505 |
2 | 2005Q3 | Q3 2005 | 2005 | Q3 | 757 | 960 | 968 | 1722 | 714 | 632 | 365 | 654 | 686 | 403 | 582 |
3 | 2005Q4 | Q4 2005 | 2005 | Q4 | 893 | 995 | 1199 | 1943 | 834 | 746 | 385 | 670 | 759 | 489 | 711 |
4 | 2006Q1 | Q1 2006 | 2006 | Q1 | 761 | 933 | 1038 | 1686 | 763 | 708 | 348 | 600 | 668 | 515 | 567 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
63 | 2020Q4 | Q4 2020 | 2020 | Q4 | 756 | 1052 | 974 | 1565 | 728 | 496 | 336 | 830 | 685 | 419 | 636 |
64 | 2021Q1 | Q1 2021 | 2021 | Q1 | 652 | 976 | 849 | 1497 | 610 | 466 | 290 | 762 | 572 | 349 | 486 |
65 | 2021Q2 | Q2 2021 | 2021 | Q2 | 724 | 1018 | 823 | 1581 | 675 | 447 | 316 | 850 | 629 | 375 | 503 |
66 | 2021Q3 | Q3 2021 | 2021 | Q3 | 739 | 989 | 931 | 1584 | 625 | 485 | 325 | 869 | 671 | 377 | 574 |
67 | 2021Q4 | Q4 2021 | 2021 | Q4 | 532 | 702 | 730 | 1272 | 417 | 405 | 250 | 572 | 474 | 359 | 437 |
68 rows × 15 columns
def cleanup_table_5a(df):
"""
Table 5a: Number of Verified Residential Property Sales by Local Government District
* Parse the 'Sale Year/Quarter' to two separate cols
* Insert future-headers for Quarter and Year cols
* Remove rows with 'total' in the first column
* Disregard the 'Sale Year/Quarter' column
* perform `basic_cleanup` with offset=2
"""
# Safety first
df=df.copy()
# Extract 'Quarter' and 'Year' columns from the future 'Sale Year/Quarter' column
dates = df.iloc[:,0].str.extract('(Q[1-4]) ([0-9]{4})').rename(columns={0:'Quarter',1:'Year'})
for c in ['Quarter','Year']:# insert the dates in order, so they come out in reverse in the insert
df.insert(1,c,dates[c])
df.iloc[2,1]=c # Need to have the right colname for when `basic_cleanup` is called.
# Remove 'total' rows from the future 'Sale Year/Quarter' column
df=df[~df.iloc[:,0].str.contains('Total').fillna(False)]
# Remove the 'Sale Year/Quarter' column all together
df = df.iloc[:,1:]
# Standard cleanup
df = basic_cleanup(df, offset=2)
return df
cleanup_table_5a(source_df['Table 5a'])
2 | Period | Year | Quarter | Antrim and Newtownabbey | Ards and North Down | Armagh City, Banbridge and Craigavon | Belfast | Causeway Coast and Glens | Derry City and Strabane | Fermanagh and Omagh | Lisburn and Castlereagh | Mid and East Antrim | Mid Ulster | Newry, Mourne and Down |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2005Q1 | 2005 | Q1 | 236 | 320 | 333 | 623 | 236 | 226 | 138 | 219 | 188 | 176 | 241 |
1 | 2005Q2 | 2005 | Q2 | 735 | 857 | 961 | 1549 | 712 | 637 | 316 | 655 | 618 | 428 | 505 |
2 | 2005Q3 | 2005 | Q3 | 757 | 960 | 968 | 1722 | 714 | 632 | 365 | 654 | 686 | 403 | 582 |
3 | 2005Q4 | 2005 | Q4 | 893 | 995 | 1199 | 1943 | 834 | 746 | 385 | 670 | 759 | 489 | 711 |
4 | 2006Q1 | 2006 | Q1 | 761 | 933 | 1038 | 1686 | 763 | 708 | 348 | 600 | 668 | 515 | 567 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
63 | 2020Q4 | 2020 | Q4 | 756 | 1052 | 974 | 1565 | 728 | 496 | 336 | 830 | 685 | 419 | 636 |
64 | 2021Q1 | 2021 | Q1 | 652 | 976 | 849 | 1497 | 610 | 466 | 290 | 762 | 572 | 349 | 486 |
65 | 2021Q2 | 2021 | Q2 | 724 | 1018 | 823 | 1581 | 675 | 447 | 316 | 850 | 629 | 375 | 503 |
66 | 2021Q3 | 2021 | Q3 | 739 | 989 | 931 | 1584 | 625 | 485 | 325 | 869 | 671 | 377 | 574 |
67 | 2021Q4 | 2021 | Q4 | 532 | 702 | 730 | 1272 | 417 | 405 | 250 | 572 | 474 | 359 | 437 |
68 rows × 14 columns
dest_df['Table 5a']=cleanup_table_5a(source_df['Table 5a'])
len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df])
(13, 19)
Table 6: Standardised House Price & Index for all Urban and Rural areas in NI
Wee buns, thankfully. Still mixing the ‘HPI’ vs ‘Index’, but that’s a downstream problem
df = basic_cleanup(source_df['Table 6'])
df
1 | Period | Year | Quarter | Urban Areas HPI | Urban Areas Standardised Price | Rural Areas HPI | Rural Areas Standardised Price |
---|---|---|---|---|---|---|---|
0 | 2005Q1 | 2005 | Q1 | 101.309947 | 107723.320891 | 100.109860 | 124292.601178 |
1 | 2005Q2 | 2005 | Q2 | 104.402908 | 111012.079786 | 105.467951 | 138865.721275 |
2 | 2005Q3 | 2005 | Q3 | 111.163485 | 118200.631818 | 111.847591 | 138865.721275 |
3 | 2005Q4 | 2005 | Q4 | 114.871996 | 122143.908606 | 116.175119 | 144238.615701 |
4 | 2006Q1 | 2006 | Q1 | 118.187559 | 125669.361667 | 119.329374 | 148154.818847 |
... | ... | ... | ... | ... | ... | ... | ... |
63 | 2020Q4 | 2020 | Q4 | 132.610763 | 141005.619094 | 133.854953 | 166189.226014 |
64 | 2021Q1 | 2021 | Q1 | 134.077654 | 142565.370205 | 135.267264 | 167942.698911 |
65 | 2021Q2 | 2021 | Q2 | 138.575881 | 147348.355880 | 140.501443 | 174441.256673 |
66 | 2021Q3 | 2021 | Q3 | 142.840470 | 151882.912133 | 144.695321 | 179648.216283 |
67 | 2021Q4 | 2021 | Q4 | 142.375033 | 151388.010443 | 146.115278 | 181411.180623 |
68 rows × 7 columns
dest_df['Table 6']=basic_cleanup(source_df['Table 6'])
len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df])
(14, 18)
Table 7: Standardised House Price & Index for Rural Areas of Northern Ireland by drive times
Nearly-wee-buns; but this one doesn’t have Year or Quarter headers, and the extra \n (Ref: Q1 2015)
added, which will complicate downstream analysis if that changes over time…
df = source_df['Table 7'].copy()
df.head()
Table 7: Standardised House Price & Index for Rural Areas of Northern Ireland by drive times | Unnamed: 1 | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Unnamed: 9 | Unnamed: 10 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Back to contents | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | NaN | NaN | Drive time within 20mins of town of 10,000 or ... | Drive time within 20mins of town of 10,000 or ... | Drive time outside 20mins of town of 10,000 or... | Drive time outside 20mins of town of 10,000 or... | Drive time within 1hr of Belfast Index | Drive time within 1hr of Belfast Price\n(Ref:... | Drive time outside 1hr of Belfast Index | Drive time outside 1hr of Belfast Price\n(Ref... | NaN |
2 | 2015 | Q1 | 100 | 124898.676844 | 100 | 122528.427865 | 100 | 128955.274996 | 100 | 111866.40498 | 100.0 |
3 | NaN | Q2 | 103.166882 | 128854.070701 | 103.003978 | 126209.155363 | 103.025069 | 132856.260679 | 103.349406 | 115613.265107 | 100.0 |
4 | NaN | Q3 | 105.851629 | 132207.28391 | 105.619893 | 129414.394046 | 105.031061 | 135443.093443 | 107.811831 | 120605.219276 | 100.0 |
df.iloc[1,0] = 'Year'
df.iloc[1,1] = 'Quarter'
df.head()
Table 7: Standardised House Price & Index for Rural Areas of Northern Ireland by drive times | Unnamed: 1 | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Unnamed: 9 | Unnamed: 10 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Back to contents | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | Year | Quarter | Drive time within 20mins of town of 10,000 or ... | Drive time within 20mins of town of 10,000 or ... | Drive time outside 20mins of town of 10,000 or... | Drive time outside 20mins of town of 10,000 or... | Drive time within 1hr of Belfast Index | Drive time within 1hr of Belfast Price\n(Ref:... | Drive time outside 1hr of Belfast Index | Drive time outside 1hr of Belfast Price\n(Ref... | NaN |
2 | 2015 | Q1 | 100 | 124898.676844 | 100 | 122528.427865 | 100 | 128955.274996 | 100 | 111866.40498 | 100.0 |
3 | NaN | Q2 | 103.166882 | 128854.070701 | 103.003978 | 126209.155363 | 103.025069 | 132856.260679 | 103.349406 | 115613.265107 | 100.0 |
4 | NaN | Q3 | 105.851629 | 132207.28391 | 105.619893 | 129414.394046 | 105.031061 | 135443.093443 | 107.811831 | 120605.219276 | 100.0 |
basic_cleanup(df).head()
1 | Period | Year | Quarter | Drive time within 20mins of town of 10,000 or more Index | Drive time within 20mins of town of 10,000 or more Price\n(Ref: Q1 2015) | Drive time outside 20mins of town of 10,000 or more Index | Drive time outside 20mins of town of 10,000 or more Price\n(Ref: Q1 2015) | Drive time within 1hr of Belfast Index | Drive time within 1hr of Belfast Price\n(Ref: Q1 2015) | Drive time outside 1hr of Belfast Index | Drive time outside 1hr of Belfast Price\n(Ref: Q1 2015) |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2015Q1 | 2015 | Q1 | 100.000000 | 124898.676844 | 100.000000 | 122528.427865 | 100.000000 | 128955.274996 | 100.000000 | 111866.404980 |
1 | 2015Q2 | 2015 | Q2 | 103.166882 | 128854.070701 | 103.003978 | 126209.155363 | 103.025069 | 132856.260679 | 103.349406 | 115613.265107 |
2 | 2015Q3 | 2015 | Q3 | 105.851629 | 132207.283910 | 105.619893 | 129414.394046 | 105.031061 | 135443.093443 | 107.811831 | 120605.219276 |
3 | 2015Q4 | 2015 | Q4 | 107.430656 | 134179.467306 | 106.924715 | 131013.172436 | 106.240145 | 137002.270924 | 110.075053 | 123137.004353 |
4 | 2016Q1 | 2016 | Q1 | 108.909364 | 136026.354775 | 108.368772 | 132782.552750 | 107.604887 | 138762.178070 | 111.828874 | 125098.941485 |
def cleanup_table_7(df):
"""
Table 7: Standardised House Price & Index for Rural Areas of Northern Ireland by drive times
* Insert Year/Quarter future-headers
* Clean normally
# TODO THIS MIGHT BE VALID FOR MULTIINDEXING ON DRIVETIME/[Index/Price]
"""
df = df.copy()
df.iloc[1,0] = 'Year'
df.iloc[1,1] = 'Quarter'
df = basic_cleanup(df)
return df
cleanup_table_7(source_df['Table 7'])
1 | Period | Year | Quarter | Drive time within 20mins of town of 10,000 or more Index | Drive time within 20mins of town of 10,000 or more Price\n(Ref: Q1 2015) | Drive time outside 20mins of town of 10,000 or more Index | Drive time outside 20mins of town of 10,000 or more Price\n(Ref: Q1 2015) | Drive time within 1hr of Belfast Index | Drive time within 1hr of Belfast Price\n(Ref: Q1 2015) | Drive time outside 1hr of Belfast Index | Drive time outside 1hr of Belfast Price\n(Ref: Q1 2015) |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2015Q1 | 2015 | Q1 | 100.000000 | 124898.676844 | 100.000000 | 122528.427865 | 100.000000 | 128955.274996 | 100.000000 | 111866.404980 |
1 | 2015Q2 | 2015 | Q2 | 103.166882 | 128854.070701 | 103.003978 | 126209.155363 | 103.025069 | 132856.260679 | 103.349406 | 115613.265107 |
2 | 2015Q3 | 2015 | Q3 | 105.851629 | 132207.283910 | 105.619893 | 129414.394046 | 105.031061 | 135443.093443 | 107.811831 | 120605.219276 |
3 | 2015Q4 | 2015 | Q4 | 107.430656 | 134179.467306 | 106.924715 | 131013.172436 | 106.240145 | 137002.270924 | 110.075053 | 123137.004353 |
4 | 2016Q1 | 2016 | Q1 | 108.909364 | 136026.354775 | 108.368772 | 132782.552750 | 107.604887 | 138762.178070 | 111.828874 | 125098.941485 |
5 | 2016Q2 | 2016 | Q2 | 111.263396 | 138966.509219 | 109.739250 | 134461.778232 | 110.208116 | 142119.179594 | 111.991819 | 125281.221326 |
6 | 2016Q3 | 2016 | Q3 | 113.419541 | 141659.506269 | 112.426034 | 137753.851946 | 112.202571 | 144691.133778 | 115.398242 | 129091.864904 |
7 | 2016Q4 | 2016 | Q4 | 113.928074 | 142294.657346 | 113.219995 | 138726.680412 | 112.508660 | 145085.851672 | 116.952798 | 130830.890712 |
8 | 2017Q1 | 2017 | Q1 | 114.262386 | 142712.207695 | 113.549623 | 139130.567598 | 112.823330 | 145491.635911 | 117.341538 | 131265.759778 |
9 | 2017Q2 | 2017 | Q2 | 115.566592 | 144341.144812 | 115.829688 | 141924.295411 | 114.288862 | 147381.515712 | 119.397027 | 133565.161466 |
10 | 2017Q3 | 2017 | Q3 | 116.716428 | 145777.273752 | 117.061832 | 143434.022704 | 115.192251 | 148546.484147 | 121.273582 | 135664.395891 |
11 | 2017Q4 | 2017 | Q4 | 117.925340 | 147287.189812 | 118.541541 | 145247.086931 | 116.101165 | 149718.576638 | 123.577530 | 138241.740123 |
12 | 2018Q1 | 2018 | Q1 | 118.482802 | 147983.452250 | 120.184585 | 147260.282195 | 117.972698 | 152132.017714 | 122.075418 | 136561.381608 |
13 | 2018Q2 | 2018 | Q2 | 119.443631 | 149183.514842 | 120.710551 | 147904.740501 | 117.686726 | 151763.241043 | 125.603705 | 140508.349303 |
14 | 2018Q3 | 2018 | Q3 | 121.408923 | 151638.138779 | 122.222994 | 149757.912837 | 119.614730 | 154249.503782 | 127.210974 | 142306.342946 |
15 | 2018Q4 | 2018 | Q4 | 123.531419 | 154289.108214 | 125.254013 | 153471.772944 | 121.462721 | 156632.586327 | 130.953652 | 146493.143086 |
16 | 2019Q1 | 2019 | Q1 | 122.499375 | 153000.098716 | 123.207618 | 150964.357154 | 120.087144 | 154858.706540 | 129.712835 | 145105.085124 |
17 | 2019Q2 | 2019 | Q2 | 124.397722 | 155371.109292 | 125.151589 | 153346.274193 | 122.486275 | 157952.513361 | 130.480634 | 145963.994647 |
18 | 2019Q3 | 2019 | Q3 | 126.533407 | 158038.551430 | 128.647747 | 157630.061642 | 124.978137 | 161165.900455 | 133.258945 | 149071.990904 |
19 | 2019Q4 | 2019 | Q4 | 127.126748 | 158779.626458 | 127.784267 | 156572.053236 | 124.547746 | 160610.887802 | 134.709059 | 150694.181735 |
20 | 2020Q1 | 2020 | Q1 | 127.090324 | 158734.133127 | 128.619521 | 157595.476721 | 124.744271 | 160864.317472 | 135.068006 | 151095.722593 |
21 | 2020Q2 | 2020 | Q2 | 127.200617 | 158871.887068 | 127.231209 | 155894.400318 | 125.021931 | 161222.374474 | 132.935126 | 148709.746207 |
22 | 2020Q3 | 2020 | Q3 | 129.627870 | 161903.493901 | 131.083667 | 160614.756005 | 127.873046 | 164899.037745 | 135.952621 | 152085.309777 |
23 | 2020Q4 | 2020 | Q4 | 132.853240 | 165931.938294 | 135.873737 | 166483.954356 | 130.725554 | 168577.497418 | 142.032974 | 158887.182045 |
24 | 2021Q1 | 2021 | Q1 | 134.414458 | 167881.879606 | 136.978231 | 167837.272586 | 132.479831 | 170839.730568 | 142.532053 | 159445.483290 |
25 | 2021Q2 | 2021 | Q2 | 139.417605 | 174130.744141 | 142.727536 | 174881.805508 | 137.737680 | 177620.004609 | 147.652942 | 165174.037638 |
26 | 2021Q3 | 2021 | Q3 | 143.303934 | 178984.717757 | 147.615067 | 180870.420630 | 141.482937 | 182449.710341 | 153.161532 | 171336.300173 |
27 | 2021Q4 | 2021 | Q4 | 144.731984 | 180768.332630 | 149.013971 | 182584.475980 | 142.365385 | 183587.673223 | 156.204293 | 174740.127539 |
dest_df['Table 7'] = cleanup_table_7(source_df['Table 7'])
len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df])
(15, 17)
Table 8: Number of Verified Residential Property Sales of properties in urban and rural areas and properties in rural areas by drive times witihn towns of 10,000 or more and within 1 hour of Belfast
We’re now getting into the swing of this!
This one has two similar problems we’ve already seen; Munged Quarters/Years (this time with no header on that column…), and annual Total rows.
Vee must deeel with it
cleanup_table_5a(source_df['Table 8']).head()
2 | Period | Year | Quarter | Urban | Rural | Drive time within 20mins of town of 10,000 or more | Drive time outside 20mins of town of 10,000 or more | Drive time within 1hr of Belfast | Drive time outside 1hr of Belfast |
---|---|---|---|---|---|---|---|---|---|
0 | 2015Q1 | 2015 | Q1 | 3294 | 1322 | 898 | 424 | 976 | 346 |
1 | 2015Q2 | 2015 | Q2 | 3789 | 1500 | 1034 | 466 | 1142 | 358 |
2 | 2015Q3 | 2015 | Q3 | 4199 | 1640 | 1145 | 495 | 1250 | 390 |
3 | 2015Q4 | 2015 | Q4 | 4396 | 1780 | 1223 | 557 | 1342 | 438 |
4 | 2016Q1 | 2016 | Q1 | 4424 | 1731 | 1171 | 560 | 1263 | 468 |
cleanup_table_8 = cleanup_table_5a
dest_df['Table 8'] = cleanup_table_8(source_df['Table 8'])
len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df])
(16, 16)
Table 9: NI Average Sales Prices Q1 2005 - Q4 2021
Wee buns
basic_cleanup(source_df['Table 9'])
1 | Period | Year | Quarter | Simple Mean | Simple Median | Standardised Price (HPI) |
---|---|---|---|---|---|---|
0 | 2005Q1 | 2005 | Q1 | 115912.942222 | 100000 | 111920.268199 |
1 | 2005Q2 | 2005 | Q2 | 120481.290591 | 105000 | 116004.031639 |
2 | 2005Q3 | 2005 | Q3 | 128866.225917 | 115000 | 123386.352673 |
3 | 2005Q4 | 2005 | Q4 | 129649.092074 | 117000 | 127674.143865 |
4 | 2006Q1 | 2006 | Q1 | 132972.115070 | 120000 | 131302.064422 |
... | ... | ... | ... | ... | ... | ... |
63 | 2020Q4 | 2020 | Q4 | 171803.199843 | 150000 | 147474.561707 |
64 | 2021Q1 | 2021 | Q1 | 176218.214924 | 150000 | 149084.306040 |
65 | 2021Q2 | 2021 | Q2 | 184144.458946 | 154950 | 154323.134643 |
66 | 2021Q3 | 2021 | Q3 | 173490.230508 | 155000 | 159028.118093 |
67 | 2021Q4 | 2021 | Q4 | 159965.154863 | 141000 | 159150.737832 |
68 rows × 6 columns
dest_df['Table 9'] = basic_cleanup(source_df['Table 9'])
len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df])
(17, 15)
Table 9x: NI Average Sale Prices XXXXX Property Q1 2005 - Q4 2021
These are very similar to Tables 2x; i.e. they’re broken down by property type.
Annoyingly, they don’t follow the same structure as Tables 2x or Table 9 because they don’t include the Year/Quarter headers.
If that reminds you of anything, it’s because Table 7 was the same…
cleanup_table_7(source_df['Table 9a'])
1 | Period | Year | Quarter | Simple Mean | Simple Median | Standardised Price (HPI) |
---|---|---|---|---|---|---|
0 | 2005Q1 | 2005 | Q1 | 166314.816092 | 149972.5 | 160428.832662 |
1 | 2005Q2 | 2005 | Q2 | 173370.669076 | 155000.0 | 169686.542965 |
2 | 2005Q3 | 2005 | Q3 | 185397.896739 | 165000.0 | 180696.666810 |
3 | 2005Q4 | 2005 | Q4 | 186545.119355 | 165000.0 | 185323.883533 |
4 | 2006Q1 | 2006 | Q1 | 191328.398119 | 173000.0 | 188669.361197 |
... | ... | ... | ... | ... | ... | ... |
63 | 2020Q4 | 2020 | Q4 | 243712.512641 | 220000.0 | 220592.113069 |
64 | 2021Q1 | 2021 | Q1 | 254182.439174 | 225000.0 | 224872.989982 |
65 | 2021Q2 | 2021 | Q2 | 268755.621299 | 235000.0 | 234734.715703 |
66 | 2021Q3 | 2021 | Q3 | 245860.399289 | 225000.0 | 239101.239764 |
67 | 2021Q4 | 2021 | Q4 | 244468.040738 | 219000.0 | 241131.373512 |
68 rows × 6 columns
cleanup_table_9x = cleanup_table_7
table9s = re.compile('Table 9[a-z]')
for table in source_df:
if table9s.match(table):
dest_df[table] = cleanup_table_9x(source_df[table])
len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df])
(21, 11)
Table 10x: Number of Verified Residential Property Sales by Type in XXXXX
Surprisingly, we’re in the home straight; the remaining tables are all of the same structure, with familiar awkwardness.,,
- Annual-Total Rows
- Munged Year/Quarter Column
- That column having a silly (but contextual) name
- a different offset
Fortunately, we already have something like that from dealing with Table 5a!
source_df['Table 10a']
Table 10a: Number of Verified Residential Property Sales by Type in Antrim and Newtownabbey Council | Unnamed: 1 | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | |
---|---|---|---|---|---|---|
0 | Please note figures for the 2 most recent quar... | NaN | NaN | NaN | NaN | NaN |
1 | Back to contents | NaN | NaN | NaN | NaN | NaN |
2 | ANTRIM AND NEWTOWNABBEY | Apartments | Detached | Semi-Detached | Terrace | Total |
3 | Q1 2005 | 10 | 61 | 78 | 87 | 236 |
4 | Q2 2005 | 46 | 213 | 216 | 260 | 735 |
... | ... | ... | ... | ... | ... | ... |
83 | Q1 2021 | 52 | 222 | 212 | 166 | 652 |
84 | Q2 2021 | 56 | 217 | 275 | 176 | 724 |
85 | Q3 2021 | 47 | 222 | 268 | 202 | 739 |
86 | Q4 2021 | 50 | 117 | 176 | 189 | 532 |
87 | 2021 Total | 205 | 778 | 931 | 733 | 2647 |
88 rows × 6 columns
cleanup_table_5a(source_df['Table 10a'])
2 | Period | Year | Quarter | Apartments | Detached | Semi-Detached | Terrace | Total |
---|---|---|---|---|---|---|---|---|
0 | 2005Q1 | 2005 | Q1 | 10 | 61 | 78 | 87 | 236 |
1 | 2005Q2 | 2005 | Q2 | 46 | 213 | 216 | 260 | 735 |
2 | 2005Q3 | 2005 | Q3 | 46 | 214 | 238 | 259 | 757 |
3 | 2005Q4 | 2005 | Q4 | 65 | 227 | 270 | 331 | 893 |
4 | 2006Q1 | 2006 | Q1 | 48 | 186 | 231 | 296 | 761 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
63 | 2020Q4 | 2020 | Q4 | 53 | 248 | 268 | 187 | 756 |
64 | 2021Q1 | 2021 | Q1 | 52 | 222 | 212 | 166 | 652 |
65 | 2021Q2 | 2021 | Q2 | 56 | 217 | 275 | 176 | 724 |
66 | 2021Q3 | 2021 | Q3 | 47 | 222 | 268 | 202 | 739 |
67 | 2021Q4 | 2021 | Q4 | 50 | 117 | 176 | 189 | 532 |
68 rows × 8 columns
cleanup_table_10x = cleanup_table_5a
table10s = re.compile('Table 10[a-z]')
for table in source_df:
if table10s.match(table):
dest_df[table] = cleanup_table_10x(source_df[table])
len(dest_df), len([k for k in source_df.keys() if k.startswith('Table') and k not in dest_df])
(32, 0)
And We’re Done!
So, we can see that while government open data is a pain, at least it’s a … consistently inconsistent pain?
I hope this was helpful to someone else.
dest_df['Contents'] = source_df['Contents'][source_df['Contents']['Worksheet Name'].str.startswith('Table')]
with pd.ExcelWriter('NI Housing Price Index.xlsx') as writer:
# Thankfully these are semantically sortable otherwise this would be a _massive_ pain
for k,df in sorted(dest_df.items()):
df.to_excel(writer, sheet_name=k)