J Lawson

Plotting a referendum - CAP Spending

Following the shock decision by the UK to leave the EU, many ideas were put forward as to the reasons behind why so many people voted the ways that they did.

In an effort to learn more about data handling in python, using pandas, matplotlib and other fun stuff I scoured the internet for data and set about plotting graphs.

Sanitizing the CAP data

We start off by deciding which of the columns of data we really want to keep. We need postcodes to place the data geographically, while the various totals of spending give the data we care about.

The provided column names are not very clear, so we provide ‘nice’ versions as well.

Pickles the CAP data, extracting only none-identifiable data
Data available from:
Available under the `Open Government Licence for public sector information`

import pandas as pd

sheet_columns = ['PostcodePrefix_F202B', 'TownCity_F202C',
                'OtherEAGFTotal', 'DirectEAGFTotal',
                'RuralDevelopmentTotal', 'Total']
final_columns = ['Postcode', 'Town', 'Other_EAGF', 'Direct_EAGF',
                'Rural_Development', 'Total']

Now define a function to read in the excel file and combine into a single DataFrame. As many different sheets are used to hold the data, we use a simple list comprehension to read each in turn.

Pandas provide an ExcelFile object, which allows us to read each sheet separately without having to open and close the file each time.

def load_cap_data(filename):
    with pd.ExcelFile(filename) as xls:
        return pd.concat( (pd.read_excel(xls, sheet) for sheet in
            xls.sheet_names), ignore_index=True)

Now that we have a DataFrame, we use the column headings defined above to throw away all others, as well as any null values, before renaming the columns.

The next goal is to combine all the rows with the same postcodes, so that we only have the sums. There is a slight problem with the Town column, where each entry is of type str and so cannot be summed. We deal with this separately, passing a lambda function to use as the groupby aggregator which returns the most common Town appearing for each postcode.

All the numeric columns behave as expected, so the simple data.groupby('Postcode').sum() returns a DataFrame with the sums indexed by postcodes that we wanted.

def organise_data(df):
    data = df.dropna(how='any', subset=['PostcodePrefix_F202B'])[sheet_columns]
    data.columns = final_columns
    town_names = (
            data[['Postcode', 'Town']].groupby('Postcode')
            .agg(lambda x: x.value_counts().index[0])
    sums = data.groupby('Postcode').sum()
    return pd.concat([town_names, sums], axis=1)

All that’s left now is to use these functions to load the data from the file, then pickle it into 'data/CAP2015/pkl'.

def get_cap(year):
    df = load_cap_data('./raw/' + year + '_All_CAP_Search_Results_Data_P14.xls')
    return organise_data(df)

d15 = get_cap('2015')

The full code can be found in this gist.

Indexing by electoral wards

The CAP data is provided in terms of postcode prefix, so we constructed a map from postcodes to electoral wards in the [previous page][Coding postcodes].

Convert the CAP data indexed by postcode to indexed by voting district code.
import pandas as pd

cap = pd.read_pickle('./data/CAP2015.pkl')
pc = pd.read_pickle('./data/pc_prefixes.pkl')

Now pc is a Series of electoral wards indexed by postcodes, and cap is the CAP data indexed by postcodes. By resetting the index of cap, these postcodes are moved into a separate column, and each row is indexed from 1.

The string functions on cap.Postcode ensure that all postcodes in the CAP data match those in the postcode data, so cap.Postcode.str.strip().str.upper() is a Series of postcodes and the call to map replaces each postcode with the corresponding electoral ward from pc.

# Reindex and convert CAP spending in to area wards
cap = cap.reset_index()
cap_codes = cap.Postcode.str.strip().str.upper().map(pc)
cap_codes.name = 'Ward'

Now add these electoral wards to the cap DataFrame and use them to collect all the data for each ward.

cap = pd.concat([cap, cap_codes], axis=1).groupby('Ward').sum()

The full code can be found in this gist.