Capturing the Brexit vote in data


In my recent research I have worked on understanding the key correlates of Brexit. This paper is joint work with my co-authors Sascha Becker and Dennis Novy and has now been published in Economic Policy. After having gone through the peer-review process, I am very happy to share the data and the underlying code.

On the construction of a rich local authority level dataset

After the EU Referendum, I started working on assembling a dataset to study the question to what extent low turnout among potential remain voters may help to understand the result of the referendum. In particular, back then I looked at the extent to which rainfall may have discouraged people to turn up to vote in the first place. After all, a lot of polls, the financial markets and betting markets seemed to indicate that Remain would win.

It was not inconceivable to imagine that turnout may have been particularly low among a set of voters living in the London commuter belt, who were turned off from voting on a working day after an arduous journey into London that was marred by train cancelations due to the bad weather.

The result then was that there appears to have been a weak effect on turnout, but this effect was not biased in either direction with regard to voter preferences: Remain would have lost also on a sunny day (we confirm this finding in the new paper with a different rainfall data set).

In any case, this was the starting point to a significant data collection effort. In particular we have combined data from all the 380 local authorities in England, Scotland and Wales as well as some data of ward level voting across five cities. The resulting set of covariates is quite sizable and they can be roughly grouped into the following four categories:

  1. characteristics of the underlying economic structure of an area, like the unemployment rate, wages or sectoral employment shares and changes therein,
  2. demographic and human capital characteristics like age structure, education or life satisfaction,
  3. exposure to the EU measured by indicators like the level of EU transfers, trade and the growth rate of immigration from other EU countries,
  4. public service provision and fiscal consolidation that covers the share of public employment, as well as the reduction in public spending per capita among others and NHS performance metrics.

To analyse which of these covariates are strong correlates of the Leave vote share in an area, we used a simple machine-learning method that chooses the best subset of covariates that best predicts the referendum Leave share. The idea is to build a robust predictive model that could achieve significant out of sample prediction accurracy.

More formally, best subset selection solves the following optimization problem to obtain an optimal vector beta that can robustly predict the vote leave share y_c

While the formulation may be a bit technical, it boils down to estimating all possible ways to combine regressors. That means, all possible ways of estimating a model that includes 1, 2, 3, …, p covariates are being estimated. This amounts to estimating (2^p) models, which becomes infeasible to estimate very fast. Lasso and other methods such as forward- and backward stepwise selection are solving approximations to the combinatorical problem that BSS solves. The method is usually taught at the beginning of machine learning courses as it provides a great way to illustrate the bias-variance tradeoff.

The result can be summarized in a simple bar chart and is somewhat striking:

R2 of best models with different groups of regressors

What this suggests is that “Fundamental factors” such economic structure, demographics and human capital are strongest correlates of Leave vote in the UK; the direct level effects of migration and trade exposure captured in the group of regressors called “EU Exposure” seem second order.

Also what is striking that even very simple empirical models with just 10 – 20 variables are doing a good job in capturing the variation in the vote leave share across the whole of the UK. The best model that includes variables measuring vote shares of different parties (especially UKIP) from the 2014 European Parliamentary election captures 95% of the overall variation.

The observation that variables capturing direct exposure to the European Union, in particular, Immigration seems at odds with voter narratives around the EU referendum, which were doinated by the immigration topic. In fact, the 2015 British Election study suggests that voters considered immigration to be the single most important issue of the time.

Word clound constructed out of responses to question "what is single most important issue facing country at the time" from the 2015 BES study

Unigram word cloud constructed out of responses to question “what is single most important issue facing country at the time” from the 2015 BES study

In the above the word cloud the keyword immigration is just quite striking. Looking at the bigram word cloud drawn from the British Election study, the themes of the individual responses become even more apparent.

Bigram word cloud constructed out of responses to question "what is single most important issue facing country at the time" from the 2015 BES study

Bigram word cloud constructed out of responses to question “what is single most important issue facing country at the time” from the 2015 BES study

There things like “many people”, “small island”, “uncontrolled immigration” appear in addition to the simple immigration keyword. But also other themes, such as “national debt”, “health service” and concerns about the “welfare system” seem to feature quite large. Overall this suggests that immigration may have been a salient feature in the public debate, but it seems at odds with the fact that the variable groups pertaining to EU immigration seem to capture very little of the variation in the intensity of the leave vote across the UK.

In another paper with Sascha we confirm this observation in a panel setup. We show that a local area’s exposure to Eastern European immigration has, at best, only a small impact on anti-EU preferences in the UK as measured by UKIP voting in European Parliamentary Elections.

While our code for the Economic policy paper is implemented in Stata, it is very easy to replicate this work in Stata. Below is an example of how you would implement BSS in R through the leaps package.

The leaps package has a nice visualization of which regressors are included in which of the best models. The below example replicates the results in our Table 1.





Plotting best subset results

If you want to replicate the word clouds from the 2015 British Election study, the following lines of code will do. You will need to download the British Election Study dataset here.





C.dfmunigram <- dfm(C, tolower = TRUE, stem = FALSE, removeNumbers=TRUE,removePunct=TRUE, remove = stopwords("english"), ngrams=1)
C.dfm<-dfm(C, tolower = TRUE, stem = FALSE, removeNumbers=TRUE,removePunct=TRUE, remove = stopwords("english"), ngrams=2)




UK 2017 General Election Results Data

Economist Magazine Infographic presenting UK 2017 Election Results

As the reality of a hung parliament starts to sink in, economists, political scientists and commentators will begin their usual routine of “post mortem” analysis of the surprise result of the UK 2017 general election. My co-authors Sascha Becker and Dennis Novy have done a similar exercise studying the EU Referendum last year [see also here] and have worked on the question whether migration contributed to an erosion of pro EU sentiment [see also here].

For people wanting to get to work straight away, there are a few things slowing us down. The last constituency, Kensington, was not called until last night and so I dont expect the UK’s Election Commission to post the final tally of votes across all constituencies anytime before next week. Nevertheless, the crude election results data can be “scraped” from some infographics. This post describes how…

The Economist’s Infographics

The Economist, among other newspapers, provides a very nice infographic – behind that info graphic lies a web service that can be queried using JSON formed requests.

Each Parliamentary constituency has an identifier code that can be used to query the web service and pull the results. The URL for a request is quite simple:

This provides the results for the constituency Cambridgeshire, South East. The JSON object looks as follows

resultCB({"swing": -3.84, "mpn": "Lucy Frazer", "electorate": "86121", "lib": 11958, "id": "E14000937", "name": "Cambridgeshire, South East", "lab": 17443, "con": 33601, "status": "hold", "pa_key": "123", "oth": 0, "region": "East Of England", "win": "con", "turnout": "63002"})

This piece of Javascript calls a function resultCB that updates one of the views of the infographic.

In order to convert this to an R data frame, we can use the RJSONIO or jsonlite package functions fromJSON, after having removed the part that calls the function, i.e.

##          id pa_key oth                       name win status swing   lib
## 1 E14000937    123   0 Cambridgeshire, South East con   hold -3.84 11958
##            region         mpn electorate turnout   lab   con
## 1 East Of England Lucy Frazer      86121   63002 17443 33601

In order to build a data.frame of all election results, all that is necessary is to loop over the set of constituency codes available. I share the results from this step in the following spreadsheet Data for UK 2017 General Election Results (Economist Infographic).

Using R to study the Yemen Conflict with night light images


The Yemeni civil war has received very little attention despite the growing humanitarian disaster. There is a lack of reliable figures on the extent of the human suffering in Yemen. The few data that is available suggests that it is immense. According to the UN, from March 2015 to August 2016, over 10,000 people have been killed in Yemen, including 3,799 civilians.

This note asks whether high frequency satellite images do capture the extent to which conflict is ongoing in Yemen and asks in particular, whether there is distinct geographic variation suggesting which areas are most affected by the ongoing conflict.

Can the effect and the spatial incidence of the conflict in Yemen be traced through satellite images?

Satellite images have been used to study urban sprawl and general economic growth and development. The extent to which satellite images can be used to study man-made disasters such as conflicts is not widely explored.

There are lots of other papers that have used various night light data sets to study urbanization, ethnic favoritism, and economic growth (see Henderson et al, 2012 ; Michalopoulos and Papaioannou 2013, Hodler and Raschky, 2014).

In related work Fetzer et al., 2016, I studied the extent to which light emissions in the early 1990s can be used to obtain a measure of the extent of power rationing in Colombia following El-Nino induced droughts. In another project, we use the DMSP night light images to study the evolution of cities over time and how democratization can change the relative distribution of cities Fetzer and Shanghavi, 2015.

Since 2012, the VIIRS
high frequency and high resolution satellite images capturing night lights emissions are available from NASA’s Earth Observation Group. They have now been made available for analysis on Google’s Earth Engine, making them much more accessible to the wider research audience.

Lets have a look at night light Yemen before and after the Saudi Arabian military intervention.

Average VIIRS lights after the Saudi intervention in Yemen started.

Average VIIRS lights for the period before the Saudi intervention in Yemen.

The light scales are identical, indicating that relative to the border with Saudi Arabia, the night light emissions from Yemen have dropped dramatically, especially around the capital city Sana’a. The circular blobs indicated are around the main oil/ gas producing parts of Yemen, where there may be light emissions due to flaring of natural gas.

A minimal average light emissions of 0.5 was imposed
Zooming in to Sana’a, the figures look as follows.


Average light emissions from Sana’a since the Saudi intervention in Yemen started.


Average light emissions from Sana’a for period before the Saudi intervention in Yemen.

Having a look at the data


options(stringsAsFactors = FALSE)

# A DATA SET OF 34k populated places (or historically populated places)
YE <- data.table(read.csv(file = "~/Dropbox/Research/Yemen/Yemen-Cities.csv"))

# LIGHTS DATA IS FROM VIIRS Images made availabe on the Google Earth Engine
LIGHTS <- data.table(read.csv(file = "~/Dropbox/Research/Yemen/lightsall.csv"))

LIGHTS[, `:=`(year, as.numeric(substr(system.index, 1, 4)))]
LIGHTS[, `:=`(month, as.numeric(substr(system.index, 5, 6)))]
LIGHTS[, `:=`(.geo, NULL)]

LIGHTS[, `:=`(date, strptime(paste(year, month, "01", sep = "-"), "%Y-%m-%d"))]

## Joining by: rownum

Some simple plots are quite suggestive. The following plots the average light emissions around populated places over time by month. The date of the intervention onset, which coincides with the date of the fall of Sana’a coincides with dramatic drop in light emissions.

Average lights dropped by a almost 2/3, suggesting a stand still in economic activity. Overall light emissions are still visible as indicated in the graphs suggesting that the places do not turn pitch black. The

plot(LIGHTS[, mean(list), by = date], type = "l")


The Distributional Effects of the Conflict

The Houthi movement has been gaining influence over a longer time period. In particular, since the 2012 the Houthi’s have gained influence spreading from North to the South. The European Council of Foreign Relations has produced maps illustrating the spatial expansion of Houthi control in Yemen.


A central question relates to the strategy of the Saudi military intervention. In particular, whether the intervention is aimed at territories that came under Houthi control since 2012 or whether the intervention is targeted at the Houthi-heartland.

A simple exercise that allows this study is to look at the evolution of lights in the northern Houthi-heartland relative to the populated places in the rest of the country that came under Houthi control since 2012.

A definition of what consists of the Houthi-heartland is subject to contention. But a conservative definition may consist of the four governerates Ammran, Sada’ah, Al Jawf and Hajjah.

LIGHTS[, `:=`(HOUTHI, as.numeric(ADM1 %in% c("15", "22", "21", "19")))]
ggplot(LIGHTS[, mean(list), by = c("HOUTHI", "date")], aes(date, V1, colour = as.factor(HOUTHI))) + 
    geom_line() + geom_point() + theme_bw() + theme(legend.position = "bottom")


The summary statistics suggest that in absolute terms much larger in the non-Houthi heartland. Though given that the initial level in the Houthi heartland is much lower, suggesting that that part of the country is much less developed. Given that there is a notional minimum light emissions of zero, this truncation of the data is a concern.

One way around this is to dummify the lights measure and look at whether a populated place is lit above a certain threshold.

LIGHTS[, `:=`(anylit, list > 0.25)]
ggplot(LIGHTS[, mean(anylit), by = c("HOUTHI", "date")], aes(date, V1, colour = as.factor(HOUTHI))) + 
    geom_line() + geom_point() + theme_bw() + theme(legend.position = "bottom")


Again it is hard to see whether there is any divergence in trends in this dummified measure, but this naturally is less prone to be affected by the truncation inherent to this type of data.

A regression with location and time fixed effects that measures whether there was a distinct change in nightlights in places in the Houthi-heartland relative to the non-Houthi heartland suggests that there is indeed a marked difference, indicating that the conflict is concentrated in the non-Houthi heartland.

Definint the discrete variable for a difference in difference estimation and loading the lfe package that allows for high dimensional fixed effects:

LIGHTS[, `:=`(anylit, list > 0.25)]
LIGHTS[, `:=`(postKSAintervention, as.numeric(date > "2015-03-01"))]
LIGHTS[, `:=`(date, as.factor(date))]

Running the actual difference in difference regressions:

# levels
summary(felm(list ~ postKSAintervention:HOUTHI | rownum + date | 0 | ADM1, data = LIGHTS))
## Call:
##    felm(formula = list ~ postKSAintervention:HOUTHI | rownum + date |      0 | ADM1, data = LIGHTS) 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -74.347  -0.205   0.043   0.194  82.063 
## Coefficients:
##                            Estimate Cluster s.e. t value Pr(>|t|)  
## postKSAintervention:HOUTHI   0.4184       0.1900   2.202   0.0277 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 1.758 on 1172455 degrees of freedom
## Multiple R-squared(full model): 0.752   Adjusted R-squared: 0.7447 
## Multiple R-squared(proj model): 0.003315   Adjusted R-squared: -0.02603 
## F-statistic(full model, *iid*):  103 on 34519 and 1172455 DF, p-value: < 2.2e-16 
## F-statistic(proj model): 4.848 on 1 and 22 DF, p-value: 0.03846
# dummified measure
summary(felm(anylit ~ postKSAintervention:HOUTHI | rownum + date | 0 | ADM1, 
    data = LIGHTS))
## Call:
##    felm(formula = anylit ~ postKSAintervention:HOUTHI | rownum +      date | 0 | ADM1, data = LIGHTS) 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.12247 -0.10416  0.00593  0.06185  1.06958 
## Coefficients:
##                            Estimate Cluster s.e. t value Pr(>|t|)    
## postKSAintervention:HOUTHI  0.08470      0.02359    3.59  0.00033 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 0.2223 on 1172455 degrees of freedom
## Multiple R-squared(full model): 0.5762   Adjusted R-squared: 0.5637 
## Multiple R-squared(proj model): 0.008458   Adjusted R-squared: -0.02073 
## F-statistic(full model, *iid*):46.18 on 34519 and 1172455 DF, p-value: < 2.2e-16 
## F-statistic(proj model): 12.89 on 1 and 22 DF, p-value: 0.00163
# taking logs
summary(felm(log(list) ~ postKSAintervention:HOUTHI | rownum + date | 0 | ADM1, 
    data = LIGHTS[!is.infinite(log(list))]))
## Call:
##    felm(formula = log(list) ~ postKSAintervention:HOUTHI | rownum +      date | 0 | ADM1, data = LIGHTS[!is.infinite(log(list))]) 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.8918  -0.3725   0.1060   0.5223   6.5958 
## Coefficients:
##                            Estimate Cluster s.e. t value Pr(>|t|)    
## postKSAintervention:HOUTHI   0.4133       0.1234    3.35 0.000809 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 0.8958 on 844476 degrees of freedom
##   (327294 observations deleted due to missingness)
## Multiple R-squared(full model): 0.6534   Adjusted R-squared: 0.6393 
## Multiple R-squared(proj model): 0.01248   Adjusted R-squared: -0.02789 
## F-statistic(full model, *iid*):46.12 on 34519 and 844476 DF, p-value: < 2.2e-16 
## F-statistic(proj model): 11.22 on 1 and 22 DF, p-value: 0.002899

An alternative way to study this is by doing a flexible non-parametric estimation to rule out diverging trends prior to the military intervention.

summary(felm(anylit ~ date:HOUTHI | rownum + date | 0 | ADM1, data = LIGHTS))
## Call:
##    felm(formula = anylit ~ date:HOUTHI | rownum + date | 0 | ADM1,      data = LIGHTS) 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.12574 -0.10765  0.00313  0.06437  1.06515 
## Coefficients:
##                       Estimate Cluster s.e. t value Pr(>|t|)    
## date2014-01-01:HOUTHI       NA      0.00000      NA       NA    
## date2014-02-01:HOUTHI  0.01095      0.01320   0.830 0.406641    
## date2014-03-01:HOUTHI  0.03173      0.02764   1.148 0.250884    
## date2014-04-01:HOUTHI  0.11048      0.06028   1.833 0.066814 .  
## date2014-05-01:HOUTHI  0.09762      0.05271   1.852 0.063989 .  
## date2014-06-01:HOUTHI  0.10249      0.05861   1.749 0.080336 .  
## date2014-07-01:HOUTHI  0.07204      0.06053   1.190 0.233987    
## date2014-08-01:HOUTHI  0.06338      0.04866   1.302 0.192778    
## date2014-09-01:HOUTHI  0.03816      0.04690   0.814 0.415860    
## date2014-10-01:HOUTHI  0.04247      0.04359   0.974 0.329930    
## date2014-11-01:HOUTHI  0.05621      0.03646   1.542 0.123115    
## date2014-12-01:HOUTHI  0.02213      0.03037   0.729 0.466205    
## date2015-01-01:HOUTHI -0.02596      0.02585  -1.004 0.315415    
## date2015-02-01:HOUTHI  0.02250      0.05141   0.438 0.661649    
## date2015-03-01:HOUTHI  0.06080      0.05740   1.059 0.289437    
## date2015-04-01:HOUTHI  0.13514      0.04806   2.812 0.004925 ** 
## date2015-05-01:HOUTHI  0.15874      0.04647   3.416 0.000635 ***
## date2015-06-01:HOUTHI  0.15493      0.05151   3.008 0.002632 ** 
## date2015-07-01:HOUTHI  0.12681      0.04697   2.700 0.006944 ** 
## date2015-08-01:HOUTHI  0.12363      0.04319   2.863 0.004202 ** 
## date2015-09-01:HOUTHI  0.13972      0.05276   2.648 0.008088 ** 
## date2015-10-01:HOUTHI  0.13422      0.04697   2.857 0.004273 ** 
## date2015-11-01:HOUTHI  0.12408      0.04566   2.717 0.006578 ** 
## date2015-12-01:HOUTHI  0.12125      0.04505   2.691 0.007119 ** 
## date2016-01-01:HOUTHI  0.11971      0.03905   3.065 0.002176 ** 
## date2016-02-01:HOUTHI  0.11952      0.04151   2.879 0.003984 ** 
## date2016-03-01:HOUTHI  0.12721      0.04239   3.001 0.002693 ** 
## date2016-04-01:HOUTHI  0.12537      0.04532   2.766 0.005669 ** 
## date2016-05-01:HOUTHI  0.12989      0.05297   2.452 0.014209 *  
## date2016-06-01:HOUTHI  0.13070      0.05936   2.202 0.027675 *  
## date2016-07-01:HOUTHI  0.14831      0.06597   2.248 0.024573 *  
## date2016-08-01:HOUTHI  0.13047      0.04614   2.827 0.004693 ** 
## date2016-09-01:HOUTHI  0.14481      0.06024   2.404 0.016227 *  
## date2016-10-01:HOUTHI  0.11782      0.05255   2.242 0.024959 *  
## date2016-11-01:HOUTHI  0.12175      0.04473   2.722 0.006486 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 0.2219 on 1172422 degrees of freedom
## Multiple R-squared(full model): 0.5776   Adjusted R-squared: 0.5652 
## Multiple R-squared(proj model): 0.01175   Adjusted R-squared: -0.01738 
## F-statistic(full model, *iid*): 46.4 on 34552 and 1172422 DF, p-value: < 2.2e-16 
## F-statistic(proj model): 147.2 on 35 and 22 DF, p-value: < 2.2e-16

This suggests that the differential drop in lights occured only after March 2015, the month in which Saudi Arabia’s military intervention commenced.

On average, the regressions suggest that the drop in lights was significantly more pronounced outside the Houthi heartland. This suggests that the conflict and the bombing carried out by Saudi Arabia is mostly concentrated outside the Houthi rebel heartland.

That the dramatic drops in light emissions is associated with the Saudi military intervention is quite clear. The conflict between the Houthi rebels and the government had been ongoing for several years but only starting with the intervention of Saudi Arabia do marked differences between Houthi and non-Houthi heartland provinces appear.

This analysis can further be refined by studying the role of the religious make up of different provinces, as the role of the religious make up between Shia and Sunni muslim groups is said to be an important factor driving this conflict.

Nevertheless, this analysis suggests that high frequency satellite images such as these can be useful in assessing the extent to which areas area directly affected by conflict, which may be useful for targeting humanitarian relief.