A spatial regression of Irish residential property prices

It’s no secret that in recent years the prices of property have become a massive issue for many people. Understanding what drives increases in prices is surely a multi-faceted issue, The focus of this spatial regression analysis will be on the impact various factors have on residential property prices in Ireland, in particular the impact change in population in Irish counties has on house prices. By testing two different models (SAR Model with Maximum likelihood and Spatial Durbin Model with Maximum Likelihood) perhaps we can theoretically and conceptually expand on what these particular models have to offer what their limitations may be. This analysis is mainly to explore the use of these models in an Irish context.

In this first step, we will read in the relevant packages that we will use in this analysis

library(maptools)
library(spdep)
library(raster)
library(rgdal)
library(leaflet)
library(leaflet.extras)
library(mapview)
library(optim.functions)
library(stats)
library(spatialreg)
library(stargazer)
library(geodist)

In this first step, we will read in the relevant packages that we will use in this analysis

An overview of data used

While different years will be used to provide an initial overview in the visualization stage of the analysis, some preliminary descriptive statistics follow on the primary year used in the analysis (2016). While this analysis will largely focus on the relationship between house prices (both absolute values and percentage changes) and population change (percentage change). Here we look at the years 2011 and 2016, an interesting period in Irish property prices for sure, given many counties were only starting to recover from the financial crises of 2008 – 2014.

data2016 <- read_csv("data2016.csv")

An overview of what the variables refered to is contained below:

1. Year
2. County
3. hp
• This refers to house prices as listed in euros
4. avgincome
• this refers to the average yearly income in euros
5. fraud
• incidences of fraud reported at the county level
6. pop
• Total population
7. popdens
• Population density
8. areakm2
• Area (in square kilometers)
9. stock
• Total housing stock
10. dlhp
• House price changes from 2011 to 2016
11. dlpop
• Population change from 2011 to 2016
12. vacancy
• Housing vacancy rate
13. stockchange
• Housing stock change from 2011 to 2016
summary(data2016)
##       year         county                hp           avgincome
##  Min.   :2016   Length:26          Min.   : 83524   Min.   :19471
##  1st Qu.:2016   Class :character   1st Qu.:116431   1st Qu.:23186
##  Median :2016   Mode  :character   Median :141584   Median :24479
##  Mean   :2016                      Mean   :160722   Mean   :25100
##  3rd Qu.:2016                      3rd Qu.:176120   3rd Qu.:25556
##  Max.   :2016                      Max.   :404590   Max.   :34305
##      fraud             pop             popdens           areakm2
##  Min.   :  68.0   Min.   :  32044   Min.   :  20.15   Min.   : 826
##  1st Qu.:  77.0   1st Qu.:  76622   1st Qu.:  36.01   1st Qu.:1701
##  Median : 107.5   Median : 123851   Median :  47.63   Median :2014
##  Mean   : 200.5   Mean   : 183149   Mean   : 108.64   Mean   :2703
##  3rd Qu.: 137.0   3rd Qu.: 159463   3rd Qu.:  68.56   3rd Qu.:3276
##  Max.   :2224.0   Max.   :1347359   Max.   :1461.34   Max.   :7500
##      stock             dlhp              dlpop             vacancy
##  Min.   : 18051   Min.   :-0.30513   Min.   :-0.01207   Min.   :0.05900
##  1st Qu.: 32772   1st Qu.:-0.13989   1st Qu.: 0.01411   1st Qu.:0.09325
##  Median : 53739   Median :-0.05114   Median : 0.02989   Median :0.12350
##  Mean   : 77063   Mean   :-0.02653   Mean   : 0.02816   Mean   :0.14496
##  3rd Qu.: 73233   3rd Qu.: 0.05422   3rd Qu.: 0.04506   3rd Qu.:0.19300
##  Max.   :530753   Max.   : 0.30844   Max.   : 0.05924   Max.   :0.29000
##   stockchange
##  Min.   :-0.012325
##  1st Qu.:-0.000483
##  Median : 0.002678
##  Mean   : 0.002377
##  3rd Qu.: 0.006189
##  Max.   : 0.017311

Identifying the need for a Spatial analysis

A typical first step in deciding whether is a need for a spatial regression, is identifying whether spatial dependencies can be observed through the use of a visualization. While relatively straightforward, a visualization can provide some indication whether in fact spatial dependencies exist. In order to identify this, and given the depth of the data used, this analysis will look at the spatial dependencies of residential property prices in two time periods (2011, and 2016), the reason for the choice of these two periods is that given they were census years, the data available is much richer.

As can be seen below, we are dealing with the 26 counties of the Republic of Ireland.

Ireland <- getData("GADM", country = "Ireland", level = 1)
plot(Ireland)

Retrieving the weight matrix W from the shape file (map), we will start by constructing a neighbour list from data already contained within the Ireland map, first the contiguity W. However as can be seen below it constructs some rather strange and distant linkages which seem counter-intuitive.

contiguity <- tri2nb(coordinates(Ireland))
plot(Ireland)
plot(contiguity, coordinates(Ireland), col = 4, add = T)

So instead I use a k nearest neighbour approach with a k of 4 to give a clearer insight based on neighbour proximity and to better correct for the previous, rather messy linkages which were being constructed, and as can be seen below a k of 4 seems to better represent the linkages. A previous k of 6 was tried, however the linkages were again rather unusual and quite distant, so it was decided to opt for the use of a k of 4.

nearest.four <- knearneigh(coordinates(Ireland), k = 4) #constructing a knn based on a set k
nearest.four2 <- knn2nb(nearest.four)  # convert knn class to nb so it can be used in the plot
plot(Ireland)
plot(nearest.four2, coordinates(Ireland), col = 2, add = T)

Visualising house prices in 2011

In order to create a relatively straightforward visualization of house prices in Ireland, first we will read in the 2011 housing data (alongside 2016 data which I use later), which is a streamlined version of the above summary data, containing information on the county, the average price and the year.

houseprices2011 <- read_csv("houseprices2011.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
##   Year = col_double(),
##   County = col_character(),
##   Price = col_double()
## )
##
## -- Column specification --------------------------------------------------------
## cols(
##   Year = col_double(),
##   County = col_character(),
##   Price = col_double()
## )

Then we will create a color palette based on the relevant average house prices, with darker shades indicating higher prices, and lighter colors indicating lower average house prices.

color <- rev(heat.colors(26)) #number based on the number of regions we're trying to plot
color2 <- color[rank(houseprices2011$Price)] #this is ranking the house prices and ascribing a color to them plot(Ireland, col=color2) #Now we plot the Ireland map, with the colour palette constructed added text(coordinates(Ireland), labels = as.character(round(houseprices2011$Price)), cex = 0.6) #To improve readability, we add text as a label in the plot

However this particular visualisation is a little difficult to read and interact with, so instead we will construct an interactive plot of the data on house prices in 2011.

IRL_map <- spTransform(Ireland, CRS("+proj=longlat +datum=WGS84")) #changing the current coordinate reference system to long and latitude
pal <- colorNumeric("inferno", domain = NULL, reverse = TRUE) #come up with a color palette
m <- leaflet(data = IRL_map)
m <- addProviderTiles(m, providers$OpenTopoMap) #use a topographic map m <- addPolygons(m, fillColor = ~pal(log(houseprices2011$Price)), stroke = T, weight = 1,
fillOpacity = 0.7, color = "black",
label = ~paste0(houseprices2011$County, ": ", format( houseprices2011$Price, big.mark=",", scientific=FALSE)),
highlightOptions = highlightOptions(color = "white", weight = 4,
bringToFront = T))  #adding an interactive element to the map
m <- addLegend(m, pal = pal, values = houseprices2016$Price, opacity = 1, title = "House Prices in Euros") #here we add a legend to the map m 2011 visualisation interpretation In the visualization of house prices from 2011, we can already see considerable spatial spillover effects being observed around the major population centers of Dublin, Cork and Galway. However given the period of considerable fluctuation being experienced in the Irish housing market in this period, this visualization is to be interpreted alongside the below visualization from 2016. Visualising house prices in 2016 Again below we follow much the same process in creating the map for 2016 as we did in the 2011 map above. color <- rev(heat.colors(26)) color2 <- color[rank(houseprices2016$Price)]
m1 <- addProviderTiles(m, providers$OpenTopoMap) irl16 <- addPolygons(m1, fillColor = ~pal(log(houseprices2016$Price)), stroke = T, weight = 1,
fillOpacity = 0.7, color = "black",
label = ~paste0(houseprices2016$County, ": ", format( houseprices2016$Price, big.mark=",", scientific=FALSE)),
highlightOptions = highlightOptions(color = "white", weight = 4,
bringToFront = T))
#irl16 <- addLegend(irl16, pal = pal, values = houseprices2016$Price, opacity = 1, title = "House Prices in Euros") irl16 2016 visualisation interpretation In this second iteration, we can see that there does appear to be spatial dependencies on house prices in both versions of the maps, which is primarily centered around the major population centers, in particular this effect is clearly pronounced in the commuter belt surrounding Dublin. As mentioned previously we can see the effects of the collapse of house prices stemming from the impact of the 2008 recession in a number of counties, most notably in the border counties of the North and West, and in the middle of the country where house prices have fallen precipitously between 2011 and 2016. While not sufficient on its own, the maps provide a useful indication that spatial dependencies do play a part in house prices in Ireland, and for this reason we will do some further tests using regressions further below. Running a spatial regression With greater clarity that there does appear to be a spatial relationship with regards to house prices in Ireland, This analysis will further test a number of different hypotheses through the use of 2 spatial regression approaches, namely a SAR (Spatial Autoregressive) Model with Maximum likelihood and a Spatial Durbin Model with Maximum Likelihood to test the factors underpinning changes in house prices. Regression models Ordinary Least Squares (OLS) I start with a non-spatial OLS, to test the impact a change in population has on house prices within Ireland, the purpose here (given the above visualisations) is to test a spatially-blind model and to see how a purely non-spatial analysis performs as a benchmark for further spatial regression approaches. This regression will just test the impact changes in population have on house prices not accounting for any spatial aspects. fit.ols <- lm(hp/100000 ~ dlpop, data = data2016) summary(fit.ols) ## ## Call: ## lm(formula = hp/1e+05 ~ dlpop, data = data2016) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.19902 -0.27165 -0.04023 0.15084 1.78955 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.0017 0.1984 5.049 3.67e-05 *** ## dlpop 21.5006 5.7285 3.753 0.00098 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.5887 on 24 degrees of freedom ## Multiple R-squared: 0.3699, Adjusted R-squared: 0.3436 ## F-statistic: 14.09 on 1 and 24 DF, p-value: 0.0009805  a 1% increase in population growth is estimated to increase house prices on average by €21,500 For a more comprehensive analysis of the effect of the different variables, I have included a model expansion below. models1<-list() models1[[1]]<- lm(hp/100000 ~ dlpop, data = data2016) models1[[2]]<- lm(hp/100000 ~ dlpop + fraud, data = data2016) models1[[3]]<- lm(hp/100000 ~ dlpop + fraud + vacancy, data = data2016) models1[[4]]<- lm(hp/100000 ~ dlpop + fraud + vacancy + stockchange, data = data2016) models1[[5]]<- lm(hp/100000 ~ dlpop + fraud + vacancy + stockchange + avgincome, data = data2016) models1[[6]]<- lm(hp/100000 ~ dlpop + fraud + vacancy + stockchange + avgincome + popdens, data = data2016) stargazer(models1, type = 'html')  OLS Interpretation While this model will only be used as a benchmark, and as such a deep investigation of what we can see will not be undertaken, it is interesting to note that in a simple model a 1% increase in population leads to €21,500 and across all models (displayed above) the range is between €5,637 and €21,501 with varying degrees of significance, there are also a number of interesting variables such as fraud and avgincome looking at the incidences of fraud reported and average annual income respectively which would typically warrant a deeper investigation. Spatial Regression Models SAR Model with Maximum likelihood As there is no spatial dimension to the above OLS model we will not dwell on the model too much, but rather refer back to it later in the analysis as a benchmark to investigate the coefficients. What follows now is a spatial regression approach, notably the use of a spatial autoregressive (SAR) model. Here we build on the work of LeSage, 2014. Below we start by creating the weight matrix based on the nearest 4 neighbours as discussed earlier. We then check to ensure the weight matrix was constructed correctly W <- nb2mat(nearest.four2) #create a weight matrix from the nearest four W[1:5, 1:5] # the diagonal has to be 0 (so you're not explaining y with y), this contains information on where there are neighbour ## [,1] [,2] [,3] [,4] [,5] ## 1 0 0.00 0.00 0.00 0 ## 2 0 0.00 0.00 0.00 0 ## 3 0 0.00 0.00 0.25 0 ## 4 0 0.00 0.25 0.00 0 ## 5 0 0.25 0.00 0.00 0 Following this we specify both the Y and the X which will be used in the model. As discussed above the Y will be the data on average house prices in the 26 counties, and the X used here will refer to the change in population. Y <- as.matrix(data2016$hp/100000) # here we divide by 100000 to ease interpretability
colnames(Y) <- "House Prices" # here we add the column name
X <- cbind(1,data2016$dlpop) #change this to estimate different variables as required colnames(X) <- c("intercept","Change in population") Now we will create the likliehood function normal.lik1 <- function(theta,y,X) { # theta is coefficient vector beta <- theta[1:2] sigma <- theta[3] rho <- theta[4] n <- nrow(X) In <- diag(n) # log likelihood: logl <- log(det(In - rho*W)) - (n/2)*log(2*pi*sigma^2) - sum((1/(2*sigma^2))*(y-rho*W%*%y-X%*%beta)^2) return(-logl) } Now we will run the maximum likelihood estimator through the use of an optimizer. fit2 <- optim(c(1,1,1,0.5), normal.lik1, y = Y, X = X, method = "BFGS", control = list(maxit = 1000), hessian = TRUE) ## Warning in log(det(In - rho * W)): NaNs produced fit2 ##$par
## [1]  0.3983096 14.6799210  0.5000549  0.4904547
##
## $value ## [1] 19.64726 ## ##$counts
##       66       29
##
## $convergence ## [1] 0 ## ##$message
## NULL
##
## $hessian ## [,1] [,2] [,3] [,4] ## [1,] 1.039772e+02 2.9284161740 6.741902e-03 168.636696 ## [2,] 2.928416e+00 0.1247102812 3.125304e-04 5.336772 ## [3,] 6.741902e-03 0.0003125304 2.079691e+02 14.532245 ## [4,] 1.686367e+02 5.3367723241 1.453225e+01 308.554265 We also include standard errors in our model sqrt(diag(solve(fit2$hessian)))
## [1] 0.29479620 5.58138912 0.07068827 0.19661351

Interpretation of the output

1% higher population growth leads to 19.6472/100×100000 = €19,647 higher house prices (as a first round effect) It’s important to note here that this value refers to first order effects, given the nature of the model there will likely be both spillover and feedback effects in the model and that the value is likely to increase once a steady state is reached, however a limitation contained herein is that this is a purely static model. It is also interesting to note that the value here of €19,647 is quite similar to the value of €21,500 from the OLS model above.

Use of lagsarlm

Below we expand on the previous section by seeking to use a Maximum likelihood estimation of spatial simultaneous autoregressive given in the form of:

y=ρWy+Xβ+εy=ρWy+Xβ+ε

Which we begin to operationalise through the construction of the weight matrix from the nearest four neighbours as explained further above.

W.listw <- nb2listw(nearest.four2)

We then estimate the model using the Weight list data constructed above

fit.sar <- lagsarlm(hp/100000 ~ dlpop, W.listw, data = data2016, method="eigen", quiet = TRUE) # A full SAR model with population change as the X variable
stargazer(fit.sar, type = "html")
summary(fit.sar)
##
## Call:lagsarlm(formula = hp/1e+05 ~ dlpop, data = data2016, listw = W.listw,
##     method = "eigen", quiet = TRUE)
##
## Residuals:
##       Min        1Q    Median        3Q       Max
## -0.817785 -0.199863 -0.043583  0.206771  1.614220
##
## Type: lag
## Coefficients: (asymptotic standard errors)
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  0.39825    0.30558  1.3032 0.192494
## dlpop       14.67998    5.18279  2.8324 0.004619
##
## Rho: 0.4905, LR test value: 4.8573, p-value: 0.027529
## Asymptotic standard error: 0.17924
##     z-value: 2.7366, p-value: 0.0062075
## Wald statistic: 7.4891, p-value: 0.0062075
##
## Log likelihood: -19.64726 for lag model
## ML residual variance (sigma squared): 0.25006, (sigma: 0.50006)
## Number of observations: 26
## Number of parameters estimated: 4
## AIC: 47.295, (AIC for lm: 50.152)
## LM test for residual autocorrelation
## test value: 0.083747, p-value: 0.77228

Interpretation of SAR Model

The results from the SAR model are considerably different from the benchmark we observed in the OLS model above, with a 1% population increase leading to a roughly €14,680 increase in average property prices. While this is not in and of itself surprising it does signal the considerable differences one can observe by taking account of the spatial dependencies existing within a dataset. In the interests of testing different spatial models on a dataset, we will now run a spatial durbin model with maximum likelihood and compare the results obtained from that model with the results from the SAR model. Here also we see a marked departure from the values given above in the simple OLS model.

Spatial Durbin Model with Maximum Likelihood:

Here we will now estimate a Spatial Durbin model (SDM), here we lean on the work of Ruttenauer, T. (2019) who through testing different spatial models found that in contrast to the SAR model, that SDM provides accurate estimates of direct impacts even in the case of misspecification, and for this reason we seek to evaluate its results of the effect of population changge on house prices. It is expressed in Ruttenauer, T. (2019) that SDM combines an autoregressive dependent variable and spatially lagged covariates (Wy and WX), the model used is displayed below:

Y=ρWy+Xβ+λWX+ε

We start out, as before by constructing a weight matrix, which will be created as a neighborhood object.

W.nb <- tri2nb(coordinates(Ireland))

We then move on to including the variables for house price change(Y) and population change (X), and replace the NAs in the data with zeros, finally we change the names to improve Interpretability

W <- nb2mat(W.nb)
W.listw <- nb2listw(W.nb)
Y <- as.matrix(data2016$dlhp) colnames(Y) <- "DLHP" X <- cbind(1,data2016$dlpop)
# you need to replace the NAs in X by some values: here zero:
X <- ifelse(is.na(X),0,X)
colnames(X) <- c("intercept","DLpop")

Similar to in the SAR model we also create a likelihood function, with some of the key differences being the inclusion of the lambda and the epislon values, alongside the different construction of the logl

normal.lik1 <- function(theta,y,X) {
beta <- theta[1:2]
sigma <- theta[3]
rho <- theta[4]
lambda <- theta[5]
n <- nrow(X)
I <- diag(n)
# log likelihood:
e <- y-rho*W%*%y-X%*%beta-lambda*W%*%X[,2]
logl <- log(det(I - rho*W)) - (n/2)*log(2*pi*sigma^2) - crossprod(e)/(2*sigma^2)
return(-logl)
}

Following this we use an optimiser, as above to seek to fit the model ‘by hand’

fit2 <- optim(c(0,-0.5,0.03,0.5,0.5),normal.lik1,y = Y, X = X, method = "BFGS",
control = list(maxit = 1000))
fit2
## $par ## [1] 0.3674980 1.5698860 0.1342101 2.8074548 -12.8502021 ## ##$value
## [1] -19.86226
##
## $counts ## function gradient ## 119 53 ## ##$convergence
## [1] 0
##
## \$message
## NULL

Interpretation

While this approach is used as an illustration, there are a number of important points to call out, namely that the production of warnings that;log(det(I – rho * W)): NaNs produced – likely refers to the inclusion of negative values within the dataset. Alongside this the value of -19.86226 refers to a 19% decrease in prices with a 1% increase in population, both of these factors are likely being effected by the considerable drop in prices being observed in the border, midlands and west regions which have seen a considerable drop in house prices but in many cases a slight increase in population. We will test this model further with lagsarlm below.

For comparison: The lagsarlm command for the SDM model:
fit.sdm <- lagsarlm(dlhp ~ dlpop, data = data2016, W.listw, method="eigen", quiet = TRUE, type = "mixed")
summary(fit.sdm)
##
## Call:lagsarlm(formula = dlhp ~ dlpop, data = data2016, listw = W.listw,
##     type = "mixed", method = "eigen", quiet = TRUE)
##
## Residuals:
##        Min         1Q     Median         3Q        Max
## -0.1851749 -0.0479621 -0.0038828  0.0534768  0.1938151
##
## Type: mixed
## Coefficients: (asymptotic standard errors)
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.201214   0.072716 -2.7671 0.005655
## dlpop        3.542001   1.137272  3.1145 0.001843
## lag.dlpop    2.974928   2.403254  1.2379 0.215762
##
## Rho: 0.26218, LR test value: 0.80638, p-value: 0.36919
## Asymptotic standard error: 0.26388
##     z-value: 0.99358, p-value: 0.32043
## Wald statistic: 0.98719, p-value: 0.32043
##
## Log likelihood: 25.3929 for mixed model
## ML residual variance (sigma squared): 0.0081813, (sigma: 0.090451)
## Number of observations: 26
## Number of parameters estimated: 5
## AIC: -40.786, (AIC for lm: -41.979)
## LM test for residual autocorrelation
## test value: 0.0046739, p-value: 0.94549

Interpretation

While there exists differences in how the SAR model and SDM function, as expanded upon further above, we can see here through the use of lagsarlm that house price changes are likely to take place with a 1% increase in population, indeed we can see that a 1% higher population increases house price growth by 3.54% (as a first round effect), and through the use of the SAR model above we can estimate that as a first round effect that the value of this change is roughly €14,680, but the SDM provides us an insight into the scale of this change across different regions in the study. We can also see through the SAR model that the benchmark OLS is not so close to the mark in terms of taking account of the spatial dependencies referred to earlier.

Future Directions

Given the data used here was based on census data, it would be interesting to expand the analysis to focus on the new census data when it becomes available (initially scheduled for 2021, but delayed due to Covid-19). One could also use the expected population increases since 2016 (ranging from 1.1% in 2017 to a high of 1.4% in 2019) and the resultant effects on Irish property prices. However, one would also assume that given the disfunction in many parts of the Irish property market, specifically around the major population centers of Dublin, Cork and Galway, that property prices are quite detached from population growth.

References

1. LeSage, James P., What Regional Scientists Need to Know About Spatial Econometrics (January 5, 2014). Available at SSRN: https://ssrn.com/abstract=2420725 or http://dx.doi.org/10.2139/ssrn.2420725
2. Ruttenauer, T. (2019). Spatial Regression Models: A Systematic Comparison of Different Model Specifications using Monte Carlo Experiments. Sociological Methods and Research, Forthcoming.

Author

Jason Deegan

Jason Deegan is a PhD Candidate (Stipendiat) and research fellow at the University of Stavanger. His work primarily focuses on; Innovation, Regional Studies, Smart Specialisation and Policy.