{ "metadata": { "name": "pysal_spreg" }, "nbformat": 2, "worksheets": [ { "cells": [ { "cell_type": "markdown", "source": [ "# `pysal.spreg`: \"spatial regression like a Pro\"", "#### Author: Daniel Arribas-Bel (May, 2012) ``", "", "In this session, we will walk through most of the functionality that `pysal` has to offer when it comes to spatial regression. We will repeat the exercise we have previously seen in GeoDaSpace using PySAL entirely. In the process, we will discover some of the details that GeoDaSpace misses in order to have an intuitive interface and, by showing some extra tricks, will hopefully convince you of the extra power and flexibility that the command line offers at the cost of a slightly steeper learning curve. The three parts in which we will split the session are the following:", "", "* Load up the data into Python", "* Replication of the regression analysis in GeoDaSpace", "* Some extra tricks you can access if you use `pysal.spreg`" ] }, { "cell_type": "markdown", "source": [ "## Loading your data into Python", "This is the very first step you need to go through in order to perform any data analysis with Python. It is the equivalent to load up a dbf of a csv file in GeoDaSpace, onlyt that instead of clicking on a GUI, we will run commands. Before anything, we have to bring `pysal` to our sesion. To make it shorter to type, we will import it with the alias `ps`. Because ultimately, `pysal` takes NumPy arrays, we will also import numpy, shortening with the common `np`:" ] }, { "cell_type": "code", "collapsed": true, "input": [ "import pysal as ps", "import numpy as np" ], "language": "python", "outputs": [], "prompt_number": 1 }, { "cell_type": "markdown", "source": [ "Let us imagine that our data are in a `csv` file under the path `../data_workshop/phx.csv`. Make sure that you have navigated to the right folder or pass the full path of the file (alternatively, you can use relative paths as well, as we are doing in this example). We will load them in memory by calling the command open in `pysal`, which mirrors the general file I/O in Python:" ] }, { "cell_type": "code", "collapsed": true, "input": [ "db = ps.open('../workshop_data/phx.csv')" ], "language": "python", "outputs": [], "prompt_number": 2 }, { "cell_type": "markdown", "source": [ "This creates the object `db` which contains all our data. We can explore it a little bit to get a feel of what it contains:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "len(db) #How many observations" ], "language": "python", "outputs": [ { "output_type": "pyout", "prompt_number": 3, "text": [ "985" ] } ], "prompt_number": 3 }, { "cell_type": "code", "collapsed": false, "input": [ "db.header #Names of the columns" ], "language": "python", "outputs": [ { "output_type": "pyout", "prompt_number": 4, "text": [ "['id',", " 'ALAND10',", " 'AWATER10',", " 'GEOID10',", " 'NAMELSAD10',", " 'pop_dens',", " 'inc',", " 'inc_error',", " 'pct_error',", " 'renter_rt',", " 'pop',", " 'white_rt',", " 'black_rt',", " 'hisp_rt',", " 'fem_nh_rt',", " 'vac_hsu_rt',", " 'hsu',", " 'l_pct_err']" ] } ], "prompt_number": 4 }, { "cell_type": "code", "collapsed": false, "input": [ "db[0, :] #First row" ], "language": "python", "outputs": [ { "output_type": "pyout", "prompt_number": 5, "text": [ "[['g04021000803',", " 2197.12461,", " 527824,", " 4021000803,", " 'Census Tract 8.03',", " 5.52949975832,", " 4462.0,", " 766.0,", " 17.1671896011,", " 0.625565890197,", " 100.0,", " 54.9839492962,", " 7.78664910692,", " 27.0639558811,", " 0.304551814964,", " 17.8477690289,", " 381.0,", " 2.84299998074]]" ] } ], "prompt_number": 5 }, { "cell_type": "code", "collapsed": false, "input": [ "db[0:5, 0] #First six elements of the first column" ], "language": "python", "outputs": [ { "output_type": "pyout", "prompt_number": 6, "text": [ "['g04021000803',", " 'g04021001701',", " 'g04021001403',", " 'g04021001406',", " 'g04021001303']" ] } ], "prompt_number": 6 }, { "cell_type": "markdown", "source": [ "If we want to extract a full column, `pysal` has a very handy utility for the task, which will return a list with the elements of the columns. We can also get some basic statistics using NumPy:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "pdens = db.by_col('pop_dens')", "pdens[0:5] #First six elements of the population density column" ], "language": "python", "outputs": [ { "output_type": "pyout", "prompt_number": 7, "text": [ "[5.52949975832, 1.10832310582, 170.575010113, 27.2587305868, 3.04072345619]" ] } ], "prompt_number": 7 }, { "cell_type": "code", "collapsed": false, "input": [ "min(pdens) #Minimum value" ], "language": "python", "outputs": [ { "output_type": "pyout", "prompt_number": 8, "text": [ "0.00471158964005" ] } ], "prompt_number": 8 }, { "cell_type": "code", "collapsed": false, "input": [ "max(pdens) #Maximum value" ], "language": "python", "outputs": [ { "output_type": "pyout", "prompt_number": 9, "text": [ "904.675278538" ] } ], "prompt_number": 9 }, { "cell_type": "code", "collapsed": false, "input": [ "np.mean(pdens) #Average value" ], "language": "python", "outputs": [ { "output_type": "pyout", "prompt_number": 10, "text": [ "163.48580448940245" ] } ], "prompt_number": 10 }, { "cell_type": "code", "collapsed": false, "input": [ "np.var(pdens) #Variance" ], "language": "python", "outputs": [ { "output_type": "pyout", "prompt_number": 11, "text": [ "13592.733720318472" ] } ], "prompt_number": 11 }, { "cell_type": "markdown", "source": [ "To be able to use `spreg`, we need to convert the data we will be using in our models into NumPy arrays, which are efficient data structures that allow very good performance for matrix operations. Also, these arrays need to be 2D so we will reshape them when necessary. Let us use the log of the `pct_error` as the dependent variable (y) and the same list of variables as in the GeoDaSpace example for the regressors:" ] }, { "cell_type": "code", "collapsed": true, "input": [ "y = np.array(db.by_col('l_pct_err')).reshape((len(db), 1))", "x_names = ['hsu', 'pop_dens', 'white_rt', 'black_rt', 'hisp_rt', \\", " 'fem_nh_rt', 'renter_rt', 'vac_hsu_rt']", "x = np.array([db.by_col(var) for var in x_names]).T" ], "language": "python", "outputs": [], "prompt_number": 12 }, { "cell_type": "markdown", "source": [ "The last bit of data house-keeping we need to perform is to load the spatial weights. ", "Although we will use more in the last section, let us begin with the `knn` weights with six neighbors", "(note you might get a warning because it does not find IDs matching in the shapefile in the folder). ", "We also row-standardize the matrix so every row sums to one and the spatial lag can be interpreted", "as the average value of the neighbors." ] }, { "cell_type": "code", "collapsed": false, "input": [ "w = ps.open('../workshop_data/phx_knn06.gwt').read()", "w.transform = 'R'" ], "language": "python", "outputs": [ { "output_type": "stream", "stream": "stderr", "text": [ "/Users/dani/code/pysal/pysal/core/IOHandlers/gwt.py:141: RuntimeWarning: DBF relating to GWT was not found, proceeding with unordered string ids.", " warn(\"DBF relating to GWT was not found, proceeding with unordered string ids.\", RuntimeWarning)" ] } ], "prompt_number": 13 }, { "cell_type": "markdown", "source": [ "##Replication of the regression analysis in GeoDaSpace", "###Non-spatial model", "We are all set to start the regression analysis. As a benchmark, we will begin with a base model using the command `OLS`. We will also pass in a weights object and set the flag so we obtain also spatial diagnostics about the residuals." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from pysal.spreg import OLS", "model = OLS(y, x, w=w, name_x=x_names, spat_diag=True)", "print model.summary" ], "language": "python", "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "REGRESSION", "----------", "SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES ESTIMATION", "----------------------------------------------------", "Data set : unknown", "Weights matrix : unknown", "Dependent Variable : dep_var Number of Observations: 985", "Mean dependent var : 2.7077 Number of Variables : 9", "S.D. dependent var : 0.3505 Degrees of Freedom : 976", "", "R-squared : 0.322025", "Adjusted R-squared : 0.3165", "Sum squared residual: 81.944 F-statistic : 57.9476", "Sigma-square : 0.084 Prob(F-statistic) :2.844783e-77", "S.E. of regression : 0.290 Log likelihood : -173.002", "Sigma-square ML : 0.083 Akaike info criterion : 364.003", "S.E of regression ML: 0.2884 Schwarz criterion : 408.037", "", "----------------------------------------------------------------------------", " Variable Coefficient Std.Error t-Statistic Probability", "----------------------------------------------------------------------------", " CONSTANT 3.2602807 0.1246450 26.1565321 1.039377e-114", " hsu -0.0002086 0.0000134 -15.6037433 3.581137e-49", " pop_dens -0.0004243 0.0000988 -4.2944144 1.926527e-05", " white_rt -0.0034062 0.0012541 -2.7160276 0.006723808", " black_rt 0.0038674 0.0028898 1.3382790 0.1811172", " hisp_rt 0.0003542 0.0007873 0.4499107 0.6528747", " fem_nh_rt -0.0236822 0.0074328 -3.1861867 0.00148738", " renter_rt 0.0079963 0.0013002 6.1501153 1.127742e-09", " vac_hsu_rt 0.0088448 0.0012917 6.8475856 1.326764e-11", "----------------------------------------------------------------------------", "", "REGRESSION DIAGNOSTICS", "MULTICOLLINEARITY CONDITION NUMBER 45.873984", "TEST ON NORMALITY OF ERRORS", "TEST DF VALUE PROB", "Jarque-Bera 2 75.866541 0.0000000", "", "DIAGNOSTICS FOR HETEROSKEDASTICITY", "RANDOM COEFFICIENTS", "TEST DF VALUE PROB", "Breusch-Pagan test 8 31.433268 0.0001176", "Koenker-Bassett test 8 20.528622 0.0085108", "", "SPECIFICATION ROBUST TEST", "Not computed due to multicollinearity.", "", "DIAGNOSTICS FOR SPATIAL DEPENDENCE", "TEST MI/DF VALUE PROB", "Lagrange Multiplier (lag) 1 24.978573 0.0000006", "Robust LM (lag) 1 8.479697 0.0035913", "Lagrange Multiplier (error) 1 16.931274 0.0000388", "Robust LM (error) 1 0.432398 0.5108145", "Lagrange Multiplier (SARMA) 2 25.410972 0.0000030", "", "========================= END OF REPORT ==============================" ] } ], "prompt_number": 14 }, { "cell_type": "markdown", "source": [ "At first sight, it is not too bad: we are explaining over 30% of the variation in the dependent variable and the F test reveals some significance of the coefficients overall. However, before we consider the coefficients, it is important to check that the model we have just run is correct and that OLS is a good estimation procedure for this dataset. There are several parts in this output report that should rise a red flag when we look at them. ", "", "In first place, the `SPECIFICATION ROBUST TEST` (White test) section has not been computed due to multicolinearity. If we look at the multicolinearity condition number we see it is almost 46. This is high, in fact beyond 30 which is the limit at which `pysal.spreg` stops computing the White test. However, it is not terribly high (although the rule of thumb is 30, below 100 may still be feasible) so for the time being, we will let it be.", "", "The assumption of normality of the residuals is one of the most important ones in linear regression when the size of your sample is limited. However, as `n` grows, this is less relevant due to the central limit theorem (CLM). Since we have almost 1000 observations, we assume we can rely on the CLT.", "", "Two other problems still remain, heteroskedasticity and spatial dependence, and it is on them that we will focus the most. Both diagnostics against heteroskedasticity clearly point to the presence of the problem. One traditional way to tackle this issue is by estimating a robust VC matrix following White's procedure. Let us do just that before we get our hands dirty with any spatial issue. Since we have already seen them, let us skip the computation of the diagnostics in this run:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from pysal.spreg import OLS", "model = OLS(y, x, w=w, name_x=x_names, spat_diag=False, nonspat_diag=False, robust='white')", "print model.summary" ], "language": "python", "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "REGRESSION", "----------", "SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES ESTIMATION", "----------------------------------------------------", "Data set : unknown", "Weights matrix : unknown", "Dependent Variable : dep_var Number of Observations: 985", "Mean dependent var : 2.7077 Number of Variables : 9", "S.D. dependent var : 0.3505 Degrees of Freedom : 976", "", "R-squared : 0.322025", "Adjusted R-squared : 0.3165", "", "White Standard Errors", "----------------------------------------------------------------------------", " Variable Coefficient Std.Error t-Statistic Probability", "----------------------------------------------------------------------------", " CONSTANT 3.2602807 0.1307919 24.9272298 1.5903e-106", " hsu -0.0002086 0.0000157 -13.2902064 3.621625e-37", " pop_dens -0.0004243 0.0001018 -4.1666504 3.365117e-05", " white_rt -0.0034062 0.0012197 -2.7926344 0.005330259", " black_rt 0.0038674 0.0025722 1.5035688 0.1330159", " hisp_rt 0.0003542 0.0006928 0.5113044 0.6092535", " fem_nh_rt -0.0236822 0.0097645 -2.4253461 0.01547431", " renter_rt 0.0079963 0.0015382 5.1983427 2.448857e-07", " vac_hsu_rt 0.0088448 0.0014780 5.9843749 3.047318e-09", "----------------------------------------------------------------------------", "========================= END OF REPORT ==============================" ] } ], "prompt_number": 15 }, { "cell_type": "markdown", "source": [ "We can see some changes in the standard errors of the coefficients, and one variable (`fem_nh_rt`) becomes insignificant at the 1% level." ] }, { "cell_type": "markdown", "source": [ "###Spatial diagnostics", "", "At this point, we are ready to consider the issue of spatial spatial dependence and explicitly incorporate space into our model. In order to better focus on them, we will print the part of the regression output that concerns only to the spatial diagnostics. This is a somewhat hidden feature, but it can prove very useful in some contexts, for instance when you are considering several types of weights on the same model. Mind that this trick requires you to re-run the OLS every time so it is not very efficient in terms of speed up (for that, see below in the extra tricks section). However, it comes in very handy to inspect the results." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from pysal.spreg.user_output import summary_spat_diag", "model = OLS(y, x, w=w, name_x=x_names, spat_diag=True)", "spd = summary_spat_diag(model, None, None)", "print spd" ], "language": "python", "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "", "DIAGNOSTICS FOR SPATIAL DEPENDENCE", "TEST MI/DF VALUE PROB", "Lagrange Multiplier (lag) 1 24.978573 0.0000006", "Robust LM (lag) 1 8.479697 0.0035913", "Lagrange Multiplier (error) 1 16.931274 0.0000388", "Robust LM (error) 1 0.432398 0.5108145", "Lagrange Multiplier (SARMA) 2 25.410972 0.0000030", "" ] } ], "prompt_number": 16 }, { "cell_type": "markdown", "source": [ "The first thing we can observe is that the non-robust versions of the LM tests all point to severe presence of spatial autocorrelation. The problem of having both tests (error and lag) rejecting the null is that they can be misleading at the presence of the other specification. In other words, if the true DGP contains say an error structure, the result of the LM-Lag is affected and viceversa. For that reason, we have to turn to the robust versions of the LM which, as they name reads, are robust to the presence of the other model. At this point, things become clearer: while the robust version of the LM-Lag is still significant, the error counterpart is not.", "", "###Spatial error models", "", "Before we move into the right direction, let us assume that we did not look at the robust versions and decided to try first with one of the simplest spatial specifications: the error model. This is very easy to run in PySAL thanks to the method `GM_Error`, which implements the traditional Kelejan and Prucha 1998/99 papers:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from pysal.spreg import GM_Error", "error_kp98 = GM_Error(y, x, w, name_x=x_names)", "print error_kp98.summary" ], "language": "python", "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "REGRESSION", "----------", "SUMMARY OF OUTPUT: SPATIALLY WEIGHTED LEAST SQUARES ESTIMATION", "--------------------------------------------------------------", "Data set : unknown", "Weights matrix : unknown", "Dependent Variable : dep_var Number of Observations: 985", "Mean dependent var : 2.7077 Number of Variables : 9", "S.D. dependent var : 0.3505 Degrees of Freedom : 976", "", "Pseudo R-squared : 0.321335", "", "----------------------------------------------------------------------------", " Variable Coefficient Std.Error z-Statistic Probability", "----------------------------------------------------------------------------", " CONSTANT 3.2542359 0.1265294 25.7192144 7.127017e-146", " hsu -0.0002062 0.0000133 -15.4802880 4.713582e-54", " pop_dens -0.0003921 0.0001024 -3.8280762 0.0001291488", " white_rt -0.0032943 0.0012636 -2.6069925 0.009134137", " black_rt 0.0042299 0.0030457 1.3887930 0.1648957", " hisp_rt 0.0004473 0.0008225 0.5437729 0.5865978", " fem_nh_rt -0.0223199 0.0077180 -2.8919223 0.003828926", " renter_rt 0.0069856 0.0013690 5.1026101 3.350008e-07", " vac_hsu_rt 0.0083237 0.0013581 6.1288540 8.851432e-10", " lambda 0.2009658 ", "----------------------------------------------------------------------------", "========================= END OF REPORT ==============================" ] } ], "prompt_number": 17 }, { "cell_type": "markdown", "source": [ "Note there is a change in the output print with respect to the OLS. Instead of R squared, we now have the *pseudo* R", "squared. This is simply the correlation coefficient between the dependent variable (`y`) and the predicted values of our", "model. Although it is not strictly comparable to the traditional one, it is still a good guidance of *how right* our model", "performs.", "", "One inconvenient of this procedure is that it only obtains a point estimate of the spatial parameter ", "and hence does not allow to do inference on it. In a recent paper, Drukker et al. (2010) present an improvement ", "on the techinique that allows to obtain inference even on the spatial parameter. PySAL implements it in the ", "method `GM_Error_Hom`, where the `Hom` is named after the assumption of homoskedasticity made in the model." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from pysal.spreg import GM_Error_Hom", "error_hom = GM_Error_Hom(y, x, w, name_x=x_names)", "print error_hom.summary" ], "language": "python", "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "REGRESSION", "----------", "SUMMARY OF OUTPUT: SPATIALLY WEIGHTED LEAST SQUARES (HOM) ESTIMATION", "--------------------------------------------------------------------", "Data set : unknown", "Weights matrix : unknown", "Dependent Variable : dep_var Number of Observations: 985", "Mean dependent var : 2.7077 Number of Variables : 9", "S.D. dependent var : 0.3505 Degrees of Freedom : 976", "", "Pseudo R-squared : 0.321329", "", "----------------------------------------------------------------------------", " Variable Coefficient Std.Error z-Statistic Probability", "----------------------------------------------------------------------------", " CONSTANT 3.2541994 0.1267546 25.6732252 2.32761e-145", " hsu -0.0002062 0.0000133 -15.4782228 4.867341e-54", " pop_dens -0.0003919 0.0001028 -3.8131977 0.0001371803", " white_rt -0.0032937 0.0012651 -2.6034443 0.009229225", " black_rt 0.0042317 0.0030615 1.3822208 0.1669039", " hisp_rt 0.0004479 0.0008266 0.5418071 0.5879514", " fem_nh_rt -0.0223135 0.0077447 -2.8811329 0.003962486", " renter_rt 0.0069811 0.0013755 5.0753519 3.867796e-07", " vac_hsu_rt 0.0083214 0.0013643 6.0993920 1.064727e-09", " lambda 0.2176842 0.0463056 4.7010301 2.588524e-06", "----------------------------------------------------------------------------", "========================= END OF REPORT ==============================" ] } ], "prompt_number": 18 }, { "cell_type": "markdown", "source": [ "At this point, we can evaluate the significance of `lambda`. ", "Apparently in contrast with what we would expect from the LM tests on spatial autocorrelation, the coefficient appears ", "highly significant. We will provide an explanation for this below when we delve into more complicated models but, before", "that, let us present a third estimation procedure implemented in PySAL for a spatial autoregressive error process. If we", "remember from the OLS output print, there were significant signs of heteroskedasticity, and the error models we have fit", "so far assume homoskedasticity. In a very recent reference, Arraiz et al. (2010) propose a procedure to estimate error models", "that is consistent to the presence of heteroskedasticity. PySAL implements this in the `spreg` module `error_sp_het` and it is", "as simple to use as the rest we have seen:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from pysal.spreg import GM_Error_Het", "error_het = GM_Error_Het(y, x, w, name_x=x_names)", "print error_het.summary" ], "language": "python", "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "REGRESSION", "----------", "SUMMARY OF OUTPUT: SPATIALLY WEIGHTED LEAST SQUARES (HET) ESTIMATION", "--------------------------------------------------------------------", "Data set : unknown", "Weights matrix : unknown", "Dependent Variable : dep_var Number of Observations: 985", "Mean dependent var : 2.7077 Number of Variables : 9", "S.D. dependent var : 0.3505 Degrees of Freedom : 976", "", "Pseudo R-squared : 0.321329", "", "Heteroskedastic Corrected Standard Errors", "----------------------------------------------------------------------------", " Variable Coefficient Std.Error z-Statistic Probability", "----------------------------------------------------------------------------", " CONSTANT 3.2541994 0.1399596 23.2509897 1.390113e-119", " hsu -0.0002062 0.0000158 -13.0265120 8.647101e-39", " pop_dens -0.0003919 0.0001034 -3.7907494 0.0001501934", " white_rt -0.0032937 0.0012963 -2.5408348 0.01105882", " black_rt 0.0042317 0.0027850 1.5194830 0.128641", " hisp_rt 0.0004479 0.0007325 0.6113715 0.5409537", " fem_nh_rt -0.0223135 0.0104736 -2.1304427 0.03313508", " renter_rt 0.0069811 0.0016474 4.2376742 2.258471e-05", " vac_hsu_rt 0.0083214 0.0015693 5.3027086 1.14097e-07", " lambda 0.2142769 0.0469120 4.5676366 4.932542e-06", "----------------------------------------------------------------------------", "========================= END OF REPORT ==============================" ] } ], "prompt_number": 19 }, { "cell_type": "markdown", "source": [ "This procedure only differs from `GM_Error_Hom` in the moments and VC matrix, so all the coefficients except for", "the spatial parameter remain unchanged. The standard errors however slightly change.", "", "###Spatial lag model", "", "Imagine that, after examining the LM tests, we had decided for a lag model instead of the error. ", "Running a spatial lag model is equally simple in PySAL thanks to the method GM_Lag. Note how, in this case, ", "`w` has to be referenced. This is because we might be passing in additional instruments and hence it is not ", "clear that the third place is for the weights object." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from pysal.spreg import GM_Lag", "lag_model = GM_Lag(y, x, w=w, name_x=x_names, spat_diag=True)", "print lag_model.summary" ], "language": "python", "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "REGRESSION", "----------", "SUMMARY OF OUTPUT: SPATIAL TWO STAGE LEAST SQUARES ESTIMATION", "-------------------------------------------------------------", "Data set : unknown", "Weights matrix : unknown", "Dependent Variable : dep_var Number of Observations: 985", "Mean dependent var : 2.7077 Number of Variables : 10", "S.D. dependent var : 0.3505 Degrees of Freedom : 975", "", "Pseudo R-squared : 0.341497", "Spatial Pseudo R-squared: 0.328653", "", "----------------------------------------------------------------------------", " Variable Coefficient Std.Error z-Statistic Probability", "----------------------------------------------------------------------------", " CONSTANT 2.4970736 0.2632379 9.4859966 2.400779e-21", " hsu -0.0002039 0.0000132 -15.4584424 6.617946e-54", " pop_dens -0.0003663 0.0000985 -3.7180632 0.0002007561", " white_rt -0.0028577 0.0012417 -2.3015501 0.02136056", " black_rt 0.0026414 0.0028596 0.9236830 0.3556514", " hisp_rt -0.0001681 0.0007887 -0.2131477 0.8312118", " fem_nh_rt -0.0202109 0.0073684 -2.7429067 0.006089798", " renter_rt 0.0068149 0.0013256 5.1410192 2.732521e-07", " vac_hsu_rt 0.0078722 0.0013015 6.0484365 1.462583e-09", " W_dep_var 0.2740141 0.0836946 3.2739746 0.001060461", "----------------------------------------------------------------------------", "Instruments: W_hsu, W_pop_dens, W_white_rt, W_black_rt, W_hisp_rt,", " W_fem_nh_rt, W_renter_rt, W_vac_hsu_rt", "", "DIAGNOSTICS FOR SPATIAL DEPENDENCE", "TEST MI/DF VALUE PROB", "Anselin-Kelejian Test 1 1.209171 0.2714964", "========================= END OF REPORT ==============================" ] } ], "prompt_number": 20 }, { "cell_type": "markdown", "source": [ "As we can see, the parameter for the lag of the dependent variable is positive and significant, similar to the error case.", "A new addition to the output print is the *spatial pseudo R squared*. In the lag model, we can obtain two different predicted ", "values: the *naive* ones and the complete ones. The former, used for the pseudo R squared, include only the first lag of the ", "dependent variable to predict the value of `y`; the latter, used in the spatial pseudo R squared, employs the reduced form of", "the model, thus incorporating all the feedback effects implicit in the model and offering a more correct estimate. In either case,", "we can see how the lag model seems to do a better job with this dataset.", "", "Note also that we asked for spatial diagnostics of the residuals. PySAL implements the Anselin-Kelejian (AK) test for residuals of ", "an IV estimation. The test is a modification of the LM-Error and is used to check for remaining spatial autocorrelation in the ", "residuals of a regresion with instruments. Consistent with the initial LM tests from above, the AK does not find any significant", "spatial correlation, pointing again to the lag model as the preferred one.", "", "The estimation procedure that `GM_Lag` employs is a particular case of the traditional Instrumental Variables (IV) approach,", "in which the endogeneity of the spatial lag of the variable is dealt with by using instruments. Kelejian and Prucha (1998-99)", "show that the optimal instruments for a lag model are the spatial lag of the exogenous variables, and `GM_Lag` follows their", "guidance. A less settled debate is how many orders of lags are optimal: just the first one (`WX`)? two (`WX` and `WWX`)?", "Very much like GeoDaSpace, PySAL uses only the first lag by default, but it allows you to modify it if you wish. Suppose we", "preferred to include two lags, the call would then be:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "lag_model2lags = GM_Lag(y, x, w=w, name_x=x_names, w_lags=2)", "print lag_model2lags.summary" ], "language": "python", "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "REGRESSION", "----------", "SUMMARY OF OUTPUT: SPATIAL TWO STAGE LEAST SQUARES ESTIMATION", "-------------------------------------------------------------", "Data set : unknown", "Weights matrix : unknown", "Dependent Variable : dep_var Number of Observations: 985", "Mean dependent var : 2.7077 Number of Variables : 10", "S.D. dependent var : 0.3505 Degrees of Freedom : 975", "", "Pseudo R-squared : 0.341153", "Spatial Pseudo R-squared: 0.328877", "", "----------------------------------------------------------------------------", " Variable Coefficient Std.Error z-Statistic Probability", "----------------------------------------------------------------------------", " CONSTANT 2.3211310 0.2579456 8.9985291 2.287619e-19", " hsu -0.0002028 0.0000132 -15.3771109 2.331201e-53", " pop_dens -0.0003530 0.0000985 -3.5844272 0.0003378187", " white_rt -0.0027313 0.0012414 -2.2001304 0.02779765", " black_rt 0.0023588 0.0028592 0.8249764 0.409385", " hisp_rt -0.0002885 0.0007881 -0.3661051 0.7142867", " fem_nh_rt -0.0194107 0.0073666 -2.6349576 0.008414779", " renter_rt 0.0065425 0.0013234 4.9435842 7.669921e-07", " vac_hsu_rt 0.0076479 0.0013001 5.8823768 4.044165e-09", " W_dep_var 0.3371827 0.0815362 4.1353753 3.54375e-05", "----------------------------------------------------------------------------", "Instruments: W_hsu, W_pop_dens, W_white_rt, W_black_rt, W_hisp_rt,", " W_fem_nh_rt, W_renter_rt, W_vac_hsu_rt, W2_hsu, W2_pop_dens,", " W2_white_rt, W2_black_rt, W2_hisp_rt, W2_fem_nh_rt,", " W2_renter_rt, W2_vac_hsu_rt", "========================= END OF REPORT ==============================" ] } ], "prompt_number": 21 }, { "cell_type": "markdown", "source": [ "As mentioned before, this procedure is a particular case of IV estimation where the instruments are spatial. ", "To demonstrate how they are created, we will replicate what `GM_Lag` builds internally and will pass it to the generic two", "stages least squares method in `spreg`. This will allow us", "to discover some more functionality in PySAL and, at the same time, to check that we actually obtain the same numbers as with", "the default.", "", "We will try to match the default settings, so we will only need to compute the first lag of `X`. PySAL has a generic function", "(`lag_spatial`) that will do the heavy lifting in a very efficient way for us, and it works both with single vectors or with matrices:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from pysal import lag_spatial", "from pysal.spreg import TSLS", "wy = lag_spatial(w, y) #Get the spatial lag of y", "wx = lag_spatial(w, x) #Get the lag of x to use as instruments", "iv_model = TSLS(y, x, yend=wy, q=wx, name_x=x_names)", "print iv_model.summary" ], "language": "python", "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "REGRESSION", "----------", "SUMMARY OF OUTPUT: TWO STAGE LEAST SQUARES ESTIMATION", "-----------------------------------------------------", "Data set : unknown", "Dependent Variable : dep_var Number of Observations: 985", "Mean dependent var : 2.7077 Number of Variables : 10", "S.D. dependent var : 0.3505 Degrees of Freedom : 975", "", "Pseudo R-squared : 0.341497", "", "----------------------------------------------------------------------------", " Variable Coefficient Std.Error z-Statistic Probability", "----------------------------------------------------------------------------", " CONSTANT 2.4970736 0.2632379 9.4859966 2.400779e-21", " hsu -0.0002039 0.0000132 -15.4584424 6.617946e-54", " pop_dens -0.0003663 0.0000985 -3.7180632 0.0002007561", " white_rt -0.0028577 0.0012417 -2.3015501 0.02136056", " black_rt 0.0026414 0.0028596 0.9236830 0.3556514", " hisp_rt -0.0001681 0.0007887 -0.2131477 0.8312118", " fem_nh_rt -0.0202109 0.0073684 -2.7429067 0.006089798", " renter_rt 0.0068149 0.0013256 5.1410192 2.732521e-07", " vac_hsu_rt 0.0078722 0.0013015 6.0484365 1.462583e-09", "endogenous_1 0.2740141 0.0836946 3.2739746 0.001060461", "----------------------------------------------------------------------------", "Instruments: instrument_1, instrument_2, instrument_3, instrument_4,", " instrument_5, instrument_6, instrument_7, instrument_8", "========================= END OF REPORT ==============================" ] } ], "prompt_number": 22 }, { "cell_type": "markdown", "source": [ "As you can see for yourself, the results of `iv_model` exactly match those from `lag_model`.", "", "###Full models", "", "At this point, we have run a non-spatial model and found significant evidence of spatial autocorrelation; ", "we have then run lag and error models separately to try account for it. Although the robust LM tests pointed to a", "lag model, we found significant the `lambda` coefficient in the error model, so it would not be unrealistic to", "think that a full model with both spatial lag and error would be a good fit. In this section we will show what ", "options are available in PySAL to construct more complicated models that include spatial effects in the dependent", "variable as well as in the error term. In particular, we will use two main specifications: a SARAR model that models", "autoregressive terms in both the lag and the error; and a combination of a lag model with the Spatial Heteroskedasticity", "and Autocorrelation Consistent (SHAC) variance covariance matrix.", "", "Our first option is to fit a model that introduces autoregressive parameters for the lag and the error. PySAL offers ", "combinations of the lag model we have seen with all the error models presented before. For the sake of simplicity, and", "without loss of generality, we will stick to our last choice (`GM_Error_Het`) to blend it with the lag, since it offers", "the possibility of inference in the spatial parameter and accounts for heteroskedasticity, which we have present in the", "dataset. These models are called `Combo` in PySAL, to allude to the fact they combine a lag and an error." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from pysal.spreg import GM_Combo_Het", "sarar_het = GM_Combo_Het(y, x, w=w, name_x=x_names)", "print sarar_het.summary" ], "language": "python", "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "REGRESSION", "----------", "SUMMARY OF OUTPUT: SPATIALLY WEIGHTED TWO STAGE LEAST SQUARES (HET) ESTIMATION", "------------------------------------------------------------------------------", "Data set : unknown", "Weights matrix : unknown", "Dependent Variable : dep_var Number of Observations: 985", "Mean dependent var : 2.7077 Number of Variables : 10", "S.D. dependent var : 0.3505 Degrees of Freedom : 975", "", "Pseudo R-squared : 0.341478", "Spatial Pseudo R-squared: 0.329460", "", "Heteroskedastic Corrected Standard Errors", "----------------------------------------------------------------------------", " Variable Coefficient Std.Error z-Statistic Probability", "----------------------------------------------------------------------------", " CONSTANT 2.4533311 0.2807015 8.7399985 2.331135e-18", " hsu -0.0002023 0.0000153 -13.2302169 5.871765e-40", " pop_dens -0.0003740 0.0000979 -3.8197206 0.0001336029", " white_rt -0.0028910 0.0011420 -2.5315327 0.01135652", " black_rt 0.0023357 0.0022767 1.0259095 0.3049342", " hisp_rt -0.0002446 0.0006538 -0.3741339 0.7083047", " fem_nh_rt -0.0202743 0.0090433 -2.2419124 0.02496704", " renter_rt 0.0070361 0.0015542 4.5270082 5.982462e-06", " vac_hsu_rt 0.0079395 0.0013681 5.8031888 6.506547e-09", " W_dep_var 0.2903322 0.0888340 3.2682557 0.001082126", " lambda -0.1720565 0.1272811 -1.3517838 0.1764445", "----------------------------------------------------------------------------", "Instruments: W_hsu, W_pop_dens, W_white_rt, W_black_rt, W_hisp_rt,", " W_fem_nh_rt, W_renter_rt, W_vac_hsu_rt", "========================= END OF REPORT ==============================" ] } ], "prompt_number": 23 }, { "cell_type": "markdown", "source": [ "Note how, once the spatial lag is introduced, the spatial parameter in the error term becomes insignificant.", "We can now close the argument we started with the LM tests and seemed to counter-argue with the error model:", "the best specification for this dataset appears to be a lag model; however, if a spatial error parameter is ", "introduced instead of a lag parameter, part of the spatial effects in the model are pushed to the error and", "picked up by `lambda`, which becomes significant for the lack of a better structure to model space (i.e. lag).", "When such preferable structure is introduced and the spatial effects modelled properly, the error does not have", "remaining spatial autocorrelation and the parameter becomes not significant.", "", "The problem of the straight lag model is that it does not account for heteroskedasticity, which is a problem with", "our data. To solve the issue, we can use the White correction in the same way we did with OLS:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "lag_white = GM_Lag(y, x, w=w, name_x=x_names, robust='white')", "print lag_white.summary" ], "language": "python", "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "REGRESSION", "----------", "SUMMARY OF OUTPUT: SPATIAL TWO STAGE LEAST SQUARES ESTIMATION", "-------------------------------------------------------------", "Data set : unknown", "Weights matrix : unknown", "Dependent Variable : dep_var Number of Observations: 985", "Mean dependent var : 2.7077 Number of Variables : 10", "S.D. dependent var : 0.3505 Degrees of Freedom : 975", "", "Pseudo R-squared : 0.341497", "Spatial Pseudo R-squared: 0.328653", "", "White Standard Errors", "----------------------------------------------------------------------------", " Variable Coefficient Std.Error z-Statistic Probability", "----------------------------------------------------------------------------", " CONSTANT 2.4970736 0.2988176 8.3565147 6.45984e-17", " hsu -0.0002039 0.0000155 -13.1234882 2.415553e-39", " pop_dens -0.0003663 0.0001009 -3.6291256 0.0002843828", " white_rt -0.0028577 0.0012154 -2.3512323 0.01871135", " black_rt 0.0026414 0.0024471 1.0793995 0.2804097", " hisp_rt -0.0001681 0.0006862 -0.2449773 0.806474", " fem_nh_rt -0.0202109 0.0097566 -2.0715185 0.03831037", " renter_rt 0.0068149 0.0016322 4.1751404 2.978023e-05", " vac_hsu_rt 0.0078722 0.0014515 5.4232996 5.85088e-08", " W_dep_var 0.2740141 0.0941695 2.9097969 0.003616637", "----------------------------------------------------------------------------", "Instruments: W_hsu, W_pop_dens, W_white_rt, W_black_rt, W_hisp_rt,", " W_fem_nh_rt, W_renter_rt, W_vac_hsu_rt", "========================= END OF REPORT ==============================" ] } ], "prompt_number": 24 }, { "cell_type": "markdown", "source": [ "Since White is a VC matrix correction, only the standard errors will be affected. Compared to the initial lag model,", "standard errors are in general larger, which results in `fem_nh_rt` becoming insignificant at the 1% level.", "", "Now that we have our best model, we can compare the coefficients with the ones obtained previously to assess the effect", "of the spatial effects present in the data on our estimations and conclusions. If we look at the estimates of `lag_white`", "in comparison with either the OLS or error models, we find they tend to be smaller in the former. This means that the", "effect of space makes models that do not account for it overestimate the importance of our regressors in explaining ", "the dependent variable. Just by looking at", "the actual numbers, it might seem like the changes are not that large, but keep in mind the scale issues and the fact ", "that our dependent variable is expressed in logs. If you count both in, depending on the motivation of the model, the", "changes may not seem that tiny.", "", "As a final note for completeness, we will showcase one more model available in PySAL. As mentioned before, Arraiz et al. (2010)", "present a nonparametric procedure to estimate the VC matrix that accounts for both heteroskedasticity *and* spatial ", "autocorrelation. Since it is a correction of the VC matrix and hence only affects the standard errors, we can use it with", "either non-spatial models or with a spatial lag model. The call we need to make is not very different from that for White,", "but before we can do that we need to create a kernel weights object, for which we will use 10 neighbors and, for this example, ", "a quadratic kernel function." ] }, { "cell_type": "code", "collapsed": false, "input": [ "wk = ps.open('../workshop_data/phx_k_triangular.kwt', dataFormat='gwt').read()", "lag_hac = GM_Lag(y, x, w=w, name_x=x_names, robust='hac', gwk=wk)", "print lag_hac.summary" ], "language": "python", "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "REGRESSION", "----------", "SUMMARY OF OUTPUT: SPATIAL TWO STAGE LEAST SQUARES ESTIMATION", "-------------------------------------------------------------", "Data set : unknown", "Weights matrix : unknown", "Dependent Variable : dep_var Number of Observations: 985", "Mean dependent var : 2.7077 Number of Variables : 10", "S.D. dependent var : 0.3505 Degrees of Freedom : 975", "", "Pseudo R-squared : 0.341497", "Spatial Pseudo R-squared: 0.328653", "", "HAC Standard Errors; Kernel Weights: unknown", "----------------------------------------------------------------------------", " Variable Coefficient Std.Error z-Statistic Probability", "----------------------------------------------------------------------------", " CONSTANT 2.4970736 0.3072108 8.1282081 4.356827e-16", " hsu -0.0002039 0.0000163 -12.5274161 5.285681e-36", " pop_dens -0.0003663 0.0001001 -3.6590947 0.0002531078", " white_rt -0.0028577 0.0011902 -2.4010425 0.01634844", " black_rt 0.0026414 0.0023127 1.1421190 0.2534045", " hisp_rt -0.0001681 0.0006744 -0.2492576 0.8031616", " fem_nh_rt -0.0202109 0.0095914 -2.1071828 0.03510174", " renter_rt 0.0068149 0.0016346 4.1692471 3.056075e-05", " vac_hsu_rt 0.0078722 0.0014090 5.5872065 2.307513e-08", " W_dep_var 0.2740141 0.0978080 2.8015507 0.005085764", "----------------------------------------------------------------------------", "Instruments: W_hsu, W_pop_dens, W_white_rt, W_black_rt, W_hisp_rt,", " W_fem_nh_rt, W_renter_rt, W_vac_hsu_rt", "========================= END OF REPORT ==============================" ] } ], "prompt_number": 25 }, { "cell_type": "markdown", "source": [ "As it can be seen, the standard errors do not differ much from those obtained in `lag_white`. This is because", "the SHAC does as good of a job as the White at correcting for heteroskedasticity and, since residual spatial ", "autocorrelation is not an issue once the spatial lag is included, there is very little room left for the ", "SHAC to improve the White.", "", "### Non-spatial endogenous variables", "In order to complete the overview of `pysal.spreg`, we have one more substantive feature to show. Each of the models", "we have just used (non-spatial, spatial error, lag and combo) allow the user to pass not only exogenous regressors,", "but also endogenous variables. This is implemented again using IV estimation, so any of the models we have just seen", "may be turned into a two-stage least squares procedure in which you can allow some of you explanatory variables to be", "endogenous and use other variables to instrument for them.", "", "As a mere example, we will consider now that the variable population density (`pop_dens`) is endogenous and that we", "want to instrument for it using the inverse of land area. This means we have to re-factor our matrix `x` and create two", "new ones (which in this case will be of dimension n by 1): one for endogenous variables (`yend`) and one for the", "instruments (`q`). We also can use this example to show a bit more how to manipulate data in a *pythonic* way. It takes", "a bit of house-keeping, but it is not complicated:" ] }, { "cell_type": "code", "collapsed": true, "input": [ "#Pull out the X variables without including population density", "x_names = ['hsu', 'white_rt', 'black_rt', 'hisp_rt', \\", " 'fem_nh_rt', 'renter_rt', 'vac_hsu_rt']", "x = np.array([db.by_col(var) for var in x_names]).T", "#Create inverse of land area", "area = np.array(db.by_col('ALAND10')).reshape((x.shape[0], 1))", "inv_area = 1. / area", "#Create the matrix for the instruments", "q_names = ['inv_area']", "q = inv_area", "#Create the matrix for the endogenous variable", "yend_names = ['pop_dens']", "yend = np.array([db.by_col(var) for var in yend_names]).T" ], "language": "python", "outputs": [], "prompt_number": 26 }, { "cell_type": "markdown", "source": [ "Once we have all the pieces ready, we can run the models very much like we have done before. Since the main", "purpose now is to show you how to call the functions when you have endogenous variables, we will not print", "the summary outputs to save space. Also, in order not to import every method one by one, we will call them", "directly from `pysal.spreg`, showing you yet another way to access the code (note that some of the models include", "now the name `Endog` for endogeneity)." ] }, { "cell_type": "code", "collapsed": true, "input": [ "#Non-spatial model (w only required for spatial diagnostics", "model = ps.spreg.TSLS(y, x, w=w, yend=yend, q=q, \\", " name_x=x_names, name_yend=yend_names, name_q=q_names)", "#KP98-99 Error model", "model = ps.spreg.GM_Endog_Error(y, x, w=w, yend=yend, q=q, \\", " name_x=x_names, name_yend=yend_names, name_q=q_names)", "#Drukker et al. error model (Hom)", "model = ps.spreg.GM_Endog_Error_Hom(y, x, w=w, yend=yend, q=q, \\", " name_x=x_names, name_yend=yend_names, name_q=q_names)", "#Arraiz. et al. error model (Het)", "model = ps.spreg.GM_Endog_Error_Het(y, x, w=w, yend=yend, q=q, \\", " name_x=x_names, name_yend=yend_names, name_q=q_names)", "#Lag model", "model = ps.spreg.GM_Lag(y, x, w=w, yend=yend, q=q, \\", " name_x=x_names, name_yend=yend_names, name_q=q_names)", "#Combo model", "model = ps.spreg.GM_Combo_Het(y, x, w=w, yend=yend, q=q, \\", " name_x=x_names, name_yend=yend_names, name_q=q_names)" ], "language": "python", "outputs": [], "prompt_number": 27 }, { "cell_type": "markdown", "source": [ "## GeoDaSpace Vs. pysal.spreg", "", "As we have just shown, pysal.spreg can do everything that GeoDaSpace does; in fact, all the core functionality ", "from GeoDaSpace is actually powered by the same code. At this point, you might be wondering why then make the ", "investment to learn it this way, when you can click through the dialogs in GeoDaSpace. In this section, we will ", "discuss some of the features that you can enjoy only if you use the command line and will demonstrate it using ", "the spatial diagnostics we saw before as an example.", "", "### Extra parameters and tweaks", "", "GeoDaSpace aims at being a simple, easy to use interface that puts all the advanced spatial econometrics ", "implemented in PySAL one click away. This is partly done at the cost of setting some pysal.spreg options ", "to reasonable defaults. In cases when we do need to change those defaults because the problem at hand requires", "some customization of estimation procedures, the command line is the way to go. In this document, we will not ", "detail one by one all these options; instead, we will review how they can be found out and accessed.", "", "PySAL takes documentation very seriously, and the spreg module is no exception. The inline help provided is ", "so good that, with some basic Python knowledge at hand, it allows to explore all the functionality from the ", "command line, right in place. The documentation has a more or less fixed structure that repeats throughout the ", "modules:", "", "* Basic description of the functionality. ", "* Parameters to be passed to the method. ", "* Attributes / output returned by the method. ", "* Examples of how to use the code. ", "* References to works cited before.", "", "Accessing this information will detail every parameter and tweak that the code allows and it will tell us ", "how to use it, so it is the best way to discover whether the customization we are after is implemented in ", "PySAL. There are several ways to access the documentation, but two are the most preferred ones:", "", "* Inline help from the Python interpreter: simply type `help(pysal.function)` and the documentation for ", "`pysal.function` will be displayed. Let us exemplify it with OLS:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "help(ps.spreg.OLS)" ], "language": "python", "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Help on class OLS in module pysal.spreg.ols:", "", "class OLS(BaseOLS, pysal.spreg.user_output.DiagnosticBuilder)", " | Ordinary least squares with results and diagnostics.", " | ", " | Parameters", " | ----------", " | y : array", " | nx1 array for dependent variable", " | x : array", " | Two dimensional array with n rows and one column for each", " | independent (exogenous) variable, excluding the constant", " | w : pysal W object", " | Spatial weights object (required if running spatial", " | diagnostics)", " | robust : string", " | If 'white', then a White consistent estimator of the", " | variance-covariance matrix is given. If 'hac', then a", " | HAC consistent estimator of the variance-covariance", " | matrix is given. Default set to None. ", " | gwk : pysal W object", " | Kernel spatial weights needed for HAC estimation. Note:", " | matrix must have ones along the main diagonal.", " | sig2n_k : boolean", " | If True, then use n-k to estimate sigma^2. If False, use n.", " | nonspat_diag : boolean", " | If True, then compute non-spatial diagnostics on", " | the regression.", " | spat_diag : boolean", " | If True, then compute Lagrange multiplier tests (requires", " | w). Note: see moran for further tests.", " | moran : boolean", " | If True, compute Moran's I on the residuals. Note:", " | requires spat_diag=True.", " | vm : boolean", " | If True, include variance-covariance matrix in summary", " | results", " | name_y : string", " | Name of dependent variable for use in output", " | name_x : list of strings", " | Names of independent variables for use in output", " | name_w : string", " | Name of weights matrix for use in output", " | name_gwk : string", " | Name of kernel weights matrix for use in output", " | name_ds : string", " | Name of dataset for use in output", " | ", " | ", " | Attributes", " | ----------", " | summary : string", " | Summary of regression results and diagnostics (note: use in", " | conjunction with the print command)", " | betas : array", " | kx1 array of estimated coefficients", " | u : array", " | nx1 array of residuals", " | predy : array", " | nx1 array of predicted y values", " | n : integer", " | Number of observations", " | k : integer", " | Number of variables for which coefficients are estimated", " | (including the constant)", " | y : array", " | nx1 array for dependent variable", " | x : array", " | Two dimensional array with n rows and one column for each", " | independent (exogenous) variable, including the constant", " | robust : string", " | Adjustment for robust standard errors", " | mean_y : float", " | Mean of dependent variable", " | std_y : float", " | Standard deviation of dependent variable", " | vm : array", " | Variance covariance matrix (kxk)", " | r2 : float", " | R squared", " | ar2 : float", " | Adjusted R squared", " | utu : float", " | Sum of squared residuals", " | sig2 : float", " | Sigma squared used in computations", " | sig2ML : float", " | Sigma squared (maximum likelihood)", " | f_stat : tuple", " | Statistic (float), p-value (float)", " | logll : float", " | Log likelihood", " | aic : float", " | Akaike information criterion ", " | schwarz : float", " | Schwarz information criterion ", " | std_err : array", " | 1xk array of standard errors of the betas ", " | t_stat : list of tuples", " | t statistic; each tuple contains the pair (statistic,", " | p-value), where each is a float", " | mulColli : float", " | Multicollinearity condition number", " | jarque_bera : dictionary", " | 'jb': Jarque-Bera statistic (float); 'pvalue': p-value", " | (float); 'df': degrees of freedom (int) ", " | breusch_pagan : dictionary", " | 'bp': Breusch-Pagan statistic (float); 'pvalue': p-value", " | (float); 'df': degrees of freedom (int) ", " | koenker_bassett : dictionary", " | 'kb': Koenker-Bassett statistic (float); 'pvalue':", " | p-value (float); 'df': degrees of freedom (int) ", " | white : dictionary", " | 'wh': White statistic (float); 'pvalue': p-value (float);", " | 'df': degrees of freedom (int) ", " | lm_error : tuple", " | Lagrange multiplier test for spatial error model; tuple", " | contains the pair (statistic, p-value), where each is a", " | float ", " | lm_lag : tuple", " | Lagrange multiplier test for spatial lag model; tuple", " | contains the pair (statistic, p-value), where each is a", " | float ", " | rlm_error : tuple", " | Robust lagrange multiplier test for spatial error model;", " | tuple contains the pair (statistic, p-value), where each", " | is a float", " | rlm_lag : tuple", " | Robust lagrange multiplier test for spatial lag model;", " | tuple contains the pair (statistic, p-value), where each", " | is a float", " | lm_sarma : tuple", " | Lagrange multiplier test for spatial SARMA model; tuple", " | contains the pair (statistic, p-value), where each is a", " | float", " | moran_res : tuple", " | Moran's I for the residuals; tuple containing the triple", " | (Moran's I, standardized Moran's I, p-value)", " | name_y : string", " | Name of dependent variable for use in output", " | name_x : list of strings", " | Names of independent variables for use in output", " | name_w : string", " | Name of weights matrix for use in output", " | name_gwk : string", " | Name of kernel weights matrix for use in output", " | name_ds : string", " | Name of dataset for use in output", " | title : string", " | Name of the regression method used", " | sig2n : float", " | Sigma squared (computed with n in the denominator)", " | sig2n_k : float", " | Sigma squared (computed with n-k in the denominator)", " | xtx : float", " | X'X", " | xtxi : float", " | (X'X)^-1", " | ", " | ", " | Examples", " | --------", " | >>> import numpy as np", " | >>> import pysal", " | ", " | Open data on Columbus neighborhood crime (49 areas) using pysal.open().", " | This is the DBF associated with the Columbus shapefile. Note that", " | pysal.open() also reads data in CSV format; also, the actual OLS class", " | requires data to be passed in as numpy arrays so the user can read their", " | data in using any method. ", " | ", " | >>> db = pysal.open(pysal.examples.get_path('columbus.dbf'),'r')", " | ", " | Extract the HOVAL column (home values) from the DBF file and make it the", " | dependent variable for the regression. Note that PySAL requires this to be", " | an nx1 numpy array.", " | ", " | >>> hoval = db.by_col(\"HOVAL\")", " | >>> y = np.array(hoval)", " | >>> y.shape = (len(hoval), 1)", " | ", " | Extract CRIME (crime) and INC (income) vectors from the DBF to be used as", " | independent variables in the regression. Note that PySAL requires this to", " | be an nxj numpy array, where j is the number of independent variables (not", " | including a constant). pysal.spreg.OLS adds a vector of ones to the", " | independent variables passed in.", " | ", " | >>> X = []", " | >>> X.append(db.by_col(\"INC\"))", " | >>> X.append(db.by_col(\"CRIME\"))", " | >>> X = np.array(X).T", " | ", " | The minimum parameters needed to run an ordinary least squares regression", " | are the two numpy arrays containing the independent variable and dependent", " | variables respectively. To make the printed results more meaningful, the", " | user can pass in explicit names for the variables used; this is optional.", " | ", " | >>> ols = OLS(y, X, name_y='home value', name_x=['income','crime'], name_ds='columbus')", " | ", " | pysal.spreg.OLS computes the regression coefficients and their standard", " | errors, t-stats and p-values. It also computes a large battery of", " | diagnostics on the regression. All of these results can be independently", " | accessed as attributes of the regression object created by running", " | pysal.spreg.OLS. They can also be accessed at one time by printing the", " | summary attribute of the regression object. In the example below, the", " | parameter on crime is -0.4849, with a t-statistic of -2.6544 and p-value", " | of 0.01087.", " | ", " | >>> ols.betas", " | array([[ 46.42818268],", " | [ 0.62898397],", " | [ -0.48488854]])", " | >>> print ols.t_stat[2][0]", " | -2.65440864272", " | >>> print ols.t_stat[2][1]", " | 0.0108745049098", " | >>> ols.r2", " | 0.34951437785126105", " | >>> print ols.summary", " | REGRESSION", " | ----------", " | SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES ESTIMATION", " | ----------------------------------------------------", " | Data set : columbus", " | Dependent Variable : home value Number of Observations: 49", " | Mean dependent var : 38.4362 Number of Variables : 3", " | S.D. dependent var : 18.4661 Degrees of Freedom : 46", " | ", " | R-squared : 0.349514", " | Adjusted R-squared : 0.3212", " | Sum squared residual: 10647.015 F-statistic : 12.3582", " | Sigma-square : 231.457 Prob(F-statistic) : 5.06369e-05", " | S.E. of regression : 15.214 Log likelihood : -201.368", " | Sigma-square ML : 217.286 Akaike info criterion : 408.735", " | S.E of regression ML: 14.7406 Schwarz criterion : 414.411", " | ", " | ----------------------------------------------------------------------------", " | Variable Coefficient Std.Error t-Statistic Probability", " | ----------------------------------------------------------------------------", " | CONSTANT 46.4281827 13.1917570 3.5194844 0.0009866767", " | income 0.6289840 0.5359104 1.1736736 0.2465669", " | crime -0.4848885 0.1826729 -2.6544086 0.0108745", " | ----------------------------------------------------------------------------", " | ", " | REGRESSION DIAGNOSTICS", " | MULTICOLLINEARITY CONDITION NUMBER 12.537555", " | TEST ON NORMALITY OF ERRORS", " | TEST DF VALUE PROB", " | Jarque-Bera 2 39.706155 0.0000000", " | ", " | DIAGNOSTICS FOR HETEROSKEDASTICITY", " | RANDOM COEFFICIENTS", " | TEST DF VALUE PROB", " | Breusch-Pagan test 2 5.766791 0.0559445", " | Koenker-Bassett test 2 2.270038 0.3214160", " | ", " | SPECIFICATION ROBUST TEST", " | TEST DF VALUE PROB", " | White 5 2.906067 0.7144648", " | ========================= END OF REPORT ==============================", " | ", " | If the optional parameters w and spat_diag are passed to pysal.spreg.OLS,", " | spatial diagnostics will also be computed for the regression. These", " | include Lagrange multiplier tests and Moran's I of the residuals. The w", " | parameter is a PySAL spatial weights matrix. In this example, w is built", " | directly from the shapefile columbus.shp, but w can also be read in from a", " | GAL or GWT file. In this case a rook contiguity weights matrix is built,", " | but PySAL also offers queen contiguity, distance weights and k nearest", " | neighbor weights among others. In the example, the Moran's I of the", " | residuals is 0.2037 with a standardized value of 2.5918 and a p-value of", " | 0.009547.", " | ", " | >>> w = pysal.weights.rook_from_shapefile(pysal.examples.get_path(\"columbus.shp\"))", " | >>> ols = OLS(y, X, w, spat_diag=True, moran=True, name_y='home value', name_x=['income','crime'], name_ds='columbus')", " | >>> ols.betas", " | array([[ 46.42818268],", " | [ 0.62898397],", " | [ -0.48488854]])", " | >>> print ols.moran_res[0]", " | 0.20373540938", " | >>> print ols.moran_res[1]", " | 2.59180452208", " | >>> print ols.moran_res[2]", " | 0.00954740031251", " | ", " | Method resolution order:", " | OLS", " | BaseOLS", " | pysal.spreg.utils.RegressionPropsY", " | pysal.spreg.utils.RegressionPropsVM", " | pysal.spreg.user_output.DiagnosticBuilder", " | ", " | Methods defined here:", " | ", " | __init__(self, y, x, w=None, robust=None, gwk=None, sig2n_k=True, nonspat_diag=True, spat_diag=False, moran=False, vm=False, name_y=None, name_x=None, name_w=None, name_gwk=None, name_ds=None)", " | ", " | ----------------------------------------------------------------------", " | Data descriptors inherited from pysal.spreg.utils.RegressionPropsY:", " | ", " | mean_y", " | ", " | std_y", " | ", " | ----------------------------------------------------------------------", " | Data descriptors inherited from pysal.spreg.utils.RegressionPropsVM:", " | ", " | sig2n", " | ", " | sig2n_k", " | ", " | utu", " | ", " | vm" ] } ], "prompt_number": 28 }, { "cell_type": "markdown", "source": [ "As you can see, the output is significant and it details all the parameters you can tweak (e.g. sig2n_k).", "", "* Online help: pysal.org hosts not only the code but an HTML version of the documentation. Very handy if ", "we want to quickly google some function.", "", "### Programatic access to the functionality and batch processing", "", "This may seem obvious but is not: the command line gives us programatic access to the code. This means that", "every action we perform can be saved (e.g. in a script) and that we no longer have to rely on what we clicked", "or did not click to know where the output comes from. Also, programatic access to the core of GeoDaSpace opens", "new options that the GUI does not offer: reproducibility (an important aspect of an open science) is much ", "easier; we can use part of our code in bigger projects or re-use it over and over without re-doing the work", "for ourselves; we can now run it from a remote server (with more power than our desktop) and then use it for", "larger datasets, for example.", "", "Probably one of the most useful options that the command line has to offer is the possibility to run the models we have seen in loops so we can use them as part of simulations or to quickly and efficiently try several specifications (e.g. different weights, variable combinations, etc.). As a very simple example of how to implement these ideas, we will walk through the following case: imagine that we want to see how different weights make results change in the spatial diagnostics we have shown before. To do that, we only need to run the OLS once because it is aspatial, but we have to compute the LM tests once for each weights specification. This is a perfect case to use a for loop, let us see how we would implement it (in order to see the results, we will also setup a little printout for ourselves):" ] }, { "cell_type": "code", "collapsed": false, "input": [ "#Import the LM tests method", "from pysal.spreg import LMtests", "#Specify the files for the weights we want to try as a list", "w_files = ['../workshop_data/phx_knn06.gwt', \\", " '../workshop_data/phx_knn10.gwt', \\", " '../workshop_data/phx_rook.gal', \\", " '../workshop_data/phx_queen.gal']", "#Run the OLS", "model = ps.spreg.OLS(y, x, spat_diag=False, nonspat_diag=False)", "#Setup the loop over the weights files", "for w_file in w_files:", " print 'File: ', w_file", " w = ps.open(w_file).read()", " lms = LMtests(model, w)", " print '\\tLM error: %.4f\\t(%.4f)'%lms.lme", " print '\\tLM lag: %.4f\\t(%.4f)'%lms.lml", " print '\\tSARMA: %.4f\\t(%.4f)'%lms.sarma", " print '\\tRobust LM error: %.4f\\t(%.4f)'%lms.rlme", " print '\\tRobust LM lag: %.4f\\t(%.4f)'%lms.rlml" ], "language": "python", "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "File: ../workshop_data/phx_knn06.gwt", "\tLM error: 22.1933\t(0.0000)", "\tLM lag: 29.1600\t(0.0000)", "\tSARMA: 29.1979\t(0.0000)", "\tRobust LM error: 0.0379\t(0.8456)", "\tRobust LM lag: 7.0046\t(0.0081)", "File: ../workshop_data/phx_knn10.gwt", "\tLM error: 31.0525\t(0.0000)" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "", "\tLM lag: 48.2486\t(0.0000)", "\tSARMA: 49.1130\t(0.0000)", "\tRobust LM error: 0.8644\t(0.3525)", "\tRobust LM lag: 18.0604\t(0.0000)", "File: ../workshop_data/phx_rook.gal", "\tLM error: 16.2499\t(0.0001)", "\tLM lag: 21.7033\t(0.0000)", "\tSARMA: 31.7729\t(0.0000)", "\tRobust LM error: 10.0696\t(0.0015)", "\tRobust LM lag: 15.5230\t(0.0001)", "File: ../workshop_data/phx_queen.gal", "\tLM error: 22.9589\t(0.0000)" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "", "\tLM lag: 9.3013\t(0.0023)", "\tSARMA: 27.5552\t(0.0000)", "\tRobust LM error: 18.2539\t(0.0000)", "\tRobust LM lag: 4.5963\t(0.0320)" ] } ], "prompt_number": 29 }, { "cell_type": "markdown", "source": [ "Note that to perform this, we only had to run the OLS model once, gaining some speed ", "up that becomes very important if the size of the dataset is very large.", "", "This is only a very simple example of how scripting and programatic access can be a powerful ", "element to help us in our analysis. With these tools, the possibilities of building fairly complicated ", "analysis or systems become not only feasible, but reachable. Feel free to tinker and hack with ", "pysal.spreg as much as you want, the limit is the sky!" ] } ] } ] }