Endogenous switching or sample selection models for count data

Introduction

In two seminal papers, Heckman (1976) and Heckman (1979) considered the case where two variable are jointly determined: a binomial variable and an outcome continuous variable. For example, the continuous variable can be the wage and the binomial variable the labor force participation. In this case, the wage is observed only for the sub-sample of the individuals who works. If the unobserved determinants of labor force participation are correlated with the unobserved determinants of wage, estimating the wage equation only on the subset of individuals who work will result in an inconsistent estimator. This case is called the sample selection model.

An other example is the case where the binomial variable is a a dummy for private vs public sector. In this case the wage is observed for the whole sample (for the individuals in the public and in the private sector). But, once again, if the unobserved determinants of the chosen sector are correlated with those of the wage, estimating the wage equation only will leads to inconsistent estimators. This case is called the endogenous switching model.

Two consistent method of estimation can be used in this context:

Sample selection and endogenous switching for count data

Let \(y\) be a count response (for sake of simplicity a Poisson variable) and \(z\) a binomial variable. The value of \(z\) is given by the sign of \(\alpha ^ \top z + \nu\), where \(\nu\) is a standard normal deviate, \(z\) a vector of covariates and \(\alpha\) the associated vector of unknown parameters. The distribution of \(y_n\) is Poisson with parameter \(\lambda_n\), given by \(\ln \lambda_n = \beta ^ \top x_n + \epsilon_n\) where \(x\) is a second set of covariates (which can overlap with \(z\)), \(\beta\) is the corresponding set of unknown parameters and \(\epsilon\) is a random normal deviate with 0 mean and a standard deviation equal to \(\sigma\). \(\epsilon\) and \(\nu\) being potentially correlated, their joint distribution has to be considered:

\[ f(\epsilon, \nu ; \sigma, \rho) = \frac{1}{2\pi\sqrt{1 - \rho ^ 2}\sigma} e^{\displaystyle-\frac{1}{2} \frac{\left(\frac{\epsilon}{\sigma}\right) ^ 2 + \nu ^ 2 - 2 \rho \left(\frac{\epsilon}{\sigma}\right)\nu}{1 - \rho ^2}} = \frac{1}{\sqrt{2\pi}}\sigma e ^ {-\frac{1}{2}\left(\frac{\epsilon}{\sigma}\right) ^ 2} \frac{1}{\sqrt{2\pi}\sqrt{1 - \rho ^ 2}} e^{-\frac{1}{2}\left(\frac{(\nu - \rho \epsilon / \sigma)}{\sqrt{1 - \rho ^ 2}}\right) ^ 2} \]

The second expression classically gives the joint distribution as the product of the marginal distribution of \(\epsilon\) and the conditional distribution of \(\nu\).

For \(z=1\), the unconditional distribution of \(y\) is obtained by integrating out \(g(y_n \mid x_n, \epsilon_n, \nu_n, z_n = 1)\) with respect with the two random deviates:

\[ P(y_n \mid x_n, z_n = 1) = \int_{-\infty} ^ {+\infty}\int_{-\alpha ^ \top z_n} ^ {+ \infty} g(y_n \mid x_n, \epsilon_n, z_n = 1) f(\epsilon, \nu) d\epsilon d\nu \]

\[ P(y_n \mid x_n, z_n = 1) = \int_{-\infty} ^ {+\infty} g(y_n \mid x_n, \epsilon_n, z_n = 1) \left(\int_{-\alpha ^ \top z_n} ^ {+ \infty} \frac{1}{\sqrt{2\pi}\sqrt{1 - \rho ^ 2}} e^{-\frac{1}{2}\left(\frac{(\nu - \rho \epsilon / \sigma)}{\sqrt{1 - \rho ^ 2}}\right) ^ 2} d\nu\right) \frac{1}{\sqrt{2\pi}\sigma} e ^ {-\frac{1}{2}\left(\frac{\epsilon}{\sigma}\right) ^ 2} d\epsilon \]

By symetry of the normal distribution, the term in braquet is:

\[ \Phi\left(\frac{\alpha ^ \top z_n + \rho / \sigma \epsilon}{\sqrt{1 - \rho ^ 2}}\right) \]

which is the probability that \(z = 1\) for a given value of \(\epsilon\). The density of \(y\) given that \(z = 1\) is then:

\[\begin{equation} P(y_n \mid x_n, z_n = 1) = \int_{-\infty} ^ {+\infty} \frac{e ^ {-\mbox{exp}(\beta ^ \top x_n + \epsilon_n)}e^{y_n(\beta^\top x_n + \epsilon_n)}}{y_n!}\Phi\left(\frac{\alpha ^ \top z_n + \rho / \sigma \epsilon}{\sqrt{1 - \rho ^ 2}}\right)\frac{1}{\sqrt{2\pi}\sigma} e ^ {-\frac{1}{2}\left(\frac{\epsilon}{\sigma}\right) ^ 2} d\epsilon (\#eq:probyz1) \end{equation}\]

By symmetry, it is easily shown that \(P(y_n \mid x_n, z_n = 0)\) is similar except that \(\Phi\left( (\alpha ^ \top z_n + \rho / \sigma \epsilon) / \sqrt{1 - \rho ^ 2}\right)\) is replaced by \(1 - \Phi\left( (\alpha ^ \top z_n + \rho / \sigma \epsilon) / \sqrt{1 - \rho ^ 2}\right) = \Phi\left( -(\alpha ^ \top z_n + \rho / \sigma \epsilon) / \sqrt{1 - \rho ^ 2}\right)\), so that a general formulation of the distribution of \(y\) is, denoting \(q = 2 z -1\):

\[\begin{equation} (\#eq:proby) P(y_n \mid x_n) = \int_{-\infty} ^ {+\infty} \frac{e ^ {-\mbox{exp}(\beta ^ \top x_n + \epsilon_n)}e^{y_n(\beta^\top x_n + \epsilon_n)}}{y_n!}\Phi\left(q_n\frac{\alpha ^ \top z_n + \rho / \sigma \epsilon}{\sqrt{1 - \rho ^ 2}}\right)\frac{1}{\sqrt{2\pi}\sigma} e ^ {-\frac{1}{2}\left(\frac{\epsilon}{\sigma}\right) ^ 2} d\epsilon \end{equation}\]

There is no closed form for this integral but, using the change of variable \(\eta = \epsilon / \sqrt{2} / \sigma\), we get:

\[ P(y_n \mid x_n) = \int_{-\infty} ^ {+\infty} \frac{e ^ {-\mbox{exp}(\beta ^ \top x_n + \sqrt{2} \sigma \eta)}e^{y_n(\beta^\top x_n + \sqrt{2}\sigma\eta)}}{y_n!}\Phi\left(q_n\frac{\alpha ^ \top z_n + \sqrt{2} \rho \eta}{\sqrt{1 - \rho ^ 2}}\right)\frac{1}{\sqrt{\pi}} e ^ {-\eta ^ 2} d\eta \]

which can be approximated using Gauss-Hermite quadrature. Denting \(\eta_r\) the nodes and \(\omega_r\) the weights:

\[ P(y_n \mid x_n) \approx \sum_{r = 1} ^ R \omega_r \frac{e ^ {-\mbox{exp}(\beta ^ \top x_n + \sqrt{2} \sigma \eta_r)}e^{y_n(\beta^\top x_n + \sqrt{2}\sigma\eta_r)}}{y_n!}\Phi\left(q_n\frac{\alpha ^ \top z_n + \sqrt{2} \rho \eta_r}{\sqrt{1 - \rho ^ 2}}\right)\frac{1}{\sqrt{\pi}} e ^ {-\eta_r ^ 2} \]

For the exogenous switching model, the contribution of one observation to the likelihood is given by @ref(eq:proby).

For the sample selection model, the contribution of one observation to the likelihood is given by @ref(eq:probyz1) if \(z_n = 1\). If \(z_n = 0\), \(y\) is unobserved and the contribution of such observations to the likelihood is the probability that \(z_n = 0\), which is:

\[\begin{equation} P(z_n = 0 \mid x_n) = \int_{-\infty} ^ {+\infty} \Phi\left(q_n\frac{\alpha ^ \top z_n + \rho / \sigma \epsilon}{\sqrt{1 - \rho ^ 2}}\right)\frac{1}{\sqrt{2\pi}\sigma} e ^ {-\frac{1}{2}\left(\frac{\epsilon}{\sigma}\right) ^ 2} d\epsilon \end{equation}\]

The ML estimator is computing intensive as the integral has no closed form. One alternative is to use non-linear least squares, by first computing the expectation of \(y\). Terza (1998) showed that, in the endogenous switching case:

\[ \mbox{E}(y_n\mid x_n) = \mbox{exp}\left(\beta ^ \top x_n + \ln \frac{\Phi\left(q_n(\alpha ^\top z_n + \theta)\right)}{\Phi\left(q_n(\alpha ^\top z_n)\right)}\right) \]

For the sample selection case, we have:

\[ \mbox{E}(y_n\mid x_n, z_n = 1) = \mbox{exp}\left(\beta ^\top x_n + \ln \frac{\Phi(\alpha ^\top z_n + \theta)}{\Phi(\alpha ^ \top z_n)}\right) \]

Greene (2001) noted that, taking a first order Taylor series of \(\ln \frac{\Phi(\alpha ^\top z_n + \theta)}{\Phi(\alpha ^ \top z_n)}\) around \(\theta = 0\) gives: \(\theta \phi(\alpha ^ \top z_n) / \Phi(\alpha ^ \top z_n)\), which is the inverse mills ratio that is used in the linear case in order to correct the inconsistency due to sample selection. As \(\alpha\) can be consistently estimated by a probit model, the NLS estimator is obtained by minimizing with respect to \(\beta\) and \(\theta\) the sum of squares of the following residuals:

\[ y_n - e^{\displaystyle \beta ^ \top x_n + \ln \frac{\Phi(\hat{\alpha} ^ \top z_n + \theta)}{\Phi(\hat{\alpha} ^ \top z_n)}} \]

As it is customary for these two-steps estimator, the covariance matrix of the estimators should take into account the fact that \(\alpha\) has been estimated in the first step. Moreover, only \(\theta = \rho \sigma\) is estimated. To retrieve an estimator of \(\sigma\), Terza (1998) proposed to insert into the log-likelihood function the estimated values of \(\alpha\), \(\beta\) and \(\theta\) and then to maximize it with respect to \(\sigma\)2.

The escount function

library("micsr")

The escount function estimates the endogenous switching and the sample selection model for count data. The first one is obtained by setting the model argument to 'es' (the default) and the second one to 'ss'. The estimation method is selected using the method argument, which can be either 'twosteps' for the two-steps non-linear least squares model (the default) or 'ML' for maximum likelihood. The model is described by an extended formula (using the Formula package) of the form:

y | d ~ x + y + z + d | x + y + w

which indicates that the two responses are y (the count) and d (the binomial variable), that the covariates are x, y and z for the count equation and x, y and w for the switching/selection equation. When there are two large sets of covariates that overlap, it is possible to define the second set of covariates by updating the first one:

y | d ~ x + y + z + d | . - d - z + w

R is an integer that indicates the number of points used for the Gauss-Hermite quadrature method. Relevant only for the ML method, the hessian argument is a boolean: if TRUE, the covariance matrix of the coefficients is estimated using the numerical hessian, which is computed using the hessian function of the numDeriv package, otherwise, the outer product of the gradient is used.

escount returns an object of class escount which inherits from lm.

Trips demand

Terza (1998) analyzed the number of trips taken by 577 individuals in the United States in 1978 the day before they were interviewed. The trips data set is included in the micsr package.A major determinant of trips demand is the availability of a car in the household. Terza (1998) advocates that the unobserved determinants of the decision of having a car may be correlated with those of the trip demand equation. In this case the estimation of the Poisson model will lead to inconsistent estimators. The covariates are the share of trips for work or school (workschl), the number of individuals in the household (size), the distance to the central business district (dist) a factor (smsa) with two levels for small and large urban area, the number of full-time worker in household (fulltime), the distance from home to nearest transit node, household income divided by the median income of the census tract (realinc), a dummy if the survey period is Saturday or Sunday (weeekend) and a dummy for owning at least a car (car). Although the coefficients are identified, as in the classic Heckman model by the non-linearity of the correction term, Terza (1998) use a different set of variable for the binomial and the count parts of the model. Namely, the weekend covariate is removed and adults is added in the binomial part of the model.

We first compute the two-steps NLS estimator. The model and method arguments needn’t to be set are the default values are es (endogenous switching) and twosteps (two-steps NLS).

trips_2s <- escount(trips | car ~ workschl + size + dist + smsa + fulltime + distnod +
                        realinc + weekend + car | . - car - weekend + adults,
                    data = trips)
names(trips_2s)
##  [1] "coefficients"  "sigma"         "rho"           "vcov"         
##  [5] "residuals"     "fitted.values" "model"         "terms"        
##  [9] "value"         "npar"          "df.residual"   "xlevels"      
## [13] "na.action"     "call"          "first"         "est_method"

The result is a list with usual items, except sigma and rho which are the estimates of \(\sigma\) and \(\rho\) obtained from the estimation of \(\theta\).

The print of the summary method returns the usual table of coefficients, the value of the objective function (the sum of squares residuals) and the estimated values of \(\sigma\) and \(\rho\):

summary(trips_2s)
## Two-steps estimation
##               Estimate Std. Error z-value  Pr(>|z|)    
## (Intercept) -1.4455570  0.4139240 -3.4923 0.0004788 ***
## workschl    -0.5543755  0.1472626 -3.7645 0.0001669 ***
## size         0.1487720  0.0302495  4.9182 8.736e-07 ***
## dist        -0.0089621  0.0056417 -1.5886 0.1121584    
## smsa        -0.0088507  0.1014225 -0.0873 0.9304600    
## fulltime     0.1059545  0.0740731  1.4304 0.1526007    
## distnod      0.0043303  0.0025689  1.6856 0.0918674 .  
## realinc      0.0066861  0.0178072  0.3755 0.7073078    
## weekend     -0.1650841  0.1152872 -1.4319 0.1521617    
## car          2.7957446  0.4209103  6.6421 3.092e-11 ***
## theta       -0.5662482  0.2221207 -2.5493 0.0107945 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## deviance: NULL
## 
## Estimated value of sigma: 0.71951
## Implied value for rho   : -0.78699
trips_pois <- glm(trips ~ workschl + size + dist + smsa + fulltime + distnod +
                      realinc + weekend + car, data = trips, family = poisson)
trips_ml <- update(trips_2s, method = "ml")

The following table presents the results of a Poisson estimation and of the endogenous switching model, estimated by the two-steps non-linear least squares method and by maximum likelihood.

Estimation results for the trip demand model
Poisson  2 steps ML
(Intercept) −0.570 −1.446 −1.413
(0.125) (0.414) (0.193)
workschl −0.456 −0.554 −0.349
(0.070) (0.147) (0.131)
size 0.167 0.149 0.156
(0.012) (0.030) (0.024)
dist −0.002 −0.009 −0.003
(0.001) (0.006) (0.003)
smsa −0.031 −0.009 0.017
(0.044) (0.101) (0.085)
fulltime 0.248 0.106 0.249
(0.026) (0.074) (0.057)
distnod 0.005 0.004 0.005
(0.001) (0.003) (0.002)
realinc 0.019 0.007 0.004
(0.006) (0.018) (0.011)
weekend −0.074 −0.165 −0.100
(0.049) (0.115) (0.089)
car 1.413 2.796 2.166
(0.123) (0.421) (0.231)
theta −0.566
(0.222)
sigma 0.737
(0.038)
rho −0.767
(0.133)
Num.Obs. 577 577 577
AIC 3340.9 3121.5
BIC 3384.5 3213.0
Log.Lik. −1660.440 −1539.726
F 82.194
RMSE 4.20

The coefficient of car is equal to 1.413 for the Poisson model, which implies that the number of trips increases by \(e^{1.413} - 1 = 311\)% for individuals who belongs to households that own at least one car. The coefficient is much higher in the selection models (2.796 for the 2-step estimator and 2.160 for the ML estimator), which implies an increase of 767% for the ML estimator. The Poisson estimator is therefore downward biased, which indicates a negative correlation between the unobserved part of the trips demand equation and the propensity to own a car equation.

Physician advice and alcohol consumption

Kenkel and Terza (2001) investigate the effect of physician’s advice on alcohol consumption. The outcome variable drinks is the number of drinks in the past 2 weeks and the selection variable advice is a dummy based on the respondents’ answer to the question “Have you ever been told by a physician to drink less”. The unobserved part of the equation indicating the propensity to receive an advice form the physician can obviously be correlated with the one of the alcohol consumption equation. The data set drinks is part of the **micsr package. The covariates are monthly income in thousands of dollars (income), age (a factor with six 10 years categories of age), education in years (educ), race (a factor with levels white, black and other), the marital status (marital, a factor with levels single, married, widow, separated), the employment status (a factor empstatus with levels other, emp and unemp) and the region (region, a factor with levels west, northeast, midwest and south). For the binomial part of the model, the same covariates are used (except of course advice) and 10 supplementary covariates indicating the insurance coverage and the health status are added.

kt_pois <- glm(drinks ~ advice + income + age + educ + race + marital + empstatus +
                   region, data = drinks, family = poisson)
kt_ml <- escount(drinks | advice ~ advice + income + age + educ + race + marital +
                     empstatus + region | income + age + educ + race + marital +
                     empstatus + region + medicare + medicaid + champus + hlthins +
                     regmed + dri + limits + diabete + hearthcond + stroke,
                 data = drinks, method = "ml")
kt_2s <- update(kt_ml, method = "twosteps")

The following table presents the results of a Poisson estimation and of the endogenous switching model, estimated by the two-steps non-linear least squares method and maximum likelihood3.

Poisson and endogenous switching models for alcohol demand
Poisson  2 steps ML
(Intercept) 3.082 4.256 3.376
(0.034) (0.300) (0.048)
advice 0.477 −1.400 −0.848
(0.011) (0.338) (0.022)
income 0.003 0.001 0.009
(0.001) (0.003) (0.001)
educ −0.026 −0.061 −0.055
(0.002) (0.012) (0.003)
raceblack −0.402 −0.258 −0.543
(0.017) (0.117) (0.025)
raceother −0.527 −0.310 −0.593
(0.049) (0.290) (0.110)
theta 1.142
(0.319)
sigma 1.298
(0.011)
rho 0.463
(0.022)
Num.Obs. 2467 2467 2467
F 229.655
RMSE 22.16

The coefficient of advice in the alcohol demand equation is positive in the Poisson model, which would imply that a positive effect of physical advice on alcohol consumption. The estimation of the endogenous switching model shows that this positive coefficient is due to the positive correlation between the error terms of the two equations (the unobserved propensities to drink and to receive an advice from a physician are positively correlated).

References

Greene, William H. 2001. “Fiml Estimation of Sample Selection Models for Count Data.” In Economic Theory, Dynamics and Markets: Essays in Honor of Ryuzo Sato, edited by Takashi Negishi, Rama V. Ramachandran, and Kazuo Mino, 73–91. Boston, MA: Springer US.

Heckman, James J. 1976. “The Common Structure of Statistical Models of Truncation, Sample Selection and Limited Dependent Variables and a Simple Estimator for Such Models.” Annals of Economic and Social Measurement, Volume 5, Number 4, 475–92.

———. 1979. “Sample Selection Bias as a Specification Error.” Econometrica 47 (1): 153–61.

Kenkel, Donald S., and Joseph V. Terza. 2001. “The Effect of Physician Advice on Alcohol Consumption: Count Regression with an Endogenous Treatment Effect.” Journal of Applied Econometrics 16 (2): 165–84.

Terza, Joseph V. 1998. “Estimating Count Data Models with Endogenous Switching: Sample Selection and Endogenous Treatment Effects.” Journal of Econometrics 84 (1): 129–54.


  1. More precisely the inverse Mills ratio.↩︎

  2. These once again requires to use Gauss-Hermite quadrature, but the problem is considerably simpler as the likelihood is maximized with respect with only one parameter.↩︎

  3. The coefficients of marital, empstatus and region are omitted to save place.↩︎