# Estimating Sampling Variation

#### Danny Kaplan and Randall Pruim, Fri Jan 4 10:33:33 2013

Sampling variation refers to the uncertainty created by the sampling process. If your data is just a sample, another data set will produce a somewhat different result. The question we seek to answer is, “How much is somewhat?”

The method used here is called bootstrapping. Bootstrapping relies on a simulation of the sampling process called resampling: sampling from the sample.

### Obtaining a model from data

To illustrate, consider estimating the power-law exponent on the relationship between species number and land area.

Species = fetchData("http://www.math.smith.edu/~bbaumer/mth247/SpeciesArea.csv")
plotPoints(Species ~ Area, data=Species)


The model we seek to fit is $\mathrm{Species} = a \mathrm{Area}^b \;.$ Judging from the shape of the graph, Species goes roughly as the square-root of area, so use $$b=0.5$$ as the initial guess for the nonlinear parameter. Finding the parameters is a straightforward application of fitModel:

SpeciesModel = fitModel( Species ~ a*Area^b,
data=Species,
start=list(b=.5) )
SpeciesModel

## function (Area, b = 0.256503447940427, a = 4.10981633631802)
## a * Area^b
## <environment: 0x103b576e0>
## attr(,"coefficients")
##      b      a
## 0.2565 4.1098

coef(SpeciesModel)  # if we just want the coeffiecients

##      b      a
## 0.2565 4.1098


It looks like a fourth-root model is appropriate. But how precise is the estimate of the coefficient $$b$$?

### Estimating parameter precision

We begin by repeating what is above, but wrapping it in curly braces. This tells R to treat the entire code block as a single expression. We'll make use of this in the next step.

{ SpeciesModel = fitModel( Species ~ a*Area^b,
data=Species,
start=list(b=.5) )
coef(SpeciesModel)
}

##      b      a
## 0.2565 4.1098


#### Apply resampling and iterate

Creating a resampling distribution is straightforward once we have gotten this far. Edit the code above as follows:

• Wrap the data frame with resample.
• Put do(500)* in front of the expression, and
• store the results so that you can access them later:
ResamplingDist = do(500)*
{ SpeciesModel = fitModel( Species ~ a*Area^b,
data=resample(Species),
start=list(b=.5) )
coef(SpeciesModel)
}


Now we have a set of 500 resampling realizations. We begin by looking at the variability in our estimated power graphically.

densityplot( ~ b, data=ResamplingDist )  # we could also have done a histogram


plotPoints( b ~ a, data=ResamplingDist)


Notice that the coefficient estimates are not independent of each other.

One common way to quantify sampling variability is with the standard error. The standard error is simply the standard deviation of the sampling distribution, which we can estimate using our resampling distribution.

sd( ~ b, data=ResamplingDist )

## [1] 0.02109


Another standard way to present the precision is with a confidence interval.
For instance, to find the 95% confidence interval, use confint.

confint(ResamplingDist)

##   name  lower  upper
## 1    b 0.2114 0.2943
## 2    a 2.4933 6.0975


A 95% confidence (when based on normal distribution theory) stretches roughly 2 standard errors in each direction from the estimate.

se <- sd( ~b, data=ResamplingDist )
coef(SpeciesModel)['b'] + c(-1,1) * 2 * se

## [1] 0.2327 0.3171


Alternatively, we could take the middle 95% of the samping distribution as our interval.

qdata(c(0.025,0.975), b, data=ResamplingDist)

##   2.5%  97.5%
## 0.1886 0.2838


### Using the model to make predictions

#### A point estimate

Suppose that we were interested in predicting the number of species in a land with area $$2 \times 10^6$$ km2. This would be an extrapolation, and extrapolations are generally unreliable, but here goes …

The procedure is essentially the same as before, but now we compute an estimated response from our model rather that estimated coefficients.

First we do this with our original data to get a point estimate, a single number that is our best guess but does not reveal anything about its precision.

{ SpeciesModel = fitModel( Species ~ a*Area^b,
data=Species,
start=list(b=.5) )
SpeciesModel( Area=2e6 )
}

## [1] 169.8


#### Interval estimates via resampling

Now we would like to get an interval estimate via resampling.

ResamplingDist2 = do(500)*
{ SpeciesModel = fitModel( Species ~ a*Area^b,
data=resample(Species),
start=list(b=.5) )
SpeciesModel(2e6)
}


Now we have a set of 500 resampling realizations.

densityplot( ~ result, data=ResamplingDist2)


To find the 95% confidence interval, use confint.

confint(ResamplingDist2)

##     name lower upper
## 1 result 135.9 198.3