Completion requirements

View

A "Good" estimator is the one, which provides an estimate with the following qualities:

An estimate is said to be an unbiased estimate of a given parameter when the expected value of that estimator can be shown to be equal to the parameter being estimated. For example, the mean of a sample is an unbiased estimate of the mean of the population from which the sample was drawn. Unbiasedness is a good quality for an estimate, since, in such a case, using weighted average of several estimates provides a better estimate than each one of those estimates. Therefore, unbiasedness allows us to upgrade our estimates. For example, if your estimates of the population mean µ are say, 10, and 11.2 from two independent samples of sizes 20, and 30 respectively, then a better estimate of the population mean µ based on both samples is [20 (10) + 30 (11.2)] (20 + 30) = 10.75.

The standard deviation of an estimate is called the standard error of that estimate. The larger the standard error the more error in your estimate. The standard deviation of an estimate is a commonly used index of the error entailed in estimating a population parameter based on the information in a random sample of size n from the entire population.

An estimator is said to be "consistent" if increasing the sample size produces an estimate with smaller standard error. Therefore, your estimate is "consistent" with the sample size. That is, spending more money to obtain a larger sample produces a better estimate.

An efficient estimate is one, which has the smallest standard error among all unbiased estimators.

The "best" estimator is the one, which is the closest to the population parameter being estimated.

The Concept of Distance for an Estimator

The above figure illustrates the concept of closeness by means of aiming at the centre for unbiased with minimum variance. Each dartboard has several samples: The first one has all its shots clustered tightly together, but none of them hit the centre. The second one has a large spread, but around the centre. The third one is worse than the first two. Only the last one has a tight cluster around the centre, therefore has good efficiency. If an estimator is unbiased, then its variability will determine its reliability. If an estimator is extremely variable, then the estimates it produces may not on average be as close to the population parameter as a biased estimator with small variance.

Estimation is the process by which sample data are used to indicate the value of an unknown quantity in a population. Results of estimation can be expressed as a single value, known as a point estimate; or a range of values, referred to as a confidence interval. Whenever we use point estimation, we calculate the margin of error associated with that point estimate.

In newspapers and television reports on public opinion polls, the margin of error often appears in a small font at the bottom of a table or screen. However, reporting the amount of error only, is not informative enough by itself, what is missing is the degree of the confidence in the findings. The more important missing piece of information is the sample size n; that is, how many people participated in the survey, 100 or 100000? By now, you know well that the larger the sample size the more accurate is the finding, right?

The reported margin of error is the margin of "sampling error". There are many non-sampling errors that can and do affect the accuracy of polls. Here we talk about sampling error. The fact that sub-groups might have sampling error larger than the group, one must include the following statement in the report: "Other sources of error include, but are not limited to, individuals refusing to participate in the interview and inability to connect with the selected number. Every feasible effort was made to obtain a response and reduce the error, but the reader (or the viewer) should be aware that some error is inherent in all research."

Some inferential statistical techniques do not require distributional assumptions about the statistics involved. These modern non-parametric methods use large amounts of computation to explore the empirical variability of a statistic, rather than making a priori assumptions about this variability.

Bootstrapping method is to obtain an estimate by combining estimators to each of many sub-samples of a data set. Often M randomly drawn samples of T observations are drawn from the original data set of size n with replacement, where T is less than n.

A jack-knife estimator creates a series of estimate, from a single data set by generating that statistic repeatedly on the data set leaving one data value out each time. This produces a mean estimate of the parameter and a standard deviation of the estimates of the parameter.

Monte Carlo Simulation: Monte Carlo simulation allows for the evaluation of the behaviour of a statistic when its mathematical analysis is intractable. Bootstrapping and jack-knifing allow inferences to be made from a sample when traditional parametric inference fails. These techniques are especially useful to deal with statistical problems, such as small sample size, statistics with no well-developed distributional theory, and parametric inference condition violations. Both are computer intensive. Bootstrapping means you take repeated samples from a sample and then make statements about a population. Bootstrapping entails sampling-with-replacement from a sample. Jack-knifing involves systematically doing n steps, of omitting 1 case from a sample at a time, or, more generally, n/k steps of omitting k cases; computations that compare "included" vs. "omitted" can be used (especially) to reduce the bias of estimation. Both have applications in reducing bias in estimations.

Re-Sampling: Re-sampling including the bootstrap, permutation, and other non-parametric tests – is a method for hypothesis testing, confidence limits, and other applied problems in statistics and probability. It involves no formulas or tables. Following the first publication of the general technique (and the bootstrap) in 1969 by Julian Simon and subsequent independent development by Bradley Efron, re-sampling has become an alternative approach for testing hypotheses. There are other findings: "The bootstrap started out as a good notion in that it presented, in theory, an elegant statistical procedure that was free of distributional conditions. In practice the bootstrap technique doesn't work very well, and the attempts to modify it make it more complicated and more confusing than the parametric procedures that it was meant to replace." While re-sampling techniques may reduce the bias, they achieve this at the expense of increase in variance. The two major concerns are:

- The loss in accuracy of the estimate as measured by variance can be very large.
- The dimension of the data affects drastically the quality of the samples and therefore the estimates.

At the planning stage of a statistical investigation, the question of sample size (n) is critical. This is an important question therefore it should not be taken lightly. To take a larger sample than is needed to achieve the desired results is wasteful of resources, whereas very small samples often lead to what are no practical use of making good decisions. The main objective is to obtain both a desirable accuracy and a desirable confidence level with minimum cost.

Students sometimes ask me what fraction of the population do you need for good estimation? I answer, "It's irrelevant; accuracy is determined by sample size." This answer has to be modified if the sample is a sizable fraction of the population. The confidence level of conclusions drawn from a set of data depends on the size of the data set. The larger the sample, the higher is the associated confidence. However, larger samples also require more effort and resources. Thus, your goal must be to find the smallest sample size that will provide the desirable confidence.

When the needed estimates for sample size calculation is not available from an existing database, a pilot study is needed for adequate estimation with a given precision. A pilot, or preliminary, sample must be drawn from the population, and the statistics computed from this sample are used in determination of the sample size. Observations used in the pilot sample may be counted as part of the final sample, so that the computed sample size minus the pilot sample size is the number of observations needed to satisfy the total sample size requirement.