When we sample a subset from a population instead of surveying the entire population, we aim for the measured value in the sample to be as close as possible to the true value in the population. When we say "value," we refer either to an average (e.g., the average height of individuals in the population) or a proportion (e.g., the proportion of people who drink coffee daily). The difference between the measured value in the sample, let’s call it x′, and the value in the population, x, represents the sampling error.
Some samples may better reflect the data of the population they are drawn from, while others may not. This brings us to the concept of representativeness.
Let’s imagine that from a population of size N, we repeatedly draw different samples—let’s say k samples—each of size n, until we exhaust all individuals in the population. For each sample, the variable we measure will yield an average x1, x2... xk. If, after drawing each sample, we calculate the average of the sample means obtained, we will notice that as we add more values from successive samples to our calculation, the result converges closer to the true value in the population.
In the case of simple random sampling, the standard deviation of the value x’ (the sample mean), also referred to as the standard error, is smaller by a factor of the square root of the sample size compared to the standard deviation of the value x (the population mean), as shown in the formula below:
The error indicates the average deviation of the sample mean x’ from the true mean value x in the population from which the sample was drawn. It tells us the likely error we can expect when estimating the population mean x using the sample mean (x’). Since it is often the case that real data about an entire population is unknown, the population's standard deviation is also unknown. To calculate the sampling error, we rely on certain assumptions.
Returning to our imagined exercise of drawing k samples, the working assumption in sampling is that if we were to plot all the sample means obtained for the variable x, their distribution would follow a normal curve. In other words, when we extract k samples, the means of the measured variable are symmetrically distributed around the population mean (x), with higher frequencies near x and lower frequencies as we move toward the tails of the distribution (resembling a bell-shaped curve or a hat viewed from the side).
The probability that the true population mean falls within a certain interval depends only on the length of the interval t, measured in standard deviations. In our example—where we consider the distribution of sample means obtained after drawing k samples—the standard deviation represents the standard error.
Thus, we aim to establish an interval within which the sample mean (e.g., x13, the mean obtained from the 13th sample) falls, with a sufficiently high probability that the error is smaller than the length of the interval.
Image source: https://analystnotes.com/cfa-study-notes-the-standard-normal-distribution.html
In practical experience, the lowest accepted probability is P = 95%, meaning there is at least a 95% chance that, when selecting a random sample, the mean value falls within the specified interval. Conversely, the value p (calculated as 1-P) indicates the probability of making an error.
There is a 95% chance that a value derived from a sample deviates by less than 2 standard errors (more precisely, 1.96) from the true population mean. There is a 99% chance that the deviation is less than 2.6 standard errors, and a 90% chance that it is less than 1.65 standard errors.
Based on the principles already outlined, we can substitute the population mean's standard deviation x with the standard deviation derived from the sample x’.
For example, we are interested in estimating the average height of the population. We conduct a survey on a sample of 800 people and find that the average height of the participants in the study is 176 cm, with a standard deviation of 17 cm for this average.
We substitute these values into the formula mentioned above:
We return to the table and see that for P=95%, t=1.96. The true value in the population lies within the interval:
176cm – 1.96*0,60 cm - 176 cm + 1.96*0,60 cm, meaning we are 95% confident that the true average height in the population is somewhere between 174.8 cm and 177.2 cm.
If we want to report the data at an even higher confidence level of 99%, we substitute into the formula again. For P=99%, t=2.6, the true value in the population lies within the interval: 176 cm – 2.6*0.60 cm - 176 cm + 2.6*0.60 cm
meaning we are 99% confident that the true average height in the population is somewhere between 174.4 cm and 177.6 cm.
The maximum error thus increases—from 1.96*0.60 (1.2 cm) to 2.6*0.60 (1.56 cm).
We can also experiment with the sample size. From the formula, we can already deduce that as the sample size increases, the error eee decreases.
Assuming the same average height of survey participants and the same standard deviation, but this time surveying 2000 people, the error e will be:
Referring to a confidence level of 95%, t=1.96, we can say with 95% certainty that the true value in the population (the average height) lies somewhere within the interval 175.3 cm – 176.7 cm. The maximum error in this case is: 1.96*0.38=0.74cm.
In practice, it is often necessary to weigh the situation and decide whether reducing the error (in our case, from 1.2 cm to 0.74 cm) justifies increasing the sample size by 1200 people.
The answer might be affirmative if we aim to estimate averages within sub-populations (e.g., men and women). In this scenario, the sample size will no longer be 2000 people but will depend on the number of women and men included in the sample (let’s assume an equal distribution, with 1000 women and 1000 men).
If the average height among the women interviewed is 165 cm, with a standard deviation of 15 cm, the error will be:
We can thus report that we are 95% confident that the average height among the population of women falls within the interval 164.1 cm – 165.9 cm. The maximum sampling error in this case is:
1.96*0.47=0,92 cm.
Bibliography:
Rotariu, T. (coord.), Bădescu, G., Culic, I., Mezei, E., Mureşan, C., Metode statistice aplicate în ştiinţele sociale, Iaşi, Polirom, 1999.