Why you should combine multi-stratified random sampling with cluster sampling when conducting fieldwork in Romania. Part II
Now that we’ve established the context, I feel that we can return to our discussion about multi-stratification. Stratification is employed when you can divide the population into sub-groups which are heterogeneous, disjunctive. They do not overlap. This separation is very clear. It somewhat helps to break a large population or area into smaller, more manageable chunks. Region is a good variable which does this, separating a population into smaller and definitely distinct parts. Regions are based on county composition, for example Tulcea is part of Dobrogea, as a historical region, along with Constanţa. They are the only counties to form Dobrogea. Well, there is one more variable which manages to separate the population into distinct groups. It is easy to infer which one if you remember that there are several settlements, predominantly villages, that are assigned to towns or communes. It is the terms municipality, town or village which provide a difference based on the number of inhabitants and makes them feel proud about themselves. In fact, the variable is “settlement size”, which incorporates the levels rural, large town, medium town, small town. The capital, Bucureşti, is by itself a (historical) region and an independent level as it comprises around 2 million inhabitants. The next largest town is Cluj-Napoca with slightly under 300.000 people. Seeing this enormous difference between Bucureşti and the second largest city in terms of population, it is clear that the capital does deserve its own strata. A small town might have 3000 inhabitants, the same as a village. Yet, assignment to an AU or strata, if we use settlement size, is decided by public/ national authorities based on certain criteria. Which is why the SIRUTA codes matter, as well as how those authorities choose to segment the territory. I will detail below the “settlement size” for various strata. In any way, settlement size strata (or urbanization level, or rural/ urban environment) may be used to meet whatever needs your study has, if you have a database containing all settlements of Romania and the number of inhabitants for each settlement. What do you think, do these strata or subgroups, so diverse due to their AU assignment or number of inhabitants, appear in every region, or can we see some type of towns or communes only in certain regions?
Table 1 - Population distribution by region and settlement size – number of inhabitants*
How to read the table (Region by columns X Settlement size by rows):
Cell B2 states that in Ardeal, in Large urban areas, live close to 600.000 people. Cell G5 shows how many people live in rural Dobrogea. Column H contains total number of inhabitants for each stratum, while row 6 indicates total number of inhabitants for each region.
*Note: keep in mind that the data are quite old, the source being INSSE 2015. My advice is to look at these data as an exercise on how we treat data so as to generate a representative sample
Now let us see what the proportions for each cell in Table 1, population distribution by region & settlement size, are. Let’s look at Table 2, where the % is taken out of the total population, 20 million inhabitants.
Table 2 - Population distribution by region and settlement size - % of total
Bucureşti, column A, has a weight of 9% in total population. The rural areas are home to 46% of our country's inhabitants. The largest urban stratum is Small urban, accounting for 18% of the total population. There are 2 regions with good coverage for this stratum, Moldova and Muntenia. Dobrogea has the fewest inhabitants living in small urban lcoalities.
We now have plenty of information and two variables that can adequately separate Romania’s population. How can I know where my respondent is? Or, to rephrase that, suppose we had 100 field agents/ interviewers (ideally 😉 ), where do we send them, to what settlement, on which street? How many of the 13.000 settlements must we visit? We are discussing a face-to-face study, this being the most complex method. We will address later on what happens for online panels or when using CATI – stratification is kept on those two methods as well, but there are fewer steps to it.
Table 3 – Settlement distribution by region and settlement size – number of settlements*
*Note: Treat these data as an exercise, source INSSE 2015.
We can see that once we reach stratum 4, the number of settlements on each cell increases sharply. Obviously, when designing samples, we will not visit each and every one of the settlements. We will select one sample, but an indefinite number of samples might be generated. The solution is to create a sample based on clusters – groups of inhabitants from a homogenous population, who all share the same traits regarding region and settlement size. Bingo, we have to extract population clusters from each cell from the above table. You might be wondering how many people must/ can this cluster include. Before we answer that question, let us see how the distribution of questionnaires/ respondents looks like for a sample of 1000 by region and settlement size.
Table 4.1. - Respondents spread by region and settlement size, N=1000
Table 4.1 states that we have to recruit 93 respondents in Bucureşti, 126 from rural Muntenia. While we can include 93 people from Bucureşti, making sure to visit all 6 sectors, it is impossible to recruit 126 people from the same village in Muntenia. Looking at Table 3, there are over 2,000 villages (some will be communes, some villages dependent on communes). Selecting just one village out of 2600 means covering only 0.04% of the region’s potential. The sample keeps it representativity if we maintain a good territorial spread (it is rather needed/ desired to cover all counties) and if the methods employed ensure its randomness. As you might have guessed, we are yet at the selecting the respondent stage, this is just the first step, selecting the sampling points (and, implicitly, the settlements) for every stratum/ cell.
This is where the cluster sampling method comes into play - to determine the number of sampling points, meaning settlements, and then selecting the respondents for each point. For a given settlement we might have one or several sampling points, it very much depends on the number of settlements each stratum/ cell contains. For a better understanding, we’ll equate sampling point to address/ start point. At these addresses you’ll send your field agent to begin recruiting, rules in hand!
Let us exemplify for Dobrogea – a region which has only one Large urban settlement (Constanţa), only one Medium urban settlement (Tulcea), and 15 Small urban settlements. It is obvious that we conduct interviews in Constanţa and Tulcea, those being our only options. For Constanţa we have to find 14 respondents. Do we recruit them from a single sampling point, or from several? To ensure a better sample, it’s obvious several are needed. Considering a cluster of 7 participants, then 2 sampling points would be used. Were we to employ a 10 sized cluster, we’d end up with a cluster and almost half, a bit tricky to handle, as it is better to have equally sized clusters. For the Small urban stratum, we’ve established there are 15 settlements, where we have to find 11 participants. We might use either one or two clusters, so I’d rather use 2 clusters in 2 different settlements.
What does, essentially, a x sized cluster mean? It means that, starting with the first address, the field agent/ interviewer employs a random selection rule to select the household and another random selection rule to select the participant from within said household until they reach a number of contacts/ selections equal to the cluster size. (A contact/ selection does not necessarily mean a complete questionnaire/ done interview, but we’ll discuss such matters on a later date).
Table 4.2. - Distribution of sampling points for region and settlement size, for cluster=7 respondents, N=1000
For a sample of 1000 respondents and a size 7 cluster, we will be working with 143 sampling points. Using a size 10 cluster, there would be 100 sampling points – a rather large discrepancy. You are probably considering which approach would be best. A theorist would say that more sampling points is better, meaning a smaller sized cluster, 7 in our experiment, because it ensures better spread, allowing for a higher chance to cover all counties and more settlements. Someone focused on cost optimization (fewer rural settlements in the sample, for lower travel expenses) while maintaining an adequate sample quality would favor a size 10 cluster. We could try a somewhat middle of the road approach with a size 8 cluster, for 125 sampling points. Anyways, for a sample of 1000 respondents I wouldn’t recommend a cluster smaller than 7 or larger than 10.
Conclusion
It is very important to be familiar with the country where you are conducting the survey/ study and understand the way its territory is organized.
Its area, as well as the average density and settlement spread provide valuable insight.
Combining stratified sampling with cluster sampling is ideal for any random/ probabilistic sample, regardless of the sample source. It helps in segmenting/ stratifying a population into smaller groups, more easily managed and contacted.