8 Steps to design a sample - Part I

In the first two articles (1, 2), we focused on providing arguments for why sampling for a face-to-face (F2F) study in Romania is conducted following the rules/methodology of multi-stratified sampling combined with cluster sampling. Essentially, we answered the question, “How do I reach the respondent?”

When I think about sampling, regardless of the study, we aim to answer the following questions, which are essentially the steps I follow mentally, like a logical framework:

1. Who is the eligible respondent?

The eligible respondent is equivalent to the target or target population, the audience of interest for the study.

To define this in detail, even to the point of visualizing it, I aim to assign values/meaning to basic socio-demographic variables (e.g., gender, age, residence type, or locality size—I'll sporadically refer to both concepts, as well as region). If, in addition to these socio-demographic variables (for which official statistics exist), other variables are added, we can already say the definition becomes more complex.

From this first point, we can determine that the sampling unit will be the population/consumer, household, company (or companies), schools/educational institutions, students, etc.

2. Do I know how many people/entities/units meet the criteria of the target population definition out of the total population?

In other terms, this refers to the incidence rate of the target population at a national level. It may seem surprising how important this information is... it's good when the incidence is high and challenging when it's low. We know this may seem confusing, but allow us to explain what a low or high incidence rate means:

A) Let's say we need to conduct a study among the employees of a bank to measure employee satisfaction. The bank has 3,000 employees, 2,800 of whom have email addresses and are relevant to the bank's action. The intention is to invite all 2,800 employees to the survey. Here, the target population is defined by the employee list deemed relevant by the bank, making the incidence rate 100%.

B) A pharmaceutical company producing oncology medications (regardless of specialization) aims to measure brand awareness and perception. The target population is all oncologists in Romania. Again, we’d have a 100% incidence rate because we need all oncologists. However, if we only needed dermatology oncologists, for example, the incidence rate would be different—calculated as the number of oncologists in the desired specialization divided by the total number of oncologists. This could result in an incidence rate below 20%, possibly closer to 10%.

C) A provider of advanced technology services wants to determine what it needs to do to increase brand awareness and, subsequently, its market share among SMEs (small and medium-sized enterprises) with more than 10 employees.

The SME category definition: "It consists of enterprises employing fewer than 250 people and having an annual turnover not exceeding 50 million euros and/or an annual balance sheet total not exceeding 43 million euros."

According to INS (Romanian National Institute of Statistics), in 2020, there were slightly over 600,000 companies in Romania, 91% of which were in the 0-9 employee segment, which, based on the client's target definition, should be excluded.

Distribution of companies by number of employees:

The SME definition also requires eliminating companies with 250 or more employees. This means we're left with 9% of companies, without even considering the turnover condition.

The incidence rate can dictate many decisions regarding the appropriate research methodology for the market study you're designing.

3. Do I have data about the target profile?

When the target population is defined as the national population aged 18+, I can certainly rely on data provided by INS.

Profiling data is important to know the distribution by region and locality size. No matter how hard you try to ensure probabilistic sampling and take all the precautions regarding accurate responses, errors are real and inevitable, whether stemming from sampling—selection methods for sampling points, respondents—or during the completion of the interview.

The refusal rate (refusal to participate in the survey) has a significant impact on sampling quality, as it can alter the target profile, risking inadequate coverage of the population of interest. Trust me, the refusal rate has increased significantly over time (a negative for researchers), and it's undoubtedly evolving even now, post-pandemic and amid economic and geopolitical uncertainty.

It varies greatly by gender (female vs. male), age group (young vs. older), and between Bucharest/large cities vs. rural or small urban areas.

You need these profiling data to verify the sample structure against socio-demographic variables, understand the size of deviations, and determine where (in which strata) they occur. You may need to consider weighting the data to align the sample with the official structure.

This relates to the representativeness of the sample—ensuring it reflects the target population structure. The goal is to have confidence (a high confidence level) that if you generated an infinite number of samples, following your chosen methodology, you'd arrive at the same results, with deviations within the maximum sampling error limits.

If the answer to this is "yes," you can rest assured. If not, it’s advisable to ensure early on that a source exists and is accessible (ideally from the proposal stage). If you're very unlucky and no source exists, it’s wise to budget separately to address this lack of data.

See the next steps here.