In experiments involving study participants, why are some studies allowed to only have a small number of test subjects?
Although large samples produce more robust results, small samples are not completely devoid of usefulness - they're just more likely to be wrong. As long as the reader keeps this in mind, there's nothing fundamentally wrong with publishing these small-sample studies.
Further, sometimes you just cannot get enough samples to make an assertion with confidence. Examples:
GW170817 was a one-of-a-kind neutron star merger event. Since the rate of neutron star mergers isn't well-known, one could easily wait another five years and not get another sample (plus you'd need hundreds of samples to make the kind of statements you're looking for).
The Berlin Patient - there's no cure for AIDS. Even if this single case were a fluke, it's still worth reporting since it's a signpost telling researchers where to look.
Habitability research - The question of whether aliens exist captures the public imagination. Like it or not, we have to approach the problem from what we know, and we know only one planet with life (i.e. sample size of one). You could of course just decline to work in the field until we've discovered several hundred alien civilizations, but then you could be waiting forever. Plus with no idea what to look for, observational astronomers would have a much harder time finding these civilizations.
In usability research, just 5 participants will reveal about 75% of the problems with a system. This seems to fit very well for qualitative research (so perhaps the hearing comfort studies you mentioned?). From my own experience of running studies with 8-32 participants, it only took running the first few participants before every subsequent participant felt like a repeat.
Sure, you might not be able to get statistical significance with 5 participants, but you can learn a lot.
Figure taken from Nielsen's article Why You Only Need to Test with 5 Users.
Of course this depends a lot on what you're after. If you are looking for differences or some other effect within an individual, then your sample size are measurements within the individual rather than number of individuals. In such scenarios it is legitimate to study only a single (or a few) individuals.
Also, if your individuals are not studied per se but act as "instrument" (testing answer by @AustinHenley, or setting up a sensory panel, i.e. you use N individuals to smell/taste the effect of something you do/produce) you need a number to cover most of the important general variation we have in the population. But that number actually isn't that large because typically only people who are good at the task at hand will be used. I.e., smell-blind people do not apply for a sensory panel in general, you only need to cover the variation in receptors you encounter in people who are good at smelling. And that are far fewer than, say, the number of people you'd need to look at to find out which proportion of people are good at smelling, or the number you'd look at in order to be able to predict, say, the probability that a randomly chosen unknown person will like the smell/taste of what you produced:
If you need to make conclusions about a population or about applicability of your findings to unknown individuals of a population (if we train our method on an individual, this is how well it will do), you need to representatively cover the variability in that population, and this requires large sample size.
For the population questions, I fully agree with @Buffy's analysis, but I'll go a bit further about possible reasons.
The maybe most scary (to me) reason is that it is IMHO perfectly possible/far too easy to publish studies with low quality due to far-too-small sample size.
Note that I've been working in a life-science field, i.e. close to where we have lots of reports about studies not being reproducible since years, and at least in my sub-field not much of an effect on the sample size.One particular problem I see with small sample size studies is not scientific (as in limitations of conclusions that can be drawn) but political. The "politics" of academia focuses very much on novelty. This basically means that once a small sample size study is published, any larger follow-up study has to present extremely good arguments to overcome the "this is known already" bias in funding.
This means, in addition to being uncertain, small sample size studies may prevent getting certainty.In (industrial) experimental design it is often recommended to start with a preliminary experiment using, say, 1/3 or 40% of the available resources. Then do a preliminary analysis and if necessary re-adjust the allocation of the remaining resources accordingly.
This takes time and means effort. However, if the preliminary results are good, they can be published. The follow-up study will then potentially face a lack-of-novelty hurdle to publication.Master's and PhD theses by definition are the work of one student. This limits how much work can happen in one study. And it pushes towards experiments that are inconsistent in the long term: students cannot be abused as lab robots (which would help getting together good sample sizes - but good lab technicians are more expensive...) as they need to contribute scientifically. One of the easiest ways to do that is to improve the experiments over what has been done before - leading to large numbers of small sample size studies/fragmented series of experiments.
The requirement of "own work" sometimes causes students to not speak openly about their project and not seek advise. Every so often that leads to flawed experimental design and/or too small sample size and this is realized only during data analysis (or not at all). But then it is often too late to do anything but try to rescue the existing data. And rescued into a paper it must be, because otherwise the student won't have the paper they need.
I see another potential conflict in mixing the evaluation of scientific work (as in grading the PhD student's work) and arriving at scientific findings:
- if a student in their well-planned and well-conducted study finds the "desired" effect, that implies both that the student did their work well, and we have a scientific advance. All is fine.
- However, if things don't work out nicely, things are much more difficult: Was the failure due to the student not working well (I come from a wet lab field)? And/or is there no effect? In other words, the student not finding an effect has to put in much more effort in demonstrating it is not their fault.
Now consider putting a set amount of effort either in n underpowered studies or in 1 with good sample size. If the one with large sample size fails, you don't have a single paper*. If a small sample size study doesn't find an effect it is simply not published, and you move on to the next, because it is usually too much hassle to make a paper publishable with negative finding. But keep in mind that small sample size papers are not only underpowered, but they also "provide" a high "chance" to produce false positive findings (i.e. overestimate effect) - and that means a paper.
Our brain tends to underestimate the effect of chance. This means without doing the statistics, we're likely to intuitively underestimate chance and be overconfident in our findings.
There are studies where it is practically speaking impossible to obtain anything close to the required sample size**. And there's nothing wrong scientifically with case reports and small studies as long as they are clearly indicated as such, and the conclusions take the required caution and limitations are clearly stated.
Often, it is also practically impossible to obtain representative samples. I'm still OK as long as a) limitations are openly stated, and/or b) even applicability is checked by plugging in reasonable guesstimates for prevalence/incidence/class frequencies to at least give a critical thought to suitability for the application in question.
However, I see far too many studies that needlessly "rescue the world on the basis of 3 mice" or that work on, say, 20 samples of tumor tissues that are anyways cut out of patients with a disease that is neither rare, nor is the tumor volume small, nor is all the tissue needed for correct diagnosis in order to properly treat the patient.Finally, I consider it the worst possible waste of experimental effort (and that's even worse if test subjects/animals are involved) if a study is so small that a back-of-the-envelope calculation would have discovered that even in the best case no practical conclusion can possibly be drawn.
(e.g. reserving 4 independent test patients in development of a medical diagnostic [pos/neg for some disease] out of your 20 patients. Assuming all 4 are correctly classified, this gives you an observed accuracy of 100 %. But the 95 % confidence interval for accuracy ranges from around guessing to perfect.)
* real life example:
- I've met a PhD student who was working with a ≈ 500 patient sample size in my field. Measured them all, found there was a flaw in the design of the measurements (not properly randomized order of measurements, thus strictly speaking drift in time of the measurements could not be excluded as cause of the observation - which however looked like real effects from a physical/spectroscopic point of view). Realized this, and re-did all the measurements. Did not find the effect which was expected from long experience of the supervisor with similar situations and preliminary experiments. It is not clear whether there is truly no effect after all, or whether something else went wrong with the experiments (non-negligible risk for what they do), or both.
- Same field, other university, other students get their PhD on the basis of, say, 20 patients. Not randomized, never thought about experimental design, not to speak of limitations of their study.
I'd judge student 1 much better than students 2 in terms of the scientific work they did. But they'll have to face a struggle to convince their committee of that: the committees of 2 clearly were not aware of the limitations, nor were the referees of the papers. So students 2 happily publish multiple papers in the time where student 1 struggles whether their study can be published. And student 1 did admit to a mistake, so there's evidence of student 1 not working well (they did make a mistake after all) - whereas there is no such evidence in the studies of 2....
** I'm also fine with the practical limitation that in many interdisciplinary fields if you need significant work of others (in our case, e.g. reference diagnoses by pathologists reviewing every single of our samples) those others may want to see preliminary data published in order to make sure they don't waste their effort with people who don't know what they are doing and/or are not serious about the application.