Over the last decade or so, the science community has been concerned about what has been called the “reproducibility crisis”: the apparent failure of some significant experiments to produce the same results when they’re repeated. That failure has led to many suggestions about what might be done to improve matters, but we still don’t fully understand why experiments are failing to reproduce results.
A few recent studies have attempted to pinpoint the underlying problem. A new study approached reproducibility failure by running a set of identical behavioral experiments in several labs in Switzerland and Germany. It found that many of the differences come down to the lab itself. But there’s also variability in the results that can’t be ascribed to any obvious cause and may just arise from differences between individual mice.
Try and try again
The basic outline of the work is pretty simple: Get three labs to perform the same set of 10 standard behavioral experiments on mice. But the researchers took a number of additional steps to allow a detailed look at the underlying factors that might drive variation in experimental results. The experiments were done on two different mouse strains, both of which had been inbred for many generations, limiting genetic variability. All the mice were ordered from the same company. They were housed in identical conditions and were tested while they were the same age.
Each of the three labs did two repetitions of the experiment. In one, all the work was done by a single individual to cut down on the influence of differences in how the mice were handled. In the second, three different people did the experiments to add some variability.
Ideally, these experiments should have produced identical results. If they didn’t, the researchers could look into how results differed and figure out whether the discrepancies might be due to the labs, the people doing the experiments, the strain of mice involved, or some combination of the above.
The first thing that’s obvious from the results is that there’s no single reproducibility problem. Some of the experiments reproduced just fine, with limited variability. Others, as you might expect, saw differences between the strains. But for half of those cases, the magnitudes of the strain differences varied such that one lab might see a statistically different effect while another wouldn’t. In one case, the strains showed opposite behaviors in the different labs.
Beyond that, results were all over the map. In some cases, the mouse strain was the biggest source of variability. In others, it was the lab. The impact of the individual researcher, which was significant in other studies, turned out to be minor in all but one or two of the tests.
But one of the strongest results was how much of the variability couldn’t be accounted for by anything tracked by the study. In nine of the 10 tests, the unaccounted-for variation was above 25 percent of the total, and it was above half in six of the 10.
“Things we didn’t think to test” could be an extremely large category, but in this case, it’s hard to think of ways to perform the experiments more consistently than they were done here. So while variations could be due to a large number of factors, that may make little practical difference since we can’t control those factors anyway.
The researchers also point to earlier research suggesting that at least some of this variability may be due to differences between individual mice. Despite the fact that they’ve been raised in the same conditions and their genetics are nearly identical, each mouse will invariably have somewhat different experiences. Mice are also not automatons and can be expected to vary their behavior from time to time. All of that may set limits on how well we can expect behavioral experiments to replicate.
In the meantime, the researchers suggest that it might be worth leveraging the factors that do make a difference. By intentionally varying things that can shift behavioral results, such as by making sure more than one experimenter runs the studies, we can intentionally add noise to the results. Any signal that rises above this noise would be more likely to replicate.