

Yes, all very scientific; but just tell me what you really think
Experimental trials generate a relatively clear cut and independent assessment of the degree of impact an intervention has on children’s lives. But they do not govern how the results they garner are then interpreted. The last day of this year’s UK Randomised Controlled Trials in the Social Sciences conference organized by the York Trials Unit was dominated by this difficulty.
Leading the charge was Robert Slavin, founding Director of the Institute for Effective Education at York. Disappointed with the variable standards of stores of good practice in the education field, especially those provided by the lavishly funded What Works Clearinghouse, Slavin created the Best Evidence Encyclopedia, the BEE.
As he described it to conference delegates, the BEE offered well designed systematic reviews of research that educators wanted to know about. Some they produced ourselves; if others had done the necessary work, the BEE just pointed people in the right direction.
Any program included in a BEE review had to meet high standards. It excluded any programs whose evaluation did not involve a control group. The experiment and control groups had to be well matched prior to intervention, within half a standard deviation. It demanded strong measures, and that the intervention lasted a minimum of 12 weeks.
On Wednesday he was keener to discuss what had been learned from assembling the BEE. Some of the lessons were familiar. Small studies generally had insufficient power to report statistically significant effects. So they were “file-drawed”. They didn’t get published.
Much the same was true of robust evaluations that showed that a program did not work. Such evidence was of less interest to journals, and was less palatable to funders and readers. But leaving negative results out in the cold skewed the findings of a meta-analysis or systematic review.
As worrying was the negative correlation between effect size and sample size. When Slavin looked at elementary and secondary school math programs he found 185 qualifying evaluations. Half had sample sizes of between 30 and 200. Around ten per cent had samples over 2,000; for one there was a study population of 40,000.
Nearly 30% of the variation in effect size could be explained by the size of the study population. Sample sizes of fewer than 100 generally produced effect sizes of around 0.4; sample sizes of 2,000 and above typically generated something less than 0.1.
Ideal conditions; alas, ideal results
There was nothing wrong with a small study, he said. But the results have to be interpreted with care. “Repeat the small study several times over and you will repeatedly get good effect sizes. Take the program to scale and evaluate with a large sample and the effect size will slump.”
Part of the problem was Lee Cronbach’s “super-realization bias”. Ideal conditions produced ideal results. And the solution? One response was to weight the findings included in the BEE to reflect sample size.
Next up on the subject was Mark Newman from the Evidence for Policy and Practice Information and Co-ordinating Centre, better known in England as EPPI, at the Institute of Education in London.
As Newman explained, policy makers commissioning systematic reviews from his center wanted straight answers to two questions: Did it work? and Should they do it in the UK?
Unfortunately, the “it” was frequently hard to define. Too often people wanted to know if broad strategies like foster care were effective. It wasn’t the sort of question systematic reviews were intended to answer.
Even reviews of single programs were often based on slim pickings. Newman was among those commissioned to evaluate Multidimensional Treatment Foster Care. There were lots of studies but when it came to estimating the impact on young offenders it turned out there were just two RCTs, one on males and one on females, involving 76 subjects and both undertaken by the program originators at the Oregon Social Learning Center.
Any review produced results. But most policy makers were not interested. They wanted to know what the reviewer thought. The same questions were repeated: Did it work? Should they do it?
Too many were willing to give an answer. “We are expert at systematic reviews,” Newman commented, “not expert at policy” Too few, hardly any in fact, made explicit the analytic steps that led from the forest plot to the interpretation center.
The discussions indicated how far there was to go in the efforts to link evidence to policy. Slavin said, “we get caught up in discussions about the relative merits of randomized controlled trials and qausi-experimental studies and forget about sample size, program duration and measures, all of which are at least as important”.
And when all the evidence had been pulled together in a meta-analysis there was still a danger that weaknesses would be submerged and reviewers would be lured into putting a tick or cross in the policy maker’s check-box.
Top
Delicious
Digg
Newsvine
Facebook
Technorati