The fifth caveat of evaluations - Average effects can be misleading

There was a very remote village without access to good educational facilities. A small for-profit company had developed a technology tool which can teach children without the need of a teacher. It set up learning centers in couple of villages. Some interested parents joined their children in these centers and over a period of time it was observed that the students were getting very good at Maths, both from the observations of the center staff and also in their school grades.

Now, the company wanted to get a rigorous evaluation done of its program. An external evaluation agency did a Randomized Controlled Trial in the same location and reported that there isn't any significant impact of the program on the test scores of the students.

What do we make of this evidence?
It is a possibility that the students who were attending the center initially were the early adopters, who were willing to invest and were intrinsically motivated. It is argued that these students would anyway learn with the help of any product. Hence, the initial anecdotal evidence reflects much about the motivation of students and not about the product.

This line of argument misses out an important factor. The intrinsically motivated students were of the same nature, with the similar motivation, even before the program was launched in that area. Why weren't they able to learn and weren't scoring good then? 

The point being, even if the students are motivated, they need access to resources to channelize the motivation. The program acted as one such source to such motivated students deprived of any learning opportunities, which resulted in their higher test scores. It is a different debate if other learning products increase the scores as effectively, for the same group of students.

The correct interpretation of this program should be, it helps motivated students who were earlier deprived of learning opportunities but doesn't seem to have effect on others. Just because the program didn't help all types of students as reflected by the average effect, doesn't mean that we disbandon the program. If motivation turns out to be the hindrance, then this should be worked on but not stop the program.

RCTs tell us the overall impact and not the impact on a particular child. In cases where the number of non-motivated students outnumber the motivated ones, average dilutes the effect. Hence, it is important to segment the effects some times. The segmentation is obvious in this case but needn't be so in all cases.

Instead of just asking 'Does this program work?', it would be better to also ask, 'On what type of students does the program work?'. The average effects as reported by RCTs can be misleading.

Update: 

Michael Clemens and Justin Sandefur wrote this amazing article on worm-wars explaining the results of replication study of the famous deworming paper by Michael Kremer and Ted Miguel. The example discussed here also demonstrates that average effects can be misleading.

I am reproducing the relevant text and images from the CGDEV article here below.



"The school that got treated (deworming) is the black house in the center. Each circle around the black house is some other school that didn’t get treated. The number on each of those other schools is the spillover effect from treatment at the school in the center. For example, the number could be the percentage increase in school attendance at each untreated school due to spillover effects from the treated school.

Looking at the map, in this schematic example, it’s obvious that there is a spillover effect from treatment. You don’t need any statistics to tell you that. Schools near the treated school have big increases in attendance, schools far away don’t. It’s obviously very unlikely that’s this pattern is just coincidence.
In the schematic picture above, using the made-up numbers there, the average spillover effect inside the green circle is 1.6. Suppose that, due to statistical noise, we can only detect an effect above 1; so this short-range effect is easy to detect.
The average effect in the 3km to 6km is only 0.25. That’s below our detectable threshold of 1, so we can’t distinguish it from zero. Furthermore, in this example, the average spillover effect at all 76 schools inside 6km is just 0.6 — a statistical goose egg. 
How would you report a correction to this mistake? There are two ways you could do it, ways that would give opposite impressions of the true spillover effects.
You could simply state that when you correct the error, the average spillover effect on all 76 schools in the correct 6km radius is 0.6, which is indistinguishable from zero. That’s an accurate statement in isolation. This is essentially all that is done in the tables of the published version of the replication paper. On that basis you could conclude, as that paper does, that “there was little evidence of an indirect [spillover] effect on school attendance among children in schools close to intervention schools.” Strictly on its own terms, that is correct. That’s the average value in all the circles in that picture.

But wait a minute. Look back at our schematic picture. It’s obvious that there is a spillover effect. So something’s incomplete and unsatisfying about that portrayal. First of all, the average spillover inside the 3km green circle is 1.6, which in this example we can distinguish from zero. So it’s certainly not right to say there is “little evidence” of a spillover effect “close to” the treatment schools. 

So how could you report this correction differently, in a way that shows the obvious spillover effect? Using the same hypothetical data from the figure above, you could show this:


This picture shows, again for our schematic example, the average cumulative spillover effect out to various distances from the treated school: all the schools out to 1km away, all the schools out to 2km, all the schools out to 3km, and so on.

Here, there’s a big spillover effect nearby the treated school. That effect peters out as you expand the radius. In this example, it gets undetectable (falls below 1) once you consider all the schools within 5km, because the overall average starts to include so many faraway, unaffected schools."
Berk Ozler summarized this masterfully

Learning: As explained earlier, average effects of RCTs alone can be misleading and one should be mindful of this. This highlights the importance of estimating heterogeneous treatment effects, which helps in uncovering the underlying phenomenon.

No comments:

Post a Comment