Caveats of evaluations in education - Ensuring a Learning India S3 E.002

[16th post in 'Ensuring a Learning India' article series]

The previous post discussed the pitfalls of the seemingly obvious and underscored the need for evaluations which can disentangle the causal relationships. 

Randomized Controlled Trials (RCTs) or Controlled group studies, as they are popularly known is a specific method of evaluating interventions to explore the causal effects. The fundamental question in any debate is the comparison with the counterfactual. If someone says that giving free textbooks to students increases outcomes, they should be able to also answer the question, what if the textbooks weren't given during that time? Would the outcomes have increased regardless? May be something else has happened during that academic year which resulted in the increase outcomes and hence the effect cannot be associated with free textbooks. The intuitive way to address this challenge is to take two sets of randomly selected similar groups, give textbooks to one group and don't give to the other. The difference in outcomes between the group which receives free textbooks (treatment group) and the group which doesn't receive free textbooks (control group) gives the true impact of textbooks. If part of the increase in outcome is due to children getting competent with age, that's the same case with both the groups. Hence, the difference takes care of the other phenomenon which might have happened, other than the distribution of free textbooks.

RCT evaluations have become very famous in recent years and are being increasingly used to generate evidence on various hypothesis in developing sector and test the impact of policy interventions. This evidence is increasingly shaping the policy discourse. Just like any other phenomenon which gets reduced to simplistic observations, on reaching scale and reaching wide audience, RCTs are also being misinterpreted. This post discusses some of the caveats while interpreting evidence from RCTs. These aren't the errors or issues with the methodology of RCTs but are limitations which one must be cautious of, while interpreting the evidence. Researchers who conduct these evaluations well understand these nuances and caveats and are cautious about interpreting along those lines. It is important for others to do the same.


I. External Validity: If an evaluation a place A shows that distributing free textbooks to children doesn't improve learning, it doesn't mean that distributing textbooks doesn't lead to learning every where else. It may or may not. So, one must be cautious while translating the evidence to other contexts.

II. Administrator of the evaluation: The effect of a policy intervention can also depend on the implementing capacity of the organization administering it. NGOs or researchers tend to implement them better as compared to an average government's administrative structures. Hence, if the same policy is implemented even in the same context by an organization with different implementing capacity, it might result in different effects.

The two caveats above relatively popular and are known to some set of people. The following caveats need more attention.

III. Necessary vs. Sufficient: The evidence of the evaluations are generally read as, distributing textbooks - no impact, providing infrastructure - no impact and so on. This is often misinterpreted as - building infrastructure has no impact, hence don't invest on infrastructure.

The evaluation doesn't say that infrastructure shouldn't be built. It only says that infrastructure is not the critical constraint in that context. Infrastructure is a necessary condition but not a sufficient condition. The lessons from this evaluation should be only used to decrease the space provided to these in the limited bandwidth of policy discourse but by no means should it mean that, infrastructure shouldn't be provided. Provide infrastructure but don't talk much about it. There are other important things to be talked about.

IV. Theme vs. Product: Whenever you are trying to evaluate an intervention, essentially you are trying to evaluate an idea or theme. The problem being, there are some themes which can have numerous possible ways to convert them into implementable policies, while there are some themes which are straight forward to implement.

For example, teacher training. Teacher training is a theme but there are various possible ways (product) to design these trainings. When you are evaluating a teacher training, you are evaluating that particular form of teacher training and not the theme of teacher training. Learnings from this can't be generalized to the entire idea of teacher training. If an evaluation of a teacher training programme doesn't show effect, one can't claim that 'teacher trainings' aren't effective, rather that particular form of  teacher training isn't effective.

On the other hand, consider the example of distributing free textbooks. This is also a theme but there aren't many possible options in which it could be implemented. So, a learning from evaluation on free text book distribution can be used straight forward.

V. Average effects can be misleading: Often, the results of the evaluation give the average effect. For example, if one is evaluating an intervention of an online learning tool. Working on an online learning tool requires self motivation on the part of students. So, some students might have used it properly and gained benefit while a large number of students who didn't use properly and hence had no effect. But, the average effect can turn out to be zero, hiding a section of students who benefited. Some evaluations analyze heterogeneous effects but for those which doesn't, one should be cautious of such possibilities.

VI. Treatment on Treated vs. Intent to Treat:  If one distributes free mosquito nets in order to prevent malaria, some people might end up using the nets while some don't. The analysis on those who used nets can show different effects from those who didn't. One should pay attention to the type of evaluation.

Even with all these caveats and limitations, information from evaluations provide valuable information about the complex dynamics and structure the policy debates.

*****The meaning of test scores *****

In education, we often hear about increase in test scores (expressed in 'standard deviation') as the result of an intervention. The term 'test scores' isn't straight forward and can have different connotations depending on the way the test is designed, administered and accounted for in the analysis. This post tries to list down the various possible versions of 'test scores'.

Tests of varying difficulty and length: The scores could be on very simple questions like 'identify the letter' to questions on varying complexity and skills. The test paper could range from anything between 5 questions to 50 questions. These aren't standardized and the assessment papers used differ among evaluations. An intervention measured on simple questions can show large increase in scores whereas the same intervention measured on complex questions mightn't show the same impact.

Grade specific question paper or same question paper for all grades? When the students in the evaluation are of different grades, how do you measure the test scores? (i) Students can be given question papers of their respective grades. The scores on all these different papers (across grades) could be considered equivalent or adjustments could be made. (ii) Same question paper for students all grades. This accounts for another variation in the design of the intervention.

Overall learning scores or skill specific scores? The weightage of different topics in a question paper is a separate debate in itself. Some might also argue that the overall performance of students isn't helpful, since a student strong in algebra might not be strong in geometry and this needs to be taken care of.

Students age with time and so do the question papers:. The interventions in education are typically for an academic year or two. If we were to administer a baseline test to a 5th grade child at the beginning of the academic year, which test paper would we use? 4th grade or 5th grade? If we use a paper of 5th grade level, the student hasn't yet been taught those concepts. So, we might end up using the 4th grade paper. After one year, the child is in 6th grade and you might end up using a paper of 5th grade level. In this case, the baseline is of the question paper of 4th grade level and the end line is of the question paper of 5th grade.

The other way to do this is to give grade 4 question papers in both baseline and end line. The difference in test scores could differ depending on whether you use grade 4 papers in both base line & end line, or you use grade 4 papers for baseline and grade 5 question papers for end line. Another variation of this is to use two different papers but have some questions in common.

Raw scores vs IRT scores: If you administer a test of 40 questions, your raw data has responses (correct/incorrect) of each student on each question. How do you now calculate the score of the child? One way is to simply sum the number of correct responses. In this method you are giving same score for answering a very easy question and also a difficult question. To adjust for this, people use Item Response Theory (IRT). This is another way of aggregating test scores of child based on responses to individual questions, which accounts for the difficulty of the question.

The test scores in the evaluation could mean raw scores or IRT scores. Even in the IRT scores, there are different ways of aggregating, causing another variation in the meaning of test scores. The impact of an intervention could differ based on the type of scores being used.  Abhijeet singh here says
when using an (unweighted) raw score, private schools in rural India appear to have an impact of about 0.28 SD on English test scores, a figure nearly identical to figures reported in Muralidharan and Sundararaman (2013) who also score tests similarly; however, using a normalized IRT score, the estimated impact is more than 0.6 SD.

There are two other issues worth noting apart from the above design aspects.

Ceiling effect: When the test paper is very easy or when the students being tested are of much higher skill level than the test paper, the baseline scores tend to be high. When the scores are already high, then there isn't much room for them to increase. For eg: If you test 1+1 = ___ on 10th grade students, there is high probability that all students get it correct in the baseline itself and there isn't much to improve if the end line is also measured on the same questions.

Children not serious while taking the test: The child may not be serious on that particular day. of test. The child may know everything but wouldn't answer just because his/her mood isn't good.

Thus, when we hear the term 'test scores' , it could have these wide variety of meanings depending on the context and the results would have changed depending on the test design. This makes the evaluations in education challenging.

These aren't unsolvable problems and people sometimes do take care to the maximum extent possible to ensure appropriate measurement but there is much needed to be done and much clarity is needed in this aspect. Meanwhile, this information gap could be bridged, if the evaluation studies share the test papers and the details of test design along with the research papers or reports.

For a discussion on the usage of the term 'effect size' or 'standard deviation' refer Abhijeet Singh's post "How standard is a standard deviation?"

No comments:

Post a Comment