Tests of varying difficulty and length: The scores could be on very simple questions like 'identify the letter' to questions on varying complexity and skills. The test paper could range from anything between 5 questions to 50 questions. These aren't standardized and the assessment papers used differ among evaluations. An intervention measured on simple questions can show large increase in scores whereas the same intervention measured on complex questions mightn't show the same impact.
Grade specific question paper or same question paper for all grades? When the students in the evaluation are of different grades, how do you measure the test scores? (i) Students can be given question papers of their respective grades. The scores on all these different papers (across grades) could be considered equivalent or adjustments could be made. (ii) Same question paper for students all grades. This accounts for another variation in the design of the intervention.
Overall learning scores or skill specific scores? The weightage of different topics in a question paper is a separate debate in itself. Some might also argue that the overall performance of students isn't helpful, since a student strong in algebra might not be strong in geometry and this needs to be taken care of.
Students age with time and so do the question papers:. The interventions in education are typically for an academic year or two. If we were to administer a baseline test to a 5th grade child at the beginning of the academic year, which test paper would we use? 4th grade or 5th grade? If we use a paper of 5th grade level, the student hasn't yet been taught those concepts. So, we might end up using the 4th grade paper. After one year, the child is in 6th grade and you might end up using a paper of 5th grade level. In this case, the baseline is of the question paper of 4th grade level and the end line is of the question paper of 5th grade.
The other way to do this is to give grade 4 question papers in both baseline and end line. The difference in test scores could differ depending on whether you use grade 4 papers in both base line & end line, or you use grade 4 papers for baseline and grade 5 question papers for end line. Another variation of this is to use two different papers but have some questions in common.
Raw scores vs IRT scores: If you administer a test of 40 questions, your raw data has responses (correct/incorrect) of each student on each question. How do you now calculate the score of the child? One way is to simply sum the number of correct responses. In this method you are giving same score for answering a very easy question and also a difficult question. To adjust for this, people use Item Response Theory (IRT). This is another way of aggregating test scores of child based on responses to individual questions, which accounts for the difficulty of the question.
The test scores in the evaluation could mean raw scores or IRT scores. Even in the IRT scores, there are different ways of aggregating, causing another variation in the meaning of test scores. The impact of an intervention could differ based on the type of scores being used. Abhijeet singh here says
when using an (unweighted) raw score, private schools in rural India appear to have an impact of about 0.28 SD on English test scores, a figure nearly identical to figures reported in Muralidharan and Sundararaman (2013) who also score tests similarly; however, using a normalized IRT score, the estimated impact is more than 0.6 SD.
Ceiling effect: When the test paper is very easy or when the students being tested are of much higher skill level than the test paper, the baseline scores tend to be high. When the scores are already high, then there isn't much room for them to increase. For eg: If you test 1+1 = ___ on 10th grade students, there is high probability that all students get it correct in the baseline itself and there isn't much to improve if the end line is also measured on the same questions.
Children not serious while taking the test: The child may not be serious on that particular day. of test. The child may know everything but wouldn't answer just because his/her mood isn't good.
Thus, when we hear the term 'test scores' , it could have these wide variety of meanings depending on the context and the results would have changed depending on the test design. This makes the evaluations in education challenging.
For a discussion on the usage of the term 'effect size' or 'standard deviation' refer Abhijeet Singh's post "How standard is a standard deviation?"