lundi 03 juillet 2023
The challenges of external validity, towards an interdisciplinary discussion?
This article is available in French.
On 6 June a seminar on external validity was held at the LIEPP, organised by Anne Revillard and Douglas Besharov. The originality of the approach to this well-worn subject: the desire to bring together, on the one hand, those who support an essentially counterfactual evaluation, and, on the other hand, those who work with, let's say, mixed methods. The papers presented by the formers were discussed by the latter, leading to fruitful discussions (and of course, some mutual misunderstanding,) – we're obviously expecting the reciprocal in a few months' time.
The stars of the day were the counterfactualists, and this seminar was an opportunity to listen to this community discuss, and to update (or not) one’s beliefs. What, then, can we take away from the exchanges?
Before proceeding, it may be helpful to refresh the reader's memory on the notion of external validity, a concept proposed by Donald Campbell in the 1950s.
In an evaluation, or more broadly in any study aimed at establishing causal relationships, a primary concern is whether one is measuring what they want to measure. Are there biases attributable to the design of the study, the concepts, indicators, data collection or processing instruments... that would affect the results? This is the idea of internal validity. But it can also be asked whether the results obtained on a sample are valid outside the context of the evaluation conducted. That is the idea of external validity.
Without going into the definitional debates, it can be said that the challenge is to clarify to what extent the results are generalizable, and under what conditions, to populations beyond the sample studied; and whether the results obtained (e.g. the fact that an intervention “works”) can be applied in other contexts.
For evaluators, this question is important whenever a tool or method is used to generalise results. What allows us to say, for example, that the results obtained from a sample of 200 respondents reflect well the situation of the original population of 400 people? Similarly, questions about the external validity of the results are raised when asking whether the intervention evaluated could be expanded or applied to other individuals or contexts.
However, external validity is of particular importance to the speakers of the day, all of whom being invested in the search for solutions ‘that work’ and that could be applied by public authorities to solve social problems. This is the idea of an _evidence__-based policy (_EPP).
To find these solutions that work, they favour an approach, randomised controlled trials. In an RCT, the intervention is applied to a randomly selected population, in order to avoid any selection bias and strengthen internal validity. But for the same reason (and quite a few others), this procedure is known to have very weak external validity. Indeed, an intervention may “work” on a random population and in an experimental setting, but in real life it will be applied to particular populations and in a less constrained setting, in combination or in competition with many other interventions. To overcome this problem, the bet is that by multiplying RCTs of the same intervention in different contexts, and by cross-analysing the results obtained, it will be possible to say what works, and thus constitute, through a systematic review process, a catalogue of interventions to be applied to improve public policies.
This solution is not, and never has been, really satisfactory, as critics of the EBP have repeatedly pointed out. Yet within the field itself, the question of external validity has long been swept under the carpet.
Before we come to the observations made by the speakers of the day, let us first say a few words about the world in which they inhabit, that is, the world of education and health in the United States for the most part. And the least we can say is that we have the impression that they live on another planet.
First, EBP has reached an advanced level of institutionalisation, with a myriad of organisations commissioning and funding RCTs, carrying out systematic reviews, and translating the results in a simplified form to public policy actors: evidence clearing houses. The role assigned to academics in this industry of evidence is to provide the raw material (with increasingly standardised tools to carry out this work, as Rebecca Maynard points out). Policymakers, meanwhile, turn into assemblers – at least this is the narrative, for example, in this recent article on Mississippi .
Strangely enough, the alternative that the speakers raised as a scarecrow is not so much a policy that would not be “based on evidence”, but its “judiciarisation”: that is, the fact that the courts in the United States can impose not only principles but also very concrete actions on educational actors. It is therefore not only public policy actors, but also the judges that researchers must turn to. This is a slippery slope where they are no longer suppliers of raw materials but have to recommend a particular approach.
Finally, we learn that there are companies which globally sell turnkey social interventions, with training and support included, and use the guarantee of RCTs as a selling point – while the evidence of effectiveness is questionable, as Julia Littel reports. This largely reinforces the ethical issues that researchers face – at the risk of being used to support “evidence-based sales pitches”.
It is therefore in a world significantly different from ours (as continental Europeans) that the speakers evolve, and it is good to keep this in mind for the rest of the article.
Let’s start with (and evacuate!) the most outlandish aspects, such as this definition of EBP by Larry Orr and his co-authors:
“Adopt an intervention only if it has been shown to produce impacts that are significant at the.05 level in a rigorous, multisite trial”.
All there is to say about randomistas has been said before, but one might have thought that the old arguments about the fact that a policy can be based on evidence not necessarily coming from RCTs had been heard a little. Apparently not.
In fact, the speakers present themselves as quantitativists and contend, among other things, that “we do not know how to replicate from qualitative studies”. The debates between these researchers, who have known each other for more than 30 years, are almost exclusively technical: what is the right tool, what is the right measure to avoid bias, to get closer to the “truth” of interventions (a term they did not use) and ensure a good degree of external validity. An example of procedural tropism – but let’s face it, the same obsession with methods rather than usefulness can be found throughout the evaluation field.
For their discussants (qualified by mirror effect as “qualitativists”, although they often do not recognise themselves in this description), the exchange is weird. The arguments they make – that decisions are rarely based on knowledge, on the political uses of the evidence produced, on the consideration of external factors in the success or failure of the evaluated intervention – are quickly swept away. Not, mind you, that the speakers think they are wrong: rather, external validity is seen as a technical issue, and these arguments are simply not the topic of the day.
At the same time, some barriers seem to open up as the conversation unfolds. Eric Hanusheck describes himself as a “fighter,” an active participant in controversies over educational policies. R. Maynard calls for taking into account the users of the data produced (whom she calls “developers” and “consumers”: another example of major differences in viewpoint between the two sides of the Atlantic) and for working in multidisciplinary teams to use existing knowledge outside RCTs. J. Littell calls for a critical reflection on the results of the studies carried out in order to better qualify their generalisability and suggest using the five principles explored by William Shadish in “the logic of generalisation” to do so. Political dimension of evaluation, consideration of uses, necessary reflexivity... Is it so different from the debates that also agitate the world of evaluation?
What is fascinating in listening to this sample of the quantitativist epistemic community are the references (sometimes somewhat confusing to the uninitiated) to the ongoing debates about external validity and their proximity to those of the evaluation world, although the vocabulary - and the implications - are different.
The first major change from the status quo described above: the speakers acknowledge and accept that the effect of an intervention may vary from case to case, and that this variation is at least partly explainable – in their terms, heterogeneity of treatment effect. These factors, which are identified as having an influence on the effects of the intervention, and which are quantifiable, are called moderators. These may be, as Jeffrey Smith points out, characteristics of the public, intervention, conditions of implementation, socio-economic, territorial context, etc. Depending on the presence or absence of these moderators, it is possible, for example, to describe the estimated effects of an intervention on different subgroups of the population.
Of course, this has important implications for the body of studies that needs to be conducted: can we be satisfied with RCTs done "at random" when we want to highlight specific configurations or patterns? How can we ensure that new studies can contribute to this work? Replication of RCTs should be undertaken, Hanushek argues - while recognising that current methodologies do not lend themselves to this and that incentives for replication are limited.
The second movement, which is entirely related to the first, is the expressed need to explain the results obtained before being able to judge of their generalisability and possible application – a trend that has been apparent from a number of years, and is clearly visible here. This ability to explain is one of the five principles that Shadish puts forward. J. Littell clearly shows, in the case she examines, that despite two dozen RCTs and other studies, some variations remain enigmatic. Burt Barnow explores the use of mixed methods to strengthen RCT-based analyses. Beyond the methodological dimension, R. Maynard talks about the need to work as a team to cross perspectives, and about realistic synthesis as a way to better understand the contexts and conditions in which interventions are deployed.
Finally, there is a recognition that the effects of an intervention differ depending on whether it is implemented in an experimental setting or whether it is rolled out to the greatest number. Traditionally, proponents of RCT have argued that if an intervention is effective in an experimental context, it will also be effective once deployed – provided that the implementation meets a number of standards. To highlight this, D. Besharov uses the example of the Head Start programme, which has been providing early schooling for poor children in the United States since the 1960s (Ah yes: on this other planet, poor children often do not have access to public education before the age of 5-6 years. They need a federal program for this). On the basis of the various studies carried out on this programme, which has accompanied half a century of American education policy and which has grown from a few thousand to several million children involved, he shows how the effect decreases with the size of the population concerned – but also varies according to the place of implementation, the time, and so on.
It would be a mistake to underestimate these three major developments. They do mean that there is some support to the idea of moving away from the idea of an intervention having ONE, average effect, which is the basis of counterfactual reasoning (one speaks of variance-based thinking) and EBP. If there is no longer an effect, there is also no solution that can be applied everywhere.
We are closer to pattern-based logic, where one tries to find configurations, structures, motifs with an explanatory dimension – and therefore not just descriptive. This is therefore a significant step, especially since the participants apparently followed this evolution on their side, and seemed almost unaware of the fact that this type of reasoning is consubstantial to theory-based evaluation!
Finally, the consequences in terms of generalisation are obvious. The first is that there is no longer any “revealed truth.” In fact, the speakers’ reasoning leads them into a kind of aporia: if effects are now associated with “configurations of factors”, and if only RCTs are likely to say what constitutes a moderator or not, then there a gigantic task awaits researchers working on these issues. In the meantime, perhaps the use of inter alia theory-based evaluations and mixed methods studies would help to add some degree of plausibility in the claims made on the effects of interventions. But not everyone in the room seemed convinced.
The second is the need to take into account the context of implantation in order to define what is generalizable or not. This is the problem identified by L. Orr: it is not enough to identify patterns in the cases studied; it is also necessary to be able to identify them in the case where an application is projected. Here too, the consequences are dizzying, since researchers can no longer simply produce knowledge, but must work on the side of receiving it.
Finally, it may invite a form of modesty: accepting that interventions “work” under certain conditions rather than universally. The speakers complained that the clearinghouses simplify the results of the studies carried out, and that public policy actors take the observations on the effect of the interventions at face value, when actually, this has been the selling point of the EBP movement for 20 years: say “what works”, full stop.
At the end of this stimulating day, we can tell ourselves that there is something to discuss. One will think in particular of ways of generalising in complicated or complicated contexts: Yin’s analytical generalisation, Ragin and Rihoux’s modest generalisation, Pawson’s realistic synthesis come to mind, and perhaps more generally reflections around the question of what can be learned from an intervention in a complex setting.
Conversely, there is much to learn for evaluators working with theory-based approaches. Enormous progress has been made in recent years in structuring these approaches. Theories of change are now more comprehensive, they are more substantial, better able to take into account the different layers of factors that come into play. And yet we find ourselves envious of the rigour of the processes used to define the 'moderators' and their influence on the effects obtained, and of the procedures used to test these theories.
The fact remains that this day was, perhaps above all, a contact and that, despite the widespread goodwill of the exchanges, several "layers of distance" remain to be overcome: between Europe and the United States, between those who hold different paradigms and approaches, between the group of speakers, who know each other very well, and that of the discussants - who are just discovering each other, and perhaps, we would add, in terms of age difference. How can we collaborate deeply enough to change practices together? Is it possible to do this without a specific framework? The last word is probably for E. Hanushek: we are only at the beginning if we want to be serious in terms of external validity.
lundi 03 juillet 2023
The challenges of external validity, towards an interdisciplinary discussion?
mardi 27 juin 2023
Les défis de la validité externe, un sujet d'échange interdisciplinaire ?
vendredi 16 juin 2023
Le name and shame est-il une politique publique efficace?
jeudi 02 mars 2023
Du outcome harvesting dans une évaluation de projet
jeudi 19 janvier 2023
Quelles distinctions entre l'évaluation et les pratiques voisines ?
mardi 01 novembre 2022
Quelle démarche pour une cartographie des usages de l'évaluation d'impact ?
samedi 15 octobre 2022
Un numéro très très lutte – pour l'équité raciale, entre objectivistes et subjectivistes, et oldies but goodies pour faire entendre l'évaluation dans un contexte politisé.
mardi 04 octobre 2022
dimanche 17 juillet 2022
Feedback on the European Evaluation Society Conference in Copenhagen
mardi 12 juillet 2022
Retour sur la conférence de la société européenne d'évaluation à Copenhague
mercredi 15 juin 2022
Dans ce numéro, cinéma à tous les étages : La Revanche des Sith, Octobre rouge et 120 battements par minute... ou presque
mercredi 01 juin 2022
mardi 31 mai 2022
lundi 02 mai 2022
vendredi 25 mars 2022
mardi 15 mars 2022
Dans ce numéro, la cartographie des controverses rencontre la science comportementale, et la recherche l'action publique.
jeudi 20 janvier 2022
mercredi 15 décembre 2021
Numéro spécial Anthologie
lundi 20 septembre 2021
Des citations inspirantes pour qui évalue.
mercredi 15 septembre 2021
Dans ce numéro, évaluation et bureaucratie, l'ultime combat, enquêter avec d'autres êtres, et oldiesbutgoodies, on sauve le monde avec Bob Stake !
dimanche 12 septembre 2021
Liste de glossaires
vendredi 09 juillet 2021
Oldies but goodies (Karine Sage)
samedi 15 mai 2021
Dans ce numéro, des échecs, des échecs, des échecs, l'évaluation pleinement décrite et pleinement jugée et la réception des politiques du handicap. Pas de oldiesbutgoodies, mais ça reviendra pour le numéro 4 !
jeudi 29 avril 2021
Nouvel article publié (Thomas Delahais)
mardi 27 avril 2021
À l'occasion de la sortie de Strateval, nous revenons sur 3 autres jeux de cartes autour de l'évaluation
jeudi 08 avril 2021
Nouvel article publié (Marc Tevini)
lundi 15 février 2021
Dans ce numéro, évaluation féministe quésaco, apprentissage et redevabilité même combat ? et oldiesbutgoodies, des éléments pour une sociologie de l'évaluation... C'est le sommaire de ce numéro 2.
vendredi 15 janvier 2021
Introduction au séminaire de l'IRTS HDF du 26/01.
mardi 15 décembre 2020
Introduction à la formation à l'analyse de contribution
dimanche 15 novembre 2020
Dans ce numéro, plongée en pleine guerre froide avec la Realpolitik de l'évaluation, des idées pour professionnaliser l'évaluation, et oldiesbutgoodies, de quoi se demander ce que les évaluateurs et les évaluatrices défendent dans leur métier... C'est le sommaire de ce numéro 1.
lundi 09 novembre 2020
Intervention de T Delahais au Congrès de la SEVAL organisé par le GREVAL à Fribourg, le 4 septembre 2020.
jeudi 29 octobre 2020
Contribution de T Delahais et M Tevini en réponse à l'appel de la SFE, "Ce que la crise sanitaire nous apprend sur l'utilité et les pratiques d'évaluation".
vendredi 23 octobre 2020
Nouvel article publié (Thomas Delahais, Karine Sage, Vincent Honoré)
mardi 06 octobre 2020
Nouvel article publié (Agathe Devaux-Spatarakis)
jeudi 30 juillet 2020
À l'honneur pour cette édition, la sagesse pratique des évaluateurs et des évaluatrices, soit « la capacité de faire les bons choix, au bon moment, pour les bonnes raisons » ; ce que les mots et concepts de l'évaluation perdent et gagnent à leur traduction d'une langue à l'autre et oldies but goodies, une piqûre de rappel quant à la vocation démocratique de l'évaluation en France... C'est le sommaire de ce numéro 0.