General Discussion🔗

This dissertation aimed to explore technological and methodological innovations that enhance the impact and precision of psychological treatment research. I derived this goal from a specific assumption: that mental disorders, psychological treatment, and thus our research field itself are context-sensitive. I argued that this context-sensitivity is closely related to heterogeneity, and worked towards a more formal definition of this concept. This background is somewhat theoretical. Yet, if it holds, we arrive at a promising idea: to account for the granular, multi-faceted nature of mental disorders and their treatment, we need research methods that master heterogeneity more effectively.

The six articles I presented above address this goal from different perspectives. The MARD concept in Article 1 is arguably the most ambitious project, and the most wide-ranging in terms of scope. Its key premise is that we need new ways to centralize, systematize and, ultimately, “make use” of existing evidence. Much ink has been spilled on the evidence-practice gap in healthcare research, i.e., that it takes years, often decades, before research evidence is translated into real-life policy (Hanney et al., 2015; Morris et al., 2011). The developed infrastructure alone cannot resolve this issue, but I believe it gives a convincing answer on how evidence synthesis may be reorganized in the future to better serve the various stakeholders who rely on it. In its vision, the Metapsy infrastructure also aligns well with similar projects leveraging living evidence databases to systematize research fields, and make their findings more reusable (Cipriani et al., 2023).

I see a great asset in the variety of ways in which MARDs, and the Metapsy infrastructure in particular, can be extended in future development cycles. This reaches from a centralized implementation of improved methodologies (such as the deployment of new risk of bias guidance and associated tools; see e.g. metapsy.org/rob/assistant), to the supplementation of existing databases with IPD warehouses. As mentioned in Article 1, this requires major shifts in funding priorities, and a greater understanding for research software development as important scholarly work in and of itself.

Articles 2 to 6 are more granular in their approach, but they share several common themes. One of these topics is personalization, or more generally, “precision” in the way we administer treatments. Articles 2 and 3 provide an illustrative case. Article 2 can be seen as the build-up to the project, defining a clearly circumscribed target group (patients with clinically relevant depressive symptoms), the concrete way in which we want to intervene (“indirect” treatment using a specific digital stress management intervention), and ascertains that this approach is effective, even though only “on average” (see the PATH statement, which also mandates such a preliminary step; Kent, Paulus, et al., 2020). Article 3 adds more nuance. It operates under the assumption that, in practice, we never treat “average” patients, but individuals; in fact, average effects as conventionally calculated in treatment research are nothing but aggregates of causal treatment effects in each selected individual.

In contrast to the average treatment effect, such ITEs are inherently unobservable, but they do exist; in fact, they are the basic building blocks that make the concept of an “average” causal effect possible in the first place (Imbens & Rubin, 2015, chap. 1.10; Harrer, Cuijpers, et al., 2023)⁴Borrowing terminology from analytic philosophy, we could say that the ATE “supervenes” on the ITEs: an ATE in a population can only change if the makeup of ITEs in that particular population changes (Lewis, 1994; McLaughlin & Bennett, 2023)..

ITEs are not directly identifiable, but we may get closer to them by quantifying “individualized” versions of the average treatment effect. This is achieved, in all approaches employed in this dissertation (and, in fact, all adjacent methods I am aware of) by conditioning: patients are segmented into smaller and smaller subgroups based on their covariate values, leading to more and more granular “conditional” average treatment effects. Article 3 briefly touches upon the inherent challenges of this procedure. It requires us, at least to some extent, to correctly capture the true underlying mechanism leading to differential treatment benefits.

It is clear that data-adaptive algorithms applied in such a setting are likely to display model optimism (i.e., overfitting), especially in the micronumerous settings in which they are typically applied in our field. Yet, this is well-known (Kent et al., 2018; Riley et al., 2019). I want to point to a maybe even more profound limitation: even if we were able to identify the complex mechanism at play in a particular predictive context; even if the model is easily applicable, and its required inputs can easily be obtained from patients; even then, it may still be useless in most practical contexts.

This is related to the argument of strong context-sensitivity in psychological treatment research, which I have tried to develop in this dissertation. Under strong context sensitivity, even a model that perfectly describes the functional form in a particular research setting will fail; for the simple reason that the model itself cannot account for contextual changes in the functional form it tries to capture. I quote from Cartwright (1999, p. 104), who develops a similar argument on the limitations of directed acyclic graph (DAG)-based techniques:

“Contrary to the hopes of many […] I argue that these methods are not universally applicable, and even where they work best, they are never sufficient by themselves for causal inferences that can ground policy recommendations. For sound policy, we need to know not only what relations hold, but also what will happen to them when we undertake changes.”

This reasoning is one of the main motivations behind the “meta-analytic” predictive modeling I employed in this dissertation. A major strength of this approach is that it explicitly models all the unexplained heterogeneity if the same intervention is provided across different settings; and that it provides a straightforward method to not only assess the “brute” performance, but also how well a model transports to different contexts. After a phase of initial optimism, it is now increasingly recognized that reaching high test sample performances is not the main obstacle for prognostic models in psychiatric research. The true challenge is to devise predictive architectures that retain their value across all the multi-faceted ways in which treatments interact with patients and a myriad of other contextual factors in the real world (see, e.g., Chekroud et al., 2024). A “meta-analytic” approach will not fully resolve this issue, but I believe it does provide a much more robust way to gauge the generalizability of a model; which, as I argued in the introduction, remains an essential contributor to the “crisis” facing quantitative psychological research.

Although to be determined by future research, I conjecture that more robust modeling approaches could reveal that the capacities of quantitative prediction in our field are inherently limited. This follows almost logically if we are to assume strong context-sensitivity of psychological treatment: if much of the functional form we aim to capture depends on context, relationships that hold across contexts will often reduce to only very basic associations; viz., broad “brushstrokes” as they appear with regularity in everyday EBM research (e.g., “treatment X is effective on average”, “waitlists lead to higher treatment estimates”, “treatment effects depend on initial symptom severity”). In article 3, for example, we find that even the best-performing “meta-analytic” model fails to explain more than 70% of the variation in outcomes (although this may differ for more tailored performance metrics, which have since been developed; see Efthimiou et al., 2023). However, as shown in the article, even such performances may be sufficient to stratify patients according to their expected benefits.

In article 6, Figure 2 presents what we could loosely call a “sufficient statistic” (Stigler, 1973) of causal treatment effects: if the estimated treatment benefit distributions sufficiently approximate reality (with a strong “if”), they encapsulate everything there is to know about the effect of a treatment across contexts. Targeted learning approaches as employed in article 6 may help to shift our focus from point estimates to the inherent variability of treatment effects (VTEs) (Levy et al., 2021). However, this will not resolve the more fundamental issue that these remain estimates – largely unverifiable, and obtained from real-world datasets prone to measurement error, systematic and non-systematic missingness, clerical errors, and many other known or unknown limitations.

In sum, the results of this dissertation strongly suggest that quantitative prediction can improve our understanding of treatment effects; and that its robustness can be improved by factoring in cross-contextual variability. Nevertheless, these models remain inherently local: reserved to a clearly circumscribed interventional context and patient group; (at best) tightly supervised when implemented in practice; and (hopefully) discarded as soon as their practical utility wanes. This “niche” approach is far removed from visions of an all-encompassing “computational model” of psychiatric treatment; and, if our assumption of strong context-dependence holds, such a model may be unattainable altogether.

Arguably, this idea is juxtaposed to commonly held beliefs on how “hard” sciences should operate. In essence, it denies the possibility of “one great scientific theory” into which all the intelligible phenomena of nature can be fitted, a unique, complete and deductively closed set of precise statements (Cartwright, 1999; p. 6). Yet, it is exactly this stance I am willing to put forward, if we want to fully acknowledge psychiatric research as a “contextual” science. This conception, it should be noted, is shared by many modern philosophers of science (e.g. Dupré, 1993; Feyerabend, 1980; Hacking, 1983; Kitcher, 2001). I quote from Cartwright (1999, p. 1):

“We live in a dappled world, a world rich in different things, with different natures, behaving in differ-ent ways. The laws that describe this world are a patchwork, not a pyramid. They do not take the after the simple, elegant and abstract structure of a system of axioms and theorems.”

Local predictive modeling architectures may help to navigate small pockets of this “dappled” world; infrastructures such as MARDs may help in others. In the introduction (p. 12ff.), I tried to investigate if psychotherapy research has “progressed” as a science. I believe there is compelling evidence that, even in lieu of grand theoretical breakthroughs, practical progress is possible. I have, for example, mentioned that the majority of individuals with mental disorders do not receive an evidence-based treatment, regardless of their modest effects. An increase in treatment utilization alone could therefore strongly improve the lives of hundreds of millions of individuals with untreated mental health problems.

In the 1980s, it was still argued that the prevention of depressive episodes is impossible (Lobel & Hirschfield, 1984). Yet, evidence compiled in the last two decades shows that this is clearly not the case, and that their incidence can be reduced by about one third, using fairly conventional psychological interventions (Buntrock et al., 2024). As I have argued in the introduction (p. 8ff.), advances in technology (in contrast to “technique”) continue to facilitate the way in which psychological treatment can be disseminated, and meta-analytic prediction models as developed in this dissertation may lead to further incremental improvements.

Far removed from “rich mathematics” or “great scientific theory”, there might be many other small, local capacities that may be leveraged to enhance the efficacy of psychological treatment. It has been shown only recently, for example, that psychotherapy effects do not increase with the total number of sessions, but may strongly improve if contents are provided with greater intensity (i.e., in two instead of one session per week; Ciharova et al., 2024).

Such insights do not stem from theorems with great explicatory power, or from complex computational models. Instead, they are grounded in the practical tradition of psychotherapy research. Coincidentally, this perspective aligns well with a “revisionist” view of science as a whole, as embraced in contemporary philosophy (Cartwright, 1989, p. 1):

“The content of science is found not just in its laws but equally in its practices. We learn what the world is like by seeing what methods work to study it and what kinds of forecasts can predict it.”