From prediction to measurement

Laurin Plank; Armin Zlomuzica; Oscar Kjell; Gabriel Bonnin

Abstract

Language-based assessments (LBAs) offer a promising approach to measuring psychological constructs such as depression from natural language. However, current supervised LBA models risk conflating the target construct with correlated traits (e.g., socioeconomic status, gender), raising concerns about construct validity. Here we propose domain-generality — the ability of a model trained in one domain to transfer to another — as a necessary, though not sufficient, marker for distinguishing construct measurement from mere prediction. Given these concerns, we examined whether Contextualized Construct Representation (CCR) — a recently proposed approach that gauges constructs by measuring the semantic similarity of text to a psychological questionnaire — would offer a more domain-general alternative to supervised LBA. We evaluated both approaches across three datasets (N = 1,181) spanning routine outpatient anamnesis texts, social media posts, and targeted language probes. Both CCRs and supervised LBA demonstrated domain-generality alongside strong convergent, divergent, and criterion validity. Interestingly, domain-generality emerged in supervised LBA selectively when trained on language specifically targeting depressive symptomatology. In such cases, the learned construct representation closely resembled the questionnaire-defined construct representation. Together, these findings support CCR as a domain-general psychometric tool for assessing depression from natural language and advance ongoing efforts to improve the interpretability and validity of LLM-based mental health assessments.

Note

If you want to suggest changes to the prose of the manuscript, please use the following Google Docs Link

TODO / Bounties

check whether categorization of depression_diagnosis is good, i.e., whether all depression diagnsois strings are caught by our method [Gabriel]
Test stats model assumptions visually
re-write abstract
Adapt results and discussion

Notes from our meeting:

Bring up the novel tests that we propose earlier
General claim should be that some validation procedures are regularly employed, but that this collection is incomplete, not that there is no validation
Integrate results from Oscar’s 2021 Harmony paper (already inserted in bibliography file)
Correlation between LBA scores across methods (scatter plot of CCR vs supLBA and see where predictions potentially depart)
Test incremental validity empirically?
Conceptual question about whether our framework is equally applicable to all common rating scales. A discussion was held about the case of wellbeing rating scales and LBAs. Is there the same expectation as to the causal relation between indicator and construct? Wellbeing is often expressed not directly, but instead behaviors are listed which lead to greater wellbeing (eg, spending time with friends). Also: How does our framework relate to formative as opposed to reflective measurement models?

Introduction

Rating scales are the predominating tool to measure psychological constructs. Psychological rating scales have been employed for decades and can be used in various contexts – from tightly controlled laboratory experiments to naturalistic research using ecological momentary assessment (Shiffman et al., 2008). What makes rating scales useful, in large part, is their face validity, ease of administration, and standardization. Yet, some of the shortcomings of rating scales, such as their susceptibility to biases and closed-ended nature, have motivated the development of alternative psychometric instruments (O. N. Kjell et al., 2019). That such tools are desirable is further emphasized by the fact that psychological researchers now frequently encounter study contexts where administering a rating scale might be unfeasible (e.g., big social media data) or strictly impossible (e.g., deceased individuals and historical records) (Atari et al., 2023; Chen et al., 2024; Plank & Zlomuzica, 2024b).

A growing body of work has demonstrated that analyses of natural language can yield rich insight into a person’s states or traits (O. N. Kjell et al., 2024; Plank & Zlomuzica, 2024a; Tausczik & Pennebaker, 2010). Language-based assessments (LBAs) aim to assess psychological constructs from natural language (Nilsson et al., 2026; Wright et al., 2026). While early approaches relied on word dictionaries, modern LBAs leverage transformer neural networks [i.e., Large Language Models (LLMs)] which are trained on vast amounts of language data and possess a rich contextualized language understanding (Devlin et al., 2019; Lake & Murphy, 2023; Vaswani et al., 2017). LBAs have been shown to enable accurate predictions of participants’ scores on mental health rating scales assessing depression (Gu et al., 2025; J.-J. Lee et al., 2026), anxiety (Gu et al., 2025; Stade et al., 2023), post-traumatic stress (O. Kjell et al., 2026), as well as positive mental health dimensions such as harmony (O. Kjell et al., 2021) and well-being (O. N. Kjell et al., 2022). The performance of LBAs now frequently approaches measurement-theoretic ceilings established by the internal- and test-retest reliability of rating scales questionnaires [a measure of ground-truth noise(O. N. Kjell et al., 2022, 2024)].

Despite their ability to generate highly accurate predictions of self-reported constructs and their potential for streamlining and scaling psychological assessment, there has been a paradoxical lack of clinical deployment of LBAs (Cohen, 2019). A major reason for this lack of translation is the fact that most LBAs remain rather crudely, if at all, evaluated on psychometric grounds (Cohen, 2019; Cohen et al., 2022). In technical fields such as computer science, assessments are appraised mainly in terms of precision or accuracy which are broadly analogous to the concepts of reliability and validity in classical test theory (Cohen, 2019). Yet even in psychological science, LBAs are frequently deployed under the implicit assumption that they provide valid construct assessments without formal proof or validated narrowly in terms of convergent validity – that is, convergence of language-assessed scores with self-report scores (Bilgrami et al., 2022; Dohnány et al., 2026; Eberhardt et al., 2025). The use of insufficiently validated measurement instruments raises major concerns to both research and clinical settings. In clinical settings, where LBAs are hoped to soon guide clinical decision making (Hüppi et al., 2025), improper validation undermines the trustworthiness of LBAs and risks false diagnoses and erroneous treatment choices (Plank & Zlomuzica, 2025). In research, it risks false conclusion about what construct was measured and raises doubt about research findings that are generated downstream of such measurement (Flake et al., 2017). To establish LBAs as mainstay tools and move beyond perpetual proof of concept studies, it is vital that they are held to the strict standards of psychological test theory (Cohen, 2019; Cohen et al., 2022; O. N. Kjell et al., 2024). Promisingly, recent works have seen a shift towards more comprehensive psychometric evaluations of LBA that extend beyond convergent validity, probing such phenomena as divergent and criterion validity (Cohen et al., 2022; O. N. E. Kjell et al., 2019; Wright et al., 2026).

There is increasing recognition of the importance of evaluative frameworks for AI-based assessments in psychiatry more generally (Chandler et al., 2020). Yet, to our knowledge, no dedicated framework has thus far been defined for LBA. Since LBAs are treated as though they are psychological tests, we contend that their evaluation should also be grounded in psychological test theory. The present article submits just such a framework for the validation of LBA. When LBAs are cast in terms of psychometric theory, it is revealed that their reliance on language as a behavioral readout positions them as a unique case which requires validation procedures not seen in traditional rating scale validation. This necessitates an extension of classical validation procedures which is provided here in the form of a set of tests targeting domain generalization, interpretability, and ecological validity of LBAs.

The article is structured as follows: Readers are first briefly introduced to neural text embedding methods which lay the technical groundwork for LBA. Afterwards, two predominating contemporary approaches to LBA are presented and formalized in terms of psychometric theory. Finally, a novel evaluation framework, composed of a set of empirical and substantive tests of validity, is presented and illustratively applied to LBA of depression symptoms.

Semantic embedding models

Contemporary LBA relies on a class of methods originating from natural language processing research known as semantic embedding models. We here provide only a brief functional and technical introduction to these models and refer readers elsewhere for more detailed expositions (Lake & Murphy, 2023).

Semantic embedding models allow quantifying the semantics, i.e. content or meaning, of texts (Lake & Murphy, 2023). To this end, pieces of texts, such as words, sentences, or paragraphs, are transformed into high-dimensional vectors known as embeddings (Mikolov et al., 2013). Having embedded multiple pieces of texts into semantic space, one may compute the distance between them which indicates their semantic similarity (Lake & Murphy, 2023). For example, the two words “sad” and “unhappy” will reside closer in semantic space than the words “sad” and “happy”. Because embeddings are high-dimensional, distance is assessed not based on Euclidean but angular metrics, namely the cosine similarity (Lake & Murphy, 2023). Cosine similarity measures the angular alignment of two embeddings. A perfect alignment with a cosine similarity of 1 indicates two semantically identical texts. An opposite alignment with a cosine similarity of -1 indicates two maximally dissimilar texts. There are two central properties of semantic embeddings that are important to note here and which will be returned to later: For one, semantic information is distributed over the entire vector rather than locally contained in any individual dimension (Piantadosi et al., 2024). Embeddings are also relational in the sense that their isolated position is meaningless but their position relative to other embeddings carries semantic information (Piantadosi et al., 2024).

There are various types of semantic embedding models, but all models follow the general notion of distributional semantics – words that co-occur or that occur in similar contexts have shared meaning (Firth, 1957; Harris, 1954; Lake & Murphy, 2023). Thus, to generate models which can represent the semantics of texts, word co-occurrence patterns across large language corpora are analyzed (Mikolov et al., 2013). Modern embedding models (and modern LBAs) rely on artificial neural networks and will be the focus in this article (Lake & Murphy, 2023). While there are implementational differences, a common feature is that they are trained to predict missing words based on their context and adapt their internal representation (i.e. the mapping from words to embeddings) according to their ability to do so (Devlin et al., 2019; Mikolov et al., 2013). While early embedding models were rather context-free and only able to provide embeddings of isolated words, more recent models can generate contextualized embeddings that consider the context within which a word was used (Devlin et al., 2019). To do this, they use a specific neural network architecture known as the transformer (Vaswani et al., 2017). The two defining features of transformers is their ability to model word order (“I am hungry” != “Am I hungry”) and token dependencies (Vaswani et al., 2017). The latter causes word vectors to adapt to their surrounding contexts such that, for example, polysemic words like “bank” are adapted to encode different meanings based on context (e.g., finance vs. rivers) (Lake & Murphy, 2023). These features allow transformers to generate semantic embeddings for sentences and paragraphs whose meaning is compositional (i.e., more than the sum of the meaning of their constituent words) (Lake & Murphy, 2023). Resulting models generate semantic embeddings that perform extraordinarily well on a wide variety of text-based tasks (Devlin et al., 2019).

Contemporary approaches to LBA

The standard approach to LBA relies on semantic embedding models and supervised learning and will hereafter be referred to as supervised LBA. Supervised LBAs learn to predict scores on self-report scales based on language samples in a data-driven fashion. To this end, a collection of language-rating pairs is collected from participants. Language samples are then embedded (typically using transformers) and input as predictors into multiple linear regression models with rating scale scores as the criterion (O. N. Kjell et al., 2022, 2024; Teitelbaum & Simchon, 2025). The ability of supervised LBA models to predict rating scales can be assessed by measuring the overlap between language-predicted scores and self-reported scores (O. N. Kjell et al., 2022; Teitelbaum & Simchon, 2025). To avoid overfitting and gather unbiased estimates of performance, performance is measured on language-rating pairs not seen during training.

Recently, a theory-driven approach to LBA has been proposed that does not require training on data in a supervised learning sense. Contextualized Construct Representation (CCR) probes how closely a participant’s language aligns with the items of a rating scale (Atari et al., 2023; Teitelbaum & Simchon, 2025). For example, an individual with high depressive symptoms might produce language that more closely reflects items of depression rating scale such as “I am so sad and unhappy that I can’t stand it” (Beck et al., 1996). To this end, items of validated rating scales are embedded into semantic space. The CCR is then defined as the average (centroid) of item embeddings and the semantic similarity of a text to the CCR indicates how much the construct is present in the text (Atari et al., 2023). As Teitelbaum & Simchon (2025) note, CCRs possibly reflect not only the target construct, but also the style in which rating scale items are formulated (“Questionnaire-ness”) and propose anchored CCR to address this issue. To generate anchored CCRs, averaged embeddings of negatively formulated items (i.e., items that reflect construct absence) are subtracted from the averaged embeddings of positively formulated items (i.e., items that reflect construct presence) (Grand et al., 2022; Teitelbaum & Simchon, 2025). In theory, this “partials out” the questionnaire-ness of items and leaves only the construct relevant language to define the vector representation (Teitelbaum & Simchon, 2025). Current evidence suggests that CCR can enable LBAs which correlate with rating scales for the constructs of perceived locus of control and moral concerns (Atari et al., 2023; Teitelbaum & Simchon, 2025).

Measurement versus prediction

Cronbach (1949) defined psychological tests as systematic procedures for comparing the behavior of two or more people (Furr, 2021). While the behavior in question in modern psychology is most often a participant’s response on a rating scale, this classical definition allows for other types of behavior (such as language) to constitute psychological measurement. When LBAs are considered as psychological tests, their evaluation should be grounded in psychometric test theory. Validity is the most fundamental and important property of a psychological test and concerns whether the test measures what it intends to measure (Grimm & Widaman, 2012). More formally, a test may be considered valid when A) the target construct exists and B) variation in the construct is causal to variation in the test’s items (Borsboom et al., 2004). In reflective measurement models, observable variations in behavior (e.g., rating scale item responses) are caused by or reflective of variation in the latent construct (e.g., depression) (Borsboom et al., 2004).

Borsboom et al. (2004) emphasizes that the causal relation between indicator and construct is most central to the validity definition. Moreover, he differentiates between substantive validation, which largely takes place at the level of test construction when indicators are selected that are believed to reflect the construct, and empirical validation, which provides circumstantial evidence of validity through convergent, divergent and criterion-oriented analyses. For rating scales, substantive validation is straightforward because items are interpretable verbal descriptions. We may reasonably assume that depression causes changes in how much a person agrees with the statement “I am so sad and unhappy that I can’t stand it” (Beck et al., 1996). For both predominating approaches to LBA, however, the path to establishing validity is considerably less straightforward, and current validation practices leave important gaps.

In supervised LBA, indicators are semantic embedding dimensions which, as discussed earlier, carry no isolated meaning (Piantadosi et al., 2024). The indicator-construct-relationship is not defined a priori but learned through supervised methods. It may thus be characterized as a stable statistical association without merit for a causal claim. The theory-driven construction stage that Borsboom et al. (2004) identifies as a major source of validity is therefore largely absent. Substantive validation can be approximately probed via word cloud visualization, where words are plotted based on whether they tend to occur in texts with high or low language-assessed construct measurement (O. Kjell et al., 2023). While this method is one of the useful ways for which a model’s internal workings can be inspected and validated, it is limited as models are expected to learn semantic patterns that extend beyond individual words. In fact, the choice of using transformers is explicitly grounded in this notion. Because models do not generate predictions from isolated words and inspecting model predictions only at that resolution will often be insufficient. Some relevant patterns that emerge only in composition (such as grammatical errors which could be indicative of educational attainment) are entirely lost in this approach. The responsibility of establishing validity is then almost entirely placed on empirical validation procedures which can show that LBAs exhibit convergent, divergent, or criterion validity (Cohen et al., 2022; O. N. Kjell et al., 2024; O. N. E. Kjell et al., 2019). The sum of results of these validation procedures is taken as evidence of the test’s total level of construct validity, which is typically conceived as a matter of degree – tests are considered more or less valid dependent on the total sum of the evidence derived from a variety of validation procedures (Grimm & Widaman, 2012). The fact that LBAs overwhelmingly tend to conform to tests of convergent, divergent, and criterion validity has led to the conclusion that they are valid measurement tools (O. N. Kjell et al., 2024; Wright et al., 2026). Yet here a deeper issue arises.

In critique of validation procedures that attempt to maximize criterion validity, Borsboom et al. (2004) notes that choosing indicators based on their ability to predict a criterion will lead to lower construct validity, because highly correlated indicators (which reflect a singular coherent construct) tend to be multicollinear, therefore effecting subpar predictive performance. Supervised LBAs combine informationally dense semantic representations with data-driven methods that operate on the singular objective of optimizing predictive accuracy. A well-performing model therefore leverages not only semantic patterns reflective of the target construct but exhausts all information that is predictive of it. Consider, for example, a supervised LBA trained to predict self-reported depression symptoms. There will exist many tertiary variables that are both related to depression and simultaneously known to be reflected in language use (e.g., age, gender, income, educational attainment) (Giorgi et al., 2022; Sap et al., 2014). It is therefore entirely possible for a supervised LBA model to satisfy to empirical tests of convergent, divergent, and criterion validity as well as substantive tests of word cloud visualization, while its measurement of the target construct is systematically confounded by the simultaneous, unintended measurement of other constructs. An analogy can be drawn to a depression rating scale which includes participants’ gender as an item because this improves the prediction of depression diagnoses [which have a higher prevalence in women (Nolen-Hoeksema, 2001)]. Detecting such confounding lies beyond the reach of the standard validation procedures.

CCR addresses several of these concerns but introduces others that are equally invisible to conventional validation methods. Because CCRs derive measurement from semantic similarities between participant language and theoretically grounded scale items, the indicator-construct relationship is specified a priori and may plausibly be regarded as causal. Also, much of its content validity is inherited from ratings scales that have been validated over decades (Atari et al., 2023) which substantiates the claim that chosen verbal descriptions reflect the target construct as opposed to merely predicting it. Yet this apparent advantage comes with its own unexamined risks. The construct representation might be confounded by the academic linguistic register in which rating scale items are formulated (Teitelbaum & Simchon, 2025). Furhtermore, the stereotypical phrasing of scale rating items might poorly represent how constructs are expressed in naturalistic setting, causing CCR to miss domain-specific expressions such as neologisms and euphemisms in online language (e.g., “sewer slide” as a euphemism for “suicide” to bypass automated content filters on social media platforms) (Steen et al., 2023). Finally, while CCR operates on interpretable and seemingly unconfounded semantic similarity computations, the embedding space itself may be biased. That biases and stereotypes are inherent to semantic embedding models is well-established (Bolukbasi et al., 2016). Training on vast uncurated corpora causes models to learn the inherent biases of human language use, which could yield, for example, systematically higher similarity to depression statement for texts disclosing female gender. Crucially, none of these issues would be detected by standard convergent, divergent, or criterion validity analyses, because such analyses only assess whether CCR scores covary with rating scales and external criteria, not whether the similarity computation itself is biased or confounded.

In sum, supervised LBA and CCR face distinct but equally consequential threats to their construct validity, that are not detectable by the convergent, divergent, and criterion validity analyses which currently constitute the primary and often the sole procedure of LBA validation. This motivates the development of an extended evaluation framework which complements these standard procedures with tests specifically designed to surface such threats. The framework proposed below provides just such a set of procedures.

A psychometric evaluation framework

The following framework proposes a set of validation procedures. We note that no singular procedure, nor the entire set is to be viewed as conclusive evidence of construct validity. Substantive and empirical procedures are further understood as complimentary. In Table 1, all procedures are summarized.

Convergent, Divergent, and Criterion Validity

To test convergent validity, measurements derived from LBA are related to other measures of the same construct. The best fitting candidate measure to establish construct validity is dependent on the target construct but for many psychological constructs will be scores on a self-report rating scale. To establish divergent validity, correlations with the self-report measures of the target construct can be contrasted with self-report measures of a different but related construct. Finally, criterion validity can be established by examining whether LBA scores correlate in expected ways with a network of criteria that have previously been validated. For example, LBA scores of depression may be related to the clinical diagnosis of a depressive disorder or the number of visits to mental healthcare clinics (Gu et al., 2025; Grimm & Widamann, 2012).

Content Validity

Content validity concerns whether a psychological test covers the entire spectrum of the construct to be assessed (Grimm & Widaman, 2012). Content validity is appraised using substantive methods, namely expert evaluations (Grimm & Widaman, 2012). A collection of procedures with varying levels of formalization can be used to evaluate the content validity of LBAs.

First, a large portion of the degree of content validity is determined a priori at test construction (Borsboom et al., 2004). The content validity of CCR is thus to some extent inherited from the rating scale it is based on. Having chosen, and possibly slightly reformulated or negated items, a next step is to determine whether item content also represents a coherent construct to the semantic embedding model. To formalize this, we can compare the semantic similarity of items among the scale [e.g., Beck’s Depression Inventory (Beck et al., 1996)] to the similarity of items of the target scale to items of different but related scales (anxiety or stress inventories) (Grand et al., 2022). Furthermore, we may probe whether items of the target scale are more related to items of another scale measuring the same construct [e.g., Patient Health Questionnaire (Kroenke et al., 2001)] than to items of different but related scales. While the above procedures are applicable only to CCR, there are some procedures that also apply to supervised LBA. To inspect whether the test carries information of the entire spectrum of the construct, experts may manually annotate texts based on whether they carry information about the construct or subdomain thereof. These annotations may then be classified based on the text’s LBA scores. If statistically significant classification is consistently possible, the target domain, and all investigated facets are captured by the test. Other approaches include word cloud visualization (O. Kjell et al., 2023) and manual inspection of corresponding pairs of LBA scores and texts.

Representational Interpretability

The lack of interpretability of LLMs is often discussed as undermining their trustworthiness (Hüppi et al., 2025). A central reason is their black box nature whereby the internal representation of LLMs is not necessarily interpretable to Humans. There are ample efforts in the field of computer science known as mechanistic interpretability to find methods which improve our understanding of the complex mechanisms underlying LLMs (Sharkey et al., 2025). Improving interpretability in the application of AI in psychiatry has generally proven quite difficult with some authors concluding that “While we believe in striving toward explainability, a more realistic goal is transparency and generalizability.” (Chandler et al., 2020). We find that semantic embedding models provide a promising opportunity for more detailed inquiries into how constructs are learned to be represented internally.

The proposed validation procedure involves comparing the learned internal representation of the target construct in supervised LBAs with the theory-defined construct representation as defined via CCR. If a model has learned to represent the target construct, its internal representation should be similar to the CCR. The representational similarity can be determined by computing the cosine similarity between CCR and the set of coefficients stored in the fitted regression model. Importantly, CCR serves here not as a validated gold standard but as a theory-derived expectation, i.e. what the model should have learned if it has captured the target construct as operationalized by the underlying rating scale. Analyses presented later in this article demonstrate that the use of partial least square regression and anchored CCR might be particularly suitable for this analysis.

Domain Generalization

Domain generalization concerns the ability of models to solve tasks in evaluation domains that differ from the training domain and is a classical problem in natural language processing research (Hupkes et al., 2023; Pan & Yang, 2009). Domain generalization indicates whether models have truly learned to solve a task or instead have learned simple heuristics which serve as effective solutions only in the training domain (Hupkes et al., 2023). The rationale for probing the domain generalization of supervised LBA is similar; if a supervised model has converged on a representation of the target construct itself, its performance should be largely invariant to the domain in which construct-relevant language is produced, because the construct (not the domain) is the common cause of the relevant linguistic variation. If, by contrast, a model rests in part on domain-specific correlates – for example, an LBA model trained on social media texts that picks up features associated with, but not reflective of, depressive symptoms – performance should deteriorate when the model is deployed in domains where these correlates are absent or differently distributed. Importantly, if confounding correlates are present in all tested domains and are also similarly distributed, non-valid models would also generalize well. Domain generalization is therefore a necessary but insufficient criterion of validity and should hence be only one of many validation procedures.

There will often not be any prior knowledge as to whether the construct is expressed in the investigated domain. A practical solution to this issue is to first establish a benchmark by correlating CCR-derived scores with rating scale scores. Successful domain generalization can then be asserted if text-rating correlations fall around this benchmark. The importance of deploying trained LBAs across novel domains is increasingly being recognized (Nilsson et al., 2026), although empirical evidence for or against their ability to generalize is markedly lacking. Indeed, some have voiced doubts about models’ ability to generalize to data-sets that diverge too strongly from the training domain (Nilsson et al., 2026; Teitelbaum & Simchon, 2025). One way to test generalization is to test LBA models trained in one research site on data accrued in another. The re-use of LBA models is facilitated by the LBAM package in R which stores fitted LBA models and enables simple deployment on novel data (Nilsson et al., 2026).

Ecological Validity

The development of LBAs often takes place in settings where participants are guided to produce language that will more strongly reflect their score on the target construct. During clinical interviews, participants are guided to talk about their mental health symptoms (First, 2014). Experimental methods sometimes provide even stronger guidance as to the content that participants should produce. Gu et al. (2025) designed a language elicitation method where participants were asked to describe their depression and worry symptoms which yielded strong text-rating correlations. While such settings are important for the development and validation of LBAs, they are relatively artificial and possibly do not reflect language use in naturalistic environments. Thus, it is vital to prove that LBAs measure a behavior which naturally occurs. An arguably highly naturalistic setting can be found in social media language which is produced spontaneously and collected retrospectively (Plank & Zlomuzica, 2025), thus circumventing behavioral modulation due to study participation [e.g., Hawthorne effect (Adair, 1984)].

Validation Procedure	Core Question	Statistical Operationalization
Convergent Validity	LBAs converge with other measures of the target construct (e.g., rating scales)	$r$(LBA, Rating) > 0, $p$ < .05
Divergent Validity	LBAs converge less with measures of different but related constructs	$r$(LBA_target, Rating_non-target) < $r$(LBA_target, Rating_target), $p$ < .05
Criterion Validity	LBAs are related to external criteria known to be associated with the target construct	Concurrent, Predictive, Postdictive: $r$(LBA, Criterion) > 0, $p$ < .05 Incremental: $R^2$(LBA + Rating, Criterion) > $R^2$(LBA, Criterion), $p$ < .05
Content Validity	LBAs cover the entire spectrum of the target construct	CCR: CCR is based on validated rating scale $\cos$(internal) > $\cos$(internal, external), p < .05 $\cos$(Construct A item set 1, Construct A item set 2) > $\cos$(Construct A item set 1, Construct B item set 1) sLBA + CCR: Inspection of text-LBA pairs Prediction of expert annotations Word cloud visualization
Ecological Validity	The target construct is naturally expressed in language	$r$(LBA, Rating) > 0, $p$ < .05, in spontaneously produced language (e.g., social media)
Domain Generalization	LBAs generalize across different settings	$r$(sLBA_Domain A, Rating_Domain B) ≈ $r$(CCR, Rating_Domain B)
Representational Interpretability	Overlap of learned and theory-defined construct representations	$\cos$(sLBA_{Representation}, CCR_{Representation})

Note. $r$ = correlation, $\cos$ = cosine similarity, CCR = Contextualized Construct Representation, LBA = language-based assessment (either CCR or supervised LBA), sLBA = supervised LBA, Rating = scores on self-report rating scale.

Methods

This manuscript and all accompanying analyses are generated from a Quarto notebook (Allaire et al., 2022)¹. The notebook and all data required to replicate analyses is provided in a Github repository². Readers are encouraged to inspect coded behind analyses and reproduce analyses following different analytical choices. Pre-processing and NLP analysis were conducted mostly in Python (Van Rossum & Drake Jr, 1995) and statistical analysis were conducted in R (Team et al., 2016).

In [1]:

library(svglite)
library(yaml)

Warning: package 'yaml' was built under R version 4.5.2

library(reticulate)
library(tidyr)
library(trackdown)
#### ---- download processed data from: SCIEBO LINK
#### ---- place data files into a folder of your choice
#### ---- define your data path in the config/paths.yaml file
PUBLIC_DATA_PATH <- yaml.load_file("config/paths.yaml")$data_dir
SECURE_DATA_PATH <- yaml.load_file("config/paths.yaml")$secure_data_dir
#### ---- create virtual env for project
# pandas pinned to 2.x: reticulate py_to_r() does not yet support pandas 3.0
if (!virtualenv_exists("psyvec_env")) {
  virtualenv_create(envname = "psyvec_env")
  virtualenv_install("psyvec_env", packages = c(
    "pandas==2.2.3", "numpy", "scikit-learn",
    "torch", "transformers", "sentence-transformers",
    "flair", "redditcleaner", "pyreadr",
    "textblob-de", "textblob==0.17.1"
  ))
}
use_virtualenv("psyvec_env", required = TRUE)

Ethics

All procedures presented in this study are in accordance with the Declaration of Helsinki. Next to re-analyses of existing data-sets, we present a novel data-set of mental health anamnesis texts. All procedures concerning the collection and analysis of this data were approved by the university’s local ethics committee (approvals #318 and #431).

Datasets

META-FBZ

META-FBZ comprises data (N = 601) that was routinely gathered at a university outpatient mental health center in Germany between 2017 and 2024. The data-set includes socio-demographic variables (age, sex, education, vocational qualification, and current work ability), mental health diagnoses according to the DSM-5 (ascertained by a clinical psychologist through a semi-structured clinical interview), psychometric questionnaires [Beck-Depression-Inventory-II (Beck et al., 1996; Kühner et al., 2007), Depression Anxiety Stress Scale 21-item version (Lovibond & Lovibond, 1995);Nilges & Essau (2015)]], and open-ended patient narratives designed to assess key aspects of their mental health concerns, functional impairments, and expectations for treatment. Responses to the following seven questions were analyzed in this study:

Problem development: “Briefly describe how the problems for which you are seeking treatment have developed over time.”
Extra stressors: “What causes you stress in addition to your everyday problems (e.g., finances, housing situation)?”
Pre-onset changes: “Did something special change in your life before the onset of your symptoms? (e.g., death of an important person, divorce or separation, change in work situation or income, addition to the family)”
Event connection: “Do you see a connection between the event(s) and the development of your problems?”
Physical symptoms: “Are there any physical side effects when your problems occur?”
Problem description: “Finally, please describe in your own words the problems for which you would like treatment.”
Impacted life areas: “In which areas of your life do these problems limit you (e.g., job, relationship)?”

Open-ended responses which were available only in paper-pencil format were voice recorded and transcribed using a local instantiation of the speech recognition model whisper-large-v2³. To anonymize texts , personal information such as names of persons, places, or organizations were substituted with placeholders using named-entity-recognition⁴ implemented in the flair library. We included only responses with at least five words and appended all text responses to yield a singular text per patient. Sociodemographic and clinical characteristics of the sample are reported in Table 1.

In [2]:

import pandas as pd
import pyreadr
from flair.data import Sentence
from flair.models import SequenceTagger
from textblob_de import TextBlobDE
# load data
result = pyreadr.read_r(r.SECURE_DATA_PATH, "merged_data.rds") 
df = result[None]
PUBLIC_DATA_PATH, "topics_deptexts_29042026.csv"
tagger = SequenceTagger.load("flair/ner-german-large")
def anonymize_text(text):
    if pd.isna(text):
        return text  # Return as is if no text
    
    sentence = Sentence(text)
    tagger.predict(sentence)
    # Get entities with spans and labels
    entities = sentence.get_spans('ner')
    # Sort entities by their start character position (left to right)
    entities = sorted(entities, key=lambda entity: entity.start_position)
    anonymized_text = text
    offset = 0  # Track index shift due to replacement length differences
    for entity in entities:
        start = entity.start_position + offset
        end = entity.end_position + offset
        label = entity.get_label('ner').value
        if label == 'PER':
            replacement = '[PERSON]'
        elif label == 'LOC':
            replacement = '[ORT]'
        elif label == "ORG":
            replacement = "[ORGANISATION]"
        else:
            continue
        # Replace entity text with replacement token
        anonymized_text = anonymized_text[:start] + replacement + anonymized_text[end:]
        # Update offset to reflect changed text length
        offset += len(replacement) - (end - start)
    return anonymized_text
text_cols = ["txt1_problem_development",
 "txt2_extra_stressors",
 "txt3_pre_onset_changes",
 "txt4_event_connection",
 "txt5_physical_symptoms",
 "txt6_problem_causes",
 "txt7_expected_improvements",
 "txt8_environment_response",
 "txt9_no_change_requested",
 "txt10_problem_description",
 "txt11_impacted_life_areas",
 "txt12_therapy_goals"] 
# Apply the anonymization function to the DataFrame column and add new column
for textcol in text_cols:
    df.loc[:, textcol+"_anonymized"] = df.loc[:, textcol].apply(anonymize_text) # NOTE: we index text_cols[0] you can here pick any text column you like (or iterate through columns using a for loop to anonymize all)
    df = df.drop(textcol, axis=1)
select_cols = ["txt1_problem_development_anonymized",
 "txt2_extra_stressors_anonymized",
 "txt3_pre_onset_changes_anonymized",
 "txt4_event_connection_anonymized",
 "txt5_physical_symptoms_anonymized",
 "txt6_problem_causes_anonymized",
 "txt10_problem_description_anonymized",
 "txt11_impacted_life_areas_anonymized"]
# exclude text responses with less than 5 words
for textcol in select_cols:
    # remove excessive whitespace
    df[textcol] = df[textcol].str.replace(r"\s+", " ", regex=True).str.strip()
    # count words, exclude
    df[textcol+"n_words"] = df[textcol].apply(
        lambda x: len(TextBlobDE(x).words) if pd.notna(x) else 0
    )
    df.loc[df[textcol+"n_words"]<5, textcol] = pd.NA
    
# create singular text from selected responses
df["fulltext"] = df[select_cols].apply(
    lambda row: " ".join(x for x in row if pd.notna(x) and x != ""),
    axis=1
)
df = df.loc[df["fulltext"]!="", :]
# only texts with accompanying BDI-II values
df = df.loc[df["bdi_sum_DU-Prä"].notna(), :].reset_index(drop=True)
# count words in fulltext
df["n_words"] = df["fulltext"].apply(
        lambda x: len(TextBlobDE(x).words) if pd.notna(x) else 0
    )
# determine primary depression diagnosis (1 = yes, 0 = no)
depressive_diagnoses = [
    "Major Depression",                              # catches all Major Depression variants
    "persistierende depressive Störung",             # catches Dysthymie
    "depress. Störungen mit Major-Depression",       # catches ähnlicher Episode variant
    "NNB Depressive Störung"
]
def has_depression_diagnosis(row):
    for i in diag_numbers:
        type_col = f"dsmv_diagnosis_{i}_type"
        diag_col = f"dsmv_diagnosis_{i}"
        
        if pd.notna(row.get(type_col)) and pd.notna(row.get(diag_col)):
            if row[type_col] == "primary diagnosis":
                if any(dep in row[diag_col] for dep in depressive_diagnoses):
                    return 1
    return 0
df["depression_diagnosis"] = df.apply(has_depression_diagnosis, axis=1)
df.to_csv(r"C:\Users\BACNeuro\Desktop\laurin\META-DATA\Meta_texts\meta.csv", index=False)

In [3]:

library(dplyr)

Warning: package 'dplyr' was built under R version 4.5.2


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(stringr)
meta_full <- readRDS(file.path(SECURE_DATA_PATH, "merged_data.rds"))
# Same 8 select columns used in Python preprocessing (on original, non-anonymized text)
select_cols_orig <- c(
  "txt1_problem_development", "txt2_extra_stressors",
  "txt3_pre_onset_changes",   "txt4_event_connection",
  "txt5_physical_symptoms",   "txt6_problem_causes",
  "txt10_problem_description","txt11_impacted_life_areas"
)
count_words_simple <- function(x) {
  if (is.na(x)) return(0L)
  length(strsplit(trimws(x), "\\s+")[[1]])
}
# Set texts with < 5 words to NA
for (col in select_cols_orig) {
  meta_full[[col]] <- str_trim(str_replace_all(meta_full[[col]], "\\s+", " "))
  n_w <- vapply(meta_full[[col]], count_words_simple, integer(1))
  meta_full[[col]][n_w < 5L] <- NA_character_
}
# Keep rows that have at least one valid text response AND a BDI-II score
fulltext_tmp <- apply(meta_full[select_cols_orig], 1, function(row) {
  paste(row[!is.na(row) & row != ""], collapse = " ")
})
meta_full <- meta_full[fulltext_tmp != "" & !is.na(meta_full[["bdi_sum_DU-Prä"]]), ]
rownames(meta_full) <- NULL
# Derive depression_diagnosis from raw DSM-V diagnosis columns
depressive_diagnoses <- c(
  "Major Depression",
  "persistierende depressive Störung",
  "depress. Störungen mit Major-Depression",
  "NNB Depressive Störung"
)
diag_numbers <- grep("^dsmv_diagnosis_[0-9]+$", names(meta_full), value = TRUE) |>
  str_extract("[0-9]+") |>
  as.integer()
meta_full$depression_diagnosis <- factor(
  vapply(seq_len(nrow(meta_full)), function(i) {
    for (d in diag_numbers) {
      type_val <- meta_full[[paste0("dsmv_diagnosis_", d, "_type")]][i]
      diag_val <- meta_full[[paste0("dsmv_diagnosis_", d)]][i]
      if (!is.na(type_val) && !is.na(diag_val) && type_val == "primary diagnosis") {
        if (any(vapply(depressive_diagnoses,
                       \(dep) grepl(dep, diag_val, fixed = TRUE), logical(1))))
          return(1L)
      }
    }
    0L
  }, integer(1)),
  levels = c(0L, 1L), labels = c("No", "Yes")
)

In [4]:

library(gtsummary)

Warning: package 'gtsummary' was built under R version 4.5.2

library(flextable)

Warning: package 'flextable' was built under R version 4.5.2


Attaching package: 'flextable'

The following object is masked from 'package:gtsummary':

    continuous_summary

vars_meta <- c(
  "patient_age_therapy_start", "patient_sex",
  "in_relationship",           "marital_status",
  "general_education",         "vocational_qualification",
  "work_ability_status",       "previous_psychotherapy",
  "CGI_severity",              "depression_diagnosis"
)
meta_full |>
  select(all_of(vars_meta)) |>
  tbl_summary(
    label = list(
      patient_age_therapy_start ~ "Age at therapy start",
      patient_sex               ~ "Sex",
      in_relationship           ~ "In relationship",
      marital_status            ~ "Marital status",
      general_education         ~ "General education",
      vocational_qualification  ~ "Vocational qualification",
      work_ability_status       ~ "Work ability status",
      previous_psychotherapy    ~ "Previous psychotherapy",
      CGI_severity              ~ "CGI severity",
      depression_diagnosis      ~ "Primary depression diagnosis"
    ),
    statistic = list(all_continuous() ~ "{mean} ({sd})"),
    digits    = list(all_continuous() ~ 2),
    missing = "ifany"
  ) |>
  modify_caption("Descriptive statistics of META-FBZ patient demographics and clinical characteristics (N = {N}).") |>
  as_flex_table()

In [5]:

Table 1

Characteristic	N = 6011
Age at therapy start	40.25 (14.22)
Sex
male	227 (38%)
female	374 (62%)
In relationship	205 (57%)
Unknown	239
Marital status
single	194 (54%)
married	98 (27%)
divorced	49 (14%)
seperated	12 (3.3%)
widowed	5 (1.4%)
other	4 (1.1%)
Unknown	239
General education
student	6 (1.7%)
no school-leaving certificate	6 (1.7%)
lower secondary school certificate	59 (16%)
intermediate secondary school certificate	103 (28%)
higher education entrance qualification	184 (51%)
other	4 (1.1%)
Unknown	239
Vocational qualification
Currently in vocational training or studying	43 (12%)
No vocational qualification	44 (12%)
Apprenticeship / vocational training	197 (54%)
University or university of applied sciences degree	56 (15%)
Other	22 (6.1%)
Unknown	239
Work ability status
Able to work	202 (56%)
Unable to work (on sick leave)	110 (30%)
Disability pension	11 (3.0%)
Old-age pension	10 (2.8%)
Other	29 (8.0%)
Unknown	239
Previous psychotherapy
no prior treatment	184 (37%)
outpatient psychotherapy	83 (17%)
inpatient psychotherapy	121 (24%)
both	101 (20%)
exact specification not available	14 (2.8%)
Unknown	98
CGI severity
Not assessable	1 (0.2%)
Normal, not at all ill	0 (0%)
Borderline mentally ill	6 (1.2%)
Mildly ill	24 (4.8%)
Moderately ill	151 (30%)
Markedly ill	246 (49%)
Severely ill	74 (15%)
Among the most extremely ill patients	1 (0.2%)
Unknown	98
Primary depression diagnosis	292 (49%)
1Mean (SD); n (%)

dep_wor_data

dep_wor_data is a publicly available data-set⁵ containing data from 500 participants recruited from Prolific (Palan & Schitter, 2018). It includes sociodemographic variables (age, gender), psychometric questionnaire scores measuring depression [PHQ-9, (Kroenke et al., 2001)] and generalized anxiety [GAD-7, (Spitzer et al., 2006)], information on the number mental health-related sick leaves and healthcare visits in the past year, and open-ended text responses. Open-ended text instructions asked participants to describe their depression or worry symptoms in their own words. Different formats were used, asking participants to describe their symptoms using words/phrases or paragraphs. For this study, we used paragraph descriptions [for more information, see (Gu et al., 2025)]. In this study, only texts containing at least five words were included. For some analyses (see Section 3.2.5.1) it was necessary for depression scores to have a common scales. Consequently, PHQ-9 scores were transformed into BDI-II scores using established methods (Wahl et al., 2014).

In [6]:

import pyreadr
import pandas as pd
from textblob import TextBlob
import pandas as pd
import os
df = pyreadr.read_r(r"C:\Users\BACNeuro\Desktop\laurin\META-DATA\Meta_texts\dep_wor_data.rda")
df = df["dep_wor_data"]
#### ---- convert PHQ-9 scores to BDI-II ---- ####
PHQ9crosswalk = pd.read_excel(os.path.join(os.getcwd(), "ressources", "Wahl2014_PHQ9_Crosswalk.xlsx"))
BDIIIcrosswalk = pd.read_excel(os.path.join(os.getcwd(), "ressources", "Wahl2014_BDI-II_Crosswalk.xlsx"))
PHQ2Theta_dict = dict(
    zip(PHQ9crosswalk["PHQ-9"],
        PHQ9crosswalk["Theta"])
)
deptexts["Theta"] = deptexts["PHQ9tot"].map(PHQ2Theta_dict)
def Theta2BDI(latent_value, crosswalk_df):
    """Find BDI-II value closest to Theta value."""
    
    idx = (
        crosswalk_df["Theta"]
        .sub(latent_value)
        .abs()
        .idxmin()
    )
    
    return crosswalk_df.loc[idx, "BDI-II"]

deptexts["mapped_BDI_sum"] = deptexts["Theta"].apply(
    lambda x: Theta2BDI(x, BDIIIcrosswalk)
)
df["n_words"] = df["Deptext"].apply(lambda x: len(TextBlob(x).words))
df = df.loc[df["n_words"]>=5, :].reset_index(drop=True)
df.to_csv(r"C:\Users\BACNeuro\Desktop\laurin\META-DATA\Meta_texts\topics_deptexts.csv", index=False)

eRisk

eRisk is an annual competition aiming to foster the development of early mental disorder detection through computational methods. Two datasets from past eRisk competitions which are available to qualified researchers⁶ were used in this study. In Task 3 of the 2021 competition involved predicting self-reported depression symptoms (BDI-II) from the social media posts of 80 Reddit users (Parapar et al., 2021). In the present study, post titles and bodies were appended to yield singular texts and only texts with at least five words were included. Task 1 of the 2025 competition contains 11,042 sentences extracted from Reddit posts that were manually scored as being relevant (= 1) or irrelevant (= 0) to BDI-II items (Crestani et al., 2022; Parapar et al., 2025). Relevance was defined as sentences which indicate symptom presence, irrespective of whether they are positive (i.e., user describes having symptom) or negative symptom description (i.e., user describes not having symptom). Rating was conducted by two computer scientists and one psychologist and final ratings were defined by consensus (Parapar et al., 2025). In both eRisk datasets, texts were cleaned of Reddit-specific markdown using the redditcleaner package.

In [7]:

import os
import pandas as pd
import redditcleaner
# Load relevance labels
erlab = pd.read_csv(
    r"C:\Users\BACNeuro\Desktop\laurin\META-DATA\Meta_texts\t1-depression-symptom-ranking\qrels_consensus_merged.csv",
    index_col=False
)
# Keep only needed columns (adjust if necessary)
cols_needed = ["query", "doc_id", "relevant"]
# Group by doc_id for fast lookup
erlab_grouped = erlab.groupby("doc_id")
eriskdir = r"C:\Users\BACNeuro\Desktop\laurin\META-DATA\Meta_texts\t1-depression-symptom-ranking\erisk25-t1-dataset"
files = os.listdir(eriskdir)
rows = []
for fileind, file in enumerate(files):
    with open(os.path.join(eriskdir, file), "r", encoding="utf-8") as f:
        content = f.read()
    docs = content.split("<DOC>")
    for doc in docs[1:]:
        if "<DOCNO>" not in doc:
            continue
        docno = doc.split("<DOCNO>")[1].split("</DOCNO>")[0].strip()
        if "<TEXT>" in doc:
            text = doc.split("<TEXT>")[1].split("</TEXT>")[0].strip()
        else:
            continue
        if docno in erlab_grouped.groups:
            matches = erlab_grouped.get_group(docno)
            for _, match_row in matches.iterrows():
                row = [
                    match_row["query"],
                    match_row["doc_id"],
                    match_row["relevant"],
                    text
                ]
                rows.append(row)
    if fileind % 1000 == 0:
        print(f"Processed {fileind} / {len(files)} files | rows collected: {len(rows)}")

bdi_rel = pd.DataFrame(rows, columns=["query", "doc_id", "relevant", "text"])
bdi_rel["text"] = bdi_rel["text"].map(redditcleaner.clean)
bdi_rel = bdi_rel.loc[bdi_rel["text"].notna(), :].reset_index(drop=True)
output_path = r"C:\Users\BACNeuro\Desktop\laurin\META-DATA\Meta_texts\bdi_rel.csv"
bdi_rel.to_csv(output_path, index=False)
del cols_needed, content, doc, docno, docs, eriskdir, erlab, erlab_grouped, f, file, fileind, files, match_row, matches, r, row, rows, text
print(f"Done. Saved to {output_path}")

In [8]:

import os
import pandas as pd
from lxml import etree
from textblob import TextBlob
import redditcleaner
path = r"C:\Users\BACNeuro\Desktop\laurin\META-DATA\Meta_texts\T3\ground-truth_eRisk2021_T3.txt"
cols = ["sub_id"] + [f"bdi_{i}" for i in range(1, 22)]
erisk21_bdi = pd.read_csv(path, sep=r"\s+", header=None, names=cols)
erisk21_bdi.loc[erisk21_bdi["sub_id"]=="aerisk2021-T3_Subject9", "sub_id"] = "erisk2021-T3_Subject9"
def sum_bdi_items(row):
    bdi_vals = row[row.index.str.contains("bdi", case=False, na=False)]
    
    numbers = bdi_vals.astype(str).str.extract(r"(-?\d+\.?\d*)")[0]
    
    return pd.to_numeric(numbers, errors="coerce").sum()
erisk21_bdi["bdi_sum"] = erisk21_bdi.apply(sum_bdi_items, axis=1)
folder = r"C:\Users\BACNeuro\Desktop\laurin\META-DATA\Meta_texts\T3\eRisk2021_T3_Collection"
parser = etree.XMLParser(recover=True)
rows = []
for file in os.listdir(folder):
    if file.endswith(".xml"):
        path = os.path.join(folder, file)
        tree = etree.parse(path, parser)
        root = tree.getroot()
        sub_id = root.findtext(".//ID")
        # all writings
        writings = root.findall(".//WRITING")
        for w in writings:
            rows.append({
                "sub_id": sub_id,
                "file": file,
                "title": (w.findtext("TITLE") or "").strip(),
                "date": w.findtext("DATE"),
                "info": w.findtext("INFO"),
                "text": (w.findtext("TEXT") or "").strip(),
            })
df = pd.DataFrame(rows)
df = pd.merge(df, erisk21_bdi, how="left", on="sub_id")
df["title"] = df["title"].fillna("")
df["text"] = df["text"].fillna("")
df["text"] = df["title"]+df["text"]
df = df.loc[df["text"]!="", :].reset_index(drop=True)
# clean post of reddit markdown
df["text"] = df["text"].map(redditcleaner.clean)
# exclude texts with less than 5 words
df["n_words"] = df["text"].apply(lambda x: len(TextBlob(x).words))
df = df.loc[df["n_words"]>=5, :].reset_index(drop=True)
# convert all letter ratings to ints
df.loc[:, df.columns.str.contains("bdi")] = df.loc[:, df.columns.str.contains("bdi")].replace(r'[^0-9]', '', regex=True)
df.to_csv(r"C:\Users\BACNeuro\Desktop\laurin\META-DATA\Meta_texts\erisk2021_t3.csv", index=False)

Contextualized Construct Representation

The process of creating a CCR for the depression construct is illustrated in Figure 1. First, items of the Beck-Depression-Inventory-II (BDI-II (Beck et al., 1996)), one of the most commonly used depression self-report questionnaires, are transformed into vectors using a semantic embedding model. The CCR is then defined as the average (or centroid) of all 21 individual item vectors ($\mu_{pos}$)(Atari et al., 2023). To measure CCR loadings (i.e., how much the depression construct is present in a given text), we determine the similarity between the text’s vector and the CCR (Atari et al., 2023). Since semantic embedding models operate in high-dimensional space, similarity between two vectors is assessed using cosine similarity which ranges from -1 (low) to 1 (high), and quantifies whether the two vectors have a similar orientation (Lake & Murphy, 2023).

Following suggestions of Teitelbaum & Simchon (2025), the use of anchored CCR (aCCR) was explored. While standard CCRs are defined as the centroid of positively formulated items ($\mu_{pos}$), aCCRs are instead defined as the difference of centroids of positively and negatively formulated items ($\mu_{pos}$ - $\mu_{neg}$). As noted by Teitelbaum & Simchon (2025), anchoring CCRs could address the fact that questionnaire items reflect not only the target construct but also a specific (academic) writing style [c.f. questionnaire-ness (Teitelbaum & Simchon, 2025)]. We used formulations in the English (Beck et al., 1996) and German (Kühner et al., 2007) versions of the BDI-II to define CCRs and aCCRs. The BDI-II is useful for this purpose as it provides descriptions for both the absence and presence of symptoms. For two symptoms, namely sleep and appetite, there are two opposite descriptions indicating symptom presence (increase and decrease in appetite/sleep duration). We reformulated these to “I have experienced changes in my [appetite/sleeping pattern]”, respectively.

We used an open-source bilingual (English/German) sentence embedding model with 560M parameters⁷ (S. Lee et al., 2024). The model is optimized for German information retrieval and achieves state-of-the-art performance on MTEB benchmarks. It was trained via fine-tuning a base model on 30M pairs of high-quality German texts. The base model was another multilingual sentence embedding model with 24 layers and 1024 embedding dimensions that was trained on diverse text data including Wikipedia, News, Reddit, academic articles, etc.⁸ (Wang et al., 2024).

In [9]:

import pandas as pd
from sentence_transformers import SentenceTransformer
import os
import numpy as np
import torch
# load SBERT model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
sentmod = SentenceTransformer("mixedbread-ai/deepset-mxbai-embed-de-large-v1", device=device)
# load data, encode, store
## dep_wor_data
deptexts = pd.read_csv(r.PUBLIC_DATA_PATH, "topics_deptexts_29042026.csv",
                        index_col=False)
deptexts_sent_embeds = sentmod.encode(deptexts["Deptext"].to_list(), show_progress_bar=True)
np.save(os.path.join(r.PUBLIC_DATA_PATH, "deptexts_sent_embeds.npy"), deptexts_sent_embeds)
del deptexts, deptexts_sent_embeds
## META-FBZ
meta = pd.read_csv(r.PUBLIC_DATA_PATH, "meta_29042026.csv",
                   index_col=False)
meta_sent_embeds = sentmod.encode(meta["fulltext"].to_list(), show_progress_bar=True)
np.save(os.path.join(r.PUBLIC_DATA_PATH, "meta_sent_embeds.npy"), meta_sent_embeds)
del meta, meta_sent_embeds
## eRisk-2021
erisk21 = pd.read_csv(r.PUBLIC_DATA_PATH, "erisk2021_t3_29042026.csv",
                   index_col=False)
erisk21_sent_embeds = sentmod.encode(erisk21["text"].to_list(), show_progress_bar=True)
np.save(os.path.join(r.PUBLIC_DATA_PATH, "erisk21_sent_embeds.npy"), erisk21_sent_embeds)
del erisk21, meta_sent_embeds
## eRisk-2025
erisk25 = pd.read_csv(r.PUBLIC_DATA_PATH, "bdi_rel_29042026.csv",
                   index_col=False)
erisk25_sent_embeds = sentmod.encode(erisk25["text"].to_list(), show_progress_bar=True)
np.save(os.path.join(r.PUBLIC_DATA_PATH, "erisk25_sent_embeds.npy"), erisk25_sent_embeds)
del erisk25, erisk25_sent_embeds

In [10]:

import numpy as np
import os
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import torch
import pandas as pd
bdi_items = {
    "german": {
      "positive": [
        "Ich bin so traurig oder unglücklich, dass ich es nicht aushalte.",
        "Ich glaube, dass meine Zukunft hoffnungslos ist und nur noch schlechter wird.",
        "Ich habe das Gefühl, als Mensch ein völliger Versager zu sein.",
        "Dinge, die mir früher Freude gemacht haben, kann ich überhaupt nicht mehr genießen.",
        "Ich habe ständig Schuldgefühle.",
        "Ich habe das Gefühl, bestraft zu sein.",
        "Ich lehne mich völlig ab.",
        "Ich gebe mir die Schuld für alles Schlimme, was passiert ist.",
        "Ich würde mich umbringen, wenn ich die Gelegenheit dazu hätte.",
        "Ich möchte gern weinen, aber ich kann nicht.",
        "Ich bin so unruhig, dass ich mich ständig bewegen oder etwas tun muss.",
        "Es fällt mir schwer, mich überhaupt für irgendetwas zu interessieren.",
        "Ich habe Mühe, überhaupt Entscheidungen zu treffen.",
        "Ich fühle mich völlig wertlos.",
        "Ich habe keine Energie mehr, um überhaupt noch etwas zu tun.",
        "Meine Schlafgewohnheiten haben sich verändert.",
        "Ich fühle mich dauernd gereizt.",
        "Mein Appetit hat sich verändert.",
        "Ich kann mich überhaupt nicht mehr konzentrieren.",
        "Ich bin so müde und erschöpft, dass ich fast nichts mehr tun kann.",
        "Ich habe das Interesse an Sexualität völlig verloren."
      ],
      "negative": [
        "Ich bin nicht traurig.",
        "Ich sehe nicht mutlos in die Zukunft.",
        "Ich fühle mich nicht als Versager.",
        "Ich kann die Dinge genauso gut genießen wie früher.",
        "Ich habe keine besonderen Schuldgefühle.",
        "Ich habe nicht das Gefühl, für etwas bestraft zu sein.",
        "Ich halte von mir genauso viel wie immer.",
        "Ich kritisiere oder tadle mich nicht mehr als sonst.",
        "Ich denke nicht daran, mir etwas anzutun.",
        "Ich weine nicht öfter als früher.",
        "Ich bin nicht unruhiger als sonst.",
        "Ich habe das Interesse an anderen Menschen oder an Tätigkeiten nicht verloren.",
        "Ich bin so entschlussfreudig wie immer.",
        "Ich fühle mich nicht wertlos.",
        "Ich habe so viel Energie wie immer.",
        "Meine Schlafgewohnheiten haben sich nicht verändert.",
        "Ich bin nicht reizbarer als sonst.",
        "Mein Appetit hat sich nicht verändert.",
        "Ich kann mich so gut konzentrieren wie immer.",
        "Ich fühle mich nicht müder oder erschöpfter als sonst.",
        "Mein Interesse an Sexualität hat sich in letzter Zeit nicht verändert."
      ]
    },
    "english": {
      "positive": [
        "I am so sad and unhappy that I can't stand it.",
        "I feel my future is hopeless and will only get worse.",
        "I feel I am a total failure as a person.",
        "I can't get any pleasure from the things I used to enjoy.",
        "I feel guilty all of the time.",
        "I feel I am being punished.",
        "I dislike myself.",
        "I blame myself for everything bad that happens.",
        "I would kill myself if I had the chance.",
        "I feel like crying, but I can't.",
        "I am so restless or agitated that I have to keep moving or doing something.",
        "It's hard to get interested in anything.",
        "I have trouble making any decisions.",
        "I feel utterly worthless.",
        "I don't have enough energy to do anything.",
        "I have experienced changes in my sleeping pattern.",
        "I am irritable all the time.",
        "I have experienced changes in my appetite.",
        "I find I can't concentrate on anything.",
        "I am too tired or fatigued to do most of the things I used to do.",
        "I have lost interest in sex completely."
      ],
    "negative": [
      "I do not feel sad.",
      "I am not discouraged about my future.",
      "I do not feel like a failure.",
      "I get as much pleasure as I ever did from the things I enjoy.",
      "I don't feel particularly guilty.",
      "I don't feel I am being punished.",
      "I feel the same about myself as ever.",
      "I don't criticize or blame myself more than usual.",
      "I don't have any thoughts of killing myself.",
      "I don't cry any more than I used to.",
      "I am no more restless or wound up than usual.",
      "I have not lost interest in other people or activities.",
      "I make decisions about as well as ever.",
      "I don't feel I am worthless.",
      "I have as much energy as ever.",
      "I have not experienced any change in my sleeping pattern.",
      "I am not more irritable than usual.",
      "I have not experienced any change in my appetite.",
      "I can concentrate as well as ever.",
      "I am no more tired or fatigued than usual.",
      "I have not noticed any recent change in my interest in sex."
    ]            
  }
}
# load SBERT model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
sentmod = SentenceTransformer("mixedbread-ai/deepset-mxbai-embed-de-large-v1", device=device)
# define CCR vectors
## compute centroids
positive_centroid_english = sentmod.encode(bdi_items["english"]["positive"]).mean(axis=0)
negative_centroid_english = sentmod.encode(bdi_items["english"]["negative"]).mean(axis=0)
positive_centroid_german = sentmod.encode(bdi_items["german"]["positive"]).mean(axis=0)
negative_centroid_german = sentmod.encode(bdi_items["german"]["negative"]).mean(axis=0)
## compute CCRs
CCR_en = positive_centroid_english
CCR_de = positive_centroid_german
aCCR_en = np.subtract(positive_centroid_english, negative_centroid_english)
aCCR_de = np.subtract(positive_centroid_german, negative_centroid_german)
# export CCR vectors as np arrays for later use
np.save(os.path.join(r.PUBLIC_DATA_PATH, "CCR_en.npy"), CCR_en)
np.save(os.path.join(r.PUBLIC_DATA_PATH, "CCR_de.npy"), CCR_de)
np.save(os.path.join(r.PUBLIC_DATA_PATH, "aCCR_en.npy"), aCCR_en)
np.save(os.path.join(r.PUBLIC_DATA_PATH, "aCCR_de.npy"), aCCR_de)
# load datasets, load embeddings, compute similarity, subset relevant data (anonymization)
## dep_wor_data
deptexts = pd.read_csv(os.path.join(r.PUBLIC_DATA_PATH, "topics_deptexts_29042026.csv"),
                        index_col=False)
deptexts_sent_embeds = np.load(os.path.join(r.PUBLIC_DATA_PATH, "deptexts_sent_embeds.npy"))
deptexts["CCR"] = cosine_similarity(deptexts_sent_embeds, CCR_en.reshape(1, -1))
deptexts["aCCR"] = cosine_similarity(deptexts_sent_embeds, aCCR_en.reshape(1, -1))
deptexts = deptexts.loc[:, ["CCR", "aCCR", "PHQ9tot", "GAD7tot", "SickDaysYear", "SickDaysMonth", "Age", "Gender", "HealthCareVisits", "n_words", "MentalSickDays", "mapped_BDI_sum", "Deptext"]]
deptexts.to_csv(os.path.join(r.PUBLIC_DATA_PATH, "deptexts_18052026.csv"), index=False)
del deptexts, deptexts_sent_embeds
## META-FBZ
meta = pd.read_csv(os.path.join(r.PUBLIC_DATA_PATH, "meta_29042026.csv"),
                   index_col=False)
meta_sent_embeds = np.load(os.path.join(r.PUBLIC_DATA_PATH, "meta_sent_embeds.npy"))
meta["CCR"] = cosine_similarity(meta_sent_embeds, CCR_de.reshape(1, -1))
meta["CCR_en"] = cosine_similarity(meta_sent_embeds, CCR_en.reshape(1, -1))
meta["aCCR"] = cosine_similarity(meta_sent_embeds, aCCR_de.reshape(1, -1))
meta["aCCR_en"] = cosine_similarity(meta_sent_embeds, aCCR_en.reshape(1, -1))
meta = meta.loc[:, ["CCR", "aCCR", "n_words", "dass_str_score_DU-Prä", "dass_anx_score_DU-Prä", "dass_dep_score_DU-Prä", "depression_diagnosis", "work_ability_status", "vocational_qualification", "patient_sex", "patient_age_therapy_start", "bdi_sum_DU-Prä"]]
meta.to_csv(os.path.join(r.PUBLIC_DATA_PATH, "meta_18052026.csv"), index=False)
del meta, meta_sent_embeds
## eRisk-2021
erisk21 = pd.read_csv(os.path.join(r.PUBLIC_DATA_PATH, "erisk2021_t3_29042026.csv"),
                   index_col=False)
erisk21_sent_embeds = np.load(os.path.join(r.PUBLIC_DATA_PATH, "erisk21_sent_embeds.npy"))
erisk21["CCR"] = cosine_similarity(erisk21_sent_embeds, CCR_en.reshape(1, -1))
erisk21["aCCR"] = cosine_similarity(erisk21_sent_embeds, aCCR_en.reshape(1, -1))
erisk21 = erisk21.loc[:, ["sub_id", "CCR", "aCCR", "bdi_sum", "n_words"]] 
erisk21.to_csv(os.path.join(r.PUBLIC_DATA_PATH, "erisk21_18052026.csv"), index=False)
del erisk21, erisk21_sent_embeds
## eRisk-2025
erisk25 = pd.read_csv(os.path.join(r.PUBLIC_DATA_PATH, "bdi_rel_29042026.csv"),
                   index_col=False)
erisk25_sent_embeds = np.load(os.path.join(r.PUBLIC_DATA_PATH, "erisk25_sent_embeds.npy"))
erisk25["positive_centroid_similarity"] = cosine_similarity(erisk25_sent_embeds, positive_centroid_english.reshape(1, -1))
erisk25["negative_centroid_similarity"] = cosine_similarity(erisk25_sent_embeds, negative_centroid_english.reshape(1, -1))
erisk25 = erisk25.loc[:, ["query", "relevant", "positive_centroid_similarity", "negative_centroid_similarity"]]
erisk25.to_csv(os.path.join(r.PUBLIC_DATA_PATH, "erisk25_18052026.csv"), index=False)
del erisk25, erisk25_sent_embeds

In [11]:

library(dplyr)
library(stringr)
library(vroom)
library(ggplot2)
# Tunable parameters
max_words  <- 30
wrap_width <- 30
# Helper to truncate to max_words
truncate_words <- function(text, n = max_words) {
  words <- str_split(text, "\\s+")[[1]]
  if (length(words) > n) {
    paste(c(words[seq_len(n)], "..."), collapse = " ")
  } else {
    text
  }
}
deptexts_examples <- vroom(file.path(PUBLIC_DATA_PATH, "deptexts_18052026.csv"))
deptexts_examples$CCR_scaled         <- scale(deptexts_examples$CCR)
deptexts_examples$CCR_scaled_rounded <- round(deptexts_examples$CCR_scaled, 1)
ccr_pos3_sd <- deptexts_examples[deptexts_examples$CCR_scaled_rounded == 3.4,  ][["Deptext"]][1]
ccr_pos2_sd <- deptexts_examples[deptexts_examples$CCR_scaled_rounded == 2,    ][["Deptext"]][1]
ccr_pos1_sd <- deptexts_examples[deptexts_examples$CCR_scaled_rounded == 1,    ][["Deptext"]][3]
ccr_neg1_sd <- deptexts_examples[deptexts_examples$CCR_scaled_rounded == -1,   ][["Deptext"]][1]
ccr_neg2_sd <- deptexts_examples[deptexts_examples$CCR_scaled_rounded == -2,   ][["Deptext"]][1]
ccr_neg3_sd <- deptexts_examples[deptexts_examples$CCR_scaled_rounded == -2.7, ][["Deptext"]][1]
counts <- deptexts_examples %>%
  count(CCR_scaled_rounded)
ymax <- max(counts$n)
labels_df <- tibble(
  x = c(-2.7, -2, -1, 1, 2, 3.4),
  text = c(ccr_neg3_sd, ccr_neg2_sd, ccr_neg1_sd, ccr_pos1_sd, ccr_pos2_sd, ccr_pos3_sd)
) %>%
  mutate(
    text  = sapply(text, truncate_words),
    label = str_wrap(text, width = wrap_width)
  ) %>%
  left_join(counts, by = c("x" = "CCR_scaled_rounded")) %>%
  mutate(
    y_label = c(
      ymax * 0.3,
      ymax * 1,
      ymax * 1.8,
      ymax * 1.85,
      ymax * 1,
      ymax * 0.4
    )
  )
Fig1d <- ggplot(deptexts_examples, aes(x = CCR_scaled_rounded)) +
  geom_histogram(
    bins = 20,
    fill = "#298c8c",
    color = "#298c8c"
  ) +
  geom_segment(
    data = labels_df,
    aes(x = x, xend = x, y = 0, yend = y_label),
    color = "grey40",
    linetype = "dashed"
  ) +
  geom_label(
    data = labels_df,
    aes(x = x, y = y_label, label = label),
    size = 3,
    label.size = 0.2,
    fill = "white",
    hjust = 0.5,
    vjust = 0
  ) +
  scale_x_continuous(
    breaks = c(-2.7, -2, -1, 0, 1, 2, 3.4),
    labels = c("-2.7", "-2", "-1", "0", "1", "2", "3.4")
  ) +
  scale_y_continuous(
    expand = expansion(mult = c(0.02, 0.05))
  ) +
  coord_cartesian(
    ylim = c(0, ymax * 2.2),
    clip = "off"
  ) +
  labs(
    x = "CCR loading (z-standardized)",
    y = NULL
  ) +
  theme_minimal() +
  theme(
    axis.title.y = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    panel.grid.major.y = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.line.x = element_line(color = "black"),
    axis.ticks.x = element_line(color = "black"),
    plot.margin = margin(40, 40, 40, 40)
  )
ggsave("figure_graphics/Fig1d.svg", Fig1d, width = 7, height = 5, units = "in")
# Figure 1 panels a-c were manually created and exported in PowerPoint

In [12]:

library(ggplot2)
library(magick)

Warning: package 'magick' was built under R version 4.5.2

Linking to ImageMagick 6.9.13.29
Enabled features: cairo, fontconfig, freetype, heic, lcms, pango, raw, rsvg, webp
Disabled features: fftw, ghostscript, x11

library(cowplot)
# load SVG files
img1 <- image_read("figure_graphics/fig1.svg")
# convert to ggplot-compatible grobs
ggdraw() + draw_image(img1)

Figure 1: Creation of a Contextualized Construct Representation (CCR).

Note. a, Items of the BDI-II are transformed into semantic vectors. b, the CCR is derived by computing the average (centroid) of positively formulated BDI-II items (blue vector). The anchored CCR (red vector) is instead defined as the vector connecting the centroid of negatively formulated items with the centroid of positively formulated items (grey vector). Mathematically this vector is derived by subtracting the latter from the former. c, CCR/aCCR loading indicates how closely the depression construct is present within a given text. CCR/aCCR loading is computed as the cosine similarity of the CCR/aCCR vector with a given text vector. Perfect alignment of vectors (cosine similarity = 1) indicates high loading, while opposite alignment (cosine similarity = -1) indicates low loading. d, CCR/aCCR loadings for example texts taken from depression texts in the dep_wor_data dataset.

Statistical Analysis Plan

Significance was asserted at p < .05 and assumptions of statistical models were checked via residual histograms, QQ-plots, and fitted-residual plots.

In [13]:

library(vroom)
library(tidyr)
library(dplyr)
library(effsize)
library(ggplot2)
library(cowplot)
library(ggpubr)

Warning: package 'ggpubr' was built under R version 4.5.2


Attaching package: 'ggpubr'

The following object is masked from 'package:cowplot':

    get_legend

The following objects are masked from 'package:flextable':

    border, font, rotate

library(lsa)

Warning: package 'lsa' was built under R version 4.5.2

Loading required package: SnowballC

library(reticulate)
library(reshape2)


Attaching package: 'reshape2'

The following object is masked from 'package:tidyr':

    smiths

library(cocor)
library(purrr)

Warning: package 'purrr' was built under R version 4.5.2


Attaching package: 'purrr'

The following object is masked from 'package:flextable':

    compose

library(magick)
library(knitr)

Warning: package 'knitr' was built under R version 4.5.2

library(pROC)

Type 'citation("pROC")' for a citation.


Attaching package: 'pROC'

The following objects are masked from 'package:stats':

    cov, smooth, var

library(tibble)

Warning: package 'tibble' was built under R version 4.5.2

library(stringr)
deptexts = vroom(file.path(PUBLIC_DATA_PATH, "deptexts_18052026.csv"))

Rows: 500 Columns: 13

── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): Deptext
dbl (12): CCR, aCCR, PHQ9tot, GAD7tot, SickDaysYear, SickDaysMonth, Age, Gen...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

meta = vroom(file.path(PUBLIC_DATA_PATH, "meta_18052026.csv"))

Rows: 601 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): dsmv_diagnosis_1, work_ability_status, vocational_qualification, pa...
dbl (9): CCR, aCCR, n_words, dass_str_score_DU-Prä, dass_anx_score_DU-Prä, d...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

erisk21 = vroom(file.path(PUBLIC_DATA_PATH, "erisk21_18052026.csv"))

Rows: 27594 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): sub_id
dbl (4): CCR, aCCR, bdi_sum, n_words

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

erisk25 = vroom(file.path(PUBLIC_DATA_PATH, "erisk25_18052026.csv"))

Rows: 11039 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (3): query, positive_centroid_similarity, negative_centroid_similarity
lgl (1): relevant

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# weighted-mean CCR scores in eRisk21
erisk21_w <- erisk21 %>%
  group_by(sub_id) %>%
  summarise(
    CCR = weighted.mean(CCR, w = n_words, na.rm = TRUE),
    aCCR = weighted.mean(aCCR, w = n_words, na.rm = TRUE),
    bdi_sum = mean(bdi_sum),
    .groups = "drop"
  )
# ── Color palette ─────────────────────────────────────────────────────────────
ccr_col  <- "#298c8c"
accr_col <- "#a00000"
rr_col   <- "#5c6bc0"
pls_col  <- "#ef8c00"
grey_col <- "#b8b8b8"
# ── p-value formatting ────────────────────────────────────────────────────────
format_p <- function(p) {
  ifelse(
    is.na(p), NA,
    ifelse(
      p < .001, "p < .001",
      paste0("p = ", sub("^0", "", sprintf("%.3f", p)))
    )
  )
}
# ── Significance stars ────────────────────────────────────────────────────────
add_stars <- function(df) {
  mutate(df, p_label = case_when(
    p < .001 ~ "***",
    p < .01  ~ "**",
    p < .05  ~ "*",
    TRUE     ~ "ns"
  ))
}
# ── Annotation helpers ────────────────────────────────────────────────────────
ann <- function(label) {
  annotate("text", x = -Inf, y = Inf, label = label,
           hjust = -0.1, vjust = 1.1, size = 3)
}
cor_ann <- function(data, x, y) {
  res <- cor.test(data[[x]], data[[y]])
  paste0("r = ", round(res$estimate, 2), "\n", format_p(res$p.value))
}
ttest_ann <- function(data, var, group_var, g1, g2) {
  x1 <- data[data[[group_var]] == g1, var, drop = TRUE]
  x2 <- data[data[[group_var]] == g2, var, drop = TRUE]
  paste0("d = ", round(cohen.d(x1, x2)$estimate, 2), "\n", format_p(t.test(x1, x2)$p.value))
}
# ── Figure layout helpers ─────────────────────────────────────────────────────
row_label <- function(txt) {
  ggdraw() + draw_label(txt, fontface = "bold", angle = 90, size = 11)
}
col_header <- function(txt, size = 10) {
  ggdraw() + draw_label(txt, fontface = "bold", size = size)
}
labeled_row <- function(lbl, ..., rel_widths = c(0.05, 1)) {
  panels <- list(...)
  row <- if (length(panels) == 1) panels[[1]] else plot_grid(plotlist = panels, nrow = 1)
  plot_grid(lbl, row, ncol = 2, rel_widths = rel_widths)
}
# ── Method variable lists (shared across validity analyses) ───────────────────
deptexts_methods <- list(
  list(var = "CCR",              label = "CCR"),
  list(var = "aCCR",             label = "aCCR"),
  list(var = "PHQ9tot_RR_pred",  label = "RR"),
  list(var = "PHQ9tot_PLS_pred", label = "PLS")
)
meta_methods <- list(
  list(var = "CCR",              label = "CCR"),
  list(var = "aCCR",             label = "aCCR"),
  list(var = "bdi_sum_RR_pred",  label = "RR"),
  list(var = "bdi_sum_PLS_pred", label = "PLS")
)
# ── numpy access via reticulate ───────────────────────────────────────────────
np <- import("numpy")

Supervised LBA

The performance of CCRs was compared to supervised LBA. To this end, sentence embedding vectors were used to train regression models to predict self-report scales. Two different regression models were used. Ridge Regression (RR), the standard model for supervised LBA (Gu et al., 2025; O. Kjell et al., 2026; Teitelbaum & Simchon, 2025), shrinks regression coefficients by penalizing maximum likelihood parameter estimates to address over-fitting. The degree of penalization can be tuned using the $\alpha$ parameter where high $\alpha$ values correspond to strong penalization. Partial Least Square regression (PLS) was submitted by Teitelbaum & Simchon (2025) as another method for supervised LBA (c.f. Correlational Anchored Vectors). PLS extracts from the predictors a set of orthogonal factors which have the highest predictive power (Abdi, 2003). PLS is thus useful when in cases with a large number of correlated predictors (Abdi, 2003). By extracting only a singular factor with the highest predictive power, the model could learn a more abstract construct representation which might lead to enhanced generalization performance on comparable text samples (Teitelbaum & Simchon, 2025).

To evaluate the performance of supervised LBA, 5-fold cross validation was conducted. For RR, cross-validation was nested to allow for a hyper-parameter search for optimal $\alpha$ in 100-step log-space from 10¹ to 10⁶. Embeddings were z-standardized on each fold individually. Predicted questionnaire score on the training data-set were extracted from outer fold predictions. Best fitting models were used in transfer tests to predict questionnaire scores in other data-sets. High text-rating correlations on data-sets not seen during supervised learning indicates good generalization performance.

In [14]:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, KFold, cross_val_predict, cross_validate
from sklearn.cross_decomposition import PLSRegression
from sklearn.metrics import r2_score
from sklearn.linear_model import Ridge
from sklearn.base import clone
import pandas as pd
import numpy as np
import os
# load data
deptexts = r.deptexts
meta = r.meta
erisk21 = r.erisk21
# load embeddings
deptexts_word_embeds = np.load(os.path.join(r.PUBLIC_DATA_PATH, "deptexts_sent_embeds.npy"))
meta_word_embeds = np.load(os.path.join(r.PUBLIC_DATA_PATH, "meta_sent_embeds.npy"))
erisk21_word_embeds = np.load(os.path.join(r.PUBLIC_DATA_PATH, "erisk21_sent_embeds.npy"))
# aggregate erisk21 data
erisk21_word_embeds_df = pd.DataFrame(erisk21_word_embeds)
erisk21_word_embeds_df.columns = ["d_"+str(d) for d in range(1, 1025)]
erisk21_word_embeds_df = pd.concat([erisk21, erisk21_word_embeds_df], axis=1)
erisk21_av_word_embeds = (
    erisk21_word_embeds_df
    .groupby("sub_id")
    .apply(
        lambda x: pd.Series({
            **{
                col: np.average(x[col], weights=x["n_words"])
                for col in [f"d_{i}" for i in range(1, 1025)]
            },
            "bdi_sum": x["bdi_sum"].mean()
        })
    )
    .reset_index()
)

<string>:4: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.

erisk21_word_embeds = erisk21_av_word_embeds.loc[:, erisk21_av_word_embeds.columns.str.contains("d_")].to_numpy()
# define common classification parameters
RR_pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("ridge", Ridge())
])
PLS_pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("PLS", PLSRegression(n_components=1))
])
alphas = np.logspace(1, 6, 100)

In [15]:

import numpy as np
from sklearn.model_selection import KFold, GridSearchCV, cross_validate, cross_val_predict
# ── Fold structure ────────────────────────────────────────────────────────────
inner_cv = KFold(n_splits=5, shuffle=True, random_state=1)
outer_cv = KFold(n_splits=5, shuffle=True, random_state=2)
# ── GridSearchCV objects ──────────────────────────────────────────────────────
deptexts_RR_model = GridSearchCV(
    RR_pipe, param_grid={"ridge__alpha": alphas}, cv=inner_cv, scoring="r2"
)
deptextsbdi_RR_model = GridSearchCV(
    RR_pipe, param_grid={"ridge__alpha": alphas}, cv=inner_cv, scoring="r2"
)
meta_RR_model = GridSearchCV(
    RR_pipe, param_grid={"ridge__alpha": alphas}, cv=inner_cv, scoring="r2"
)
erisk21_RR_model = GridSearchCV(
    RR_pipe, param_grid={"ridge__alpha": alphas}, cv=inner_cv, scoring="r2"
)
# ── Nested cross-validation ───────────────────────────────────────────────────
# Ridge
deptexts_RR_cv_results = cross_validate(
    deptexts_RR_model, X=deptexts_word_embeds,
    y=deptexts["PHQ9tot"].to_numpy(), cv=outer_cv, scoring="r2", return_estimator=True
)
deptextsbdi_RR_cv_results = cross_validate(
    deptextsbdi_RR_model, X=deptexts_word_embeds,
    y=deptexts["mapped_BDI_sum"].to_numpy(), cv=outer_cv, scoring="r2", return_estimator=True
)
meta_RR_cv_results = cross_validate(
    meta_RR_model, X=meta_word_embeds,
    y=meta["bdi_sum_DU-Prä"].to_numpy(), cv=outer_cv, scoring="r2", return_estimator=True
)
erisk21_RR_cv_results = cross_validate(
    erisk21_RR_model, X=erisk21_word_embeds,
    y=erisk21_av_word_embeds["bdi_sum"].to_numpy(), cv=outer_cv, scoring="r2", return_estimator=True
)
# PLS
deptexts_PLS_cv_results = cross_validate(
    PLS_pipe, X=deptexts_word_embeds,
    y=deptexts["PHQ9tot"].to_numpy(), cv=outer_cv, scoring="r2", return_estimator=True
)
deptextsbdi_PLS_cv_results = cross_validate(
    PLS_pipe, X=deptexts_word_embeds,
    y=deptexts["mapped_BDI_sum"].to_numpy(), cv=outer_cv, scoring="r2", return_estimator=True
)
meta_PLS_cv_results = cross_validate(
    PLS_pipe, X=meta_word_embeds,
    y=meta["bdi_sum_DU-Prä"].to_numpy(), cv=outer_cv, scoring="r2", return_estimator=True
)
erisk21_PLS_cv_results = cross_validate(
    PLS_pipe, X=erisk21_word_embeds,
    y=erisk21_av_word_embeds["bdi_sum"].to_numpy(), cv=outer_cv, scoring="r2", return_estimator=True
)
# ════════════════════════════════════════════════════════════════════════════════
# Final models fitted on the full dataset
# ════════════════════════════════════════════════════════════════════════════════
# Each GridSearchCV is re-fitted on 100 % of the data.  After .fit() the object
# exposes .best_estimator_ (the pipeline with the best alpha, already refit) and
# .best_params_ so you can inspect or log the chosen regularisation strength.
# Ridge – final models
deptexts_RR_final_model = GridSearchCV(
    RR_pipe, param_grid={"ridge__alpha": alphas}, cv=inner_cv, scoring="r2"
)
deptexts_RR_final_model.fit(deptexts_word_embeds, deptexts["PHQ9tot"].to_numpy())

GridSearchCV(cv=KFold(n_splits=5, random_state=1, shuffle=True),
             estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('ridge', Ridge())]),
             param_grid={'ridge__alpha': array([1.00000000e+01, 1.12332403e+01, 1.26185688e+01, 1.41747416e+01,
       1.59228279e+01, 1.78864953e+01, 2.00923300e+01, 2.25701972e+01,
       2.53536449e+01, 2.84803587e+01, 3.19926714e+01, 3.59381366e+01,...
       6.89261210e+04, 7.74263683e+04, 8.69749003e+04, 9.77009957e+04,
       1.09749877e+05, 1.23284674e+05, 1.38488637e+05, 1.55567614e+05,
       1.74752840e+05, 1.96304065e+05, 2.20513074e+05, 2.47707636e+05,
       2.78255940e+05, 3.12571585e+05, 3.51119173e+05, 3.94420606e+05,
       4.43062146e+05, 4.97702356e+05, 5.59081018e+05, 6.28029144e+05,
       7.05480231e+05, 7.92482898e+05, 8.90215085e+05, 1.00000000e+06])},
             scoring='r2')

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

deptextsbdi_RR_final_model = GridSearchCV(
    RR_pipe, param_grid={"ridge__alpha": alphas}, cv=inner_cv, scoring="r2"
)
deptextsbdi_RR_final_model.fit(deptexts_word_embeds, deptexts["mapped_BDI_sum"].to_numpy())

GridSearchCV(cv=KFold(n_splits=5, random_state=1, shuffle=True),
             estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('ridge', Ridge())]),
             param_grid={'ridge__alpha': array([1.00000000e+01, 1.12332403e+01, 1.26185688e+01, 1.41747416e+01,
       1.59228279e+01, 1.78864953e+01, 2.00923300e+01, 2.25701972e+01,
       2.53536449e+01, 2.84803587e+01, 3.19926714e+01, 3.59381366e+01,...
       6.89261210e+04, 7.74263683e+04, 8.69749003e+04, 9.77009957e+04,
       1.09749877e+05, 1.23284674e+05, 1.38488637e+05, 1.55567614e+05,
       1.74752840e+05, 1.96304065e+05, 2.20513074e+05, 2.47707636e+05,
       2.78255940e+05, 3.12571585e+05, 3.51119173e+05, 3.94420606e+05,
       4.43062146e+05, 4.97702356e+05, 5.59081018e+05, 6.28029144e+05,
       7.05480231e+05, 7.92482898e+05, 8.90215085e+05, 1.00000000e+06])},
             scoring='r2')

meta_RR_final_model = GridSearchCV(
    RR_pipe, param_grid={"ridge__alpha": alphas}, cv=inner_cv, scoring="r2"
)
meta_RR_final_model.fit(meta_word_embeds, meta["bdi_sum_DU-Prä"].to_numpy())

GridSearchCV(cv=KFold(n_splits=5, random_state=1, shuffle=True),
             estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('ridge', Ridge())]),
             param_grid={'ridge__alpha': array([1.00000000e+01, 1.12332403e+01, 1.26185688e+01, 1.41747416e+01,
       1.59228279e+01, 1.78864953e+01, 2.00923300e+01, 2.25701972e+01,
       2.53536449e+01, 2.84803587e+01, 3.19926714e+01, 3.59381366e+01,...
       6.89261210e+04, 7.74263683e+04, 8.69749003e+04, 9.77009957e+04,
       1.09749877e+05, 1.23284674e+05, 1.38488637e+05, 1.55567614e+05,
       1.74752840e+05, 1.96304065e+05, 2.20513074e+05, 2.47707636e+05,
       2.78255940e+05, 3.12571585e+05, 3.51119173e+05, 3.94420606e+05,
       4.43062146e+05, 4.97702356e+05, 5.59081018e+05, 6.28029144e+05,
       7.05480231e+05, 7.92482898e+05, 8.90215085e+05, 1.00000000e+06])},
             scoring='r2')

erisk21_RR_final_model = GridSearchCV(
    RR_pipe, param_grid={"ridge__alpha": alphas}, cv=inner_cv, scoring="r2"
)
erisk21_RR_final_model.fit(erisk21_word_embeds, erisk21_av_word_embeds["bdi_sum"].to_numpy())

GridSearchCV(cv=KFold(n_splits=5, random_state=1, shuffle=True),
             estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('ridge', Ridge())]),
             param_grid={'ridge__alpha': array([1.00000000e+01, 1.12332403e+01, 1.26185688e+01, 1.41747416e+01,
       1.59228279e+01, 1.78864953e+01, 2.00923300e+01, 2.25701972e+01,
       2.53536449e+01, 2.84803587e+01, 3.19926714e+01, 3.59381366e+01,...
       6.89261210e+04, 7.74263683e+04, 8.69749003e+04, 9.77009957e+04,
       1.09749877e+05, 1.23284674e+05, 1.38488637e+05, 1.55567614e+05,
       1.74752840e+05, 1.96304065e+05, 2.20513074e+05, 2.47707636e+05,
       2.78255940e+05, 3.12571585e+05, 3.51119173e+05, 3.94420606e+05,
       4.43062146e+05, 4.97702356e+05, 5.59081018e+05, 6.28029144e+05,
       7.05480231e+05, 7.92482898e+05, 8.90215085e+05, 1.00000000e+06])},
             scoring='r2')

# PLS – final models (no alpha tuning, so a single fit suffices)
from sklearn.base import clone
deptexts_PLS_final_model = clone(PLS_pipe)
deptexts_PLS_final_model.fit(deptexts_word_embeds, deptexts["PHQ9tot"].to_numpy())

Pipeline(steps=[('scaler', StandardScaler()),
                ('PLS', PLSRegression(n_components=1))])

deptextsbdi_PLS_final_model = clone(PLS_pipe)
deptextsbdi_PLS_final_model.fit(deptexts_word_embeds, deptexts["mapped_BDI_sum"].to_numpy())

Pipeline(steps=[('scaler', StandardScaler()),
                ('PLS', PLSRegression(n_components=1))])

meta_PLS_final_model = clone(PLS_pipe)
meta_PLS_final_model.fit(meta_word_embeds, meta["bdi_sum_DU-Prä"].to_numpy())

Pipeline(steps=[('scaler', StandardScaler()),
                ('PLS', PLSRegression(n_components=1))])

erisk21_PLS_final_model = clone(PLS_pipe)
erisk21_PLS_final_model.fit(erisk21_word_embeds, erisk21_av_word_embeds["bdi_sum"].to_numpy())

Pipeline(steps=[('scaler', StandardScaler()),
                ('PLS', PLSRegression(n_components=1))])

# persist to disk with joblib
import joblib
joblib.dump(deptextsbdi_RR_final_model.best_estimator_, os.path.join(r.PUBLIC_DATA_PATH, "deptextsbdi_RR.pkl"))

['/Users/gabrielbonnin/Documents/lokal/PsychometricVectors_data/deptextsbdi_RR.pkl']

joblib.dump(deptextsbdi_PLS_final_model, os.path.join(r.PUBLIC_DATA_PATH, "deptextsbdi_PLS.pkl"))

['/Users/gabrielbonnin/Documents/lokal/PsychometricVectors_data/deptextsbdi_PLS.pkl']

joblib.dump(meta_RR_final_model.best_estimator_, os.path.join(r.PUBLIC_DATA_PATH, "meta_RR.pkl"))

['/Users/gabrielbonnin/Documents/lokal/PsychometricVectors_data/meta_RR.pkl']

joblib.dump(meta_PLS_final_model, os.path.join(r.PUBLIC_DATA_PATH, "meta_PLS.pkl"))

['/Users/gabrielbonnin/Documents/lokal/PsychometricVectors_data/meta_PLS.pkl']

joblib.dump(erisk21_RR_final_model.best_estimator_, os.path.join(r.PUBLIC_DATA_PATH, "erisk21_RR.pkl"))

['/Users/gabrielbonnin/Documents/lokal/PsychometricVectors_data/erisk21_RR.pkl']

joblib.dump(erisk21_PLS_final_model, os.path.join(r.PUBLIC_DATA_PATH, "erisk21_PLS.pkl"))

['/Users/gabrielbonnin/Documents/lokal/PsychometricVectors_data/erisk21_PLS.pkl']

# ════════════════════════════════════════════════════════════════════════════════
# CV-predicted scores stored back into the dataframes
# ════════════════════════════════════════════════════════════════════════════════
# cross_val_predict returns one out-of-fold prediction per observation, using
# the same outer_cv splits as cross_validate, so the predictions are unbiased
# estimates of generalisation performance.
#
# For Ridge the estimator passed to cross_val_predict is a *fresh* GridSearchCV
# (identical setup to above) — each outer fold still runs inner-CV tuning before
# predicting on the held-out fold.
# Ridge – CV predictions
deptexts_RR_oof_preds = cross_val_predict(
    GridSearchCV(RR_pipe, param_grid={"ridge__alpha": alphas}, cv=inner_cv, scoring="r2"),
    X=deptexts_word_embeds, y=deptexts["PHQ9tot"].to_numpy(), cv=outer_cv
)
deptexts["PHQ9tot_RR_pred"] = deptexts_RR_oof_preds
deptextsbdi_RR_oof_preds = cross_val_predict(
    GridSearchCV(RR_pipe, param_grid={"ridge__alpha": alphas}, cv=inner_cv, scoring="r2"),
    X=deptexts_word_embeds, y=deptexts["mapped_BDI_sum"].to_numpy(), cv=outer_cv
)
deptexts["mapped_BDI_sum_RR_pred"] = deptextsbdi_RR_oof_preds
meta_RR_oof_preds = cross_val_predict(
    GridSearchCV(RR_pipe, param_grid={"ridge__alpha": alphas}, cv=inner_cv, scoring="r2"),
    X=meta_word_embeds, y=meta["bdi_sum_DU-Prä"].to_numpy(), cv=outer_cv
)
meta["bdi_sum_RR_pred"] = meta_RR_oof_preds
erisk21_RR_oof_preds = cross_val_predict(
    GridSearchCV(RR_pipe, param_grid={"ridge__alpha": alphas}, cv=inner_cv, scoring="r2"),
    X=erisk21_word_embeds, y=erisk21_av_word_embeds["bdi_sum"].to_numpy(), cv=outer_cv
)
erisk21_av_word_embeds["bdi_sum_RR_pred"] = erisk21_RR_oof_preds

<string>:1: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

# PLS – CV predictions
deptexts_PLS_oof_preds = cross_val_predict(
    PLS_pipe, X=deptexts_word_embeds,
    y=deptexts["PHQ9tot"].to_numpy(), cv=outer_cv
)
deptexts["PHQ9tot_PLS_pred"] = deptexts_PLS_oof_preds
deptextsbdi_PLS_oof_preds = cross_val_predict(
    PLS_pipe, X=deptexts_word_embeds,
    y=deptexts["mapped_BDI_sum"].to_numpy(), cv=outer_cv
)
deptexts["mapped_BDI_sum_PLS_pred"] = deptextsbdi_PLS_oof_preds
meta_PLS_oof_preds = cross_val_predict(
    PLS_pipe, X=meta_word_embeds,
    y=meta["bdi_sum_DU-Prä"].to_numpy(), cv=outer_cv
)
meta["bdi_sum_PLS_pred"] = meta_PLS_oof_preds
erisk21_PLS_oof_preds = cross_val_predict(
    PLS_pipe, X=erisk21_word_embeds,
    y=erisk21_av_word_embeds["bdi_sum"].to_numpy(), cv=outer_cv
)
erisk21_av_word_embeds["bdi_sum_PLS_pred"] = erisk21_PLS_oof_preds

<string>:1: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [16]:

import joblib
from scipy.stats import pearsonr
# Load transfer models from disk
deptextsbdi_RR_transfer_model  = joblib.load(os.path.join(r.PUBLIC_DATA_PATH, "deptextsbdi_RR.pkl"))
deptextsbdi_PLS_transfer_model = joblib.load(os.path.join(r.PUBLIC_DATA_PATH, "deptextsbdi_PLS.pkl"))
meta_RR_transfer_model         = joblib.load(os.path.join(r.PUBLIC_DATA_PATH, "meta_RR.pkl"))
meta_PLS_transfer_model        = joblib.load(os.path.join(r.PUBLIC_DATA_PATH, "meta_PLS.pkl"))
erisk21_RR_transfer_model      = joblib.load(os.path.join(r.PUBLIC_DATA_PATH, "erisk21_RR.pkl"))
erisk21_PLS_transfer_model     = joblib.load(os.path.join(r.PUBLIC_DATA_PATH, "erisk21_PLS.pkl"))
# ── Transfer predictions ──────────────────────────────────────────────────────
## Ridge Regression
erisk21_av_word_embeds["meta_pred_RR"]     = meta_RR_transfer_model.predict(erisk21_word_embeds).ravel()

<string>:3: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

deptexts["meta_pred_RR"]                   = meta_RR_transfer_model.predict(deptexts_word_embeds).ravel()

<string>:1: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

meta["erisk21_pred_RR"]                    = erisk21_RR_transfer_model.predict(meta_word_embeds).ravel()

<string>:1: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

deptexts["erisk21_pred_RR"]                = erisk21_RR_transfer_model.predict(deptexts_word_embeds).ravel()

<string>:1: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

meta["deptexts_pred_RR"]                   = deptextsbdi_RR_transfer_model.predict(meta_word_embeds).ravel()

<string>:1: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

erisk21_av_word_embeds["deptexts_pred_RR"] = deptextsbdi_RR_transfer_model.predict(erisk21_word_embeds).ravel()

<string>:1: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

## Partial Least Squares Regression
erisk21_av_word_embeds["meta_pred_PLS"]     = meta_PLS_transfer_model.predict(erisk21_word_embeds).ravel()

<string>:2: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

deptexts["meta_pred_PLS"]                   = meta_PLS_transfer_model.predict(deptexts_word_embeds).ravel()

<string>:1: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

meta["erisk21_pred_PLS"]                    = erisk21_PLS_transfer_model.predict(meta_word_embeds).ravel()

<string>:1: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

deptexts["erisk21_pred_PLS"]                = erisk21_PLS_transfer_model.predict(deptexts_word_embeds).ravel()

<string>:1: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

meta["deptexts_pred_PLS"]                   = deptextsbdi_PLS_transfer_model.predict(meta_word_embeds).ravel()

<string>:1: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

erisk21_av_word_embeds["deptexts_pred_PLS"] = deptextsbdi_PLS_transfer_model.predict(erisk21_word_embeds).ravel()

<string>:1: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [17]:

deptexts <- py$deptexts
meta <- py$meta
erisk21_av_word_embeds <- py$erisk21_av_word_embeds

# ── Helper ────────────────────────────────────────────────────────────────────
r_val <- function(x, y) cor.test(as.numeric(x), as.numeric(y))$estimate
p_val <- function(x, y) cor.test(as.numeric(x), as.numeric(y))$p.value
# ── Shortcuts for outcome vectors ─────────────────────────────────────────────
dep_outcome     <- deptexts$mapped_BDI_sum
meta_outcome    <- meta$`bdi_sum_DU-Prä`
erisk_outcome   <- erisk21_av_word_embeds$bdi_sum
# ── RR ────────────────────────────────────────────────────────────────────────
RR_transfer_results <- tibble(
  training_data = c(
    "Training: dep_wor_data", "Training: META-FBZ",     "Training: eRisk-2021",
    "Training: META-FBZ",     "Training: META-FBZ",
    "Training: eRisk-2021",   "Training: eRisk-2021",
    "Training: dep_wor_data", "Training: dep_wor_data"
  ),
  test_data = c(
    "dep_wor_data", "META-FBZ",     "eRisk-2021",
    "eRisk-2021",   "dep_wor_data",
    "META-FBZ",     "dep_wor_data",
    "META-FBZ",     "eRisk-2021"
  ),
  r = c(
    r_val(dep_outcome,   deptexts$mapped_BDI_sum_RR_pred),
    r_val(meta_outcome,  meta$bdi_sum_RR_pred),
    r_val(erisk_outcome, erisk21_av_word_embeds$bdi_sum_RR_pred),
    r_val(erisk_outcome, erisk21_av_word_embeds$meta_pred_RR),
    r_val(dep_outcome,   deptexts$meta_pred_RR),
    r_val(meta_outcome,  meta$erisk21_pred_RR),
    r_val(dep_outcome,   deptexts$erisk21_pred_RR),
    r_val(meta_outcome,  meta$deptexts_pred_RR),
    r_val(erisk_outcome, erisk21_av_word_embeds$deptexts_pred_RR)
  ),
  p = c(
    p_val(dep_outcome,   deptexts$mapped_BDI_sum_RR_pred),
    p_val(meta_outcome,  meta$bdi_sum_RR_pred),
    p_val(erisk_outcome, erisk21_av_word_embeds$bdi_sum_RR_pred),
    p_val(erisk_outcome, erisk21_av_word_embeds$meta_pred_RR),
    p_val(dep_outcome,   deptexts$meta_pred_RR),
    p_val(meta_outcome,  meta$erisk21_pred_RR),
    p_val(dep_outcome,   deptexts$erisk21_pred_RR),
    p_val(meta_outcome,  meta$deptexts_pred_RR),
    p_val(erisk_outcome, erisk21_av_word_embeds$deptexts_pred_RR)
  )
) %>%
  add_stars()
# ── PLS ───────────────────────────────────────────────────────────────────────
PLS_transfer_results <- tibble(
  training_data = RR_transfer_results$training_data,
  test_data     = RR_transfer_results$test_data,
  r = c(
    r_val(dep_outcome,   deptexts$mapped_BDI_sum_PLS_pred),
    r_val(meta_outcome,  meta$bdi_sum_PLS_pred),
    r_val(erisk_outcome, erisk21_av_word_embeds$bdi_sum_PLS_pred),
    r_val(erisk_outcome, erisk21_av_word_embeds$meta_pred_PLS),
    r_val(dep_outcome,   deptexts$meta_pred_PLS),
    r_val(meta_outcome,  meta$erisk21_pred_PLS),
    r_val(dep_outcome,   deptexts$erisk21_pred_PLS),
    r_val(meta_outcome,  meta$deptexts_pred_PLS),
    r_val(erisk_outcome, erisk21_av_word_embeds$deptexts_pred_PLS)
  ),
  p = c(
    p_val(dep_outcome,   deptexts$mapped_BDI_sum_PLS_pred),
    p_val(meta_outcome,  meta$bdi_sum_PLS_pred),
    p_val(erisk_outcome, erisk21_av_word_embeds$bdi_sum_PLS_pred),
    p_val(erisk_outcome, erisk21_av_word_embeds$meta_pred_PLS),
    p_val(dep_outcome,   deptexts$meta_pred_PLS),
    p_val(meta_outcome,  meta$erisk21_pred_PLS),
    p_val(dep_outcome,   deptexts$erisk21_pred_PLS),
    p_val(meta_outcome,  meta$deptexts_pred_PLS),
    p_val(erisk_outcome, erisk21_av_word_embeds$deptexts_pred_PLS)
  )
) %>%
  add_stars()

Content Validity

A content analysis of individual items was conducted to test whether CCRs represent a coherent construct that is distinct from other related psychopathology constructs, namely anxiety and stress. A similar procedure to Grand et al. (2022) was followed; positive items of the BDI-II embedded and compared to the embeddings of the Depression-Anxiety-Stress-Scale (42-item version) (Lovibond & Lovibond, 1995). The semantic similarity among BDI-II items (i.e., within similarity) was compared to the semantic similarity of BDI-II items to DASS-42 anxiety and stress sub-scale items (i.e., between similarity) via a Welch t-test. An increased within vs. between similarity was hypothesized. Additionally, the similarity of BDI-II items to the DASS-42 depression sub-scale was compared to the similarity of BDI-II items to the Anxiety and Stress sub-scale via pairwise Welch t-tests. Similarity to Depression items was expected to be elevated when compared to both anxiety and stress items.

In [18]:

import pandas as pd
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import torch
import os
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
sentmod = SentenceTransformer("mixedbread-ai/deepset-mxbai-embed-de-large-v1", device=device)
bdi_items = {
    "positive_anchors": [
        "I am so sad and unhappy that I can't stand it.",
        "I feel my future is hopeless and will only get worse.",
        "I feel I am a total failure as a person.",
        "I can't get any pleasure from the things I used to enjoy.",
        "I feel guilty all of the time.",
        "I feel I am being punished.",
        "I dislike myself.",
        "I blame myself for everything bad that happens.",
        "I would kill myself if I had the chance.",
        "I feel like crying, but I can't.",
        "I am so restless or agitated that I have to keep moving or doing something.",
        "It's hard to get interested in anything.",
        "I have trouble making any decisions.",
        "I feel utterly worthless.",
        "I don't have enough energy to do anything.",
        "I have experienced changes in my sleeping pattern.",
        "I am irritable all the time.",
        "I have experienced changes in my appetite.",
        "I find I can't concentrate on anything.",
        "I am too tired or fatigued to do most of the things I used to do.",
        "I have lost interest in sex completely."
    ],
    "negative_anchors": [
        "I do not feel sad.",
        "I am not discouraged about my future.",
        "I do not feel like a failure.",
        "I get as much pleasure as I ever did from the things I enjoy.",
        "I don't feel particularly guilty.",
        "I don't feel I am being punished.",
        "I feel the same about myself as ever.",
        "I don't criticize or blame myself more than usual.",
        "I don't have any thoughts of killing myself.",
        "I don't cry any more than I used to.",
        "I am no more restless or wound up than usual.",
        "I have not lost interest in other people or activities.",
        "I make decisions about as well as ever.",
        "I don't feel I am worthless.",
        "I have as much energy as ever.",
        "I have not experienced any change in my sleeping pattern.",
        "I am not more irritable than usual.",
        "I have not experienced any change in my appetite.",
        "I can concentrate as well as ever.",
        "I am no more tired or fatigued than usual.",
        "I have not noticed any recent change in my interest in sex."
    ]
}
dass_items = {
    "depression": [
        "I couldn't seem to experience any positive feeling at all.",
        "I just couldn't seem to get going.",
        "I felt that I had nothing to look forward to.",
        "I felt sad and depressed.",
        "I felt that I had lost interest in just about everything.",
        "I felt I wasn't worth much as a person.",
        "I felt that life wasn't worthwhile.",
        "I couldn't seem to get any enjoyment out of the things I did.",
        "I felt down-hearted and blue.",
        "I was unable to become enthusiastic about anything.",
        "I felt I was pretty worthless.",
        "I could see nothing in the future to be hopeful about.",
        "I felt that life was meaningless.",
        "I found it difficult to work up the initiative to do things."
    ],
    "anxiety": [
        "I was aware of dryness of my mouth.",
        "I experienced breathing difficulty.",
        "I had a feeling of shakiness.",
        "I found myself in situations that made me so anxious I was most relieved when they ended.",
        "I had a feeling of faintness.",
        "I perspired noticeably.",
        "I felt scared without any good reason.",
        "I had difficulty in swallowing.",
        "I was aware of the action of my heart in the absence of physical exertion.",
        "I felt I was close to panic.",
        "I feared that I would be thrown by some trivial but unfamiliar task.",
        "I felt terrified.",
        "I was worried about situations in which I might panic and make a fool of myself.",
        "I experienced trembling."
    ],
    "stress": [
        "I found myself getting upset by quite trivial things.",
        "I tended to over-react to situations.",
        "I found it difficult to relax.",
        "I found myself getting upset rather easily.",
        "I felt that I was using a lot of nervous energy.",
        "I found myself getting impatient when I was delayed.",
        "I felt that I was rather touchy.",
        "I found it hard to wind down.",
        "I found that I was very irritable.",
        "I found it hard to calm down after something upset me.",
        "I found it difficult to tolerate interruptions to what I was doing.",
        "I was in a state of nervous tension.",
        "I was intolerant of anything that kept me from getting on with what I was doing.",
        "I found myself getting agitated."
    ]
}
# embed items
bdi_embeds = sentmod.encode(bdi_items["positive_anchors"])
dass_dep_embeds = sentmod.encode(dass_items["depression"])
dass_anx_embeds = sentmod.encode(dass_items["anxiety"])
dass_str_embeds = sentmod.encode(dass_items["stress"])
# store as numpy arrays
np.save(os.path.join(r.PUBLIC_DATA_PATH, "bdi_embeds.npy"), bdi_embeds)
np.save(os.path.join(r.PUBLIC_DATA_PATH, "dass_dep_embeds.npy"), dass_dep_embeds)
np.save(os.path.join(r.PUBLIC_DATA_PATH, "dass_anx_embeds.npy"), dass_anx_embeds)
np.save(os.path.join(r.PUBLIC_DATA_PATH, "dass_str_embeds.npy"), dass_str_embeds)

In [19]:

# load item embeddings
bdi_embeds <- np$load(file.path(PUBLIC_DATA_PATH, "bdi_embeds.npy"))
dass_dep_embeds <- np$load(file.path(PUBLIC_DATA_PATH, "dass_dep_embeds.npy"))
dass_anx_embeds <- np$load(file.path(PUBLIC_DATA_PATH, "dass_anx_embeds.npy"))
dass_str_embeds <- np$load(file.path(PUBLIC_DATA_PATH, "dass_str_embeds.npy"))
# compute cosine similarities
## BDI internal similarity
bdi_internal_sims <- cosine(t(bdi_embeds))
bdi_internal <- bdi_internal_sims[lower.tri(bdi_internal_sims)]
## BDI-DASS-depression similarity
bdi_norm <- sqrt(rowSums(bdi_embeds^2))
dassdep_norm <- sqrt(rowSums(dass_dep_embeds^2))
bdi_dassdep_sims <- (bdi_embeds %*% t(dass_dep_embeds)) / outer(bdi_norm, dassdep_norm)
bdi_dassdep <- as.vector(bdi_dassdep_sims)
bdi_norm <- sqrt(rowSums(bdi_embeds^2))
## BDI-DASS-anxiety similarity
dassanx_norm <- sqrt(rowSums(dass_anx_embeds^2))
bdi_dassanx_sims <- (bdi_embeds %*% t(dass_anx_embeds)) /
  outer(bdi_norm, dassanx_norm)
bdi_dassanx <- as.vector(bdi_dassanx_sims)
## BDI-DASS-stress similarity
dassstr_norm <- sqrt(rowSums(dass_str_embeds^2))
bdi_dassstr_sims <- (bdi_embeds %*% t(dass_str_embeds)) /
  outer(bdi_norm, dassstr_norm)
bdi_dassstr <- as.vector(bdi_dassstr_sims)

In [20]:

# BDI internal similarity vs. Similarity to other Questionnaires
internal_external_results <- data.frame(
  type = c(
    rep("Within", length(bdi_internal)),
    rep("Between", length(c(bdi_dassanx, bdi_dassstr)))
  ),
  similarity = c(
    bdi_internal,
    c(bdi_dassanx, bdi_dassstr)
  )
)
internal_external_results$type <- factor(internal_external_results$type,
                        levels = c("Within", "Between"))
internal_external_ttest <- t.test(similarity ~ type, data = internal_external_results, equal_variance=FALSE)
internal_external_d <- cohen.d(similarity ~ type, data = internal_external_results)
# BDI Similarity to DASS-42 Sub-Scales
item_similarity_data <- data.frame(
  subscale = rep(c("Depression", "Anxiety", "Stress"),
                 each = length(bdi_dassdep)),
  similarity = c(bdi_dassdep, bdi_dassanx, bdi_dassstr)
)
item_similarity_data$subscale <- factor(
  item_similarity_data$subscale,
  levels = c("Depression", "Anxiety", "Stress")
)
m_sd <- function(x) {
  sprintf("%.3f ± %.3f", mean(x), sd(x))
}
dep <- item_similarity_data$similarity[item_similarity_data$subscale == "Depression"]
anx <- item_similarity_data$similarity[item_similarity_data$subscale == "Anxiety"]
str <- item_similarity_data$similarity[item_similarity_data$subscale == "Stress"]
# Only two comparisons needed: dep vs anx, dep vs str
t_dep_anx <- t.test(dep, anx)
t_dep_str <- t.test(dep, str)
d_dep_anx <- effsize::cohen.d(dep, anx)$estimate
d_dep_str <- effsize::cohen.d(dep, str)$estimate
item_similarity_results <- data.frame(
  `DASS-42 scale` = c("Depression", "Anxiety", "Stress"),
  `M +- SD` = c(
    m_sd(dep),
    m_sd(anx),
    m_sd(str)
  ),
  `t-value` = c(
    NA,                               # no comparison for the reference group
    round(-t_dep_anx$statistic, 2),   # anxiety vs depression
    round(t_dep_str$statistic, 2)     # stress vs depression (keeping dep first)
  ),
  `p-value` = c(
    NA,
    format_p(t_dep_anx$p.value),
    format_p(t_dep_str$p.value)
  ),
  `Cohen's d` = c(
    NA,
    round(-d_dep_anx, 2),
    round(d_dep_str, 2)
  ),
  check.names = FALSE
)
colnames(item_similarity_results) <- c("DASS-42 Scale", "M ± SD", 
                                        "t vs. Depression", 
                                        "p vs. Depression", 
                                        "d vs. Depression")

A psychometric tool can be considerd to have high content validity if it represents the entire spectrum of the construct (or syndrome) to be assessed. To test this, the semantic similarity of Reddit sentences to the positive ($\mu_{pos}$) and negative ($\mu_{neg}$) BDI-II centroids were used to predict manual symptom relevance ratings (0 = irrelevant, 1 = relevant) for all 21 BDI-II symptoms individually. A logistic regression was used and performance was assessed via the area-under-the-receiver-operating-curve (AUROC). Significance was determined by checking whether the CI of the AUROC includes 0.5 (chance-level).

In [21]:

content_validity_table <- data.frame(
  query = 1:21,
  nrel = NA, 
  nirrel = NA,
  auc = NA,
  auc_lower_ci = NA,
  auc_upper_ci = NA
)
for (query_i in 1:21) {
  # select data for given BDI item
  select_df <- erisk25[erisk25$query == query_i, ]
  # get N per relevance label
  content_validity_table$nrel[query_i] <- sum(select_df$relevant==1)
  content_validity_table$nirrel[query_i] <- sum(select_df$relevant==0)
  # fit logistic regression
  logreg <- glm(relevant ~ positive_centroid_similarity + negative_centroid_similarity,
                data = select_df,
                family = binomial)
  
  # predict relevancy labels
  roc_obj <- roc(select_df$relevant,
                 predict(logreg, type = "response"))
  # get AUROC
  content_validity_table$auc[query_i] <- auc(roc_obj)
  # get CI
  ci <- ci.auc(roc_obj)
  content_validity_table$auc_lower_ci[query_i] <- ci[1]
  content_validity_table$auc_upper_ci[query_i] <- ci[3]
  # determine significance from CI
  content_validity_table$auc_sig[query_i] <- !(0.5 >= ci[1] & 0.5 <= ci[3])
}

Setting levels: control = FALSE, case = TRUE

Setting direction: controls < cases

Setting levels: control = FALSE, case = TRUE

Setting direction: controls < cases

Setting levels: control = FALSE, case = TRUE

Setting direction: controls < cases

Setting levels: control = FALSE, case = TRUE

Setting direction: controls < cases

Setting levels: control = FALSE, case = TRUE

Setting direction: controls < cases

Setting levels: control = FALSE, case = TRUE

Setting direction: controls < cases

Setting levels: control = FALSE, case = TRUE

Setting direction: controls < cases

Setting levels: control = FALSE, case = TRUE

Setting direction: controls < cases

Setting levels: control = FALSE, case = TRUE

Setting direction: controls < cases

Setting levels: control = FALSE, case = TRUE

Setting direction: controls < cases

Setting levels: control = FALSE, case = TRUE

Setting direction: controls < cases

Setting levels: control = FALSE, case = TRUE

Setting direction: controls < cases

Setting levels: control = FALSE, case = TRUE

Setting direction: controls < cases

Setting levels: control = FALSE, case = TRUE

Setting direction: controls < cases

Setting levels: control = FALSE, case = TRUE

Setting direction: controls < cases

Setting levels: control = FALSE, case = TRUE

Setting direction: controls < cases

Setting levels: control = FALSE, case = TRUE

Setting direction: controls < cases

Setting levels: control = FALSE, case = TRUE

Setting direction: controls < cases

Setting levels: control = FALSE, case = TRUE

Setting direction: controls < cases

Setting levels: control = FALSE, case = TRUE

Setting direction: controls < cases

Setting levels: control = FALSE, case = TRUE

Setting direction: controls < cases

colnames(content_validity_table) <- c("BDI item", "N<sub>Relevant<sub>", "N<sub>Irrelevant<sub>", "AUROC", "Lower CI", "Upper CI", "Significance")

Convergent Validity

Pearson correlations were used to test for significant associations between language-assessed depressivity and self-reported depressive symptoms. To compare the magnitude of text-rating correlations for different LBA methods, Steiger tests were conducted (Steiger, 1980).

In [22]:

erisk21_w <- left_join(erisk21_w, erisk21_av_word_embeds[, c("sub_id", "bdi_sum_RR_pred", "bdi_sum_PLS_pred", "meta_pred_RR", "deptexts_pred_RR", "meta_pred_PLS", "deptexts_pred_PLS")], by = "sub_id")

# ── Helper function ───────────────────────────────────────────────────────────
run_steiger <- function(data, outcome, methods, dataset_name) {
  
  comps <- combn(methods, 2, simplify = FALSE)
  
  map_dfr(comps, function(comp) {
    
    m1 <- comp[1]
    m2 <- comp[2]
    
    # restrict to complete cases across all three variables
    tmp <- data %>%
      select(all_of(c(m1, m2, outcome))) %>%
      drop_na()
    
    n    <- nrow(tmp)
    r_jk <- cor(tmp[[m1]], tmp[[outcome]])
    r_jh <- cor(tmp[[m2]], tmp[[outcome]])
    r_kh <- cor(tmp[[m1]], tmp[[m2]])
    
    res <- cocor.dep.groups.overlap(
      r.jk = r_jk,
      r.jh = r_jh,
      r.kh = r_kh,
      n    = n,
      test = "steiger1980"
    )
    
    tibble(
      Dataset    = dataset_name,
      Method_1   = m1,
      Method_2   = m2,
      r_Method_1 = round(r_jk, 2),
      r_Method_2 = round(r_jh, 2),
      r_Methods  = round(r_kh, 2),
      n          = n,
      z          = round(res@steiger1980$statistic, 2),
      p          = format_p(res@steiger1980$p.value)
    )
  })
}
# ── Extract column-name vectors from global method lists ─────────────────────
deptexts_cvars <- sapply(deptexts_methods, function(m) m$var)
meta_cvars     <- sapply(meta_methods,     function(m) m$var)
erisk21_cvars  <- meta_cvars   # same column names as meta
# ── Run Steiger tests ────────────────────────────────────────────────────────
deptexts_steiger <- run_steiger(
  data = deptexts,
  outcome = "PHQ9tot",
  methods = deptexts_cvars,
  dataset_name = "dep_wor_data"
)
erisk21_steiger <- run_steiger(
  data = erisk21_w,
  outcome = "bdi_sum",
  methods = erisk21_cvars,
  dataset_name = "eRisk-2021"
)
meta_steiger <- run_steiger(
  data = meta,
  outcome = "bdi_sum_DU-Prä",
  methods = meta_cvars,
  dataset_name = "META-FBZ"
)
# ── Final publication-ready dataframe ───────────────────────────────────────
steiger_results <- bind_rows(
  deptexts_steiger,
  erisk21_steiger,
  meta_steiger
) %>%
  mutate(
    Method_1 = case_when(
      grepl("RR_pred", Method_1)  ~ "RR",
      grepl("PLS_pred", Method_1) ~ "PLS",
      TRUE ~ Method_1
    ),
    Method_2 = case_when(
      grepl("RR_pred", Method_2)  ~ "RR",
      grepl("PLS_pred", Method_2) ~ "PLS",
      TRUE ~ Method_2
    )
  )

In [23]:

# ── Correlation stats: CCR / aCCR ────────────────────────────────────────────
F3a_cor <- paste0("r = ", round(cor.test(deptexts$CCR,  deptexts$PHQ9tot)$estimate, 2))
F3a_p   <- format_p(cor.test(deptexts$CCR,  deptexts$PHQ9tot)$p.value)
F3b_cor <- paste0("r = ", round(cor.test(deptexts$aCCR, deptexts$PHQ9tot)$estimate, 2))
F3b_p   <- format_p(cor.test(deptexts$aCCR, deptexts$PHQ9tot)$p.value)
F3c_cor <- paste0("r = ", round(cor.test(erisk21_w$CCR,  erisk21_w$bdi_sum)$estimate, 2))
F3c_p   <- format_p(cor.test(erisk21_w$CCR,  erisk21_w$bdi_sum)$p.value)
F3d_cor <- paste0("r = ", round(cor.test(erisk21_w$aCCR, erisk21_w$bdi_sum)$estimate, 2))
F3d_p   <- format_p(cor.test(erisk21_w$aCCR, erisk21_w$bdi_sum)$p.value)
F3e_cor <- paste0("r = ", round(cor.test(meta$CCR,  meta$`bdi_sum_DU-Prä`)$estimate, 2))
F3e_p   <- format_p(cor.test(meta$CCR,  meta$`bdi_sum_DU-Prä`)$p.value)
F3f_cor <- paste0("r = ", round(cor.test(meta$aCCR, meta$`bdi_sum_DU-Prä`)$estimate, 2))
F3f_p   <- format_p(cor.test(meta$aCCR, meta$`bdi_sum_DU-Prä`)$p.value)
# ── Correlation stats: Ridge ──────────────────────────────────────────────────
F3g_cor <- paste0("r = ", round(cor.test(deptexts$PHQ9tot_RR_pred, deptexts$PHQ9tot)$estimate, 2))
F3g_p   <- format_p(cor.test(deptexts$PHQ9tot_RR_pred, deptexts$PHQ9tot)$p.value)
F3h_cor <- paste0("r = ", round(cor.test(erisk21_w$bdi_sum_RR_pred, erisk21_w$bdi_sum)$estimate, 2))
F3h_p   <- format_p(cor.test(erisk21_w$bdi_sum_RR_pred, erisk21_w$bdi_sum)$p.value)
F3i_cor <- paste0("r = ", round(cor.test(meta$bdi_sum_RR_pred, meta$`bdi_sum_DU-Prä`)$estimate, 2))
F3i_p   <- format_p(cor.test(meta$bdi_sum_RR_pred, meta$`bdi_sum_DU-Prä`)$p.value)
# ── Correlation stats: PLS ────────────────────────────────────────────────────
F3j_cor <- paste0("r = ", round(cor.test(deptexts$PHQ9tot_PLS_pred, deptexts$PHQ9tot)$estimate, 2))
F3j_p   <- format_p(cor.test(deptexts$PHQ9tot_PLS_pred, deptexts$PHQ9tot)$p.value)
F3k_cor <- paste0("r = ", round(cor.test(erisk21_w$bdi_sum_PLS_pred, erisk21_w$bdi_sum)$estimate, 2))
F3k_p   <- format_p(cor.test(erisk21_w$bdi_sum_PLS_pred, erisk21_w$bdi_sum)$p.value)
F3l_cor <- paste0("r = ", round(cor.test(meta$bdi_sum_PLS_pred, meta$`bdi_sum_DU-Prä`)$estimate, 2))
F3l_p   <- format_p(cor.test(meta$bdi_sum_PLS_pred, meta$`bdi_sum_DU-Prä`)$p.value)

Divergent Validity

Pearson correlations were used to test whether language-assessed depressivity would correlate with other mental health symptoms. Correlations of LBAs with depression scores (DASS-21-Depression in META-FBZ; PHQ-9 in dep_wor_data) were compared with correlations of LBAs with anxiety (DASS-21-Anxiety in META-FBZ; GAD-7 in dep_wor_data) and stress self-reports (DASS-21-Stress in META-FBZ). Steiger tests were conducted to statistically compare correlations.

In [24]:

# ── Helper function ───────────────────────────────────────────────────────────
run_divergent_steiger <- function(
    data,
    predictor,
    measure_a,
    measure_b,
    dataset_name,
    method_name,
    measure_a_label,
    measure_b_label
) {
  
  # complete cases
  tmp <- data %>%
    select(all_of(c(predictor, measure_a, measure_b))) %>%
    drop_na()
  
  n <- nrow(tmp)
  
  # correlations
  r_jk <- cor(tmp[[predictor]], tmp[[measure_a]])
  r_jh <- cor(tmp[[predictor]], tmp[[measure_b]])
  r_kh <- cor(tmp[[measure_a]], tmp[[measure_b]])
  
  # Steiger test
  res <- cocor.dep.groups.overlap(
    r.jk = r_jk,
    r.jh = r_jh,
    r.kh = r_kh,
    n = n,
    test = "steiger1980"
  )
  
  tibble(
    Dataset = dataset_name,
    Method = method_name,
    Measure_A = measure_a_label,
    Measure_B = measure_b_label,
    r_A = round(r_jk, 2),
    r_B = round(r_jh, 2),
    r_AB = round(r_kh, 2),
    n = n,
    z = round(res@steiger1980$statistic, 2),
    p = format_p(res@steiger1980$p.value)
  )
}
# ── dep_wor_data: PHQ-9 vs GAD-7 ─────────────────────────────────────────────
deptexts_results <- map_dfr(deptexts_methods, function(m) {
  
  run_divergent_steiger(
    data = deptexts,
    predictor = m$var,
    measure_a = "PHQ9tot",
    measure_b = "GAD7tot",
    dataset_name = "dep_wor_data",
    method_name = m$label,
    measure_a_label = "PHQ-9",
    measure_b_label = "GAD-7"
  )
})
# ── META-FBZ: DASS Depression vs Anxiety ─────────────────────────────────────
meta_dep_anx_results <- map_dfr(meta_methods, function(m) {
  
  run_divergent_steiger(
    data = meta,
    predictor = m$var,
    measure_a = "dass_dep_score_DU-Prä",
    measure_b = "dass_anx_score_DU-Prä",
    dataset_name = "META-FBZ",
    method_name = m$label,
    measure_a_label = "DASS Depression",
    measure_b_label = "DASS Anxiety"
  )
})
# ── META-FBZ: DASS Depression vs Stress ──────────────────────────────────────
meta_dep_str_results <- map_dfr(meta_methods, function(m) {
  
  run_divergent_steiger(
    data = meta,
    predictor = m$var,
    measure_a = "dass_dep_score_DU-Prä",
    measure_b = "dass_str_score_DU-Prä",
    dataset_name = "META-FBZ",
    method_name = m$label,
    measure_a_label = "DASS Depression",
    measure_b_label = "DASS Stress"
  )
})
# ── Final publication-ready table ────────────────────────────────────────────
divergent_validity_table <- bind_rows(
  deptexts_results,
  meta_dep_anx_results,
  meta_dep_str_results
)

Criterion Validity

Associations of language-assessed depression with four external criteria served as tests of criterion validity. In META-FBZ, primary diagnosis of a depressive disorder (semi-structured clinical interview) and work ability status (currently working vs. on sick leave) served as external criteria. Comparisons were made using Welch t-tests. In dep_wor_data, the number of healthcare visits and sick leave days in the past year served as external criteria. Pearson correlations were used to test for these associations.

In [25]:

# ── Helper functions ─────────────────────────────────────────────────────────
run_ttest <- function(
    data,
    predictor,
    grouping_var,
    group_1,
    group_2,
    dataset_name,
    method_name,
    criterion_label
) {
  
  tmp <- data %>%
    filter(.data[[grouping_var]] %in% c(group_1, group_2)) %>%
    select(all_of(c(predictor, grouping_var))) %>%
    drop_na()
  
  x1 <- tmp[tmp[[grouping_var]] == group_1, predictor]
  x2 <- tmp[tmp[[grouping_var]] == group_2, predictor]
  
  t_res <- t.test(x1, x2)
  d     <- cohen.d(x1, x2)$estimate
  
  tibble(
    Dataset            = dataset_name,
    Method             = method_name,
    External_Criterion = criterion_label,
    N                  = nrow(tmp),
    Effect_Size        = paste0("d = ", round(d, 2)),
    Test_Statistic     = paste0("t = ", round(t_res$statistic, 2)),
    p                  = format_p(t_res$p.value)
  )
}
run_correlation <- function(
    data,
    predictor,
    criterion,
    dataset_name,
    method_name,
    criterion_label
) {
  
  tmp <- data %>%
    select(all_of(c(predictor, criterion))) %>%
    drop_na()
  
  cor_res <- cor.test(tmp[[predictor]], tmp[[criterion]])
  
  tibble(
    Dataset            = dataset_name,
    Method             = method_name,
    External_Criterion = criterion_label,
    N                  = nrow(tmp),
    Effect_Size        = paste0("r = ", round(cor_res$estimate, 2)),
    Test_Statistic     = paste0("t = ", round(cor_res$statistic, 2)),
    p                  = format_p(cor_res$p.value)
  )
}
# ── META-FBZ: Diagnosis ──────────────────────────────────────────────────────
meta_diagnosis_results <- map_dfr(meta_methods, function(m) {
  run_ttest(
    data           = meta,
    predictor      = m$var,
    grouping_var   = "depression_diagnosis",
    group_1        = 1,
    group_2        = 0,
    dataset_name   = "META-FBZ",
    method_name    = m$label,
    criterion_label = "Primary Depression Diagnosis"
  )
})
# ── META-FBZ: Work ability ───────────────────────────────────────────────────
meta_work_results <- map_dfr(meta_methods, function(m) {
  run_ttest(
    data           = meta,
    predictor      = m$var,
    grouping_var   = "work_ability_status",
    group_1        = "Unable to work (on sick leave)",
    group_2        = "Able to work",
    dataset_name   = "META-FBZ",
    method_name    = m$label,
    criterion_label = "Work Ability Status"
  )
})
# ── dep_wor_data: Healthcare visits ──────────────────────────────────────────
deptexts_visits_results <- map_dfr(deptexts_methods, function(m) {
  run_correlation(
    data            = deptexts,
    predictor       = m$var,
    criterion       = "HealthCareVisits",
    dataset_name    = "dep_wor_data",
    method_name     = m$label,
    criterion_label = "Healthcare Visits"
  )
})
# ── dep_wor_data: Sick leave days ────────────────────────────────────────────
deptexts_sickdays_results <- map_dfr(deptexts_methods, function(m) {
  run_correlation(
    data            = deptexts,
    predictor       = m$var,
    criterion       = "SickDaysYear",
    dataset_name    = "dep_wor_data",
    method_name     = m$label,
    criterion_label = "Sick Leave Days"
  )
})
# ── Final publication-ready table ────────────────────────────────────────────
criterion_validity_table <- bind_rows(
  meta_diagnosis_results,
  meta_work_results,
  deptexts_visits_results,
  deptexts_sickdays_results
)

Results

Descriptive Statistics

Table 2 shows descriptive statistics of the three data-sets, totalling N=1181 participants. In samples with available demographic information, the mean age was between 33.62 and 42.5 years and between 39.4 and 62.2% of participants were female. Based on common cut-offs, between 17.4 to 42.5% of participants were severely depressed. In the data-set with the fewest number of words per person, participants provided responses that were on average 55 words long, indicating sufficient data for language analyses.

In [26]:

# ── Data ──────────────────────────────────────────────────────────────────────
dataset_names <- c("dep_wor_data", "eRisk-2021", "META-FBZ")
N_subjects    <- c(nrow(deptexts), length(unique(erisk21$sub_id)), nrow(meta))
P_female <- c(mean(deptexts$Gender, na.rm = TRUE) * 100,
              NA,
              mean(meta$patient_sex == "female", na.rm = TRUE) * 100)
M_age  <- c(mean(deptexts$Age, na.rm = TRUE), NA,
            mean(meta$patient_age_therapy_start, na.rm = TRUE))
SD_age <- c(sd(deptexts$Age,  na.rm = TRUE), NA,
            sd(meta$patient_age_therapy_start,  na.rm = TRUE))
Dep_Measures <- c("PHQ-9", "BDI-II", "BDI-II")
M_Dep  <- c(mean(deptexts$PHQ9tot,          na.rm = TRUE),
            mean(erisk21_w$bdi_sum,          na.rm = TRUE),
            mean(meta$`bdi_sum_DU-Prä`,      na.rm = TRUE))
SD_Dep <- c(sd(deptexts$PHQ9tot,            na.rm = TRUE),
            sd(erisk21_w$bdi_sum,            na.rm = TRUE),
            sd(meta$`bdi_sum_DU-Prä`,        na.rm = TRUE))
P_severe <- c(
  mean(deptexts$PHQ9tot          >= 20, na.rm = TRUE) * 100,
  mean(erisk21_w$bdi_sum         >= 30, na.rm = TRUE) * 100,
  mean(meta$`bdi_sum_DU-Prä`     >= 30, na.rm = TRUE) * 100
)
N_texts <- c(1,
             round(nrow(erisk21) / length(unique(erisk21$sub_id)), 2),
             1)
N_words <- c(round(mean(deptexts$n_words), 2),
             round(sum(erisk21$n_words) / length(unique(erisk21$sub_id)), 2),
             round(mean(meta$n_words), 2))
M_CCR   <- c(mean(deptexts$CCR,  na.rm = TRUE), mean(erisk21$CCR,  na.rm = TRUE), mean(meta$CCR,  na.rm = TRUE))
SD_CCR  <- c(sd(deptexts$CCR,   na.rm = TRUE), sd(erisk21$CCR,   na.rm = TRUE), sd(meta$CCR,   na.rm = TRUE))
M_aCCR  <- c(mean(deptexts$aCCR, na.rm = TRUE), mean(erisk21$aCCR, na.rm = TRUE), mean(meta$aCCR, na.rm = TRUE))
SD_aCCR <- c(sd(deptexts$aCCR,  na.rm = TRUE), sd(erisk21$aCCR,  na.rm = TRUE), sd(meta$aCCR,  na.rm = TRUE))
# ── Helper ────────────────────────────────────────────────────────────────────
fmt <- function(m, s, digits = 2) {
  ifelse(is.na(m), "—", paste0(round(m, digits), " (", round(s, digits), ")"))
}
# ── Build table with inline group header rows ─────────────────────────────────
table1 <- data.frame(
  Variable = c(
    "**Sample**",
    "N",
    "% Female",
    "Age, M (SD)",
    "**Depression**",
    "Measure",
    "Score, M (SD)",
    "% Severe depression",
    "**Corpus size**",
    "Texts per person",
    "Words per person",
    "**CCR loadings**",
    "CCR, M (SD)",
    "aCCR, M (SD)"
  ),
  dep_wor_data = c(
    "", N_subjects[1], round(P_female[1], 1), fmt(M_age[1], SD_age[1]),
    "", Dep_Measures[1], fmt(M_Dep[1], SD_Dep[1]), round(P_severe[1], 1),
    "", N_texts[1], N_words[1],
    "", fmt(M_CCR[1], SD_CCR[1]), fmt(M_aCCR[1], SD_aCCR[1])
  ),
  eRisk_2021 = c(
    "", N_subjects[2], "—", "—",
    "", Dep_Measures[2], fmt(M_Dep[2], SD_Dep[2]), round(P_severe[2], 1),
    "", N_texts[2], N_words[2],
    "", fmt(M_CCR[2], SD_CCR[2]), fmt(M_aCCR[2], SD_aCCR[2])
  ),
  META_FBZ = c(
    "", N_subjects[3], round(P_female[3], 1), fmt(M_age[3], SD_age[3]),
    "", Dep_Measures[3], fmt(M_Dep[3], SD_Dep[3]), round(P_severe[3], 1),
    "", N_texts[3], N_words[3],
    "", fmt(M_CCR[3], SD_CCR[3]), fmt(M_aCCR[3], SD_aCCR[3])
  ),
  stringsAsFactors = FALSE
)
kable(
  table1,
  col.names = c("", "dep\\_wor\\_data", "eRisk-2021", "META-FBZ"),
  align     = c("l", "r", "r", "r"),
  booktabs  = TRUE
)

In [27]:

Table 2: Descriptive Statistics for the Three Datasets

	dep_wor_data	eRisk-2021	META-FBZ
Sample
N	500	80	601
% Female	39.4	—	62.2
Age, M (SD)	33.62 (11.87)	—	40.25 (14.22)
Depression
Measure	PHQ-9	BDI-II	BDI-II
Score, M (SD)	11.49 (7.53)	28.4 (12.65)	23.47 (12.73)
% Severe depression	17.4	42.5	31.1
Corpus size
Texts per person	1	344.92	1
Words per person	55.13	19847.89	84.25
CCR loadings
CCR, M (SD)	0.82 (0.02)	0.77 (0.03)	0.83 (0.02)
aCCR, M (SD)	0.03 (0.08)	0.02 (0.04)	0.09 (0.03)

Note. The cut-off for severe depression in the PHQ-9 was 20, while in the BDI-II the cut-off was set at a score of 30.

Results of Psychometric Evaluation

Face Validity

Figure 2 shows z-standardized CCR loadings for five example texts from dep_wor_data. Texts with higher CCR loadings described more acute/severe symptoms of depression, indicating that assessments based on CCRs exhibit face validity.

Content Validity

The semantic similarity among BDI-II items (within similarity; M = 0.797, SD = 0.035) was higher than the similarity of BDI-II items to DASS-42 anxiety and stress items (between similarity; M = 0.767, SD = 0.029, t = 11.25, p < .001, d = 0.99). Furthermore, BDI-II items were significantly more similar to DASS-42 depression items (M = 0.791, SD = 0.037) than to both DASS-42 anxiety (M = 0.758, SD = 0.028, t = 12, p < .001, d = 0.99) and stress items (M = 0.775, SD = 0.026, t = 5.78, p < .001, d = 0.48). See Figure 2

Symptom relevance ratings could be predicted above chance for all 21 BDI-II symptoms with a mean classification performance of AUROC = 0.68 (SD = 0.06, range = [0.57; 0.77], Figure 2, Table 6).

In [28]:

F2a <- ggplot(internal_external_results, aes(x = type, y = similarity)) +
  geom_boxplot(width = 0.6, alpha = 0.6, outlier.alpha = 0.3) +
  geom_jitter(width = 0.05, alpha = 0.4, color = grey_col) +
  scale_y_continuous(expand = expansion(mult = c(0.05, 0.2))) +
  theme_minimal() +
  labs(
    x = NULL,
    y = "Item similarity"
  ) +
  stat_compare_means(
    comparisons = list(c("Within", "Between")),
    method = "t.test",
    label = "p.signif"
  )
F2b <- ggplot(item_similarity_data, aes(x = subscale, y = similarity)) +
  geom_boxplot(width = 0.6, outlier.alpha = 0.3) +
  geom_jitter(width = 0.05, alpha = 0.5, color = grey_col) +
  scale_y_continuous(expand = expansion(mult = c(0.05, 0.2))) +
  theme_minimal() +
  labs(
    x = "DASS-42 sub-scale",
    y = "Item Similarity to BDI-II"
  ) +
  stat_compare_means(
    comparisons = list(
      c("Depression", "Anxiety"),
      c("Depression", "Stress")
    ),
    method = "t.test",
    label = "p.signif"
  )
item_labels <- c(
  "Sadness", "Pessimism", "Past Failure", "Loss of Pleasure",
  "Guilty Feelings", "Punishment Feelings", "Self-Dislike", "Self-Criticalness",
  "Suicidal Thoughts", "Crying", "Agitation", "Loss of Interest", "Indecisiveness",
  "Worthlessness", "Loss of Energy", "Sleep Changes", "Irritability", "Appetite Changes",
  "Concentration Difficulty", "Tiredness", "Loss of Interest in Sex"
)
F2c <- ggplot(content_validity_table,
              aes(x = factor(`BDI item`, labels = item_labels),
                  y = `AUROC`,
                  fill = Significance)) +
  geom_col() +
  geom_errorbar(aes(ymin = `Lower CI`,
                    ymax = `Upper CI`),
                width = 0.2) +
  scale_fill_manual(values = c("TRUE" = ccr_col,
                               "FALSE" = grey_col)) +
  geom_hline(yintercept = 0.5, linetype = "dashed") +
  labs(
    x = "BDI Item",
    y = "AUROC ± CI",
    fill = NULL
  ) +
  theme_minimal() +
  theme(
    legend.position = "none",
    axis.text.x = element_text(angle = 45, hjust = 1)
  )
# arrange panels
top_row <- plot_grid(
  F2a, F2b,
  labels = c("a", "b"),
  ncol = 2
)
bottom_row <- plot_grid(
  F2c,
  labels = "c",
  ncol = 1
)
plot_grid(
  top_row,
  bottom_row,
  ncol = 1,
  rel_heights = c(1, 1)
)

Figure 2: Content Validity. *p < .05, *p* < .01, *p < .001, ns = not significant.

Note. a, Boxplots comparing the semantic similarity among BDI-II items (Within) with the similarity of BDI-II items to DASS-42 anxiety and stress items (Between). b, Boxplots comparing the semantic similarity of DASS-42 sub-scale items to BDI-II items. c, Barplot indicating the performance of a logistic regression classifier trained to predict BDI-item relevance ratings. For a given BDI-item, the semantic similarity of sentences to positive and negative item formulations served as input features. Significance was assessed based on the CI (error bars) of the area-under-the-receiver-operating-curve (AUROC). *p < .05,**p < .01, ***p < .001, ****p < .0001, ns = p > .05.

Convergent Validity

Significant positive correlations of language-assessed depression with self-reported depression scores emerged across all data-sets and methods (see Figure 3). Text-rating correlations were highest in dep_wor_data, with moderate correlations for CCR (r = 0.43, p < .001) and significantly larger correlations for aCCR (r = 0.64, p < .001), and yet larger correlations for RR (r = 0.69, p < .001), and PLS (r = 0.68, p < .001). In eRisk-2021, small-to-moderate correlations emerged for CCR (r = 0.31, p = .006), aCCR (r = 0.36, p = .001), and PLS (r = 0.26, p = .022), while RR (r = 0.15, p = .193) showed no significant correlation and was outperformed by aCCR and PLS. In META-FBZ, small correlations were found for CCR (r = 0.17, p < .001), aCCR (r = 0.14, p < .001), RR (r = 0.26, p < .001), and PLS (r = 0.26, p < .001) with RR outperforming aCCR and PLS outperforming both CCR and aCCR. Full results of text-rating correlation comparisons are given in Table 3.

In [29]:

# ── Plots: no titles on individual panels ────────────────────────────────────
# Row 1: CCR
F3a <- ggplot(deptexts, aes(x=CCR, y=PHQ9tot)) +
  geom_point(color=grey_col) + geom_smooth(method="lm", color=ccr_col, fill=ccr_col) +
  ann(paste0(F3a_cor, "
", F3a_p)) + labs(x="CCR loading", y="PHQ-9") + theme_minimal()
F3c <- ggplot(erisk21_w, aes(x=CCR, y=bdi_sum)) +
  geom_point(color=grey_col) + geom_smooth(method="lm", color=ccr_col, fill=ccr_col) +
  ann(paste0(F3c_cor, "
", F3c_p)) + labs(x="CCR loading", y="BDI") + theme_minimal()
F3e <- ggplot(meta, aes(x=CCR, y=`bdi_sum_DU-Prä`)) +
  geom_point(color=grey_col) + geom_smooth(method="lm", color=ccr_col, fill=ccr_col) +
  ann(paste0(F3e_cor, "
", F3e_p)) + labs(x="CCR loading", y="BDI") + theme_minimal()
# Row 2: aCCR
F3b <- ggplot(deptexts, aes(x=aCCR, y=PHQ9tot)) +
  geom_point(color=grey_col) + geom_smooth(method="lm", color=accr_col, fill=accr_col) +
  ann(paste0(F3b_cor, "
", F3b_p)) + labs(x="aCCR loading", y="PHQ-9") +
  scale_x_continuous(n.breaks = 4) + theme_minimal()
F3d <- ggplot(erisk21_w, aes(x=aCCR, y=bdi_sum)) +
  geom_point(color=grey_col) + geom_smooth(method="lm", color=accr_col, fill=accr_col) +
  ann(paste0(F3d_cor, "
", F3d_p)) + labs(x="aCCR loading", y="BDI") +
  scale_x_continuous(n.breaks = 4) + theme_minimal()
F3f <- ggplot(meta, aes(x=aCCR, y=`bdi_sum_DU-Prä`)) +
  geom_point(color=grey_col) + geom_smooth(method="lm", color=accr_col, fill=accr_col) +
  ann(paste0(F3f_cor, "
", F3f_p)) + labs(x="aCCR loading", y="BDI") +
  scale_x_continuous(n.breaks = 4) + theme_minimal()
# Row 3: Ridge
F3g <- ggplot(deptexts, aes(x=PHQ9tot_RR_pred, y=PHQ9tot)) +
  geom_point(color=grey_col) + geom_smooth(method="lm", color=rr_col, fill=rr_col) +
  ann(paste0(F3g_cor, "
", F3g_p)) + labs(x="Pred. score", y="PHQ-9") + theme_minimal()
F3h <- ggplot(erisk21_w, aes(x=bdi_sum_RR_pred, y=bdi_sum)) +
  geom_point(color=grey_col) + geom_smooth(method="lm", color=rr_col, fill=rr_col) +
  ann(paste0(F3h_cor, "
", F3h_p)) + labs(x="Pred. score", y="BDI") + theme_minimal()
F3i <- ggplot(meta, aes(x=bdi_sum_RR_pred, y=`bdi_sum_DU-Prä`)) +
  geom_point(color=grey_col) + geom_smooth(method="lm", color=rr_col, fill=rr_col) +
  ann(paste0(F3i_cor, "
", F3i_p)) + labs(x="Pred. score", y="BDI") + theme_minimal()
# Row 4: PLS
F3j <- ggplot(deptexts, aes(x=PHQ9tot_PLS_pred, y=PHQ9tot)) +
  geom_point(color=grey_col) + geom_smooth(method="lm", color=pls_col, fill=pls_col) +
  ann(paste0(F3j_cor, "
", F3j_p)) + labs(x="Pred. score", y="PHQ-9") + theme_minimal()
F3k <- ggplot(erisk21_w, aes(x=bdi_sum_PLS_pred, y=bdi_sum)) +
  geom_point(color=grey_col) + geom_smooth(method="lm", color=pls_col, fill=pls_col) +
  ann(paste0(F3k_cor, "
", F3k_p)) + labs(x="Pred. score", y="BDI") + theme_minimal()
F3l <- ggplot(meta, aes(x=bdi_sum_PLS_pred, y=`bdi_sum_DU-Prä`)) +
  geom_point(color=grey_col) + geom_smooth(method="lm", color=pls_col, fill=pls_col) +
  ann(paste0(F3l_cor, "
", F3l_p)) + labs(x="Pred. score", y="BDI") + theme_minimal()
# ── Column headers (once at top) ──────────────────────────────────────────────
headers <- plot_grid(
  ggdraw(),                          # gutter spacer matching row-label width
  col_header("dep_wor_data"),
  col_header("eRisk-2021"),
  col_header("META-FBZ"),
  ncol = 4, rel_widths = c(0.05, 1, 1, 1)
)
# ── Assemble ──────────────────────────────────────────────────────────────────
Figure3 <- plot_grid(
  headers,
  labeled_row(row_label("CCR"),   F3a, F3c, F3e),
  labeled_row(row_label("aCCR"),  F3b, F3d, F3f),
  labeled_row(row_label("RR"), F3g, F3h, F3i),
  labeled_row(row_label("PLS"),   F3j, F3k, F3l),
  ncol = 1, rel_heights = c(0.08, 1, 1, 1, 1)
)

`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

Figure3

Note. Correlations of language-assessed depression with self-reported depression. Rows show correlations for CCR, aCCR, RR, and PLS, columns show correlations for data-sets dep_wor_data, eRisk-2021, and META-FBZ.

In [30]:

# Build table with inline group headers and N folded in
convergent_table <- data.frame(
  `Method 1` = c(
    "**dep\\_wor\\_data**", "CCR", "CCR", "CCR", "aCCR", "aCCR", "RR",
    "**eRisk-2021**",        "CCR", "CCR", "CCR", "aCCR", "aCCR", "RR",
    "**META-FBZ**",         "CCR", "CCR", "CCR", "aCCR", "aCCR", "RR"
  ),
  `Method 2` = c(
    "", "aCCR", "RR", "PLS", "RR", "PLS", "PLS",
    "", "aCCR", "RR", "PLS", "RR", "PLS", "PLS",
    "", "aCCR", "RR", "PLS", "RR", "PLS", "PLS"
  ),
  `r M1` = c(
    "", steiger_results$r_Method_1[steiger_results$Dataset == "dep_wor_data"],
    "", steiger_results$r_Method_1[steiger_results$Dataset == "eRisk-2021"],
    "", steiger_results$r_Method_1[steiger_results$Dataset == "META-FBZ"]
  ),
  `r M2` = c(
    "", steiger_results$r_Method_2[steiger_results$Dataset == "dep_wor_data"],
    "", steiger_results$r_Method_2[steiger_results$Dataset == "eRisk-2021"],
    "", steiger_results$r_Method_2[steiger_results$Dataset == "META-FBZ"]
  ),
  `r M1,M2` = c(
    "", steiger_results$r_Methods[steiger_results$Dataset == "dep_wor_data"],
    "", steiger_results$r_Methods[steiger_results$Dataset == "eRisk-2021"],
    "", steiger_results$r_Methods[steiger_results$Dataset == "META-FBZ"]
  ),
  `z` = c(
    "", steiger_results$z[steiger_results$Dataset == "dep_wor_data"],
    "", steiger_results$z[steiger_results$Dataset == "eRisk-2021"],
    "", steiger_results$z[steiger_results$Dataset == "META-FBZ"]
  ),
  `p` = c(
    "", steiger_results$p[steiger_results$Dataset == "dep_wor_data"],
    "", steiger_results$p[steiger_results$Dataset == "eRisk-2021"],
    "", steiger_results$p[steiger_results$Dataset == "META-FBZ"]
  ),
  check.names = FALSE
)
knitr::kable(
  convergent_table,
  col.names = c("Method 1", "Method 2", 
                "*r* M1", "*r* M2", "*r* M1,M2", 
                "*z*", "*p*"),
  align    = c("l", "l", "r", "r", "r", "r", "r"),
  booktabs = TRUE
)

In [31]:

Table 3: Convergent Validity

Method 1	Method 2	r M1	r M2	r M1,M2	z	p
dep_wor_data
CCR	aCCR	0.43	0.64	0.57	-6.09	p < .001
CCR	RR	0.43	0.69	0.57	-8.19	p < .001
CCR	PLS	0.43	0.68	0.57	-7.67	p < .001
aCCR	RR	0.64	0.69	0.87	-3.53	p < .001
aCCR	PLS	0.64	0.68	0.95	-4.2	p < .001
RR	PLS	0.69	0.68	0.94	1.32	p = .188
eRisk-2021
CCR	aCCR	0.31	0.36	0.66	-0.61	p = .542
CCR	RR	0.31	0.15	0.64	1.71	p = .088
CCR	PLS	0.31	0.26	0.69	0.59	p = .557
aCCR	RR	0.36	0.15	0.66	2.37	p = .018
aCCR	PLS	0.36	0.26	0.72	1.29	p = .197
RR	PLS	0.15	0.26	0.95	-3.06	p = .002
META-FBZ
CCR	aCCR	0.17	0.14	0.37	0.71	p = .475
CCR	RR	0.17	0.26	0.38	-1.94	p = .053
CCR	PLS	0.17	0.27	0.44	-2.3	p = .022
aCCR	RR	0.14	0.26	0.26	-2.44	p = .015
aCCR	PLS	0.14	0.27	0.25	-2.64	p = .008
RR	PLS	0.26	0.27	0.92	-0.72	p = .471

Note. Comparison of text-rating correlations between LBA methods using Steiger tests.

Results of transfer performance tests indicated that supervised LBA can generalize to novel domains, although not consistently and with significant losses in performance (see Figure 4 for visualization). Models trained on dep_wor_data showed the greatest transfer performance with significant text-rating correlations on both eRisk-2021 (RR: r = 0.37, p < .001; PLS: r = 0.39, p < .001) and META-FBZ (RR: r = 0.19, p < .001; PLS: r = 0.14, p < .001). Models trained on eRisk-2021 showed poorer generalization performance with decreased text-rating correlation on dep_wor_data (RR: r = 0.44, p < .001; PLS: r = 0.42, p < .001) and non-significant correlations on META-FBZ (RR: r = 0.07, p = .088; PLS: r = 0.07, p = .098). Models trained META-FBZ showed inconsistent results with losses in performance on dep_wor_data (RR: r = 0.34, p < .001; PLS: r = 0.31, p < .001) and no significant correlations on eRisk-2021 (RR: r = 0.2, p = .072; PLS: r = 0.22, p = .050). Table 7 contrasts text-rating correlations of transfer models with those involving CCRs. Transfer models trained on dep_wor_data performed as well on unseen test data as aCCR and CCR. In contrast, when transfer models were trained on either eRisk-2021 or META-FBZ, significant performance losses emerged.

In [32]:

F2a <- ggplot(RR_transfer_results, aes(x = test_data, y = r)) +
  geom_col() +
  geom_text(
    aes(
      label = p_label,
      y = ifelse(r >= 0, r + 0.02, r - 0.02),
      vjust = ifelse(r >= 0, 0, 1)
    ),
    size = 4
  ) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.2))) +
  facet_wrap(~ training_data) +
  labs(
    x = "Test dataset",
    y = "Text-Rating Corr. (r)",
    title = "RR"
  ) +
  theme_bw() +
  theme(
    legend.position = "none",
    axis.text.x = element_text(angle = 45, hjust = 1)
  )
F2b <- ggplot(PLS_transfer_results, aes(x = test_data, y = r)) +
  geom_col() +
  geom_text(
    aes(
      label = p_label,
      y = ifelse(r >= 0, r + 0.02, r - 0.02),
      vjust = ifelse(r >= 0, 0, 1)
    ),
    size = 4
  ) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.2))) +
  facet_wrap(~ training_data) +
  labs(
    x = "Test dataset",
    y = "Text-Rating Corr. (r)",
    title = "PLS Regression"
  ) +
  theme_bw() +
  theme(
    legend.position = "none",
    axis.text.x = element_text(angle = 45, hjust = 1)
  )
plot_grid(F2a, F2b, labels = "auto", ncol = 1)

Figure 4: Generalization of supervised LBA.

Note. Transfer performance of LBA based on RR (a) and PLS (b) measured as text-rating correlations for LBA models trained in one data-set and tested in another. *p < .05,**p < .01, ***p < .001, ****p < .0001, ns = p > .05.

Divergent Validity

In META-FBZ, all LBA methods resulted in significant positive correlations with DASS-42 Depression scores (CCR: r = 0.16, p = .001; aCCR: r = 0.11, p = .022; RR: r = 0.19, p < .001; PLS: r = 0.23, p < .001). In contrast, text-rating correlations with DASS-42 Anxiety were significant only for CCR (r = 0.1, p = .034). Correlations with DASS-42 stress were significant only for RR (r = 0.12, p = .017) and PLS (r = 0.12, p = .017). When comparing text-rating correlation strength, only RR and PLS significantly differentiated depression from anxiety, and only PLS significantly differentiated depression from stress.

In dep_wor_data, all LAB methods resulted in significant positive correlations with GAD-7 scores (CCR: r = 0.39, p < .001; aCCR: r = 0.59, p < .001; RR: r = 0.63, p < .001; PLS: r = 0.62, p < .001). Comparing text-rating correlations, all methods were able to significantly differentiate depression from anxiety.

In [33]:

# ── Subset data ───────────────────────────────────────────────────────────────
dw  <- divergent_validity_table[divergent_validity_table$Dataset == "dep_wor_data", ]
mfa <- divergent_validity_table[divergent_validity_table$Dataset == "META-FBZ" &
                                  divergent_validity_table$Measure_B == "GAD-7" |
                                  divergent_validity_table$Dataset == "META-FBZ" &
                                  divergent_validity_table$Measure_B == "DASS Anxiety", ]
mfs <- divergent_validity_table[divergent_validity_table$Dataset == "META-FBZ" &
                                  divergent_validity_table$Measure_B == "DASS Stress", ]
# ── Helper to extract columns cleanly ────────────────────────────────────────
fmt_rows <- function(df) {
  data.frame(
    Method  = df$Method,
    N       = as.character(df$n),
    rA      = as.character(round(df$r_A, 2)),
    rB      = as.character(round(df$r_B, 2)),
    rAB     = as.character(round(df$r_AB, 2)),
    t       = as.character(round(df$z, 2)),
    p       = df$p,
    stringsAsFactors = FALSE
  )
}
blank_row <- function(label) {
  data.frame(Method = label, N = "", rA = "", rB = "", rAB = "", t = "", p = "",
             stringsAsFactors = FALSE)
}
# ── Assemble ──────────────────────────────────────────────────────────────────
dw_pair <- paste(unique(dw$Measure_A), "vs", unique(dw$Measure_B))
ma_pair <- paste(unique(mfa$Measure_A), "vs", unique(mfa$Measure_B))
ms_pair <- paste(unique(mfs$Measure_A), "vs", unique(mfs$Measure_B))
divergent_table <- rbind(
  blank_row("**dep\\_wor\\_data**"),
  blank_row(paste0("*", dw_pair, "*")),
  fmt_rows(dw),
  blank_row("**META-FBZ**"),
  blank_row(paste0("*", ma_pair, "*")),
  fmt_rows(mfa),
  blank_row(paste0("*", ms_pair, "*")),
  fmt_rows(mfs)
)
knitr::kable(
  divergent_table,
  col.names = c("Method", "*N*", "*r* A", "*r* B", "*r* A,B", "*z*", "*p*"),
  align     = c("l", "r", "r", "r", "r", "r", "r"),
  booktabs  = TRUE,
  row.names = FALSE
)

In [34]:

Table 4: Divergent Validity

Method	N	r A	r B	r A,B	z	p
dep_wor_data
PHQ-9 vs GAD-7
CCR	500	0.43	0.39	0.83	2.02	p = .044
aCCR	500	0.64	0.59	0.83	2.34	p = .019
RR	500	0.69	0.63	0.83	3.39	p < .001
PLS	500	0.68	0.62	0.83	3.02	p = .003
META-FBZ
DASS Depression vs DASS Anxiety
CCR	411	0.16	0.1	0.5	1.15	p = .249
aCCR	411	0.11	0.04	0.5	1.42	p = .156
RR	411	0.19	0.03	0.5	3.26	p = .001
PLS	411	0.23	0	0.5	4.63	p < .001
DASS Depression vs DASS Stress
CCR	411	0.16	0.09	0.61	1.63	p = .104
aCCR	411	0.11	0.07	0.61	0.96	p = .336
RR	411	0.19	0.12	0.61	1.69	p = .091
PLS	411	0.23	0.12	0.61	2.6	p = .009

Note. Steiger tests were computed to compare text-rating correlations between different LBA methods.

Criterion Validity

In META-FBZ, language-assessed depression was elevated in outpatients with a primary depressive disorder for CCR (d = 0.41, p < .001), RR (d = 0.46, p < .001), PLS (d = 0.54, p < .001), but not aCCR (d = 0.12, p = .128). Language-assessed depression was not increased in outpatients who were on sick leave when compared to those who were able to work at the time of assessment. In dep_wor_data, language-assessed depression was significantly positively correlated with the number of healthcare visits for CCR (r = 0.14, p = .002), aCCR(r = 0.22, p < .001), RR (r = 0.23, p < .001), and PLS (r = 0.22, p < .001). Likewise, language-assessed depression was significantly positively correlated with the number of sick leave days for CCR (r = 0.12, p = .007), aCCR(r = 0.16, p < .001), RR (r = 0.19, p < .001), and PLS (r = 0.18, p < .001). See Figure 5 and Table 5.

In [35]:

# ── Subset & recode work ability labels ──────────────────────────────────────
meta_diagnosis <- meta %>%
  mutate(depression_diagnosis = recode(depression_diagnosis,
    `1` = "Dep",
    `0` = "No-Dep"
  ))
meta_work <- meta %>%
  filter(work_ability_status %in% c("Able to work", "Unable to work (on sick leave)")) %>%
  mutate(work_ability_status = recode(work_ability_status,
    "Able to work"                      = "Working",
    "Unable to work (on sick leave)"    = "Sick"
  ))
# ── Plot factories ────────────────────────────────────────────────────────────
make_boxplot <- function(data, var, group_var, col, xlab, ylab) {
  p <- ggplot(data, aes(x = .data[[group_var]], y = .data[[var]])) +
    geom_boxplot(color = col) +
    geom_point(position = position_jitter(0.1), color = grey_col, alpha = 0.5) +
    ann(ttest_ann(data, var, group_var,
                  unique(data[[group_var]])[1], unique(data[[group_var]])[2])) +
    labs(x = xlab, y = ylab) +
    theme_minimal()
    p <- p + theme(axis.text.x = element_text(size = 7))
  p
}
make_scatter <- function(data, xvar, yvar, col, xlab, ylab) {
  ggplot(data, aes(x = .data[[xvar]], y = .data[[yvar]])) +
    geom_point(color = grey_col) +
    geom_smooth(method = "lm", color = col, fill = col) +
    ann(cor_ann(data, xvar, yvar)) +
    labs(x = xlab, y = ylab) +
    theme_minimal()
}
# ── Row 1: CCR ────────────────────────────────────────────────────────────────
r1c1 <- make_boxplot(meta_diagnosis, "CCR",  "depression_diagnosis", ccr_col,
                     "Diagnosis",    "CCR")
r1c2 <- make_boxplot(meta_work,      "CCR",  "work_ability_status",  ccr_col,
                     "Work Ability", "CCR")
r1c3 <- make_scatter(deptexts, "CCR",  "HealthCareVisits", ccr_col,  "CCR",  "HC Visits")
r1c4 <- make_scatter(deptexts, "CCR",  "SickDaysYear",     ccr_col,  "CCR",  "SL Days")
# ── Row 2: aCCR ───────────────────────────────────────────────────────────────
r2c1 <- make_boxplot(meta_diagnosis, "aCCR", "depression_diagnosis", accr_col,
                     "Diagnosis",    "aCCR")
r2c2 <- make_boxplot(meta_work,      "aCCR", "work_ability_status",  accr_col,
                     "Work Ability", "aCCR")
r2c3 <- make_scatter(deptexts, "aCCR", "HealthCareVisits", accr_col, "aCCR", "HC Visits")
r2c4 <- make_scatter(deptexts, "aCCR", "SickDaysYear",     accr_col, "aCCR", "SL Days")
# ── Row 3: RR ─────────────────────────────────────────────────────────────────
r3c1 <- make_boxplot(meta_diagnosis, "bdi_sum_RR_pred",         "depression_diagnosis", rr_col,
                     "Diagnosis",    "Pred. BDI")
r3c2 <- make_boxplot(meta_work,      "bdi_sum_RR_pred",         "work_ability_status",  rr_col,
                     "Work Ability", "Pred. BDI")
r3c3 <- make_scatter(deptexts, "mapped_BDI_sum_RR_pred",  "HealthCareVisits", rr_col,  "Pred. BDI", "HC Visits")
r3c4 <- make_scatter(deptexts, "mapped_BDI_sum_RR_pred",  "SickDaysYear",     rr_col,  "Pred. BDI", "SL Days")
# ── Row 4: PLS ────────────────────────────────────────────────────────────────
r4c1 <- make_boxplot(meta_diagnosis, "bdi_sum_PLS_pred",        "depression_diagnosis", pls_col,
                     "Diagnosis",    "Pred. BDI")
r4c2 <- make_boxplot(meta_work,      "bdi_sum_PLS_pred",        "work_ability_status",  pls_col,
                     "Work Ability", "Pred. BDI")
r4c3 <- make_scatter(deptexts, "mapped_BDI_sum_PLS_pred", "HealthCareVisits", pls_col, "Pred. BDI", "HC Visits")
r4c4 <- make_scatter(deptexts, "mapped_BDI_sum_PLS_pred", "SickDaysYear",     pls_col, "Pred. BDI", "SL Days")
# ── Column headers (drawn once at the top) ────────────────────────────────────
headers <- plot_grid(
  ggdraw(),                                        # spacer for row-label gutter
  col_header("Depressive Diagnosis"),
  col_header("Work Ability"),
  col_header("Healthcare Visits"),
  col_header("Sick Leave Days"),
  ncol = 5, rel_widths = c(0.04, 1, 1, 1, 1)
)
# ── Assemble ──────────────────────────────────────────────────────────────────
FigureCrit <- plot_grid(
  headers,
  labeled_row(row_label("CCR"),   r1c1, r1c2, r1c3, r1c4),
  labeled_row(row_label("aCCR"),  r2c1, r2c2, r2c3, r2c4),
  labeled_row(row_label("RR"),    r3c1, r3c2, r3c3, r3c4),
  labeled_row(row_label("PLS"),   r4c1, r4c2, r4c3, r4c4),
  ncol = 1, rel_heights = c(0.08, 1, 1, 1, 1)
)

`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

FigureCrit

Note. Associations of language-assessed depression based on CCR (first row), aCCR (second row), RR (third row), and PLS (fourth row) with the presence of gold-standard depression diagnosis (first column, META-FBZ), current work ability status (second column, META-FBZ), HC Visits = healthcare visits in the past year (third column, dep_wor_data), and SL Days = sick leave days in the past year (fourth column, dep_wor_data).

In [36]:

# ── Subset data ───────────────────────────────────────────────────────────────
mf_dep  <- criterion_validity_table[
  criterion_validity_table$Dataset == "META-FBZ" &
  criterion_validity_table$External_Criterion == "Primary Depression Diagnosis", ]
mf_work <- criterion_validity_table[
  criterion_validity_table$Dataset == "META-FBZ" &
  criterion_validity_table$External_Criterion == "Work Ability Status", ]
dw_hc   <- criterion_validity_table[
  criterion_validity_table$Dataset == "dep_wor_data" &
  criterion_validity_table$External_Criterion == "Healthcare Visits", ]
dw_sl   <- criterion_validity_table[
  criterion_validity_table$Dataset == "dep_wor_data" &
  criterion_validity_table$External_Criterion == "Sick Leave Days", ]
# ── Helpers ───────────────────────────────────────────────────────────────────
fmt_rows <- function(df) {
  data.frame(
    Method         = df$Method,
    Effect_Size    = df$Effect_Size,
    Test_Statistic = df$Test_Statistic,
    p              = df$p,
    stringsAsFactors = FALSE,
    check.names    = FALSE
  )
}
blank_row <- function(label) {
  data.frame(Method = label, Effect_Size = "", Test_Statistic = "", p = "",
             stringsAsFactors = FALSE, check.names = FALSE)
}
# ── Assemble ──────────────────────────────────────────────────────────────────
# N may differ per criterion within a dataset, so take the max as the dataset N
n_mf <- max(criterion_validity_table$N[criterion_validity_table$Dataset == "META-FBZ"])
n_dw <- max(criterion_validity_table$N[criterion_validity_table$Dataset == "dep_wor_data"])
criterion_table <- rbind(
  blank_row(paste0("**META-FBZ** *(N = ", n_mf, ")*")),
  blank_row(paste0("*", unique(mf_dep$External_Criterion), "*")),
  fmt_rows(mf_dep),
  blank_row(paste0("*", unique(mf_work$External_Criterion), "*")),
  fmt_rows(mf_work),
  blank_row(paste0("**dep\\_wor\\_data** *(N = ", n_dw, ")*")),
  blank_row(paste0("*", unique(dw_hc$External_Criterion), "*")),
  fmt_rows(dw_hc),
  blank_row(paste0("*", unique(dw_sl$External_Criterion), "*")),
  fmt_rows(dw_sl)
)
knitr::kable(
  criterion_table,
  col.names = c("Method", "Effect Size", "*t*", "*p*"),
  align     = c("l", "r", "r", "r"),
  booktabs  = TRUE,
  row.names = FALSE
)

In [37]:

Table 5: Association of Language-assessed Depression with External Criteria.

Method	Effect Size	t	p
META-FBZ (N = 601)
Primary Depression Diagnosis
CCR	d = 0.41	t = 5.04	p < .001
aCCR	d = 0.12	t = 1.52	p = .128
RR	d = 0.46	t = 5.62	p < .001
PLS	d = 0.54	t = 6.61	p < .001
Work Ability Status
CCR	d = -0.15	t = -1.19	p = .236
aCCR	d = 0.03	t = 0.21	p = .834
RR	d = 0.07	t = 0.57	p = .568
PLS	d = 0	t = -0.01	p = .993
dep_wor_data (N = 500)
Healthcare Visits
CCR	r = 0.14	t = 3.05	p = .002
aCCR	r = 0.22	t = 5.05	p < .001
RR	r = 0.23	t = 5.36	p < .001
PLS	r = 0.22	t = 4.94	p < .001
Sick Leave Days
CCR	r = 0.12	t = 2.73	p = .007
aCCR	r = 0.16	t = 3.68	p < .001
RR	r = 0.19	t = 4.39	p < .001
PLS	r = 0.18	t = 4.1	p < .001

Similarity of Learned and Theory-Driven Vectors

Supervised LBA and theory-defined vectors yielded comparable text-rating correlations. We were therefore interested in whether vector directions learned during supervised training would be similar to theory-defined directions. To this end, we computed the cosine similarity of all vector pairs (see Figure 6). This analysis revealed that learned vector representations are generally quite dissimilar (0.07 - 0.23), except when RR and PLS models are trained on the same data-set (0.76 - 0.94). CCRs and aCCRs also were dissimilar from both each other (0.15) and from RR/PLS vectors (0.1 - 0.17). A crucial exception was found in dep_wor_data, with a similarity of aCCR and PLS (0.66) that is comparable to the similarity of PLS and RR (0.76). This means that PLS trained on texts targeting depressive symptoms converges on a vector representation that is similar, although not identical, to the semantic representation of the BDI-II.

In [38]:

deptextsbdi_PLS_vector = deptextsbdi_PLS_transfer_model.named_steps["PLS"].x_weights_[:, 0]
deptextsbdi_RR_vector = deptextsbdi_RR_transfer_model.named_steps["ridge"].coef_.flatten()
erisk21_PLS_vector = erisk21_PLS_transfer_model.named_steps["PLS"].x_weights_[:, 0]
erisk21_RR_vector = erisk21_RR_transfer_model.named_steps["ridge"].coef_.flatten()
meta_PLS_vector = meta_PLS_transfer_model.named_steps["PLS"].x_weights_[:, 0]
meta_RR_vector = meta_RR_transfer_model.named_steps["ridge"].coef_.flatten()

In [39]:

CCR_en <- np$load(file.path(PUBLIC_DATA_PATH, "CCR_en.npy"))
aCCR_en <- np$load(file.path(PUBLIC_DATA_PATH, "aCCR_en.npy"))
vectors <- list(
  CCR              = as.vector(CCR_en),
  aCCR             = as.vector(aCCR_en),
  dep_wor_data_PLS = as.vector(py$deptextsbdi_PLS_vector),
  dep_wor_data_RR  = as.vector(py$deptextsbdi_RR_vector),
  eRisk_PLS        = as.vector(py$erisk21_PLS_vector),
  eRisk_RR         = as.vector(py$erisk21_RR_vector),
  Meta_PLS         = as.vector(py$meta_PLS_vector),
  Meta_RR          = as.vector(py$meta_RR_vector)
)
# Identify which vector names are PLS
pls_nms <- grep("PLS", names(vectors), value = TRUE)
# Compute pairwise cosine similarity, taking absolute value if either vector is PLS
n   <- length(vectors)
nms <- names(vectors)
sim_mat <- matrix(NA, nrow = n, ncol = n, dimnames = list(nms, nms))
for (i in seq_len(n)) {
  for (j in seq_len(n)) {
    sim <- cosine(vectors[[i]], vectors[[j]])
    if (nms[i] %in% pls_nms || nms[j] %in% pls_nms) {
      sim <- abs(sim)
    }
    sim_mat[i, j] <- sim
  }
}
# Melt to long format and relabel
label_map <- c(
  "CCR"              = "CCR",
  "aCCR"             = "aCCR",
  "dep_wor_data_PLS" = "PLS (dep_wor_data)",
  "dep_wor_data_RR"  = "RR (dep_wor_data)",
  "eRisk_PLS"        = "PLS (eRisk-2021)",
  "eRisk_RR"         = "RR (eRisk-2021)",
  "Meta_PLS"         = "PLS (META-FBZ)",
  "Meta_RR"          = "RR (META-FBZ)"
)
sim_df <- melt(sim_mat, varnames = c("Vec1", "Vec2"), value.name = "Cosine")
sim_df$Vec1 <- factor(label_map[as.character(sim_df$Vec1)], levels = label_map)
sim_df$Vec2 <- factor(label_map[as.character(sim_df$Vec2)], levels = rev(label_map))
ggplot(sim_df, aes(x = Vec1, y = Vec2, fill = Cosine)) +
  geom_tile(color = "white", linewidth = 0.5) +
  geom_text(aes(label = round(Cosine, 2)), size = 3.5, color = "black") +
  scale_fill_viridis_c(
    limits = c(0, 1), 
    name = "Cosine\nSimilarity"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    axis.title  = element_blank(),
    panel.grid  = element_blank()
  )

Figure 6: Similarity of Learned and Theory-Driven Vectors.

Discussion

The goal of the present study was to evaluate the psychometric qualities of CCRs, with a focus on domain-generality, for the assessment of depression symptoms. When compared to state-of-the-art LBA, CCRs demonstrated high validity and good performance across various domains and are thus supported as domain-general, interpretable, and highly valid tools for the assessment of depression from language. Interestingly, findings also revealed that supervised LBA, which has previously been hypothesized to generalize poorly across strongly diverging domains (Nilsson et al., 2026; Teitelbaum & Simchon, 2025), achieves strong domain-generality under certain training regimes. Moreover, supervised LBA converged on a vector representation that is similar to the semantic representation of the BDI-II. Thus, in addition to validating the use of CCRs, our findings lend further support for the validity of supervised LBA. In the following, we will discuss these findings in detail and provide recommendations about when to use which LBA method.

Both CCR types yielded significant text-rating correlations that were comparable in magnitude to supervised LBA. In dep_wor_data, performance losses (relative to aCCR) were minimal (), in eRisk-2021 non-significant, and in META-FBZ more notable (). When trained on social media data, RR was unable to generate significant text-rating correlations, underlining the sensitivity of supervised LBA to small samples and noisy data. PLS did achieve significant text-rating correlations under these conditions which supports recent suggestions to use PLS for LBA (Teitelbaum & Simchon, 2025). However, to arrive at definitive conclusions regarding meaningful differences between PLS and RR, more research on different constructs is certainly needed.

Results of the transfer tests speak directly to the measurement-versus-prediction question raised in the introduction. When supervised LBA models were trained on outpatient anamnesis texts (META-FBZ) or social-media texts (eRisk-2021), notable performance losses occurred under transfer. Such losses admit two non-exclusive interpretations. A model may have learned domain-specific correlates of depression alongside construct-central features; alternatively, or in addition, the linguistic expression of depression may genuinely differ across these domains, such that a representation learned in one register transfers imperfectly to another even when both validly index the construct. Transfer performance alone does not adjudicate between these readings, which is why we complement it with content-validity and representational analyses. When trained on language probes specifically targeting depressive symptomatology (dep_wor_data), however, no performance loss was detectable. Critically, the vector similarity analysis disambiguates this finding: the PLS model trained on construct-targeted language converged on a vector representation substantially similar to the theory-defined aCCR (cosine similarity = .66, approaching the within-method similarity of RR and PLS). It is the conjunction of these two results — domain-general performance and convergence of the learned representation on the questionnaire-defined construct representation — that allows the interpretation that this model approximates measurement of the depression construct rather than prediction from correlated signals. Taken together, these findings suggest that LBA methods are best located on a continuum between prediction and reflective measurement, and that a model’s position on this continuum is determined not by the use of language per se, but by whether the indicator-construct mapping is theoretically specified (CCR), learned from construct-targeted elicitation (supervised LBA on probes), or learned from naturalistic corpora (supervised LBA on found text).

The context-dependence of construct expression also helps account for the gradient observed for CCR, whose convergent validity decreased from construct-targeted symptom descriptions (dep_wor_data, r = .43) through spontaneous social-media posts (eRisk-2021, r = .31) to clinical intake narratives (META-FBZ, r = .17). Because CCR fixes its construct representation in the linguistic register of questionnaire items, its loadings should be highest where elicited language most closely approximates that register — as when participants are asked to describe their symptoms directly — and lower where language is produced for other communicative purposes, such as recounting the development of one’s difficulties at clinical intake. On this view, the gradient reflects, at least in part, the varying distance between domain-typical language and the questionnaire register rather than a uniform loss of construct validity. The same logic offers a constructive reading of why supervised LBA generalized best when trained on construct-targeted elicitation: such elicitation draws spontaneous language toward the canonical expression of the construct, narrowing the register gap that transfer must bridge. Both methods, then, appear to track the depression construct; they differ in how their performance is modulated by the linguistic register of the domain in which the construct is expressed.

Tests of divergent validity revealed an ability of LBA to differentiate between depression and related constructs, although results were not entirely affirmative. In dep_wor_data, where text-rating correlations were generally high, all methods significantly differentiated between depression and generalized anxiety. In META-FBZ, where text-rating correlations were small, results were less consistent. aCCR correlated significantly with depression but not significantly with anxiety or stress, although differences between these correlations were not significant. Supervised LBAs did show significant correlations with stress, although these correlations were significantly smaller than correlations with depression. The lesser discriminant validity of CCRs in META-FBZ could stem from the small magnitude of correlations, as CCRs demonstrated substantial discriminant validity in dep_wor_data where correlations were moderate to large in size.

LBAs were significantly associated with three out of four external criteria, namely gold-standard depression diagnoses, as well as the number of healthcare visits and sick leave days. No difference in language-assessed depression was found for mental health outpatients who were currently working vs. on sick leave at the time of assessment. Additionally, only CCR- but not aCCR loadings were elevated in outpatients with a depression diagnosis. In sum, these results support the criterion validity of LBA, but point towards some remaining inconsistencies.

Limitations and Future Work

While these results are promising, several limitations should be noted. Firstly, the presented analysis was confined to the assessment of depression symptoms, limiting extrapolation of findings to other psychological constructs. Indeed, prior work has shown that, for some dimensions of moral concerns, CCRs do not correlate significantly with questionnaire scores, potentially implying that some psychological constructs are not well represented in natural language (Teitelbaum & Simchon, 2025). Secondly, our validation strategy shares a construct definition with its criterion: CCRs are constructed from BDI-II items and evaluated, in part, against BDI-II and PHQ-9 scores. Convergent validity coefficients therefore reflect agreement between two operationalizations derived from the same nomological tradition rather than agreement with the construct itself — a limitation that applies to all validation of new instruments against established self-reports (Cronbach & Meehl, 1955). The criterion validity findings (clinical diagnoses, healthcare utilization) partially mitigate this concern, as these criteria are external to the self-report operationalization. Future work should expand the criterion space toward method-independent indicators (e.g., clinician ratings, behavioral measures, longitudinal treatment outcomes) to probe whether language-based scores carry construct-relevant information beyond the questionnaire-defined operationalization. Relatedly, our domain comparisons confound data domain with elicitation format and linguistic register. We therefore cannot fully separate genuine context-dependence in the expression of depression from register-related method effects. Future designs could hold register constant across domains, or vary it systematically within a domain, to disentangle these influences and to test directly whether the construct’s linguistic manifestation is itself domain-variable. Thirdly, the comparatively small number of individuals included in the eRisk-2021 data-sets possibly led to under-powered correlation comparisons. Fourthly, we considered here supervised LBA as a comparison method, but note that the use of zero-shot prompting has yielded impressive results for LBA of personality (Wright et al., 2026) and mental health (J.-J. Lee et al., 2026). This choice was motivated by the fact that supervised LBA represents the current state-of-the-art and has been validated extensively, therefore serving as the critical comparison benchmark. Still, it would be desirable for future studies to undertake a systematic comparisons of these three methods.

Choosing a method

The results of this study provide insight into the strengths and weaknesses of different LBA methods, raising the question what method to select in a given research scenario. We propose that method selection depends chiefly on two factors: the goal of the assessment and the available data.

When the goal is prediction, e.g. screening for at-risk individuals or stratifying patients into treatment groups in a highly standardized setting where large amounts of gold-standard data can be accumulated, supervised LBA will most often be indicated. In such scenarios, it is legitimate and indeed desirable for a model to exploit all available predictive information, including semantic information that is predictive of, but not reflective of, the target construct: a screening instrument is evaluated by its predictive utility, not by the purity of its construct representation. Validity evidence remains necessary, but domain-generality does not take precedence over within-domain accuracy.

When the goal is measurement, that is, when scores are intended to be interpreted as reflecting standing on the depression construct, compared across populations or settings, or related to other constructs in theory testing, the requirements shift. Here, construct specificity and an explicit indicator-construct mapping become paramount, and CCR is the preferable choice: it offers an interpretable, a priori specified construct representation that is not conflated with correlated traits and that can be deployed in domains where gold-standard ground truth is unavailable (e.g., historical records; (Chen et al., 2024)). Transferring a supervised LBA trained on construct-targeted language elicitation is a viable alternative, as the present results show, but such models must be inspected carefully; ideally including a comparison of the learned representation against a theory-defined construct vector, as demonstrated here.

Supplementary Material

In [40]:

content_stats_table <- content_validity_table[, !names(content_validity_table) %in% c("Significance")]
content_stats_table <- round(content_stats_table, 2)
knitr::kable(
  content_stats_table,
  col.names = c("BDI Item", "N Relevant", "N Irrelevant", "AUROC", "Lower CI", "Upper CI"),
  align     = c("l", "r", "r", "r", "r", "r"),
  booktabs  = TRUE
)

In [41]:

Table 6: Symptom Relevance Classification

BDI Item	N Relevant	N Irrelevant	AUROC	Lower CI	Upper CI
1	167	414	0.72	0.68	0.77
2	209	343	0.67	0.62	0.71
3	146	390	0.71	0.66	0.76
4	132	390	0.75	0.70	0.79
5	88	312	0.66	0.60	0.72
6	33	519	0.77	0.70	0.84
7	205	269	0.74	0.70	0.78
8	115	419	0.73	0.68	0.78
9	300	217	0.67	0.62	0.72
10	143	403	0.74	0.69	0.78
11	142	451	0.59	0.54	0.64
12	105	448	0.75	0.70	0.80
13	50	534	0.67	0.59	0.75
14	161	263	0.67	0.62	0.72
15	161	330	0.74	0.70	0.79
16	274	294	0.64	0.60	0.69
17	132	408	0.57	0.51	0.62
18	225	323	0.62	0.57	0.67
19	166	262	0.68	0.62	0.73
20	217	349	0.62	0.58	0.67
21	239	291	0.65	0.60	0.69

Note. Results of logistic regressions predicting symptom relevance labels based on positive and negative BDI-II item centroids.

In [42]:

run_corr_comparison <- function(data, ccr_var, pred_var, bdi_var,
                                sample_label, ccr_label, pred_label) {
  tmp <- data %>%
    dplyr::select(all_of(c(ccr_var, pred_var, bdi_var))) %>%
    tidyr::drop_na()
  r_ccr_bdi  <- cor(tmp[[ccr_var]],  tmp[[bdi_var]])
  r_pred_bdi <- cor(tmp[[pred_var]], tmp[[bdi_var]])
  r_ccr_pred <- cor(tmp[[ccr_var]],  tmp[[pred_var]])
  n <- nrow(tmp)
  res <- cocor.dep.groups.overlap(r.jk = r_ccr_bdi, r.jh = r_pred_bdi,
                                  r.kh = r_ccr_pred, n = n, test = "steiger1980")
  tibble(Sample = sample_label, CCR = ccr_label, `Transfer Model` = pred_label,
         `r(CCR, BDI)` = round(r_ccr_bdi, 3), `r(Pred, BDI)` = round(r_pred_bdi, 3),
         `r(CCR, Pred)` = round(r_ccr_pred, 3), n = n,
         z = round(res@steiger1980$statistic, 3), p = res@steiger1980$p.value)
}
results_tbl <- bind_rows(
  run_corr_comparison(erisk21_w, "CCR",  "deptexts_pred_RR",  "bdi_sum", "eRisk-2021", "CCR",  "RR (dep_wor_data)"),
  run_corr_comparison(erisk21_w, "aCCR", "deptexts_pred_RR",  "bdi_sum", "eRisk-2021", "aCCR", "RR (dep_wor_data)"),
  run_corr_comparison(erisk21_w, "CCR",  "deptexts_pred_PLS", "bdi_sum", "eRisk-2021", "CCR",  "PLS (dep_wor_data)"),
  run_corr_comparison(erisk21_w, "aCCR", "deptexts_pred_PLS", "bdi_sum", "eRisk-2021", "aCCR", "PLS (dep_wor_data)"),
  run_corr_comparison(erisk21_w, "CCR",  "meta_pred_RR",  "bdi_sum", "eRisk-2021", "CCR",  "RR (META-FBZ)"),
  run_corr_comparison(erisk21_w, "aCCR", "meta_pred_RR",  "bdi_sum", "eRisk-2021", "aCCR", "RR (META-FBZ)"),
  run_corr_comparison(erisk21_w, "CCR",  "meta_pred_PLS", "bdi_sum", "eRisk-2021", "CCR",  "PLS (META-FBZ)"),
  run_corr_comparison(erisk21_w, "aCCR", "meta_pred_PLS", "bdi_sum", "eRisk-2021", "aCCR", "PLS (META-FBZ)"),
  run_corr_comparison(deptexts, "CCR",  "erisk21_pred_RR",  "mapped_BDI_sum", "dep_wor_data", "CCR",  "RR (eRisk)"),
  run_corr_comparison(deptexts, "aCCR", "erisk21_pred_RR",  "mapped_BDI_sum", "dep_wor_data", "aCCR", "RR (eRisk)"),
  run_corr_comparison(deptexts, "CCR",  "erisk21_pred_PLS", "mapped_BDI_sum", "dep_wor_data", "CCR",  "PLS (eRisk)"),
  run_corr_comparison(deptexts, "aCCR", "erisk21_pred_PLS", "mapped_BDI_sum", "dep_wor_data", "aCCR", "PLS (eRisk)"),
  run_corr_comparison(deptexts, "CCR",  "meta_pred_RR",  "mapped_BDI_sum", "dep_wor_data", "CCR",  "RR (META-FBZ)"),
  run_corr_comparison(deptexts, "aCCR", "meta_pred_RR",  "mapped_BDI_sum", "dep_wor_data", "aCCR", "RR (META-FBZ)"),
  run_corr_comparison(deptexts, "CCR",  "meta_pred_PLS", "mapped_BDI_sum", "dep_wor_data", "CCR",  "PLS (META-FBZ)"),
  run_corr_comparison(deptexts, "aCCR", "meta_pred_PLS", "mapped_BDI_sum", "dep_wor_data", "aCCR", "PLS (META-FBZ)"),
  run_corr_comparison(meta, "CCR",  "erisk21_pred_RR",  "bdi_sum_DU-Prä", "META-FBZ", "CCR",  "RR (eRisk-2021)"),
  run_corr_comparison(meta, "aCCR", "erisk21_pred_RR",  "bdi_sum_DU-Prä", "META-FBZ", "aCCR", "RR (eRisk-2021)"),
  run_corr_comparison(meta, "CCR",  "erisk21_pred_PLS", "bdi_sum_DU-Prä", "META-FBZ", "CCR",  "PLS (eRisk-2021)"),
  run_corr_comparison(meta, "aCCR", "erisk21_pred_PLS", "bdi_sum_DU-Prä", "META-FBZ", "aCCR", "PLS (eRisk-2021)"),
  run_corr_comparison(meta, "CCR",  "deptexts_pred_RR",  "bdi_sum_DU-Prä", "META-FBZ", "CCR",  "RR (dep_wor_data)"),
  run_corr_comparison(meta, "aCCR", "deptexts_pred_RR",  "bdi_sum_DU-Prä", "META-FBZ", "aCCR", "RR (dep_wor_data)"),
  run_corr_comparison(meta, "CCR",  "deptexts_pred_PLS", "bdi_sum_DU-Prä", "META-FBZ", "CCR",  "PLS (dep_wor_data)"),
  run_corr_comparison(meta, "aCCR", "deptexts_pred_PLS", "bdi_sum_DU-Prä", "META-FBZ", "aCCR", "PLS (dep_wor_data)")
)
results_tbl %>%
  mutate(p = ifelse(p < .001, "< .001", sprintf("= %.3f", p))) %>%
  knitr::kable(align = c("l", "l", "l", "r", "r", "r", "r", "r", "r"), booktabs = TRUE)

In [43]:

Table 7: Comparison of transfer models with CCR/aCCR correlations.

Sample	CCR	Transfer Model	r(CCR, BDI)	r(Pred, BDI)	r(CCR, Pred)	n	z	p
eRisk-2021	CCR	RR (dep_wor_data)	0.306	0.374	0.669	80	-0.787	= 0.431
eRisk-2021	aCCR	RR (dep_wor_data)	0.360	0.374	0.863	80	-0.257	= 0.797
eRisk-2021	CCR	PLS (dep_wor_data)	0.306	0.394	0.624	80	-0.964	= 0.335
eRisk-2021	aCCR	PLS (dep_wor_data)	0.360	0.394	0.901	80	-0.735	= 0.462
eRisk-2021	CCR	RR (META-FBZ)	0.306	0.202	0.640	80	1.118	= 0.264
eRisk-2021	aCCR	RR (META-FBZ)	0.360	0.202	0.478	80	1.430	= 0.153
eRisk-2021	CCR	PLS (META-FBZ)	0.306	0.220	0.641	80	0.927	= 0.354
eRisk-2021	aCCR	PLS (META-FBZ)	0.360	0.220	0.498	80	1.293	= 0.196
dep_wor_data	CCR	RR (eRisk)	0.430	0.441	0.418	500	-0.268	= 0.789
dep_wor_data	aCCR	RR (eRisk)	0.618	0.441	0.510	500	4.965	< .001
dep_wor_data	CCR	PLS (eRisk)	0.430	0.416	0.317	500	0.303	= 0.762
dep_wor_data	aCCR	PLS (eRisk)	0.618	0.416	0.442	500	5.298	< .001
dep_wor_data	CCR	RR (META-FBZ)	0.430	0.338	0.445	500	2.168	= 0.030
dep_wor_data	aCCR	RR (META-FBZ)	0.618	0.338	0.338	500	6.619	< .001
dep_wor_data	CCR	PLS (META-FBZ)	0.430	0.306	0.429	500	2.859	= 0.004
dep_wor_data	aCCR	PLS (META-FBZ)	0.618	0.306	0.325	500	7.237	< .001
META-FBZ	CCR	RR (eRisk-2021)	0.170	0.070	0.270	601	2.050	= 0.040
META-FBZ	aCCR	RR (eRisk-2021)	0.138	0.070	0.129	601	1.269	= 0.204
META-FBZ	CCR	PLS (eRisk-2021)	0.170	0.068	0.240	601	2.052	= 0.040
META-FBZ	aCCR	PLS (eRisk-2021)	0.138	0.068	0.129	601	1.309	= 0.191
META-FBZ	CCR	RR (dep_wor_data)	0.170	0.191	0.266	601	-0.441	= 0.659
META-FBZ	aCCR	RR (dep_wor_data)	0.138	0.191	0.525	601	-1.369	= 0.171
META-FBZ	CCR	PLS (dep_wor_data)	0.170	0.142	0.289	601	0.572	= 0.568
META-FBZ	aCCR	PLS (dep_wor_data)	0.138	0.142	0.763	601	-0.174	= 0.862

In [44]:

make_plot <- function(data, predictor, outcome, xlab, ylab, color) {
  ggplot(data, aes(x = .data[[predictor]], y = .data[[outcome]])) +
    geom_point(color = grey_col) +
    geom_smooth(method = "lm", color = color, fill = color) +
    ann(cor_ann(data, predictor, outcome)) +
    labs(x = xlab, y = ylab) +
    scale_x_continuous(n.breaks = 4) +
    theme_minimal()
}
# ── Outcomes / columns ───────────────────────────────────────────────────────
vars <- list(
  list(
    data = meta,
    outcome = "dass_dep_score_DU-Prä",
    label = "Dep."
  ),
  list(
    data = meta,
    outcome = "dass_anx_score_DU-Prä",
    label = "Anx."
  ),
  list(
    data = meta,
    outcome = "dass_str_score_DU-Prä",
    label = "Str."
  ),
  list(
    data = deptexts,
    outcome = "GAD7tot",
    label = "GAD-7"
  )
)
# ── Method definitions / rows ────────────────────────────────────────────────
methods <- list(
  list(
    predictor_meta = "CCR",
    predictor_deptexts = "CCR",
    label = "CCR",
    color = ccr_col,
    xlab = "CCR loading"
  ),
  list(
    predictor_meta = "aCCR",
    predictor_deptexts = "aCCR",
    label = "aCCR",
    color = accr_col,
    xlab = "aCCR loading"
  ),
  list(
    predictor_meta = "bdi_sum_RR_pred",
    predictor_deptexts = "PHQ9tot_RR_pred",
    label = "RR",
    color = rr_col,
    xlab = "Predicted score"
  ),
  list(
    predictor_meta = "bdi_sum_PLS_pred",
    predictor_deptexts = "PHQ9tot_PLS_pred",
    label = "PLS",
    color = pls_col,
    xlab = "Predicted score"
  )
)
# ── Generate plots ───────────────────────────────────────────────────────────
all_rows <- lapply(methods, function(meth) {
  
  plots <- lapply(vars, function(v) {
    
    predictor <- if (identical(v$data, deptexts)) {
      meth$predictor_deptexts
    } else {
      meth$predictor_meta
    }
    
    make_plot(
      data = v$data,
      predictor = predictor,
      outcome = v$outcome,
      xlab = meth$xlab,
      ylab = v$label,
      color = meth$color
    )
  })
  
  plot_grid(plotlist = plots, nrow = 1)
})

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 190 rows containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 190 rows containing missing values or values outside the scale range
(`geom_point()`).

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 190 rows containing non-finite outside the scale range
(`stat_smooth()`).
Removed 190 rows containing missing values or values outside the scale range
(`geom_point()`).

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 190 rows containing non-finite outside the scale range
(`stat_smooth()`).
Removed 190 rows containing missing values or values outside the scale range
(`geom_point()`).

`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 190 rows containing non-finite outside the scale range
(`stat_smooth()`).
Removed 190 rows containing missing values or values outside the scale range
(`geom_point()`).

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 190 rows containing non-finite outside the scale range
(`stat_smooth()`).
Removed 190 rows containing missing values or values outside the scale range
(`geom_point()`).

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 190 rows containing non-finite outside the scale range
(`stat_smooth()`).
Removed 190 rows containing missing values or values outside the scale range
(`geom_point()`).

`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 190 rows containing non-finite outside the scale range
(`stat_smooth()`).
Removed 190 rows containing missing values or values outside the scale range
(`geom_point()`).

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 190 rows containing non-finite outside the scale range
(`stat_smooth()`).
Removed 190 rows containing missing values or values outside the scale range
(`geom_point()`).

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 190 rows containing non-finite outside the scale range
(`stat_smooth()`).
Removed 190 rows containing missing values or values outside the scale range
(`geom_point()`).

`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 190 rows containing non-finite outside the scale range
(`stat_smooth()`).
Removed 190 rows containing missing values or values outside the scale range
(`geom_point()`).

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 190 rows containing non-finite outside the scale range
(`stat_smooth()`).
Removed 190 rows containing missing values or values outside the scale range
(`geom_point()`).

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 190 rows containing non-finite outside the scale range
(`stat_smooth()`).
Removed 190 rows containing missing values or values outside the scale range
(`geom_point()`).

`geom_smooth()` using formula = 'y ~ x'

header_row <- plot_grid(
  NULL,
  col_header("Dep."),
  col_header("Anx."),
  col_header("Str."),
  col_header("GAD-7"),
  nrow = 1,
  rel_widths = c(0.05, 1, 1, 1, 1)
)
# ── Final figure ─────────────────────────────────────────────────────────────
FigureDiv <- plot_grid(
  header_row,
  labeled_row(row_label(methods[[1]]$label), all_rows[[1]]),
  labeled_row(row_label(methods[[2]]$label), all_rows[[2]]),
  labeled_row(row_label(methods[[3]]$label), all_rows[[3]]),
  labeled_row(row_label(methods[[4]]$label), all_rows[[4]]),
  ncol = 1,
  rel_heights = c(0.08, 1, 1, 1, 1)
)
FigureDiv

Figure 7: Divergent Validity. Columns: Dep. = DASS Depression; Anx. = DASS Anxiety; Str. = DASS Stress; GAD-7 = Generalized Anxiety Disorder 7-item scale.

Note. Pearson correlations of language-assessed depression based on CCR (first row), aCCR (second row), RR (third row), and PLS (fourth row) with self-reports of DASS-21 depression (Dep., first column, META-FBZ), DASS-21 anxiety (Anx., second column, META-FBZ), DASS-21 stress (Str., third column, META-FBZ), and GAD-7 anxiety (fourth column, dep_wor_data).

Abdi, H. (2003). Partial least square regression (PLS regression). Encyclopedia for Research Methods for the Social Sciences, 6(4), 792–795.

Adair, J. G. (1984). The hawthorne effect: A reconsideration of the methodological artifact. Journal of Applied Psychology, 69(2), 334.

Allaire, J. J., Teague, C., Xie, Y., & Dervieux, C. (2022). Quarto. https://doi.org/10.5281/ZENODO.5960048

Atari, M., Omrani, A., & Dehghani, M. (2023). Contextualized construct representation: Leveraging psychometric scales to advance theory-driven text analysis. Preprint at PsyArXiv Https://Doi.org/10.31234/Osf. Io/M93pd.

Beck, A. T., Steer, R. A., & Brown, G. (1996). Beck depression inventory–II. Psychological Assessment.

Bilgrami, Z. R., Sarac, C., Srivastava, A., Herrera, S. N., Azis, M., Haas, S. S., Shaik, R. B., Parvaz, M. A., Mittal, V. A., Cecchi, G., et al. (2022). Construct validity for computational linguistic metrics in individuals at clinical risk for psychosis: Associations with clinical ratings. Schizophrenia Research, 245, 90–96.

Bolukbasi, T., Chang, K.-W., Zou, J. Y., Saligrama, V., & Kalai, A. T. (2016). Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. Advances in Neural Information Processing Systems, 29.

Borsboom, D., Mellenbergh, G. J., & Heerden, J. van. (2004). The Concept of Validity. Psychological Review, 111(4), 1061–1071. https://doi.org/10.1037/0033-295x.111.4.1061

Chandler, C., Foltz, P. W., & Elvevåg, B. (2020). Using machine learning in psychiatry: The need to establish a framework that nurtures trustworthiness. Schizophrenia Bulletin, 46(1), 11–14.

Chen, Y., Li, S., Li, Y., & Atari, M. (2024). Surveying the dead minds: Historical-psychological text analysis with contextualized construct representation (CCR) for classical chinese. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2597–2615.

Cohen, A. S. (2019). Advancing ambulatory biobehavioral technologies beyond “proof of concept”: Introduction to the special section. Psychological Assessment, 31(3), 277.

Cohen, A. S., Rodriguez, Z., Warren, K. K., Cowan, T., Masucci, M. D., Edvard Granrud, O., Holmlund, T. B., Chandler, C., Foltz, P. W., & Strauss, G. P. (2022). Natural language processing and psychosis: On the need for comprehensive psychometric evaluation. Schizophrenia Bulletin, 48(5), 939–948.

Crestani, F., Losada, D. E., & Parapar, J. (2022). Early detection of mental health disorders by social media monitoring. Studies in Computational Intelligence, 1018(4).

Cronbach, L. J. (1949). Essentials of psychological testing.

Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52(4), 281–302. https://doi.org/10.1037/h0040957

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186.

Dohnány, S., Kurth-Nelson, Z., Spens, E., Luettgau, L., Reid, A., Gabriel, I., Summerfield, C., Shanahan, M., & Nour, M. M. (2026). Technological folie à deux: Feedback loops between AI chatbots and mental health. Nature Mental Health, 1–10.

Eberhardt, S. T., Vehlen, A., Schaffrath, J., Schwartz, B., Baur, T., Schiller, D., Hallmen, T., André, E., & Lutz, W. (2025). Development and validation of large language model rating scales for automatically transcribed psychological therapy sessions. Scientific Reports, 15(1), 29541.

First, M. B. (2014). Structured clinical interview for the DSM (SCID). The Encyclopedia of Clinical Psychology, 1–6.

Firth, J. (1957). A synopsis of linguistic theory, 1930-1955. Studies in Linguistic Analysis, 10–32.

Flake, J. K., Pek, J., & Hehman, E. (2017). Construct validation in social and personality research: Current practice and recommendations. Social Psychological and Personality Science, 8(4), 370–378.

Furr, R. M. (2021). Psychometrics: An introduction. SAGE publications.

Giorgi, S., Lynn, V. E., Gupta, K., Ahmed, F., Matz, S., Ungar, L. H., & Schwartz, H. A. (2022). Correcting sociodemographic selection biases for population prediction from social media. Proceedings of the International AAAI Conference on Web and Social Media, 16, 228–240.

Grand, G., Blank, I. A., Pereira, F., & Fedorenko, E. (2022). Semantic projection recovers rich human knowledge of multiple object features from word embeddings. Nature Human Behaviour, 6(7), 975–987.

Grimm, K. J., & Widaman, K. F. (2012). Construct validity.

Gu, Z., Kjell, K., Schwartz, H. A., & Kjell, O. (2025). Natural language response formats for assessing depression and worry with large language models: A sequential evaluation with model pre-registration. Assessment, 10731911251364022.

Harris, Z. S. (1954). Distributional structure. Word, 10(2-3), 146–162.

Hupkes, D., Giulianelli, M., Dankers, V., Artetxe, M., Elazar, Y., Pimentel, T., Christodoulopoulos, C., Lasri, K., Saphra, N., Sinclair, A., et al. (2023). A taxonomy and review of generalization research in NLP. Nature Machine Intelligence, 5(10), 1161–1174.

Hüppi, R. M., Bautista, L., Cecere, G., Just, S. A., Koops, S., Hussain, M., Tedeschi, E., Benke-Bruderer, S., Bora, E., Lyne, J., et al. (2025). TRUSTING: An international multicenter observational study of speech-based relapse prediction in psychosis using explainable AI. medRxiv, 2025–2011.

Kjell, O. N. E., Kjell, K., Garcia, D., & Sikström, S. (2019). Semantic measures: Using natural language processing to measure, differentiate, and describe psychological constructs. Psychological Methods, 24(1), 92–115. https://doi.org/10.1037/met0000191

Kjell, O. N., Kjell, K., Garcia, D., & Sikström, S. (2019). Semantic measures: Using natural language processing to measure, differentiate, and describe psychological constructs. Psychological Methods, 24(1), 92.

Kjell, O. N., Kjell, K., & Schwartz, H. A. (2024). Beyond rating scales: With targeted evaluation, large language models are poised for psychological assessment. Psychiatry Research, 333, 115667.

Kjell, O. N., Sikström, S., Kjell, K., & Schwartz, H. A. (2022). Natural language analyzed with AI-based transformers predict traditional subjective well-being measures approaching the theoretical upper limits in accuracy. Scientific Reports, 12(1), 3918.

Kjell, O., Daukantaitė, D., & Sikström, S. (2021). Computational language assessments of harmony in life—not satisfaction with life or rating scales—correlate with cooperative behaviors. Frontiers in Psychology, 12, 601679.

Kjell, O., Ganesan, A. V., Boyd, R. L., Oltmanns, J., Rivero, A., Feltman, S., Carr, M. A., Alves, J., Luft, B., Kotov, R., et al. (2026). Replicability and validity of a new artificial-intelligence assessment of posttraumatic stress disorder from patient language: A sequential evaluation with model preregistration. Clinical Psychological Science, 21677026261439026.

Kjell, O., Giorgi, S., & Schwartz, H. A. (2023). The text-package: An r-package for analyzing and visualizing human language using natural language processing and transformers. Psychological Methods, 28(6), 1478.

Kroenke, K., Spitzer, R. L., & Williams, J. B. (2001). The PHQ-9: Validity of a brief depression severity measure. Journal of General Internal Medicine, 16(9), 606–613.

Kühner, C., Bürger, C., Keller, F., & Hautzinger, M. (2007). Reliability and validity of the revised beck depression inventory (BDI-II). Results from german samples. Der Nervenarzt, 78(6), 651–656.

Lake, B. M., & Murphy, G. L. (2023). Word meaning in minds and machines. Psychological Review, 130(2), 401.

Lee, J.-J., Han, J., & Woo, C.-W. (2026). Interpretable depression assessment using a large language model. PLOS Digital Health, 5(2), e0001205.

Lee, S., Shakir, A., Koenig, D., & Lipp, J. (2024). Open source gets DE-licious: Mixedbread x deepset german/english embeddings. https://www.mixedbread.ai/blog/deepset-mxbai-embed-de-large-v1

Lovibond, P. F., & Lovibond, S. H. (1995). The structure of negative emotional states: Comparison of the depression anxiety stress scales (DASS) with the beck depression and anxiety inventories. Behaviour Research and Therapy, 33(3), 335–343.

Mikolov, T., Yih, W., & Zweig, G. (2013). Linguistic regularities in continuous space word representations. Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 746–751.

Nilges, P., & Essau, C. (2015). Die depressions-angst-stress-skalen. Der Schmerz, 29(6), 649–657.

Nilsson, A. H., Eijsbroek, V. C., Gu, Z., Kjell, K., Giorgi, S., Kotov, R., Ganesan, A. V., Schwartz, H. A., & Kjell, O. N. (2026). The language-based assessment model library: Open model sharing for independent validation and broader applications. Advances in Methods and Practices in Psychological Science, 9(2), 25152459261419036.

Nolen-Hoeksema, S. (2001). Gender differences in depression. Current Directions in Psychological Science, 10(5), 173–176.

Palan, S., & Schitter, C. (2018). Prolific. Ac—a subject pool for online experiments. Journal of Behavioral and Experimental Finance, 17, 22–27.

Pan, S. J., & Yang, Q. (2009). A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10), 1345–1359.

Parapar, J., Martı́n-Rodilla, P., Losada, D. E., & Crestani, F. (2021). eRisk 2021: Pathological gambling, self-harm and depression challenges. European Conference on Information Retrieval, 650–656.

Parapar, J., Perez, A., Wang, X., & Crestani, F. (2025). eRisk 2025: Contextual and conversational approaches for depression challenges. European Conference on Information Retrieval, 416–424.

Piantadosi, S. T., Muller, D. C., Rule, J. S., Kaushik, K., Gorenstein, M., Leib, E. R., & Sanford, E. (2024). Why concepts are (probably) vectors. Trends in Cognitive Sciences, 28(9), 844–856.

Plank, L., & Zlomuzica, A. (2024a). Natural language processing reveals differences in mental time travel at higher levels of self-efficacy. Scientific Reports, 14(1), 25342.

Plank, L., & Zlomuzica, A. (2024b). Reduced speech coherence in psychosis-related social media forum posts. Schizophrenia, 10(1), 60.

Plank, L., & Zlomuzica, A. (2025). Detecting psychosis via natural language processing of social media posts: Potentials and pitfalls. Neuropsychologia, 109325.

Sap, M., Park, G., Eichstaedt, J., Kern, M., Stillwell, D., Kosinski, M., Ungar, L., & Schwartz, H. A. (2014). Developing age and gender predictive lexica over social media. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1146–1151.

Sharkey, L., Chughtai, B., Batson, J., Lindsey, J., Wu, J., Bushnaq, L., Goldowsky-Dill, N., Heimersheim, S., Ortega, A., Bloom, J., et al. (2025). Open problems in mechanistic interpretability. arXiv Preprint arXiv:2501.16496.

Shiffman, S., Stone, A. A., & Hufford, M. R. (2008). Ecological momentary assessment. Annu. Rev. Clin. Psychol., 4(1), 1–32.

Spitzer, R. L., Kroenke, K., Williams, J. B., & Löwe, B. (2006). A brief measure for assessing generalized anxiety disorder: The GAD-7. Archives of Internal Medicine, 166(10), 1092–1097.

Stade, E. C., Ungar, L., Eichstaedt, J. C., Sherman, G., & Ruscio, A. M. (2023). Depression and anxiety have distinct and overlapping language patterns: Results from a clinical interview. Journal of Psychopathology and Clinical Science, 132(8), 972.

Steen, E., Yurechko, K., & Klug, D. (2023). You can (not) say what you want: Using algospeak to contest and evade algorithmic content moderation on TikTok. Social Media+ Society, 9(3), 20563051231194586.

Steiger, J. H. (1980). Tests for comparing elements of a correlation matrix. Psychological Bulletin, 87(2), 245.

Tausczik, Y. R., & Pennebaker, J. W. (2010). The psychological meaning of words: LIWC and computerized text analysis methods. Journal of Language and Social Psychology, 29(1), 24–54.

Team, R. C. et al. (2016). R: A language and environment for statistical computing. R foundation for statistical computing, vienna, austria. Http://Www. R-Project. Org/.

Teitelbaum, L., & Simchon, A. (2025). Neural text embeddings in psychological research: A guide with examples in r. Psychological Methods.

Van Rossum, G., & Drake Jr, F. L. (1995). Python tutorial (Vol. 620). Centrum voor Wiskunde en Informatica Amsterdam, The Netherlands.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.

Wahl, I., Löwe, B., Bjorner, J. B., Fischer, F., Langs, G., Voderholzer, U., Aita, S. A., Bergemann, N., Brähler, E., & Rose, M. (2014). Standardization of depression measurement: A common metric was developed for 11 self-report depression measures. Journal of Clinical Epidemiology, 67(1), 73–86.

Wang, L., Yang, N., Huang, X., Yang, L., Majumder, R., & Wei, F. (2024). Multilingual e5 text embeddings: A technical report. arXiv Preprint arXiv:2402.05672.

Wright, A. G., Ringwald, W. R., Vize, C. E., Eichstaedt, J. C., Angstadt, M., Taxali, A., & Sripada, C. (2026). Assessing personality using zero-shot generative AI scoring of brief open-ended text. Nature Human Behaviour, 1–15.

https://quarto.org/↩︎
https://github.com/laurinplank/PsychometricVectors ↩︎
available at https://huggingface.co/openai/whisper-large-v2 ↩︎
Model available at https://huggingface.co/flair/ner-german-large↩︎
Available at https://github.com/cran/topics/blob/master/data/dep_wor_data.rda↩︎
Researchers may apply for access at https://erisk.irlab.org/↩︎
Available at https://huggingface.co/mixedbread-ai/deepset-mxbai-embed-de-large-v1↩︎
Available at https://huggingface.co/intfloat/multilingual-e5-large ↩︎

	estimator estimator: estimator object This is assumed to implement the scikit-learn estimator interface. Either estimator needs to provide a ``score`` function, or ``scoring`` must be passed.	Pipeline(step...e', Ridge())])
	param_grid param_grid: dict or list of dictionaries Dictionary with parameters names (`str`) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.	{'ridge__alpha': array([1.0000...00000000e+06])}
	scoring scoring: str, callable, list, tuple or dict, default=None Strategy to evaluate the performance of the cross-validated model on the test set. If `scoring` represents a single score, one can use: - a single string (see :ref:`scoring_string_names`); - a callable (see :ref:`scoring_callable`) that returns a single value; - `None`, the `estimator`'s :ref:`default evaluation criterion ` is used. If `scoring` represents multiple scores, one can use: - a list or tuple of unique strings; - a callable returning a dictionary where the keys are the metric names and the values are the metric scores; - a dictionary with metric names as keys and callables as values. See :ref:`multimetric_grid_search` for an example.	'r2'
	n_jobs n_jobs: int, default=None Number of jobs to run in parallel. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details. .. versionchanged:: v0.20 `n_jobs` default changed from 1 to None	None
	refit refit: bool, str, or callable, default=True Refit an estimator using the best found parameters on the whole dataset. For multiple metric evaluation, this needs to be a `str` denoting the scorer that would be used to find the best parameters for refitting the estimator at the end. Where there are considerations other than maximum score in choosing a best estimator, ``refit`` can be set to a function which returns the selected ``best_index_`` given ``cv_results_``. In that case, the ``best_estimator_`` and ``best_params_`` will be set according to the returned ``best_index_`` while the ``best_score_`` attribute will not be available. The refitted estimator is made available at the ``best_estimator_`` attribute and permits using ``predict`` directly on this ``GridSearchCV`` instance. Also for multiple metric evaluation, the attributes ``best_index_``, ``best_score_`` and ``best_params_`` will only be available if ``refit`` is set and all of them will be determined w.r.t this specific scorer. See ``scoring`` parameter to know more about multiple metric evaluation. See :ref:`sphx_glr_auto_examples_model_selection_plot_grid_search_digits.py` to see how to design a custom selection strategy using a callable via `refit`. See :ref:`this example ` for an example of how to use ``refit=callable`` to balance model complexity and cross-validated score. .. versionchanged:: 0.20 Support for callable added.	True
	cv cv: int, cross-validation generator or an iterable, default=None Determines the cross-validation splitting strategy. Possible inputs for cv are: - None, to use the default 5-fold cross validation, - integer, to specify the number of folds in a `(Stratified)KFold`, - :term:`CV splitter`, - An iterable yielding (train, test) splits as arrays of indices. For integer/None inputs, if the estimator is a classifier and ``y`` is either binary or multiclass, :class:`StratifiedKFold` is used. In all other cases, :class:`KFold` is used. These splitters are instantiated with `shuffle=False` so the splits will be the same across calls. Refer :ref:`User Guide ` for the various cross-validation strategies that can be used here. .. versionchanged:: 0.22 ``cv`` default value if None changed from 3-fold to 5-fold.	KFold(n_split... shuffle=True)
	verbose verbose: int Controls the verbosity: the higher, the more messages. - >1 : the computation time for each fold and parameter candidate is displayed; - >2 : the score is also displayed; - >3 : the fold and candidate parameter indexes are also displayed together with the starting time of the computation.	0
	pre_dispatch pre_dispatch: int, or str, default='2n_jobs' Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be: - None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs - An int, giving the exact number of total jobs that are spawned - A str, giving an expression as a function of n_jobs, as in '2n_jobs'	'2*n_jobs'
	error_score error_score: 'raise' or numeric, default=np.nan Value to assign to the score if an error occurs in estimator fitting. If set to 'raise', the error is raised. If a numeric value is given, FitFailedWarning is raised. This parameter does not affect the refit step, which will always raise the error.	nan
	return_train_score return_train_score: bool, default=False If ``False``, the ``cv_results_`` attribute will not include training scores. Computing training scores is used to get insights on how different parameter settings impact the overfitting/underfitting trade-off. However computing the scores on the training set can be computationally expensive and is not strictly required to select the parameters that yield the best generalization performance. .. versionadded:: 0.19 .. versionchanged:: 0.21 Default value was changed from ``True`` to ``False``	False

	alpha alpha: {float, ndarray of shape (n_targets,)}, default=1.0 Constant that multiplies the L2 term, controlling regularization strength. `alpha` must be a non-negative float i.e. in `[0, inf)`. When `alpha = 0`, the objective is equivalent to ordinary least squares, solved by the :class:`LinearRegression` object. For numerical reasons, using `alpha = 0` with the `Ridge` object is not advised. Instead, you should use the :class:`LinearRegression` object. If an array is passed, penalties are assumed to be specific to the targets. Hence they must correspond in number.	np.float64(2656.0877829466867)
	fit_intercept fit_intercept: bool, default=True Whether to fit the intercept for this model. If set to false, no intercept will be used in calculations (i.e. ``X`` and ``y`` are expected to be centered).	True
	copy_X copy_X: bool, default=True If True, X will be copied; else, it may be overwritten.	True
	max_iter max_iter: int, default=None Maximum number of iterations for conjugate gradient solver. For 'sparse_cg' and 'lsqr' solvers, the default value is determined by scipy.sparse.linalg. For 'sag' solver, the default value is 1000. For 'lbfgs' solver, the default value is 15000.	None
	tol tol: float, default=1e-4 The precision of the solution (`coef_`) is determined by `tol` which specifies a different convergence criterion for each solver: - 'svd': `tol` has no impact. - 'cholesky': `tol` has no impact. - 'sparse_cg': norm of residuals smaller than `tol`. - 'lsqr': `tol` is set as atol and btol of scipy.sparse.linalg.lsqr, which control the norm of the residual vector in terms of the norms of matrix and coefficients. - 'sag' and 'saga': relative change of coef smaller than `tol`. - 'lbfgs': maximum of the absolute (projected) gradient=max\|residuals\| smaller than `tol`. .. versionchanged:: 1.2 Default value changed from 1e-3 to 1e-4 for consistency with other linear models.	0.0001
	solver solver: {'auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga', 'lbfgs'}, default='auto' Solver to use in the computational routines: - 'auto' chooses the solver automatically based on the type of data. - 'svd' uses a Singular Value Decomposition of X to compute the Ridge coefficients. It is the most stable solver, in particular more stable for singular matrices than 'cholesky' at the cost of being slower. - 'cholesky' uses the standard :func:`scipy.linalg.solve` function to obtain a closed-form solution. - 'sparse_cg' uses the conjugate gradient solver as found in :func:`scipy.sparse.linalg.cg`. As an iterative algorithm, this solver is more appropriate than 'cholesky' for large-scale data (possibility to set `tol` and `max_iter`). - 'lsqr' uses the dedicated regularized least-squares routine :func:`scipy.sparse.linalg.lsqr`. It is the fastest and uses an iterative procedure. - 'sag' uses a Stochastic Average Gradient descent, and 'saga' uses its improved, unbiased version named SAGA. Both methods also use an iterative procedure, and are often faster than other solvers when both n_samples and n_features are large. Note that 'sag' and 'saga' fast convergence is only guaranteed on features with approximately the same scale. You can preprocess the data with a scaler from :mod:`sklearn.preprocessing`. - 'lbfgs' uses L-BFGS-B algorithm implemented in :func:`scipy.optimize.minimize`. It can be used only when `positive` is True. All solvers except 'svd' support both dense and sparse data. However, only 'lsqr', 'sag', 'sparse_cg', and 'lbfgs' support sparse input when `fit_intercept` is True. .. versionadded:: 0.17 Stochastic Average Gradient descent solver. .. versionadded:: 0.19 SAGA solver.	'auto'
	positive positive: bool, default=False When set to ``True``, forces the coefficients to be positive. Only 'lbfgs' solver is supported in this case.	False
	random_state random_state: int, RandomState instance, default=None Used when ``solver`` == 'sag' or 'saga' to shuffle the data. See :term:`Glossary ` for details. .. versionadded:: 0.17 `random_state` to support Stochastic Average Gradient.	None

	estimator estimator: estimator object This is assumed to implement the scikit-learn estimator interface. Either estimator needs to provide a ``score`` function, or ``scoring`` must be passed.	Pipeline(step...e', Ridge())])
	param_grid param_grid: dict or list of dictionaries Dictionary with parameters names (`str`) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.	{'ridge__alpha': array([1.0000...00000000e+06])}
	scoring scoring: str, callable, list, tuple or dict, default=None Strategy to evaluate the performance of the cross-validated model on the test set. If `scoring` represents a single score, one can use: - a single string (see :ref:`scoring_string_names`); - a callable (see :ref:`scoring_callable`) that returns a single value; - `None`, the `estimator`'s :ref:`default evaluation criterion ` is used. If `scoring` represents multiple scores, one can use: - a list or tuple of unique strings; - a callable returning a dictionary where the keys are the metric names and the values are the metric scores; - a dictionary with metric names as keys and callables as values. See :ref:`multimetric_grid_search` for an example.	'r2'
	n_jobs n_jobs: int, default=None Number of jobs to run in parallel. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details. .. versionchanged:: v0.20 `n_jobs` default changed from 1 to None	None
	refit refit: bool, str, or callable, default=True Refit an estimator using the best found parameters on the whole dataset. For multiple metric evaluation, this needs to be a `str` denoting the scorer that would be used to find the best parameters for refitting the estimator at the end. Where there are considerations other than maximum score in choosing a best estimator, ``refit`` can be set to a function which returns the selected ``best_index_`` given ``cv_results_``. In that case, the ``best_estimator_`` and ``best_params_`` will be set according to the returned ``best_index_`` while the ``best_score_`` attribute will not be available. The refitted estimator is made available at the ``best_estimator_`` attribute and permits using ``predict`` directly on this ``GridSearchCV`` instance. Also for multiple metric evaluation, the attributes ``best_index_``, ``best_score_`` and ``best_params_`` will only be available if ``refit`` is set and all of them will be determined w.r.t this specific scorer. See ``scoring`` parameter to know more about multiple metric evaluation. See :ref:`sphx_glr_auto_examples_model_selection_plot_grid_search_digits.py` to see how to design a custom selection strategy using a callable via `refit`. See :ref:`this example ` for an example of how to use ``refit=callable`` to balance model complexity and cross-validated score. .. versionchanged:: 0.20 Support for callable added.	True
	cv cv: int, cross-validation generator or an iterable, default=None Determines the cross-validation splitting strategy. Possible inputs for cv are: - None, to use the default 5-fold cross validation, - integer, to specify the number of folds in a `(Stratified)KFold`, - :term:`CV splitter`, - An iterable yielding (train, test) splits as arrays of indices. For integer/None inputs, if the estimator is a classifier and ``y`` is either binary or multiclass, :class:`StratifiedKFold` is used. In all other cases, :class:`KFold` is used. These splitters are instantiated with `shuffle=False` so the splits will be the same across calls. Refer :ref:`User Guide ` for the various cross-validation strategies that can be used here. .. versionchanged:: 0.22 ``cv`` default value if None changed from 3-fold to 5-fold.	KFold(n_split... shuffle=True)
	verbose verbose: int Controls the verbosity: the higher, the more messages. - >1 : the computation time for each fold and parameter candidate is displayed; - >2 : the score is also displayed; - >3 : the fold and candidate parameter indexes are also displayed together with the starting time of the computation.	0
	pre_dispatch pre_dispatch: int, or str, default='2n_jobs' Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be: - None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs - An int, giving the exact number of total jobs that are spawned - A str, giving an expression as a function of n_jobs, as in '2n_jobs'	'2*n_jobs'
	error_score error_score: 'raise' or numeric, default=np.nan Value to assign to the score if an error occurs in estimator fitting. If set to 'raise', the error is raised. If a numeric value is given, FitFailedWarning is raised. This parameter does not affect the refit step, which will always raise the error.	nan
	return_train_score return_train_score: bool, default=False If ``False``, the ``cv_results_`` attribute will not include training scores. Computing training scores is used to get insights on how different parameter settings impact the overfitting/underfitting trade-off. However computing the scores on the training set can be computationally expensive and is not strictly required to select the parameters that yield the best generalization performance. .. versionadded:: 0.19 .. versionchanged:: 0.21 Default value was changed from ``True`` to ``False``	False

	alpha alpha: {float, ndarray of shape (n_targets,)}, default=1.0 Constant that multiplies the L2 term, controlling regularization strength. `alpha` must be a non-negative float i.e. in `[0, inf)`. When `alpha = 0`, the objective is equivalent to ordinary least squares, solved by the :class:`LinearRegression` object. For numerical reasons, using `alpha = 0` with the `Ridge` object is not advised. Instead, you should use the :class:`LinearRegression` object. If an array is passed, penalties are assumed to be specific to the targets. Hence they must correspond in number.	np.float64(2656.0877829466867)
	fit_intercept fit_intercept: bool, default=True Whether to fit the intercept for this model. If set to false, no intercept will be used in calculations (i.e. ``X`` and ``y`` are expected to be centered).	True
	copy_X copy_X: bool, default=True If True, X will be copied; else, it may be overwritten.	True
	max_iter max_iter: int, default=None Maximum number of iterations for conjugate gradient solver. For 'sparse_cg' and 'lsqr' solvers, the default value is determined by scipy.sparse.linalg. For 'sag' solver, the default value is 1000. For 'lbfgs' solver, the default value is 15000.	None
	tol tol: float, default=1e-4 The precision of the solution (`coef_`) is determined by `tol` which specifies a different convergence criterion for each solver: - 'svd': `tol` has no impact. - 'cholesky': `tol` has no impact. - 'sparse_cg': norm of residuals smaller than `tol`. - 'lsqr': `tol` is set as atol and btol of scipy.sparse.linalg.lsqr, which control the norm of the residual vector in terms of the norms of matrix and coefficients. - 'sag' and 'saga': relative change of coef smaller than `tol`. - 'lbfgs': maximum of the absolute (projected) gradient=max\|residuals\| smaller than `tol`. .. versionchanged:: 1.2 Default value changed from 1e-3 to 1e-4 for consistency with other linear models.	0.0001
	solver solver: {'auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga', 'lbfgs'}, default='auto' Solver to use in the computational routines: - 'auto' chooses the solver automatically based on the type of data. - 'svd' uses a Singular Value Decomposition of X to compute the Ridge coefficients. It is the most stable solver, in particular more stable for singular matrices than 'cholesky' at the cost of being slower. - 'cholesky' uses the standard :func:`scipy.linalg.solve` function to obtain a closed-form solution. - 'sparse_cg' uses the conjugate gradient solver as found in :func:`scipy.sparse.linalg.cg`. As an iterative algorithm, this solver is more appropriate than 'cholesky' for large-scale data (possibility to set `tol` and `max_iter`). - 'lsqr' uses the dedicated regularized least-squares routine :func:`scipy.sparse.linalg.lsqr`. It is the fastest and uses an iterative procedure. - 'sag' uses a Stochastic Average Gradient descent, and 'saga' uses its improved, unbiased version named SAGA. Both methods also use an iterative procedure, and are often faster than other solvers when both n_samples and n_features are large. Note that 'sag' and 'saga' fast convergence is only guaranteed on features with approximately the same scale. You can preprocess the data with a scaler from :mod:`sklearn.preprocessing`. - 'lbfgs' uses L-BFGS-B algorithm implemented in :func:`scipy.optimize.minimize`. It can be used only when `positive` is True. All solvers except 'svd' support both dense and sparse data. However, only 'lsqr', 'sag', 'sparse_cg', and 'lbfgs' support sparse input when `fit_intercept` is True. .. versionadded:: 0.17 Stochastic Average Gradient descent solver. .. versionadded:: 0.19 SAGA solver.	'auto'
	positive positive: bool, default=False When set to ``True``, forces the coefficients to be positive. Only 'lbfgs' solver is supported in this case.	False
	random_state random_state: int, RandomState instance, default=None Used when ``solver`` == 'sag' or 'saga' to shuffle the data. See :term:`Glossary ` for details. .. versionadded:: 0.17 `random_state` to support Stochastic Average Gradient.	None

	estimator estimator: estimator object This is assumed to implement the scikit-learn estimator interface. Either estimator needs to provide a ``score`` function, or ``scoring`` must be passed.	Pipeline(step...e', Ridge())])
	param_grid param_grid: dict or list of dictionaries Dictionary with parameters names (`str`) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.	{'ridge__alpha': array([1.0000...00000000e+06])}
	scoring scoring: str, callable, list, tuple or dict, default=None Strategy to evaluate the performance of the cross-validated model on the test set. If `scoring` represents a single score, one can use: - a single string (see :ref:`scoring_string_names`); - a callable (see :ref:`scoring_callable`) that returns a single value; - `None`, the `estimator`'s :ref:`default evaluation criterion ` is used. If `scoring` represents multiple scores, one can use: - a list or tuple of unique strings; - a callable returning a dictionary where the keys are the metric names and the values are the metric scores; - a dictionary with metric names as keys and callables as values. See :ref:`multimetric_grid_search` for an example.	'r2'
	n_jobs n_jobs: int, default=None Number of jobs to run in parallel. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details. .. versionchanged:: v0.20 `n_jobs` default changed from 1 to None	None
	refit refit: bool, str, or callable, default=True Refit an estimator using the best found parameters on the whole dataset. For multiple metric evaluation, this needs to be a `str` denoting the scorer that would be used to find the best parameters for refitting the estimator at the end. Where there are considerations other than maximum score in choosing a best estimator, ``refit`` can be set to a function which returns the selected ``best_index_`` given ``cv_results_``. In that case, the ``best_estimator_`` and ``best_params_`` will be set according to the returned ``best_index_`` while the ``best_score_`` attribute will not be available. The refitted estimator is made available at the ``best_estimator_`` attribute and permits using ``predict`` directly on this ``GridSearchCV`` instance. Also for multiple metric evaluation, the attributes ``best_index_``, ``best_score_`` and ``best_params_`` will only be available if ``refit`` is set and all of them will be determined w.r.t this specific scorer. See ``scoring`` parameter to know more about multiple metric evaluation. See :ref:`sphx_glr_auto_examples_model_selection_plot_grid_search_digits.py` to see how to design a custom selection strategy using a callable via `refit`. See :ref:`this example ` for an example of how to use ``refit=callable`` to balance model complexity and cross-validated score. .. versionchanged:: 0.20 Support for callable added.	True
	cv cv: int, cross-validation generator or an iterable, default=None Determines the cross-validation splitting strategy. Possible inputs for cv are: - None, to use the default 5-fold cross validation, - integer, to specify the number of folds in a `(Stratified)KFold`, - :term:`CV splitter`, - An iterable yielding (train, test) splits as arrays of indices. For integer/None inputs, if the estimator is a classifier and ``y`` is either binary or multiclass, :class:`StratifiedKFold` is used. In all other cases, :class:`KFold` is used. These splitters are instantiated with `shuffle=False` so the splits will be the same across calls. Refer :ref:`User Guide ` for the various cross-validation strategies that can be used here. .. versionchanged:: 0.22 ``cv`` default value if None changed from 3-fold to 5-fold.	KFold(n_split... shuffle=True)
	verbose verbose: int Controls the verbosity: the higher, the more messages. - >1 : the computation time for each fold and parameter candidate is displayed; - >2 : the score is also displayed; - >3 : the fold and candidate parameter indexes are also displayed together with the starting time of the computation.	0
	pre_dispatch pre_dispatch: int, or str, default='2n_jobs' Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be: - None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs - An int, giving the exact number of total jobs that are spawned - A str, giving an expression as a function of n_jobs, as in '2n_jobs'	'2*n_jobs'
	error_score error_score: 'raise' or numeric, default=np.nan Value to assign to the score if an error occurs in estimator fitting. If set to 'raise', the error is raised. If a numeric value is given, FitFailedWarning is raised. This parameter does not affect the refit step, which will always raise the error.	nan
	return_train_score return_train_score: bool, default=False If ``False``, the ``cv_results_`` attribute will not include training scores. Computing training scores is used to get insights on how different parameter settings impact the overfitting/underfitting trade-off. However computing the scores on the training set can be computationally expensive and is not strictly required to select the parameters that yield the best generalization performance. .. versionadded:: 0.19 .. versionchanged:: 0.21 Default value was changed from ``True`` to ``False``	False

Article Notebook

TODO / Bounties

Introduction

Semantic embedding models

Contemporary approaches to LBA

Measurement versus prediction

A psychometric evaluation framework

Convergent, Divergent, and Criterion Validity

Content Validity

Representational Interpretability

Domain Generalization

Ecological Validity

Methods

Ethics

Datasets

META-FBZ

dep_wor_data

eRisk

Contextualized Construct Representation

Statistical Analysis Plan

Supervised LBA

Content Validity

Convergent Validity

Divergent Validity

Criterion Validity

Results

Descriptive Statistics

Results of Psychometric Evaluation

Face Validity

Content Validity

Convergent Validity

Divergent Validity

Criterion Validity

Similarity of Learned and Theory-Driven Vectors

Discussion

Limitations and Future Work

Choosing a method

Supplementary Material