Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Popular for improving the power of classifiers is to expand application from a single data set (Fig. 1(a)) to multiple, independently collected sets of the same disease (Fig. 1(b)) [1]. To analyze across multiple sets, neuroimage studies generally first harmonize the data by, for example, regressing out demographic factors from MRI measurements and then train the classifier to distinguish disease from control samples [2]. However, harmonization might mitigate group differences making classification difficult (such as in Fig. 2). To improve classification accuracy, we propose the first approach to jointly learn how to harmonize MR image data and classify disease. Harmonization relies on both controls and disease cohorts to reduce differences in the image based measurements due to acquisition differences between sets while preserving the group separation investigated by the classifier. We evaluate our approach on the challenging task (Fig. 1(c)) of being trained on two data sets differing in their acquisition and diseases they sample, and tested on accurately distinguishing the two diseases while being impartial to acquisition differences, i.e., the controls of the two sets.

Fig. 1.
figure 1

Examples for classifying data: (a) Classification based on a single set, (b) separating healthy and disease based on multiple sets requiring harmonization, and (c) part two disease groups based on harmonizing controls of disease specific data sets.

Training of our approach is defined by an energy function that combines a linear harmonization model with a logistic regression classifier. Minimizing this function is confined to group cardinality constrained solutions, i.e., labeling is based on a small number of groups of image scores (counted via the \(l_0\)-‘norm’). Doing so reduces the risk of overfitting, accounts for inter-dependency between regions (e.g., bilateral impact of diseases), and identifies disease distinguishing patterns (defined by non-zero weights of classifiers) that avoid issues of solutions based on relaxed sparsity constraints [3]. Inspired by [4], our method uses block coordinate descent to find the optimal parameters for harmonizing the two data sets and correctly labeling disease samples while being indifferent to control cohorts (i.e., acquisition differences). During testing, we use those parameters to harmonize the image scores before performing classification ensuring that the data set associated with a subject is not part of the labeling decision.

Using 5-fold cross-validation, we measure the accuracy of our method on distinguishing HIV-Associated Neurocognitive Disorder (HAND) from Mild Cognitive Impairment (MCI) of two independently collected data sets; each set contains controls and samples of one disease only. Distinguishing HAND from MCI is clinically challenging due to similarity in symptoms and missing standard protocols for assessing neuropsychological deficits [5]. Not only is our classifier indifferent to the controls from both groups but is also significantly more accurate in distinguishing the two diseases than the conventional sequential harmonization and classification approach and non-sparsity based logistic regression methods.

2 Jointly Learning of Harmonization and Classification

Let a data set consists of set \(S_A\) of controls and samples with disease A, and an independently collected set \(S_B\) of controls and samples of disease B. The four subsets are matched with respect to demographic scores, such as age. Each sample s of the joint set is represented by a vector of image features \(\mathbf x _s\) and a label \(y_s\), where \(y_s=-1\) if \(s \in S_A\) and \(y_s=+1\) for \(s \in S_B\). The acquisition differences between \(S_A\) and \(S_B\) are assumed to linearly impact the image features. To extract disease separating patterns from this joint set, we review the training of a sequential model for data harmonization and classification, and then propose to simultaneously parameterize both tasks by minimizing a single energy function.

2.1 Sequential Harmonization and Classification

Training the linear regression model for data harmonization results in parameterizing matrix U so that it minimizes the difference (with respect to the \(l_2\)-norm \(\Vert \cdot \Vert _2\)) between the inferred values \(U \cdot [1 ~ y_s]^\top \) (1 is the bias term of the linear model) and the raw image scores \(\mathbf x _s\) across \(n_C\) controls of the joint data set, i.e.,

$$\begin{aligned} \widehat{U} := \arg \min _U h_C(U) \text{ with } h_C(U):= \frac{1}{n_C}\sum _{s=1}^{n_C} \Vert U \cdot [1 ~ y_s]^\top - \mathbf x _s\Vert _2^2. \end{aligned}$$
(1)

A group sparsity constrained logistic regression classifier is trained on the residuals \(\mathbf r _s(\widehat{U}) := \widehat{U} \cdot [1 ~ y_s]^\top - \mathbf x _s\), i.e., harmonized scores, across samples ‘s’ of two groups, which are \(n_D\) samples of the two disease cohorts. Note, classification based on the inferred values \(\widehat{U} \cdot [1 ~ y_s]^\top \) is uninformative as all scores of a region are mapped to two values. Now, let the logistic function be \(\theta (\alpha ) := \log (1+\exp (-\alpha ))\), the weight vector \(\mathbf w \) encode the importance of each score of \(\mathbf r _s\) and \(v\in \mathbb {R}\) be the label offset, then the penalty (or label) function of the classifier is defined by

$$\begin{aligned} l_D(\widehat{U}, v,\mathbf w ) := \frac{1}{n_D}\sum _{s=n_C+1}^{n_C+n_D}\theta \left( y_s \cdot \left( \mathbf w ^\top \cdot \mathbf r _s(\widehat{U}) + v \right) \right) . \end{aligned}$$
(2)

The search for point \((\widehat{v},\widehat{\mathbf{w }})\) minimizing \(l_D(\widehat{U},\cdot ,\cdot )\) often has to be limited so that \(\widehat{\mathbf{w }}\) conforms to disease specific constraints, such as the bilateral impact of HAND or MCI on individual regions. These constraints can be modeled by group cardinality. Specifically, every entry of \(\mathbf w \) is assigned to one of \(n_G\) groups \(\mathbf g _1(\mathbf w ), \ldots ,\mathbf g _{n_G}(\mathbf w )\). The search space \(\mathbb {S}\) is composed of weights where the number of groups with non-zero weight entries are below a certain threshold \(k \in \mathbb {N}\), i.e., \(\mathbb {S}:=\left\{ \mathbf w : \sum _{i=1}^{n_G} \Vert \Vert \mathbf g _{i}(\mathbf w )\Vert _2\Vert _0 \le k \right\} \) with \(\Vert \cdot \Vert _0\) being the \(l_0\)-‘norm’. Finally, the training of the classifier is fully described by the following minimization problem

$$\begin{aligned} (\hat{v},\hat{\mathbf{w }}) := \arg \min _{v \in \mathbb {R},\mathbf w \in \mathbb {S}} l_D(\widehat{U}, v,\mathbf w ), \end{aligned}$$
(3)

which can be solved via penalty decomposition [4, 6]. Note, that Eq. (3) turns into a sparsity constrained problem if each group \(\mathbf g _i\) is of size 1. Furthermore, setting ‘\(k = n_G\)’ turns Eq. (3) into a logistic regression problem. Finally, Eq. (3) can distinguish a single disease from controls by simply replacing \(y_s\) in Eq. (2) with a variable encoding assignment to cohorts instead of data sets.

2.2 Simultaneous Harmonization and Classification

We now determine \((\widehat{U},\widehat{v},\widehat{\mathbf{w }})\) for a single minimization problem composed of the linear (harmonization) and logistic (classification) regression terms, i.e.,

$$\begin{aligned} (\widehat{U},\widehat{v},\widehat{\mathbf{w }}) := \arg \min _{U, v,\mathbf w \in \mathbb {S}} (1- \lambda ) \cdot l_D(U,v,\mathbf w ) + \lambda \cdot h_C(U), \end{aligned}$$
(4)

where \(\lambda \in (0,1)\). Note, the model fails to classify when \(\lambda =1\) (\(\widehat{v}\) and \(\widehat{\mathbf{w }}\) are undefined) or harmonize when \(\lambda =0\) (entries of \(\widehat{U}\) are undefined). Motivated by [4], we simplify optimization by first parameterizing the classifier with respect to the ‘unconstrained’ vector \(\mathbf q \) before determining the corresponding sparse solution \(\widehat{\mathbf{w }}\). The solution to Eq. (4) is estimated by iteratively increasing \({\varrho }\) of

$$\begin{aligned} (U_{\varrho },v_{\varrho },\mathbf w _{\varrho },\mathbf q _{\varrho }) := \arg \min _{U, v,\mathbf w \in \mathbb {S}, \mathbf q } (1- \lambda ) \cdot l_D(U,v,\mathbf q ) + \lambda \cdot h_C(U) + {\varrho }\cdot \Vert \mathbf w -\mathbf q \Vert _2^2 \end{aligned}$$
(5)
figure a

until the maximum of the absolute difference between elements of \(\mathbf w _{\varrho }\) and \(\mathbf q _{\varrho }\) is below a threshold, i.e., \(\Vert \mathbf w _{\varrho }-{} \mathbf q _{\varrho }\Vert _{\max } < \epsilon _{P}.\) \((U_{\varrho },v_{\varrho },{} \mathbf w _{\varrho },{} \mathbf q _{\varrho })\) are determined by block coordinate descent (BCD). As outlined in Algorithm 1, let \((U',v',\mathbf w ',\mathbf q ')\) be estimates of \((U_{\varrho }, v_{\varrho },\mathbf w _{\varrho },\mathbf q _{\varrho })\), then \(U'\) is updated by solving Eq. (5) with fixed (\(v'\), \(\mathbf w '\), \(\mathbf q '\)):

$$\begin{aligned} U':= \arg \min _{U} ~ (1-\lambda ) \cdot l_D(U,v',\mathbf q ')+\lambda \cdot h_C(U). \end{aligned}$$
(6)

As this minimization is over a convex and smooth function, Eq. (6) is solved via gradient descent. Note, that determining \(U'\) is equivalent to increasing the separation between the two disease groups by minimizing \(l_D(\cdot ,v',\mathbf q ')\) while reducing the difference between the two control groups by minimizing \(h_C(\cdot )\).

Next, BCD updates \(v'\) and \(\mathbf q '\) by keeping (\(U'\),\(\mathbf w '\)) fixed in Eq. (5), i.e.,

$$\begin{aligned} (v',\mathbf q '):= \arg \min _{v,\mathbf q } (1-\lambda ) \cdot l_D(U',v,\mathbf q ) + {{\varrho }} \cdot \Vert \mathbf w '-\mathbf q \Vert _2^2. \end{aligned}$$
(7)

This minimization problem is defined by a smooth and convex function. Its solution is thus also estimated via gradient descent. Finally, \(\mathbf w '\) is updated by solving Eq. (5) with fixed (\(U'\), \(v'\), \(\mathbf q '\)), i.e.,

$$\begin{aligned} \mathbf w ':= \arg \min _\mathbf{w \in \mathbb {S}} \Vert \mathbf w -\mathbf q '\Vert _2^2. \end{aligned}$$
(8)

As shown in [4, 6], the closed form solution of Eq. (8) first computes \(\Vert \mathbf g _i(\mathbf q ')\Vert _2\) for each group i and then sets \(\mathbf w '\) to the entries of \(\mathbf q '\), who are assigned to the k groups with the highest norms. The remaining entries of \(\mathbf w '\) are set to 0. The procedures (6)\(\sim \)(8) are repeated until the relative changes of (\(U'\), \(v'\), \(\mathbf w '\), \(\mathbf q '\)) between iterations are smaller than a threshold \(\epsilon _{B}\). (\(U_{\varrho }\), \(v_{\varrho }\), \(\mathbf w _{\varrho }\), \(\mathbf q _{\varrho }\)) is updated with the converged (\(U'\), \(v'\), \(\mathbf w '\), \(\mathbf q ')\), \({\varrho }\) is increased and another BCD loop is initiated until \(\mathbf w _{\varrho }\) and \(\mathbf q _{\varrho }\) converge towards each other (see Algorithm 1 for details).

Fig. 2.
figure 2

Impact of harmonization on classification (black line) of two synthetic sets. Compared to the raw scores, the Sequential approach mitigated differences between the two disease groups by ‘pushing’ them together. Our joint model with \(\lambda = 0.5\) is the only approach that is indifferent to controls and correctly labels all disease cases.

Figure 2 showcases the differences between sequential and joint harmonization and classification. Two synthetic data sets consist of a control and disease cohort, where the raw scores for each cohort are 20 random samples from a Gaussian distribution with the covariance being the identity matrix multiplied by 0.01. The mean of the Gaussian for Disease A of Set I (blue) is (1.3,2) resulting in samples that are somewhat separated from those of Disease B of Set II (mean = (1.5,2), red). The difference in data acquisition between the two sets is simulated by an even larger separation of the means between the two control groups (Set I: mean=(0.9,1), green; Set B: mean = (1.2,1), black). The Sequential method (see Sect. 2.1 without sparsity) harmonizes the scores so that the classifier assigns the controls to one set, i.e., the separating plane (black line) is impartial to acquisition differences. This plane fails to perfectly separate the two disease cohorts as the cohorts are ‘pushed’ together with the mean of Disease B being now to the right of the mean of Disease A. Higher accuracy in disease classification is achieved by our joint model (omitting sparsity) with \(\lambda =0.8\). Comparing this plot to the results with \(\lambda =0.5\) shows that as \(\lambda \) decreases the emphasis on separating the two disease increases as intended by Eq. (6). The classifier is still impartial to acquisition differences and perfectly labels the samples of the two disease cohort. In summary, the joint model enables data harmonization that preserves group differences, which was not the case for the sequential approach.

3 Distinguishing HAND from MCI

We tested our method on a joint set of two independently collected data sets: the ‘UHES’ set [7] contained cross-sectional MRIs of 15 HAND cases and 21 controls while the ‘ADNI’ set contained MRIs of 20 MCIs and 18 controls. The 4 groups were matched with respect to sex and age. Each MRI was acquired on a 3T Siemens scanner (with the two sets having different acquisition protocols) and was processed by [8] resulting in 102 regional volume scores. We assigned those scores to (\(n_G\) = 52) groups to account for the bilateral impact of MCI and HAND on both hemispheres. We now review our experimental setup and results highlighting that jointly learning harmonization and classification results in findings that are indifferent to the two control cohorts and more accurate in distinguishing the two disease groups compared to alternative learning models.

To measure the accuracy of our method, we used 5 fold-cross validation with each fold containing roughly 20\(\%\) of each cohort. On the training sets, we used parameter exploration to determine the optimal group cardinality \(k \in \left\{ 1, 2,\ldots , 10 \right\} \) and weighing \(\lambda \in \left\{ 0.1, 0.2,\ldots , 0.9\right\} \). Across all folds and settings, our algorithm converged within 5 iterations while each BCD optimization converged within 500 iterations. For each setting, we then computed the disease accuracy of the classifier by separately computing the accuracy for each cohort (MCI and HAND) to be assigned to the right set (UHES or ADNI) and then averaging the scores to consider the imbalance of cohort sizes. We determined the control accuracy repeating those computations with respect to the two control groups. Since higher control accuracy infers worse harmonization, an unbiased control accuracy, i.e., around 50 %, coupled with a high disease accuracy was viewed as preferable. Note, an unbiased control accuracy only qualifies the harmonization in that the remaining acquisition differences do not impact the separating plane (or weights \(\widehat{\mathbf{w }}\)) of the classifier. For each training set, we then chose the setting \((\lambda ,k)\) with corresponding weights \(\widehat{\mathbf{w }}\) and harmonization parameters \(\widehat{U}\) that produced the largest difference between the two accuracy scores. This criteria selected a unique setting for 2 folds, 2 settings for 2 folds, and 3 settings for 1 fold.

On the test set, we computed the harmonized scores (residuals) of all samples for each selected setting. We then hid the subjects’ data set associations from the ensemble of selected classifiers by applying the residuals to classifiers parameterized by the corresponding weight settings. Based on those results, we computed control and disease accuracy as well as the corresponding p-values via the Fisher exact test [9]. We viewed \(p<0.05\) as significantly more accurate than randomly assigning samples to the two sets. An ensemble of classifiers was viewed as informative, if separating HAND from MCI resulted in a significant p-value and a non-significant one for controls.

We repeated the above experiment for the sequential harmonization and classification approach of Sect. 2.1, called SeqGroup, to show the advantages of our joint learning approach, called JointGroup. To generalize our findings about joint learning of harmonization and classification parameters to non-sparsity constrained models, we also tested the approach omitting the group sparsity constraint, i.e., \(k = 52\). JointNoGroup refers to the results of the corresponding joint method and SeqNoGroup to the sequential results.

Table 1 shows the accuracy scores and p-values of all implementations listed according to the difference between disease and control accuracy. The method with the smallest difference, SeqNoGroup, was the only approach recording a significant control accuracy score (\(68.4\,\%\), \(p=0.024\)). The corresponding separating plane was thus not impartial to acquisition difference so that the relatively high disease accuracy (82.5\(\%\)) was insignificant. The next approach, SegGroup, obtained non-significant accuracy scores for controls (\(64.6\,\%\)) indicating that the group sparsity constrain improved generalizing the results from the relatively small training set to testing. The difference between control (\(52.8\,\%\)) and disease accuracy (\(83.3\,\%\)) almost doubled (30.5\(\%\)) for the joint approach JointNoGroup. As reflected in Fig. 2, learning data harmonization separated from classification does not generalize as well as the joint approach, who harmonizes data so that differences between disease groups are preserved. Confirming all previous findings, the overall best approach (i.e., control and disease accuracy differ most) is our proposed JointGroup approach achieving a disease accuracy of \(90\,\%\).

Table 1. Implementations listed according to the difference between disease (MCI vs. HAND) and control (UHES controls vs. ADNI controls) accuracy. In bold are the findings that are statistically significant different from chance, i.e., \(p<0.05\). Our method, JointGroup, achieves the highest disease accuracy and largest spread in scores.
Fig. 3.
figure 3

Each column shows the largest, unique pattern extracted by JointGroup on one of the 5 training sets. Identified regions are impacted by HAND, HIV, or MCI.

The group sparsity constraint aided in separating diseases and identified patterns of regions (i.e., non-zero weights \(\widehat{\mathbf{w }}\)) impacted by either MCI or HAND (or HIV). Each column of Fig. 3 shows the largest, unique pattern associated with a training set. For those training sets that selected multiple patterns (i.e., \(\widehat{\mathbf{w }}\) settings), patterns with less regions were always included in the largest pattern. The precentral gyrus, cerebellum VIII, and lateral ventricle were parts of all patterns. HIV is known to impact the cerebellum [10] and accelerated enlargement of ventricles is linked to both HIV [11] and MCI [12]. These findings indicate that the extracted patterns are informative with respect to MCI and HAND (and HIV), which requires an in-depth morphemic analysis for confirmation.

4 Conclusion

We proposed an approach that simultaneously learned the parameters for data harmonization and disease classification. The search for the optimal separating plane was confined by group cardinality to reduce the risk of overfitting, to explicitly account for the impact of disease on the inter-dependency of regions, and to identify disease specific patterns. On separating HAND from MCI samples of two disease specific data sets, our joint approach achieved better classification accuracy than the non-sparsity based model and sequential approaches.