Using random forest for brain tissue identification by Raman spectroscopy

Weiyi Zhang; Chau Minh Giang; Qingan Cai; Behnam Badie; Jun Sheng; Chen Li

doi:10.1088/2632-2153/ad1349

1. Introduction

Accurate tissue identification and characterization plays an important role in a variety of applications. For example, in many forensic cases, the identification of biological materials found at crime scenes can be very informative [1]. In the biomedical field, characterization of abnormal tissues in patients enables clinicians to diagnose diseases and prescribe appropriate treatments [2, 3]. In particular, tumor identification and classification are essential for early cancer detection and selection of most appropriate therapies. Glioma is one of the most common and life-threatening primary brain tumors that can arise from a variety of brain cells [4]. These tumors are invariably invasion and cause local tissue destruction that cause neurological symptoms. When patients report symptoms, several imaging-based diagnostic approaches are available for the preliminary diagnosis of brain tumors, such as computed tomography (CT) and magnetic resonance imaging (MRI). These imaging modalities, however, are not specific and do not provide the critical histological and molecular information that are used to classify these tumors. Recently, image processing and machine learning algorithms have been developed to aid this process, such as a robust random forest (RF) classifier for brain tumor detection using MRI [5] and convolution neural networks for human brain cancer using optical coherence tomography images [6]. However, these techniques are still in the early development stage, and the current paradigm involves surgical sampling of tumors by image-guided open surgery (craniotomy) or stereotactic needle biopsy Craniotomy is the surgical procedure of choice for bulky gliomas that can be safely resected, but tumors that are invasive and deeply seated, are often biopsied using stereotactic guidance.

Although needle biopsy has been widely used in the diagnosis of a variety of cancers, it faces unique challenges in glioma diagnosis due to intratumor heterogeneity, risk of hemorrhage, and excessive tumor necrosis (dead non-diagnostic tissue). Furthermore, due to the finite resolution of medical imaging, intraoperative brain shift, and potential error in needle placement, tissue samples extracted by a biopsy needle could misrepresent the true glioma subtype and misguide subsequent treatment.

To improve the accuracy and safety of brain tumor diagnostic procedures, we plan to combine machine learning, spectroscopy with needle biopsy to guide the selection of the tissue biopsy location (figure 1). Among spectroscopic techniques, Raman spectroscopy is one of the most widely used to analyze materials [7–9], and has been applied to detect and classify several types of tumors in open procedures [10, 11]. Optical spectroscopy methods, such as Raman and infrared spectroscopy, measure the molecular vibrational states in samples. Such information allows detecting changes in composition and chemical environment. When the photon of the laser interacts with a molecule of the sample, it might cause a shift in molecule's energy levels when the photon is scattered. In Raman spectroscopy, this inelastic scattered light is detected by CCD, and transferred to electronic signals so that the information can be received and processed on the computer [12]. The peaks in Raman spectra indicate the feature of the measured material, allowing for the identification of different materials in a more quantitative way compared to the images [13]. Raman spectra provide valuable insights into the molecular and physical properties of materials by providing vibrational information, making Raman spectroscopy a powerful tool in scientific disciplines such as chemistry, materials science, physics, biology, and geophysics. The energies and intensities of the peaks in the Raman spectrum are sensitive to the compositions, structures, bonds, and chemical environments of the materials. The technical details of Raman spectroscopy can be found in [14]. However, manual identification of the spectra requires training and false identification can be common. Therefore, it is necessary to develop a reliable method to analyze spectra autonomously and to improve the efficiency and accuracy of spectral identification.

Figure 1. Refer to the following caption and surrounding text. — **Figure 1.** A biopsy needle equipped with Raman spectral sensing is capable of identifying tissue at the needle tip and guide the localization to biopsy targets.
Download figure:
Standard image High-resolution image

In the medical field, the combination of Raman spectroscopy and machine learning algorithms is gaining popularity. For example, Raman spectroscopy and two machine learning methods, i.e. partial least squares discriminant analysis and support vector machine (SVM), were used for early detection of gastric cancer [15]. As reported in [16], deep learning (DL), SVM, and extreme gradient boosting trees were used with Raman spectroscopy to determine tumor markers for colon cancer. In addition, the combination of DL and Raman spectroscopy allows rapid detection of melanoma at the single cell level [17].

This work explores the RF as a robust tool for the classification of brain tissue. As a supervised machine learning algorithm, the RF is widely used for classification and regression problems [18]. A prominent benefit of using RF is that it limits overfitting during training to some extent, as RF is based on the aggregation of several decision trees [19] using bagging, which is more robust against overfitting [20]. Additionally, in training data, the RF is less sensitive to outlier data [21], and to the parameters used to run it [22]. The RF can avoid the difficulty of what to do with categoricals that have many values by selecting a random subset of them [18]. Another benefit is that it allows one to train the model based on small data sets. Several publications have addressed RF as a suitable algorithm among others in similar situations [23–25]. RF has been used with Raman spectroscopy to diagnose several types of cancer, such as lung cancer [26] and laryngeal carcinoma [27]. Its efficacy in identifying various types of brain tissue and tumor tissue has not yet been explored.

The aim of this work is to validate the feasibility of RF to identify spectral differences in various tissues, thus laying the foundation for the use of machine learning and Raman spectroscopy to guide brain tumor biopsy procedures. Therefore, several critical questions were raised: (i) how to optimize Raman spectroscopy to achieve sufficient data quality? (ii) What is the appropriate procedure for data preprocessing? (iii) How to build and train a reasonable model? (iv) What is an effective way to evaluate model performance? In this work, we will focus on these questions by designing and implementing a new classification workflow for animal tissues. The necessary data acquisition and data preprocessing were described. The RF algorithm and the choice of parameters were elaborated. The training results were presented and discussed.

2. Methods

2.1. Acquisition of raman spectra

Three types of viable tissue were prepared, including ectocinerea (grey matter), alba (white matter), and blood vessels. A freshly dissected goat brain (figure 2) was acquired from a slaughterhouse. The brain was kept at −18 °C for 24 h. The frozen brain was cut to a thickness of approximately 2 mm using a scalpel for Raman measurement. The samples were fixed on a glass slide using super glue and acclimatized to room temperature. The slide was mounted on a three-axis translation stage (figure 2). Data sampling was done by moving the stage. Since the blood vessels contained in the sliced tissue are extremely small, we sliced a goat aorta to create blood vessel samples to develop our approach.

Figure 2. Refer to the following caption and surrounding text. — **Figure 2.** Preparation of tissue samples and installation on an open-bench Raman spectrometer.
Download figure:
Standard image High-resolution image

In our study, Raman spectra of tissue samples were acquired using a custom-built open bench Raman system at 532 nm laser excitation. The spectrometer has a focal length of 500 mm, a grating density of 1200 mm⁻¹, and a CCD camera with the 20 × 20 μm pixel size offers the optimal combination of high resolution with dynamic range. An input laser power of 10 mW or less is used to avoid sample damage. A long-working-distance microscope objective was used to focus and collect light in the reflective geometry, and a notch filter was used to remove the Rayleigh scattering. The objective was adjusted for focus at each data collection site. The spectra were manually tagged for the type of samples under the microscope. Measurement parameters, such as ranges of wavenumbers and exposure time, were optimized by collecting a small amount of data, at different settings. Details are discussed in section 3.3.

2.2. Data pre-processing

Data preprocessing is performed to improve the accuracy of identification [28]. It includes sporadic signal removal, rebinning, background removal, data normalization, and data augmentation, as discussed in the following sections. To avoid leakage of the training example, where the training data are used in the test set [29], all data were shuffled and then divided into the training set and the test set in an 8:2 ratio before data preprocessing. Figure 3 shows the Raman spectrogram of data preprocessing in a wavenumber range of 892–2062 cm⁻¹, with 120-s exposure time.

Figure 3. Refer to the following caption and surrounding text. — **Figure 3.** Example data preprocessing of the Raman spectrum: (a) original Raman spectrum, (b) Raman spectrum after the sporadic signal is removed, (c) Raman spectrum after rebinning, (d) Raman spectrum after background removal, and (e) normalized Raman spectrum.
Download figure:
Standard image High-resolution image

2.2.1. Sporadic signal removal and spectral rebinning

Sporadic signals are inevitable from background radiation. In most cases, the sporadic signal comes from a single pixel with high counts on the detector. An algorithm was designed to recognize and remove these signals as follows: First, in Raman spectrum, the standard deviation of the intensity of each pixel was calculated from the 20 neighboring pixels. Then, the calculated standard deviation was set as the threshold for the detection of sporadic signals. If the counts of a pixel were greater than the calculated standard deviation, it would be replaced by the average of its two adjacent pixels. The selection of the threshold is based on the result comparison using different multiples of the standard deviation. Surprisingly, no significant differences in performance were found when 2–5 times of the standard deviation were used as the threshold.

If the sporadic signals occupy two or three pixels in the Raman spectrum, the method cannot remove them. Since this happens rarely, this has negligible effects on the results. Once the sporadic signals were removed, all spectral wavenumbers were rebinned into 1 cm⁻¹ intervals using a weighted rebinning method because the original wavenumber intervals are not uniform [30].

2.2.2. Background removal

The background of Raman spectra may bias the identification process. To remove the background, it was estimated using improved asymmetric least squares [31], which is based on polynomial fitting. The optimal smoothing parameter $\lambda$ and the penalizing weight factor $p$ were set to 10⁵ and 0.02, respectively.

2.2.3. Spectral normalization and augmentation

The Raman spectra were normalized to 0–1, because it prevents high-count pixels from dominating the training.

Also, to increase the data size, a spectral augmentation algorithm was implemented. The data augmentation algorithm includes two steps: randomly shifting each spectrum by a few pixels according to a normal distribution and adding Gaussian noise to each pixel at a scale of 5% of the maximum intensity. Shifting was used to simulate small changes in alignment, while adding noise was used to simulate statistical fluctuations. Both are common techniques used for data augmentation in machine learning. Shifting can help to improve the robustness and generalization of the model [32], and adding noise can help to improve the ability of the model in handling noisy or low signal-to-noise ratio data [33]. Data augmentation should be applied cautiously to avoid generating non-representative data that do not accurately reflect the variations and noise observed in real experiments. Prudent application of data augmentation can help mitigate potential errors. The evaluation results indicate that our data augmentation matches the expected fluctuation in Raman measurement. After removing spectra with insufficient statistics, 654 spectra were used for spectral augmentation, of which 269 were from grey matter, 264 from white matter, and 121 from blood vessels. The data augmentation increased the number of the training and test data to 24 300 and 8400, respectively. Finally, the Raman spectra of the white matter, grey matter, and blood vessels were labeled.

3. Classifier training

3.1. Choice of training algorithm

Numerous supervised machine learning algorithms are available in classification problems, including SVM, k-nearest neighbors (KNN), and RF, each with distinct strengths and limitations. SVM is a powerful and flexible algorithm that excels at processing high-dimensional data or unbalanced datasets [34]. However, it uses a computationally expensive operation for large datasets for the probability, reflecting the confidence level of the model [35]. KNN is not optimized for high-dimensional spaces due to the intrinsic difficulty of processing the KNN query in such spaces [36]. Raman spectra contain over one thousand features. This paper focuses solely on ensemble-based methods, which offer specific advantages over DL techniques (e.g. ANN) for our particular classification tasks, such as interpretability, data efficiency, and reduced computational complexity [37].

Gradient boosting is a popular ensemble learning algorithm in classification problems. It constructs the model by iteratively adding decision trees to the ensemble, with each new tree intended to correct the errors made by the prior tree [38]. The errors represent the difference between the actual and predicted values. The new tree is added to the previous tree, and the resulting ensemble is used to predict the target variable [39]. The algorithm optimizes the decision tree model by minimizing a loss function [40].

To compare with the RF classifier, the gradient boosting classifier was used. The values of hyper-parameters of the gradient boosting classifier were selected based on the results of a cross-validated grid-search algorithm [41]. The model was trained using a 10-fold cross-validation to prevent overfitting, with 19 decision trees and a maximum depth of 6. The mean score was 0.961 (±0.016) for gradient boosting, whereas the mean score was 0.962 (±0.003) when using RF. The values were nearly identical, but the gradient boosting classifier model was more computationally demanding than the RF. The RF model takes seconds for training, whereas gradient boosting requires 3 orders of magnitude longer of training time. This is because the gradient boosting algorithm is a sequential process, with each new tree built to enhance the performance of the previous trees. The decision trees in RF, on the other hand, can be built independently and run in parallel. The details about RF are discussed in sections 3.2 and 3.3.

3.2. RF

RF is an ensemble model for classification. When building the model, bootstrap is used as a criterion to divide the training data into each decision tree [42], which is a binary tree. At each node, Raman spectra will be placed into one of the two subgroups using a randomly chosen feature [35]. The splitting criterion is to maximize the Gini impurity decrease from each node to its children node [43]. The decision tree grows until it reaches the maximum depth of the classifier. The RF algorithm combines classifiers by averaging their probabilistic prediction, rather than letting each classifier vote for a single class [44].

Figure 4 is a schematic of how the RF algorithm classifies a Raman spectrum. In this study, the preprocessed Raman spectra used for training the model were chosen from the training set using bootstrap. At each node, the spectra were split using a criterion that decreases the Gini impurity at most. This process was repeated to create a forest of decision trees. The averaged probabilistic predictions of each decision tree were combined, and the class of the spectra would be the output of the model. The details of the parameter selection of the model are discussed in section 3.3.1.

Figure 4. Refer to the following caption and surrounding text. — **Figure 4.** Schematic of random forest identification. The circles represent the nodes of the tree. The blue circles denote a possible trial when a data is put in the model.
Download figure:
Standard image High-resolution image

The classifier was implemented using Python 3.8.3 [45], Scikit-learn 0.23.1 [44], matplotlib v3.2.2 [46] and SciPy 1.0 [47], running on a laptop with an Intel Core i5-8250 U processor.

3.3. Choice of parameters

3.3.1. Parameter selection

Hyperparameters (e.g. the number of trees in the forest, the maximum depth of each tree, etc.) are arguments passed to the constructor of the RF classifier class [44]. The choice and values of the hyperparameters in the RF will significantly influence the result [48]. An algorithm, which performs an exhaustive search using cross-validated grid-search within a specified range of parameter values [41], was used to determine the approximate hyperparameter range, and then we manually tune the hyperparameter combination for the best test score.

The number of trees in the forest and the maximum depth of each tree were set to 12. Increasing both parameters further make no significant improvement in the model performance. When tuning both parameters with larger values, no significant improvement in the model performance was found. A slight decrease in accuracy was observed when tuning both values smaller. The function that measures the quality of each split was Gini impurity, which is a criterion commonly used in RF.

3.3.2. Feature number reduction

Reducing the number of features is necessary since irrelevant features may negatively affect the training and increase computation cost [49]. To reduce the feature numbers, we started with a wide range of wavenumbers containing over 4000 pixels and narrowed it down to 892–2062 cm⁻¹ using feature importance. The latter is calculated pixel-wise and as the mean and standard deviation of the accumulation of the Gini impurity decrease in each tree [50]. The reduced range did not decrease the model performance compared to the full wavenumber range and significantly reduces the measurement time. Figure 5 shows a semilog plot of the feature importance of the chosen Raman spectra feature by the pixel index. The chosen range contains sufficient information to contribute to the predictive power of a model.

Figure 5. Refer to the following caption and surrounding text. — **Figure 5.** The semi-log plot of feature importance. The mean decrease in impurity on the y-axis denotes the Gini impurity.
Download figure:
Standard image High-resolution image

3.3.3. Optimization of experiment parameters

The exposure time is the accumulation time of the CCD camera and affects the signal-to-noise ratio of the Raman spectrum. Longer exposure times can result in a higher signal-to-noise ratio and improve spectral quality. This can lead to increased accuracy and precision in quantitative measurements, as well as better detection of weaker Raman features. Figure 6 shows the test scores for different exposure times using grey matter and white matter. The abscissa represents different exposure times and the ordinate represents the corresponding test scores. The training in this plot denotes a test set with a 120-s exposure time, which is the same as that of the training set. The prediction accuracy of the test set decreases with shorter exposure time. The scores are above 0.9 when the exposure time is greater than 60 s. When the exposure time is shorter than 10 s, the model performs poorly. An exposure time of 120-s with a wavenumber range of 892–2062 cm⁻¹ was used for the data collection. The 120-s exposure time ensures excellent data quality, and the optimized wavenumber range reduces the experimental measurement time of Raman spectra. The chosen wavenumber range contains sufficient peaks for identification.

Figure 6. Refer to the following caption and surrounding text. — **Figure 6.** Comparison of test scores between data sets based on different Raman collection time. The training in this plot denotes a data set with a 120 s exposure time. The blue error bar is the standard deviation of the test score.
Download figure:
Standard image High-resolution image

4. Results and discussion

The performance of the training model is evaluated by confusion matrix, precision, recall, f1 score, accuracy, and the probability of each prediction.

The confusion matrix of the test set is an essential indicator. It helps to obtain detailed information about the performance of the model by summarizing the amount of data between each true label and predicted label. When comparing predicted and actual results, the confusion matrix not only determines the number of wrong estimations, but also exposes the distribution of errors [51]. The results in table 1 indicate that the spectral difference between white matter and blood vessels is sufficient to achieve confident predictions with zero misidentification. Misidentified results show that the model tends to identify grey matter as white matter.

$\begin{equation*}{\text{Precision}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FP}}}}\end{equation*}$

$\begin{equation*}{\text{Recall}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}}\end{equation*}$

$\begin{equation*}f1 - {\text{score}} = 2 \times \frac{{{\text{precision}} \times {\text{recall}}}}{{{\text{precision}} + {\text{recall}}}}\end{equation*}$

$\begin{equation*}{\text{accuracy}} = \frac{{{\text{TN}} + {\text{TP}}}}{{{\text{TN}} + {\text{FP}} + {\text{FN}} + {\text{TP}}}}.\end{equation*}$

Table 1. Confusion matrix of the testing set.

Predicted True	White matter	Grey matter	Blood vessels
White matter	3145	55	0
Grey matter	220	3265	15
Blood vessels	0	5	1695

Analysis was further refined using the confusion matrix, providing metrics such as precision (ability to avoid the model incorrectly identifying a sample as positive), recall (ability to detect all positives), and the F1 score (harmonic mean of precision and recall). Accuracy assesses predictive performance. True positive (TP) is when the model accurately classifies a positive sample. False positive (FP) arises when a sample is incorrectly labeled as positive. True negatives (TN) are correct identifications of negative samples, while false negatives (FN) occur when positive samples are wrongly predicted as negative. Support represents data volume for each class in the test set. The macro-average is the unweighted mean across labels, whereas the weighted-average takes into account class distribution. Table 2 shows a slightly higher macro-average due to superior performance on blood vessel class, whereas the weighted-average is lower due to assigning greater weight to grey and white matter classes.

Table 2. The comparison of the main identification report for gradient boosting algorithm (GB) and random forest algorithm (RF).

	Precision		Recall		f1—score		Support
	RF	GB	RF	GB	RF	GB	RF	GB
White matter	0.928	0.886	0.981	0.983	0.954	0.932	3200	3200
Grey matter	0.982	0.957	0.926	0.880	0.953	0.917	3500	3500
Blood vessels	0.991	0.988	0.999	0.947	0.995	0.967	1700	1700
Accuracy						0.933	8400	8400
Macro avg	0.967	0.943	0.969	0.937	0.967	0.938	8400	8400
Weighted avg	0.963	0.936	0.962	0.933	0.962	0.933	8400	8400

The evaluation result in table 2 presents the high accuracy of the RF model in identifying the Raman spectra of grey matter, white matter, and blood vessels. Furthermore, the result shows that the preprocessed data contain sufficient and significant information for building a reliable RF classification model. Figure 7 is a schematic of the proposed process, including data preprocessing, data augmentation, training, and evaluation of the two machine leaning algorithms.

Figure 7. Refer to the following caption and surrounding text. — **Figure 7.** A schematic of the proposed process.
Download figure:
Standard image High-resolution image

The above analysis is based on the output given by the model; however, it is also important to understand the confidence level of the model when making predictions. The predicted probabilities of one type of tissue are the mean predicted probabilities of trees in the forest for that type of tissue [44]. They are calculated by averaging the frequencies of each class at the leaf node where the input data points are located. The criteria used to calculate the probability in RF is the normalized class histogram of leaf nodes [52]. The probability is conserved to 1 for each spectrum, thereby allowing probabilities as an indication of the model's confidence in its predictions. Table 3 provides the predicted probabilities of tissue types for part of the test set as an example. The values stand for the probabilities of the data being identified as one type of tissue. The class with the maximum value in each row indicates the model preference.

Table 3. Examples of class probabilities predicted for the test set.

	White matter	Grey matter	Blood vessels
Example 1	0.121	0.794	0.085
Example 2	0.169	0.83	0.001
Example 3	0.825	0.175	0
Example 4	0.084	0.083	0.833
Example 5	0	0.083	0.917

Data augmentation is required to generate adequate data with sufficient diversity. When the noise level is increased, the performance of the model decreases and a similar result was found for the shifting. Table 4 shows the comparison of the results between with and without data augmentation. The numbers one to ten represent the index of the training. The values under number one to ten in the table are the accuracy of the model. Mean is the mean of the former ten values of accuracy, and STD is the standard deviation of the ten values of accuracy. The result shows that applying data augmentation on the training set produces better test scores than not applying data augmentation for more than 3 $\sigma$ . Furthermore, the test score becomes less stable in the absence of data augmentation, as evidenced by the standard deviation of 0.023, which is ∼8 times higher than the augmented data. The model performs worse without data augmentation because the training set is too small and lacking diversity compared to the augmented training data. Furthermore, data augmentation can improve the model's generalization performance by increasing its robustness against variations and noise in the data. Therefore, applying data augmentation can improve and stabilize model performance.

Table 4. Comparison of the results between with and without data augmentation.

	1	2	3	4	5	6	7	8	9	10	Mean	STD
With	0.965	0.966	0.959	0.963	0.965	0.959	0.961	0.956	0.962	0.964	0.962	0.003
Without	0.864	0.845	0.829	0.836	0.850	0.793	0.827	0.836	0.830	0.886	0.840	0.023

Subsets (25%, 50%, and 75%) of the dataset were randomly selected for training to test the performance with limited data. The scores were 0.958 (±0.005), 0.960 (±0.003), and 0.961 (±0.002), respectively. Although our model displays excellent performance with less data, maximizing the training data remains important, especially in real-world situations, where other factors reduce the signal-to-noise ratio, for example, changes in optical alignment.

5. Summary

This work implements a novel pipeline to identify white matter, grey matter, and blood vessels by combining Raman spectroscopy and RF. The results have validated that this method can be used to build a reliable model with reasonable performance for this classification task.

However, the model is sensitive to shifting based on the analysis of the incorrectly predicted data in the test set. This may be because the model relies on individual pixels for identification. Therefore, future studies may consider using a convoluted model to avoid locating the feature importance to a single pixel. Furthermore, future experiments may use brain tumors to obtain Raman spectra for further study. In reality, the composition of the sample may be complex and may not be a single type of tissue, so the main limitation of this study is that RF does not support this case, since the training data were labeled as one type to build the RF model.

Raman spectrometers exhibit a wide cost range, spanning from $4000 (e.g. Thorlabs Raman spectrometer kits) to high-resolution custom systems priced at $100 000, such as Raman microscopes offered by Renishaw or Horiba. While the cost of medical imaging equipment, such as CT, can reach to hundreds of thousands of dollars [53]. The machine learning-based method of identification does not depend on the practitioner. Therefore, it is believed that this work could not only serve as a novel auxiliary diagnosis that could help the diagnosis and biopsy of brain tumors in the future, but could also benefit hospitals by reducing their operating expenses, especially in remote regions and some less developed countries.

Data availability statement

The data that support the findings of this study are openly available at the following URL/DOI: https://doi.org/10.6086/D1XH5H [54].

Dates