[Paper
] [Pre-print
] [Code
] [BibTeX
]
Has been presented in 2024 at the 21th International Symposium on Biomedical Imaging (ISBI-2024).
Authors: George Batchkala, Bin Li, Mengran Fan, Mark McCole, Cecilia Brambilla, Fergus Gleeson, Jens Rittscher.
-
DHMC_MetaData_Release_1.0.csv - downloaded from https://bmirds.github.io/LungCancer/; gives predominant LUAD pattern
-
tcga_classes_extended_info.csv - see https://github.com/GeorgeBatch/TCGA-lung-histology-download/
-
tcga_dsmil_test_ids.csv - see https://github.com/GeorgeBatch/TCGA-lung-histology-download/
-
tcia_cptac_md5sum_hashes.txt - see https://github.com/GeorgeBatch/TCIA-CPTAC-lung-histology-download
-
tcia_cptac_luad_lusc_cohort.csv - see https://github.com/GeorgeBatch/TCIA-CPTAC-lung-histology-download
-
tcia_cptac_string_2_ouh_labels.csv - took unique values from tcia_cptac_luad_lusc_cohort.csv and manually mapped to labels inspired by OUH (Oxford University Hospitals) reports
Columns include the label
(LUAD vs LUSC) and paths to features:
features_csv_file_path
h5_file_path
pt_file_path
mapping = {
"LUAD": 0,
"LUSC": 1,
}
DHMC has only LUAD slides, so all entries in the label
field are 0:
TCGA has both LUAD and LUSC so entries in the label
field include 0 and 1:
Run the labels creation code notebook. The code will create the files in labels/experiment-label-files/.
Note, the combined dataset for training/validation is not the same as in the paper since the in-house DART dataset is not publicly available. The test set, however, is the same as in the paper and is fully available in the 8-label task and 5-label task.
-
a_save_slide_metadata.py: Saves metadata for all WSIs in a dataset.
Example:python a_save_slide_metadata.py --dataset TCGA-lung --slide_format svs
-
b_create_thumbnails_and_masks.py: Produces thumbnails and masks for WSIs.
Example:python b_create_thumbnails_and_masks.py --dataset TCGA-lung --slide_format svs
-
c_compute_tiatoolbox_feats.py: Extracts patch features for WSIs using tiatoolbox.
Example:python c_compute_tiatoolbox_feats.py --dataset TCGA-lung --slide_format svs
-
c_record_masked_positions.py: Records masked positions for WSIs by comparing feature positions with mask intersections.
Example:python c_record_masked_positions.py --dataset TCGA-lung --slide_format svs --min_mask_ratio 0.1
-
c_record_positions_intersections.py: Records intersections between slide feature positions and mask intersection positions.
Example:python c_record_positions_intersections.py --num_workers 24
-
d_train_classifier.py: Trains a MIL classifier on patch features.
- Example 1 (like in the paper):
python d_train_classifier.py --base_config_path ./configs/base_config.yaml --config_path ./configs/combined-configs-sota/simclr-tcga-lung_resnet18-10x_COMBINED-ALL-8-dsmil-wo_subsampling.yaml
- Example 2 (subsampling patches):
python d_train_classifier.py --base_config_path ./configs/base_config.yaml --config_path ./configs/combined-configs-sota/simclr-tcga-lung_resnet18-10x_COMBINED-ALL-8-dsmil_config.yaml
- Example 3 (with mixed supervision - used in-house data):
python d_train_classifier.py --base_config_path ./configs/base_config.yaml --config_path ./configs/combined-configs-mixed-supervision/simclr-tcga-lung_resnet18-10x_COMBINED-ALL-8-dsmil_config.yaml
- Example 1 (like in the paper):
-
e_compute_gigapath_slide_level_feats.py: Computes slide-level embeddings using the prov-gigapath model.
Example:python e_compute_gigapath_slide_level_feats.py --embedding_data_dir datasets/TCGA-lung/features/prov-gigapath/imagenet/patch_224_0.5_mpp
-
e_compute_prism_slide_caption_similarities.py: Computes slide-level caption similarities using the PRISM model.
Example:python e_compute_prism_slide_caption_similarities.py --embedding_data_dir datasets/TCGA-lung/features/VirchowFeatureExtractor_v1_concat/imagenet/patch_224_0.5_mpp
-
e_compute_prism_slide_level_feats.py: Computes slide-level embeddings using the PRISM model.
Example:python e_compute_prism_slide_level_feats.py --embedding_data_dir datasets/TCGA-lung/features/VirchowFeatureExtractor_v1_concat/imagenet/patch_224_0.5_mpp
-
f_train_linear_probing_classifier.py: Trains a linear probing classifier on slide features.
Example:python f_train_linear_probing_classifier.py --base_config_path ./configs/base_config.yaml --config_path ./configs/combined-configs-slide-linear-probing/PRISM_COMBINED-ALL-8-linear_config.yaml
The data loading pipeline is implemented using custom PyTorch Datasets and PyTorch Lightning DataModules. Specifically:
- Datasets: source.data.dataset_detailed (
LungSubtypingDataset
andLungSubtypingSlideEmbeddingDataset
) load precomputed features, positional data, and label masks from .pt and .npy files. They also perform on-the-fly subsampling and compute weight masks for instances with unknown labels. - DataModules: source.data.datamodule_detailed (
LungSubtypingDM
andLungSubtypingSlideEmbeddingDM
) handle the creation and splitting of datasets based on patient IDs, reading CSV descriptions that reference pre-extracted patch features.
In addition to the patch feature extractors readily available through TIAToolbox (UNI, Prov-GigaPath, H-Optimus-0), this repository provides a range of feature extraction models that can be used as plug-in models for TIAToolbox by using the get_feature_extractor_model
function from source.feature_extraction.get_model. Used in c_compute_tiatoolbox_feats.py.
Added feature extractors include:
- ResNet-based extractors:
- The CLAM-inspired extractor (
resnet50_baseline
) adapts Truncated ResNet50 pre-trained on ImageNet. - The DSMIL variant (via
get_resnet18_dsmil
) leverages SimCLR pre-training to extract rich features from whole slide images.
- The CLAM-inspired extractor (
- Transformer-based extractors:
PhikonFeatureExtractor
with versionsv1
andv2
(trained on more data)HibouFeatureExtractor
with versionsb
andL
VirchowFeatureExtractor
with versionsv1
(w/o register tokens) andv2
(w/ DINOv2 register tokens).
Dependency-MIL model can be created using get_model()
function from source.feature_aggregation.models.combined_model
It uses the following components:
- Instance Embedders:
IdentityEmbedder
,AdaptiveAvgPoolingEmbedder
,LinearEmbedder
,SliceEmbedder
from source.feature_aggregation.instance_embedders - Bag Aggregators:
AbmilBagClassifier
,DsmilBagClassifier
from source.feature_aggregation.combined_model - Class Connectors:
BahdanauSelfAttention
,TransformerSelfAttention
from source.feature_aggregation.class_connectors - Classifier Heads:
LinearClassifier
,DSConvClassifier
,CommunicatingConvClassifier
from source.feature_aggregation.classifier_heads
The custom loss function in source.losses computes a weighted loss across a list of input logits. Depending on the setting, it uses either binary cross entropy with logits (multiplying the loss elementwise with a provided weight tensor) or standard cross entropy (which requires all weights to be positive). For each valid input (i.e. non-None), the loss is summed and then normalized by the product of the number of valid inputs and the total sum of weights.
The same loss function is used for both the instance-level (when patch labels are available) and bag-level (on slide labels) losses in the Dependency-MIL model.
The classifier heads that use this loss (e.g. LinearClassifier
, DSConvClassifier
, and CommunicatingConvClassifier
) are defined in source.feature_aggregation.classifier_heads.
George Batchkala is supported by Fergus Gleeson and the EPSRC Center for Doctoral Training in Health Data Science (EP/S02428X/1). The work was done as part of DART Lung Health Program (UKRI grant 40255).
The computational aspects of this research were supported by the Wellcome Trust Core Award Grant Number 203141/Z/16/Z and the NIHR Oxford BRC. The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health.
If you find Dependency-MIL useful for your your research and applications, please cite using this BibTeX:
@INPROCEEDINGS{batchkala2024dependency-mil,
author={Batchkala, George and Li, Bin and Fan, Mengran and McCole, Mark and Brambilla, Cecilia and Gleeson, Fergus and Rittscher, Jens},
booktitle={2024 IEEE International Symposium on Biomedical Imaging (ISBI)},
title={Accurate Subtyping of Lung Cancers by Modelling Class Dependencies},
year={2024},
volume={},
number={},
pages={1-5},
keywords={Accuracy;Convolution;Annotations;Histopathology;Lung cancer;Lung;Predictive models;lung cancer;computational pathology;multi-label classification;multiple-instance learning},
doi={10.1109/ISBI56570.2024.10635232}
}