Dependency-MIL

Accurate Subtyping of Lung Cancers by Modelling Class Dependencies

[Paper] [Pre-print] [Code] [BibTeX]

Has been presented in 2024 at the 21th International Symposium on Biomedical Imaging (ISBI-2024).

Authors: George Batchkala, Bin Li, Mengran Fan, Mark McCole, Cecilia Brambilla, Fergus Gleeson, Jens Rittscher.

Creation of the Multi-label Dataset

Source files used to make the labels

DHMC_MetaData_Release_1.0.csv - downloaded from https://bmirds.github.io/LungCancer/; gives predominant LUAD pattern
tcga_classes_extended_info.csv - see https://github.com/GeorgeBatch/TCGA-lung-histology-download/
tcga_dsmil_test_ids.csv - see https://github.com/GeorgeBatch/TCGA-lung-histology-download/
tcia_cptac_md5sum_hashes.txt - see https://github.com/GeorgeBatch/TCIA-CPTAC-lung-histology-download
tcia_cptac_luad_lusc_cohort.csv - see https://github.com/GeorgeBatch/TCIA-CPTAC-lung-histology-download
tcia_cptac_string_2_ouh_labels.csv - took unique values from tcia_cptac_luad_lusc_cohort.csv and manually mapped to labels inspired by OUH (Oxford University Hospitals) reports

Dummy label files

Columns include the label (LUAD vs LUSC) and paths to features:

features_csv_file_path
h5_file_path
pt_file_path

mapping = {
    "LUAD": 0,
    "LUSC": 1,
}

DHMC has only LUAD slides, so all entries in the label field are 0:

TCGA has both LUAD and LUSC so entries in the label field include 0 and 1:

Run the creation code

Run the labels creation code notebook. The code will create the files in labels/experiment-label-files/.

Note, the combined dataset for training/validation is not the same as in the paper since the in-house DART dataset is not publicly available. The test set, however, is the same as in the paper and is fully available in the 8-label task and 5-label task.

Running Scripts

a_save_slide_metadata.py: Saves metadata for all WSIs in a dataset.
Example: python a_save_slide_metadata.py --dataset TCGA-lung --slide_format svs
b_create_thumbnails_and_masks.py: Produces thumbnails and masks for WSIs.
Example: python b_create_thumbnails_and_masks.py --dataset TCGA-lung --slide_format svs
c_compute_tiatoolbox_feats.py: Extracts patch features for WSIs using tiatoolbox.
Example: python c_compute_tiatoolbox_feats.py --dataset TCGA-lung --slide_format svs
c_record_masked_positions.py: Records masked positions for WSIs by comparing feature positions with mask intersections.
Example: python c_record_masked_positions.py --dataset TCGA-lung --slide_format svs --min_mask_ratio 0.1
c_record_positions_intersections.py: Records intersections between slide feature positions and mask intersection positions.
Example: python c_record_positions_intersections.py --num_workers 24
d_train_classifier.py: Trains a MIL classifier on patch features.
- Example 1 (like in the paper): python d_train_classifier.py --base_config_path ./configs/base_config.yaml --config_path ./configs/combined-configs-sota/simclr-tcga-lung_resnet18-10x_COMBINED-ALL-8-dsmil-wo_subsampling.yaml
- Example 2 (subsampling patches): python d_train_classifier.py --base_config_path ./configs/base_config.yaml --config_path ./configs/combined-configs-sota/simclr-tcga-lung_resnet18-10x_COMBINED-ALL-8-dsmil_config.yaml
- Example 3 (with mixed supervision - used in-house data): python d_train_classifier.py --base_config_path ./configs/base_config.yaml --config_path ./configs/combined-configs-mixed-supervision/simclr-tcga-lung_resnet18-10x_COMBINED-ALL-8-dsmil_config.yaml
e_compute_gigapath_slide_level_feats.py: Computes slide-level embeddings using the prov-gigapath model.
Example: python e_compute_gigapath_slide_level_feats.py --embedding_data_dir datasets/TCGA-lung/features/prov-gigapath/imagenet/patch_224_0.5_mpp
e_compute_prism_slide_caption_similarities.py: Computes slide-level caption similarities using the PRISM model.
Example: python e_compute_prism_slide_caption_similarities.py --embedding_data_dir datasets/TCGA-lung/features/VirchowFeatureExtractor_v1_concat/imagenet/patch_224_0.5_mpp
e_compute_prism_slide_level_feats.py: Computes slide-level embeddings using the PRISM model.
Example: python e_compute_prism_slide_level_feats.py --embedding_data_dir datasets/TCGA-lung/features/VirchowFeatureExtractor_v1_concat/imagenet/patch_224_0.5_mpp
f_train_linear_probing_classifier.py: Trains a linear probing classifier on slide features.
Example: python f_train_linear_probing_classifier.py --base_config_path ./configs/base_config.yaml --config_path ./configs/combined-configs-slide-linear-probing/PRISM_COMBINED-ALL-8-linear_config.yaml

Source Contents

PyTorch Datasets and Data Loaders

The data loading pipeline is implemented using custom PyTorch Datasets and PyTorch Lightning DataModules. Specifically:

Datasets: source.data.dataset_detailed (LungSubtypingDataset and LungSubtypingSlideEmbeddingDataset) load precomputed features, positional data, and label masks from .pt and .npy files. They also perform on-the-fly subsampling and compute weight masks for instances with unknown labels.
DataModules: source.data.datamodule_detailed (LungSubtypingDM and LungSubtypingSlideEmbeddingDM) handle the creation and splitting of datasets based on patient IDs, reading CSV descriptions that reference pre-extracted patch features.

Feature Extraction

In addition to the patch feature extractors readily available through TIAToolbox (UNI, Prov-GigaPath, H-Optimus-0), this repository provides a range of feature extraction models that can be used as plug-in models for TIAToolbox by using the get_feature_extractor_model function from source.feature_extraction.get_model. Used in c_compute_tiatoolbox_feats.py.

Added feature extractors include:

ResNet-based extractors:
- The CLAM-inspired extractor (resnet50_baseline) adapts Truncated ResNet50 pre-trained on ImageNet.
- The DSMIL variant (via get_resnet18_dsmil) leverages SimCLR pre-training to extract rich features from whole slide images.
Transformer-based extractors:
- PhikonFeatureExtractor with versions v1 and v2 (trained on more data)
- HibouFeatureExtractor with versions b and L
- VirchowFeatureExtractor with versions v1 (w/o register tokens) and v2 (w/ DINOv2 register tokens).

Feature Aggregation: Class-Dependency Modelling

Dependency-MIL model can be created using get_model() function from source.feature_aggregation.models.combined_model

It uses the following components:

Instance Embedders: IdentityEmbedder, AdaptiveAvgPoolingEmbedder, LinearEmbedder, SliceEmbedder from source.feature_aggregation.instance_embedders
Bag Aggregators: AbmilBagClassifier, DsmilBagClassifier from source.feature_aggregation.combined_model
Class Connectors: BahdanauSelfAttention, TransformerSelfAttention from source.feature_aggregation.class_connectors
Classifier Heads: LinearClassifier, DSConvClassifier, CommunicatingConvClassifier from source.feature_aggregation.classifier_heads

Loss Function

The custom loss function in source.losses computes a weighted loss across a list of input logits. Depending on the setting, it uses either binary cross entropy with logits (multiplying the loss elementwise with a provided weight tensor) or standard cross entropy (which requires all weights to be positive). For each valid input (i.e. non-None), the loss is summed and then normalized by the product of the number of valid inputs and the total sum of weights.

The same loss function is used for both the instance-level (when patch labels are available) and bag-level (on slide labels) losses in the Dependency-MIL model.

The classifier heads that use this loss (e.g. LinearClassifier, DSConvClassifier, and CommunicatingConvClassifier) are defined in source.feature_aggregation.classifier_heads.

Acknowledgements

George Batchkala is supported by Fergus Gleeson and the EPSRC Center for Doctoral Training in Health Data Science (EP/S02428X/1). The work was done as part of DART Lung Health Program (UKRI grant 40255).

The computational aspects of this research were supported by the Wellcome Trust Core Award Grant Number 203141/Z/16/Z and the NIHR Oxford BRC. The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health.

Citation

If you find Dependency-MIL useful for your your research and applications, please cite using this BibTeX:

@INPROCEEDINGS{batchkala2024dependency-mil,
  author={Batchkala, George and Li, Bin and Fan, Mengran and McCole, Mark and Brambilla, Cecilia and Gleeson, Fergus and Rittscher, Jens},
  booktitle={2024 IEEE International Symposium on Biomedical Imaging (ISBI)}, 
  title={Accurate Subtyping of Lung Cancers by Modelling Class Dependencies}, 
  year={2024},
  volume={},
  number={},
  pages={1-5},
  keywords={Accuracy;Convolution;Annotations;Histopathology;Lung cancer;Lung;Predictive models;lung cancer;computational pathology;multi-label classification;multiple-instance learning},
  doi={10.1109/ISBI56570.2024.10635232}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dependency-MIL

Accurate Subtyping of Lung Cancers by Modelling Class Dependencies

Creation of the Multi-label Dataset

Source files used to make the labels

Dummy label files

Run the creation code

Running Scripts

Source Contents

PyTorch Datasets and Data Loaders

Feature Extraction

Feature Aggregation: Class-Dependency Modelling

Loss Function

Acknowledgements

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
configs		configs
labels		labels
labels_creation_code		labels_creation_code
source		source
tests/unit/models		tests/unit/models
.gitignore		.gitignore
README.md		README.md
a_save_slide_metadata.py		a_save_slide_metadata.py
b_create_thumbnails_and_masks.py		b_create_thumbnails_and_masks.py
c_compute_tiatoolbox_feats.py		c_compute_tiatoolbox_feats.py
c_record_masked_positions.py		c_record_masked_positions.py
c_record_positions_intersections.py		c_record_positions_intersections.py
d_train_classifier.py		d_train_classifier.py
e_compute_gigapath_slide_level_feats.py		e_compute_gigapath_slide_level_feats.py
e_compute_prism_slide_caption_similarities.py		e_compute_prism_slide_caption_similarities.py
e_compute_prism_slide_level_feats.py		e_compute_prism_slide_level_feats.py
f_train_linear_probing_classifier.py		f_train_linear_probing_classifier.py

GeorgeBatch/dependency-mil

Folders and files

Latest commit

History

Repository files navigation

Dependency-MIL

Accurate Subtyping of Lung Cancers by Modelling Class Dependencies

Creation of the Multi-label Dataset

Source files used to make the labels

Dummy label files

Run the creation code

Running Scripts

Source Contents

PyTorch Datasets and Data Loaders

Feature Extraction

Feature Aggregation: Class-Dependency Modelling

Loss Function

Acknowledgements

Citation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages