Abstract
Foundation models in deep learning are characterized by a single large-scale model trained on vast amounts of data serving as the foundation for various downstream tasks. Foundation models are generally trained using self-supervised learning and excel in reducing the demand for training samples in downstream applications. This is especially important in medicine, where large labelled datasets are often scarce. Here, we developed a foundation model for cancer imaging biomarker discovery by training a convolutional encoder through self-supervised learning using a comprehensive dataset of 11,467 radiographic lesions. The foundation model was evaluated in distinct and clinically relevant applications of cancer imaging-based biomarkers. We found that it facilitated better and more efficient learning of imaging biomarkers and yielded task-specific models that significantly outperformed their conventional supervised and other state-of-the-art pretrained implementations on downstream tasks, especially when training dataset sizes were very limited. Furthermore, the foundation model was more stable to input variations and showed strong associations with underlying biology. Our results demonstrate the tremendous potential of foundation models in discovering new imaging biomarkers that may extend to other clinical use cases and can accelerate the widespread translation of imaging biomarkers into clinical settings
Publication
Pai, S et al. Foundation Models for Cancer Imaging Biomarkers; Nature Machine Intelligence, 2024
Use our model fast and easy through MHub.ai
To ensure ease of use for both academic and clinical research, we offer a complete, containerized, and ready-to-use implementation of our model. Through the Mhub.ai platform, we support various input workflows, enabling users to leverage our foundation model regardless of their data format. Additionally, we provide seamless integration with 3D Slicer, making our model highly applicable and adaptable to diverse research settings.
Data availability
Most of the datasets used in this study are openly accessible for both training and validation purposes. They can be obtained from the following sources: i) DeepLesion, an extensive collection of RECIST bookmarked lesions, used both for our pre-training and use-case 1; ii) LUNA16, used for developing our diagnostic image biomarker; iii) LUNG1 and iv) RADIO, used for the validation of our prognostic image biomarker model. The training dataset for our prognostic biomarker model, HarvardRT, is internal and unavailable to the public. Nonetheless, our foundational model can be publicly accessed, and the results reproduced using the accessible test datasets.
Code availability
Our Github repo includes the code for (1) data download and preprocessing, starting from downloading the data to generating train-validation-test splits used in our study; (2) replicating the training and inference of foundation and baseline models across all tasks through easily readable and customizable YAML files (leveraging project-lighter) and (3) code for reproducing our comprehensive performance validation.
In addition to sharing reproducible code, we provide trained model weights, extracted features, and outcome predictions for all the models used in our study. Most importantly, we provide our foundation model accessible through a simple pip package install and two lines of code to extract features for your dataset. We also provide a detailed documentation website.
AIM Investigators
Acknowledgements
We acknowledge financial support from the National Institute of Health (NIH) (H.J.W.L.A. grant nos. NIH-USA U24CA194354, NIH-USA U01CA190234, NIH-USA U01CA209414, NIH-USA R35CA22052 and NIH-USA U54CA274516-01A1), the European Union, European Research Council (H.J.W.L.A. grant no. 866504) and Deutsche Forschungsgemeinschaft, the German Research Foundation (S.B. grant no. 502050303).