MMIST ccRCC Dataset

ccRCC is the most common type of kidney cancer, accounting for up to 80% of all renal cell carcinoma cases in adults. Estimating the prognosis is critical for patient management, but it is still a very challenging task. Ongoing research on this topic has led to the creation of two public studies: CPTAC-CCRCC and TCGA-KIRC, from which we curated MMIST-ccRCC.

Number of Modalities Across the Dataset

Patients CT MRI WSI Genomics Clinical
Train 497 189 35 497 361 497
Test 121 59 13 121 101 121
Total 618 248 48 618 462 618

Patients' Clinical and Genomic Data

Here you can access the metadata with the train/test split (from CPTAC and TCGA repositories).

Download the Multi-Modal Data

Download WSIs

Below you have the links where you can access the original dataset both from TCGA (XXGB) and CPTAC ('Tissue Slide Images': with 190GB) and the CSV containing the selected files that we used. Be aware that for the TCGA we provide the manifest file that you can use directly with the GDC Data Transfer tool.

Download the CT and MRI images

Below you can access the CT and MRI images from the same patients both from TCGA ('Images': 91.56GB) and CPTAC repositories ('Radiology Images': 56.58GB), as well as the CSV with the filenames and IDs that we used in our repository.

We discarded the localization scans, the pre-contrast ones, and scans that were acquired with a significant time lapse from the diagnosis (years). We also removed the scans from the coronal and sagittal views, to minimize domain shifts. We reduced the number of volumes to 736 for CT and 552 for MRI.

Use Our Features

If you don't want to download the original images, you can use our features.

Access the CT/MRI/WSI Features

Here you can access the different folders containing the multi-modal features:

Per Patient Files (Selected Using MIL)

Access the CT/MRI/WSI Features

Several patients presented more than one CT/MRI scan and WSI image. We opted to reduce the amount of data used by our multi-modal system to a single CT, MRI, and WSI per patient. We implemented a novel patient-level MIL framework that automatically selects the best imaging modalities from the available pool. Here you can access the different CSV files containing the selected files per patient that were used in our paper:

Model Hyperparameter Settings

The following tables provide a detailed summary of the hyperparameters and configurations used for various model architectures in our study. Each table specifies the parameters applied for different data types and model configurations.

MIL Models

This table summarizes the hyperparameters for the Multiple Instance Learning (MIL) models, including specific settings for CT, MRI, and Pathology data. Key hyperparameters such as epochs, architecture, learning rates, optimizers, and validation metrics are outlined for each modality.

Table A.1: MIL Models Hyperparameters
Models MIL CT MIL MRI MIL Pathology
Epochs 60 60 100
Architecture 3 FC 3 FC 4 FC
Hidden Sizes [256,128] [256,128] [512, 256, 128]
Initial LR 1e-3 1e-3 1e-3
LR Scheduler Step Step Step
LR Settings Step size = 30
Gamma = 1e-2
Step size = 30
Gamma = 1e-2
Step size = 30
Gamma = 1e-2
Optimizer SGD Adam AdamW
Oversample 8x 16x 8x

Base Models

Table A.2 provides the hyperparameter settings for base models across different modalities and configurations, including CT, MRI, Pathology, Clingen, and multi-modality approaches like Weighted Sum, Learn Weights, Mean, and Concat. Each model has its specific optimizer, learning rate schedule, batch size, and other settings detailed in this table.

Table A.2: Base Models Hyperparameters
Models Base CT Base MRI Base Path Base Clingen Weighted Sum Learn Weights Mean Concat
Epochs 60 60 60 60 60 60 120 120
Architecture 3L 3L 3L 3L 3L 3L 5L 5L
Hidden Sizes 128 128 128 128 128 128 128 128
Initial LR 1e-3 1e-3 1e-4 1e-3 1e-3 1e-3 1e-3 1e-3
LR Scheduler Cosine Cosine Const Const Const Cosine Cosine Cosine
LR Settings Default Default None None None Default Default Default
Optimizer AdamW SGD SGD AdamW AdamW AdamW AdamW SGD
Oversample None None 6x 6x 6x 6x 6x 6x
Batch Size 14 14 14 14 14 14 14 14
Batch Norm No No No No Yes Yes Yes Yes

Reconstruction Encoder-Decoder Model

This table outlines the hyperparameter settings for the encoder-decoder model architecture. It includes the number of epochs, learning rate, optimizer, and other relevant hyperparameters used specifically for this model.

Table A.3: Encoder-Decoder Model Hyperparameters
Model Encoder - Decoder
Epochs 600
Architecture Encoder 2 FC
Architecture Decoder 2 FC
Hidden Sizes 128
Initial LR 1e-3
LR Scheduler Cosine
Optimizer AdamW
Oversample 6x
Batch Size 14
Batch Normalization No

Scientific Paper

If you want more information about this dataset feel free to read our paper:

Tiago Mota, Maria Rita Fonseca Verdelho, Diogo José Pereira Araújo, Alceu Bissoto, Carlos Santiago, Catarina Barata, MMIST-ccRCC: A Real-World Medical Dataset for the Development of Multi-Modal Systems, Data Curation and Augmentation in Enhancing Medical Imaging Applications Workshop (archival) @CVPR 2024.

How to cite us?

Get in Touch at