MMIST ccRCC Dataset

ccRCC is the most common type of kidney cancer, accounting for up to 80% of all renal cell carcinoma cases in adults. Estimating the prognosis is critical for patient management, but it is still a very challenging task. Ongoing research on this topic has led to the creation of two public studies: CPTAC-CCRCC and TCGA-KIRC, from which we curated MMIST-ccRCC.

Number of Modalities Across the Dataset

	Patients	CT	MRI	WSI	Genomics	Clinical
Train	497	189	35	497	361	497
Test	121	59	13	121	101	121
Total	618	248	48	618	462	618

Patients' Clinical and Genomic Data

Here you can access the metadata with the train/test split (from CPTAC and TCGA repositories).

Access the CSV here!

Download the Multi-Modal Data

Download WSIs

Below you have the links where you can access the original dataset both from TCGA (XXGB) and CPTAC ('Tissue Slide Images': with 190GB) and the CSV containing the selected files that we used. Be aware that for the TCGA we provide the manifest file that you can use directly with the GDC Data Transfer tool.

Download the CT and MRI images

Below you can access the CT and MRI images from the same patients both from TCGA ('Images': 91.56GB) and CPTAC repositories ('Radiology Images': 56.58GB), as well as the CSV with the filenames and IDs that we used in our repository.

We discarded the localization scans, the pre-contrast ones, and scans that were acquired with a significant time lapse from the diagnosis (years). We also removed the scans from the coronal and sagittal views, to minimize domain shifts. We reduced the number of volumes to 736 for CT and 552 for MRI.

Use Our Features

If you don't want to download the original images, you can use our features.

Access the CT/MRI/WSI Features

Here you can access the different folders containing the multi-modal features:

Per Patient Files (Selected Using MIL)

Access the CT/MRI/WSI Features

Several patients presented more than one CT/MRI scan and WSI image. We opted to reduce the amount of data used by our multi-modal system to a single CT, MRI, and WSI per patient. We implemented a novel patient-level MIL framework that automatically selects the best imaging modalities from the available pool. Here you can access the different CSV files containing the selected files per patient that were used in our paper:

Model Hyperparameter Settings

The following tables provide a detailed summary of the hyperparameters and configurations used for various model architectures in our study. Each table specifies the parameters applied for different data types and model configurations.

MIL Models

This table summarizes the hyperparameters for the Multiple Instance Learning (MIL) models, including specific settings for CT, MRI, and Pathology data. Key hyperparameters such as epochs, architecture, learning rates, optimizers, and validation metrics are outlined for each modality.

Table A.1: MIL Models Hyperparameters
Models	MIL CT	MIL MRI	MIL Pathology
Epochs	60	60	100
Architecture	3 FC	3 FC	4 FC
Hidden Sizes	[256,128]	[256,128]	[512, 256, 128]
Initial LR	1e-3	1e-3	1e-3
LR Scheduler	Step	Step	Step
LR Settings	Step size = 30 Gamma = 1e-2	Step size = 30 Gamma = 1e-2	Step size = 30 Gamma = 1e-2
Optimizer	SGD	Adam	AdamW
Oversample	8x	16x	8x

Base Models

Table A.2 provides the hyperparameter settings for base models across different modalities and configurations, including CT, MRI, Pathology, Clingen, and multi-modality approaches like Weighted Sum, Learn Weights, Mean, and Concat. Each model has its specific optimizer, learning rate schedule, batch size, and other settings detailed in this table.

Table A.2: Base Models Hyperparameters
Models	Base CT	Base MRI	Base Path	Base Clingen	Weighted Sum	Learn Weights	Mean	Concat
Epochs	60	60	60	60	60	60	120	120
Architecture	3L	3L	3L	3L	3L	3L	5L	5L
Hidden Sizes	128	128	128	128	128	128	128	128
Initial LR	1e-3	1e-3	1e-4	1e-3	1e-3	1e-3	1e-3	1e-3
LR Scheduler	Cosine	Cosine	Const	Const	Const	Cosine	Cosine	Cosine
LR Settings	Default	Default	None	None	None	Default	Default	Default
Optimizer	AdamW	SGD	SGD	AdamW	AdamW	AdamW	AdamW	SGD
Oversample	None	None	6x	6x	6x	6x	6x	6x
Batch Size	14	14	14	14	14	14	14	14
Batch Norm	No	No	No	No	Yes	Yes	Yes	Yes

Reconstruction Encoder-Decoder Model

This table outlines the hyperparameter settings for the encoder-decoder model architecture. It includes the number of epochs, learning rate, optimizer, and other relevant hyperparameters used specifically for this model.

Table A.3: Encoder-Decoder Model Hyperparameters
Model	Encoder - Decoder
Epochs	600
Architecture Encoder	2 FC
Architecture Decoder	2 FC
Hidden Sizes	128
Initial LR	1e-3
LR Scheduler	Cosine
Optimizer	AdamW
Oversample	6x
Batch Size	14
Batch Normalization	No

Scientific Paper

If you want more information about this dataset feel free to read our paper:

Tiago Mota, Maria Rita Fonseca Verdelho, Diogo José Pereira Araújo, Alceu Bissoto, Carlos Santiago, Catarina Barata, MMIST-ccRCC: A Real-World Medical Dataset for the Development of Multi-Modal Systems, Data Curation and Augmentation in Enhancing Medical Imaging Applications Workshop (archival) @CVPR 2024.

Access the Paper Here!

How to cite us?

Cite the paper where the dataset was first published and acknowledge us: "The results shown in this work used datasets collected from MMIST: https://Multi-Modal-IST.github.io/"

MMIST ccRCC Dataset

Number of Modalities Across the Dataset

Patients' Clinical and Genomic Data

Download the Multi-Modal Data

Download WSIs

Download the CT and MRI images

Use Our Features

Access the CT/MRI/WSI Features

Per Patient Files (Selected Using MIL)

Access the CT/MRI/WSI Features

Model Hyperparameter Settings

MIL Models

Base Models

Reconstruction Encoder-Decoder Model

Scientific Paper

How to cite us?

Get in Touch at

ana.c.fidalgo.barata@tecnico.ulisboa.pt