Free Datasets for Radiology AI
- Dr. Candace Makeda Moore
- 20 באפר׳ 2019
- זמן קריאה 4 דקות

The ACR has released a directory of datasets for AI. I found it to be a significantly smaller than the number I encountered. I'm sending them updates- but who knows how long it will take them to put them up. I've decided to post my fruitful methods for finding radiology datasets in 2019.
Obviously, you could just go googling around, however many previously available datasets you will find links to were either opensource projects of academic projects- and they died, or the websites moved or they migrated behind a paywall. Here are some links that are not dead as of today.
The ACR list of datasets is here:
https://www.acrdsi.org/DSI-Services/Dataset-Directory
My additional recommendations to find datasets are:
#1: The Medical Image Bank of Valencia- not just one but MANY DATASETS!
http://bimcv.cipf.es/
Spine (MIDAS), Brains (specifically multiple datasets like GLIOHABITATS, NEUROBIM-MS etc.), Chest Xrays (PADCHEST), even methodological frameworks and a GIS. Just off the hook cool- with people who actually write you back!
#2 MD.ai
https://public.md.ai/hub/projects/public
Columbia, Harvard and Duke put some great datasets including the Qure.ai head CT dataset in one place. Not the largest list of datasets- but taking #2 in my heart for leading me to publicly available algorithms and code nearby at https://public.md.ai/hub/models/public
#3: OpenI: The Open Access Biomedical Image Search Engine
https://openi.nlm.nih.gov/
Home to the University of Indiana Chest Xray dataset. The U of I dataset, while smaller than either ChexNet set includes the full reports in XML. So there is a CXR REPORT dataset here, not just images.
#4: Kaggle:
www.kaggle.com
A general dataset website that includes nonradiology datasets, but also many radiology datasets. Often a better way to get datasets than their own official websites as you don't have to buy special software but just download a zip or two/ A subset of the DeepLesion dataset. ChestXray8...(because before 14, there was 8) and so on.
#5+6: OpenNeuro and OASIS
https://openneuro.org/ and http://oasis-brains.org/
Neuro, neuro-> Brain MRIs and more
#7 Spineweb's datasets
http://spineweb.digitalimaginggroup.ca/spineweb/index.php?n=Main.Datasets
Over a dozen datasets about the spine
#8 Zenodo
https://zenodo.org/
A few simple clicks or queries and you can grab plenty of datasets such as this one (UCLH Stroke EIT Dataset - Radiology Data) https://zenodo.org/record/1199398#.XL2uNOgzZPY
#8 The Cancer Imaging Archive:
https://wiki.cancerimagingarchive.net/
The ACR posted some but not all or even most of the datasets available from this site. The site has cancer imaging DICOMS by the terabyte. So many collections I'm too lazy to describe them all- just look:
4D-Lung
ACRIN-FLT-Breast
ACRIN-FMISO-Brain
ACRIN-NSCLC-FDG-PET
Anti-PD-1 Immunotherapy Lung (Anti-PD-1_Lung)
Anti-PD-1 Immunotherapy Melanoma (Anti-PD-1_MELANOMA)
APOLLO-1-VA
APOLLO2
Brain-Tumor-Progression
BREAST-DIAGNOSIS
Breast-MRI-NACT-Pilot
CBIS-DDSM
CPTAC-CCRCC
CPTAC-CM
CPTAC-GBM
CPTAC-HNSCC
CPTAC-LSCC
CPTAC-LUAD
CPTAC-PDA
CPTAC-SAR
CPTAC-UCEC
Credence Cartridge Radiomics Phantom CT Scans
Credence Cartridge Radiomics Phantom CT Scans with Controlled Scanning Approach (CC-Radiomics-Phantom-2)
CT COLONOGRAPHY
CT Lymph Nodes
Head-and-neck squamous cell carcinoma patients with CT taken during pre-treatment, mid-treatment, and post-treatment (HNSCC-3DCT-RT)
Head-Neck Cetuximab
Head-Neck-PET-CT
ISPY1
Ivy GAP
LGG-1p19qDeletion
LIDC-IDRI
LungCT-Diagnosis
Lung CT Segmentation Challenge 2017
Lung Phantom
Mouse-Astrocytoma
Mouse-Mammary
NaF Prostate
NRG-1308
NSCLC-Cetuximab
NSCLC Radiogenomics
NSCLC-Radiomics
NSCLC-Radiomics-Genomics
Osteosarcoma data from UT Southwestern/UT Dallas for Viable and Necrotic Tumor Assessment
Pancreas-CT
Phantom FDA
Prostate-3T
PROSTATE-DIAGNOSIS
Prostate Fused-MRI-Pathology
PROSTATE-MRI
QIBA CT-1C
QIN-BRAIN-DSC-MRI
QIN-Breast
QIN Breast DCE-MRI
QIN GBM Treatment Response
QIN-HEADNECK
QIN LUNG CT
QIN PET Phantom
QIN PROSTATE
QIN-PROSTATE-Repeatability
QIN-SARCOMA
Quantitative Imaging Network Collections
REMBRANDT
RIDER Breast MRI
RIDER Collections
RIDER Lung CT
RIDER Lung PET-CT
RIDER NEURO MRI
RIDER PHANTOM MRI
RIDER Phantom PET-CT
Soft-tissue-Sarcoma
SPIE-AAPM Lung CT Challenge
SPIE-AAPM-NCI PROSTATEx Challenges
Synthetic and Phantom MR Images for Determining Deformable Image Registration Accuracy (MRI-DIR)
TCGA-BLCA
TCGA-BRCA
TCGA-CESC
TCGA-COAD
TCGA-ESCA
TCGA-GBM
TCGA-HNSC
TCGA-KICH
TCGA-KIRC
TCGA-KIRP
TCGA-LGG
TCGA-LIHC
TCGA-LUAD
TCGA-LUSC
TCGA-OV
TCGA-PRAD
TCGA-READ
TCGA-SARC
TCGA-STAD
TCGA-THCA
TCGA-UCEC
The VICTRE Trial: Open-Source, In-Silico Clinical Trial For Evaluating Digital Breast Tomosynthesis
# 8002 The NIH:
I can't tell you how irritated I have been every time I try and access an NIH dataset and I find out that theoretically I need to pay. The N is for national- and I'm a sucker who pays taxes for this institution- yet somehow I need to pay a private company called BOX to get datasets? Because I need to subsidize rich people in tech, not the other way around according to government logic. Apparently there is even a National Biomedical Imaging Archive complete with an NBIA Data Retriever which is always down for maintenence. Since I can't get the thing, I must presume the cart icon means they are working on some way for me to pay (beyond my taxes) for that as well. Nonetheless, the DeepLesion, (https://nihcc.app.box.com/v/DeepLesion) as well as ChestXray 8, ChestXray 14 have made their way to various dataset groupie websites and gone open- if you look, you can find and torrent. The question is when will someone set the MIMIC Chest Xray data set free? (https://physionet.nlm.nih.gov/physiobank/database/mimiccxr/)
Clearly the NIH did not get the memo- I mean literally- THE MEMO from the US government:
https://project-open-data.cio.gov/policy-memo/
Which includes such nuggets of hope as:
"this Memorandum requires agencies to collect or create information in a way that supports downstream information processing and dissemination activities" and "Making information resources accessible, discoverable, and usable by the public can help fuel entrepreneurship, innovation, and scientific discovery – all of which improve Americans’ lives and contribute significantly to job creation"///
The NSF seems to have gotten the memo, so here's to hoping... someone besides the one tiny branch that published some serious CXR data gets the memo. (Here is that CXR gold: https://ceb.nlm.nih.gov/repositories/tuberculosis-chest-x-ray-image-data-sets/)
Comments