Addressing Class Imbalance in Medical Image Data: Finding
Needle in a Haystack (IDDDP)
- Supervisor : Prof. Ganesh
Ramakrishnan
- Co-supervisor : Prof.
Rishabh Iyer
Current state-of-the-art active learning algorithms (www.arxiv.org/abs/1906.03671, www.arxiv.org/abs/2012.10630,
www.cse.iitb.ac.in) do not often work well in cases where
there is a class imbalance (e.g. the cancerous cases are much lower compared
to the non-cancerous cases), distribution shift (e.g. training with data from one
ethnicity/group and testing on another ethnicity), and out of distribution examples
(e.g. unseen classes in the unlabeled set). All of these
issues are often present in real-world medical imaging datasets. We will develop new
active learning algorithms, which can effectively address the issues pointed out above.
We will achieve this by building upon the supervisors’
work on active learning using submodular mutual information functions www.arxiv.org/abs/2103.00128. To get a glimpse of the
datasets and challenges in
this space, look at www.grand-challenge.org. As for the
coding platform, we will develop on www.decile.org - our
homegrown data efficient machine learning platform. More specifically, we will build on
the DISTIL (www.github.com/decile-team/distil) - an open
source platform for Deep diversified interactive Learning with several Jupyter notebooks
and video tutorials (such as www.youtube.com)
The overall goal of the project is Data and Cost Efficient Deep Learning for Medical
Image Classification. The advancement of machine learning and deep learning is creating
a big impact in several domains. One such important domain is to build machine
learning classifiers to effectively complement human doctors and radiologists in
detecting various diseases (e.g. forms of cancer) from various medical images, e.g.
X-rays, CT scans, and MRI images. Examples of applications
range from cancer tumor detection, medical image segmentation, to detection of
Alzheimer’s and Parkinson’s disease. Despite all the amazing progress of deep learning,
one main challenge of these approaches is that deep
learning models are extremely data-hungry and require several tens of thousands of
images to work effectively. Given that detecting diseases like cancers in images
requires specialized skill-sets (doctors and radiologists),
the cost of annotating these datasets is very high. Furthermore, it is often hard to
find sufficient samples for certain rare diseases, and medical imaging data is often
heavily class-imbalanced. The goal of this project
is to significantly reduce the amount of labeled data required, with minimal loss in
accuracy. To achieve this, we will study the role of active learning and semi-supervised
learning to reduce the amount of labeled data
required. Semi-supervised approaches aim at effectively using the unlabeled data in
complementing the limited labeled data in learning, while data selection and active
learning seek to select the most informative labeled
data to improve the model performance. Preliminary results suggest that we can reduce
the amount of labeled data by factors of 5x to 20x with negligible performance
degradation, depending on the dataset and the choice of
the algorithm.
The advisors complement each other in this project: Ganesh Ramakrishnan's and Rishabh
Iyer’s expertise is in active learning and semi-supervised learning and Prof. Tamil's
expertise is in deep learning for medical imaging. We will facilitate a summer
internship at a Healthcare technology company/organization, as part of the
collaboration, subject to approval by the faculty advisors.