Addressing Class Imbalance in Medical Image
Data: Finding Needle in a Haystack
Current state-of-the-art active learning algorithms (www.arxiv.org/abs/1906.03671, www.arxiv.org/abs/2012.10630, www.cse.iitb.ac.in) do not often work well in cases
where there is a class imbalance (e.g. the cancerous cases are much lower
compared to the non-cancerous cases), distribution shift (e.g. training with
data from one ethnicity/group and testing on another ethnicity), and out of
distribution examples (e.g. unseen classes in the unlabeled set). All of
these issues are often present in real-world medical imaging datasets. We
will develop new active learning algorithms, which can effectively address
the issues pointed out above. We will achieve this by building upon the
supervisors’ work on active learning using submodular mutual
information functions www.arxiv.org/abs/2103.00128. To get a glimpse of
the datasets and challenges in this space, look at www.grand-challenge.org. As for
the coding platform, we will develop on www.decile.org - our homegrown data efficient
machine learning platform. More specifically, we will build on the DISTIL
(www.github.com/decile-team/distil) - an open source
platform for Deep diversified interactive Learning with several Jupyter
notebooks and video tutorials (such as www.youtube.com).
The overall goal of the project is Data and Cost Efficient Deep Learning for
Medical Image Classification. The advancement of machine learning and deep
learning is creating a big impact in several domains. One such important
domain is to build machine learning classifiers to effectively complement
human doctors and radiologists in detecting various diseases (e.g. forms of
cancer) from various medical images, e.g. X-rays, CT scans, and MRI images.
Examples of applications range from cancer tumor detection, medical image
segmentation, to detection of Alzheimer’s and Parkinson’s
disease. Despite all the amazing progress of deep learning, one main
challenge of these approaches is that deep learning models are extremely
data-hungry and require several tens of thousands of images to work
effectively. Given that detecting diseases like cancers in images requires
specialized skill-sets (doctors and radiologists), the cost of annotating
these datasets is very high. Furthermore, it is often hard to find
sufficient samples for certain rare diseases, and medical imaging data is
often heavily class-imbalanced. The goal of this project is to significantly
reduce the amount of labeled data required, with minimal loss in accuracy.
To achieve this, we will study the role of active learning and
semi-supervised learning to reduce the amount of labeled data required.
Semi-supervised approaches aim at effectively using the unlabeled data in
complementing the limited labeled data in learning, while data selection and
active learning seek to select the most informative labeled data to improve
the model performance. Preliminary results suggest that we can reduce the
amount of labeled data by factors of 5x to 20x with negligible performance
degradation, depending on the dataset and the choice of the algorithm.
The advisors complement each other in this project: Ganesh
Ramakrishnan's
and Rishabh Iyer’s expertise is in active learning and semi-supervised
learning and Prof. Tamil's expertise is in deep learning for medical
imaging. We will facilitate a summer internship at a Healthcare technology
company/organization, as part of the collaboration, subject to approval by
the faculty advisors.
Project Type: Student Project