MuDaRI | Quang N. Nguyen

This project was presented at the 4th Asia Pacific Tele-Ophthalmology Society (APTOS) Symposium 2019, Chennai, India. [Slides]

This work also received Best Poster Presentation at the Southeast Asia Machine Learning School (SEA MLS) 2019, Jakarta, Indonesia.

Introduction

Automated screening of retinal fundus photographs has made significant progress over the past decade, driven largely by the availability of large, well-annotated datasets. However, most publicly available datasets, such as EyePACS (Cuadros & Bresnick, 2009), are organized around a single disease category, meaning each image carries only one diagnostic label, typically for diabetic retinopathy (DR), age-related macular degeneration (AMD), or glaucoma. This single-label paradigm creates a critical gap when deployed in real clinical settings: a model trained exclusively on DR, for instance, can only report whether DR is present or absent, leaving every other concurrent pathology undetected. The clinical consequence is an uncomfortable and potentially misleading outcome where a screening tool reports a patient is clear of one condition while remaining entirely silent about others, as illustrated in Figure 1.

Illustration of the single-label limitation of existing datasets such as EyePACS — **Figure 1:** Existing large-scale retinal datasets such as EyePACS are organized around a single diagnostic label per image, making them unsuitable for multi-disease screening in clinical practice.

Real patients frequently present with co-occurring conditions: a diabetic patient may simultaneously exhibit DR, macular edema, and early cataract changes within the same fundus image. Addressing this gap requires datasets that are both broader in disease coverage and structured to support multi-label annotation, where a single image can carry several concurrent diagnoses.

In this project, we curate one of the first multilabel datasets of retinal fundus images covering a wide range of ocular diseases, and we develop a multilabel classifier trained on this data to perform simultaneous detection of multiple conditions from a single image.

Building the Dataset

Curation Pipeline

The dataset was sourced from Cao Thang International Eye Hospital (Ho Chi Minh City, Vietnam), a private tertiary eye care center, covering a six-year collection period from 2013 to 2018. A total of 6,871 fundus photographs were retrieved and matched against the corresponding Electronic Medical Records (EMRs). All diagnostic terms were standardized and verified to be in compliance with ICD-10 coding conventions, and the dataset was fully de-identified prior to use.

Rather than relying on a purely manual labeling workflow, we integrated convolutional neural network (CNN)-based tools at intermediate stages to assist with image quality filtering and preliminary label assignment, which were then reviewed and corrected by human graders. This hybrid approach, combining automated pre-screening with expert verification, allowed us to scale the curation process without sacrificing annotation quality. The full pipeline is illustrated in Figure 2.

Dataset curation pipeline — **Figure 2:** Overview of the dataset curation pipeline. CNN-based tools assist with quality filtering and initial labeling at each stage, with human expert review at critical checkpoints.

Dataset Composition

The final curated dataset spans eight major diagnostic categories, with a long tail of less common conditions grouped under an “Others” class. Table 1 summarizes the disease distribution, and Figure 3 shows the resulting label distribution.

Condition	Occurrence
Cataract	2,258
Diabetic Retinopathy	1,085
Glaucoma	812
Macular Edema	760
Macular Degeneration	757
Retinal Vascular Occlusion	592
Optic Neuritis/Neuropathy	580
Others	2,199

Table 1: Disease occurrence counts in the full MuDaRI dataset.

The “Others” category consolidates all conditions with fewer than 500 occurrences, including Posterior Uveitis, Eye Infections, Macular Pucker, Vitreous/Retinal Hemorrhage, Posterior Capsular Opacification, Retinal Detachment/Breaks, Laser Scars, Hereditary Retinal Dystrophy, Myopia, Central Serous Chorioretinopathy, Glaucoma Suspect, Hypertensive Retinopathy, Chorioretinal Atrophy, Large Cup, and Chorioretinal Neovascularization.

Because images can carry multiple simultaneous diagnoses, the dataset yields 59 unique combinations of disease labels. This combinatorial diversity is precisely what makes MuDaRI challenging and valuable: no two patients are guaranteed to present identically, and a model must learn to assign any subset of the eight classes independently rather than selecting a single winner.

Distribution of disease labels in MuDaRI — **Figure 3:** Label distribution across the MuDaRI dataset, illustrating the class imbalance present in real-world retinal screening data.

Multilabel Classifier

Data Splits

Images were drawn from two sources: the CTEH collection and the Messidor dataset. Table 2 summarizes the data sources, and Table 3 shows the per-condition breakdown across train, validation, and test splits.

Dataset	Partition	Images
CTEH	Train	5,351
CTEH	Test	592
Messidor	Train	988
Messidor	Test	113
Total		7,048

Table 2: Data sources and partition sizes.

Condition	Train	Validation	Test
Cataract	1,823	202	233
Diabetic Retinopathy	1,353	150	168
Glaucoma	662	74	76
Macular Edema	765	85	104
Macular Degeneration	508	68	81
Retinal Vascular Occlusion	468	52	72
Optic Neuritis/Neuropathy	470	52	58
Others	1,800	200	199
Normal	415	46	54

Table 3: Per-condition image counts across train, validation, and test splits.

Model Architecture

Our classifier is built on EfficientNet-B3 (Tan & Le, 2019), a highly parameter-efficient convolutional network that achieves strong ImageNet performance via compound scaling of depth, width, and resolution. We initialize the network with ImageNet pretrained weights and replace the final classification head with a 9-node fully connected layer with sigmoid activations, enabling independent probability estimates for each disease class. This formulation treats the problem as nine simultaneous binary classification tasks rather than a single mutually exclusive choice, directly accommodating images with multiple concurrent diagnoses. The adapted architecture is shown in Figure 4.

EfficientNet-B3 architecture adapted for multilabel classification — **Figure 4:** EfficientNet-B3 architecture with the final layer replaced by a 9-node sigmoid head for multilabel disease classification.

Two additional techniques were applied to improve generalization. First, real-time data augmentation (random flips, rotations, color jitter, and zoom) was applied during training to expand the effective diversity of the training set and reduce overfitting. Second, the training loss was modified to account for class imbalance, which is substantial given that conditions such as Glaucoma and Retinal Vascular Occlusion are far less frequent than Cataract. Finally, to support interpretability and facilitate clinician feedback, saliency maps (Simonyan et al., 2014) are computed for each prediction, highlighting the image regions that most strongly drive the model’s output. Examples of augmented images and their saliency maps are shown in Figure 5.

Data augmentation examples and saliency maps — **Figure 5:** Left: examples of real-time data augmentation applied during training. Right: saliency maps overlaid on fundus images, highlighting lesion regions that the model attends to for each predicted condition.

Results

The model was evaluated using standard example-based multilabel metrics on the held-out test set, reported in Table 4. Per-class confusion matrices are shown in Figure 6.

Metric	Score
Exact Match Ratio	0.59
Accuracy	0.76
Recall	0.89
Specificity	0.95
Precision	0.79
F1-Score	0.82

Table 4: Example-based multilabel classification metrics on the test set.

The exact match ratio of 0.59 reflects cases where the model’s predicted label set perfectly matches the ground truth, a strict criterion given the combinatorial space of possible multi-disease configurations. The high recall (0.89) and specificity (0.95) indicate that the classifier is conservative in producing false negatives while also avoiding excessive false alarms, which is desirable for a screening context where missing a true positive carries greater clinical cost than a false referral.

**Figure 6:** Per-class confusion matrices on the test set. Each matrix shows the true positive, false positive, false negative, and true negative counts for one disease category.

Application Demo

To demonstrate real-world utility, the classifier was integrated into a prototype screening interface. A clinician uploads a fundus photograph and receives a ranked list of predicted conditions alongside the corresponding saliency map, enabling rapid triage and supporting a more transparent second-opinion workflow. Screenshots of the interface are shown in Figure 7.

Application demo screenshot 2 — **Figure 7:** Screenshots of the prototype screening application. The interface displays predicted disease labels with confidence scores and overlays the saliency map on the input fundus image.

Conclusion

This project addresses a structural gap in retinal AI research: the absence of large-scale, multilabel annotated datasets that reflect the co-occurrence of multiple ocular diseases in real clinical populations. By curating MuDaRI and training an EfficientNet-B3-based multilabel classifier on it, we demonstrate that a single model can simultaneously screen for eight major retinal conditions from one fundus photograph, with strong recall and specificity across classes.

The high recall in particular has practical implications for deployment: screening systems must minimize missed diagnoses, and the model’s behavior aligns with this priority. The saliency map integration further supports clinical adoption by making the model’s reasoning interpretable to ophthalmologists, enabling more productive human-AI collaboration.

Future directions include expanding the dataset with additional disease categories, incorporating multi-reader annotations to quantify labeling uncertainty, and evaluating the classifier across imaging devices and patient populations to assess generalizability.