RDT CNN — Diagnostic Test Image Regression

What it is

ResNet-18 + CORAL ordinal regression that predicts analyte concentrations from smartphone images of lateral flow assay (LFA) strips, the same format used in over-the-counter flu and COVID tests. 59.3% ± 0.6% exact accuracy (5-fold CV, held-out test set) on a 12-class ordinal problem, approximately 7× random chance. Originally built as my MS thesis at UW's Lutz Lab in 2022, then rebuilt in 2026 as a production-quality ML framework.

Why I built it

Lateral flow assays output a binary readout (a line or no line) but the underlying signal is a continuous analyte concentration. A human reader calls a faint line "negative" even when a measurable quantity of antigen is present. A model that reads the strip image more precisely than the eye can improve clinical sensitivity without changing the hardware or the test protocol.

The question was whether a generic smartphone camera (not a calibrated reader) could provide enough signal to make this work under real-world variation in devices, lighting, and capture angle.

How it works

The dataset

This is not a public benchmark. I designed and collected this dataset while working as a research engineer at UW's Lutz Lab on a project for Audere, a health-tech company building smartphone-based diagnostic tools. I managed a team of undergraduates to run Quidel LFA strips at 12 known antigen concentrations (0–10 ng/mL, 2× serial dilution), photograph each strip at a fixed 10-minute read time using 3–5 different smartphone models, and deliberately vary the angle, direction, and lighting type per shot. The goal: test how well a generic smartphone camera could function as a diagnostic reader, so the dataset was intentionally noisy across devices and capture conditions.

~1,940 usable images. Blank controls (confirmed negative) included. Invalid tests (no control line) excluded. Deliberately oversampled near the limit of detection, the clinically critical range.

Grid of test strip images with predicted and actual concentration labels — Representative test strip images with predicted vs. actual labels. Each strip photographed with a consumer smartphone at 10-minute read time.

Dataset collected under agreement with Audere and not in the public repository. See the data schema for column documentation.

The model

ResNet-18 pretrained on ImageNet, fine-tuned with a CORAL ordinal regression head.

Why ordinal? LFA concentration levels are ordered but not linearly spaced on a continuous scale. Standard classification treats "off by 1 class" the same as "off by 6 classes," ignoring the ordering entirely. Standard regression assumes smooth linear continuity that doesn't hold in discrete 2× dilution steps. CORAL ordinal encoding respects both constraints: K-1 binary sigmoid outputs ("is concentration above level k?"), trained with a sum of K-1 binary cross-entropy losses.

The v2 rebuild

The v2 framework replaced the original one-file-per-experiment approach with: config-driven YAML experiments with base inheritance, modular interchangeable backbone/head/loss/metric components, 5-fold stratified cross-validation throughout, W&B experiment tracking across all runs, 49 unit tests, differential learning rates (1e-4 frozen backbone, 1e-5 fine-tuned), and an improved image cropping strategy that reduced pixel count and removed edge noise from the original pipeline.

On the accuracy metric

With 12 ordinal classes, random guessing achieves ~8%. The v2 model reaches 59.3% exact accuracy and 70.2% ± 1.1% within-1 accuracy (predictions within one concentration level of ground truth). In a 2× serial dilution, being off by one class means the estimate is within a factor of 2 of the actual value.

The v2 model's effective limit of detection is 0.1563 ng/mL, one concentration level above the assay's stated LoD of 0.0781 ng/mL. Concentrations at or below 0.0781 ng/mL are predicted as either 0.1563 or negative. This is a known behavior of ordinal regression near decision boundaries: the model absorbs highly ambiguous classes into the nearest confident threshold rather than pushing a low-confidence prediction through. The v1 regression model had partial detection at the true LoD, since a continuous output can interpolate toward the correct class even when uncertain.

Results

The v1 to v2 improvement spans three simultaneous changes: preprocessing, backbone, and loss function. All three contributed.

Experiment	Architecture	Loss	Exact Accuracy	Within-1 Accuracy
v1 thesis	Custom CNet (unpretrained)	CrossEntropyLoss	35.0%	N/A
v1 thesis	Custom CNet (unpretrained)	HuberLoss	46.8%	~65%
v2 rebuild	ResNet-18 (ImageNet)	CORAL ordinal	59.3% ± 0.6%	70.2% ± 1.1%

v1 within-1 derived from presentation data: ~35% of misclassified tests were one level off. v2 exact accuracy is from the held-out test set; within-1 is from validation at best-checkpoint epoch.

Training and validation loss over 50 epochs, v1 model — v1 training and validation loss (50 epochs). Both curves converge smoothly without overfitting on ~1,940 images.

Predicted vs actual concentrations near the limit of detection, v1 — v1 predictions near the limit of detection. The model resolves concentrations down to the assay's stated LoD (0.0781 ng/mL), though with increasing scatter.

v2 ResNet-18 + CORAL ordinal predictions vs ground truth, fold 1 — v2 ResNet-18 + CORAL ordinal (fold 1). Predicted vs true concentration. Predictions cluster tightly along the diagonal for concentrations ≥ 0.1563 ng/mL. Below that threshold, the model hedges toward the nearest confident class.

The arc: v1 to v2

The original thesis was a two-week sprint with real constraints. Working on a compute-limited laptop, the first challenge was just fitting the problem in GPU memory: solving batching to handle the large raw images took time before training could start. Building the dataloader, scoring metrics, and training loop from scratch took longer than the network itself. The network was an AlexNet variant with augmented front and back layers, not pretrained, because the infrastructure work consumed most of the available time. After 30 epochs: 46.8% exact accuracy, MSE loss of 0.022. About 35% of misclassified tests were only one level off, and 30% were two levels off. The errors were clustered near the true value, not random. The model also correctly identified invalid tests (no control line) in all but one case.

The v2 rebuild targeted three bottlenecks identified in hindsight: (1) preprocessing, where a better cropping strategy reduced pixel count and removed edge noise the original pipeline left in; (2) backbone, replacing the unpretrained AlexNet with an ImageNet-pretrained ResNet-18 for a head start on low-level feature detection; (3) loss, where CORAL ordinal regression encodes the concentration ordering that both classification and regression formulations ignore.

There was also a methodological bug: the original v2 checkpoint was saved at the epoch with the lowest validation loss, not the highest accuracy. For CORAL ordinal regression, these can diverge significantly. Loss can increase while accuracy continues to improve. Fixing the checkpoint criterion to save on peak within-1 accuracy added 25 percentage points to the saved checkpoint's performance and is itself a lesson about the difference between optimizing for what you measure and what you care about.

What's next

Phase 2 is in progress: cross-dataset generalization to Access Bio cartridge-housed strips (a different manufacturer, with shadow artifacts from the plastic housing). Running a 7-experiment domain adaptation matrix (zero-shot transfer, fine-tuning, combined training) to answer how well an LFA model trained on one test kit generalizes to another. Results will be added when complete.

What I learned

CORAL ordinal loss outperformed both classification and regression. Ordering information in the label space is real signal, and formulations that ignore it leave accuracy on the table.
The biggest early gains came from preprocessing and input normalization, not architecture changes. The data pipeline bottleneck is real and usually underestimated.
Checkpoint criterion matters as much as architecture. Saving on minimum validation loss while accuracy continues to improve can leave 20+ percentage points on the table. Optimize the checkpoint for the metric you actually care about.
K-fold CV is non-negotiable on small medical image datasets. Single train/test splits gave wildly different accuracy impressions across folds.
Designing a robust dataset is its own research contribution. Controlling for read time, device variation, and lighting during collection is what makes generalization results meaningful rather than optimistic.