2026-04·8 min·draft

Building Deep Learning Models for FDA Submission

What Changes When the Output Matters

regulatorydeep-learningFDAclinical-ai

Most machine learning tutorials end at "the model achieves 0.92 AUC." In clinical AI, that's roughly where the real work begins.

I spent two and a half years at Anumana building ECG-based deep learning algorithms in partnership with Pfizer, working toward what eventually became an FDA Breakthrough Device Designation for early detection of cardiac amyloidosis — an underdiagnosed condition where misfolded proteins damage the heart. Here's what I learned about the gap between research and regulatory reality.

The AUC Is Not the Product

In a research context, you optimize a metric, you report it, you move on. In a regulatory context, you have to justify every decision: why this metric, why this threshold, why this population.

For Breakthrough Device submissions, the FDA cares about clinical utility — does this algorithm change a patient outcome? That means sensitivity at a clinically meaningful threshold matters more than aggregate AUC. You're not just building a classifier; you're defining who gets screened, at what sensitivity, and what happens next.

This is where the clinical indication for use becomes critical. Before writing a single line of model code, we had to define: what patient population does this apply to? What clinical action does a positive prediction trigger? What's the acceptable false positive rate given the downstream diagnostic pathway?

These aren't ML questions. They're clinical questions that constrain every technical decision.

Data Collection Determines Everything

The most underrated phase of any clinical AI project is data collection design. At Idoven, I designed Case Report Forms for an AstraZeneca partnership — the actual documents that determine which variables get recorded at each hospital site.

The instinct is to collect everything. The reality is that every additional field introduces noise, missingness, and cross-site inconsistency. A CRF is a data contract: you're committing to a schema that will determine what models you can train six months from now.

Getting this right requires sitting with clinicians and understanding their workflow. Which fields will they actually fill in consistently? Which measurements vary by equipment or protocol? Where does the data entry bottleneck sit?

I've seen projects fail not because the model was wrong, but because the upstream data was collected inconsistently. By the time you realize it, you've burned months.

Multi-Site Validation Is Where Models Go to Die

A model trained on data from one hospital performs well on held-out data from that same hospital. This surprises nobody and convinces no regulator.

The real test is generalization across sites — different equipment, different patient demographics, different clinical protocols. At Anumana, we curated cohorts from multiple US hospitals, each with their own ECG acquisition systems and patient populations.

The model that looked excellent on single-site validation dropped in predictable ways: older machines with different sampling rates, demographic subgroups underrepresented in training data, sites with different lead configurations. Each of these required specific technical solutions — signal resampling, stratified evaluation, subgroup analysis with pre-specified thresholds.

Regulatory bodies expect this. You don't just report overall performance; you report performance across every meaningful subgroup, and you explain any disparities.

What Breakthrough Device Actually Means

The FDA Breakthrough Device Designation is often misunderstood. It's not an approval — it's a designation that grants accelerated review and closer FDA interaction during development.

To qualify, you need to demonstrate that the device provides more effective treatment or diagnosis of a life-threatening or irreversibly debilitating condition, and that it represents a breakthrough compared to existing alternatives.

For cardiac amyloidosis, the case was strong: the condition is massively underdiagnosed, existing diagnostic pathways require specialized imaging or biopsy, and an ECG-based screen could catch patients years earlier. The algorithm we built was designed to flag potential ATTR-CM from a routine 12-lead ECG — something that every patient already gets.

The designation changed how we worked. Instead of building in isolation and submitting a completed package, we had ongoing dialogue with FDA reviewers about study design, endpoint selection, and performance benchmarks.

The Takeaway

If you're a data scientist considering clinical AI: the modeling is maybe 30% of the work. The rest is study design, data governance, clinical collaboration, and regulatory strategy. The most impactful technical decision you'll make isn't which architecture to use — it's how you define your cohort and your clinical endpoint.

The models I'm most proud of aren't the ones with the highest AUC. They're the ones where I was involved in defining what we were measuring and why.

This is a draft. Sections to expand: specific examples of CRF design decisions, more on the Breakthrough Device application process, comparison with European CE mark pathway.