The Data Pipeline Nobody Talks About
Why 80% of Clinical AI Is Data Work
Every clinical AI paper describes the model. Almost none describe how the data got there.
I've spent five years working with clinical data from hospitals across the US and Europe: ECG signals, electronic health records, clinical trial databases. The single biggest determinant of model quality is not the architecture, the optimizer, or the training schedule. It's the pipeline that takes raw hospital data and turns it into something a model can learn from.
Here's what that actually looks like.
Hospital Data Is Not a CSV
When people picture clinical data, they imagine tidy tables with columns like "age," "diagnosis," "outcome." What you actually receive is closer to a digital archaeological site.
ECG files come in proprietary formats that vary by manufacturer. Patient records are scattered across systems that don't talk to each other. Diagnosis codes are entered by humans who abbreviate, misspell, and use local conventions. Timestamps are unreliable. Duplicate records are everywhere.
At Anumana, I built cohorts from multi-site data using Spark SQL because the volumes made pandas infeasible. But the real challenge wasn't scale — it was reconciliation. The same patient might appear with different identifiers across hospital systems. The same diagnosis might be coded differently depending on whether it was entered by a cardiologist or a primary care physician.
Reproducibility As Infrastructure
At Idoven, I built what I'd describe as a fully reproducible data pipeline — a package that could ingest very dirty, multi-source clinical data and output clean, version-controlled datasets with complete audit trails.
The design principles were simple:
- Every transformation is logged and reversible
- Patient stratification (train/validation/test splits) is deterministic and version-controlled, preventing data leakage across experiments
- Output datasets carry metadata about their provenance — which source files, which filters, which exclusion criteria
This pipeline achieved a 4.4x processing speedup over the previous workflow, but the speed wasn't the point. The point was that when a collaborator asked "which patients are in this cohort and why?", the answer was a reproducible script, not a researcher's memory.
Why CRFs Deserve More Respect
A Case Report Form is the document that defines what data gets collected during a clinical study. Most data scientists never see one. I think that's a mistake.
At Idoven, I designed CRFs for the AstraZeneca partnership. The exercise forced a level of precision that ML practitioners rarely engage with: what exactly is the definition of "symptom onset"? Which ECG interval measurements should be manual vs. automated? What constitutes a "complete" patient record?
Every ambiguity in a CRF becomes noise in your dataset. Every optional field becomes a pattern of missingness that your model will learn from — often in ways you don't want.
The best CRFs are designed backward from the analysis plan. If you know what model you're going to train and what subgroup analyses you need, you can design a collection protocol that ensures you'll have the data to support it.
Signal Processing Before Deep Learning
ECGs are not images, even though they look like squiggly lines. They're time series with specific physical meaning: each wave and interval corresponds to a cardiac event. The signal varies by hardware (sampling rate, number of leads, filtering), patient factors (electrode placement, body habitus, movement artifacts), and clinical context.
Before any model training, there's a signal processing layer:
- Resampling to a standard rate
- Baseline wander removal
- Quality filtering (reject noisy signals that could mislead the model)
- Standardization across hardware vendors
I built anomaly detection methods using Mahalanobis distance for ECG quality filtering — identifying signals that were statistical outliers not because of pathology but because of acquisition artifacts. Getting this wrong means your model learns to detect bad electrodes instead of disease.
The 80% Nobody Publishes
In my experience, roughly 80% of the effort in clinical AI goes into data work. The remaining 20% — architecture selection, training, evaluation — is what gets written up in papers.
This creates a distorted picture of what the field actually needs. We don't need more novel architectures for ECG analysis. We need better tooling for data provenance, better standards for cross-site data harmonization, and more people who understand both the clinical and technical sides of data collection.
The most valuable thing I bring to a project isn't my ability to train a Transformer. It's knowing which questions to ask about the data before writing any code.
This is a draft. Sections to expand: specific examples of data reconciliation challenges, comparison of ECG formats across manufacturers, the relationship between data quality and regulatory submission readiness.