2026-04·9 min·draft

From CNN to Vision Transformer: Choosing Architectures for ECG Analysis

What the Benchmarks Won't Tell You

deep-learningarchitecturesECGtransformers

ECGs are deceptively simple inputs. Twelve channels, a few seconds of signal, a fixed sampling rate. Compared to radiology or pathology, the raw data is tiny. And yet the architecture choice matters enormously — not just for accuracy, but for what the model actually learns.

At Idoven, I trained and compared 9 deep learning architectures on 20,754 ECGs from 2,901 patients for cardiac amyloidosis detection — the full results are published in Heart Rhythm (2026). At Anumana before that, I worked on CNNs, Transformers, and multimodal models for different cardiac conditions. Here's what I found about what works, what doesn't, and what the benchmarks won't tell you.

Why ECGs Are Interesting From an ML Perspective

A 12-lead ECG is a 12-channel time series, typically sampled at 250-500 Hz over 10 seconds. That's about 5,000 samples per lead — small enough to fit in memory, large enough to contain real signal.

The clinical information is encoded at multiple scales:

Local features: P-wave morphology, QRS duration, ST-segment changes — these are millisecond-level patterns
Global features: QT interval, heart rate variability, axis deviation — these span the full recording
Cross-lead relationships: patterns that appear in specific lead combinations reveal the spatial origin of cardiac abnormalities

This multi-scale, multi-channel structure is what makes architecture selection non-trivial. A model that excels at local feature extraction (CNN) may miss global patterns. A model that captures long-range dependencies (Transformer) may struggle with the fine-grained morphological features that cardiologists rely on.

The Architectures

Standard CNN: The workhorse. 1D convolutions across the time axis, often with some form of multi-lead aggregation. Fast to train, easy to interpret via gradient-based attribution. Limitation: fixed receptive field means long-range dependencies require very deep networks.

ResNet variants: Skip connections help with depth, and ResNet-based architectures have been competitive in most ECG benchmarks. The architecture I've used most frequently for initial baselines.

CNN-LSTM hybrids: The idea is to use CNNs for local feature extraction and LSTMs for temporal modeling. In practice, the LSTM component adds training complexity and the gains over deeper CNNs are inconsistent.

Vision Transformer (ViT): Treating the ECG as a sequence of patches and using self-attention. Strong at capturing long-range dependencies and cross-lead interactions. Requires more data and longer training. At Idoven, we explored this for ATTR detection alongside CNNs and ResNets.

Multimodal models: Combining ECG signals with tabular clinical data (age, sex, comorbidities). This is where things get interesting clinically, because the same ECG pattern means different things in different patient contexts. We used concatenation at various fusion points.

What the Benchmarks Won't Tell You

On public ECG datasets, the performance differences between architectures are often small. A well-tuned ResNet and a well-tuned ViT will often land within a point or two of each other on AUC.

The differences show up in deployment-relevant ways:

Calibration: some architectures produce better-calibrated probabilities, which matters for clinical decision-making
Subgroup robustness: attention-based models sometimes generalize better across demographic subgroups, possibly because they learn more flexible representations
Inference speed: matters for real-time clinical systems; a CNN is typically 5-10x faster than a Transformer at inference
Data efficiency: CNNs learn reasonable features with less data; Transformers need more examples to outperform

For ATTR-CM detection specifically, the signal is subtle — many patients with the condition have ECGs that look normal to a human reader. The final model (a Time-Slicing CNN) achieved AUC 0.88 on the test set, with sensitivity of 80.7% and specificity of 78.5%. Notably, it maintained informative performance in early-stage disease (asymptomatic sensitivity 68.4%) and in paced ECGs — a first for ATTR detection.

Bayesian Optimization for Hyperparameter Tuning

With multiple architectures to compare, hyperparameter search becomes a bottleneck. I used Bayesian optimization to systematically explore the space: learning rate, batch size, augmentation parameters, architecture-specific settings (number of heads for ViT, kernel sizes for CNN, hidden dimensions for LSTM).

The key insight is that architecture comparisons are only valid when each architecture is given a fair shot at hyperparameter optimization. A badly tuned Transformer will lose to a well-tuned CNN, and the conclusion "CNNs are better for ECGs" would be wrong.

The search strategy matters: I used Bayesian optimization rather than random search because the evaluation budget was constrained — each training run used significant GPU time, and clinical validation required specific data splits that couldn't be arbitrarily expanded.

Ensemble and Meta-Ensemble Methods

Single models have ceilings. At Anumana, we used ensemble methods — combining predictions from multiple models trained with different initializations, architectures, or data subsets.

The meta-ensemble approach goes further: training a second-level model that learns how to optimally combine the outputs of base models. This is where architectural diversity pays off — a meta-ensemble of a CNN, a Transformer, and a CNN-LSTM captures complementary patterns.

The tradeoff is complexity. For a regulated medical device, every model in the ensemble needs its own documentation, validation, and version control. A single well-validated model is often preferred by regulatory teams over a complex ensemble, even if the ensemble has slightly better performance.

What I'd Recommend

For a new ECG classification project: start with a ResNet baseline, get your data pipeline and evaluation framework solid, then explore Transformers if you have sufficient data. Invest time in Bayesian hyperparameter search before concluding one architecture is better than another. And don't neglect the multimodal component — clinical context often matters more than the marginal architecture improvement.

This is a draft. Sections to expand: specific architecture diagrams, training curves comparing convergence, more on the Bayesian optimization setup (Optuna vs. others), discussion of foundation models for ECG (emerging area).