Beyond Accuracy: Rethinking Model Selection in Health AI

10 April 2026

In recent years, the discourse around artificial intelligence in healthcare has increasingly emphasised model performance, often measured through metrics such as accuracy, AUC, or F1-score. While these metrics are important, they can also be misleading when used in isolation, particularly in high-stakes clinical environments.

In a recent exploratory analysis using a publicly available heart disease dataset (see my project on Github), I compared the performance of three comonly used machine learning models: logistic regression, decision trees, and random forests. The results were, at first glance, unsurprising—tree-based models significantly outperformed logistic regression in predictive accuracy, achieving approximately 98% accuracy compared to 79% for logistic regression.

However, a deeper examination reveals why such results should be interpreted with caution. Tree-based models, particularly decision trees, are highly flexible and capable of capturing complex, non-linear relationships within data. This flexibility allows them to achieve high performance on structured datasets. However, it also introduces a well-known risk: overfitting, where models capture dataset-specific patterns that do not generalise to real-world populations. Random forests mitigate this risk by aggregating multiple decision trees, improving robustness and generalizability. Yet even ensemble methods are not immune to deeper structural issues.

This is where insights from Sociotechnical Challenges in ML systems become particularly relevant. A recent paper (link to the full text below) argues that machine learning systems should not be evaluated as isolated technical artefacts, but rather as components of complex sociotechnical systems. It highlights several key challenges that directly inform how we should interpret model performance in healthcare:

1. Data is Not Neutral

Datasets are shaped by:

  • historical practices

  • institutional processes

  • measurement constraints

In my project, the heart disease dataset represents a simplified and potentially biased snapshot of reality. High model accuracy may therefore reflect patterns specific to that dataset, rather than clinically valid or generalisable insights.

2. Generalisation is Context-Dependent

A key argument in the paper is that model performance does not transfer reliably across contexts due to:

  • population differences

  • clinical practice variation

  • data collection inconsistencies

This directly challenges the assumption that a model achieving 98% accuracy in a controlled dataset would perform similarly in real healthcare environments.

3. Human–AI Interaction Matters

The paper highlights that machine learning systems are ultimately used by people, and their effectiveness depends on:

  • clinician trust

  • interpretability

  • usability within workflows

This reinforces the importance of models like logistic regression, which—despite lower accuracy—offer greater transparency and interpretability, making them more suitable for clinical decision-making contexts.

4. Hidden Failure Modes

Another critical insight from the paper is that models can fail in ways that are not visible through standard evaluation metrics.

For example:

  • systematic bias against certain patient groups

  • sensitivity to missing or noisy data

  • unintended consequences in deployment

These risks are not captured by accuracy alone, yet they are central to patient safety.

Reframing Model Performance

Taken together, these insights fundamentally change how we interpret the model comparison results.

While tree-based models demonstrated superior accuracy, this advantage must be evaluated within a broader context:

  • Are the results generalisable?

  • Are the predictions explainable to clinicians?

  • How does the model behave under real-world data conditions?

  • What are the risks of failure, and who is affected?

A marginal gain in predictive performance may not justify the loss of interpretability, especially when decisions impact diagnosis, treatment, or resource allocation.

Toward a Sociotechnical Evaluation Framework

As both the empirical findings and the referenced research suggest, model evaluation in healthcare must move beyond purely technical metrics toward a multi-dimensional, sociotechnical framework that incorporates:

  • Predictive performance

  • Interpretability and explainability

  • Data quality and representativeness

  • Clinical workflow integration

  • Human factors and usability

  • Ethical and regulatory considerations

The key takeaway is not that complex models should be avoided, but that model selection in healthcare is not purely a technical decision.

It is a governance decision, shaped by the interaction between technology, people, and systems. As highlighted in Sociotechnical Challenges in Machine Learning Systems, the success or failure of AI in healthcare depends less on achieving marginal gains in accuracy and more on how well these systems align with the complex realities of clinical practice.