Figure 1. Overview of the MedFocusLeak framework. The attack first generates a targeted adversarial text defining the malicious diagnostic objective, then jointly optimizes image and text perturbations confined to non-diagnostic background regions. An attention-shift loss explicitly redirects the model's visual focus toward the perturbed background, causing it to produce confident yet clinically incorrect diagnoses.
Vision-Language Models (VLMs) are increasingly used in clinical diagnostics, but their robustness to adversarial attacks is largely unexplored, posing serious risks. Existing medical image attacks mostly target secondary goals like model stealing or adversarial finetuning, while vanilla transferable attacks from natural images fail by introducing visible distortions that are easily detectable by clinicians.
To address this, we propose MedFocusLeak, a novel and highly transferable black-box multimodal attack that forces incorrect medical diagnoses while ensuring perturbations remain imperceptible. The approach strategically introduces synergistic perturbations into non-diagnostic background regions of a medical image and uses an Attention-Distract loss to deliberately shift the model's diagnostic focus away from pathological areas.
Through comprehensive evaluations on 6 distinct medical imaging modalities, we demonstrate that MedFocusLeak attains state-of-the-art effectiveness, producing adversarial examples that elicit plausible but incorrect diagnostic outputs across a range of VLMs — including GPT-5 and Gemini 2.5 Pro Thinking. We also propose a novel evaluation framework with new metrics that capture both the success of misleading text generation and image quality preservation in one statistical number.
MedFocusLeak corrupts a model's internal visual focus rather than just its output — forcing attention onto adversarially perturbed background regions while ignoring pathological evidence. Four tightly coupled stages drive the attack.
Given a medical image and clinical prompt, an attacker first crafts a targeted adversarial prompt using GPT-4o to produce a plausible but incorrect diagnosis. The adversarial output preserves the image's primary modality (e.g., "X-ray") while altering the reported clinical findings — for example, escalating a benign nodule to a malignant mass.
This adversarial text becomes the semantic anchor that steers all subsequent optimization.
A blank white seed image is initialized with the adversarial text rendered as an overlay to establish explicit cross-modal correspondence. Optimization alternates between modalities:
This alternating search continues until both perturbations converge to a stable adversarial configuration aligned toward the target wrong diagnosis.
MedSAM is used to segment and isolate the diagnostically critical foreground region. From the remaining background, the top-k largest square patches are identified via dynamic programming. Adversarial perturbations are then iteratively generated exclusively within these non-critical patches by:
This ensures the core clinical content (the pathology itself) is left completely untouched.
Embedding adversarial signals in the background alone is insufficient — models still anchor predictions on foreground evidence. An auxiliary Attention-Distract Loss intervenes directly:
We assembled a dataset of 1,000 medical images with ground-truth findings drawn from MIMIC-CXR, SkinCAP, and MedTrinity, spanning seven imaging modalities and ten anatomical body parts. Transferability is assessed across six VLMs:
The attack is optimized for 300 iterations with perturbation budget ε = 16/255 (ℓ∞ norm), step size 1/255, and k = 10 background patches selected via dynamic programming. Surrogate ensemble: 4 CLIP variants (ViT-L/14-336, ViT-B/16, ViT-B/32, LAION ViT-G/14).
Figure 2. Qualitative analysis of diagnostic misdirection. Original findings (top) vs. adversarial findings (bottom) for brain MRI cases. Correct clinical tokens are highlighted in green and adversarially altered tokens in red. Note that the adversarial image remains visually indistinguishable from the original.
MedFocusLeak consistently outperforms all five baselines — AttackVLM, AttackBard, AnyAttack, M-Attack, and FOA-Attack — across all models and modalities. Results are highlighted in red for our method. Numbers in blue in the paper indicate statistically significant improvements (paired t-test, p < 0.05).
Table 1. Performance of different attacks (MTR / AvgSim / MAS) across six VLMs. Red rows = MedFocusLeak.
| Attack | InternVL-8B | QwenVL-7B | BioMedLlama | ||||||
|---|---|---|---|---|---|---|---|---|---|
| MTR | AvgSim | MAS | MTR | AvgSim | MAS | MTR | AvgSim | MAS | |
| Attack Bard | 0.550 | 0.680 | 0.370 | 0.590 | 0.680 | 0.400 | 0.620 | 0.680 | 0.420 |
| AnyAttack | 0.540 | 0.790 | 0.420 | 0.660 | 0.790 | 0.520 | 0.570 | 0.790 | 0.450 |
| AttackVLM | 0.630 | 0.830 | 0.520 | 0.630 | 0.830 | 0.520 | 0.620 | 0.830 | 0.510 |
| M-Attack | 0.690 | 0.750 | 0.518 | 0.660 | 0.750 | 0.490 | 0.560 | 0.750 | 0.420 |
| FOA-Attack | 0.630 | 0.590 | 0.370 | 0.640 | 0.590 | 0.370 | 0.590 | 0.590 | 0.340 |
| MedFocusLeak (Ours) | 0.790 | 0.850 | 0.670 | 0.750 | 0.850 | 0.630 | 0.680 | 0.850 | 0.570 |
| Attack | Gemini 2.5 Pro | MedVLM-R1 | GPT-5 | ||||||
| MTR | AvgSim | MAS | MTR | AvgSim | MAS | MTR | AvgSim | MAS | |
| Attack Bard | 0.350 | 0.680 | 0.230 | 0.290 | 0.680 | 0.190 | 0.370 | 0.680 | 0.250 |
| AnyAttack | 0.410 | 0.790 | 0.320 | 0.350 | 0.790 | 0.270 | 0.390 | 0.790 | 0.300 |
| AttackVLM | 0.330 | 0.830 | 0.270 | 0.320 | 0.830 | 0.266 | 0.400 | 0.830 | 0.330 |
| M-Attack | 0.310 | 0.750 | 0.240 | 0.330 | 0.750 | 0.233 | 0.340 | 0.750 | 0.220 |
| FOA-Attack | 0.160 | 0.590 | 0.094 | 0.290 | 0.590 | 0.170 | 0.070 | 0.590 | 0.041 |
| MedFocusLeak (Ours) | 0.480 | 0.850 | 0.400 | 0.400 | 0.850 | 0.340 | 0.480 | 0.850 | 0.400 |
Table 2. Ablation study across QwenVL, Gemini 2.5 Pro Thinking, and MedVLM-R1.
| Setting | QwenVL 7B | Gemini 2.5 Pro | MedVLM-R1 | ||||||
|---|---|---|---|---|---|---|---|---|---|
| MTR | AvgSim | MAS | MTR | AvgSim | MAS | MTR | AvgSim | MAS | |
| Image perturbation only | 0.47 | 0.79 | 0.37 | 0.26 | 0.79 | 0.20 | 0.28 | 0.79 | 0.22 |
| Text perturbation only | 0.62 | 0.81 | 0.50 | 0.37 | 0.81 | 0.30 | 0.38 | 0.81 | 0.30 |
| Without attention shift | 0.55 | 0.88 | 0.48 | 0.27 | 0.88 | 0.24 | 0.30 | 0.88 | 0.26 |
| ε = 4 | 0.43 | 0.92 | 0.39 | 0.33 | 0.92 | 0.30 | 0.25 | 0.92 | 0.23 |
| ε = 8 | 0.57 | 0.88 | 0.50 | 0.34 | 0.88 | 0.30 | 0.29 | 0.88 | 0.26 |
| MedFocusLeak (ε = 16) | 0.74 | 0.85 | 0.63 | 0.48 | 0.85 | 0.40 | 0.39 | 0.87 | 0.33 |
Table 3. Human evaluation by 3 certified medical interns (supervised by a senior expert). Scale: 1–5.
| Method | ATI ↑ | IQP ↑ | OHAS ↑ |
|---|---|---|---|
| M-Attack | 3.1 | 3.1 | 3.2 |
| FOA-Attack | 3.3 | 1.5 | 2.8 |
| MedFocusLeak (Ours) | 3.94 | 3.5 | 3.75 |
@inproceedings{ghosh2026medfocusleak,
title = {When Background Matters: Breaking Medical Vision Language Models by Transferable Attack},
author = {Akash Ghosh and Subhadip Baidya and Sriparna Saha and Xiuying Chen},
booktitle = {Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL)},
year = {2026},
url = {https://arxiv.org/abs/0000.00000},
}