MedFocusLeak

Figure 1. Overview of the MedFocusLeak framework. The attack first generates a targeted adversarial text defining the malicious diagnostic objective, then jointly optimizes image and text perturbations confined to non-diagnostic background regions. An attention-shift loss explicitly redirects the model's visual focus toward the perturbed background, causing it to produce confident yet clinically incorrect diagnoses.

Abstract

Vision-Language Models (VLMs) are increasingly used in clinical diagnostics, but their robustness to adversarial attacks is largely unexplored, posing serious risks. Existing medical image attacks mostly target secondary goals like model stealing or adversarial finetuning, while vanilla transferable attacks from natural images fail by introducing visible distortions that are easily detectable by clinicians.

To address this, we propose MedFocusLeak, a novel and highly transferable black-box multimodal attack that forces incorrect medical diagnoses while ensuring perturbations remain imperceptible. The approach strategically introduces synergistic perturbations into non-diagnostic background regions of a medical image and uses an Attention-Distract loss to deliberately shift the model's diagnostic focus away from pathological areas.

Through comprehensive evaluations on 6 distinct medical imaging modalities, we demonstrate that MedFocusLeak attains state-of-the-art effectiveness, producing adversarial examples that elicit plausible but incorrect diagnostic outputs across a range of VLMs — including GPT-5 and Gemini 2.5 Pro Thinking. We also propose a novel evaluation framework with new metrics that capture both the success of misleading text generation and image quality preservation in one statistical number.

How MedFocusLeak Works

MedFocusLeak corrupts a model's internal visual focus rather than just its output — forcing attention onto adversarially perturbed background regions while ignoring pathological evidence. Four tightly coupled stages drive the attack.

📝

① Adversarial Text Generation

Given a medical image and clinical prompt, an attacker first crafts a targeted adversarial prompt using GPT-4o to produce a plausible but incorrect diagnosis. The adversarial output preserves the image's primary modality (e.g., "X-ray") while altering the reported clinical findings — for example, escalating a benign nodule to a malignant mass.

This adversarial text becomes the semantic anchor that steers all subsequent optimization.

🔗

② Multimodal Adversarial Representation

A blank white seed image is initialized with the adversarial text rendered as an overlay to establish explicit cross-modal correspondence. Optimization alternates between modalities:

Image perturbation — updated via projected gradient descent to disrupt block-level visual representations.
Text perturbation — refined by greedy token substitution to maximize cross-modal fusion disruption.

This alternating search continues until both perturbations converge to a stable adversarial configuration aligned toward the target wrong diagnosis.

🎯

③ Background-Constrained Perturbation

MedSAM is used to segment and isolate the diagnostically critical foreground region. From the remaining background, the top-k largest square patches are identified via dynamic programming. Adversarial perturbations are then iteratively generated exclusively within these non-critical patches by:

Taking random sub-crops and aligning their embeddings to the multimodal adversarial target.
Maximizing cosine similarity across an ensemble of 4 CLIP surrogate models.

This ensures the core clinical content (the pathology itself) is left completely untouched.

👁️

④ Attention-Shift via Background Gate

Embedding adversarial signals in the background alone is insufficient — models still anchor predictions on foreground evidence. An auxiliary Attention-Distract Loss intervenes directly:

Cross-attention weights between visual and textual tokens are extracted from the final fusion block.
The loss suppresses attention on diagnostically salient foreground regions while amplifying attention on adversarially perturbed background.

            Result: the model produces confident, coherent, yet clinically incorrect diagnoses based entirely on malicious background cues.
          

Dataset & Experimental Setup

🩻 X-Ray 🧠 MRI 🔬 CT Scan 🔊 Ultrasound 🩸 Dermoscopy 🎀 Mammography

We assembled a dataset of 1,000 medical images with ground-truth findings drawn from MIMIC-CXR, SkinCAP, and MedTrinity, spanning seven imaging modalities and ten anatomical body parts. Transferability is assessed across six VLMs:

Open-source: Qwen2.5-VL 7B, InternVL 8B
Medical-specialized: MedVLM-R1, BioMedLLAMA-Vision
Closed-source: GPT-5, Gemini-2.5-Pro-Thinking

The attack is optimized for 300 iterations with perturbation budget ε = 16/255 (ℓ_∞ norm), step size 1/255, and k = 10 background patches selected via dynamic programming. Surrogate ensemble: 4 CLIP variants (ViT-L/14-336, ViT-B/16, ViT-B/32, LAION ViT-G/14).

          Evaluation Metrics:
          MTR Medical Text Rate — LLM-as-judge score for diagnostic misdirection.  
          AvgSim MedCLIP cosine similarity between original and adversarial images (imperceptibility).  
          MAS Medical Attack Score — unified geometric mean of MTR and AvgSim; high only when both are high.
        

Figure 2. Qualitative analysis of diagnostic misdirection. Original findings (top) vs. adversarial findings (bottom) for brain MRI cases. Correct clinical tokens are highlighted in green and adversarially altered tokens in red. Note that the adversarial image remains visually indistinguishable from the original.

Experimental Results

MedFocusLeak consistently outperforms all five baselines — AttackVLM, AttackBard, AnyAttack, M-Attack, and FOA-Attack — across all models and modalities. Results are highlighted in red for our method. Numbers in blue in the paper indicate statistically significant improvements (paired t-test, p < 0.05).

Table 1. Performance of different attacks (MTR / AvgSim / MAS) across six VLMs. Red rows = MedFocusLeak.

Attack	InternVL-8B			QwenVL-7B			BioMedLlama
Attack	MTR	AvgSim	MAS	MTR	AvgSim	MAS	MTR	AvgSim	MAS
Attack Bard	0.550	0.680	0.370	0.590	0.680	0.400	0.620	0.680	0.420
AnyAttack	0.540	0.790	0.420	0.660	0.790	0.520	0.570	0.790	0.450
AttackVLM	0.630	0.830	0.520	0.630	0.830	0.520	0.620	0.830	0.510
M-Attack	0.690	0.750	0.518	0.660	0.750	0.490	0.560	0.750	0.420
FOA-Attack	0.630	0.590	0.370	0.640	0.590	0.370	0.590	0.590	0.340
MedFocusLeak (Ours)	0.790	0.850	0.670	0.750	0.850	0.630	0.680	0.850	0.570
Attack	Gemini 2.5 Pro			MedVLM-R1			GPT-5
Attack	MTR	AvgSim	MAS	MTR	AvgSim	MAS	MTR	AvgSim	MAS
Attack Bard	0.350	0.680	0.230	0.290	0.680	0.190	0.370	0.680	0.250
AnyAttack	0.410	0.790	0.320	0.350	0.790	0.270	0.390	0.790	0.300
AttackVLM	0.330	0.830	0.270	0.320	0.830	0.266	0.400	0.830	0.330
M-Attack	0.310	0.750	0.240	0.330	0.750	0.233	0.340	0.750	0.220
FOA-Attack	0.160	0.590	0.094	0.290	0.590	0.170	0.070	0.590	0.041
MedFocusLeak (Ours)	0.480	0.850	0.400	0.400	0.850	0.340	0.480	0.850	0.400

          On GPT-5, MedFocusLeak achieves a MAS of 0.408 — nearly 2× the strongest
          baseline (0.225). On InternVL, it reaches MAS 0.672, far exceeding the next best 0.523.
          Reasoning models (MedVLM-R1, Gemini 2.5 Pro Thinking) show higher robustness but are still significantly compromised.
        

Table 2. Ablation study across QwenVL, Gemini 2.5 Pro Thinking, and MedVLM-R1.

Setting	QwenVL 7B			Gemini 2.5 Pro			MedVLM-R1
Setting	MTR	AvgSim	MAS	MTR	AvgSim	MAS	MTR	AvgSim	MAS
Image perturbation only	0.47	0.79	0.37	0.26	0.79	0.20	0.28	0.79	0.22
Text perturbation only	0.62	0.81	0.50	0.37	0.81	0.30	0.38	0.81	0.30
Without attention shift	0.55	0.88	0.48	0.27	0.88	0.24	0.30	0.88	0.26
ε = 4	0.43	0.92	0.39	0.33	0.92	0.30	0.25	0.92	0.23
ε = 8	0.57	0.88	0.50	0.34	0.88	0.30	0.29	0.88	0.26
MedFocusLeak (ε = 16)	0.74	0.85	0.63	0.48	0.85	0.40	0.39	0.87	0.33

          Joint multimodal perturbation outperforms unimodal variants. Adding attention shift boosts MAS by
          +0.145 on QwenVL. Optimal performance peaks at k = 10 patches and
          α = 1.00 step size.
        

Table 3. Human evaluation by 3 certified medical interns (supervised by a senior expert). Scale: 1–5.

Method	ATI ↑	IQP ↑	OHAS ↑
M-Attack	3.1	3.1	3.2
FOA-Attack	3.3	1.5	2.8
MedFocusLeak (Ours)	3.94	3.5	3.75

          Inter-annotator agreement: Cohen's κ = 0.82 (strong agreement).
          ATI = Adversarial Text Impact; IQP = Image Quality Preservation; OHAS = Overall Human Attack Score.
          FOA-Attack sacrifices image quality (IQP 1.5) for text impact — MedFocusLeak achieves both simultaneously.
        

BibTeX

@inproceedings{ghosh2026medfocusleak,
  title     = {When Background Matters: Breaking Medical Vision Language Models by Transferable Attack},
  author    = {Akash Ghosh and Subhadip Baidya and Sriparna Saha and Xiuying Chen},
  booktitle = {Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL)},
  year      = {2026},
  url       = {https://arxiv.org/abs/0000.00000},
}