CarePilot iconCarePilot: A Multi-Agent Framework for Long-Horizon Computer Task Automation in Healthcare

1Indian Institute of Technology Patna2Mohamed bin Zayed University of Artificial Intelligence
CarePilot Framework Overview

Figure 1. Overview of the CarePilot framework. An Actor–Critic multi-agent architecture governs hierarchical decision-making for long-horizon healthcare workflows. At each step, the Actor observes the current interface, integrates tool-grounding signals and dual-memory context, then predicts the next semantic action. The Critic evaluates outcomes, provides corrective feedback, and updates both memory buffers.


Abstract

Multimodal agentic pipelines are transforming human–computer interaction by enabling efficient and accessible automation of complex, real-world tasks. However, recent efforts have focused on short-horizon or general-purpose applications (e.g., mobile or desktop interfaces), leaving long-horizon automation for domain-specific systems, particularly in healthcare, largely unexplored. To address this, we introduce CareFlow, a high-quality human-annotated benchmark comprising complex, long-horizon software workflows across medical annotation tools, DICOM viewers, EHR systems, and laboratory information systems. On this benchmark, existing vision–language models (VLMs) perform poorly, struggling with long-horizon reasoning and multi-step interactions in medical contexts. To overcome this, we propose CarePilot, a multi-agent framework based on the actor–critic paradigm. The Actor integrates tool grounding with dual-memory mechanisms—long-term and short-term experience—to predict the next semantic action from the visual interface and system state. The Critic evaluates each action, updates memory based on observed effects, and either executes or provides corrective feedback to refine the workflow. Our experiments show that CarePilot achieves state-of-the-art performance, outperforming strong closed-source and open-source multimodal baselines by approximately 15.26% and 3.38% on our benchmark and out-of-distribution dataset, respectively.


How CarePilot Works

CarePilot is a memory- and tool-augmented multi-agent framework built on the actor–critic paradigm. At each timestep t, the system runs through four tightly coupled stages.

🔍

① Tool Grounding

Before predicting any action, CarePilot enriches perception of the GUI using four lightweight tools:

  • UI Object Detection — open-vocabulary bounding-box localization of widgets and panels.
  • Zoom/Crop — magnifies fine-grained controls that are hard to parse at full resolution.
  • OCR — extracts text–box pairs for patient IDs, series names, and LIS codes.
  • Template/Icon Matching — identifies toolbar icons robustly across themes and locales.

All four outputs are fused into a perceptual grounding signal φt that conditions both memory updates and action prediction.

🧠

② Dual-Memory Design

Two complementary memories provide immediate and long-range context:

  • Short-Term Memory (STM) — stores the previous screenshot, the action just executed, and the Critic's feedback. Enables rapid correction of local errors.
  • Long-Term Memory (LTM) — maintains a compact rolling summary of the full trajectory updated with tool-grounding features φt. Prevents error accumulation and preserves semantic consistency across many steps.

The Actor always conditions on both memories, giving it situational awareness at every timescale.

🎭

③ Actor–Critic with Hierarchical Reflection

Both agents share the same VLM backbone (Qwen-VL), differing only in role:

  • Actor — samples the next semantic action from {CLICK SCROLL ZOOM TEXT SEGMENT COMPLETE}.
  • Critic — scores the action. If rejected, it triggers three-level hierarchical reflection:
    1. Action Reflector — detects local grounding errors (→ STM).
    2. Trajectory Reflector — diagnoses stalled progress (→ LTM).
    3. Global Reflector — checks full-trajectory goal consistency (→ LTM).
🎓

④ Training & Inference

The Actor is fine-tuned on Critic-augmented trajectories via supervised fine-tuning (SFT), internalizing the Critic's structured reasoning into its weights.

At inference only the Actor is retained — given the GUI state, task instruction, and memory context, it directly predicts the next action without the Critic, drastically reducing compute while preserving all quality gains from hierarchical reflection.

Fine-tuned with LoRA (rank 2, lora_alpha 4) on Qwen-VL 2.5-7B / Qwen 3 VL-8B in 4-bit precision on NVIDIA A100 GPUs.

CareFlow Benchmark

🖥 Orthanc 🩻 Weasis 🧊 3D Slicer 📋 OpenEMR 🔬 OpenHospital

CareFlow is the first large-scale, human-annotated benchmark dedicated to long-horizon healthcare software automation. It contains 1,100 tasks (735 train / 315 test + 50 OOD), each paired with a trajectory of 8–24 consecutive GUI screenshots. Every screenshot is labeled with an interface-invariant next-action from a six-primitive action space: CLICK SCROLL ZOOM TEXT SEGMENT COMPLETE.

CareFlow benchmark task sequence

Figure 2. Example task trajectory from CareFlow. Each task pairs a natural-language goal with a sequence of GUI screenshots representing authentic clinical workflows across DICOM viewers, annotation tools, EMR/EHR, and LIS platforms.


Experimental Results

We evaluate CarePilot against strong open- and closed-source multimodal baselines using two metrics: Step-Wise Accuracy (SWA) — fraction of correct next-action predictions across all steps — and Task Accuracy (TA) — fraction of tasks where every action is predicted correctly in order.

Table 1. Results on CareFlow. Best results are bold; best among baselines are underlined. Green rows are our method.

Model Weasis 3D Slicer Orthanc OpenEMR Average
SWATA SWATA SWATA SWATA SWATA
Qwen2.5 VL 32B79.955.1368.001.9048.622.4241.600.3260.722.43
Llama 3.2 11B86.0310.2676.474.6069.5813.5862.5611.4775.659.50
Llama 4 Scout84.3410.2678.472.7088.5623.9085.5521.7985.6513.50
Llama 4 Maverick88.2118.6971.553.4084.9927.9977.9725.6880.5319.20
Qwen3 VL 235B83.1417.6972.445.3087.4825.4084.4624.5281.8519.70
Mistral 3.2 VL 24B88.155.1364.810.6768.440.7961.430.0070.651.67
Nemotron 12B VL86.9812.8273.955.1373.5614.4666.5512.3677.9310.71
GPT-4o85.3020.0077.5027.3788.5026.6785.1027.5083.1325.40
GPT-588.7231.2581.4237.9086.9246.6783.8231.2585.2236.19
Gemini 2.5 Pro68.903.7559.705.2671.306.6661.706.7565.155.39
CarePilot (Qwen 2.5 VL-7B) 90.3840.0082.0954.7593.8055.0090.1856.7088.0548.90
CarePilot (Qwen 3 VL-8B) 92.5048.76 88.9054.80 91.8056.67 87.5246.25 90.1851.45

Table 2. Out-of-Distribution results on OpenHospital. Green rows denote CarePilot.

ModelSWATA
Qwen2.5 VL 32B71.7412.72
Llama 3.2 11B70.7616.36
Llama 4 Scout72.2020.75
Llama 4 Maverick73.7127.27
Qwen3 VL 235B75.1825.46
Mistral 3.2 VL 24B69.631.82
Nemotron 12B VL72.9018.18
Gemini 2.5 Pro73.9018.87
GPT-4o74.6325.48
GPT-579.7034.80
CarePilot (Qwen 2.5 VL-7B)77.9336.40
CarePilot (Qwen 3 VL-8B)79.2738.18

Table 3. Ablation on contextual components (Qwen 2.5 VL-7B). TG = Tool Grounding, LTM = Long-Term Memory, STM = Short-Term Memory.

TGLTMSTMSWATA
73.209.37
82.1023.67
80.4030.42
88.0548.90
Tool Grounding is the most critical component — removing it drops task accuracy to 9.37%. Long-Term Memory contributes more than Short-Term Memory when removed individually.

BibTeX

@misc{ghosh2026carepilotmultiagentframeworklonghorizon,
      title={CarePilot: A Multi-Agent Framework for Long-Horizon Computer Task Automation in Healthcare}, 
      author={Akash Ghosh and Tajamul Ashraf and Rishu Kumar Singh and Numan Saeed and Sriparna Saha and Xiuying Chen and Salman Khan},
      year={2026},
      eprint={2603.24157},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.24157}, 
}