Vision-language models for medical report generation and visual question answering: a review

Hartsock, Iryna; Rasool, Ghulam

doi:10.3389/frai.2024.1430984

reviewFrontiers in Artificial IntelligenceNov 19, 2024GOLD OA

Vision-language models for medical report generation and visual question answering: a review

IHIryna Hartsock GRGhulam Rasool

Moffitt Cancer Center

PubMed

Indexed incrossrefdoajpubmed

Abstract

Medical vision-language models (VLMs) combine computer vision (CV) and natural language processing (NLP) to analyze visual and textual medical data. Our paper reviews recent advancements in developing VLMs specialized for healthcare, focusing on publicly available models designed for medical report generation and visual question answering (VQA). We provide background on NLP and CV, explaining how techniques from both fields are integrated into VLMs, with visual and language data often fused using Transformer-based architectures to enable effective learning from multimodal data. Key areas we address include the exploration of 18 public medical vision-language datasets, in-depth analyses of the architectures and…

Citation impact

159

total citations

FWCI: 35.54
Percentile: 100%
References: 220

Citations per year

Authors

2

Topics & keywords

Topics

Keywords

Computer science
Data science
Question answering
Artificial intelligence
Key (lock)
Machine learning
Human–computer interaction
Computer security

No related works found for this paper.

Funding

NS
National Science Foundation
Awards: 1903466, 2234468, 2234836