Vision-language models for medical report generation and visual question answering: a review
Indexed incrossrefdoajpubmed
Abstract
Medical vision-language models (VLMs) combine computer vision (CV) and natural language processing (NLP) to analyze visual and textual medical data. Our paper reviews recent advancements in developing VLMs specialized for healthcare, focusing on publicly available models designed for medical report generation and visual question answering (VQA). We provide background on NLP and CV, explaining how techniques from both fields are integrated into VLMs, with visual and language data often fused using Transformer-based architectures to enable effective learning from multimodal data. Key areas we address include the exploration of 18 public medical vision-language datasets, in-depth analyses of the architectures and…
Citation impact
159
total citations
- FWCI
- 35.54
- Percentile
- 100%
- References
- 220
Citations per year
Authors
2Topics & keywords
Topics
Keywords
- Computer science
- Data science
- Question answering
- Artificial intelligence
- Key (lock)
- Machine learning
- Human–computer interaction
- Computer security
No related works found for this paper.