Detecting and Preventing Hallucinations in Large Vision Language Models
Meso Scale Discovery (United States)
Abstract
Instruction tuned Large Vision Language Models (LVLMs) have significantly advanced in generalizing across a diverse set of multi-modal tasks, especially for Visual Question Answering (VQA). However, generating detailed responses that are visually grounded is still a challenging task for these models. We find that even the current state-of-the-art LVLMs (InstructBLIP) still contain a staggering 30 percent of the hallucinatory text in the form of non-existent objects, unfaithful descriptions, and inaccurate relationships. To address this, we introduce M-HalDetect, a Multimodal Hallucination Detection Dataset that can be used to train and benchmark models for hallucination detection and prevention. M-HalDetect…
Citation impact
- FWCI
- 62.25
- Percentile
- 100%
- References
- 27
Authors
3Topics & keywords
- Visual Hallucination
- Psychology
- Artificial intelligence
- Cognitive psychology
- Computer science
- Computer vision
- Psychiatry
- Quality Education