Detecting and Preventing Hallucinations in Large Vision Language Models

Meso Scale Discovery (United States)

Indexed incrossref

Abstract

Instruction tuned Large Vision Language Models (LVLMs) have significantly advanced in generalizing across a diverse set of multi-modal tasks, especially for Visual Question Answering (VQA). However, generating detailed responses that are visually grounded is still a challenging task for these models. We find that even the current state-of-the-art LVLMs (InstructBLIP) still contain a staggering 30 percent of the hallucinatory text in the form of non-existent objects, unfaithful descriptions, and inaccurate relationships. To address this, we introduce M-HalDetect, a Multimodal Hallucination Detection Dataset that can be used to train and benchmark models for hallucination detection and prevention. M-HalDetect…

Citation impact

122
total citations
FWCI
62.25
Percentile
100%
References
27
Citations per year

Authors

3

Topics & keywords

Keywords
  • Visual Hallucination
  • Psychology
  • Artificial intelligence
  • Cognitive psychology
  • Computer science
  • Computer vision
  • Psychiatry
UN Sustainable Development Goals
  • Quality Education
No related works found for this paper.