A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity
Indexed inarxivdatacite
Abstract
This paper proposes a framework for quantitatively evaluating interactive LLMs such as ChatGPT using publicly available data sets. We carry out an extensive technical evaluation of ChatGPT using 23 data sets covering 8 different common NLP application tasks. We evaluate the multitask, multilingual and multi-modal aspects of ChatGPT based on these data sets and a newly designed multimodal dataset. We find that ChatGPT outperforms LLMs with zero-shot learning on most tasks and even outperforms fine-tuned models on some tasks. We find that it is better at understanding non-Latin script languages than generating them. It is able to generate multimodal content from textual prompts, via an intermediate code…
Citation impact
353
total citations
- FWCI
- —
- Percentile
- —
- References
- 0
Citations per year
Authors
13Topics & keywords
Topics
Keywords
- Computer science
- Semantic reasoner
- Artificial intelligence
- Natural language processing
- Set (abstract data type)
- Automatic summarization
- Machine learning
- Programming language
No related works found for this paper.