preprintarXiv (Cornell University)Mar 25, 2016GREEN OA

How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

Indexed inarxivdatacite

Abstract

We investigate evaluation metrics for dialogue response generation systems where supervised labels, such as task completion, are not available. Recent works in response generation have adopted metrics from machine translation to compare a model's generated response to a single target response. We show that these metrics correlate very weakly with human judgements in the non-technical Twitter domain, and not at all in the technical Ubuntu domain. We provide quantitative and qualitative results highlighting specific weaknesses in existing metrics, and provide recommendations for future development of better automatic evaluation metrics for dialogue systems.

Citation impact

614
total citations
FWCI
Percentile
References
41
Citations per year

Authors

6

Topics & keywords

Keywords
  • Computer science
  • Domain (mathematical analysis)
  • Task (project management)
  • Artificial intelligence
  • Machine translation
  • Machine learning
  • Open domain
  • Strengths and weaknesses
No related works found for this paper.