How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

Liu, Chia‐Wei; Lowe, Ryan; Serban, Iulian Vlad; Noseworthy, Mike; Charlin, Laurent; Pineau, Joëlle

doi:10.18653/v1/d16-1230

articleJan 1, 2016GOLD OA

How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

CLChia‐Wei Liu RLRyan Lowe IVIulian Vlad SerbanMNMike NoseworthyLCLaurent Charlin

McGill University · Université de Montréal · +1 more institution

Indexed incrossref

Abstract

We investigate evaluation metrics for dialogue response generation systems where supervised labels, such as task completion, are not available. Recent works in response generation have adopted metrics from machine translation to compare a model's generated response to a single target response. We show that these metrics correlate very weakly with human judgements in the non-technical Twitter domain, and not at all in the technical Ubuntu domain. We provide quantitative and qualitative results highlighting specific weaknesses in existing metrics, and provide recommendations for future development of better automatic evaluation metrics for dialogue systems.

Citation impact

908

total citations

FWCI: 123.68
Percentile: 100%
References: 51

Citations per year

Authors

6

CL
Chia‐Wei LiuCorresponding
McGill University
RL
Ryan Lowe
McGill University
IV
Iulian Vlad Serban
Université de Montréal, University of Monterrey
MN
Mike Noseworthy
McGill University
LC
Laurent Charlin
McGill University

Topics & keywords

Topics

Keywords

Computer science
Empirical research
Artificial intelligence
Machine learning
Natural language processing
Statistics
Mathematics

No related works found for this paper.