Title: Referenceless Evaluation of Natural Language Generation from Meaning Representations

Proposal Abstract:

Automatic evaluation of Natural Language Generation (NLG) usually involves comparison to human-authored references which are treated as the ground-truth. However, these references often fail to adequately capture the range of valid output for an NLG task, and numerous validation studies have shown that reference-based metrics do not reliably correlate well with human judgments. Focusing on tasks that generate English from meaning representations, this dissertation will explore new referenceless metrics to automatically evaluate NLG. In particular, inspired by the UCCA-based approach to evaluating grammatical error correction proposed by Choshen and Abend (2018), I will test whether automatic evaluation of adequacy can be improved by parsing system output into the meaning representation and directly comparing this parse to the original input. This method can be combined with referenceless metrics for fluency, such as those proposed by Napoles et al. (2016), to perform an entirely referenceless evaluation.

In addition to developing and validating new metrics, the dissertation will involve creating a new dataset of human judgments for AMR generation which can be used to validate evaluation metrics. I will also experiment with possible applications of the new metrics beyond ranking systems, which may include improving system output by reranking, supporting error analysis, or supporting post-editing by highlighting potential adequacy errors.