ServiceNow Research

The StatCan Dialogue Dataset: Retrieving Data Tables through Conversations with Genuine Intents

Abstract

We introduce the StatCan Dialogue Dataset consisting of 4967 conversations between agents working at Statistics Canada and online users looking for published data tables. The conversations stem from genuine intents, are held in either English or French, and lead to agents retrieving one of over 5000 complex data tables. Based on this dataset, we propose two tasks: (1) automatic retrieval of relevant tables based on a on-going conversation, and (2) automatic generation of appropriate agent responses at each turn. We investigate the difficulty of each task by establishing strong baselines. Our experiments on a temporal data split reveal that all models struggle to generalize to future conversations, as we observe a significant drop in performance across both tasks when we move from the validation to the test set. In addition, we find that response generation models struggle to decide when to return a table. Thus, the introduced tasks pose significant challenges to existing models and we encourage the community to further develop models that can assist agents and users of Statistics Canada.

Publication
European Chapter of the Association for Computational Linguistics (EACL)
Siva Reddy
Siva Reddy
Research Scientist

Research Scientist at Human Machine Interaction Through Language located at Montreal, QC, Canada.

Harm de Vries
Harm de Vries
Research Lead

Research Lead at Large Language Models Lab located at Amsterdam, Holland.