End-to-end interactive learning of dialogue systems has been all-but-abandoned in favour of other approaches using more labelled data, such as dialogue state tracking. A major issue of the approach is that using language models as speaker and listener can lead to
language drift.'' Models are trained only to optimize a task objective and so their intermediate language can drift from pretrained natural language to an un-natural communication protocol. We reproduce previous work on tackling this phenomena and find that baseline methods are not as bad as reported. Furthermore, we use a simple KL regularization with an EMA model to stabilize RL training and outperform previous methods. Finally, we investigate the issue of language drift’’ and find that it focuses only on the sender. We argue that ``receiver drift’’ is equally important and show strong results on this novel metric.