Towards Disentangled High-level Causal Explanations in Text

Navita Goyal, Hal Daumé III, Alexandre Drouin, Dhanya Sridhar

April 2024

Abstract

In this work, we propose a causal representation learning framework for learning disentangled and intervenable high-level explanations in text. This is based on a key desiderata of explainability—explanations should identify a high-level abstraction over the inputs that convey the essence of model decisions. Importantly, these high-level factors can be realized in text with different tokens, but the model behavior remains invariant across these choices. Furthermore, an effective intervention should change all relevant realizations of a factor simultaneously. This is challenging, especially without manual intervention, as it requires disentangling the high-level causal factors and intervening on them as a unified entity. To learn these high-level causal factors, we employ a representation learning approach that bottlenecks large language model representation to identify factors that distinguish different model predictions, throwing away other information. To ensure disentangled representations, we pose an identifiability criterion under which we can provably recover disentangled causal factors in text. We conduct an empirical evaluation on a semi-synthetic income prediction task, demonstrating the efficacy of our approach in recovering disentangled causal factors. Furthermore, using our approach, we are able to intervene in latent space to automatically generate counterfactuals.

Type

Workshop

Publication

Mid-Atlantic Student Colloquium on Speech, Language and Learning