News

Invited Talks and Lectures

Research
These days, most of my time goes in thinking how brains do credit assignment through time.
For RL problems, I'd like to figure out a method that estimates the gradient of the reward with respect
to the action probabilities in a way that mimics some of the fundamental properties of
backprop. Backprop works by *composing* local estimates of effects (Jacobians).
In general I'm interested in unsupervised exploration, and training generative models! :)


A MetaTransfer Objective for Learning to Disentangle Causal Mechanisms
Yoshua Bengio,
Tristan Deleu, Nasim Rahaman, Rosemary Ke, Sébastien Lachapelle, Olexa Bilaniuk,
Anirudh Goyal,
Christopher Pal
arXiv
ICML'19 submission
We propose to metalearn causal structures based on how fast a learner adapts to new distributions arising from sparse distributional changes, e.g. due to interventions, actions of agents and other sources of nonstationarities. We show that under this assumption, the correct causal structural choices lead to faster adaptation to modified distributions because the changes are concentrated in one or just a few mechanisms when the learned knowledge is modularized appropriately. This leads to sparse expected gradients and a lower effective number of degrees of freedom needing to be relearned while adapting to the change. It motivates using the speed of adaptation to a modified distribution as a metalearning objective. We demonstrate how this can be used to determine the causeeffect relationship between two observed variables. The distributional changes do not need to correspond to standard interventions (clamping a variable), and the learner has no direct knowledge of these interventions. We show that causal structures can be parameterized via continuous variables and learned endtoend. We then explore how these ideas could be used to also learn an encoder that would map lowlevel observed variables to unobserved causal variables leading to faster adaptation outofdistribution, learning a representation space where one can satisfy the assumptions of independent mechanisms and of small and sparse changes in these mechanisms due to actions and nonstationarities.


Maximum Entropy Generators for EnergyBased Models
Rithesh Kumar, Anirudh Goyal, Aaron Courville, Yoshua Bengio
ICML'19 submission
Unsupervised learning is about capturing dependencies between variables and is driven by the contrast between the probable vs. improbable configurations of these variables, often either via a generative model that only samples probable ones or with an energy function (unnormalized logdensity) that is low for probable ones and high for improbable ones. Here, we consider learning both an energy function and an efficient approximate sampling mechanism. Whereas the discriminator in generative adversarial networks (GANs) learns to separate data and generator samples, introducing an entropy maximization regularizer on the generator can turn the interpretation of the critic into an energy function, which separates the training distribution from everything else, and thus can be used for tasks like anomaly or novelty detection. Then, we show how Markov Chain Monte Carlo can be done in the generator latent space whose samples can be mapped to data space, producing better samples. These samples are used for the negative phase gradient required to estimate the loglikelihood gradient of the data space energy function. To maximize entropy at the output of the generator, we take advantage of recently introduced neural estimators of mutual information. We find that in addition to producing a useful scoring function for anomaly detection, the resulting approach produces sharp samples while covering the modes well, leading to high Inception and Frechet scores.


Learning Dynamics Model in Reinforcement Learning by Incorporating the Long Term Future
Nan Rosemary Ke, Amanpreet Singh, Ahmed Touati,
Anirudh Goyal,
Yoshua Bengio,Devi Parikh, Dhruv Batra
arXiv
International Conference on Learning Representations (ICLR), 2019 (Poster)
In modelbased reinforcement learning, the agent interleaves between model learning and planning. These two components are inextricably intertwined. If the model is not able to provide sensible longterm prediction, the executed planner would exploit model flaws, which can yield catastrophic failures. This paper focuses on building a model that reasons about the longterm future and demonstrates how to use this for efficient planning and exploration. To this end, we build a latentvariable autoregressive model by leveraging recent ideas in variational inference. We argue that forcing latent variables to carry future information through an auxiliary task substantially improves longterm predictions. Moreover, by planning in the latent space, the planner's solution is ensured to be within regions where the model is valid. An exploration strategy can be devised by searching for unlikely trajectories under the model. Our method achieves higher reward faster compared to baselines on a variety of tasks and environments in both the imitation learning and modelbased reinforcement learning settings.


InfoBot:Transfer and Exploration via the Information Bottleneck
Anirudh Goyal,
Riashat Islam,
D.J Strousse,
Zafarali Ahmed,
Matthew Botvinick,
Hugo Larochelle,
Yoshua Bengio,
Sergey Levine,
arXiv
International Conference on Learning Representations (ICLR), 2019 (Poster)
A central challenge in reinforcement learning is discovering effective policies for tasks where rewards are sparsely distributed. We postulate that in the absence of useful reward signals, an effective exploration strategy should seek out {\it decision states}. These states lie at critical junctions in the state space from where the agent can transition to new, potentially unexplored regions. We propose to learn about decision states from prior experience. By training a goalconditioned policy with an information bottleneck, we can identify decision states by examining where the model actually leverages the goal state. We find that this simple mechanism effectively identifies decision states, even in partially observed settings. In effect, the model learns the sensory cues that correlate with potential subgoals. In new environments, this model can then identify novel subgoals for further exploration, guiding the agent through a sequence of potential decision states and through new regions of the state space.


Recall Traces: Backtracking Models for Efficient Reinforcement Learning
Anirudh Goyal,
Philemon Brakel,
William Fedus,
Soumye Singhal,
Timothy Lillicrap,
Sergey Levine,
Hugo Larochelle,
Yoshua Bengio,
arXiv
International Conference on Learning Representations (ICLR), 2019 (Poster)
In many environments only a tiny subset of all states yield high reward. In these cases, few of the interactions with the environment provide a relevant learning signal. Hence, we may want to preferentially train on those highreward states and the probable trajectories leading to them. To this end, we advocate for the use of a backtracking model that predicts the preceding states that terminate at a given highreward state. We can train a model which, starting from a high
value state (or one that is estimated to have high value), predicts and sample for which the (state, action)tuples may have led to that high value state. These traces of (state, action) pairs, which we refer to as Recall Traces, sampled from this backtracking model starting from a high value state, are informative as they terminate in good states, and hence we can use these traces to improve a policy. We provide a variational interpretation for this idea and a practical
algorithm in which the backtracking model samples from an approximate posterior distribution over trajectories which lead to large rewards. Our method improves the sample efficiency of both on and offpolicy RL algorithms across several environments and tasks.


Sparse Attentive Backtracking: Temporal Credit Assignment Through Reminding
Rosemary Nan Ke, Anirudh Goyal, Olexa Bilaniuk, Jonathan Binas, Michael C. Mozer, Chris Pal, Yoshua Bengio
Neural Information Processing System (NIPS), 2018 (Oral Presentation)
Learning longterm dependencies in extended temporal sequences requires credit assignment to events far back in the past. The most common method for training recurrent neural networks, backpropagation through time (BPTT), requires credit information to be propagated backwards through every single step of the forward computation, potentially over thousands or millions of time steps. This becomes computationally expensive or even infeasible when used with long sequences. Importantly,
biological brains are unlikely to perform such detailed reverse replay over very long sequences of internal states (consider days, months, or years.) However, humans are often reminded of past memories or mental states which are associated with the current mental state. We consider the hypothesis that such memory associations between past and present could be used for credit assignment through arbitrarily long sequences, propagating the credit assigned to the current state to
the associated past state. Based on this principle, we study a novel algorithm which only backpropagates through a few of these temporal skip connections, realized by a learned attention mechanism that associates current states with relevant past states. We demonstrate in experiments that our method matches or outperforms regular BPTT and truncated BPTT in tasks involving particularly longterm dependencies, but without requiring the biologically implausible backward
replay through the whole history of states. Additionally, we demonstrate that the proposed method transfers to longer sequences significantly better than LSTMs trained with BPTT and LSTMs trained with full selfattention.


Fortified Networks: Improving the Robustness of Deep Networks by Modeling the Manifold of Hidden Representations
Alex Lamb,
Jonathan Binas,
Anirudh Goyal,
Dmitriy Serdyuk,
Sandeep Subramanian,
Ioannis Mitliagkas,
Yoshua Bengio,
arXiv
/
Code
Deep networks have achieved impressive results across a variety of important tasks. However a known weakness is a failure to perform well when evaluated on data which differ from the training distribution, even if these differences are very small, as is the case with adversarial examples. We propose Fortified Networks, a simple transformation of existing networks, which fortifies the hidden layers in a deep network by identifying when the hidden states are off of the
data manifold, and maps these hidden states back to parts of the data manifold where the network performs well. Our principal contribution is to show that fortifying these hidden states improves the robustness of deep networks and our experiments (i) demonstrate improved robustness to standard adversarial attacks in both blackbox and whitebox threat models; (ii) suggest that our improvements are not primarily due to the gradient masking problem and (iii) show the
advantage of doing this fortification in the hidden layers instead of the input space.


Generalization of Equilibrium Propagation to Vector Field Dynamics
Benjamin Scellier,
Anirudh Goyal,
Jonathan Binas,
Thomas Mesnard,
Yoshua Bengio,
ICLR'18 Workshop
The biological plausibility of the backpropagation algorithm has long been doubted by neuroscientists. Two major reasons are that neurons would need to send two different types of signal in the forward and backward phases, and that pairs of neurons would need to communicate through symmetric bidirectional connections.
We present a simple twophase learning procedure for fixed point recurrent networks that addresses both these issues.
In our model, neurons perform leaky integration and synaptic weights are updated through a local mechanism.
Our learning method extends the framework of Equilibrium Propagation to general dynamics, relaxing the requirement of an energy function.
As a consequence of this generalization, the algorithm does not compute the true gradient of the objective function,
but rather approximates it at a precision which is proven to be directly related to the degree of symmetry of the feedforward and feedback weights.
We show experimentally that the intrinsic properties of the system lead to alignment of the feedforward and feedback weights, and that our algorithm optimizes the objective function.


Z Forcing: Training Stochastic RNN's
Anirudh Goyal,
Alessandro Sordoni,
MarcAlexandre Côté,
Rosemary Nan Ke,
Yoshua Bengio,
Neural Information Processing System (NIPS), 2017
arXiv
/
code
We proposed a novel approach to incorporate stochastic latent variables in sequential neural networks. The method builds on recent architectures that use latent variables to condition the recurrent dynamics of the network. We augmented the inference network with an RNN that runs backward through the sequence and added a new auxiliary cost that forces the latent variables to reconstruct the state of that backward RNN, i.e. predict a summary of future observations.


Variational Walkback: Learning a Transition Operator as a Stochastic Recurrent Net
Anirudh Goyal,
Nan Rosemary Ke,
Surya Ganguli,
Yoshua Bengio
Neural Information Processing System (NIPS), 2017
arXiv
/
code
We propose a novel method to directly learn a stochastic transition operator whose repeated application provides generated samples. Traditional undirected graphical models approach this problem indirectly by learning a Markov chain model whose stationary distribution obeys detailed balance with respect to a parameterized energy function.


Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations
David Krueger, Tegan Maharaj, Janos Kramar, Mohammad Pezeshki, Nicolas Ballas, Nan Rosemary Ke, Anirudh Goyal
Yoshua Bengio,
Aaron Courville
Chris Pal
International Conference on Learning Representations (ICLR), 2017
arXiv
/
code
We propose zoneout, a novel method for regularizing RNNs. At each timestep, zoneout stochastically forces some hidden units to maintain their previous values. Like dropout, zoneout uses random noise to train a pseudoensemble, improving generalization. But by preserving instead of dropping hidden units, gradient information and state information are more readily propagated through time, as in feedforward stochastic depth networks.


ACtuAL: ActorCritic Under Adversarial Learning
Anirudh Goyal, Nan Rosemary Ke, Alex Lamb, R Devon Hjelm, Chris Pal, Joelle Pineau,
Yoshua Bengio
arXiv
/
code
Generative Adversarial Networks (GANs) are a powerful framework for deep generative modeling. Posed as a twoplayer minimax problem, GANs are typically trained endtoend on realvalued data and can be used to train a generator of highdimensional and realistic images. However, a major limitation of GANs is that training relies on passing gradients from the discriminator through the generator via backpropagation. This makes it fundamentally difficult to train GANs with
discrete data, as generation in this case typically involves a nondifferentiable function. These difficulties extend to the reinforcement learning setting when the action space is composed of discrete decisions. We address these issues by reframing the GAN framework so that the generator is no longer trained using gradients through the discriminator, but is instead trained using a learned critic in the actorcritic framework with a Temporal Difference (TD) objective. This
is a natural fit for sequence modeling and we use it to achieve improvements on language modeling tasks over the standard TeacherForcing methods.


An ActorCritic Algorithm for Sequence Prediction
Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau,
Aaron Courville,
Yoshua Bengio
International Conference on Learning Representations (ICLR), 2017
arXiv
/
code
We present an approach to training neural networks to generate sequences using actorcritic methods from reinforcement learning (RL). Current loglikelihood training methods are limited by the discrepancy between their training and testing modes, as models must generate tokens conditioned on their previous guesses rather than the groundtruth tokens. We address this problem by introducing a critic network that is trained to predict the value of an output token, given the policy of an actor network


Professor Forcing: A New Algorithm for Training Recurrent Networks
Anirudh Goyal, Alex Lamb, Ying Zhang, Saizheng Zhang,
Aaron Courville,
Yoshua Bengio,
Neural Information on Processing System(NIPS), 2016
arXiv
/
video
/
code
The Teacher Forcing algorithm trains recurrent networks by supplying observed sequence values as inputs during training and using the network’s own onestepahead predictions to do multistep sampling. We introduce the Professor Forcing algorithm, which uses adversarial domain adaptation to encourage the dynamics of the recurrent network to be the same when training the network and when sampling from the network over multiple time steps.

