Forward-Backward Representations:
Instead of directly learning the expected discounted reward Q,
the FB Representation approach focuses on learning the expected discounted state-action visitation probabilities M,
which are less dependent on the reward.
These two quantities are related by: \(Q = M \cdot R\).
To learn M, we use its Bellman equation—similar to Q-learning—describing how the visitation probabilities evolve:
\[
M^{\pi} =
\underbrace{P}_{\text{Direct transitions from } (s,a)}
+
\underbrace{\gamma P_{\pi} M^{\pi}}_{\text{Future transitions arriving at } (s,a) \text{ from others}}
\]
Here, \(P\) represents the transition probabilities and \(\gamma\) the discount factor.
We turn this Bellman consistency into a learning objective:
\[ \mathcal{L}(M) = \mathbb{E} \left[ \left\| M^{\pi}(s_0, a_0, ds, da) - P + \gamma P_{\pi} M^{\pi}(s, a) \right\|^2 \right] \]
However, learning M directly is too difficult (intractable) due to its complexity.
To simplify, we use a low-rank decomposition:
Using this equation solely is not enough to learn M, as M is too complex to be learned directly.
\[M^{\pi^z}(s_0, a_0, ds, da) = F(s_0, a_0, z)^{\top} B(s, a)\].
Here, \(z\)
is a latent representation of the reward function—meaning it’s a compact,
hidden encoding of the reward that the model can use as input to the function F,
and later to compute the final policy.
It is computed as:
\[ z_r = \mathbb{E}_{s \sim \rho}[r(s) B(s)] \]
In this formulation, \(B(s, a)\) captures environment dynamics (the next-state/action dependency of M),
and \(F(s_0, a_0, z)\) encodes the initial state-action and task embedding.
The new objective with this decomposition becomes:
$$
\begin{aligned}
\mathrm{L}(F, B) = & \|F_z^{\top} B\rho - (P + \gamma P_{\pi_z} F_z^{\top} B\rho)\|_{\rho}^2\\
= & \mathrm{E}_{(s_t,a_t,s_{t+1})\sim\rho, s'\sim\rho}\left[\left(F(s_t, a_t, z)^{\top}B(s') - \gamma \bar{F}(s_{t+1}, \pi_z(s_{t+1}), z)^{\top}\bar{B}(s')\right)^2\right]\\
& - 2\mathrm{E}_{(s_t,a_t,s_{t+1})\sim\rho}\left[F(s_t, a_t, z)^{\top}B(s_{t+1})\right] + \text{Const}\\
\end{aligned}
$$
Finally, the optimal policy is obtained by selecting the action that maximizes the expected reward:
$$
\begin{aligned}
\pi^R(s) = & \mathrm{argmax}_a Q(s, a) \\
= & \mathrm{argmax}_a \sum_{s_0, a_0} M^R(s_0, a_0, s, a) \cdot R(s,a) \\
= & \mathrm{argmax}_a \sum_{s_0, a_0} F(s_0, a_0, z_R)^{\top} B(s, a) \cdot R(s,a)
\end{aligned}
$$
Main issues with existing the FB learning methods:
- Performance: The original model uses simple MLP architectures of 3 layers maximum. These are not powerful enough to handle complex patterns. Moreover, simply increasing the size of the networks (i.e., adding more parameters) does not lead to better performance. In fact, it makes training harder without effectively capturing the more complex patterns found in richer datasets or more difficult problems.
- Distributional Shift: If the dataset does not include the action \(\pi(s_{t+1})\), the algorithm receives no feedback to correct its prediction at that point. As a result, it may assign unrealistic values to \(\pi(s_{t+1})\), which leads to biased training.
FB limitations