From [1], we have:
$$\mathcal{L}=\mathcal{L}_{\mathrm{CE}}+\beta \cdot D_{\mathrm{KL}}[P(\mathbf{T} \mid \mathbf{X}) | Q(\mathbf{T})]$$
While comparing to conventional adversarial training, from [2], we have:
$$\min \left[\underset{x, y \sim p_{\mathcal{D}}}{\mathbb{E}}\left[\max _{\hat{x} \in \mathbb{B}(x)} \mathcal{L}(x, \hat{x}, y)\right]\right]$$
where above objective can be specified in semi-supervised fashion as:
$$-\log q\left(y \mid F_{s}(x)\right)+\beta \mathrm{KL}\left(q\left(\cdot \mid F_{s}(x)\right) | q\left(\cdot \mid F_{s}(\hat{x})\right)\right)$$
IB principle and adversarial training(AT) both introduce a regularization term to smooth the landscape of model. The only difference is distribution term used KL distance, variational IB uses Gaussian while AT uses adversarial examples.
Adversarial examples can be viewed as a special out-of-distribution. In this view, compare with IB, AT should be a tighter bound for OOD optimization. But from your experiment results, IB surpasses all previous AT-liked methods. How could a loose bound be better than a tighter bound? This really confused me. Is there something I misunderstood?
[1] Improving the Adversarial Robustness of NLP Models by Information Bottleneck
[2] How Should Pre-Trained Language Models Be Fine-Tuned Towards Adversarial Robustness?
From [1], we have:
While comparing to conventional adversarial training, from [2], we have:
$$\min \left[\underset{x, y \sim p_{\mathcal{D}}}{\mathbb{E}}\left[\max _{\hat{x} \in \mathbb{B}(x)} \mathcal{L}(x, \hat{x}, y)\right]\right]$$
where above objective can be specified in semi-supervised fashion as:
$$-\log q\left(y \mid F_{s}(x)\right)+\beta \mathrm{KL}\left(q\left(\cdot \mid F_{s}(x)\right) | q\left(\cdot \mid F_{s}(\hat{x})\right)\right)$$
IB principle and adversarial training(AT) both introduce a regularization term to smooth the landscape of model. The only difference is distribution term used KL distance, variational IB uses Gaussian while AT uses adversarial examples.
Adversarial examples can be viewed as a special out-of-distribution. In this view, compare with IB, AT should be a tighter bound for OOD optimization. But from your experiment results, IB surpasses all previous AT-liked methods. How could a loose bound be better than a tighter bound? This really confused me. Is there something I misunderstood?
[1] Improving the Adversarial Robustness of NLP Models by Information Bottleneck
[2] How Should Pre-Trained Language Models Be Fine-Tuned Towards Adversarial Robustness?