Hi All,
Here is a good blog on bi-level optimization that makes it clear why we need the Hessian matrix (with respect to the parameters) when we are calculating the gradient with respect to the hyperparameters (e.g., the distillation dataset).
https://timvieira.github.io/blog/post/2016/03/05/gradient-based-hyperparameter-optimization-and-the-implicit-function-theorem/
Best,
Soheil