Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question on relation to univariate Taylor series propagation for higher-order derivatives #1

Open
thisiscam opened this issue Dec 19, 2024 · 2 comments

Comments

@thisiscam
Copy link

Hi,

Thank you for your interesting work! I’ve been reading your paper and found the idea of stochastically evaluating higher-order derivatives quite intriguing.

I was wondering if you could comment on how your technique relates to the known method of evaluating higher-order derivative tensors by propagating a set of univariate Taylor series. This idea is discussed in this paper and the Evaluating Derivatives book (Chapter 13).

Specifically, could your method be interpreted (or extended) as sampling a set of univariate Taylor series to propogate, to the end of approximating the derivative tensor? I’d love to hear your thoughts on this connection or any fundamental differences.

Thanks in advance for your response!

@thisiscam thisiscam changed the title Question on relation to univariate Taylor series for higher-order derivatives Question on relation to univariate Taylor series propagation for higher-order derivatives Dec 19, 2024
@zekun-shi
Copy link
Collaborator

Thanks for bringing up this highly relevant work! Upon reading the paper, I think the idea of evaluating arbitrary derivative tensor elements via forward propagation univariate Taylor series is similar. I wasn't aware of this work, and based on my experience talking to people at NeurIPS, this work is not well known within the ML community even for people who work on AD in the ML context. The JAX team who wrote the Taylor mode AD I used in the paper wasn't aware of this technique and was quite surprised that one could do this.

So yes your interpretation is right, STDE approximates derivative tensor with randomized univariate Taylor series propagation.

@thisiscam
Copy link
Author

thisiscam commented Jan 10, 2025

Thank you for your detailed response! I wanted to follow up with some additional thoughts and questions:

  1. I agree that the method of propagating univariate Taylor series to compute higher-order derivative tensors is not widely known in the ML community, which is indeed unfortunate given its mention in the classic Evaluating Derivatives book. It's great to see that your work brings attention to this approach, even if from a different perspective!

  2. That said, I think there's a key distinction between your method and GUW98. Specifically, GUW98 considers only first-order perturbations, whereas your work (as seen in Eq. (10) and Appendix F.1) includes higher-order perturbations.
    Specifically, (it appears to me that) GUW98 only considers series inputs of the form
    $$[v, 0 \ldots 0]$$ where $$v$$ is an integer linear combination of the basis vectors $e_i$,
    while your work (please correct me if I'm wrong) considers
    $$[\ldots 0 e_1 , 0, \ldots 0, e_2, 0, \ldots, 0, e_k]$$ that scatters the basis vector in the higher-order perturbations.
    In Eq. (10) of your paper, the coefficient $$C_{d_1 \dots d_k}$$ corresponds to the coefficient associated with each directional derivative. However, I couldn't find a clear algorithm or explicit description in your paper for computing $$C$$. Could you clarify if you have a formula or algorithm for computing $$C$$ that generalizes Eq. (13) in GUW98? Screenshot from 2025-01-10 12-47-16 It would be interesting to explore whether including higher-order perturbations, as in your work, might lead to better algorithms or approximations in some contexts.

  3. I found your comment in the paper on fully mixed partial derivatives $$\frac{\partial^k f}{\partial x_1 \dots \partial x_k}$$ (i.e., maximally non-diagonal operators) particularly interesting. Your paper mentions that the minimum $$l$$ required for these cases is $$(1 + k)k/2$$. However, when generalizing your example, I noticed that calculating the fully mixed partial derivative using a single higher-order univariate derivative (without cancellation steps) seems to require an exponential $$l = O(k \cdot 2^k)$$-th order derivative. For instance:

    • When $$k = 2$$, positions $$[0, e_1, e_2]$$ work.
    • When $$k = 3$$, positions $$[0, 0, 0, e_1, 0, e_2, e_3]$$ work.
    • In general, the perturbation $$e_i$$ appears at index $$2^{k-1} + \ldots + 2^{k-i}$$ , leading to an overall length $$l = k 2^{k-1} - 1$$. Could you clarify if the quadratic $$l = (1 + k)k/2$$ formula in your paper offers a more efficient method or insight into improving this computation?

Looking forward to your thoughts!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants