Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to debug_nans #25643

Open
5 tasks
emilyfertig opened this issue Dec 20, 2024 · 0 comments
Open
5 tasks

Improvements to debug_nans #25643

emilyfertig opened this issue Dec 20, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@emilyfertig
Copy link
Collaborator

emilyfertig commented Dec 20, 2024

This issue tracks improvements to debug_nans/debug_infs. See the description of #25519 for some examples of status quo behavior.

  • NaNs in the forward pass of a shard_map function currently only report "Invalid value in a sharded computation," and the line where the shard_map function was called. The traceback should extend to the lax/jnp op within the shard_map function that produced the NaN.
  • Avoid re-running functions with side effects and collectives.
  • Consider stopping the traceback at the jax.numpy boundary instead of the lax primitive that NaNed in a jnp function, or using the JAX_TRACEBACK_FILTERING flag to switch this behavior.
  • Improve the error messages for NaNs in the backward pass. Currently they report a stacktrace of line numbers, but it would be nice to have a regular Python stacktrace with code highlighted.
  • For pmap, the Python dispatch path reports "Invalid value in parallel computation" and the traceback stops at the call to the pmap function. It should extend to the arithmetic op where the NaN occurred, like it does for the C++ dispatch path.
@emilyfertig emilyfertig added the enhancement New feature or request label Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant