Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinite loop in url retry #21958

Closed
chris-smith-zocdoc opened this issue Feb 14, 2025 · 7 comments · Fixed by #21959
Closed

Infinite loop in url retry #21958

chris-smith-zocdoc opened this issue Feb 14, 2025 · 7 comments · Fixed by #21959
Labels

Comments

@chris-smith-zocdoc
Copy link
Contributor

Describe the bug
I'm seeing bad urls (non-existant or return a bad status code) trigger an infinite loop, never times out or completes.

Image

pants run :bad_file

file(
    name="bad_file",
    source=http_source(
        url="http://localhost:8080/foo.txt",
        len=123,
        sha256="sha",
    )
)

run_shell_command(
    name="list_files",
    command="ls -lah {chroot}",
    execution_dependencies=[":bad_file"],
)

Pants version
2.24.1

OS
Mac

Additional info
Possibly related to #21308

@sureshjoshi
Copy link
Member

sureshjoshi commented Feb 14, 2025

Can you check what your download attempts/delays are set to? And can you try removing retries? It might also be that the last retry just keeps going?

https://www.pantsbuild.org/stable/reference/global-options#file_downloads_max_attempts

I wonder if http_source bypasses it somehow.

@chris-smith-zocdoc
Copy link
Contributor Author

Can you check what your download attempts/delays are set to?

I'm using the default of 4 and 0.2

It might also be that the last retry just keeps going?

It appears to be related to the retry delay. If I only set file-downloads-max-attempts higher, it stalls out on the 4th request (which happens to be the default)

Setting the delay lower however, it can progress past the 4th request and eventually complete.

@chris-smith-zocdoc
Copy link
Contributor Author

Added some logging to figure out what is going wrong, the exponential backoff is going a little haywire.

replaced the jitter with this

pub fn jitter(duration: Duration) -> Duration {
    let r = rand::random::<f64>();
    let m = duration.mul_f64(r);
    log::debug!("Jitter random: {} old duration: {} new duration: {}", r, duration.as_millis(), m.as_millis());
    return m;
}
Using retry delay of 200 ms
Downloading http://localhost:8080/foo.txt (attempt #1)
Error while downloading http://localhost:8080/foo.txt: Error downloading file: error sending request for url (http://localhost:8080/foo.txt) (retryable)
Jitter random: 0.1434208799854798 old duration: 200 new duration: 28
Downloading http://localhost:8080/foo.txt (attempt #2)
Error while downloading http://localhost:8080/foo.txt: Error downloading file: error sending request for url (http://localhost:8080/foo.txt) (retryable)
Jitter random: 0.6670330227038747 old duration: 40000 new duration: 26681
Downloading http://localhost:8080/foo.txt (attempt #3)
Error while downloading http://localhost:8080/foo.txt: Error downloading file: error sending request for url (http://localhost:8080/foo.txt) (retryable)
Jitter random: 0.9847793764108195 old duration: 8000000 new duration: 7878235

@chris-smith-zocdoc
Copy link
Contributor Author

The typical way I'm used to seeing exponential backoff work is described by aws here

# exponential 
sleep = min(cap, base * 2 ** attempt)

# exponential w/ jitter
sleep = random(0, min(cap, base * 2 ** attempt))

So for a base of 200ms (ignoring the cap and jitter)

The four attempts would be 200ms, 400ms, 800ms, 1600ms

Instead, what we're seeing here is 200ms, 40000ms, 8000000ms, (presumably 1600000000ms)

@chris-smith-zocdoc
Copy link
Contributor Author

@sureshjoshi
Copy link
Member

Good catch, that’s a ridiculous set of back off timings…

@chris-smith-zocdoc
Copy link
Contributor Author

I'm very new to rust, but I got it running #21959 Welcome your feedback 🙏

chris-smith-zocdoc added a commit to Zocdoc/pants that referenced this issue Feb 15, 2025
Required the update of tokio-retry to tokio-retry2 which is a fork of the original (which is now unmaintained?)

The backoff algorithm is now

sleep = random(0, base * 2 ** attempt)
so they will approximate 200ms, 400ms, 800ms, 1600ms

Previously it was

sleep = random(0, base ** attempt)
which was 200ms, 40000ms, 8000000ms, 1600000000ms
chris-smith-zocdoc added a commit to Zocdoc/pants that referenced this issue Feb 17, 2025
Required the update of tokio-retry to tokio-retry2 which is a fork of the original (which is now unmaintained?)

The backoff algorithm is now

sleep = random(0, base * 2 ** attempt)
so they will approximate 200ms, 400ms, 800ms, 1600ms

Previously it was

sleep = random(0, base ** attempt)
which was 200ms, 40000ms, 8000000ms, 1600000000ms
@huonw huonw closed this as completed in 303a082 Feb 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants