Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop hammering LetsEncrypt 100 times for badNonce #287

Open
anthonyryan1 opened this issue Jan 4, 2025 · 5 comments
Open

Stop hammering LetsEncrypt 100 times for badNonce #287

anthonyryan1 opened this issue Jan 4, 2025 · 5 comments

Comments

@anthonyryan1
Copy link

anthonyryan1 commented Jan 4, 2025

Relevant lines:

It's extremely rude of acme-tiny to abuse LetsEncrypt like this, and the code in this repository acts like this is a bug in LetsEncrypt when it's actually a bug in our code.

And we aren't even polite enough to have a sleep between retries, we're just hammering their server repeatedly, expecting a different result.

The cause of the badNonce error is that LetsEncrypt runs a large number of servers to deal with the load. If we request an nonce from a server, then try to use it on a different server, with a different connection. The other server won't recognize the nonce. Admittedly I'm not sure if this is a replication delay, or if LetsEncrypt deliberately doesn't sync nonce tokens between servers.

I've personally verified that changing acme tiny to use a single TLS connection for all requests with: persistent_connection = http.client.HTTPSConnection(...) then sending all requests on the same connection (meaning they route consistently to the same LetsEncrypt server) has meant I'm no longer seeing badNonce errors. I don't even have retry code for it anymore, just a fatal exception if it's hit even once (which has yet to happen).

I would be happy to send a patch, but I have no interest in trying to fit it into the line count.

Respectfully, I feel like the line count goal for this project has fallen to Goodhart's law. What used to be a measurement of easy-to-read code, has become the opposite with developers now making individual lines much less readable, desperately trying to stay under the arbitrary line count limit.

So for the moment, I'm just filing this as an issue explaining exactly what's causing badNonce errors and how to fix it. The rest is left as an exercise for someone who enjoys playing code golf to stay under the line count.

@felixfontein
Copy link
Contributor

The cause of the badNonce error is that LetsEncrypt runs a large number of servers to deal with the load. If we request an nonce from a server, then try to use it on a different server, with a different connection. The other server won't recognize the nonce. Admittedly I'm not sure if this is a replication delay, or if LetsEncrypt deliberately doesn't sync nonce tokens between servers.

While it's correct that Let's Encrypt (that's the correct spelling BTW, not LetsEncrypt or any other variant) has a large number of servers, it is not correct that nonces depend on the server you are talking to. In Let's Encrypt's implementation, nonces are valid for the same data center, but not for a different data center. (See for example bruncsak/ght-acme.sh#1 (comment).)

Let's Encrypt is currently using two different data centers (source), so if your requests happen to end up with one or the other data center all the time, you would have a problem. I'm not sure what the chances are that this actually happens, though. urllib.requests in the end uses socket.create_connection() (https://docs.python.org/3/library/socket.html#socket.create_connection) to connect, how it's used in acme-tiny repeatedly calls it with the same domain name for host. Its implementation simply loops over the return value of socket.getaddrinfo() and connects to the first result. When I call socket.getaddrinfo('acme-v02.api.letsencrypt.org', 443, 0, socket.SOCK_STREAM) repeatedly on my system, I always get the same two IPs (one IPv4 and one IPv6) in the same order. So from that side, it looks OK. But acme-v02.api.letsencrypt.org is actually ca80a1adb12a4fbdac5ffcbc944e9a61.pacloudflare.com (via two CNAMEs), and I don't know how CloudFlare routes the requests to the data centers.

I don't even have retry code for it anymore, just a fatal exception if it's hit even once (which has yet to happen).

Doing that is not a good idea, see https://community.letsencrypt.org/t/17933/3.

@anthonyryan1
Copy link
Author

I agree that it's infrequent, but my logs show badNonce errors approximately 1 out of every thousand certificates issued by acme-tiny.

Do you know if nonce propagation within the same data center is eventually consistent? Or guaranteed that all servers will respond before the newNonce command returns?

Since I implemented a single persistent TCP connection for all requests sent by acme-tiny (preventing server changes), I've issued about 3000 certificates without any badNonce errors, I'm happy to report back after another 10K or 20K more certificates, but I believe the error has been resolved on my end without 0 badNonce retries in my code.

Even if others disagree with me that persistent connections appear to resolve the badNonce errors (whether caused by data-center changes mid-issuance, or by racing eventually consistent propagation within a single LetsEncrypt data center), I think it's hard to argue that hitting LetsEncrypt 100 times is an absurdly high number.

@felixfontein
Copy link
Contributor

Do I understand it correctly that you are not aware of any case where a request had to be repeated up to (or close to) 100 times, but only a few times (or even just once) during quite many invocations of acme-tiny?

@anthonyryan1
Copy link
Author

I'm not sure I fully grasp the meaning behind your question.

If you're claiming that the 100 retries is only a notional limit, since it'll rarely be hit in practice. I did not collect statistics on how many retries were used prior to changing to persistent TCP connections locally, so I have no information to share about how many of 100 are actually used in practice. My concern would be the same if it were an unbounded number of retries.

If you're just asking me to confirm I issued 3,000 certificates with acme-tiny modified for persistent TCP connections with 0 retries for a badNonce error. Then I can confirm that is the case in my testing. And badNonce errors are no longer happening and retries appear entirely unnecessary in my testing so far.

@felixfontein
Copy link
Contributor

You wrote

I agree that it's infrequent, but my logs show badNonce errors approximately 1 out of every thousand certificates issued by acme-tiny.

which indicated to me that you got only a single badNonce error every ~1000 certificates. I now see that you didn't wrote that there was one badNonce error, but one request had badNonce errors (potentially multiple). In any case, I don't think it would get anywhere near to 100 retries, and so far there never has been any indication that this would be the case. The client I'm using (which isn't acme-tiny) is also not using separate requests (same as acme-tiny), but limits its badNonce retries to a maximum of 5 per request, and I've never seen that limit being reached. So I really doubt that it will get anywhere close to 100 (unless there's a bug on the CA side).

In any case, using the same TLS connection for all requests is nice if it works, but also has potential downsides, namely that you need a mechanism to re-create the connection in case it gets closed/interrupted for whatever reason. While that isn't too hard, it does increase complexity and does not fit the goal of this project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants