Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix concurrency issue of Terraform provider cache #21805

Merged
merged 11 commits into from
Jan 29, 2025

Conversation

lilatomic
Copy link
Contributor

@lilatomic lilatomic commented Jan 2, 2025

The Terraform provider cache is not concurrency safe.
The worst concurrency bug with the Terraform provider cache is a race condition when nonatomic moves result in another process capturing incorrect hashes in the lockfile which gets pulled into the Pants cache and then poisons it.

After lots of debugging and help from the community, we've determined that the best way forward is to isolate each module into its own Terraform cache. This prevents concurrent access to the Terraform cache.

One possible solution would have been to not cache during cache-unsafe operations. This is harder than it seems, because envvars can override user cache settings in Terraform RC files but they cannot unset it.

fixes #21804

@lilatomic lilatomic added category:bugfix Bug fixes for released features backend: Terraform Terraform backend-related issues labels Jan 2, 2025
@lilatomic lilatomic force-pushed the terraform/fix/cache-concurrency branch from 9d22f89 to e975123 Compare January 19, 2025 04:45
The worst concurrency bug with the Terraform provider cache
is when nonatomic moves result in incorrect hashes in the lockfile
which gets pulled into the Pants cache and then poisons it.
So we just don't use the provider cache in that case.

I think there's still the possibility that the nonatomic move triggers a problem with a module with a lockfile when the cache is being populated,
but at least that won't enter the cache.
I think that risk is lower since the files are fetched during `init` but used later (ex during `validate`),
by which time the copy operation has completed.
Terraform downloads providers, while locking providers.
We therefore skip the cache while generating lockfiles
it's not ideal but it works. In the worst case we're the same as a .terraform folder
This will cause it to override anything set in the .terraformrc file
Terraform expects the directory to already exist and will not create it.
@lilatomic lilatomic force-pushed the terraform/fix/cache-concurrency branch from e975123 to d97a1be Compare January 22, 2025 23:40
@lilatomic lilatomic added this to the 2.23.x milestone Jan 22, 2025
@lilatomic lilatomic requested a review from tdyas January 22, 2025 23:58
@jgranstrom
Copy link
Contributor

If there's any chance this could be cherry-picked that would be great, it does fix a fairly tricky-to-debug issue

@tdyas
Copy link
Contributor

tdyas commented Jan 23, 2025

If there's any chance this could be cherry-picked that would be great, it does fix a fairly tricky-to-debug issue

It is straightforward enough to mark this PR for the CI cherry picking automation. That said, seems more of a question for @lilatomic about how far back this can be back-ported.

@@ -450,6 +452,17 @@ async def setup_terraform_process(
def prepend_paths(paths: Tuple[str, ...]) -> Tuple[str, ...]:
return tuple((Path(request.chdir) / path).as_posix() for path in paths)

# Initialise the Terraform provider cache, since Terraform expects the directory to already exist.
await Get(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate invocations of mkdir and terraform would not work in remote execution because the mkdir and terraform invocations could potentially end up on separate workers. Moreover, it can even be in an issue in local execution because the mkdir could be cached and the terraform invocation uncached. Thus, the side effect of making the cache directory would not occur because Pants would see the mkdir process as cached.

They should be a singe process invocation running the shell with a custom script (doing the necessary replacements):

script = f"""mkdir -p {tf_plugin_cache_dir} && exec  __terraform/terraform  -chdir={shlex.quote(request.chdir)} {shlex.join(request.args)}"""

process = Process(argv=(bash.path, "-c", script), ...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: bash.path comes from a bash: BashBinary instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, that makes a lot of sense. I've fixed it, thank you for the direction on how.

This ensures that they're executed on the same machine.
As separate processes, a remote execution might execute these on different machines.
As well, it's possible that locally the mkdir is cached even  if it needs to be run again
@lilatomic lilatomic force-pushed the terraform/fix/cache-concurrency branch from ddfe4ea to d26e69e Compare January 26, 2025 16:11
@lilatomic
Copy link
Contributor Author

If there's any chance this could be cherry-picked that would be great, it does fix a fairly tricky-to-debug issue

It is straightforward enough to mark this PR for the CI cherry picking automation. That said, seems more of a question for @lilatomic about how far back this can be back-ported.

I've marked it for porting back to 2.23 . I think it will merge cleanly.

@lilatomic lilatomic merged commit eca3207 into pantsbuild:main Jan 29, 2025
24 checks passed
@WorkerPants
Copy link
Member

I tried to automatically cherry-pick this change back to each relevant milestone, so that it is available in those older releases of Pants.

❌ 2.23.x

I was unable to cherry-pick this PR to 2.23.x, likely due to merge-conflicts.

Steps to Cherry-Pick locally

To resolve:

  1. (Ensure your git working directory is clean)
  2. Run the following script to reproduce the merge-conflicts:
    git fetch https://github.com/pantsbuild/pants main \
      && git fetch https://github.com/pantsbuild/pants 2.23.x \
      && git checkout -b cherry-pick-21805-to-2.23.x FETCH_HEAD \
      && git cherry-pick eca32073999b4a15894baef4f3c3795ec91f37dc
  3. Fix the merge conflicts and commit the changes
  4. Run build-support/cherry_pick/make_pr.sh "21805" "2.23.x"

Please note that I cannot re-run CI if a job fails. Please work with your PR approver(s) to re-run CI if necessary.

❌ 2.24.x

I was unable to cherry-pick this PR to 2.24.x, likely due to merge-conflicts.

Steps to Cherry-Pick locally

To resolve:

  1. (Ensure your git working directory is clean)
  2. Run the following script to reproduce the merge-conflicts:
    git fetch https://github.com/pantsbuild/pants main \
      && git fetch https://github.com/pantsbuild/pants 2.24.x \
      && git checkout -b cherry-pick-21805-to-2.24.x FETCH_HEAD \
      && git cherry-pick eca32073999b4a15894baef4f3c3795ec91f37dc
  3. Fix the merge conflicts and commit the changes
  4. Run build-support/cherry_pick/make_pr.sh "21805" "2.24.x"

Please note that I cannot re-run CI if a job fails. Please work with your PR approver(s) to re-run CI if necessary.


When you're done manually cherry-picking, please remove the needs-cherrypick label on this PR.

Thanks again for your contributions!

🤖 Beep Boop here's my run link

@WorkerPants WorkerPants added the auto-cherry-picking-failed Auto Cherry-Picking Failed label Jan 29, 2025
lilatomic added a commit to lilatomic/pants that referenced this pull request Jan 29, 2025
The Terraform provider cache is not concurrency safe.
The worst concurrency bug with the Terraform provider cache is a race
condition when nonatomic moves result in another process capturing
incorrect hashes in the lockfile which gets pulled into the Pants cache
and then poisons it.

After [lots of debugging and help from the
community](https://chat.pantsbuild.org/t/26730983/i-m-finding-terraform-init-may-not-cache-stably-from-2-23-it#84153d7a-cedd-4600-838f-bd6ef99f6cc9),
we've determined that the best way forward is to isolate each module
into its own Terraform cache. This prevents concurrent access to the
Terraform cache.

One possible solution would have been to not cache during cache-unsafe
operations. This is harder than it seems, because envvars can override
user cache settings in Terraform RC files but they cannot _unset_ it.

fixes pantsbuild#21804
lilatomic added a commit to lilatomic/pants that referenced this pull request Jan 29, 2025
The Terraform provider cache is not concurrency safe.
The worst concurrency bug with the Terraform provider cache is a race
condition when nonatomic moves result in another process capturing
incorrect hashes in the lockfile which gets pulled into the Pants cache
and then poisons it.

After [lots of debugging and help from the
community](https://chat.pantsbuild.org/t/26730983/i-m-finding-terraform-init-may-not-cache-stably-from-2-23-it#84153d7a-cedd-4600-838f-bd6ef99f6cc9),
we've determined that the best way forward is to isolate each module
into its own Terraform cache. This prevents concurrent access to the
Terraform cache.

One possible solution would have been to not cache during cache-unsafe
operations. This is harder than it seems, because envvars can override
user cache settings in Terraform RC files but they cannot _unset_ it.

fixes pantsbuild#21804
cburroughs pushed a commit that referenced this pull request Jan 31, 2025
) (#21887)

The Terraform provider cache is not concurrency safe.
The worst concurrency bug with the Terraform provider cache is a race
condition when nonatomic moves result in another process capturing
incorrect hashes in the lockfile which gets pulled into the Pants cache
and then poisons it.

After [lots of debugging and help from the
community](https://chat.pantsbuild.org/t/26730983/i-m-finding-terraform-init-may-not-cache-stably-from-2-23-it#84153d7a-cedd-4600-838f-bd6ef99f6cc9),
we've determined that the best way forward is to isolate each module
into its own Terraform cache. This prevents concurrent access to the
Terraform cache.

One possible solution would have been to not cache during cache-unsafe
operations. This is harder than it seems, because envvars can override
user cache settings in Terraform RC files but they cannot _unset_ it.

fixes #21804
cburroughs pushed a commit that referenced this pull request Jan 31, 2025
) (#21889)

The Terraform provider cache is not concurrency safe.
The worst concurrency bug with the Terraform provider cache is a race
condition when nonatomic moves result in another process capturing
incorrect hashes in the lockfile which gets pulled into the Pants cache
and then poisons it.

After [lots of debugging and help from the
community](https://chat.pantsbuild.org/t/26730983/i-m-finding-terraform-init-may-not-cache-stably-from-2-23-it#84153d7a-cedd-4600-838f-bd6ef99f6cc9),
we've determined that the best way forward is to isolate each module
into its own Terraform cache. This prevents concurrent access to the
Terraform cache.

One possible solution would have been to not cache during cache-unsafe
operations. This is harder than it seems, because envvars can override
user cache settings in Terraform RC files but they cannot _unset_ it.

fixes #21804
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-cherry-picking-failed Auto Cherry-Picking Failed backend: Terraform Terraform backend-related issues category:bugfix Bug fixes for released features needs-cherrypick
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Terraform cache is not concurrency safe
4 participants