Skip to content

Conversation

YakDriver
Copy link
Member

@YakDriver YakDriver commented Oct 3, 2025

Rollback Plan

If a change needs to be reverted, we will publish an updated version of the library.

Changes to Security Controls

Are there any changes to security controls (access controls, encryption, logging) in this pull request? If so, explain.

Description

Forensic Report: Transit Gateway Route Table Association/Propagation ForceNew Behavior

Executive Summary

This report documents the history of ForceNew behavior changes in aws_ec2_transit_gateway_route_table_association and aws_ec2_transit_gateway_route_table_propagation resources, the problems they attempted to solve, the new problems they created, and justification for reverting the changes.

Timeline of Events

Original State (Pre-v5.87.0)

  • Behavior: transit_gateway_attachment_id had ForceNew: true
  • Effect: Any change to the attachment ID (including when it became unknown during planning) would force resource recreation
  • Problem: This was the correct behavior for actual attachment changes, but caused unnecessary recreation when the attachment ID became unknown due to unrelated dependency changes

March 2023: Issue #30085 Reported

  • Date: March 17, 2023
  • Reporter: @marcincuber
  • Problem: Updating allowed_prefixes on aws_dx_gateway_association caused unnecessary recreation of aws_ec2_transit_gateway_route_table_association
  • Root Cause:
    • User configuration used data source aws_ec2_transit_gateway_dx_gateway_attachment to get attachment ID
    • When aws_dx_gateway_association.allowed_prefixes changed, Terraform deferred the data source read to apply time
    • This made transit_gateway_attachment_id unknown during planning
    • ForceNew: true on unknown value triggered replacement
  • Impact: 4-10 minutes of network downtime during DX gateway updates
  • Community Response: 15 thumbs up, multiple comments over nearly 2 years requesting fix

April 2024: Issue #36889 Reported

February 2025: PR #41292 - First Workaround Attempt

  • Date: Merged February 11, 2025
  • Release: v5.87.0 (February 14, 2025)
  • Resource: aws_ec2_transit_gateway_route_table_association
  • Change:
    • Removed ForceNew: true from transit_gateway_attachment_id
    • Added CustomizeDiff logic to conditionally apply ForceNew
  • Logic: Only force new when:
    • Old value is empty string, OR
    • New value is not empty string, OR
    • Plan value is known, OR
    • State is null/unknown, OR
    • State value is unknown
  • Intent: Avoid recreation when attachment ID becomes unknown but won't actually change
  • Closed: Issue [Bug]: expanding allowed prefixes aws_dx_gateway_association causes downtime #30085

July 2025: PR #43405 - Second Workaround Attempt

July 2025: PR #43436 - Proper Solution for DX Gateway

  • Date: Merged July 22, 2025
  • Release: v6.5.0 (July 24, 2025)
  • Resource: aws_dx_gateway_association
  • Change: Added computed transit_gateway_attachment_id attribute
  • Impact: Eliminated need for data source lookup in DX Gateway scenarios
  • Significance: This was the proper solution - expose the value directly rather than trying to guess when it changes

August 2025: Issue #43706 Reported

  • Date: August 4, 2025
  • Reporter: @industrialzombie
  • Problem: The CustomizeDiff workaround logic is fundamentally flawed
  • Scenario: VPC attachment with lifecycle rule forcing replacement when subnets change
  • Failure Mode:
    • VPC attachment is being replaced (attachment ID will change)
    • Attachment ID becomes unknown during planning
    • CustomizeDiff logic incorrectly assumes it won't change
    • Plans update-in-place instead of replacement
    • Apply fails with "Provider produced inconsistent final plan"
    • Association/propagation gets destroyed but not recreated
    • Network connectivity lost
  • Status: Prioritized, assigned to @YakDriver

Technical Analysis

The Fundamental Flaw

The CustomizeDiff logic attempts to distinguish between two scenarios:

Scenario A (Intended to handle):

aws_dx_gateway_association.allowed_prefixes changes
  → Data source deferred to apply
  → Attachment ID unknown but won't change
  → Should NOT force recreation

Scenario B (Fails to handle):

aws_ec2_transit_gateway_vpc_attachment replaced
  → Attachment ID unknown AND WILL change
  → SHOULD force recreation
  → CustomizeDiff incorrectly allows update-in-place

The logic cannot distinguish between these scenarios because both present identically during planning:

  • Old value: known attachment ID
  • New value: empty string (unknown)
  • Plan: unknown
  • State: known

Why the Logic Fails

The condition that allows update-in-place:

if o.(string) == "" || n.(string) != "" {
    return d.ForceNew(names.AttrTransitGatewayAttachmentID)
}
// If we reach here: o is not empty, n is empty (unknown)
// Logic assumes: attachment ID won't change
// Reality: Could be either scenario A or B

This assumption is incorrect when the attachment resource itself is being replaced.

Breaking Change Justification

Is Restoring ForceNew a Breaking Change?

Technically: Yes. Changing from non-ForceNew to ForceNew is considered breaking because it changes resource replacement behavior.

Practically: No, for the following reasons:

1. The Current Behavior is Broken

The CustomizeDiff workaround produces incorrect plans that fail during apply with "inconsistent final plan" errors. This is not a working feature - it's a bug that causes:

  • Failed applies
  • Destroyed resources without recreation
  • Network connectivity loss
  • Manual intervention required

2. Limited Actual Impact

The CustomizeDiff logic only prevents ForceNew in very specific conditions:

  • Old value must be non-empty (resource exists)
  • New value must be empty/unknown
  • Plan must be unknown
  • State must be known

In most real-world scenarios, the logic still forces recreation. The only scenarios where it prevents recreation are:

For Scenario B, users would need to be:

  1. Using the workaround versions (v5.87.0+ or v6.4.0+)
  2. Making changes that defer data source reads
  3. Having those changes NOT actually affect the attachment ID
  4. Not experiencing the "inconsistent final plan" error

This is a narrow use case.

3. The Workaround Was Short-Lived

  • Association: v5.87.0 (Feb 14, 2025) to present = ~8 months
  • Propagation: v6.4.0 (Jul 17, 2025) to present = ~3 months

The workaround has not been in place long enough for widespread adoption.

4. The Proper Solution Exists

PR #43436 (v6.5.0) provided the correct solution for DX Gateway scenarios by exposing transit_gateway_attachment_id directly on aws_dx_gateway_association. Users should migrate to using this attribute instead of data source lookups.

For VPC attachments, the attachment ID is already available as the resource's id attribute, so no data source is needed.

5. Reverting Prevents Data Loss

The current behavior can cause:

  • Destroyed associations/propagations without recreation
  • Network connectivity loss
  • Production outages

Restoring ForceNew: true may cause unnecessary recreations in some edge cases, but it will never cause the "inconsistent final plan" error or leave resources in a destroyed state.

Recommendation

Revert the CustomizeDiff workaround and restore ForceNew: true because:

  1. Correctness: The workaround produces incorrect plans that fail during apply
  2. Safety: ForceNew is the safe default - it may recreate unnecessarily but won't fail or lose connectivity
  3. Proper Solution Available: PR Add transit_gateway_attachment_id to aws_dx_gateway_association #43436 provides the correct approach for DX Gateway scenarios
  4. Limited Impact: The workaround only worked in narrow scenarios and has been present for a short time
  5. Prevents Data Loss: Current behavior can destroy resources without recreation

Migration Path for Users

For DX Gateway Users (Scenario A)

Before (using data source):

data "aws_ec2_transit_gateway_dx_gateway_attachment" "this" {
  transit_gateway_id = aws_dx_gateway_association.this.associated_gateway_id
  dx_gateway_id      = aws_dx_gateway_association.this.dx_gateway_id
}

resource "aws_ec2_transit_gateway_route_table_association" "this" {
  transit_gateway_attachment_id  = data.aws_ec2_transit_gateway_dx_gateway_attachment.this.id
  transit_gateway_route_table_id = aws_ec2_transit_gateway_route_table.this.id
}

After (using direct attribute - requires v6.5.0+):

resource "aws_ec2_transit_gateway_route_table_association" "this" {
  transit_gateway_attachment_id  = aws_dx_gateway_association.this.transit_gateway_attachment_id
  transit_gateway_route_table_id = aws_ec2_transit_gateway_route_table.this.id
}

For VPC Attachment Users

Correct approach (no data source needed):

resource "aws_ec2_transit_gateway_vpc_attachment" "this" {
  subnet_ids         = var.subnet_ids
  transit_gateway_id = var.transit_gateway_id
  vpc_id             = var.vpc_id
}

resource "aws_ec2_transit_gateway_route_table_association" "this" {
  transit_gateway_attachment_id  = aws_ec2_transit_gateway_vpc_attachment.this.id
  transit_gateway_route_table_id = var.route_table_id
}

Conclusion

Restoring ForceNew: true is justified as a bug fix rather than a breaking change because:

  • The current behavior is broken and causes apply failures
  • The workaround was a temporary measure that has been superseded by proper solutions
  • The impact is limited and the alternative (data loss) is worse
  • Users have a clear migration path to avoid unnecessary recreations

The lesson learned: Expose computed attributes directly rather than attempting to predict when unknown values will change.

Relations

Closes #43706

References

Output from Acceptance Testing

2025/10/03 17:22:57 Creating Terraform AWS Provider (SDKv2-style)...
2025/10/03 17:22:57 Initializing Terraform AWS Provider (SDKv2-style)...
=== RUN   TestAccTransitGateway_serial
=== PAUSE TestAccTransitGateway_serial
=== CONT  TestAccTransitGateway_serial
=== RUN   TestAccTransitGateway_serial/RouteTablePropagation_attachmentChange
=== PAUSE TestAccTransitGateway_serial/RouteTablePropagation_attachmentChange
=== RUN   TestAccTransitGateway_serial/RouteTableAssociation_attachmentChange
=== PAUSE TestAccTransitGateway_serial/RouteTableAssociation_attachmentChange
=== CONT  TestAccTransitGateway_serial/RouteTablePropagation_attachmentChange
=== CONT  TestAccTransitGateway_serial/RouteTableAssociation_attachmentChange
--- PASS: TestAccTransitGateway_serial (0.00s)
    --- PASS: TestAccTransitGateway_serial/RouteTablePropagation_attachmentChange (395.33s)
    --- PASS: TestAccTransitGateway_serial/RouteTableAssociation_attachmentChange (605.10s)
PASS
ok  	github.com/hashicorp/terraform-provider-aws/internal/service/ec2	610.578s

@YakDriver YakDriver requested a review from a team as a code owner October 3, 2025 20:22
Copy link
Contributor

github-actions bot commented Oct 3, 2025

Community Guidelines

This comment is added to every new Pull Request to provide quick reference to how the Terraform AWS Provider is maintained. Please review the information below, and thank you for contributing to the community that keeps the provider thriving! 🚀

Voting for Prioritization

  • Please vote on this Pull Request by adding a 👍 reaction to the original post to help the community and maintainers prioritize it.
  • Please see our prioritization guide for additional information on how the maintainers handle prioritization.
  • Please do not leave +1 or other comments that do not add relevant new information or questions; they generate extra noise for others following the Pull Request and do not help prioritize the request.

Pull Request Authors

  • Review the contribution guide relating to the type of change you are making to ensure all of the necessary steps have been taken.
  • Whether or not the branch has been rebased will not impact prioritization, but doing so is always a welcome surprise.

@github-actions github-actions bot added service/transitgateway Issues and PRs that pertain to the transitgateway service. size/S Managed by automation to categorize the size of a PR. prioritized Part of the maintainer teams immediate focus. To be addressed within the current quarter. labels Oct 3, 2025
@github-actions github-actions bot added tests PRs: expanded test coverage. Issues: expanded coverage, enhancements to test infrastructure. size/L Managed by automation to categorize the size of a PR. labels Oct 3, 2025
@github-actions github-actions bot added documentation Introduces or discusses updates to documentation. size/XL Managed by automation to categorize the size of a PR. labels Oct 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Introduces or discusses updates to documentation. prioritized Part of the maintainer teams immediate focus. To be addressed within the current quarter. service/transitgateway Issues and PRs that pertain to the transitgateway service. size/L Managed by automation to categorize the size of a PR. size/S Managed by automation to categorize the size of a PR. size/XL Managed by automation to categorize the size of a PR. tests PRs: expanded test coverage. Issues: expanded coverage, enhancements to test infrastructure.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

aws_ec2_transit_gateway_route_table_association produces inconsistent final plan
1 participant