Skip to content

fix: fail doctor when shared Dolt rig databases are missing#3449

Open
Bella-Giraffety wants to merge 4 commits intogastownhall:mainfrom
Bella-Giraffety:polecat/mayor-gs-635
Open

fix: fail doctor when shared Dolt rig databases are missing#3449
Bella-Giraffety wants to merge 4 commits intogastownhall:mainfrom
Bella-Giraffety:polecat/mayor-gs-635

Conversation

@Bella-Giraffety
Copy link
Copy Markdown

@Bella-Giraffety Bella-Giraffety commented Mar 31, 2026

Summary

  • make dolt-server-reachable verify expected server-mode databases instead of treating a listening TCP socket as healthy
  • query the exact metadata-configured Dolt server address, including non-default local ports, alias database names, and non-loopback hosts
  • preserve retry behavior for startup races while keeping the fix scoped to the doctor health-check path
  • add focused regression coverage for missing expected databases, verification errors, configured server ports, and rigs whose dolt_database differs from the rig name

Related Issue

  • Fixes gs-635

Changes

  • update internal/doctor/migration_check.go so the doctor reachability check:
    • groups server-mode rigs by configured server address
    • still fails on unreachable servers as before
    • now also verifies that each configured server is actually serving the databases named in rig metadata
    • reports a real failure when the server is reachable but expected databases are missing
  • add VerifyExpectedDatabasesAtConfig in internal/doltserver/doltserver.go so doctor can check a specific configured server address without relying on the town default Dolt config
  • add focused tests in internal/doctor/migration_check_test.go for:
    • reachable server with missing expected rig database
    • reachable server where verification errors out
    • configured database name differing from rig name
    • configured server port coming from metadata/config

Bug

During the coder_dotfiles bootstrap investigation, gt doctor could report shared Dolt as healthy as soon as the socket accepted connections even when one or more configured rig databases were absent or unreadable. That hid the real failure mode: HQ looked reachable, but the rig database needed by the town was missing from the running server.

Why this fix

The smallest safe fix is to keep the change inside the existing doctor reachability check and reuse explicit SHOW DATABASES verification against the server address declared by rig metadata. This avoids broad startup or repair changes while making the health signal honest for the exact failure we reproduced.

Testing

  • Unit tests pass for the touched area
  • Manual testing performed

Focused command run:

  • GOTOOLCHAIN=auto go test ./internal/doctor -run 'TestDoltServerReachableCheck|TestGetServerAddr'

Regression cases added:

  • TestDoltServerReachableCheck_FailsWhenExpectedRigDatabaseMissing
  • TestDoltServerReachableCheck_FailsWhenDatabaseVerificationErrors
  • TestDoltServerReachableCheck_UsesConfiguredDatabaseNameNotRigName
  • TestGetServerAddr_UsesConfigYAMLPort

Review process

This branch was reviewed in five independent passes before implementation and five more passes against the implementation branch. The final patch incorporates review findings about:

  • exact-address verification instead of town-default verification
  • startup-race retries in the verification helper
  • alias database names where dolt_database != rig name
  • keeping the change scoped to the doctor path
  • operator-facing diagnostics and fix-hint behavior

Checklist

  • Code follows project style
  • Documentation updated where needed in the PR body/testing notes
  • No breaking changes intended

@github-actions github-actions bot added the status/needs-triage Inbox — we haven't looked at it yet label Mar 31, 2026
@Bella-Giraffety
Copy link
Copy Markdown
Author

Rebuilt this PR on a clean upstream main base and force-pushed it so it now contains only the doctor shared-Dolt database verification change and its focused tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

status/needs-triage Inbox — we haven't looked at it yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants