feat(sitemap): site discoverability check (robots.txt + noindex)

## Summary

Add a discoverability check that runs alongside `bing inspect-sitemap` / `gsc inspect-sitemap` (and likely as its own command) to surface URLs in the sitemap that are blocked from indexing by **local** signals — independent of GSC/Bing verdicts.

## Why

GSC's `inspectUrl` returns `robotsTxtState` only after Google has crawled. Bing's `GetUrlInfo` is thinner; the existing `GetCrawlIssues` cross-check (`packages/canonry/src/bing-inspect-sitemap.ts:88-103`) only catches issues Bing has already flagged. A local check explains "why isn't this indexed?" without waiting on the indexer to surface a verdict.

Classic conflict it catches: URL is in the sitemap (telling search engines to index it) AND is blocked by `robots.txt` Disallow OR `<meta name="robots" content="noindex">` OR `X-Robots-Tag: noindex`.

## Scope

For each URL discovered during a sitemap inspection run:

1. Match path against `/robots.txt` `Disallow:` rules for `Googlebot` / `Bingbot` / `*`
2. Fetch the page; scan HTML for `<meta name="robots" content="noindex">` and per-bot variants (`googlebot`, `bingbot`)
3. Inspect the `X-Robots-Tag` response header for `noindex`

Surface as new fields on the inspection record:

- `localBlocked: boolean`
- `localBlockReason: 'robots-disallow' | 'meta-noindex' | 'x-robots-tag-noindex' | null`

## Non-goals

- **Not** for sitemap *discovery* — `robots.txt` `Sitemap:` directive parsing is a separate (probably-not-worth-it) change.
- Not a full crawler — only checks URLs already in the sitemap.

## Notes

- `robots-parser` (12M weekly downloads, MIT) handles step 1 cleanly.
- Steps 2 and 3 are a fetch + a small meta-tag regex (or cheerio if we want to be careful about commented-out tags).
- Logic overlaps with both `bing-inspect-sitemap` and `gsc-inspect-sitemap`, so it likely belongs in a shared helper consumed by each executor.
- Possibly also a standalone command: `canonry <project> discoverability` for ad-hoc audits.

## Out of band

Discovered while reviewing #354 / #356 — split out from that work to keep PRs focused.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sitemap): site discoverability check (robots.txt + noindex) #357

Summary

Why

Scope

Non-goals

Notes

Out of band

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feat(sitemap): site discoverability check (robots.txt + noindex) #357

Description

Summary

Why

Scope

Non-goals

Notes

Out of band

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions