Summary
Add a discoverability check that runs alongside bing inspect-sitemap / gsc inspect-sitemap (and likely as its own command) to surface URLs in the sitemap that are blocked from indexing by local signals — independent of GSC/Bing verdicts.
Why
GSC's inspectUrl returns robotsTxtState only after Google has crawled. Bing's GetUrlInfo is thinner; the existing GetCrawlIssues cross-check (packages/canonry/src/bing-inspect-sitemap.ts:88-103) only catches issues Bing has already flagged. A local check explains "why isn't this indexed?" without waiting on the indexer to surface a verdict.
Classic conflict it catches: URL is in the sitemap (telling search engines to index it) AND is blocked by robots.txt Disallow OR <meta name="robots" content="noindex"> OR X-Robots-Tag: noindex.
Scope
For each URL discovered during a sitemap inspection run:
- Match path against
/robots.txt Disallow: rules for Googlebot / Bingbot / *
- Fetch the page; scan HTML for
<meta name="robots" content="noindex"> and per-bot variants (googlebot, bingbot)
- Inspect the
X-Robots-Tag response header for noindex
Surface as new fields on the inspection record:
localBlocked: boolean
localBlockReason: 'robots-disallow' | 'meta-noindex' | 'x-robots-tag-noindex' | null
Non-goals
- Not for sitemap discovery —
robots.txt Sitemap: directive parsing is a separate (probably-not-worth-it) change.
- Not a full crawler — only checks URLs already in the sitemap.
Notes
robots-parser (12M weekly downloads, MIT) handles step 1 cleanly.
- Steps 2 and 3 are a fetch + a small meta-tag regex (or cheerio if we want to be careful about commented-out tags).
- Logic overlaps with both
bing-inspect-sitemap and gsc-inspect-sitemap, so it likely belongs in a shared helper consumed by each executor.
- Possibly also a standalone command:
canonry <project> discoverability for ad-hoc audits.
Out of band
Discovered while reviewing #354 / #356 — split out from that work to keep PRs focused.
Summary
Add a discoverability check that runs alongside
bing inspect-sitemap/gsc inspect-sitemap(and likely as its own command) to surface URLs in the sitemap that are blocked from indexing by local signals — independent of GSC/Bing verdicts.Why
GSC's
inspectUrlreturnsrobotsTxtStateonly after Google has crawled. Bing'sGetUrlInfois thinner; the existingGetCrawlIssuescross-check (packages/canonry/src/bing-inspect-sitemap.ts:88-103) only catches issues Bing has already flagged. A local check explains "why isn't this indexed?" without waiting on the indexer to surface a verdict.Classic conflict it catches: URL is in the sitemap (telling search engines to index it) AND is blocked by
robots.txtDisallow OR<meta name="robots" content="noindex">ORX-Robots-Tag: noindex.Scope
For each URL discovered during a sitemap inspection run:
/robots.txtDisallow:rules forGooglebot/Bingbot/*<meta name="robots" content="noindex">and per-bot variants (googlebot,bingbot)X-Robots-Tagresponse header fornoindexSurface as new fields on the inspection record:
localBlocked: booleanlocalBlockReason: 'robots-disallow' | 'meta-noindex' | 'x-robots-tag-noindex' | nullNon-goals
robots.txtSitemap:directive parsing is a separate (probably-not-worth-it) change.Notes
robots-parser(12M weekly downloads, MIT) handles step 1 cleanly.bing-inspect-sitemapandgsc-inspect-sitemap, so it likely belongs in a shared helper consumed by each executor.canonry <project> discoverabilityfor ad-hoc audits.Out of band
Discovered while reviewing #354 / #356 — split out from that work to keep PRs focused.