Skip to content

feat(sitemap): site discoverability check (robots.txt + noindex) #357

@arberx

Description

@arberx

Summary

Add a discoverability check that runs alongside bing inspect-sitemap / gsc inspect-sitemap (and likely as its own command) to surface URLs in the sitemap that are blocked from indexing by local signals — independent of GSC/Bing verdicts.

Why

GSC's inspectUrl returns robotsTxtState only after Google has crawled. Bing's GetUrlInfo is thinner; the existing GetCrawlIssues cross-check (packages/canonry/src/bing-inspect-sitemap.ts:88-103) only catches issues Bing has already flagged. A local check explains "why isn't this indexed?" without waiting on the indexer to surface a verdict.

Classic conflict it catches: URL is in the sitemap (telling search engines to index it) AND is blocked by robots.txt Disallow OR <meta name="robots" content="noindex"> OR X-Robots-Tag: noindex.

Scope

For each URL discovered during a sitemap inspection run:

  1. Match path against /robots.txt Disallow: rules for Googlebot / Bingbot / *
  2. Fetch the page; scan HTML for <meta name="robots" content="noindex"> and per-bot variants (googlebot, bingbot)
  3. Inspect the X-Robots-Tag response header for noindex

Surface as new fields on the inspection record:

  • localBlocked: boolean
  • localBlockReason: 'robots-disallow' | 'meta-noindex' | 'x-robots-tag-noindex' | null

Non-goals

  • Not for sitemap discoveryrobots.txt Sitemap: directive parsing is a separate (probably-not-worth-it) change.
  • Not a full crawler — only checks URLs already in the sitemap.

Notes

  • robots-parser (12M weekly downloads, MIT) handles step 1 cleanly.
  • Steps 2 and 3 are a fetch + a small meta-tag regex (or cheerio if we want to be careful about commented-out tags).
  • Logic overlaps with both bing-inspect-sitemap and gsc-inspect-sitemap, so it likely belongs in a shared helper consumed by each executor.
  • Possibly also a standalone command: canonry <project> discoverability for ad-hoc audits.

Out of band

Discovered while reviewing #354 / #356 — split out from that work to keep PRs focused.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions