Summary
Connect a GitHub repo to a Canonry project so that site scans can cross-reference citation data with actual page content — producing file-level, citation-correlated recommendations and (in a later phase) opening PRs with fixes.
The adapter model is generic so WordPress, Shopify, and GitLab integrations slot in later without architectural changes.
From "you lost a citation" → "you lost a citation, and here's what to fix in src/app/blog/[slug]/page.tsx to win it back."
Scope: 4 Phases
Each phase is independently shippable.
| Phase |
Description |
| 1 |
GitHub PAT read-only repo scan + site-scan run kind |
| 2 |
Page-keyword matching and citation-correlated recommendations |
| 3 |
apply-fix for one narrow fix class (JSON-LD injection via PR) |
| 4 |
GitHub App, draft-PR analysis, non-Git adapters (future, not scoped) |
Architecture
Package: packages/integration-github/
Follows the existing packages/integration-google/, packages/integration-bing/ convention. Depends on @ainyc/canonry-contracts only.
packages/integration-github/
package.json # @ainyc/canonry-integration-github
src/
index.ts # public exports
adapter.ts # GitHubSiteSourceAdapter
framework.ts # framework detection + file→URL routing
extract.ts # structured data / heading / meta extraction from source files
types.ts # adapter-specific types (NOT shared DTOs — those go in contracts)
test/
framework.test.ts
extract.test.ts
Shared types in packages/contracts/
// packages/contracts/src/site-source.ts
/** Generic adapter interface — GitHub is first impl, WordPress/Shopify/GitLab follow same shape */
interface SiteSourceAdapter {
name: string
displayName: string
healthcheck(config: SiteSourceConfig): Promise<{ ok: boolean; message: string }>
listPages(config: SiteSourceConfig): Promise<SitePage[]>
getPage(config: SiteSourceConfig, urlOrPath: string): Promise<SitePage | null>
}
/** Remediation is a separate, opt-in interface — not all adapters support it */
interface SiteSourceRemediator {
applyFix(config: SiteSourceConfig, recommendation: SiteRecommendationIntent): Promise<SiteFixResult>
}
interface SitePage {
url: string
path: string
title: string | null
content: string | null
structuredData: StructuredDataItem[]
headings: HeadingNode[]
metaDescription: string | null
lastModified: string | null
sourceRef: SourceReference | null
}
/** Discriminated by adapter type — Git has file paths, CMS has post IDs */
type SourceReference =
| { type: 'git-file'; filePath: string; lineStart?: number; lineEnd?: number }
| { type: 'cms-post'; identifier: string; editUrl?: string }
| { type: 'cms-product'; identifier: string; editUrl?: string }
/** Recommendation intent — enough to regenerate the fix at apply-time, not a stale patch */
interface SiteRecommendationIntent {
type: string
targetFile: string
schemaType?: string
content?: Record<string, unknown>
}
type SiteFixResult =
| { applied: true; prUrl: string }
| { applied: true; editUrl: string }
| { applied: false; error: string }
Key design decisions
-
Scan vs remediation are separate interfaces. SiteSourceAdapter handles read-only scanning. SiteSourceRemediator handles fix application. CMS adapters won't produce unified diffs — they'll produce content updates via their own API. This avoids Git-centric fields in the base interface.
-
Store intent, not patches. Recommendations store SiteRecommendationIntent (what to fix, where, what schema type) plus the commitSha the scan was against. At apply-fix time, the adapter generates the fix against the current repo state. This avoids stale patches in the DB as the repo moves.
-
Dedicated endpoint, not the runs route. The current POST /projects/:name/runs hard-gates on kind !== 'answer-visibility'. Rather than plumbing a new callback through RunRoutesOptions, create POST /projects/:name/site-scan — mirroring how GSC sync and sitemap inspection have their own trigger endpoints with dedicated callbacks.
Run Kind Wiring: site-scan
This is not a drop-in. The current run route (packages/api-routes/src/runs.ts:26) hard-gates on kind !== 'answer-visibility'. The job runner in server.ts:375 only handles answer-visibility. Cloud API (apps/api/src/app.ts) doesn't wire run callbacks at all.
Approach: Follow the GSC sync pattern (google.ts → server.ts:333):
- Route handler creates a run record with
kind: 'site-scan'
- Calls
opts.onSiteScanRequested(runId, projectId) callback
server.ts wires that callback to executeSiteScan()
Files:
packages/contracts/src/run.ts:7 — add 'site-scan' to runKindSchema
packages/api-routes/src/site-source.ts (new) — route plugin with callback
packages/api-routes/src/index.ts:143+ — register plugin
packages/canonry/src/server.ts:325+ — wire callback
Auth & Storage Model
Secrets in ~/.canonry/config.yaml, non-secret config in DB. Follows existing Google/Bing pattern.
# ~/.canonry/config.yaml
github:
token: ghp_xxxxxxxxxxxx # PAT with repo read access
siteConnections DB table stores only non-secret config (adapter name, repo, branch). Connection store follows the Bing pattern in server.ts:194-235.
No split-brain:
- Auth credentials:
~/.canonry/config.yaml (source of truth)
- Connection config (which repo, which branch):
siteConnections table
- Scan results:
sitePages, siteRecommendations tables
Database Schema
Three new tables in packages/db/src/schema.ts with corresponding migrations in migrate.ts:
site_connections — 1 per project: adapter, config JSON (repo, branch), timestamps. Unique index on project_id.
site_pages — per-scan page snapshots: url, path, title, structured_data_types, headings, meta_description, has_schema, source_ref, commit_sha. Indexed on project_id and scan_run_id.
site_recommendations — per-scan recommendations: page_url, keyword_id, type, severity, title, description, source_ref, intent JSON, commit_sha, status, pr_url. Indexed on project_id, scan_run_id, and (project_id, status).
GitHub Adapter: Framework Detection
This is the hard part. Repo file trees don't reliably map to live URLs for dynamic routes, rewrites, or locales.
MVP: narrow framework support + explicit mapping fallback
| Framework |
Detection |
Content paths |
Route mapping |
| Next.js (app) |
next.config.* + app/ |
app/**/page.{tsx,jsx,mdx} |
Dir path = URL |
| Next.js (pages) |
next.config.* + pages/ |
pages/**/*.{tsx,jsx} |
File path = URL |
| Hugo |
hugo.toml/config.toml |
content/**/*.md |
Front matter slug or dir path |
| Astro |
astro.config.* |
src/pages/**/*.{astro,md,mdx} |
File path = URL |
| Plain HTML |
Fallback |
**/*.html |
File path = URL |
Explicitly NOT handled in MVP: dynamic routes with data fetching, rewrites/redirects, i18n routing, generateStaticParams.
Explicit mapping config for when auto-detection fails:
spec:
siteSource:
adapter: github
repo: myorg/my-site
contentPaths:
- glob: "content/blog/**/*.md"
urlPrefix: "/blog/"
Structured data extraction — pattern-match only, never execute:
- Regex
<script type="application/ld+json"> blocks
- YAML front matter in MD/MDX
- Heading tags /
# headings
<meta name="description"> tags
Recommendation Engine (Phase 2)
Available data
querySnapshots stores: citationState, citedDomains, competitorOverlap, answerText, groundingSources. It does not store competitor page content or structured data.
Correlation logic
For each keyword:
- Find best-matching page (URL path terms, title/h1 match)
- If not cited → check what the page is missing (schema, meta, headings)
- If no matching page → content gap
- Competitor comparison limited to "competitor X is cited, you are not" — structured data comparison requires optional page fetch via existing
site-fetch.ts
Recommendation types
| Type |
Trigger |
Severity |
content-gap |
Not-cited keyword, no matching page |
High |
missing-schema |
Page targets keyword but lacks relevant schema |
High |
competitor-advantage |
Competitor cited, yours isn't |
High |
weak-headings |
Keyword missing from h1/h2 |
Medium |
no-meta-description |
Page has no meta description |
Medium |
API Endpoints
| Method |
Path |
Phase |
PUT |
/projects/:name/site-source |
1 |
GET |
/projects/:name/site-source |
1 |
DELETE |
/projects/:name/site-source |
1 |
POST |
/projects/:name/site-source/healthcheck |
1 |
POST |
/projects/:name/site-scan |
1 |
GET |
/projects/:name/pages |
1 |
GET |
/projects/:name/recommendations |
2 |
PATCH |
/projects/:name/recommendations/:id |
2 |
POST |
/projects/:name/recommendations/:id/apply |
3 |
Plus: OpenAPI catalog entries, API client methods, route index registration.
CLI Commands
# Phase 1
canonry site connect <project> --adapter github --repo owner/repo [--branch main]
canonry site disconnect <project>
canonry site status <project>
canonry site scan <project> [--wait] [--format json]
canonry site pages <project> [--format json]
# Phase 2
canonry site recommendations <project> [--severity high] [--status open] [--format json]
# Phase 3
canonry site apply-fix <project> <recommendation-id>
New files: commands/site.ts, cli-commands/site.ts. Register in cli-commands.ts.
Config-as-Code Round-Trip
Adding siteSource to configSpecSchema also requires changes to:
packages/api-routes/src/apply.ts — handle in apply body, upsert siteConnections
packages/api-routes/src/projects.ts — include in export response
packages/contracts/src/project.ts — add to projectDtoSchema
What's Explicitly Out of Scope
- Pre-publish analysis on draft PRs — requires ref-specific scanning + GitHub App/check-run integration
- Competitor structured data comparison — optional enrichment using
site-fetch.ts, not a requirement
- Cloud mode wiring —
apps/api/ and apps/worker/ don't wire run callbacks today
- Non-Git adapters — interface supports them, no code written
Implementation Steps
Phase 1: Read-only GitHub scan (19 steps)
packages/contracts/src/site-source.ts (new) — shared types
packages/contracts/src/run.ts — add 'site-scan'
packages/contracts/src/config-schema.ts — add siteSource
packages/contracts/src/project.ts — add to DTO
packages/integration-github/ (new package) — adapter
packages/db/src/schema.ts — siteConnections, sitePages tables
packages/db/src/migrate.ts — migrations
packages/api-routes/src/site-source.ts (new) — route plugin
packages/api-routes/src/index.ts — register routes
packages/api-routes/src/openapi.ts — endpoint docs
packages/api-routes/src/apply.ts — handle siteSource
packages/api-routes/src/projects.ts — export siteSource
packages/canonry/src/client.ts — API client methods
packages/canonry/src/site-scan.ts (new) — execution function
packages/canonry/src/server.ts — connection store + callback
packages/canonry/src/commands/site.ts (new) — CLI handler
packages/canonry/src/cli-commands/site.ts (new) — CLI specs
packages/canonry/src/cli-commands.ts — register
- Tests: framework detection, file→URL mapping, extraction, API endpoints
Phase 2: Recommendations (7 steps)
packages/db/src/schema.ts — siteRecommendations table + migration
packages/canonry/src/site-scan.ts — recommendation engine
packages/api-routes/src/site-source.ts — recommendation endpoints
packages/canonry/src/client.ts — recommendation methods
packages/canonry/src/commands/site.ts — recommendations subcommand
- Tests: recommendation generation with mock data
Phase 3: Apply-fix (4 steps)
packages/integration-github/src/remediate.ts — SiteSourceRemediator
packages/api-routes/src/site-source.ts — apply endpoint
packages/canonry/src/commands/site.ts — apply-fix subcommand
- Tests: patch generation
Reusable Existing Code
packages/canonry/src/site-fetch.ts — SSRF-safe HTML fetch
packages/api-routes/src/run-queue.ts — queueRunIfProjectIdle()
packages/api-routes/src/helpers.ts — resolveProject(), writeAuditLog()
server.ts:194-235 — Bing connection store pattern
server.ts:333-358 — GSC sync callback wiring pattern
Summary
Connect a GitHub repo to a Canonry project so that site scans can cross-reference citation data with actual page content — producing file-level, citation-correlated recommendations and (in a later phase) opening PRs with fixes.
The adapter model is generic so WordPress, Shopify, and GitLab integrations slot in later without architectural changes.
From "you lost a citation" → "you lost a citation, and here's what to fix in
src/app/blog/[slug]/page.tsxto win it back."Scope: 4 Phases
Each phase is independently shippable.
site-scanrun kindapply-fixfor one narrow fix class (JSON-LD injection via PR)Architecture
Package:
packages/integration-github/Follows the existing
packages/integration-google/,packages/integration-bing/convention. Depends on@ainyc/canonry-contractsonly.Shared types in
packages/contracts/Key design decisions
Scan vs remediation are separate interfaces.
SiteSourceAdapterhandles read-only scanning.SiteSourceRemediatorhandles fix application. CMS adapters won't produce unified diffs — they'll produce content updates via their own API. This avoids Git-centric fields in the base interface.Store intent, not patches. Recommendations store
SiteRecommendationIntent(what to fix, where, what schema type) plus thecommitShathe scan was against. Atapply-fixtime, the adapter generates the fix against the current repo state. This avoids stale patches in the DB as the repo moves.Dedicated endpoint, not the runs route. The current
POST /projects/:name/runshard-gates onkind !== 'answer-visibility'. Rather than plumbing a new callback throughRunRoutesOptions, createPOST /projects/:name/site-scan— mirroring how GSC sync and sitemap inspection have their own trigger endpoints with dedicated callbacks.Run Kind Wiring:
site-scanThis is not a drop-in. The current run route (
packages/api-routes/src/runs.ts:26) hard-gates onkind !== 'answer-visibility'. The job runner inserver.ts:375only handles answer-visibility. Cloud API (apps/api/src/app.ts) doesn't wire run callbacks at all.Approach: Follow the GSC sync pattern (
google.ts→server.ts:333):kind: 'site-scan'opts.onSiteScanRequested(runId, projectId)callbackserver.tswires that callback toexecuteSiteScan()Files:
packages/contracts/src/run.ts:7— add'site-scan'torunKindSchemapackages/api-routes/src/site-source.ts(new) — route plugin with callbackpackages/api-routes/src/index.ts:143+— register pluginpackages/canonry/src/server.ts:325+— wire callbackAuth & Storage Model
Secrets in
~/.canonry/config.yaml, non-secret config in DB. Follows existing Google/Bing pattern.siteConnectionsDB table stores only non-secret config (adapter name, repo, branch). Connection store follows the Bing pattern inserver.ts:194-235.No split-brain:
~/.canonry/config.yaml(source of truth)siteConnectionstablesitePages,siteRecommendationstablesDatabase Schema
Three new tables in
packages/db/src/schema.tswith corresponding migrations inmigrate.ts:site_connections— 1 per project: adapter, config JSON (repo, branch), timestamps. Unique index onproject_id.site_pages— per-scan page snapshots: url, path, title, structured_data_types, headings, meta_description, has_schema, source_ref, commit_sha. Indexed on project_id and scan_run_id.site_recommendations— per-scan recommendations: page_url, keyword_id, type, severity, title, description, source_ref, intent JSON, commit_sha, status, pr_url. Indexed on project_id, scan_run_id, and (project_id, status).GitHub Adapter: Framework Detection
This is the hard part. Repo file trees don't reliably map to live URLs for dynamic routes, rewrites, or locales.
MVP: narrow framework support + explicit mapping fallback
next.config.*+app/app/**/page.{tsx,jsx,mdx}next.config.*+pages/pages/**/*.{tsx,jsx}hugo.toml/config.tomlcontent/**/*.mdastro.config.*src/pages/**/*.{astro,md,mdx}**/*.htmlExplicitly NOT handled in MVP: dynamic routes with data fetching, rewrites/redirects, i18n routing,
generateStaticParams.Explicit mapping config for when auto-detection fails:
Structured data extraction — pattern-match only, never execute:
<script type="application/ld+json">blocks#headings<meta name="description">tagsRecommendation Engine (Phase 2)
Available data
querySnapshotsstores:citationState,citedDomains,competitorOverlap,answerText,groundingSources. It does not store competitor page content or structured data.Correlation logic
For each keyword:
site-fetch.tsRecommendation types
content-gapmissing-schemacompetitor-advantageweak-headingsno-meta-descriptionAPI Endpoints
PUT/projects/:name/site-sourceGET/projects/:name/site-sourceDELETE/projects/:name/site-sourcePOST/projects/:name/site-source/healthcheckPOST/projects/:name/site-scanGET/projects/:name/pagesGET/projects/:name/recommendationsPATCH/projects/:name/recommendations/:idPOST/projects/:name/recommendations/:id/applyPlus: OpenAPI catalog entries, API client methods, route index registration.
CLI Commands
New files:
commands/site.ts,cli-commands/site.ts. Register incli-commands.ts.Config-as-Code Round-Trip
Adding
siteSourcetoconfigSpecSchemaalso requires changes to:packages/api-routes/src/apply.ts— handle in apply body, upsertsiteConnectionspackages/api-routes/src/projects.ts— include in export responsepackages/contracts/src/project.ts— add toprojectDtoSchemaWhat's Explicitly Out of Scope
site-fetch.ts, not a requirementapps/api/andapps/worker/don't wire run callbacks todayImplementation Steps
Phase 1: Read-only GitHub scan (19 steps)
packages/contracts/src/site-source.ts(new) — shared typespackages/contracts/src/run.ts— add'site-scan'packages/contracts/src/config-schema.ts— addsiteSourcepackages/contracts/src/project.ts— add to DTOpackages/integration-github/(new package) — adapterpackages/db/src/schema.ts—siteConnections,sitePagestablespackages/db/src/migrate.ts— migrationspackages/api-routes/src/site-source.ts(new) — route pluginpackages/api-routes/src/index.ts— register routespackages/api-routes/src/openapi.ts— endpoint docspackages/api-routes/src/apply.ts— handlesiteSourcepackages/api-routes/src/projects.ts— exportsiteSourcepackages/canonry/src/client.ts— API client methodspackages/canonry/src/site-scan.ts(new) — execution functionpackages/canonry/src/server.ts— connection store + callbackpackages/canonry/src/commands/site.ts(new) — CLI handlerpackages/canonry/src/cli-commands/site.ts(new) — CLI specspackages/canonry/src/cli-commands.ts— registerPhase 2: Recommendations (7 steps)
packages/db/src/schema.ts—siteRecommendationstable + migrationpackages/canonry/src/site-scan.ts— recommendation enginepackages/api-routes/src/site-source.ts— recommendation endpointspackages/canonry/src/client.ts— recommendation methodspackages/canonry/src/commands/site.ts—recommendationssubcommandPhase 3: Apply-fix (4 steps)
packages/integration-github/src/remediate.ts—SiteSourceRemediatorpackages/api-routes/src/site-source.ts— apply endpointpackages/canonry/src/commands/site.ts—apply-fixsubcommandReusable Existing Code
packages/canonry/src/site-fetch.ts— SSRF-safe HTML fetchpackages/api-routes/src/run-queue.ts—queueRunIfProjectIdle()packages/api-routes/src/helpers.ts—resolveProject(),writeAuditLog()server.ts:194-235— Bing connection store patternserver.ts:333-358— GSC sync callback wiring pattern