diff --git a/README.md b/README.md index 3c19d90..cc5c6ce 100644 --- a/README.md +++ b/README.md @@ -76,6 +76,52 @@ npx @ainyc/aeo-audit https://example.com --include-geo npx @ainyc/aeo-audit https://example.com --include-agent-skills ``` +### Platform Detection Mode + +Detect what platform, CMS, framework, or static site generator a website is built on. Useful for competitor research, lead qualification, and triage before an audit. + +```bash +# Identify the stack (WordPress, Webflow, Shopify, Next.js, Vercel, etc.) +npx @ainyc/aeo-audit https://example.com --detect-platform + +# JSON for programmatic use +npx @ainyc/aeo-audit https://example.com --detect-platform --format json + +# Only show high-confidence matches +npx @ainyc/aeo-audit https://example.com --detect-platform --min-confidence high +``` + +The detector inspects HTML, response headers, ``, script and link sources, and platform-specific globals to fingerprint: + +- **CMS:** WordPress, Drupal, Joomla, Ghost, HubSpot, Craft CMS, Sanity, Contentful, Notion +- **Site builders:** Wix, Squarespace, Webflow, Framer, Carrd, Bubble +- **E-commerce:** Shopify, WooCommerce, BigCommerce, Magento, PrestaShop +- **Frameworks:** Next.js, Nuxt, Gatsby, Remix, Astro, SvelteKit, Angular, Vue, React, Ember, Qwik +- **Static site generators:** Hugo, Jekyll, Eleventy, Hexo, Docusaurus, MkDocs +- **Hosting / CDN:** Vercel, Netlify, Cloudflare, GitHub Pages, Fastly, AWS CloudFront + +Each detected platform is reported with a confidence bucket (`high`, `medium`, `low`), a numeric score, an optional version, and the list of signals that matched. When no CMS, site builder, or e-commerce platform is found, the report flags the site as `custom-built` (framework and hosting fingerprints are still surfaced for context). Exit code is `0` when at least one platform is detected, `1` otherwise. + +#### Batch detection + +Pass `--urls` to fingerprint many sites in a single run. Pages are fetched with bounded concurrency (5 in flight by default; tune with `--concurrency`). + +```bash +# From a file (one URL per line; # comments and blank lines are skipped) +npx @ainyc/aeo-audit --detect-platform --urls urls.txt + +# Inline comma-separated list +npx @ainyc/aeo-audit --detect-platform --urls https://a.com,https://b.com,https://c.com + +# From stdin +cat urls.txt | npx @ainyc/aeo-audit --detect-platform --urls - + +# JSON for downstream processing +npx @ainyc/aeo-audit --detect-platform --urls urls.txt --format json +``` + +Per-URL fetch errors don't abort the batch — each entry is reported with `status: 'success'` or `status: 'error'`. Exit code is `0` when at least one URL succeeded, `1` otherwise. + ### Sitemap Mode Audit every page discovered from the site's sitemap with bounded concurrency (5 in flight): @@ -107,6 +153,10 @@ When the sitemap has more URLs than `--limit`, the run audits the highest-priori | `--sitemap [url]` | Audit all pages from the sitemap (auto-discovers `/sitemap.xml` or uses an explicit URL) | | `--limit ` | Max pages to audit in sitemap mode (default 200, sorted by sitemap priority) | | `--top-issues` | In sitemap mode, skip per-page output and show only cross-cutting issues | +| `--detect-platform` | Identify the platform/CMS/framework powering the site instead of running an audit | +| `--urls ` | In `--detect-platform` mode, run on multiple URLs. `` is a file path (one URL per line), a comma-separated list, or `-` for stdin | +| `--concurrency ` | In `--detect-platform` batch mode, max in-flight fetches (default 5) | +| `--min-confidence ` | In platform-detect mode, only report matches at or above this level: `low` (default), `medium`, `high` | | `-h`, `--help` | Show the help message | Exit code `0` for score >= 70, `1` for < 70 (CI-friendly). In sitemap mode the exit code is based on the aggregate score. diff --git a/package.json b/package.json index d9c16ac..7fbef65 100644 --- a/package.json +++ b/package.json @@ -1,6 +1,6 @@ { "name": "@ainyc/aeo-audit", - "version": "1.5.0", + "version": "1.6.0", "description": "The most comprehensive open-source Answer Engine Optimization (AEO) audit tool. Scores websites across 13 ranking factors that determine AI citation.", "type": "module", "main": "./dist/index.js", diff --git a/skills/aeo/SKILL.md b/skills/aeo/SKILL.md index 686e1a9..78d7804 100644 --- a/skills/aeo/SKILL.md +++ b/skills/aeo/SKILL.md @@ -42,6 +42,7 @@ npx @ainyc/aeo-audit@1 "" [flags] --format json - `schema`: validate JSON-LD and entity consistency - `llms`: create or improve `llms.txt` and `llms-full.txt` - `monitor`: compare changes over time or benchmark competitors +- `detect-platform`: identify the CMS, site builder, framework, or hosting stack a site uses If no mode is provided, default to `audit`. @@ -55,10 +56,14 @@ If no mode is provided, default to `audit`. - `schema https://example.com` - `llms https://example.com` - `monitor https://site-a.com --compare https://site-b.com` +- `detect-platform https://example.com` +- `detect-platform https://example.com --min-confidence high` +- `detect-platform --urls competitors.txt` +- `detect-platform --urls https://a.com,https://b.com` ## Mode Selection -- If the first argument is one of `audit`, `fix`, `schema`, `llms`, or `monitor`, use that mode. +- If the first argument is one of `audit`, `fix`, `schema`, `llms`, `monitor`, or `detect-platform`, use that mode. - If no explicit mode is given, infer the intent from the request and default to `audit`. ## Audit @@ -101,6 +106,35 @@ Returns: - Aggregate score and grade - Prioritized fixes ranked by site-wide impact +### Detect Platform Mode + +Use `--detect-platform` when the user wants to know what stack a site is built on (e.g., "is this WordPress?", "what framework does competitor X use?", "is this site custom-built?"). This is much faster than a full audit because it skips analyzer scoring. + +```bash +npx @ainyc/aeo-audit@1 "" --detect-platform --format json +npx @ainyc/aeo-audit@1 "" --detect-platform --min-confidence high --format json +``` + +Flags: +- `--detect-platform` — switch to detection mode instead of auditing +- `--min-confidence ` — filter to `low` (default), `medium`, or `high` confidence +- `--urls ` — run on multiple URLs at once (file path, comma-separated list, or `-` for stdin) +- `--concurrency ` — max in-flight fetches in batch mode (default 5) + +The report groups detections by category (CMS, site builder, e-commerce, framework, SSG, hosting), each with a confidence bucket, a 0–100 score, an optional version, and the signals that matched. When the report's `isCustom` flag is true, no CMS/site-builder/e-commerce platform was identified — the site is likely custom-built. Exit code is `0` when at least one platform is detected, `1` otherwise. + +#### Batch detection + +When the user wants to fingerprint many sites at once (competitor lists, customer cohorts), pass `--urls`: + +```bash +npx @ainyc/aeo-audit@1 --detect-platform --urls urls.txt --format json +npx @ainyc/aeo-audit@1 --detect-platform --urls https://a.com,https://b.com --format json +cat urls.txt | npx @ainyc/aeo-audit@1 --detect-platform --urls - --format json +``` + +The batch report contains a `results` array; each entry has `status: 'success'` or `'error'`, plus the same shape as a single-URL report on success. Per-URL fetch errors do not abort the run. Exit code is `0` when at least one URL succeeded, `1` otherwise. + ## Fix Use when the user wants code changes applied after the audit. diff --git a/src/cli.ts b/src/cli.ts index 214f115..8e6cddd 100644 --- a/src/cli.ts +++ b/src/cli.ts @@ -1,11 +1,33 @@ +import { readFile } from 'node:fs/promises' import { runAeoAudit } from './index.js' import { runSitemapAudit } from './sitemap.js' +import { detectPlatform, detectPlatformBatch } from './detect-platform.js' import { isAeoAuditError } from './errors.js' -import { formatJson } from './formatters/json.js' -import { formatSitemapJson } from './formatters/json.js' -import { formatMarkdown, formatSitemapMarkdown } from './formatters/markdown.js' -import { formatText, formatSitemapText } from './formatters/text.js' -import type { AuditReport, SitemapAuditReport, SitemapAuditOptions } from './types.js' +import { + formatBatchPlatformJson, + formatJson, + formatPlatformJson, + formatSitemapJson, +} from './formatters/json.js' +import { + formatBatchPlatformMarkdown, + formatMarkdown, + formatPlatformMarkdown, + formatSitemapMarkdown, +} from './formatters/markdown.js' +import { + formatBatchPlatformText, + formatPlatformText, + formatSitemapText, + formatText, +} from './formatters/text.js' +import type { + BatchPlatformDetectionReport, + PlatformConfidence, + PlatformDetectionReport, + SitemapAuditOptions, + SitemapAuditReport, +} from './types.js' const FORMATTERS = { json: formatJson, @@ -19,6 +41,18 @@ const SITEMAP_FORMATTERS = { text: (report: SitemapAuditReport, topIssuesOnly: boolean) => formatSitemapText(report, topIssuesOnly), } +const PLATFORM_FORMATTERS = { + json: (report: PlatformDetectionReport) => formatPlatformJson(report), + markdown: (report: PlatformDetectionReport) => formatPlatformMarkdown(report), + text: (report: PlatformDetectionReport) => formatPlatformText(report), +} + +const BATCH_PLATFORM_FORMATTERS = { + json: (report: BatchPlatformDetectionReport) => formatBatchPlatformJson(report), + markdown: (report: BatchPlatformDetectionReport) => formatBatchPlatformMarkdown(report), + text: (report: BatchPlatformDetectionReport) => formatBatchPlatformText(report), +} + type FormatterName = keyof typeof FORMATTERS interface ParsedArgs { @@ -32,6 +66,10 @@ interface ParsedArgs { sitemapUrl: string | null limit: number | null topIssues: boolean + detectPlatform: boolean + minConfidence: PlatformConfidence | null + urls: string | null + concurrency: number | null } function isFormatterName(value: string): value is FormatterName { @@ -51,6 +89,10 @@ function parseArgs(argv: string[]): ParsedArgs { sitemapUrl: null, limit: null, topIssues: false, + detectPlatform: false, + minConfidence: null, + urls: null, + concurrency: null, } for (let i = 0; i < args.length; i += 1) { @@ -79,6 +121,23 @@ function parseArgs(argv: string[]): ParsedArgs { i += 1 } else if (args[i] === '--top-issues') { result.topIssues = true + } else if (args[i] === '--detect-platform') { + result.detectPlatform = true + } else if (args[i] === '--min-confidence' && args[i + 1]) { + const value = args[i + 1] + if (value === 'high' || value === 'medium' || value === 'low') { + result.minConfidence = value + } + i += 1 + } else if (args[i] === '--urls' && args[i + 1]) { + result.urls = args[i + 1] + i += 1 + } else if (args[i] === '--concurrency' && args[i + 1]) { + const num = parseInt(args[i + 1], 10) + if (Number.isFinite(num) && num > 0) { + result.concurrency = num + } + i += 1 } else if (args[i] === '--help' || args[i] === '-h') { result.help = true } else if (!args[i].startsWith('-')) { @@ -89,6 +148,39 @@ function parseArgs(argv: string[]): ParsedArgs { return result } +export function parseUrlList(text: string): string[] { + const urls: string[] = [] + for (const rawLine of text.split(/\r?\n/)) { + const line = rawLine.trim() + if (line.length === 0 || line.startsWith('#')) continue + // Allow comma-separated values on a single line too. + for (const part of line.split(',')) { + const candidate = part.trim() + if (candidate.length > 0) urls.push(candidate) + } + } + return urls +} + +async function readStdin(): Promise { + const chunks: Buffer[] = [] + for await (const chunk of process.stdin) { + chunks.push(Buffer.isBuffer(chunk) ? chunk : Buffer.from(chunk)) + } + return Buffer.concat(chunks).toString('utf-8') +} + +async function resolveUrls(spec: string): Promise { + if (spec === '-') { + return parseUrlList(await readStdin()) + } + if (spec.startsWith('http://') || spec.startsWith('https://')) { + return parseUrlList(spec) + } + const text = await readFile(spec, 'utf-8') + return parseUrlList(text) +} + function printHelp() { console.log(` Usage: aeo-audit [options] @@ -103,6 +195,14 @@ Options: --limit Max pages to audit in sitemap mode (default 200, sorted by sitemap priority). When the sitemap exceeds the limit, a notice is printed to stderr. --top-issues In sitemap mode, skip per-page output and show only cross-cutting issues + --detect-platform Detect what platform/CMS/framework the site is built on (WordPress, + Webflow, Shopify, Next.js, etc.) instead of running a full audit. + --urls In --detect-platform mode, run on multiple URLs. can be a path + to a text file (one URL per line, # comments allowed), a comma-separated + list (e.g. https://a.com,https://b.com), or - to read from stdin. + --concurrency In --detect-platform batch mode, max in-flight fetches (default 5). + --min-confidence In platform-detect mode, only report platforms at or above this + confidence level: low (default), medium, high. -h, --help Show this help message Examples: @@ -115,8 +215,16 @@ Examples: aeo-audit https://example.com --sitemap https://example.com/sitemap.xml aeo-audit https://example.com --sitemap --limit 10 aeo-audit https://example.com --sitemap --top-issues + aeo-audit https://example.com --detect-platform + aeo-audit https://example.com --detect-platform --format json + aeo-audit https://example.com --detect-platform --min-confidence medium + aeo-audit --detect-platform --urls urls.txt + aeo-audit --detect-platform --urls https://a.com,https://b.com --format json + cat urls.txt | aeo-audit --detect-platform --urls - Exit code: 0 when score >= 70, 1 otherwise. In sitemap mode, the aggregate score is used. +In --detect-platform mode, exit code is 0 if any platform is detected, 1 otherwise. +In --detect-platform batch mode, exit code is 0 if at least one URL succeeded, 1 otherwise. `) } @@ -128,17 +236,57 @@ export async function main(argv: string[] = process.argv): Promise { return 0 } - if (!args.url) { - console.error('Error: URL is required. Run with --help for usage.') + if (!isFormatterName(args.format)) { + console.error(`Error: Unknown format "${args.format}". Use: text, json, markdown`) return 1 } - if (!isFormatterName(args.format)) { - console.error(`Error: Unknown format "${args.format}". Use: text, json, markdown`) + if (args.urls && !args.detectPlatform) { + console.error('Error: --urls is only supported with --detect-platform.') return 1 } try { + if (args.detectPlatform) { + if (args.urls) { + if (args.url) { + console.error('Error: cannot combine a positional URL with --urls. Use one or the other.') + return 1 + } + + const urls = await resolveUrls(args.urls) + if (urls.length === 0) { + console.error('Error: no URLs found in --urls input.') + return 1 + } + + const batch = await detectPlatformBatch(urls, { + minConfidence: args.minConfidence ?? undefined, + concurrency: args.concurrency ?? undefined, + }) + const batchFormatter = BATCH_PLATFORM_FORMATTERS[args.format] + console.log(batchFormatter(batch)) + return batch.successful > 0 ? 0 : 1 + } + + if (!args.url) { + console.error('Error: URL is required (or pass --urls |). Run with --help for usage.') + return 1 + } + + const report = await detectPlatform(args.url, { + minConfidence: args.minConfidence ?? undefined, + }) + const platformFormatter = PLATFORM_FORMATTERS[args.format] + console.log(platformFormatter(report)) + return report.detected.length > 0 ? 0 : 1 + } + + if (!args.url) { + console.error('Error: URL is required. Run with --help for usage.') + return 1 + } + if (args.sitemap) { const options: SitemapAuditOptions = { factors: args.factors, diff --git a/src/detect-platform.ts b/src/detect-platform.ts new file mode 100644 index 0000000..5227a6d --- /dev/null +++ b/src/detect-platform.ts @@ -0,0 +1,811 @@ +import { load, type CheerioAPI } from 'cheerio' +import { isAeoAuditError } from './errors.js' +import { fetchPage, normalizeTargetUrl } from './fetch-page.js' +import { mapWithConcurrency } from './sitemap.js' +import type { + BatchDetectionEntry, + BatchPlatformDetectionReport, + DetectedPlatform, + PlatformCategory, + PlatformConfidence, + PlatformDetectionReport, +} from './types.js' + +const HIGH_THRESHOLD = 70 +const MEDIUM_THRESHOLD = 40 +const LOW_THRESHOLD = 15 + +interface DetectionInput { + $: CheerioAPI + html: string + headers: Record +} + +interface SignalHit { + evidence: string + weight: number + version?: string +} + +type SignalFn = (input: DetectionInput) => SignalHit | null + +interface PlatformDef { + id: string + name: string + category: PlatformCategory + signals: SignalFn[] +} + +/* ── Signal helpers ───────────────────────────────────────────────────── */ + +function getHeader(headers: Record, name: string): string | undefined { + const lower = name.toLowerCase() + for (const key of Object.keys(headers)) { + if (key.toLowerCase() === lower) return headers[key] + } + return undefined +} + +function metaGenerator(pattern: RegExp, weight: number): SignalFn { + return ({ $ }) => { + const content = $('meta[name="generator" i]').attr('content')?.trim() + if (!content) return null + const match = content.match(pattern) + if (!match) return null + return { + evidence: ``, + weight, + version: match[1], + } + } +} + +function scriptSrcContains(needle: string | RegExp, weight: number): SignalFn { + return ({ $ }) => { + let hit: string | undefined + $('script[src]').each((_, el) => { + if (hit) return + const src = $(el).attr('src') || '' + if (typeof needle === 'string' ? src.includes(needle) : needle.test(src)) { + hit = src + } + }) + if (!hit) return null + return { evidence: `script src includes "${truncate(hit, 80)}"`, weight } + } +} + +function linkHrefContains(needle: string | RegExp, weight: number): SignalFn { + return ({ $ }) => { + let hit: string | undefined + $('link[href]').each((_, el) => { + if (hit) return + const href = $(el).attr('href') || '' + if (typeof needle === 'string' ? href.includes(needle) : needle.test(href)) { + hit = href + } + }) + if (!hit) return null + return { evidence: `link href includes "${truncate(hit, 80)}"`, weight } + } +} + +function imgSrcContains(needle: string | RegExp, weight: number): SignalFn { + return ({ $ }) => { + let hit: string | undefined + $('img[src]').each((_, el) => { + if (hit) return + const src = $(el).attr('src') || '' + if (typeof needle === 'string' ? src.includes(needle) : needle.test(src)) { + hit = src + } + }) + if (!hit) return null + return { evidence: `img src includes "${truncate(hit, 80)}"`, weight } + } +} + +function htmlMatches(pattern: RegExp, weight: number, label?: string): SignalFn { + return ({ html }) => { + const match = html.match(pattern) + if (!match) return null + return { + evidence: label ?? `HTML contains ${pattern.source.slice(0, 60)}`, + weight, + version: match[1], + } + } +} + +function headerMatches(name: string, pattern: RegExp, weight: number): SignalFn { + return ({ headers }) => { + const value = getHeader(headers, name) + if (!value) return null + const match = value.match(pattern) + if (!match) return null + return { + evidence: `${name} header: "${truncate(value, 80)}"`, + weight, + version: match[1], + } + } +} + +function headerExists(name: string, weight: number): SignalFn { + return ({ headers }) => { + const value = getHeader(headers, name) + if (!value) return null + return { evidence: `${name} header present: "${truncate(value, 80)}"`, weight } + } +} + +function bodyClassContains(token: string, weight: number): SignalFn { + return ({ $ }) => { + const cls = $('body').attr('class') || '' + if (!cls.split(/\s+/).some((c) => c.includes(token))) return null + return { evidence: `body class contains "${token}"`, weight } + } +} + +function elementExists(selector: string, weight: number, label?: string): SignalFn { + return ({ $ }) => { + if ($(selector).length === 0) return null + return { evidence: label ?? `element ${selector} present`, weight } + } +} + +function attributeExists(selector: string, attr: string, weight: number): SignalFn { + return ({ $ }) => { + let hit: string | undefined + $(selector).each((_, el) => { + if (hit) return + const v = $(el).attr(attr) + if (typeof v === 'string') hit = v + }) + if (hit === undefined) return null + return { evidence: `${selector}[${attr}] present`, weight } + } +} + +function truncate(s: string, n: number): string { + return s.length > n ? s.slice(0, n - 1) + '…' : s +} + +/* ── Platform definitions ─────────────────────────────────────────────── */ + +const PLATFORMS: PlatformDef[] = [ + /* === CMS ============================================================ */ + { + id: 'wordpress', + name: 'WordPress', + category: 'cms', + signals: [ + metaGenerator(/WordPress(?:\s+(\d+\.\d+(?:\.\d+)?))?/i, 5), + scriptSrcContains('/wp-includes/', 5), + scriptSrcContains('/wp-content/', 4), + linkHrefContains('/wp-content/', 3), + linkHrefContains('/wp-json/', 4), + htmlMatches(/wp-block-[a-z]/, 2, 'wp-block-* class found'), + headerExists('x-pingback', 3), + headerMatches('link', /wp-json/i, 4), + ], + }, + { + id: 'drupal', + name: 'Drupal', + category: 'cms', + signals: [ + metaGenerator(/Drupal(?:\s+(\d+))?/i, 5), + headerMatches('x-generator', /drupal/i, 5), + headerExists('x-drupal-cache', 4), + headerExists('x-drupal-dynamic-cache', 4), + htmlMatches(/sites\/(?:default|all)\//i, 3, '/sites/default/ or /sites/all/ path'), + ], + }, + { + id: 'joomla', + name: 'Joomla', + category: 'cms', + signals: [ + metaGenerator(/Joomla(?:!\s*-?\s*(\d+\.\d+))?/i, 5), + htmlMatches(/\/media\/jui\//i, 3, '/media/jui/ path'), + htmlMatches(/\/components\/com_/i, 3, '/components/com_* path'), + ], + }, + { + id: 'ghost', + name: 'Ghost', + category: 'cms', + signals: [ + metaGenerator(/Ghost(?:\s+(\d+\.\d+))?/i, 5), + linkHrefContains('/ghost/', 3), + scriptSrcContains('ghost.io', 3), + ], + }, + { + id: 'hubspot', + name: 'HubSpot CMS', + category: 'cms', + signals: [ + metaGenerator(/HubSpot/i, 5), + // tracking/forms scripts: low weight — many non-CMS sites embed HubSpot CRM tooling. + scriptSrcContains('js.hs-scripts.com', 2), + scriptSrcContains('js.hsforms.net', 1), + scriptSrcContains('hs-banner.com', 1), + scriptSrcContains('hs-analytics.net', 1), + ], + }, + { + id: 'craft-cms', + name: 'Craft CMS', + category: 'cms', + signals: [ + headerMatches('x-powered-by', /Craft\s*CMS/i, 5), + scriptSrcContains('cdn.craft.cm', 3), + ], + }, + { + id: 'sanity', + name: 'Sanity', + category: 'cms', + signals: [ + imgSrcContains('cdn.sanity.io', 4), + htmlMatches(/cdn\.sanity\.io/i, 3, 'cdn.sanity.io reference'), + ], + }, + { + id: 'contentful', + name: 'Contentful', + category: 'cms', + signals: [ + imgSrcContains('images.ctfassets.net', 4), + imgSrcContains('assets.ctfassets.net', 4), + htmlMatches(/(?:images|assets|videos)\.ctfassets\.net/i, 3, 'ctfassets.net reference'), + ], + }, + { + id: 'notion', + name: 'Notion (notion.site)', + category: 'cms', + signals: [ + htmlMatches(/notion\.site/i, 3, 'notion.site reference'), + metaGenerator(/Notion/i, 5), + htmlMatches(/__NOTION/, 3, '__NOTION marker'), + ], + }, + + /* === Site builders =================================================== */ + { + id: 'wix', + name: 'Wix', + category: 'site-builder', + signals: [ + metaGenerator(/Wix(?:\.com)?/i, 5), + scriptSrcContains('static.wixstatic.com', 5), + htmlMatches(/wix-warmup-data/i, 4, 'wix-warmup-data marker'), + htmlMatches(/parastorage\.com/i, 3, 'parastorage.com (Wix CDN)'), + ], + }, + { + id: 'squarespace', + name: 'Squarespace', + category: 'site-builder', + signals: [ + metaGenerator(/Squarespace/i, 5), + scriptSrcContains('static1.squarespace.com', 5), + htmlMatches(/Static\.SQUARESPACE_CONTEXT/, 5, 'Static.SQUARESPACE_CONTEXT global'), + headerMatches('x-served-by', /squarespace/i, 4), + ], + }, + { + id: 'webflow', + name: 'Webflow', + category: 'site-builder', + signals: [ + metaGenerator(/Webflow/i, 5), + attributeExists('html', 'data-wf-page', 5), + attributeExists('html', 'data-wf-site', 5), + scriptSrcContains('webflow.js', 4), + scriptSrcContains('webflow.com', 3), + imgSrcContains('uploads-ssl.webflow.com', 4), + imgSrcContains('cdn.prod.website-files.com', 4), + ], + }, + { + id: 'framer', + name: 'Framer', + category: 'site-builder', + signals: [ + metaGenerator(/Framer/i, 5), + htmlMatches(/framerusercontent\.com/i, 4, 'framerusercontent.com reference'), + htmlMatches(/framer\.com/i, 2, 'framer.com reference'), + ], + }, + { + id: 'carrd', + name: 'Carrd', + category: 'site-builder', + signals: [ + metaGenerator(/Carrd/i, 5), + htmlMatches(/carrd\.co/i, 3, 'carrd.co reference'), + ], + }, + { + id: 'bubble', + name: 'Bubble', + category: 'site-builder', + signals: [ + metaGenerator(/Bubble/i, 4), + scriptSrcContains('bubble.io', 4), + htmlMatches(/bubble-app/i, 3, 'bubble-app marker'), + ], + }, + + /* === E-commerce ====================================================== */ + { + id: 'shopify', + name: 'Shopify', + category: 'ecommerce', + signals: [ + headerExists('x-shopid', 5), + headerExists('x-shopify-stage', 5), + scriptSrcContains('cdn.shopify.com', 5), + htmlMatches(/Shopify\.theme\b/, 5, 'Shopify.theme global'), + htmlMatches(/Shopify\.shop\b/, 4, 'Shopify.shop global'), + headerMatches('powered-by', /Shopify/i, 4), + ], + }, + { + id: 'woocommerce', + name: 'WooCommerce', + category: 'ecommerce', + signals: [ + bodyClassContains('woocommerce', 4), + htmlMatches(/wp-content\/plugins\/woocommerce/i, 5, 'WooCommerce plugin path'), + metaGenerator(/WooCommerce(?:\s+(\d+\.\d+))?/i, 5), + ], + }, + { + id: 'bigcommerce', + name: 'BigCommerce', + category: 'ecommerce', + signals: [ + scriptSrcContains('cdn11.bigcommerce.com', 5), + headerMatches('x-bc-apigw-client-id', /./, 5), + htmlMatches(/bigcommerce\.com/i, 3, 'bigcommerce.com reference'), + ], + }, + { + id: 'magento', + name: 'Magento (Adobe Commerce)', + category: 'ecommerce', + signals: [ + htmlMatches(/Mage\.Cookies/, 5, 'Mage.Cookies marker'), + htmlMatches(/\/skin\/frontend\//i, 4, '/skin/frontend/ path'), + htmlMatches(/\/static\/version\d+\//i, 4, '/static/versionN/ path'), + headerMatches('x-magento-cache-debug', /./, 5), + ], + }, + { + id: 'prestashop', + name: 'PrestaShop', + category: 'ecommerce', + signals: [ + metaGenerator(/PrestaShop/i, 5), + headerMatches('powered-by', /PrestaShop/i, 5), + ], + }, + + /* === JS frameworks =================================================== */ + { + id: 'nextjs', + name: 'Next.js', + category: 'framework', + signals: [ + elementExists('script#__NEXT_DATA__', 5, '', + ) + const result = detectPlatformFromInput({ html }) + const wp = result.detected.find((p) => p.id === 'wordpress') + expect(wp, 'expected WordPress detection').toBeDefined() + expect(wp!.confidence).toBe('high') + expect(wp!.version).toBe('6.4.1') + expect(wp!.evidence.length).toBeGreaterThanOrEqual(2) + }) + + it('detects Drupal from x-generator header', () => { + const result = detectPlatformFromInput({ + html: wrap(), + headers: { 'x-generator': 'Drupal 10 (https://www.drupal.org)' }, + }) + const drupal = result.detected.find((p) => p.id === 'drupal') + expect(drupal).toBeDefined() + expect(drupal!.confidence).toBe('high') + }) + + it('detects Ghost from generator', () => { + const html = wrap('') + const result = detectPlatformFromInput({ html }) + const ghost = result.detected.find((p) => p.id === 'ghost') + expect(ghost).toBeDefined() + expect(ghost!.version).toBe('5.78') + }) + + it('detects HubSpot CMS at high confidence when meta generator declares it', () => { + const html = wrap('') + const result = detectPlatformFromInput({ html }) + const hs = result.detected.find((p) => p.id === 'hubspot') + expect(hs).toBeDefined() + expect(hs!.confidence).toBe('high') + }) + + it('does not flag HubSpot CMS at high confidence from a tracking pixel alone', () => { + // js.hs-scripts.com is the HubSpot CRM tracking pixel — countless non-CMS sites embed it. + const html = wrap('') + const result = detectPlatformFromInput({ html }) + const hs = result.detected.find((p) => p.id === 'hubspot') + expect(hs).toBeDefined() + expect(hs!.confidence).not.toBe('high') + }) + }) + + describe('Site builder detection', () => { + it('detects Wix from generator + static.wixstatic.com', () => { + const html = wrap( + '' + + '', + ) + const result = detectPlatformFromInput({ html }) + const wix = result.detected.find((p) => p.id === 'wix') + expect(wix).toBeDefined() + expect(wix!.confidence).toBe('high') + }) + + it('detects Squarespace from Static.SQUARESPACE_CONTEXT global', () => { + const html = wrap('', '') + const result = detectPlatformFromInput({ html }) + const sq = result.detected.find((p) => p.id === 'squarespace') + expect(sq).toBeDefined() + expect(sq!.confidence).toBe('high') + }) + + it('detects Webflow from data-wf-page attribute on ', () => { + const html = wrap('', '', 'data-wf-page="abc123" data-wf-site="xyz"') + const result = detectPlatformFromInput({ html }) + const wf = result.detected.find((p) => p.id === 'webflow') + expect(wf).toBeDefined() + expect(wf!.confidence).toBe('high') + }) + + it('detects Framer from generator + framerusercontent', () => { + const html = wrap( + '' + + '', + ) + const result = detectPlatformFromInput({ html }) + const framer = result.detected.find((p) => p.id === 'framer') + expect(framer).toBeDefined() + }) + }) + + describe('E-commerce detection', () => { + it('detects Shopify from headers + theme global', () => { + const html = wrap('', '') + const result = detectPlatformFromInput({ + html, + headers: { 'x-shopid': '12345' }, + }) + const shop = result.detected.find((p) => p.id === 'shopify') + expect(shop).toBeDefined() + expect(shop!.confidence).toBe('high') + }) + + it('detects WooCommerce as a layered signal on top of WordPress', () => { + const html = wrap( + '', + '', + '', + ).replace( + '', + '', + ) + const result = detectPlatformFromInput({ html }) + expect(result.detected.find((p) => p.id === 'wordpress')).toBeDefined() + expect(result.detected.find((p) => p.id === 'woocommerce')).toBeDefined() + }) + + it('detects BigCommerce from cdn11.bigcommerce.com', () => { + const html = wrap('') + const result = detectPlatformFromInput({ html }) + expect(result.detected.find((p) => p.id === 'bigcommerce')).toBeDefined() + }) + }) + + describe('JS framework detection', () => { + it('detects Next.js from __NEXT_DATA__ script', () => { + const html = wrap( + '', + '', + ) + const result = detectPlatformFromInput({ html }) + const next = result.detected.find((p) => p.id === 'nextjs') + expect(next).toBeDefined() + expect(next!.confidence).toBe('high') + }) + + it('detects Nuxt from window.__NUXT__ + /_nuxt/', () => { + const html = wrap( + '', + '', + ) + const result = detectPlatformFromInput({ html }) + expect(result.detected.find((p) => p.id === 'nuxt')).toBeDefined() + }) + + it('detects Gatsby from div#___gatsby', () => { + const html = wrap('', '
') + const result = detectPlatformFromInput({ html }) + expect(result.detected.find((p) => p.id === 'gatsby')).toBeDefined() + }) + + it('detects Astro from generator + data-astro-cid attribute', () => { + const html = wrap('', '
x
') + const result = detectPlatformFromInput({ html }) + const astro = result.detected.find((p) => p.id === 'astro') + expect(astro).toBeDefined() + expect(astro!.version).toBe('4.5.2') + }) + + it('detects Angular from ng-version attribute and captures the version', () => { + const html = wrap('', '') + const result = detectPlatformFromInput({ html }) + const ng = result.detected.find((p) => p.id === 'angular') + expect(ng).toBeDefined() + expect(ng!.confidence).toBe('high') + expect(ng!.version).toBe('17.1.2') + }) + + it('detects Remix from window.__remixContext', () => { + const html = wrap('', '') + const result = detectPlatformFromInput({ html }) + expect(result.detected.find((p) => p.id === 'remix')).toBeDefined() + }) + }) + + describe('Static site generator detection', () => { + it('detects Hugo from generator', () => { + const html = wrap('') + const result = detectPlatformFromInput({ html }) + const hugo = result.detected.find((p) => p.id === 'hugo') + expect(hugo).toBeDefined() + expect(hugo!.version).toBe('0.121.1') + }) + + it('detects Jekyll from generator', () => { + const html = wrap('') + const result = detectPlatformFromInput({ html }) + const j = result.detected.find((p) => p.id === 'jekyll') + expect(j).toBeDefined() + expect(j!.version).toBe('4.3.2') + }) + + it('detects Docusaurus from __docusaurus marker', () => { + const html = wrap('') + const result = detectPlatformFromInput({ html }) + expect(result.detected.find((p) => p.id === 'docusaurus')).toBeDefined() + }) + }) + + describe('Hosting detection', () => { + it('detects Vercel from x-vercel-id header', () => { + const result = detectPlatformFromInput({ + html: wrap(), + headers: { 'x-vercel-id': 'sfo1::abc123', server: 'Vercel' }, + }) + const v = result.detected.find((p) => p.id === 'vercel') + expect(v).toBeDefined() + expect(v!.confidence).toBe('high') + }) + + it('detects Netlify from x-nf-request-id header', () => { + const result = detectPlatformFromInput({ + html: wrap(), + headers: { 'x-nf-request-id': 'abc-123', server: 'Netlify' }, + }) + expect(result.detected.find((p) => p.id === 'netlify')).toBeDefined() + }) + + it('detects Cloudflare from cf-ray header', () => { + const result = detectPlatformFromInput({ + html: wrap(), + headers: { server: 'cloudflare', 'cf-ray': '8a1b2c3d-LAX' }, + }) + expect(result.detected.find((p) => p.id === 'cloudflare')).toBeDefined() + }) + + it('detects GitHub Pages from server header', () => { + const result = detectPlatformFromInput({ + html: wrap(), + headers: { server: 'GitHub.com', 'x-github-request-id': 'abc' }, + }) + expect(result.detected.find((p) => p.id === 'github-pages')).toBeDefined() + }) + }) + + describe('Custom site behavior', () => { + it('flags isCustom=true when only framework + hosting are detected', () => { + const html = wrap( + '', + '', + ) + const result = detectPlatformFromInput({ + html, + headers: { 'x-vercel-id': 'sfo1::abc' }, + }) + expect(result.isCustom).toBe(true) + expect(result.detected.find((p) => p.id === 'nextjs')).toBeDefined() + expect(result.detected.find((p) => p.id === 'vercel')).toBeDefined() + }) + + it('flags isCustom=false when a CMS is detected', () => { + const html = wrap( + '' + + '', + ) + const result = detectPlatformFromInput({ html }) + expect(result.isCustom).toBe(false) + }) + + it('returns no detections and isCustom=true for a fully blank page', () => { + const result = detectPlatformFromInput({ + html: 'xhi', + }) + expect(result.detected).toHaveLength(0) + expect(result.isCustom).toBe(true) + expect(result.rawSignals.generator).toBeNull() + }) + + it('captures raw signals even when no platform matches', () => { + const html = wrap('') + const result = detectPlatformFromInput({ + html, + headers: { 'x-powered-by': 'PHP/8.2', server: 'nginx' }, + }) + expect(result.rawSignals.generator).toBe('MysteryStack 1.0') + expect(result.rawSignals.xPoweredBy).toBe('PHP/8.2') + expect(result.rawSignals.server).toBe('nginx') + }) + }) + + describe('Confidence + ranking', () => { + it('sorts detections by confidence score (highest first)', () => { + const html = wrap( + '' + + '', + '
', + ) + const result = detectPlatformFromInput({ html }) + const wpIdx = result.detected.findIndex((p) => p.id === 'wordpress') + const reactIdx = result.detected.findIndex((p) => p.id === 'react') + expect(wpIdx).toBeGreaterThanOrEqual(0) + // WordPress should rank above generic React if both detected. + if (reactIdx >= 0) expect(wpIdx).toBeLessThan(reactIdx) + }) + + it('respects minConfidence=high — drops weaker matches', () => { + const html = wrap('', '
') // weak React signal only + const all = detectPlatformFromInput({ html }) + const high = detectPlatformFromInput({ html }, { minConfidence: 'high' }) + // React alone would be at most ~medium; high filter should drop it. + expect(high.detected.length).toBeLessThanOrEqual(all.detected.length) + expect(high.detected.find((p) => p.id === 'react')).toBeUndefined() + }) + }) +})