You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
cnry analytics <project> --feature sources (and the API + report behind it) is the surface for "where do AI engines get their facts about this project." Today it has three gaps that force an operator to query the SQLite DB directly to answer common questions.
Truncated to top-5 per category.topDomains caps at 5 per category, so the long tail is invisible. A domain that ranks near the top by raw citation count can be absent from the surfaced list because it sits inside the large catch-all category. There is no way to get the full ranked list of cited domains from the CLI or API.
No per-provider breakdown.sources aggregates across all providers (overall + byQuery, no byProvider). "What does provider X cite, and how much does it ground at all?" is unanswerable from the CLI, even though query_snapshots.cited_domains is stored per snapshot per provider. This blocks comparing engines (one engine may ground heavily on a single source while another barely grounds).
Coarse categories hide the structure. Domains are bucketed as other / video / directory / news, so the large majority land in a single "Independent sites" bucket. That bucket mixes booking aggregators, listicle/recommendation aggregators, direct competitors, editorial/curator media, and the project's own domains. Concrete failure: a hotel-recommendation aggregator and a rival business in the same vertical both fall into "Independent sites," even though one is a placement target and the other is a competitor to out-rank. That distinction is exactly what makes the ranking actionable.
Proposed changes (exposure only, no new data collection)
All three read from data already stored in query_snapshots.cited_domains, which is already per-snapshot and per-provider. No collection change, no schema change.
Full ranked sources. Return the complete ranked domain list (or top-N with an explicit long-tail rollup) via cnry sources <project> --rank [--limit N] --format json and the backing endpoint. Stop truncating to five-per-category in the machine output.
Per-provider cut. Add a byProvider breakdown (--by-provider) so each provider's cited-domain mix and cited-slot total is one call.
Classify by bucket. Tag each cited domain with a class (for example: ota-aggregator / direct-competitor / editorial-media / own / other) by reusing the discovery domain classifier, replacing or augmenting the current other/video/directory/news categories.
Related
Change 3 reuses the same discovery classifier proposed for content targets in docs(plans): add content brief synthesis design doc #673 (the surfaceClass gate). One classifier-exposure foundation, two consumers: content briefs and source ranking.
Overlaps the SoV rework (plans/sov-rework.md). Note: Retrieval Share was evaluated and rejected, because per-provider extractors already decode grounding URLs into cited_domains, so a grounding-share metric equals the existing citation share. The win here is exposing the cited_domains data already collected, not a new metric.
Surface and parity
API first, then CLI, then MCP and report, per the agent-first parity rules. All counts, shares, and classification live in the API response, not in any consumer.
Problem
cnry analytics <project> --feature sources(and the API + report behind it) is the surface for "where do AI engines get their facts about this project." Today it has three gaps that force an operator to query the SQLite DB directly to answer common questions.Truncated to top-5 per category.
topDomainscaps at 5 per category, so the long tail is invisible. A domain that ranks near the top by raw citation count can be absent from the surfaced list because it sits inside the large catch-all category. There is no way to get the full ranked list of cited domains from the CLI or API.No per-provider breakdown.
sourcesaggregates across all providers (overall+byQuery, nobyProvider). "What does provider X cite, and how much does it ground at all?" is unanswerable from the CLI, even thoughquery_snapshots.cited_domainsis stored per snapshot per provider. This blocks comparing engines (one engine may ground heavily on a single source while another barely grounds).Coarse categories hide the structure. Domains are bucketed as other / video / directory / news, so the large majority land in a single "Independent sites" bucket. That bucket mixes booking aggregators, listicle/recommendation aggregators, direct competitors, editorial/curator media, and the project's own domains. Concrete failure: a hotel-recommendation aggregator and a rival business in the same vertical both fall into "Independent sites," even though one is a placement target and the other is a competitor to out-rank. That distinction is exactly what makes the ranking actionable.
Proposed changes (exposure only, no new data collection)
All three read from data already stored in
query_snapshots.cited_domains, which is already per-snapshot and per-provider. No collection change, no schema change.Full ranked sources. Return the complete ranked domain list (or top-N with an explicit long-tail rollup) via
cnry sources <project> --rank [--limit N] --format jsonand the backing endpoint. Stop truncating to five-per-category in the machine output.Per-provider cut. Add a
byProviderbreakdown (--by-provider) so each provider's cited-domain mix and cited-slot total is one call.Classify by bucket. Tag each cited domain with a class (for example: ota-aggregator / direct-competitor / editorial-media / own / other) by reusing the discovery domain classifier, replacing or augmenting the current other/video/directory/news categories.
Related
surfaceClassgate). One classifier-exposure foundation, two consumers: content briefs and source ranking.plans/sov-rework.md). Note: Retrieval Share was evaluated and rejected, because per-provider extractors already decode grounding URLs intocited_domains, so a grounding-share metric equals the existing citation share. The win here is exposing thecited_domainsdata already collected, not a new metric.Surface and parity
API first, then CLI, then MCP and report, per the agent-first parity rules. All counts, shares, and classification live in the API response, not in any consumer.
🤖 Generated with Claude Code