Skip to content

analytics(sources): expose full, per-provider, and classified source rankings #675

@arberx

Description

@arberx

Problem

cnry analytics <project> --feature sources (and the API + report behind it) is the surface for "where do AI engines get their facts about this project." Today it has three gaps that force an operator to query the SQLite DB directly to answer common questions.

  1. Truncated to top-5 per category. topDomains caps at 5 per category, so the long tail is invisible. A domain that ranks near the top by raw citation count can be absent from the surfaced list because it sits inside the large catch-all category. There is no way to get the full ranked list of cited domains from the CLI or API.

  2. No per-provider breakdown. sources aggregates across all providers (overall + byQuery, no byProvider). "What does provider X cite, and how much does it ground at all?" is unanswerable from the CLI, even though query_snapshots.cited_domains is stored per snapshot per provider. This blocks comparing engines (one engine may ground heavily on a single source while another barely grounds).

  3. Coarse categories hide the structure. Domains are bucketed as other / video / directory / news, so the large majority land in a single "Independent sites" bucket. That bucket mixes booking aggregators, listicle/recommendation aggregators, direct competitors, editorial/curator media, and the project's own domains. Concrete failure: a hotel-recommendation aggregator and a rival business in the same vertical both fall into "Independent sites," even though one is a placement target and the other is a competitor to out-rank. That distinction is exactly what makes the ranking actionable.

Proposed changes (exposure only, no new data collection)

All three read from data already stored in query_snapshots.cited_domains, which is already per-snapshot and per-provider. No collection change, no schema change.

  1. Full ranked sources. Return the complete ranked domain list (or top-N with an explicit long-tail rollup) via cnry sources <project> --rank [--limit N] --format json and the backing endpoint. Stop truncating to five-per-category in the machine output.

  2. Per-provider cut. Add a byProvider breakdown (--by-provider) so each provider's cited-domain mix and cited-slot total is one call.

  3. Classify by bucket. Tag each cited domain with a class (for example: ota-aggregator / direct-competitor / editorial-media / own / other) by reusing the discovery domain classifier, replacing or augmenting the current other/video/directory/news categories.

Related

  • Change 3 reuses the same discovery classifier proposed for content targets in docs(plans): add content brief synthesis design doc #673 (the surfaceClass gate). One classifier-exposure foundation, two consumers: content briefs and source ranking.
  • Overlaps the SoV rework (plans/sov-rework.md). Note: Retrieval Share was evaluated and rejected, because per-provider extractors already decode grounding URLs into cited_domains, so a grounding-share metric equals the existing citation share. The win here is exposing the cited_domains data already collected, not a new metric.

Surface and parity

API first, then CLI, then MCP and report, per the agent-first parity rules. All counts, shares, and classification live in the API response, not in any consumer.

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions