analytics(sources): expose full, per-provider, and classified source rankings

## Problem

`cnry analytics <project> --feature sources` (and the API + report behind it) is the surface for "where do AI engines get their facts about this project." Today it has three gaps that force an operator to query the SQLite DB directly to answer common questions.

1. **Truncated to top-5 per category.** `topDomains` caps at 5 per category, so the long tail is invisible. A domain that ranks near the top by raw citation count can be absent from the surfaced list because it sits inside the large catch-all category. There is no way to get the full ranked list of cited domains from the CLI or API.

2. **No per-provider breakdown.** `sources` aggregates across all providers (`overall` + `byQuery`, no `byProvider`). "What does provider X cite, and how much does it ground at all?" is unanswerable from the CLI, even though `query_snapshots.cited_domains` is stored per snapshot per provider. This blocks comparing engines (one engine may ground heavily on a single source while another barely grounds).

3. **Coarse categories hide the structure.** Domains are bucketed as other / video / directory / news, so the large majority land in a single "Independent sites" bucket. That bucket mixes booking aggregators, listicle/recommendation aggregators, direct competitors, editorial/curator media, and the project's own domains. Concrete failure: a hotel-recommendation aggregator and a rival business in the same vertical both fall into "Independent sites," even though one is a placement target and the other is a competitor to out-rank. That distinction is exactly what makes the ranking actionable.

## Proposed changes (exposure only, no new data collection)

All three read from data already stored in `query_snapshots.cited_domains`, which is already per-snapshot and per-provider. No collection change, no schema change.

1. **Full ranked sources.** Return the complete ranked domain list (or top-N with an explicit long-tail rollup) via `cnry sources <project> --rank [--limit N] --format json` and the backing endpoint. Stop truncating to five-per-category in the machine output.

2. **Per-provider cut.** Add a `byProvider` breakdown (`--by-provider`) so each provider's cited-domain mix and cited-slot total is one call.

3. **Classify by bucket.** Tag each cited domain with a class (for example: ota-aggregator / direct-competitor / editorial-media / own / other) by reusing the discovery domain classifier, replacing or augmenting the current other/video/directory/news categories.

## Related

- Change 3 reuses the same discovery classifier proposed for content targets in #673 (the `surfaceClass` gate). One classifier-exposure foundation, two consumers: content briefs and source ranking.
- Overlaps the SoV rework (`plans/sov-rework.md`). Note: Retrieval Share was evaluated and rejected, because per-provider extractors already decode grounding URLs into `cited_domains`, so a grounding-share metric equals the existing citation share. The win here is exposing the `cited_domains` data already collected, not a new metric.

## Surface and parity

API first, then CLI, then MCP and report, per the agent-first parity rules. All counts, shares, and classification live in the API response, not in any consumer.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

analytics(sources): expose full, per-provider, and classified source rankings #675

Problem

Proposed changes (exposure only, no new data collection)

Related

Surface and parity

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

analytics(sources): expose full, per-provider, and classified source rankings #675

Description

Problem

Proposed changes (exposure only, no new data collection)

Related

Surface and parity

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions