Skip to content

feat: site cap (1000/section), HF source splits, live hero count#19

Merged
kishormorol merged 2 commits into
mainfrom
feat/site-cap-and-hf-source-splits
Jun 6, 2026
Merged

feat: site cap (1000/section), HF source splits, live hero count#19
kishormorol merged 2 commits into
mainfrom
feat/site-cap-and-hf-source-splits

Conversation

@kishormorol

@kishormorol kishormorol commented Jun 6, 2026

Copy link
Copy Markdown
Owner

Three related site/data-exposure changes.

1. Website shows 3,000 papers (1,000 each)

The browse pages (Papers / Conferences / Journals) query Railway via _queryPapers, which reported json.total — so pagination spanned the full corpus (74k+ journals). Now it clamps the count to SECTION_CAP = 1000 and trims rows past the cap, so each section paginates through at most 1,000 (3,000 total). Static-JSON fallback clamped too. Full corpus stays available via the API + HF dataset.

2. HF dataset: separate arXiv / conference / journal

The papers config now also publishes per-source splits:

load_dataset("kishormorol/researchscope-papers", "papers", split="journal")  # or conference / arxiv

Uploaded as data/papers_{arxiv,conference,journal}.jsonl. The combined train split is kept for the existing downloaders. Card updated with per-source counts + usage.

3. Fix stale "83,000+" paper count on the homepage

The hero claimed "83,000+ papers" — long outdated (corpus is now 100K+; Railway holds ~165K). The hero count is now dynamic (injected from stats.json by loadStats(), rounded to a clean "N,000+"), so it won't go stale again. Remaining hardcoded "83K+" copy across the README, index feature card, and sign-in/register pages replaced with "100,000+".

Verification

  • node --check passes on railway-api.js and app.js.
  • HF _bucket unit-checked; card front-matter validates as YAML with splits [train, arxiv, conference, journal].
  • Full suite: 147 passed.
  • No 83K/83,000 references remain.

HF per-source splits + the live hero count populate on the next pipeline run that pushes data (any sync with HF_TOKEN).

🤖 Generated with Claude Code

kishormorol and others added 2 commits June 6, 2026 19:04
Website:
- Cap each browse section (arXiv / conference / journal) to 1,000 papers
  (3,000 total). _queryPapers now clamps the reported count to SECTION_CAP
  and trims rows past the cap, so pagination stops at 1,000 per section.
  The full corpus stays available via the API and the HF dataset; the
  static JSON was already capped at 1,000.

HF dataset:
- Split the `papers` config by source into `arxiv`, `conference`, and
  `journal` splits (uploaded as papers_<source>.jsonl), alongside the
  existing combined `train` split (kept for backward compatibility).
  Users can now `load_dataset(repo, "papers", split="journal")`.
- Card updated: per-source counts in stats, file table, and usage examples.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The homepage hero claimed "83,000+ papers" — long stale (the corpus is now
100K+; the Railway API holds ~165K). Fixes:

- Hero count is now dynamic: loadStats() injects the live total from
  stats.json, rounded down to a clean "N,000+", so it never goes stale again
  (static fallback "100,000+" before JS loads).
- Replace remaining hardcoded "83K+"/"83,000+" copy with "100,000+" across
  README, the index feature card, and the sign-in/register pages.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@kishormorol kishormorol changed the title feat: cap site to 1000/section + split HF dataset by source feat: site cap (1000/section), HF source splits, live hero count Jun 6, 2026
@kishormorol kishormorol merged commit 5aa32f3 into main Jun 6, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant