feat: site cap (1000/section), HF source splits, live hero count#19
Merged
Conversation
Website: - Cap each browse section (arXiv / conference / journal) to 1,000 papers (3,000 total). _queryPapers now clamps the reported count to SECTION_CAP and trims rows past the cap, so pagination stops at 1,000 per section. The full corpus stays available via the API and the HF dataset; the static JSON was already capped at 1,000. HF dataset: - Split the `papers` config by source into `arxiv`, `conference`, and `journal` splits (uploaded as papers_<source>.jsonl), alongside the existing combined `train` split (kept for backward compatibility). Users can now `load_dataset(repo, "papers", split="journal")`. - Card updated: per-source counts in stats, file table, and usage examples. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The homepage hero claimed "83,000+ papers" — long stale (the corpus is now 100K+; the Railway API holds ~165K). Fixes: - Hero count is now dynamic: loadStats() injects the live total from stats.json, rounded down to a clean "N,000+", so it never goes stale again (static fallback "100,000+" before JS loads). - Replace remaining hardcoded "83K+"/"83,000+" copy with "100,000+" across README, the index feature card, and the sign-in/register pages. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Three related site/data-exposure changes.
1. Website shows 3,000 papers (1,000 each)
The browse pages (Papers / Conferences / Journals) query Railway via
_queryPapers, which reportedjson.total— so pagination spanned the full corpus (74k+ journals). Now it clamps the count toSECTION_CAP = 1000and trims rows past the cap, so each section paginates through at most 1,000 (3,000 total). Static-JSON fallback clamped too. Full corpus stays available via the API + HF dataset.2. HF dataset: separate arXiv / conference / journal
The
papersconfig now also publishes per-source splits:Uploaded as
data/papers_{arxiv,conference,journal}.jsonl. The combinedtrainsplit is kept for the existing downloaders. Card updated with per-source counts + usage.3. Fix stale "83,000+" paper count on the homepage
The hero claimed "83,000+ papers" — long outdated (corpus is now 100K+; Railway holds ~165K). The hero count is now dynamic (injected from
stats.jsonbyloadStats(), rounded to a clean "N,000+"), so it won't go stale again. Remaining hardcoded "83K+" copy across the README, index feature card, and sign-in/register pages replaced with "100,000+".Verification
node --checkpasses onrailway-api.jsandapp.js._bucketunit-checked; card front-matter validates as YAML with splits[train, arxiv, conference, journal].83K/83,000references remain.HF per-source splits + the live hero count populate on the next pipeline run that pushes data (any sync with
HF_TOKEN).🤖 Generated with Claude Code