diff --git a/paper/main.pdf b/paper/main.pdf index c8fd980..a31ef02 100644 Binary files a/paper/main.pdf and b/paper/main.pdf differ diff --git a/paper/main.tex b/paper/main.tex index 2a92a8e..6a0cf6e 100644 --- a/paper/main.tex +++ b/paper/main.tex @@ -45,7 +45,7 @@ % Abstract \begin{abstract} -Understanding how scientific datasets are accessed and reused is essential for resource planning and impact assessment. Here we present the PRIDE Archive download tracking infrastructure and a comprehensive analysis of 159.3 million download records from the PRIDE proteomics database (2021--2025), spanning 35,528 datasets accessed from 235 countries. The infrastructure includes \texttt{nf-downloadstats}, a scalable Nextflow pipeline for processing download logs, and DeepLogBot, a machine-learning framework that classifies traffic into bots, institutional download hubs, and independent user downloads. DeepLogBot combines heuristic seed selection with blind multi-LLM annotation (Claude and Qwen3) to produce gold-standard training labels, achieving 92.2\% classification accuracy on a held-out test set. After separating automated traffic, analysis reveals downloads from 235 countries, 249 institutional download hubs, and a concentrated reuse distribution, with the top five countries (United States, United Kingdom, Germany, China, and Canada) accounting for over 54\% of independent user downloads. These findings provide actionable insights for repository infrastructure planning and highlight the importance of distinguishing automated from individual access in scientific data resources. +Understanding how scientific datasets are accessed and reused is essential for resource planning and impact assessment. Here we present the PRIDE Archive download tracking infrastructure and a comprehensive analysis of 159.3 million download records from the PRIDE proteomics database (2021--2025), spanning 35,528 datasets accessed from 235 countries. The infrastructure includes \texttt{nf-downloadstats}, a scalable Nextflow pipeline for processing download logs, and DeepLogBot, a machine-learning framework that classifies traffic into bots, institutional download hubs, and independent user downloads. DeepLogBot combines heuristic seed selection with multi-LLM annotation (Claude and Qwen3) to produce gold-standard training labels, achieving 92.2\% bot classification accuracy on a held-out test set. After separating bot traffic, analysis reveals downloads from 214 countries, 249 institutional download hubs, and a concentrated reuse distribution, with the top five countries (United States, United Kingdom, Germany, China, and Canada) accounting for over 54\% of independent user downloads. These findings provide actionable insights for repository infrastructure planning and highlight the importance of distinguishing automated from individual access in scientific data resources. \begin{sloppypar} \noindent\textbf{Availability:} The \texttt{nf-downloadstats} pipeline is available at \url{https://github.com/PRIDE-Archive/nf-downloadstats} and DeepLogBot at \url{https://github.com/ypriverol/deeplogbot}, both under the Apache 2.0 license. @@ -60,7 +60,7 @@ \section{Introduction} The PRIDE database is the world-leading data repository for mass spectrometry-based proteomics \citep{PerezRiverol2025}. As a founding member of the ProteomeXchange consortium \citep{Deutsch2026}, PRIDE enables researchers to share and access proteomics datasets globally, promoting transparency, reproducibility, and data reuse. Aligned with the FAIR principles - Findable, Accessible, Interoperable, and Reusable \citep{Wilkinson2016} - PRIDE supports open science by ensuring that public datasets are well-annotated and machine-readable. These principles are essential for maximizing the value of shared scientific data. -Understanding how much public datasets are reused is essential for assessing the scientific impact of a data resource such as PRIDE, and informing data-driven policies. While citations in scholarly publications offer one indicator of reuse, data download statistics can provide a more granular view of data demand from any data resource. In our previous work \citep{PerezRiverol2019}, we demonstrated that usage/downloads metrics can serve as complementary indicators of scientific impact for a given publication, supporting improved data stewardship, resource allocation, and funding decisions. Beyond measuring the global impact of a data resource, download statistics are critical for designing more effective data infrastructures. Download patterns can inform the optimization of data access protocols, guide the prioritization of metadata curation and visualization features, and identify high-value datasets for targeted annotation or integration efforts. As public data volumes continue to grow, usage-driven strategies become increasingly important for improving dataset discoverability and reuse. For example, in the concrete case of PRIDE, frequently downloaded datasets - particularly those used as community benchmarks - could be prioritized for enhanced manual metadata curation/annotation (e.g., SDRF-based sample descriptions \citep{Dai2021}) and enriched with curated tags and keywords, making them easier to find through search interfaces. Users could then combine these annotations with download counts as a proxy to identify the most relevant and community-validated datasets for reuse. Similarly, repositories can leverage download patterns to allocate faster transfer services and optimized storage for high-demand datasets, ensuring that the most reused data remain readily accessible. In contrast, knowing which datasets are less frequently downloaded can help repositories to identify underused datasets that may benefit from improved metadata and file organization. +Understanding how much public datasets are reused is essential for assessing the scientific impact of a data resource such as PRIDE, and informing data-driven policies. While citations in scholarly publications offer one indicator of reuse, data download statistics can provide a more granular view of data demand from any data resource. In our previous work \citep{PerezRiverol2019}, we demonstrated that usage/downloads metrics can serve as complementary indicators of scientific impact for a given publication, supporting improved data stewardship, resource allocation, and funding decisions. Beyond measuring the global impact of a data resource, download statistics are critical for designing more effective data infrastructures. Download patterns can inform the optimization of data access protocols, guide the prioritization of metadata curation and visualization features, and identify high-value datasets for targeted annotation or integration efforts. As public data volumes continue to grow, usage-driven strategies become increasingly important for improving dataset discoverability and reuse. For example, in the concrete case of PRIDE, frequently downloaded datasets - particularly those used as community benchmarks - could be prioritized for enhanced manual metadata curation/annotation (e.g., SDRF-based sample metadata \citep{Dai2021}) and enriched with curated tags and keywords, making them easier to find through search interfaces. Users could then combine these annotations with download counts as a proxy to identify the most relevant and community-validated datasets for reuse. Similarly, repositories can leverage download patterns to allocate faster transfer services and optimized storage for high-demand datasets, ensuring that the most reused data remain readily accessible. In contrast, knowing which datasets are less frequently downloaded can help repositories to identify underused datasets that may benefit from improved metadata and file organization. Despite their importance, systematic tracking of dataset downloads remains a major challenge across bioinformatics resources in general, including PRIDE and the other ProteomeXchange partners. Despite some common access statistics having been widely adopted across data resources \citep{Perez2019}, barriers include the absence of a standardized infrastructure for logging access events, technical complexities in aggregating usage data across distributed and heterogeneous transfer systems (e.g., FTP, HTTP), and ongoing concerns related to user privacy and data protection. Compounding these challenges, automated bot traffic contaminates download statistics - studies estimate that bots account for 30-70\% of all internet traffic (\url{https://cpl.thalesgroup.com/ppc/application-security/bad-bot-report}), and scientific repositories are particularly attractive targets due to their open-access policies and valuable content \citep{Orr2025}. Without accounting for this contamination, any analysis of repository usage risks drawing conclusions from inflated and distorted metrics. As bioinformatics resources continue to scale in both size and complexity, robust download analytics will become increasingly vital - not only for measuring impact - but also for enabling smarter, user-informed development of open data platforms. @@ -82,18 +82,18 @@ \subsection{nf-downloadstats: Log Processing Pipeline} To efficiently process the large volume of log files, parallelization is employed using a high-performance computing (HPC) environment managed via Slurm. Log files are processed in batches, with batch sizes and filtering criteria defined in a user-friendly YAML configuration file. The output is a consolidated 4.7~GB Parquet file containing 159,327,635 individual download records spanning five years: from January 2021 through December 2025 (2025 data extends to December~10). Each record includes the download date, geolocation, dataset accession, filename, download method (protocol). The data covers 35,528 unique dataset accessions accessed from 235 countries. -For analysis, the full 159.3 million download events are aggregated at the \textit{location} level, where each location represents a unique geographic coordinate. Throughout this work, a ``unique user'' is defined as a distinct anonymized IP hash; because IP addresses are hashed irreversibly at ingestion, we cannot link users across locations or identify individuals, but can count distinct downloaders per location. Each location profile is characterized by behavioral features including download volumes, user counts, temporal patterns (working hours ratio, night activity, hourly entropy), protocol usage (HTTP, FTP, Aspera, Globus shares), burst patterns, user coordination scores, year-over-year growth metrics, and user distribution statistics (entropy, Gini coefficient, single-download user ratio). Full feature descriptions are provided in \textbf{Supplementary Notes Section~S3}. +For analysis, the full 159.3 million download events are aggregated at the \textit{location} level, where each location represents a unique geographic coordinate. Throughout this work, a ``unique session'' is defined as a distinct anonymized IP hash; because IP addresses are hashed irreversibly at ingestion, we cannot link users across locations or identify individuals, but can count distinct downloaders per location. Each location profile is characterized by behavioral features including download volumes, user counts, temporal patterns (working hours ratio, night activity, hourly entropy), protocol usage (HTTP, FTP, Aspera, Globus shares), burst patterns, user coordination scores, year-over-year growth metrics, and user distribution statistics (entropy, Gini coefficient, single-download user ratio). Full feature descriptions are provided in \textbf{Supplementary Notes Section~S3}. \subsection{Traffic Classification Framework} -To distinguish automated traffic from individual user access, we developed DeepLogBot, a semi-supervised four-phase classification pipeline (\textbf{Figure~\ref{fig:bot_overview}}). Each geographic location is classified as bot, hub (legitimate automation), or user. In Phase~1, heuristic rules derive training seeds from structural signals: organic seeds via a three-tier system (individual researchers, active researchers, research groups), bot seeds via six complementary signals (bot-farm, distributed, nocturnal, coordinated, scraper, and explosive-growth patterns), and hub seeds from institutional mirrors and sustained high-volume sites. In Phase~2, to break the circularity of heuristic-only training, 1,153 locations sampled across 20 stratified feature-space zones were annotated blindly by two LLMs --- Claude Opus 4.6 and Qwen3-30B-A3B --- using only behavioral features and geographic context, yielding 934 consensus labels. These were split into training (67\%) and held-out test (33\%) sets and injected as high-confidence seeds that override heuristic labels. In Phase~3, a gradient-boosted classifier (200 estimators, max depth 5) is trained on the gold-standard labels using 36 behavioral features. Phase~4 applies post-classification hub protection --- locations with institutional download patterns (specialized protocols, multi-year activity) are never classified as bots --- flags locations with fewer than 3 downloads as insufficient evidence, and derives final boolean labels. Full pipeline details are in \textbf{Supplementary Notes Sections~S2 and~S5.2}. +To distinguish automated traffic from individual user access, we developed DeepLogBot, a semi-supervised four-phase classification pipeline (\textbf{Figure~\ref{fig:bot_overview}}). Each geographic location is classified as bot, hub (legitimate automation), or user. In Phase~1, heuristic rules derive training seeds from structural signals: user seeds via a three-tier system (individual researchers, active researchers, research groups), bot seeds via six complementary signals (bot-farm, distributed, nocturnal, coordinated, scraper, and explosive-growth patterns), and hub seeds from institutional mirrors and sustained high-volume sites. In Phase~2, to break the circularity of heuristic-only training, 1,153 locations sampled across 20 stratified feature-space zones were annotated blindly by two LLMs, Claude Opus 4.6 and Qwen3-30B-A3B, using only behavioral features and geographic context, yielding 934 consensus labels. These were split into training (67\%) and held-out test (33\%) sets and injected as high-confidence seeds that override heuristic labels. In Phase~3, a gradient-boosted classifier (200 estimators, max depth 5) is trained on the gold-standard labels using 36 behavioral features. Phase~4 applies post-classification hub protection: locations with institutional download patterns (specialized protocols, multi-year activity) are never classified as bots; flags locations with fewer than 3 downloads as insufficient evidence, and derives final boolean labels. Full pipeline details are in \textbf{Supplementary Notes Sections~S2 and~S5.2}. -Training on gold-standard consensus labels improved classification accuracy from 62.1\% (heuristic seeds only) to 92.2\% on the held-out test set of 309 locations. Per-class F1 scores improved substantially: bot from 0.734 to 0.946, hub from 0.687 to 0.933, and organic from 0.305 to 0.849. The largest gains came from correctly reclassifying research-city locations previously mislabeled as bots and residential-area bots previously mislabeled as organic. Bot locations are typically characterized by large numbers of anonymized users each making very few downloads (3--15 downloads per user), uniform temporal patterns lacking circadian rhythm, and activity concentrated in a single year --- consistent with distributed web crawlers, scrapers, or automated scanning tools. Full evaluation details including confusion matrices are in \textbf{Supplementary Notes Section~S5.2}. +Training on gold-standard consensus labels improved classification accuracy from 62.1\% (heuristic seeds only) to 92.2\% on the held-out test set of 309 locations. Per-class F1 scores improved substantially: bot from 0.734 to 0.946, hub from 0.687 to 0.933, and user from 0.305 to 0.849. The largest gains came from correctly reclassifying research-city locations previously mislabeled as bots and residential-area bots previously mislabeled as independent users. Bot locations are typically characterized by large numbers of anonymized users each making very few downloads (3-15 downloads per user), uniform temporal patterns lacking circadian rhythm, and activity concentrated in a single year, consistent with distributed web crawlers, scrapers, or automated scanning tools. Full evaluation details including confusion matrices are in \textbf{Supplementary Notes Section~S5.2}. \begin{figure}[H] \centering \includegraphics[width=\textwidth]{figures/figure1_pipeline_overview.pdf} -\caption{PRIDE Download Traffic Classification Pipeline. End-to-end pipeline comprising two components: \texttt{nf-downloadstats} (green dashed box) processes raw PRIDE download logs into a consolidated Parquet file (159M records, 4.7~GB), and DeepLogBot (blue dashed box) aggregates records to geographic locations, extracts behavioral features per location, and applies a four-phase classification pipeline --- heuristic seed selection, blind multi-LLM seed refinement, gradient-boosted fusion meta-learner with Platt calibration, and hub protection with finalization --- to classify locations as bot, hub, or user.} +\caption{PRIDE Download Traffic Classification Pipeline. End-to-end pipeline comprising two components: \texttt{nf-downloadstats} (green dashed box) processes raw PRIDE download logs into a consolidated Parquet file (159M records, 4.7~GB), and DeepLogBot (blue dashed box) aggregates records to geographic locations, extracts behavioral features per location, and applies a four-phase classification pipeline (heuristic seed selection, blind multi-LLM seed refinement, gradient-boosted fusion meta-learner with Platt calibration, and hub protection with finalization) to classify locations as bot, hub, or user.} \label{fig:bot_overview} \end{figure} @@ -114,7 +114,7 @@ \subsection{Global PRIDE Usage Patterns} \label{fig:temporal} \end{figure} -The geographic reach of PRIDE data reuse is truly global. After separating bot and hub traffic, user downloads span over 200 countries with broad geographic diversity. To characterize the relationship between user base size and download intensity, we plotted user-only downloads against unique users for the top 50 countries (\textbf{Figure~\ref{fig:bubble}}A). Download patterns vary considerably: some countries show broad user bases with moderate per-user activity, suggesting predominantly individual researchers, while others exhibit high per-user averages reflecting concentrated institutional access. +The geographic reach of PRIDE data reuse is truly global. After separating bot and hub traffic, user downloads span 214 countries with broad geographic diversity. To characterize the relationship between user base size and download intensity, we plotted user-only downloads against unique users for the top 50 countries (\textbf{Figure~\ref{fig:bubble}}A). Download patterns vary considerably: some countries show broad user bases with moderate per-user activity, suggesting predominantly individual researchers, while others exhibit high per-user averages reflecting concentrated institutional access. Although European countries account for a major share of user downloads, yearly trends reveal shifting dynamics (\textbf{Figure~\ref{fig:bubble}}B). Notably, PRIDE usage is growing in some low- and middle-income countries (LMIC, as defined by the Wellcome Trust based on the OECD DAC list \citep{WellcomeLMIC}; \textbf{Figure~\ref{fig:bubble}}D), suggesting that PRIDE is increasingly serving as a resource for researchers in developing nations, supporting broader global participation in proteomics data reuse. @@ -132,16 +132,16 @@ \subsection{Download Concentration} \begin{figure}[H] \centering \includegraphics[width=\textwidth]{figures/figure7_dataset_reuse.pdf} -\caption{Dataset download concentration and consistency (user downloads only, after bot and hub removal). (A) Rank-frequency distribution on log-log scale; the dashed red line marks the top 1\% boundary. (B) Top 20 most downloaded datasets with the number of accessing countries. (C) Download consistency heatmap for the top 25 datasets (2021--2025); color intensity represents download count on a log$_{10}$ scale; most top datasets show sustained volumes across 4--5 years, indicating their role as community reference and benchmark datasets.} +\caption{Dataset download concentration and consistency (user downloads only, after bot and hub removal). (A) Rank-frequency distribution on log-log scale; the dashed red line marks the top 1\% boundary. (B) Top 20 most downloaded datasets with the number of accessing countries. (C) Download consistency heatmap for the top 25 datasets (2021--2025); color intensity represents download count on a log$_{10}$ scale; most top datasets show sustained volumes across 4-5 years, indicating their role as community reference and benchmark datasets.} \label{fig:concentration} \end{figure} -Download activity -- particularly sustained, multi-year download patterns -- provides a complementary measure of genuine community adoption that is not fully captured by publication records alone. By separating bot and hub traffic from individual user downloads, the download statistics presented here more accurately reflect individual researcher engagement with specific datasets. +Download activity, particularly sustained, multi-year download patterns, provides a complementary measure of genuine community adoption that is not fully captured by publication records alone. By separating bot and hub traffic from individual user downloads, the download statistics presented here more accurately reflect individual researcher engagement with specific datasets. \subsection{File Transfer Protocol Usage} \begin{sloppypar} -Prior to 2025, PRIDE downloads relied almost exclusively on HTTP and FTP, with FTP dominating in 2021 (66.7\%) and 2023 (61.2\%), and HTTP leading in 2022 (73.4\%) and 2024 (60.2\%) (\textbf{Figure~\ref{fig:protocols}}A). This shifted in 2025 with the emergence of Aspera (FASP), which accounted for 10.5\% of all non-bot downloads (4.5M downloads) despite appearing only from July onward (\textbf{Figure~\ref{fig:protocols}}B). Aspera usage peaked in September 2025 with 3.2M downloads, driven primarily by institutional hubs in China --- particularly a hub in Chongqing (5.1M total downloads, 70\% via Aspera), Hefei (182K downloads, 93\% via Aspera), and a hub in Kensington, MD near NIH/NCBI (36K downloads, 11\% via Aspera). The timing of Aspera adoption coincides with the release of \texttt{pridepy} \citep{Kamatchinathan2025} in March 2025, a Python-based command-line tool that abstracts protocol complexity and enables seamless switching between FTP, Aspera, and Globus transfers with a single command, substantially lowering adoption barriers for high-performance protocols. +Prior to 2025, PRIDE downloads relied almost exclusively on HTTP and FTP, with FTP dominating in 2021 (66.7\%) and 2023 (61.2\%), and HTTP leading in 2022 (73.4\%) and 2024 (60.2\%) (\textbf{Figure~\ref{fig:protocols}}A). This shifted in 2025 with the emergence of Aspera (FASP), which accounted for 10.5\% of all non-bot downloads (4.5M downloads) despite appearing only from July onward (\textbf{Figure~\ref{fig:protocols}}B). Aspera usage peaked in September 2025 with 3.2M downloads, driven primarily by institutional hubs in China, particularly a hub in Chongqing (5.1M total downloads, 70\% via Aspera), Hefei (182K downloads, 93\% via Aspera), and a hub in Kensington, MD near NIH/NCBI (36K downloads, 11\% via Aspera). The timing of Aspera adoption coincides with the release of \texttt{pridepy} \citep{Kamatchinathan2025} in March 2025, a Python-based command-line tool that abstracts protocol complexity and enables seamless switching between FTP, Aspera, and Globus transfers with a single command, substantially lowering adoption barriers for high-performance protocols. \end{sloppypar} \begin{figure}[H] @@ -154,7 +154,7 @@ \subsection{File Transfer Protocol Usage} \subsection{Download Hubs} \label{sec:hubs} -Institutional download hubs are characterized by few users with very high per-user download rates, usage of specialized bulk-transfer protocols (Aspera, Globus, FTP), and sustained multi-year activity (\textbf{Figure~\ref{fig:hubs}}). These hubs represent institutions that systematically and continuously download public proteomics data \citep{PerezRiverol2022reanalysis} for reanalysis, mirroring, or aggregation purposes. The geographic spread of hubs --- spanning all six inhabited continents --- demonstrates that institutional data reuse is a global phenomenon. +Institutional download hubs are characterized by few users with very high per-user download rates, usage of specialized bulk-transfer protocols (Aspera, Globus, FTP), and sustained multi-year activity (\textbf{Figure~\ref{fig:hubs}}). These hubs represent institutions that systematically and continuously download public proteomics data \citep{PerezRiverol2022reanalysis} for reanalysis, mirroring, or aggregation purposes. The geographic spread of hubs, spanning all six inhabited continents, demonstrates that institutional data reuse is a global phenomenon. Hub characteristics vary widely (\textbf{Figure~\ref{fig:bubble}}C): some operate with very few users but extremely high per-user download rates, consistent with mirrors or automated reanalysis pipelines, while others involve many users accessing data at moderate intensity, suggesting shared institutional infrastructure. @@ -192,17 +192,17 @@ \section{Discussion} % ====================================================================== \begin{sloppypar} -Here we have performed a detailed study of the PRIDE data download statistics for the last 5 years (2021--2025). A central finding is that 48.2\% of PRIDE download traffic originates from 27,063 automated bot locations. After removing bot traffic, the remaining downloads comprise 249 institutional download hubs (50.0\% of total traffic) and independent user downloads --- together representing the valuable, legitimate use of PRIDE data. Without separating bot traffic, raw download volumes are unreliable as scientific impact indicators \citep{PerezRiverol2019}. As AI-driven platforms that perform large-scale automated reanalysis become more prevalent, repositories will need adaptive classification schemes that evolve alongside legitimate automation patterns. +Here we have performed a detailed study of the PRIDE data download statistics for the last 5 years (2021--2025). A central finding is that 48.2\% of PRIDE download traffic originates from 27,063 automated bot locations. After removing bot traffic, the remaining downloads comprise 249 institutional download hubs (50.0\% of total traffic) and independent user downloads, together representing the valuable, legitimate use of PRIDE data. Without separating bot traffic, raw download volumes are unreliable as scientific impact indicators \citep{PerezRiverol2019}. As AI-driven platforms that perform large-scale automated reanalysis become more prevalent, repositories will need adaptive classification schemes that evolve alongside legitimate automation patterns. -PRIDE download volumes have grown substantially over 5 years, confirming accelerating data access across a geographically broad user base (235 countries). Download intensity varies markedly: some countries exhibit broad individual user bases, while others show concentrated institutional access, suggesting that the nature of data downloads -- individual exploration versus systematic reanalysis -- differs between research communities. The 249 download hubs we identified reveal a global infrastructure of institutional data consumers, from single-user mirrors performing full-repository synchronization to multi-user reanalysis centers processing hundreds of datasets. This hub distribution provides an empirical map of where proteomics bioinformatics infrastructure exists and can inform ProteomeXchange decisions about potential PRIDE mirrors placement, edge caching, and regional resource allocation -- for instance, countries with growing user bases but no local hubs may benefit from targeted infrastructure support. Protocol analysis reveals that hub traffic is strongly characterized by Aspera, FTP, and Globus usage, while individual users predominantly rely on HTTP. Tools such as \texttt{pridepy} \citep{Kamatchinathan2025} should lower adoption barriers for high-performance protocols as datasets continue to grow in size. +PRIDE download volumes have grown substantially over 5 years, confirming accelerating data access across a geographically broad user base (214 countries). Download intensity varies around the world: some countries exhibit broad individual user bases, while others show concentrated institutional access, suggesting that the nature of data downloads, individual exploration versus systematic reanalysis, differs between research communities. The 249 download hubs we identified reveal a global infrastructure of institutional data consumers, from single-user mirrors performing full-repository synchronization to multi-user reanalysis centers processing hundreds of datasets. This hub distribution provides an empirical map of where proteomics bioinformatics infrastructure exists and can inform ProteomeXchange decisions about potential PRIDE mirrors placement, edge caching, and regional resource allocation; for instance, countries with growing user bases but no local hubs may benefit from targeted infrastructure support. Protocol analysis reveals that hub traffic is strongly characterized by Aspera, FTP, and Globus usage, while individual users predominantly rely on HTTP. Tools such as \texttt{pridepy} \citep{Kamatchinathan2025} should lower adoption barriers for high-performance protocols as datasets continue to grow in size. -User download activity grew steadily over the study period, with year-over-year increases of 109\% (2022), 25\% (2023), 6\% (2024), and 62\% (2025), representing genuine expansion of the PRIDE user base. The 2025 surge in total download volume is largely driven by a rapid expansion of institutional download hubs, particularly in China: new hubs in Chongqing, Changsha, Wuxi, Nantong, and Lu'an --- each contributing 99--100\% of their traffic in 2025 alone --- account for much of the hub growth. Chinese hubs collectively contributed 22.4~million downloads in 2025, representing 53\% of all hub traffic that year. This expansion reflects growing investment in proteomics data infrastructure and the broader trend of regional data mirroring to serve local research communities, consistent with the growth of platforms such as iProX \citep{Chen2022iProX} and the National Genomics Data Center. The absence of a centralized PRIDE mirror in the region likely contributes to this pattern: without a local mirror, individual institutions in major proteomics cities independently download and cache large portions of PRIDE data, as the slow cross-continental transfer speeds make it more practical to pre-download datasets before they are needed for analysis. The emergence of these hubs highlights the value of establishing formal mirror agreements to coordinate replication, reduce redundant transfers, and improve data access latency for the growing Chinese proteomics community. +User download activity grew steadily over the study period, with year-over-year increases of 109\% (2022), 25\% (2023), 6\% (2024), and 62\% (2025), representing genuine expansion of the PRIDE user base. The 2025 surge in total download volume is largely driven by a rapid expansion of institutional download hubs, particularly in China: new hubs in Chongqing, Changsha, Wuxi, Nantong, and Lu'an, each contributing 99-100\% of their traffic in 2025 alone, account for much of the hub growth. Chinese hubs collectively contributed 22.4~million downloads in 2025, representing 53\% of all hub traffic that year. This expansion reflects growing investment in proteomics data infrastructure and the broader trend of regional data mirroring to serve local research communities, consistent with the growth of platforms such as iProX \citep{Chen2022iProX} and the National Genomics Data Center. The absence of a centralized PRIDE mirror in the region likely contributes to this pattern: without a local mirror, individual institutions in major proteomics cities independently download and cache large portions of PRIDE data, as the slow cross-continental transfer speeds make it more practical to pre-download datasets before they are needed for analysis. The emergence of these hubs highlights the value of establishing formal mirror agreements to coordinate replication, reduce redundant transfers, and improve data access latency for the growing Chinese proteomics community. -Dataset user downloads (after bot and hub removal) show a concentrated distribution. While community reference datasets such as ProteomeTools (\href{https://www.ebi.ac.uk/pride/archive/projects/PXD004732}{PXD004732}) show sustained multi-year downloads -- likely because its comprehensive synthetic peptide spectral libraries serve as training data for machine learning models, retention time predictors, and spectral library search engines across the field -- the ``long tail'' of rarely downloaded datasets should not be disregarded: these datasets may gain future value through meta-analyses, machine learning applications, or integration into multi-omics studies. Repositories can better serve both ends of this distribution by investing in improved discoverability -- richer metadata, curated tags, and recommendation systems -- alongside prioritized access for high-demand datasets. +Dataset user downloads (after bot and hub removal) show a concentrated distribution. While community reference datasets such as ProteomeTools (\href{https://www.ebi.ac.uk/pride/archive/projects/PXD004732}{PXD004732}) show sustained multi-year downloads, likely because its comprehensive synthetic peptide spectral libraries serve as training data for machine learning models, retention time predictors, and spectral library search engines across the field, the ``long tail'' of rarely downloaded datasets should not be disregarded: these datasets may gain future value through meta-analyses, machine learning applications, or integration into multi-omics studies. Repositories can better serve both ends of this distribution by investing in improved discoverability (richer metadata, curated tags, and recommendation systems) alongside prioritized access for high-demand datasets. -Regional differences in file type usage, with LMIC countries showing higher reliance on processed results rather than raw files, suggest that computational capacity and bandwidth constraints shape data download patterns. The dominance of raw file downloads across all regions (\textbf{Figure~\ref{fig:filetype}}A) indicates that researchers currently lack easy access to analysis results within submissions, forcing them to re-download and reprocess raw data even when search engine outputs already exist. To address this, the PRIDE team is developing dedicated infrastructure for discovering, browsing, and downloading result and analysis files independently of the full raw dataset, enabling researchers with limited computational resources to directly access quantification tables, identification lists, and processed spectra (peak list files) without the overhead of re-running search engines. In parallel, the PRIDE team will prioritize SDRF sample metadata annotation \citep{Dai2021} for the most downloaded and community-relevant datasets identified in this study, making these high-impact submissions immediately reusable through standardized experimental design descriptions. Several complementary efforts support this vision: \texttt{quantms} \citep{Dai2024} generates standardized reanalysis outputs from public datasets, the PTMExchange initiative (\url{https://www.proteomexchange.org/ptmexchange}) provides harmonised results coming from the reanalysis of PTM-enriched datasets, and the PRIDE team is collaborating with developers of widely used search engines -- including DIA-NN \citep{Demichev2020}, MaxQuant \citep{Cox2008}, and MSFragger \citep{Kong2017} -- to define standardized submission guidelines that ensure result files, quantification tables, and metadata are structured for immediate reuse. +Regional differences in file type usage, with LMIC countries showing higher reliance on processed results rather than raw files, suggest that computational capacity and bandwidth constraints shape data download patterns. The dominance of raw file downloads across all regions (\textbf{Figure~\ref{fig:filetype}}A) indicates that researchers currently lack easy access to analysis results within submissions, forcing them to re-download and reprocess raw data even when search engine outputs already exist. To address this, the PRIDE team is developing dedicated infrastructure for discovering, browsing, and downloading result and analysis files independently of the full raw dataset, enabling researchers with limited computational resources to directly access quantification tables, identification lists, and processed spectra (peak list files) without the overhead of re-running search engines. In parallel, the PRIDE team will prioritize SDRF sample metadata annotation \citep{Dai2021} for the most downloaded and community-relevant datasets identified in this study, making these high-impact submissions immediately reusable through standardized experimental design descriptions. Several complementary efforts support this vision: \texttt{quantms} \citep{Dai2024} generates standardized reanalysis outputs from public datasets, the PTMExchange initiative (\url{https://www.proteomexchange.org/ptmexchange}) provides harmonised results coming from the reanalysis of PTM-enriched datasets, and the PRIDE team is collaborating with developers of widely used search engines, including DIA-NN \citep{Demichev2020}, MaxQuant \citep{Cox2008}, and MSFragger \citep{Kong2017}, to define standardized submission guidelines that ensure result files, quantification tables, and metadata are structured for immediate reuse. -In summary, we present the PRIDE database download tracking infrastructure, comprising \texttt{nf-downloadstats} and DeepLogBot, and the first comprehensive analysis of PRIDE data download statistics, processing 159.3 million records spanning 2021--2025. Our classification pipeline separates bot traffic, institutional download hubs, and independent user access with 92.2\% accuracy on a held-out test set. After removing automated traffic, the remaining legitimate downloads across 35,528 datasets and 235 countries reveal a globally distributed user base, shifting protocol preferences, a concentrated download distribution, and diverse patterns of data reuse. The PRIDE team has integrated download statistics into the PRIDE web interface, enabling data submitters to use these metrics in grant reports and publications. Through \texttt{pridepy} \citep{Kamatchinathan2025} and dedicated infrastructure for result-level data access, we aim to lower barriers for researchers --- particularly in LMIC --- to discover and download analysis outputs without reprocessing full raw datasets.\end{sloppypar} +In summary, we present the PRIDE database download tracking infrastructure, comprising \texttt{nf-downloadstats} and DeepLogBot, and the first comprehensive analysis of PRIDE data download statistics, processing 159.3 million records spanning 2021-2025. Our classification pipeline separates bot traffic, institutional download hubs, and independent user access with 92.2\% accuracy on a held-out test set. After removing automated traffic, the remaining legitimate downloads across 35,528 datasets and 214 countries reveal a globally distributed user base, shifting protocol preferences, a concentrated download distribution, and diverse patterns of data reuse. The PRIDE team has integrated download statistics into the PRIDE web interface, enabling data submitters to use these metrics in grant reports and publications. Through \texttt{pridepy} \citep{Kamatchinathan2025} and dedicated infrastructure for result-level data access, we aim to lower barriers for researchers, particularly in LMIC, to discover and download analysis outputs without reprocessing full raw datasets.\end{sloppypar} \section*{Data and Code Availability} diff --git a/paper/supplementary.pdf b/paper/supplementary.pdf index 096a372..c81478d 100644 Binary files a/paper/supplementary.pdf and b/paper/supplementary.pdf differ diff --git a/paper/supplementary.tex b/paper/supplementary.tex index b768683..302f8e8 100644 --- a/paper/supplementary.tex +++ b/paper/supplementary.tex @@ -91,7 +91,7 @@ \subsection{System Overview} ] \node[block] (input) {Parquet Log Files}; \node[block, below=of input] (features) {Feature Extraction (36 behavioral features per location)}; -\node[block, below=of features, fill=purple!10] (seeds) {Phase 1: Seed Selection (organic 3-tier / bot 6-signal / hub structural)}; +\node[block, below=of features, fill=purple!10] (seeds) {Phase 1: Seed Selection (user 3-tier / bot 6-signal / hub structural)}; \node[block, below=of seeds, fill=yellow!15] (llm) {Phase 2: LLM Seed Refinement (blind multi-LLM consensus corrections)}; \node[block, below=of llm, fill=red!10] (fusion) {Phase 3: Fusion Meta-Learner (GradientBoosting + Platt calibration)}; \node[block, below=of fusion, fill=blue!10] (hub) {Phase 4: Hub Protection \& Finalization (structural override, insufficient evidence filter, boolean derivation)}; @@ -246,11 +246,11 @@ \section{S4. Algorithm Details} \subsection{Rule-Based Method} -The rule-based method applies threshold patterns from a YAML configuration file. Classification proceeds in two stages to assign each location to one of three categories---\textbf{bot}, \textbf{hub} (legitimate automation), or \textbf{organic}: +The rule-based method applies threshold patterns from a YAML configuration file. Classification proceeds in two stages to assign each location to one of three categories: \textbf{bot}, \textbf{hub} (legitimate automation), or \textbf{user}: \begin{sloppypar} \begin{enumerate} - \item \textbf{Stage~1 (Organic vs.\ Automated):} Organic patterns match on \texttt{working\_hours\_ratio}~$\geq 0.4$ and \texttt{regularity\_score}~$\leq 0.6$, or \texttt{interval\_cv}~$\geq 0.7$, or \texttt{unique\_users}~$< 50$ with moderate activity. Automated patterns match on \texttt{regularity\_score}~$\geq 0.7$, or \texttt{night\_activity\_ratio}~$\geq 0.35$ with low working hours, or \texttt{user\_coordination\_score}~$\geq 0.6$ with many users. + \item \textbf{Stage~1 (User vs.\ Automated):} User patterns match on \texttt{working\_hours\_ratio}~$\geq 0.4$ and \texttt{regularity\_score}~$\leq 0.6$, or \texttt{interval\_cv}~$\geq 0.7$, or \texttt{unique\_users}~$< 50$ with moderate activity. Automated patterns match on \texttt{regularity\_score}~$\geq 0.7$, or \texttt{night\_activity\_ratio}~$\geq 0.35$ with low working hours, or \texttt{user\_coordination\_score}~$\geq 0.6$ with many users. \item \textbf{Stage~2 (Bot vs.\ Hub):} Among automated locations, bot patterns include many-users-low-downloads (\texttt{unique\_users}~$\geq 1000$, \texttt{downloads\_per\_user}~$\leq 50$), coordinated activity (\texttt{coordination\_score}~$\geq 0.7$, \texttt{authenticity\_score}~$\leq 0.4$), and suspicious timing (\texttt{night\_activity\_ratio}~$\geq 0.5$, \texttt{working\_hours\_ratio}~$\leq 0.2$). Hub patterns match on mirror-like behavior (\texttt{downloads\_per\_user}~$\geq 500$, \texttt{unique\_users}~$\leq 100$) or CI/CD patterns (\texttt{users}~$\leq 10$, \texttt{regularity}~$\geq 0.7$). @@ -264,15 +264,15 @@ \subsection{Deep Architecture Method} \subsubsection{Phase 1: Seed Selection} -Seed selection identifies high-confidence training examples for each category using structural and behavioral heuristics. The seeds are \emph{not} the final classification --- they serve only as labeled training data for the meta-learner. Each seed receives a confidence weight (0--1) that influences its importance during gradient-boosted training. +Seed selection identifies high-confidence training examples for each category using structural and behavioral heuristics. The seeds are \emph{not} the final classification; they serve only as labeled training data for the meta-learner. Each seed receives a confidence weight (0--1) that influences its importance during gradient-boosted training. \paragraph{Minimum volume filter.} All seeds require a minimum of 20 total downloads (\texttt{MIN\_SEED\_DOWNLOADS}~$= 20$). Below this threshold, behavioral features (e.g., hourly entropy, working hours ratio) become unreliable because they are computed from too few events. This filter is critical: without it, the training set would include noisy locations where a handful of downloads happened to fall at night or during working hours purely by chance, degrading meta-learner performance. -\paragraph{Organic seeds (3-tier system).} +\paragraph{User seeds (3-tier system).} \begin{itemize} - \item \textbf{Tier~A --- Individual researchers} (confidence 1.0): $\leq$10 users, $\leq$5 downloads/user, 20--200 total downloads, working hours ratio $>$0.3, night activity $<$0.5, $\geq$2 years span. These are the cleanest organic examples: small-scale, sustained, daytime activity. - \item \textbf{Tier~B --- Active researchers} (confidence 0.7): $\leq$50 users, $\leq$10 downloads/user, 20--1000 total downloads, working hours ratio $>$0.25, hourly entropy $>$1.5, burst pattern score $<$0.5. Broader than Tier~A, these capture moderately active research locations with irregular (non-automated) temporal patterns. - \item \textbf{Tier~C --- Research groups} (confidence 0.4): $\leq$200 users, $\leq$20 downloads/user, $\geq$20 total downloads, user coordination score $<$0.3, protocol legitimacy $>$0.3. Additional safeguards exclude locations active only in the latest year with $>$50 users (likely distributed bot farms appearing as ``new'' research groups) and single-year locations with $>$30 users (insufficient history to confirm organic behavior). + \item \textbf{Tier~A - Individual researchers} (confidence 1.0): $\leq$10 users, $\leq$5 downloads/user, 20--200 total downloads, working hours ratio $>$0.3, night activity $<$0.5, $\geq$2 years span. These are the cleanest user examples: small-scale, sustained, daytime activity. + \item \textbf{Tier~B - Active researchers} (confidence 0.7): $\leq$50 users, $\leq$10 downloads/user, 20--1000 total downloads, working hours ratio $>$0.25, hourly entropy $>$1.5, burst pattern score $<$0.5. Broader than Tier~A, these capture moderately active research locations with irregular (non-automated) temporal patterns. + \item \textbf{Tier~C - Research groups} (confidence 0.4): $\leq$200 users, $\leq$20 downloads/user, $\geq$20 total downloads, user coordination score $<$0.3, protocol legitimacy $>$0.3. Additional safeguards exclude locations active only in the latest year with $>$50 users (likely distributed bot farms appearing as ``new'' research groups) and single-year locations with $>$30 users (insufficient history to confirm user behavior). \end{itemize} \noindent The decreasing confidence weights reflect decreasing certainty: Tier~A seeds are unambiguous individual researchers, while Tier~C seeds may include small institutional users that share some characteristics with legitimate automation. @@ -284,12 +284,12 @@ \subsubsection{Phase 1: Seed Selection} \item \textbf{Nocturnal}: Night activity ratio $>$0.8, working hours ratio $<$0.1. Extreme nocturnal activity with essentially no daytime presence, inconsistent with any plausible human usage pattern regardless of time zone. \item \textbf{Coordinated}: $>$10,000 users, $<$20 downloads/user. Massive-scale coordinated access characteristic of large botnets or web crawlers. \item \textbf{Scraper}: $>$15,000 unique projects accessed. Locations that systematically crawl the entire PRIDE catalog, accessing far more datasets than any researcher or institution would. - \item \textbf{Year-over-year explosion}: Spike ratio $>$50$\times$ compared to previous years, $>$200 users, $>$95\% of activity in the latest year. Captures locations that experience sudden, extreme surges in activity, typically indicating the emergence of a new automated process rather than organic growth. + \item \textbf{Year-over-year explosion}: Spike ratio $>$50$\times$ compared to previous years, $>$200 users, $>$95\% of activity in the latest year. Captures locations that experience sudden, extreme surges in activity, typically indicating the emergence of a new automated process rather than gradual growth. \end{itemize} \noindent Bot seeds are assigned confidence 0.7 (base), boosted to 0.9 for $>$10,000 users, 0.95 for bot-farm + nocturnal overlap, and reduced to 0.6 for distributed bots (where individual signals are weaker). -\paragraph{Hub-like exclusion from bot seeds.} Locations with downloads/user $>$200 sustained over $\geq$3 years are excluded from bot seeds regardless of other signals. This prevents institutional mirrors from contaminating the bot training set --- a mirror may have thousands of users and uniform temporal patterns, but the key distinguishing feature is high downloads per user (institutional users download hundreds of files each, while bot ``users'' download 3--15). This exclusion is critical for preventing the meta-learner from learning to associate high download volume with bot behavior. +\paragraph{Hub-like exclusion from bot seeds.} Locations with downloads/user $>$200 sustained over $\geq$3 years are excluded from bot seeds regardless of other signals. This prevents institutional mirrors from contaminating the bot training set; a mirror may have thousands of users and uniform temporal patterns, but the key distinguishing feature is high downloads per user (institutional users download hundreds of files each, while bot ``users'' download 3--15). This exclusion is critical for preventing the meta-learner from learning to associate high download volume with bot behavior. \paragraph{Hub seeds (2 structural patterns).} \begin{itemize} @@ -299,7 +299,7 @@ \subsubsection{Phase 1: Seed Selection} \noindent Hub seeds exclude nocturnal-dominant locations (working hours ratio $<$0.1 and night activity $>$0.7) to avoid capturing bot locations that happen to have high downloads/user. Confidence is set to 0.8 (base), boosted to 0.95 for protocol-verified hubs (Aspera ratio $>$0.3 or Globus ratio $>$0.1) and 0.85 for long-running hubs ($\geq$4 years). -\paragraph{Seed overlap resolution.} Seed sets are resolved with strict priority ordering: \textbf{hub $>$ bot $>$ organic}. If a location qualifies as both a hub seed and a bot seed, it is retained only as a hub seed and removed from the bot set. Similarly, bot--organic overlaps are resolved in favor of bot. This priority reflects the asymmetric cost of misclassification: incorrectly training the meta-learner to associate hub patterns with bot behavior is far more damaging than the reverse, because hub misclassification removes legitimate scientific infrastructure from downstream analyses. +\paragraph{Seed overlap resolution.} Seed sets are resolved with strict priority ordering: \textbf{hub $>$ bot $>$ user}. If a location qualifies as both a hub seed and a bot seed, it is retained only as a hub seed and removed from the bot set. Similarly, bot--user overlaps are resolved in favor of bot. This priority reflects the asymmetric cost of misclassification: incorrectly training the meta-learner to associate hub patterns with bot behavior is far more damaging than the reverse, because hub misclassification removes legitimate scientific infrastructure from downstream analyses. \subsubsection{Phase 2: LLM Seed Refinement} @@ -311,9 +311,9 @@ \subsubsection{Phase 3: Fusion Meta-Learner} The model takes 36 behavioral features as input (Table~\ref{tab:all_features}), organized into six groups: download intensity (unique users, downloads/user, total downloads), temporal patterns (working hours ratio, night activity, hourly entropy, burst pattern score, spike ratio), protocol usage (Aspera ratio, Globus ratio, protocol legitimacy score), user distribution (user entropy, Gini coefficient, single-download user ratio, power user ratio), growth dynamics (years span, year-over-year CV, fraction latest year, momentum score, recent activity ratio), and project diversity (unique projects, top project concentration, top-3 project concentration, project HHI). -Probabilities are calibrated via Platt scaling (\texttt{CalibratedClassifierCV} with sigmoid method, \texttt{cv=5}). The calibrated probabilities enable meaningful confidence thresholds: predictions with $<$0.5 maximum probability are flagged as \texttt{needs\_review}. The key discriminating features --- identified through feature importance analysis --- are downloads per user, unique users, unique projects, total downloads, and user Gini coefficient. +Probabilities are calibrated via Platt scaling (\texttt{CalibratedClassifierCV} with sigmoid method, \texttt{cv=5}). The calibrated probabilities enable meaningful confidence thresholds: predictions with $<$0.5 maximum probability are flagged as \texttt{needs\_review}. The key discriminating features, identified through feature importance analysis, are downloads per user, unique users, unique projects, total downloads, and user Gini coefficient. -\paragraph{Training data composition.} Across the full PRIDE dataset, seed selection typically yields $\sim$25,000--35,000 organic seeds, $\sim$300--700 bot seeds, and $\sim$500--900 hub seeds. The class imbalance (organic seeds outnumber bot seeds by $\sim$50:1) is partially mitigated by confidence weighting and the gradient boosting algorithm's sequential error correction, but represents a fundamental challenge: bot locations are inherently rare, and the meta-learner has fewer examples from which to learn bot patterns. This imbalance means the model may underdetect bots in novel configurations not represented in the seed set. +\paragraph{Training data composition.} Across the full PRIDE dataset, seed selection typically yields $\sim$25,000--35,000 user seeds, $\sim$300--700 bot seeds, and $\sim$500--900 hub seeds. The class imbalance (user seeds outnumber bot seeds by $\sim$50:1) is partially mitigated by confidence weighting and the gradient boosting algorithm's sequential error correction, but represents a fundamental challenge: bot locations are inherently rare, and the meta-learner has fewer examples from which to learn bot patterns. This imbalance means the model may underdetect bots in novel configurations not represented in the seed set. \subsubsection{Phase 4: Hub Protection \& Finalization} @@ -338,7 +338,7 @@ \subsubsection{Phase 4: Hub Protection \& Finalization} \begin{itemize} \item \textbf{Insufficient evidence}: Locations with $<$3 total downloads are marked as insufficient evidence. At 1--2 downloads, behavioral features are essentially random: a single download at 2\,AM does not indicate nocturnal bot behavior, and a single download of one file does not indicate scraping. The threshold of 3 (rather than 5 or 10) was chosen empirically: locations with 3--4 downloads still show enough signal for the meta-learner to produce useful (if low-confidence) predictions, while raising the threshold to 5 would reclassify $\sim$7,000 additional locations as insufficient evidence. - \item \textbf{User sub-classification}: Organic locations with $\leq$10 users and $\leq$5 downloads/user are labeled as \texttt{independent\_user} (individual researchers), while the remainder are \texttt{normal} (research groups or active labs). + \item \textbf{User sub-classification}: User locations with $\leq$10 users and $\leq$5 downloads/user are labeled as \texttt{independent\_user} (individual researchers), while the remainder are \texttt{normal} (research groups or active labs). \item \textbf{Boolean derivation}: Final boolean columns (\texttt{is\_bot}, \texttt{is\_hub}, \texttt{is\_organic}) are derived from the hierarchical \texttt{behavior\_type} and \texttt{automation\_category} fields. \end{itemize} @@ -346,7 +346,7 @@ \subsubsection{Known Edge Cases and Failure Modes} The pipeline has several known limitations that arise from the fundamental challenges of semi-supervised traffic classification: -\paragraph{1. Bot--hub boundary ambiguity.} The hardest classification boundary is between \textbf{large institutional hubs and sophisticated bot networks}. Both can exhibit: many users, high total download volume, multi-year sustained activity, and access to thousands of datasets. The key discriminating feature is downloads per user --- hubs have $>$200 DL/user while bot farms have 3--15 DL/user --- but locations in the 15--200 DL/user range occupy an ambiguous zone where the meta-learner's confidence is low. These locations are flagged with \texttt{needs\_review = True} but their classification may be unreliable. Examples include medium-sized bioinformatics groups with automated pipelines that download moderately, or sophisticated bots that throttle their per-user download rate to mimic human behavior. +\paragraph{1. Bot--hub boundary ambiguity.} The hardest classification boundary is between \textbf{large institutional hubs and sophisticated bot networks}. Both can exhibit: many users, high total download volume, multi-year sustained activity, and access to thousands of datasets. The key discriminating feature is downloads per user: hubs have $>$200 DL/user while bot farms have 3--15 DL/user. However, locations in the 15--200 DL/user range occupy an ambiguous zone where the meta-learner's confidence is low. These locations are flagged with \texttt{needs\_review = True} but their classification may be unreliable. Examples include medium-sized bioinformatics groups with automated pipelines that download moderately, or sophisticated bots that throttle their per-user download rate to mimic human behavior. \paragraph{2. Geographic proxy limitations.} The classification operates at the \textbf{geographic location level}, meaning all users from the same city/region are grouped together. This creates two failure modes: \begin{itemize} @@ -354,24 +354,24 @@ \subsubsection{Known Edge Cases and Failure Modes} \item \textbf{VPN and proxy aggregation}: Users behind institutional VPNs or cloud-based proxies appear to originate from a single location. A university VPN concentrating hundreds of researchers into one geographic point can produce a profile that resembles a bot farm (many users, low DL/user), while a cloud-hosted bot using a residential proxy appears as a single legitimate user. \end{itemize} -\paragraph{3. Temporal distribution of seeds.} Organic Tier~A seeds require $\geq$2 years of activity, which \textbf{systematically excludes new locations} from the highest-confidence organic training set. Locations that first appeared in the final year of the study period can only qualify as Tier~B or Tier~C seeds (lower confidence). This means the meta-learner has weaker training signal for new locations, and newly active legitimate users may receive lower confidence scores. Conversely, bot seeds emphasizing \texttt{fraction\_latest\_year $> 0.8$} for distributed bots may miss bot campaigns that have been running for multiple years. +\paragraph{3. Temporal distribution of seeds.} User Tier~A seeds require $\geq$2 years of activity, which \textbf{systematically excludes new locations} from the highest-confidence user training set. Locations that first appeared in the final year of the study period can only qualify as Tier~B or Tier~C seeds (lower confidence). This means the meta-learner has weaker training signal for new locations, and newly active legitimate users may receive lower confidence scores. Conversely, bot seeds emphasizing \texttt{fraction\_latest\_year $> 0.8$} for distributed bots may miss bot campaigns that have been running for multiple years. -\paragraph{4. Seed contamination.} The pipeline's accuracy depends on the purity of seed sets. If the heuristic thresholds are poorly calibrated, contaminated seeds will teach the meta-learner incorrect associations. Several safeguards mitigate this risk --- hub-like exclusion from bot seeds, multi-year requirements for hub seeds, behavioral exclusions --- but edge cases remain: +\paragraph{4. Seed contamination.} The pipeline's accuracy depends on the purity of seed sets. If the heuristic thresholds are poorly calibrated, contaminated seeds will teach the meta-learner incorrect associations. Several safeguards mitigate this risk (hub-like exclusion from bot seeds, multi-year requirements for hub seeds, behavioral exclusions), but edge cases remain: \begin{itemize} \item A botnet that uses Aspera or Globus (extremely rare but possible) would be excluded from bot seeds and potentially protected as a hub. - \item An automated reanalysis pipeline at a small institution (5 users, 100 DL/user) that only started in the current year would fail the multi-year requirement for hub seeds and might be classified as a research group (organic Tier~C). + \item An automated reanalysis pipeline at a small institution (5 users, 100 DL/user) that only started in the current year would fail the multi-year requirement for hub seeds and might be classified as a research group (user Tier~C). \item Large volunteer-based projects (e.g., citizen science) with thousands of individual users each making few downloads would match bot-farm seed criteria despite being legitimate. \end{itemize} \paragraph{5. Protocol signal erosion.} The pipeline uses Aspera and Globus usage as strong hub indicators. As tools like \texttt{pridepy} lower adoption barriers for high-performance protocols, individual users may increasingly use Aspera and Globus for routine downloads. This will gradually erode the discriminative power of protocol-based features. Future versions of the pipeline will need to adapt as protocol adoption patterns evolve. -\paragraph{6. Insufficient evidence threshold.} The threshold of $<$3 downloads for insufficient evidence is conservative. Locations with 3--10 downloads have technically computable features, but the meta-learner's predictions for these locations are inherently noisy. A location with 3 downloads all at night is flagged as bot despite insufficient statistical evidence, while a location with 3 daytime downloads is classified as organic. In practice, these low-activity locations contribute negligibly to total download volume ($<$0.02\% of all downloads), so misclassification has minimal impact on aggregate statistics but may affect per-location accuracy assessments. +\paragraph{6. Insufficient evidence threshold.} The threshold of $<$3 downloads for insufficient evidence is conservative. Locations with 3--10 downloads have technically computable features, but the meta-learner's predictions for these locations are inherently noisy. A location with 3 downloads all at night is flagged as bot despite insufficient statistical evidence, while a location with 3 daytime downloads is classified as user. In practice, these low-activity locations contribute negligibly to total download volume ($<$0.02\% of all downloads), so misclassification has minimal impact on aggregate statistics but may affect per-location accuracy assessments. \paragraph{7. Concept drift.} The pipeline is trained on behavioral patterns observed during 2021--2025. As bot technology evolves (e.g., AI-generated browsing patterns, randomized request timing) and legitimate automation patterns shift (e.g., more institutions adopting automated reanalysis), the seed heuristics may become less effective. The pipeline does not currently include temporal model updating or drift detection mechanisms. Re-running the full pipeline periodically on updated data partially addresses this, but gradual drift in bot sophistication may go undetected until classification accuracy degrades noticeably. -\paragraph{8. Class imbalance in training.} Organic seeds outnumber bot seeds by $\sim$50:1 and hub seeds by $\sim$30:1. While confidence weighting and gradient boosting partially compensate, the meta-learner may still exhibit \textbf{recall bias toward organic}: locations with ambiguous features are more likely to be classified as organic simply because organic patterns dominate the training set. This is acceptable for our primary use case (separating bot traffic from user traffic to compute accurate download statistics), but means the pipeline likely \emph{undercounts} bots rather than overcounts them. +\paragraph{8. Class imbalance in training.} User seeds outnumber bot seeds by $\sim$50:1 and hub seeds by $\sim$30:1. While confidence weighting and gradient boosting partially compensate, the meta-learner may still exhibit \textbf{recall bias toward user}: locations with ambiguous features are more likely to be classified as user simply because user patterns dominate the training set. This is acceptable for our primary use case (separating bot traffic from user traffic to compute accurate download statistics), but means the pipeline likely \emph{undercounts} bots rather than overcounts them. -\paragraph{9. Single-year bots with human-like patterns.} The most challenging false negatives are \textbf{bots that mimic human behavior}: they operate during working hours, download at irregular intervals, and access a realistic number of datasets. These bots fail to match any bot seed heuristic (no nocturnal activity, no user-count anomaly, no temporal concentration) and produce organic-like feature profiles. Without additional signals such as browser fingerprinting, session-level behavioral analysis, or known-bot IP lists (none of which are available in PRIDE's anonymized logs), these sophisticated bots are fundamentally undetectable by our approach. +\paragraph{9. Single-year bots with human-like patterns.} The most challenging false negatives are \textbf{bots that mimic human behavior}: they operate during working hours, download at irregular intervals, and access a realistic number of datasets. These bots fail to match any bot seed heuristic (no nocturnal activity, no user-count anomaly, no temporal concentration) and produce user-like feature profiles. Without additional signals such as browser fingerprinting, session-level behavioral analysis, or known-bot IP lists (none of which are available in PRIDE's anonymized logs), these sophisticated bots are fundamentally undetectable by our approach. % ====================================================================== @@ -385,7 +385,7 @@ \subsection{S5.1 Heuristic Ground Truth for Benchmark} \begin{table}[H] \centering -\caption{Ground truth label criteria and counts. Subtypes are applied hierarchically (each excludes locations matched by prior subtypes within the same category); per-subtype counts are not listed as they depend on evaluation order --- totals per category are the relevant quantities.} +\caption{Ground truth label criteria and counts. Subtypes are applied hierarchically (each excludes locations matched by prior subtypes within the same category); per-subtype counts are not listed as they depend on evaluation order; totals per category are the relevant quantities.} \label{tab:ground_truth} \small \begin{tabular}{llrp{6cm}} @@ -404,7 +404,7 @@ \subsection{S5.1 Heuristic Ground Truth for Benchmark} \cmidrule{2-4} & \textbf{Total} & \textbf{44} & \\ \midrule -\multirow{3}{*}{Organic} & Individual user & -- & $\leq$3 users, $\leq$20 DL/user, work ratio $\geq$0.4 \\ +\multirow{3}{*}{User} & Individual user & -- & $\leq$3 users, $\leq$20 DL/user, work ratio $\geq$0.4 \\ & Research group & -- & 3--30 users, 5--100 DL/user, work ratio $\geq$0.35 \\ & Casual user & -- & $\leq$5 users, $\leq$50 DL/user, night ratio $\leq$0.3 \\ \cmidrule{2-4} @@ -431,7 +431,7 @@ \subsection{S5.2 Blind Multi-LLM Independent Validation} \item \textbf{Qwen3-30B-A3B} (Alibaba): Run locally via Ollama with temperature 0.1 for consistency, processing each location sequentially with no access to other annotations. \end{enumerate} -Both LLMs used the same structured prompt defining three categories (bot, hub, organic) with explicit feature interpretation guidelines. The key discriminator was downloads per user: bots typically show 3--15 DL/user, hubs $>$200 DL/user, and organic falls in between with research-consistent patterns. +Both LLMs used the same structured prompt defining three categories (bot, hub, user) with explicit feature interpretation guidelines. The key discriminator was downloads per user: bots typically show 3--15 DL/user, hubs $>$200 DL/user, and user falls in between with research-consistent patterns. \paragraph{Inter-annotator agreement.} Of the 1,153 locations, 1,029 received valid labels from both LLMs (124 had parse errors). Cohen's kappa was 0.535 (moderate agreement) with 75.5\% raw agreement (777/1,029). The confusion matrix between annotators shows: @@ -445,21 +445,21 @@ \subsection{S5.2 Blind Multi-LLM Independent Validation} \toprule & \multicolumn{3}{c}{\textbf{Qwen3}} & \\ \cmidrule{2-4} -\textbf{Claude} & Bot & Hub & Organic & \textbf{Total} \\ +\textbf{Claude} & Bot & Hub & User & \textbf{Total} \\ \midrule Bot & 559 & 4 & 35 & 598 \\ Hub & 27 & 95 & 41 & 163 \\ -Organic & 133 & 12 & 123 & 268 \\ +User & 133 & 12 & 123 & 268 \\ \midrule \textbf{Total} & 719 & 111 & 199 & 1,029 \\ \bottomrule \end{tabular} \end{table} -Qwen3 was more aggressive at labeling locations as bots (719 vs.\ 598), while Claude assigned more organic (268 vs.\ 199) and hub (163 vs.\ 111) labels. The dominant disagreement pattern was Claude=organic $\to$ Qwen3=bot (133 cases), typically involving mid-sized research cities with moderate user counts. +Qwen3 was more aggressive at labeling locations as bots (719 vs.\ 598), while Claude assigned more user (268 vs.\ 199) and hub (163 vs.\ 111) labels. The dominant disagreement pattern was Claude=user $\to$ Qwen3=bot (133 cases), typically involving mid-sized research cities with moderate user counts. \paragraph{Consensus resolution.} -The 777 locations where both LLMs agreed were assigned the consensus label directly. Of the 252 disagreements, 157 were resolved by feature-based tiebreaker rules (e.g., if DL/user $<$ 15, single-year activity, and high hourly entropy, resolve as bot; if total downloads $<$ 50 with $\leq$5 users, resolve as organic). The remaining 95 truly ambiguous locations were excluded. The final consensus set contains 934 locations: 593 bot (63.5\%), 134 hub (14.3\%), 207 organic (22.2\%). +The 777 locations where both LLMs agreed were assigned the consensus label directly. Of the 252 disagreements, 157 were resolved by feature-based tiebreaker rules (e.g., if DL/user $<$ 15, single-year activity, and high hourly entropy, resolve as bot; if total downloads $<$ 50 with $\leq$5 users, resolve as user). The remaining 95 truly ambiguous locations were excluded. The final consensus set contains 934 locations: 593 bot (63.5\%), 134 hub (14.3\%), 207 user (22.2\%). \paragraph{Human review.} All consensus labels and tiebreaker rules were reviewed by a domain expert (Y.P.-R.) who validated the annotations based on direct knowledge of proteomics institutions and PRIDE usage patterns. @@ -498,14 +498,14 @@ \subsection{S5.3 LLM-Augmented Seed Retraining} Hub recall & 0.576 & & 0.946 & +0.370 \\ Hub F1 & 0.687 & & 0.933 & +0.246 \\ \midrule -Organic precision & 0.365 & & 0.766 & +0.401 \\ -Organic recall & 0.261 & & 0.952 & +0.691 \\ -Organic F1 & 0.305 & & 0.849 & +0.544 \\ +User precision & 0.365 & & 0.766 & +0.401 \\ +User recall & 0.261 & & 0.952 & +0.691 \\ +User F1 & 0.305 & & 0.849 & +0.544 \\ \bottomrule \end{tabular} \end{table} -The largest improvements came from two systematic corrections: (i)~research-city locations with high working-hours ratios that were heuristically seeded as bots were correctly classified as organic, as the gold-standard labels teach the meta-learner that working-hours activity in proteomics-rich cities indicates legitimate access; and (ii)~residential-area locations with night-dominant patterns that were heuristically labeled organic were correctly classified as bot, as the model learns to recognize distributed bot-farm signatures in non-research areas. +The largest improvements came from two systematic corrections: (i)~research-city locations with high working-hours ratios that were heuristically seeded as bots were correctly classified as user, as the gold-standard labels teach the meta-learner that working-hours activity in proteomics-rich cities indicates legitimate access; and (ii)~residential-area locations with night-dominant patterns that were heuristically labeled user were correctly classified as bot, as the model learns to recognize distributed bot-farm signatures in non-research areas. \begin{table}[H] \centering @@ -516,11 +516,11 @@ \subsection{S5.3 LLM-Augmented Seed Retraining} \toprule & \multicolumn{3}{c}{\textbf{Gold-Standard Label}} & \\ \cmidrule{2-4} -\textbf{Classifier Label} & Bot & Hub & Organic & \textbf{Total} \\ +\textbf{Classifier Label} & Bot & Hub & User & \textbf{Total} \\ \midrule Bot & 188 & 0 & 4 & 192 \\ Hub & 7 & 40 & 15 & 62 \\ -Organic & 9 & 1 & 45 & 55 \\ +User & 9 & 1 & 45 & 55 \\ \midrule \textbf{Total} & 204 & 41 & 64 & 309 \\ \bottomrule @@ -572,11 +572,11 @@ \subsection{Statistical Significance} \end{tabular} \end{table} -McNemar's test evaluates whether two classifiers differ in their \emph{overall} error rate (i.e., the number of locations where one method is correct and the other incorrect). The non-significant result ($p = 0.877$) indicates that Rules and Deep make a similar \emph{number} of errors overall on the heuristic benchmark. However, the bootstrap confidence intervals on Macro F1 (Section~\ref{sec:bootstrap}) show non-overlapping intervals (Deep: 0.731--0.818 vs Rules: 0.574--0.691), indicating that Deep achieves significantly better \emph{class-balanced} performance. This apparent discrepancy arises because McNemar's test is insensitive to which classes the errors fall in: Rules achieves high bot precision but very low hub and organic performance, while Deep balances all three classes. The two methods make a similar total number of errors, but Deep distributes its errors more evenly across classes, yielding higher Macro F1. We therefore interpret the bootstrap Macro F1 as the primary comparison metric, as it reflects class-balanced performance relevant to the three-class classification task. +McNemar's test evaluates whether two classifiers differ in their \emph{overall} error rate (i.e., the number of locations where one method is correct and the other incorrect). The non-significant result ($p = 0.877$) indicates that Rules and Deep make a similar \emph{number} of errors overall on the heuristic benchmark. However, the bootstrap confidence intervals on Macro F1 (Section~\ref{sec:bootstrap}) show non-overlapping intervals (Deep: 0.731--0.818 vs Rules: 0.574--0.691), indicating that Deep achieves significantly better \emph{class-balanced} performance. This apparent discrepancy arises because McNemar's test is insensitive to which classes the errors fall in: Rules achieves high bot precision but very low hub and user performance, while Deep balances all three classes. The two methods make a similar total number of errors, but Deep distributes its errors more evenly across classes, yielding higher Macro F1. We therefore interpret the bootstrap Macro F1 as the primary comparison metric, as it reflects class-balanced performance relevant to the three-class classification task. \subsection{Inter-Method Agreement} -Figure~\ref{fig:agreement_supp} shows pairwise agreement between methods. Rules and Deep agree on 87.8\% of classifications (Cohen's $\kappa$ = 0.508), indicating moderate agreement. Per-category agreement is highest for organic classification and lowest for hubs. +Figure~\ref{fig:agreement_supp} shows pairwise agreement between methods. Rules and Deep agree on 87.8\% of classifications (Cohen's $\kappa$ = 0.508), indicating moderate agreement. Per-category agreement is highest for user classification and lowest for hubs. \begin{figure}[H] \centering @@ -603,12 +603,12 @@ \subsection{Full-Dataset Comparison: Rules vs Deep} \begin{table}[H] \centering -\caption{Full-dataset classification comparison: Rules vs Deep. Location counts reflect full-dataset classification; ``Organic locs'' includes both classified organic locations and locations with insufficient evidence ($<$3 downloads). Accuracy and Macro F1 are measured against all 934 blind multi-LLM consensus labels; of these, 625 were used as training labels for the Deep method --- see Table~\ref{tab:retraining_comparison} for the held-out evaluation on 309 locations.} +\caption{Full-dataset classification comparison: Rules vs Deep. Location counts reflect full-dataset classification; ``User locs'' includes both classified user locations and locations with insufficient evidence ($<$3 downloads). Accuracy and Macro F1 are measured against all 934 blind multi-LLM consensus labels; of these, 625 were used as training labels for the Deep method; see Table~\ref{tab:retraining_comparison} for the held-out evaluation on 309 locations.} \label{tab:rules_vs_deep_full} \small \begin{tabular}{lrrrccc} \toprule -\textbf{Method} & \textbf{Bot locs} & \textbf{Hub locs} & \textbf{Organic locs} & \textbf{Accuracy} & \textbf{Macro F1} \\ +\textbf{Method} & \textbf{Bot locs} & \textbf{Hub locs} & \textbf{User locs} & \textbf{Accuracy} & \textbf{Macro F1} \\ \midrule Rules & 22,981 & 714 & 47,438 & 59.7\% & 0.592 \\ Deep & 27,063 & 249 & 43,821 & 92.2\% & 0.909 \\ @@ -617,7 +617,7 @@ \subsection{Full-Dataset Comparison: Rules vs Deep} \end{table} \paragraph{Key differences.} -The rule-based method labels fewer locations as bots (22,981 vs.\ 27,063) compared to the deep pipeline. This underdetection stems from two systematic issues: (i)~the fixed-threshold approach requires locations to clearly exceed specific criteria (e.g., $>$1,000 users, $\leq$50 DL/user) to be flagged as automated, missing distributed bots with moderate user counts (500--1,000) that fall below these thresholds; and (ii)~locations not matching any automated pattern default to organic, even if their behavioral features (uniform temporal patterns, single-year activity) are inconsistent with independent research access. The deep pipeline's fusion meta-learner, by contrast, weighs all 36 behavioral features jointly and benefits from gold-standard training labels, capturing bot patterns that no single threshold rule detects. Additionally, the pipeline includes a suspicious hub demotion step that reclassifies hub-labeled locations with concentrated project downloads (top three projects exceeding 45--50\% of traffic with HHI $>$ 0.05) as bots, since legitimate mirrors download broadly across many datasets. +The rule-based method labels fewer locations as bots (22,981 vs.\ 27,063) compared to the deep pipeline. This underdetection stems from two systematic issues: (i)~the fixed-threshold approach requires locations to clearly exceed specific criteria (e.g., $>$1,000 users, $\leq$50 DL/user) to be flagged as automated, missing distributed bots with moderate user counts (500--1,000) that fall below these thresholds; and (ii)~locations not matching any automated pattern default to user, even if their behavioral features (uniform temporal patterns, single-year activity) are inconsistent with independent research access. The deep pipeline's fusion meta-learner, by contrast, weighs all 36 behavioral features jointly and benefits from gold-standard training labels, capturing bot patterns that no single threshold rule detects. Additionally, the pipeline includes a suspicious hub demotion step that reclassifies hub-labeled locations with concentrated project downloads (top three projects exceeding 45--50\% of traffic with HHI $>$ 0.05) as bots, since legitimate mirrors download broadly across many datasets. \begin{table}[H] \centering @@ -631,17 +631,17 @@ \subsection{Full-Dataset Comparison: Rules vs Deep} \multirow{3}{*}{Rules} & Bot & 0.810 & 0.589 & 0.682 \\ & Hub & 0.653 & 0.858 & 0.742 \\ - & Organic & 0.287 & 0.454 & 0.352 \\ + & User & 0.287 & 0.454 & 0.352 \\ \midrule \multirow{3}{*}{Deep} & Bot & 1.000 & 0.975 & 0.987 \\ & Hub & 0.687 & 1.000 & 0.815 \\ - & Organic & 1.000 & 0.778 & 0.875 \\ + & User & 1.000 & 0.778 & 0.875 \\ \bottomrule \end{tabular} \end{table} -The rule-based method achieves reasonable bot precision (0.810) but the lowest organic F1 (0.352), confirming that static thresholds cannot separate research-city organic traffic from automated patterns. The deep pipeline improves all per-class F1 scores substantially. Note that the Deep method's high scores on the full 934-label set reflect partial overlap with training seeds (625 locations); the held-out evaluation on 309 locations (Table~\ref{tab:retraining_comparison}) provides an unbiased comparison. +The rule-based method achieves reasonable bot precision (0.810) but the lowest user F1 (0.352), confirming that static thresholds cannot separate research-city user traffic from automated patterns. The deep pipeline improves all per-class F1 scores substantially. Note that the Deep method's high scores on the full 934-label set reflect partial overlap with training seeds (625 locations); the held-out evaluation on 309 locations (Table~\ref{tab:retraining_comparison}) provides an unbiased comparison. \begin{table}[H] \centering @@ -652,18 +652,18 @@ \subsection{Full-Dataset Comparison: Rules vs Deep} \toprule & \multicolumn{3}{c}{\textbf{Consensus Label}} & \\ \cmidrule{2-4} -\textbf{Classifier Label} & Bot & Hub & Organic & \textbf{Total} \\ +\textbf{Classifier Label} & Bot & Hub & User & \textbf{Total} \\ \midrule Bot & 349 & 15 & 67 & 431 \\ Hub & 15 & 115 & 46 & 176 \\ -Organic & 229 & 4 & 94 & 327 \\ +User & 229 & 4 & 94 & 327 \\ \midrule \textbf{Total} & 593 & 134 & 207 & 934 \\ \bottomrule \end{tabular} \end{table} -The rule-based confusion matrix reveals two characteristic error modes: 229 bot locations misclassified as organic (residential proxy locations falling below the automated threshold) and 67 organic locations misclassified as bots (research cities with features that trigger automated patterns). These are precisely the errors that the LLM-augmented seed correction addresses. +The rule-based confusion matrix reveals two characteristic error modes: 229 bot locations misclassified as user (residential proxy locations falling below the automated threshold) and 67 user locations misclassified as bots (research cities with features that trigger automated patterns). These are precisely the errors that the LLM-augmented seed correction addresses. % ====================================================================== \section{S7. Bot Removal Analysis} @@ -672,7 +672,7 @@ \section{S7. Bot Removal Analysis} \subsection{Full-Dataset Classification} -The semi-supervised classification pipeline was applied to the complete dataset of locations aggregated from 159.3 million download records. The classification results are summarized in Figure~\ref{fig:classification_dist}. The asymmetry between location counts and download volumes is striking: bots generate far more traffic per location on average, while institutional hubs --- though comprising a small fraction of locations --- account for a substantial share of download volume. +The semi-supervised classification pipeline was applied to the complete dataset of locations aggregated from 159.3 million download records. The classification results are summarized in Figure~\ref{fig:classification_dist}. The asymmetry between location counts and download volumes is striking: bots generate far more traffic per location on average, while institutional hubs, though comprising a small fraction of locations, account for a substantial share of download volume. \begin{figure}[H] \centering @@ -830,11 +830,11 @@ \section{S9. Limitations} \paragraph{2. User identity resolution.} A ``unique user'' is defined as a distinct anonymized IP hash. This proxy is imperfect in two directions: (i)~multiple individuals behind a single NAT gateway or institutional proxy appear as one user, deflating user counts; and (ii)~a single individual using multiple networks (e.g., VPN, home vs.\ office) appears as multiple users, inflating counts. -Consequently, the ``downloads per user'' metric---central to both hub protection and bot-farm detection---may be distorted for locations with heavy proxy usage. +Consequently, the ``downloads per user'' metric, central to both hub protection and bot-farm detection, may be distorted for locations with heavy proxy usage. \paragraph{3. Geographic location granularity.} All users from the same geolocated coordinate are grouped into a single location profile. -In large metropolitan areas, this conflates distinct institutions and user populations (e.g., multiple universities in London or Beijing), potentially masking mixed organic/bot behavior. +In large metropolitan areas, this conflates distinct institutions and user populations (e.g., multiple universities in London or Beijing), potentially masking mixed user/bot behavior. Conversely, users at the same institution on different subnets may map to different coordinates, splitting what is logically one location. \paragraph{4. IP geolocation accuracy.}