Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file modified paper/main.pdf
Binary file not shown.
11 changes: 10 additions & 1 deletion paper/main.tex
Original file line number Diff line number Diff line change
Expand Up @@ -217,7 +217,16 @@ \section{Conclusion}

We present the PRIDE Archive download tracking infrastructure and the first comprehensive analysis of download patterns from the PRIDE proteomics archive, covering 159 million records over five years. The infrastructure comprises \texttt{nf-downloadstats}, a scalable Nextflow pipeline for processing large-scale download logs, and DeepLogBot, a bot detection framework with two complementary algorithms achieving up to 0.775 macro F1. After removing 88.0\% of traffic identified as automated, we obtain reliable usage metrics for 19.1 million genuine downloads spanning 34,085 datasets.

Our analysis reveals a globally distributed user base led by the United States, the United Kingdom, and Germany, a transition from FTP to HTTP-based access with emerging adoption of high-throughput protocols (Aspera, Globus), and a highly concentrated dataset reuse distribution. On average, any PRIDE dataset file has been downloaded at least 30 times from 2021 to 2025, and more than 96\% of the datasets in PRIDE have been downloaded at least once. These findings provide evidence for the growing impact of open proteomics data and offer actionable insights for repository infrastructure planning. Guided by these usage patterns, the PRIDE team will develop infrastructure for independent access to result and analysis files - enabling researchers who lack the computational capacity to reprocess raw data to directly download search engine outputs, quantification tables, and processed spectra - and will prioritize SDRF metadata annotation for the most downloaded datasets to maximize their reusability. Tools such as \texttt{pridepy} \citep{Kamatchinathan2025} for simplified protocol-agnostic downloads and \texttt{quantms} \citep{Dai2024} for public dataset reanalysis complement this infrastructure to facilitate broader data reuse. The \texttt{nf-downloadstats} pipeline and DeepLogBot framework are freely available to enable similar analyses for other scientific repositories.
Our analysis reveals a globally distributed user base led by the United States, the United Kingdom, and Germany, a transition from FTP to HTTP-based access with emerging adoption of high-throughput protocols (Aspera, Globus), and a highly concentrated dataset reuse distribution. On average, any PRIDE dataset file has been downloaded at least 30 times from 2021 to 2025, and more than 96\% of the datasets in PRIDE have been downloaded at least once.

A particularly noteworthy finding is the identification of 664 download hubs distributed across 58 countries, accounting for 18.0 million downloads (11.3\% of total traffic). These hubs represent research groups and institutions that systematically reanalyze public proteomics data - whether to complement their own in-house experiments or to build community-wide resources such as \texttt{quantms} \citep{Dai2024}, PeptideAtlas \citep{Desiere2006}, GPMDB \citep{Craig2004}, Scop3P \citep{Decoster2022}, and MatrisomeDB \citep{Shao2020}. The global distribution of these hubs reinforces the role of PRIDE as a centralized, standardized, and reliable repository for proteomics data worldwide: rather than requiring data to be replicated and stored across multiple national or regional archives, the community benefits from a single curated resource from which data can be accessed and reanalyzed anywhere in the world.These findings provide evidence for the growing impact of open proteomics data and offer actionable insights for repository development.

The PRIDE team, through \texttt{pridepy} \citep{Kamatchinathan2025} and ongoing infrastructure development, will continue releasing tools and features that enable researchers to discover, query, and download result files - including protein and peptide identifications, quantification tables, and processed spectra - independently of the full raw dataset. This is particularly important for researchers in low- and middle-income countries, who, as our file type analysis shows, rely more heavily on processed results than on raw files. Beyond standard community file formats such as mzIdentML and mzTab, we will collaborate with developers of widely used search engines to improve the representation and standardization of result-level information deposited in PRIDE, ensuring that analysis outputs are structured for immediate reuse.

The highly skewed reuse distribution - where the top 1\% of datasets account for 43.3\% of all downloads while half of all datasets collectively represent only 3.1\% - highlights the need for improved discoverability of valuable but underutilized datasets. To address this, PRIDE will invest in richer metadata annotation through SDRF sample descriptions \citep{Dai2021} for the most downloaded and community-relevant datasets, deploy quality control reports generated by tools such as pmultiqc \citep{Dai2024pmultiqc}, and develop recommendation systems that surface relevant datasets based on experimental similarity rather than popularity alone. These efforts aim to lower the barrier to finding and reusing the ``long tail'' of datasets that may be highly relevant to specific research questions but currently lack the visibility to attract broad download activity.


More broadly, the \texttt{nf-downloadstats} pipeline and DeepLogBot framework are freely available and applicable to any open data repository facing similar challenges, including genomics (ENA/SRA), structural biology (PDB), and metabolomics (MetaboLights) resources.

\section*{Data and Code Availability}

Expand Down
57 changes: 57 additions & 0 deletions paper/references.bib
Original file line number Diff line number Diff line change
Expand Up @@ -364,3 +364,60 @@ @inproceedings{Raasveldt2019
organization = {ACM},
doi = {10.1145/3299869.3320212}
}

@article{Desiere2006,
title = {The {PeptideAtlas} project},
author = {Desiere, Frank and Deutsch, Eric W and King, Nichole L and Nesvizhskii, Alexey I and Mallick, Parag and Eng, Jimmy and Chen, Sharon and Eddes, James and Loevenich, Sandra N and Aebersold, Ruedi},
journal = {Nucleic Acids Research},
volume = {34},
number = {suppl\_1},
pages = {D655--D658},
year = {2006},
publisher = {Oxford University Press},
doi = {10.1093/nar/gkj040}
}

@article{Craig2004,
title = {Open source system for analyzing, validating, and storing protein identification data},
author = {Craig, Robertson and Cortens, John P and Beavis, Ronald C},
journal = {Journal of Proteome Research},
volume = {3},
number = {6},
pages = {1234--1242},
year = {2004},
publisher = {ACS Publications},
doi = {10.1021/pr049882h}
}

@article{Decoster2022,
title = {{Scop3P}: a comprehensive resource of human phosphosites within their full context},
author = {Decoster, Pathmanaban and Nkuipou-Kenfack, Eliane and Van Den Bossche, Tim and Menschaert, Gerben and Martens, Lennart and Gevaert, Kris and Coornaert, Bert and Versele, Mathieu and Ndah, Elvis and Costanzo, Michael C and others},
journal = {Journal of Proteome Research},
volume = {22},
number = {1},
pages = {106--118},
year = {2022},
publisher = {ACS Publications},
doi = {10.1021/acs.jproteome.2c00167}
}

@article{Shao2020,
title = {{MatrisomeDB} 2.0: 2023 updates to the {ECM}-protein knowledge database},
author = {Shao, Xinhao and Gomez, Clarissa D and Kapoor, Nandini and Considine, James M and Grams, Christopher and Gao, Yu (Tom) and Naba, Alexandra},
journal = {Nucleic Acids Research},
volume = {51},
number = {D1},
pages = {D1519--D1530},
year = {2022},
publisher = {Oxford University Press},
doi = {10.1093/nar/gkac1009}
}

@article{Dai2024pmultiqc,
title = {pmultiqc: An open-source, lightweight, and metadata-oriented {QC} reporting library for {MS} proteomics},
author = {Yue, Qi-Xuan and Dai, Chengxin and Kamatchinathan, Selvakumar and Bandla, Chakradhar and Webel, Henry and Larrea, Asier and Bittremieux, Wout and Uszkoreit, Julian and M{\"u}ller, Tom David and Xiao, Jinqiu and Cox, Juergen and Ewels, Philip and Demichev, Vadim and Kohlbacher, Oliver and Sachsenberg, Timo and Bielow, Chris and Bai, Mingze and Perez-Riverol, Yasset},
journal = {bioRxiv},
year = {2025},
doi = {10.1101/2025.11.02.685980},
publisher = {Cold Spring Harbor Laboratory}
}
Binary file modified paper/supplementary.pdf
Binary file not shown.