The repo contains the official code for the ICML 25 paper STAMP Your Content: Proving Dataset Membership via Watermarked Rephrasings by Saksham Rastogi, Pratyush Maini, and Danish Pruthi.
To install the necessary packages, first create a conda environment.
conda create -n <env_name> python=3.10
conda activate <env_name>
Then, install the required packages with
pip install -r requirements.txt
We provide the following artifacts for future research and reproducibility:
Below are the links to trained models (continual pretraining on contaminated data) from the paper's experiments (hosted on huggingface). They can also be found at this Hugging Face Collection.
- Continual pretraining on corpus of ~6B tokens.
- Continual pretraining on corpus of ~4B tokens.
- Continual pretraining on corpus of ~3B tokens.
- Continual pretraining on corpus of ~2B tokens.
- Continual pretraining on corpus of ~1B tokens.
-
The
benchmarks
folder contains all the test files used to produce the paper's results, including both original and rephrased versions for the following four datasets:
We heavily rely on the following repos in our paper:
If you have any questions, feel free to open an issue on GitHub or contact Saksham ([email protected]).
If you find this repo useful, please consider citing:
@misc{rastogi2025stampcontentprovingdataset,
title={STAMP Your Content: Proving Dataset Membership via Watermarked Rephrasings},
author={Saksham Rastogi and Pratyush Maini and Danish Pruthi},
year={2025},
eprint={2504.13416},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2504.13416},
}