This repository contains the source code to replicate the experimental results in our paper.
We use Anaconda 24.3.0 to set up our virtual environment in Python.
conda create -n private-synthetic-text-generation python=3.8
conda activate private-synthetic-text-generation
We install the remaining requirements with pip.
pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
Please download the respective datasets and put the csv files in the destination folders (SWMH access needs to be granted by its creators).
Dataset | Source | Manually move to |
---|---|---|
Drugs.com | Already in Repository | not needed |
SPAM | 🔗 | data/spam/ 📂 |
SWMH | 🔗 | data/swmh/ 📂 |
Thumbs-Up | Already available on huggingface datasets | not needed |
WebMD | 🔗 | data/webmd/ 📂 |
Then you can run the three preprocessing script:
python preprocessing.py
python create_samples.py
python create_val_sets.py
Our code relies on some publicly available text diffusion model checkpoints, which you can download here:
Model | Source | Manually move to |
---|---|---|
GENIE | 🔗 | GENIE/ 📂 |
DiffuSeq | 🔗 | DiffuSeq/ 📂 |
SeqDiffuSeq | t.b.d. | SeqDiffuSeq/ 📂 |
Please use the following citation:
@misc{ochs2024privatesynthetictextgeneration,
title={Private Synthetic Text Generation with Diffusion Models},
author={Sebastian Ochs and Ivan Habernal},
year={2024},
eprint={2410.22971},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.22971},
}
This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.