Private Synthetic Text Generation with Diffusion Models

Description

This repository contains the source code to replicate the experimental results in our paper.

Installation

We use Anaconda 24.3.0 to set up our virtual environment in Python.

conda create -n private-synthetic-text-generation python=3.8
conda activate private-synthetic-text-generation

We install the remaining requirements with pip.

pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt

Data

Please download the respective datasets and put the csv files in the destination folders (SWMH access needs to be granted by its creators).

Dataset	Source	Manually move to
Drugs.com	Already in Repository	not needed
SPAM	🔗	data/spam/ 📂
SWMH	🔗	data/swmh/ 📂
Thumbs-Up	Already available on huggingface datasets	not needed
WebMD	🔗	data/webmd/ 📂

Then you can run the three preprocessing script:

python preprocessing.py
python create_samples.py
python create_val_sets.py

Pretrained Models

Our code relies on some publicly available text diffusion model checkpoints, which you can download here:

Model	Source	Manually move to
GENIE	🔗	GENIE/ 📂
DiffuSeq	🔗	DiffuSeq/ 📂
SeqDiffuSeq	t.b.d.	SeqDiffuSeq/ 📂

Cite

Please use the following citation:

@misc{ochs2024privatesynthetictextgeneration,
      title={Private Synthetic Text Generation with Diffusion Models}, 
      author={Sebastian Ochs and Ivan Habernal},
      year={2024},
      eprint={2410.22971},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.22971}, 
}

Disclaimer

This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
DiffuSeq		DiffuSeq
GENIE		GENIE
SeqDiffuSeq		SeqDiffuSeq
baselines		baselines
data/drugs		data/drugs
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
avg_eval.py		avg_eval.py
create_samples.py		create_samples.py
create_val_sets.py		create_val_sets.py
evaluation.py		evaluation.py
preprocessing.py		preprocessing.py
requirements.txt		requirements.txt
train_evaluate.py		train_evaluate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Private Synthetic Text Generation with Diffusion Models

Description

Installation

Data

Pretrained Models

Cite

Disclaimer

About

Releases

Packages

Contributors 2

Languages

License

trusthlt/private-synthetic-text-generation

Folders and files

Latest commit

History

Repository files navigation

Private Synthetic Text Generation with Diffusion Models

Description

Installation

Data

Pretrained Models

Cite

Disclaimer

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages