SusGen-GPT: A Data-Centric LLM for Financial NLP and Sustainability Report Generation (NAACL Findings 2025)

This is the official repository for SusGen-GPT: A Data-Centric LLM for Financial NLP and Sustainability Report Generation.

🌿 Introduction

We present SusGen-30K, a meticulously curated instruction-tuning dataset for financial and ESG NLP domain; Then, we introduce SusGen-GPT, a suite of fine-tuned LLMs that achieve state-of-the-art performance across financial and ESG benchmarks with only 7–8B parameters; Besides, we propose TCFD-Bench, a benchmark designed to evaluate sustainability report generation, setting a new standard for model evaluation in this domain.

Tasks Supported (Click to expand)

Headline Classification (HC), Named Entity Recognition (NER), Relation Extraction (RE), Sentiment Analysis (SA), Financial Question Answering (FIN-QA), Financial Tabel Question Answering (FIN-TQA), Text Summarisation (SUM), Sustainability Report Generation (SRG).

📰 News

2025.1.23: 🎉🎉🎉 Our paper is accepted to NAACL Findings 2025! 🎉🎉🎉

2025.1.12: Code and checkpoints are released.

2024.12.14: 🍃🍃🍃 SusGen is released! 🍃🍃🍃 See our and .

Getting Started

1. Set Up the Environment

git clone [email protected]:JerryWu-code/SusGen.git
cd SusGen/
conda create --name susgen python==3.10 -y
conda activate susgen
export VLLM_INSTALL_PUNICA_KERNELS=1
pip install -r requirements.txt

2. Download the model checkpoint

Before you downloaded the checkpoint, make sure you have access to Llama3-8B-Instruct, Llama3-8B, Mistral-7B-Instruct-v0.3 and Mistral-7B-v0.3 and our lora checkpoint. And login in your huggingface client in the terminal.

# 1. set up the huggingface client
huggingface-cli login # prompt in your write permit access token
mkdir ckpts && cd ckpts/
git lfs install
# 2. download the llm base checkpoint
git clone https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
git clone https://huggingface.co/meta-llama/Meta-Llama-3-8B
git clone https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3
git clone https://huggingface.co/mistralai/Mistral-7B-v0.3
# 3. download the lora checkpoint (replace the 'path-to-our-lora-checkpoint' with the actual path)
git clone https://huggingface.co/WHATX/path-to-our-lora-checkpoint

3. Data Preparation

You could download:

Only the data SusGen-30k in our huggingface repository via this link, and put it under the folder data/SusGen/.
Data along with preprocessing parts and additional data via this link. And put it under the folder data/.

4. Training

Adjust configs in the configs/training_configs/finetune_config.yaml and then run the following command:

cd src/
CUDA_VISIBLE_DEVICES=0 /home/(you hostname)/anaconda3/envs/susgen/bin/torchrun --nproc_per_node=1 --master_port=29501 finetune.py --config configs/training_configs/finetune_config.yaml

📑 Open-source Plan

Release the curated training data SusGen-30k.
Release the curated evaluation data TCFD-Bench.
Release the training & inference code.
Release the data preprocessing code.
Release the suite of trained model checkpoints SusGen-GPT.
Release the evaluation code.
Release more if have time ...

🤝 About Team

Project Leader	Project Members	Other Members	PI
Qilong Wu	Xiaoneng Xiang Hejia Huang Xuan Wang	Yeo Wei Jie Ricardo Shirota Filho	Dr. Satapathy Ranjan (A*STAR) Prof. Bharadwaj Veeravalli (NUS)

📄 Citation

If our work assists your research or you use our data, feel free to give us a star ⭐ or cite us using

@article{wu2024susgen,
  title={SusGen-GPT: A Data-Centric LLM for Financial NLP and Sustainability Report Generation},
  author={Wu, Qilong and Xiang, Xiaoneng and Huang, Hejia and Wang, Xuan and Jie, Yeo Wei and Satapathy, Ranjan and Veeravalli, Bharadwaj and others},
  journal={arXiv preprint arXiv:2412.10906},
  year={2024}
}

🍎 Acknowledgement

We thank the great work from QLoRA, FinGPT and PIXIU, and thank to the paper inspiration from ChatReport and Common-7B. This project was sponsored by National University of Singapore and A*STAR Institute of High Performance.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
eval		eval
src		src
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SusGen-GPT: A Data-Centric LLM for Financial NLP and Sustainability Report Generation (NAACL Findings 2025)

🌿 Introduction

📰 News

Getting Started

1. Set Up the Environment

2. Download the model checkpoint

3. Data Preparation

4. Training

📑 Open-source Plan

🤝 About Team

📄 Citation

🍎 Acknowledgement

About

Releases

Packages

Languages

License

JerryWu-code/SusGen

Folders and files

Latest commit

History

Repository files navigation

SusGen-GPT: A Data-Centric LLM for Financial NLP and Sustainability Report Generation (NAACL Findings 2025)

🌿 Introduction

📰 News

Getting Started

1. Set Up the Environment

2. Download the model checkpoint

3. Data Preparation

4. Training

📑 Open-source Plan

🤝 About Team

📄 Citation

🍎 Acknowledgement

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages