SusGen-GPT: A Data-Centric LLM for Financial NLP and Sustainability Report Generation (NAACL Findings 2025)
This is the official repository for SusGen-GPT: A Data-Centric LLM for Financial NLP and Sustainability Report Generation.
We present SusGen-30K, a meticulously curated instruction-tuning dataset for financial and ESG NLP domain; Then, we introduce SusGen-GPT, a suite of fine-tuned LLMs that achieve state-of-the-art performance across financial and ESG benchmarks with only 7–8B parameters; Besides, we propose TCFD-Bench, a benchmark designed to evaluate sustainability report generation, setting a new standard for model evaluation in this domain.
Tasks Supported (Click to expand)
Headline Classification (HC), Named Entity Recognition (NER), Relation Extraction (RE), Sentiment Analysis (SA), Financial Question Answering (FIN-QA), Financial Tabel Question Answering (FIN-TQA), Text Summarisation (SUM), Sustainability Report Generation (SRG).
2025.1.23: 🎉🎉🎉 Our paper is accepted to NAACL Findings 2025! 🎉🎉🎉
2025.1.12: Code and checkpoints are released.
2024.12.14: 🍃🍃🍃 SusGen is released! 🍃🍃🍃 See our and
.
git clone [email protected]:JerryWu-code/SusGen.git
cd SusGen/
conda create --name susgen python==3.10 -y
conda activate susgen
export VLLM_INSTALL_PUNICA_KERNELS=1
pip install -r requirements.txt
Before you downloaded the checkpoint, make sure you have access to Llama3-8B-Instruct, Llama3-8B, Mistral-7B-Instruct-v0.3 and Mistral-7B-v0.3 and our lora checkpoint. And login in your huggingface client in the terminal.
# 1. set up the huggingface client
huggingface-cli login # prompt in your write permit access token
mkdir ckpts && cd ckpts/
git lfs install
# 2. download the llm base checkpoint
git clone https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
git clone https://huggingface.co/meta-llama/Meta-Llama-3-8B
git clone https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3
git clone https://huggingface.co/mistralai/Mistral-7B-v0.3
# 3. download the lora checkpoint (replace the 'path-to-our-lora-checkpoint' with the actual path)
git clone https://huggingface.co/WHATX/path-to-our-lora-checkpoint
You could download:
- Only the data SusGen-30k in our huggingface repository via this link, and put it under the folder data/SusGen/.
- Data along with preprocessing parts and additional data via this link. And put it under the folder data/.
Adjust configs in the configs/training_configs/finetune_config.yaml
and then run the following command:
cd src/
CUDA_VISIBLE_DEVICES=0 /home/(you hostname)/anaconda3/envs/susgen/bin/torchrun --nproc_per_node=1 --master_port=29501 finetune.py --config configs/training_configs/finetune_config.yaml
- Release the curated training data SusGen-30k.
- Release the curated evaluation data TCFD-Bench.
- Release the training & inference code.
- Release the data preprocessing code.
- Release the suite of trained model checkpoints SusGen-GPT.
- Release the evaluation code.
- Release more if have time ...
Project Leader | Project Members | Other Members | PI |
---|---|---|---|
Qilong Wu |
Xiaoneng Xiang Hejia Huang Xuan Wang |
Yeo Wei Jie Ricardo Shirota Filho |
Dr. Satapathy Ranjan (A*STAR) Prof. Bharadwaj Veeravalli (NUS) |
If our work assists your research or you use our data, feel free to give us a star ⭐ or cite us using
@article{wu2024susgen,
title={SusGen-GPT: A Data-Centric LLM for Financial NLP and Sustainability Report Generation},
author={Wu, Qilong and Xiang, Xiaoneng and Huang, Hejia and Wang, Xuan and Jie, Yeo Wei and Satapathy, Ranjan and Veeravalli, Bharadwaj and others},
journal={arXiv preprint arXiv:2412.10906},
year={2024}
}
We thank the great work from QLoRA, FinGPT and PIXIU, and thank to the paper inspiration from ChatReport and Common-7B. This project was sponsored by National University of Singapore and A*STAR Institute of High Performance.