This repository is a product of the work "Keyphrase Extraction via Extractive Summarization" published at the NAACL's Scholarly Document Processing (SDP) workshop. The paper can be accessed here.
Automatically extracting keyphrases from scholarly documents leads to a valuable concise representation that humans can understand and machines can process for tasks, such as information retrieval, article clustering and article classification. This work is concerned with the parts of a scientific article that should be given as input to keyphrase extraction methods. Recent deep learning methods take titles and abstracts as input due to the increased computational complexity in processing long sequences, whereas traditional approaches can also work with full-texts. Titles and abstracts are dense in keyphrases, but often miss important aspects of the articles, while full-texts on the other hand are richer in keyphrases but much noisier. To address this trade-off, we propose the use of extractive summarization models on the full-texts of scholarly documents. Our empirical study on 3 article collections using 3 keyphrase extraction methods shows promising results.
Clone the project
git clone https://github.com/intelligence-csd-auth-gr/keyphrase-extraction-via-summarization.git
Install libraries (Python 3.7)
pip install -r requirements.txt
Download all datasets (KP20K, NUS, ACM & SemEval) used in the experiments here OR follow the instructions below to set them up manually. To place the data in the working project folder, paste the contents of the downloaded root_folder/
folder into the root folder of the project. The downloaded folder contains all necessary folders under the /data/preprocessed_data/
, GloVe
and pretrained_models
folders, so skip the folder creation step (see also section "Folders" below).
- KP20K (source: initial repository of paper Deep Keyphrase Generation)
- NUS (source: updated repository of paper Deep Keyphrase Generation)
- ACM
- SemEval 2010
Place the KP20K
datasets (kp20k_training.json
, kp20k_validation.json
, kp20k_testing.json
) under the folder:
/data/
For the NUS
dataset:
- move the file
data/json/nus/nus_test.json
todata/benchmark_data/
and rename it toNUS.json
For the ACM
dataset (create folders if not existing):
- place the contents of
src/all_docs_abstacts_refined.zip
inside the folderdata/benchmark_data/test_dataset_processing/ACM/
, - place the file
references/test.author.stem.json
indata/benchmark_data/test_dataset_processing/ACM/all_keys_in_json/
For the SemEval 2010
dataset (create folders if not existing):
- place the contents of both the
train/
andtest/
folders into the project folderdata/benchmark_data/test_dataset_processing/semeval_2010/train_test_combined/
, - manually merge the files
references/train.combined.stem.json
andreferences/test.combined.stem.json
into a file namedtrain_test.combined.stem.json
, and, place it indata/benchmark_data/test_dataset_processing/semeval_2010/
Convert format of ACM
and SemEval 2010
datasets
- Prepare the SemEval 2010 dataset
python data/benchmark_data/test_dataset_processing/combine_semeval_dataset.py
- Prepare the ACM dataset
python data/benchmark_data/test_dataset_processing/acm_parser.py
Download "glove.6B/glove.6B.100d.txt" GloVe embeddings and place them in the root folder of the project under the path:
/GloVe/glove.6B/
Folder | Description |
---|---|
data | contains all the datasets and also the necessary scripts to generate the test datasets (folders are split by experiment) |
data_statistics | run any script in this folder to get statistics from each dataset |
unsupervised_models | contains code for MultipartiteRank and TF-IDF models |
Create the following folders under the folder /data/preprocessed_data/
to store the processed data for all experiments: (Skip if data were downloaded with the "Automated Data Set-up" method)
/data/preprocessed_data/first_paragraphs_fulltext/
/data/preprocessed_data/full_abstract/
/data/preprocessed_data/paragraph_fulltext/
/data/preprocessed_data/sentence_abstract/
/data/preprocessed_data/sentence_fulltext/
/data/preprocessed_data/summarization_experiment/
To generate smmarizations a transformer-based model was used that can be found at TransformerSum. Clone the TransformerSum repository and place the script acm_nus_semeval_summarization.py
in the root project folder. Next, download the distilroberta-base-ext-sum
pre-trained model, trained on the arXiv-PubMed dataset and place it in the root of the project in a folder named models/
. In the datasets/
folder drop the datasets to be summarized (NUS.json
, ACM.json
and semeval_2010.json
).
Generate summarizations for all ACM
, NUS
and SemEval
datasets
python acm_nus_semeval_summarization.py
Move the generated files (ACM_summarized.csv
, NUS_summarized.csv
and SemEval-2010_summarized.csv
) that contain the summarizations into the folder data/benchmark_data/summarization_experiment/
Clean duplicate documents between the train set and each of the test sets
python data/benchmark_data/clean_duplicate_papers.py
Prepare the KP20k datasets (train: kp527k, validation: kp20k-v, test: kp20k)
python preprocessing_full.py --mode train
python preprocessing_full.py --mode validation
python preprocessing_full.py --mode test
Prepare the KP20k split into sentences datasets (train: kp527k, validation: kp20k-v, test: kp20k)
python preprocessing_sentences.py --mode train
python preprocessing_sentences.py --mode validation
python preprocessing_sentences.py --mode test
Change sequence size of string data without needing to pre-process data again (the mode
argument is used to select which dataset to load and sentence_model
defines whether to load data split in sentences or as a whole)
python load_preprocessed_data.py --mode train --sentence_model 0
python load_preprocessed_data.py --mode validation --sentence_model 0
python load_preprocessed_data.py --mode test --sentence_model 0
Prepare the test datasets for the first three paragraphs of the full-text experiments
data/benchmark_data/first_paragraphs_fulltext/
Prepare the test datasets for the complete abstract experiments
data/benchmark_data/full_abstract/
Prepare the test datasets for the full-text split into paragraphs experiments
data/benchmark_data/paragraph_fulltext/
Prepare the test datasets for the abstract split into sentences experiments
data/benchmark_data/sentence_abstract/
Prepare the test datasets for the full-text split into sentences experiments
data/benchmark_data/sentence_fulltext/
Prepare the test datasets for the summarization of the full-text experiments
data/benchmark_data/summarization_experiment/
Create a folder with the name pretrained_models/checkpoint/
to store the trained models
Train Bi-LSTM-CRF model (the select_test_set
argument is used to select which test dataset to use and sentence_model
defines whether to train a sentence model)
python bi_lstm_crf.py --sentence_model 0 --select_test_set acm_full_abstract
The select_test_set
argument can take the following values:
experiment | select_test_set |
---|---|
full abstract | kp20k_full_abstract nus_full_abstract acm_full_abstract semeval_full_abstract |
abstract in sentences | kp20k_sentences_abstract nus_sentences_abstract acm_sentences_abstract semeval_sentences_abstract |
fulltext in sentences | nus_sentences_fulltext acm_sentences_fulltext semeval_sentences_fulltext |
fulltext in paragraphs | nus_paragraph_fulltext acm_paragraph_fulltext semeval_paragraph_fulltext |
first 3 paragraphs | nus_220_first_3_paragraphs acm_220_first_3_paragraphs semeval_220_first_3_paragraphs nus_400_first_3_paragraphs acm_400_first_3_paragraphs semeval_400_first_3_paragraphs |
summarization of abstract and fulltext | nus_summarization acm_summarization semeval_summarization |
Load a trained model (the select_test_set
argument is used to select which test dataset to use, sentence_model
defines whether to train a sentence model, and, pretrained_model_path
defines the path and the pre-trained model). The values of select_test_set
and sentence_model
arguments are the same as seen on the table above.
python load_pretrained_model.py --sentence_model 0 --select_test_set acm_full_abstract --pretrained_model_path pretrained_models/checkpoint/model.03.h5
Load trained models for the experiments of the combined predictions of Abstract & Summaries (the select_test_set
argument is used to select which test dataset to use, sentence_model
defines whether to train a sentence model, and, pretrained_model_path
defines the path and the pre-trained model). In this case, the select_test_set
argument can take be either nus
, acm
or semeval
.
python combined_summary_abstract_load_pretrained_model.py --sentence_model 0 --select_test_set acm --pretrained_model_path pretrained_models/checkpoint/model.03.h5
Please cite the following paper if you are interested in using our code.
@inproceedings{kontoulis2021keyphrase,
title={Keyphrase Extraction from Scientific Articles via Extractive Summarization},
author={Kontoulis, Chrysovalantis Giorgos and Papagiannopoulou, Eirini and Tsoumakas, Grigorios},
booktitle={Proceedings of the Second Workshop on Scholarly Document Processing},
pages={49--55},
year={2021}
}