RNN for Text Summarization

Spring 2018 Deep Learning Project

Eduardo Fierro (eff254), Lisa Ren (tr1312), Caroline Roper (cer446)

Files

You can download the RAW and pre-processed data used in this project from an NYU email ([email protected]) from the following link: https://bit.ly/2IEN7Ke

Preprocess

To pre-process the data, we followed Rico Sennrich's subword-NMT scripts, available originally here: https://github.com/rsennrich/subword-nmt. All the files are available in the Preprocess folder inside this repository. Please refer to the README.md file inside that folder for instructions of replication.

Open-NMT

The model follows the specification publicly available by Guillaume Klein, Yoon Kim, Yuntian Dengm, Jean Senellart and Alexander M. Rush, available here: https://github.com/OpenNMT/OpenNMT-py. Due to recent changes for the implementation of pytorch 0.4, the codes were forked to this repository. For requirements, please refer to the README.md file of the repository.

Training models

Once the requirements are met, you can pre-process the data to generate the lazy dataloaders with the following code:

python preprocess.py -train_src DIRECTORY_TO_SOURCE_TRAIN/train.article.txt.bpe \
                     -train_tgt DIRECTORY_TO_TARGET_TRAIN/train.title.txt.bpe \
                     -valid_src DIRECTORY_TO_SOURCE_VALIDATION/valid.article.filter.txt.bpe \
                     -valid_tgt DIRECTORY_TO_TARGET_VALIDATION/valid.title.filter.txt.bpe \
                     -save_data OUTPUT_DIRECTORY/bpe \
                     -src_seq_length 10000 \
                     -dynamic_dict \
                     -share_vocab \
                     -max_shard_size 524288000

After that, the model can be run using the following specification, where the directory INPUT_DIRECTORYfrom below is the same as the OUTPUT_DIRECTORY from above:

python train.py -data INPUT_DIRECTORY/bpe \
                -save_model SAVED_MODELS_DIRECTORY \
                -copy_attn \
                -reuse_copy_attn \
                -epochs 20 \
                -gpuid 0 \
                -bridge

You can also specify the share_embeddings option or none. For comparison purposes, run the model using BPE data and data that have not converted into Byte Pairs.

Evaluation

To generate the translation, you can run the following:

python translate.py -gpu 0 \
                    -batch_size 20 \
                    -beam_size 5 \
                    -model SAVED_MODELS_DIRECTORY \
                    -src PREDICTION DATA \
                    -output OUTPUT_DIRECTORY \
                    -min_length 9 \
                    -verbose \
                    -stepwise_penalty \
                    -coverage_penalty summary \
                    -beta 5 \
                    -length_penalty wu \
                    -alpha 0.9 \
                    -verbose \
                    -block_ngram_repeat 3 \
                    -ignore_when_blocking "." "</t>" "<t>"

In the example above, the translations are generated with alpha = 0.9 and a minimum 9 words.

METEOR scores are obtained using the publicly available implementation authored by Michael Denkowski: https://github.com/cmu-mtlab/meteor. Please refer to the README for further installation instructions and details: http://www.cs.cmu.edu/~alavie/METEOR/README.html.

The scores are obtained using the command below where bp_m9_a9.out is a text file of generated summaries and validation.title.bpe.txt is a text file of reference summaries.

java -Xmx2G -jar meteor-*.jar bpe_m9_a9.out validation.title.bpe.txt -l en -norm

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
OpenNMT-py		OpenNMT-py
Preprocess		Preprocess
splitting_validation_data		splitting_validation_data
.gitignore		.gitignore
README.md		README.md
Screen Shot 2018-05-07 at 10.39.32 PM.png		Screen Shot 2018-05-07 at 10.39.32 PM.png
compare-align.pdf		compare-align.pdf
compare-score-pg-1.pdf		compare-score-pg-1.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RNN for Text Summarization

Files

Preprocess

Open-NMT

Training models

Evaluation

About

Uh oh!

Releases

Packages

Languages

ceroper/DLProject

Folders and files

Latest commit

History

Repository files navigation

RNN for Text Summarization

Files

Preprocess

Open-NMT

Training models

Evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages