Skip to content

Commit d88abf6

Browse files
Updating documentation (#15)
1 parent c6ba686 commit d88abf6

24 files changed

+495
-52
lines changed

Readme.md

+195-50
Large diffs are not rendered by default.

doc/datasets/chaganty2018.md

+16
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
# Chaganty 2018
2+
This dataset contains quality judgments for several different summarization systems on the CNN/DailyMail dataset.
3+
The data was published in [The price of debiasing automatic metrics in natural language evaluation](https://www.aclweb.org/anthology/P18-1060.pdf).
4+
5+
```bash
6+
sacrerouge setup-dataset chaganty2018 \
7+
<output-dir>
8+
```
9+
10+
The output files are the following:
11+
- `documents.jsonl`: The CNN/DailyMail documents
12+
- `summaries.jsonl`: The system summaries
13+
- `metrics.jsonl`: The corresponding manual evaluation metrics for the system summaries
14+
15+
## Notes
16+
006588 appears twice for ml+rl.

doc/datasets/datasets.md

+14-1
Original file line numberDiff line numberDiff line change
@@ -1 +1,14 @@
1-
TODOTODO
1+
# Datasets
2+
SacreROUGE provides dataset readers for the following datasets:
3+
4+
- [DUC and TAC](duc-tac/duc-tac.md)
5+
- [MultiLing](multiling/multiling.md)
6+
- [Chaganty 2018](chaganty2018.md)
7+
8+
The readers parse the original data and convert it to a common format for use in SacreROUGE.
9+
Please see the respective documentation for each dataset for more details.
10+
11+
Each of the datasets can be setup via a command such as:
12+
```bash
13+
sacrerouge setup-dataset <dataset-name>
14+
```

doc/datasets/duc-tac/duc-tac.md

+18
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
# DUC/TAC
2+
The DUC (Document Understanding Conference) and TAC (Text Analysis Conference) provided many single- and multi-document summarization datasets with human judgments from 2001 to 2011.
3+
Due to license restrictions, we cannot release any of this data.
4+
However, if you have the usernames and passwords required to download the data, we have provided [this repository](https://github.com/danieldeutsch/duc-tac-data) which you can use to setup the data in the format required by SacreROUGE.
5+
6+
For details related to each year, please see the corresponding documentation:
7+
8+
- [DUC 2001](duc2001.md)
9+
- [DUC 2002](duc2002.md)
10+
- [DUC 2003](duc2003.md)
11+
- [DUC 2004](duc2004.md)
12+
- [DUC 2005](duc2005.md)
13+
- [DUC 2006](duc2006.md)
14+
- [DUC 2007](duc2007.md)
15+
- [TAC 2008](tac2008.md)
16+
- [TAC 2009](tac2009.md)
17+
- [TAC 2010](tac2010.md)
18+
- [TAC 2011](tac2011.md)

doc/datasets/duc-tac/duc2001.md

+20
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# DUC 2001
2+
[Homepage](https://www-nlpir.nist.gov/projects/duc/guidelines/2001.html)
3+
4+
For DUC 2001, we provide dataset readers for task 1 (single-document summarization) and task 2 (multi-document summarization).
5+
```bash
6+
sacrerouge setup-dataset duc2001 \
7+
<path-to-raw-data> \
8+
<output-dir>
9+
```
10+
The `<path-to-raw-data>` is the path to the root of the [DUC/TAC data repository](https://github.com/danieldeutsch/duc-tac-data) with the data already downloaded.
11+
12+
The output files are the following:
13+
- `task1.train.jsonl`: The training data for task 1
14+
- `task1.test.jsonl`: The test data for task 1
15+
- `task2.train.X.jsonl`: The training data for task 2 with target summaries of length `X` for `X` in `[50, 100, 200, 400]`
16+
- `task2.test.X.jsonl`: The same as the above but for testing
17+
18+
## Notes
19+
The input documents for DUC 2001 did not always have a standard schema, so parsing them was quite difficult.
20+
Therefore, there may be noise in the documents.

doc/datasets/duc-tac/duc2002.md

+22
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
# DUC 2002
2+
[Homepage](https://www-nlpir.nist.gov/projects/duc/guidelines/2002.html)
3+
4+
For DUC 2002, we provide dataset readers for task 1 (single-document summarization) and task 2 (multi-document summarization).
5+
```bash
6+
sacrerouge setup-dataset duc2002 \
7+
<path-to-raw-data> \
8+
<output-dir>
9+
```
10+
The `<path-to-raw-data>` is the path to the root of the [DUC/TAC data repository](https://github.com/danieldeutsch/duc-tac-data) with the data already downloaded.
11+
12+
The output files are the following:
13+
- `task1.jsonl`: The data for task 1
14+
- `task1.summaries.jsonl`: The submitted peer and reference summaries for task 1
15+
- `task1.metrics.jsonl`: The corresponding automatic and manual evaluation metrics for the peer and reference summaries for task 1
16+
- `task2.X.jsonl`: The data for task 2 for the summary target length `X`
17+
- `task2.X.summaries.jsonl`: The submitted peer and reference summaries for task 2 length `X`
18+
- `task2.X.metrics.jsonl`: The corresponding automatic and manual evaluation metrics for the peer and reference summaries for task 2 length `X`
19+
- `task2.Xe.jsonl`: The extractive summarization data for task 2 length `X`
20+
21+
## Notes
22+
Some of the human judgments were not all loaded (the `multijudge.short.results.table` results).

doc/datasets/duc-tac/duc2003.md

+15
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
# DUC 2003
2+
[Homepage](https://duc.nist.gov/duc2003/tasks.html)
3+
4+
For DUC 2003, we provide dataset readers for all 4 tasks.
5+
```bash
6+
sacrerouge setup-dataset duc2003 \
7+
<path-to-raw-data> \
8+
<output-dir>
9+
```
10+
The `<path-to-raw-data>` is the path to the root of the [DUC/TAC data repository](https://github.com/danieldeutsch/duc-tac-data) with the data already downloaded.
11+
12+
The output files are the following:
13+
- `taskX.jsonl`: The data for task `X`
14+
- `taskX.summaries.jsonl`: The submitted peer and reference summaries for task `X`
15+
- `taskX.metrics.jsonl`: The corresponding automatic and manual evaluation metrics for the peer and reference summaries for task `X`

doc/datasets/duc-tac/duc2004.md

+16
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
# DUC 2004
2+
[Homepage](https://duc.nist.gov/duc2004/)
3+
4+
For DUC 2004, we provide dataset readers for tasks 1, 2, and 5.
5+
```bash
6+
sacrerouge setup-dataset duc2004 \
7+
<path-to-raw-data> \
8+
<output-dir>
9+
```
10+
The `<path-to-raw-data>` is the path to the root of the [DUC/TAC data repository](https://github.com/danieldeutsch/duc-tac-data) with the data already downloaded.
11+
12+
The output files are the following:
13+
- `task1.jsonl`: The data for task 1
14+
- `taskX.jsonl`: The data for task `X` in `[2, 5]`
15+
- `taskX.summaries.jsonl`: The submitted peer and reference summaries for task `X` in `[2, 5]`
16+
- `taskX.metrics.jsonl`: The corresponding automatic and manual evaluation metrics for the peer and reference summaries for task `X` in `[2, 5]`

doc/datasets/duc-tac/duc2005.md

+20
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# DUC 2005
2+
[Homepage](https://duc.nist.gov/duc2005/)
3+
4+
For DUC 2005, we provide dataset readers for the single task.
5+
```bash
6+
sacrerouge setup-dataset duc2005 \
7+
<path-to-raw-data> \
8+
<output-dir>
9+
```
10+
The `<path-to-raw-data>` is the path to the root of the [DUC/TAC data repository](https://github.com/danieldeutsch/duc-tac-data) with the data already downloaded.
11+
12+
The output files are the following:
13+
- `task1.jsonl`: The data for the task
14+
- `task1.summaries.jsonl`: The submitted peer and reference summaries for the task
15+
- `task1.metrics.jsonl`: The corresponding automatic and manual evaluation metrics for the peer and reference summaries for the task
16+
17+
## Notes
18+
The reference summaries in `ROUGE/extras` are not loaded in `summaries` because they weren't used in evaluation even though they do have some judgments (for example, linguistic quality).
19+
20+
Some of the pyramid scores were done twice. We only take the last score.

doc/datasets/duc-tac/duc2006.md

+18
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
# DUC 2006
2+
[Homepage](https://duc.nist.gov/duc2006/tasks.html)
3+
4+
For DUC 2006, we provide dataset readers for the single task.
5+
```bash
6+
sacrerouge setup-dataset duc2006 \
7+
<path-to-raw-data> \
8+
<output-dir>
9+
```
10+
The `<path-to-raw-data>` is the path to the root of the [DUC/TAC data repository](https://github.com/danieldeutsch/duc-tac-data) with the data already downloaded.
11+
12+
The output files are the following:
13+
- `task1.jsonl`: The data for the task
14+
- `task1.summaries.jsonl`: The submitted peer and reference summaries for the task
15+
- `task1.metrics.jsonl`: The corresponding automatic and manual evaluation metrics for the peer and reference summaries for the task
16+
- `task1.pyramids.jsonl`: The Pyramids for the set of references for task 1
17+
- `task1.pyramid-annotations.jsonl`: The Pyramid annotations for each submitted peer and reference summary
18+
- `task1.d0631.pyramid-annotations.jsol`: The Pyramid annotations for instance `d0631` which were done several times.

doc/datasets/duc-tac/duc2007.md

+22
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
# DUC 2007
2+
[Homepage](https://duc.nist.gov/duc2007/tasks.html)
3+
4+
For DUC 2007, we provide dataset readers for both tasks 1 and 2.
5+
```bash
6+
sacrerouge setup-dataset duc2007 \
7+
<path-to-raw-data> \
8+
<output-dir>
9+
```
10+
The `<path-to-raw-data>` is the path to the root of the [DUC/TAC data repository](https://github.com/danieldeutsch/duc-tac-data) with the data already downloaded.
11+
12+
The output files are the following:
13+
- `task1.jsonl`: The data for task 1
14+
- `task1.summaries.jsonl`: The submitted peer and reference summaries for task 1
15+
- `task1.metrics.jsonl`: The corresponding automatic and manual evaluation metrics for the peer and reference summaries for task 1
16+
- `task1.pyramids.jsonl`: The Pyramids for the set of references for task 1
17+
- `task1.pyramid-annotations.jsonl`: The Pyramid annotations for each submitted peer and reference summary for task 1
18+
- `task2.X.jsonl`: The data for task 2 for document sets `X`. The file includes just set A (`A`), B (`B`), C (`C`), or all three (`A-B-C`).
19+
- `task2.X.summaries.jsonl`: The submitted peer and reference summaries for task 2
20+
- `task2.X.metrics.jsonl`: The corresponding automatic and manual evaluation metrics for the peer and reference summaries for task 2
21+
- `task2.X.pyramids.jsonl`: The Pyramids for the set of references for task 2
22+
- `task2.X.pyramid-annotations.jsonl`: The Pyramid annotations for each submitted peer and reference summary for task 2

doc/datasets/duc-tac/tac2008.md

+17
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
# TAC 2008
2+
[Homepage](https://tac.nist.gov/2008/summarization/)
3+
4+
For TAC 2008, we provide dataset readers for tasks 1.
5+
```bash
6+
sacrerouge setup-dataset tac2008 \
7+
<path-to-raw-data> \
8+
<output-dir>
9+
```
10+
The `<path-to-raw-data>` is the path to the root of the [DUC/TAC data repository](https://github.com/danieldeutsch/duc-tac-data) with the data already downloaded.
11+
12+
The output files are the following:
13+
- `task1.X.jsonl`: The data for task 1 for document sets `X`. The file includes just set A (`A`), B (`B`), or both (`A-B`).
14+
- `task1.X.summaries.jsonl`: The submitted peer and reference summaries for task 1
15+
- `task1.X.metrics.jsonl`: The corresponding automatic and manual evaluation metrics for the peer and reference summaries for task 1
16+
- `task1.X.pyramids.jsonl`: The Pyramids for the set of references for task 1
17+
- `task1.X.pyramid-annotations.jsonl`: The Pyramid annotations for each submitted peer and reference summary for task 1

doc/datasets/duc-tac/tac2009.md

+21
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
# TAC 2009
2+
[Homepage](https://tac.nist.gov/2009/Summarization/)
3+
4+
For TAC 2009, we provide dataset readers for tasks 1 and the submitted AESOP values.
5+
```bash
6+
sacrerouge setup-dataset tac2009 \
7+
<path-to-raw-data> \
8+
<output-dir>
9+
```
10+
The `<path-to-raw-data>` is the path to the root of the [DUC/TAC data repository](https://github.com/danieldeutsch/duc-tac-data) with the data already downloaded.
11+
12+
The output files are the following:
13+
- `task1.X.jsonl`: The data for task 1 for document sets `X`. The file includes just set A (`A`), B (`B`), or both (`A-B`).
14+
- `task1.X.summaries.jsonl`: The submitted peer and reference summaries for task 1
15+
- `task1.X.metrics.jsonl`: The corresponding automatic and manual evaluation metrics for the peer and reference summaries for task 1
16+
- `task1.X.pyramids.jsonl`: The Pyramids for the set of references for task 1
17+
- `task1.X.pyramid-annotations.jsonl`: The Pyramid annotations for each submitted peer and reference summary for task 1
18+
19+
## Notes
20+
Our correlation tests do not match the original for the LCS version of ROUGE because NIST ran ROUGE on non-sentence-tokenized summaries.
21+
We run sentence-tokenization, and this ends up having a large effect on the LCS scores.

doc/datasets/duc-tac/tac2010.md

+17
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
# TAC 2010
2+
[Homepage](https://tac.nist.gov/2010/Summarization/)
3+
4+
For TAC 2010, we provide dataset readers for tasks 1 and the submitted AESOP values.
5+
```bash
6+
sacrerouge setup-dataset tac2010 \
7+
<path-to-raw-data> \
8+
<output-dir>
9+
```
10+
The `<path-to-raw-data>` is the path to the root of the [DUC/TAC data repository](https://github.com/danieldeutsch/duc-tac-data) with the data already downloaded.
11+
12+
The output files are the following:
13+
- `task1.X.jsonl`: The data for task 1 for document sets `X`. The file includes just set A (`A`), B (`B`), or both (`A-B`).
14+
- `task1.X.summaries.jsonl`: The submitted peer and reference summaries for task 1
15+
- `task1.X.metrics.jsonl`: The corresponding automatic and manual evaluation metrics for the peer and reference summaries for task 1
16+
- `task1.X.pyramids.jsonl`: The Pyramids for the set of references for task 1
17+
- `task1.X.pyramid-annotations.jsonl`: The Pyramid annotations for each submitted peer and reference summary for task 1

doc/datasets/duc-tac/tac2011.md

+52
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
# TAC 2011
2+
[Homepage](https://tac.nist.gov/2011/Summarization/)
3+
4+
For TAC 2011, we provide dataset readers for tasks 1 and the submitted AESOP values.
5+
```bash
6+
sacrerouge setup-dataset tac2011 \
7+
<path-to-gigaword-root> \
8+
<path-to-raw-data> \
9+
<output-dir>
10+
```
11+
The `<path-to-gigaword-root>` is the path to the root of `LDC2011T07/gigaword_eng_5`.
12+
The `<path-to-raw-data>` is the path to the root of the [DUC/TAC data repository](https://github.com/danieldeutsch/duc-tac-data) with the data already downloaded.
13+
14+
The output files are the following:
15+
- `task1.X.jsonl`: The data for task 1 for document sets `X`. The file includes just set A (`A`), B (`B`), or both (`A-B`).
16+
- `task1.X.summaries.jsonl`: The submitted peer and reference summaries for task 1
17+
- `task1.X.metrics.jsonl`: The corresponding automatic and manual evaluation metrics for the peer and reference summaries for task 1
18+
- `task1.X.pyramids.jsonl`: The Pyramids for the set of references for task 1
19+
- `task1.X.pyramid-annotations.jsonl`: The Pyramid annotations for each submitted peer and reference summary for task 1
20+
21+
## Notes
22+
It appears that the Pyramid annotations were exhasutive (identifying SCUs which are not present in the reference Pyramids).
23+
Those extra SCUs are not loaded here.
24+
25+
There are Pyramids for the combined A-B summaries, which we do not load.
26+
27+
The Pyramid annotations have incorrect SCU IDs, so they should be used with caution.
28+
Here is an example:
29+
```xml
30+
<!-- Pyramid for D1112-B -->
31+
<scu uid="7" label="Jury did not believe Alvarez planned to hurt anyone (NONE)">
32+
<contributor label="The jury foreman said at a news conference, after the trial...he did not believe Alvarez planned to kill anyone">
33+
<part label="he did not believe Alvarez planned to kill anyone" start="323" end="372"/>
34+
<part label="The jury foreman said at a news conference, after the trial" start="263" end="322"/>
35+
</contributor>
36+
<contributor label="the jury...believed he didn't intend to kill anyone">
37+
<part label="the jury" start="642" end="650"/>
38+
<part label="believed he didn't intend to kill anyone" start="713" end="753"/>
39+
</contributor>
40+
<contributor label="Jurors...didn't believe he meant to hurt anyone">
41+
<part label="didn't believe he meant to hurt anyone" start="1288" end="1326"/>
42+
<part label="Jurors" start="1228" end="1234"/>
43+
</contributor>
44+
</scu>
45+
46+
<!-- # Annotation for system 22 -->
47+
<peerscu uid="41" label="(3) Jury did not believe Alvarez planned to hurt anyone (NONE)">
48+
<contributor label="some jurors in the Metrolink train derailment case last month said they really didn't think Alvarez intended to kill anyone">
49+
<part label="some jurors in the Metrolink train derailment case last month said they really didn't think Alvarez intended to kill anyone" start="304" end="427"/>
50+
</contributor>
51+
</peerscu>
52+
```

doc/metrics/autosummeng.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
# AutoSummENG, MeMoG, and NPowER
22
AutoSummENG [1, 2], MeMoG [1], and NPoWER [3] are a family of reference-based evaluation metrics that use n-gram graphs to compare the content of a summary and set of reference summaries.
3-
Our implementation wraps [our modification](https://github.com/danieldeutsch/AutoSummENG) of the [original code](https://github.com/ggianna/SummaryEvaluation) which allows for evaluating batches of summaries.
3+
Our implementation wraps [our modification](https://github.com/danieldeutsch/AutoSummENG) of the [original code](https://github.com/ggianna/SummaryEvaluation) which allows for evaluating batches of summaries.
4+
All three metrics can be computed with the metric name `autosummeng`.
45

56
## Setting Up
67
Running the AutoSummENG code requires Java 1.8 and Maven to be installed.

doc/metrics/bertscore.md

+1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
# BERTScore
22
BERTScore [1] is a reference-based evaluation metric based on calculating the similarity of two summaries' BERT embeddings.
33
Our implementation calls the `score` function from [our fork](https://github.com/danieldeutsch/bert_score) of the [original repository](https://github.com/Tiiiger/bert_score), which we modified to expose creating the IDF dictionaries.
4+
The name for this metric is `bertscore`.
45

56
## Setting Up
67
BERTScore can be installed via pip:

doc/metrics/bewte.md

+1
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
BEwT-E [1] is an extension of the Basic Elements [2].
33
These metrics compare a summary and reference based on matches between heads of syntactic phrases and dependency tree-based relations.
44
Our implementation wraps a [mavenized fork](https://github.com/igorbrigadir/ROUGE-BEwTE) of the original code.
5+
The name for this metric is `bewte`.
56

67
## Setting Up
78
Running BEwT-E requires having Git LFS, Java 1.6, and Maven installed.

doc/metrics/meteor.md

+1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
# METEOR
22
METEOR [1] is a reference-based metric that scores a summary based on an alignment to the reference.
33
Our implementation wraps the released Java library.
4+
The name for this metric is `meteor`.
45

56
## Setting Up
67
METEOR requires Java (not sure which version) to run.

doc/metrics/moverscore.md

+1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
# MoverScore
22
MoverScore [1] is a reference-based evaluation metric using an Earth Mover's Distance between a summary and its reference that uses contextual word representations.
33
Our implementation uses the `moverscore` [pip package](https://github.com/AIPHES/emnlp19-moverscore).
4+
The name for this metric is `moverscore`.
45

56
## Setting Up
67
To set up MoverScore, pip install the package:

doc/metrics/python-rouge.md

+2
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@ The Python version is significantly faster.
66
The Python version currently supports ROUGE-N and ROUGE-L.
77
Although, it is near-identical to the Perl version, it should only be used for development and not official evaluation, for which you should use the original [ROUGE](rouge.md).
88

9+
The name for this metric is `python-rouge`.
10+
911
## Setting Up
1012
This metric only requires that ROUGE has been set up (see [here](rouge.md)).
1113

doc/metrics/rouge.md

+1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
# ROUGE
22
ROUGE [1] is a reference-based evaluation based on n-gram overlaps between a summary and its reference.
33
Our implementation wraps the original Perl code.
4+
The name for this metric is `rouge`.
45

56
## Setting Up
67
To set up ROUGE, run the following:

doc/metrics/simetrix.md

+1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
# SIMetrix
22
SIMetrix [1, 2, 3] is a reference-free evaluation metric that compares a summary to the input documents.
33
Our implementation wraps [this fork](https://github.com/igorbrigadir/simetrix) of the original code.
4+
The name for this metric is `simetrix`.
45

56
## Setting Up
67
Running SIMetrix requires Java 1.7 and Maven to be installed.

doc/metrics/sumqe.md

+2
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,8 @@ We additionally followed the steps in the repository to retrain their multi-task
77
- [Model trained on DUC 2005 and 2007](https://danieldeutsch.s3.amazonaws.com/sacrerouge/metrics/SumQE/models/multitask_5-duc2005_duc2007.npy)
88
- [Model trained on DUC 2006 and 2007](https://danieldeutsch.s3.amazonaws.com/sacrerouge/metrics/SumQE/models/multitask_5-duc2006_duc2007.npy)
99

10+
The name for this metric is `sum-qe`.
11+
1012
## Setting Up
1113
Sum-QE has many Python dependencies.
1214
We recommend referencing the repository's instructions for creating the conda environment.

0 commit comments

Comments
 (0)