Updating documentation (#15)

danieldeutsch · web-flow · commit d88abf667f0c · 2020-06-21T14:57:29.000-04:00
diff --git a/Readme.md b/Readme.md
diff --git a/doc/datasets/chaganty2018.md b/doc/datasets/chaganty2018.md
@@ -0,0 +1,16 @@
+# Chaganty 2018
+This dataset contains quality judgments for several different summarization systems on the CNN/DailyMail dataset.
+The data was published in [The price of debiasing automatic metrics in natural language evaluation](https://www.aclweb.org/anthology/P18-1060.pdf).
+
+```bash
+sacrerouge setup-dataset chaganty2018 \
+    <output-dir>
+```
+
+The output files are the following:
+- `documents.jsonl`: The CNN/DailyMail documents
+- `summaries.jsonl`: The system summaries
+- `metrics.jsonl`: The corresponding manual evaluation metrics for the system summaries
+
+## Notes
+006588 appears twice for ml+rl.
diff --git a/doc/datasets/datasets.md b/doc/datasets/datasets.md
@@ -1 +1,14 @@
-TODOTODO
+# Datasets
+SacreROUGE provides dataset readers for the following datasets:
+
+- [DUC and TAC](duc-tac/duc-tac.md)
+- [MultiLing](multiling/multiling.md)
+- [Chaganty 2018](chaganty2018.md)
+
+The readers parse the original data and convert it to a common format for use in SacreROUGE.
+Please see the respective documentation for each dataset for more details.
+
+Each of the datasets can be setup via a command such as:
+```bash
+sacrerouge setup-dataset <dataset-name>
+```
diff --git a/doc/datasets/duc-tac/duc-tac.md b/doc/datasets/duc-tac/duc-tac.md
@@ -0,0 +1,18 @@
+# DUC/TAC
+The DUC (Document Understanding Conference) and TAC (Text Analysis Conference) provided many single- and multi-document summarization datasets with human judgments from 2001 to 2011.
+Due to license restrictions, we cannot release any of this data.
+However, if you have the usernames and passwords required to download the data, we have provided [this repository](https://github.com/danieldeutsch/duc-tac-data) which you can use to setup the data in the format required by SacreROUGE.
+
+For details related to each year, please see the corresponding documentation:
+
+- [DUC 2001](duc2001.md)
+- [DUC 2002](duc2002.md)
+- [DUC 2003](duc2003.md)
+- [DUC 2004](duc2004.md)
+- [DUC 2005](duc2005.md)
+- [DUC 2006](duc2006.md)
+- [DUC 2007](duc2007.md)
+- [TAC 2008](tac2008.md)
+- [TAC 2009](tac2009.md)
+- [TAC 2010](tac2010.md)
+- [TAC 2011](tac2011.md)
diff --git a/doc/datasets/duc-tac/duc2001.md b/doc/datasets/duc-tac/duc2001.md
@@ -0,0 +1,20 @@
+# DUC 2001
+[Homepage](https://www-nlpir.nist.gov/projects/duc/guidelines/2001.html)
+
+For DUC 2001, we provide dataset readers for task 1 (single-document summarization) and task 2 (multi-document summarization).
+```bash
+sacrerouge setup-dataset duc2001 \
+    <path-to-raw-data> \
+    <output-dir>
+```
+The `<path-to-raw-data>` is the path to the root of the [DUC/TAC data repository](https://github.com/danieldeutsch/duc-tac-data) with the data already downloaded.
+
+The output files are the following:
+- `task1.train.jsonl`: The training data for task 1
+- `task1.test.jsonl`: The test data for task 1
+- `task2.train.X.jsonl`: The training data for task 2 with target summaries of length `X` for `X` in `[50, 100, 200, 400]`
+- `task2.test.X.jsonl`: The same as the above but for testing
+
+## Notes
+The input documents for DUC 2001 did not always have a standard schema, so parsing them was quite difficult.
+Therefore, there may be noise in the documents.
diff --git a/doc/datasets/duc-tac/duc2002.md b/doc/datasets/duc-tac/duc2002.md
@@ -0,0 +1,22 @@
+# DUC 2002
+[Homepage](https://www-nlpir.nist.gov/projects/duc/guidelines/2002.html)
+
+For DUC 2002, we provide dataset readers for task 1 (single-document summarization) and task 2 (multi-document summarization).
+```bash
+sacrerouge setup-dataset duc2002 \
+    <path-to-raw-data> \
+    <output-dir>
+```
+The `<path-to-raw-data>` is the path to the root of the [DUC/TAC data repository](https://github.com/danieldeutsch/duc-tac-data) with the data already downloaded.
+
+The output files are the following:
+- `task1.jsonl`: The data for task 1
+- `task1.summaries.jsonl`: The submitted peer and reference summaries for task 1
+- `task1.metrics.jsonl`: The corresponding automatic and manual evaluation metrics for the peer and reference summaries for task 1
+- `task2.X.jsonl`: The data for task 2 for the summary target length `X`
+- `task2.X.summaries.jsonl`: The submitted peer and reference summaries for task 2 length `X`
+- `task2.X.metrics.jsonl`: The corresponding automatic and manual evaluation metrics for the peer and reference summaries for task 2 length `X`
+- `task2.Xe.jsonl`: The extractive summarization data for task 2 length `X`
+
+## Notes
+Some of the human judgments were not all loaded (the `multijudge.short.results.table` results).
diff --git a/doc/datasets/duc-tac/duc2003.md b/doc/datasets/duc-tac/duc2003.md
@@ -0,0 +1,15 @@
+# DUC 2003
+[Homepage](https://duc.nist.gov/duc2003/tasks.html)
+
+For DUC 2003, we provide dataset readers for all 4 tasks.
+```bash
+sacrerouge setup-dataset duc2003 \
+    <path-to-raw-data> \
+    <output-dir>
+```
+The `<path-to-raw-data>` is the path to the root of the [DUC/TAC data repository](https://github.com/danieldeutsch/duc-tac-data) with the data already downloaded.
+
+The output files are the following:
+- `taskX.jsonl`: The data for task `X`
+- `taskX.summaries.jsonl`: The submitted peer and reference summaries for task `X`
+- `taskX.metrics.jsonl`: The corresponding automatic and manual evaluation metrics for the peer and reference summaries for task `X`
diff --git a/doc/datasets/duc-tac/duc2004.md b/doc/datasets/duc-tac/duc2004.md
@@ -0,0 +1,16 @@
+# DUC 2004
+[Homepage](https://duc.nist.gov/duc2004/)
+
+For DUC 2004, we provide dataset readers for tasks 1, 2, and 5.
+```bash
+sacrerouge setup-dataset duc2004 \
+    <path-to-raw-data> \
+    <output-dir>
+```
+The `<path-to-raw-data>` is the path to the root of the [DUC/TAC data repository](https://github.com/danieldeutsch/duc-tac-data) with the data already downloaded.
+
+The output files are the following:
+- `task1.jsonl`: The data for task 1
+- `taskX.jsonl`: The data for task `X` in `[2, 5]`
+- `taskX.summaries.jsonl`: The submitted peer and reference summaries for task `X` in `[2, 5]`
+- `taskX.metrics.jsonl`: The corresponding automatic and manual evaluation metrics for the peer and reference summaries for task `X` in `[2, 5]`
diff --git a/doc/datasets/duc-tac/duc2005.md b/doc/datasets/duc-tac/duc2005.md
@@ -0,0 +1,20 @@
+# DUC 2005
+[Homepage](https://duc.nist.gov/duc2005/)
+
+For DUC 2005, we provide dataset readers for the single task.
+```bash
+sacrerouge setup-dataset duc2005 \
+    <path-to-raw-data> \
+    <output-dir>
+```
+The `<path-to-raw-data>` is the path to the root of the [DUC/TAC data repository](https://github.com/danieldeutsch/duc-tac-data) with the data already downloaded.
+
+The output files are the following:
+- `task1.jsonl`: The data for the task
+- `task1.summaries.jsonl`: The submitted peer and reference summaries for the task
+- `task1.metrics.jsonl`: The corresponding automatic and manual evaluation metrics for the peer and reference summaries for the task
+
+## Notes
+The reference summaries in `ROUGE/extras` are not loaded in `summaries` because they weren't used in evaluation even though they do have some judgments (for example, linguistic quality).
+
+Some of the pyramid scores were done twice. We only take the last score.
diff --git a/doc/datasets/duc-tac/duc2006.md b/doc/datasets/duc-tac/duc2006.md
@@ -0,0 +1,18 @@
+# DUC 2006
+[Homepage](https://duc.nist.gov/duc2006/tasks.html)
+
+For DUC 2006, we provide dataset readers for the single task.
+```bash
+sacrerouge setup-dataset duc2006 \
+    <path-to-raw-data> \
+    <output-dir>
+```
+The `<path-to-raw-data>` is the path to the root of the [DUC/TAC data repository](https://github.com/danieldeutsch/duc-tac-data) with the data already downloaded.
+
+The output files are the following:
+- `task1.jsonl`: The data for the task
+- `task1.summaries.jsonl`: The submitted peer and reference summaries for the task
+- `task1.metrics.jsonl`: The corresponding automatic and manual evaluation metrics for the peer and reference summaries for the task
+- `task1.pyramids.jsonl`: The Pyramids for the set of references for task 1
+- `task1.pyramid-annotations.jsonl`: The Pyramid annotations for each submitted peer and reference summary
+- `task1.d0631.pyramid-annotations.jsol`: The Pyramid annotations for instance `d0631` which were done several times.
diff --git a/doc/datasets/duc-tac/duc2007.md b/doc/datasets/duc-tac/duc2007.md
@@ -0,0 +1,22 @@
+# DUC 2007
+[Homepage](https://duc.nist.gov/duc2007/tasks.html)
+
+For DUC 2007, we provide dataset readers for both tasks 1 and 2.
+```bash
+sacrerouge setup-dataset duc2007 \
+    <path-to-raw-data> \
+    <output-dir>
+```
+The `<path-to-raw-data>` is the path to the root of the [DUC/TAC data repository](https://github.com/danieldeutsch/duc-tac-data) with the data already downloaded.
+
+The output files are the following:
+- `task1.jsonl`: The data for task 1
+- `task1.summaries.jsonl`: The submitted peer and reference summaries for task 1
+- `task1.metrics.jsonl`: The corresponding automatic and manual evaluation metrics for the peer and reference summaries for task 1
+- `task1.pyramids.jsonl`: The Pyramids for the set of references for task 1
+- `task1.pyramid-annotations.jsonl`: The Pyramid annotations for each submitted peer and reference summary for task 1
+- `task2.X.jsonl`: The data for task 2 for document sets `X`. The file includes just set A (`A`), B (`B`), C (`C`), or all three (`A-B-C`).
+- `task2.X.summaries.jsonl`: The submitted peer and reference summaries for task 2
+- `task2.X.metrics.jsonl`: The corresponding automatic and manual evaluation metrics for the peer and reference summaries for task 2
+- `task2.X.pyramids.jsonl`: The Pyramids for the set of references for task 2
+- `task2.X.pyramid-annotations.jsonl`: The Pyramid annotations for each submitted peer and reference summary for task 2
diff --git a/doc/datasets/duc-tac/tac2008.md b/doc/datasets/duc-tac/tac2008.md
@@ -0,0 +1,17 @@
+# TAC 2008
+[Homepage](https://tac.nist.gov/2008/summarization/)
+
+For TAC 2008, we provide dataset readers for tasks 1.
+```bash
+sacrerouge setup-dataset tac2008 \
+    <path-to-raw-data> \
+    <output-dir>
+```
+The `<path-to-raw-data>` is the path to the root of the [DUC/TAC data repository](https://github.com/danieldeutsch/duc-tac-data) with the data already downloaded.
+
+The output files are the following:
+- `task1.X.jsonl`: The data for task 1 for document sets `X`. The file includes just set A (`A`), B (`B`), or both (`A-B`).
+- `task1.X.summaries.jsonl`: The submitted peer and reference summaries for task 1
+- `task1.X.metrics.jsonl`: The corresponding automatic and manual evaluation metrics for the peer and reference summaries for task 1
+- `task1.X.pyramids.jsonl`: The Pyramids for the set of references for task 1
+- `task1.X.pyramid-annotations.jsonl`: The Pyramid annotations for each submitted peer and reference summary for task 1
diff --git a/doc/datasets/duc-tac/tac2009.md b/doc/datasets/duc-tac/tac2009.md
@@ -0,0 +1,21 @@
+# TAC 2009
+[Homepage](https://tac.nist.gov/2009/Summarization/)
+
+For TAC 2009, we provide dataset readers for tasks 1 and the submitted AESOP values.
+```bash
+sacrerouge setup-dataset tac2009 \
+    <path-to-raw-data> \
+    <output-dir>
+```
+The `<path-to-raw-data>` is the path to the root of the [DUC/TAC data repository](https://github.com/danieldeutsch/duc-tac-data) with the data already downloaded.
+
+The output files are the following:
+- `task1.X.jsonl`: The data for task 1 for document sets `X`. The file includes just set A (`A`), B (`B`), or both (`A-B`).
+- `task1.X.summaries.jsonl`: The submitted peer and reference summaries for task 1
+- `task1.X.metrics.jsonl`: The corresponding automatic and manual evaluation metrics for the peer and reference summaries for task 1
+- `task1.X.pyramids.jsonl`: The Pyramids for the set of references for task 1
+- `task1.X.pyramid-annotations.jsonl`: The Pyramid annotations for each submitted peer and reference summary for task 1
+
+## Notes
+Our correlation tests do not match the original for the LCS version of ROUGE because NIST ran ROUGE on non-sentence-tokenized summaries.
+We run sentence-tokenization, and this ends up having a large effect on the LCS scores.
diff --git a/doc/datasets/duc-tac/tac2010.md b/doc/datasets/duc-tac/tac2010.md
@@ -0,0 +1,17 @@
+# TAC 2010
+[Homepage](https://tac.nist.gov/2010/Summarization/)
+
+For TAC 2010, we provide dataset readers for tasks 1 and the submitted AESOP values.
+```bash
+sacrerouge setup-dataset tac2010 \
+    <path-to-raw-data> \
+    <output-dir>
+```
+The `<path-to-raw-data>` is the path to the root of the [DUC/TAC data repository](https://github.com/danieldeutsch/duc-tac-data) with the data already downloaded.
+
+The output files are the following:
+- `task1.X.jsonl`: The data for task 1 for document sets `X`. The file includes just set A (`A`), B (`B`), or both (`A-B`).
+- `task1.X.summaries.jsonl`: The submitted peer and reference summaries for task 1
+- `task1.X.metrics.jsonl`: The corresponding automatic and manual evaluation metrics for the peer and reference summaries for task 1
+- `task1.X.pyramids.jsonl`: The Pyramids for the set of references for task 1
+- `task1.X.pyramid-annotations.jsonl`: The Pyramid annotations for each submitted peer and reference summary for task 1
diff --git a/doc/datasets/duc-tac/tac2011.md b/doc/datasets/duc-tac/tac2011.md
@@ -0,0 +1,52 @@
+# TAC 2011
+[Homepage](https://tac.nist.gov/2011/Summarization/)
+
+For TAC 2011, we provide dataset readers for tasks 1 and the submitted AESOP values.
+```bash
+sacrerouge setup-dataset tac2011 \
+    <path-to-gigaword-root> \
+    <path-to-raw-data> \
+    <output-dir>
+```
+The `<path-to-gigaword-root>` is the path to the root of `LDC2011T07/gigaword_eng_5`.
+The `<path-to-raw-data>` is the path to the root of the [DUC/TAC data repository](https://github.com/danieldeutsch/duc-tac-data) with the data already downloaded.
+
+The output files are the following:
+- `task1.X.jsonl`: The data for task 1 for document sets `X`. The file includes just set A (`A`), B (`B`), or both (`A-B`).
+- `task1.X.summaries.jsonl`: The submitted peer and reference summaries for task 1
+- `task1.X.metrics.jsonl`: The corresponding automatic and manual evaluation metrics for the peer and reference summaries for task 1
+- `task1.X.pyramids.jsonl`: The Pyramids for the set of references for task 1
+- `task1.X.pyramid-annotations.jsonl`: The Pyramid annotations for each submitted peer and reference summary for task 1
+
+## Notes
+It appears that the Pyramid annotations were exhasutive (identifying SCUs which are not present in the reference Pyramids).
+Those extra SCUs are not loaded here.
+
+There are Pyramids for the combined A-B summaries, which we do not load.
+
+The Pyramid annotations have incorrect SCU IDs, so they should be used with caution.
+Here is an example:
+```xml
+<!-- Pyramid for D1112-B -->
+<scu uid="7" label="Jury did not believe Alvarez planned to hurt anyone (NONE)">
+  <contributor label="The jury foreman said at a news conference, after the trial...he did not believe Alvarez planned to kill anyone">
+    <part label="he did not believe Alvarez planned to kill anyone" start="323" end="372"/>
+    <part label="The jury foreman said at a news conference, after the trial" start="263" end="322"/>
+  </contributor>
+  <contributor label="the jury...believed he didn't intend to kill anyone">
+    <part label="the jury" start="642" end="650"/>
+    <part label="believed he didn't intend to kill anyone" start="713" end="753"/>
+  </contributor>
+  <contributor label="Jurors...didn't believe he meant to hurt anyone">
+    <part label="didn't believe he meant to hurt anyone" start="1288" end="1326"/>
+    <part label="Jurors" start="1228" end="1234"/>
+  </contributor>
+</scu>
+
+<!-- # Annotation for system 22 -->
+<peerscu uid="41" label="(3) Jury did not believe Alvarez planned to hurt anyone (NONE)">
+  <contributor label="some jurors in the Metrolink train derailment case last month said they really didn't think Alvarez intended to kill anyone">
+    <part label="some jurors in the Metrolink train derailment case last month said they really didn't think Alvarez intended to kill anyone" start="304" end="427"/>
+  </contributor>
+</peerscu>
+```
diff --git a/doc/metrics/autosummeng.md b/doc/metrics/autosummeng.md
@@ -1,6 +1,7 @@
 # AutoSummENG, MeMoG, and NPowER
 AutoSummENG [1, 2], MeMoG [1], and NPoWER [3] are a family of reference-based evaluation metrics that use n-gram graphs to compare the content of a summary and set of reference summaries.
-Our implementation wraps [our modification](https://github.com/danieldeutsch/AutoSummENG) of the [original code](https://github.com/ggianna/SummaryEvaluation) which allows for evaluating batches of summaries. 
+Our implementation wraps [our modification](https://github.com/danieldeutsch/AutoSummENG) of the [original code](https://github.com/ggianna/SummaryEvaluation) which allows for evaluating batches of summaries.
+All three metrics can be computed with the metric name `autosummeng`. 
 
 ## Setting Up
 Running the AutoSummENG code requires Java 1.8 and Maven to be installed.
diff --git a/doc/metrics/bertscore.md b/doc/metrics/bertscore.md
@@ -1,6 +1,7 @@
 # BERTScore
 BERTScore [1] is a reference-based evaluation metric based on calculating the similarity of two summaries' BERT embeddings.
 Our implementation calls the `score` function from [our fork](https://github.com/danieldeutsch/bert_score) of the [original repository](https://github.com/Tiiiger/bert_score), which we modified to expose creating the IDF dictionaries.
+The name for this metric is `bertscore`.
 
 ## Setting Up
 BERTScore can be installed via pip:
diff --git a/doc/metrics/bewte.md b/doc/metrics/bewte.md
@@ -2,6 +2,7 @@
 BEwT-E [1] is an extension of the Basic Elements [2].
 These metrics compare a summary and reference based on matches between heads of syntactic phrases and dependency tree-based relations.
 Our implementation wraps a [mavenized fork](https://github.com/igorbrigadir/ROUGE-BEwTE) of the original code.
+The name for this metric is `bewte`.
 
 ## Setting Up
 Running BEwT-E requires having Git LFS, Java 1.6, and Maven installed.
diff --git a/doc/metrics/meteor.md b/doc/metrics/meteor.md
@@ -1,6 +1,7 @@
 # METEOR
 METEOR [1] is a reference-based metric that scores a summary based on an alignment to the reference.
 Our implementation wraps the released Java library.
+The name for this metric is `meteor`.
 
 ## Setting Up
 METEOR requires Java (not sure which version) to run.
diff --git a/doc/metrics/moverscore.md b/doc/metrics/moverscore.md
@@ -1,6 +1,7 @@
 # MoverScore
 MoverScore [1] is a reference-based evaluation metric using an Earth Mover's Distance between a summary and its reference that uses contextual word representations.
 Our implementation uses the `moverscore` [pip package](https://github.com/AIPHES/emnlp19-moverscore).
+The name for this metric is `moverscore`.
 
 ## Setting Up
 To set up MoverScore, pip install the package:
diff --git a/doc/metrics/python-rouge.md b/doc/metrics/python-rouge.md
@@ -6,6 +6,8 @@ The Python version is significantly faster.
 The Python version currently supports ROUGE-N and ROUGE-L.
 Although, it is near-identical to the Perl version, it should only be used for development and not official evaluation, for which you should use the original [ROUGE](rouge.md).
 
+The name for this metric is `python-rouge`.
+
 ## Setting Up
 This metric only requires that ROUGE has been set up (see [here](rouge.md)).
 
diff --git a/doc/metrics/rouge.md b/doc/metrics/rouge.md
@@ -1,6 +1,7 @@
 # ROUGE
 ROUGE [1] is a reference-based evaluation based on n-gram overlaps between a summary and its reference.
 Our implementation wraps the original Perl code.
+The name for this metric is `rouge`.
 
 ## Setting Up
 To set up ROUGE, run the following:
diff --git a/doc/metrics/simetrix.md b/doc/metrics/simetrix.md
@@ -1,6 +1,7 @@
 # SIMetrix
 SIMetrix [1, 2, 3] is a reference-free evaluation metric that compares a summary to the input documents.
 Our implementation wraps [this fork](https://github.com/igorbrigadir/simetrix) of the original code.
+The name for this metric is `simetrix`.
 
 ## Setting Up
 Running SIMetrix requires Java 1.7 and Maven to be installed.
diff --git a/doc/metrics/sumqe.md b/doc/metrics/sumqe.md
@@ -7,6 +7,8 @@ We additionally followed the steps in the repository to retrain their multi-task
 - [Model trained on DUC 2005 and 2007](https://danieldeutsch.s3.amazonaws.com/sacrerouge/metrics/SumQE/models/multitask_5-duc2005_duc2007.npy)
 - [Model trained on DUC 2006 and 2007](https://danieldeutsch.s3.amazonaws.com/sacrerouge/metrics/SumQE/models/multitask_5-duc2006_duc2007.npy)
 
+The name for this metric is `sum-qe`.
+
 ## Setting Up
 Sum-QE has many Python dependencies.
 We recommend referencing the repository's instructions for creating the conda environment.