Skip to content

Commit 536c7c7

Browse files
committed
Merge branch 'master' of https://github.com/Torchee/ConvoKit
2 parents f463985 + 9484164 commit 536c7c7

18 files changed

+3720
-13
lines changed

.github/workflows/build-docs.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,9 @@ jobs:
1010
runs-on: ubuntu-latest
1111
steps:
1212
- uses: actions/checkout@v3
13+
- name: Install Dependencies
14+
run: |
15+
pip install sphinx sphinx-rtd-theme m2r2
1316
- name: Sphinx Build
1417
uses: ammaraskar/sphinx-action@master
1518
with:

.github/workflows/continuous-integration.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ jobs:
77
runs-on: ubuntu-latest
88
strategy:
99
matrix:
10-
python-version: [3.7, 3.8, 3.9, '3.10']
10+
python-version: [3.9, '3.10', '3.11', '3.12']
1111
mongodb-version: [5.0.2]
1212

1313
steps:

README.md

Lines changed: 21 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,13 @@
44
<!-- ALL-CONTRIBUTORS-BADGE:END -->
55

66
[![pypi](https://img.shields.io/pypi/v/convokit.svg)](https://pypi.org/pypi/convokit/)
7-
[![py\_versions](https://img.shields.io/badge/python-3.8%2B-blue)](https://pypi.org/pypi/convokit/)
7+
[![py\_versions](https://img.shields.io/badge/python-3.9%2B-blue)](https://pypi.org/pypi/convokit/)
88
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
99
[![license](https://img.shields.io/badge/license-MIT-green)](https://github.com/CornellNLP/ConvoKit/blob/master/LICENSE.md)
1010
[![Discord Community](https://img.shields.io/static/v1?logo=discord&style=flat&color=red&label=discord&message=community)](https://discord.gg/WMFqMWgz6P)
1111

1212

13-
This toolkit contains tools to extract conversational features and analyze social phenomena in conversations, using a [single unified interface](https://convokit.cornell.edu/documentation/architecture.html) inspired by (and compatible with) scikit-learn. Several large [conversational datasets](https://github.com/CornellNLP/ConvoKit#datasets) are included together with scripts exemplifying the use of the toolkit on these datasets. The latest version is [3.0.0](https://github.com/CornellNLP/ConvoKit/releases/tag/v3.0.0) (released July 17, 2023); follow the [project on GitHub](https://github.com/CornellNLP/ConvoKit) to keep track of updates.
13+
This toolkit contains tools to extract conversational features and analyze social phenomena in conversations, using a [single unified interface](https://convokit.cornell.edu/documentation/architecture.html) inspired by (and compatible with) scikit-learn. Several large [conversational datasets](https://github.com/CornellNLP/ConvoKit#datasets) are included together with scripts exemplifying the use of the toolkit on these datasets. The latest version is [3.0.1](https://github.com/CornellNLP/ConvoKit/releases/tag/v3.0.1) (released November 13, 2024); follow the [project on GitHub](https://github.com/CornellNLP/ConvoKit) to keep track of updates.
1414

1515
Read our [documentation](https://convokit.cornell.edu/documentation) or try ConvoKit in our [interactive tutorial](https://colab.research.google.com/github/CornellNLP/ConvoKit/blob/master/examples/Introduction_to_ConvoKit.ipynb).
1616

@@ -137,6 +137,24 @@ A collection of all the conversations that occurred over 10 seasons of Friends,
137137

138138
Name for download: `friends-corpus`
139139

140+
### [Federal Open Market Committee (FOMC) Corpus](https://convokit.cornell.edu/documentation/fomc.html)
141+
142+
Transcripts of recurring meetings of the Federal Reserve’s Open Market Committee (FOMC), where important aspects of U.S. monetary policy are decided, covering the period 1977-2008.
143+
144+
Name for download: `fomc-corpus`
145+
146+
### [NPR Interview 2P Dataset Corpus](https://convokit.cornell.edu/documentation/npr-2p.html)
147+
148+
This corpus contains conversations between NPR show hosts and their guests.
149+
150+
Name for download: `npr-2p-corpus`
151+
152+
### [DeliData Dataset Corpus](https://convokit.cornell.edu/documentation/deli.html)
153+
154+
This corpus contains conversations in multi-party problem-solving contexts, containing information about group discussions and team performance.
155+
156+
Name for download: `deli-corpus`
157+
140158
### [Switchboard Dialog Act Corpus](https://convokit.cornell.edu/documentation/switchboard.html)
141159

142160
A collection of 1,155 five-minute telephone conversations between two participants, annotated with speech act tags.
@@ -180,7 +198,7 @@ Name for download: `spolin-corpus`
180198
In addition to the provided datasets, you may also use ConvoKit with your own custom datasets by loading them into a `convokit.Corpus` object. [This example script](https://github.com/CornellNLP/ConvoKit/blob/master/examples/converting_movie_corpus.ipynb) shows how to construct a Corpus from custom data.
181199

182200
## Installation
183-
This toolkit requires Python >= 3.8.
201+
This toolkit requires Python >= 3.9.
184202

185203
1. Download the toolkit: `pip3 install convokit`
186204
2. Download Spacy's English model: `python3 -m spacy download en`

docs/source/conf.py

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@
1919
#
2020
import os
2121
import sys
22+
import sphinx_rtd_theme
2223

2324
_HERE = os.path.dirname(__file__)
2425
_DOCS_DIR = os.path.abspath(os.path.join(_HERE, ".."))
@@ -55,7 +56,7 @@
5556

5657
# General information about the project.
5758
project = "convokit"
58-
copyright = "2017-2023 The ConvoKit Developers"
59+
copyright = "2017-2024 The ConvoKit Developers"
5960
author = "The ConvoKit Developers"
6061

6162
# The version info for the project you're documenting, acts as replacement for
@@ -65,7 +66,7 @@
6566
# The short X.Y version.
6667
version = "3.0"
6768
# The full version, including alpha/beta/rc tags.
68-
release = "3.0.0"
69+
release = "3.0.1"
6970

7071
# The language for content autogenerated by Sphinx. Refer to documentation
7172
# for a list of supported languages.
@@ -126,6 +127,9 @@
126127
# a list of builtin themes.
127128
#
128129
html_theme = "sphinx_rtd_theme"
130+
131+
# Add theme path explicitly
132+
html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]
129133
# html_context = {"css_files": ["_static/overrides.css"]}
130134
# Theme options are theme-specific and customize the look and feel of a theme
131135
# further. For a list of options available for each theme, see the
@@ -159,7 +163,7 @@
159163
# Add any paths that contain custom static files (such as style sheets) here,
160164
# relative to this directory. They are copied after the builtin static files,
161165
# so a file named "default.css" will overwrite the builtin "default.css".
162-
html_static_path = ["static"]
166+
html_static_path = ["_static"]
163167

164168
# Add any extra paths that contain custom files (such as robots.txt or
165169
# .htaccess) here, relative to this directory. These files are copied

docs/source/datasets.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,3 +27,7 @@ Datasets
2727
Supreme Court Oral Arguments Dataset <supreme.rst>
2828
Wikipedia Articles for Deletion Dataset <wiki-articles-for-deletion-corpus.rst>
2929
CaSiNo Corpus <casino-corpus.rst>
30+
NPR Interviews 2P Corpus <npr-2p.rst>
31+
Federal Open Market Committee Corpus <fomc.rst>
32+
FORA Corpus <fora.rst>
33+
DeliData Corpus <deli.rst>

docs/source/deli.rst

Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
DeliData Corpus
2+
===============
3+
4+
DeliData is a dataset designed for analyzing deliberation in multi-party problem-solving contexts. It contains information about group discussions, capturing various aspects of participant interactions, message annotations, and team performance.
5+
6+
The corpus is available upon request from the authors, and a ConvoKit-compatible version can be derived using ConvoKit’s conversion tools. ConvoKit also host the ConvoKit-format deli corpus, which can be directly downloaded following instruction in the Usage section.
7+
8+
For a full description of the dataset collection and potential applications, please refer to the original publication: `Karadzhov, G., Stafford, T., & Vlachos, A. (2023). DeliData: A dataset for deliberation in multi-party problem solving. Proceedings of the ACM on Human-Computer Interaction, 7(CSCW2), 1-25.`
9+
10+
Dataset details
11+
---------------
12+
13+
All ConvoKit metadata attributes retain the original names used in the dataset.
14+
15+
Speaker-level information
16+
^^^^^^^^^^^^^^^^^^^^^^^^^
17+
18+
Metadata for each speaker includes the following fields:
19+
20+
* speaker: Identifier or pseudonym of the speaker.
21+
22+
Utterance-level information
23+
^^^^^^^^^^^^^^^^^^^^^^^^^^^
24+
25+
Each utterance includes:
26+
27+
* id: Unique identifier for an utterance.
28+
* conversation_id: Identifier for the conversation that the utterance belongs to.
29+
* reply_to: Identifier for the previous utterance in the conversation, if any (null if not a reply).
30+
* speaker: Name or pseudonym of the utterance speaker.
31+
* text: Normalized textual content of the utterance with applied tokenization and masked special tokens.
32+
* timestamp: Null for the entirety of this corpus.
33+
34+
Metadata for each utterance includes:
35+
36+
* annotation_type: Type of utterance deliberation, if annotated (e.g., "Probing" or "Non-probing deliberation"). If unannotated, may be null.
37+
* annotation_target: Target annotation, indicating the intended focus of the message, such as "Moderation" or "Solution." May be null if not annotated.
38+
* annotation_additional: Any additional annotations indicating specific deliberative actions (e.g., "complete_solution"), may be null if not annotated.
39+
* message_type: Type of message, categorized as INITIAL, SUBMIT, or MESSAGE, indicating its function in the dialogue.
40+
* original_text: Original text as said in the collected conversation; For INITIAL type, contains the list of participants and cards presented. For SUBMIT type, contains the cards submitted
41+
42+
Conversation-level information
43+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
44+
45+
For each conversation we provide:
46+
47+
* id: id of the conversation
48+
49+
Metadata for each conversation includes:
50+
51+
* team_performance: Approximate performance of the team based on user submissions and solution mentions, ranging from 0 to 1, where 1 indicates all participants selected the correct solution.
52+
* sol_tracker_message: Extracted solution from the current message content.
53+
* sol_tracker_all: Up-to-date "state-of-mind" for each of the participants, i.e. an approximation of what each participant think the correct solution is at given timestep. This is based on initial solutions, submitted solutions, and solution mentions. team_performance value is calculated based on this column
54+
* performance_change: Change in team performance relative to the previous utterance.
55+
56+
Usage
57+
-----
58+
59+
Convert the DeliData Corpus into ConvoKit format using the following notebook: `Converting DeliData to ConvoKit Format <https://github.com/CornellNLP/ConvoKit/blob/master/examples/dataset-examples/DELI/ConvoKit_DeliData_Conversion.ipynb>`_
60+
61+
To download directly with ConvoKit:
62+
63+
>>> from convokit import Corpus, download
64+
>>> corpus = Corpus(filename=download("deli-corpus"))
65+
66+
67+
For some quick stats:
68+
69+
>>> corpus.print_summary_stats()
70+
71+
* Number of Speakers: 30
72+
* Number of Utterances: 17111
73+
* Number of Conversations: 500
74+
75+
Additional note
76+
---------------
77+
Data License
78+
^^^^^^^^^^^^
79+
80+
ConvoKit is not distributing the corpus separately, and thus no additional data license is applicable. The license of the original distribution applies.
81+
82+
Contact
83+
^^^^^^^
84+
85+
Questions regarding the DeliData corpus should be directed to Georgi Karadzhov ([email protected]).
86+
87+
Files
88+
^^^^^^^
89+
90+
Request the Official Released DeliData Corpus without ConvoKit formatting: https://delibot.xyz/delidata

docs/source/fomc.rst

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
Federal Open Market Committee (FOMC) Corpus
2+
===========================================
3+
4+
Transcripts of recurring meetings of the Federal Reserve’s Open Market Committee (FOMC), where important aspects of U.S. monetary policy are decided, covering the period 1977-2008. (108,504 conversational exchanges between 364 speakers of FOMC board members in 268 meetings).
5+
6+
Distributed together with:
7+
`Talk it up or play it down? (Un)expected correlations between (de-)emphasis and recurrence of discussion points in consequential U.S. economic policy meetings <https://chenhaot.com/papers/de-emphasis-fomc.html>`_. Chenhao Tan and Lillian Lee. Presented in Text As Data 2016.
8+
9+
Dataset details
10+
---------------
11+
12+
Speaker-level information
13+
^^^^^^^^^^^^^^^^^^^^^^^^^
14+
15+
Speakers in this dataset are FOMC members, indexed by their name as recorded in the transcripts.
16+
* id: name of the speaker
17+
* chair: (boolean) is speaker FOMC Chair
18+
* vice_chair: (boolean) is speaker FOMC Vice-Chair
19+
20+
Utterance-level information
21+
^^^^^^^^^^^^^^^^^^^^^^^^^^^
22+
23+
For each utterance, we provide:
24+
* id: index of the utterance (concatenating the meeting date with the utterance’s sequence position)
25+
* speaker: the speaker who authored the utterance
26+
* conversation_id: ID of meeting
27+
* reply_to: id of the sequentially prior utterance (None for the first utterance of a meeting)
28+
* text: textual content of the utterance
29+
* timestamp: calculated value based off the date of the meeting and the speech index
30+
31+
Metadata for utterances include:
32+
* speech_index: index of utterance in the context of the conversation
33+
* parsed: parsed version of the utterance text, represented as a SpaCy Doc
34+
35+
Conversational-level information
36+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
37+
38+
Conversations are indexed by a string representing the meeting date.
39+
40+
Usage
41+
-----------
42+
43+
Convert the FOMC Corpus into ConvoKit format using this notebook `Converting FOMC Corpus to ConvoKit Format <https://github.com/CornellNLP/ConvoKit/blob/master/examples/dataset-examples/FOMC/fomc_to_convokit.ipynb>`_
44+
45+
To download directly with ConvoKit:
46+
47+
>>> from convokit import Corpus, download
48+
>>> corpus = Corpus(filename=download("fomc-corpus"))
49+
50+
51+
For some quick stats:
52+
53+
>>> corpus.print_summary_stats()
54+
Number of Speakers: 364
55+
Number of Utterances: 108504
56+
Number of Conversations: 268
57+
58+
59+
Additional note
60+
---------------
61+
62+
The original dataset can be downloaded `here <https://chenhaot.com/pages/de-emphasis-fomc.html>`_. Refer to the original README for more explanations on dataset construction.
63+
64+
Contact
65+
^^^^^^^
66+
67+
Please email any questions to: [email protected] (Cristian Danescu-Niculescu-Mizil).

0 commit comments

Comments
 (0)