Skip to content

Automated Identification of Competing Narratives on Social Media

License

Notifications You must be signed in to change notification settings

fjen/competing-narratives

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

96 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Automated Identification of Competing Narratives on Social Media

Narrative Extraction Pipeline

About The Project

This repository contains the code and data for the paper "Automated Identification of Competing Narratives in Political Discourse on Social Media" published at the Text2Story Workshop 2025@ECIR.

The project is organized as follows:

  • dataset/ folder should contain the data to be used for the analysis
  • app.py and app_pages/ contain the code for the web application build with Streamlit
  • the numbered scripts are used to preprocess the data

The dataset is expected to be in JSONL format. Each line should contain one post. The data should contain the following fields:

  • date: the date of the post
  • text: the text of the post
  • user_id: unique identifier for the author of the post
  • translation: the translation of the post (optional)

Getting Started

Prerequisites

Installation

  1. Clone the repo
  2. Add your dataset to the dataset/ folder. For example dataset/twitter-covid/dataset.jsonl.gz
    • The data should be in JSONL format
    • Each line should contain one post.
  3. Add your configuration in .env. See .env.sample for a template.
  4. Install the dependencies: pipenv install

Configuration Reference

The configuration is done in the .env file. The following variables are available:

  • DATASET: Path inside the dataset/ folder to the dataset.
    • Example: twitter-covid/dataset.jsonl.gz
  • TEXT_ATTR: Name of the field in the dataset that contains the text of the post.
    • Example: text
  • TEXT_TRANSLATION_ATTR: Name of the field in the dataset that contains the translation of the post. (optional)
    • Example: translation
  • USER_ATTR: Name of the field in the dataset that contains the unique identifier of the author.
    • Example: user_id
  • EMBEDDING_MODEL: Name of the sentence embedding model to use.
    • Example: paraphrase-multilingual-MiniLM-L12-v2
  • EMBED_TRANSLATION: Whether to use the translation for the embeddings. (optional)
    • 0 or 1
  • OPENAI_URL: URL for an OpenAI compatible API. (optional)
  • OPENAI_API_KEY: API key for the OpenAI API. (optional)
  • OPENAI_MODEL: Name of the LLM to use. (optional)
    • Example: phi4:14b

LLMs can optionally be used to summarize events and stories. Otherwise, we fall back to keyword extraction.

Usage

  1. Activate the virtual environment: pipenv shell
  2. Run the numbered scripts in order to preprocess the data
  3. Run the Streamlit app: streamlit run app.py

Citation

If you use this code or data, please cite the following paper:

@inproceedings{wildemann2025automated,
  title     = {Automated Identification of Competing Narratives in Political Discourse on Social Media},
  author    = {Sergej Wildemann and Erick Elejalde},
  editor    = {Ricardo Campos and 
               Al{\'{\i}}pio M{\'{a}}rio Jorge and 
               Adam Jatowt and 
               Sumit Bhatia and 
               Marina Litvak},
  booktitle = {Proceedings of Text2Story - Eigth Workshop on Narrative Extraction
               From Texts held in conjunction with the 47th European Conference on
               Information Retrieval {(ECIR} 2025), Lucca, Italy, April 10, 2025},
  year      = {2025},
  series    = {{CEUR} Workshop Proceedings},
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Automated Identification of Competing Narratives on Social Media

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages