2021-08-19
This dataset is derived from Infomedia, a proprietary Danish media archive.
Unfortunately, we are not allowed to share the text itself, so instead we are sharing the inferred LDA topic distributions of each article in addition to the 10 most associated words within that topic.
Derived newspaper data is available here
We have trained two topic models: one on omnibus newspapers and one on tabloid newspapers as their reporting styles are different.
Print & web versions.
Tabloid newspapers
- BT
- Ekstrabladet
Omnibus newspapers
- Information
- Jyllands Posten
- Kristelight Dagblad
- Politiken
- Weekendavisen
- (Berlingske)
Omnibus | Tabloids | |
---|---|---|
Total number of articles | 328,129 | 205,529 |
First article | 2019-12-02 | 2019-12-02 |
Last article | 2021-01-21 | 2021-01-21 |
- date (e.g. "2019-12-02T00:00:00Z")
- source_name (e.g. "politiken")
- source_type ("print" OR "web")
In separate files:
- topic content (.txt files)
The dataset comes in two formats: new-line delimited json & CSV.
Pick the format that suits you best.
Loading the dataset (or its omnibus part):
ndjson version
import ndjson
with open('Omnibus/topics_omnibus.ndjson') as f:
omnibus = ndjson.load(f)
CSV version
import pandas as pd
omnibus = pd.read_csv('Omnibus/topics_omnibus.csv')
# optional parsing of article timestamps
omnibus['date'] = pd.to_datetime(omnibus['date'])
Unzipped, the two folders are ~2.2 GB.
Infomedia/
│
├── Tabloid/ (in tabloid.zip)
│ ├── topics_tabloid.ndjson
│ │ tabloid articles - ndjson version
│ ├── topics_tabloid.csv
│ │ tabloid articles - csv version
│ ├── coherence_tabloid.txt
│ │ coherence scores: a) whole model b) per topic
│ └── content_tabloid.txt
│ topic model - top words per topic
│
├── Omnibus/ (in omnibus.zip)
│ ├── topics_omnibus.ndjson
│ │ omnibus articles - ndjson version
│ ├── topics_omnibus.csv
│ │ omnibus articles - csv version
│ ├── coherence_omnibus.txt
│ │ coherence scores: a) whole model b) per topic
│ └── content_omnibus.txt
│ topic model - top words per topic
|
└── README.md
We trained LDA topic models to represent articles by their topic distributions.
A topic is essentailly a cluster of related words (that often appear together). As an example, the following topic captures writing about US presidents & politics. Top ten most associated words are shown along with their probabilities.
(5, '0.065*"usa" + 0.043*"præsident" + 0.041*"amerikansk" + 0.030*"trump" + 0.021*"obama" + 0.018*"bush" + 0.014*"donald" + 0.013*"amerikaner" + 0.009*"george" + 0.008*"politisk"')
Each article is represented by a vector of 100 values (topics). This number of topics yielded marginally the highest C_v topic coherence; Although the parameter search was limited (models with 50, 100 and 150 topics were fitted on only the printed articles from both source types).
Each value in the topic distribution of an article (100 values) corresponds to the probability of a topic appearing in a given article.
Topic probabilities range between 0 and 1 and each topic distribution/vector adds up to 1 (or very close to it).
Topics are numbered from 0 to 99. See which words make the topics in the txt files.
- concatenate news articles into a single string
- heading, subheading, text body
- preprocessing
- remove html tags
- replace non-word characters (e.g. punctuation, special characters) with a whitespace
- remove digits
- remove stopwords
- remove excess whitespace
- lemmatization (using spaCy's
da_core_news_sm
model) - lowercasing
- remove tokens that appear less than 10 times in the whole dataset
- training the topic model
- gensim's LdaModel default hyperparameters
- (alpha) inferred symmetric priors per topic
- (eta) no apriori assumptions about words
- c_v coherence grid search on print media only
- tabloid final model ≈ 0.28 at 100 topics (0.26 on other settings). 100 topics chosen for simplicity.
- omnibus final model ≈ 0.45 at 100 topics.
- gensim's LdaModel default hyperparameters
- Infomedia database (access through the Danish Royal Library)