diff --git a/.gitignore b/.gitignore
index 1f0b2078..7773e9c6 100644
--- a/.gitignore
+++ b/.gitignore
@@ -128,5 +128,23 @@ dmypy.json
# Pyre type checker
.pyre/
-# exclude /data directory entirely
-/data/
+# exclude csv files from data directory
+data/raw/
+data/preprocessing/
+data/classification/mlflow
+mlruns/
+data/feature_extraction/
+
+# exlcude OSX specific files
+.DS_Store
+**/.DS_Store
+
+# exlcude /mlruns subfolder
+mlruns/
+
+# classifier.pickle file gets too big after certain number of features to classify with
+data/classification/classifier.pickle
+data/feature_extraction/validation.pickle
+data/classification/classifier.pickle
+data/feature_extraction/test.pickle
+data/feature_extraction/training.pickle
diff --git a/Documentation.md b/Documentation.md
new file mode 100644
index 00000000..07e2f6e8
--- /dev/null
+++ b/Documentation.md
@@ -0,0 +1,769 @@
+# Documentation
+Machine Learning in Practice block seminar, winter term 2021/22 @ [UOS](https://www.uni-osnabrueck.de/startseite/).
+Held by Lucas Bechberger, M.Sc.
+Group members: Dennis Hesenkamp, Yannik Ullrich
+
+---
+### Table of contents
+1. [Introduction](#introduction)
+1. [Preprocessing](#preprocessing)
+2. [Feature Extraction](#feature_extraction)
+3. [Dimensionality Reduction](#dimensionality_reduction)
+3. [Classification](#classification)
+4. [Evaluation Metrics](#evaluation)
+5. [Hyperparameter Optimization](#param_optimization)
+5. [Results](#results)
+6. [Conclusion](#conclusion)
+6. [Next Steps](#next_steps)
+7. [Resources](#resources)
+
+---
+
+
+
+
+## Introduction
+
+This file contains the documentation for our project, which aims to classify tweets as viral/non-viral based on multiple features derived from
+
+- the metadata of the tweet and
+- the natural language features of the tweet.
+
+The data set used is Ruchi Bhatia's [Data Science Tweets 2010-2021](https://www.kaggle.com/ruchi798/data-science-tweets) from [Kaggle](https://www.kaggle.com/). The code base on which we built our machine learning pipeline was provided by Lucas Bechberger (lecturer) and can be found [here](https://github.com/lbechberger/MLinPractice).
+
+We can see that, over the years, the interest in data science and related topics as grown very fast:
+
+
+
+ Fig 1: Tweets per year.
+
+
+Also, most tweets in our data set are written in English, as can be seen here:
+
+
+
+ Fig 2: Language distribution of the tweets.
+
+
+
+
+
+
+## Preprocessing
+
+The data set provides the raw tweet as it has been posted as well as multiple features related to the tweet, for instance the person who published it, the time it has been published at, whether it contained any media (be it photo, video, url, etc.), and many more. We employed multiple preprocessing steps to transform the input data into a more usable format for feature extraction steps later on.
+
+### Splitting
+As a very first step, we split the data set into a training (60%), a validation (20%), and a test set (20%). We will work on the training and validation set to implement everything to then, eventually, test our classifiers on the test set.
+
+
+### Tokenization
+In the lecture, Lucas implemented a tokenizer to disassemble tweets into individual words using the `nltk` library[^nltk]. This is done to split up the raw tweet into its constituents, i.e. the single words and punctuation signs it contains. By doing so, further processing and feature extraction can be performed by looking at the single components of a sentence/tweet as opposed to working with one long string.
+
+Example:
+
+```python
+import nltk
+
+sent = 'There is great genius behind all this.'
+nltk.word_tokenize(sent)
+
+# ['There', 'is', 'great', 'genius', 'behind', 'all', 'this', '.']
+```
+
+
+### Stopword Removal
+To extract meaningful natural language features from a string, it makes sense to first remove any stopwords occuring in that string. Say, for example, one would like to look at the most frequently occuring words in a large corpus. Usually, that means looking at words which actually carry _meaning_ in the given context. According to the OEC[^oec], the largest 21st-century English text corpus, the commonest word in English is _the_ - from which we cannot derive any meaning. Hence, it would make sense to remove words such as _the_ and other, non-meaning carrying words (= stopwords) from a corpus (the set of tweets in our case) before doing anything like keyword of occurence frequency analysis.
+
+There is not one universal stopword list nor are there universal rules on how stopwords should be defined. For the sake of convenience, we decided to use `gensim`'s `gensim.parsing.preprocessing.remove_stopwords` function[^gensim_stopwords], which uses `gensim`'s built-in stopword list containing high-frequency words with little lexical content.
+
+Example:
+
+```python
+import gensim
+
+sent = 'There is great genius behind all this.'
+gensim.parsing.preprocessing.remove_stopwords(sent)
+
+# 'There great genius this.'
+```
+
+Other options would have been `nltk`'s stopword corpus[^nltk_stopwords], an annotated corpus with 2.400 stopwords from 11 languages or `spaCy`'s stopword list[^spacy_stopwords], but we faced problems implementing the former one while `gensim`'s corpus apparently contains more words and leads to better results compared to the latter.
+
+
+### Punctuation Removal
+Punctuation removal follows the same rationale as stopword removal: A dot, hyphen, or exclamation mark will probably occur often in the corpus, but without carrying much meaning at first sight (we can actually also infer features from punctuation, more about that in [Sentiment Analysis](#sentiment_analysis)). A feature for removing punctuation from the raw tweet has already been implemented by Lucas during the lecture using the `string` package. Again, alternatives can be used - for example with `gensim`, which offers a function for punctuation removal[^gensim-punctuation]. We had to rebuild this class as it was initially meant to work as first step in the preprocessing pipeline, but we now have it in second place. Hence, it was necessary to change how the class handles input and output and we needed an additional command line argument. We did not change the method of removing punctuation in general, as there is not much benefit in looking at different ways of punctuation removal anyways, as opposed to stopword removal, where lists can vary a lot based on the corpus.
+
+Example:
+
+```python
+import string
+
+sent = "O, that my tongue were in the thunder's mouth!"
+punctuation = '[{}]'.format(string.punctuation)
+sent.replace(punctuation, '')
+
+# "O that my tongue were in the thunders mouth"
+```
+Caveat: the above code will actually not produce the desired output, but works in our implementation due to the different format of the input (we pass a `dtype object` as input). This is just to illustrate how our code and punctuation removal in general work.
+
+
+### Lemmatization
+Lemmatization modifies an inflected or variant form of a word into its lemma or dictionary form. Through lemmatization, we can make sure that words - on a semantical level - get interpreted in the same way, even when inflected: _walk_ and _walking_, for example, stem from the same word and ultimately carry the same meaning. We decided to use lemmatization as opposed to stemming, although it is computationally more expensive. This is due to lemmatization taking context into account, as it depends on part-of-speech (PoS) tagging.
+
+To implement this, we used `nltk`'s `pos_tag` to assign PoS tags and WordNet's `WordNetLemmatizer()` class, as well as a manually defined PoS dictionary to reduce the (rather detailed) tags from `pos_tag` to only four different ones, namely _noun_, _verb_, _adjective_, and _adverb_:
+
+```python
+from nltk.corpus import wordnet
+
+tag_dict = {"J": wordnet.ADJ,
+ "N": wordnet.NOUN,
+ "V": wordnet.VERB,
+ "R": wordnet.ADV}
+```
+
+This simplified PoS assignment is important because `pos_tag` returns a tuple, which has to be converted to a format the WordNet lemmatizer can work with, further WordNet lemmatizes differently for different PoS classes and only distinguishes between the above mentioned classes. Courtesy to [this blog entry](https://www.machinelearningplus.com/nlp/lemmatization-examples-python/#wordnetlemmatizerwithappropriatepostag) by Selva Prabhakaran for the idea and the code.
+
+Example:
+
+```python
+from nltk import pos_tag
+from nltk.stem import WordNetLemmatizer
+
+sent = ['These', 'newer', 'data', 'help', 'scientists', 'accurately', 'project', 'how', 'quickly', 'glaciers', 'are', 'retreating', '.']
+lem = WordNetLemmatizer()
+lemmatized = []
+
+for word in sent:
+ # get the part-of-speech tag
+ tag = pos_tag([word])[0][1][0].upper()
+ lemmatized.append(lem.lemmatize(word.lower(), tag_dict.get(tag, wordnet.NOUN)))
+
+# ['these', 'newer', 'data', 'help', 'scientist', 'accurately', 'project', 'how', 'quickly', 'glacier', 'be', 'retreat', '.']
+```
+
+Whenever the PoS tagging encounters an unknown tag or a tag which the lemmatizer cannot handle, the default tag to be used is `wordnet.NOUN`.
+
+As mentioned in the beginning, alternatively to lemmatization we could use the computationally cheaper stemming, which only reduces an inflected word to its stem (e.g. _accurately_ becomes _accur_). This could be done with `gensim.parsing.preprocessing.stem`[^gensim_stemming]
+
+
+### Final Words
+The above preprocessing steps have all been tested and work fine. Some of them can be performed independently, but we built the pipeline such that they stack. To ensure proper functionality, the order of steps has to be as follows:
+
+1. Stopword removal
+2. Punctuation removal
+3. Tokenization
+4. Lemmatization
+
+Input columns have to be specified accordingly with the provided command line arguments (see readme for more info).
+
+
+
+
+
+## Feature Extraction
+After the preprocessing of the data is done, we can move on to extracting features from the dataset.
+
+
+### Character Length
+The length of a tweet might influence its chance of going viral as people might prefer shorter texts on social media (or longer, more complex ones). This feature was already implemented by Lucas as an example, using `len()`.
+
+Example:
+
+```python
+sent = 'There is great genius behind all this.'
+len(sent)
+
+# 38
+```
+
+This is, however simple it may be, a difficult to interpret feature: for most of its existence, Twitter has had a character limit of 140 characters per tweet. In 2017, the maximum character limit was raised to 280[^twitter_charlength], which led to an almost immediate drop of the prevalence of tweets with around 140 characters while, at the same time, tweets approaching 280 characters appear to be syntactically and semantically similar to tweets around 140 characters from before the change (Gligorić et al., 2020).
+
+
+### Month
+We thought that the month in which a tweet was published could have (some minor?) influence on its virality. Maybe during holiday season or the darker time of the year, i.e. from October to March, people spend more time on the internet, hence tweets might get more interaction which will lead to a higher potential of going viral.
+
+We extracted the month from the `date` column of the dataframe using the `datetime` package as follows:
+
+```python
+import datetime
+
+date = "2021-04-14"
+datetime.datetime.strptime(date, "%Y-%m-%d").month
+
+# 4
+```
+
+The result we return is the respective month. We have NOT yet implemented one-hot encoding for the result because we actually decided rather quickly that we do not want to use this feature. We could not find evidence or reserach on our assumption that screentime/time on the internet is higher during certain months or periods of the year. How one-hot encoding is done can be seen in [Time of Day](#time_of_day).
+
+
+
+### Sentiment Analysis
+Using the VADER (Valence Aware Dictionary and sEntiment Reasoner)[^vader_pypi] [^vader_homepage] framework, we extract the compound sentiment of a tweet. VADER was built for social media and takes into account, among other factors, emojis, punctuation, and caps - which is why we let it work on the unmodified `tweet` column of the dataframe, ensuring that we do not artificially modify the sentiment. The `polarity_score()` function returns a value for positive, negative, and neutral polarity, as well as an additional compound value with $-1$ representing the most negative and $+1$ the most positive sentiment. The classifier does not need training as it is pre-trained, unknown words, however, are simply classified as neutral.
+
+Example:
+
+```python
+from nltk.sentiment.vader import SentimentIntensityAnalyzer
+
+sia = SentimentIntensityAnalyzer()
+sentences = ["The service here is good.",
+ "The service here is extremely good.",
+ "The service here is extremely good!!!",
+ "The service here is extremely good!!! ;)"]
+
+for s in sentences:
+ print(sia.polarity_scores(s)['compound'])
+
+# 0.4404
+# 0.4927
+# 0.6211
+# 0.7389
+```
+We can see how the compound sentiment changes with the addition of words, punctuation, and emojis. We decided to only use the compound sentiment as measure because we felt that this is the most important one. A tweet might have a certain negativity score (indicating, e.g., that it is negatively phrased) because of a few words, while the rest of the tweet is phrased very positively, resulting in a positive compound sentiment. However, compared to a tweet with only neutral phrasing (i.e. a negative score of $0$), it would still be classified as more negative, which would intuitively be wrong.
+
+Nota bene: We added $+1$ to the compound sentiment so that no negative values are returned. We had to do this because the Bayes classifier (see [Bayes Classifier](#bayes_classifier)) cannot handle negative values. The range of possible values is now [0, 2] with 1 representing a neutral tweet, 0 being the most negative and 2 the most positive value.
+
+
+
+### Time of Day
+As opposed to the [Month](#month) feature, which we ended up not using, we felt that the time of the day during which a tweet was posted might very well have an influence on its virality. For example, we suppose that less people are online during the night, decreasing a tweet's virality potential. We decided to split the day into time ranges with hard boundaries:
+
+1. Morning hours from 5am to 9am (5 hours)
+2. Midday from 10am to 2pm (5 hours)
+3. Afternoon from 3pm to 6pm (4 hours)
+4. Evening from 7pm to 11pm (5 hours)
+5. Night from 12am to 4am (5 hours)
+
+The time sections are roughly equally sized, with afternoon being the only exception. We decided to split the day like this based on our own experience. We have not tested different splits. Alternatively, one could test
+
+1. Diffferent time ranges
+2. A finer split, i.e. more categories
+3. A less fine split, i.e. less categories
+
+We extracted the time from the `time` column of the dataframe and simply used the `split()` method to extract the hour from the string it is stored in, then checked whether the extracted value falls into a predefined range and appened the respective value to a new column. We then one-hot encoded the result to retrieve a binary classification for every entry with `pandas`' `get_dummies()` function.
+
+```python
+import pandas as pd
+
+time = ["05:15:01", "07:11:31", "16:04:59", "23:12:00"]
+hours = [t.split(":")[0] for t in time]
+# ['05', '07', '16', '23']
+
+result = []
+for h in hours:
+ if hour in range(0, 6):
+ result.append(0)
+ elif hour in range(6, 11):
+ result.append(1)
+ elif hour in range(11, 15):
+ result.append(2)
+ elif hour in range(15, 19):
+ result.append(3)
+ elif hour in range(19, 24):
+ result.append(4)
+
+pd.get_dummies(result)
+
+# 05 07 16 23
+#0 1 0 0 0
+#1 0 1 0 0
+#2 0 0 1 0
+#3 0 0 0 1
+# only yields encoding for 4 variables in this case because 5th category not used
+```
+
+
+
+### Named Entity Recognition
+Named entity recognition (NER) aims to identify so-called named entities in unstructured texts. We implemented this feature using `spacy`'s pre-trained `en_core_web_sm` pipeline[^spacy_ner]. It has been trained on the OntoNotes 5.0[^onto_notes] and WordNet 3.0[^wordnet] databases. The following entity types are supported by this model:
+
+```python
+import spacy
+
+spacy.info("en_core_web_sm")['labels']['ner']
+
+# ['CARDINAL', 'DATE', 'EVENT', 'FAC', 'GPE', 'LANGUAGE', 'LAW', 'LOC', 'MONEY', 'NORP', 'ORDINAL', 'ORG', 'PERCENT', 'PERSON', 'PRODUCT', 'QUANTITY', 'TIME', 'WORK_OF_ART']
+```
+
+The entities can be accessed as follows:
+
+```python
+ner = en_core_web_sm.load()
+sent1 = "The current Dalai Lama was born in 1935 in Tibet"
+sent2 = "Big Data and AI are a hot trend in 2021 on Twitter"
+
+ner(sent1).ents
+# (1935, Tibet)
+
+ner(sent2).ents
+len(ner(sent2).ents)
+# (Big Data, AI, 2021, Twitter)
+# 4
+```
+As can be seen in the above example, the NER does not work perfectly. "Dalai Lama" in the first sentence is a named entity but not recognized as such. However, we still decided to make use of this feature as it classifies most named entities correctly. We counted the number of NEs per tweet and stored the occurences as integer. The pipeline we employed was specifically designed for web language. We went with the version for English - although an abundancy of other languages is available - because the number of tweets in other languages in our dataset is rather low and the model might still capture named entities, even if in another language:
+
+```python
+sent3 = "Kanzlerin Merkel hat 16 Jahre der Bundesregierung vorgesessen."
+ner.(sent3).ents
+
+# (Kanzlerin Merkel, 16, Jahre, Bundesregierung)
+```
+
+Other models are offered as well, the `en_core_web_sm` is a small one designed for efficiency. Alternatively, models with larger corpus or trimmed towards higher accuracy are available. Although using the more efficient version, the feature extraction step still seems to be computationally very expensive.
+
+NER with `nltk` is also possible when utilizing the `pos_tag()` function, it requires a much larger effort, though, as noun phrase chunking and regular expressions have to be used for the classification.
+
+
+### URLs, Photos, Mentions, Hashtags
+In this section, we evaluate whether any of the above have been attached to a tweet as a binary (1 if attached, 0 else). Our thinking here was that additional media, be it a link, pictures, mentions of another account, or hashtags, might influence the potential virality of a tweet. We accessed the respective columns of the dataframe (`url`, `photos`, `mentions`, `hashtags`), in which the entries are stored in a list. Hence, we could simply evaluate the length of the entries. If they exceed a length of 2, the column contains more than just the empty brackets and the tweet contains the respective feature.
+
+Example with URL:
+
+```python
+urls = ["[]", "['https://www.sciencenews.org/article/climate-thwaites-glacier-under-ice-shelf-risks-warm-water']", "[]"]
+
+result = [0 if len(url) <= 2 else 1 for url in urls]
+
+# [0, 1, 0]
+```
+
+Important: although being stored in lists, the column entries get still evaluated as strings. That is why checking for a length less equal 2 works in this case. The evaluation procedure (checking for the length) is the same for all of the above features.
+
+
+### Replies
+We also figured that the number of replies has an influence on the virality: the more people engage with a tweet and reply to it, the more people see it in their news feed, which again increases reach and interactions. The number of replies are stored as float in the column `replies_count` of the dataframe, so we just have to access that column, make a copy, transform it to a `numpy.array`, and reshape it so the classifier can work with the data later on:
+
+```python
+import numpy as np
+
+replies = [0.0, 7.0, 2.0, 49.0]
+np.array(replies).reshape(-1, 1)
+
+# array([[ 0.],
+# [ 7.],
+# [ 2.],
+# [49.]])
+```
+
+
+### Retweets and Likes
+Retweets and likes follow the same rationale as replies. These are the most obvious features to consider when measuring virality and we just implemented them for the purpose of testing. We did not use them for training our model (since that easily results in an accuracy $>99\%$ and does not tell us anything about _why_ the tweet went viral). The procedure is the same as above: access the respective column, convert it to a `numpy.array`, and reshape it.
+
+
+
+
+
+## Dimensionality Reduction
+When considering a large amount of features, we ultimately also have to think about whether they are all useful. Some features might contribute alot to the classification task at hand, while others contribute not at all. When performing classification, large and high-dimensional feature spaces - which can emerge extremely fast due to the curse of dimensionality - can become computationally very costly, so it makes sense to distuingish between important and less important features.
+
+We decided to neither implement new dimensionality reduction methods nor the use the already provided `sklearn.feature_selection.SelectKBest` procedure, since our feature vector was comprised of less than 20 features.
+
+
+
+
+
+## Classification
+The data vectores in our dataset have one of two possible labels - __true__ if the tweet went viral, __false__ if not. We are, thus, faced with a binary classification task and need classifiers suited for this kind of task.
+
+
+### Dummy Classifiers
+```python
+sklearn.dummy.DummyClassifier
+```
+
+Dummy classifiers make predictions without any knowledge about patterns in the data. We implemented three of them with different rules to explore possible baselines to which we could compare our real classifiers later on:
+
+- Majority vote: always labels data as the most frequently occuring label in the data set, in our case __false__.
+- Stratified classification: uses the data set's class distribution to assign labels.
+- Uniform classification: makes random uniform predictions.
+
+
+### _k_ - Nearest Neighbor
+```python
+from sklearn.neighbors import KNeighborsClassifier
+```
+The _k_-NN classifier was implemented by Lucas during the lecture. We use it with only one hyperparameter - _k_ - for our binary classification task. This algorithm is an example for instance-based learning. It relies on the distance between data points for classification, hence it requires standardization of feature vectors.
+
+We decided to additionally implement a way to change the weight function. As default, the `KNeighborsClassifier` works with uniform weights, i.e. all features are equally important. Having an additional option for distance-weighted classification where nearer neighbors are more important than those further away made sense for us (and it also improved our results, as can be seen later).
+
+Other than that, though, we left the classifier with default settings. A notable alternative could have been the choice of the algorithm for computation of the nearest neighbors, options being a brute-force search, _k_-dimensional tree search, and ball tree search. The default option is `auto`, where the classifier picks the method it deems fittest for the task at hand.
+
+
+
+### Decision Tree
+```python
+from sklearn.tree import DecisionTreeClassifier
+```
+
+Further, we implemented a decision tree classifier. Due to its nature of learning decision rules from the dataset, it does neither require standardization of data nor does it make assumptions on the data distribution.
+
+We added the option to define a maximum depth of the tree, which is extremely important to cope with overfitting. Further, the criterion for measuring split quality can be chosen between Gini impurity and entropy/information gain. The former is usually preferred for classification and regression trees (CART) while the latter is used for the ID3 variant[^id3] of decision trees. Although `sklearn` employs a version of the CART algorithm, it nonetheless works with entropy as measure.
+
+Decision trees generally have difficulties working with continuous data and we have the compound sentiment (see [Sentiment Analysis](#sentiment_analysis)) as feature of such nature which is continuous in the range $[-1, 1]$ (although a case could be made for it being a discrete feature since it is rounded to four decimal places).
+
+
+### Random Forest
+```python
+from sklearn.ensemble import RandomForestClassifier
+```
+
+Random forest classifiers represent an ensemble of multiple decision trees. They are often more robust and accurate than single decision trees and less prone to overfitting and can, therefore, better generalize on new, unseen data (Breiman, 2001). Random forests are able to deal with unbalanced datasets by down-sampling the majority class such that each subtree works on more balanced data (Biau and Scornet, 2016).
+
+We implemented it such that we can modify the number of trees per forest, the maximum depth per tree, as well as the criterion based on which a split occurs. The options for this are the same as for single decision trees - Gini impurity and entropy. The first two are the main parameters to look at when constructing a forest according to `sklearn`'s user guide on classifiers[^sklearn_forest]. Usually, the classification is obtained via majority vote of the trees in a forest, but this implementation averages over the probabilistic class prediction of the single classifiers.
+
+Being able to manipulate both the maximum depth as well as the split criterion further allows us to compare the forest to our single decision tree classifier, since we can use the same parametrization for both.
+
+
+### Support Vector Machine
+```python
+from sklearn.svm import SVC
+```
+
+We also added a support vector machine (SVM). This classifier seeks to find a hyperplane in the data space which maximizes the distance between different classes. It can easily deal with higher-dimensional data by changing the kernel (application of the so-called kernel-trick). `sklearn` offers to choose between a linear, polynomial (default degree: 3), radial basis function, and sigmoid kernel. We decided to implement a way to change the kernel, as this can highly affect the outcome of the classifier.
+
+SVMs are sensible to unscaled data and require standardization of the input, which we carried out using the `StandardScaler()` from `sklearn.preprocessing`. Class weights can be set by the parameter `class_weight` to deal with unbalanced data sets, we did not implement this parameter.
+
+
+### Multilayer Perceptron
+```python
+from sklearn.neural_network import MLPClassifier
+```
+
+The multilayer perceptron (MLP) consists usually of at least three layers: one input layer, one hidden layer, and one output layer. We implemented it such that we can define the number of hidden layers as well as the number of neurons per layer. It should be noted that `sklearn`'s implementation of the MLP stops after 200 iterations if the network has not converged to a solution by then.
+
+
+### Naive Bayes
+```python
+from sklearn.naive_bayes import ComplementNB
+```
+As a sixth (and last) classifier, we implemented one of the naive Bayes variants that `sklearn` offers. The two classic variants are Gaussian and multinomial Bayes, yet, we chose the complement naive Bayes (CNB) algorithm as it was specifically designed to deal with unbalanced data and addresses some of the shortcomings of the multinomial variant (Rennie et al., 2003).
+
+No additional parameters to adapt were implemented. Alternatively, we could have implemented command line argument for the smoothing parameter alpha, which adds Laplace smoothing and takes care of the zero probability problem. The default value here is $1$, which is also the preferred value in most use cases. Hence, we did not deem modification necessary.
+
+
+
+
+
+## Evaluation Metrics
+We implemented multiple evaluation metrics to see how well our classification works and to compare the different classifiers described above.
+
+
+### Accuracy
+The accuracy - or fraction of correctly classified samples - might just be simplest statistic for evaluation of a classification task. It can be calculated as follows:
+
+$$
+\text{Accuracy} = \frac{TP + TN}{N}
+$$
+
+$TP$: True positive, correctly positively labeled samples
+$TN$: True negative, correctly negatively labeled samples
+$N$: Total number of samples
+
+The best value is $1$ ($100\%$ correctly classified samples), the worst $0$.
+
+### Balanced Accuracy
+The balanced accuracy is better suited for unbalanced data sets. It is based on two other commonly used metrics, the sensitivity and specificity (see section [F1-Score](#f1_score) for more details). Its calculation works as follows:
+
+$$
+\text{Balanced accuracy} = \ \frac{\text{Sensitivity} + \text{Specificity}}{2}
+$$
+
+Again, values can range from $0$ (worst) to $1$ (best). This metric makes a much better statement about the quality of a classifier when the majority of samples belongs to one class, as in our case with the label '_false_', because it takes into account both how many samples were correctly classified as _true_ and correctly classified as _false_. Consider an example with $1\%$ of the data points belonging to class A and a classifier always assigning the other class (B) as label. This would yield an accuracy of $99\%$ as that would be the fraction of correctly classified samples. Balanced accuracy, however, would return a value of $0.5$ as no sample with class A had been correctly identified, while all class B samples were correctly labeled.
+
+
+### Cohen's Kappa
+Cohen's kappa is another metric which is said to be very robust against class imbalance and, therefore, well suited for our task.
+
+Calculation:
+$$
+\text{Cohen's kappa} = \frac{\text{Accuracy} - p_e}{1 - p_e}
+$$
+
+with
+
+$$
+p_e = p(\text{AC = T}) * p(\text{PC = T}) + p(\text{AC = F}) * p(\text{PC = F})
+$$
+
+where $p$ denotes the probability, $p_e$ specifically designates the expected agreement by chance, $AC$ and $PC$ are the actual and predicted class, and $T$ and $F$ true and false, i.e. the class labels. This metric takes into account that correct classification can happen by chance. Values can range from $-1$ to $1$, with negative values meaning correct classification by chance and $1$ representing complete agreement.
+
+
+
+### F - Score
+The Fβ-Score is a measure which combines precision and recall and returns a single value. The relative contribution from precision and recall can be adjusted with the β-value: $1$ encodes equal weight of the two measures, while a value closer to $0$ will weigh precision stronger and β $> 1$ favors the recall. We chose to implement the standard score with β $= 1$, as we deem both precision and recall equally important. It can be calculated as follows:
+
+$$
+F_{\beta} = (1 * \beta) \frac{\text{Precision} * \text{Recall}}{(\beta^2 * \text{Precision}) + \text{Recall}}
+$$
+
+with
+
+$$
+\text{Precision} = \frac{TP}{TP + FP}
+$$
+
+and
+
+$$
+\text{Recall} = \frac{TP}{TP + FN}
+$$
+
+Values range from $0$ (worst) to $1$ (best).
+
+
+
+## Hyperparameter Optimization
+`mlflow ui --backend-store-uri data/classification/mlflow`
+
+After having done preprocessing and feature extraction, chosen evaluation metrics, and decided on the classifiers to employ with what kind of parameters, we ran different configurations on the training and validation set to find the most promising classifier and hyperparameter set. Listing the results of every possible combination would go beyond the scope of this documentation, which is why we will only provide an overview over all tested combinations and the most notable results. We tracked all results using the `mlflow` package, which allows for very convenient logging of the used parameters and metrics.
+
+
+### _k_ - Nearest Neighbor
+
+For the _k_-NN classifier, we tested the following parameter combinations:
+
+
+
+
+
Weight
+
Uniform
+
Distance
+
+
+
k
+
1
+
3
+
5
+
7
+
9
+
1
+
3
+
5
+
7
+
9
+
+
+
+
+We used only odd numbers for _k_ to avoid a tie since our task is binary classification (for an odd number of classes, the inverse holds true: even _k_-values avoid ties).
+
+
+
+### Decision Tree
+For decision trees, we explored the following hyperparameter space:
+
+
+
+
+
Criterion
+
Gini impurity
+
+
+
Max depth
+
16
+
18
+
20
+
22
+
24
+
26
+
28
+
30
+
32
+
+
+
Criterion
+
Entropy
+
+
+
Max depth
+
16
+
18
+
20
+
22
+
24
+
26
+
28
+
30
+
32
+
+
+
+
+Additionally, we built the tree without depth restriction for both split criterions.
+
+
+
+### Random Forest
+The random forest classifier comes with the added parameter of a set number of trees per forest:
+
+
+
+
+
Trees per forest
+
10
+
25
+
50
+
100
+
+
+
+
+For each possible number of trees per forest, we explored the same space as with the single decision tree. Further, we also built one tree without depth restriction for each possible combination of tree number and split criterion. A high number of trees per forest usually results in better and more solid results, especially in terms of avoiding overfitting.
+
+
+### Support Vector Machine
+We tested the SVM classifier with four different kernels:
+
+
+
+
+
Kernel
+
linear
+
polynomial
+
radial basis function
+
sigmoid
+
+
+
+
+The computational cost for the SVM classifier seems very high and execution of training and validation took extremely long.
+
+
+### Multilayer Perceptron
+We tried many different combinations for the MLP classifier: we built it first with only one hidden layer (which means three layers in total, one additional layer for input and output), then with two and three hidden layers. We tried every possible combination of neurons per hidden layer from the set of (10, 25, 50), which yields a total of 39 combinations. The hyperparameter space for a network with three hidden layers and 10 neurons in hidden layer 1 would, e.g., look like this:
+
+
+
+
+
Layer 1
+
10
+
+
+
Layer 2
+
10
+
25
+
50
+
+
+
Layer 3
+
10
+
25
+
50
+
10
+
25
+
50
+
10
+
25
+
50
+
+
+
+
+Additionally, we followed a promising lead and also trained a network with the hidden layer structure (100, 100, 100). After 200 iterations, the training was abandoned because the network had still not converged - nonetheless delivering the best result we observed thus far with the MLP. We also decided not to explore any combinations with higher number of neurons because of the computational cost.
+
+
+
+### Naive Bayes
+Since we did not implement any hyperparameters to adjust, we only ran the CNB classifier once.
+
+
+
+## Results
+An important note right away: we did not use the grid of our institute for the hyperparameter optimization but only ran the classifier on a local machine. The results we obtained are from a naive exploration of the search space. We tried to narrow down interesting and promising configurations and ranges for every classifier by manual testing. Hence, we might have obtained results that are only local optima.
+
+The results per classifier for our evaluation metrics on the validation set can be seen in the figures below:
+
+
+
+
+ Fig 3: Accuracy per classifier
+
+
+
+ Fig 4: Balanced accuracy per classifier
+
+
+
+ Fig 5: Cohen's kappa per classifier
+
+
+
+ Fig 6: F1-score per classifier
+
+
+
+Figure 3 shows that the majority of classifiers and configurations achieve a high accuracy, the CNB being an exception while the uniform classifier achieves an accuracy of 0.5, as expected. As already discussed earlier in the section [Evaluation Metrics](#evaluation), this measurement does, unfortunately, not tell us much about the actual quality of the classifier.
+
+In Figure 4, we can see that none of our classifiers performs much better than the balanced accuracy baseline of 0.5. This value can easily be obtained by classifying most to all of the tweets as _false_. Only the CNB pulls ahead with achieving a score og 0.622, which is still rather low.
+
+Looking further at Cohen's kappa in Figure 5, we can now see more of a difference in the performance of the configurations. The random forest classifier with 25 trees, Gini impurity, and no specified maximum depth performs best with a value of 0.134. We can see that especially the MLP and SVM cofigurations are not useful as well. Decision tree, random forest, and _k_-NN have a similar performance, CNB scores equally well, given that only one configuration was run.
+
+Figure 6 displays the F1-score, showing a similar picture to Figure 5: SVM and MLP do not perform well, while decision tree, random forest, and _k_-NN are again quite level in terms of mean score. We can see that even the uniform dummy classifier performs on par with our other classifiers, since it probably assigns the correct label to about half of the positive samples. Again, CNB leads the field with a score of 0.233.
+
+Given the above results, we decided that the CNB classifier is the best choice to run on our test set. It performed best on both the balanced accuracy and F1-score, while being average on the Cohen's kappa metric.
+
+
+
+## Conclusion
+
+
+ Fig 7: Final result on test set.
+
+
+
+Figure 7 shows the result of our CNB classifier on the test set. The scores resemble those achieved on the validation set quiet closely, which confirms our choice.
+
+In the end, we based our classifier decision not on one single metric as we had initially planned, but looked at a combination of values. That CNB worked best overall caught us by surprise, although we might just have been biased by the many configurations tested with the other classifiers compared to the single one from CNB. A random forest with Gini impurity split criterion and unlimited depth would be our second choice, the number of trees should be at least 25 (higher numbers did not yield better results, but they are possibly more robust to overfitting).
+
+We decided to drop accuracy as decision-influencing metric because of its drawbacks when working with unbalanced data. The scores on the other three metrics are still not very satisfactory and leave much room for improvement. Our pipeline cannot be considered production ready at this point due to its substandard performance. This might have different reasons.
+
+First of all, the features we extracted are largely metadata based. We have only implemented two natural language based features, namely the sentiment score and the named entity recognition. But even these two are without error, as our examples have shown. Sometimes, they fail to label even simple examples correctly (see sections [Sentiment Analysis](#sentiment_analysis) and [Named Entity Recognition](#ner) again for details). It could be that the features we extracted just do not capture what exactly makes a tweet go viral. There has been research into virality in the past, but it is not easy to capture what exactly helps a tweet (or any piece of media, for that matter) to become popular. Marketing agency OutGrow has put together an infographic with some aspects that seem to play a role in making content shareworthy[^outgrow].
+
+Further, we only did a naive hyperparameter optimization. It is possible that we found solutions only in a local optimum, while there are much more suitable classifier setups. In Figures 4-6, we can, for example, see that one MLP configuration, namely the one with hidden layer structure (100, 100, 100), outperforms the other MLPs. In this particular case, our hyperparameter space exploration was limited due to computational capability, training more complex networks has simply been not feasible for us.
+
+On the other hand, when testing the decision tree and random forest classifiers, we also let one tree or forest grow to full depth per parameter combination. We figured this might lead to overfitting on the validation set, but the unrestricted depth configurations actually achieved best (in one case second-to-best) performance, i.e. for each evaluation metric the best performing trees and random forests where those which could grow until the end.
+
+
+
+## Next Steps
+After having discussed the results and possible shortcomings of our pipeline, we would like to point out directions for further reseach.
+
+As already mentioned, we only implemented two natural language grounded features. It is likely that a greater focus on this kind of feature will lead to better classification results. One could, e.g., consider _n_-grams, word importance measures like tf-idf, or constituency parsing and dependency parsing to measure syntactic complexity. There might also be more interesting (and more obvious) metadata features as the number of followers of a Twitter account or the history of viral posts of an account. Such features, though, seem less interesting compared to actual language based featurea - at least to us.
+
+A more thorough and thought-through implementation of classifiers based on our first results is another feasible direction. This work can be understand as laying out some groundwork to possibly built on.
+
+Lastly, it should not be forgotten that we worked only with a data set containing tweets about data science. While making it probably easier to work with features such as keywords when narrowing the domain, it is also much harder to get large data sets and find general patterns in the data that can be applied to new data.
+
+
+
+## Resources
+
Biau, G., & Scornet, E. (2016). A random forest guided tour. TEST, 25(2), 197–227. https://doi.org/10.1007/s11749-016-0481-7
+
+
Breiman, L. (2001). Random Forests. Machine Learning, 45, 5–32. https://doi.org/10.1023/A:1010933404324
+
+
Gligorić, K., Anderson, A., & West, R. (2020). Adoption of Twitter’s New Length Limit: Is 280 the New 140? http://arxiv.org/abs/2009.07661
+
+
Rennie, J. D. M., Shih, L., Teevan, J., & Karger, D. R. (2003). Tackling the Poor Assumptions of Naive Bayes Text Classifiers. Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), 616–623. https://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf
+
+
+
+
+
+[^nltk]: , retrieved Oct 26, 2021
+[^oec]: , retrieved Oct 26, 2021
+[^nltk_stopwords]: , retrieved Oct 26, 2021
+[^gensim_stopwords]: , retireved Oct 26, 2021
+[^spacy_stopwords]: , retrieved Oct 26, 2021
+[^gensim-punctuation]: , retrieved Oct 26, 2021
+[^gensim_stemming]: , retrieved Oct 26, 2021
+[^vader_pypi]: , retrieved Oct 15, 2021
+[^vader_homepage]: , retrieved Oct 15, 2021
+[^twitter_charlength]: , retroieved Oct 29, 2021
+[^sklearn_forest]: , retrieved Oct 31, 2021
+[^mlflow]: , retrieved Oct 31, 2021
+[^onto_notes]: , retrieved Oct 31, 2021
+[^wordnet]: , retrieved Oct 31, 2021
+[^spacy_ner]: , retrieved Oct 31, 2021
+[^outgrow]: , retrieved Oct 31, 2021
+
diff --git a/README.md b/README.md
index 26e7793d..afa22ece 100644
--- a/README.md
+++ b/README.md
@@ -7,7 +7,7 @@ As data source, we use the "Data Science Tweets 2010-2021" data set (version 3)
In order to install all necessary dependencies, please make sure that you have a local [Conda](https://docs.conda.io/en/latest/) distribution (e.g., Anaconda or miniconda) installed. Begin by creating a new environment called "MLinPractice" that has Python 3.6 installed:
-```conda create -y -q --name MLinPractice python=3.6```
+ ```conda create -y -q --name MLinPractice python=3.6```
You can enter this environment with `conda activate MLinPractice` (or `source activate MLinPractice`, if the former does not work). You can leave it with `conda deactivate` (or `source deactivate`, if the former does not work). Enter the environment and execute the following commands in order to install the necessary dependencies (this may take a while):
@@ -18,6 +18,11 @@ conda install -y -q -c conda-forge nltk=3.6.3
conda install -y -q -c conda-forge gensim=4.1.2
conda install -y -q -c conda-forge spyder=5.1.5
conda install -y -q -c conda-forge pandas=1.1.5
+conda install -c conda-forge mlflow
+conda install -c conda-forge vadersentiment
+conda install -c conda-forge spacy
+python -m spacy download en_core_web_sm
+
```
You can double-check that all of these packages have been installed by running `conda list` inside of your virtual environment. The Spyder IDE can be started by typing `~/miniconda/envs/MLinPractice/bin/spyder` in your terminal window (assuming you use miniconda, which is installed right in your home directory).
@@ -43,7 +48,7 @@ All python scripts and classes for the preprocessing of the input data can be fo
### Creating Labels
The script `create_labels.py` assigns labels to the raw data points based on a threshold on a linear combination of the number of likes and retweets. It is executed as follows:
-```python -m code.preprocessing.create_labels path/to/input_dir path/to/output.csv```
+ ```python -m code.preprocessing.create_labels path/to/input_dir path/to/output.csv```
Here, `input_dir` is the directory containing the original raw csv files, while `output.csv` is the single csv file where the output will be written.
The script takes the following optional parameters:
- `-l` or `--likes_weight` determines the relative weight of the number of likes a tweet has received. Defaults to 1.
@@ -53,11 +58,13 @@ The script takes the following optional parameters:
### Classical Preprocessing
The script `run_preprocessing.py` is used to run various preprocessing steps on the raw data, producing additional columns in the csv file. It is executed as follows:
-```python -m code.preprocessing.run_preprocessing path/to/input.csv path/to/output.csv```
+ ```python -m code.preprocessing.run_preprocessing path/to/input.csv path/to/output.csv```
Here, `input.csv` is a csv file (ideally the output of `create_labels.py`), while `output.csv` is the csv file where the output will be written.
The preprocessing steps to take can be configured with the following flags:
- `-p` or `--punctuation`: A new column "tweet_no_punctuation" is created, where all punctuation is removed from the original tweet. (See `code/preprocessing/punctuation_remover.py` for more details)
-- `-t`or `--tokenize`: Tokenize the given column (can be specified by `--tokenize_input`, default = "tweet"), and create new column with suffix "_tokenized" containing tokenized tweet.
+- `-t` or `--tokenize`: Tokenize the given column (can be specified by `--tokenize_input`, default = "tweet"), and create new column with suffix "_tokenized" containing tokenized tweet.
+- `-s` or `--stopwords`: Remove common stopwords from the given column (can be specified with `stopwords_input`, default = "tweet"), and returns new column with suffix "_stopwords_removed".
+- `-l`or `--lemmatize`: Modifies inflected or variant words into their base forms (=lemmas). Input column can be specified with `-lemmatize_input` (default = "tweet"), returns new column with suffix "_lemmatized".
Moreover, the script accepts the following optional parameters:
- `-e` or `--export` gives the path to a pickle file where an sklearn pipeline of the different preprocessing steps will be stored for later usage.
@@ -65,7 +72,7 @@ Moreover, the script accepts the following optional parameters:
### Splitting the Data Set
The script `split_data.py` splits the overall preprocessed data into training, validation, and test set. It can be invoked as follows:
-```python -m code.preprocessing.split_data path/to/input.csv path/to/output_dir```
+ ```python -m code.preprocessing.split_data path/to/input.csv path/to/output_dir```
Here, `input.csv` is the input csv file to split (containing a column "label" with the label information, i.e., `create_labels.py` needs to be run beforehand) and `output_dir` is the directory where three individual csv files `training.csv`, `validation.csv`, and `test.csv` will be stored.
The script takes the following optional parameters:
- `-t` or `--test_size` determines the relative size of the test set and defaults to 0.2 (i.e., 20 % of the data).
@@ -77,8 +84,8 @@ The script takes the following optional parameters:
All python scripts and classes for feature extraction can be found in `code/feature_extraction/`.
-The script `extract_features.py` takes care of the overall feature extraction process and can be invoked as follows:
-```python -m code.feature_extraction.extract_features path/to/input.csv path/to/output.pickle```
+The script `extract_features.py` takes care of the overall feature extraction process and can be invoked as follows:
+ ```python -m code.feature_extraction.extract_features path/to/input.csv path/to/output.pickle```
Here, `input.csv` is the respective training, validation, or test set file created by `split_data.py`. The file `output.pickle` will be used to store the results of the feature extraction process, namely a dictionary with the following entries:
- `"features"`: a numpy array with the raw feature values (rows are training examples, colums are features)
- `"feature_names"`: a list of feature names for the columns of the numpy array
@@ -86,6 +93,18 @@ Here, `input.csv` is the respective training, validation, or test set file creat
The features to be extracted can be configured with the following optional parameters:
- `-c` or `--char_length`: Count the number of characters in the "tweet" column of the data frame. (see code/feature_extraction/character_length.py)
+- `-m` or `--month`: Extract the month in which the tweet was published from the "date" column of the data frame.
+- `-s` or `--sentiment`: Extract a compound sentiment value from the original tweet using VADER.
+- `-p` or `--photos`: Extracts binary for whether tweet has photo(s) attached from the "photo" column
+- `-@` or `--mention`: Extracts binary for whether someone has been mentioned by the tweet author from the "mention" column
+- `-u` or `--url` Extracts binary for whether a url is attached to tweet from the "url" column
+- `-rt` or `--retweet`: Extracts number of retweets from the "retweet_count" column
+- `-re` or `--replies`: Extracts number of replies from the "replies_count" column
+- `-#` or `--hashtag`: Extracts binary for whether a hashtag is attached to tweet from the "hashtag" column
+- `-l` or `--likes`E xtracts amount of likes from a tweet from the "likes_count" column
+- `-d` or `--daytime`: Extracts at which time of day tweet was tweeted from the "time" column
+- `-n` or `--ner`: Collects the number of entities in a tweet
+
Moreover, the script support importing and exporting fitted feature extractors with the following optional arguments:
- `-i` or `--import_file`: Load a configured and fitted feature extraction from the given pickle file. Ignore all parameters that configure the features to extract.
@@ -97,7 +116,7 @@ All python scripts and classes for dimensionality reduction can be found in `cod
The script `reduce_dimensionality.py` takes care of the overall dimensionality reduction procedure and can be invoked as follows:
-```python -m code.dimensionality_reduction.reduce_dimensionality path/to/input.pickle path/to/output.pickle```
+ ```python -m code.dimensionality_reduction.reduce_dimensionality path/to/input.pickle path/to/output.pickle```
Here, `input.pickle` is the respective training, validation, or test set file created by `extract_features.py`.
The file `output.pickle` will be used to store the results of the dimensionality reduction process, containing `"features"` (which are the selected/projected ones) and `"labels"` (same as in the input file).
@@ -117,16 +136,31 @@ All python scripts and classes for classification can be found in `code/classifi
### Train and Evaluate a Single Classifier
The script `run_classifier.py` can be used to train and/or evaluate a given classifier. It can be executed as follows:
-```python -m code.classification.run_classifier path/to/input.pickle```
+ ```python -m code.classification.run_classifier path/to/input.pickle```
Here, `input.pickle` is a pickle file of the respective data subset, produced by either `extract_features.py` or `reduce_dimensionality.py`.
By default, this data is used to train a classifier, which is specified by one of the following optional arguments:
- `-m` or `--majority`: Majority vote classifier that always predicts the majority class.
- `-f` or `--frequency`: Dummy classifier that makes predictions based on the label frequency in the training data.
+- `-u` or `--uniform`: uniform (random) classifier
+- `--knn`: k nearest neighbor classifier with the specified value of k, default = None
+- `--knn_weights`: weight function of knn which can be optionally be chosen: uniform or distance, default = uniform
+- `--tree`: decision tree classifier with the specified value as max_depth, default = None
+- `--tree_criterion`: Criterion to measure split quality: gini or entropy, default = "gini"
+- `--svm`: Support vector machine with specified kernel: "linear", "polynomial", "rbf", or "sigmoid", default = None
+- `--randforest`: Random forest classifier with specified value as # of trees in forest, default = None
+- `--forest_criterion`: Criterion to measure split quality: "gini" or "entropy", default = "gini"
+- `--forest_max_depth`: max_depth of trees in forest, default = None
+- `--mlp`: Multilayer perceptron classifier, values resemble hidden layer sizes (1 value per layer), default = None
+- `--bayes` Complement naive bayes classifier
+
+
The classifier is then evaluated, using the evaluation metrics as specified through the following optional arguments:
- `-a`or `--accuracy`: Classification accurracy (i.e., percentage of correctly classified examples).
- `-k`or `--kappa`: Cohen's kappa (i.e., adjusting accuracy for probability of random agreement).
+- `-f1` or `--f1_score`: F1-score, calculated from precision and recall
+- `-ba`or `--balanced_accuracy`: Balanced classification accuracy
Moreover, the script support importing and exporting trained classifiers with the following optional arguments:
@@ -142,5 +176,5 @@ All python code for the application demo can be found in `code/application/`.
The script `application.py` provides a simple command line interface, where the user is asked to type in their prospective tweet, which is then analyzed using the trained ML pipeline.
The script can be invoked as follows:
-```python -m code.application.application path/to/preprocessing.pickle path/to/feature_extraction.pickle path/to/dimensionality_reduction.pickle path/to/classifier.pickle```
-The four pickle files correspond to the exported versions for the different pipeline steps as created by `run_preprocessing.py`, `extract_features.py`, `reduce_dimensionality.py`, and `run_classifier.py`, respectively, with the `-e` option.
+ ```python -m code.application.application path/to/preprocessing.pickle path/to/feature_extraction.pickle path/to/dimensionality_reduction.pickle path/to/classifier.pickle```
+The four pickle files correspond to the exported versions for the different pipeline steps as created by `run_preprocessing.py`, `extract_features.py`, `reduce_dimensionality.py`, and `run_classifier.py`, respectively, with the `-e` option.
\ No newline at end of file
diff --git a/code/application/application.py b/code/application/application.py
index 159aafaa..bd990e07 100644
--- a/code/application/application.py
+++ b/code/application/application.py
@@ -13,6 +13,7 @@
from sklearn.pipeline import make_pipeline
from code.util import COLUMN_TWEET
+
# setting up CLI
parser = argparse.ArgumentParser(description = "Application")
parser.add_argument("preprocessing_file", help = "path to the pickle file containing the preprocessing")
@@ -29,7 +30,7 @@
with open(args.dim_red_file, 'rb') as f_in:
dimensionality_reduction = pickle.load(f_in)
with open(args.classifier_file, 'rb') as f_in:
- classifier = pickle.load(f_in)
+ classifier = pickle.load(f_in)["classifier"]
# chain them together into a single pipeline
pipeline = make_pipeline(preprocessing, feature_extraction, dimensionality_reduction, classifier)
@@ -56,5 +57,4 @@
confidence = pipeline.predict_proba(df)
print("Prediction: {0}, Confidence: {1}".format(prediction, confidence))
- print("")
-
+ print("")
\ No newline at end of file
diff --git a/code/classification.sh b/code/classification.sh
index ceb7ac18..ec947e76 100755
--- a/code/classification.sh
+++ b/code/classification.sh
@@ -5,10 +5,16 @@ mkdir -p data/classification/
# run feature extraction on training set (may need to fit extractors)
echo " training set"
-python -m code.classification.run_classifier data/dimensionality_reduction/training.pickle -e data/classification/classifier.pickle --knn 5 -s 42 --accuracy --kappa
+#python -m code.classification.run_classifier data/dimensionality_reduction/training.pickle -e data/classification/classifier.pickle --svm 'rbf' -s 42 -a -k -f1 -ba
+python -m code.classification.run_classifier data/feature_extraction/training.pickle -e data/classification/classifier.pickle --bayes -s 42 -a -k -f1 -ba
+
# run feature extraction on validation set (with pre-fit extractors)
echo " validation set"
-python -m code.classification.run_classifier data/dimensionality_reduction/validation.pickle -i data/classification/classifier.pickle --accuracy --kappa
+#python -m code.classification.run_classifier data/dimensionality_reduction/validation.pickle -i data/classification/classifier.pickle -a -k -f1 -ba
+python -m code.classification.run_classifier data/feature_extraction/validation.pickle -i data/classification/classifier.pickle -a -k -f1 -ba
+
-# don't touch the test set, yet, because that would ruin the final generalization experiment!
\ No newline at end of file
+# don't touch the test set, yet, because that would ruin the final generalization experiment!
+echo " test set"
+python -m code.classification.run_classifier data/feature_extraction/test.pickle -i data/classification/classifier.pickle -a -k -f1 -ba
\ No newline at end of file
diff --git a/code/classification/run_classifier.py b/code/classification/run_classifier.py
index b9d55245..00ce13aa 100644
--- a/code/classification/run_classifier.py
+++ b/code/classification/run_classifier.py
@@ -10,10 +10,16 @@
import argparse, pickle
from sklearn.dummy import DummyClassifier
-from sklearn.metrics import accuracy_score, cohen_kappa_score
+from sklearn.metrics import accuracy_score, cohen_kappa_score, f1_score, balanced_accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
+from sklearn.tree import DecisionTreeClassifier
+from sklearn.svm import SVC
+from sklearn.ensemble import RandomForestClassifier
+from sklearn.neural_network import MLPClassifier
+from sklearn.naive_bayes import ComplementNB
from sklearn.pipeline import make_pipeline
+from mlflow import log_metric, log_param, set_tracking_uri
# setting up CLI
parser = argparse.ArgumentParser(description = "Classifier")
@@ -21,42 +27,158 @@
parser.add_argument("-s", '--seed', type = int, help = "seed for the random number generator", default = None)
parser.add_argument("-e", "--export_file", help = "export the trained classifier to the given location", default = None)
parser.add_argument("-i", "--import_file", help = "import a trained classifier from the given location", default = None)
+
+# <--- Classifier --->
parser.add_argument("-m", "--majority", action = "store_true", help = "majority class classifier")
parser.add_argument("-f", "--frequency", action = "store_true", help = "label frequency classifier")
+parser.add_argument("-u", "--uniform", action = "store_true", help = "uniform (random) classifier")
parser.add_argument("--knn", type = int, help = "k nearest neighbor classifier with the specified value of k", default = None)
+parser.add_argument("--knn_weights", type = str, help = "weight function of knn, uniform or distance", default = "uniform")
+parser.add_argument("--tree", action = "store_true", help = "decision tree classifier", default = None)
+parser.add_argument("--tree_depth", type = int, help = "max depth of decision tree", default = None)
+parser.add_argument("--tree_criterion", type = str, help = "criterion to measure split quality, gini or entropy", default = "gini")
+parser.add_argument("--svm", type = str, help = "support vector machine with specified kernel: linear, polynomial, rbf, or sigmoid", default = None)
+parser.add_argument("--randforest", type = int, help = "random forest classifier with specified value as # of trees in forest", default = None)
+parser.add_argument("--forest_criterion", type = str, help = "criterion to measure split quality, gini or entropy", default = "gini")
+parser.add_argument("--forest_max_depth", type = int, help = "max depth of trees in forest", default = None)
+parser.add_argument("--mlp", nargs = "+", type = int, help = "multilayer perceptron classifier, values resemble hidden layer sizes (1 value per layer)", default = None)
+parser.add_argument("--bayes", action = "store_true", help = "complement naive bayes classifier")
+
+# <--- Evaluation metrics --->
parser.add_argument("-a", "--accuracy", action = "store_true", help = "evaluate using accuracy")
parser.add_argument("-k", "--kappa", action = "store_true", help = "evaluate using Cohen's kappa")
+parser.add_argument("-f1", "--f1_score", action = "store_true", help = "evaluate using F1 score")
+parser.add_argument("-ba", "--balanced_accuracy", action = "store_true", help = "evaluate using balanced accuracy score")
+
+# <--- Param optimization --->
+parser.add_argument("--log_folder", help = "where to log the mlflow results", default = "data/classification/mlflow")
+
args = parser.parse_args()
# load data
with open(args.input_file, 'rb') as f_in:
data = pickle.load(f_in)
+set_tracking_uri(args.log_folder)
+
if args.import_file is not None:
# import a pre-trained classifier
with open(args.import_file, 'rb') as f_in:
- classifier = pickle.load(f_in)
+ input_dict = pickle.load(f_in)
+
+ classifier = input_dict["classifier"]
+ for param, value in input_dict["params"].items():
+ log_param(param, value)
+
+ log_param("dataset", "validation")
else: # manually set up a classifier
if args.majority:
# majority vote classifier
print(" majority vote classifier")
+ log_param("classifier", "majority")
+ params = {"classifier": "majority"}
classifier = DummyClassifier(strategy = "most_frequent", random_state = args.seed)
elif args.frequency:
# label frequency classifier
print(" label frequency classifier")
+ log_param("classifier", "stratified")
+ params = {"classifier": "stratified"}
classifier = DummyClassifier(strategy = "stratified", random_state = args.seed)
+ elif args.uniform:
+ # uniform classifier
+ print(" uniform classifier")
+ log_param("classifier", "uniform")
+ params = {"classifier": "uniform"}
+ classifier = DummyClassifier(strategy = "uniform", random_state = args.seed)
elif args.knn is not None:
- print(" {0} nearest neighbor classifier".format(args.knn))
+ # k nearest neighbour classifier
+ print(" {0} nearest neighbor classifier, {1} weights".format(args.knn, args.knn_weights))
+
+ log_param("classifier", "knn")
+ log_param("k", args.knn)
+ log_param("weights", args.knn_weights)
+ params = {"classifier": "knn",
+ "k": args.knn,
+ "weights": args.knn_weights}
+
standardizer = StandardScaler()
- knn_classifier = KNeighborsClassifier(args.knn)
+ knn_classifier = KNeighborsClassifier(n_neighbors = args.knn, weights = args.knn_weights, n_jobs = -1)
classifier = make_pipeline(standardizer, knn_classifier)
+
+ elif args.tree is not None:
+ # decision tree classifier
+ print(" decision tree with max depth {0}, {1} split criterion".format(args.tree_depth, args.tree_criterion))
+
+ log_param("classifier", "tree")
+ log_param("criterion", args.tree_criterion)
+ log_param("max_depth", args.tree_depth)
+ params = {"classifier": "tree",
+ "criterion": args.tree_criterion,
+ "max_depth": args.tree_depth}
+
+ #standardizer = StandardScaler()
+ classifier = DecisionTreeClassifier(criterion = args.tree_criterion, max_depth = args.tree_depth)
+ #classifier = make_pipeline(standardizer, decision_tree)
+ elif args.svm is not None:
+ # support vector machine
+ print(" svm classifier, kernel: {0}".format(args.svm))
+
+ log_param("classifier", "svm")
+ log_param("kernel", args.svm)
+ params = {"classifier": "svm",
+ "kernel": args.svm}
+
+ standardizer = StandardScaler()
+ svm_classifier = SVC(kernel = args.svm, gamma = "auto")
+ classifier = make_pipeline(standardizer, svm_classifier)
+
+ elif args.randforest is not None:
+ # random forest classifier
+ print(" random forest classifier with {0} trees, max depth {1}, {2} criterion".format(args.randforest, args.forest_max_depth, args.forest_criterion))
+
+ log_param("classifier", "random forest")
+ log_param("nr trees", args.randforest)
+ log_param("max depth", args.forest_max_depth)
+ log_param("criterion", args.forest_criterion)
+ params = {"classifier": "random forest",
+ "nr trees": args.randforest,
+ "max depth": args.forest_max_depth,
+ "criterion": args.forest_criterion}
+
+ classifier = RandomForestClassifier(n_estimators = args.randforest, criterion = args.forest_criterion, max_depth = args.forest_max_depth, n_jobs = -1)
+
+ elif args.mlp is not None:
+ # multilayer perceptron
+ print(" multilayer perceptron with hidden layer size {0}".format(args.mlp))
+
+ log_param("classifier", "mlp")
+ log_param("hidden layer sizes", args.mlp)
+ params = {"classifier": "mlp",
+ "hidden layer sizes": args.mlp}
+
+ standardizer = StandardScaler()
+ mlp_classifier = MLPClassifier(hidden_layer_sizes = tuple(args.mlp))
+ classifier = make_pipeline(standardizer, mlp_classifier)
+
+ elif args.bayes:
+ # gaussian naive bayes
+ print(" complement NB classifier")
+
+ log_param("classifier", "complementNB")
+ params = {"classifier": "complementNB"}
+
+ #standardizer = StandardScaler()
+ classifier = ComplementNB()
+ #classifier = make_pipeline(standardizer, nb_classifer)
+
classifier.fit(data["features"], data["labels"].ravel())
+ log_param("dataset", "training")
# now classify the given data
prediction = classifier.predict(data["features"])
@@ -64,15 +186,22 @@
# collect all evaluation metrics
evaluation_metrics = []
if args.accuracy:
- evaluation_metrics.append(("accuracy", accuracy_score))
+ evaluation_metrics.append(("Accuracy", accuracy_score))
if args.kappa:
- evaluation_metrics.append(("Cohen's kappa", cohen_kappa_score))
+ evaluation_metrics.append(("Cohens_kappa", cohen_kappa_score))
+if args.f1_score:
+ evaluation_metrics.append(("F1_score", f1_score))
+if args.balanced_accuracy:
+ evaluation_metrics.append(("Balanced_accuracy", balanced_accuracy_score))
# compute and print them
for metric_name, metric in evaluation_metrics:
- print(" {0}: {1}".format(metric_name, metric(data["labels"], prediction)))
+ metric_value = metric(data["labels"], prediction)
+ print(" {0}: {1}".format(metric_name, metric_value))
+ log_metric(metric_name, metric_value)
# export the trained classifier if the user wants us to do so
if args.export_file is not None:
+ output_dict = {"classifier": classifier, "params": params}
with open(args.export_file, 'wb') as f_out:
- pickle.dump(classifier, f_out)
\ No newline at end of file
+ pickle.dump(output_dict, f_out)
\ No newline at end of file
diff --git a/code/examples.py b/code/examples.py
index 69b2b3e3..816c840c 100644
--- a/code/examples.py
+++ b/code/examples.py
@@ -87,12 +87,28 @@
# NER
text = "John Wilkes Booth shot Abraham Lincoln. Abraham Lincoln was not shot inside the White House."
sentences = nltk.sent_tokenize(text)
+print(sentences[0])
for sentence in sentences:
- words = nltk.word_tokenize(sentence)
- pos_tagged = nltk.pos_tag(words)
+ #words = nltk.word_tokenize(sentence)
+ pos_tagged = nltk.pos_tag([sentence])
ne_chunked = nltk.ne_chunk(pos_tagged)
+ print(pos_tagged)
print(ne_chunked)
-
+ print(ne_chunked.label())
+ for i in ne_chunked:
+ print(i.label())
+
+import spacy
+import en_core_web_sm
+
+nlp = en_core_web_sm.load()
+doc = nlp(text)
+for i in doc:
+ print(i.ent_type_)
+l = ([(X.text, X.label_) for X in doc.ents])
+print(len(doc.ents))
+print(len(l))
+
# WordNet
dog_synsets = nltk.corpus.wordnet.synsets('dog')
@@ -195,10 +211,39 @@
print("Compare: ", X[0], embedded_transformed[0])
+###############################################################################
+##################### Pickle ################################################
+###############################################################################
+#to see what changes were made to pickle files
+object = pd.read_pickle(r'C:/Users/Yannik/Desktop/Uni/Semester5/ml/MLinPractice/data/feature_extraction/training.pickle')
+print(object)
+
+###############################################################################
+##################### other testing #########################################
+###############################################################################
-
-
+count = 0
+count2 = 0
+data = df["created_at"]
+for i in data:
+ a = i.split(" ")
+ count2 = count2 + 1
+ if a[-1] == "IST":
+ count = count +1
+print(count)
+print(count2)
+
+count3 = 0
+count4 = 0
+data = df["language"]
+for i in data:
+ a = i.split(" ")
+ count4 = count4 + 1
+ if a[-1] == "en":
+ count3 = count3 +1
+print(count3)
+print(count4)
diff --git a/code/feature_extraction.sh b/code/feature_extraction.sh
index f494f835..444ec425 100755
--- a/code/feature_extraction.sh
+++ b/code/feature_extraction.sh
@@ -3,12 +3,16 @@
# create directory if not yet existing
mkdir -p data/feature_extraction/
+# install spaCy language model
+python -m spacy download en_core_web_sm
+
# run feature extraction on training set (may need to fit extractors)
echo " training set"
-python -m code.feature_extraction.extract_features data/preprocessing/split/training.csv data/feature_extraction/training.pickle -e data/feature_extraction/pipeline.pickle --char_length
+python -m code.feature_extraction.extract_features data/preprocessing/split/training.csv data/feature_extraction/training.pickle -e data/feature_extraction/pipeline.pickle -c -s -p -@ -u -# -d -n
# run feature extraction on validation set and test set (with pre-fit extractors)
echo " validation set"
python -m code.feature_extraction.extract_features data/preprocessing/split/validation.csv data/feature_extraction/validation.pickle -i data/feature_extraction/pipeline.pickle
+
echo " test set"
python -m code.feature_extraction.extract_features data/preprocessing/split/test.csv data/feature_extraction/test.pickle -i data/feature_extraction/pipeline.pickle
\ No newline at end of file
diff --git a/code/feature_extraction/bigrams.py b/code/feature_extraction/bigrams.py
index 6c0c4b3a..be924176 100644
--- a/code/feature_extraction/bigrams.py
+++ b/code/feature_extraction/bigrams.py
@@ -1,23 +1,32 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
+Collect bigrams from a Tweet, not finished!
+
Created on Thu Oct 7 14:53:52 2021
-@author: ml
+@author: lbechberger
"""
import ast
import nltk
from code.feature_extraction.feature_extractor import FeatureExtractor
+
class BigramFeature(FeatureExtractor):
+ """Collects bigrams."""
def __init__(self, input_column):
+ """Constructor, calls super Constructor."""
super().__init__([input_column], "{0}_bigrams".format(input_column))
+
+ # don't need to fit, so don't overwrite _set_variables()
+
+
def _set_variables(self, inputs):
-
+ """"""
overall_text = []
for line in inputs:
tokens = ast.literal_eval(line.item())
diff --git a/code/feature_extraction/character_length.py b/code/feature_extraction/character_length.py
index 0349bf94..f0f659ac 100644
--- a/code/feature_extraction/character_length.py
+++ b/code/feature_extraction/character_length.py
@@ -11,18 +11,22 @@
import numpy as np
from code.feature_extraction.feature_extractor import FeatureExtractor
-# class for extracting the character-based length as a feature
+
class CharacterLength(FeatureExtractor):
+ """Extracts the character-based length as a feature."""
- # constructor
+
def __init__(self, input_column):
+ """Constructor with given input_column."""
super().__init__([input_column], "{0}_charlength".format(input_column))
+
# don't need to fit, so don't overwrite _set_variables()
- # compute the word length based on the inputs
+
def _get_values(self, inputs):
-
+ """Compute the word length based on the input."""
result = np.array(inputs[0].str.len())
result = result.reshape(-1,1)
- return result
+
+ return result
\ No newline at end of file
diff --git a/code/feature_extraction/daytime.py b/code/feature_extraction/daytime.py
new file mode 100644
index 00000000..b7956149
--- /dev/null
+++ b/code/feature_extraction/daytime.py
@@ -0,0 +1,60 @@
+# -*- coding: utf-8 -*-
+"""
+Extracts time from a tweet and one-hot encodes it.
+
+Created on Sat Oct 23 17:51:46 2021
+
+@author: Yannik
+modified by dhesenkamp
+"""
+
+from code.feature_extraction.feature_extractor import FeatureExtractor
+import numpy as np
+import pandas as pd
+
+
+class Daytime(FeatureExtractor):
+ """Extracting time from a given input_column"""
+
+ def __init__(self, input_column):
+ """Constructor with given input_column."""
+ super().__init__([input_column], "day_{0}".format(input_column))
+
+
+ # don't need to fit, so don't overwrite _set_variables()
+
+
+ def _get_values(self, inputs):
+ """
+ Extracts the hour tweet was posted and sets it as a daytime.
+ After that it one-hot encodes the array before return.
+ """
+ daytime = []
+ for i in inputs[0]:
+ time = i.split(":")
+ hour = int(time[0])
+
+ # night hours
+ if hour in range(0, 5):
+ daytime.append(0)
+
+ # morning hours
+ if hour in range(5, 10):
+ daytime.append(1)
+
+ # midday
+ if hour in range(10, 15):
+ daytime.append(2)
+
+ # afternoon
+ if hour in range(15, 19):
+ daytime.append(3)
+
+ # evening hours
+ if hour in range(19, 24):
+ daytime.append(4)
+
+ # one-hot encoding
+ result = np.array(pd.get_dummies(daytime))
+
+ return result
\ No newline at end of file
diff --git a/code/feature_extraction/extract_features.py b/code/feature_extraction/extract_features.py
index a3527acf..cb5bf426 100644
--- a/code/feature_extraction/extract_features.py
+++ b/code/feature_extraction/extract_features.py
@@ -13,7 +13,18 @@
import numpy as np
from code.feature_extraction.character_length import CharacterLength
from code.feature_extraction.feature_collector import FeatureCollector
-from code.util import COLUMN_TWEET, COLUMN_LABEL
+from code.feature_extraction.month import MonthExtractor
+from code.feature_extraction.sentiment import SentimentAnalyzer
+from code.feature_extraction.photos import Photos
+from code.feature_extraction.mention import Mentions
+from code.feature_extraction.retweets import RetweetExtractor
+from code.feature_extraction.url import URL
+from code.feature_extraction.replies import RepliesExtractor
+from code.feature_extraction.hashtags import Hashtags
+from code.feature_extraction.likes import Likes
+from code.feature_extraction.daytime import Daytime
+from code.feature_extraction.ner import NER
+from code.util import COLUMN_TWEET, COLUMN_LABEL, COLUMN_MONTH, COLUMN_PHOTOS, COLUMN_MENTIONS, COLUMN_URL, COLUMN_RETWEETS, COLUMN_REPLIES, COLUMN_HASHTAG, COLUMN_LIKES, COLUMN_TIME
# setting up CLI
@@ -22,7 +33,21 @@
parser.add_argument("output_file", help = "path to the output pickle file")
parser.add_argument("-e", "--export_file", help = "create a pipeline and export to the given location", default = None)
parser.add_argument("-i", "--import_file", help = "import an existing pipeline from the given location", default = None)
+
+# <--- Features --->
parser.add_argument("-c", "--char_length", action = "store_true", help = "compute the number of characters in the tweet")
+parser.add_argument("-m", "--month", action = "store_true", help = "extract month in which tweet was published")
+parser.add_argument("-s", "--sentiment", action = "store_true", help = "extracts compound sentiment from original tweet")
+parser.add_argument("-p", "--photos", action = "store_true", help = "extracts binary for whether tweet has photo(s) attached")
+parser.add_argument("-@", "--mention", action = "store_true", help = "extracts binary for whether someone has been mentioned by the tweet author")
+parser.add_argument("-u", "--url", action = "store_true", help = "extracts binary for whether a url is attached to tweet")
+parser.add_argument("-rt", "--retweet", action = "store_true", help = "extracts number of retweets")
+parser.add_argument("-re", "--replies", action = "store_true", help = "extracts number of replies")
+parser.add_argument("-#", "--hashtag", action = "store_true", help = "extracts binary for whether a hashtag is attached to tweet")
+parser.add_argument("-l", "--likes", action = "store_true", help = "extracts amount of likes from a tweet")
+parser.add_argument("-d", "--daytime", action = "store_true", help = "extracts at which time of day tweet was tweeted")
+parser.add_argument("-n", "--ner", action = "store_true", help = "collects the number of entities in a tweet")
+
args = parser.parse_args()
# load data
@@ -33,14 +58,58 @@
with open(args.import_file, "rb") as f_in:
feature_collector = pickle.load(f_in)
-else: # need to create FeatureCollector manually
-
+# need to create FeatureCollector manually
+else:
# collect all feature extractors
features = []
if args.char_length:
# character length of original tweet (without any changes)
features.append(CharacterLength(COLUMN_TWEET))
+
+ if args.month:
+ # month in which tweet was published
+ features.append(MonthExtractor(COLUMN_MONTH))
+
+ if args.sentiment:
+ # compound sentiment of tweet
+ features.append(SentimentAnalyzer(COLUMN_TWEET))
+
+ if args.photos:
+ # photos attached to original tweet
+ features.append(Photos(COLUMN_PHOTOS))
+
+ if args.mention:
+ # mentions contained in tweet
+ features.append(Mentions(COLUMN_MENTIONS))
+
+ if args.url:
+ # url attached to tweet
+ features.append(URL(COLUMN_URL))
+
+ if args.retweet:
+ # number of retweets
+ features.append(RetweetExtractor(COLUMN_RETWEETS))
+ if args.replies:
+ # number of replies
+ features.append(RepliesExtractor(COLUMN_REPLIES))
+
+ if args.hashtag:
+ # number of replies
+ features.append(Hashtags(COLUMN_HASHTAG))
+
+ if args.likes:
+ # number of likes
+ features.append(Likes(COLUMN_LIKES))
+
+ if args.daytime:
+ # time range
+ features.append(Daytime(COLUMN_TIME))
+
+ if args.ner:
+ # time range
+ features.append(NER(COLUMN_TWEET))
+
# create overall FeatureCollector
feature_collector = FeatureCollector(features)
diff --git a/code/feature_extraction/feature_collector.py b/code/feature_extraction/feature_collector.py
index d2fca494..5e9dc6b1 100644
--- a/code/feature_extraction/feature_collector.py
+++ b/code/feature_extraction/feature_collector.py
@@ -11,12 +11,19 @@
import numpy as np
from code.feature_extraction.feature_extractor import FeatureExtractor
-# extend FeatureExtractor for the sake of simplicity
+
class FeatureCollector(FeatureExtractor):
+ """
+ Collects feature values from many different feature extractors.
+ Extend FeatureExtractor for the sake of simplicity.
+ """
- # constructor
+
def __init__(self, features):
-
+ """
+ Constructor: stores features, collects input columns,
+ removes duplicates of columns, calls super Construktor.
+ """
# store features
self._features = features
@@ -32,15 +39,20 @@ def __init__(self, features):
super().__init__(input_columns, "FeatureCollector")
- # overwrite fit: instead of calling _set_variables(), we forward the call to the features
def fit(self, df):
-
+ """
+ @overwrite fit: instead of calling _set_variables(),
+ we forward the call to the features.
+ """
for feature in self._features:
feature.fit(df)
- # overwrite transform: instead of calling _get_values(), we forward the call to the features
+
def transform(self, df):
-
+ """
+ @overwrite transform: instead of calling _get_values(),
+ we forward the call to the features.
+ """
all_feature_values = []
for feature in self._features:
@@ -49,7 +61,9 @@ def transform(self, df):
result = np.concatenate(all_feature_values, axis = 1)
return result
+
def get_feature_names(self):
+ """Getter for feature names."""
feature_names = []
for feature in self._features:
feature_names.append(feature.get_feature_name())
diff --git a/code/feature_extraction/feature_extractor.py b/code/feature_extraction/feature_extractor.py
index e8db5d84..c5a7f8c0 100644
--- a/code/feature_extraction/feature_extractor.py
+++ b/code/feature_extraction/feature_extractor.py
@@ -10,57 +10,74 @@
from sklearn.base import BaseEstimator, TransformerMixin
-# base class for all feature extractors
-# inherits from BaseEstimator (as pretty much everything in sklearn)
-# and TransformerMixin (allowing for fit, transform, and fit_transform methods)
+
class FeatureExtractor(BaseEstimator,TransformerMixin):
+ """
+ Base class for all feature extractors.
+ inherits from BaseEstimator and
+ TransformerMixin (allowing for fit, transform, and fit_transform methods) from sklearn
+ """
- # constructor
+
def __init__(self, input_columns, feature_name):
+ """
+ Constructor
+ calls super Constructor from BaseEstimator and TransformerMixin (initializes them)
+ sets the respective _output_column and _feature_name
+ """
super(BaseEstimator, self).__init__()
super(TransformerMixin, self).__init__()
self._input_columns = input_columns
self._feature_name = feature_name
- # access to feature name
+
def get_feature_name(self):
+ """Getter for feature name."""
return self._feature_name
- # access to input colums
+
def get_input_columns(self):
+ """Getter for input columns."""
return self._input_columns
- # set internal variables based on input columns
- # to be implemented by subclass!
def _set_variables(self, inputs):
+ """
+ Set the internal variables based on input columns.
+ Needs to be implemented by subclass!
+ """
pass
- # fit function: takes pandas DataFrame to set any internal variables
+
def fit(self, df):
-
+ """Fit function: takes pandas DataFrame to set any internal variables"""
inputs = []
- # collect all input columns from df
+ # collect all input columns from the DataFrame
for input_col in self._input_columns:
inputs.append(df[input_col])
- # call _set_variables (to be implemented by subclass)
+ # call _set_variables
self._set_variables(inputs)
return self
- # get feature values based on input column and internal variables
- # should return a numpy array
- # to be implemented by subclass!
def _get_values(self, inputs):
+ """
+ Get preprocessed column based on the inputs from the DataFrame
+ and internal variables.
+ Needs to be implemented by subclass.
+ """
pass
- # transform function: transforms pandas DataFrame to numpy array of feature values
- def transform(self, df):
+ def transform(self, df):
+ """
+ Transform function: transforms pandas DataFrame
+ based on any internal variables.
+ """
inputs = []
- # collect all input columns from df
+ # collect all input columns from the DataFrame
for input_col in self._input_columns:
inputs.append(df[input_col])
diff --git a/code/feature_extraction/hashtags.py b/code/feature_extraction/hashtags.py
new file mode 100644
index 00000000..77a275eb
--- /dev/null
+++ b/code/feature_extraction/hashtags.py
@@ -0,0 +1,34 @@
+# -*- coding: utf-8 -*-
+"""
+Creates binary column showing whether tweet uses hastag or not.
+
+Created on Fri Oct 22 01:44:28 2021
+
+@author: Yannik
+"""
+
+from code.feature_extraction.feature_extractor import FeatureExtractor
+import numpy as np
+
+
+class Hashtags(FeatureExtractor):
+ """Determines if there is Hastag in a Tweet."""
+
+
+ def __init__(self, input_column):
+ """Constructor, calls super Constructor."""
+ super().__init__([input_column], "{0}_binary".format(input_column))
+
+
+ # don't need to fit, so don't overwrite _set_variables()
+
+
+ def _get_values(self, inputs):
+ """
+ Appends array with 0 if no hashtag and with 1 if there is a hashtag.
+ Therefore returns binary column.
+ """
+ result = np.array([0 if len(_) <= 2 else 1 for _ in inputs[0]])
+ result = result.reshape(-1, 1)
+
+ return result
\ No newline at end of file
diff --git a/code/feature_extraction/likes.py b/code/feature_extraction/likes.py
new file mode 100644
index 00000000..26cb5db4
--- /dev/null
+++ b/code/feature_extraction/likes.py
@@ -0,0 +1,31 @@
+# -*- coding: utf-8 -*-
+"""
+Collects number of likes of a Tweet
+
+Created on Tue Oct 19 18:01:56 2021
+
+@author: Yannik
+modified by dhesenkamp
+"""
+
+from code.feature_extraction.feature_extractor import FeatureExtractor
+import numpy as np
+
+
+class Likes(FeatureExtractor):
+ """Collects the number of likes for a tweet and stores them as seperate feature."""
+
+ def __init__(self, input_column):
+ """Constructor, calls super Constructor."""
+ super().__init__([input_column], "{0}_feature".format(input_column))
+
+
+ # don't need to fit, so don't overwrite _set_variables()
+
+
+ def _get_values(self, inputs):
+ """Returns the given input column and reshapes it."""
+ result = np.array(inputs[0])
+ result = result.reshape(-1, 1)
+
+ return result
\ No newline at end of file
diff --git a/code/feature_extraction/mention.py b/code/feature_extraction/mention.py
new file mode 100644
index 00000000..70fb587e
--- /dev/null
+++ b/code/feature_extraction/mention.py
@@ -0,0 +1,35 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+Determines whether a Tweet has a mentions.
+
+Created on Thu Oct 21 10:22:30 2021
+
+@author: dhesenkamp
+"""
+
+from code.feature_extraction.feature_extractor import FeatureExtractor
+import numpy as np
+
+
+class Mentions(FeatureExtractor):
+ """Determines whether a Tweet has mentions and returnes it in a binary form."""
+
+
+ def __init__(self, input_column):
+ """Constructor, calls super Constructor."""
+ super().__init__([input_column], "{0}_binary".format(input_column))
+
+
+ # don't need to fit, so don't overwrite _set_variables()
+
+
+ def _get_values(self, inputs):
+ """
+ Appends array with 0 if no mentions and with 1 if there is a mention.
+ Therefore returns binary column.
+ """
+ result = np.array([0 if len(_) <= 2 else 1 for _ in inputs[0]])
+ result = result.reshape(-1, 1)
+
+ return result
\ No newline at end of file
diff --git a/code/feature_extraction/month.py b/code/feature_extraction/month.py
new file mode 100644
index 00000000..56522242
--- /dev/null
+++ b/code/feature_extraction/month.py
@@ -0,0 +1,30 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+Extract the month in which a tweet was published as feature.
+
+Created on Tue Oct 12 12:33:37 2021
+
+@author: dhesenkamp
+"""
+
+from code.feature_extraction.feature_extractor import FeatureExtractor
+import numpy as np
+import datetime
+
+
+class MonthExtractor(FeatureExtractor):
+ """Extracts the month in which a tweet has been published."""
+
+
+ def __init__(self, input_column):
+ """Constructor, calls super Constructor."""
+ super().__init__([input_column], "tweet_month")
+
+
+ def _get_values(self, inputs):
+ """Extracts the month from a string containing a date."""
+ result = np.array([datetime.datetime.strptime(date, "%Y-%m-%d").month for date in inputs[0]])
+ result = result.reshape(-1, 1)
+
+ return result
\ No newline at end of file
diff --git a/code/feature_extraction/ner.py b/code/feature_extraction/ner.py
new file mode 100644
index 00000000..a803cde4
--- /dev/null
+++ b/code/feature_extraction/ner.py
@@ -0,0 +1,42 @@
+# -*- coding: utf-8 -*-
+"""
+Named entitiy recognition using spaCy's corpus.
+Works with unprocessed tweet column as default input.
+
+Created on Thu Oct 28 10:08:05 2021
+
+@author: Yannik
+modified by dhesenkamp
+"""
+
+import spacy
+import en_core_web_sm
+import numpy as np
+from code.feature_extraction.feature_extractor import FeatureExtractor
+
+
+class NER(FeatureExtractor):
+ """Named entity recognition as a count."""
+
+
+ def __init__(self, input_column):
+ """Constructor, calls super Constructor."""
+ super().__init__([input_column], "{0}_ner_count".format(input_column))
+
+
+ # don't need to fit, so don't overwrite _set_variables()
+
+
+ def _get_values(self, inputs):
+ """Recognize named entities and counts their occurence on a per tweet basis."""
+ result = []
+ nlp = en_core_web_sm.load()
+
+ for tweet in inputs[0]:
+ doc = nlp(tweet)
+ result.append(len(doc.ents))
+
+ result = np.array(result)
+ result = result.reshape(-1, 1)
+
+ return result
\ No newline at end of file
diff --git a/code/feature_extraction/photos.py b/code/feature_extraction/photos.py
new file mode 100644
index 00000000..a4a07817
--- /dev/null
+++ b/code/feature_extraction/photos.py
@@ -0,0 +1,35 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+Feature extraction for photos
+
+Created on Thu Oct 21 09:44:54 2021
+
+@author: dhesenkamp
+"""
+
+from code.feature_extraction.feature_extractor import FeatureExtractor
+import numpy as np
+
+
+class Photos(FeatureExtractor):
+ """Determines whether a Tweet contains any photos."""
+
+
+ def __init__(self, input_column):
+ """Constuctor, calls super Constructor."""
+ super().__init__([input_column], "{0}_binary".format(input_column))
+
+
+ # don't need to fit, so don't overwrite _set_variables()
+
+
+ def _get_values(self, inputs):
+ """
+ Appends array with 0 if no photo and with 1 if there is a photo.
+ Therefore returns binary column.
+ """
+ result = np.array([0 if len(_) <= 2 else 1 for _ in inputs[0]])
+ result = result.reshape(-1, 1)
+
+ return result
\ No newline at end of file
diff --git a/code/feature_extraction/replies.py b/code/feature_extraction/replies.py
new file mode 100644
index 00000000..679bd8b1
--- /dev/null
+++ b/code/feature_extraction/replies.py
@@ -0,0 +1,29 @@
+# -*- coding: utf-8 -*-
+"""
+Extracts number of replies from a tweet.
+
+Created on Fri Oct 22 00:58:01 2021
+
+@author: Yannik
+"""
+
+from code.feature_extraction.feature_extractor import FeatureExtractor
+import numpy as np
+
+class RepliesExtractor(FeatureExtractor):
+ """Collects the number of replies for a Tweet and stores them as seperate feature."""
+
+ def __init__(self, input_column):
+ """Constructor, calls super Constructor."""
+ super().__init__([input_column], "{0}_feature".format(input_column))
+
+
+ # don't need to fit, so don't overwrite _set_variables()
+
+
+ def _get_values(self, inputs):
+ """Returnes the given input column as a feature."""
+ result = np.array(inputs[0])
+ result = result.reshape(-1,1)
+
+ return result
\ No newline at end of file
diff --git a/code/feature_extraction/retweets.py b/code/feature_extraction/retweets.py
new file mode 100644
index 00000000..3ec0a6d5
--- /dev/null
+++ b/code/feature_extraction/retweets.py
@@ -0,0 +1,31 @@
+# -*- coding: utf-8 -*-
+"""
+Feature extraction for retweets.
+
+Created on Thu Oct 21 22:39:28 2021
+
+@author: Yannik
+"""
+
+from code.feature_extraction.feature_extractor import FeatureExtractor
+import numpy as np
+
+
+class RetweetExtractor(FeatureExtractor):
+ """Collects the number of retweets for a Tweet and stores them as seperate feature."""
+
+
+ def __init__(self, input_column):
+ """Constructor, calls super Constructor."""
+ super().__init__([input_column], "{0}_feature".format(input_column))
+
+
+ # don't need to fit, so don't overwrite _set_variables()
+
+
+ def _get_values(self, inputs):
+ """Returns the input column as a seperate feature."""
+ result = np.array(inputs[0])
+ result = result.reshape(-1,1)
+
+ return result
\ No newline at end of file
diff --git a/code/feature_extraction/sentiment.py b/code/feature_extraction/sentiment.py
new file mode 100644
index 00000000..0bd600eb
--- /dev/null
+++ b/code/feature_extraction/sentiment.py
@@ -0,0 +1,38 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+Sentiment analyser to extract compound sentiment from tweet.
+
+Created on Wed Oct 13 11:12:51 2021
+
+@author: dhesenkamp
+"""
+
+from code.feature_extraction.feature_extractor import FeatureExtractor
+from nltk.sentiment.vader import SentimentIntensityAnalyzer
+import numpy as np
+
+
+class SentimentAnalyzer(FeatureExtractor):
+ """
+ Analyses input w.r.t. its sentiment using the VADER approach,
+ taking into account punctuation, caps, emojis among others, making
+ it a pristine choice for social media content.
+ """
+
+
+ def __init__(self, input_column):
+ """Constructor, calls super Constructor."""
+ super().__init__([input_column], "{0}_sentiment".format(input_column))
+
+
+ # don't need to fit, so don't overwrite _set_variables()
+
+
+ def _get_values(self, inputs):
+ """Analyse sentiment and return compound value."""
+ sia = SentimentIntensityAnalyzer()
+
+ result = np.array([sia.polarity_scores(tweet)["compound"] + 1 for tweet in inputs[0]])
+
+ return result.reshape(-1,1)
\ No newline at end of file
diff --git a/code/feature_extraction/url.py b/code/feature_extraction/url.py
new file mode 100644
index 00000000..fda52453
--- /dev/null
+++ b/code/feature_extraction/url.py
@@ -0,0 +1,35 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+Determines whether a Tweet has a URL attached.
+
+Created on Thu Oct 21 10:34:06 2021
+
+@author: dhesenkamp
+"""
+
+from code.feature_extraction.feature_extractor import FeatureExtractor
+import numpy as np
+
+
+class URL(FeatureExtractor):
+ """Determines whether a Tweet has a URL and returnes it in a binary form."""
+
+
+ def __init__(self, input_column):
+ """Constructor, calls super Constructor."""
+ super().__init__([input_column], "{0}_binary".format(input_column))
+
+
+ # don't need to fit, so don't overwrite _set_variables()
+
+
+ def _get_values(self, inputs):
+ """
+ Appends array with 0 if no URL and with 1 if there is a URL.
+ Therefore returns binary column.
+ """
+ result = np.array([0 if len(_) <= 2 else 1 for _ in inputs[0]])
+ result = result.reshape(-1, 1)
+
+ return result
\ No newline at end of file
diff --git a/code/param_optimization.sh b/code/param_optimization.sh
new file mode 100755
index 00000000..de0fcb35
--- /dev/null
+++ b/code/param_optimization.sh
@@ -0,0 +1,81 @@
+#!/bin/bash
+
+# runs all classifiers with the configurations we want to explore as part of
+# the hyperparamter optimization
+
+# create directory if not yet existing
+mkdir -p data/classification/
+
+
+# knn
+for i in "uniform" "distance"
+do
+ for j in 1 3 5 7 9
+ do
+ echo " training set"
+ python -m code.classification.run_classifier data/feature_extraction/training.pickle -e data/classification/classifier.pickle --knn $i $j -s 42 -a -k -f1 -ba
+ echo " validation set"
+ python -m code.classification.run_classifier data/feature_extraction/validation.pickle -i data/classification/classifier.pickle -a -k -f1 -ba
+ done
+done
+
+
+# decision tree
+for i in "gini" "entropy"
+do
+ for j in 16 18 20 22 24 26 28 30 32
+ do
+ echo " training set"
+ python -m code.classification.run_classifier data/feature_extraction/training.pickle -e data/classification/classifier.pickle --tree --tree_criterion $i --tree_depth $j -s 42 -a -k -f1 -ba
+ echo " validation set"
+ python -m code.classification.run_classifier data/feature_extraction/validation.pickle -i data/classification/classifier.pickle -a -k -f1 -ba
+ done
+done
+
+
+# random forest
+for i in "gini" "entropy"
+do
+ for j in 10 25 50 100
+ do
+ for k in 16 18 20 22 24 26 28 30 32
+ do
+ echo " training set"
+ python -m code.classification.run_classifier data/feature_extraction/training.pickle -e data/classification/classifier.pickle --randforest $j --forest_criterion $i --forest_max_depth $k -s 42 -a -k -f1 -ba
+ echo " validation set"
+ python -m code.classification.run_classifier data/feature_extraction/validation.pickle -i data/classification/classifier.pickle -a -k -f1 -ba
+ done
+ done
+done
+
+
+# svm
+for i in "linear" "poly" "rbf" "sigmoid"
+do
+ echo " training set"
+ python -m code.classification.run_classifier data/feature_extraction/training.pickle -e data/classification/classifier.pickle --svm $i -s 42 -a -k -f1 -ba
+ echo " validation set"
+ python -m code.classification.run_classifier data/feature_extraction/validation.pickle -i data/classification/classifier.pickle -a -k -f1 -ba
+done
+
+
+# mlp
+for i in 10 25 50
+do
+ for j in 10 25 50
+ do
+ for k in 10 25 50
+ do
+ echo " training set"
+ python -m code.classification.run_classifier data/feature_extraction/training.pickle -e data/classification/classifier.pickle --mlp $i $j $k -s 42 -a -k -f1 -ba
+ echo " validation set"
+ python -m code.classification.run_classifier data/feature_extraction/validation.pickle -i data/classification/classifier.pickle -a -k -f1 -ba
+ done
+ done
+done
+
+# bayes
+echo " training set"
+python -m code.classification.run_classifier data/feature_extraction/training.pickle -e data/classification/classifier.pickle --bayes -s 42 -a -k -f1 -ba
+echo " validation set"
+python -m code.classification.run_classifier data/feature_extraction/validation.pickle -i data/classification/classifier.pickle -a -k -f1 -ba
diff --git a/code/plots.py b/code/plots.py
new file mode 100644
index 00000000..36b968d4
--- /dev/null
+++ b/code/plots.py
@@ -0,0 +1,101 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+Produces some of the plots used in the documentation
+
+Created on Sun Oct 31 23:32:44 2021
+
+@author: dhesenkamp
+"""
+
+import numpy as np
+import pandas as pd
+import time
+import csv
+import matplotlib.pyplot as plt
+
+
+# courtesy to https://www.kaggle.com/docxian/data-science-tweets for some code snippets
+# load data
+df = pd.read_csv('data/raw/data_science.csv', quoting = csv.QUOTE_NONNUMERIC, lineterminator = "\n")
+
+# convert date
+df.date = pd.to_datetime(df.date)
+# and extract year, month
+df['year'] = df.date.dt.year
+df['month'] = df.date.dt.month
+
+# year distribution
+df.year.value_counts().sort_index().plot(kind='bar', color='#b84064')
+plt.title('Year of Tweet')
+
+plt.savefig('img/year_distribution.png')
+plt.show()
+
+# language distribution
+plt.figure(figsize=(20,6))
+df.language.value_counts().plot(kind='bar', color="#b84064")
+plt.title('Language')
+#plt.grid()
+plt.savefig('img/lang_distribution.png')
+plt.show()
+
+# time range distribution
+# initializing data
+preprocessed = pd.read_csv("data/preprocessing/preprocessed.csv", quoting = csv.QUOTE_NONNUMERIC, lineterminator = "\n")
+
+daytime = []
+day = preprocessed["time"]
+for i in day:
+ t = i.split(":")
+ hour = int(t[0])
+
+ # night hours
+ if hour in range(0, 5):
+ daytime.append(0)
+
+ # morning hours
+ if hour in range(5, 10):
+ daytime.append(1)
+
+ # midday
+ if hour in range(10, 15):
+ daytime.append(2)
+
+ # afternoon
+ if hour in range(15, 19):
+ daytime.append(3)
+
+ # evening hours
+ if hour in range(19, 24):
+ daytime.append(4)
+
+d = np.zeros(5)
+for i in daytime:
+ if i == 0:
+ d[0]+=1
+ elif i == 1:
+ d[1]+=1
+ elif i == 2:
+ d[2]+=1
+ elif i == 3:
+ d[3]+=1
+ else:
+ d[4]+=1
+
+plt.title("Tweets per time range.")
+plt.ylabel("Number of tweets")
+plt.xlabel("Daytime")
+r = ["night","morning","midday","afternoon","evening"]
+plt.bar(r, d, align = "center", color="#b84064")
+plt.savefig('img/time_distribution.png')
+
+# final results
+x = ['accuracy', 'balanced\naccuracy', 'cohens\nkappa', 'f1 score']
+results = [0.63, 0.606, 0.088, 0.223]
+plt.bar(x, results, 0.5, color='#b84064')
+plt.title('CNB classifier on test set.')
+plt.ylabel("Score")
+plt.xlabel("Metric")
+plt.gcf().subplots_adjust(bottom=0.15)
+plt.savefig('img/final_result.png')
diff --git a/code/preprocessing.sh b/code/preprocessing.sh
index 61f83ea6..764b5681 100755
--- a/code/preprocessing.sh
+++ b/code/preprocessing.sh
@@ -12,7 +12,7 @@ python -m code.preprocessing.create_labels data/raw/ data/preprocessing/labeled.
# other preprocessing (removing punctuation etc.)
echo " general preprocessing"
-python -m code.preprocessing.run_preprocessing data/preprocessing/labeled.csv data/preprocessing/preprocessed.csv --punctuation --tokenize -e data/preprocessing/pipeline.pickle
+python -m code.preprocessing.run_preprocessing data/preprocessing/labeled.csv data/preprocessing/preprocessed.csv -s --stopwords_input "tweet" -p --punctuation_input "tweet_stopwords_removed" -t --tokenize_input "tweet_stopwords_removed_punct_removed" -l --lemmatize_input "tweet_stopwords_removed_punct_removed_tokenized" -e data/preprocessing/pipeline.pickle
# split the data set
echo " splitting the data set"
diff --git a/code/preprocessing/create_labels.py b/code/preprocessing/create_labels.py
index 21b1748d..c7e3e553 100644
--- a/code/preprocessing/create_labels.py
+++ b/code/preprocessing/create_labels.py
@@ -13,6 +13,7 @@
import pandas as pd
from code.util import COLUMN_LIKES, COLUMN_RETWEETS, COLUMN_LABEL
+
# setting up CLI
parser = argparse.ArgumentParser(description = "Creation of Labels")
parser.add_argument("data_directory", help = "directory where the original csv files reside")
diff --git a/code/preprocessing/lemmatizer.py b/code/preprocessing/lemmatizer.py
new file mode 100644
index 00000000..96157468
--- /dev/null
+++ b/code/preprocessing/lemmatizer.py
@@ -0,0 +1,59 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+Lemmatizes the given input column, i.e. modifies inflected or variant forms
+of a word into its lemma.
+
+Created on Fri Oct 8 11:18:30 2021
+
+@author: dhesenkamp
+"""
+
+from code.preprocessing.preprocessor import Preprocessor
+from nltk.stem import WordNetLemmatizer
+from nltk.corpus import wordnet
+from nltk import pos_tag
+from ast import literal_eval
+
+
+class Lemmatizer(Preprocessor):
+ """
+ Lemmatize given input column.
+ inspired by https://www.machinelearningplus.com/nlp/lemmatization-examples-python/
+ """
+
+
+ def __init__(self, input_column, output_column):
+ """Constuctor, calls super Constructor."""
+ super().__init__([input_column], output_column)
+
+
+ # implementation of _set_variables() not necessary
+
+
+ def _get_values(self, inputs):
+ """
+ Lemmatize given input based on WordNet. Also changes to lowercase.
+ """
+ lemmatizer = WordNetLemmatizer()
+
+ # dict to map PoS to arg accepted by lemmatize()
+ tag_dict = {"J": wordnet.ADJ,
+ "N": wordnet.NOUN,
+ "V": wordnet.VERB,
+ "R": wordnet.ADV
+ }
+
+ column = []
+
+ for tweet in inputs[0]:
+ tweet_eval = literal_eval(tweet)
+ lemmatized = []
+
+ for word in tweet_eval:
+ # get first letter of PoS tag to retrieve entry from dict
+ tag = pos_tag([word])[0][1][0].upper()
+ lemmatized.append(lemmatizer.lemmatize(word.lower(), tag_dict.get(tag, wordnet.NOUN)))
+ column.append(lemmatized)
+
+ return column
\ No newline at end of file
diff --git a/code/preprocessing/preprocessor.py b/code/preprocessing/preprocessor.py
index a5abd445..3332fc34 100644
--- a/code/preprocessing/preprocessor.py
+++ b/code/preprocessing/preprocessor.py
@@ -10,49 +10,66 @@
from sklearn.base import BaseEstimator, TransformerMixin
-# inherits from BaseEstimator (as pretty much everything in sklearn)
-# and TransformerMixin (allowing for fit, transform, and fit_transform methods)
+
class Preprocessor(BaseEstimator,TransformerMixin):
+ """Inheritance from BaseEstimator and TransformerMixin from the sklearn library."""
+
- # constructor
def __init__(self, input_columns, output_column):
+ """
+ Constructor
+ calls super Constructor from BaseEstimator and TransformerMixin (initializes them)
+ sets the respective _output_column and _input_columns
+ """
super(BaseEstimator, self).__init__()
super(TransformerMixin, self).__init__()
self._output_column = output_column
self._input_columns = input_columns
- # set internal variables based on input columns
- # to be implemented by subclass!
+
def _set_variables(self, inputs):
+ """
+ Set the internal variables based on input columns.
+ Needs to be implemented by subclass
+ """
pass
- # fit function: takes pandas DataFrame to set any internal variables
+
def fit(self, df):
-
+ """Fit function: takes pandas DataFrame to set any internal variables"""
inputs = []
- # collect all input columns from df
+
+ # collect all input columns from the dataframe
for input_col in self._input_columns:
inputs.append(df[input_col])
- # call _set_variables (to be implemented by subclass)
+ # call _set_variables
self._set_variables(inputs)
return self
- # get preprocessed column based on the inputs from the DataFrame and internal variables
- # to be implemented by subclass!
+
def _get_values(self, inputs):
+ """
+ Get preprocessed column based on the inputs from the DataFrame
+ and internal variables.
+ Needs to be implemented by subclass.
+ """
pass
- # transform function: transforms pandas DataFrame based on any internal variables
+
def transform(self, df):
-
+ """
+ Transform function: transforms pandas DataFrame
+ based on any internal variables.
+ """
inputs = []
- # collect all input columns from df
+ # collect all input columns from the dataframe
for input_col in self._input_columns:
inputs.append(df[input_col])
# add to copy of DataFrame
df_copy = df.copy()
- df_copy[self._output_column] = self._get_values(inputs)
+ df_copy[self._output_column] = self._get_values(inputs)
+
return df_copy
\ No newline at end of file
diff --git a/code/preprocessing/punctuation_remover.py b/code/preprocessing/punctuation_remover.py
index 0f026b0e..9274256e 100644
--- a/code/preprocessing/punctuation_remover.py
+++ b/code/preprocessing/punctuation_remover.py
@@ -6,28 +6,36 @@
Created on Wed Sep 29 09:45:56 2021
@author: lbechberger
+modified by dhesenkamp
"""
import string
from code.preprocessing.preprocessor import Preprocessor
-from code.util import COLUMN_TWEET, COLUMN_PUNCTUATION
-# removes punctuation from the original tweet
-# inspired by https://stackoverflow.com/a/45600350
+
class PunctuationRemover(Preprocessor):
+ """
+ Class to remove punctuation marks from given input
+ inspired by https://stackoverflow.com/a/45600350
+ """
+
+
+ def __init__(self, input_column, output_column):
+ """Constuctor, calls super Constructor"""
+ super().__init__([input_column], output_column)
- # constructor
- def __init__(self):
- # input column "tweet", new output column
- super().__init__([COLUMN_TWEET], COLUMN_PUNCTUATION)
- # set internal variables based on input columns
def _set_variables(self, inputs):
- # store punctuation for later reference
+ """
+ Stores punctuation for later reference
+ """
self._punctuation = "[{}]".format(string.punctuation)
- # get preprocessed column based on data frame and internal variables
+
def _get_values(self, inputs):
- # replace punctuation with empty string
+ """
+ Replaces a punctuation mark with an empty string
+ """
column = inputs[0].str.replace(self._punctuation, "")
+
return column
\ No newline at end of file
diff --git a/code/preprocessing/run_preprocessing.py b/code/preprocessing/run_preprocessing.py
index 72130a30..7da03b29 100644
--- a/code/preprocessing/run_preprocessing.py
+++ b/code/preprocessing/run_preprocessing.py
@@ -13,15 +13,27 @@
from sklearn.pipeline import make_pipeline
from code.preprocessing.punctuation_remover import PunctuationRemover
from code.preprocessing.tokenizer import Tokenizer
-from code.util import COLUMN_TWEET, SUFFIX_TOKENIZED
+from code.preprocessing.stopword_remover import StopwordRemover
+from code.preprocessing.lemmatizer import Lemmatizer
+from code.util import COLUMN_TWEET, SUFFIX_TOKENIZED, SUFFIX_STOPWORDS, SUFFIX_LEMMATIZED, SUFFIX_PUNCTUATION
+
# setting up CLI
parser = argparse.ArgumentParser(description = "Various preprocessing steps")
parser.add_argument("input_file", help = "path to the input csv file")
parser.add_argument("output_file", help = "path to the output csv file")
+
+# <--- Preprocessing steps --->
parser.add_argument("-p", "--punctuation", action = "store_true", help = "remove punctuation")
+parser.add_argument("--punctuation_input", help = "input column to remove punctuation from", default = COLUMN_TWEET)
parser.add_argument("-t", "--tokenize", action = "store_true", help = "tokenize given column into individual words")
parser.add_argument("--tokenize_input", help = "input column to tokenize", default = COLUMN_TWEET)
+parser.add_argument("-s", "--stopwords", action = "store_true", help = "remove common stopwords")
+parser.add_argument("--stopwords_input", help = "input column to remove stopwords from", default = COLUMN_TWEET)
+parser.add_argument("-l", "--lemmatize", action = "store_true", help = "change words into their lemma based on WordNet")
+parser.add_argument("--lemmatize_input", help = "input column to lemmatize", default = COLUMN_TWEET)
+
+# <--- File export --->
parser.add_argument("-e", "--export_file", help = "create a pipeline and export to the given location", default = None)
args = parser.parse_args()
@@ -30,11 +42,19 @@
# collect all preprocessors
preprocessors = []
+
+if args.stopwords:
+ preprocessors.append(StopwordRemover(args.stopwords_input, args.stopwords_input + SUFFIX_STOPWORDS))
+
if args.punctuation:
- preprocessors.append(PunctuationRemover())
+ preprocessors.append(PunctuationRemover(args.punctuation_input, args.punctuation_input + SUFFIX_PUNCTUATION))
+
if args.tokenize:
preprocessors.append(Tokenizer(args.tokenize_input, args.tokenize_input + SUFFIX_TOKENIZED))
+if args.lemmatize:
+ preprocessors.append(Lemmatizer(args.lemmatize_input, args.lemmatize_input + SUFFIX_LEMMATIZED))
+
# call all preprocessing steps
for preprocessor in preprocessors:
df = preprocessor.fit_transform(df)
@@ -46,4 +66,4 @@
if args.export_file is not None:
pipeline = make_pipeline(*preprocessors)
with open(args.export_file, 'wb') as f_out:
- pickle.dump(pipeline, f_out)
\ No newline at end of file
+ pickle.dump(pipeline, f_out)
\ No newline at end of file
diff --git a/code/preprocessing/split_data.py b/code/preprocessing/split_data.py
index 57bad668..f2efa46d 100644
--- a/code/preprocessing/split_data.py
+++ b/code/preprocessing/split_data.py
@@ -13,6 +13,7 @@
from sklearn.model_selection import train_test_split
from code.util import COLUMN_LABEL
+
# setting up CLI
parser = argparse.ArgumentParser(description = "Splitting the data set")
parser.add_argument("input_file", help = "path to the input csv file")
diff --git a/code/preprocessing/stopword_remover.py b/code/preprocessing/stopword_remover.py
new file mode 100644
index 00000000..43cab5c3
--- /dev/null
+++ b/code/preprocessing/stopword_remover.py
@@ -0,0 +1,31 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+Remove common stopwords from the tweet.
+
+Created on Thu Oct 7 12:21:12 2021
+
+@author: dhesenkamp
+"""
+
+from code.preprocessing.preprocessor import Preprocessor
+import gensim
+
+
+class StopwordRemover(Preprocessor):
+ """Remove common stopwords from the given input column"""
+
+
+ def __init__(self, input_column, output_column):
+ """Constructor, calls super Constructor"""
+ super().__init__([input_column], output_column)
+
+
+ # def _set_variables(self, inputs):
+
+
+ def _get_values(self, inputs):
+ """Remove stopwords from given column with gensim."""
+ column = [gensim.parsing.preprocessing.remove_stopwords(tweet) for tweet in inputs[0]]
+
+ return column
\ No newline at end of file
diff --git a/code/preprocessing/tokenizer.py b/code/preprocessing/tokenizer.py
index 94191502..ab83249a 100644
--- a/code/preprocessing/tokenizer.py
+++ b/code/preprocessing/tokenizer.py
@@ -11,23 +11,27 @@
from code.preprocessing.preprocessor import Preprocessor
import nltk
+
class Tokenizer(Preprocessor):
"""Tokenizes the given input column into individual words."""
+
def __init__(self, input_column, output_column):
- """Initialize the Tokenizer with the given input and output column."""
+ """Constructor, calls super Constructor."""
super().__init__([input_column], output_column)
+
# don't need to implement _set_variables(), since no variables to set
+
def _get_values(self, inputs):
- """Tokenize the tweet."""
-
+ """Tokenize the tweet into individual words using nltk."""
tokenized = []
for tweet in inputs[0]:
sentences = nltk.sent_tokenize(tweet)
tokenized_tweet = []
+
for sentence in sentences:
words = nltk.word_tokenize(sentence)
tokenized_tweet += words
diff --git a/code/util.py b/code/util.py
index 7d8794c7..c5aec9d7 100644
--- a/code/util.py
+++ b/code/util.py
@@ -12,9 +12,20 @@
COLUMN_TWEET = "tweet"
COLUMN_LIKES = "likes_count"
COLUMN_RETWEETS = "retweets_count"
+COLUMN_MONTH = "date"
+COLUMN_PHOTOS = "photos"
+COLUMN_MENTIONS = "mentions"
+COLUMN_URL = "urls"
+COLUMN_REPLIES = "replies_count"
+COLUMN_HASHTAG = "hashtags"
+COLUMN_LIKES = "likes_count"
+COLUMN_TIME = "time"
# column names of novel columns for preprocessing
COLUMN_LABEL = "label"
COLUMN_PUNCTUATION = "tweet_no_punctuation"
-SUFFIX_TOKENIZED = "_tokenized"
\ No newline at end of file
+SUFFIX_TOKENIZED = "_tokenized"
+SUFFIX_STOPWORDS = "_stopwords_removed"
+SUFFIX_LEMMATIZED = "_lemmatized"
+SUFFIX_PUNCTUATION = "_punct_removed"
\ No newline at end of file
diff --git a/data/dimensionality_reduction/pipeline.pickle b/data/dimensionality_reduction/pipeline.pickle
new file mode 100644
index 00000000..4fd6eb2f
Binary files /dev/null and b/data/dimensionality_reduction/pipeline.pickle differ
diff --git a/data/dimensionality_reduction/test.pickle b/data/dimensionality_reduction/test.pickle
new file mode 100644
index 00000000..818c7424
Binary files /dev/null and b/data/dimensionality_reduction/test.pickle differ
diff --git a/data/dimensionality_reduction/training.pickle b/data/dimensionality_reduction/training.pickle
new file mode 100644
index 00000000..b4f6bef3
Binary files /dev/null and b/data/dimensionality_reduction/training.pickle differ
diff --git a/data/dimensionality_reduction/validation.pickle b/data/dimensionality_reduction/validation.pickle
new file mode 100644
index 00000000..41653b42
Binary files /dev/null and b/data/dimensionality_reduction/validation.pickle differ
diff --git a/img/accuracy.png b/img/accuracy.png
new file mode 100644
index 00000000..a0ff0b21
Binary files /dev/null and b/img/accuracy.png differ
diff --git a/img/balancedaccuracy.png b/img/balancedaccuracy.png
new file mode 100644
index 00000000..890be61c
Binary files /dev/null and b/img/balancedaccuracy.png differ
diff --git a/img/cohenskappa.png b/img/cohenskappa.png
new file mode 100644
index 00000000..850f4251
Binary files /dev/null and b/img/cohenskappa.png differ
diff --git a/img/final_result.png b/img/final_result.png
new file mode 100644
index 00000000..f49894fb
Binary files /dev/null and b/img/final_result.png differ
diff --git a/img/fscore.png b/img/fscore.png
new file mode 100644
index 00000000..c0048c32
Binary files /dev/null and b/img/fscore.png differ
diff --git a/img/lang_distribution.png b/img/lang_distribution.png
new file mode 100644
index 00000000..c3dc6edf
Binary files /dev/null and b/img/lang_distribution.png differ
diff --git a/img/year_distribution.png b/img/year_distribution.png
new file mode 100644
index 00000000..508ee8a8
Binary files /dev/null and b/img/year_distribution.png differ
diff --git a/optimization/runs.csv b/optimization/runs.csv
new file mode 100644
index 00000000..84882c53
--- /dev/null
+++ b/optimization/runs.csv
@@ -0,0 +1,98 @@
+Run ID,Name,Source Type,Source Name,User,Status,classifier,dataset,hidden layer sizes,kernel,Accuracy,Balanced_accuracy,Cohens_kappa,F1_score
+efec387940c644a2a5a2ef4841cd11ee,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,complementNB,validation,,,0.6304616060713621,0.6064940834378993,0.08769047501083926,0.2228699392172893
+b5bdc7f94ad24b23bfe6fbfffd9a821e,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,complementNB,validation,,,0.631892092897468,0.6220923866043796,0.09960836001164752,0.23333098641132155
+4d01b6864964469a926baedbb40f1f96,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,complementNB,training,,,0.6338584451731404,0.6123910053275523,0.092875338085471,0.2271694792298453
+1e37af9fd3a34c42a591a9750e3d34fb,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,svm,validation,,sigmoid,0.8368885433217268,0.510392989328906,0.020810084617806535,0.11059907834101383
+93a9fd6410b44d5ba9fa05a38093ad1a,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,svm,training,,sigmoid,0.8372209639070124,0.5093624635112142,0.018823045997413135,0.1083850260778323
+e31511cfba40494d8827d49236b83972,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,svm,validation,,rbf,0.9081843074946756,0.5,0,0
+29b07d98efe342c78d9498197b2ad937,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,svm,training,,rbf,0.9081843074946756,0.5,0,0
+b139536e21574264a4173ae45ae919bf,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,svm,validation,,poly,0.9081843074946756,0.5,0,0
+c75118af40bf434fa0b6d52c9e3eb592,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,svm,training,,poly,0.9081843074946756,0.5,0,0
+841eae55e44f41b8ab030297c0257ca9,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,svm,validation,,linear,0.9081843074946756,0.5,0,0
+95eb2e7a8e134d52813366d37278a1ea,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,svm,training,,linear,0.9081843074946756,0.5,0,0
+563c000dc8f645b38dc5033b840f12e4,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,complementNB,validation,,,0.631892092897468,0.6220923866043796,0.09960836001164752,0.23333098641132155
+8c5cc40765d84f93bc73623082684144,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,complementNB,training,,,0.6338584451731404,0.6123910053275523,0.092875338085471,0.2271694792298453
+594220292a094caf9dcacd93722f8bd5,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,validation,"[100, 100, 100]",,0.9005611710219398,0.517233096563382,0.05464532288137236,0.08092485549132948
+80aaf1a449504b96ac71d4a0f0ea23b8,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,training,"[100, 100, 100]",,0.9144833958734774,0.5558156447979273,0.17877898013721394,0.20006324443975967
+9da9104052d644d4bb0f092d33fad2b1,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,validation,"[50, 50, 50]",,0.9058855346337176,0.5069258054577694,0.023957071393251983,0.03433922996878252
+c478bd4b9aa04804a4b0470512ce48a9,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,training,"[50, 50, 50]",,0.910319687186595,0.5197924327267348,0.06856544627029226,0.07818381884519604
+a73c3743d40040ffb4a132ed0e26603d,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,validation,"[50, 50, 25]",,0.9075251005713126,0.5028639865174341,0.010224328047144793,0.014056586772391423
+c5748ff2756947fc878392b3bfd2bab7,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,training,"[50, 50, 25]",,0.9090801528007842,0.5079675078148976,0.028398974923405484,0.032495952994783854
+302b35e017694d0faef563b611571e6e,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,validation,"[50, 50, 10]",,0.9072546567053176,0.5031288006058767,0.011126977299656948,0.0157847533632287
+3b53ffc0c627414c9a0059a083fc2e05,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,training,"[50, 50, 10]",,0.9088829541484962,0.5071142682265306,0.025381435612958136,0.029291716686674674
+3137f11d92fb4387a2b7f109b8c505e2,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,validation,"[50, 25, 50]",,0.9073053649301916,0.504728803429135,0.016730924587559648,0.022459893048128343
+85d46fed14834a2e9927b5fba1959fdc,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,training,"[50, 25, 50]",,0.9091421295200748,0.5106493519160376,0.037623278193080534,0.043534994068801895
+15039c8080b24162ba8b26357d3158d1,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,validation,"[50, 25, 25]",,0.9080659883033028,0.5004313075221742,0.0015623655792601499,0.002201430930104568
+11b88994c4744aaf8629854bc663b7cb,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,training,"[50, 25, 25]",,0.9083138951804649,0.5012573035818559,0.004553083706556493,0.0052570450516535245
+aceef5db676d46ceae0d92286cdbb1c5,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,validation,"[50, 25, 10]",,0.9076265170210608,0.5026715972268477,0.0095544385998112,0.013003431461080009
+14e2ce06d9de4a25a76e78c892006514,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,training,"[50, 25, 10]",,0.9086237787769176,0.505868361608402,0.02097151946173359,0.024539877300613498
+dfb4896019a84344b1572226e7771d81,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,validation,"[50, 10, 50]",,0.9080490855616781,0.500670225753762,0.0024252772576568438,0.0032979113228288753
+86680bab032046ffacbb0dec83b53809,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,training,"[50, 10, 50]",,0.9082068444835085,0.5011707864766112,0.004236182271651567,0.005129457743038593
+a7f1cbd41a3843f9947f3c1c7c117917,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,validation,"[50, 10, 25]",,0.9080321828200534,0.5014055920249259,0.005071845613671244,0.006572941391272595
+ad0c7d56551347b6993b0d7f7d0ec902,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,training,"[50, 10, 25]",,0.9083477006637143,0.502599776597126,0.009372467822858832,0.011064502401361786
+1c4413a9764b4ebfbcd520f61f9b8661,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,validation,"[50, 10, 10]",,0.9081335992698015,0.5007167546947633,0.0025955896168747827,0.003300935264991748
+a1fc96e7748c4dafb3b6054901269023,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,training,"[50, 10, 10]",,0.9082800896972155,0.5011559506655261,0.00418605356474111,0.004890274466654441
+851a166a33bb4471b75e2e5f3837e33f,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,validation,"[25, 50, 50]",,0.9076265170210608,0.5027543385667771,0.009847432687869917,0.013359812240476622
+a9cbac7cf02c4fe9b9cf95859bb216da,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,training,"[25, 50, 50]",,0.9088153431819974,0.5063323730143654,0.022638663043718532,0.02611625947767481
+ad9c9bdce8354a10a3b40575306fa2d4,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,validation,"[25, 50, 25]",,0.9078969608870558,0.5038961272571333,0.013913142786541433,0.017667207499549302
+dc8876f7d4794896a166c4c233a0a69a,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,training,"[25, 50, 25]",,0.9086688527545834,0.5062517228499632,0.02232172068307048,0.02607546262917568
+6213c727f8d24b11a59a5dc7a3115431,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,validation,"[25, 50, 10]",,0.9078124471789324,0.5016155821380398,0.005811342333363556,0.008002910149145144
+c9bd85719d53431daf5cfa728f07d820,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,training,"[25, 50, 10]",,0.9083307979220896,0.5030869188485017,0.011107131290344907,0.013221737020863659
+dfae20218359476b9a938437ba58795d,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,validation,"[25, 25, 50]",,0.9078462526621818,0.5022961244338751,0.008242120857151924,0.01088534107402032
+28bc93462d214421a00703fd6db1abc0,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,training,"[25, 25, 50]",,0.9084716541022954,0.5039643000361541,0.014240743541621748,0.01670601053205012
+186fe2dca3c34a858720438ca3b5f4d2,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,validation,"[25, 25, 25]",,0.9080828910449275,0.5005233546503038,0.0018955124378710053,0.0025678650036683784
+a9eb9ba497a346a8b65b8d1bb913a2b2,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,training,"[25, 25, 25]",,0.9083082609332567,0.501061138525954,0.003845205908099447,0.004404747338798483
+5ac10ced25424ce6b293a337d45e3b49,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,validation,"[25, 25, 10]",,0.9081166965281768,0.5005419662267043,0.0019635092081258243,0.002568807339449541
+8b0f6c23510141748408f6f6fd6c0e29,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,training,"[25, 25, 10]",,0.9083082609332567,0.5011438798658834,0.004143790026535199,0.004770058708414873
+f1408b668729422d9d078b57e6f1ecd7,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,validation,"[25, 10, 50]",,0.9076941279875596,0.502295113680002,0.008225056746911652,0.011225783088900959
+1a8e774942e74993be73d29ca2f4eed5,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,training,"[25, 10, 50]",,0.9081730390002591,0.5036344150980909,0.013029252371474564,0.015940103852191766
+16f8ca6976df45e985c4165ca8694069,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,validation,"[25, 10, 25]",,0.9081505020114262,0.49998138842359946,-0.00006759065026695765,0
+d17265e83d9a446ca52465eb65eaa35a,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,training,"[25, 10, 25]",,0.9082012102363003,0.5000920471281296,0.0003343327828052578,0.0003681207436039021
+fd6acee07aca4f5b9f19a2ed79571170,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,validation,"[25, 10, 10]",,0.9081674047530509,0.5001561768916585,0.000566923597530411,0.0007356998344675373
+65dc412cf9404aa98c236a794977e9f4,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,training,"[25, 10, 10]",,0.9082406499667579,0.5005274673335769,0.0019136576759937185,0.0022056120573459136
+fcb9b9c1b89c45bb9f0c3b1e6c69fc7f,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,validation,"[10, 100, 50]",,0.9081674047530509,0.49999069421179976,-0.00003380040342815249,0
+dc7027a9067445b183b734c46b081b0a,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,training,"[10, 100, 50]",,0.9081786732474674,0.5000244785172431,0.00008891065864957692,0.00012270691453463402
+18f61bca5999479cbf02a44d233f3407,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,validation,"[10, 100, 25]",,0.9082012102363003,0.5017468739267165,0.006307070790515845,0.0076740361775991224
+765f3898a1554cb1b39fcf8f469d5727,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,training,"[10, 100, 25]",,0.9083984088885884,0.5025449526217973,0.009181668658757247,0.0107095046854083
+5dae56dee82e4018916fd63f0a906874,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,validation,"[10, 100, 10]",,0.9082012102363003,0.500671236507635,0.0024329411952009883,0.002937396732146135
+152775221bde499d9c98ef21fada667e,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,training,"[10, 100, 10]",,0.9082800896972155,0.5012111115588123,0.0043849322467430785,0.005133532970726639
+300774e4c4d941079894fcd681b87d85,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,validation,"[10, 50, 50]",,0.9081674047530509,0.49999069421179976,-0.00003380040342815249,0
+bac8ba33986d483e9a93108c747f47e7,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,training,"[10, 50, 50]",,0.9081899417418838,0.5000582628226864,0.00021162172072741736,0.00024541382906926803
+2b519bf461b6481a862f6015c9ddc63e,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,validation,"[10, 50, 25]",,0.9082012102363003,0.5002575298079883,0.0009348356447980155,0.0011035497517013057
+aef81fdde1844c80b077b0457c3c6aef,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,training,"[10, 50, 25]",,0.9081955759890921,0.5002820083252313,0.001023539183543587,0.001225940909648155
+d088e9f46d84432a89ada6c5edce6c4b,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,validation,"[10, 50, 10]",,0.9081843074946756,0.5006619307194348,0.0023988520396821533,0.002936857562408223
+fda1052d79004998972b262fd84078cc,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,training,"[10, 50, 10]",,0.9082969924388402,0.5008067106473658,0.0029255847337198437,0.0033067973055725657
+d135ffab84294e1e8eda7c2c5c250049,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,validation,"[10, 25, 50]",,0.9082350157195497,0.5008553307638942,0.0030992735476269573,0.003670398238208846
+989e08d6b22c42b194ab7af0ac121338,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,training,"[10, 25, 50]",,0.9082124787307168,0.5007877621530077,0.0028544400665798664,0.0034257050223282562
+96edfc003705432d9410d9c97aa8f43d,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,validation,"[10, 25, 25]",,0.9081335992698015,0.5000548239753286,0.00019901155026280648,0.00036784991723376866
+d24b3c36107d47d8ba1621a14febb7a4,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,training,"[10, 25, 25]",,0.9082237472251332,0.5002423570789456,0.0008800228109887565,0.0009812940815700705
+3a0c7206e1ec4e508861e02cf9776dec,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,validation,"[10, 25, 10]",,0.9081674047530509,0.49999069421179976,-0.00003380040342815249,0
+4127da98c43647e1b3037390bfee9e63,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,training,"[10, 25, 10]",,0.9081899417418838,0.5000306823760432,0.00011145542625656812,0.00012272197336933178
+8b4d4ddeb7824559a0fa82a3b14874bb,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,validation,"[10, 10, 50]",,0.9081843074946756,0.5,0,0
+46126c7a22c44de0ac9244deafd19666,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,training,"[10, 10, 50]",,0.9081955759890921,0.5000613647520864,0.0002228996866323607,0.00024542888697999754
+cb0f9f4f7c144e1b8dfe2206e2440cbb,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,validation,"[10, 10, 25]",,0.9082350157195497,0.5002761413843888,0.0010026970054990425,0.0011039558417663294
+7f39b429aca7461e9fc3017dfbb17216,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,training,"[10, 10, 25]",,0.9081955759890921,0.5001165256453727,0.0004231798518310459,0.000490737332842596
+8b2d16c761b24ed2b81dbf5355a4b535,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,validation,"[10, 10, 10]",,0.9081843074946756,0.5,0,0
+9f41b0c27ca04f39a7dfd27b4f6cf107,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,training,"[10, 10, 10]",,0.9081843074946756,0.5000275804466432,0.00010018247780541056,0.00012271444348999877
+e5e16c5b1bed485abdd976578d4f8530,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,validation,"[50, 50]",,0.9081166965281768,0.5001282595270576,0.0004653736355804128,0.0007352941176470588
+5d7aa9db0e4d4caf877721edd6cc5921,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,training,"[50, 50]",,0.9082800896972155,0.5008249853058087,0.0029911109394871183,0.0034282216100397916
+d9a70e40ebfb42db8ce10438cf567ab3,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,validation,"[50, 25]",,0.9076772252459349,0.5017066185122964,0.0061278625889164,0.008711433756805808
+a8bd382b2de04f51a8e54aa59446bcba,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,training,"[50, 25]",,0.9084603856078789,0.5039305157307109,0.014119381983424328,0.016584952484716423
+500233a8205a49d6b261da11e1a64fa9,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,validation,"[50, 10]",,0.908218112977925,0.5002668355961886,0.000968761228811954,0.0011037527593818985
+901a344a33b043a3a1cb4f056e9bcd6f,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,training,"[50, 10]",,0.9081955759890921,0.5002544278785882,0.000923529381038457,0.0011034820990681706
+e4c998c65d884341b4a64cbdab3be02c,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,validation,"[25, 50]",,0.9081843074946756,0.5002482240197881,0.0009009202488663437,0.0011033468186833395
+7343f69999414b5dad9f02ab9604f008,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,training,"[25, 50]",,0.9082237472251332,0.5006560637785923,0.0023786599971316047,0.0028160391796755433
+64cea74d088849d0b4af0401a7e8c7ce,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,validation,"[25, 25]",,0.9081166965281768,0.5006247075666337,0.0022625975748927774,0.00293470286133529
+f536d776dca14aea8a8043eb3a445643,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,training,"[25, 25]",,0.9082688212027991,0.5010394250201534,0.0037652064715287814,0.004402861860209136
+d87482fba54540e99618ae64fb7b94bf,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,validation,"[25, 10]",,0.9081843074946756,0.5,0,0
+9270f47fef9b4ab4b6b63684d0eae92f,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,training,"[25, 10]",,0.9081843074946756,0.5,0,0
+1e3780e16372476983ceabee1522f74b,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,validation,"[10, 50]",,0.9081843074946756,0.5,0,0
+ad5e71915b984870a80d2f851830ce60,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,training,"[10, 50]",,0.9081843074946756,0.5,0,0
+1eb0305f0a834639b6bad50b8415cf55,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,validation,"[10, 25]",,0.9081674047530509,0.49999069421179976,-0.00003380040342815249,0
+cd8d7c131333418da75f74c19bf0e52a,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,training,"[10, 25]",,0.9081843074946756,0.5,0,0
+8706ac71f52e46a9a10989172379f6fd,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,validation,"[10, 10]",,0.9081674047530509,0.49999069421179976,-0.00003380040342815249,0
+3c11365ff62f4ec7877f5d75019a8b4d,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,training,"[10, 10]",,0.9081786732474674,0.5000520589638862,0.00018906917797201217,0.0002453837187902583
+5fc7c8115d3a4bc2971d8a831e82e489,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,validation,[50],,0.9081335992698015,0.5000548239753286,0.00019901155026280648,0.00036784991723376866
+8ead250f32cb41b3bb18739966939b36,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,training,[50],,0.9081899417418838,0.5004719695223331,0.0017117114939724232,0.002082185069508237
+3e14a76354ed41f1ac409388129d941b,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,validation,[25],,0.9081843074946756,0.5003309653597174,0.0012008663702119948,0.0014705882352941176
+b267b338ff4043e681f84f3d863ab199,,LOCAL,/Users/dhesenkamp/Documents/GitHub/MLinPractice/code/classification/run_classifier.py,dhesenkamp,FINISHED,mlp,training,[25],,0.9081505020114262,0.5002020319967444,0.0007331211704250107,0.000980512317685991