diff --git a/Documentation.md b/Documentation.md index de0a5b0f..9bdf07c4 100644 --- a/Documentation.md +++ b/Documentation.md @@ -1,59 +1,223 @@ -# Documentation Example +# Documentation -Some introductory sentence(s). Data set and task are relatively fixed, so -probably you don't have much to say about them (unless you modifed them). -If you haven't changed the application much, there's also not much to say about -that. -The following structure thus only covers preprocessing, feature extraction, -dimensionality reduction, classification, and evaluation. +This is the forked repository for Magnus Müller, Maximilian Kalcher and Samuel Hagemann. -## Evaluation +Our task involved building and documenting a real-life application of a machine learning task. We were given a dataset of 295811 tweets about data science from the years 2010 until 2021 and had to build a classifier that would detect whether a tweet would go viral or not. The measure for it being viral was when the sum of likes and retweets were bigger than 50, which resulted in 91% false (or non-viral) labels, and 9% true (or viral) labels. + +Our work consisted of a number of steps in a pipeline, in summary: We loaded and labeled the data using the framework given to us by Bechberger. We then preprocessed the data, mainly the raw tweet, to better fit our feature extraction later. This was done by removing punctuation, stopwords, etc. and also tokenizing into single words. After this, we extracted a handful of features which we found to be of importance, some were already included in the raw dataset columns, some we had to extract ourselves. Since the feature space was not exactly very large and mostly overseeable, we did not apply any dimensionality reduction other than what was already implemented. So after the features, we headed straight into classification using a variety of classifiers and benchmarks for evaluation. At the end, our best classifier is implemented into an 'application', callable by terminal, which gives the likeliness of an input tweet being viral, having used the features as training. + +This pipeline is documented more in detail below. + +## Preprocessing + +Before having used any data of columns such as the raw tweets, it was important for us to process parts of it beforehand so our chosen features could be extracted smoothly. The first thing we noticed was that many tweets had different kinds of punctuation, stopwords, emojis, and some of them even were written in different languages. + +Since we found it interesting from the start to use some NLP features like the semantics of words alongside other features, we aimed to keep the core words of the tweet only. ### Design Decisions -Which evaluation metrics did you use and why? -Which baselines did you use and why? +To achieve to only keep the core words of a tweet, we used the following data cleaning methods: -### Results +1. Removing punctuation and digits -How do the baselines perform with respect to the evaluation metrics? +We knew that tweets can sometimes use extensive punctuation, which would be a problem for later features and/or the tokenizer, since it detects punctuation as a token too. We chose to remove any written digits as well to only keep strings. With the help of the ```string``` package, we filtered out most punctuation and digits, namely: ```['!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~0123456789] ``` . But we also had to add some other special characters that the package did not pick up, and were still being present after the cleaning: ``` [’—”“↓'] ``` . -### Interpretation +Example: -Is there anything we can learn from these results? +Input: +` Red Black tree, AVL Tree and other Algorithms like Bellman ford etc are asked in most interviews even on Devops, Data Science roles 😏 httptcoQrKYJpiiVp ` -## Preprocessing +Output: +` Red Black tree AVL Tree and other Algorithms like Bellman ford etc are asked in most interviews even on Devops Data Science roles 😏 httptcoQrKYJpiiVp` -I'm following the "Design Decisions - Results - Interpretation" structure here, -but you can also just use one subheading per preprocessing step to organize -things (depending on what you do, that may be better structured). -### Design Decisions +2. Removing stopwords + +For the sake of only using important words for our features, we removed so called stopwords, or filler words used commonly in sentences. This was possible with the ```nltk``` package. So for each word in the tweet, any word that was equal to the corpus of stopwords was removed : +``` +{'a', + 'about', + 'above', + 'after', + 'again', +… + 'your', + 'yours', + 'yourself', + 'yourselves'} +``` +(179 total) + +Example: + +Input: +`Red Black tree, AVL Tree and other Algorithms like Bellman ford etc are asked in most interviews even on Devops, Data Science roles 😏 httptcoQrKYJpiiVp` + +Output: +`Red Black tree, AVL Tree Algorithms like Bellman ford etc asked interviews even Devops, Data Science roles 😏 httptcoQrKYJpiiVp` + + +3. Remove emojis + +In between the now usable words, we found some tweets had emojis. This of course was not a string and since interpretability of emojis by encoding and decoding into ASCII was a little difficult, we chose to just remove them from the tweet. + +Seeing as how the decoded emojis all start with ```\UXXXX``` followed by a set of numbers and letters, we just removed every string that started or contained ```\U``` after being decoded (contained - because some emojis were written without spacing and recognized as a single string). + +Example: + +Input: ``` Red Black tree, AVL Tree and other Algorithms like Bellman ford etc are asked in most interviews even on Devops, Data Science roles 😏 httptcoQrKYJpiiVp``` + +Output: ``` Red Black tree, AVL Tree and other Algorithms like Bellman ford etc are asked in most interviews even on Devops, Data Science roles httptcoQrKYJpiiVp``` + +4. Remove links -Which kind of preprocessing steps did you implement? Why are they necessary -and/or useful down the road? +We thought we had finished the cleaning part, since there was finally only text in a tweet. But we quickly found that links were also recognized as strings, and were practically unusable for us. So we also removed any string that started with ``` http ```. + +Example: + +Input: +`Red Black tree, AVL Tree and other Algorithms like Bellman ford etc are asked in most interviews even on Devops, Data Science roles 😏 httptcoQrKYJpiiVp` + +Output: +`Red Black tree, AVL Tree and other Algorithms like Bellman ford etc are asked in most interviews even on Devops, Data Science roles 😏` + +5. Tokenize tweet + +After these important cleaning steps, we took all words in the tweet and tokenized them for easier feature extraction when dealing with NLP features, because we would then just be iterating over a list or core words. This tokenizer is built using ```nltk``` and was already implemented in one of the seminar sessions. + +Every part of the string that is not a whitespace or ```’ ‘``` is then added onto a list of so-called tokens that represent the words in the sentence. We didn’t want to overwrite the normal preprocessed tweet, because of further features that would not need tokens, so we decided to just insert an extra preprocessing column only for these tokenized words. + +Example: + +Input: ``` Red Black tree, AVL Tree and other Algorithms like Bellman ford etc are asked in most interviews even on Devops, Data Science roles 😏 httptcoQrKYJpiiVp``` + +Output: ``` [‘Red’, ‘Black’, ‘tree’, ’,’, ‘AVL’, ‘Tree’, ‘and’, ‘other’, ‘Algorithms’, ‘like’, ‘Bellman’, ‘ford’, ‘etc’, ‘are’, ‘asked’, ‘in’, ‘most’, ‘interviews’, ‘even’, ‘on’, ‘Devops’, ‘,’, ‘Data’, ‘Science’, ‘roles’, ‘😏’, ‘httptcoQrKYJpiiVp’ ]``` + +6. Set language to only english + +Taking a closer look at the languages of our tweets, our analysis summary was the following: + +{'en': 282035, 'it': 4116, 'es': 3272, 'fr': 2781, 'de': 714, 'id': 523, 'nl': 480, 'pt': 364, 'ca': 275, 'ru': 204, 'th': 157, 'ar': 126, 'tl': 108, 'tr': 84, 'hr': 68, 'da': 66, 'ro': 60, 'ja': 58, 'sv': 42, 'et': 29, 'pl': 25, 'bg': 24, 'af': 23, 'no': 21, 'fi': 20, 'so': 16, 'ta': 16, 'hi': 11, 'mk': 11, 'he': 9, 'sw': 9, 'lt': 7, 'uk': 6, 'sl': 6, 'te': 5, 'zh-cn': 5, 'lv': 5, 'ko': 5, 'bn': 4, 'el': 4, 'fa': 3, 'vi': 2, 'mr': 2, 'ml': 2, 'hu': 2, 'kn': 1, 'cs': 1, 'gu': 1, 'sk': 1, 'ur': 1, 'sq': 1} + +It turns out, from the total 295811 samples, 95% were english tweets. The rest would be pretty much unusable for our subsequent NLP-based features, so we chose to remove that 5% portion from our preprocessed data. + +Example: + +Input: ``` Red Black tree, AVL Tree and other Algorithms like Bellman ford etc are asked in most interviews even on Devops, Data Science roles 😏 httptcoQrKYJpiiVp``` + +Output: ``` Red Black tree, AVL Tree and other Algorithms like Bellman ford etc are asked in most interviews even on Devops, Data Science roles 😏 httptcoQrKYJpiiVp``` + +7. Set all words to lowercase + +Our final preprocessing step was only implemented after having examined keywords for the feature extraction and found out that many words on ‘most used words’ analysis had appeared twice. This was because sometimes people would write them with lower and uppercase. So we just went back into the code and made sure to append all strings to our preprocessing columns with ```.lower()```, making all words lowercase. + +Example: + +Input: ``` Red Black tree, AVL Tree and other Algorithms like Bellman ford etc are asked in most interviews even on Devops, Data Science roles 😏 httptcoQrKYJpiiVp``` + +Output: ``` red black tree, avl tree and other algorithms like bellman ford etc are asked in most interviews even on devops, data science roles 😏 httptcoQrKYJpiiVp``` ### Results -Maybe show a short example what your preprocessing does. +At the end of our preprocessing, we had created 2 new columns: ``` preprocess_col ``` for the general feature extraction and ``` preprocess_col_tokenized ``` for the feature extraction based on the word level (semantics). -### Interpretation +Our resulting columns from a raw tweet looks like this: + +(raw) (preprocess_col) (preprocess_col_tokenized) +Looking at this from a different perspective, we can count how much of the original dataset was removed solely based on the character count: + +Number of characters before preprocessing: 52686072 +Number of characters after preprocessing: 32090622 +Removed: 20595450 (39.1 %) -Probably, no real interpretation possible, so feel free to leave this section out. +At the end of our preprocessing, almost 60% of the original data was used and worked with. ## Feature Extraction -Again, either structure among decision-result-interpretation or based on feature, -up to you. +The dataset was very variable and we had a lot of features to work with, which gave us the freedom to choose and experiment with these freely. We chose to extract two types of features: + +Features about special characters and metadata, and +Features about natural language, word similarity and frequency + +These features are appended to the preprocessed dataset as new columns and can be inspected in the ``` features.csv ``` file in the feature extraction folder. Below we listed all extracted features and their explanation. ### Design Decisions -Which features did you implement? What's their motivation and how are they computed? +Below are listed all implemented features and their explanation. + +Photo Bool + +We implemented a simple feature that would tell whether the tweet under consideration contains one (or even multiple) photos or not. The idea behind this was that tweets with visual stimulus are more appealing and would therefore be more likely to go viral. + +The possible outputs are either ```[1]``` if the tweet contains photo(s) and ```[0]``` if it does not. If we have 5 tweets that either contain videos or not, an exemplary output could be: ```[[1], [0], [0], [1], [0]]``` + +Video Bool + +Also, we were interested in the possible effect of video(s) in the tweets and implemented it as a feature in a similar way to the photo feature. This feature will tell whether the tweet under consideration contains video(s) or not. The possible outputs are either ```[1]``` if the tweet contains video(s) and ```[0]``` if it does not. If we have 5 tweets that either contain videos or not, an exemplary output could be: ```[[0], [1], [0], [1], [1]]``` + +Since this feature was already integrated in the raw dataset, we just had to extract the values of this column. + +Hashtag Counter + +We also thought about the importance of hashtags. Since using a hashtag increases the engagement of tweets by tagging similar topics together, having more hashtags would mean that more people could be reached for a variety of topics. Because of this, we implemented a feature that counts the number of hashtags a tweet has. If we have 5 tweets that contain different number of hashtags, an exemplary output could be: ```[[0], [5], [2], [3], [1]]``` + +Emoji Counter + +According to a study, emojis have not only become a visual language used to reveal several things (like feelings and thoughts), but also part of the fundamental structure of texts. They have shown to convey easier communication and strengthen the meaning and social connections between users. + +Because of this, we thought it would be very fitting to add a feature about emojis. So we took the original unprocessed tweets again, and counted every occurrence of strings that started or contained ```\U``` after being decoded. This number was then added to a new column as a feature to show that using a n > 0 number of emojis would increase the attractiveness of the tweet. If we have 5 tweets that contain different number of emojis, an exemplary output could be: ```[[0], [5], [2], [3], [1]]``` + + +Tweet length + +This feature was already implemented, so we did not add anything to it. We did find this very interesting though. According to a 2013 study trying to analyze why tweets became viral, most viral tweets back then had less than 140 characters. This means that a shorter and precise tweet would perform better than a longer tweet conveying more information. + +This feature just counts the entire tweet string and adds the value to a new feature column. If we have 5 tweets of different length, an exemplary output could be: ```[[45], [14], [41], [30], [12]]`` + +Hour of tweet + +The last feature about the tweet metadata is about the posting time of the tweet. We wanted to know if there was a difference in the posting times in viral and non-viral tweets. The following two graphs represent the tweet frequency per hour of posting. + +![Test](/Documentation/time_non_viral.png) +[2] Hour frequency of non-viral tweets [0-24] + +[3] Hour frequency of viral tweets [0-24] + +Interestingly enough, both viral and non-viral tweets are distributed pretty much equally except for a timeframe of about 3 hours in the morning from 7:00 - 10:00. This is where the viral tweets tend to be tweeted more. Using this information, we stripped the hour from the ```time``` column of the raw dataset and added this as a feature in a new column. So for a time of ```12:05:45``` the feature would just extract the number ```12```. If we have 5 tweets of posting times, an exemplary output could be: ```[[12], [14], [3], [4], [0]]`` + + +Word2Vec + +We then started to analyze features about semantics and natural language. The first thing that came to mind was seeing if there was a different word usage comparing the viral and non-viral tweets. The two graphs below show the 20 most used words in both categories: + +[3] Top 20 most used words in non-viral tweets (label == False) + +[3] Top 20 most used words in viral tweets (label == True) + +What we can gather from this is that both categories seem to use the word ‘data’, followed by ‘datascience’ the most. This is not surprising, as the entire dataset is about data-science related tweets. Applying a word embedding feature using these words would not make much sense, as the feature would score equally high for the majority of both viral and non-viral tweets. So we had to differentiate between the categories and find words that were only present in the top 20 words in viral tweets, and not in non-virals. Since we used a dataset of embedded words from google news ( ```'word2vec-google-news-300'```), the keywords had to be present there and could not be too specific (deepleaning, datascience, etc. did not exist). We then settled with the following words: + +` keywords = ['coding','free','algorithms','statistics'] ` + +These words were present in the dataset for word embeddings and were also exclusively used in viral tweets. + +Using the package ```gensim``` for computing word embeddings and semantic similarity, we could easily iterate over all words in ```preprocess_col_tokenized``` and compare each word there with each word in our ```keyword``` list. For every (tokenized) tweet, we took the mean of all similarity values and added this float number to a new feature column ```word2vec```. + + +Hashing Vectorizer +Tf Idf + +We wanted to try something we didn't hear in the lecture. Therefore, we used the HashingVectorizer from sklearn to create an individual hash for each tweet. For a sentence like 'I love Machine Learning', the output can look like [0.4, 0.3, 0.9, 0, 0.21], with length n representing the number of features. It's not very intuitive to humans why this works, but after a long time of version conflicts and other problems, we enjoyed the simplicity of using sklearn. + +Usage: `--hash_vec` +and for number of features for hash vector edit HASH_VECTOR_N_FEATURES in util.py ### Results Can you say something about how the feature values are distributed? Maybe show some plots? +When we finally ran it successfully with 25 features, we tried it with the SVM classifier, but that took too much time (nearly endless), so we used KNN with 4 NN on a 20000 sample subset and for the first time our Cohen kappa went from 0.0 to 0.1 and after some tuning (using more data) to 0.3. + + ### Interpretation Can we already guess which features may be more useful than others? @@ -78,12 +242,13 @@ Can we somehow make sense of the dimensionality reduction results? Which features are the most important ones and why may that be the case? ## Classification - +First of all we add a new argument: --small 1000 which would just use 1000s tweets. ### Design Decisions Which classifier(s) did you use? Which hyperparameter(s) (with their respective candidate values) did you look at? What were your reasons for this? +- SVM ### Results The big finale begins: What are the evaluation results you obtained with your @@ -94,4 +259,27 @@ selected setup: How well does it generalize to the test set? Which hyperparameter settings are how important for the results? How good are we? Can this be used in practice or are we still too bad? -Anything else we may have learned? \ No newline at end of file +Anything else we may have learned? + +## Evaluation + +### Design Decisions + +Which evaluation metrics did you use and why? +Which baselines did you use and why? + +### Results + +How do the baselines perform with respect to the evaluation metrics? + +### Interpretation + +Is there anything we can learn from these results? + +## Tests + +We have written tests for tfidf_vec and hash_vector, because even though the sklearn functions themselves naturally have many tests implemented, we want to double check that we are using them correctly and that we are getting the expected output. Therefore, especially 'test_result_shape' is very important, because it checks if the length of the output list matches the number of input elements. + +We added in run_classifier, a number of functions to run from the run_classifier_test.py which tests all classifiers, checks if the data is still equal length, if no classifier is written, try classifier fit, if not, give correct error output. + +## Project Organization diff --git a/Documentation/Screenshot 2021-10-19 at 12.45.51.png b/Documentation/Screenshot 2021-10-19 at 12.45.51.png new file mode 100644 index 00000000..cc563390 Binary files /dev/null and b/Documentation/Screenshot 2021-10-19 at 12.45.51.png differ diff --git a/Documentation/Screenshot 2021-10-19 at 19.41.16.png b/Documentation/Screenshot 2021-10-19 at 19.41.16.png new file mode 100644 index 00000000..f3c47b2e Binary files /dev/null and b/Documentation/Screenshot 2021-10-19 at 19.41.16.png differ diff --git a/Documentation/time_non_viral.png b/Documentation/time_non_viral.png new file mode 100644 index 00000000..a4becd25 Binary files /dev/null and b/Documentation/time_non_viral.png differ diff --git a/Documentation/time_viral.png b/Documentation/time_viral.png new file mode 100644 index 00000000..795387fa Binary files /dev/null and b/Documentation/time_viral.png differ diff --git a/Documentation/word_count_non_viral.png b/Documentation/word_count_non_viral.png new file mode 100644 index 00000000..5a912390 Binary files /dev/null and b/Documentation/word_count_non_viral.png differ diff --git a/Documentation/word_count_viral.png b/Documentation/word_count_viral.png new file mode 100644 index 00000000..f5853157 Binary files /dev/null and b/Documentation/word_count_viral.png differ diff --git a/README.md b/README.md index f1c12d81..24964e9a 100644 --- a/README.md +++ b/README.md @@ -19,6 +19,8 @@ conda install -y -q -c conda-forge gensim=4.1.2 conda install -y -q -c conda-forge spyder=5.1.5 conda install -y -q -c conda-forge pandas=1.1.5 conda install -y -q -c conda-forge mlflow=1.20.2 +conda install -y -q -c conda-forge spacy +conda install -c conda-forge langdetect ``` You can double-check that all of these packages have been installed by running `conda list` inside of your virtual environment. The Spyder IDE can be started by typing `~/miniconda/envs/MLinPractice/bin/spyder` in your terminal window (assuming you use miniconda, which is installed right in your home directory). @@ -91,6 +93,8 @@ The features to be extracted can be configured with the following optional param Moreover, the script support importing and exporting fitted feature extractors with the following optional arguments: - `-i` or `--import_file`: Load a configured and fitted feature extraction from the given pickle file. Ignore all parameters that configure the features to extract. - `-e` or `--export_file`: Export the configured and fitted feature extraction into the given pickle file. +- `--hash_vec`: use HashingVectorizer from sklearn. +and for number of features for hash vector edit HASH_VECTOR_N_FEATURES in util.py ## Dimensionality Reduction @@ -128,7 +132,7 @@ By default, this data is used to train a classifier, which is specified by one o The classifier is then evaluated, using the evaluation metrics as specified through the following optional arguments: - `-a`or `--accuracy`: Classification accurracy (i.e., percentage of correctly classified examples). - `-k`or `--kappa`: Cohen's kappa (i.e., adjusting accuracy for probability of random agreement). - +- `--small 1000`: use just 1000 tweets. Moreover, the script support importing and exporting trained classifiers with the following optional arguments: - `-i` or `--import_file`: Load a trained classifier from the given pickle file. Ignore all parameters that configure the classifier to use and don't retrain the classifier. diff --git a/code/all_in_one.py b/code/all_in_one.py new file mode 100644 index 00000000..290c8828 --- /dev/null +++ b/code/all_in_one.py @@ -0,0 +1,183 @@ +import argparse +import pdb +import csv +import pickle + + +# feature_extraction +from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfTransformer, TfidfVectorizer + +# feature_selection +from sklearn.feature_selection import SelectKBest, mutual_info_classif, chi2 + +# dim_reduction +from sklearn.decomposition import PCA, TruncatedSVD, NMF + +# classifier +from sklearn.naive_bayes import MultinomialNB +from sklearn.linear_model import LogisticRegression +from sklearn.ensemble import RandomForestClassifier +from sklearn.linear_model import SGDClassifier +from sklearn.svm import LinearSVC, l1_min_c, SVC, LinearSVR, SVR + +from sklearn.pipeline import Pipeline +from sklearn.pipeline import FeatureUnion +from sklearn.model_selection import cross_val_score + +# metrics +from sklearn.metrics import classification_report, cohen_kappa_score, accuracy_score, balanced_accuracy_score + +import pandas as pd +import numpy as np +import seaborn as sns + +# balancing +from imblearn.under_sampling import RandomUnderSampler +from imblearn.over_sampling import RandomOverSampler +from sklearn.model_selection import train_test_split + +from collections import Counter + +parser = argparse.ArgumentParser(description="all in one") +parser.add_argument("input_file", help="path to the input file") +parser.add_argument("-e", "--export_file", + help="export the trained classifier to the given location", default=None) + +# evaluate: +parser.add_argument("-a", "--accuracy", action="store_true", + help="evaluate using accuracy") +parser.add_argument("-k", "--kappa", action="store_true", + help="evaluate using Cohen's kappa") +parser.add_argument("--balanced_accuracy", action="store_true", + help="evaluate using balanced_accuracy") +parser.add_argument("--classification_report", action="store_true", + help="evaluate using classification_report") + +# balance dataset +parser.add_argument("--balance", type=str, + help="choose btw under and oversampling", default=None) +parser.add_argument("--small", type=int, + help="choose subset of all data", default=None) +# feature_extraction +parser.add_argument("--feature_extraction", type=str, + help="choose a feature_extraction algo", default=None) +# dim_red +parser.add_argument("--dim_red", type=str, + help="choose a dim_red algo", default=None) +# classifier +parser.add_argument("--classifier", type=str, + help="choose a classifier", default=None) + +args = parser.parse_args() +#args, unk = parser.parse_known_args() + +# load data +# with open(args.input_file, 'rb') as f_in: +# data = pickle.load(f_in) + +# load data +df = pd.read_csv(args.input_file, quoting=csv.QUOTE_NONNUMERIC, + lineterminator="\n") + +if args.small is not None: + # if limit is given + max_length = len(df['label']) + limit = min(args.small, max_length) + df = df.head(limit) + +# split data +input_col = 'preprocess_col' +X = df[input_col].array.reshape(-1, 1) +y = df["label"].ravel() + +X_train, X_test, y_train, y_test = train_test_split( + X, y, test_size=0.1, random_state=42) + +# balance data +if args.balance == 'over_sampler': + over_sampler = RandomOverSampler(random_state=42) + X_res, y_res = over_sampler.fit_resample(X_train, y_train) +elif args.balance == 'under_sampler': + under_sampler = RandomUnderSampler(random_state=42) + X_res, y_res = under_sampler.fit_resample(X_train, y_train) +else: + X_res, y_res = X_train, y_train + +print(f"Training target statistics: {Counter(y_res)}") +print(f"Testing target statistics: {Counter(y_test)}") + + +my_pipeline = [] + +# feature_extraction +if args.feature_extraction == 'HashingVectorizer': + my_pipeline.append(('hashvec', HashingVectorizer(n_features=2**22, + strip_accents='ascii', stop_words='english', ngram_range=(1, 3)))) +elif args.feature_extraction == 'TfidfVectorizer': + my_pipeline.append(('tfidf', TfidfVectorizer( + stop_words='english', ngram_range=(1, 3)))) + +# dimension reduction +if args.dim_red == 'SelectKBest(chi2)': + my_pipeline.append(('dim_red', SelectKBest(chi2))) +elif args.dim_red == 'NMF': + my_pipeline.append(('nmf', NMF())) + + +# classifier +if args.classifier == 'MultinomialNB': + my_pipeline.append(('MNB', MultinomialNB())) +elif args.classifier == 'SGDClassifier': + my_pipeline.append(('SGD', SGDClassifier(class_weight="balanced", n_jobs=-1, + random_state=42, alpha=1e-07, verbose=1))) +elif args.classifier == 'LogisticRegression': + my_pipeline.append(('LogisticRegression', LogisticRegression(class_weight="balanced", n_jobs=-1, + random_state=42, verbose=1))) +elif args.classifier == 'LinearSVC': + my_pipeline.append(('LinearSVC', LinearSVC(class_weight="balanced", + random_state=42, verbose=1))) +elif args.classifier == 'SVC': + # attention: time = samples ^ 2 + my_pipeline.append(('SVC', SVC(class_weight="balanced", + random_state=42, verbose=1))) + +classifier = Pipeline(my_pipeline) +import pdb +pdb.set_trace() +classifier.fit(X_res.ravel(), y_res) + +# now classify the given data +prediction = classifier.predict(X_test.ravel()) + +prediction_train_set = classifier.predict(X_res.ravel()) + +pdb.set_trace() + +# collect all evaluation metrics +evaluation_metrics = [] +if args.accuracy: + evaluation_metrics.append(("accuracy", accuracy_score)) +if args.kappa: + evaluation_metrics.append(("Cohen's kappa", cohen_kappa_score)) +if args.balanced_accuracy: + evaluation_metrics.append(("balanced accuracy", balanced_accuracy_score)) +# compute and print them +for metric_name, metric in evaluation_metrics: + + print(" {0}: {1}".format(metric_name, + metric(y_test, prediction))) + +if args.classification_report: + categories = ["Flop", "Viral"] + print("Matrix Train set:") + print(classification_report(y_res, prediction_train_set, + target_names=categories)) + print("Matrix Test set:") + print(classification_report(y_test.ravel(), prediction, + target_names=categories)) + + +# export the trained classifier if the user wants us to do so +if args.export_file is not None: + with open(args.export_file, 'wb') as f_out: + pickle.dump(classifier, f_out) diff --git a/code/all_in_one.sh b/code/all_in_one.sh new file mode 100755 index 00000000..070a267c --- /dev/null +++ b/code/all_in_one.sh @@ -0,0 +1,27 @@ +#!/bin/bash + +# create directory if not yet existing +mkdir -p data/all_in_one/ + +# run feature extraction on training set (may need to fit extractors) +#echo " training set" +#python3 -m code.all_in_one data/feature_extraction/training.pickle -e data/classification/classifier.pickle --accuracy --kappa --balanced_accuracy --classification_report --small 10000 + +# raw input, mit preprocessing +#python3 -m code.all_in_one data/preprocessing/split/training.csv -e data/classification/classifier.pickle --accuracy --kappa --balanced_accuracy --classification_report --hash_vectorizer #--count_vectorizer + +# raw input, ohne preprocessing +#python3 -m code.all_in_one data/preprocessing/labeled.csv -e data/classification/classifier.pickle --accuracy --kappa --balanced_accuracy --classification_report --count_vectorizer #--hash_vectorizer # + +# sklearn example +#python3 -m code.example_sklearn_pipeline data/preprocessing/split/training.csv + + +# run feature extraction on validation set (with pre-fit extractors) +#echo " validation set" +#python3 -m code.all_in_one data/feature_extraction/validation.pickle -i data/classification/classifier.pickle --accuracy --kappa --balanced_accuracy --small 10000 + +# don't touch the test set, yet, because that would ruin the final generalization experiment! + +# new approach +python3 -m code.all_in_one data/preprocessing/preprocessed.csv -e data/classification/classifier.pickle --accuracy --kappa --balanced_accuracy --classification_report --classifier 'LogisticRegression' --feature_extraction 'TfidfVectorizer' #--small 20000 #--balance 'over_sampler' # | HashingVectorizer TfidfVectorizer | SVC SGDClassifier LogisticRegression LinearSVC MultinomialNB data/preprocessing/split/training.csv data/preprocessing/labeled.csv data/preprocessing/preprocessed.csv \ No newline at end of file diff --git a/code/all_in_one_multiple_input_features.py b/code/all_in_one_multiple_input_features.py new file mode 100644 index 00000000..f0f0a6a1 --- /dev/null +++ b/code/all_in_one_multiple_input_features.py @@ -0,0 +1,215 @@ +import argparse +import pdb +import csv +import pickle + + +# feature_extraction +from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfTransformer, TfidfVectorizer + +# feature_selection +from sklearn.feature_selection import SelectKBest, mutual_info_classif, chi2 + +# dim_reduction +from sklearn.decomposition import PCA, TruncatedSVD, NMF + +# classifier +from sklearn.naive_bayes import MultinomialNB +from sklearn.linear_model import LogisticRegression +from sklearn.ensemble import RandomForestClassifier +from sklearn.linear_model import SGDClassifier +from sklearn.svm import LinearSVC, l1_min_c, SVC, LinearSVR, SVR + +from sklearn.pipeline import Pipeline +from sklearn.pipeline import FeatureUnion +from sklearn.model_selection import cross_val_score +from sklearn.preprocessing import FunctionTransformer + +# metrics +from sklearn.metrics import classification_report, cohen_kappa_score, accuracy_score, balanced_accuracy_score + +import pandas as pd +import numpy as np +import seaborn as sns + +# balancing +from imblearn.under_sampling import RandomUnderSampler +from imblearn.over_sampling import RandomOverSampler +from sklearn.model_selection import train_test_split + +from collections import Counter + +parser = argparse.ArgumentParser(description="all in one") +parser.add_argument("input_file", help="path to the input file") +parser.add_argument("-e", "--export_file", + help="export the trained classifier to the given location", default=None) + +# evaluate: +parser.add_argument("-a", "--accuracy", action="store_true", + help="evaluate using accuracy") +parser.add_argument("-k", "--kappa", action="store_true", + help="evaluate using Cohen's kappa") +parser.add_argument("--balanced_accuracy", action="store_true", + help="evaluate using balanced_accuracy") +parser.add_argument("--classification_report", action="store_true", + help="evaluate using classification_report") + +# balance dataset +parser.add_argument("--balance", type=str, + help="choose btw under and oversampling", default=None) +parser.add_argument("--small", type=int, + help="choose subset of all data", default=None) +# feature_extraction +parser.add_argument("--feature_extraction", type=str, + help="choose a feature_extraction algo", default=None) +# dim_red +parser.add_argument("--dim_red", type=str, + help="choose a dim_red algo", default=None) +# classifier +parser.add_argument("--classifier", type=str, + help="choose a classifier", default=None) + +args = parser.parse_args() +#args, unk = parser.parse_known_args() + +# load data +# with open(args.input_file, 'rb') as f_in: +# data = pickle.load(f_in) + +# load data +df = pd.read_csv(args.input_file, quoting=csv.QUOTE_NONNUMERIC, + lineterminator="\n") + +if args.small is not None: + # if limit is given + max_length = len(df['label']) + limit = min(args.small, max_length) + df = df.head(limit) + +# split data +# input_col = 'preprocess_col' +#X = df[input_col].array.reshape(-1, 1) +#y = df["label"].ravel() + +X_train, X_test, y_train, y_test = train_test_split( + df, df['label'], test_size=0.1, random_state=42) + +""" +# balance data +if args.balance == 'over_sampler': + over_sampler = RandomOverSampler(random_state=42) + X_res, y_res = over_sampler.fit_resample(X_train, y_train) +elif args.balance == 'under_sampler': + under_sampler = RandomUnderSampler(random_state=42) + X_res, y_res = under_sampler.fit_resample(X_train, y_train) +else: + X_res, y_res = X_train, y_train + + +print(f"Training target statistics: {Counter(y_res)}") +print(f"Testing target statistics: {Counter(y_test)}") +""" + +my_pipeline = [] +get_text_data = FunctionTransformer( + lambda x: x['preprocess_col'], validate=False) +get_numeric_data = FunctionTransformer(lambda x: x[[ + 'replies_count', 'likes_count', 'retweets_count']], validate=False) # array.reshape(-1,1) + +# add text data +if args.feature_extraction != 'union': + my_pipeline.append(('selector', get_numeric_data)) + + +# feature_extraction +if args.feature_extraction == 'HashingVectorizer': + my_pipeline.append(('hashvec', HashingVectorizer(n_features=2**22, + strip_accents='ascii', stop_words='english', ngram_range=(1, 3)))) +elif args.feature_extraction == 'TfidfVectorizer': + my_pipeline.append(('tfidf', TfidfVectorizer( + stop_words='english', ngram_range=(1, 3)))) + +elif args.feature_extraction == 'union': + # using more than just text data as features: + my_pipeline.append(('features', FeatureUnion([ + ('selector', get_numeric_data), + ('text_features', Pipeline( + [('selector', get_text_data), ('vec', TfidfVectorizer())]))], verbose=1))) + +""" +abc = FeatureUnion([ + ('numeric_features', Pipeline([ + ('selector', get_numeric_data) + ])), + ('text_features', Pipeline([ + ('selector', get_text_data), + ('vec', TfidfVectorizer()) + ])) +],verbose=1) +""" + + +# dimension reduction +if args.dim_red == 'SelectKBest(chi2)': + my_pipeline.append(('dim_red', SelectKBest(chi2))) +elif args.dim_red == 'NMF': + my_pipeline.append(('nmf', NMF())) + + +# classifier +if args.classifier == 'MultinomialNB': + my_pipeline.append(('MNB', MultinomialNB())) +elif args.classifier == 'SGDClassifier': + my_pipeline.append(('SGD', SGDClassifier(class_weight="balanced", n_jobs=-1, + random_state=42, alpha=1e-07, verbose=1))) +elif args.classifier == 'LogisticRegression': + my_pipeline.append(('LogisticRegression', LogisticRegression(class_weight="balanced", n_jobs=-1, + random_state=42, verbose=1))) +elif args.classifier == 'LinearSVC': + my_pipeline.append(('LinearSVC', LinearSVC(class_weight="balanced", + random_state=42, verbose=1, max_iter=10000))) +elif args.classifier == 'SVC': + # attention: time = samples ^ 2 + my_pipeline.append(('SVC', SVC(class_weight="balanced", + random_state=42, verbose=1))) + +classifier = Pipeline(my_pipeline) + + +classifier.fit(X_train, y_train) + + +# now classify the given data +prediction = classifier.predict(X_test) + +prediction_train_set = classifier.predict(X_train) + + +# collect all evaluation metrics +evaluation_metrics = [] +if args.accuracy: + evaluation_metrics.append(("accuracy", accuracy_score)) +if args.kappa: + evaluation_metrics.append(("Cohen's kappa", cohen_kappa_score)) +if args.balanced_accuracy: + evaluation_metrics.append(("balanced accuracy", balanced_accuracy_score)) +# compute and print them +for metric_name, metric in evaluation_metrics: + + print(" {0}: {1}".format(metric_name, + metric(y_test, prediction))) + +if args.classification_report: + categories = ["Flop", "Viral"] + print("Matrix Train set:") + print(classification_report( + y_train, prediction_train_set, target_names=categories)) + print("Matrix Test set:") + print(classification_report(y_test, prediction, + target_names=categories)) + + +# export the trained classifier if the user wants us to do so +# if args.export_file is not None: +# with open(args.export_file, 'wb') as f_out: +# pickle.dump(classifier, f_out) diff --git a/code/all_in_one_multiple_input_features.sh b/code/all_in_one_multiple_input_features.sh new file mode 100755 index 00000000..0af59edc --- /dev/null +++ b/code/all_in_one_multiple_input_features.sh @@ -0,0 +1,27 @@ +#!/bin/bash + +# create directory if not yet existing +mkdir -p data/all_in_one_multiple_input_features/ + +# run feature extraction on training set (may need to fit extractors) +#echo " training set" +#python3 -m code.all_in_one data/feature_extraction/training.pickle -e data/classification/classifier.pickle --accuracy --kappa --balanced_accuracy --classification_report --small 10000 + +# raw input, mit preprocessing +#python3 -m code.all_in_one data/preprocessing/split/training.csv -e data/classification/classifier.pickle --accuracy --kappa --balanced_accuracy --classification_report --hash_vectorizer #--count_vectorizer + +# raw input, ohne preprocessing +#python3 -m code.all_in_one data/preprocessing/labeled.csv -e data/classification/classifier.pickle --accuracy --kappa --balanced_accuracy --classification_report --count_vectorizer #--hash_vectorizer # + +# sklearn example +#python3 -m code.example_sklearn_pipeline data/preprocessing/split/training.csv + + +# run feature extraction on validation set (with pre-fit extractors) +#echo " validation set" +#python3 -m code.all_in_one data/feature_extraction/validation.pickle -i data/classification/classifier.pickle --accuracy --kappa --balanced_accuracy --small 10000 + +# don't touch the test set, yet, because that would ruin the final generalization experiment! + +# new approach +python3 -m code.all_in_one_multiple_input_features data/preprocessing/preprocessed.csv -e data/classification/classifier.pickle --accuracy --kappa --balanced_accuracy --classification_report --classifier 'LinearSVC' --feature_extraction 'union' --small 2000 #--balance 'over_sampler' # | HashingVectorizer TfidfVectorizer | SVC SGDClassifier LogisticRegression LinearSVC MultinomialNB data/preprocessing/split/training.csv data/preprocessing/labeled.csv data/preprocessing/preprocessed.csv \ No newline at end of file diff --git a/code/application.sh b/code/application.sh index da31860e..b610e4f7 100755 --- a/code/application.sh +++ b/code/application.sh @@ -2,4 +2,4 @@ # execute the application with all necessary pickle files echo "Starting the application..." -python -m code.application.application data/preprocessing/pipeline.pickle data/feature_extraction/pipeline.pickle data/dimensionality_reduction/pipeline.pickle data/classification/classifier.pickle \ No newline at end of file +python -m code.application.application data/preprocessing/pipeline.pickle data/feature_extraction/pipeline.pickle data/dimensionality_reduction/pipeline.pickle data/classification/classifier.pickle diff --git a/code/classification.sh b/code/classification.sh index ceb7ac18..aff1833b 100755 --- a/code/classification.sh +++ b/code/classification.sh @@ -5,10 +5,9 @@ mkdir -p data/classification/ # run feature extraction on training set (may need to fit extractors) echo " training set" -python -m code.classification.run_classifier data/dimensionality_reduction/training.pickle -e data/classification/classifier.pickle --knn 5 -s 42 --accuracy --kappa - +python3 -m code.classification.run_classifier data/feature_extraction/training.pickle -e data/classification/classifier.pickle --LinearSVC --accuracy --kappa --balanced_accuracy --small 10000 # run feature extraction on validation set (with pre-fit extractors) echo " validation set" -python -m code.classification.run_classifier data/dimensionality_reduction/validation.pickle -i data/classification/classifier.pickle --accuracy --kappa +python3 -m code.classification.run_classifier data/feature_extraction/validation.pickle -i data/classification/classifier.pickle --accuracy --kappa --balanced_accuracy -# don't touch the test set, yet, because that would ruin the final generalization experiment! \ No newline at end of file +# don't touch the test set, yet, because that would ruin the final generalization experiment! diff --git a/code/classification/run_classifier.py b/code/classification/run_classifier.py index b9d55245..538ad24c 100644 --- a/code/classification/run_classifier.py +++ b/code/classification/run_classifier.py @@ -8,71 +8,212 @@ @author: lbechberger """ -import argparse, pickle +import argparse +import pickle + from sklearn.dummy import DummyClassifier -from sklearn.metrics import accuracy_score, cohen_kappa_score -from sklearn.preprocessing import StandardScaler +from sklearn.naive_bayes import MultinomialNB +from sklearn.linear_model import LogisticRegression +from sklearn.ensemble import RandomForestClassifier from sklearn.neighbors import KNeighborsClassifier +from sklearn.linear_model import SGDClassifier +from sklearn.svm import LinearSVC, SVC + from sklearn.pipeline import make_pipeline +from sklearn.preprocessing import StandardScaler + +from sklearn.metrics import accuracy_score, cohen_kappa_score, balanced_accuracy_score + + +def load_args(): + # setting up CLI + parser = argparse.ArgumentParser(description="Classifier") + parser.add_argument("input_file", help="path to the input pickle file") + parser.add_argument( + "-s", + "--seed", + type=int, + help="seed for the random number generator", + default=None, + ) + parser.add_argument( + "-e", + "--export_file", + help="export the trained classifier to the given location", + default=None, + ) + parser.add_argument( + "-i", + "--import_file", + help="import a trained classifier from the given location", + default=None, + ) + parser.add_argument( + "-m", "--majority", action="store_true", help="majority class classifier" + ) + parser.add_argument( + "-f", "--frequency", action="store_true", help="label frequency classifier" + ) + parser.add_argument("-v", "--svm", action="store_true", help="SVM classifier") + parser.add_argument("--sgd", action="store_true", help="SGD classifier") + parser.add_argument( + "--LogisticRegression", action="store_true", help="LogisticRegression" + ) + parser.add_argument("--LinearSVC", action="store_true", help="LinearSVC") + parser.add_argument("--MultinomialNB", action="store_true", help="MultinomialNB") + + parser.add_argument( + "--knn", + type=int, + help="k nearest neighbor classifier with the specified value of k", + default=None, + ) + parser.add_argument( + "-a", "--accuracy", action="store_true", help="evaluate using accuracy" + ) + parser.add_argument( + "-k", "--kappa", action="store_true", help="evaluate using Cohen's kappa" + ) + parser.add_argument( + "--balanced_accuracy", + action="store_true", + help="evaluate using balanced_accuracy", + ) + parser.add_argument( + "--classification_report", + action="store_true", + help="evaluate using classification_report", + ) + + parser.add_argument( + "--small", type=int, help="not use all data but just subset", default=None + ) + parser.add_argument( + "--balanced_data_set", + action="store_true", + help="arg for classifier, use balanced data", + ) + + import pdb + + args = parser.parse_args() + return args + + +def load_dataset(args): + """load a pickle file and reduce samples""" + # load data + with open(args.input_file, "rb") as f_in: + data = pickle.load(f_in) + + # use less data to safe time for testing + if args.small is not None: + # if limit is given + max_length = len(data["features"]) + limit = min(args.small, max_length) + # go through data and limit it + for key, value in data.items(): + data[key] = value[:limit] + return data + + +def create_classifier(args, data): + """Load or create a classifier with given args and sklearn methods.""" + # use balanced data in classifier + balanced = "balanced" if args.balanced_data_set else None + + if args.import_file is not None: + # import a pre-trained classifier + with open(args.import_file, "rb") as f_in: + classifier = pickle.load(f_in) + + else: # manually set up a classifier + + if args.majority: + # majority vote classifier + print(" majority vote classifier") + classifier = DummyClassifier( + strategy="most_frequent", random_state=args.seed + ) + elif args.frequency: + # label frequency classifier + print(" label frequency classifier") + classifier = DummyClassifier(strategy="stratified", random_state=args.seed) + elif args.svm: + print(" SVC classifier") + classifier = make_pipeline( + StandardScaler(), SVC(probability=True, verbose=1) + ) + elif args.knn is not None: + print(" {0} nearest neighbor classifier".format(args.knn)) + standardizer = StandardScaler() + knn_classifier = KNeighborsClassifier(args.knn, n_jobs=-1) + classifier = make_pipeline(standardizer, knn_classifier) + elif args.sgd: + print(" SGDClassifier") + standardizer = StandardScaler() + classifier = make_pipeline( + standardizer, + SGDClassifier( + class_weight=balanced, random_state=args.seed, n_jobs=-1, verbose=1 + ), + ) + elif args.MultinomialNB: + print(" MultinomialNB") + classifier = MultinomialNB(random_state=args.seed) + elif args.LogisticRegression: + print(" LogisticRegression") + classifier = LogisticRegression( + class_weight=balanced, n_jobs=-1, random_state=args.seed, verbose=1 + ) + elif args.LinearSVC: + print(" LinearSVC") + classifier = LinearSVC( + class_weight=balanced, random_state=args.seed, verbose=1 + ) + + try: + classifier.fit(data["features"], data["labels"].ravel()) + except: + raise UnboundLocalError("Import an classifier or choose one.") + + return classifier + + +def evaluate_classifier(args, data, prediction): + # collect all evaluation metrics + evaluation_metrics = [] + if args.accuracy: + evaluation_metrics.append(("accuracy", accuracy_score)) + if args.kappa: + evaluation_metrics.append(("Cohen's kappa", cohen_kappa_score)) + if args.balanced_accuracy: + evaluation_metrics.append(("balanced accuracy", balanced_accuracy_score)) + # compute and print them + for metric_name, metric in evaluation_metrics: + print(" {0}: {1}".format(metric_name, metric(data["labels"], prediction))) + + if args.classification_report: + categories = ["Flop", "Viral"] + print(classification_report( + data["labels"], prediction, target_names=categories)) + +def export_classifier(args, classifier): + # export the trained classifier if the user wants us to do so + if args.export_file is not None: + with open(args.export_file, "wb") as f_out: + pickle.dump(classifier, f_out) + + +if __name__ == "__main__": + args = load_args() + + data = load_dataset(args) + + classifier = create_classifier(args, data) + # now classify the given data + prediction = classifier.predict(data["features"]) + + evaluate_classifier(args, data, prediction) -# setting up CLI -parser = argparse.ArgumentParser(description = "Classifier") -parser.add_argument("input_file", help = "path to the input pickle file") -parser.add_argument("-s", '--seed', type = int, help = "seed for the random number generator", default = None) -parser.add_argument("-e", "--export_file", help = "export the trained classifier to the given location", default = None) -parser.add_argument("-i", "--import_file", help = "import a trained classifier from the given location", default = None) -parser.add_argument("-m", "--majority", action = "store_true", help = "majority class classifier") -parser.add_argument("-f", "--frequency", action = "store_true", help = "label frequency classifier") -parser.add_argument("--knn", type = int, help = "k nearest neighbor classifier with the specified value of k", default = None) -parser.add_argument("-a", "--accuracy", action = "store_true", help = "evaluate using accuracy") -parser.add_argument("-k", "--kappa", action = "store_true", help = "evaluate using Cohen's kappa") -args = parser.parse_args() - -# load data -with open(args.input_file, 'rb') as f_in: - data = pickle.load(f_in) - -if args.import_file is not None: - # import a pre-trained classifier - with open(args.import_file, 'rb') as f_in: - classifier = pickle.load(f_in) - -else: # manually set up a classifier - - if args.majority: - # majority vote classifier - print(" majority vote classifier") - classifier = DummyClassifier(strategy = "most_frequent", random_state = args.seed) - - elif args.frequency: - # label frequency classifier - print(" label frequency classifier") - classifier = DummyClassifier(strategy = "stratified", random_state = args.seed) - - - elif args.knn is not None: - print(" {0} nearest neighbor classifier".format(args.knn)) - standardizer = StandardScaler() - knn_classifier = KNeighborsClassifier(args.knn) - classifier = make_pipeline(standardizer, knn_classifier) - - classifier.fit(data["features"], data["labels"].ravel()) - -# now classify the given data -prediction = classifier.predict(data["features"]) - -# collect all evaluation metrics -evaluation_metrics = [] -if args.accuracy: - evaluation_metrics.append(("accuracy", accuracy_score)) -if args.kappa: - evaluation_metrics.append(("Cohen's kappa", cohen_kappa_score)) - -# compute and print them -for metric_name, metric in evaluation_metrics: - print(" {0}: {1}".format(metric_name, metric(data["labels"], prediction))) - -# export the trained classifier if the user wants us to do so -if args.export_file is not None: - with open(args.export_file, 'wb') as f_out: - pickle.dump(classifier, f_out) \ No newline at end of file + export_classifier(args, classifier) diff --git a/code/dimensionality_reduction/reduce_dimensionality.py b/code/dimensionality_reduction/reduce_dimensionality.py index d2b27419..7d4da260 100644 --- a/code/dimensionality_reduction/reduce_dimensionality.py +++ b/code/dimensionality_reduction/reduce_dimensionality.py @@ -40,6 +40,7 @@ if args.mutual_information is not None: # select K best based on Mutual Information dim_red = SelectKBest(mutual_info_classif, k = args.mutual_information) + dim_red.fit(features, labels.ravel()) # resulting feature names based on support given by SelectKBest @@ -64,6 +65,7 @@ def get_feature_names(kbest, names): # store the results output_data = {"features": reduced_features, "labels": labels} + with open(args.output_file, 'wb') as f_out: pickle.dump(output_data, f_out) diff --git a/code/example_sklearn_pipeline.py b/code/example_sklearn_pipeline.py new file mode 100644 index 00000000..8ad9f392 --- /dev/null +++ b/code/example_sklearn_pipeline.py @@ -0,0 +1,48 @@ +import pdb +from sklearn.feature_extraction.text import CountVectorizer +from sklearn.feature_extraction.text import TfidfTransformer +from sklearn.pipeline import Pipeline +from sklearn.naive_bayes import MultinomialNB +from sklearn.linear_model import SGDClassifier +from sklearn.metrics import classification_report, cohen_kappa_score, accuracy_score, balanced_accuracy_score +import argparse +import csv +import pickle +from sklearn.datasets import fetch_20newsgroups +import pandas as pd +import numpy as np + + +parser = argparse.ArgumentParser(description="Classifier") +parser.add_argument("input_file", help="path to the input pickle file") +args = parser.parse_args() + + +df = pd.read_csv(args.input_file, quoting=csv.QUOTE_NONNUMERIC, + lineterminator="\n") +categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med'] + +twenty_train = fetch_20newsgroups(subset='train', + categories=categories, shuffle=True, random_state=42) +twenty_test = fetch_20newsgroups(subset='test', + categories=categories, shuffle=True, random_state=42) + + +text_clf = Pipeline([ + ('vect', CountVectorizer()), + ('tfidf', TfidfTransformer()), + ('clf', SGDClassifier(loss='hinge', penalty='l2', + alpha=1e-3, random_state=42, + max_iter=5, tol=None, verbose=True))]) + +text_clf.fit(twenty_train.data, twenty_train.target) + + +docs_test = twenty_test.data + +predicted = text_clf.predict(docs_test) +print(np.mean(predicted == twenty_test.target)) +print(classification_report(twenty_test.target, predicted, target_names=twenty_test.target_names)) +print(balanced_accuracy_score(predicted, twenty_test.target)) +pdb.set_trace() +print(cohen_kappa_score(predicted, twenty_test.target)) \ No newline at end of file diff --git a/code/feature_extraction.sh b/code/feature_extraction.sh index f494f835..c91bd3d8 100755 --- a/code/feature_extraction.sh +++ b/code/feature_extraction.sh @@ -4,11 +4,14 @@ mkdir -p data/feature_extraction/ # run feature extraction on training set (may need to fit extractors) -echo " training set" -python -m code.feature_extraction.extract_features data/preprocessing/split/training.csv data/feature_extraction/training.pickle -e data/feature_extraction/pipeline.pickle --char_length + +echo -e "\n -> training set" +python -m code.feature_extraction.extract_features data/preprocessing/split/training.csv data/feature_extraction/training.pickle -e data/feature_extraction/pipeline.pickle --char_length --photo_bool --video_bool --time --word2vec --emoji_count --hashtags # run feature extraction on validation set and test set (with pre-fit extractors) -echo " validation set" -python -m code.feature_extraction.extract_features data/preprocessing/split/validation.csv data/feature_extraction/validation.pickle -i data/feature_extraction/pipeline.pickle -echo " test set" -python -m code.feature_extraction.extract_features data/preprocessing/split/test.csv data/feature_extraction/test.pickle -i data/feature_extraction/pipeline.pickle \ No newline at end of file +echo -e "\n -> validation set" +python -m code.feature_extraction.extract_features data/preprocessing/split/validation.csv data/feature_extraction/validation.pickle -i data/feature_extraction/pipeline.pickle --char_length --photo_bool --video_bool --time --word2vec --emoji_count --hashtags + +echo -e "\n -> test set\n" +python -m code.feature_extraction.extract_features data/preprocessing/split/test.csv data/feature_extraction/test.pickle -i data/feature_extraction/pipeline.pickle --char_length --photo_bool --video_bool --time --word2vec --emoji_count --hashtags + diff --git a/code/feature_extraction/bigrams.py b/code/feature_extraction/bigrams.py index 6c0c4b3a..6303f3d9 100644 --- a/code/feature_extraction/bigrams.py +++ b/code/feature_extraction/bigrams.py @@ -23,4 +23,4 @@ def _set_variables(self, inputs): tokens = ast.literal_eval(line.item()) overall_text += tokens - self._bigrams = nltk.bigrams(overall_text) \ No newline at end of file + self._bigrams = nltk.bigrams(overall_text) diff --git a/code/feature_extraction/character_length.py b/code/feature_extraction/character_length.py index 0349bf94..1478faad 100644 --- a/code/feature_extraction/character_length.py +++ b/code/feature_extraction/character_length.py @@ -16,7 +16,7 @@ class CharacterLength(FeatureExtractor): # constructor def __init__(self, input_column): - super().__init__([input_column], "{0}_charlength".format(input_column)) + super().__init__([input_column], "tweet_charlength") # don't need to fit, so don't overwrite _set_variables() diff --git a/code/feature_extraction/emoji_count.py b/code/feature_extraction/emoji_count.py new file mode 100644 index 00000000..130a60f5 --- /dev/null +++ b/code/feature_extraction/emoji_count.py @@ -0,0 +1,35 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +Simple feature that tells whether photos are present or not. +Created on Wed Sep 29 12:29:25 2021 +@author: maximilian +""" + +import numpy as np +from code.feature_extraction.feature_extractor import FeatureExtractor + +# class for extracting the photo-bool as a feature +class EmojiCount(FeatureExtractor): + + # constructor + def __init__(self, input_column): + super().__init__([input_column], "{0}_emoji_count".format(input_column)) + + # don't need to fit, so don't overwrite _set_variables() + + # 0 if no photos, return 1 else + def _get_values(self, inputs): + + column = inputs[0].str + column = [' '.join([word for word in tweet if str(word.encode('unicode-escape').decode('ASCII')).count('U00')]) for tweet in column.split()] + + values = [] + + for i in column: + values.append(len(list(i.split(None)))) + + result = np.array(values) + result = result.reshape(-1,1) + + return result diff --git a/code/feature_extraction/extract_features.py b/code/feature_extraction/extract_features.py index a3527acf..e53b23da 100644 --- a/code/feature_extraction/extract_features.py +++ b/code/feature_extraction/extract_features.py @@ -8,25 +8,45 @@ @author: lbechberger """ -import argparse, csv, pickle +import argparse +import csv +import pickle import pandas as pd import numpy as np from code.feature_extraction.character_length import CharacterLength +from code.feature_extraction.emoji_count import EmojiCount +from code.feature_extraction.hash_vector import HashVector +from code.feature_extraction.tfidf_vector import TfidfVector from code.feature_extraction.feature_collector import FeatureCollector -from code.util import COLUMN_TWEET, COLUMN_LABEL +from code.feature_extraction.hashtag_count import HashtagCount +from code.feature_extraction.photo_bool import PhotoBool +from code.feature_extraction.video_bool import VideoBool +from code.feature_extraction.word2vec import Word2Vec +from code.feature_extraction.time_feature import Hours +from code.util import COLUMN_TWEET, COLUMN_LABEL, COLUMN_PREPROCESS, COLUMN_PHOTOS, COLUMN_REPLIES, COLUMN_VIDEO # setting up CLI parser = argparse.ArgumentParser(description = "Feature Extraction") -parser.add_argument("input_file", help = "path to the input csv file") -parser.add_argument("output_file", help = "path to the output pickle file") -parser.add_argument("-e", "--export_file", help = "create a pipeline and export to the given location", default = None) -parser.add_argument("-i", "--import_file", help = "import an existing pipeline from the given location", default = None) -parser.add_argument("-c", "--char_length", action = "store_true", help = "compute the number of characters in the tweet") +parser.add_argument("input_file", help="path to the input csv file") +parser.add_argument("output_file", help="path to the output pickle file") +parser.add_argument("-e", "--export_file", help="create a pipeline and export to the given location", default=None) +parser.add_argument("-i", "--import_file", help="import an existing pipeline from the given location", default=None) +parser.add_argument("-c", "--char_length", action="store_true", help="compute the number of characters in the tweet") +parser.add_argument("--hash_vec", action="store_true", help="compute the hash vector of the tweet") +parser.add_argument("--tfidf_vec", action="store_true", help="compute the tf idf of the tweet") +parser.add_argument("--photo_bool", action="store_true", help="tells whether the tweet contains photos or not") +parser.add_argument("--video_bool", action="store_true", help="tells whether the tweet contains a video or not") +parser.add_argument("--word2vec", action="store_true", help="compute the semantic distance of words to given keywords") +parser.add_argument("--time", action="store_true", help="take into account what hour the tweet was sent") +parser.add_argument("--emoji_count", action="store_true", help="count the emojis in a tweet") +parser.add_argument("--hashtags", action = "store_true", help = "count hashtags of the tweet") + args = parser.parse_args() # load data -df = pd.read_csv(args.input_file, quoting = csv.QUOTE_NONNUMERIC, lineterminator = "\n") +df = pd.read_csv(args.input_file, quoting=csv.QUOTE_NONNUMERIC, + lineterminator="\n") if args.import_file is not None: # simply import an exisiting FeatureCollector @@ -38,12 +58,32 @@ # collect all feature extractors features = [] if args.char_length: - # character length of original tweet (without any changes) - features.append(CharacterLength(COLUMN_TWEET)) - + features.append(CharacterLength(COLUMN_PREPROCESS)) + if args.hash_vec: + # hash of original tweet (without any changes) + features.append(HashVector(COLUMN_TWEET)) + if args.hashtags: + # number of hashtags per tweet + features.append(HashtagCount('hashtags')) + if args.tfidf_vec: + features.append(TfidfVector(COLUMN_PREPROCESS)) + if args.emoji_count: + features.append(EmojiCount(COLUMN_TWEET)) + if args.photo_bool: + # do photos exist or not + features.append(PhotoBool(COLUMN_PHOTOS)) + if args.video_bool: + # does a video exist or not + features.append(VideoBool(COLUMN_VIDEO)) + if args.word2vec: + features.append(Word2Vec('preprocess_col_tokenized')) + if args.time: + # how many replies does the tweet have + features.append(Hours('time')) + # create overall FeatureCollector feature_collector = FeatureCollector(features) - + # fit it on the given data set (assumed to be training data) feature_collector.fit(df) @@ -57,12 +97,19 @@ label_array = label_array.reshape(-1, 1) # store the results -results = {"features": feature_array, "labels": label_array, +results = {"features": feature_array, "labels": label_array, "feature_names": feature_collector.get_feature_names()} + + with open(args.output_file, 'wb') as f_out: pickle.dump(results, f_out) # export the FeatureCollector as pickle file if desired by user if args.export_file is not None: with open(args.export_file, 'wb') as f_out: - pickle.dump(feature_collector, f_out) \ No newline at end of file + pickle.dump(feature_collector, f_out) + +df_out = pd.DataFrame(feature_array, columns=feature_collector.get_feature_names()) +df_out.to_csv("data/feature_extraction/features.csv") + +# results.to_csv(args.output_file, index = False, quoting = csv.QUOTE_NONNUMERIC, line_terminator = "\n") \ No newline at end of file diff --git a/code/feature_extraction/feature_collector.py b/code/feature_extraction/feature_collector.py index d2fca494..2534b738 100644 --- a/code/feature_extraction/feature_collector.py +++ b/code/feature_extraction/feature_collector.py @@ -53,4 +53,4 @@ def get_feature_names(self): feature_names = [] for feature in self._features: feature_names.append(feature.get_feature_name()) - return feature_names \ No newline at end of file + return feature_names diff --git a/code/feature_extraction/hash_vector.py b/code/feature_extraction/hash_vector.py new file mode 100644 index 00000000..fd431f93 --- /dev/null +++ b/code/feature_extraction/hash_vector.py @@ -0,0 +1,37 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +Simple feature that counts the number of characters in the given column. + +Created on Wed Sep 29 12:29:25 2021 + +@author: lbechberger +""" + +import numpy as np +from code.feature_extraction.feature_extractor import FeatureExtractor +from sklearn.feature_extraction.text import HashingVectorizer + +from code.util import HASH_VECTOR_N_FEATURES + +# class for extracting the character-based length as a feature + + +class HashVector(FeatureExtractor): + + # constructor + def __init__(self, input_column): + super().__init__([input_column], "{0}_hashvector".format(input_column)) + + # don't need to fit, so don't overwrite _set_variables() + + # compute the word length based on the inputs + def _get_values(self, inputs): + # inputs is list of text documents + # create the transform + # pdb.set_trace() + vectorizer = HashingVectorizer(n_features=HASH_VECTOR_N_FEATURES, + strip_accents='ascii', stop_words='english', ngram_range=(1, 2)) + # encode document + vector = vectorizer.fit_transform(inputs[0]) + return vector.toarray() diff --git a/code/feature_extraction/hashtag_count.py b/code/feature_extraction/hashtag_count.py new file mode 100644 index 00000000..c617ecaa --- /dev/null +++ b/code/feature_extraction/hashtag_count.py @@ -0,0 +1,39 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +Simple feature that counts the number of hashtags in the given column. + +Created on Wed Sep 29 12:29:25 2021 + +@author: lbechberger +""" + +import numpy as np +import ast +from code.feature_extraction.feature_extractor import FeatureExtractor + +# class for extracting the character-based length as a feature +class HashtagCount(FeatureExtractor): + + # constructor + def __init__(self, input_column): + super().__init__([input_column], "{0}_count".format(input_column)) + + # don't need to fit, so don't overwrite _set_variables() + + # compute the word length based on the inputs + def _get_values(self, inputs): + + hashtags_list = inputs[0].astype(str).values.tolist() + + values = [] + for row in hashtags_list: + if ast.literal_eval(row) == []: + values.append(0) + else: + values.append(len(ast.literal_eval(row))) + + result = np.array(values) + result = result.reshape(-1,1) + + return result diff --git a/code/feature_extraction/photo_bool.py b/code/feature_extraction/photo_bool.py new file mode 100644 index 00000000..a16e83e8 --- /dev/null +++ b/code/feature_extraction/photo_bool.py @@ -0,0 +1,33 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +Simple feature that tells whether photos are present or not. + +Created on Wed Sep 29 12:29:25 2021 + +@author: shagemann +""" + +import numpy as np +from code.feature_extraction.feature_extractor import FeatureExtractor + +# class for extracting the photo-bool as a feature +class PhotoBool(FeatureExtractor): + + # constructor + def __init__(self, input_column): + super().__init__([input_column], "{0}_bool".format(input_column)) + + # don't need to fit, so don't overwrite _set_variables() + + # 0 if no photos, return 1 else + def _get_values(self, inputs): + values = [] + for index, row in inputs[0].iteritems(): + if len(row) > 2: + values.append(1) + else: + values.append(0) + result = np.array(values) + result = result.reshape(-1,1) + return result diff --git a/code/feature_extraction/tfidf_vector.py b/code/feature_extraction/tfidf_vector.py new file mode 100644 index 00000000..1ab023cf --- /dev/null +++ b/code/feature_extraction/tfidf_vector.py @@ -0,0 +1,37 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +Simple feature that counts the number of characters in the given column. + +Created on Wed Sep 29 12:29:25 2021 + +@author: lbechberger +""" + +import numpy as np +from code.feature_extraction.feature_extractor import FeatureExtractor +from sklearn.feature_extraction.text import TfidfVectorizer + + +# class for extracting the character-based length as a feature + + +class TfidfVector(FeatureExtractor): + + # constructor + def __init__(self, input_column): + super().__init__([input_column], + "{0}_tfidfvector".format(input_column)) + + # don't need to fit, so don't overwrite _set_variables() + + # compute the word length based on the inputs + def _get_values(self, inputs): + # inputs is list of text documents + # create the transform + + vectorizer = TfidfVectorizer(strip_accents='ascii', stop_words='english', ngram_range=(1, 2)) + # encode document + vector = vectorizer.fit_transform(inputs[0]) + + return vector.toarray() diff --git a/code/feature_extraction/time_feature.py b/code/feature_extraction/time_feature.py new file mode 100644 index 00000000..c0a4ef19 --- /dev/null +++ b/code/feature_extraction/time_feature.py @@ -0,0 +1,29 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +Created on Tue Oct 19 22:05:32 2021 + +@author: maximilian +""" + +import numpy as np +import pandas as pd +from code.feature_extraction.feature_extractor import FeatureExtractor + +# class for extracting the photo-bool as a feature +class Hours(FeatureExtractor): + + # constructor + def __init__(self, input_column): + super().__init__([input_column], "{0}_hours".format(input_column)) + + # don't need to fit, so don't overwrite _set_variables() + + # use the replies count column as a feature + def _get_values(self, inputs): + + hours = pd.to_datetime(inputs[0], format='%H:%M:%S').dt.hour + + result = np.array(hours) + result = result.reshape(-1,1) + return result diff --git a/code/feature_extraction/video_bool.py b/code/feature_extraction/video_bool.py new file mode 100644 index 00000000..46824840 --- /dev/null +++ b/code/feature_extraction/video_bool.py @@ -0,0 +1,31 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +Simple feature that tells whether a video is in the tweet or not. + +Created on Wed Sep 29 12:29:25 2021 + +@author: shagemann +""" + +import numpy as np +from code.feature_extraction.feature_extractor import FeatureExtractor + +# class for extracting the photo-bool as a feature +class VideoBool(FeatureExtractor): + + # constructor + def __init__(self, input_column): + super().__init__([input_column], "{0}_bool".format(input_column)) + + # don't need to fit, so don't overwrite _set_variables() + + # use 'video' column as a feature + # 0 if no video, return 1 else + def _get_values(self, inputs): + values = [] + for index, row in inputs[0].iteritems(): + values.append(int(row)) + result = np.array(values) + result = result.reshape(-1,1) + return result diff --git a/code/feature_extraction/word2vec.py b/code/feature_extraction/word2vec.py new file mode 100644 index 00000000..88d58300 --- /dev/null +++ b/code/feature_extraction/word2vec.py @@ -0,0 +1,49 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +Simple feature that counts the number of characters in the given column. + +Created on Wed Sep 29 12:29:25 2021 + +@author: lbechberger +""" + +import numpy as np +import gensim.downloader as api +import pandas as pd +import ast +from code.feature_extraction.feature_extractor import FeatureExtractor + +# class for extracting the character-based length as a feature +class Word2Vec(FeatureExtractor): + + # constructor + def __init__(self, input_column): + super().__init__([input_column], "tweet_word2vec") + + # don't need to fit, so don't overwrite _set_variables() + + # compute the word length based on the inputs + def _get_values(self, inputs): + + embeddings = api.load('word2vec-google-news-300') # Try glove-twitter-200 for classifier + keywords = ['coding','free','algorithms','statistics'] # deeplearning not present + + tokens = inputs[0].apply(lambda x: list(ast.literal_eval(x))) # Column from Series to list + + similarities = [] + + for rows in tokens: + sim = [] + for word in keywords: + for item in rows: + try: + sim.append(embeddings.similarity(item,word)) + except KeyError: + pass + # similarities.append(max(sim)-np.std(sim)) + similarities.append(round(np.mean(sim),4)) # try median + + result = np.asarray(similarities) + result = result.reshape(-1,1) + return result diff --git a/code/pipeline.sh b/code/pipeline.sh index 8cfef559..d7a4771c 100755 --- a/code/pipeline.sh +++ b/code/pipeline.sh @@ -10,4 +10,4 @@ code/feature_extraction.sh echo "dimensionality reduction" code/dimensionality_reduction.sh echo "classification" -code/classification.sh \ No newline at end of file +code/classification.sh diff --git a/code/preprocessing.sh b/code/preprocessing.sh index 61f83ea6..2c342324 100755 --- a/code/preprocessing.sh +++ b/code/preprocessing.sh @@ -1,19 +1,19 @@ #!/bin/bash # create directory if not yet existing -mkdir -p data/preprocessing/split/ +#mkdir -p data/preprocessing/split/ # install all NLTK models -python -m nltk.downloader all +#python -m nltk.downloader all # add labels -echo " creating labels" +echo -e "\n -> creating labels\n" python -m code.preprocessing.create_labels data/raw/ data/preprocessing/labeled.csv # other preprocessing (removing punctuation etc.) -echo " general preprocessing" -python -m code.preprocessing.run_preprocessing data/preprocessing/labeled.csv data/preprocessing/preprocessed.csv --punctuation --tokenize -e data/preprocessing/pipeline.pickle +echo -e "\n -> general preprocessing\n" +python -m code.preprocessing.run_preprocessing data/preprocessing/labeled.csv data/preprocessing/preprocessed.csv --punctuation --strings --tokenize --language en -e data/preprocessing/pipeline.pickle # split the data set -echo " splitting the data set" -python -m code.preprocessing.split_data data/preprocessing/preprocessed.csv data/preprocessing/split/ -s 42 \ No newline at end of file +echo -e "\n -> splitting the data set\n" +python -m code.preprocessing.split_data data/preprocessing/preprocessed.csv data/preprocessing/split/ -s 42 diff --git a/code/preprocessing/create_labels.py b/code/preprocessing/create_labels.py index 21b1748d..d91509c8 100644 --- a/code/preprocessing/create_labels.py +++ b/code/preprocessing/create_labels.py @@ -28,7 +28,7 @@ # load all csv files dfs = [] for file_path in file_paths: - dfs.append(pd.read_csv(file_path, quoting = csv.QUOTE_NONNUMERIC, lineterminator = "\n")) + dfs.append(pd.read_csv(file_path, quoting = csv.QUOTE_NONNUMERIC, lineterminator = "\n", low_memory=False)) # join all data into a single DataFrame df = pd.concat(dfs) @@ -42,4 +42,4 @@ print(df[COLUMN_LABEL].value_counts(normalize = True)) # store the DataFrame into a csv file -df.to_csv(args.output_file, index = False, quoting = csv.QUOTE_NONNUMERIC, line_terminator = "\n") \ No newline at end of file +df.to_csv(args.output_file, index = False, quoting = csv.QUOTE_NONNUMERIC, line_terminator = "\n") diff --git a/code/preprocessing/language_remover.py b/code/preprocessing/language_remover.py new file mode 100644 index 00000000..0eba3551 --- /dev/null +++ b/code/preprocessing/language_remover.py @@ -0,0 +1,29 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- + + +import string +from code.preprocessing.preprocessor import Preprocessor +from langdetect import detect +from code.util import COLUMN_TWEET, COLUMN_LANGUAGE + +class LanguageRemover(Preprocessor): + + # constructor + def __init__(self, input_column = COLUMN_TWEET, output_column = COLUMN_LANGUAGE): #, language_to_keep = 'en' + # input column "tweet", new output column + super().__init__([input_column], output_column) + #self.language_to_keep = language_to_keep + + # set internal variables based on input columns + #def _set_variables(self, inputs): + # store punctuation for later reference + #self._punctuation = "[{}]".format(string.punctuation) + #self.nlp = spacy.load('en') # 1 + #self.nlp.add_pipe(LanguageDetector(), name='language_detector', last=True) #2 + + + # get preprocessed column based on data frame and internal variables + def _get_values(self, inputs): + column = [detect(tweet) for tweet in inputs[0]] + return column diff --git a/code/preprocessing/punctuation_remover.py b/code/preprocessing/punctuation_remover.py index 0f026b0e..3bd13e81 100644 --- a/code/preprocessing/punctuation_remover.py +++ b/code/preprocessing/punctuation_remover.py @@ -1,7 +1,7 @@ #!/usr/bin/env python3 # -*- coding: utf-8 -*- """ -Preprocessor that removes punctuation from the original tweet text. +Preprocessor that removes punctuation & digits from the original tweet text. Created on Wed Sep 29 09:45:56 2021 @@ -11,23 +11,29 @@ import string from code.preprocessing.preprocessor import Preprocessor from code.util import COLUMN_TWEET, COLUMN_PUNCTUATION +import pdb + +punct = set(string.punctuation).union(string.digits).union('—') +#print(str(''.join(punct))) # removes punctuation from the original tweet # inspired by https://stackoverflow.com/a/45600350 class PunctuationRemover(Preprocessor): # constructor - def __init__(self): - # input column "tweet", new output column - super().__init__([COLUMN_TWEET], COLUMN_PUNCTUATION) + def __init__(self, inputcol, outputcol): + # input column, new output column + super().__init__([inputcol], outputcol) # set internal variables based on input columns def _set_variables(self, inputs): # store punctuation for later reference - self._punctuation = "[{}]".format(string.punctuation) + self._punctuation = "[{}]".format(string.punctuation+string.digits+'’'+'—'+'”'+'➡️'+'“'+'↓') # get preprocessed column based on data frame and internal variables def _get_values(self, inputs): # replace punctuation with empty string column = inputs[0].str.replace(self._punctuation, "") - return column \ No newline at end of file + #import pdb + #pdb.set_trace() + return column diff --git a/code/preprocessing/run_preprocessing.py b/code/preprocessing/run_preprocessing.py index 72130a30..0b2f61b2 100644 --- a/code/preprocessing/run_preprocessing.py +++ b/code/preprocessing/run_preprocessing.py @@ -10,35 +10,58 @@ import argparse, csv, pickle import pandas as pd +from tqdm import tqdm from sklearn.pipeline import make_pipeline from code.preprocessing.punctuation_remover import PunctuationRemover +from code.preprocessing.string_remover import StringRemover +from code.preprocessing.language_remover import LanguageRemover from code.preprocessing.tokenizer import Tokenizer -from code.util import COLUMN_TWEET, SUFFIX_TOKENIZED +from code.util import COLUMN_TWEET, SUFFIX_TOKENIZED, COLUMN_LANGUAGE, COLUMN_PREPROCESS # setting up CLI parser = argparse.ArgumentParser(description = "Various preprocessing steps") parser.add_argument("input_file", help = "path to the input csv file") parser.add_argument("output_file", help = "path to the output csv file") -parser.add_argument("-p", "--punctuation", action = "store_true", help = "remove punctuation") +parser.add_argument("-p", "--punctuation", action = "store_true", help = "remove punctuation and special characters") +parser.add_argument("-s", "--strings", action = "store_true", help = "remove stopwords, links and emojis") parser.add_argument("-t", "--tokenize", action = "store_true", help = "tokenize given column into individual words") -parser.add_argument("--tokenize_input", help = "input column to tokenize", default = COLUMN_TWEET) +#parser.add_argument("--tokenize_input", help = "input column to tokenize", default = 'output') parser.add_argument("-e", "--export_file", help = "create a pipeline and export to the given location", default = None) +parser.add_argument("--language", help = "just use tweets with this language ", default = None) + args = parser.parse_args() # load data -df = pd.read_csv(args.input_file, quoting = csv.QUOTE_NONNUMERIC, lineterminator = "\n") +df = pd.read_csv(args.input_file, quoting = csv.QUOTE_NONNUMERIC, lineterminator = "\n",low_memory=False) + # collect all preprocessors preprocessors = [] if args.punctuation: - preprocessors.append(PunctuationRemover()) + preprocessors.append(PunctuationRemover(COLUMN_TWEET, COLUMN_PREPROCESS)) +if args.strings: + preprocessors.append(StringRemover(COLUMN_PREPROCESS, COLUMN_PREPROCESS)) if args.tokenize: - preprocessors.append(Tokenizer(args.tokenize_input, args.tokenize_input + SUFFIX_TOKENIZED)) + preprocessors.append(Tokenizer(COLUMN_PREPROCESS, COLUMN_PREPROCESS + SUFFIX_TOKENIZED)) + +# no need to detect languages, because it is already given +# if args.language is not None: +# preprocessors.append(LanguageRemover()) + +if args.language is not None: + # filter out one language + before = len(df) + df = df[df['language']==args.language] + after = len(df) + print("Filtered out: {0} (not 'en')".format(before-after)) + df.reset_index(drop=True, inplace=True) # call all preprocessing steps -for preprocessor in preprocessors: +for preprocessor in tqdm(preprocessors): df = preprocessor.fit_transform(df) +# drop useless line which makes problems with csv +del df['trans_dest\r'] # store the results df.to_csv(args.output_file, index = False, quoting = csv.QUOTE_NONNUMERIC, line_terminator = "\n") @@ -46,4 +69,7 @@ if args.export_file is not None: pipeline = make_pipeline(*preprocessors) with open(args.export_file, 'wb') as f_out: - pickle.dump(pipeline, f_out) \ No newline at end of file + pickle.dump(pipeline, f_out) + + + diff --git a/code/preprocessing/split_data.py b/code/preprocessing/split_data.py index 57bad668..2308ce82 100644 --- a/code/preprocessing/split_data.py +++ b/code/preprocessing/split_data.py @@ -23,7 +23,7 @@ args = parser.parse_args() # load the data -df = pd.read_csv(args.input_file, quoting = csv.QUOTE_NONNUMERIC, lineterminator = "\n") +df = pd.read_csv(args.input_file, quoting = csv.QUOTE_NONNUMERIC, lineterminator = "\n", low_memory=False) # split into (training & validation) and test set X, X_test = train_test_split(df, test_size = args.test_size, random_state = args.seed, shuffle = True, stratify = df[COLUMN_LABEL]) @@ -37,4 +37,4 @@ X_val.to_csv(os.path.join(args.output_folder, "validation.csv"), index = False, quoting = csv.QUOTE_NONNUMERIC, line_terminator = "\n") X_test.to_csv(os.path.join(args.output_folder, "test.csv"), index = False, quoting = csv.QUOTE_NONNUMERIC, line_terminator = "\n") -print("Training: {0} examples, Validation: {1} examples, Test: {2} examples".format(len(X_train), len(X_val), len(X_test))) \ No newline at end of file +print("Training: {0} examples, Validation: {1} examples, Test: {2} examples".format(len(X_train), len(X_val), len(X_test))) diff --git a/code/preprocessing/string_remover.py b/code/preprocessing/string_remover.py new file mode 100644 index 00000000..e76b4304 --- /dev/null +++ b/code/preprocessing/string_remover.py @@ -0,0 +1,45 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +Preprocessor that removes punctuation from the original tweet text. +Created on Wed Sep 29 09:45:56 2021 +@author: lbechberger +""" +import string +from code.preprocessing.preprocessor import Preprocessor +from code.util import COLUMN_TWEET, COLUMN_PUNCTUATION +from nltk.corpus import stopwords +import pandas as pd + +STOPWORDS = set(stopwords.words('english')) + +# removes punctuation from the original tweet +# inspired by https://stackoverflow.com/a/45600350 +class StringRemover(Preprocessor): + + # constructor + def __init__(self, inputcol, outputcol): + # input column "tweet", new output column + super().__init__([inputcol], outputcol) + + # set internal variables based on input columns + #def _set_variables(self, inputs): + # store punctuation for later reference + # self._punctuation = "[{}]".format(string.punctuation) + + + # get preprocessed column based on data frame and internal variables + def _get_values(self, inputs): + column = inputs[0].str + + # replace stopwords with empty string + column = [' '.join([word.lower() for word in tweet if word.lower() not in STOPWORDS]) for tweet in column.split()] + column = pd.Series(column) + # replace links with empty string + column = [' '.join([word for word in tweet if word.startswith('http') is False]) for tweet in column.str.split()] + column = pd.Series(column) + # replace emojis with empty string + column = [' '.join([word for word in tweet if str(word.encode('unicode-escape').decode('ASCII')).__contains__('\\U') is False]) for tweet in column.str.split()] + + column = pd.Series(column) + return column diff --git a/code/preprocessing/tokenizer.py b/code/preprocessing/tokenizer.py index 94191502..9f90f87b 100644 --- a/code/preprocessing/tokenizer.py +++ b/code/preprocessing/tokenizer.py @@ -24,14 +24,20 @@ def _get_values(self, inputs): """Tokenize the tweet.""" tokenized = [] - + import pdb for tweet in inputs[0]: - sentences = nltk.sent_tokenize(tweet) + #pdb.set_trace() + if type(tweet) is float: + # if tweet is nan, maybe because of stopword remove + sentences = nltk.sent_tokenize('') + else: + sentences = nltk.sent_tokenize(tweet) tokenized_tweet = [] for sentence in sentences: words = nltk.word_tokenize(sentence) tokenized_tweet += words - tokenized.append(str(tokenized_tweet)) - - return tokenized \ No newline at end of file + tokenized.append(str(tokenized_tweet).lower().strip('\\u')) + + #pdb.set_trace() + return tokenized diff --git a/code/util.py b/code/util.py index 7d8794c7..73982ab4 100644 --- a/code/util.py +++ b/code/util.py @@ -12,9 +12,16 @@ COLUMN_TWEET = "tweet" COLUMN_LIKES = "likes_count" COLUMN_RETWEETS = "retweets_count" +COLUMN_PHOTOS = "photos" +COLUMN_VIDEO = "video" +COLUMN_REPLIES = "replies_count" # column names of novel columns for preprocessing COLUMN_LABEL = "label" COLUMN_PUNCTUATION = "tweet_no_punctuation" +COLUMN_LANGUAGE = "language" +COLUMN_PREPROCESS = 'preprocess_col' +SUFFIX_TOKENIZED = "_tokenized" -SUFFIX_TOKENIZED = "_tokenized" \ No newline at end of file +# number of features for hash vector +HASH_VECTOR_N_FEATURES = 2**10 diff --git a/script/application.sh b/script/application.sh new file mode 100755 index 00000000..9dea8208 --- /dev/null +++ b/script/application.sh @@ -0,0 +1,6 @@ +#!/bin/bash + +# execute the application with all necessary pickle files +# dim_red not used: data/dimensionality_reduction/pipeline.pickle +echo "Starting the application..." +python -m script.application.application data/preprocessing/pipeline.pickle data/feature_extraction/pipeline.pickle data/classification/classifier.pickle diff --git a/script/application/application.py b/script/application/application.py new file mode 100644 index 00000000..3fb5e6d1 --- /dev/null +++ b/script/application/application.py @@ -0,0 +1,91 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +Console-based application for tweet classification. + +Created on Wed Sep 29 14:49:25 2021 + +@author: lbechberger, mkalcher, magmueller, shagemann +""" + +import argparse, pickle +import pandas as pd +import numpy as np +from datetime import datetime +from sklearn.pipeline import make_pipeline +from code.util import COLUMN_TWEET, COLUMN_HASHTAGS, COLUMN_PHOTOS, COLUMN_VIDEO, COLUMN_TIME + +# setting up CLI +# the user can choose the features he wants to extract in the file 'feature_extraction.sh' +parser = argparse.ArgumentParser(description = "Application") +parser.add_argument("preprocessing_file", help = "path to the pickle file containing the preprocessing") +parser.add_argument("feature_file", help = "path to the pickle file containing the feature extraction") +parser.add_argument("classifier_file", help = "path to the pickle file containing the classifier") +args = parser.parse_args() + +# load all the pipeline steps +# dimensionality_reduction is NOT used here +with open(args.preprocessing_file, 'rb') as f_in: + preprocessing = pickle.load(f_in) +with open(args.feature_file, 'rb') as f_in: + feature_extraction = pickle.load(f_in) +with open(args.classifier_file, 'rb') as f_in: + classifier = pickle.load(f_in) + +# chain them together into a single pipeline +pipeline = make_pipeline(preprocessing, feature_extraction, classifier) + +# headline output +print("Welcome to ViralTweeter v0.1!") +print("-----------------------------") +print("") + +while True: + # ask user for input + tweet = input("Please type in your tweet (type 'quit' to quit the program): ") + + # terminate if necessary + if tweet == "quit": + print("Okay, goodbye!") + break + + # ask if the tweet contains videos + check_bool = True + while(check_bool == True): + input_video = input("How many videos does your tweet contain? (type in 0, 1, 2, ...): ") + if not input_video.isnumeric(): + print("Your input must be an integer!") + else: + check_bool = False + + # check how many photos the tweet contains + input_photos = [] + for word in tweet.split(): + if "https://pbs.twimg.com/media" in word and (".png" in word or ".jpg" in word): + input_photos.append(word) + + # get current time + now = datetime.now() + + # if not terminated: create pandas DataFrame and put it through the pipeline + # the feature columns that are not generated in feature extraction need to be put in manually + # the video column needs to be manually put in by the user + # the current time is being generated automatically + df = pd.DataFrame() + df[COLUMN_TWEET] = [tweet] + df[COLUMN_HASHTAGS] = [tweet.split()] + df[COLUMN_PHOTOS] = [input_photos] + df[COLUMN_VIDEO] = [input_video] + df[COLUMN_TIME] = [now.strftime("%H:%M:%S")] + prediction = pipeline.predict(df) + try: + confidence = pipeline.predict_proba(df) + print("Prediction: {0}, Confidence: {1}".format(prediction, confidence)) + except: + print("Prediction:", prediction) + if prediction.flat[0] == False: + print("Your tweet will most likely not go viral.") + elif prediction.flat[0] == True: + print("Your tweet will most likely go viral.") + + print("") diff --git a/script/util.py b/script/util.py new file mode 100644 index 00000000..4e3df22c --- /dev/null +++ b/script/util.py @@ -0,0 +1,29 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +Utility file for collecting frequently used constants and helper functions. + +Created on Wed Sep 29 10:50:36 2021 + +@author: lbechberger, shagemann +""" + +# column names for the original data frame +COLUMN_TWEET = "tweet" +COLUMN_LIKES = "likes_count" +COLUMN_RETWEETS = "retweets_count" +COLUMN_PHOTOS = "photos" +COLUMN_VIDEO = "video" +COLUMN_REPLIES = "replies_count" +COLUMN_HASHTAGS = "hashtags" +COLUMN_TIME = "time" + +# column names of novel columns for preprocessing +COLUMN_LABEL = "label" +COLUMN_PUNCTUATION = "tweet_no_punctuation" +COLUMN_LANGUAGE = "language" +COLUMN_PREPROCESS = 'preprocess_col' +SUFFIX_TOKENIZED = "_tokenized" + +# number of features for hash vector +HASH_VECTOR_N_FEATURES = 2**3 diff --git a/test/classifier/__init__ .py b/test/classifier/__init__ .py new file mode 100644 index 00000000..4287ca86 --- /dev/null +++ b/test/classifier/__init__ .py @@ -0,0 +1 @@ +# \ No newline at end of file diff --git a/test/classifier/run_classifier_test.py b/test/classifier/run_classifier_test.py new file mode 100644 index 00000000..4051c01b --- /dev/null +++ b/test/classifier/run_classifier_test.py @@ -0,0 +1,73 @@ +import unittest +import pandas as pd +import numpy as np +import pdb +import argparse +from code.classification.run_classifier import load_dataset, create_classifier +from argparse import Namespace + +# from app import process_data + + +class TestClassifier(unittest.TestCase): + def setUp(self): + parser = argparse.ArgumentParser(description="Classifier") + + self.small_len = 10 + self.args = Namespace( + input_file="data/feature_extraction/training.pickle", + small=self.small_len, + seed=42, + balanced_data_set=True, + import_file=None, + majority=False, + frequency=False, + sgd=False, + svm=False, + LinearSVC=False, + MultinomialNB=False, + LogisticRegression=False, + knn=1, + ) + self.data = load_dataset(self.args) + self.all_clf = [ + "majority", + "frequency", + "sgd", + "LogisticRegression", + "MultinomialNB", + "LinearSVC", + "svm", + ] + + def test_load_dataset(self): + + # small arg is working + self.assertEqual(len(self.data["labels"]), self.small_len) + + # dict has 3 output keys + self.assertEqual(len(self.data), 3) + + # is dict + self.assertTrue(type(self.data), dict) + + def test_create_classifier(self): + for clf in self.all_clf: + args_dict = vars(self.args) + args_dict[clf] = True + try: + classifier = create_classifier(self.args, self.data) + except: + raise Exception("{clf} was not created successfully.") + + args_dict[clf] = False + + + def test_no_classifier(self): + self.args.knn = None + with self.assertRaises(UnboundLocalError): + create_classifier(self.args, self.data) + + +if __name__ == "__main__": + unittest.main() diff --git a/test/feature_extraction/bigrams_test.py b/test/feature_extraction/bigrams_test.py index 29abfdae..677eecc3 100644 --- a/test/feature_extraction/bigrams_test.py +++ b/test/feature_extraction/bigrams_test.py @@ -40,4 +40,4 @@ def test_list_of_bigrams_most_frequent_correct(self): self.assertEqual(freq_list[0][0], EXPECTED_BIGRAM) if __name__ == '__main__': - unittest.main() \ No newline at end of file + unittest.main() diff --git a/test/feature_extraction/hash_vector_test.py b/test/feature_extraction/hash_vector_test.py new file mode 100644 index 00000000..96de1e6a --- /dev/null +++ b/test/feature_extraction/hash_vector_test.py @@ -0,0 +1,38 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +Created on Thu Oct 14 14:51:00 2021 +""" + +import unittest +import pandas as pd +import numpy as np +import pdb +from code.feature_extraction.hash_vector import HashVector + +class HashVectorTest(unittest.TestCase): + + def setUp(self): + self.INPUT_COLUMN = "input" + self.hash_vector_feature = HashVector(self.INPUT_COLUMN) + self.df = pd.DataFrame() + self.df[self.INPUT_COLUMN] = ["This is a tweet This is also a test", + "This is a tweet This is also a test", "hallo ne data science", "OMG look at this"] + self.result = self.hash_vector_feature._get_values([self.df.squeeze()]) + + def test_input_columns(self): + self.assertEqual(self.hash_vector_feature._input_columns, [self.INPUT_COLUMN]) + + def test_feature_name(self): + self.assertEqual(self.hash_vector_feature.get_feature_name(), self.INPUT_COLUMN + "_hashvector") + + def test_result_shape(self): + self.assertEqual(self.result.shape[0], len(self.df[self.INPUT_COLUMN])) + + def test_result_type(self): + self.assertEqual(type(self.result), np.ndarray) + + + +if __name__ == '__main__': + unittest.main() diff --git a/test/feature_extraction/tfidf_vec_test.py b/test/feature_extraction/tfidf_vec_test.py new file mode 100644 index 00000000..1b377114 --- /dev/null +++ b/test/feature_extraction/tfidf_vec_test.py @@ -0,0 +1,44 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +Created on Thu Oct 7 14:51:00 2021 + +@author: ml +""" + +import unittest +import pandas as pd +import numpy as np + + +from code.feature_extraction.tfidf_vector import TfidfVector + + +class TfidfVectorTest(unittest.TestCase): + + def setUp(self): + self.INPUT_COLUMN = "preprocessing_col" + self.tfidf_vector_feature = TfidfVector(self.INPUT_COLUMN) + self.df = pd.DataFrame() + self.df[self.INPUT_COLUMN] = ["This is a tweet This is also a test", + "This is a tweet This is also a test", "hallo ne data science", "OMG look at this"] + + self.result = self.tfidf_vector_feature._get_values([self.df.squeeze()]) + + def test_input_columns(self): + self.assertEqual(self.tfidf_vector_feature._input_columns, [ + self.INPUT_COLUMN]) + + def test_feature_name(self): + self.assertEqual(self.tfidf_vector_feature.get_feature_name( + ), self.INPUT_COLUMN + "_tfidfvector") + + def test_result_shape(self): + self.assertEqual(self.result.shape[0], len(self.df[self.INPUT_COLUMN])) + + def test_result_positive(self): + self.assertTrue(np.all(self.result >= 0)) + + +if __name__ == '__main__': + unittest.main()