-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path0. Exploratory Data Analysis.py
181 lines (134 loc) · 4.71 KB
/
0. Exploratory Data Analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
# ---
# jupyter:
# jupytext:
# formats: ipynb,py:percent
# text_representation:
# extension: .py
# format_name: percent
# format_version: '1.3'
# jupytext_version: 1.3.1
# kernelspec:
# display_name: Python 3
# language: python
# name: python3
# ---
# %% [markdown]
# In this notebook, I will do some exploratory data analysis to understand more about the dataset, and to see which part we need to focus on before text processing.
# %% [markdown]
# ### Install jupytext for notebook version control
# %% language="bash"
# pip install --upgrade pip
# pip install jupytext
# %% [markdown]
# ### Download Stack Overflow questions dataset from Kaggle
# %% language="bash"
# pip install kaggle --upgrade
# %% language="bash"
# kaggle datasets download -d stackoverflow/stacksample --force
# %% language="bash"
# unzip /home/ec2-user/SageMaker/stack-overflow-questions-auto-tagger/stacksample.zip -d data
# %% [markdown]
# ### Exploratory Data Analysis
# %%
import os
HOME_DIR = os.curdir
DATA_DIR = os.path.join(HOME_DIR, "data")
# %% language="bash"
# pip install --upgrade pip
# pip install seaborn==0.9.0
# %%
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
pd.options.display.max_colwidth = 255
sns.set(style="darkgrid")
# %% [markdown]
# #### Questions
# %%
# %%time
questions_df = pd.read_csv(os.path.join(DATA_DIR, "Questions.csv"), encoding="ISO-8859-1", parse_dates=["CreationDate", "ClosedDate"])
# %%
print(f"Number of rows: {questions_df.shape[0]}")
print(f"Number of columns: {questions_df.shape[1]}")
# %%
questions_df.head()
# %% [markdown]
# We can see that `Title` is in plain text, while `Body` is in HTML format, which requires a lot of data cleansing before it is in a useful format. Also note that punctuations can be meaningful in this problem, e.g. `ASP.NET`, `C#`, etc, so we need to be careful not to remove them during data cleansing.
# %% [markdown]
# #### Tags
# %%
# %%time
tags_df = pd.read_csv(os.path.join(DATA_DIR, "Tags.csv"), encoding="ISO-8859-1")
# %%
print(f"Number of rows: {tags_df.shape[0]}")
print(f"Number of columns: {tags_df.shape[1]}")
# %%
tags_df.head()
# %% [markdown]
# `Questions` has a one-to-many relationship with `Tags`, as there is only one question ID and tag pair in each record.
# %% [markdown]
# ### Top 10 tags with most questions
# %%
tag_value_counts = tags_df["Tag"].value_counts()
# %%
top_ten_tags = tag_value_counts.head(10)
top_ten_tags
# %%
sns.barplot(x=top_ten_tags.index, y=top_ten_tags.values)
plt.xticks(rotation=45)
# %% [markdown]
# ### Top 50 tags with most questions
# %%
top_fifty_tags = tag_value_counts.head(50)
top_fifty_tags
# %% [markdown]
# #### Let's plot the counts to have a better visualization about the distribution:
# %%
top_fifty_tags_barplot = sns.barplot(x=top_fifty_tags.index, y=top_fifty_tags.values)
for i, label in enumerate(top_fifty_tags_barplot.xaxis.get_ticklabels()):
if i % 5 != 0:
label.set_visible(False)
plt.xticks(rotation=45)
top_fifty_tags_barplot
# %% [markdown]
# We can see that the number of questions per tag clearly demostrates a long tail distribution. Therefore, we can limit the number of tags to include in the dataset, so that the model training can be more efficient, while still maintain a high level of accuracy.
# %%
pd.options.display.float_format = "{:.2f}%".format
100 * tag_value_counts.head(4000).cumsum() / tag_value_counts.sum()
# %% [markdown]
# The top 4000 tags cover almost 90% of the questions in the dataset. Therefore, I will limit the dataset to include only questions with the top 4000 tags to reduce the size and time for model training. We can always include more tags later in case we find the model is not as performant as expected.
# %% [markdown]
# #### Joining `Questions` with `Tags`
# %%
# standardize column names
for df in [questions_df, tags_df]:
df.columns = df.columns.str.lower()
# %%
# %%time
# group rows per question id
tags_per_question_df = tags_df.groupby(['id'])['tag'].apply(list)
# %%
tags_per_question_df.head()
# %%
# %%time
# we are only interested in text column from `questions_df`
df = questions_df[["id", "title", "body"]].merge(tags_per_question_df.to_frame(), on="id")
# %%
df["tag_count"] = df["tag"].apply(len)
# %%
df.head()
# %% [markdown]
# #### Minimum, maximum and average tags per question
# %%
min_tag_count = df["tag_count"].min()
max_tag_count = df["tag_count"].max()
avg_tag_count = df["tag_count"].mean()
# %%
print(f"Each question has a minimum of {min_tag_count} tag and a maximum of {max_tag_count} tags. \
The average number of tags per question is {avg_tag_count:.2f}.")
# %% [markdown]
# ### Export dataframe for next section
# %%
df.to_pickle(f"{DATA_DIR}/eda.pkl")
# %%