-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gitter: Messages, Analysis and Import #145
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also typo in commit message, anayise
:P
gitter/dicttagger.py
Outdated
|
||
class DictionaryTagger(object): | ||
def __init__(self, dictionary_paths): | ||
files = [open(path, 'r') for path in dictionary_paths] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unnecessary to load all files into memory together. Also, it's better to use context manager when reading from the file IMO.
gitter/dicttagger.py
Outdated
self.dictionary[key] = curr_dict[key] | ||
self.max_key_size = max(self.max_key_size, len(key)) | ||
|
||
def tag(self, postagged_sentences): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
postagged
-> pos_tagged
maybe?
gitter/dicttagger.py
Outdated
else: | ||
literal = expression_form | ||
if literal in self.dictionary: | ||
# self.logger.debug("found: %s" % literal) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either remove this line or remove #
gitter/ignore.yml
Outdated
@@ -0,0 +1,5 @@ | |||
welcome: [ignore] | |||
thank you: [ignore] | |||
thanks: [ignore] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also thx
;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any library/external resource that contains similar words/phrases?
gitter/models.py
Outdated
sent_by = models.CharField(max_length=300) | ||
|
||
def __str__(self): | ||
return ('sent_by: ' + self.sent_by + 'on room: ' + self.room) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing space before on room
gitter/newcomer_messages.py
Outdated
|
||
def get_newcomer_messages(): | ||
logger = logging.getLogger(__name__) | ||
IMPORT_URL = 'https://pastebin.com/raw/TR4YaaU8' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this for test/demonstration only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, the messages data will come from webservices, after https://gitlab.com/coala/landing/merge_requests/45 is merged.
Those are the output messages from that MR, when I run it locally.
gitter/newcomer_messages.py
Outdated
response.raise_for_status() | ||
except Exception as e: | ||
logger.error(e) | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite understand why simply return
here. Would the program do anything meaningful even if it doesn't get any message?
gitter/preprocessing.py
Outdated
e.g.: [['this', 'is', 'a', 'sentence'], | ||
['this', 'is', 'another', 'one']] | ||
""" | ||
sentences = self.nltk_splitter.tokenize(text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with nltk
, but I wonder what would happen if the text is not in English? Would it be an issue?
gitter/question.yml
Outdated
@@ -0,0 +1,4 @@ | |||
when: [question] | |||
why: [question] | |||
how: [question] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and what
, where
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will add those for sure.
I will add many more. Keep giving your suggestions. ;)
gitter/score.py
Outdated
return dict_tagged_sentences | ||
|
||
|
||
def value_of(sentiment): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better explain a bit how these values are determined, like why questions are given negative marks.
@li-boxuan, This is a wip PR, I created for showing the demo of how the process will go on... This is a part of my second coding phase. I will improve this day by day in the meantime. Thanks for the review ;) |
Okay, I thought it was ready for review 😅 |
gitter/score.py
Outdated
def sentiment_score(text, review): | ||
return sum( | ||
[value_of(tag) for sentence in tagged_sentences(text) | ||
for token in sentence for tag in token[2]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code does not comply to PEP8.
Origin: PEP8Bear, Section: all.python.default
.
The issue can be fixed by applying the following patch:
--- a/tmp/tmpge2qu_p4/gitter/score.py
+++ b/tmp/tmpge2qu_p4/gitter/score.py
@@ -23,7 +23,7 @@
def sentiment_score(text, review):
return sum(
[value_of(tag) for sentence in tagged_sentences(text)
- for token in sentence for tag in token[2]])
+ for token in sentence for tag in token[2]])
def message_score(message):
gitter/score.py
Outdated
def sentiment_score(text, review): | ||
return sum( | ||
[value_of(tag) for sentence in tagged_sentences(text) | ||
for token in sentence for tag in token[2]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E128 continuation line under-indented for visual indent
Origin: PycodestyleBear (E128), Section: all.python.default
.
gitter/dict_tagger.py
Outdated
tagged = False | ||
while (j > i): | ||
expression_form = ' '.join([word[0] | ||
for word in sentence[i:j]]).lower() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code does not comply to PEP8.
Origin: PEP8Bear, Section: all.python.default
.
The issue can be fixed by applying the following patch:
--- a/tmp/tmpge2qu_p4/gitter/dict_tagger.py
+++ b/tmp/tmpge2qu_p4/gitter/dict_tagger.py
@@ -37,7 +37,7 @@
tagged = False
while (j > i):
expression_form = ' '.join([word[0]
- for word in sentence[i:j]]).lower()
+ for word in sentence[i:j]]).lower()
expression_lemma = ' '.join(
[word[1] for word in sentence[i:j]]).lower()
if tag_with_lemmas:
gitter/dict_tagger.py
Outdated
tagged = False | ||
while (j > i): | ||
expression_form = ' '.join([word[0] | ||
for word in sentence[i:j]]).lower() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E128 continuation line under-indented for visual indent
Origin: PycodestyleBear (E128), Section: all.python.default
.
Travis tests have failedHey @sks444, 2nd Buildcoala --non-interactive -V
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gitmate and travis reported errors.
unreviewable!
Travis tests have failedHey @sks444, 2nd Buildcoala --non-interactive -V
|
Comment on e46b53e, file openhub/organization.py, line 13. Broken link - unable to connect to https://www.openhub.net/orgs/ Origin: InvalidLinkBear, Section: |
Comment on e46b53e, file openhub/data.py, line 20. Broken link - unable to connect to https://www.openhub.net/orgs/ Origin: InvalidLinkBear, Section: |
Comment on e46b53e, file community/settings.py, line 7. Broken link - unable to connect to https://docs.djangoproject.com/en/1.11/topics/settings/ Origin: InvalidLinkBear, Section: |
gitter/messages.py
Outdated
import requests | ||
|
||
from data.webservices import webservices_url | ||
from gitter.score import sentiment_score, tagged_sentences |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'gitter.score.tagged_sentences' imported but unused
Origin: PyFlakesBear, Section: all.python.default
.
Travis tests have failedHey @sks444, 2nd Buildcoala --non-interactive -V
|
Travis tests have failedHey @sks444, 2nd Build./.ci/build.sh
|
Comment on 9114cab, file community/settings.py, line 117. Broken link - unable to connect to https://docs.djangoproject.com/en/1.11/topics/i18n/ Origin: InvalidLinkBear, Section: |
Travis tests have failedHey @sks444, 2nd Build./.ci/build.sh
|
Travis tests have failedHey @sks444, 2nd Build./.ci/build.sh
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cant be merged without lots of tests showing the sentiment analysis is correct when it makes a deduction, or it skips the message and the message is not stored in the data model or included in any analysis.
gitter/pos_tagger.py
Outdated
@@ -0,0 +1,28 @@ | |||
import nltk | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all nlp files should go in gitter/nlp/
, and nlp training data files in gitter/nlp/training_data/
gitter/score.py
Outdated
postagger = POSTagger() | ||
splitted_sentences = splitter.split(text) | ||
pos_tagged_sentences = postagger.pos_tag(splitted_sentences) | ||
dicttagger = DictionaryTagger(['gitter/question.yml', 'gitter/ignore.yml']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need one Splitter, one POSTagger and one DictionaryTagger.
Do not recreate these objects every time you want to do some nlp on a fragment.
Create another class which creates and uses those objects internally, and instantiate only one instance of our class.
gitter/score.py
Outdated
return 1 | ||
if sentiment == 'question': | ||
return -1 | ||
return 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this system currently only knows about two sentiments, ignore and question (answer is missing).
It should raise a ValueError here if the sentiment is not known, as its score can not be determined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jayvdb, actually the strategy I am using here is a message which don't include the ignore
and question
words should be categorized into answer
. That's why I am returing 0
here, so that when we get the sentiment_score=0
, we can say that this message is a answer.
gitter/messages.py
Outdated
|
||
""" | ||
text = message['text'] | ||
return (sentiment_score(text)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just return sentiment_score(text)
gitter/messages.py
Outdated
return (sentiment_score(text)) | ||
|
||
|
||
def import_messages(message): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this import_messages
or import_message
?
gitter/messages.py
Outdated
return data | ||
|
||
|
||
def message_score(message): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no message scoring in https://github.com/coala/cEPs/blob/master/cEP-0020.md#gitter , so this method name is not appropriate.
You can use score
terminology in the gitter/nlp
namespace, but in the main app code there is only 'question', 'answer', and 'not known'.
gitter/messages.py
Outdated
"room": "offtopic", | ||
"sent_at": "2018-07-25 14:00:09.456000+00:00", | ||
"sent_by": "Naveenaidu", | ||
"text": "How can I solve this issue?" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bad indentation and missing trailing comma.
gitter/messages.py
Outdated
def import_messages(message): | ||
logger = logging.getLogger(__name__) | ||
sent_by = message.get('sent_by') | ||
score = message_score(message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no use of word 'score' anywhere in the main app logic. The nlp layer should spit out a 'message classification'.
maybe create an enum
MessageCategory
/ MessageType
or something like that.
gitter/models.py
Outdated
sent_by = models.CharField(max_length=300) | ||
|
||
def __str__(self): | ||
return ('sent_by: ' + self.sent_by + ' on room: ' + self.room) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the identifier
and text
are more useful
also add __repr__
to be more helpful thant __str__
for debugging
I have written a blog about the work done in this pr. https://sks444.github.io/gsoc/pinned/2018/07/25/GSoC'18-CodingPhase3-Part-I.html |
Travis tests have failedHey @sks444, 2nd Build./.ci/build.sh
|
Travis tests have failedHey @sks444, 1st Build./.ci/build.sh
|
gitter/tests/test_message_type.py
Outdated
message2s = {'text': 'No create a pull request.The Travis CI ' | ||
'build will continue its checking.'} | ||
m_type2 = message_type(message2s) | ||
self.assertEqual(m_type2, 'answer') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more tests coming soon..
gitter/nlp/training_data/__init__.py
Outdated
@@ -0,0 +1,4 @@ | |||
import nltk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed this file by mistake. Will remove it in the next push.
gitter/nlp/splitter.py
Outdated
class Splitter(object): | ||
|
||
def __init__(self): | ||
nltk.download('punkt') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jayvdb, Is there any other way to download this and nltk.download('averaged_perceptron_tagger')
? As of now its running every time we analyze a message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, do it in the module scope, or better create a single django command to download them all
gitter/nlp/splitter.py
Outdated
class Splitter(object): | ||
|
||
def __init__(self): | ||
nltk.download('punkt') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, do it in the module scope, or better create a single django command to download them all
help = 'Import Newcomer Messages' | ||
|
||
def handle(self, *args, **options): | ||
MESSAGES = get_messages() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this uppercase crap
gitter/messages.py
Outdated
Get all the messages send by newcomers on the gitter rooms. | ||
""" | ||
logger = logging.getLogger(__name__) | ||
IMPORT_URL = webservices_url('messages') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more uppercase
gitter/tests/test_message_type.py
Outdated
|
||
message18s = {'text': 'ok, so what should I do now?'} | ||
m_type18 = message_type(message18s) | ||
self.assertEqual(m_type18, 'question') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
create your own assertion method assertIsQuestion('ok, so what should I do now?')
Travis tests have failedHey @sks444, 1st Build./.ci/build.sh
pytest
|
This PR creates an app gitter, and anayise the messages by the newcomers, which is being fetched from webservices, and then import those messages into two models Question and Answer. Closes #84
Comment on 5f0de59, file gci/feeds.py, line 37. Broken link - unable to connect to https://codein.withgoogle.com/tasks/ Origin: InvalidLinkBear, Section: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an alternative is to create a django app for this.
Get the type of a message from a message_dict. | ||
|
||
:param message: a message dict of type: | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use a doctest
for the example
def pos_tag(self, sentences): | ||
""" | ||
input format: list of lists of words | ||
e.g.: [['this', 'is', 'a', 'sentence'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use doctest
for examples
Get a value based on the sentiment. | ||
|
||
:param sentiment: a string representing the sentiment of | ||
a word. i.e. 'ignore' or 'question'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use doctest
for examples
Get the score of a message. | ||
|
||
:param text: a string representing the text of a message. | ||
e.g. 'How to solve this issue?' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use doctest
for examples
and 'training_data/answers.yml'. | ||
|
||
:param text: a string representing the message. | ||
e.g. 'How to solve this issue?' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use doctest for examples
@@ -8,6 +8,7 @@ nocover_file_globs: | |||
- community/git.py | |||
- gci/*.py | |||
- gsoc/*.py | |||
- gitter/nlp/dict_tagger.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use def tag_sentence
in the regex section, if that is the problem.
but you must get full coverage of __init__
Going for creating a django app :) |
@sks444 It seems that work has stopped on this PR, are you looking forward to create a django app for it ? |
This PR creates an app gitter, and anayise the
messages by the newcomers, which is being fetched from
webservices, and then import those messages into
two models Question and Answer.
Closes #84