-
Notifications
You must be signed in to change notification settings - Fork 27
Description
On
dps/dps/spark/jobs/japanese_job.py
Lines 64 to 75 in bec4078
| .filter(lambda x: japanese_bad_words_filter(x[use_column])) | |
| .filter(lambda x: doc_len_filter(x[use_column], conf["min_doc_len"], conf["max_doc_len"])) | |
| .filter(lambda x: japanese_mean_word_len_filter(x[use_column], conf["min_mean_word_len"], conf["max_mean_word_len"])) | |
| .filter(lambda x: japanese_symbol_to_word_ratio_filter(x[use_column], conf["symbol_to_word_ratio"])) | |
| .filter(lambda x: bullet_ellipsis_filter(x[use_column], conf["bullet_point_ratio"], conf["ellipsis_ratio"])) | |
| .filter(lambda x: japanese_word_ratio_filter(x[use_column], conf["japanese_word_ratio"])) | |
| .filter(lambda x: dict(text=preprocess_text(x[use_column]))) | |
| .filter(lambda x: doc_len_filter(x[use_column], conf["min_doc_len"], conf["max_doc_len"])) | |
| .filter(lambda x: japanese_frequent_char_existence_filter(x[use_column], conf["freq_char_cnt"])) | |
| .filter(lambda x: reduce_japanese_emoticon(x[use_column])) | |
| .filter(lambda x: many_separators_filter(x[use_column], conf["separator_ratio"])) | |
| .filter(lambda x: remove_symbols(x[use_column])) |
.filter but instead it should be a .map.
For example
dps/dps/spark/jobs/japanese_job.py
Line 73 in bec4078
| .filter(lambda x: reduce_japanese_emoticon(x[use_column])) |
calls
dps/dps/spark/prep/japanese_prep.py
Lines 64 to 67 in bec4078
| def reduce_japanese_emoticon(text): | |
| text = re.sub("w{3,}", "www", text) | |
| text = re.sub("笑{2,}", "笑", text) | |
| return text |
but in effect this is doing nothing because the expression within .filter is always is true, as long as text is non-empty:
>>> def reduce_japanese_emoticon(text):
... text = re.sub("w{3,}", "www", text)
... text = re.sub("笑{2,}", "笑", text)
... return text
>>> rdd = sc.parallelize([{'text': 'wwwwasdf'}, {'text': '1234笑笑笑'}, {'text': ''}])
>>> rdd.filter(lambda x: reduce_japanese_emoticon(x['text'])).collect()
[{'text': 'wwwwasdf'}, {'text': '1234笑笑笑'}]Thus, I think the following cases of .filter are simply doing nothing instead of the intended preprocessing:
preprocess_textondps/dps/spark/jobs/japanese_job.py
Line 70 in bec4078
.filter(lambda x: dict(text=preprocess_text(x[use_column]))) reduce_japanese_emoticonondps/dps/spark/jobs/japanese_job.py
Line 73 in bec4078
.filter(lambda x: reduce_japanese_emoticon(x[use_column])) remove_symbolsondps/dps/spark/jobs/japanese_job.py
Line 75 in bec4078
.filter(lambda x: remove_symbols(x[use_column]))
The remaining calls to methods that end with _filter (e.g. japanese_bad_words_filter, doc_len_filter, etc.) are actually filter methods that return booleans so they should be OK.