Skip to content

[ja] .filter is used instead of .map for non-filter methods #74

@mrorii

Description

@mrorii

On

.filter(lambda x: japanese_bad_words_filter(x[use_column]))
.filter(lambda x: doc_len_filter(x[use_column], conf["min_doc_len"], conf["max_doc_len"]))
.filter(lambda x: japanese_mean_word_len_filter(x[use_column], conf["min_mean_word_len"], conf["max_mean_word_len"]))
.filter(lambda x: japanese_symbol_to_word_ratio_filter(x[use_column], conf["symbol_to_word_ratio"]))
.filter(lambda x: bullet_ellipsis_filter(x[use_column], conf["bullet_point_ratio"], conf["ellipsis_ratio"]))
.filter(lambda x: japanese_word_ratio_filter(x[use_column], conf["japanese_word_ratio"]))
.filter(lambda x: dict(text=preprocess_text(x[use_column])))
.filter(lambda x: doc_len_filter(x[use_column], conf["min_doc_len"], conf["max_doc_len"]))
.filter(lambda x: japanese_frequent_char_existence_filter(x[use_column], conf["freq_char_cnt"]))
.filter(lambda x: reduce_japanese_emoticon(x[use_column]))
.filter(lambda x: many_separators_filter(x[use_column], conf["separator_ratio"]))
.filter(lambda x: remove_symbols(x[use_column]))
there are several cases where we are using .filter but instead it should be a .map.

For example

.filter(lambda x: reduce_japanese_emoticon(x[use_column]))

calls
def reduce_japanese_emoticon(text):
text = re.sub("w{3,}", "www", text)
text = re.sub("笑{2,}", "笑", text)
return text

but in effect this is doing nothing because the expression within .filter is always is true, as long as text is non-empty:

>>> def reduce_japanese_emoticon(text):
...     text = re.sub("w{3,}", "www", text)
...     text = re.sub("笑{2,}", "笑", text)
...     return text
>>> rdd = sc.parallelize([{'text': 'wwwwasdf'}, {'text': '1234笑笑笑'}, {'text': ''}])
>>> rdd.filter(lambda x: reduce_japanese_emoticon(x['text'])).collect()
[{'text': 'wwwwasdf'}, {'text': '1234笑笑笑'}]

Thus, I think the following cases of .filter are simply doing nothing instead of the intended preprocessing:

The remaining calls to methods that end with _filter (e.g. japanese_bad_words_filter, doc_len_filter, etc.) are actually filter methods that return booleans so they should be OK.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions