On
|
.filter(lambda x: japanese_bad_words_filter(x[use_column])) |
|
.filter(lambda x: doc_len_filter(x[use_column], conf["min_doc_len"], conf["max_doc_len"])) |
|
.filter(lambda x: japanese_mean_word_len_filter(x[use_column], conf["min_mean_word_len"], conf["max_mean_word_len"])) |
|
.filter(lambda x: japanese_symbol_to_word_ratio_filter(x[use_column], conf["symbol_to_word_ratio"])) |
|
.filter(lambda x: bullet_ellipsis_filter(x[use_column], conf["bullet_point_ratio"], conf["ellipsis_ratio"])) |
|
.filter(lambda x: japanese_word_ratio_filter(x[use_column], conf["japanese_word_ratio"])) |
|
.filter(lambda x: dict(text=preprocess_text(x[use_column]))) |
|
.filter(lambda x: doc_len_filter(x[use_column], conf["min_doc_len"], conf["max_doc_len"])) |
|
.filter(lambda x: japanese_frequent_char_existence_filter(x[use_column], conf["freq_char_cnt"])) |
|
.filter(lambda x: reduce_japanese_emoticon(x[use_column])) |
|
.filter(lambda x: many_separators_filter(x[use_column], conf["separator_ratio"])) |
|
.filter(lambda x: remove_symbols(x[use_column])) |
there are several cases where we are using
.filter but instead it should be a
.map.
For example
|
.filter(lambda x: reduce_japanese_emoticon(x[use_column])) |
calls
|
def reduce_japanese_emoticon(text): |
|
text = re.sub("w{3,}", "www", text) |
|
text = re.sub("笑{2,}", "笑", text) |
|
return text |
but in effect this is doing nothing because the expression within .filter is always is true, as long as text is non-empty:
>>> def reduce_japanese_emoticon(text):
... text = re.sub("w{3,}", "www", text)
... text = re.sub("笑{2,}", "笑", text)
... return text
>>> rdd = sc.parallelize([{'text': 'wwwwasdf'}, {'text': '1234笑笑笑'}, {'text': ''}])
>>> rdd.filter(lambda x: reduce_japanese_emoticon(x['text'])).collect()
[{'text': 'wwwwasdf'}, {'text': '1234笑笑笑'}]
Thus, I think the following cases of .filter are simply doing nothing instead of the intended preprocessing:
preprocess_text on
|
.filter(lambda x: dict(text=preprocess_text(x[use_column]))) |
reduce_japanese_emoticon on
|
.filter(lambda x: reduce_japanese_emoticon(x[use_column])) |
remove_symbols on
|
.filter(lambda x: remove_symbols(x[use_column])) |
The remaining calls to methods that end with _filter (e.g. japanese_bad_words_filter, doc_len_filter, etc.) are actually filter methods that return booleans so they should be OK.
On
dps/dps/spark/jobs/japanese_job.py
Lines 64 to 75 in bec4078
.filterbut instead it should be a.map.For example
dps/dps/spark/jobs/japanese_job.py
Line 73 in bec4078
calls
dps/dps/spark/prep/japanese_prep.py
Lines 64 to 67 in bec4078
but in effect this is doing nothing because the expression within .filter is always is true, as long as text is non-empty:
Thus, I think the following cases of
.filterare simply doing nothing instead of the intended preprocessing:preprocess_textondps/dps/spark/jobs/japanese_job.py
Line 70 in bec4078
reduce_japanese_emoticonondps/dps/spark/jobs/japanese_job.py
Line 73 in bec4078
remove_symbolsondps/dps/spark/jobs/japanese_job.py
Line 75 in bec4078
The remaining calls to methods that end with
_filter(e.g.japanese_bad_words_filter,doc_len_filter, etc.) are actually filter methods that return booleans so they should be OK.