[ja] `.filter` is used instead of `.map` for non-filter methods

On https://github.com/EleutherAI/dps/blob/bec4078f341037879feab1d5c82668745b28aa55/dps/spark/jobs/japanese_job.py#L64-L75 there are several cases where we are using `.filter` but instead it should be a `.map`.

For example https://github.com/EleutherAI/dps/blob/bec4078f341037879feab1d5c82668745b28aa55/dps/spark/jobs/japanese_job.py#L73 
calls
https://github.com/EleutherAI/dps/blob/bec4078f341037879feab1d5c82668745b28aa55/dps/spark/prep/japanese_prep.py#L64-L67
but in effect this is doing nothing because the expression within .filter is always is true, as long as text is non-empty:

```python
>>> def reduce_japanese_emoticon(text):
...     text = re.sub("w{3,}", "www", text)
...     text = re.sub("笑{2,}", "笑", text)
...     return text
>>> rdd = sc.parallelize([{'text': 'wwwwasdf'}, {'text': '1234笑笑笑'}, {'text': ''}])
>>> rdd.filter(lambda x: reduce_japanese_emoticon(x['text'])).collect()
[{'text': 'wwwwasdf'}, {'text': '1234笑笑笑'}]
```

Thus, I think the following cases of `.filter` are simply doing nothing instead of the intended preprocessing:
- `preprocess_text` on https://github.com/EleutherAI/dps/blob/bec4078f341037879feab1d5c82668745b28aa55/dps/spark/jobs/japanese_job.py#L70
- `reduce_japanese_emoticon` on https://github.com/EleutherAI/dps/blob/bec4078f341037879feab1d5c82668745b28aa55/dps/spark/jobs/japanese_job.py#L73
- `remove_symbols` on https://github.com/EleutherAI/dps/blob/bec4078f341037879feab1d5c82668745b28aa55/dps/spark/jobs/japanese_job.py#L75

The remaining calls to methods that end with `_filter` (e.g. `japanese_bad_words_filter`, `doc_len_filter`, etc.) are actually filter methods that return booleans so they should be OK.

	.filter(lambda x: japanese_bad_words_filter(x[use_column]))
	.filter(lambda x: doc_len_filter(x[use_column], conf["min_doc_len"], conf["max_doc_len"]))
	.filter(lambda x: japanese_mean_word_len_filter(x[use_column], conf["min_mean_word_len"], conf["max_mean_word_len"]))
	.filter(lambda x: japanese_symbol_to_word_ratio_filter(x[use_column], conf["symbol_to_word_ratio"]))
	.filter(lambda x: bullet_ellipsis_filter(x[use_column], conf["bullet_point_ratio"], conf["ellipsis_ratio"]))
	.filter(lambda x: japanese_word_ratio_filter(x[use_column], conf["japanese_word_ratio"]))
	.filter(lambda x: dict(text=preprocess_text(x[use_column])))
	.filter(lambda x: doc_len_filter(x[use_column], conf["min_doc_len"], conf["max_doc_len"]))
	.filter(lambda x: japanese_frequent_char_existence_filter(x[use_column], conf["freq_char_cnt"]))
	.filter(lambda x: reduce_japanese_emoticon(x[use_column]))
	.filter(lambda x: many_separators_filter(x[use_column], conf["separator_ratio"]))
	.filter(lambda x: remove_symbols(x[use_column]))

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ja] `.filter` is used instead of `.map` for non-filter methods #74

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	def reduce_japanese_emoticon(text):
	text = re.sub("w{3,}", "www", text)
	text = re.sub("笑{2,}", "笑", text)
	return text

[ja] .filter is used instead of .map for non-filter methods #74

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[ja] `.filter` is used instead of `.map` for non-filter methods #74