https://github.com/rosewang2008/language_modeling_via_stochastic_processes/blob/5cbc3eed581eba6444c471bfe716bd56db0f5253/language_modeling_via_stochastic_processes/src/datasets/wikihow.py#L41 Seems that there's an extra space here, which would result in not splitting sentences.