this reverses #7 somewhat, but for NLP applications it's very handy to have some sentence markers to start from in order to break the text into manageable chunks. most of the texts sourced from TLS and CHANT have punctation and markers we can use to segment. if downstream consumers still want to discard this information, that's easy to do.
texts with punctuation/semantic line breaks:
- Lunyu (KR1h0004)
- Xiaojing (KR1f0001)
- Laozi (KR5c0057)
- Maoshi/Shijing (KR1c0001)
- Zhuangzi (KR5c0126)
this reverses #7 somewhat, but for NLP applications it's very handy to have some sentence markers to start from in order to break the text into manageable chunks. most of the texts sourced from TLS and CHANT have punctation and markers we can use to segment. if downstream consumers still want to discard this information, that's easy to do.
texts with punctuation/semantic line breaks: