Skip to content

preserve newlines and punctuation for texts that have it #9

Description

@thatbudakguy

this reverses #7 somewhat, but for NLP applications it's very handy to have some sentence markers to start from in order to break the text into manageable chunks. most of the texts sourced from TLS and CHANT have punctation and markers we can use to segment. if downstream consumers still want to discard this information, that's easy to do.

texts with punctuation/semantic line breaks:

  • Lunyu (KR1h0004)
  • Xiaojing (KR1f0001)
  • Laozi (KR5c0057)
  • Maoshi/Shijing (KR1c0001)
  • Zhuangzi (KR5c0126)

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions