Create dataset loader for SCH #707

SamuelCahyawijaya · 2024-07-30T15:40:10Z

Dataset	sch
Description	This is the first publicly available corpus of Hmong [ISO 639-3: mww, hmj], a minority language of China, Vietnam, Laos, Thailand, and various countries in Europe, America, and Australia. The corpus has been scraped from a long-running Usenet newsgroup called soc.culture.hmong and consists of approximately 12 million tokens. This corpus (called SCH) is also the first substantial corpus to be annotated for elaborate expressions, a kind of four-part coordinate construction that is common and important in the languages of mainland Southeast Asia.
Subsets	-
Languages	hnj
Tasks	Language Modeling
License	Creative Commons Zero v1.0 Universal (cc0-1.0)
Homepage	https://github.com/dmort27/sch-corpus/
HF URL	-
Paper URL	https://aclanthology.org/2022.lrec-1.533

The text was updated successfully, but these errors were encountered:

SamuelCahyawijaya added this to SEACrowd Data Hub Jul 30, 2024

SamuelCahyawijaya converted this from a draft issue Jul 30, 2024

Provide feedback