Skip to content

Benchmark for conversion from Simplified Chinese to Traditional Chinese

Notifications You must be signed in to change notification settings

StarCC0/benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

b46435d · Apr 24, 2022

History

2 Commits
Apr 24, 2022
Apr 24, 2022
Apr 24, 2022
Apr 24, 2022
Apr 24, 2022
Apr 24, 2022

Repository files navigation

Benchmark for Simplified Chinese to Traditional Chinese conversion

設計思路

由 RTHK 即時新聞匯出 2021 年全年新聞文本,將文本轉為簡體後作為測試輸入,繁體版原文作為預期輸出。

TODO:

  • 將繁體轉為簡體時,需要人工判斷「一繁對多簡」的字詞
  • RTHK 即時新聞語料有別於 OpenCC 標準,需要異體字規範化,如「為」>「爲」
  • RTHK 即時新聞本身有誤,如多次出現將「發佈」誤作「發布」,需人工修正

Compute accuracy

Run python compute_accuracy.py output.txt answer.txt.

Get the dataset

Open Telegram. In the @rthk_new_c channel, click export. Select a range from 1 Jan 2021 to 31 Dec 2021. Save the JSON file to ChatExport.json.

Run preprocess.py.

Run opencc -c hk2s -i answer.txt -o input.txt.

About

Benchmark for conversion from Simplified Chinese to Traditional Chinese

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages